The structure of a compiler:
Phase 1: Lexical Analyzer: converts a file into tokens.
This part of the compiler basically figures out where the keywords are. Basically it goes through the file character by character, usually discarding the white space. It's what figures out if "IF" identifies something or if it's the start of an if statement for example.
Phase 2: Syntax Analyzer: converts list of tokens into a parse tree
The syntax analyzer matches up a string in a grammar to figure out how to build the parse tree.
A simple grammar:
S -> S + S | S - S | T
T -> T * T | T / T | U
U -> (S) | NUMBER
NUMBER -> (0-9)+
This grammar generates all expressions like 7+4*3, 1+2*3/4, etc.
But this grammar has a problem, it is ambiguous. For each of the strings above, there are multiple different ways to build a parse tree that matches.
This means that the expressions like 1-4+3 could be interpreted different ways.
The compiler could run it like 1-(4+3) or (1-4)+3
We could get 0 or -6 in this case.
So ambiguous grammars are bad, because they can lead to those kinds of problems.
For the simple expressions like these, we need something that does:
-Precedence (multiplying comes before adding)
This is done by having operations with higher precedence lower in the grammar.
-Associativity (whether a-b-c is interpreted as a-(b-c) or (a-b)-c )
This is done by matching each left associative operation with a left recursive production and same with the right. This looks like:
A -> A * B | B / A
B -> NUMBER
In this example, multiplication is left associative, and division is right associative. So in the division example a/b/c/d, the compiler would solve c/d first.
Phase 3: Semantic Analyzer
This phase does things like type checking. You'll basically run through the parse tree and make sure that the types match.
Phase 4: Intermediate code generation
This phase goes through the parse tree to build the code.
This part can be pretty tricky. You want to do things like building a symbol table to keep track of where a variable gets used and to keep track of scope.
Have to be able to access the symbol table to know whether you're in the local variable space for x.
If your compiling into assembly, you'll need to keep track of what memory locations are used for what variables, as well as what variables are in what registers.
An expression like
a = a + b;
int c = a + b;
might have to become something like (in more like psuedocode):
load register1, a
load register2, b
add register1, register1, register2
store a
add register3, register1, register2
store c
Phase 1,2 and 4 are essentially the minimum of what a compiler needs.
A good compiler is also going to have phases for code optimization, and final code generation.
A fancy compiler can have some pretty sophisticated processes for code optimization.
Some things that can be done include looking at the exit condition for a function. I recall seeing that gcc will actually simplify the following code:
function square(int x){
int i = 0;
while(i<x*x){
i++;
}
return i;
}
into the following:
function square(int x){
return x*x;
}
I don't know if that's actually what the example was, but it was something like that.