Summary of what I did
My main goal here was to show you how to make a complete compiler with a help of
ANTLR. I made the compiler for the same language called “Good For Nothing” described
here. Compiler that was built there is written completely by hand. This is summary of what I did:
- I wrote combined ANTLR lexer and parser grammar with the AST tree as output from the parsing process
- I generated lexer and parser classes (with ANTLRWorks 1.4 IDE, and c# as target language (later releases of ANTLRWorks were not compatible with the C# target at the time I started using it. You can find ANTLRWorks 1.4 in the "grammar" folder of the source code I provided))
- wrote ANTLR tree grammar
- generated tree parser class from the tree grammar
- Made Visual Studio Console project, wrote code generator that emits CIL, added ANTLR generated classes to the project, and also some code that performs semantic checks, like type checking etc. (this checks are not complete, I just showed one way to do it)
Source code of the project and how to use it
Complete source code of the compiler can be downloaded
here. You can extract it and run it from the Visual Studio. When you run the solution it will open test.txt file which is placed in bin/Debug path of the solution. In test.txt there is program written in Good For Nothing Language. It is the same program that is used as an example at
http://msdn.microsoft.com/en-us/magazine/cc136756.aspx :
var ntimes = 0;
print "How much do you love this company? (1-10) ";
read_int ntimes;
var x = 0;
for x = 0 to ntimes do
print "Developers!";
end;
print "Who said sit down?!!!!!";
After compiler executes, you will find generated test.il file in the Debug folder. It will contain CIL code made by the Good For Nothing Compiler:
.assembly extern mscorlib
{
}
.assembly test
{
}
.module test.exe
.class private auto ansi test extends [mscorlib]System.Object {
.method private hidebysig static void Main(string[] args) cil managed {
.entrypoint
ldc.i4 0
.locals init ( int32 ntimes)
stloc ntimes
ldstr "How much do you love this company? (1-10) "
call void [mscorlib]System.Console::WriteLine(string)
call string [mscorlib]System.Console::ReadLine()
call int32 [mscorlib]System.Convert::ToInt32(string)
stloc ntimes
ldc.i4 0
.locals init ( int32 x)
stloc x
ldc.i4 0
stloc x
br IL_0000
IL_0001:
ldloc x
ldc.i4 1
add
stloc x
ldstr "Developers!"
call void [mscorlib]System.Console::WriteLine(string)
IL_0000:
ldloc x
ldloc ntimes
clt
brtrue IL_0001
ldstr "Who said sit down?!!!!!"
call void [mscorlib]System.Console::WriteLine(string)
ret
}
}
Then you will use IL Asembler of .NET framework to make exe file from test.il. Navigate to the folder where test.il lies and type:
After you get an exe file, you can run it:
Where to start
When I started to think about making the language in February, the idea was to make a language, and compiler for it, that will emit CIL code. I had a number of questions since I was new to this field, so I started to look for the books and on-line resources about this topic. One of the first articles that I found on the Internet was " Create a Language Compiler for the .NET Framework ", written by Joel Pobar. He showed there how to make a simple .NET language and compiler by hand. It is good place to start if you want to build a compiler for .NET because the article gives you the whole picture about language definition and steps of compiler construction. I recommend you to read it before you continue with this article because I intend to make the same language and compiler not by hand, but with the help of ANTLR.
Also, if you are not that familiar with the fundamentals of language grammars and language design issues, I will strongly recommend reading the book
Programming Languages: Principles and Paradigms. This book emphasizes a thorough, hands-on treatment of the key issues in programming language design and shows you the definition and an implementation of a modest subset of C language, called C-lite. A complete C-lite interpreter is made from scratch, so it is great place to start.
The best book about CIL I have found is
Expert .NET 2.0 IL Assembler. You will not need any other book to understand CIL code. I would also recommend using CIL Disassembler – ILDASM. The CIL Disassembler is a companion tool to the CIL Assembler (Ilasm.exe). Ildasm.exe takes a portable executable (PE) file that contains binary Common intermediate language code and creates a text file that contains CIL Assembly – textual CIL code that human can read. You can look inside CIL Assembly to see how CIL code looks like. More about ildasm you can find
here.
The tool to generate language recognizer - ANTLR
Making language grammar and compiler by hand becomes increasingly difficult as language becomes bigger. Just as developers use IDEs (integrated development environments) to dramatically improve their productivity, programmers need a sophisticated development environment for building, understanding, and debugging grammars. I decided to use one such environment - ANTLRWorks. ANTLRWorks is a complete development environment for ANTLR grammars. There are two very common diffulties encountered by grammar developers: Understanding why a grammar fragment results in parser nondeterminism and determining why a generated parser incorrectly interprets an input sentence. You can write and debug grammars in ANTLR. From this formal language specification, ANTLRWorks generates a program that determines whether sentences conform to that language i.e. - a language recognizer that consists of lexer and parser. ANTLR supports multiple target languages such as Java, C#, Python, Ruby, Objective-C, C, and C++.
Combined ANTLR lexer and parser grammar of Good For Nothing language
Here is the complete ANTLR lexer and parser grammar (You can also find this file in the “grammar” folder in the source code):
In the options section of an ANTLR grammar, you can specify a series of key/value assignments that alter the way ANTLR generates code. Language option specifies the target language in which ANTLR should generate recognizers.
Basically I told ANTLR to generate lexer and parser classes in C#. A deep discussion of ANTLR grammar is beyond the scope here, but I will explain the basics.
According to Good For Nothing language definition, program (prgm rule) consists of one or more statements (stmt rule) following by the EOF token. EOF is a predefined token for the end-of-file. You should add an EOF token; otherwise ANTLR will stop parsing when there is no valid token to be matched. Statement can be variable declaration, assignment, for loop, reading of integers from the command line, or printing to the screen. Expressions (expr) can be strings, integers, arithmetic expressions, or identifiers. Identifiers (ident) can be named using an alphabetic character or “_” as the first letter, followed by characters, numbers or “_”. And so on. Quite simply, I've defined a language syntax that provides basic arithmetic capabilities, a small type system, and simple, console-based user interaction.
It is not compiler yet
There are a lot of examples of ANTLR grammars
here, but when you finally write your grammar and generate lexer and parser from it - you just got the language recognizer. It is most important part of the compiler, but in order to build a whole compiler, you must not only determine whether some sentence conforms to the rules of some language but to translate it to some other form, according to the meaning of the sentence. In our case, that other form is the CIL code. The way of linking parse phase of the compiler with the code generation that emits CIL was a real little mystery for me earlier this year, and I almost found no example about how to do it, but when I finally realize the way, it turned out very simple.
Intermediate representation of the source code - Abstract Syntax Tree (AST)
It is valuable to define the structure of a programming language on a more functional level than that is offered by its syntax. Abstract syntax is a notation that allows the parser to strip away the syntactic sugar and generate a tree that contains only essential elements of the program (for example punctuations, whitespaces and other irrelevant details are omitted). Typically an Abstract Syntax Tree is used as an intermediate representation of the program.
When you use the output=AST option, each of the grammar rules will implicitly return a node or subtree. The tree you get from invoking the starting rule is the complete AST. Building ASTs with ANTLR is straightforward. Just add AST construction rules (so called rewrite rules) to the parser grammar that indicate what tree shape you want to build. AST construction rules are rules behind the ’->’.
First element inside ^() becomes the root, and other elements becomes its children. So, for example, if we have the declaration: var x = 0; we get following subtree:
You can define AST tree with rewrite rules or operators for the tree construction – „^“ and „!“.To specify a tree structure, simply indicate which tokens should be considered operators (subtree roots) and which tokens should be excluded from the tree. Use the ^ and ! token reference suffixes, respectively. The expression rules use AST construction operators almost exclusively because they are much more terse than rewrite rules. To define expressions there is a series of rules, one for each operator precedence level and one for the lowest level describing primary expressions like identifiers, strings and integers. Start with an overall rule called expr that represents a complete expression. Rule expr will match operators with the weakest precedence, plus and minus, and will refer to a rule that matches subexpressions for operators with the next highest precedence. In this case, that next operators are multiply and divide.
From this lexer and parser grammar we can generate lexer and parser classes with ANTLRWorks IDE simply by choosing “Generate code” from the menu:
When parsing some Good For Nothing program, these classes will generate AST representation of the program. Next we need a code that will “walk” this tree and perform some actions according to the meaning of the program.
AST grammar
ANTLR gives you notation to describe the structure of the abstract syntax tree – just like you wrote your ANTLR parser grammar and generated lexer and parser programs from it, you can write you tree grammar and define rules that describe appropriate structure of the tree. After this you can generate tree walker – a program that checks if some particular tree conforms AST grammar rules that you wrote. AST grammar of Good For Nothing language is given below. It looks like parser grammar with only rewrite rules left. All of the expression rules from the parser grammar collapse into a single recursive rule in the tree grammar. The parser constructs ASTs for expressions that encode the order of operations by their very structure. The tree grammar does not need to repeat the multilevel expression grammar pattern used by parser grammars to deal with different levels of precedence. A tree grammar can just list the possibilities, yielding a much simpler description for expressions than a parser grammar can use. There are no lexer rules in the tree grammar because you are parsing tree nodes, not tokens or characters.
Linking syntax and semantics analysis phases of a compiler with AST
Abstract syntax forms basics for linking syntax and semantic analysis phases of a compiler. It allows a considerably more compact parse tree than the concrete syntax, and yet it conveys all the required information for semantic processing.
ANTLR provides you the way to add code snippets in the target language to the rules of your ANTLR grammar. With this code snippets you define what happens when tree walker match some rule in the AST of the program. By adding code snippets to the grammar, the recognizer becomes a translator. You can use these code snippets to invoke code generation and apply semantic rules (for example type checks or check if some variable is declared before it is used). AST grammar with code snippets between curly braces is given below:
I will try to explain the most important parts this code. First, I defined a global scope to handle dictionary of identifiers:
Scope has its name, and Dictionary symbols where I will put defined variables, their names (as a key), types etc. Prgm rule then can access this scope with:
For simplicity, Good For Nothing has only this global scope. If you have more scopes, you will declare new scope under the particular rule. Recognizer pushes a new Dictionary onto a stack of scopes upon entry to each method that declares scope GoodForNothingScope.
In @header section I specified namespaces needed by tree walker class.
In @members seciton I put variables declarations and methods visible to all tree grammar rules. First I declared instance target of the code generator class, that I will show you later:
private CodeGen target = new CodeGen();
Then I defined method resolveType(), wich will return the type of the compound expression:
public static string resolveType (string e1, string e2)
{
if((e1== "string") && (e2 == "string"))
{
return "string";
}
return "integer";
}
For example if we have expression like 2+3, the type of 2 will be integer, and the type of 3 will be integer, so expression 2+3 will also be of the integer type.
Next I defined method isDefined(), witch for the given name (id) of the variable checks if variable is declared in the global scope:
public bool isDefined(String id)
{
if ( $GoodForNothingScope[0]::symbols.ContainsKey(id) )
{
return true;
}
return false;
}
Next I defined @init and @after actions for the prgm rule. To execute an action before anything else in the rule and to define local variables, use an init action. Similarly, to execute something after any alternative has been executed and right before the rule returns, use an after action. I use @init in the prgm rule to initialize scope, and to give it a name – “global”.
Code generation for Good For Nothing language
With target.Start(); Start() method of the CodeGen class is called to start emit CIL code into output file stream.
I use @after in the prgm rule to call End() method:
@after {
target.End();
}
Also, when tree walker matches particular AST grammar rule it will perform semantic checks and emit CIL code corresponding to that particular rule.
Start() method emits following code:
.assembly extern mscorlib
{}
.assembly test
{}
.module test.exe
.class private auto ansi test extends [mscorlib]System.Object {
.method private hidebysig static void Main(string[] args) cil managed {
.entrypoint
End() method emits following code:
ret
}
}
Between calling Start() and End() methods, tree walker will walk the AST of the particular program and emit the rest of the CIL code between these two parts. I will show you how this process takes place on the example of declaration and initialization of variable x.
Tree walker first goes to the ident rule which only returns the name of the variable:
ident returns [string text] : ID {$ident.text = $ID.text;}
;
Then tree walker goes to the expression. It matches INT subrule and emits instruction ldc.i4 0, which puts int32 value 0 to the stack. Then code in the curly braces in the statement rule is executed. Compiler checks if variable with name x is already defined:
if ( isDefined($ident.text) )
{
throw new MyException("variable is already declared: " + $ident.text);
}
If everything is OK, compiler concludes the type of the variable according to the type of the initializing expression. Then constructor of the Variable class is called. After this I put variable name (as the key) and instance of the Variable class (as the value) into Dictionary that represents global scope. EmitLocals() method of the CodeGen class is called and it emits the declaration of the local variable x of the type int32 into the output file stream. After this, compiler emits stloc x instruction, which takes 0 from the stack (e.g. zero) and stores it to the local variable x.
This is complete code of the CodeGen class: