Wednesday, February 15, 2012

FON*C kompajler

FON*C je strogo tipizirani, imperativni, portabilni, po svojoj definiciji konzistentan i koncizan programski jezik C familije, baziran na .NET platformi. Za definisanje sintakse jezika i generisanje leksera i parsera korišćen je ANTLR (Another Tool For Language Recognition) alat. Kao sporedni efekat procesa parsiranja napravljeno je Apstraktno sintaksno stablo koje je iskorišćeno kao veza između sintaksnog i semantičkog dela kompajlera. Semantička pravila jezika i generator koda implementirani su u programskom jeziku C#. Ciljanu platformu jezika čini .NET platforma koja pruža Zajedničko izvršno okruženje (Common Language Runtime, CLR) za sve jezike implementirane na ovoj platformi. Generator koda FON*C kompajlera emituje Zajednički međujezik (Common Intermediate Language, CIL) CLR izvršnog okruženja. Implementirana je alatka make kojom se, na osnovu opisa zavisnosti između datoteka izvornog koda FON*C programa, vrši automatska kompilacija u izvršnu datoteku. 

Slika 1: Šematksi prikaz procesa kompajliranja 

Kao što se može videti, kompajler FON*C-a emituje tekstualne CIL fajlove. Da bi se tekstualni fajl preveo u izvršnu datoteku koristi se specijalni kompajler – IL Asembler (ILASM), koji je deo .NET okruženja. Ovaj kompajler može da se poziva ručno ali je za potrebe FON*C-a, napravljena alatka make, po ugledu na slične alate napravljene za programski jezik C. Alatka make služi za automatsku kompilaciju izvornog koda programa na osnovu opisa zavisnosti između datoteka od kojih se izvršni program sastoji. Program make automatski pronalazi heder fajlove, vrši kompilaciju .c fajlova u .il fajlove i spaja ih u izvršnu datoteku pozivanjem ILASM asemblera u pozadini.

Materijali za preuzimanje

FON*C kompajler, kao i primeri izvornog koda  (alatka make ce biti posebno objašnjena u nekom narednom postu).

Prezentacija o osobinama jezika FON*C kao i o konstrukciji FON*C kompajlera napravljena u sklopu odbrane Master rada. 

Kompletan tekst mog Master rada na temu Konstrukcija kompajlera jezika FON*C za .NET platformu primenom ANTLR-a U prvom delu rada razmotrena su osnovna pitanja konstrukcije programskih jezika: sintaksa, imena, tipovi, semantika, funkcije i upravljanje memorijom. U drugom delu rada opisano je Zajedničko jezičko izvršno okruženje (Common Language Runtime, CLR) .NET programske platforme: Zajednički sistem tipova (Common Type System, CTS) i programski jezik samog CLR-a - Zajednički međujezik (Common Intermediate Language, CLI). U trećem delu rada implementiran je kompajler programskog jezika FON*C.

Pravci daljeg rada:
  •  Integracija jezika u Visual Studio 2010, što bi trebalo da obuhvati:
  • Isticanje sintakse (syntax highlighting)
  • Strukturiranje - kolapsiranje teksta (outlining)
  • Intelisens
    • Dovršavanje naredbi
    • Brzi info
    • Pomoć oko parametara
  • Podudaranje zagrada (bracket matching)
  • Formatiranje
  • Isticanje grešaka


Povezani postovi:


Tuesday, December 6, 2011

Assignment as an expression in Antlr grammar

Look at this example of statement:
a = b = 4;
This statement consists of two assignments: b = 4 is an assignment expression, and a = b = 4; is an assignment statement

Here is the topic on stackoverflow about how to define an assignment as an expression in ANTLR grammar. I gave one of the answers, but you should see the other that uses syntactic predicates. It turned out to be very good example if you want to understand how to effectively use syntactic predicates.
I will present you here how to define assignment expression

About an assignment expression

What is so special about assignment expression? There is one crucial difference between operators like + or *, and an assignment operator =. It is right associative. So, if you write a = b = 4;  b = 4 is evaluated first. (In contrary if you write 2 + 3 + 4, 2+3 is evaluated first). Next thing is that assignment operator has almost the lowest precedence (only comma operator has lower precedence in C). So when you add the assignment expression to you grammar, it has to be on the top of the other expressions.

Simple ANTLR grammar with an assignment expression

Here is the complete example in ANTLR parser grammar:

As you can see, expression is defined as an assignment expression or like an add expression:
The way assign expression is defined ensures it's right associativity, since expr on the right side of the Assign token is always evaluated first. Also, this way of definition ensures the lowest precedence of the assign operator. A few parse tree examples are given below:

Statement       a = 2 + 3;  gives the following parse tree:

And statements

a=b=4;
a = 2 * (b = 1);

give the following parse tree:

Create a Language Compiler for the .NET Framework using ANTLR


Summary of what I did

My main goal here was to show you how to make a complete compiler with a help of ANTLR. I made the compiler for the same language called “Good For Nothing” described here. Compiler that was built there is written completely by hand. This is summary of what I did:
  1. I wrote combined ANTLR lexer and parser grammar with the AST tree as output from the parsing process
  2. I generated lexer and parser classes (with ANTLRWorks 1.4 IDE, and c# as target language (later releases of ANTLRWorks were not compatible with the C# target at the time I started using it. You can find ANTLRWorks 1.4 in the "grammar" folder of the source code I provided))
  3. wrote ANTLR tree grammar 
  4. generated tree parser class from the tree grammar 
  5. Made Visual Studio Console project, wrote code generator that emits CIL, added ANTLR generated classes to the project, and also some code that performs semantic checks, like type checking etc. (this checks are not complete, I just showed one way to do it)

Source code of the project and how to use it

Complete source code of the compiler can be downloaded here. You can extract it and run it from the Visual Studio. When you run the solution it will open test.txt file which is placed in bin/Debug path of the solution. In test.txt there is program written in Good For Nothing Language. It is the same program that is used as an example at http://msdn.microsoft.com/en-us/magazine/cc136756.aspx :

var ntimes = 0;
print "How much do you love this company? (1-10) ";
read_int ntimes;
var x = 0;
for x = 0 to ntimes do
   print "Developers!";
end;
print "Who said sit down?!!!!!";

After compiler executes, you will find generated test.il file in the Debug folder. It will contain CIL code made by the Good For Nothing Compiler:

.assembly extern mscorlib
{
}
.assembly test
{
}
.module test.exe
.class private auto ansi test extends [mscorlib]System.Object {
.method private hidebysig static void  Main(string[] args) cil managed {
 .entrypoint
        ldc.i4  0
        .locals init ( int32 ntimes)
        stloc ntimes
        ldstr  "How much do you love this company? (1-10) "
        call       void [mscorlib]System.Console::WriteLine(string)
        call       string [mscorlib]System.Console::ReadLine()
        call       int32 [mscorlib]System.Convert::ToInt32(string)
        stloc ntimes
        ldc.i4  0
        .locals init ( int32 x)
        stloc x
        ldc.i4  0
        stloc x
        br IL_0000
        IL_0001:
        ldloc x
        ldc.i4    1
        add
        stloc x
        ldstr  "Developers!"
        call       void [mscorlib]System.Console::WriteLine(string)
        IL_0000:
        ldloc x
        ldloc  ntimes
        clt
        brtrue IL_0001
        ldstr  "Who said sit down?!!!!!"
        call       void [mscorlib]System.Console::WriteLine(string)
ret
}
}

Then you will use IL Asembler of .NET framework to make exe file from test.il. Navigate to the folder where test.il lies and type:

After you get an exe file, you can run it:

Where to start

When I started to think about making the language in February, the idea was to make a language, and compiler for it, that will emit CIL code. I had a number of questions since I was new to this field, so I started to look for the books and on-line resources about this topic. One of the first articles that I found on the Internet was " Create a Language Compiler for the .NET Framework ", written by Joel Pobar. He showed there how to make a simple .NET language and compiler by hand. It is good place to start if you want to build a compiler for .NET because the article gives you the whole picture about language definition and steps of compiler construction. I recommend you to read it before you continue with this article because I intend to make the same language and compiler not by hand, but with the help of ANTLR.
Also, if you are not that familiar with the fundamentals of language grammars and language design issues, I will strongly recommend reading the book Programming Languages: Principles and Paradigms. This book emphasizes a thorough, hands-on treatment of the key issues in programming language design and shows you the definition and an implementation of a modest subset of C language, called C-lite. A complete C-lite interpreter  is made from scratch, so it is great place to start.
If you want to use ANTLR to make language recognizers I will recommend this book The Definitive ANTLR Reference: Building Domain-Specific Languages. It is the most complete reference of ANTLR and an excellent guide, with many concrete examples of how to write ANTLR grammars.
The best book about CIL I have found is Expert .NET 2.0 IL Assembler. You will not need any other book to understand CIL code. I would also recommend using CIL Disassembler – ILDASM. The CIL Disassembler is a companion tool to the CIL Assembler (Ilasm.exe). Ildasm.exe takes a portable executable (PE) file that contains binary Common intermediate language code and creates a text file that contains CIL Assembly – textual CIL code that human can read. You can look inside CIL Assembly to see how CIL code looks like. More about ildasm you can find here.

The tool to generate language recognizer - ANTLR

Making language grammar and compiler by hand becomes increasingly difficult as language becomes bigger. Just as developers use IDEs (integrated development environments) to dramatically improve their productivity, programmers need a sophisticated development environment for building, understanding, and debugging grammars. I decided to use one such environment - ANTLRWorks. ANTLRWorks is a complete development environment for ANTLR grammars. There are two very common diffulties encountered by grammar developers: Understanding why a grammar fragment results in parser nondeterminism and determining why a generated parser incorrectly interprets an input sentence. You can write and debug grammars in ANTLR. From this formal language specification, ANTLRWorks generates a program that determines whether sentences conform to that language i.e. - a language recognizer that consists of lexer and parser. ANTLR supports multiple target languages such as Java, C#, Python, Ruby, Objective-C, C, and C++. 

Combined ANTLR lexer and parser grammar of Good For Nothing language

Here is the complete ANTLR lexer and parser grammar (You can also find this file in the “grammar” folder in the source code):

In the options section of an ANTLR grammar, you can specify a series of key/value assignments that alter the way ANTLR generates code. Language option specifies the target language in which ANTLR should generate recognizers. 
Basically I told ANTLR to generate lexer and parser classes in C#.  A deep discussion of ANTLR grammar is beyond the scope here, but I will explain the basics.
According to Good For Nothing language definition, program (prgm rule) consists of one or more statements (stmt rule) following by the EOF token. EOF is a predefined token for the end-of-file. You should add an EOF token; otherwise ANTLR will stop parsing when there is no valid token to be matched. Statement can be variable declaration, assignment, for loop, reading of integers from the command line, or printing to the screen. Expressions (expr) can be strings, integers, arithmetic expressions, or identifiers. Identifiers (ident) can be named using an alphabetic character or “_” as the first letter, followed by characters, numbers or “_”. And so on. Quite simply, I've defined a language syntax that provides basic arithmetic capabilities, a small type system, and simple, console-based user interaction.

It is not compiler yet

There are a lot of examples of ANTLR grammars here, but when you finally write your grammar and generate lexer and parser from it - you just got the language recognizer. It is most important part of the compiler, but in order to build a whole compiler, you must not only determine whether some sentence conforms to the rules of some language but to translate it to some other form, according to the meaning of the sentence. In our case, that other form is the CIL code. The way of linking parse phase of the compiler with the code generation that emits CIL was a real little mystery for me earlier this year, and I almost found no example about how to do it, but when I finally realize the way, it turned out very simple.

Intermediate representation of the source code - Abstract Syntax Tree (AST)

It is valuable to define the structure of a programming language on a more functional level than that is offered by its syntax. Abstract syntax is a notation that allows the parser to strip away the syntactic sugar and generate a tree that contains only essential elements of the program (for example punctuations, whitespaces and other irrelevant details are omitted). Typically an Abstract Syntax Tree is used as an intermediate representation of the program.

When you use the output=AST option, each of the grammar rules will implicitly return a node or subtree. The tree you get from invoking the starting rule is the complete AST. Building ASTs with ANTLR is straightforward. Just add AST construction rules (so called rewrite rules) to the parser grammar that indicate what tree shape you want to build. AST construction rules are rules behind the ’->’.


First element inside ^() becomes the root, and other elements becomes its children. So, for example, if we have the declaration: var x = 0; we get following subtree:
You can define AST tree with rewrite rules or operators for the tree construction – „^“ and „!“.To specify a tree structure, simply indicate which tokens should be considered operators (subtree roots) and which tokens should be excluded from the tree. Use the ^ and ! token reference suffixes, respectively. The expression rules use AST construction operators almost exclusively because they are much more terse than rewrite rules. To define expressions there is a series of rules, one for each operator precedence level and one for the lowest level describing primary expressions like identifiers, strings and integers. Start with an overall rule called expr that represents a complete expression. Rule expr will match operators with the weakest precedence, plus and minus, and will refer to a rule that matches subexpressions for operators with the next highest precedence. In this case, that next operators are multiply and divide.
From this lexer and parser grammar we can generate lexer and parser classes with ANTLRWorks IDE simply by choosing “Generate code” from the menu:


When parsing some Good For Nothing program, these classes will generate AST representation of the program. Next we need a code that will “walk” this tree and perform some actions according to the meaning of the program. 



AST grammar

ANTLR gives you notation to describe the structure of the abstract syntax tree – just like you wrote your ANTLR parser grammar and generated lexer and parser programs from it, you can write you tree grammar and define rules that describe appropriate structure of the tree. After this you can generate tree walker – a program that checks if some particular tree conforms AST grammar rules that you wrote. AST grammar of Good For Nothing language is given below. It looks like parser grammar with only rewrite rules left. All of the expression rules from the parser grammar collapse into a single recursive rule in the tree grammar. The parser constructs ASTs for expressions that encode the order of operations by their very structure. The tree grammar does not need to repeat the multilevel expression grammar pattern used by parser grammars to deal with different levels of precedence. A tree grammar can just list the possibilities, yielding a much simpler description for expressions than a parser grammar can use. There are no lexer rules in the tree grammar because you are parsing tree nodes, not tokens or characters.

Linking syntax and semantics analysis phases of a compiler with AST

Abstract syntax forms basics for linking syntax and semantic analysis phases of a compiler. It allows a considerably more compact parse tree than the concrete syntax, and yet it conveys all the required information for semantic processing.
ANTLR provides you the way to add code snippets in the target language to the rules of your ANTLR grammar. With this code snippets you define what happens when tree walker match some rule in the AST of the program. By adding code snippets to the grammar, the recognizer becomes a translator. You can use these code snippets to invoke code generation and apply semantic rules (for example type checks or check if some variable is declared before it is used). AST grammar with code snippets between curly braces is given below:

I will try to explain the most important parts this code. First, I defined a global scope to handle dictionary of identifiers:

Scope has its name, and Dictionary symbols where I will put defined variables, their names (as a key), types etc. Prgm rule then can access this scope with:

For simplicity, Good For Nothing has only this global scope. If you have more scopes, you will declare new scope under the particular rule. Recognizer pushes a new Dictionary onto a stack of scopes upon entry to each method that declares scope GoodForNothingScope.
In @header section I specified namespaces needed by tree walker class.
In @members seciton I put variables declarations and methods visible to all tree grammar rules. First I declared instance target of the code generator class, that I will show you later:

private CodeGen target = new CodeGen();

Then I defined method resolveType(), wich will return the type of the compound expression:

public static string resolveType (string e1, string e2)
{
if((e1== "string") && (e2 == "string"))
{
return "string";
}
return "integer";
}

For example if we have expression like 2+3, the type of 2 will be integer, and the type of 3 will be integer, so expression 2+3 will also be of the integer type.
Next I defined method isDefined(), witch for the given name (id) of the variable checks if variable is declared in the global scope:

public bool isDefined(String id) 
        {
            if ( $GoodForNothingScope[0]::symbols.ContainsKey(id) ) 
            {
               return true;
            }
            return false;
        } 

Next I defined @init and @after actions for the prgm rule. To execute an action before anything else in the rule and to define local variables, use an init action. Similarly, to execute something after any alternative has been executed and right before the rule returns, use an after action. I use @init in the prgm rule to initialize scope, and to give it a name – “global”.

Code generation for Good For Nothing language

With    target.Start(); Start() method  of the CodeGen class is called to start emit CIL code into output file stream. 
I use @after in the prgm rule to call End() method:
@after {
     target.End();
}

Also, when tree walker matches particular AST grammar rule it will perform semantic checks and emit CIL code corresponding to that particular rule. 
Start() method emits following code:
.assembly extern mscorlib
{}
.assembly test
{}
.module test.exe
.class private auto ansi test extends [mscorlib]System.Object {
 .method private hidebysig static void  Main(string[] args) cil managed {
 .entrypoint

End() method emits following code:
ret
}
}

Between calling Start() and End() methods, tree walker will walk the AST of the particular program and emit the rest of the CIL code between these two parts. I will show you how this process takes place on the example of declaration and initialization of variable x.

Tree walker first goes to the ident rule which only returns the name of the variable:


ident returns [string text] : ID {$ident.text = $ID.text;}
;

Then tree walker goes to the expression. It matches INT subrule and emits instruction ldc.i4   0, which puts int32 value 0 to the stack. Then code in the curly braces in the statement rule is executed. Compiler checks if variable with name x is already defined:


if ( isDefined($ident.text) ) 
     {
                throw new MyException("variable is already declared: " + $ident.text);
                   }

If everything is OK, compiler concludes the type of the variable according to the type of the initializing expression. Then constructor of the Variable class is called. After this I put variable name (as the key) and instance of the Variable class (as the value) into Dictionary that represents global scope. EmitLocals() method of the CodeGen class is called and it emits the declaration of the local variable x of the type int32 into the output file stream. After this, compiler emits stloc    x instruction, which takes 0 from the stack (e.g. zero) and stores it to the local variable x.

This is complete code of the CodeGen class:




Monday, December 5, 2011

Short about FON*C


This is update of this blog after some time, so first, I want to tell you what happened with the language I intended to make earlier this year ...  As you can see in my previous post, I intended to make a language that will compile to CIL code. Well, I did it. I made it during the studies at the Faculty of Organizational Sciences under the supervision of my mentor - professor Saša D. Lazarević who gave me a very idea of making the language and great help with language definition and the necessary literature.

The name of the language is FON*C, and it is completely C-like, imperative, strongly typed language made with ANTLR and C#. FON*C is statically typed and has numerous built in types: int, long, bool, char, string, float, file, etc., and user defined types: enums, structs and delegates. It supports following flow control statements: if, switch, for, goto, while and do while. It supports header files, so you can make your own function libraries and include it with #include directive, just like in C. You can pass function arguments by value and by reference, etc...

I will not post the source code of the FON*C compiler here at this time because it became pretty big over the last couple of months, so it might happen that one can't see the forest for the trees.
Instead, int the next post I will show you how to make one much simpler, but complete compiler. When you realize how to do it, you can make bigger and more complex compilers by yourself. All you need then is the idea what to build, and the time to do it.

Sunday, May 1, 2011

ANTLR parser and tree grammar for my simple language

Here is some work I've done recenly: it is ANTLR parser and tree grammar for simple imperative (C-Ada-Pascal-or-whatever)-like programming language. You can find grammars on stackoverflow in antlr-parser-and-tree-grammars-for-one-simple-language topic. 

Next, I am planning to add code generator for this, that will emit Common Intermediate Language. So the source code will compile to CIL.

The Common Intermediate Language (also known as Microsoft Intermediate Language, or MSIL) is an intermediate representation developed by Microsoft, to be the core language of .Net platform. All the .Net languages such as C#, VB.NET or VC++ are compiled to CIL which is the only language accepted by the Common Language Runtime (CLR) of .Net.

Feel free for any suggestions about grammars.

Vladimir