Toy Compiler Using Ocamllex, Menhir, and LLVM

2019-05-01

Implementing our own compiler for a simple C-like language.

Getting Started

A compiler is a pipeline consisting of sub-components: Lexical Analysis, Parsing, and Assembly (and linking). We implement the three components to form a complete compiler using the corresponding tools: ocamllex, menhir, and LLVM. There is a lot of theory going on in the background such as LL(0), LL(1), or BNF grammar, abstract syntax trees(AST), and so on... but we simply use this theory and recommend brushing up if unfamiliar.

The full complete code can be found at the about page.

What Is Our Grammar?

The grammar is the central part of any language. For our toy example we want to keep it simple but complex enough to keep it interesting. We will be using a simple C-like syntax, as shown below:

double circle(double r) {
    3.14 * r * r
}

double cyl(double r, double h) {
    h * circle(r)
}

int main() {
    double vol = cyl(2.0, 5.0)
    printf(vol)
    0
}

The last statement is treated as the return value for the function.

Step 1 - Lexical Analysis

Since OCAML allows use to easily pattern match and traverse ASTs, everything will be written in OCAML. A nice library for lexical analysis is ocamllex. Given our syntax, we need to break it down to a list of known tokens (with the help of regex) such as variables, numbers, operators, etc. Our file is called lexer.mll and simply contains:

{
    open Parser
}

rule token = parse
  | [' ' '\t' '\n']                  { token lexbuf }
  | ['a'-'z' 'A'-'Z' '_']['a'-'z' 'A'-'Z' '0'-'9' '_']* as s       {  TIDENTIFIER (s)}
  | ['0'-'9']+ as lxm                { TINTEGER (int_of_string lxm) }
  | ['0'-'9']+'.'['0'-'9']* as lxm   { TDOUBLE(float_of_string lxm)}
  | "=" as s                         { TEQUAL(String.make 1 s) }
  | "==" as s                        { TCEQ(s) }
  | "!=" as s                        { TCNE(s) }
  | "<"  as s                        { TCLT(String.make 1 s) }
  | "<=" as s                        { TCLE(s) }
  | ">"  as s                        { TCGT(String.make 1 s) }
  | ">=" as s                        { TCGE(s) }
  | "&&" as s                        { TCAND(s) }
  | "||" as s                        { TCOR(s) }
  | "+"  as s                        { TPLUS(String.make 1 s) }
  | "-"  as s                        { TMINUS(String.make 1 s) }
  | "*"  as s                        { TMUL(String.make 1 s) }
  | "/"  as s                        { TDIV(String.make 1 s) }
  | '('                              { TLPAREN }
  | ')'                              { TRPAREN }
  | '{'                              { TLBRACE }
  | '}'                              { TRBRACE }
  | ','                              { TCOMMA }
  | eof                              { EOF }

As you can see we define a token which can be an identifier, an integer, a double, equals, etc. And each option returns a type that may contain internal data such as the name of the identifier, or the value of the integer or double.

Step 2 - Parsing

Now we want to creeate an AST which will represent our language. To do this, we need to define fundamental expression and statement that each node in the AST will define. Hence we have the following:


type expr = 
  | Int of int
  | Double of float
  | Assignment of string * expr
  | MethodCall of string * expr list
  | BinaryOperator of expr * string * expr
  | Identifier of string

type statement = 
    | Expr of expr
    | VariableDeclarationExpr of string * string * expr (* type, id, assignment expr *)
    | VariableDeclaration of string * string
    | FunctionDeclaration of string * string * statement list * statement list

type block = statement list

We have an expression type that can be classified as an int, double, assignment, method call, binary operator, or a variable. A statement type that can be an expression, a variable decleration or assignment, and a function decleration. A block is simple a list of statements. With these types, we can parse the tokens using menhir and generate an AST that represents each expression, statement, and block. In our file parser.mly we have the following:

%{
    open Expr
%}

%token <int> TINTEGER
%token <float> TDOUBLE
%token <string> TIDENTIFIER
%token <string> TPLUS TMINUS TMUL TDIV
%token TLPAREN TRPAREN TLBRACE TRBRACE
%token <string> TEQUAL TCEQ TCNE TCLT TCLE TCGT TCGE TCAND TCOR
%token TCOMMA EOF


%type <string> ident
%type <Expr.expr> numeric expr 
%type <Expr.statement list> func_decl_args
%type <Expr.expr list> call_args
%type <Expr.block> program stmts block
%type <Expr.statement> stmt var_decl func_decl
%type <string> comparison

%start program

%%

program: stmts EOF                   { $1 }
    ;

stmts : stmt { [$1] }
    | stmts stmt { $1@[$2] }
    ;

stmt : var_decl {$1} | func_decl {$1}
    | expr { Expr($1) }
    ;

block : TLBRACE stmts TRBRACE { $2 }
    | TLBRACE TRBRACE { [] }
    ;

var_decl : ident ident { VariableDeclaration($1, $2) }
    | ident ident TEQUAL expr { VariableDeclarationExpr($1, $2, $4) }
    ;

func_decl : ident ident TLPAREN func_decl_args TRPAREN block 
                {FunctionDeclaration($1, $2, $4, $6)}
    ;

func_decl_args : {[]}
    | var_decl {[$1]}
    | func_decl_args TCOMMA var_decl {$1@[$3]}
    ;

ident : TIDENTIFIER { $1 }
    ;

numeric : TINTEGER { Int($1) }
    | TDOUBLE {Double($1)}
    ;

expr : ident TEQUAL expr {Assignment($1, $3)}
    | ident TLPAREN call_args TRPAREN {MethodCall($1, $3)}
    | ident {Identifier($1)}
    | numeric {$1}
    | expr comparison expr {BinaryOperator($1, $2, $3)}
    | TLPAREN expr TRPAREN {$2} 
    ;

call_args : {[]}
    | expr {[$1]}
    | call_args TCOMMA expr {$1@[$3]}
    ;

    comparison : TCEQ {$1} | TCNE {$1} | TCLT {$1} | TCLE {$1} | TCGT {$1} | TCGE {$1} | TPLUS {$1} | TMINUS {$1} | TMUL {$1} | TDIV {$1} | TCAND{$1} | TCOR{$1}
           ;

%%

Step 3 - Assembly using LLVM

With an AST constructed we can now generate LLVM-IR code. I recommend taking a look at the offical LLVM example project, to get a feel for how LLVM works. Essientally we go through our AST, match on the type of node, and preform the corresponding IR operation. For example, in codegen.ml we have the following match statement for Expr:

let rec codegen_expr (e: expr) =
    match e with
    | Int n -> const_int int_type n
    | Double n -> const_float double_type n
    | MethodCall (callname, args) -> 
            (*Printf.eprintf "Calling function %s\n" callname;*)
            let callee =  match lookup_function callname the_module with
                            | Some callee -> callee
                            | None -> raise (Error "unknown function referenced")
            in
            let params = params callee in
            if Array.length params == List.length args then () else
                raise (Error "incorrect # args");
            let args = List.map codegen_expr args in
            if callname = "printf" then
                let printval = List.hd args in
                match classify_type (type_of printval) with
                | TypeKind.Integer ->
                    let const_int n = const_int int_type n in
                    let str s = define_global "buf" (const_stringz context s) the_module in
                    let int_spec = build_gep (str "%d\n") [| const_int 0; const_int 0 |] "int_spec" builder in
                build_call callee [| int_spec;  (List.hd args) |] "" builder
                | _ ->
                    let const_int n = const_int int_type n in
                    let str s = define_global "buf" (const_stringz context s) the_module in
                    let int_spec = build_gep (str "%f\n") [| const_int 0; const_int 0 |] "float_spec" builder in
                build_call callee [| int_spec;  (List.hd args) |] "" builder

            else
            build_call callee (Array.of_list args) "calltmp" builder
    | BinaryOperator (lhs, op, rhs) ->
            (*Printf.eprintf "Operator %s\n" op;*)
            let lhs_val = codegen_expr lhs in
            let rhs_val = codegen_expr rhs in
            (*Printf.eprintf " type LHS %s" (string_of_lltype (type_of lhs_val));
            Printf.eprintf " type RHS %s\n" (string_of_lltype (type_of rhs_val));*)
            let w =
                match classify_type (type_of lhs_val), classify_type (type_of rhs_val) with
                | TypeKind.Integer, TypeKind.Integer -> 0
                | _, TypeKind.Double -> 1
                | TypeKind.Double, _ -> 1
                | _,_ -> 1
            in
            (
                match op with
                    | "+" -> if w = 0 then build_add lhs_val rhs_val "addtmp" builder else build_fadd lhs_val rhs_val "addtmp" builder
                    | "-" -> if w = 0 then build_sub lhs_val rhs_val "subtmp" builder else build_fsub lhs_val rhs_val "subtmp" builder
                    | "*" -> if w = 0 then build_mul lhs_val rhs_val "multmp" builder else build_fmul lhs_val rhs_val "multmp" builder
                    | "/" -> if w = 0 then build_sdiv lhs_val rhs_val "divtmp" builder else build_fdiv lhs_val rhs_val "multmp" builder
                    | "&&" -> build_and lhs_val rhs_val "andtmp" builder
                    | "||" -> build_or lhs_val rhs_val "ortmp" builder
                    | _ -> failwith "Unsupported"
            )
    | Identifier name ->
            (
            (*Printf.eprintf "Variable %s\n" name;*)
            let v = try Hashtbl.find named_values name with
                | Not_found -> raise (Error "unknown variable name")
            in
            match v with
            | t, v -> (*build_load v name builder*) v
)

As you can see, we match an expression with either Int, Double, Method Call, Binary Operator, or Identifier and then generate LLVM code such as const_int int_type for integer number.

Building/Compiling

We create a main file that defines the compiler sub-components. In our file main.ml we have:

open Lexing
open Parser
open Lexer
open Codegen
open Expr

let filename = Sys.argv.(1)
let filename_llvm = Sys.argv.(2)

let () = 
  let block = open_in filename |>
  Lexing.from_channel |>
  Parser.program Lexer.token in
  (*|> print_block in*)
codegen_main block filename_llvm

This uses the pipe operator to pass along the block of code between each of the sub-components, with the final function generating the LLVM.

Now to build ocaml: ocamlbuild -use-ocamlfind -pkgs llvm,llvm.analysis,llvm.bitwriter -use-menhir main.byte. We can now pass into main.byte a file using toy language and output name for LLVM-IR. Afterwards, we compile using llc and gcc or clang to create our executable.

For our example to find the volume of a cylinder:

main.byte cylinder.in llvm-ir.out

more llvm-ir.out
@buf = global [4 x i8] c"%f\0A\00"

declare i32 @printf(i8*, ...)

; Function Attrs: noinline nounwind optnone sspstrong uwtable
define double @circle(double %r) #0 {
entry:
  %multmp = fmul double %r, %r
  %multmp1 = fmul double 3.140000e+00, %multmp
  ret double %multmp1
}

; Function Attrs: noinline nounwind optnone sspstrong uwtable
define double @cyl(double %r, double %h) #0 {
entry:
  %calltmp = call double @circle(double %r)
  %multmp = fmul double %h, %calltmp
  ret double %multmp
}

; Function Attrs: noinline nounwind optnone sspstrong uwtable
define i32 @main() #0 {
entry:
  %calltmp = call double @cyl(double 2.000000e+00, double 5.000000e+00)
  %0 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @buf, i32 0, i32 0), double %calltmp)
  ret i32 0
}

attributes #0 = { noinline nounwind optnone sspstrong uwtable }

llc llvm-ir.out && clang -o cylinder llvm-ir.s

And finally, running our compiled code prints our the volume of the cylinder!

./cylinder
62.800000

This is pretty cool! We went from defining our little language, to parsing it and compiling it, to running it. There are of course plenty of other features that we can add or include such as using LLVM optimization, or even optimizing the AST ourselves. But in the end we created a language that we can call our own.

Daniel Volya