&SPL – Creating a simple imperative programming language

SPL sample code

&SPL is an acronym for Simple Programming Language, and is a C-like imperative programming language, with classes, lists, and tuples. For example, the following code produces the Fibonacci sequence.

The language is formally defined with a context-free grammar and typing rules. In this post an implementation of an &SPL source code compiler implemented in Haskell will be discussed. The compiler compiles the source code into assembly for a Simple Stack Machine [2] (http://www.staff.science.uu.nl/~dijks106/SSM/) in a four-part process of lexing, parsing, typing and code generation. The source code is available on GitHub.

Grammar

The &SPL grammar defines the legal syntax of the language. The semantics are only partially defined through the grammar, and are more fully defined through the language typing rules defined in the Typing section.

The context-free grammar is as follows:

Here alpha and alphaNum refer to alphabetical and alphabetical or numerical characters respectively.

Operator precedence, associativity and semantics

The operator precedence, associativity and semantics of the binary operators are as follows:

Operator Precedence Associativity Semantics
|| 1 Left Boolean or
| 1 Left Bitwise or
&& 2 Left Boolean and
& 2 Left Bitwise and
== 3 Left Equals
!= 3 Left Not equals
< 4 Left Less than
> 4 Left Greater than
<= 4 Left Less than or equals
>= 4 Left Greater than or equals
: 5 Right List concatenation
&+ 5 Left Reference integer addition
&- 5 Left Reference integer subtraction
&-& 5 Left Reference reference subtraction
<< 5 Left Bit-shift left
>> 5 Left Bit-shift right
+ 6 Left Integer addition
6 Left Integer subtraction
* 7 Left Integer multiplication
/ 7 Left Integer division
% 7 Left Modulo

The unary operators are defined as follows:

Operator Semantics
! Boolean negation
Integer negation
(<Type>) Type casting
& Referencing
* Dereferencing

Lexing

The lexing process is the initial phase of &SPL compilation. Through lexing raw source code is turned into a list of tokens. Some such tokens represent single haracters in the source code, but other tokens represent more complex structures. For example, the source code string 49020 is represented as a single token TConstant (CInt 49020).

The lexer is implemented as a set of recognizers. Each recognizer takes as input the remaining source code, and either fails or outputs the recognized initial string and the corresponding token. If multiple recognizers produce output, the output of the recognizer matching the largest part of the input string is used. Tie-breaking is performed by a priority assigned to each recognizer. For example, the source code string if will be lexed as TKeyword KIf as opposed to TIdentifier "if", as the keyword recognizer has a greater priority.

The recognizer data types are as follows:

and a single recognize step takes a list of such RecognizerPriority, the source code string, and either fails or outputs the recognized initial string and corresponding token:

Regular expression engine

Internally, the recognizers make use of regular expressions to match strings. The regular expression engine implementation is based on Thompson’s construction [4, 5]. In this construction, the regular expression engine compiles a regular expression to a non-deterministic finite automaton (NFA). Here, the compilation is implemented similarly to the main compiler as separate processes of lexing, parsing, and “code” generation in the form of building the NFA. The compiled NFA can subsequently be simulated on an input string, returning all matching strings.

The NFAs are implemented as machines taking string input, and have a set of states, a set of transitions between states, an initial state, and a set of accepting states. Four transition types are implemented: transitions from one state to another predicated on an input character, a conditional transition with an arbitrary predicate on characters, an empty transition, and a transition if the end of the input string has been reached.

The regular expression engine supports the following regular expression constructs:

  • the basics: union |, concatenation, literals , and the Kleene closure;
  • the option operator (“0 or 1 times”) ?;
  • characters sets (“one of …”) [a-zA-Z0-9_!];
  • negative character sets (“none of …”) [^a-z];
  • escaping of control characters (“literal ‘?'”) \?; and
  • the end-of-string marker $.

Many contemporary regular expression engines are implemented using some form of complex recursion, e.g. through parser combinators [1]. As such, they are built using a construct able to recognize more expressive types of languages, such as the context-free grammar, as opposed to pure regular expressions. In contrast, NFAs and regular expressions are equivalent regarding the types of languages they can recognize. Thus, Thompson’s construction generally results in a better-performing implementation of regular expressions. For example, compiling the following regular expression

and matching it with

takes 58 seconds using Python’s regular expression implementation, and only 0.7 seconds using the implementation created for this compiler.

Constructing recognizers

The following illustrates some of the recognizers used and how to construct them:

Parsing

Parsing &SPL proceeds after the lexing phase. The input to the parser is a list of tokens. The parsing phase is aimed at transforming tokens into a more direct representation of the program structure. For example, an expression using a binary operator will have associated with it the specific operator and two sub-expressions, where each sub-expression can be arbitrarily large, resembling a tree structure. As such, the parser transforms the tokens into an abstract syntax tree, and is implemented using a recursive top-down implementation with parser combinators, provided by the Haskell library Parsec.

Abstract syntax tree

At a high level, the &SPL abstract syntax tree consists of declarations (auxiliary source code includes, class declarations, variable declarations and function declarations), statements (such as control structures), expressions (actual calculations) and constants. The &SPL abstract syntax tree is represented naturally as an algebraic data type in Haskell, e.g.:

The types are implemented to keep track of AST meta-information; namely the original source code position for error messages, and the resulting data type of the underlying tree:

Parser implementation

Using parser combinators, the parser closely mimics the defined grammar. For example, 'while' '(' ')' '' is implemented as follows:

However, not all elements of the original grammar are translated so readily. The defined grammar is ambiguous; e.g. it does not inherently make explicit whether 1 + 2 * 4 should be parsed as (1 + 2) * 4 or as 1 + (2 * 4). As such, on top of the the AST data structure we further define operator precedence and associativity rules:

for which we can implement an expression parser as:

Pretty printing

Based on the abstract syntax tree, &SPL source code can automatically be pretty printed. For example,

becomes

Additionally, unnecessary parentheses are ignored:

becomes

Typing

&SPL is a strongly typed-language. However, unlike strongly-typed languages such as C++, the programmer is not required to annotate the source code with variable and function type definitions in &SPL. The &SPL compiler has a powerful type inference system based on type unification. In many cases a programmer does not need to supply any type annotations in a &SPL program at all.

Type system

&SPL has well-defined typing rules. The type system is based on the Hindley-Milner system [bicite key=milner1978theory], allowing for polymorphism and inference to find the most general type. The rules state what the resulting type is given the input types of an expression, statement, etc. Any expression, statement, etc., without a corresponding typing rule is illegal.

For example, given an element a of type A, the expression [a] will be of type [A]. Typing &SPL source code entails proving the types of statements, expression, etc., are consistent using these rules. The rules can be formalized in a natural deduction style:

Type inference

The type system used allows for automatic inference to the most general type. Type inference is implemented using Algorithm M, which is a top-down algorithm using unification to solve type constraints. The type constraints come from the AST in combination with the defined type rules (see the Type system section above).

For example, the reference and dereference rules shown above are implemented as follows:

Here, the metaMGU statements are used to enable annotating the AST with type information. They are not actually part of the type inference itself. We see that the type inference for the unary reference operator is based on an additional “meta type” of the expression, namely its value type. This is expanded upon in the next section.

Value types

&SPL provides pointers, with referencing and dereferencing, and assignment to expressions. These constructs make use of the address of a value in memory. The address of a value only makes semantic sense if the value exists for longer than the expression it is used in; i.e., it should be persistent over multiple statements. As such, each expression has an associated value type that is either persistent or temporary.

Intuitively, an expression has a persistent value type if it is derived from a named entity, and has a temporary value type otherwise. Value types are, with some abuse of notation, inductively defined as follows:

Dependency analysis

The type system used allows for polymorphism, but this entails AST types should be inferred to their most general type. As such, it is undesirable to solve the entire program in a single set of constraints, as this would destroy the ability to have polymorphism.

Instead, type constraint unification is performed in blocks of mutually dependent declarations, after which the found types in those blocks are generalized. Generalization transforms the free type variables in the found most general types into bound type schemes, or “forall types.”

The dependencies of a declaration are found by traversing the AST and noting all other declarations it uses, thereby building a graph of dependencies. The strongly connected components are identified in this graph, thus finding all declarations that are mutually dependent. Then, the groups of mutually dependent declarations are toplogically ordered such that groups containing declarations are processed and generalized before groups that depend on those declarations.

AST notation

The AST is annotated with the found types, such that the code generator has access to type information of any AST element.

Example

The following unannotated code

is automatically typed as follows

Code generation

After typing has been completed successfully, the resulting AST makes semantic sense within the rules of the language, and can be transformed into machine code. Here, the AST is transformed into SSM assembly. Each line of SSM assembly contains an instruction and optionally a label and a comment. The comments are for annotation purposes only and otherwise ignored. The labels are resolved to numeric constants by the SSM assembler.

Intermediate SSM representation

It would be possible to output the code generated from the AST directly in the plaintext assembly format. However, this is prone to mistakes and makes post-processing difficult. Instead, the AST is first compiled to an intermediate SSM representation, which is then compiled to plaintext. The SSM code is represented as a list of SSM lines. Lines are defined as:

with:

Stack-based

The SSM is, as its name suggests, stack-based, and as such code generation proceeds in a simple stack-based manner; i.e., evaluation of expressions is implemented akin to reverse Polish notation.

For example, evaluating a binary operation expression is performed by evaluating the first expression, evaluating the second expression, and then executing the operator instruction.

However, the SSM also provides a heap and multiple registers. The registers are largely ignored in generated code, though some are used implicitly by instructions to control the flow through the program code and stack. The only exception is the return register, which is used explicitly to set and retrieve function results. The heap is used in dynamic memory allocation.

Caller / callee convention

It is important to clearly define the program flow when generating code. Perhaps most apparent is the need to set up a clear function caller / callee convention. When calling a function, the function frame has to be set up, and the frame should be cleared upon leaving the function. However, there are multiple ways this can be achieved. For example, either the calling or the called code could reserve stack space for the function locals, and either the calling or the called code could subsequently clear the stack space.

Here, the convention is taken that the calling code evaluates the argument expressions in reverse order, places the return address on the stack, and finally branches to the called code. Upon returning, the calling code cleans the evaluated argument expressions from the stack. Upon entering the called code, stack space is allocated for any additional locals. Prior to leaving the called code, any objects with a class type (including objects passed as arguments) are destructed, the reserved space for locals is cleared, and the return address is popped from the stack.

Inspecting the stack for the following program right before returning from f, we see:

Address Value RegPtr
00000079 ffffffff Global g2
0000007a baaaaaad Global g1
0000007b 00000006 Ret. addr. from main (to halt)
0000007c 00000078 Prev. frame pointer
0000007d cafebabe Arg. c
0000007e 000007d1 Arg. b
0000007f deadbeef Arg. a
00000080 0000005f Ret. addr. from f (to main)
00000081 0000007c FP Prev. frame pointer
00000082 cafed00d Local 1, d
00000083 99887766 Local 2, g2
00000084 11223344 SP Local 3, a

Global variables

Global variables are evaluated at the start of the program. Their results are then stored sequentially at the top of the stack. As they are evaluated first, their values will subsequently remain at the bottom of the stack throughout the program. To access these variables, a label __end_pc is added to the end of the program code. This label is a constant offset 0x12 lower than the bottom of the stack. As such, a global can be loaded by first putting the label value on the stack, and then loading from the address pointed to with a static offset based on which global should be accessed (0 for the first global variable, 1 for the second, etc.). In pseudo-SSM assmembly:

Order of global variables

As global variables can depend on other global variables (but mutual dependence between global variables is not supported), the code generator uses the reverse topologically ordered AST. This guarantees that all global variables are declared and initialized before the global variables depending on them.

Tuple

When evaluating a tuple, the tuple contents are placed in the heap, and the tuple address is placed on the stack. The tuple data looks as follows in the heap:

Address offset Value
0 Tuple element 2
1 Tuple element 1

Lists

Lists are implemented as linked lists. Similar to tuples, a linked list takes its values on the heap:

Address offset Value
0 List element value
1 Address of list tail

An empty list has the next address set to -1.

Value copying

Values of expressions can be copied on the stack, which means a new value is created from the value of the expression. For expressions of all types except a class type, copying entails simply evaluating the expression, thereby placing the value on the stack. For expressions of a class type, however, the “value” of the object does not make semantic sense. Should it be its first member’s value, or something else? Instead, a full copy of the object is placed on the stack. Expression values are copied in three cases:

  1. The value is assigned to an expression.
  2. The value is passed to a function.
  3. The value is returned from a function.

Though, note that this last case is not yet supported in the code generation, as the return register used for returning values can only pass values of a single word.

Memory

&SPL provides refined control over memory usage. The heap can be managed explicitly using memory allocation functions.

Function malloc allocates a block of consecutive memory of at least the amount of words requested (or a null pointer if no memory is available). Conversely, free frees the pointed-to memory that was previously allocated.

This allows for efficient implementation of arrays, without the overhead incurred by lists.

As this will likely be a common use-case, subscript notation is provided. The previous code can be written more succinctly as:

Buddy block system

The memory allocator is based on the binary buddy block system (see e.g. [3]). Here, a block of consecutive memory is divided into equal-sized fragments. Each fragment has a buddy, with which it forms a larger block of memory. The larger blocks of memory similarly have a buddy, recursively, until the root is reached. Each memory allocation request receives a block of memory that is at least as large as the requested memory, but will likely be larger, as it must be a power of two. Thus, if the smallest fragment size is, e.g., 2^5 words, and the total memory size is 2^20 words, a maximum of 2^15 objects can be held in memory.

For example, for 64 words of memory with fragments of 8 words, the memory tree would look as follows, where the header represents the memory offset, and the cells the block size.

Implementation

The buddy block system keeps track of buddy block allocations, or splits.

When allocating memory, the algorithm checks whether a large enough consecutive block is available. If not, it returns a null pointer. Otherwise, it starts from the root node. If the node is larger than needed for the requested memory and is not yet split, the node is split. The algorithm travels to the child with the least consecutive memory left that still fits. Once the smallest fitting block is found, the system updates the block allocations and returns a pointer to the start of the block.

When freeing memory, the system first frees the block pointed to, and subsequently recursively travels up the tree, re-merging two child blocks if they are now both free.

The algorithm is implemented in the SPL standard library.

Classes

A class in &SPL is a collection of variable members and methods. An object can be instantiated from a class by using the class constructor. An object can either be instantiated dynamically on the heap, or in-place on the stack. An object instantiated on the stack is automatically destructed using its class destructor when the object falls out of scope. When a class object is copied, the new object is instantiated through the class copy constructor.

New and delete

A class can be created dynamically on the stack by using the new keyword in front of the regular constructor expression. Note that, unlike languages such as C++, the new keyword does not require some “default” constructor. The defined class constructor can be used with dynamic parameters.

A class object created dynamically through the new keyword, must be manually destructed using the delete keyword. Note that delete is currently defined for any pointer, but using delete on a non-object pointer results in undefined behavior. An effective solution is already available in the type checker and code generator, where an object can be determined to be of a variable class type; e.g. Class (TVar "a") instead of Class (TClassIdentifier "Car"). However, no &SPL syntax has been implemented for this yet.

Members and methods

Members of class objects are stored in the object space (either on the heap or in the stack). Currently object members cannot have default values; they must be initialized using the copy constructor. Methods of class objects are shared by all objects of that class, and are located statically in the program code. When a class method is called, the called code receives an argument self which is a pointer to the specific object.

Type inference

Type inference for classes poses a problem. The methods and members of a class can depend on each other, and further other functions or methods and members of other classes can depend on the class as well. However, without typing the AST it is not immediately clear on which methods and members a piece of code is dependent. For example, to know the code obj->get() is dependent on a method SomeClass.get as opposed to SomeOtherClass.get would require knowing obj is of type SomeClass. This could be solved by typing the AST in two passes, where in the second pass enough information should be available for full dependency analysis.

However, instead, we opted for a simpler approach: in dependency analysis, once a class identifier was named explicitly, the code was deemed dependent on all methods and members of that class. Note that by nature of class method and member aliasing, or “namespacing,” any code using a member or method specific to a class must have prior named that class explicitly.

Type frames

To ensure class copy constructors and destructors can be called without knowing the specific class in the code generator, each class object contains a pointer to a location in the program code. This pointer can be seen as the object’s class type frame. The program code location contains specific instructions to load basic information of the object to the stack, such as the object’s size, and the addresses of the copy constructor and destructor.

As such, when the code generator is required to generate code for deleting an object, but that specific area of code does not know the class type of the object, the type frame can be used to load the destructor address.

Address offset Value
0 Instruction to load object’s size to stack
3 Instruction to load address of object’s copy constructor to stack
6 Instruction to load address of object’s destructor to stack

Standard library

To make &SPL more useful, a standard library is provided. The library contains functionality such as the buddy block memory management system, smart pointers, and some utility functions. The current library is distributed over four files:

  • std.spl
  • stdbuddyallocator.spl
  • stdsmartpointers.spl
  • stdmath.spl

Linking

To facilitate a standard library, some rudimentary form of code linking must be available. In the current version of &SPL and its compiler this is solved straightforwardly. The grammar is extended to allow for include statements, e.g. std.spl contains:

On compilation, the separate source files are parsed and type-checked independently. Code including another source receives the type-inferred type schemes of that source. On code generation, the generated ASTs are concatenated.

Smart pointers

For more automatic memory management and to prevent programmer mistakes resulting in memory leaks or dangling references, a reference counting smart pointer class is implemented in the standard library. This class wraps around a dynamically allocated object and leverages the copy constructor and the automatic destructor called when objects fall out of scope to keep track of the references to the wrapped object.

On constructing a smart pointer object it creates a dynamically allocated Count object. On copying a smart pointer object, the references to the count and wrapped object are shared, and the count is incremented. On destructing a smart pointer object, the shared count is decremented. Once the count reaches 0, the count and wrapped object are destructed. As a result, the programmer is no longer required to manually call delete on their objects.

For example, the following code will keep track of the references to the Vehicle object, even after the code returns from fun. Only once the last smart pointer wrapping the Vehicle object falls out of scope, will the object be destructed.

Higher-order functions

&SPL has rudimentary support for higher-order functions. Global functions can be assigned to variables and passed around as expressions. The support for higher-order functions was a natural precursor to implementing classes, due to the need for value assigning to persistent value typed expressions’ addresses and later the need for dynamically calling functions based on expression results (i.e., calling an object method). However, no anonymous or nested functions are supported.

The following program outputs 1, 2, 3, 4, …, 19, 20.

SSM assembly post-processing

The SSM assembly resulting from the code generation can contain unnecessary inefficiencies. Some of these inefficiencies stem from the recursive, and separated, nature of the code generation. For example, in the code generation of an if-else statement, a label is generated in the SSM pointing to the end of the else code. We can not simply print a raw label without an operation, as it is possible the statement that is to be generated after the if-else statement will generate a label at the start. This would result in illegal SSM assembly: lbl1: lbl2: .... Instead, the if-else statement generates a label to a nop — no-operation instruction — resulting in wasted cycles:

In post-processing of the SSM assembly some of these inefficiencies can be optimized out while maintaining semantics of the original program. The following optimizations are carried out in the given order:

  1. adjacent stack pointer adjustments, where the second adjustment does not contain a label, are merged;
  2. stack pointer adjustments set to adjust the stack by 0 are changed into a nop operation; and
  3. nop operations are removed; if a nop is labeled and the next operation is not, the label is moved to the next operation. If the next operation is also labeled, the label clash is resolved by renaming the nop label to the label of the second operation throughout the program code.

For example,

would become

Known issues

Currently type inference with programmer-given annotations is bugged. If a programmer uses a type variable annotation, such as a, twice at different locations, the type inference process will attempt to unify those types. Additionally, tuples and lists are not yet implemented as classes. As such, they do not use dynamic memory management yet, and can not be garbage collected.

Source code and statistics

The source code is available on https://github.com/Beskhue/spl-compiler. Lines of code:

  • Data structures: 961
  • Regex: 515
  • Lexer: 248
  • Parser: 551
  • Type checker: 1573
  • Code generator: 1154
[1] R. Cox, “Regular expression matching can be simple and fast (but is slow in Java, Perl, PHP, Python, Ruby,…),” , 2007.
[Bibtex]
@article{cox2007regular,
title={Regular expression matching can be simple and fast (but is slow in {Java}, {Perl}, {PHP}, {Python}, {Ruby},...)},
author={Cox, Russ},
year={2007}
}
[2] A. Dijkstra, “Simple stack machine.”
[Bibtex]
@Manual{SSM,
title = {Simple Stack Machine},
author = {Atze Dijkstra},
timestamp = {2017-06-18},
url = {http://www.staff.science.uu.nl/~dijks106/SSM/},
}
[3] J. L. Peterson and T. A. Norman, “Buddy systems,” Communications of the acm, vol. 20, iss. 6, p. 421–431, 1977.
[Bibtex]
@article{peterson1977buddy,
title={Buddy systems},
author={Peterson, James L and Norman, Theodore A},
journal={Communications of the ACM},
volume={20},
number={6},
pages={421--431},
year={1977},
publisher={ACM}
}
[4] K. Thompson, “Programming techniques: regular expression search algorithm,” Communications of the acm, vol. 11, iss. 6, p. 419–422, 1968.
[Bibtex]
@article{thompson1968programming,
title={Programming techniques: Regular expression search algorithm},
author={Thompson, Ken},
journal={Communications of the ACM},
volume={11},
number={6},
pages={419--422},
year={1968},
publisher={ACM}
}
[5] G. Xing, “Minimized Thompson NFA,” International journal of computer mathematics, vol. 81, iss. 9, p. 1097–1106, 2004.
[Bibtex]
@article{xing2004minimized,
title={Minimized {Thompson NFA}},
author={Xing, Guangming},
journal={International Journal of Computer Mathematics},
volume={81},
number={9},
pages={1097--1106},
year={2004},
publisher={Taylor \& Francis}
}

Leave a Reply

Your email address will not be published. Required fields are marked *