Operating System Design :: Projects :: Compiler Part 1
Problem
The files for this project are in the projects directory under 10 when you downloaded the class software.
You will need two tools: the programming language with which you will implement your Syntax Analyzer, and the supplied TextComparer utility (available in your tools directory), which allows comparing the output files generated by your analyzer to the compare files supplied by us. You may also want to inspect the generated and supplied output files visually, using some XML viewer (any standard web browser should do the job).
Instructions
Chapter 10 includes a proposed, language-independent Syntax Analyzer API, which can serve as your implementation's blueprint. We propose implementing the Syntax Analyzer in two stages. First, write and test a Tokenizer module. Next, write and test a Parser module.
Stage I: Tokenizer: A tokenizer is a program that breaks a given textual input into a list of separate tokens. And while it is at it, it can also classify the tokens into lexical categories. This is the first task that syntax analyzers normally do. Your first task it to implement the JackTokenizer module specified in chapter 10. When applied to a text file containing Jack code, the tokenizer should produce a list of tokens, each printed in a separate line along with its classification: symbol, keyword, identifier, integer constant or string constant. Here is an example:
Source Code (Input)
if (x < 0) {let state = "negative";}
Tokenizer Output
<tokens> <keyword> if </keyword> <symbol> ( </symbol> <identifier> x </identifier> <symbol> < </symbol> <integerConstant> 0 </integerConstant> <symbol> ) </symbol> <symbol> { </symbol> <keyword> let </keyword> <identifier> state </identifier> <symbol> = </symbol> <stringConstant> negative </stringConstant> <symbol> ; </symbol> <symbol> } </symbol> </tokens>
Note that in the case of string constants, the tokenizer throws away the double-quote characters. This behavior is intended, and is part of our tokenizer specification.
Note that four of the symbols used in the Jack language (<, >, ", and &) are also used for XML markup, and thus they cannot appear verbatim as XML data. To solve the problem, we require the tokenizer to output these tokens as <, >, ", and &, respectively. For example, in order for the text "<symbol> < </symbol>" to be displayed properly in a web browser, it should be generated as "<symbol><</symbol>".
Stage II: Parser: Once you are done implementing the tokenizer for the Jack language, you can proceed implementing a Jack parser. A parser goes over the tokenized text and emits output indicating that it "understood" the text's grammatical structure. In order to do so, the parser must include methods that look for canonic structures in a certain language - in our case Jack - and then emit these structures in some agreed upon formalism - in our case the XML described in Chapter 10.
In both Project 10 and Project 11, the Jack parser is implemented as a module called CompilationEngine. We recommend following the compilation engine's API, and writing each one of its methods, making sure that it emits the correct XML output. We also recommend to first write a compilation engine that handles any given Jack code except for expressions, and unit-testing it as explained below.
Testing
A natural way to test your analyzer it is to have it parse some representative Jack programs. We supply two such test programs: Square Dance and Array Test. The former includes all the syntactic features of the Jack language except for array processing, which appears in the latter. We also provide a simpler version of the SquareDance program that does not include expressions.
Tokenizer Testing:
- Test your tokenizer on the Square Dance and the TestArray programs.
- For each Xxx.jack source file, have your tokenizer give the output file the name XxxT.xml. Apply your tokenizer to each class file in the test programs, then use the supplied TextComparer utility to compare the generated output to the supplied .xml compare files.
- Since the output files generated by your tokenizer will have the same names and extensions as those of the supplied compare files, we suggest putting them in separate directories.
Parser Testing:
- Apply your analyzer to the supplied test programs, then use the supplied TextComparer utility to compare the generated output to the supplied .xml compare files.
- Since the output files generated by your analyzer will have the same names and extensions as those of the supplied compare files, we suggest putting them in separate directories.
- Note that the indentation of the XML output is only for readability. Web browsers and the supplied TextComparer ignore white space.