Project 1: Decaf Lexer
Due: Sunday, February 3, 2013 at 11:59pm
In this first part of the term project, we will build a lexer for a simple Object-Oriented language called Decaf.
The tokens are defined as below:
The keywords are all reserved, which means they cannot be used as identifiers or redefined.
void |
int |
double |
bool |
string |
class |
interface |
null |
this |
extends |
implements |
for |
while |
if |
else |
return |
break |
New |
NewArray |
|
ReadInteger |
ReadLine |
true |
false |
An identifier is a sequence of letters, digits, and underscores, and must start with a letter. Decaf is case-sensitive. Identifiers are at most 31 characters long.
Whitespace separates tokens but is otherwise discarded.
An integer constant can either be specified in decimal (base 10) or hexadecimal (base 16). A decimal integer is a sequence of decimal digits (0-9). A hexadecimal integer must begin with 0X or 0x and is followed by a sequence of hexadecimal digits. Hexadecimal digits include the decimal digits and the letters a through f (either upper or lowercase).
Boolean constants are true and false as indicated in the reserved words above.
A double constant is a sequence of digits, a period, followed by any sequence of digits, maybe none. Thus, .12 is not a valid double but both 0.12 and 12. are valid. A double can also have an optional exponent, e.g., 12.2E+2 For a double in this sort of scientific notation, the decimal point is required, the sign of 1 the exponent is optional (if not specified, + is assumed), and the E can be lower or upper case. As above, .12E+2 is invalid, but 12.E+2 is valid. Leading zeroes on the mantissa and exponent are allowed.
A string constant is a sequence of characters enclosed in double quotes. Strings can contain any character except a newline or double quote. A string must start and end on a single line, it cannot be split over multiple lines:
“this string is missing its close quote
this is not a part of the string above
The following escape sequences are supported:
\n |
Newline |
\t |
Tab |
\” |
Quotation Mark |
\\ |
Backslash |
A single-line comment is started by // and extends to the end of the line. Multi-line comments start with /* and end with */. Any symbol is allowed in a comment except the sequence */ which ends the current comment. Multi-line comments do not nest.
+ |
- |
* |
/ |
% |
< |
<= |
> |
>= |
= |
== |
!= |
&& |
|| |
! |
; |
, |
. |
[ |
] |
( |
) |
{ |
} |
· Use flex or JFlex to implement a Decaf lexer
· The lexer should return tokens in the lexical rules associated with each regular expression
o If you’re using C, the token type can be a set of #defines or an enum, combined with a union to return the value. You can look ahead at the bison parser manual to see how the union and token types will work.
o If you’re using Java, this can be a set of public static final ints and the Symbol class. The Symbol class is part of the parser (JavaCUP)’s runtime jar file. You may use it in the lexer simply by setting the classpath to include the jar file from JavaCUP and importing it.
· The lexer should be driven by a program that prints out the line and column numbers of each token encountered in the program. This is information you can track in your token. An example is shown below.
· The lexer should discard all comments.
· The lexer should report any invalid character with the message “Illegal character ‘%c’ at line %d column %d\n”
· The lexer should report any invalid escape sequence with the message “Illegal escape sequence ‘%s’ at line %d column %d\n”
· The lexer should report the start of an unterminated string with the message “Unterminated string literal at line %d column %d”
class Hello {
void main() {
Print(“Hello world\n”);
}
}
Line |
Column |
Token |
Value |
====================================================================== |
|||
1 |
1 |
CLASS |
|
1 |
7 |
ID |
Hello |
1 |
13 |
LBRACE |
|
2 |
2 |
VOID |
|
2 |
7 |
ID |
main |
2 |
11 |
LPAREN |
|
2 |
12 |
RPAREN |
|
2 |
14 |
LBRACE |
|
3 |
3 |
|
|
3 |
8 |
LPAREN |
|
3 |
9 |
STRING |
Hello world
|
3 |
24 |
RPAREN |
|
3 |
25 |
SEMICOLON |
|
4 |
2 |
RBRACE |
|
4 |
1 |
RBRACE |
|
By the deadline, you need to submit:
1. Your JFlex or flex file containing your lexer
2. Your C or Java files containing main() and any auxiliary files you have used
3. A Makefile to build it all
4. A README text file describing how to run it
5. Two or more examples of Decaf programs that you have written and tested your program on
Create a zip file of the above files and copy it to:
/afs/pitt.edu/home/j/r/jrmst106/submit/2210