Project 1: Decaf Lexer

Due: Sunday, February 3, 2013 at 11:59pm

Description

In this first part of the term project, we will build a lexer for a simple Object-Oriented language called Decaf.

The tokens are defined as below:

Keywords

The keywords are all reserved, which means they cannot be used as identifiers or redefined.

void

int

double

bool

string

class

interface

null

this

extends

implements

for

while

if

else

return

break

New

NewArray

Print

ReadInteger

ReadLine

true

false

Identifiers

An identifier is a sequence of letters, digits, and underscores, and must start with a letter. Decaf is case-sensitive. Identifiers are at most 31 characters long.

Whitespace

Whitespace separates tokens but is otherwise discarded.

Constants

An integer constant can either be specified in decimal (base 10) or hexadecimal (base 16). A decimal integer is a sequence of decimal digits (0-9). A hexadecimal integer must begin with 0X or 0x and is followed by a sequence of hexadecimal digits. Hexadecimal digits include the decimal digits and the letters a through f (either upper or lowercase).

Boolean constants are true and false as indicated in the reserved words above.

A double constant is a sequence of digits, a period, followed by any sequence of digits, maybe none. Thus, .12 is not a valid double but both 0.12 and 12. are valid. A double can also have an optional exponent, e.g., 12.2E+2 For a double in this sort of scientific notation, the decimal point is required, the sign of 1 the exponent is optional (if not specified, + is assumed), and the E can be lower or upper case. As above, .12E+2 is invalid, but 12.E+2 is valid. Leading zeroes on the mantissa and exponent are allowed.

A string constant is a sequence of characters enclosed in double quotes. Strings can contain any character except a newline or double quote. A string must start and end on a single line, it cannot be split over multiple lines:

“this string is missing its close quote

this is not a part of the string above

The following escape sequences are supported:

\n

Newline

\t

Tab

\”

Quotation Mark

\\

Backslash

 

Comments

A single-line comment is started by // and extends to the end of the line. Multi-line comments start with /* and end with */. Any symbol is allowed in a comment except the sequence */ which ends the current comment. Multi-line comments do not nest.

Operators

+

-

*

/

%

<=

>=

=

==

!=

&&

||

!

;

,

.

[

]

(

)

{

}

Requirements

·         Use flex or JFlex to implement a Decaf lexer

·         The lexer should return tokens in the lexical rules associated with each regular expression

o   If you’re using C, the token type can be a set of #defines or an enum, combined with a union to return the value. You can look ahead at the bison parser manual to see how the union and token types will work.

o   If you’re using Java, this can be a set of public static final ints and the Symbol class. The Symbol class is part of the parser (JavaCUP)’s runtime jar file. You may use it in the lexer simply by setting the classpath to include the jar file from JavaCUP and importing it.

·         The lexer should be driven by a program that prints out the line and column numbers of each token encountered in the program. This is information you can track in your token. An example is shown below.

·         The lexer should discard all comments.

·         The lexer should report any invalid character with the message “Illegal character ‘%c’ at line %d column %d\n”

·         The lexer should report any invalid escape sequence with the message “Illegal escape sequence ‘%s’ at line %d column %d\n”

·         The lexer should report the start of an unterminated string with the message “Unterminated string literal at line %d column %d”

Example

class Hello {

     void main() {

           Print(“Hello world\n”);

     }

}

Example Output

Line

Column

Token

Value

======================================================================

1

1

CLASS

 

1

7

ID

Hello

1

13

LBRACE

 

2

2

VOID

 

2

7

ID

main

2

11

LPAREN

 

2

12

RPAREN

 

2

14

LBRACE

 

3

3

PRINT

 

3

8

LPAREN

 

3

9

STRING

Hello world

 

3

24

RPAREN

 

3

25

SEMICOLON

 

4

2

RBRACE

 

4

1

RBRACE

 

Submission

By the deadline, you need to submit:

1.       Your JFlex or flex file containing your lexer

2.       Your C or Java files containing main() and any auxiliary files you have used

3.       A Makefile to build it all

4.       A README text file describing how to run it

5.       Two or more examples of Decaf programs that you have written and tested your program on

Create a zip file of the above files and copy it to:

/afs/pitt.edu/home/j/r/jrmst106/submit/2210