5034 Homework Assignment on LEX

Posted: October 24, 1996 Due: Tuesday, November 4, 1997 at 17:00

This assignment is worth a total of 75 points.

Rules

This assignment MUST be done and submitted individually. You may NOT work with a partner on this assignment.

Description

For your lex assignment you are to write lex specifications to match the described constructs. Each such construct is called a token. Whenever a token is found, the type of token should be printed to standard output, one per line, and the token should be counted. If the token is a variable name or integer, the variable name or integer value should be printed right after the token type is printed. At the end of the program, a summary containing the number of times each token type was matched should be printed.

The lex specification that you create for this assignment will be used in a following homework assignment.

About Lex

Information on how to use lex is posted on the CS5034 web pages. This information should be correct for most versions of lex, however some versions differ from others (e.g. GNU's flex vs. ATT lex). To avoid problems, all lex specifications turned in will be built using the version of lex on csgrad. Make sure that you test your lex specification on csgrad before submitting, especially if you develop your program on a different machine.

Mechanics

Your lex source file should be named TREE.lex. You should use the Makefile to build your program. The final executable's name is TREE. Lex programs take their input from standard input by default, so to run your program use the command

prompt% TREE
and type in the input by hand or
prompt% TREE < input_file
where input_file is a file containing data that you want to test.

Make sure to incorporate your ``main'' function into your lex source file. Everything should fit in a single file. Document your lex constructs for clarity.

This assignment assumes some knowledge of the C programming language. If you are unfamiliar with this language, contact Mohammad for help.

Test Data

Sample test data will be available via the CS5034 web pages. Warning: your submission will be tested by Mohammad on his own data file, which will be different than the sample test data!

Submission

Use the following command on csgrad to submit your lex source file to Mohammad:

cat TREE.lex | Mail -s "HWLEX: Lex Assignment" cs5034@ei.cs.vt.edu

You should receive a reply message within a few minutes. Make sure the number of lines stated in the reply message is around the number of lines in your lex source file. If you do not receive a reply message or the number of lines in the reply message is significantly more or less than your lex source file, resend your assignment.

Be sure to include your name and your ID number as a comment at the top of your Lex file.

TREE Language Constructs

Note that case is significant. All key words must be lower case to be matched. Variables can contain both upper and lower case letters.

Variable Names

Description

A variable name consists of a string that starts with an underscore or letter, and then is followed by any number of underscores, hyphens, letters, or digits.

Token Type

VARIABLE

Output

VARIABLE: foo
Where foo is the string matched by this construct.

Comments

Description

Comments either begin with \\ and end with end-of-line or begin with a /* symbol and end with a */.

Token Type

COMMENT

Output

COMMENT

Integers

Description

Integers can be represented as hexadecimal, octal, or decimal constants. They have NO plus or minus sign. Match one of these three constructs:

Make a distinction between hexadecimal, octal, binary, and decimal constants for counting purposes.

Token Type

INTEGER

Output

INTEGER: type val
where type is either hex, oct, bin, or dec; and val is the value of the integer constant in decimal.

Reserved Words

Description

The following strings are considered reserved words and should be matched, each with its own rule.

     print
     left
     val
     right
     root
     tree
     isempty
     while
     if

Token Type

Each reserved word has the token type of the reserved word in all capital letters. For example print has the token type PRINT.

Output

The token type of the reserved word.

Punctuation

Description

The following punctuation should be matched and counted.

Symbol    Name                            Token Type
{         left curly brace                LBRACE
}         right curly brace               RBRACE
(         left parenthesis                LPAREN
)         right parenthesis               RPAREN
+         plus                            PLUS
-         minus                           MINUS
=         equals(assignment)              ASSIGN
==        equal to                        EQUAL_TO
<         less than                       LESS_THAN
>         greater than                    GRTR_THAN
<=        less than or equal to           LESS_EQUAL
>=        greater than or equal to        GRTR_EQUAL
|         or                              OR
&         and                             AND
++        inc                             INC
--        dec                             DEC

Token Type

Listed in the table above.

Output

The token type from the table above.

Whitespace

Description

White space is spaces, tabs, and newlines (that do not terminate comments). White space should be matched, but otherwise it is ignored (i.e. there is no WHITESPACE token, and nothing should be printed out about it).

Invalid Characters

Description

Any character not matched by one of the above constructs is invalid. Your program should exit with the error message: Invalid character in input. Exiting. and the summary of tokens matched up to the invalid character should be printed.

Matching Priority

Reserved words have priority over variable names. That is, if the string left is read, it is considered the reserved word, but if the string lefta is read, it is considered a variable. Do not count strings as two different constructs (i.e. don't match a string twice).

Summary Output

The summary output should be in the format:

TOKEN_TYPE1: %d matches
TOKEN_TYPE2: %d matches
...

Where %d is the number of matches for the TOKEN_TYPE. Each token type that was matched should be listed in the summary. Token types not matched should NOT be listed.

Special Exception: Integers should be listed by the types of integers that were matched. The types that should be used are

DEC_INTEGER
HEX_INTEGER
OCT_INTEGER
BIN_INTEGER
for decimal, hexadecimal, binary, and octal constants respectively.