NAME
lex - Generates a C Language program that matches patterns for simple lexi-
cal analysis of an input stream
SYNOPSIS
lex [-cnrtv] [-V] [-Qy|-Qn] [file ...]
The lex command reads file or standard input, generates a C Language pro-
gram, and writes it to a file named lex.yy.c, a compilable C Language pro-
gram.
FLAGS
If the environment variable CMD_ENV is set to svr4, all flags listed in the
synopsis are legal. Otherwise n, t, v are the only legal flags, and they
may be upper or lower case.
-c Writes C code to the file lex.yy.c. This is the default.
-n Suppresses the statistics summary. When you set your own table sizes
for the finite state machine, lex automatically produces this summary
if you do not select this flag.
-r Writes RATFOR code to the file lex.yy.r. Note: there is no RATFOR com-
piler for DEC OSF/1.
-t Writes to standard output instead of to a file.
-v Provides a summary of the generated finite state machine statistics.
-V Outputs lex version number to standard error. Requires the environment
variable CMD_ENV to be set to svr4.
-Q[y|n]
Determines whether the lex version number is written to the output
file. -Qn does not do so, and is the default. Requires the environ-
ment variable CMD_ENV to be set to svr4.
DESCRIPTION
The lex command uses the rules and actions contained in file to generate a
program, lex.yy.c, which can be compiled with the cc command. That program
can then receive input, break the input into the logical pieces defined by
the rules in file, and run program fragments contained in the actions in
file.
The generated program is a C Language function called yylex(). The lex
command stores yylex() in a file named lex.yy.c. You can use yylex() alone
to recognize simple, 1-word input, or you can use it with other C Language
programs to perform more difficult input analysis functions. For example,
you can use lex to generate a program that tokenizes an input stream before
Input File Format
The input file can contain three sections: definitions, rules, and user
subroutines. Each section must be separated from the others by a line con-
taining only the delimiter, %%. The format is as follows:
definitions
%%
rules
%%
user_subroutines
The purpose and format of each are described in the following sections.
Definitions
If you want to use variables in rules, you must define them in this sec-
tion. The variables make up the left column, and their definitions make up
the right column. For example, to define D as a numerical digit, enter:
D [0-9]
You can use a defined variable in the rules section by enclosing the vari-
able name in braces, {D}.
In the definitions section, you can also set table sizes for the resulting
finite state machine. The default sizes are large enough for small pro-
grams. You may want to set larger sizes for more complex programs.
%p number
Number of positions is number (default 5000)
%n number
Number of states is number (default 2500)
%e number
Number of parse tree nodes is number (default 2000)
%a number
Number of transitions is number (default 5000)
%k number
Number of packed character classes is number (default 1000)
%o number
Number of output slots is number (default 5000)
If extended characters appear in regular expression strings, you may need
to reset the output array size with the %o parameter (possibly to array
sizes in the range 10,000 to 20,000). This reset reflects the much larger
The columns are separated by a tab. For example, to search files for the
word LEAD and replace it with GOLD, perform the following steps:
Create a file called transmute.l containing the lines:
%%
(LEAD) printf("GOLD");
Then issue the following commands to the shell:
lex transmute.l
cc -o transmute lex.yy.c -ll
You can test the resulting program with the command:
transmute <transmute.l
This command echoes the contents of transmute.l, with the occurrences of
LEAD changed to GOLD.
Each pattern may have a corresponding action, that is, a fragment of C
source code to execute when the pattern is matched. Each statement must
end with a ; (semicolon). If you use more than one statement in an action,
you must enclose all of them in {} (braces). A second delimiter, %%, must
follow the rules section if you have a user subroutine section.
When yylex() matches a string in the input stream, it copies the matched
text to an external character array, yytext, before it executes any actions
in the rules section.
You can use the following operators to form patterns that you want to
match:
x, y
Matches the characters written.
[ ] Matches any one character in the enclosed range ([.-.]) or the enclosed
list ([...]). [abcx-z] matches a,b,c,x,y, or z.
" " Matches the enclosed character or string even if it is an operator.
"$" prevents lex from interpreting the $ character as an operator.
\ Acts the same as double quotes. \$ prevents lex from interpreting the
$ character as an operator.
* Matches zero or more occurrences of the character immediately preceding
it. x* matches zero or more repeated literal characters x.
$ Matches the end of a line.
| Matches either of two characters. x | y matches either x or y.
/ Matches one character only when followed by a second character. It
reads only the first character into yytext. x/y matches x when it is
followed by y, and reads x into yytext.
( ) Matches the pattern in the ( ) (parentheses). This is used for group-
ing. It reads the whole pattern into yytext. A group in parentheses
can be used in place of any single character in any other pattern.
(xyz123) matches the pattern xyz123 and reads the whole string into
yytext.
{} Matches the character as defined in the Definitions section. If D is
defined as numeric digits, {D} matches all numeric digits.
{m,n}
Matches m to n occurrences of the character. x{2,4} matches 2, 3, or 4
occurrences of x.
If a line begins with only a space, lex copies it to the lex.yy.c output
file. If the line is in the definitions section of file, lex copies it to
the declarations section of lex.yy.c. If the line is in the rules section,
lex copies it to the program code section of lex.yy.c.
User Subroutines
The lex library has three subroutines defined as macros that you can use in
the rules.
input( ) Reads a character from yyin.
unput( ) Replaces a character after it is read.
output( ) Writes an output character to yyout.
You can override these three macros by writing your own code for these rou-
tines in the user subroutines section. But if you write your own, you must
undefine these macros in the definitions section as follows:
%{
#undef input
#undef unput
#undef output
}%
When you are using lex as a simple transformer/recognizer for stdin to
stdout piping, you can avoid writing the framework by using libl.a (the lex
library). It has a main routine that calls yylex() for you.
Other Special Characters
The lex program recognizes many of the normal C language special charac-
ters. These character sequences are as follows:
Sequence Meaning
\n Newline
\t Tab
\b Backspace
\\ Backslash
Do not use the actual newline character in an expression.
When using these special characters in an expression, you do not need to
enclose them in quotes. Every character, except these special characters
and the previously described operator symbols, is always a text character.
Matching Rules
When more than one expression can match the current input, lex chooses the
longest match first. Among rules that match the same number of characters,
the rule that occurs first is chosen. For example:
integer keyword action...;
[a-z]+ identifier action...;
If the preceding rules are given in that order, and integers is the input
word, lex matches the input as an identifier because [a-z]+ matches eight
characters, while integer matches only seven. However, if the input is
integer, both rules match seven characters. The keyword rule is selected
because it occurs first. A shorter input, such as int, does not match the
expression rule integer and so lex selects the rule identifier.
Matching a String with Wildcard Characters
Because lex chooses the longest match first, do not use rules containing
expressions like .*. For example:
'.*'
The preceding rule might seem like a good way to recognize a string in sin-
gle quotes. However, the lexical analyzer reads far ahead, looking for a
distant single quote to complete the long match. If a lexical analyzer
with such a rule gets the following input, it matches the whole string:
'first' quoted string here, 'second' here
To find the smaller strings, first and second, use the following rule:
The lex program partitions the input stream, and does not search for all
possible matches of each expression. Each character is accounted for once
and only once. For example, to count occurrences of both she and he in an
input text, try the following rules:
she s++;
he h++;
\n |
. ;
The last two rules ignore everything besides he and she. However, because
she includes he, lex does not recognize the instances of he that are
included in she.
To override this choice, use the REJECT action. This directive tells lex
to go to the next rule. lex then adjusts the position of the input pointer
to where it was before the first rule was executed, and executes the second
choice rule. For example, to count the included instances of he, use the
following rules:
she {s++; REJECT;}
he {h++; REJECT;}
\n |
. ;
After counting the occurrences of she, lex rejects the input stream and
then counts the occurrences of he. Because in this case, she includes he,
but not vice versa, you can omit the REJECT action on he. In other cases,
it may be difficult to determine which input characters are in both
classes.
In general, REJECT is useful whenever the purpose of lex is not to parti-
tion the input stream but to detect all examples of some items in the
input, and the instances of these items may overlap or include each other.
EXAMPLES
1. The following command draws lex instructions from the file lexcom-
mands, and places the output in lex.yy.c:
lex lexcommands
2. The contents of the file lexcommands are an example of a lex program
that would be put into a lex command file. This program converts
uppercase to lowercase, removes spaces at the end of a line, and
replaces multiple spaces with single spaces:
%%
[A-Z] putchar(tolower(yytext[0]));
[ ]+$ ;
Default C language skeleton finite state machine for lex.
/usr/ccs/lib/nrform
Default RATFOR language skeleton finite state machine for lex.
RELATED INFORMATION
Commands: yacc(1).
Programming Support Tools