NAME
  lex -	Generates a C Language program that matches patterns for simple	lexi-
  cal analysis of an input stream

SYNOPSIS

  lex [-cnrtv] [-V] [-Qy|-Qn] [file ...]

  The lex command reads	file or	standard input,	generates a C Language pro-
  gram,	and writes it to a file	named lex.yy.c,	a compilable C Language	pro-
  gram.

FLAGS

  If the environment variable CMD_ENV is set to	svr4, all flags	listed in the
  synopsis are legal. Otherwise	n, t, v	are the	only legal flags, and they
  may be upper or lower	case.

  -c  Writes C code to the file	lex.yy.c. This is the default.

  -n  Suppresses the statistics	summary.  When you set your own	table sizes
      for the finite state machine, lex	automatically produces this summary
      if you do	not select this	flag.

  -r  Writes RATFOR code to the	file lex.yy.r. Note: there is no RATFOR	com-
      piler for	DEC OSF/1.

  -t  Writes to	standard output	instead	of to a	file.

  -v  Provides a summary of the	generated finite state machine statistics.

  -V  Outputs lex version number to standard error. Requires the environment
      variable CMD_ENV to be set to svr4.

  -Q[y|n]
      Determines whether the lex version number	is written to the output
      file.  -Qn does not do so, and is	the default.  Requires the environ-
      ment variable CMD_ENV to be set to svr4.

DESCRIPTION

  The lex command uses the rules and actions contained in file to generate a
  program, lex.yy.c, which can be compiled with	the cc command.	 That program
  can then receive input, break	the input into the logical pieces defined by
  the rules in file, and run program fragments contained in the	actions	in
  file.

  The generated	program	is a C Language	function called	yylex().  The lex
  command stores yylex() in a file named lex.yy.c.  You	can use	yylex()	alone
  to recognize simple, 1-word input, or	you can	use it with other C Language
  programs to perform more difficult input analysis functions.	For example,
  you can use lex to generate a	program	that tokenizes an input	stream before

  Input	File Format

  The input file can contain three sections:  definitions, rules, and user
  subroutines.	Each section must be separated from the	others by a line con-
  taining only the delimiter, %%.  The format is as follows:

       definitions
       %%
       rules
       %%
       user_subroutines

  The purpose and format of each are described in the following	sections.

  Definitions

  If you want to use variables in rules, you must define them in this sec-
  tion.	 The variables make up the left	column,	and their definitions make up
  the right column.  For example, to define D as a numerical digit, enter:

       D       [0-9]


  You can use a	defined	variable in the	rules section by enclosing the vari-
  able name in braces, {D}.

  In the definitions section, you can also set table sizes for the resulting
  finite state machine.	 The default sizes are large enough for	small pro-
  grams.  You may want to set larger sizes for more complex programs.

  %p  number
	  Number of positions is number	(default 5000)

  %n  number
	  Number of states is number (default 2500)

  %e  number
	  Number of parse tree nodes is	number (default	2000)

  %a  number
	  Number of transitions	is number (default 5000)

  %k  number
	  Number of packed character classes is	number (default	1000)

  %o  number
	  Number of output slots is number (default 5000)

  If extended characters appear	in regular expression strings, you may need
  to reset the output array size with the %o parameter (possibly to array
  sizes	in the range 10,000 to 20,000).	 This reset reflects the much larger

  The columns are separated by a tab.  For example, to search files for	the
  word LEAD and	replace	it with	GOLD, perform the following steps:

  Create a file	called transmute.l containing the lines:

       %%
       (LEAD)  printf("GOLD");


  Then issue the following commands to the shell:

       lex transmute.l
       cc -o transmute lex.yy.c	-ll


  You can test the resulting program with the command:

       transmute <transmute.l


  This command echoes the contents of transmute.l, with	the occurrences	of
  LEAD changed to GOLD.

  Each pattern may have	a corresponding	action,	that is, a fragment of C
  source code to execute when the pattern is matched.  Each statement must
  end with a ; (semicolon).  If	you use	more than one statement	in an action,
  you must enclose all of them in {} (braces).	A second delimiter, %%,	must
  follow the rules section if you have a user subroutine section.

  When yylex() matches a string	in the input stream, it	copies the matched
  text to an external character	array, yytext, before it executes any actions
  in the rules section.

  You can use the following operators to form patterns that you	want to
  match:

  x, y
      Matches the characters written.

  [ ] Matches any one character	in the enclosed	range ([.-.]) or the enclosed
      list ([...]).  [abcx-z] matches a,b,c,x,y, or z.

  " " Matches the enclosed character or	string even if it is an	operator.
      "$" prevents lex from interpreting the $ character as an operator.

  \   Acts the same as double quotes.  \$ prevents lex from interpreting the
      $	character as an	operator.

  *   Matches zero or more occurrences of the character	immediately preceding
      it.  x* matches zero or more repeated literal characters x.

  $   Matches the end of a line.

  |   Matches either of	two characters.	 x | y matches either x	or y.

  /   Matches one character only when followed by a second character.  It
      reads only the first character into yytext.  x/y matches x when it is
      followed by y, and reads x into yytext.

  ( ) Matches the pattern in the ( ) (parentheses).  This is used for group-
      ing.  It reads the whole pattern into yytext.  A group in	parentheses
      can be used in place of any single character in any other	pattern.
      (xyz123) matches the pattern xyz123 and reads the	whole string into
      yytext.

  {}  Matches the character as defined in the Definitions section.  If D is
      defined as numeric digits, {D} matches all numeric digits.

  {m,n}
      Matches m	to n occurrences of the	character.  x{2,4} matches 2, 3, or 4
      occurrences of x.

  If a line begins with	only a space, lex copies it to the lex.yy.c output
  file.	 If the	line is	in the definitions section of file, lex	copies it to
  the declarations section of lex.yy.c.	 If the	line is	in the rules section,
  lex copies it	to the program code section of lex.yy.c.

  User Subroutines

  The lex library has three subroutines	defined	as macros that you can use in
  the rules.

  input( )  Reads a character from yyin.

  unput( )  Replaces a character after it is read.

  output( ) Writes an output character to yyout.

  You can override these three macros by writing your own code for these rou-
  tines	in the user subroutines	section.  But if you write your	own, you must
  undefine these macros	in the definitions section as follows:

       %{
       #undef input
       #undef unput
       #undef output
       }%


  When you are using lex as a simple transformer/recognizer for	stdin to
  stdout piping, you can avoid writing the framework by	using libl.a (the lex
  library).  It	has a main routine that	calls yylex() for you.


  Other	Special	Characters

  The lex program recognizes many of the normal	C language special charac-
  ters.	 These character sequences are as follows:

       Sequence		       Meaning

       \n		       Newline
       \t		       Tab
       \b		       Backspace
       \\		       Backslash

  Do not use the actual	newline	character in an	expression.

  When using these special characters in an expression,	you do not need	to
  enclose them in quotes.  Every character, except these special characters
  and the previously described operator	symbols, is always a text character.

  Matching Rules

  When more than one expression	can match the current input, lex chooses the
  longest match	first.	Among rules that match the same	number of characters,
  the rule that	occurs first is	chosen.	 For example:

       integer keyword action...;
       [a-z]+ identifier action...;

  If the preceding rules are given in that order, and integers is the input
  word,	lex matches the	input as an identifier because [a-z]+ matches eight
  characters, while integer matches only seven.	 However, if the input is
  integer, both	rules match seven characters. The keyword rule is selected
  because it occurs first.  A shorter input, such as int, does not match the
  expression rule integer and so lex selects the rule identifier.

  Matching a String with Wildcard Characters

  Because lex chooses the longest match	first, do not use rules	containing
  expressions like .*.	For example:

       '.*'


  The preceding	rule might seem	like a good way	to recognize a string in sin-
  gle quotes.  However,	the lexical analyzer reads far ahead, looking for a
  distant single quote to complete the long match.  If a lexical analyzer
  with such a rule gets	the following input, it	matches	the whole string:

       'first' quoted string here, 'second' here


  To find the smaller strings, first and second, use the following rule:

  The lex program partitions the input stream, and does	not search for all
  possible matches of each expression.	Each character is accounted for	once
  and only once.  For example, to count	occurrences of both she	and he in an
  input	text, try the following	rules:

       she     s++;
       he      h++;
       \n      |
       .       ;


  The last two rules ignore everything besides he and she.  However, because
  she includes he, lex does not	recognize the instances	of he that are
  included in she.

  To override this choice, use the REJECT action.  This	directive tells	lex
  to go	to the next rule.  lex then adjusts the	position of the	input pointer
  to where it was before the first rule	was executed, and executes the second
  choice rule.	For example, to	count the included instances of	he, use	the
  following rules:

       she     {s++; REJECT;}
       he      {h++; REJECT;}
       \n      |
       .       ;


  After	counting the occurrences of she, lex rejects the input stream and
  then counts the occurrences of he.  Because in this case, she	includes he,
  but not vice versa, you can omit the REJECT action on	he.  In	other cases,
  it may be difficult to determine which input characters are in both
  classes.

  In general, REJECT is	useful whenever	the purpose of lex is not to parti-
  tion the input stream	but to detect all examples of some items in the
  input, and the instances of these items may overlap or include each other.

EXAMPLES

   1.  The following command draws lex instructions from the file lexcom-
       mands, and places the output in lex.yy.c:
	    lex	lexcommands


   2.  The contents of the file	lexcommands are	an example of a	lex program
       that would be put into a	lex command file.  This	program	converts
       uppercase to lowercase, removes spaces at the end of a line, and
       replaces	multiple spaces	with single spaces:
	    %%
	    [A-Z] putchar(tolower(yytext[0]));
	    [ ]+$ ;
	     Default C language	skeleton finite	state machine for lex.

  /usr/ccs/lib/nrform
	     Default RATFOR language skeleton finite state machine for lex.

RELATED	INFORMATION

  Commands:  yacc(1).

  Programming Support Tools