summaryrefslogtreecommitdiff
path: root/usr.bin/lex/PSD.doc/lex.ms
diff options
context:
space:
mode:
authorMichael Shalayeff <mickey@cvs.openbsd.org>2002-12-03 21:44:00 +0000
committerMichael Shalayeff <mickey@cvs.openbsd.org>2002-12-03 21:44:00 +0000
commit5dc6f996d1d715bac7c9d5a59f939d29eacc3a14 (patch)
treeda347c07f4cfc5737063c99f7c28ebedd054105e /usr.bin/lex/PSD.doc/lex.ms
parent7c40f51a9a07812997cc937835ea8a45aab0d3e1 (diff)
caldera-licensed docs, now that they are free. need more work, thus not installed yet
Diffstat (limited to 'usr.bin/lex/PSD.doc/lex.ms')
-rw-r--r--usr.bin/lex/PSD.doc/lex.ms2335
1 files changed, 2335 insertions, 0 deletions
diff --git a/usr.bin/lex/PSD.doc/lex.ms b/usr.bin/lex/PSD.doc/lex.ms
new file mode 100644
index 00000000000..e1b4b0085a4
--- /dev/null
+++ b/usr.bin/lex/PSD.doc/lex.ms
@@ -0,0 +1,2335 @@
+.\" $OpenBSD: lex.ms,v 1.1 2002/12/03 21:43:59 mickey Exp $
+.\"
+.\" Copyright (C) Caldera International Inc. 2001-2002.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code and documentation must retain the above
+.\" copyright notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed or owned by Caldera
+.\" International, Inc.
+.\" 4. Neither the name of Caldera International, Inc. nor the names of other
+.\" contributors may be used to endorse or promote products derived from
+.\" this software without specific prior written permission.
+.\"
+.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
+.\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+.\" IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE FOR ANY DIRECT,
+.\" INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+.\" (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+.\" SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+.\" STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+.\" IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+.\" POSSIBILITY OF SUCH DAMAGE.
+.\"
+.\" @(#)lex.ms 8.2 (Berkeley) 5/24/94
+.\"
+.EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator'
+.OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%'
+.hc ~
+.bd I 2
+.de TS
+.br
+.nf
+.SP 1v
+.ul 0
+..
+.de TE
+.SP 1v
+.fi
+..
+.\".de PT
+.\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
+.\".if \\n%>1 'sp
+.\"..
+.ND July 21, 1975
+.\".RP
+.\".TM 75-1274-15 39199 39199-11
+.TL
+Lex \- A Lexical Analyzer ~Generator~
+.AU ``MH 2C-569'' 6377
+M. E. Lesk and E. Schmidt
+.AI
+.MH
+.AB
+.sp
+.bd I 2
+.\".nr PS 8
+.\".nr VS 9
+.\".ps 8
+.\".vs 9p
+Lex helps write programs whose control flow
+is directed by instances of regular
+expressions in the input stream.
+It is well suited for editor-script type transformations and
+for segmenting input in preparation for
+a parsing routine.
+.PP
+Lex source is a table of regular expressions and corresponding program fragments.
+The table is translated to a program
+which reads an input stream, copying it to an output stream
+and partitioning the input
+into strings which match the given expressions.
+As each such string is recognized the corresponding
+program fragment is executed.
+The recognition of the expressions
+is performed by a deterministic finite automaton
+generated by Lex.
+The program fragments written by the user are executed in the order in which the
+corresponding regular expressions occur in the input stream.
+.if n .if \n(tm .ig
+.PP
+The lexical analysis
+programs written with Lex accept ambiguous specifications
+and choose the longest
+match possible at each input point.
+If necessary, substantial look~ahead
+is performed on the input, but the
+input stream will be backed up to the
+end of the current partition, so that the user
+has general freedom to manipulate it.
+.PP
+Lex can generate analyzers in either C or Ratfor, a language
+which can be translated automatically to portable Fortran.
+It is available on the PDP-11 UNIX, Honeywell GCOS,
+and IBM OS systems.
+This manual, however, will only discuss generating analyzers
+in C on the UNIX system, which is the only supported
+form of Lex under UNIX Version 7.
+Lex is designed to simplify
+interfacing with Yacc, for those
+with access to this compiler-compiler system.
+..
+.\".nr PS 9
+.\".nr VS 11
+.AE
+.2C
+.NH
+Introduction.
+.PP
+Lex is a program generator designed for
+lexical processing of character input streams.
+It accepts a high-level, problem oriented specification
+for character string matching,
+and
+produces a program in a general purpose language which recognizes
+regular expressions.
+The regular expressions are specified by the user in the
+source specifications given to Lex.
+The Lex written code recognizes these expressions
+in an input stream and partitions the input stream into
+strings matching the expressions. At the bound~aries
+between strings
+program sections
+provided by the user are executed.
+The Lex source file associates the regular expressions and the
+program fragments.
+As each expression appears in the input to the program written by Lex,
+the corresponding fragment is executed.
+.PP
+.de MH
+Bell Laboratories, Murray Hill, NJ 07974.
+..
+The user supplies the additional code
+beyond expression matching
+needed to complete his tasks, possibly
+including code written by other generators.
+The program that recognizes the expressions is generated in the
+general purpose programming language employed for the
+user's program fragments.
+Thus, a high level expression
+language is provided to write the string expressions to be
+matched while the user's freedom to write actions
+is unimpaired.
+This avoids forcing the user who wishes to use a string manipulation
+language for input analysis to write processing programs in the same
+and often inappropriate string handling language.
+.PP
+Lex is not a complete language, but rather a generator representing
+a new language feature which can be added to
+different programming languages, called ``host languages.''
+Just as general purpose languages
+can produce code to run on different computer hardware,
+Lex can write code in different host languages.
+The host language is used for the output code generated by Lex
+and also for the program fragments added by the user.
+Compatible run-time libraries for the different host languages
+are also provided.
+This makes Lex adaptable to different environments and
+different users.
+Each application
+may be directed to the combination of hardware and host language appropriate
+to the task, the user's background, and the properties of local
+implementations.
+At present, the only supported host language is C,
+although Fortran (in the form of Ratfor [2] has been available
+in the past.
+Lex itself exists on UNIX, GCOS, and OS/370; but the
+code generated by Lex may be taken anywhere the appropriate
+compilers exist.
+.PP
+Lex turns the user's expressions and actions
+(called
+.ul
+source
+in this memo) into the host general-purpose language;
+the generated program is named
+.ul
+yylex.
+The
+.ul
+yylex
+program
+will recognize expressions
+in a stream
+(called
+.ul
+input
+in this memo)
+and perform the specified actions for each expression as it is detected.
+See Figure 1.
+.TS
+center;
+l _ r
+l|c|r
+l _ r
+l _ r
+l|c|r
+l _ r
+c s s
+c s s.
+
+Source \(-> Lex \(-> yylex
+
+.sp 2
+
+Input \(-> yylex \(-> Output
+
+.sp
+An overview of Lex
+Figure 1
+.TE
+.PP
+For a trivial example, consider a program to delete
+from the input
+all blanks or tabs at the ends of lines.
+.TS
+center;
+l l.
+%%
+[ \et]+$ ;
+.TE
+is all that is required.
+The program
+contains a %% delimiter to mark the beginning of the rules, and
+one rule.
+This rule contains a regular expression
+which matches one or more
+instances of the characters blank or tab
+(written \et for visibility, in accordance with the C language convention)
+just prior to the end of a line.
+The brackets indicate the character
+class made of blank and tab; the + indicates ``one or more ...'';
+and the $ indicates ``end of line,'' as in QED.
+No action is specified,
+so the program generated by Lex (yylex) will ignore these characters.
+Everything else will be copied.
+To change any remaining
+string of blanks or tabs to a single blank,
+add another rule:
+.TS
+center;
+l l.
+%%
+[ \et]+$ ;
+[ \et]+ printf(" ");
+.TE
+The finite automaton generated for this
+source will scan for both rules at once,
+observing at
+the termination of the string of blanks or tabs
+whether or not there is a newline character, and executing
+the desired rule action.
+The first rule matches all strings of blanks or tabs
+at the end of lines, and the second
+rule all remaining strings of blanks or tabs.
+.PP
+Lex can be used alone for simple transformations, or
+for analysis and statistics gathering on a lexical level.
+Lex can also be used with a parser generator
+to perform the lexical analysis phase; it is particularly
+easy to interface Lex and Yacc [3].
+Lex programs recognize only regular expressions;
+Yacc writes parsers that accept a large class of context free grammars,
+but require a lower level analyzer to recognize input tokens.
+Thus, a combination of Lex and Yacc is often appropriate.
+When used as a preprocessor for a later parser generator,
+Lex is used to partition the input stream,
+and the parser generator assigns structure to
+the resulting pieces.
+The flow of control
+in such a case (which might be the first half of a compiler,
+for example) is shown in Figure 2.
+Additional programs,
+written by other generators
+or by hand, can
+be added easily to programs written by Lex.
+.BS 2
+.ps 9
+.vs 11
+.TS
+center;
+l c c c l
+l c c c l
+l c c c l
+l _ c _ l
+l|c|c|c|l
+l _ c _ l
+l c c c l
+l _ c _ l
+l|c|c|c|l
+l _ c _ l
+l c s s l
+l c s s l.
+ lexical grammar
+ rules rules
+ \(da \(da
+
+ Lex Yacc
+
+ \(da \(da
+
+Input \(-> yylex \(-> yyparse \(-> Parsed input
+
+.sp
+ Lex with Yacc
+ Figure 2
+.TE
+.ps 10
+.vs 12
+.BE
+Yacc users
+will realize that the name
+.ul
+yylex
+is what Yacc expects its lexical analyzer to be named,
+so that the use of this name by Lex simplifies
+interfacing.
+.PP
+Lex generates a deterministic finite automaton from the regular expressions
+in the source [4].
+The automaton is interpreted, rather than compiled, in order
+to save space.
+The result is still a fast analyzer.
+In particular, the time taken by a Lex program
+to recognize and partition an input stream is
+proportional to the length of the input.
+The number of Lex rules or
+the complexity of the rules is
+not important in determining speed,
+unless rules which include
+forward context require a significant amount of re~scanning.
+What does increase with the number and complexity of rules
+is the size of the finite
+automaton, and therefore the size of the program
+generated by Lex.
+.PP
+In the program written by Lex, the user's fragments
+(representing the
+.ul
+actions
+to be performed as each regular expression
+is found)
+are gathered
+as cases of a switch.
+The automaton interpreter directs the control flow.
+Opportunity is provided for the user to insert either
+declarations or additional statements in the routine containing
+the actions, or to
+add subroutines outside this action routine.
+.PP
+Lex is not limited to source which can
+be interpreted on the basis of one character
+look~ahead.
+For example,
+if there are two rules, one looking for
+.I ab
+and another for
+.I abcdefg ,
+and the input stream is
+.I abcdefh ,
+Lex will recognize
+.I ab
+and leave
+the input pointer just before
+.I "cd. . ."
+Such backup is more costly
+than the processing of simpler languages.
+.2C
+.NH
+Lex Source.
+.PP
+The general format of Lex source is:
+.TS
+center;
+l.
+{definitions}
+%%
+{rules}
+%%
+{user subroutines}
+.TE
+where the definitions and the user subroutines
+are often omitted.
+The second
+.I %%
+is optional, but the first is required
+to mark the beginning of the rules.
+The absolute minimum Lex program is thus
+.TS
+center;
+l.
+%%
+.TE
+(no definitions, no rules) which translates into a program
+which copies the input to the output unchanged.
+.PP
+In the outline of Lex programs shown above, the
+.I
+rules
+.R
+represent the user's control
+decisions; they are a table, in which the left column
+contains
+.I
+regular expressions
+.R
+(see section 3)
+and the right column contains
+.I
+actions,
+.R
+program fragments to be executed when the expressions
+are recognized.
+Thus an individual rule might appear
+.TS
+center;
+l l.
+integer printf("found keyword INT");
+.TE
+to look for the string
+.I integer
+in the input stream and
+print the message ``found keyword INT'' whenever it appears.
+In this example the host procedural language is C and
+the C library function
+.I
+printf
+.R
+is used to print the string.
+The end
+of the expression is indicated by the first blank or tab character.
+If the action is merely a single C expression,
+it can just be given on the right side of the line; if it is
+compound, or takes more than a line, it should be enclosed in
+braces.
+As a slightly more useful example, suppose it is desired to
+change a number of words from British to American spelling.
+Lex rules such as
+.TS
+center;
+l l.
+colour printf("color");
+mechanise printf("mechanize");
+petrol printf("gas");
+.TE
+would be a start. These rules are not quite enough,
+since
+the word
+.I petroleum
+would become
+.I gaseum ;
+a way of dealing
+with this will be described later.
+.2C
+.NH
+Lex Regular Expressions.
+.PP
+The definitions of regular expressions are very similar to those
+in QED [5].
+A regular
+expression specifies a set of strings to be matched.
+It contains text characters (which match the corresponding
+characters in the strings being compared)
+and operator characters (which specify
+repetitions, choices, and other features).
+The letters of the alphabet and the digits are
+always text characters; thus the regular expression
+.TS
+center;
+l l.
+integer
+.TE
+matches the string
+.ul
+integer
+wherever it appears
+and the expression
+.TS
+center;
+l.
+a57D
+.TE
+looks for the string
+.ul
+a57D.
+.PP
+.I
+Operators.
+.R
+The operator characters are
+.TS
+center;
+l.
+" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
+.TE
+and if they are to be used as text characters, an escape
+should be used.
+The quotation mark operator (")
+indicates that whatever is contained between a pair of quotes
+is to be taken as text characters.
+Thus
+.TS
+center;
+l.
+xyz"++"
+.TE
+matches the string
+.I xyz++
+when it appears. Note that a part of a string may be quoted.
+It is harmless but unnecessary to quote an ordinary
+text character; the expression
+.TS
+center;
+l.
+"xyz++"
+.TE
+is the same as the one above.
+Thus by quoting every non-alphanumeric character
+being used as a text character, the user can avoid remembering
+the list above of current
+operator characters, and is safe should further extensions to Lex
+lengthen the list.
+.PP
+An operator character may also be turned into a text character
+by preceding it with \e as in
+.TS
+center;
+l.
+xyz\e+\e+
+.TE
+which
+is another, less readable, equivalent of the above expressions.
+Another use of the quoting mechanism is to get a blank into
+an expression; normally, as explained above, blanks or tabs end
+a rule.
+Any blank character not contained within [\|] (see below) must
+be quoted.
+Several normal C escapes with \e
+are recognized: \en is newline, \et is tab, and \eb is backspace.
+To enter \e itself, use \e\e.
+Since newline is illegal in an expression, \en must be used;
+it is not
+required to escape tab and backspace.
+Every character but blank, tab, newline and the list above is always
+a text character.
+.PP
+.I
+Character classes.
+.R
+Classes of characters can be specified using the operator pair [\|].
+The construction
+.I [abc]
+matches a
+single character, which may be
+.I a ,
+.I b ,
+or
+.I c .
+Within square brackets,
+most operator meanings are ignored.
+Only three characters are special:
+these are \e \(mi and ^. The \(mi character
+indicates ranges. For example,
+.TS
+center;
+l.
+[a\(miz0\(mi9<>_]
+.TE
+indicates the character class containing all the lower case letters,
+the digits,
+the angle brackets, and underline.
+Ranges may be given in either order.
+Using \(mi between any pair of characters which are
+not both upper case letters, both lower case letters, or both digits
+is implementation dependent and will get a warning message.
+(E.g., [0\-z] in ASCII is many more characters
+than it is in EBCDIC).
+If it is desired to include the
+character \(mi in a character class, it should be first or
+last; thus
+.TS
+center;
+l.
+[\(mi+0\(mi9]
+.TE
+matches all the digits and the two signs.
+.PP
+In character classes,
+the ^ operator must appear as the first character
+after the left bracket; it indicates that the resulting string
+is to be complemented with respect to the computer character set.
+Thus
+.TS
+center;
+l.
+[^abc]
+.TE
+matches all characters except a, b, or c, including
+all special or control characters; or
+.TS
+center;
+l.
+[^a\-zA\-Z]
+.TE
+is any character which is not a letter.
+The \e character provides the usual escapes within
+character class brackets.
+.PP
+.I
+Arbitrary character.
+.R
+To match almost any character, the operator character
+.TS
+center;
+l.
+\&.
+.TE
+is the class of all characters except newline.
+Escaping into octal is possible although non-portable:
+.TS
+center;
+l.
+[\e40\-\e176]
+.TE
+matches all printable characters in the ASCII character set, from octal
+40 (blank) to octal 176 (tilde).
+.PP
+.I
+Optional expressions.
+.R
+The operator
+.I ?
+indicates
+an optional element of an expression.
+Thus
+.TS
+center;
+l.
+ab?c
+.TE
+matches either
+.I ac
+or
+.I abc .
+.PP
+.I
+Repeated expressions.
+.R
+Repetitions of classes are indicated by the operators
+.I \(**
+and
+.I + .
+.TS
+center;
+l.
+\f2a\(**\f1
+.TE
+is any number of consecutive
+.I a
+characters, including zero; while
+.TS
+center;
+l.
+a+
+.TE
+is one or more instances of
+.I a.
+For example,
+.TS
+center;
+l.
+[a\-z]+
+.TE
+is all strings of lower case letters.
+And
+.TS
+center;
+l.
+[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
+.TE
+indicates all alphanumeric strings with a leading
+alphabetic character.
+This is a typical expression for recognizing identifiers in
+computer languages.
+.PP
+.I
+Alternation and Grouping.
+.R
+The operator |
+indicates alternation:
+.TS
+center;
+l.
+(ab\||\|cd)
+.TE
+matches either
+.ul
+ab
+or
+.ul
+cd.
+Note that parentheses are used for grouping, although
+they are
+not necessary on the outside level;
+.TS
+center;
+l.
+ab\||\|cd
+.TE
+would have sufficed.
+Parentheses
+can be used for more complex expressions:
+.TS
+center;
+l.
+(ab\||\|cd+)?(ef)\(**
+.TE
+matches such strings as
+.I abefef ,
+.I efefef ,
+.I cdef ,
+or
+.I cddd\| ;
+but not
+.I abc ,
+.I abcd ,
+or
+.I abcdef .
+.PP
+.I
+Context sensitivity.
+.R
+Lex will recognize a small amount of surrounding
+context. The two simplest operators for this are
+.I ^
+and
+.I $ .
+If the first character of an expression is
+.I ^ ,
+the expression will only be matched at the beginning
+of a line (after a newline character, or at the beginning of
+the input stream).
+This can never conflict with the other meaning of
+.I ^ ,
+complementation
+of character classes, since that only applies within
+the [\|] operators.
+If the very last character is
+.I $ ,
+the expression will only be matched at the end of a line (when
+immediately followed by newline).
+The latter operator is a special case of the
+.I /
+operator character,
+which indicates trailing context.
+The expression
+.TS
+center;
+l.
+ab/cd
+.TE
+matches the string
+.I ab ,
+but only if followed by
+.ul
+cd.
+Thus
+.TS
+center;
+l.
+ab$
+.TE
+is the same as
+.TS
+center;
+l.
+ab/\en
+.TE
+Left context is handled in Lex by
+.I
+start conditions
+.R
+as explained in section 10. If a rule is only to be executed
+when the Lex automaton interpreter is in start condition
+.I
+x,
+.R
+the rule should be prefixed by
+.TS
+center;
+l.
+<x>
+.TE
+using the angle bracket operator characters.
+If we considered ``being at the beginning of a line'' to be
+start condition
+.I ONE ,
+then the ^ operator
+would be equivalent to
+.TS
+center;
+l.
+<ONE>
+.TE
+Start conditions are explained more fully later.
+.PP
+.I
+Repetitions and Definitions.
+.R
+The operators {} specify
+either repetitions (if they enclose numbers)
+or
+definition expansion (if they enclose a name). For example
+.TS
+center;
+l.
+{digit}
+.TE
+looks for a predefined string named
+.I digit
+and inserts it
+at that point in the expression.
+The definitions are given in the first part of the Lex
+input, before the rules.
+In contrast,
+.TS
+center;
+l.
+a{1,5}
+.TE
+looks for 1 to 5 occurrences of
+.I a .
+.PP
+Finally, initial
+.I %
+is special, being the separator
+for Lex source segments.
+.2C
+.NH
+Lex Actions.
+.PP
+When an expression written as above is matched, Lex
+executes the corresponding action. This section describes
+some features of Lex which aid in writing actions. Note
+that there is a default action, which
+consists of copying the input to the output. This
+is performed on all strings not otherwise matched. Thus
+the Lex user who wishes to absorb the entire input, without
+producing any output, must provide rules to match everything.
+When Lex is being used with Yacc, this is the normal
+situation.
+One may consider that actions are what is done instead of
+copying the input to the output; thus, in general,
+a rule which merely copies can be omitted.
+Also, a character combination
+which is omitted from the rules
+and which appears as input
+is likely to be printed on the output, thus calling
+attention to the gap in the rules.
+.PP
+One of the simplest things that can be done is to ignore
+the input. Specifying a C null statement, \fI;\fR as an action
+causes this result. A frequent rule is
+.TS
+center;
+l l.
+[ \et\en] ;
+.TE
+which causes the three spacing characters (blank, tab, and newline)
+to be ignored.
+.PP
+Another easy way to avoid writing actions is the action character
+|, which indicates that the action for this rule is the action
+for the next rule.
+The previous example could also have been written
+.TS
+center;
+l l.
+" " |
+"\et" |
+"\en" ;
+.TE
+with the same result, although in different style.
+The quotes around \en and \et are not required.
+.PP
+In more complex actions, the user
+will
+often want to know the actual text that matched some expression
+like
+.I [a\(miz]+ .
+Lex leaves this text in an external character
+array named
+.I
+yytext.
+.R
+Thus, to print the name found,
+a rule like
+.TS
+center;
+l l.
+[a\-z]+ printf("%s", yytext);
+.TE
+will print
+the string in
+.I
+yytext.
+.R
+The C function
+.I
+printf
+.R
+accepts a format argument and data to be printed;
+in this case, the format is ``print string'' (% indicating
+data conversion, and
+.I s
+indicating string type),
+and the data are the characters
+in
+.I
+yytext.
+.R
+So this just places
+the matched string
+on the output.
+This action
+is so common that
+it may be written as ECHO:
+.TS
+center;
+l l.
+[a\-z]+ ECHO;
+.TE
+is the same as the above.
+Since the default action is just to
+print the characters found, one might ask why
+give a rule, like this one, which merely specifies
+the default action?
+Such rules are often required
+to avoid matching some other rule
+which is not desired. For example, if there is a rule
+which matches
+.I read
+it will normally match the instances of
+.I read
+contained in
+.I bread
+or
+.I readjust ;
+to avoid
+this,
+a rule
+of the form
+.I [a\(miz]+
+is needed.
+This is explained further below.
+.PP
+Sometimes it is more convenient to know the end of what
+has been found; hence Lex also provides a count
+.I
+yyleng
+.R
+of the number of characters matched.
+To count both the number
+of words and the number of characters in words in the input, the user might write
+.TS
+center;
+l l.
+[a\-zA\-Z]+ {words++; chars += yyleng;}
+.TE
+which accumulates in
+.ul
+chars
+the number
+of characters in the words recognized.
+The last character in the string matched can
+be accessed by
+.TS
+center;
+l.
+yytext[yyleng\-1]
+.TE
+.PP
+Occasionally, a Lex
+action may decide that a rule has not recognized the correct
+span of characters.
+Two routines are provided to aid with this situation.
+First,
+.I
+yymore()
+.R
+can be called to indicate that the next input expression recognized is to be
+tacked on to the end of this input. Normally,
+the next input string would overwrite the current
+entry in
+.I
+yytext.
+.R
+Second,
+.I
+yyless (n)
+.R
+may be called to indicate that not all the characters matched
+by the currently successful expression are wanted right now.
+The argument
+.I
+n
+.R
+indicates the number of characters
+in
+.I
+yytext
+.R
+to be retained.
+Further characters previously matched
+are
+returned to the input. This provides the same sort of
+look~ahead offered by the / operator,
+but in a different form.
+.PP
+.I
+Example:
+.R
+Consider a language which defines
+a string as a set of characters between quotation (") marks, and provides that
+to include a " in a string it must be preceded by a \e. The
+regular expression which matches that is somewhat confusing,
+so that it might be preferable to write
+.TS
+center;
+l l.
+\e"[^"]\(** {
+ if (yytext[yyleng\-1] == \(fm\e\e\(fm)
+ yymore();
+ else
+ ... normal user processing
+ }
+.TE
+which will, when faced with a string such as
+.I
+"abc\e"def\|"
+.R
+first match
+the five characters
+\fI"abc\e\|\fR;
+then
+the call to
+.I yymore()
+will
+cause the next part of the string,
+\fI"def\|\fR,
+to be tacked on the end.
+Note that the final quote terminating the string should be picked
+up in the code labeled ``normal processing''.
+.PP
+The function
+.I
+yyless()
+.R
+might be used to reprocess
+text in various circumstances. Consider the C problem of distinguishing
+the ambiguity of ``=\(mia''.
+Suppose it is desired to treat this as ``=\(mi a''
+but print a message. A rule might be
+.ps 9
+.vs 11
+.TS
+center;
+l l.
+=\(mi[a\-zA\-Z] {
+ printf("Op (=\(mi) ambiguous\en");
+ yyless(yyleng\-1);
+ ... action for =\(mi ...
+ }
+.TE
+.ps 10
+.vs 12
+which prints a message, returns the letter after the
+operator to the input stream, and treats the operator as ``=\(mi''.
+Alternatively it might be desired to treat this as ``= \(mia''.
+To do this, just return the minus
+sign as well as the letter to the input:
+.ps 9
+.vs 11
+.TS
+center;
+l l.
+=\(mi[a\-zA\-Z] {
+ printf("Op (=\(mi) ambiguous\en");
+ yyless(yyleng\-2);
+ ... action for = ...
+ }
+.TE
+.ps 10
+.vs 12
+will perform the other interpretation.
+Note that the expressions for the two cases might more easily
+be written
+.TS
+center;
+l l.
+=\(mi/[A\-Za\-z]
+.TE
+in the first case and
+.TS
+center;
+l.
+=/\-[A\-Za\-z]
+.TE
+in the second;
+no backup would be required in the rule action.
+It is not necessary to recognize the whole identifier
+to observe the ambiguity.
+The
+possibility of ``=\(mi3'', however, makes
+.TS
+center;
+l.
+=\(mi/[^ \et\en]
+.TE
+a still better rule.
+.PP
+In addition to these routines, Lex also permits
+access to the I/O routines
+it uses.
+They are:
+.IP 1)
+.I
+input()
+.R
+which returns the next input character;
+.IP 2)
+.I
+output(c)
+.R
+which writes the character
+.I
+c
+.R
+on the output; and
+.IP 3)
+.I
+unput(c)
+.R
+pushes the character
+.I
+c
+.R
+back onto the input stream to be read later by
+.I
+input().
+.R
+.LP
+By default these routines are provided as macro definitions,
+but the user can override them and supply private versions.
+These routines
+define the relationship between external files and
+internal characters, and must all be retained
+or modified consistently.
+They may be redefined, to
+cause input or output to be transmitted to or from strange
+places, including other programs or internal memory;
+but the character set used must be consistent in all routines;
+a value of zero returned by
+.I
+input
+.R
+must mean end of file; and
+the relationship between
+.I
+unput
+.R
+and
+.I
+input
+.R
+must be retained
+or the Lex look~ahead will not work.
+Lex does not look ahead at all if it does not have to,
+but every rule ending in
+.ft I
++ \(** ?
+.ft R
+or
+.ft I
+$
+.ft R
+or containing
+.ft I
+/
+.ft R
+implies look~ahead.
+Look~ahead is also necessary to match an expression that is a prefix
+of another expression.
+See below for a discussion of the character set used by Lex.
+The standard Lex library imposes
+a 100 character limit on backup.
+.PP
+Another Lex library routine that the user will sometimes want
+to redefine is
+.I
+yywrap()
+.R
+which is called whenever Lex reaches an end-of-file.
+If
+.I
+yywrap
+.R
+returns a 1, Lex continues with the normal wrapup on end of input.
+Sometimes, however, it is convenient to arrange for more
+input to arrive
+from a new source.
+In this case, the user should provide
+a
+.I
+yywrap
+.R
+which
+arranges for new input and
+returns 0. This instructs Lex to continue processing.
+The default
+.I
+yywrap
+.R
+always returns 1.
+.PP
+This routine is also a convenient place
+to print tables, summaries, etc. at the end
+of a program. Note that it is not
+possible to write a normal rule which recognizes
+end-of-file; the only access to this condition is
+through
+.I
+yywrap.
+.R
+In fact, unless a private version of
+.I
+input()
+.R
+is supplied
+a file containing nulls
+cannot be handled,
+since a value of 0 returned by
+.I
+input
+.R
+is taken to be end-of-file.
+.PP
+.2C
+.NH
+Ambiguous Source Rules.
+.PP
+Lex can handle ambiguous specifications.
+When more than one expression can match the
+current input, Lex chooses as follows:
+.IP 1)
+The longest match is preferred.
+.IP 2)
+Among rules which matched the same number of characters,
+the rule given first is preferred.
+.LP
+Thus, suppose the rules
+.TS
+center;
+l l.
+integer keyword action ...;
+[a\-z]+ identifier action ...;
+.TE
+to be given in that order. If the input is
+.I integers ,
+it is taken as an identifier, because
+.I [a\-z]+
+matches 8 characters while
+.I integer
+matches only 7.
+If the input is
+.I integer ,
+both rules match 7 characters, and
+the keyword rule is selected because it was given first.
+Anything shorter (e.g. \fIint\fR\|) will
+not match the expression
+.I integer
+and so the identifier interpretation is used.
+.PP
+The principle of preferring the longest
+match makes rules containing
+expressions like
+.I \&.\(**
+dangerous.
+For example,
+.TS
+center;
+l.
+\&\(fm.\(**\(fm
+.TE
+might seem a good way of recognizing
+a string in single quotes.
+But it is an invitation for the program to read far
+ahead, looking for a distant
+single quote.
+Presented with the input
+.TS
+center;
+l l.
+\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
+.TE
+the above expression will match
+.TS
+center;
+l l.
+\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
+.TE
+which is probably not what was wanted.
+A better rule is of the form
+.TS
+center;
+l.
+\&\(fm[^\(fm\en]\(**\(fm
+.TE
+which, on the above input, will stop
+after
+.I \(fmfirst\(fm .
+The consequences
+of errors like this are mitigated by the fact
+that the
+.I \&.
+operator will not match newline.
+Thus expressions like
+.I \&.\(**
+stop on the
+current line.
+Don't try to defeat this with expressions like
+.I (.|\en)+
+or
+equivalents;
+the Lex generated program will try to read
+the entire input file, causing
+internal buffer overflows.
+.PP
+Note that Lex is normally partitioning
+the input stream, not searching for all possible matches
+of each expression.
+This means that each character is accounted for
+once and only once.
+For example, suppose it is desired to
+count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
+Some Lex rules to do this might be
+.TS
+center;
+l l.
+she s++;
+he h++;
+\en |
+\&. ;
+.TE
+where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
+Remember that . does not include newline.
+Since \fIshe\fR includes \fIhe\fR, Lex will normally
+.I
+not
+.R
+recognize
+the instances of \fIhe\fR included in \fIshe\fR,
+since once it has passed a \fIshe\fR those characters are gone.
+.PP
+Sometimes the user would like to override this choice. The action
+REJECT
+means ``go do the next alternative.''
+It causes whatever rule was second choice after the current
+rule to be executed.
+The position of the input pointer is adjusted accordingly.
+Suppose the user really wants to count the included instances of \fIhe\fR:
+.TS
+center;
+l l.
+she {s++; REJECT;}
+he {h++; REJECT;}
+\en |
+\&. ;
+.TE
+these rules are one way of changing the previous example
+to do just that.
+After counting each expression, it is rejected; whenever appropriate,
+the other expression will then be counted. In this example, of course,
+the user could note that \fIshe\fR includes \fIhe\fR but not
+vice versa, and omit the REJECT action on \fIhe\fR;
+in other cases, however, it
+would not be possible a priori to tell
+which input characters
+were in both classes.
+.PP
+Consider the two rules
+.TS
+center;
+l l.
+a[bc]+ { ... ; REJECT;}
+a[cd]+ { ... ; REJECT;}
+.TE
+If the input is
+.I ab ,
+only the first rule matches,
+and on
+.I ad
+only the second matches.
+The input string
+.I accb
+matches the first rule for four characters
+and then the second rule for three characters.
+In contrast, the input
+.I accd
+agrees with
+the second rule for four characters and then the first
+rule for three.
+.PP
+In general, REJECT is useful whenever
+the purpose of Lex is not to partition the input
+stream but to detect all examples of some items
+in the input, and the instances of these items
+may overlap or include each other.
+Suppose a digram table of the input is desired;
+normally the digrams overlap, that is the word
+.I the
+is considered to contain
+both
+.I th
+and
+.I he .
+Assuming a two-dimensional array named
+.ul
+digram
+to be incremented, the appropriate
+source is
+.TS
+center;
+l l.
+%%
+[a\-z][a\-z] {
+ digram[yytext[0]][yytext[1]]++;
+ REJECT;
+ }
+\. ;
+\en ;
+.TE
+where the REJECT is necessary to pick up
+a letter pair beginning at every character, rather than at every
+other character.
+.2C
+.NH
+Lex Source Definitions.
+.PP
+Remember the format of the Lex
+source:
+.TS
+center;
+l.
+{definitions}
+%%
+{rules}
+%%
+{user routines}
+.TE
+So far only the rules have been described. The user needs
+additional options,
+though, to define variables for use in his program and for use
+by Lex.
+These can go either in the definitions section
+or in the rules section.
+.PP
+Remember that Lex is turning the rules into a program.
+Any source not intercepted by Lex is copied
+into the generated program. There are three classes
+of such things.
+.IP 1)
+Any line which is not part of a Lex rule or action
+which begins with a blank or tab is copied into
+the Lex generated program.
+Such source input prior to the first %% delimiter will be external
+to any function in the code; if it appears immediately after the first
+%%,
+it appears in an appropriate place for declarations
+in the function written by Lex which contains the actions.
+This material must look like program fragments,
+and should precede the first Lex rule.
+.IP
+As a side effect of the above, lines which begin with a blank
+or tab, and which contain a comment,
+are passed through to the generated program.
+This can be used to include comments in either the Lex source or
+the generated code. The comments should follow the host
+language convention.
+.IP 2)
+Anything included between lines containing
+only
+.I %{
+and
+.I %}
+is
+copied out as above. The delimiters are discarded.
+This format permits entering text like preprocessor statements that
+must begin in column 1,
+or copying lines that do not look like programs.
+.IP 3)
+Anything after the third %% delimiter, regardless of formats, etc.,
+is copied out after the Lex output.
+.PP
+Definitions intended for Lex are given
+before the first %% delimiter. Any line in this section
+not contained between %{ and %}, and begining
+in column 1, is assumed to define Lex substitution strings.
+The format of such lines is
+.TS
+center;
+l l.
+name translation
+.TE
+and it
+causes the string given as a translation to
+be associated with the name.
+The name and translation
+must be separated by at least one blank or tab, and the name must begin with a letter.
+The translation can then be called out
+by the {name} syntax in a rule.
+Using {D} for the digits and {E} for an exponent field,
+for example, might abbreviate rules to recognize numbers:
+.TS
+center;
+l l.
+D [0\-9]
+E [DEde][\-+]?{D}+
+%%
+{D}+ printf("integer");
+{D}+"."{D}\(**({E})? |
+{D}\(**"."{D}+({E})? |
+{D}+{E} printf("real");
+.TE
+Note the first two rules for real numbers;
+both require a decimal point and contain
+an optional exponent field,
+but the first requires at least one digit before the
+decimal point and the second requires at least one
+digit after the decimal point.
+To correctly handle the problem
+posed by a Fortran expression such as
+.I 35.EQ.I ,
+which does not contain a real number, a context-sensitive
+rule such as
+.TS
+center;
+l l.
+[0\-9]+/"."EQ printf("integer");
+.TE
+could be used in addition to the normal rule for integers.
+.PP
+The definitions
+section may also contain other commands, including the
+selection of a host language, a character set table,
+a list of start conditions, or adjustments to the default
+size of arrays within Lex itself for larger source programs.
+These possibilities
+are discussed below under ``Summary of Source Format,''
+section 12.
+.2C
+.NH
+Usage.
+.PP
+There are two steps in
+compiling a Lex source program.
+First, the Lex source must be turned into a generated program
+in the host general purpose language.
+Then this program must be compiled and loaded, usually with
+a library of Lex subroutines.
+The generated program
+is on a file named
+.I lex.yy.c .
+The I/O library is defined in terms of the C standard
+library [6].
+.PP
+The C programs generated by Lex are slightly different
+on OS/370, because the
+OS compiler is less powerful than the UNIX or GCOS compilers,
+and does less at compile time.
+C programs generated on GCOS and UNIX are the same.
+.PP
+.I
+UNIX.
+.R
+The library is accessed by the loader flag
+.I \-ll .
+So an appropriate
+set of commands is
+.KS
+.in 5
+lex source
+cc lex.yy.c \-ll
+.in 0
+.KE
+The resulting program is placed on the usual file
+.I
+a.out
+.R
+for later execution.
+To use Lex with Yacc see below.
+Although the default Lex I/O routines use the C standard library,
+the Lex automata themselves do not do so;
+if private versions of
+.I
+input,
+output
+.R
+and
+.I unput
+are given, the library can be avoided.
+.PP
+.2C
+.NH
+Lex and Yacc.
+.PP
+If you want to use Lex with Yacc, note that what Lex writes is a program
+named
+.I
+yylex(),
+.R
+the name required by Yacc for its analyzer.
+Normally, the default main program on the Lex library
+calls this routine, but if Yacc is loaded, and its main
+program is used, Yacc will call
+.I
+yylex().
+.R
+In this case each Lex rule should end with
+.TS
+center;
+l.
+return(token);
+.TE
+where the appropriate token value is returned.
+An easy way to get access
+to Yacc's names for tokens is to
+compile the Lex output file as part of
+the Yacc output file by placing the line
+.TS
+center;
+l.
+# include "lex.yy.c"
+.TE
+in the last section of Yacc input.
+Supposing the grammar to be
+named ``good'' and the lexical rules to be named ``better''
+the UNIX command sequence can just be:
+.TS
+center;
+l.
+yacc good
+lex better
+cc y.tab.c \-ly \-ll
+.TE
+The Yacc library (\-ly) should be loaded before the Lex library,
+to obtain a main program which invokes the Yacc parser.
+The generations of Lex and Yacc programs can be done in
+either order.
+.2C
+.NH
+Examples.
+.PP
+As a trivial problem, consider copying an input file while
+adding 3 to every positive number divisible by 7.
+Here is a suitable Lex source program
+.TS
+center;
+l l.
+%%
+ int k;
+[0\-9]+ {
+ k = atoi(yytext);
+ if (k%7 == 0)
+ printf("%d", k+3);
+ else
+ printf("%d",k);
+ }
+.TE
+to do just that.
+The rule [0\-9]+ recognizes strings of digits;
+.I
+atoi
+.R
+converts the digits to binary
+and stores the result in
+.ul
+k.
+The operator % (remainder) is used to check whether
+.ul
+k
+is divisible by 7; if it is,
+it is incremented by 3 as it is written out.
+It may be objected that this program will alter such
+input items as
+.I 49.63
+or
+.I X7 .
+Furthermore, it increments the absolute value
+of all negative numbers divisible by 7.
+To avoid this, just add a few more rules after the active one,
+as here:
+.TS
+center;
+l l.
+%%
+ int k;
+\-?[0\-9]+ {
+ k = atoi(yytext);
+ printf("%d",
+ k%7 == 0 ? k+3 : k);
+ }
+\-?[0\-9.]+ ECHO;
+[A-Za-z][A-Za-z0-9]+ ECHO;
+.TE
+Numerical strings containing
+a ``.'' or preceded by a letter will be picked up by
+one of the last two rules, and not changed.
+The
+.I if\-else
+has been replaced by
+a C conditional expression to save space;
+the form
+.ul
+a?b:c
+means ``if
+.I a
+then
+.I b
+else
+.I c ''.
+.PP
+For an example of statistics gathering, here
+is a program which histograms the lengths
+of words, where a word is defined as a string of letters.
+.TS
+center;
+l l.
+ int lengs[100];
+%%
+[a\-z]+ lengs[yyleng]++;
+\&. |
+\en ;
+%%
+.T&
+l s.
+yywrap()
+{
+int i;
+printf("Length No. words\en");
+for(i=0; i<100; i++)
+ if (lengs[i] > 0)
+ printf("%5d%10d\en",i,lengs[i]);
+return(1);
+}
+.TE
+This program
+accumulates the histogram, while producing no output. At the end
+of the input it prints the table.
+The final statement
+.I
+return(1);
+.R
+indicates that Lex is to perform wrapup. If
+.I
+yywrap
+.R
+returns zero (false)
+it implies that further input is available
+and the program is
+to continue reading and processing.
+To provide a
+.I
+yywrap
+.R
+that never
+returns true causes an infinite loop.
+.PP
+As a larger example,
+here are some parts of a program written by N. L. Schryer
+to convert double precision Fortran to single precision Fortran.
+Because Fortran does not distinguish upper and lower case letters,
+this routine begins by defining a set of classes including
+both cases of each letter:
+.TS
+center;
+l l.
+a [aA]
+b [bB]
+c [cC]
+\&...
+z [zZ]
+.TE
+An additional class recognizes white space:
+.TS
+center;
+l l.
+W [ \et]\(**
+.TE
+The first rule changes
+``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
+.TS
+center;
+l.
+{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
+ printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
+ }
+.TE
+Care is taken throughout this program to preserve the case
+(upper or lower)
+of the original program.
+The conditional operator is used to
+select the proper form of the keyword.
+The next rule copies continuation card indications to
+avoid confusing them with constants:
+.TS
+center;
+l l.
+^" "[^ 0] ECHO;
+.TE
+In the regular expression, the quotes surround the
+blanks.
+It is interpreted as
+``beginning of line, then five blanks, then
+anything but blank or zero.''
+Note the two different meanings of
+.I ^ .
+There follow some rules to change double precision
+constants to ordinary floating constants.
+.TS
+center;
+l.
+[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
+[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
+"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
+ /\(** convert constants \(**/
+ for(p=yytext; \(**p != 0; p++)
+ {
+ if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
+ \(**p=+ \(fme\(fm\- \(fmd\(fm;
+ ECHO;
+ }
+.TE
+After the floating point constant is recognized, it is
+scanned by the
+.ul
+for
+loop
+to find the letter
+.I d
+or
+.I D .
+The program than adds
+.c
+.I \(fme\(fm\-\(fmd\(fm ,
+which converts
+it to the next letter of the alphabet.
+The modified constant, now single-precision,
+is written out again.
+There follow a series of names which must be respelled to remove
+their initial \fId\fR.
+By using the
+array
+.I
+yytext
+.R
+the same action suffices for all the names (only a sample of
+a rather long list is given here).
+.TS
+center;
+l l.
+{d}{s}{i}{n} |
+{d}{c}{o}{s} |
+{d}{s}{q}{r}{t} |
+{d}{a}{t}{a}{n} |
+\&...
+{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
+.TE
+Another list of names must have initial \fId\fR changed to initial \fIa\fR:
+.TS
+center;
+l l.
+{d}{l}{o}{g} |
+{d}{l}{o}{g}10 |
+{d}{m}{i}{n}1 |
+{d}{m}{a}{x}1 {
+ yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
+ ECHO;
+ }
+.TE
+And one routine
+must have initial \fId\fR changed to initial \fIr\fR:
+.TS
+center;
+l l.
+{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
+ ECHO;
+ }
+.TE
+To avoid such names as \fIdsinx\fR being detected as instances
+of \fIdsin\fR, some final rules pick up longer words as identifiers
+and copy some surviving characters:
+.TS
+center;
+l l.
+[A\-Za\-z][A\-Za\-z0\-9]\(** |
+[0\-9]+ |
+\en |
+\&. ECHO;
+.TE
+Note that this program is not complete; it
+does not deal with the spacing problems in Fortran or
+with the use of keywords as identifiers.
+.br
+.2C
+.NH
+Left Context Sensitivity.
+.PP
+Sometimes
+it is desirable to have several sets of lexical rules
+to be applied at different times in the input.
+For example, a compiler preprocessor might distinguish
+preprocessor statements and analyze them differently
+from ordinary statements.
+This requires
+sensitivity
+to prior context, and there are several ways of handling
+such problems.
+The \fI^\fR operator, for example, is a prior context operator,
+recognizing immediately preceding left context just as \fI$\fR recognizes
+immediately following right context.
+Adjacent left context could be extended, to produce a facility similar to
+that for adjacent right context, but it is unlikely
+to be as useful, since often the relevant left context
+appeared some time earlier, such as at the beginning of a line.
+.PP
+This section describes three means of dealing
+with different environments: a simple use of flags,
+when only a few rules change from one environment to another,
+the use of
+.I
+start conditions
+.R
+on rules,
+and the possibility of making multiple lexical analyzers all run
+together.
+In each case, there are rules which recognize the need to change the
+environment in which the
+following input text is analyzed, and set some parameter
+to reflect the change. This may be a flag explicitly tested by
+the user's action code; such a flag is the simplest way of dealing
+with the problem, since Lex is not involved at all.
+It may be more convenient,
+however,
+to have Lex remember the flags as initial conditions on the rules.
+Any rule may be associated with a start condition. It will only
+be recognized when Lex is in
+that start condition.
+The current start condition may be changed at any time.
+Finally, if the sets of rules for the different environments
+are very dissimilar,
+clarity may be best achieved by writing several distinct lexical
+analyzers, and switching from one to another as desired.
+.PP
+Consider the following problem: copy the input to the output,
+changing the word \fImagic\fR to \fIfirst\fR on every line which began
+with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
+which began with the letter \fIb\fR, and changing
+\fImagic\fR to \fIthird\fR on every line which began
+with the letter \fIc\fR. All other words and all other lines
+are left unchanged.
+.PP
+These rules are so simple that the easiest way
+to do this job is with a flag:
+.TS
+center;
+l l.
+ int flag;
+%%
+^a {flag = \(fma\(fm; ECHO;}
+^b {flag = \(fmb\(fm; ECHO;}
+^c {flag = \(fmc\(fm; ECHO;}
+\en {flag = 0 ; ECHO;}
+magic {
+ switch (flag)
+ {
+ case \(fma\(fm: printf("first"); break;
+ case \(fmb\(fm: printf("second"); break;
+ case \(fmc\(fm: printf("third"); break;
+ default: ECHO; break;
+ }
+ }
+.TE
+should be adequate.
+.PP
+To handle the same problem with start conditions, each
+start condition must be introduced to Lex in the definitions section
+with a line reading
+.TS
+center;
+l l.
+%Start name1 name2 ...
+.TE
+where the conditions may be named in any order.
+The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
+The conditions may be referenced at the
+head of a rule with the <> brackets:
+.TS
+center;
+l.
+<name1>expression
+.TE
+is a rule which is only recognized when Lex is in the
+start condition \fIname1\fR.
+To enter a start condition,
+execute the action statement
+.TS
+center;
+l.
+BEGIN name1;
+.TE
+which changes the start condition to \fIname1\fR.
+To resume the normal state,
+.TS
+center;
+l.
+BEGIN 0;
+.TE
+resets the initial condition
+of the Lex automaton interpreter.
+A rule may be active in several
+start conditions:
+.TS
+center;
+l.
+<name1,name2,name3>
+.TE
+is a legal prefix. Any rule not beginning with the
+<> prefix operator is always active.
+.PP
+The same example as before can be written:
+.TS
+center;
+l l.
+%START AA BB CC
+%%
+^a {ECHO; BEGIN AA;}
+^b {ECHO; BEGIN BB;}
+^c {ECHO; BEGIN CC;}
+\en {ECHO; BEGIN 0;}
+<AA>magic printf("first");
+<BB>magic printf("second");
+<CC>magic printf("third");
+.TE
+where the logic is exactly the same as in the previous
+method of handling the problem, but Lex does the work
+rather than the user's code.
+.2C
+.NH
+Character Set.
+.PP
+The programs generated by Lex handle
+character I/O only through the routines
+.I
+input,
+output,
+.R
+and
+.I
+unput.
+.R
+Thus the character representation
+provided in these routines
+is accepted by Lex and employed to return
+values in
+.I
+yytext.
+.R
+For internal use
+a character is represented as a small integer
+which, if the standard library is used,
+has a value equal to the integer value of the bit
+pattern representing the character on the host computer.
+Normally, the letter
+.I a
+is represented as the same form as the character constant
+.I \(fma\(fm .
+If this interpretation is changed, by providing I/O
+routines which translate the characters,
+Lex must be told about
+it, by giving a translation table.
+This table must be in the definitions section,
+and must be bracketed by lines containing only
+``%T''.
+The table contains lines of the form
+.TS
+center;
+l.
+{integer} {character string}
+.TE
+which indicate the value associated with each character.
+Thus the next example
+.TS
+center;
+l l.
+%T
+ 1 Aa
+ 2 Bb
+\&...
+26 Zz
+27 \en
+28 +
+29 \-
+30 0
+31 1
+\&...
+39 9
+%T
+.TE
+.sp
+.ce 1
+Sample character table.
+maps the lower and upper case letters together into the integers 1 through 26,
+newline into 27, + and \- into 28 and 29, and the
+digits into 30 through 39.
+Note the escape for newline.
+If a table is supplied, every character that is to appear either
+in the rules or in any valid input must be included
+in the table.
+No character
+may be assigned the number 0, and no character may be
+assigned a bigger number than the size of the hardware character set.
+.2C
+.NH
+Summary of Source Format.
+.PP
+The general form of a Lex source file is:
+.TS
+center;
+l.
+{definitions}
+%%
+{rules}
+%%
+{user subroutines}
+.TE
+The definitions section contains
+a combination of
+.IP 1)
+Definitions, in the form ``name space translation''.
+.IP 2)
+Included code, in the form ``space code''.
+.IP 3)
+Included code, in the form
+.TS
+center;
+l.
+%{
+code
+%}
+.TE
+.ns
+.IP 4)
+Start conditions, given in the form
+.TS
+center;
+l.
+%S name1 name2 ...
+.TE
+.ns
+.IP 5)
+Character set tables, in the form
+.TS
+center;
+l.
+%T
+number space character-string
+\&...
+%T
+.TE
+.ns
+.IP 6)
+Changes to internal array sizes, in the form
+.TS
+center;
+l.
+%\fIx\fR\0\0\fInnn\fR
+.TE
+where \fInnn\fR is a decimal integer representing an array size
+and \fIx\fR selects the parameter as follows:
+.TS
+center;
+c c
+c l.
+Letter Parameter
+p positions
+n states
+e tree nodes
+a transitions
+k packed character classes
+o output array size
+.TE
+.LP
+Lines in the rules section have the form ``expression action''
+where the action may be continued on succeeding
+lines by using braces to delimit it.
+.PP
+Regular expressions in Lex use the following
+operators:
+.br
+.TS
+center;
+l l.
+x the character "x"
+"x" an "x", even if x is an operator.
+\ex an "x", even if x is an operator.
+[xy] the character x or y.
+[x\-z] the characters x, y or z.
+[^x] any character but x.
+\&. any character but newline.
+^x an x at the beginning of a line.
+<y>x an x when Lex is in start condition y.
+x$ an x at the end of a line.
+x? an optional x.
+x\(** 0,1,2, ... instances of x.
+x+ 1,2,3, ... instances of x.
+x|y an x or a y.
+(x) an x.
+x/y an x but only if followed by y.
+{xx} the translation of xx from the
+ definitions section.
+x{m,n} \fIm\fR through \fIn\fR occurrences of x
+.TE
+.NH
+Caveats and Bugs.
+.PP
+There are pathological expressions which
+produce exponential growth of the tables when
+converted to deterministic machines;
+fortunately, they are rare.
+.PP
+REJECT does not rescan the input; instead it remembers the results of the previous
+scan. This means that if a rule with trailing context is found, and
+REJECT executed, the user
+must not have used
+.ul
+unput
+to change the characters forthcoming
+from the input stream.
+This is the only restriction on the user's ability to manipulate
+the not-yet-processed input.
+.PP
+.2C
+.NH
+Acknowledgments.
+.PP
+As should
+be obvious from the above, the outside of Lex
+is patterned
+on Yacc and the inside on Aho's string matching routines.
+Therefore, both S. C. Johnson and A. V. Aho
+are really originators
+of much of Lex,
+as well as debuggers of it.
+Many thanks are due to both.
+.PP
+The code of the current version of Lex was designed, written,
+and debugged by Eric Schmidt.
+.SG MH-1274-MEL-unix
+.sp 1
+.2C
+.NH
+References.
+.SP 1v
+.IP 1.
+B. W. Kernighan and D. M. Ritchie,
+.I
+The C Programming Language,
+.R
+Prentice-Hall, N. J. (1978).
+.IP 2.
+B. W. Kernighan,
+.I
+Ratfor: A Preprocessor for a Rational Fortran,
+.R
+Software \- Practice and Experience,
+\fB5\fR, pp. 395-496 (1975).
+.IP 3.
+S. C. Johnson,
+.I
+Yacc: Yet Another Compiler Compiler,
+.R
+Computing Science Technical Report No. 32,
+1975,
+.MH
+.if \n(tm (also TM 75-1273-6)
+.IP 4.
+A. V. Aho and M. J. Corasick,
+.I
+Efficient String Matching: An Aid to Bibliographic Search,
+.R
+Comm. ACM
+.B
+18,
+.R
+333-340 (1975).
+.IP 5.
+B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
+.I
+QED Text Editor,
+.R
+Computing Science Technical Report No. 5,
+1972,
+.MH
+.IP 6.
+D. M. Ritchie,
+private communication.
+See also
+M. E. Lesk,
+.I
+The Portable C Library,
+.R
+Computing Science Technical Report No. 31,
+.MH
+.if \n(tm (also TM 75-1274-11)