diff options
author | Michael Shalayeff <mickey@cvs.openbsd.org> | 2002-12-03 21:44:00 +0000 |
---|---|---|
committer | Michael Shalayeff <mickey@cvs.openbsd.org> | 2002-12-03 21:44:00 +0000 |
commit | 5dc6f996d1d715bac7c9d5a59f939d29eacc3a14 (patch) | |
tree | da347c07f4cfc5737063c99f7c28ebedd054105e /usr.bin/lex/PSD.doc/lex.ms | |
parent | 7c40f51a9a07812997cc937835ea8a45aab0d3e1 (diff) |
caldera-licensed docs, now that they are free. need more work, thus not installed yet
Diffstat (limited to 'usr.bin/lex/PSD.doc/lex.ms')
-rw-r--r-- | usr.bin/lex/PSD.doc/lex.ms | 2335 |
1 files changed, 2335 insertions, 0 deletions
diff --git a/usr.bin/lex/PSD.doc/lex.ms b/usr.bin/lex/PSD.doc/lex.ms new file mode 100644 index 00000000000..e1b4b0085a4 --- /dev/null +++ b/usr.bin/lex/PSD.doc/lex.ms @@ -0,0 +1,2335 @@ +.\" $OpenBSD: lex.ms,v 1.1 2002/12/03 21:43:59 mickey Exp $ +.\" +.\" Copyright (C) Caldera International Inc. 2001-2002. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code and documentation must retain the above +.\" copyright notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed or owned by Caldera +.\" International, Inc. +.\" 4. Neither the name of Caldera International, Inc. nor the names of other +.\" contributors may be used to endorse or promote products derived from +.\" this software without specific prior written permission. +.\" +.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA +.\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR +.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES +.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. +.\" IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE FOR ANY DIRECT, +.\" INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES +.\" (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +.\" SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, +.\" STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING +.\" IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +.\" POSSIBILITY OF SUCH DAMAGE. +.\" +.\" @(#)lex.ms 8.2 (Berkeley) 5/24/94 +.\" +.EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator' +.OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%' +.hc ~ +.bd I 2 +.de TS +.br +.nf +.SP 1v +.ul 0 +.. +.de TE +.SP 1v +.fi +.. +.\".de PT +.\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0'' +.\".if \\n%>1 'sp +.\".. +.ND July 21, 1975 +.\".RP +.\".TM 75-1274-15 39199 39199-11 +.TL +Lex \- A Lexical Analyzer ~Generator~ +.AU ``MH 2C-569'' 6377 +M. E. Lesk and E. Schmidt +.AI +.MH +.AB +.sp +.bd I 2 +.\".nr PS 8 +.\".nr VS 9 +.\".ps 8 +.\".vs 9p +Lex helps write programs whose control flow +is directed by instances of regular +expressions in the input stream. +It is well suited for editor-script type transformations and +for segmenting input in preparation for +a parsing routine. +.PP +Lex source is a table of regular expressions and corresponding program fragments. +The table is translated to a program +which reads an input stream, copying it to an output stream +and partitioning the input +into strings which match the given expressions. +As each such string is recognized the corresponding +program fragment is executed. +The recognition of the expressions +is performed by a deterministic finite automaton +generated by Lex. +The program fragments written by the user are executed in the order in which the +corresponding regular expressions occur in the input stream. +.if n .if \n(tm .ig +.PP +The lexical analysis +programs written with Lex accept ambiguous specifications +and choose the longest +match possible at each input point. +If necessary, substantial look~ahead +is performed on the input, but the +input stream will be backed up to the +end of the current partition, so that the user +has general freedom to manipulate it. +.PP +Lex can generate analyzers in either C or Ratfor, a language +which can be translated automatically to portable Fortran. +It is available on the PDP-11 UNIX, Honeywell GCOS, +and IBM OS systems. +This manual, however, will only discuss generating analyzers +in C on the UNIX system, which is the only supported +form of Lex under UNIX Version 7. +Lex is designed to simplify +interfacing with Yacc, for those +with access to this compiler-compiler system. +.. +.\".nr PS 9 +.\".nr VS 11 +.AE +.2C +.NH +Introduction. +.PP +Lex is a program generator designed for +lexical processing of character input streams. +It accepts a high-level, problem oriented specification +for character string matching, +and +produces a program in a general purpose language which recognizes +regular expressions. +The regular expressions are specified by the user in the +source specifications given to Lex. +The Lex written code recognizes these expressions +in an input stream and partitions the input stream into +strings matching the expressions. At the bound~aries +between strings +program sections +provided by the user are executed. +The Lex source file associates the regular expressions and the +program fragments. +As each expression appears in the input to the program written by Lex, +the corresponding fragment is executed. +.PP +.de MH +Bell Laboratories, Murray Hill, NJ 07974. +.. +The user supplies the additional code +beyond expression matching +needed to complete his tasks, possibly +including code written by other generators. +The program that recognizes the expressions is generated in the +general purpose programming language employed for the +user's program fragments. +Thus, a high level expression +language is provided to write the string expressions to be +matched while the user's freedom to write actions +is unimpaired. +This avoids forcing the user who wishes to use a string manipulation +language for input analysis to write processing programs in the same +and often inappropriate string handling language. +.PP +Lex is not a complete language, but rather a generator representing +a new language feature which can be added to +different programming languages, called ``host languages.'' +Just as general purpose languages +can produce code to run on different computer hardware, +Lex can write code in different host languages. +The host language is used for the output code generated by Lex +and also for the program fragments added by the user. +Compatible run-time libraries for the different host languages +are also provided. +This makes Lex adaptable to different environments and +different users. +Each application +may be directed to the combination of hardware and host language appropriate +to the task, the user's background, and the properties of local +implementations. +At present, the only supported host language is C, +although Fortran (in the form of Ratfor [2] has been available +in the past. +Lex itself exists on UNIX, GCOS, and OS/370; but the +code generated by Lex may be taken anywhere the appropriate +compilers exist. +.PP +Lex turns the user's expressions and actions +(called +.ul +source +in this memo) into the host general-purpose language; +the generated program is named +.ul +yylex. +The +.ul +yylex +program +will recognize expressions +in a stream +(called +.ul +input +in this memo) +and perform the specified actions for each expression as it is detected. +See Figure 1. +.TS +center; +l _ r +l|c|r +l _ r +l _ r +l|c|r +l _ r +c s s +c s s. + +Source \(-> Lex \(-> yylex + +.sp 2 + +Input \(-> yylex \(-> Output + +.sp +An overview of Lex +Figure 1 +.TE +.PP +For a trivial example, consider a program to delete +from the input +all blanks or tabs at the ends of lines. +.TS +center; +l l. +%% +[ \et]+$ ; +.TE +is all that is required. +The program +contains a %% delimiter to mark the beginning of the rules, and +one rule. +This rule contains a regular expression +which matches one or more +instances of the characters blank or tab +(written \et for visibility, in accordance with the C language convention) +just prior to the end of a line. +The brackets indicate the character +class made of blank and tab; the + indicates ``one or more ...''; +and the $ indicates ``end of line,'' as in QED. +No action is specified, +so the program generated by Lex (yylex) will ignore these characters. +Everything else will be copied. +To change any remaining +string of blanks or tabs to a single blank, +add another rule: +.TS +center; +l l. +%% +[ \et]+$ ; +[ \et]+ printf(" "); +.TE +The finite automaton generated for this +source will scan for both rules at once, +observing at +the termination of the string of blanks or tabs +whether or not there is a newline character, and executing +the desired rule action. +The first rule matches all strings of blanks or tabs +at the end of lines, and the second +rule all remaining strings of blanks or tabs. +.PP +Lex can be used alone for simple transformations, or +for analysis and statistics gathering on a lexical level. +Lex can also be used with a parser generator +to perform the lexical analysis phase; it is particularly +easy to interface Lex and Yacc [3]. +Lex programs recognize only regular expressions; +Yacc writes parsers that accept a large class of context free grammars, +but require a lower level analyzer to recognize input tokens. +Thus, a combination of Lex and Yacc is often appropriate. +When used as a preprocessor for a later parser generator, +Lex is used to partition the input stream, +and the parser generator assigns structure to +the resulting pieces. +The flow of control +in such a case (which might be the first half of a compiler, +for example) is shown in Figure 2. +Additional programs, +written by other generators +or by hand, can +be added easily to programs written by Lex. +.BS 2 +.ps 9 +.vs 11 +.TS +center; +l c c c l +l c c c l +l c c c l +l _ c _ l +l|c|c|c|l +l _ c _ l +l c c c l +l _ c _ l +l|c|c|c|l +l _ c _ l +l c s s l +l c s s l. + lexical grammar + rules rules + \(da \(da + + Lex Yacc + + \(da \(da + +Input \(-> yylex \(-> yyparse \(-> Parsed input + +.sp + Lex with Yacc + Figure 2 +.TE +.ps 10 +.vs 12 +.BE +Yacc users +will realize that the name +.ul +yylex +is what Yacc expects its lexical analyzer to be named, +so that the use of this name by Lex simplifies +interfacing. +.PP +Lex generates a deterministic finite automaton from the regular expressions +in the source [4]. +The automaton is interpreted, rather than compiled, in order +to save space. +The result is still a fast analyzer. +In particular, the time taken by a Lex program +to recognize and partition an input stream is +proportional to the length of the input. +The number of Lex rules or +the complexity of the rules is +not important in determining speed, +unless rules which include +forward context require a significant amount of re~scanning. +What does increase with the number and complexity of rules +is the size of the finite +automaton, and therefore the size of the program +generated by Lex. +.PP +In the program written by Lex, the user's fragments +(representing the +.ul +actions +to be performed as each regular expression +is found) +are gathered +as cases of a switch. +The automaton interpreter directs the control flow. +Opportunity is provided for the user to insert either +declarations or additional statements in the routine containing +the actions, or to +add subroutines outside this action routine. +.PP +Lex is not limited to source which can +be interpreted on the basis of one character +look~ahead. +For example, +if there are two rules, one looking for +.I ab +and another for +.I abcdefg , +and the input stream is +.I abcdefh , +Lex will recognize +.I ab +and leave +the input pointer just before +.I "cd. . ." +Such backup is more costly +than the processing of simpler languages. +.2C +.NH +Lex Source. +.PP +The general format of Lex source is: +.TS +center; +l. +{definitions} +%% +{rules} +%% +{user subroutines} +.TE +where the definitions and the user subroutines +are often omitted. +The second +.I %% +is optional, but the first is required +to mark the beginning of the rules. +The absolute minimum Lex program is thus +.TS +center; +l. +%% +.TE +(no definitions, no rules) which translates into a program +which copies the input to the output unchanged. +.PP +In the outline of Lex programs shown above, the +.I +rules +.R +represent the user's control +decisions; they are a table, in which the left column +contains +.I +regular expressions +.R +(see section 3) +and the right column contains +.I +actions, +.R +program fragments to be executed when the expressions +are recognized. +Thus an individual rule might appear +.TS +center; +l l. +integer printf("found keyword INT"); +.TE +to look for the string +.I integer +in the input stream and +print the message ``found keyword INT'' whenever it appears. +In this example the host procedural language is C and +the C library function +.I +printf +.R +is used to print the string. +The end +of the expression is indicated by the first blank or tab character. +If the action is merely a single C expression, +it can just be given on the right side of the line; if it is +compound, or takes more than a line, it should be enclosed in +braces. +As a slightly more useful example, suppose it is desired to +change a number of words from British to American spelling. +Lex rules such as +.TS +center; +l l. +colour printf("color"); +mechanise printf("mechanize"); +petrol printf("gas"); +.TE +would be a start. These rules are not quite enough, +since +the word +.I petroleum +would become +.I gaseum ; +a way of dealing +with this will be described later. +.2C +.NH +Lex Regular Expressions. +.PP +The definitions of regular expressions are very similar to those +in QED [5]. +A regular +expression specifies a set of strings to be matched. +It contains text characters (which match the corresponding +characters in the strings being compared) +and operator characters (which specify +repetitions, choices, and other features). +The letters of the alphabet and the digits are +always text characters; thus the regular expression +.TS +center; +l l. +integer +.TE +matches the string +.ul +integer +wherever it appears +and the expression +.TS +center; +l. +a57D +.TE +looks for the string +.ul +a57D. +.PP +.I +Operators. +.R +The operator characters are +.TS +center; +l. +" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < > +.TE +and if they are to be used as text characters, an escape +should be used. +The quotation mark operator (") +indicates that whatever is contained between a pair of quotes +is to be taken as text characters. +Thus +.TS +center; +l. +xyz"++" +.TE +matches the string +.I xyz++ +when it appears. Note that a part of a string may be quoted. +It is harmless but unnecessary to quote an ordinary +text character; the expression +.TS +center; +l. +"xyz++" +.TE +is the same as the one above. +Thus by quoting every non-alphanumeric character +being used as a text character, the user can avoid remembering +the list above of current +operator characters, and is safe should further extensions to Lex +lengthen the list. +.PP +An operator character may also be turned into a text character +by preceding it with \e as in +.TS +center; +l. +xyz\e+\e+ +.TE +which +is another, less readable, equivalent of the above expressions. +Another use of the quoting mechanism is to get a blank into +an expression; normally, as explained above, blanks or tabs end +a rule. +Any blank character not contained within [\|] (see below) must +be quoted. +Several normal C escapes with \e +are recognized: \en is newline, \et is tab, and \eb is backspace. +To enter \e itself, use \e\e. +Since newline is illegal in an expression, \en must be used; +it is not +required to escape tab and backspace. +Every character but blank, tab, newline and the list above is always +a text character. +.PP +.I +Character classes. +.R +Classes of characters can be specified using the operator pair [\|]. +The construction +.I [abc] +matches a +single character, which may be +.I a , +.I b , +or +.I c . +Within square brackets, +most operator meanings are ignored. +Only three characters are special: +these are \e \(mi and ^. The \(mi character +indicates ranges. For example, +.TS +center; +l. +[a\(miz0\(mi9<>_] +.TE +indicates the character class containing all the lower case letters, +the digits, +the angle brackets, and underline. +Ranges may be given in either order. +Using \(mi between any pair of characters which are +not both upper case letters, both lower case letters, or both digits +is implementation dependent and will get a warning message. +(E.g., [0\-z] in ASCII is many more characters +than it is in EBCDIC). +If it is desired to include the +character \(mi in a character class, it should be first or +last; thus +.TS +center; +l. +[\(mi+0\(mi9] +.TE +matches all the digits and the two signs. +.PP +In character classes, +the ^ operator must appear as the first character +after the left bracket; it indicates that the resulting string +is to be complemented with respect to the computer character set. +Thus +.TS +center; +l. +[^abc] +.TE +matches all characters except a, b, or c, including +all special or control characters; or +.TS +center; +l. +[^a\-zA\-Z] +.TE +is any character which is not a letter. +The \e character provides the usual escapes within +character class brackets. +.PP +.I +Arbitrary character. +.R +To match almost any character, the operator character +.TS +center; +l. +\&. +.TE +is the class of all characters except newline. +Escaping into octal is possible although non-portable: +.TS +center; +l. +[\e40\-\e176] +.TE +matches all printable characters in the ASCII character set, from octal +40 (blank) to octal 176 (tilde). +.PP +.I +Optional expressions. +.R +The operator +.I ? +indicates +an optional element of an expression. +Thus +.TS +center; +l. +ab?c +.TE +matches either +.I ac +or +.I abc . +.PP +.I +Repeated expressions. +.R +Repetitions of classes are indicated by the operators +.I \(** +and +.I + . +.TS +center; +l. +\f2a\(**\f1 +.TE +is any number of consecutive +.I a +characters, including zero; while +.TS +center; +l. +a+ +.TE +is one or more instances of +.I a. +For example, +.TS +center; +l. +[a\-z]+ +.TE +is all strings of lower case letters. +And +.TS +center; +l. +[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(** +.TE +indicates all alphanumeric strings with a leading +alphabetic character. +This is a typical expression for recognizing identifiers in +computer languages. +.PP +.I +Alternation and Grouping. +.R +The operator | +indicates alternation: +.TS +center; +l. +(ab\||\|cd) +.TE +matches either +.ul +ab +or +.ul +cd. +Note that parentheses are used for grouping, although +they are +not necessary on the outside level; +.TS +center; +l. +ab\||\|cd +.TE +would have sufficed. +Parentheses +can be used for more complex expressions: +.TS +center; +l. +(ab\||\|cd+)?(ef)\(** +.TE +matches such strings as +.I abefef , +.I efefef , +.I cdef , +or +.I cddd\| ; +but not +.I abc , +.I abcd , +or +.I abcdef . +.PP +.I +Context sensitivity. +.R +Lex will recognize a small amount of surrounding +context. The two simplest operators for this are +.I ^ +and +.I $ . +If the first character of an expression is +.I ^ , +the expression will only be matched at the beginning +of a line (after a newline character, or at the beginning of +the input stream). +This can never conflict with the other meaning of +.I ^ , +complementation +of character classes, since that only applies within +the [\|] operators. +If the very last character is +.I $ , +the expression will only be matched at the end of a line (when +immediately followed by newline). +The latter operator is a special case of the +.I / +operator character, +which indicates trailing context. +The expression +.TS +center; +l. +ab/cd +.TE +matches the string +.I ab , +but only if followed by +.ul +cd. +Thus +.TS +center; +l. +ab$ +.TE +is the same as +.TS +center; +l. +ab/\en +.TE +Left context is handled in Lex by +.I +start conditions +.R +as explained in section 10. If a rule is only to be executed +when the Lex automaton interpreter is in start condition +.I +x, +.R +the rule should be prefixed by +.TS +center; +l. +<x> +.TE +using the angle bracket operator characters. +If we considered ``being at the beginning of a line'' to be +start condition +.I ONE , +then the ^ operator +would be equivalent to +.TS +center; +l. +<ONE> +.TE +Start conditions are explained more fully later. +.PP +.I +Repetitions and Definitions. +.R +The operators {} specify +either repetitions (if they enclose numbers) +or +definition expansion (if they enclose a name). For example +.TS +center; +l. +{digit} +.TE +looks for a predefined string named +.I digit +and inserts it +at that point in the expression. +The definitions are given in the first part of the Lex +input, before the rules. +In contrast, +.TS +center; +l. +a{1,5} +.TE +looks for 1 to 5 occurrences of +.I a . +.PP +Finally, initial +.I % +is special, being the separator +for Lex source segments. +.2C +.NH +Lex Actions. +.PP +When an expression written as above is matched, Lex +executes the corresponding action. This section describes +some features of Lex which aid in writing actions. Note +that there is a default action, which +consists of copying the input to the output. This +is performed on all strings not otherwise matched. Thus +the Lex user who wishes to absorb the entire input, without +producing any output, must provide rules to match everything. +When Lex is being used with Yacc, this is the normal +situation. +One may consider that actions are what is done instead of +copying the input to the output; thus, in general, +a rule which merely copies can be omitted. +Also, a character combination +which is omitted from the rules +and which appears as input +is likely to be printed on the output, thus calling +attention to the gap in the rules. +.PP +One of the simplest things that can be done is to ignore +the input. Specifying a C null statement, \fI;\fR as an action +causes this result. A frequent rule is +.TS +center; +l l. +[ \et\en] ; +.TE +which causes the three spacing characters (blank, tab, and newline) +to be ignored. +.PP +Another easy way to avoid writing actions is the action character +|, which indicates that the action for this rule is the action +for the next rule. +The previous example could also have been written +.TS +center; +l l. +" " | +"\et" | +"\en" ; +.TE +with the same result, although in different style. +The quotes around \en and \et are not required. +.PP +In more complex actions, the user +will +often want to know the actual text that matched some expression +like +.I [a\(miz]+ . +Lex leaves this text in an external character +array named +.I +yytext. +.R +Thus, to print the name found, +a rule like +.TS +center; +l l. +[a\-z]+ printf("%s", yytext); +.TE +will print +the string in +.I +yytext. +.R +The C function +.I +printf +.R +accepts a format argument and data to be printed; +in this case, the format is ``print string'' (% indicating +data conversion, and +.I s +indicating string type), +and the data are the characters +in +.I +yytext. +.R +So this just places +the matched string +on the output. +This action +is so common that +it may be written as ECHO: +.TS +center; +l l. +[a\-z]+ ECHO; +.TE +is the same as the above. +Since the default action is just to +print the characters found, one might ask why +give a rule, like this one, which merely specifies +the default action? +Such rules are often required +to avoid matching some other rule +which is not desired. For example, if there is a rule +which matches +.I read +it will normally match the instances of +.I read +contained in +.I bread +or +.I readjust ; +to avoid +this, +a rule +of the form +.I [a\(miz]+ +is needed. +This is explained further below. +.PP +Sometimes it is more convenient to know the end of what +has been found; hence Lex also provides a count +.I +yyleng +.R +of the number of characters matched. +To count both the number +of words and the number of characters in words in the input, the user might write +.TS +center; +l l. +[a\-zA\-Z]+ {words++; chars += yyleng;} +.TE +which accumulates in +.ul +chars +the number +of characters in the words recognized. +The last character in the string matched can +be accessed by +.TS +center; +l. +yytext[yyleng\-1] +.TE +.PP +Occasionally, a Lex +action may decide that a rule has not recognized the correct +span of characters. +Two routines are provided to aid with this situation. +First, +.I +yymore() +.R +can be called to indicate that the next input expression recognized is to be +tacked on to the end of this input. Normally, +the next input string would overwrite the current +entry in +.I +yytext. +.R +Second, +.I +yyless (n) +.R +may be called to indicate that not all the characters matched +by the currently successful expression are wanted right now. +The argument +.I +n +.R +indicates the number of characters +in +.I +yytext +.R +to be retained. +Further characters previously matched +are +returned to the input. This provides the same sort of +look~ahead offered by the / operator, +but in a different form. +.PP +.I +Example: +.R +Consider a language which defines +a string as a set of characters between quotation (") marks, and provides that +to include a " in a string it must be preceded by a \e. The +regular expression which matches that is somewhat confusing, +so that it might be preferable to write +.TS +center; +l l. +\e"[^"]\(** { + if (yytext[yyleng\-1] == \(fm\e\e\(fm) + yymore(); + else + ... normal user processing + } +.TE +which will, when faced with a string such as +.I +"abc\e"def\|" +.R +first match +the five characters +\fI"abc\e\|\fR; +then +the call to +.I yymore() +will +cause the next part of the string, +\fI"def\|\fR, +to be tacked on the end. +Note that the final quote terminating the string should be picked +up in the code labeled ``normal processing''. +.PP +The function +.I +yyless() +.R +might be used to reprocess +text in various circumstances. Consider the C problem of distinguishing +the ambiguity of ``=\(mia''. +Suppose it is desired to treat this as ``=\(mi a'' +but print a message. A rule might be +.ps 9 +.vs 11 +.TS +center; +l l. +=\(mi[a\-zA\-Z] { + printf("Op (=\(mi) ambiguous\en"); + yyless(yyleng\-1); + ... action for =\(mi ... + } +.TE +.ps 10 +.vs 12 +which prints a message, returns the letter after the +operator to the input stream, and treats the operator as ``=\(mi''. +Alternatively it might be desired to treat this as ``= \(mia''. +To do this, just return the minus +sign as well as the letter to the input: +.ps 9 +.vs 11 +.TS +center; +l l. +=\(mi[a\-zA\-Z] { + printf("Op (=\(mi) ambiguous\en"); + yyless(yyleng\-2); + ... action for = ... + } +.TE +.ps 10 +.vs 12 +will perform the other interpretation. +Note that the expressions for the two cases might more easily +be written +.TS +center; +l l. +=\(mi/[A\-Za\-z] +.TE +in the first case and +.TS +center; +l. +=/\-[A\-Za\-z] +.TE +in the second; +no backup would be required in the rule action. +It is not necessary to recognize the whole identifier +to observe the ambiguity. +The +possibility of ``=\(mi3'', however, makes +.TS +center; +l. +=\(mi/[^ \et\en] +.TE +a still better rule. +.PP +In addition to these routines, Lex also permits +access to the I/O routines +it uses. +They are: +.IP 1) +.I +input() +.R +which returns the next input character; +.IP 2) +.I +output(c) +.R +which writes the character +.I +c +.R +on the output; and +.IP 3) +.I +unput(c) +.R +pushes the character +.I +c +.R +back onto the input stream to be read later by +.I +input(). +.R +.LP +By default these routines are provided as macro definitions, +but the user can override them and supply private versions. +These routines +define the relationship between external files and +internal characters, and must all be retained +or modified consistently. +They may be redefined, to +cause input or output to be transmitted to or from strange +places, including other programs or internal memory; +but the character set used must be consistent in all routines; +a value of zero returned by +.I +input +.R +must mean end of file; and +the relationship between +.I +unput +.R +and +.I +input +.R +must be retained +or the Lex look~ahead will not work. +Lex does not look ahead at all if it does not have to, +but every rule ending in +.ft I ++ \(** ? +.ft R +or +.ft I +$ +.ft R +or containing +.ft I +/ +.ft R +implies look~ahead. +Look~ahead is also necessary to match an expression that is a prefix +of another expression. +See below for a discussion of the character set used by Lex. +The standard Lex library imposes +a 100 character limit on backup. +.PP +Another Lex library routine that the user will sometimes want +to redefine is +.I +yywrap() +.R +which is called whenever Lex reaches an end-of-file. +If +.I +yywrap +.R +returns a 1, Lex continues with the normal wrapup on end of input. +Sometimes, however, it is convenient to arrange for more +input to arrive +from a new source. +In this case, the user should provide +a +.I +yywrap +.R +which +arranges for new input and +returns 0. This instructs Lex to continue processing. +The default +.I +yywrap +.R +always returns 1. +.PP +This routine is also a convenient place +to print tables, summaries, etc. at the end +of a program. Note that it is not +possible to write a normal rule which recognizes +end-of-file; the only access to this condition is +through +.I +yywrap. +.R +In fact, unless a private version of +.I +input() +.R +is supplied +a file containing nulls +cannot be handled, +since a value of 0 returned by +.I +input +.R +is taken to be end-of-file. +.PP +.2C +.NH +Ambiguous Source Rules. +.PP +Lex can handle ambiguous specifications. +When more than one expression can match the +current input, Lex chooses as follows: +.IP 1) +The longest match is preferred. +.IP 2) +Among rules which matched the same number of characters, +the rule given first is preferred. +.LP +Thus, suppose the rules +.TS +center; +l l. +integer keyword action ...; +[a\-z]+ identifier action ...; +.TE +to be given in that order. If the input is +.I integers , +it is taken as an identifier, because +.I [a\-z]+ +matches 8 characters while +.I integer +matches only 7. +If the input is +.I integer , +both rules match 7 characters, and +the keyword rule is selected because it was given first. +Anything shorter (e.g. \fIint\fR\|) will +not match the expression +.I integer +and so the identifier interpretation is used. +.PP +The principle of preferring the longest +match makes rules containing +expressions like +.I \&.\(** +dangerous. +For example, +.TS +center; +l. +\&\(fm.\(**\(fm +.TE +might seem a good way of recognizing +a string in single quotes. +But it is an invitation for the program to read far +ahead, looking for a distant +single quote. +Presented with the input +.TS +center; +l l. +\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here +.TE +the above expression will match +.TS +center; +l l. +\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm +.TE +which is probably not what was wanted. +A better rule is of the form +.TS +center; +l. +\&\(fm[^\(fm\en]\(**\(fm +.TE +which, on the above input, will stop +after +.I \(fmfirst\(fm . +The consequences +of errors like this are mitigated by the fact +that the +.I \&. +operator will not match newline. +Thus expressions like +.I \&.\(** +stop on the +current line. +Don't try to defeat this with expressions like +.I (.|\en)+ +or +equivalents; +the Lex generated program will try to read +the entire input file, causing +internal buffer overflows. +.PP +Note that Lex is normally partitioning +the input stream, not searching for all possible matches +of each expression. +This means that each character is accounted for +once and only once. +For example, suppose it is desired to +count occurrences of both \fIshe\fR and \fIhe\fR in an input text. +Some Lex rules to do this might be +.TS +center; +l l. +she s++; +he h++; +\en | +\&. ; +.TE +where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR. +Remember that . does not include newline. +Since \fIshe\fR includes \fIhe\fR, Lex will normally +.I +not +.R +recognize +the instances of \fIhe\fR included in \fIshe\fR, +since once it has passed a \fIshe\fR those characters are gone. +.PP +Sometimes the user would like to override this choice. The action +REJECT +means ``go do the next alternative.'' +It causes whatever rule was second choice after the current +rule to be executed. +The position of the input pointer is adjusted accordingly. +Suppose the user really wants to count the included instances of \fIhe\fR: +.TS +center; +l l. +she {s++; REJECT;} +he {h++; REJECT;} +\en | +\&. ; +.TE +these rules are one way of changing the previous example +to do just that. +After counting each expression, it is rejected; whenever appropriate, +the other expression will then be counted. In this example, of course, +the user could note that \fIshe\fR includes \fIhe\fR but not +vice versa, and omit the REJECT action on \fIhe\fR; +in other cases, however, it +would not be possible a priori to tell +which input characters +were in both classes. +.PP +Consider the two rules +.TS +center; +l l. +a[bc]+ { ... ; REJECT;} +a[cd]+ { ... ; REJECT;} +.TE +If the input is +.I ab , +only the first rule matches, +and on +.I ad +only the second matches. +The input string +.I accb +matches the first rule for four characters +and then the second rule for three characters. +In contrast, the input +.I accd +agrees with +the second rule for four characters and then the first +rule for three. +.PP +In general, REJECT is useful whenever +the purpose of Lex is not to partition the input +stream but to detect all examples of some items +in the input, and the instances of these items +may overlap or include each other. +Suppose a digram table of the input is desired; +normally the digrams overlap, that is the word +.I the +is considered to contain +both +.I th +and +.I he . +Assuming a two-dimensional array named +.ul +digram +to be incremented, the appropriate +source is +.TS +center; +l l. +%% +[a\-z][a\-z] { + digram[yytext[0]][yytext[1]]++; + REJECT; + } +\. ; +\en ; +.TE +where the REJECT is necessary to pick up +a letter pair beginning at every character, rather than at every +other character. +.2C +.NH +Lex Source Definitions. +.PP +Remember the format of the Lex +source: +.TS +center; +l. +{definitions} +%% +{rules} +%% +{user routines} +.TE +So far only the rules have been described. The user needs +additional options, +though, to define variables for use in his program and for use +by Lex. +These can go either in the definitions section +or in the rules section. +.PP +Remember that Lex is turning the rules into a program. +Any source not intercepted by Lex is copied +into the generated program. There are three classes +of such things. +.IP 1) +Any line which is not part of a Lex rule or action +which begins with a blank or tab is copied into +the Lex generated program. +Such source input prior to the first %% delimiter will be external +to any function in the code; if it appears immediately after the first +%%, +it appears in an appropriate place for declarations +in the function written by Lex which contains the actions. +This material must look like program fragments, +and should precede the first Lex rule. +.IP +As a side effect of the above, lines which begin with a blank +or tab, and which contain a comment, +are passed through to the generated program. +This can be used to include comments in either the Lex source or +the generated code. The comments should follow the host +language convention. +.IP 2) +Anything included between lines containing +only +.I %{ +and +.I %} +is +copied out as above. The delimiters are discarded. +This format permits entering text like preprocessor statements that +must begin in column 1, +or copying lines that do not look like programs. +.IP 3) +Anything after the third %% delimiter, regardless of formats, etc., +is copied out after the Lex output. +.PP +Definitions intended for Lex are given +before the first %% delimiter. Any line in this section +not contained between %{ and %}, and begining +in column 1, is assumed to define Lex substitution strings. +The format of such lines is +.TS +center; +l l. +name translation +.TE +and it +causes the string given as a translation to +be associated with the name. +The name and translation +must be separated by at least one blank or tab, and the name must begin with a letter. +The translation can then be called out +by the {name} syntax in a rule. +Using {D} for the digits and {E} for an exponent field, +for example, might abbreviate rules to recognize numbers: +.TS +center; +l l. +D [0\-9] +E [DEde][\-+]?{D}+ +%% +{D}+ printf("integer"); +{D}+"."{D}\(**({E})? | +{D}\(**"."{D}+({E})? | +{D}+{E} printf("real"); +.TE +Note the first two rules for real numbers; +both require a decimal point and contain +an optional exponent field, +but the first requires at least one digit before the +decimal point and the second requires at least one +digit after the decimal point. +To correctly handle the problem +posed by a Fortran expression such as +.I 35.EQ.I , +which does not contain a real number, a context-sensitive +rule such as +.TS +center; +l l. +[0\-9]+/"."EQ printf("integer"); +.TE +could be used in addition to the normal rule for integers. +.PP +The definitions +section may also contain other commands, including the +selection of a host language, a character set table, +a list of start conditions, or adjustments to the default +size of arrays within Lex itself for larger source programs. +These possibilities +are discussed below under ``Summary of Source Format,'' +section 12. +.2C +.NH +Usage. +.PP +There are two steps in +compiling a Lex source program. +First, the Lex source must be turned into a generated program +in the host general purpose language. +Then this program must be compiled and loaded, usually with +a library of Lex subroutines. +The generated program +is on a file named +.I lex.yy.c . +The I/O library is defined in terms of the C standard +library [6]. +.PP +The C programs generated by Lex are slightly different +on OS/370, because the +OS compiler is less powerful than the UNIX or GCOS compilers, +and does less at compile time. +C programs generated on GCOS and UNIX are the same. +.PP +.I +UNIX. +.R +The library is accessed by the loader flag +.I \-ll . +So an appropriate +set of commands is +.KS +.in 5 +lex source +cc lex.yy.c \-ll +.in 0 +.KE +The resulting program is placed on the usual file +.I +a.out +.R +for later execution. +To use Lex with Yacc see below. +Although the default Lex I/O routines use the C standard library, +the Lex automata themselves do not do so; +if private versions of +.I +input, +output +.R +and +.I unput +are given, the library can be avoided. +.PP +.2C +.NH +Lex and Yacc. +.PP +If you want to use Lex with Yacc, note that what Lex writes is a program +named +.I +yylex(), +.R +the name required by Yacc for its analyzer. +Normally, the default main program on the Lex library +calls this routine, but if Yacc is loaded, and its main +program is used, Yacc will call +.I +yylex(). +.R +In this case each Lex rule should end with +.TS +center; +l. +return(token); +.TE +where the appropriate token value is returned. +An easy way to get access +to Yacc's names for tokens is to +compile the Lex output file as part of +the Yacc output file by placing the line +.TS +center; +l. +# include "lex.yy.c" +.TE +in the last section of Yacc input. +Supposing the grammar to be +named ``good'' and the lexical rules to be named ``better'' +the UNIX command sequence can just be: +.TS +center; +l. +yacc good +lex better +cc y.tab.c \-ly \-ll +.TE +The Yacc library (\-ly) should be loaded before the Lex library, +to obtain a main program which invokes the Yacc parser. +The generations of Lex and Yacc programs can be done in +either order. +.2C +.NH +Examples. +.PP +As a trivial problem, consider copying an input file while +adding 3 to every positive number divisible by 7. +Here is a suitable Lex source program +.TS +center; +l l. +%% + int k; +[0\-9]+ { + k = atoi(yytext); + if (k%7 == 0) + printf("%d", k+3); + else + printf("%d",k); + } +.TE +to do just that. +The rule [0\-9]+ recognizes strings of digits; +.I +atoi +.R +converts the digits to binary +and stores the result in +.ul +k. +The operator % (remainder) is used to check whether +.ul +k +is divisible by 7; if it is, +it is incremented by 3 as it is written out. +It may be objected that this program will alter such +input items as +.I 49.63 +or +.I X7 . +Furthermore, it increments the absolute value +of all negative numbers divisible by 7. +To avoid this, just add a few more rules after the active one, +as here: +.TS +center; +l l. +%% + int k; +\-?[0\-9]+ { + k = atoi(yytext); + printf("%d", + k%7 == 0 ? k+3 : k); + } +\-?[0\-9.]+ ECHO; +[A-Za-z][A-Za-z0-9]+ ECHO; +.TE +Numerical strings containing +a ``.'' or preceded by a letter will be picked up by +one of the last two rules, and not changed. +The +.I if\-else +has been replaced by +a C conditional expression to save space; +the form +.ul +a?b:c +means ``if +.I a +then +.I b +else +.I c ''. +.PP +For an example of statistics gathering, here +is a program which histograms the lengths +of words, where a word is defined as a string of letters. +.TS +center; +l l. + int lengs[100]; +%% +[a\-z]+ lengs[yyleng]++; +\&. | +\en ; +%% +.T& +l s. +yywrap() +{ +int i; +printf("Length No. words\en"); +for(i=0; i<100; i++) + if (lengs[i] > 0) + printf("%5d%10d\en",i,lengs[i]); +return(1); +} +.TE +This program +accumulates the histogram, while producing no output. At the end +of the input it prints the table. +The final statement +.I +return(1); +.R +indicates that Lex is to perform wrapup. If +.I +yywrap +.R +returns zero (false) +it implies that further input is available +and the program is +to continue reading and processing. +To provide a +.I +yywrap +.R +that never +returns true causes an infinite loop. +.PP +As a larger example, +here are some parts of a program written by N. L. Schryer +to convert double precision Fortran to single precision Fortran. +Because Fortran does not distinguish upper and lower case letters, +this routine begins by defining a set of classes including +both cases of each letter: +.TS +center; +l l. +a [aA] +b [bB] +c [cC] +\&... +z [zZ] +.TE +An additional class recognizes white space: +.TS +center; +l l. +W [ \et]\(** +.TE +The first rule changes +``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''. +.TS +center; +l. +{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { + printf(yytext[0]==\(fmd\(fm? "real" : "REAL"); + } +.TE +Care is taken throughout this program to preserve the case +(upper or lower) +of the original program. +The conditional operator is used to +select the proper form of the keyword. +The next rule copies continuation card indications to +avoid confusing them with constants: +.TS +center; +l l. +^" "[^ 0] ECHO; +.TE +In the regular expression, the quotes surround the +blanks. +It is interpreted as +``beginning of line, then five blanks, then +anything but blank or zero.'' +Note the two different meanings of +.I ^ . +There follow some rules to change double precision +constants to ordinary floating constants. +.TS +center; +l. +[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ | +[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ | +"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ { + /\(** convert constants \(**/ + for(p=yytext; \(**p != 0; p++) + { + if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm) + \(**p=+ \(fme\(fm\- \(fmd\(fm; + ECHO; + } +.TE +After the floating point constant is recognized, it is +scanned by the +.ul +for +loop +to find the letter +.I d +or +.I D . +The program than adds +.c +.I \(fme\(fm\-\(fmd\(fm , +which converts +it to the next letter of the alphabet. +The modified constant, now single-precision, +is written out again. +There follow a series of names which must be respelled to remove +their initial \fId\fR. +By using the +array +.I +yytext +.R +the same action suffices for all the names (only a sample of +a rather long list is given here). +.TS +center; +l l. +{d}{s}{i}{n} | +{d}{c}{o}{s} | +{d}{s}{q}{r}{t} | +{d}{a}{t}{a}{n} | +\&... +{d}{f}{l}{o}{a}{t} printf("%s",yytext+1); +.TE +Another list of names must have initial \fId\fR changed to initial \fIa\fR: +.TS +center; +l l. +{d}{l}{o}{g} | +{d}{l}{o}{g}10 | +{d}{m}{i}{n}1 | +{d}{m}{a}{x}1 { + yytext[0] =+ \(fma\(fm \- \(fmd\(fm; + ECHO; + } +.TE +And one routine +must have initial \fId\fR changed to initial \fIr\fR: +.TS +center; +l l. +{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm; + ECHO; + } +.TE +To avoid such names as \fIdsinx\fR being detected as instances +of \fIdsin\fR, some final rules pick up longer words as identifiers +and copy some surviving characters: +.TS +center; +l l. +[A\-Za\-z][A\-Za\-z0\-9]\(** | +[0\-9]+ | +\en | +\&. ECHO; +.TE +Note that this program is not complete; it +does not deal with the spacing problems in Fortran or +with the use of keywords as identifiers. +.br +.2C +.NH +Left Context Sensitivity. +.PP +Sometimes +it is desirable to have several sets of lexical rules +to be applied at different times in the input. +For example, a compiler preprocessor might distinguish +preprocessor statements and analyze them differently +from ordinary statements. +This requires +sensitivity +to prior context, and there are several ways of handling +such problems. +The \fI^\fR operator, for example, is a prior context operator, +recognizing immediately preceding left context just as \fI$\fR recognizes +immediately following right context. +Adjacent left context could be extended, to produce a facility similar to +that for adjacent right context, but it is unlikely +to be as useful, since often the relevant left context +appeared some time earlier, such as at the beginning of a line. +.PP +This section describes three means of dealing +with different environments: a simple use of flags, +when only a few rules change from one environment to another, +the use of +.I +start conditions +.R +on rules, +and the possibility of making multiple lexical analyzers all run +together. +In each case, there are rules which recognize the need to change the +environment in which the +following input text is analyzed, and set some parameter +to reflect the change. This may be a flag explicitly tested by +the user's action code; such a flag is the simplest way of dealing +with the problem, since Lex is not involved at all. +It may be more convenient, +however, +to have Lex remember the flags as initial conditions on the rules. +Any rule may be associated with a start condition. It will only +be recognized when Lex is in +that start condition. +The current start condition may be changed at any time. +Finally, if the sets of rules for the different environments +are very dissimilar, +clarity may be best achieved by writing several distinct lexical +analyzers, and switching from one to another as desired. +.PP +Consider the following problem: copy the input to the output, +changing the word \fImagic\fR to \fIfirst\fR on every line which began +with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line +which began with the letter \fIb\fR, and changing +\fImagic\fR to \fIthird\fR on every line which began +with the letter \fIc\fR. All other words and all other lines +are left unchanged. +.PP +These rules are so simple that the easiest way +to do this job is with a flag: +.TS +center; +l l. + int flag; +%% +^a {flag = \(fma\(fm; ECHO;} +^b {flag = \(fmb\(fm; ECHO;} +^c {flag = \(fmc\(fm; ECHO;} +\en {flag = 0 ; ECHO;} +magic { + switch (flag) + { + case \(fma\(fm: printf("first"); break; + case \(fmb\(fm: printf("second"); break; + case \(fmc\(fm: printf("third"); break; + default: ECHO; break; + } + } +.TE +should be adequate. +.PP +To handle the same problem with start conditions, each +start condition must be introduced to Lex in the definitions section +with a line reading +.TS +center; +l l. +%Start name1 name2 ... +.TE +where the conditions may be named in any order. +The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR. +The conditions may be referenced at the +head of a rule with the <> brackets: +.TS +center; +l. +<name1>expression +.TE +is a rule which is only recognized when Lex is in the +start condition \fIname1\fR. +To enter a start condition, +execute the action statement +.TS +center; +l. +BEGIN name1; +.TE +which changes the start condition to \fIname1\fR. +To resume the normal state, +.TS +center; +l. +BEGIN 0; +.TE +resets the initial condition +of the Lex automaton interpreter. +A rule may be active in several +start conditions: +.TS +center; +l. +<name1,name2,name3> +.TE +is a legal prefix. Any rule not beginning with the +<> prefix operator is always active. +.PP +The same example as before can be written: +.TS +center; +l l. +%START AA BB CC +%% +^a {ECHO; BEGIN AA;} +^b {ECHO; BEGIN BB;} +^c {ECHO; BEGIN CC;} +\en {ECHO; BEGIN 0;} +<AA>magic printf("first"); +<BB>magic printf("second"); +<CC>magic printf("third"); +.TE +where the logic is exactly the same as in the previous +method of handling the problem, but Lex does the work +rather than the user's code. +.2C +.NH +Character Set. +.PP +The programs generated by Lex handle +character I/O only through the routines +.I +input, +output, +.R +and +.I +unput. +.R +Thus the character representation +provided in these routines +is accepted by Lex and employed to return +values in +.I +yytext. +.R +For internal use +a character is represented as a small integer +which, if the standard library is used, +has a value equal to the integer value of the bit +pattern representing the character on the host computer. +Normally, the letter +.I a +is represented as the same form as the character constant +.I \(fma\(fm . +If this interpretation is changed, by providing I/O +routines which translate the characters, +Lex must be told about +it, by giving a translation table. +This table must be in the definitions section, +and must be bracketed by lines containing only +``%T''. +The table contains lines of the form +.TS +center; +l. +{integer} {character string} +.TE +which indicate the value associated with each character. +Thus the next example +.TS +center; +l l. +%T + 1 Aa + 2 Bb +\&... +26 Zz +27 \en +28 + +29 \- +30 0 +31 1 +\&... +39 9 +%T +.TE +.sp +.ce 1 +Sample character table. +maps the lower and upper case letters together into the integers 1 through 26, +newline into 27, + and \- into 28 and 29, and the +digits into 30 through 39. +Note the escape for newline. +If a table is supplied, every character that is to appear either +in the rules or in any valid input must be included +in the table. +No character +may be assigned the number 0, and no character may be +assigned a bigger number than the size of the hardware character set. +.2C +.NH +Summary of Source Format. +.PP +The general form of a Lex source file is: +.TS +center; +l. +{definitions} +%% +{rules} +%% +{user subroutines} +.TE +The definitions section contains +a combination of +.IP 1) +Definitions, in the form ``name space translation''. +.IP 2) +Included code, in the form ``space code''. +.IP 3) +Included code, in the form +.TS +center; +l. +%{ +code +%} +.TE +.ns +.IP 4) +Start conditions, given in the form +.TS +center; +l. +%S name1 name2 ... +.TE +.ns +.IP 5) +Character set tables, in the form +.TS +center; +l. +%T +number space character-string +\&... +%T +.TE +.ns +.IP 6) +Changes to internal array sizes, in the form +.TS +center; +l. +%\fIx\fR\0\0\fInnn\fR +.TE +where \fInnn\fR is a decimal integer representing an array size +and \fIx\fR selects the parameter as follows: +.TS +center; +c c +c l. +Letter Parameter +p positions +n states +e tree nodes +a transitions +k packed character classes +o output array size +.TE +.LP +Lines in the rules section have the form ``expression action'' +where the action may be continued on succeeding +lines by using braces to delimit it. +.PP +Regular expressions in Lex use the following +operators: +.br +.TS +center; +l l. +x the character "x" +"x" an "x", even if x is an operator. +\ex an "x", even if x is an operator. +[xy] the character x or y. +[x\-z] the characters x, y or z. +[^x] any character but x. +\&. any character but newline. +^x an x at the beginning of a line. +<y>x an x when Lex is in start condition y. +x$ an x at the end of a line. +x? an optional x. +x\(** 0,1,2, ... instances of x. +x+ 1,2,3, ... instances of x. +x|y an x or a y. +(x) an x. +x/y an x but only if followed by y. +{xx} the translation of xx from the + definitions section. +x{m,n} \fIm\fR through \fIn\fR occurrences of x +.TE +.NH +Caveats and Bugs. +.PP +There are pathological expressions which +produce exponential growth of the tables when +converted to deterministic machines; +fortunately, they are rare. +.PP +REJECT does not rescan the input; instead it remembers the results of the previous +scan. This means that if a rule with trailing context is found, and +REJECT executed, the user +must not have used +.ul +unput +to change the characters forthcoming +from the input stream. +This is the only restriction on the user's ability to manipulate +the not-yet-processed input. +.PP +.2C +.NH +Acknowledgments. +.PP +As should +be obvious from the above, the outside of Lex +is patterned +on Yacc and the inside on Aho's string matching routines. +Therefore, both S. C. Johnson and A. V. Aho +are really originators +of much of Lex, +as well as debuggers of it. +Many thanks are due to both. +.PP +The code of the current version of Lex was designed, written, +and debugged by Eric Schmidt. +.SG MH-1274-MEL-unix +.sp 1 +.2C +.NH +References. +.SP 1v +.IP 1. +B. W. Kernighan and D. M. Ritchie, +.I +The C Programming Language, +.R +Prentice-Hall, N. J. (1978). +.IP 2. +B. W. Kernighan, +.I +Ratfor: A Preprocessor for a Rational Fortran, +.R +Software \- Practice and Experience, +\fB5\fR, pp. 395-496 (1975). +.IP 3. +S. C. Johnson, +.I +Yacc: Yet Another Compiler Compiler, +.R +Computing Science Technical Report No. 32, +1975, +.MH +.if \n(tm (also TM 75-1273-6) +.IP 4. +A. V. Aho and M. J. Corasick, +.I +Efficient String Matching: An Aid to Bibliographic Search, +.R +Comm. ACM +.B +18, +.R +333-340 (1975). +.IP 5. +B. W. Kernighan, D. M. Ritchie and K. L. Thompson, +.I +QED Text Editor, +.R +Computing Science Technical Report No. 5, +1972, +.MH +.IP 6. +D. M. Ritchie, +private communication. +See also +M. E. Lesk, +.I +The Portable C Library, +.R +Computing Science Technical Report No. 31, +.MH +.if \n(tm (also TM 75-1274-11) |