src - OpenBSD base system

diff options


context:
space:
mode:

author	Michael Shalayeff <mickey@cvs.openbsd.org>	2002-12-03 21:44:00 +0000
committer	Michael Shalayeff <mickey@cvs.openbsd.org>	2002-12-03 21:44:00 +0000
commit	5dc6f996d1d715bac7c9d5a59f939d29eacc3a14 (patch)
tree	da347c07f4cfc5737063c99f7c28ebedd054105e /usr.bin/lex/PSD.doc/lex.ms
parent	7c40f51a9a07812997cc937835ea8a45aab0d3e1 (diff)

caldera-licensed docs, now that they are free. need more work, thus not installed yet

Diffstat (limited to 'usr.bin/lex/PSD.doc/lex.ms')

-rw-r--r--

usr.bin/lex/PSD.doc/lex.ms

2335

1 files changed, 2335 insertions, 0 deletions

diff --git a/usr.bin/lex/PSD.doc/lex.ms b/usr.bin/lex/PSD.doc/lex.ms
new file mode 100644
index 00000000000..e1b4b0085a4
--- /dev/null
+++ b/usr.bin/lex/PSD.doc/lex.ms

@@ -0,0 +1,2335 @@

+.\" $OpenBSD: lex.ms,v 1.1 2002/12/03 21:43:59 mickey Exp $

+.\"

+.\" Copyright (C) Caldera International Inc. 2001-2002.

+.\"

+.\" Redistribution and use in source and binary forms, with or without

+.\" modification, are permitted provided that the following conditions

+.\" are met:

+.\" 1. Redistributions of source code and documentation must retain the above

+.\" copyright notice, this list of conditions and the following disclaimer.

+.\" 2. Redistributions in binary form must reproduce the above copyright

+.\" notice, this list of conditions and the following disclaimer in the

+.\" documentation and/or other materials provided with the distribution.

+.\" 3. All advertising materials mentioning features or use of this software

+.\" must display the following acknowledgement:

+.\" This product includes software developed or owned by Caldera

+.\" International, Inc.

+.\" 4. Neither the name of Caldera International, Inc. nor the names of other

+.\" contributors may be used to endorse or promote products derived from

+.\" this software without specific prior written permission.

+.\"

+.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA

+.\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR

+.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES

+.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.

+.\" IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE FOR ANY DIRECT,

+.\" INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES

+.\" (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR

+.\" SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)

+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,

+.\" STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING

+.\" IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE

+.\" POSSIBILITY OF SUCH DAMAGE.

+.\"

+.\" @(#)lex.ms 8.2 (Berkeley) 5/24/94

+.\"

+.EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator'

+.OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%'

+.hc ~

+.bd I 2

+.de TS

+.br

+.nf

+.SP 1v

+.ul 0

+..

+.de TE

+.SP 1v

+.fi

+..

+.\".de PT

+.\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''

+.\".if \\n%>1 'sp

+.\"..

+.ND July 21, 1975

+.\".RP

+.\".TM 75-1274-15 39199 39199-11

+.TL

+Lex \- A Lexical Analyzer ~Generator~

+.AU ``MH 2C-569'' 6377

+M. E. Lesk and E. Schmidt

+.AI

+.MH

+.AB

+.sp

+.bd I 2

+.\".nr PS 8

+.\".nr VS 9

+.\".ps 8

+.\".vs 9p

+Lex helps write programs whose control flow

+is directed by instances of regular

+expressions in the input stream.

+It is well suited for editor-script type transformations and

+for segmenting input in preparation for

+a parsing routine.

+.PP

+Lex source is a table of regular expressions and corresponding program fragments.

+The table is translated to a program

+which reads an input stream, copying it to an output stream

+and partitioning the input

+into strings which match the given expressions.

+As each such string is recognized the corresponding

+program fragment is executed.

+The recognition of the expressions

+is performed by a deterministic finite automaton

+generated by Lex.

+The program fragments written by the user are executed in the order in which the

+corresponding regular expressions occur in the input stream.

+.if n .if \n(tm .ig

+.PP

+The lexical analysis

+programs written with Lex accept ambiguous specifications

+and choose the longest

+match possible at each input point.

+If necessary, substantial look~ahead

+is performed on the input, but the

+input stream will be backed up to the

+end of the current partition, so that the user

+has general freedom to manipulate it.

+.PP

+Lex can generate analyzers in either C or Ratfor, a language

+which can be translated automatically to portable Fortran.

+It is available on the PDP-11 UNIX, Honeywell GCOS,

+and IBM OS systems.

+This manual, however, will only discuss generating analyzers

+in C on the UNIX system, which is the only supported

+form of Lex under UNIX Version 7.

+Lex is designed to simplify

+interfacing with Yacc, for those

+with access to this compiler-compiler system.

+..

+.\".nr PS 9

+.\".nr VS 11

+.AE

+.2C

+.NH

+Introduction.

+.PP

+Lex is a program generator designed for

+lexical processing of character input streams.

+It accepts a high-level, problem oriented specification

+for character string matching,

+and

+produces a program in a general purpose language which recognizes

+regular expressions.

+The regular expressions are specified by the user in the

+source specifications given to Lex.

+The Lex written code recognizes these expressions

+in an input stream and partitions the input stream into

+strings matching the expressions. At the bound~aries

+between strings

+program sections

+provided by the user are executed.

+The Lex source file associates the regular expressions and the

+program fragments.

+As each expression appears in the input to the program written by Lex,

+the corresponding fragment is executed.

+.PP

+.de MH

+Bell Laboratories, Murray Hill, NJ 07974.

+..

+The user supplies the additional code

+beyond expression matching

+needed to complete his tasks, possibly

+including code written by other generators.

+The program that recognizes the expressions is generated in the

+general purpose programming language employed for the

+user's program fragments.

+Thus, a high level expression

+language is provided to write the string expressions to be

+matched while the user's freedom to write actions

+is unimpaired.

+This avoids forcing the user who wishes to use a string manipulation

+language for input analysis to write processing programs in the same

+and often inappropriate string handling language.

+.PP

+Lex is not a complete language, but rather a generator representing

+a new language feature which can be added to

+different programming languages, called ``host languages.''

+Just as general purpose languages

+can produce code to run on different computer hardware,

+Lex can write code in different host languages.

+The host language is used for the output code generated by Lex

+and also for the program fragments added by the user.

+Compatible run-time libraries for the different host languages

+are also provided.

+This makes Lex adaptable to different environments and

+different users.

+Each application

+may be directed to the combination of hardware and host language appropriate

+to the task, the user's background, and the properties of local

+implementations.

+At present, the only supported host language is C,

+although Fortran (in the form of Ratfor [2] has been available

+in the past.

+Lex itself exists on UNIX, GCOS, and OS/370; but the

+code generated by Lex may be taken anywhere the appropriate

+compilers exist.

+.PP

+Lex turns the user's expressions and actions

+(called

+.ul

+source

+in this memo) into the host general-purpose language;

+the generated program is named

+.ul

+yylex.

+The

+.ul

+yylex

+program

+will recognize expressions

+in a stream

+(called

+.ul

+input

+in this memo)

+and perform the specified actions for each expression as it is detected.

+See Figure 1.

+.TS

+center;

+l _ r

+l|c|r

+l _ r

+l|c|r

+l _ r

+c s s

+c s s.

+Source \(-> Lex \(-> yylex

+.sp 2

+Input \(-> yylex \(-> Output

+.sp

+An overview of Lex

+Figure 1

+.TE

+.PP

+For a trivial example, consider a program to delete

+from the input

+all blanks or tabs at the ends of lines.

+.TS

+center;

+l l.

+%%

+[ \et]+$ ;

+.TE

+is all that is required.

+The program

+contains a %% delimiter to mark the beginning of the rules, and

+one rule.

+This rule contains a regular expression

+which matches one or more

+instances of the characters blank or tab

+(written \et for visibility, in accordance with the C language convention)

+just prior to the end of a line.

+The brackets indicate the character

+class made of blank and tab; the + indicates ``one or more ...'';

+and the $ indicates ``end of line,'' as in QED.

+No action is specified,

+so the program generated by Lex (yylex) will ignore these characters.

+Everything else will be copied.

+To change any remaining

+string of blanks or tabs to a single blank,

+add another rule:

+.TS

+center;

+l l.

+%%

+[ \et]+$ ;

+[ \et]+ printf(" ");

+.TE

+The finite automaton generated for this

+source will scan for both rules at once,

+observing at

+the termination of the string of blanks or tabs

+whether or not there is a newline character, and executing

+the desired rule action.

+The first rule matches all strings of blanks or tabs

+at the end of lines, and the second

+rule all remaining strings of blanks or tabs.

+.PP

+Lex can be used alone for simple transformations, or

+for analysis and statistics gathering on a lexical level.

+Lex can also be used with a parser generator

+to perform the lexical analysis phase; it is particularly

+easy to interface Lex and Yacc [3].

+Lex programs recognize only regular expressions;

+Yacc writes parsers that accept a large class of context free grammars,

+but require a lower level analyzer to recognize input tokens.

+Thus, a combination of Lex and Yacc is often appropriate.

+When used as a preprocessor for a later parser generator,

+Lex is used to partition the input stream,

+and the parser generator assigns structure to

+the resulting pieces.

+The flow of control

+in such a case (which might be the first half of a compiler,

+for example) is shown in Figure 2.

+Additional programs,

+written by other generators

+or by hand, can

+be added easily to programs written by Lex.

+.BS 2

+.ps 9

+.vs 11

+.TS

+center;

+l c c c l

+l _ c _ l

+l|c|c|c|l

+l _ c _ l

+l c c c l

+l _ c _ l

+l|c|c|c|l

+l _ c _ l

+l c s s l

+l c s s l.

+ lexical grammar

+ rules rules

+ \(da \(da

+ Lex Yacc

+ \(da \(da

+Input \(-> yylex \(-> yyparse \(-> Parsed input

+.sp

+ Lex with Yacc

+ Figure 2

+.TE

+.ps 10

+.vs 12

+.BE

+Yacc users

+will realize that the name

+.ul

+yylex

+is what Yacc expects its lexical analyzer to be named,

+so that the use of this name by Lex simplifies

+interfacing.

+.PP

+Lex generates a deterministic finite automaton from the regular expressions

+in the source [4].

+The automaton is interpreted, rather than compiled, in order

+to save space.

+The result is still a fast analyzer.

+In particular, the time taken by a Lex program

+to recognize and partition an input stream is

+proportional to the length of the input.

+The number of Lex rules or

+the complexity of the rules is

+not important in determining speed,

+unless rules which include

+forward context require a significant amount of re~scanning.

+What does increase with the number and complexity of rules

+is the size of the finite

+automaton, and therefore the size of the program

+generated by Lex.

+.PP

+In the program written by Lex, the user's fragments

+(representing the

+.ul

+actions

+to be performed as each regular expression

+is found)

+are gathered

+as cases of a switch.

+The automaton interpreter directs the control flow.

+Opportunity is provided for the user to insert either

+declarations or additional statements in the routine containing

+the actions, or to

+add subroutines outside this action routine.

+.PP

+Lex is not limited to source which can

+be interpreted on the basis of one character

+look~ahead.

+For example,

+if there are two rules, one looking for

+.I ab

+and another for

+.I abcdefg ,

+and the input stream is

+.I abcdefh ,

+Lex will recognize

+.I ab

+and leave

+the input pointer just before

+.I "cd. . ."

+Such backup is more costly

+than the processing of simpler languages.

+.2C

+.NH

+Lex Source.

+.PP

+The general format of Lex source is:

+.TS

+center;

+l.

+{definitions}

+%%

+{rules}

+%%

+{user subroutines}

+.TE

+where the definitions and the user subroutines

+are often omitted.

+The second

+.I %%

+is optional, but the first is required

+to mark the beginning of the rules.

+The absolute minimum Lex program is thus

+.TS

+center;

+l.

+%%

+.TE

+(no definitions, no rules) which translates into a program

+which copies the input to the output unchanged.

+.PP

+In the outline of Lex programs shown above, the

+.I

+rules

+.R

+represent the user's control

+decisions; they are a table, in which the left column

+contains

+.I

+regular expressions

+.R

+(see section 3)

+and the right column contains

+.I

+actions,

+.R

+program fragments to be executed when the expressions

+are recognized.

+Thus an individual rule might appear

+.TS

+center;

+l l.

+integer printf("found keyword INT");

+.TE

+to look for the string

+.I integer

+in the input stream and

+print the message ``found keyword INT'' whenever it appears.

+In this example the host procedural language is C and

+the C library function

+.I

+printf

+.R

+is used to print the string.

+The end

+of the expression is indicated by the first blank or tab character.

+If the action is merely a single C expression,

+it can just be given on the right side of the line; if it is

+compound, or takes more than a line, it should be enclosed in

+braces.

+As a slightly more useful example, suppose it is desired to

+change a number of words from British to American spelling.

+Lex rules such as

+.TS

+center;

+l l.

+colour printf("color");

+mechanise printf("mechanize");

+petrol printf("gas");

+.TE

+would be a start. These rules are not quite enough,

+since

+the word

+.I petroleum

+would become

+.I gaseum ;

+a way of dealing

+with this will be described later.

+.2C

+.NH

+Lex Regular Expressions.

+.PP

+The definitions of regular expressions are very similar to those

+in QED [5].

+A regular

+expression specifies a set of strings to be matched.

+It contains text characters (which match the corresponding

+characters in the strings being compared)

+and operator characters (which specify

+repetitions, choices, and other features).

+The letters of the alphabet and the digits are

+always text characters; thus the regular expression

+.TS

+center;

+l l.

+integer

+.TE

+matches the string

+.ul

+integer

+wherever it appears

+and the expression

+.TS

+center;

+l.

+a57D

+.TE

+looks for the string

+.ul

+a57D.

+.PP

+.I

+Operators.

+.R

+The operator characters are

+.TS

+center;

+l.

+" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >

+.TE

+and if they are to be used as text characters, an escape

+should be used.

+The quotation mark operator (")

+indicates that whatever is contained between a pair of quotes

+is to be taken as text characters.

+Thus

+.TS

+center;

+l.

+xyz"++"

+.TE

+matches the string

+.I xyz++

+when it appears. Note that a part of a string may be quoted.

+It is harmless but unnecessary to quote an ordinary

+text character; the expression

+.TS

+center;

+l.

+"xyz++"

+.TE

+is the same as the one above.

+Thus by quoting every non-alphanumeric character

+being used as a text character, the user can avoid remembering

+the list above of current

+operator characters, and is safe should further extensions to Lex

+lengthen the list.

+.PP

+An operator character may also be turned into a text character

+by preceding it with \e as in

+.TS

+center;

+l.

+xyz\e+\e+

+.TE

+which

+is another, less readable, equivalent of the above expressions.

+Another use of the quoting mechanism is to get a blank into

+an expression; normally, as explained above, blanks or tabs end

+a rule.

+Any blank character not contained within [\|] (see below) must

+be quoted.

+Several normal C escapes with \e

+are recognized: \en is newline, \et is tab, and \eb is backspace.

+To enter \e itself, use \e\e.

+Since newline is illegal in an expression, \en must be used;

+it is not

+required to escape tab and backspace.

+Every character but blank, tab, newline and the list above is always

+a text character.

+.PP

+.I

+Character classes.

+.R

+Classes of characters can be specified using the operator pair [\|].

+The construction

+.I [abc]

+matches a

+single character, which may be

+.I a ,

+.I b ,

+or

+.I c .

+Within square brackets,

+most operator meanings are ignored.

+Only three characters are special:

+these are \e \(mi and ^. The \(mi character

+indicates ranges. For example,

+.TS

+center;

+l.

+[a\(miz0\(mi9<>_]

+.TE

+indicates the character class containing all the lower case letters,

+the digits,

+the angle brackets, and underline.

+Ranges may be given in either order.

+Using \(mi between any pair of characters which are

+not both upper case letters, both lower case letters, or both digits

+is implementation dependent and will get a warning message.

+(E.g., [0\-z] in ASCII is many more characters

+than it is in EBCDIC).

+If it is desired to include the

+character \(mi in a character class, it should be first or

+last; thus

+.TS

+center;

+l.

+[\(mi+0\(mi9]

+.TE

+matches all the digits and the two signs.

+.PP

+In character classes,

+the ^ operator must appear as the first character

+after the left bracket; it indicates that the resulting string

+is to be complemented with respect to the computer character set.

+Thus

+.TS

+center;

+l.

+[^abc]

+.TE

+matches all characters except a, b, or c, including

+all special or control characters; or

+.TS

+center;

+l.

+[^a\-zA\-Z]

+.TE

+is any character which is not a letter.

+The \e character provides the usual escapes within

+character class brackets.

+.PP

+.I

+Arbitrary character.

+.R

+To match almost any character, the operator character

+.TS

+center;

+l.

+\&.

+.TE

+is the class of all characters except newline.

+Escaping into octal is possible although non-portable:

+.TS

+center;

+l.

+[\e40\-\e176]

+.TE

+matches all printable characters in the ASCII character set, from octal

+40 (blank) to octal 176 (tilde).

+.PP

+.I

+Optional expressions.

+.R

+The operator

+.I ?

+indicates

+an optional element of an expression.

+Thus

+.TS

+center;

+l.

+ab?c

+.TE

+matches either

+.I ac

+or

+.I abc .

+.PP

+.I

+Repeated expressions.

+.R

+Repetitions of classes are indicated by the operators

+.I \(**

+and

+.I + .

+.TS

+center;

+l.

+\f2a\(**\f1

+.TE

+is any number of consecutive

+.I a

+characters, including zero; while

+.TS

+center;

+l.

+a+

+.TE

+is one or more instances of

+.I a.

+For example,

+.TS

+center;

+l.

+[a\-z]+

+.TE

+is all strings of lower case letters.

+And

+.TS

+center;

+l.

+[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**

+.TE

+indicates all alphanumeric strings with a leading

+alphabetic character.

+This is a typical expression for recognizing identifiers in

+computer languages.

+.PP

+.I

+Alternation and Grouping.

+.R

+The operator |

+indicates alternation:

+.TS

+center;

+l.

+(ab\||\|cd)

+.TE

+matches either

+.ul

+ab

+or

+.ul

+cd.

+Note that parentheses are used for grouping, although

+they are

+not necessary on the outside level;

+.TS

+center;

+l.

+ab\||\|cd

+.TE

+would have sufficed.

+Parentheses

+can be used for more complex expressions:

+.TS

+center;

+l.

+(ab\||\|cd+)?(ef)\(**

+.TE

+matches such strings as

+.I abefef ,

+.I efefef ,

+.I cdef ,

+or

+.I cddd\| ;

+but not

+.I abc ,

+.I abcd ,

+or

+.I abcdef .

+.PP

+.I

+Context sensitivity.

+.R

+Lex will recognize a small amount of surrounding

+context. The two simplest operators for this are

+.I ^

+and

+.I $ .

+If the first character of an expression is

+.I ^ ,

+the expression will only be matched at the beginning

+of a line (after a newline character, or at the beginning of

+the input stream).

+This can never conflict with the other meaning of

+.I ^ ,

+complementation

+of character classes, since that only applies within

+the [\|] operators.

+If the very last character is

+.I $ ,

+the expression will only be matched at the end of a line (when

+immediately followed by newline).

+The latter operator is a special case of the

+.I /

+operator character,

+which indicates trailing context.

+The expression

+.TS

+center;

+l.

+ab/cd

+.TE

+matches the string

+.I ab ,

+but only if followed by

+.ul

+cd.

+Thus

+.TS

+center;

+l.

+ab$

+.TE

+is the same as

+.TS

+center;

+l.

+ab/\en

+.TE

+Left context is handled in Lex by

+.I

+start conditions

+.R

+as explained in section 10. If a rule is only to be executed

+when the Lex automaton interpreter is in start condition

+.I

+x,

+.R

+the rule should be prefixed by

+.TS

+center;

+l.

+<x>

+.TE

+using the angle bracket operator characters.

+If we considered ``being at the beginning of a line'' to be

+start condition

+.I ONE ,

+then the ^ operator

+would be equivalent to

+.TS

+center;

+l.

+<ONE>

+.TE

+Start conditions are explained more fully later.

+.PP

+.I

+Repetitions and Definitions.

+.R

+The operators {} specify

+either repetitions (if they enclose numbers)

+or

+definition expansion (if they enclose a name). For example

+.TS

+center;

+l.

+{digit}

+.TE

+looks for a predefined string named

+.I digit

+and inserts it

+at that point in the expression.

+The definitions are given in the first part of the Lex

+input, before the rules.

+In contrast,

+.TS

+center;

+l.

+a{1,5}

+.TE

+looks for 1 to 5 occurrences of

+.I a .

+.PP

+Finally, initial

+.I %

+is special, being the separator

+for Lex source segments.

+.2C

+.NH

+Lex Actions.

+.PP

+When an expression written as above is matched, Lex

+executes the corresponding action. This section describes

+some features of Lex which aid in writing actions. Note

+that there is a default action, which

+consists of copying the input to the output. This

+is performed on all strings not otherwise matched. Thus

+the Lex user who wishes to absorb the entire input, without

+producing any output, must provide rules to match everything.

+When Lex is being used with Yacc, this is the normal

+situation.

+One may consider that actions are what is done instead of

+copying the input to the output; thus, in general,

+a rule which merely copies can be omitted.

+Also, a character combination

+which is omitted from the rules

+and which appears as input

+is likely to be printed on the output, thus calling

+attention to the gap in the rules.

+.PP

+One of the simplest things that can be done is to ignore

+the input. Specifying a C null statement, \fI;\fR as an action

+causes this result. A frequent rule is

+.TS

+center;

+l l.

+[ \et\en] ;

+.TE

+which causes the three spacing characters (blank, tab, and newline)

+to be ignored.

+.PP

+Another easy way to avoid writing actions is the action character

+|, which indicates that the action for this rule is the action

+for the next rule.

+The previous example could also have been written

+.TS

+center;

+l l.

+" " |

+"\et" |

+"\en" ;

+.TE

+with the same result, although in different style.

+The quotes around \en and \et are not required.

+.PP

+In more complex actions, the user

+will

+often want to know the actual text that matched some expression

+like

+.I [a\(miz]+ .

+Lex leaves this text in an external character

+array named

+.I

+yytext.

+.R

+Thus, to print the name found,

+a rule like

+.TS

+center;

+l l.

+[a\-z]+ printf("%s", yytext);

+.TE

+will print

+the string in

+.I

+yytext.

+.R

+The C function

+.I

+printf

+.R

+accepts a format argument and data to be printed;

+in this case, the format is ``print string'' (% indicating

+data conversion, and

+.I s

+indicating string type),

+and the data are the characters

+in

+.I

+yytext.

+.R

+So this just places

+the matched string

+on the output.

+This action

+is so common that

+it may be written as ECHO:

+.TS

+center;

+l l.

+[a\-z]+ ECHO;

+.TE

+is the same as the above.

+Since the default action is just to

+print the characters found, one might ask why

+give a rule, like this one, which merely specifies

+the default action?

+Such rules are often required

+to avoid matching some other rule

+which is not desired. For example, if there is a rule

+which matches

+.I read

+it will normally match the instances of

+.I read

+contained in

+.I bread

+or

+.I readjust ;

+to avoid

+this,

+a rule

+of the form

+.I [a\(miz]+

+is needed.

+This is explained further below.

+.PP

+Sometimes it is more convenient to know the end of what

+has been found; hence Lex also provides a count

+.I

+yyleng

+.R

+of the number of characters matched.

+To count both the number

+of words and the number of characters in words in the input, the user might write

+.TS

+center;

+l l.

+[a\-zA\-Z]+ {words++; chars += yyleng;}

+.TE

+which accumulates in

+.ul

+chars

+the number

+of characters in the words recognized.

+The last character in the string matched can

+be accessed by

+.TS

+center;

+l.

+yytext[yyleng\-1]

+.TE

+.PP

+Occasionally, a Lex

+action may decide that a rule has not recognized the correct

+span of characters.

+Two routines are provided to aid with this situation.

+First,

+.I

+yymore()

+.R

+can be called to indicate that the next input expression recognized is to be

+tacked on to the end of this input. Normally,

+the next input string would overwrite the current

+entry in

+.I

+yytext.

+.R

+Second,

+.I

+yyless (n)

+.R

+may be called to indicate that not all the characters matched

+by the currently successful expression are wanted right now.

+The argument

+.I

+.R

+indicates the number of characters

+in

+.I

+yytext

+.R

+to be retained.

+Further characters previously matched

+are

+returned to the input. This provides the same sort of

+look~ahead offered by the / operator,

+but in a different form.

+.PP

+.I

+Example:

+.R

+Consider a language which defines

+a string as a set of characters between quotation (") marks, and provides that

+to include a " in a string it must be preceded by a \e. The

+regular expression which matches that is somewhat confusing,

+so that it might be preferable to write

+.TS

+center;

+l l.

+\e"[^"]\(** {

+ if (yytext[yyleng\-1] == \(fm\e\e\(fm)

+ yymore();

+ else

+ ... normal user processing

+ }

+.TE

+which will, when faced with a string such as

+.I

+"abc\e"def\|"

+.R

+first match

+the five characters

+\fI"abc\e\|\fR;

+then

+the call to

+.I yymore()

+will

+cause the next part of the string,

+\fI"def\|\fR,

+to be tacked on the end.

+Note that the final quote terminating the string should be picked

+up in the code labeled ``normal processing''.

+.PP

+The function

+.I

+yyless()

+.R

+might be used to reprocess

+text in various circumstances. Consider the C problem of distinguishing

+the ambiguity of ``=\(mia''.

+Suppose it is desired to treat this as ``=\(mi a''

+but print a message. A rule might be

+.ps 9

+.vs 11

+.TS

+center;

+l l.

+=\(mi[a\-zA\-Z] {

+ printf("Op (=\(mi) ambiguous\en");

+ yyless(yyleng\-1);

+ ... action for =\(mi ...

+ }

+.TE

+.ps 10

+.vs 12

+which prints a message, returns the letter after the

+operator to the input stream, and treats the operator as ``=\(mi''.

+Alternatively it might be desired to treat this as ``= \(mia''.

+To do this, just return the minus

+sign as well as the letter to the input:

+.ps 9

+.vs 11

+.TS

+center;

+l l.

+=\(mi[a\-zA\-Z] {

+ printf("Op (=\(mi) ambiguous\en");

+ yyless(yyleng\-2);

+ ... action for = ...

+ }

+.TE

+.ps 10

+.vs 12

+will perform the other interpretation.

+Note that the expressions for the two cases might more easily

+be written

+.TS

+center;

+l l.

+=\(mi/[A\-Za\-z]

+.TE

+in the first case and

+.TS

+center;

+l.

+=/\-[A\-Za\-z]

+.TE

+in the second;

+no backup would be required in the rule action.

+It is not necessary to recognize the whole identifier

+to observe the ambiguity.

+The

+possibility of ``=\(mi3'', however, makes

+.TS

+center;

+l.

+=\(mi/[^ \et\en]

+.TE

+a still better rule.

+.PP

+In addition to these routines, Lex also permits

+access to the I/O routines

+it uses.

+They are:

+.IP 1)

+.I

+input()

+.R

+which returns the next input character;

+.IP 2)

+.I

+output(c)

+.R

+which writes the character

+.I

+.R

+on the output; and

+.IP 3)

+.I

+unput(c)

+.R

+pushes the character

+.I

+.R

+back onto the input stream to be read later by

+.I

+input().

+.R

+.LP

+By default these routines are provided as macro definitions,

+but the user can override them and supply private versions.

+These routines

+define the relationship between external files and

+internal characters, and must all be retained

+or modified consistently.

+They may be redefined, to

+cause input or output to be transmitted to or from strange

+places, including other programs or internal memory;

+but the character set used must be consistent in all routines;

+a value of zero returned by

+.I

+input

+.R

+must mean end of file; and

+the relationship between

+.I

+unput

+.R

+and

+.I

+input

+.R

+must be retained

+or the Lex look~ahead will not work.

+Lex does not look ahead at all if it does not have to,

+but every rule ending in

+.ft I

++ \(** ?

+.ft R

+or

+.ft I

+.ft R

+or containing

+.ft I

+.ft R

+implies look~ahead.

+Look~ahead is also necessary to match an expression that is a prefix

+of another expression.

+See below for a discussion of the character set used by Lex.

+The standard Lex library imposes

+a 100 character limit on backup.

+.PP

+Another Lex library routine that the user will sometimes want

+to redefine is

+.I

+yywrap()

+.R

+which is called whenever Lex reaches an end-of-file.

+If

+.I

+yywrap

+.R

+returns a 1, Lex continues with the normal wrapup on end of input.

+Sometimes, however, it is convenient to arrange for more

+input to arrive

+from a new source.

+In this case, the user should provide

+.I

+yywrap

+.R

+which

+arranges for new input and

+returns 0. This instructs Lex to continue processing.

+The default

+.I

+yywrap

+.R

+always returns 1.

+.PP

+This routine is also a convenient place

+to print tables, summaries, etc. at the end

+of a program. Note that it is not

+possible to write a normal rule which recognizes

+end-of-file; the only access to this condition is

+through

+.I

+yywrap.

+.R

+In fact, unless a private version of

+.I

+input()

+.R

+is supplied

+a file containing nulls

+cannot be handled,

+since a value of 0 returned by

+.I

+input

+.R

+is taken to be end-of-file.

+.PP

+.2C

+.NH

+Ambiguous Source Rules.

+.PP

+Lex can handle ambiguous specifications.

+When more than one expression can match the

+current input, Lex chooses as follows:

+.IP 1)

+The longest match is preferred.

+.IP 2)

+Among rules which matched the same number of characters,

+the rule given first is preferred.

+.LP

+Thus, suppose the rules

+.TS

+center;

+l l.

+integer keyword action ...;

+[a\-z]+ identifier action ...;

+.TE

+to be given in that order. If the input is

+.I integers ,

+it is taken as an identifier, because

+.I [a\-z]+

+matches 8 characters while

+.I integer

+matches only 7.

+If the input is

+.I integer ,

+both rules match 7 characters, and

+the keyword rule is selected because it was given first.

+Anything shorter (e.g. \fIint\fR\|) will

+not match the expression

+.I integer

+and so the identifier interpretation is used.

+.PP

+The principle of preferring the longest

+match makes rules containing

+expressions like

+.I \&.\(**

+dangerous.

+For example,

+.TS

+center;

+l.

+\&\(fm.\(**\(fm

+.TE

+might seem a good way of recognizing

+a string in single quotes.

+But it is an invitation for the program to read far

+ahead, looking for a distant

+single quote.

+Presented with the input

+.TS

+center;

+l l.

+\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here

+.TE

+the above expression will match

+.TS

+center;

+l l.

+\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm

+.TE

+which is probably not what was wanted.

+A better rule is of the form

+.TS

+center;

+l.

+\&\(fm[^\(fm\en]\(**\(fm

+.TE

+which, on the above input, will stop

+after

+.I \(fmfirst\(fm .

+The consequences

+of errors like this are mitigated by the fact

+that the

+.I \&.

+operator will not match newline.

+Thus expressions like

+.I \&.\(**

+stop on the

+current line.

+Don't try to defeat this with expressions like

+.I (.|\en)+

+or

+equivalents;

+the Lex generated program will try to read

+the entire input file, causing

+internal buffer overflows.

+.PP

+Note that Lex is normally partitioning

+the input stream, not searching for all possible matches

+of each expression.

+This means that each character is accounted for

+once and only once.

+For example, suppose it is desired to

+count occurrences of both \fIshe\fR and \fIhe\fR in an input text.

+Some Lex rules to do this might be

+.TS

+center;

+l l.

+she s++;

+he h++;

+\en |

+\&. ;

+.TE

+where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.

+Remember that . does not include newline.

+Since \fIshe\fR includes \fIhe\fR, Lex will normally

+.I

+not

+.R

+recognize

+the instances of \fIhe\fR included in \fIshe\fR,

+since once it has passed a \fIshe\fR those characters are gone.

+.PP

+Sometimes the user would like to override this choice. The action

+REJECT

+means ``go do the next alternative.''

+It causes whatever rule was second choice after the current

+rule to be executed.

+The position of the input pointer is adjusted accordingly.

+Suppose the user really wants to count the included instances of \fIhe\fR:

+.TS

+center;

+l l.

+she {s++; REJECT;}

+he {h++; REJECT;}

+\en |

+\&. ;

+.TE

+these rules are one way of changing the previous example

+to do just that.

+After counting each expression, it is rejected; whenever appropriate,

+the other expression will then be counted. In this example, of course,

+the user could note that \fIshe\fR includes \fIhe\fR but not

+vice versa, and omit the REJECT action on \fIhe\fR;

+in other cases, however, it

+would not be possible a priori to tell

+which input characters

+were in both classes.

+.PP

+Consider the two rules

+.TS

+center;

+l l.

+a[bc]+ { ... ; REJECT;}

+a[cd]+ { ... ; REJECT;}

+.TE

+If the input is

+.I ab ,

+only the first rule matches,

+and on

+.I ad

+only the second matches.

+The input string

+.I accb

+matches the first rule for four characters

+and then the second rule for three characters.

+In contrast, the input

+.I accd

+agrees with

+the second rule for four characters and then the first

+rule for three.

+.PP

+In general, REJECT is useful whenever

+the purpose of Lex is not to partition the input

+stream but to detect all examples of some items

+in the input, and the instances of these items

+may overlap or include each other.

+Suppose a digram table of the input is desired;

+normally the digrams overlap, that is the word

+.I the

+is considered to contain

+both

+.I th

+and

+.I he .

+Assuming a two-dimensional array named

+.ul

+digram

+to be incremented, the appropriate

+source is

+.TS

+center;

+l l.

+%%

+[a\-z][a\-z] {

+ digram[yytext[0]][yytext[1]]++;

+ REJECT;

+ }

+\. ;

+\en ;

+.TE

+where the REJECT is necessary to pick up

+a letter pair beginning at every character, rather than at every

+other character.

+.2C

+.NH

+Lex Source Definitions.

+.PP

+Remember the format of the Lex

+source:

+.TS

+center;

+l.

+{definitions}

+%%

+{rules}

+%%

+{user routines}

+.TE

+So far only the rules have been described. The user needs

+additional options,

+though, to define variables for use in his program and for use

+by Lex.

+These can go either in the definitions section

+or in the rules section.

+.PP

+Remember that Lex is turning the rules into a program.

+Any source not intercepted by Lex is copied

+into the generated program. There are three classes

+of such things.

+.IP 1)

+Any line which is not part of a Lex rule or action

+which begins with a blank or tab is copied into

+the Lex generated program.

+Such source input prior to the first %% delimiter will be external

+to any function in the code; if it appears immediately after the first

+%%,

+it appears in an appropriate place for declarations

+in the function written by Lex which contains the actions.

+This material must look like program fragments,

+and should precede the first Lex rule.

+.IP

+As a side effect of the above, lines which begin with a blank

+or tab, and which contain a comment,

+are passed through to the generated program.

+This can be used to include comments in either the Lex source or

+the generated code. The comments should follow the host

+language convention.

+.IP 2)

+Anything included between lines containing

+only

+.I %{

+and

+.I %}

+is

+copied out as above. The delimiters are discarded.

+This format permits entering text like preprocessor statements that

+must begin in column 1,

+or copying lines that do not look like programs.

+.IP 3)

+Anything after the third %% delimiter, regardless of formats, etc.,

+is copied out after the Lex output.

+.PP

+Definitions intended for Lex are given

+before the first %% delimiter. Any line in this section

+not contained between %{ and %}, and begining

+in column 1, is assumed to define Lex substitution strings.

+The format of such lines is

+.TS

+center;

+l l.

+name translation

+.TE

+and it

+causes the string given as a translation to

+be associated with the name.

+The name and translation

+must be separated by at least one blank or tab, and the name must begin with a letter.

+The translation can then be called out

+by the {name} syntax in a rule.

+Using {D} for the digits and {E} for an exponent field,

+for example, might abbreviate rules to recognize numbers:

+.TS

+center;

+l l.

+D [0\-9]

+E [DEde][\-+]?{D}+

+%%

+{D}+ printf("integer");

+{D}+"."{D}\(**({E})? |

+{D}\(**"."{D}+({E})? |

+{D}+{E} printf("real");

+.TE

+Note the first two rules for real numbers;

+both require a decimal point and contain

+an optional exponent field,

+but the first requires at least one digit before the

+decimal point and the second requires at least one

+digit after the decimal point.

+To correctly handle the problem

+posed by a Fortran expression such as

+.I 35.EQ.I ,

+which does not contain a real number, a context-sensitive

+rule such as

+.TS

+center;

+l l.

+[0\-9]+/"."EQ printf("integer");

+.TE

+could be used in addition to the normal rule for integers.

+.PP

+The definitions

+section may also contain other commands, including the

+selection of a host language, a character set table,

+a list of start conditions, or adjustments to the default

+size of arrays within Lex itself for larger source programs.

+These possibilities

+are discussed below under ``Summary of Source Format,''

+section 12.

+.2C

+.NH

+Usage.

+.PP

+There are two steps in

+compiling a Lex source program.

+First, the Lex source must be turned into a generated program

+in the host general purpose language.

+Then this program must be compiled and loaded, usually with

+a library of Lex subroutines.

+The generated program

+is on a file named

+.I lex.yy.c .

+The I/O library is defined in terms of the C standard

+library [6].

+.PP

+The C programs generated by Lex are slightly different

+on OS/370, because the

+OS compiler is less powerful than the UNIX or GCOS compilers,

+and does less at compile time.

+C programs generated on GCOS and UNIX are the same.

+.PP

+.I

+UNIX.

+.R

+The library is accessed by the loader flag

+.I \-ll .

+So an appropriate

+set of commands is

+.KS

+.in 5

+lex source

+cc lex.yy.c \-ll

+.in 0

+.KE

+The resulting program is placed on the usual file

+.I

+a.out

+.R

+for later execution.

+To use Lex with Yacc see below.

+Although the default Lex I/O routines use the C standard library,

+the Lex automata themselves do not do so;

+if private versions of

+.I

+input,

+output

+.R

+and

+.I unput

+are given, the library can be avoided.

+.PP

+.2C

+.NH

+Lex and Yacc.

+.PP

+If you want to use Lex with Yacc, note that what Lex writes is a program

+named

+.I

+yylex(),

+.R

+the name required by Yacc for its analyzer.

+Normally, the default main program on the Lex library

+calls this routine, but if Yacc is loaded, and its main

+program is used, Yacc will call

+.I

+yylex().

+.R

+In this case each Lex rule should end with

+.TS

+center;

+l.

+return(token);

+.TE

+where the appropriate token value is returned.

+An easy way to get access

+to Yacc's names for tokens is to

+compile the Lex output file as part of

+the Yacc output file by placing the line

+.TS

+center;

+l.

+# include "lex.yy.c"

+.TE

+in the last section of Yacc input.

+Supposing the grammar to be

+named ``good'' and the lexical rules to be named ``better''

+the UNIX command sequence can just be:

+.TS

+center;

+l.

+yacc good

+lex better

+cc y.tab.c \-ly \-ll

+.TE

+The Yacc library (\-ly) should be loaded before the Lex library,

+to obtain a main program which invokes the Yacc parser.

+The generations of Lex and Yacc programs can be done in

+either order.

+.2C

+.NH

+Examples.

+.PP

+As a trivial problem, consider copying an input file while

+adding 3 to every positive number divisible by 7.

+Here is a suitable Lex source program

+.TS

+center;

+l l.

+%%

+ int k;

+[0\-9]+ {

+ k = atoi(yytext);

+ if (k%7 == 0)

+ printf("%d", k+3);

+ else

+ printf("%d",k);

+ }

+.TE

+to do just that.

+The rule [0\-9]+ recognizes strings of digits;

+.I

+atoi

+.R

+converts the digits to binary

+and stores the result in

+.ul

+k.

+The operator % (remainder) is used to check whether

+.ul

+is divisible by 7; if it is,

+it is incremented by 3 as it is written out.

+It may be objected that this program will alter such

+input items as

+.I 49.63

+or

+.I X7 .

+Furthermore, it increments the absolute value

+of all negative numbers divisible by 7.

+To avoid this, just add a few more rules after the active one,

+as here:

+.TS

+center;

+l l.

+%%

+ int k;

+\-?[0\-9]+ {

+ k = atoi(yytext);

+ printf("%d",

+ k%7 == 0 ? k+3 : k);

+ }

+\-?[0\-9.]+ ECHO;

+[A-Za-z][A-Za-z0-9]+ ECHO;

+.TE

+Numerical strings containing

+a ``.'' or preceded by a letter will be picked up by

+one of the last two rules, and not changed.

+The

+.I if\-else

+has been replaced by

+a C conditional expression to save space;

+the form

+.ul

+a?b:c

+means ``if

+.I a

+then

+.I b

+else

+.I c ''.

+.PP

+For an example of statistics gathering, here

+is a program which histograms the lengths

+of words, where a word is defined as a string of letters.

+.TS

+center;

+l l.

+ int lengs[100];

+%%

+[a\-z]+ lengs[yyleng]++;

+\&. |

+\en ;

+%%

+.T&

+l s.

+yywrap()

+int i;

+printf("Length No. words\en");

+for(i=0; i<100; i++)

+ if (lengs[i] > 0)

+ printf("%5d%10d\en",i,lengs[i]);

+return(1);

+.TE

+This program

+accumulates the histogram, while producing no output. At the end

+of the input it prints the table.

+The final statement

+.I

+return(1);

+.R

+indicates that Lex is to perform wrapup. If

+.I

+yywrap

+.R

+returns zero (false)

+it implies that further input is available

+and the program is

+to continue reading and processing.

+To provide a

+.I

+yywrap

+.R

+that never

+returns true causes an infinite loop.

+.PP

+As a larger example,

+here are some parts of a program written by N. L. Schryer

+to convert double precision Fortran to single precision Fortran.

+Because Fortran does not distinguish upper and lower case letters,

+this routine begins by defining a set of classes including

+both cases of each letter:

+.TS

+center;

+l l.

+a [aA]

+b [bB]

+c [cC]

+\&...

+z [zZ]

+.TE

+An additional class recognizes white space:

+.TS

+center;

+l l.

+W [ \et]\(**

+.TE

+The first rule changes

+``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.

+.TS

+center;

+l.

+{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {

+ printf(yytext[0]==\(fmd\(fm? "real" : "REAL");

+ }

+.TE

+Care is taken throughout this program to preserve the case

+(upper or lower)

+of the original program.

+The conditional operator is used to

+select the proper form of the keyword.

+The next rule copies continuation card indications to

+avoid confusing them with constants:

+.TS

+center;

+l l.

+^" "[^ 0] ECHO;

+.TE

+In the regular expression, the quotes surround the

+blanks.

+It is interpreted as

+``beginning of line, then five blanks, then

+anything but blank or zero.''

+Note the two different meanings of

+.I ^ .

+There follow some rules to change double precision

+constants to ordinary floating constants.

+.TS

+center;

+l.

+[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |

+[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |

+"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {

+ /\(** convert constants \(**/

+ for(p=yytext; \(**p != 0; p++)

+ {

+ if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)

+ \(**p=+ \(fme\(fm\- \(fmd\(fm;

+ ECHO;

+ }

+.TE

+After the floating point constant is recognized, it is

+scanned by the

+.ul

+for

+loop

+to find the letter

+.I d

+or

+.I D .

+The program than adds

+.c

+.I \(fme\(fm\-\(fmd\(fm ,

+which converts

+it to the next letter of the alphabet.

+The modified constant, now single-precision,

+is written out again.

+There follow a series of names which must be respelled to remove

+their initial \fId\fR.

+By using the

+array

+.I

+yytext

+.R

+the same action suffices for all the names (only a sample of

+a rather long list is given here).

+.TS

+center;

+l l.

+{d}{s}{i}{n} |

+{d}{c}{o}{s} |

+{d}{s}{q}{r}{t} |

+{d}{a}{t}{a}{n} |

+\&...

+{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);

+.TE

+Another list of names must have initial \fId\fR changed to initial \fIa\fR:

+.TS

+center;

+l l.

+{d}{l}{o}{g} |

+{d}{l}{o}{g}10 |

+{d}{m}{i}{n}1 |

+{d}{m}{a}{x}1 {

+ yytext[0] =+ \(fma\(fm \- \(fmd\(fm;

+ ECHO;

+ }

+.TE

+And one routine

+must have initial \fId\fR changed to initial \fIr\fR:

+.TS

+center;

+l l.

+{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;

+ ECHO;

+ }

+.TE

+To avoid such names as \fIdsinx\fR being detected as instances

+of \fIdsin\fR, some final rules pick up longer words as identifiers

+and copy some surviving characters:

+.TS

+center;

+l l.

+[A\-Za\-z][A\-Za\-z0\-9]\(** |

+[0\-9]+ |

+\en |

+\&. ECHO;

+.TE

+Note that this program is not complete; it

+does not deal with the spacing problems in Fortran or

+with the use of keywords as identifiers.

+.br

+.2C

+.NH

+Left Context Sensitivity.

+.PP

+Sometimes

+it is desirable to have several sets of lexical rules

+to be applied at different times in the input.

+For example, a compiler preprocessor might distinguish

+preprocessor statements and analyze them differently

+from ordinary statements.

+This requires

+sensitivity

+to prior context, and there are several ways of handling

+such problems.

+The \fI^\fR operator, for example, is a prior context operator,

+recognizing immediately preceding left context just as \fI$\fR recognizes

+immediately following right context.

+Adjacent left context could be extended, to produce a facility similar to

+that for adjacent right context, but it is unlikely

+to be as useful, since often the relevant left context

+appeared some time earlier, such as at the beginning of a line.

+.PP

+This section describes three means of dealing

+with different environments: a simple use of flags,

+when only a few rules change from one environment to another,

+the use of

+.I

+start conditions

+.R

+on rules,

+and the possibility of making multiple lexical analyzers all run

+together.

+In each case, there are rules which recognize the need to change the

+environment in which the

+following input text is analyzed, and set some parameter

+to reflect the change. This may be a flag explicitly tested by

+the user's action code; such a flag is the simplest way of dealing

+with the problem, since Lex is not involved at all.

+It may be more convenient,

+however,

+to have Lex remember the flags as initial conditions on the rules.

+Any rule may be associated with a start condition. It will only

+be recognized when Lex is in

+that start condition.

+The current start condition may be changed at any time.

+Finally, if the sets of rules for the different environments

+are very dissimilar,

+clarity may be best achieved by writing several distinct lexical

+analyzers, and switching from one to another as desired.

+.PP

+Consider the following problem: copy the input to the output,

+changing the word \fImagic\fR to \fIfirst\fR on every line which began

+with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line

+which began with the letter \fIb\fR, and changing

+\fImagic\fR to \fIthird\fR on every line which began

+with the letter \fIc\fR. All other words and all other lines

+are left unchanged.

+.PP

+These rules are so simple that the easiest way

+to do this job is with a flag:

+.TS

+center;

+l l.

+ int flag;

+%%

+^a {flag = \(fma\(fm; ECHO;}

+^b {flag = \(fmb\(fm; ECHO;}

+^c {flag = \(fmc\(fm; ECHO;}

+\en {flag = 0 ; ECHO;}

+magic {

+ switch (flag)

+ {

+ case \(fma\(fm: printf("first"); break;

+ case \(fmb\(fm: printf("second"); break;

+ case \(fmc\(fm: printf("third"); break;

+ default: ECHO; break;

+ }

+.TE

+should be adequate.

+.PP

+To handle the same problem with start conditions, each

+start condition must be introduced to Lex in the definitions section

+with a line reading

+.TS

+center;

+l l.

+%Start name1 name2 ...

+.TE

+where the conditions may be named in any order.

+The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.

+The conditions may be referenced at the

+head of a rule with the <> brackets:

+.TS

+center;

+l.

+<name1>expression

+.TE

+is a rule which is only recognized when Lex is in the

+start condition \fIname1\fR.

+To enter a start condition,

+execute the action statement

+.TS

+center;

+l.

+BEGIN name1;

+.TE

+which changes the start condition to \fIname1\fR.

+To resume the normal state,

+.TS

+center;

+l.

+BEGIN 0;

+.TE

+resets the initial condition

+of the Lex automaton interpreter.

+A rule may be active in several

+start conditions:

+.TS

+center;

+l.

+<name1,name2,name3>

+.TE

+is a legal prefix. Any rule not beginning with the

+<> prefix operator is always active.

+.PP

+The same example as before can be written:

+.TS

+center;

+l l.

+%START AA BB CC

+%%

+^a {ECHO; BEGIN AA;}

+^b {ECHO; BEGIN BB;}

+^c {ECHO; BEGIN CC;}

+\en {ECHO; BEGIN 0;}

+<AA>magic printf("first");

+<BB>magic printf("second");

+<CC>magic printf("third");

+.TE

+where the logic is exactly the same as in the previous

+method of handling the problem, but Lex does the work

+rather than the user's code.

+.2C

+.NH

+Character Set.

+.PP

+The programs generated by Lex handle

+character I/O only through the routines

+.I

+input,

+output,

+.R

+and

+.I

+unput.

+.R

+Thus the character representation

+provided in these routines

+is accepted by Lex and employed to return

+values in

+.I

+yytext.

+.R

+For internal use

+a character is represented as a small integer

+which, if the standard library is used,

+has a value equal to the integer value of the bit

+pattern representing the character on the host computer.

+Normally, the letter

+.I a

+is represented as the same form as the character constant

+.I \(fma\(fm .

+If this interpretation is changed, by providing I/O

+routines which translate the characters,

+Lex must be told about

+it, by giving a translation table.

+This table must be in the definitions section,

+and must be bracketed by lines containing only

+``%T''.

+The table contains lines of the form

+.TS

+center;

+l.

+{integer} {character string}

+.TE

+which indicate the value associated with each character.

+Thus the next example

+.TS

+center;

+l l.

+%T

+ 1 Aa

+ 2 Bb

+\&...

+26 Zz

+27 \en

+28 +

+29 \-

+30 0

+31 1

+\&...

+39 9

+%T

+.TE

+.sp

+.ce 1

+Sample character table.

+maps the lower and upper case letters together into the integers 1 through 26,

+newline into 27, + and \- into 28 and 29, and the

+digits into 30 through 39.

+Note the escape for newline.

+If a table is supplied, every character that is to appear either

+in the rules or in any valid input must be included

+in the table.

+No character

+may be assigned the number 0, and no character may be

+assigned a bigger number than the size of the hardware character set.

+.2C

+.NH

+Summary of Source Format.

+.PP

+The general form of a Lex source file is:

+.TS

+center;

+l.

+{definitions}

+%%

+{rules}

+%%

+{user subroutines}

+.TE

+The definitions section contains

+a combination of

+.IP 1)

+Definitions, in the form ``name space translation''.

+.IP 2)

+Included code, in the form ``space code''.

+.IP 3)

+Included code, in the form

+.TS

+center;

+l.

+%{

+code

+%}

+.TE

+.ns

+.IP 4)

+Start conditions, given in the form

+.TS

+center;

+l.

+%S name1 name2 ...

+.TE

+.ns

+.IP 5)

+Character set tables, in the form

+.TS

+center;

+l.

+%T

+number space character-string

+\&...

+%T

+.TE

+.ns

+.IP 6)

+Changes to internal array sizes, in the form

+.TS

+center;

+l.

+%\fIx\fR\0\0\fInnn\fR

+.TE

+where \fInnn\fR is a decimal integer representing an array size

+and \fIx\fR selects the parameter as follows:

+.TS

+center;

+c c

+c l.

+Letter Parameter

+p positions

+n states

+e tree nodes

+a transitions

+k packed character classes

+o output array size

+.TE

+.LP

+Lines in the rules section have the form ``expression action''

+where the action may be continued on succeeding

+lines by using braces to delimit it.

+.PP

+Regular expressions in Lex use the following

+operators:

+.br

+.TS

+center;

+l l.

+x the character "x"

+"x" an "x", even if x is an operator.

+\ex an "x", even if x is an operator.

+[xy] the character x or y.

+[x\-z] the characters x, y or z.

+[^x] any character but x.

+\&. any character but newline.

+^x an x at the beginning of a line.

+<y>x an x when Lex is in start condition y.

+x$ an x at the end of a line.

+x? an optional x.

+x\(** 0,1,2, ... instances of x.

+x+ 1,2,3, ... instances of x.

+x|y an x or a y.

+(x) an x.

+x/y an x but only if followed by y.

+{xx} the translation of xx from the

+ definitions section.

+x{m,n} \fIm\fR through \fIn\fR occurrences of x

+.TE

+.NH

+Caveats and Bugs.

+.PP

+There are pathological expressions which

+produce exponential growth of the tables when

+converted to deterministic machines;

+fortunately, they are rare.

+.PP

+REJECT does not rescan the input; instead it remembers the results of the previous

+scan. This means that if a rule with trailing context is found, and

+REJECT executed, the user

+must not have used

+.ul

+unput

+to change the characters forthcoming

+from the input stream.

+This is the only restriction on the user's ability to manipulate

+the not-yet-processed input.

+.PP

+.2C

+.NH

+Acknowledgments.

+.PP

+As should

+be obvious from the above, the outside of Lex

+is patterned

+on Yacc and the inside on Aho's string matching routines.

+Therefore, both S. C. Johnson and A. V. Aho

+are really originators

+of much of Lex,

+as well as debuggers of it.

+Many thanks are due to both.

+.PP

+The code of the current version of Lex was designed, written,

+and debugged by Eric Schmidt.

+.SG MH-1274-MEL-unix

+.sp 1

+.2C

+.NH

+References.

+.SP 1v

+.IP 1.

+B. W. Kernighan and D. M. Ritchie,

+.I

+The C Programming Language,

+.R

+Prentice-Hall, N. J. (1978).

+.IP 2.

+B. W. Kernighan,

+.I

+Ratfor: A Preprocessor for a Rational Fortran,

+.R

+Software \- Practice and Experience,

+\fB5\fR, pp. 395-496 (1975).

+.IP 3.

+S. C. Johnson,

+.I

+Yacc: Yet Another Compiler Compiler,

+.R

+Computing Science Technical Report No. 32,

+1975,

+.MH

+.if \n(tm (also TM 75-1273-6)

+.IP 4.

+A. V. Aho and M. J. Corasick,

+.I

+Efficient String Matching: An Aid to Bibliographic Search,

+.R

+Comm. ACM

+.B

+18,

+.R

+333-340 (1975).

+.IP 5.

+B. W. Kernighan, D. M. Ritchie and K. L. Thompson,

+.I

+QED Text Editor,

+.R

+Computing Science Technical Report No. 5,

+1972,

+.MH

+.IP 6.

+D. M. Ritchie,

+private communication.

+See also

+M. E. Lesk,

+.I

+The Portable C Library,

+.R

+Computing Science Technical Report No. 31,

+.MH

+.if \n(tm (also TM 75-1274-11)