1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
|
.\" $OpenBSD: lex.ms,v 1.1 2002/12/03 21:43:59 mickey Exp $
.\"
.\" Copyright (C) Caldera International Inc. 2001-2002.
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code and documentation must retain the above
.\" copyright notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\" notice, this list of conditions and the following disclaimer in the
.\" documentation and/or other materials provided with the distribution.
.\" 3. All advertising materials mentioning features or use of this software
.\" must display the following acknowledgement:
.\" This product includes software developed or owned by Caldera
.\" International, Inc.
.\" 4. Neither the name of Caldera International, Inc. nor the names of other
.\" contributors may be used to endorse or promote products derived from
.\" this software without specific prior written permission.
.\"
.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
.\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
.\" IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE FOR ANY DIRECT,
.\" INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
.\" (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
.\" SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
.\" STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
.\" IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
.\" POSSIBILITY OF SUCH DAMAGE.
.\"
.\" @(#)lex.ms 8.2 (Berkeley) 5/24/94
.\"
.EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator'
.OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%'
.hc ~
.bd I 2
.de TS
.br
.nf
.SP 1v
.ul 0
..
.de TE
.SP 1v
.fi
..
.\".de PT
.\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
.\".if \\n%>1 'sp
.\"..
.ND July 21, 1975
.\".RP
.\".TM 75-1274-15 39199 39199-11
.TL
Lex \- A Lexical Analyzer ~Generator~
.AU ``MH 2C-569'' 6377
M. E. Lesk and E. Schmidt
.AI
.MH
.AB
.sp
.bd I 2
.\".nr PS 8
.\".nr VS 9
.\".ps 8
.\".vs 9p
Lex helps write programs whose control flow
is directed by instances of regular
expressions in the input stream.
It is well suited for editor-script type transformations and
for segmenting input in preparation for
a parsing routine.
.PP
Lex source is a table of regular expressions and corresponding program fragments.
The table is translated to a program
which reads an input stream, copying it to an output stream
and partitioning the input
into strings which match the given expressions.
As each such string is recognized the corresponding
program fragment is executed.
The recognition of the expressions
is performed by a deterministic finite automaton
generated by Lex.
The program fragments written by the user are executed in the order in which the
corresponding regular expressions occur in the input stream.
.if n .if \n(tm .ig
.PP
The lexical analysis
programs written with Lex accept ambiguous specifications
and choose the longest
match possible at each input point.
If necessary, substantial look~ahead
is performed on the input, but the
input stream will be backed up to the
end of the current partition, so that the user
has general freedom to manipulate it.
.PP
Lex can generate analyzers in either C or Ratfor, a language
which can be translated automatically to portable Fortran.
It is available on the PDP-11 UNIX, Honeywell GCOS,
and IBM OS systems.
This manual, however, will only discuss generating analyzers
in C on the UNIX system, which is the only supported
form of Lex under UNIX Version 7.
Lex is designed to simplify
interfacing with Yacc, for those
with access to this compiler-compiler system.
..
.\".nr PS 9
.\".nr VS 11
.AE
.2C
.NH
Introduction.
.PP
Lex is a program generator designed for
lexical processing of character input streams.
It accepts a high-level, problem oriented specification
for character string matching,
and
produces a program in a general purpose language which recognizes
regular expressions.
The regular expressions are specified by the user in the
source specifications given to Lex.
The Lex written code recognizes these expressions
in an input stream and partitions the input stream into
strings matching the expressions. At the bound~aries
between strings
program sections
provided by the user are executed.
The Lex source file associates the regular expressions and the
program fragments.
As each expression appears in the input to the program written by Lex,
the corresponding fragment is executed.
.PP
.de MH
Bell Laboratories, Murray Hill, NJ 07974.
..
The user supplies the additional code
beyond expression matching
needed to complete his tasks, possibly
including code written by other generators.
The program that recognizes the expressions is generated in the
general purpose programming language employed for the
user's program fragments.
Thus, a high level expression
language is provided to write the string expressions to be
matched while the user's freedom to write actions
is unimpaired.
This avoids forcing the user who wishes to use a string manipulation
language for input analysis to write processing programs in the same
and often inappropriate string handling language.
.PP
Lex is not a complete language, but rather a generator representing
a new language feature which can be added to
different programming languages, called ``host languages.''
Just as general purpose languages
can produce code to run on different computer hardware,
Lex can write code in different host languages.
The host language is used for the output code generated by Lex
and also for the program fragments added by the user.
Compatible run-time libraries for the different host languages
are also provided.
This makes Lex adaptable to different environments and
different users.
Each application
may be directed to the combination of hardware and host language appropriate
to the task, the user's background, and the properties of local
implementations.
At present, the only supported host language is C,
although Fortran (in the form of Ratfor [2] has been available
in the past.
Lex itself exists on UNIX, GCOS, and OS/370; but the
code generated by Lex may be taken anywhere the appropriate
compilers exist.
.PP
Lex turns the user's expressions and actions
(called
.ul
source
in this memo) into the host general-purpose language;
the generated program is named
.ul
yylex.
The
.ul
yylex
program
will recognize expressions
in a stream
(called
.ul
input
in this memo)
and perform the specified actions for each expression as it is detected.
See Figure 1.
.TS
center;
l _ r
l|c|r
l _ r
l _ r
l|c|r
l _ r
c s s
c s s.
Source \(-> Lex \(-> yylex
.sp 2
Input \(-> yylex \(-> Output
.sp
An overview of Lex
Figure 1
.TE
.PP
For a trivial example, consider a program to delete
from the input
all blanks or tabs at the ends of lines.
.TS
center;
l l.
%%
[ \et]+$ ;
.TE
is all that is required.
The program
contains a %% delimiter to mark the beginning of the rules, and
one rule.
This rule contains a regular expression
which matches one or more
instances of the characters blank or tab
(written \et for visibility, in accordance with the C language convention)
just prior to the end of a line.
The brackets indicate the character
class made of blank and tab; the + indicates ``one or more ...'';
and the $ indicates ``end of line,'' as in QED.
No action is specified,
so the program generated by Lex (yylex) will ignore these characters.
Everything else will be copied.
To change any remaining
string of blanks or tabs to a single blank,
add another rule:
.TS
center;
l l.
%%
[ \et]+$ ;
[ \et]+ printf(" ");
.TE
The finite automaton generated for this
source will scan for both rules at once,
observing at
the termination of the string of blanks or tabs
whether or not there is a newline character, and executing
the desired rule action.
The first rule matches all strings of blanks or tabs
at the end of lines, and the second
rule all remaining strings of blanks or tabs.
.PP
Lex can be used alone for simple transformations, or
for analysis and statistics gathering on a lexical level.
Lex can also be used with a parser generator
to perform the lexical analysis phase; it is particularly
easy to interface Lex and Yacc [3].
Lex programs recognize only regular expressions;
Yacc writes parsers that accept a large class of context free grammars,
but require a lower level analyzer to recognize input tokens.
Thus, a combination of Lex and Yacc is often appropriate.
When used as a preprocessor for a later parser generator,
Lex is used to partition the input stream,
and the parser generator assigns structure to
the resulting pieces.
The flow of control
in such a case (which might be the first half of a compiler,
for example) is shown in Figure 2.
Additional programs,
written by other generators
or by hand, can
be added easily to programs written by Lex.
.BS 2
.ps 9
.vs 11
.TS
center;
l c c c l
l c c c l
l c c c l
l _ c _ l
l|c|c|c|l
l _ c _ l
l c c c l
l _ c _ l
l|c|c|c|l
l _ c _ l
l c s s l
l c s s l.
lexical grammar
rules rules
\(da \(da
Lex Yacc
\(da \(da
Input \(-> yylex \(-> yyparse \(-> Parsed input
.sp
Lex with Yacc
Figure 2
.TE
.ps 10
.vs 12
.BE
Yacc users
will realize that the name
.ul
yylex
is what Yacc expects its lexical analyzer to be named,
so that the use of this name by Lex simplifies
interfacing.
.PP
Lex generates a deterministic finite automaton from the regular expressions
in the source [4].
The automaton is interpreted, rather than compiled, in order
to save space.
The result is still a fast analyzer.
In particular, the time taken by a Lex program
to recognize and partition an input stream is
proportional to the length of the input.
The number of Lex rules or
the complexity of the rules is
not important in determining speed,
unless rules which include
forward context require a significant amount of re~scanning.
What does increase with the number and complexity of rules
is the size of the finite
automaton, and therefore the size of the program
generated by Lex.
.PP
In the program written by Lex, the user's fragments
(representing the
.ul
actions
to be performed as each regular expression
is found)
are gathered
as cases of a switch.
The automaton interpreter directs the control flow.
Opportunity is provided for the user to insert either
declarations or additional statements in the routine containing
the actions, or to
add subroutines outside this action routine.
.PP
Lex is not limited to source which can
be interpreted on the basis of one character
look~ahead.
For example,
if there are two rules, one looking for
.I ab
and another for
.I abcdefg ,
and the input stream is
.I abcdefh ,
Lex will recognize
.I ab
and leave
the input pointer just before
.I "cd. . ."
Such backup is more costly
than the processing of simpler languages.
.2C
.NH
Lex Source.
.PP
The general format of Lex source is:
.TS
center;
l.
{definitions}
%%
{rules}
%%
{user subroutines}
.TE
where the definitions and the user subroutines
are often omitted.
The second
.I %%
is optional, but the first is required
to mark the beginning of the rules.
The absolute minimum Lex program is thus
.TS
center;
l.
%%
.TE
(no definitions, no rules) which translates into a program
which copies the input to the output unchanged.
.PP
In the outline of Lex programs shown above, the
.I
rules
.R
represent the user's control
decisions; they are a table, in which the left column
contains
.I
regular expressions
.R
(see section 3)
and the right column contains
.I
actions,
.R
program fragments to be executed when the expressions
are recognized.
Thus an individual rule might appear
.TS
center;
l l.
integer printf("found keyword INT");
.TE
to look for the string
.I integer
in the input stream and
print the message ``found keyword INT'' whenever it appears.
In this example the host procedural language is C and
the C library function
.I
printf
.R
is used to print the string.
The end
of the expression is indicated by the first blank or tab character.
If the action is merely a single C expression,
it can just be given on the right side of the line; if it is
compound, or takes more than a line, it should be enclosed in
braces.
As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling.
Lex rules such as
.TS
center;
l l.
colour printf("color");
mechanise printf("mechanize");
petrol printf("gas");
.TE
would be a start. These rules are not quite enough,
since
the word
.I petroleum
would become
.I gaseum ;
a way of dealing
with this will be described later.
.2C
.NH
Lex Regular Expressions.
.PP
The definitions of regular expressions are very similar to those
in QED [5].
A regular
expression specifies a set of strings to be matched.
It contains text characters (which match the corresponding
characters in the strings being compared)
and operator characters (which specify
repetitions, choices, and other features).
The letters of the alphabet and the digits are
always text characters; thus the regular expression
.TS
center;
l l.
integer
.TE
matches the string
.ul
integer
wherever it appears
and the expression
.TS
center;
l.
a57D
.TE
looks for the string
.ul
a57D.
.PP
.I
Operators.
.R
The operator characters are
.TS
center;
l.
" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
.TE
and if they are to be used as text characters, an escape
should be used.
The quotation mark operator (")
indicates that whatever is contained between a pair of quotes
is to be taken as text characters.
Thus
.TS
center;
l.
xyz"++"
.TE
matches the string
.I xyz++
when it appears. Note that a part of a string may be quoted.
It is harmless but unnecessary to quote an ordinary
text character; the expression
.TS
center;
l.
"xyz++"
.TE
is the same as the one above.
Thus by quoting every non-alphanumeric character
being used as a text character, the user can avoid remembering
the list above of current
operator characters, and is safe should further extensions to Lex
lengthen the list.
.PP
An operator character may also be turned into a text character
by preceding it with \e as in
.TS
center;
l.
xyz\e+\e+
.TE
which
is another, less readable, equivalent of the above expressions.
Another use of the quoting mechanism is to get a blank into
an expression; normally, as explained above, blanks or tabs end
a rule.
Any blank character not contained within [\|] (see below) must
be quoted.
Several normal C escapes with \e
are recognized: \en is newline, \et is tab, and \eb is backspace.
To enter \e itself, use \e\e.
Since newline is illegal in an expression, \en must be used;
it is not
required to escape tab and backspace.
Every character but blank, tab, newline and the list above is always
a text character.
.PP
.I
Character classes.
.R
Classes of characters can be specified using the operator pair [\|].
The construction
.I [abc]
matches a
single character, which may be
.I a ,
.I b ,
or
.I c .
Within square brackets,
most operator meanings are ignored.
Only three characters are special:
these are \e \(mi and ^. The \(mi character
indicates ranges. For example,
.TS
center;
l.
[a\(miz0\(mi9<>_]
.TE
indicates the character class containing all the lower case letters,
the digits,
the angle brackets, and underline.
Ranges may be given in either order.
Using \(mi between any pair of characters which are
not both upper case letters, both lower case letters, or both digits
is implementation dependent and will get a warning message.
(E.g., [0\-z] in ASCII is many more characters
than it is in EBCDIC).
If it is desired to include the
character \(mi in a character class, it should be first or
last; thus
.TS
center;
l.
[\(mi+0\(mi9]
.TE
matches all the digits and the two signs.
.PP
In character classes,
the ^ operator must appear as the first character
after the left bracket; it indicates that the resulting string
is to be complemented with respect to the computer character set.
Thus
.TS
center;
l.
[^abc]
.TE
matches all characters except a, b, or c, including
all special or control characters; or
.TS
center;
l.
[^a\-zA\-Z]
.TE
is any character which is not a letter.
The \e character provides the usual escapes within
character class brackets.
.PP
.I
Arbitrary character.
.R
To match almost any character, the operator character
.TS
center;
l.
\&.
.TE
is the class of all characters except newline.
Escaping into octal is possible although non-portable:
.TS
center;
l.
[\e40\-\e176]
.TE
matches all printable characters in the ASCII character set, from octal
40 (blank) to octal 176 (tilde).
.PP
.I
Optional expressions.
.R
The operator
.I ?
indicates
an optional element of an expression.
Thus
.TS
center;
l.
ab?c
.TE
matches either
.I ac
or
.I abc .
.PP
.I
Repeated expressions.
.R
Repetitions of classes are indicated by the operators
.I \(**
and
.I + .
.TS
center;
l.
\f2a\(**\f1
.TE
is any number of consecutive
.I a
characters, including zero; while
.TS
center;
l.
a+
.TE
is one or more instances of
.I a.
For example,
.TS
center;
l.
[a\-z]+
.TE
is all strings of lower case letters.
And
.TS
center;
l.
[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
.TE
indicates all alphanumeric strings with a leading
alphabetic character.
This is a typical expression for recognizing identifiers in
computer languages.
.PP
.I
Alternation and Grouping.
.R
The operator |
indicates alternation:
.TS
center;
l.
(ab\||\|cd)
.TE
matches either
.ul
ab
or
.ul
cd.
Note that parentheses are used for grouping, although
they are
not necessary on the outside level;
.TS
center;
l.
ab\||\|cd
.TE
would have sufficed.
Parentheses
can be used for more complex expressions:
.TS
center;
l.
(ab\||\|cd+)?(ef)\(**
.TE
matches such strings as
.I abefef ,
.I efefef ,
.I cdef ,
or
.I cddd\| ;
but not
.I abc ,
.I abcd ,
or
.I abcdef .
.PP
.I
Context sensitivity.
.R
Lex will recognize a small amount of surrounding
context. The two simplest operators for this are
.I ^
and
.I $ .
If the first character of an expression is
.I ^ ,
the expression will only be matched at the beginning
of a line (after a newline character, or at the beginning of
the input stream).
This can never conflict with the other meaning of
.I ^ ,
complementation
of character classes, since that only applies within
the [\|] operators.
If the very last character is
.I $ ,
the expression will only be matched at the end of a line (when
immediately followed by newline).
The latter operator is a special case of the
.I /
operator character,
which indicates trailing context.
The expression
.TS
center;
l.
ab/cd
.TE
matches the string
.I ab ,
but only if followed by
.ul
cd.
Thus
.TS
center;
l.
ab$
.TE
is the same as
.TS
center;
l.
ab/\en
.TE
Left context is handled in Lex by
.I
start conditions
.R
as explained in section 10. If a rule is only to be executed
when the Lex automaton interpreter is in start condition
.I
x,
.R
the rule should be prefixed by
.TS
center;
l.
<x>
.TE
using the angle bracket operator characters.
If we considered ``being at the beginning of a line'' to be
start condition
.I ONE ,
then the ^ operator
would be equivalent to
.TS
center;
l.
<ONE>
.TE
Start conditions are explained more fully later.
.PP
.I
Repetitions and Definitions.
.R
The operators {} specify
either repetitions (if they enclose numbers)
or
definition expansion (if they enclose a name). For example
.TS
center;
l.
{digit}
.TE
looks for a predefined string named
.I digit
and inserts it
at that point in the expression.
The definitions are given in the first part of the Lex
input, before the rules.
In contrast,
.TS
center;
l.
a{1,5}
.TE
looks for 1 to 5 occurrences of
.I a .
.PP
Finally, initial
.I %
is special, being the separator
for Lex source segments.
.2C
.NH
Lex Actions.
.PP
When an expression written as above is matched, Lex
executes the corresponding action. This section describes
some features of Lex which aid in writing actions. Note
that there is a default action, which
consists of copying the input to the output. This
is performed on all strings not otherwise matched. Thus
the Lex user who wishes to absorb the entire input, without
producing any output, must provide rules to match everything.
When Lex is being used with Yacc, this is the normal
situation.
One may consider that actions are what is done instead of
copying the input to the output; thus, in general,
a rule which merely copies can be omitted.
Also, a character combination
which is omitted from the rules
and which appears as input
is likely to be printed on the output, thus calling
attention to the gap in the rules.
.PP
One of the simplest things that can be done is to ignore
the input. Specifying a C null statement, \fI;\fR as an action
causes this result. A frequent rule is
.TS
center;
l l.
[ \et\en] ;
.TE
which causes the three spacing characters (blank, tab, and newline)
to be ignored.
.PP
Another easy way to avoid writing actions is the action character
|, which indicates that the action for this rule is the action
for the next rule.
The previous example could also have been written
.TS
center;
l l.
" " |
"\et" |
"\en" ;
.TE
with the same result, although in different style.
The quotes around \en and \et are not required.
.PP
In more complex actions, the user
will
often want to know the actual text that matched some expression
like
.I [a\(miz]+ .
Lex leaves this text in an external character
array named
.I
yytext.
.R
Thus, to print the name found,
a rule like
.TS
center;
l l.
[a\-z]+ printf("%s", yytext);
.TE
will print
the string in
.I
yytext.
.R
The C function
.I
printf
.R
accepts a format argument and data to be printed;
in this case, the format is ``print string'' (% indicating
data conversion, and
.I s
indicating string type),
and the data are the characters
in
.I
yytext.
.R
So this just places
the matched string
on the output.
This action
is so common that
it may be written as ECHO:
.TS
center;
l l.
[a\-z]+ ECHO;
.TE
is the same as the above.
Since the default action is just to
print the characters found, one might ask why
give a rule, like this one, which merely specifies
the default action?
Such rules are often required
to avoid matching some other rule
which is not desired. For example, if there is a rule
which matches
.I read
it will normally match the instances of
.I read
contained in
.I bread
or
.I readjust ;
to avoid
this,
a rule
of the form
.I [a\(miz]+
is needed.
This is explained further below.
.PP
Sometimes it is more convenient to know the end of what
has been found; hence Lex also provides a count
.I
yyleng
.R
of the number of characters matched.
To count both the number
of words and the number of characters in words in the input, the user might write
.TS
center;
l l.
[a\-zA\-Z]+ {words++; chars += yyleng;}
.TE
which accumulates in
.ul
chars
the number
of characters in the words recognized.
The last character in the string matched can
be accessed by
.TS
center;
l.
yytext[yyleng\-1]
.TE
.PP
Occasionally, a Lex
action may decide that a rule has not recognized the correct
span of characters.
Two routines are provided to aid with this situation.
First,
.I
yymore()
.R
can be called to indicate that the next input expression recognized is to be
tacked on to the end of this input. Normally,
the next input string would overwrite the current
entry in
.I
yytext.
.R
Second,
.I
yyless (n)
.R
may be called to indicate that not all the characters matched
by the currently successful expression are wanted right now.
The argument
.I
n
.R
indicates the number of characters
in
.I
yytext
.R
to be retained.
Further characters previously matched
are
returned to the input. This provides the same sort of
look~ahead offered by the / operator,
but in a different form.
.PP
.I
Example:
.R
Consider a language which defines
a string as a set of characters between quotation (") marks, and provides that
to include a " in a string it must be preceded by a \e. The
regular expression which matches that is somewhat confusing,
so that it might be preferable to write
.TS
center;
l l.
\e"[^"]\(** {
if (yytext[yyleng\-1] == \(fm\e\e\(fm)
yymore();
else
... normal user processing
}
.TE
which will, when faced with a string such as
.I
"abc\e"def\|"
.R
first match
the five characters
\fI"abc\e\|\fR;
then
the call to
.I yymore()
will
cause the next part of the string,
\fI"def\|\fR,
to be tacked on the end.
Note that the final quote terminating the string should be picked
up in the code labeled ``normal processing''.
.PP
The function
.I
yyless()
.R
might be used to reprocess
text in various circumstances. Consider the C problem of distinguishing
the ambiguity of ``=\(mia''.
Suppose it is desired to treat this as ``=\(mi a''
but print a message. A rule might be
.ps 9
.vs 11
.TS
center;
l l.
=\(mi[a\-zA\-Z] {
printf("Op (=\(mi) ambiguous\en");
yyless(yyleng\-1);
... action for =\(mi ...
}
.TE
.ps 10
.vs 12
which prints a message, returns the letter after the
operator to the input stream, and treats the operator as ``=\(mi''.
Alternatively it might be desired to treat this as ``= \(mia''.
To do this, just return the minus
sign as well as the letter to the input:
.ps 9
.vs 11
.TS
center;
l l.
=\(mi[a\-zA\-Z] {
printf("Op (=\(mi) ambiguous\en");
yyless(yyleng\-2);
... action for = ...
}
.TE
.ps 10
.vs 12
will perform the other interpretation.
Note that the expressions for the two cases might more easily
be written
.TS
center;
l l.
=\(mi/[A\-Za\-z]
.TE
in the first case and
.TS
center;
l.
=/\-[A\-Za\-z]
.TE
in the second;
no backup would be required in the rule action.
It is not necessary to recognize the whole identifier
to observe the ambiguity.
The
possibility of ``=\(mi3'', however, makes
.TS
center;
l.
=\(mi/[^ \et\en]
.TE
a still better rule.
.PP
In addition to these routines, Lex also permits
access to the I/O routines
it uses.
They are:
.IP 1)
.I
input()
.R
which returns the next input character;
.IP 2)
.I
output(c)
.R
which writes the character
.I
c
.R
on the output; and
.IP 3)
.I
unput(c)
.R
pushes the character
.I
c
.R
back onto the input stream to be read later by
.I
input().
.R
.LP
By default these routines are provided as macro definitions,
but the user can override them and supply private versions.
These routines
define the relationship between external files and
internal characters, and must all be retained
or modified consistently.
They may be redefined, to
cause input or output to be transmitted to or from strange
places, including other programs or internal memory;
but the character set used must be consistent in all routines;
a value of zero returned by
.I
input
.R
must mean end of file; and
the relationship between
.I
unput
.R
and
.I
input
.R
must be retained
or the Lex look~ahead will not work.
Lex does not look ahead at all if it does not have to,
but every rule ending in
.ft I
+ \(** ?
.ft R
or
.ft I
$
.ft R
or containing
.ft I
/
.ft R
implies look~ahead.
Look~ahead is also necessary to match an expression that is a prefix
of another expression.
See below for a discussion of the character set used by Lex.
The standard Lex library imposes
a 100 character limit on backup.
.PP
Another Lex library routine that the user will sometimes want
to redefine is
.I
yywrap()
.R
which is called whenever Lex reaches an end-of-file.
If
.I
yywrap
.R
returns a 1, Lex continues with the normal wrapup on end of input.
Sometimes, however, it is convenient to arrange for more
input to arrive
from a new source.
In this case, the user should provide
a
.I
yywrap
.R
which
arranges for new input and
returns 0. This instructs Lex to continue processing.
The default
.I
yywrap
.R
always returns 1.
.PP
This routine is also a convenient place
to print tables, summaries, etc. at the end
of a program. Note that it is not
possible to write a normal rule which recognizes
end-of-file; the only access to this condition is
through
.I
yywrap.
.R
In fact, unless a private version of
.I
input()
.R
is supplied
a file containing nulls
cannot be handled,
since a value of 0 returned by
.I
input
.R
is taken to be end-of-file.
.PP
.2C
.NH
Ambiguous Source Rules.
.PP
Lex can handle ambiguous specifications.
When more than one expression can match the
current input, Lex chooses as follows:
.IP 1)
The longest match is preferred.
.IP 2)
Among rules which matched the same number of characters,
the rule given first is preferred.
.LP
Thus, suppose the rules
.TS
center;
l l.
integer keyword action ...;
[a\-z]+ identifier action ...;
.TE
to be given in that order. If the input is
.I integers ,
it is taken as an identifier, because
.I [a\-z]+
matches 8 characters while
.I integer
matches only 7.
If the input is
.I integer ,
both rules match 7 characters, and
the keyword rule is selected because it was given first.
Anything shorter (e.g. \fIint\fR\|) will
not match the expression
.I integer
and so the identifier interpretation is used.
.PP
The principle of preferring the longest
match makes rules containing
expressions like
.I \&.\(**
dangerous.
For example,
.TS
center;
l.
\&\(fm.\(**\(fm
.TE
might seem a good way of recognizing
a string in single quotes.
But it is an invitation for the program to read far
ahead, looking for a distant
single quote.
Presented with the input
.TS
center;
l l.
\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
.TE
the above expression will match
.TS
center;
l l.
\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
.TE
which is probably not what was wanted.
A better rule is of the form
.TS
center;
l.
\&\(fm[^\(fm\en]\(**\(fm
.TE
which, on the above input, will stop
after
.I \(fmfirst\(fm .
The consequences
of errors like this are mitigated by the fact
that the
.I \&.
operator will not match newline.
Thus expressions like
.I \&.\(**
stop on the
current line.
Don't try to defeat this with expressions like
.I (.|\en)+
or
equivalents;
the Lex generated program will try to read
the entire input file, causing
internal buffer overflows.
.PP
Note that Lex is normally partitioning
the input stream, not searching for all possible matches
of each expression.
This means that each character is accounted for
once and only once.
For example, suppose it is desired to
count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
Some Lex rules to do this might be
.TS
center;
l l.
she s++;
he h++;
\en |
\&. ;
.TE
where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
Remember that . does not include newline.
Since \fIshe\fR includes \fIhe\fR, Lex will normally
.I
not
.R
recognize
the instances of \fIhe\fR included in \fIshe\fR,
since once it has passed a \fIshe\fR those characters are gone.
.PP
Sometimes the user would like to override this choice. The action
REJECT
means ``go do the next alternative.''
It causes whatever rule was second choice after the current
rule to be executed.
The position of the input pointer is adjusted accordingly.
Suppose the user really wants to count the included instances of \fIhe\fR:
.TS
center;
l l.
she {s++; REJECT;}
he {h++; REJECT;}
\en |
\&. ;
.TE
these rules are one way of changing the previous example
to do just that.
After counting each expression, it is rejected; whenever appropriate,
the other expression will then be counted. In this example, of course,
the user could note that \fIshe\fR includes \fIhe\fR but not
vice versa, and omit the REJECT action on \fIhe\fR;
in other cases, however, it
would not be possible a priori to tell
which input characters
were in both classes.
.PP
Consider the two rules
.TS
center;
l l.
a[bc]+ { ... ; REJECT;}
a[cd]+ { ... ; REJECT;}
.TE
If the input is
.I ab ,
only the first rule matches,
and on
.I ad
only the second matches.
The input string
.I accb
matches the first rule for four characters
and then the second rule for three characters.
In contrast, the input
.I accd
agrees with
the second rule for four characters and then the first
rule for three.
.PP
In general, REJECT is useful whenever
the purpose of Lex is not to partition the input
stream but to detect all examples of some items
in the input, and the instances of these items
may overlap or include each other.
Suppose a digram table of the input is desired;
normally the digrams overlap, that is the word
.I the
is considered to contain
both
.I th
and
.I he .
Assuming a two-dimensional array named
.ul
digram
to be incremented, the appropriate
source is
.TS
center;
l l.
%%
[a\-z][a\-z] {
digram[yytext[0]][yytext[1]]++;
REJECT;
}
\. ;
\en ;
.TE
where the REJECT is necessary to pick up
a letter pair beginning at every character, rather than at every
other character.
.2C
.NH
Lex Source Definitions.
.PP
Remember the format of the Lex
source:
.TS
center;
l.
{definitions}
%%
{rules}
%%
{user routines}
.TE
So far only the rules have been described. The user needs
additional options,
though, to define variables for use in his program and for use
by Lex.
These can go either in the definitions section
or in the rules section.
.PP
Remember that Lex is turning the rules into a program.
Any source not intercepted by Lex is copied
into the generated program. There are three classes
of such things.
.IP 1)
Any line which is not part of a Lex rule or action
which begins with a blank or tab is copied into
the Lex generated program.
Such source input prior to the first %% delimiter will be external
to any function in the code; if it appears immediately after the first
%%,
it appears in an appropriate place for declarations
in the function written by Lex which contains the actions.
This material must look like program fragments,
and should precede the first Lex rule.
.IP
As a side effect of the above, lines which begin with a blank
or tab, and which contain a comment,
are passed through to the generated program.
This can be used to include comments in either the Lex source or
the generated code. The comments should follow the host
language convention.
.IP 2)
Anything included between lines containing
only
.I %{
and
.I %}
is
copied out as above. The delimiters are discarded.
This format permits entering text like preprocessor statements that
must begin in column 1,
or copying lines that do not look like programs.
.IP 3)
Anything after the third %% delimiter, regardless of formats, etc.,
is copied out after the Lex output.
.PP
Definitions intended for Lex are given
before the first %% delimiter. Any line in this section
not contained between %{ and %}, and begining
in column 1, is assumed to define Lex substitution strings.
The format of such lines is
.TS
center;
l l.
name translation
.TE
and it
causes the string given as a translation to
be associated with the name.
The name and translation
must be separated by at least one blank or tab, and the name must begin with a letter.
The translation can then be called out
by the {name} syntax in a rule.
Using {D} for the digits and {E} for an exponent field,
for example, might abbreviate rules to recognize numbers:
.TS
center;
l l.
D [0\-9]
E [DEde][\-+]?{D}+
%%
{D}+ printf("integer");
{D}+"."{D}\(**({E})? |
{D}\(**"."{D}+({E})? |
{D}+{E} printf("real");
.TE
Note the first two rules for real numbers;
both require a decimal point and contain
an optional exponent field,
but the first requires at least one digit before the
decimal point and the second requires at least one
digit after the decimal point.
To correctly handle the problem
posed by a Fortran expression such as
.I 35.EQ.I ,
which does not contain a real number, a context-sensitive
rule such as
.TS
center;
l l.
[0\-9]+/"."EQ printf("integer");
.TE
could be used in addition to the normal rule for integers.
.PP
The definitions
section may also contain other commands, including the
selection of a host language, a character set table,
a list of start conditions, or adjustments to the default
size of arrays within Lex itself for larger source programs.
These possibilities
are discussed below under ``Summary of Source Format,''
section 12.
.2C
.NH
Usage.
.PP
There are two steps in
compiling a Lex source program.
First, the Lex source must be turned into a generated program
in the host general purpose language.
Then this program must be compiled and loaded, usually with
a library of Lex subroutines.
The generated program
is on a file named
.I lex.yy.c .
The I/O library is defined in terms of the C standard
library [6].
.PP
The C programs generated by Lex are slightly different
on OS/370, because the
OS compiler is less powerful than the UNIX or GCOS compilers,
and does less at compile time.
C programs generated on GCOS and UNIX are the same.
.PP
.I
UNIX.
.R
The library is accessed by the loader flag
.I \-ll .
So an appropriate
set of commands is
.KS
.in 5
lex source
cc lex.yy.c \-ll
.in 0
.KE
The resulting program is placed on the usual file
.I
a.out
.R
for later execution.
To use Lex with Yacc see below.
Although the default Lex I/O routines use the C standard library,
the Lex automata themselves do not do so;
if private versions of
.I
input,
output
.R
and
.I unput
are given, the library can be avoided.
.PP
.2C
.NH
Lex and Yacc.
.PP
If you want to use Lex with Yacc, note that what Lex writes is a program
named
.I
yylex(),
.R
the name required by Yacc for its analyzer.
Normally, the default main program on the Lex library
calls this routine, but if Yacc is loaded, and its main
program is used, Yacc will call
.I
yylex().
.R
In this case each Lex rule should end with
.TS
center;
l.
return(token);
.TE
where the appropriate token value is returned.
An easy way to get access
to Yacc's names for tokens is to
compile the Lex output file as part of
the Yacc output file by placing the line
.TS
center;
l.
# include "lex.yy.c"
.TE
in the last section of Yacc input.
Supposing the grammar to be
named ``good'' and the lexical rules to be named ``better''
the UNIX command sequence can just be:
.TS
center;
l.
yacc good
lex better
cc y.tab.c \-ly \-ll
.TE
The Yacc library (\-ly) should be loaded before the Lex library,
to obtain a main program which invokes the Yacc parser.
The generations of Lex and Yacc programs can be done in
either order.
.2C
.NH
Examples.
.PP
As a trivial problem, consider copying an input file while
adding 3 to every positive number divisible by 7.
Here is a suitable Lex source program
.TS
center;
l l.
%%
int k;
[0\-9]+ {
k = atoi(yytext);
if (k%7 == 0)
printf("%d", k+3);
else
printf("%d",k);
}
.TE
to do just that.
The rule [0\-9]+ recognizes strings of digits;
.I
atoi
.R
converts the digits to binary
and stores the result in
.ul
k.
The operator % (remainder) is used to check whether
.ul
k
is divisible by 7; if it is,
it is incremented by 3 as it is written out.
It may be objected that this program will alter such
input items as
.I 49.63
or
.I X7 .
Furthermore, it increments the absolute value
of all negative numbers divisible by 7.
To avoid this, just add a few more rules after the active one,
as here:
.TS
center;
l l.
%%
int k;
\-?[0\-9]+ {
k = atoi(yytext);
printf("%d",
k%7 == 0 ? k+3 : k);
}
\-?[0\-9.]+ ECHO;
[A-Za-z][A-Za-z0-9]+ ECHO;
.TE
Numerical strings containing
a ``.'' or preceded by a letter will be picked up by
one of the last two rules, and not changed.
The
.I if\-else
has been replaced by
a C conditional expression to save space;
the form
.ul
a?b:c
means ``if
.I a
then
.I b
else
.I c ''.
.PP
For an example of statistics gathering, here
is a program which histograms the lengths
of words, where a word is defined as a string of letters.
.TS
center;
l l.
int lengs[100];
%%
[a\-z]+ lengs[yyleng]++;
\&. |
\en ;
%%
.T&
l s.
yywrap()
{
int i;
printf("Length No. words\en");
for(i=0; i<100; i++)
if (lengs[i] > 0)
printf("%5d%10d\en",i,lengs[i]);
return(1);
}
.TE
This program
accumulates the histogram, while producing no output. At the end
of the input it prints the table.
The final statement
.I
return(1);
.R
indicates that Lex is to perform wrapup. If
.I
yywrap
.R
returns zero (false)
it implies that further input is available
and the program is
to continue reading and processing.
To provide a
.I
yywrap
.R
that never
returns true causes an infinite loop.
.PP
As a larger example,
here are some parts of a program written by N. L. Schryer
to convert double precision Fortran to single precision Fortran.
Because Fortran does not distinguish upper and lower case letters,
this routine begins by defining a set of classes including
both cases of each letter:
.TS
center;
l l.
a [aA]
b [bB]
c [cC]
\&...
z [zZ]
.TE
An additional class recognizes white space:
.TS
center;
l l.
W [ \et]\(**
.TE
The first rule changes
``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
.TS
center;
l.
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
}
.TE
Care is taken throughout this program to preserve the case
(upper or lower)
of the original program.
The conditional operator is used to
select the proper form of the keyword.
The next rule copies continuation card indications to
avoid confusing them with constants:
.TS
center;
l l.
^" "[^ 0] ECHO;
.TE
In the regular expression, the quotes surround the
blanks.
It is interpreted as
``beginning of line, then five blanks, then
anything but blank or zero.''
Note the two different meanings of
.I ^ .
There follow some rules to change double precision
constants to ordinary floating constants.
.TS
center;
l.
[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
/\(** convert constants \(**/
for(p=yytext; \(**p != 0; p++)
{
if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
\(**p=+ \(fme\(fm\- \(fmd\(fm;
ECHO;
}
.TE
After the floating point constant is recognized, it is
scanned by the
.ul
for
loop
to find the letter
.I d
or
.I D .
The program than adds
.c
.I \(fme\(fm\-\(fmd\(fm ,
which converts
it to the next letter of the alphabet.
The modified constant, now single-precision,
is written out again.
There follow a series of names which must be respelled to remove
their initial \fId\fR.
By using the
array
.I
yytext
.R
the same action suffices for all the names (only a sample of
a rather long list is given here).
.TS
center;
l l.
{d}{s}{i}{n} |
{d}{c}{o}{s} |
{d}{s}{q}{r}{t} |
{d}{a}{t}{a}{n} |
\&...
{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
.TE
Another list of names must have initial \fId\fR changed to initial \fIa\fR:
.TS
center;
l l.
{d}{l}{o}{g} |
{d}{l}{o}{g}10 |
{d}{m}{i}{n}1 |
{d}{m}{a}{x}1 {
yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
ECHO;
}
.TE
And one routine
must have initial \fId\fR changed to initial \fIr\fR:
.TS
center;
l l.
{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
ECHO;
}
.TE
To avoid such names as \fIdsinx\fR being detected as instances
of \fIdsin\fR, some final rules pick up longer words as identifiers
and copy some surviving characters:
.TS
center;
l l.
[A\-Za\-z][A\-Za\-z0\-9]\(** |
[0\-9]+ |
\en |
\&. ECHO;
.TE
Note that this program is not complete; it
does not deal with the spacing problems in Fortran or
with the use of keywords as identifiers.
.br
.2C
.NH
Left Context Sensitivity.
.PP
Sometimes
it is desirable to have several sets of lexical rules
to be applied at different times in the input.
For example, a compiler preprocessor might distinguish
preprocessor statements and analyze them differently
from ordinary statements.
This requires
sensitivity
to prior context, and there are several ways of handling
such problems.
The \fI^\fR operator, for example, is a prior context operator,
recognizing immediately preceding left context just as \fI$\fR recognizes
immediately following right context.
Adjacent left context could be extended, to produce a facility similar to
that for adjacent right context, but it is unlikely
to be as useful, since often the relevant left context
appeared some time earlier, such as at the beginning of a line.
.PP
This section describes three means of dealing
with different environments: a simple use of flags,
when only a few rules change from one environment to another,
the use of
.I
start conditions
.R
on rules,
and the possibility of making multiple lexical analyzers all run
together.
In each case, there are rules which recognize the need to change the
environment in which the
following input text is analyzed, and set some parameter
to reflect the change. This may be a flag explicitly tested by
the user's action code; such a flag is the simplest way of dealing
with the problem, since Lex is not involved at all.
It may be more convenient,
however,
to have Lex remember the flags as initial conditions on the rules.
Any rule may be associated with a start condition. It will only
be recognized when Lex is in
that start condition.
The current start condition may be changed at any time.
Finally, if the sets of rules for the different environments
are very dissimilar,
clarity may be best achieved by writing several distinct lexical
analyzers, and switching from one to another as desired.
.PP
Consider the following problem: copy the input to the output,
changing the word \fImagic\fR to \fIfirst\fR on every line which began
with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
which began with the letter \fIb\fR, and changing
\fImagic\fR to \fIthird\fR on every line which began
with the letter \fIc\fR. All other words and all other lines
are left unchanged.
.PP
These rules are so simple that the easiest way
to do this job is with a flag:
.TS
center;
l l.
int flag;
%%
^a {flag = \(fma\(fm; ECHO;}
^b {flag = \(fmb\(fm; ECHO;}
^c {flag = \(fmc\(fm; ECHO;}
\en {flag = 0 ; ECHO;}
magic {
switch (flag)
{
case \(fma\(fm: printf("first"); break;
case \(fmb\(fm: printf("second"); break;
case \(fmc\(fm: printf("third"); break;
default: ECHO; break;
}
}
.TE
should be adequate.
.PP
To handle the same problem with start conditions, each
start condition must be introduced to Lex in the definitions section
with a line reading
.TS
center;
l l.
%Start name1 name2 ...
.TE
where the conditions may be named in any order.
The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
The conditions may be referenced at the
head of a rule with the <> brackets:
.TS
center;
l.
<name1>expression
.TE
is a rule which is only recognized when Lex is in the
start condition \fIname1\fR.
To enter a start condition,
execute the action statement
.TS
center;
l.
BEGIN name1;
.TE
which changes the start condition to \fIname1\fR.
To resume the normal state,
.TS
center;
l.
BEGIN 0;
.TE
resets the initial condition
of the Lex automaton interpreter.
A rule may be active in several
start conditions:
.TS
center;
l.
<name1,name2,name3>
.TE
is a legal prefix. Any rule not beginning with the
<> prefix operator is always active.
.PP
The same example as before can be written:
.TS
center;
l l.
%START AA BB CC
%%
^a {ECHO; BEGIN AA;}
^b {ECHO; BEGIN BB;}
^c {ECHO; BEGIN CC;}
\en {ECHO; BEGIN 0;}
<AA>magic printf("first");
<BB>magic printf("second");
<CC>magic printf("third");
.TE
where the logic is exactly the same as in the previous
method of handling the problem, but Lex does the work
rather than the user's code.
.2C
.NH
Character Set.
.PP
The programs generated by Lex handle
character I/O only through the routines
.I
input,
output,
.R
and
.I
unput.
.R
Thus the character representation
provided in these routines
is accepted by Lex and employed to return
values in
.I
yytext.
.R
For internal use
a character is represented as a small integer
which, if the standard library is used,
has a value equal to the integer value of the bit
pattern representing the character on the host computer.
Normally, the letter
.I a
is represented as the same form as the character constant
.I \(fma\(fm .
If this interpretation is changed, by providing I/O
routines which translate the characters,
Lex must be told about
it, by giving a translation table.
This table must be in the definitions section,
and must be bracketed by lines containing only
``%T''.
The table contains lines of the form
.TS
center;
l.
{integer} {character string}
.TE
which indicate the value associated with each character.
Thus the next example
.TS
center;
l l.
%T
1 Aa
2 Bb
\&...
26 Zz
27 \en
28 +
29 \-
30 0
31 1
\&...
39 9
%T
.TE
.sp
.ce 1
Sample character table.
maps the lower and upper case letters together into the integers 1 through 26,
newline into 27, + and \- into 28 and 29, and the
digits into 30 through 39.
Note the escape for newline.
If a table is supplied, every character that is to appear either
in the rules or in any valid input must be included
in the table.
No character
may be assigned the number 0, and no character may be
assigned a bigger number than the size of the hardware character set.
.2C
.NH
Summary of Source Format.
.PP
The general form of a Lex source file is:
.TS
center;
l.
{definitions}
%%
{rules}
%%
{user subroutines}
.TE
The definitions section contains
a combination of
.IP 1)
Definitions, in the form ``name space translation''.
.IP 2)
Included code, in the form ``space code''.
.IP 3)
Included code, in the form
.TS
center;
l.
%{
code
%}
.TE
.ns
.IP 4)
Start conditions, given in the form
.TS
center;
l.
%S name1 name2 ...
.TE
.ns
.IP 5)
Character set tables, in the form
.TS
center;
l.
%T
number space character-string
\&...
%T
.TE
.ns
.IP 6)
Changes to internal array sizes, in the form
.TS
center;
l.
%\fIx\fR\0\0\fInnn\fR
.TE
where \fInnn\fR is a decimal integer representing an array size
and \fIx\fR selects the parameter as follows:
.TS
center;
c c
c l.
Letter Parameter
p positions
n states
e tree nodes
a transitions
k packed character classes
o output array size
.TE
.LP
Lines in the rules section have the form ``expression action''
where the action may be continued on succeeding
lines by using braces to delimit it.
.PP
Regular expressions in Lex use the following
operators:
.br
.TS
center;
l l.
x the character "x"
"x" an "x", even if x is an operator.
\ex an "x", even if x is an operator.
[xy] the character x or y.
[x\-z] the characters x, y or z.
[^x] any character but x.
\&. any character but newline.
^x an x at the beginning of a line.
<y>x an x when Lex is in start condition y.
x$ an x at the end of a line.
x? an optional x.
x\(** 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.
x|y an x or a y.
(x) an x.
x/y an x but only if followed by y.
{xx} the translation of xx from the
definitions section.
x{m,n} \fIm\fR through \fIn\fR occurrences of x
.TE
.NH
Caveats and Bugs.
.PP
There are pathological expressions which
produce exponential growth of the tables when
converted to deterministic machines;
fortunately, they are rare.
.PP
REJECT does not rescan the input; instead it remembers the results of the previous
scan. This means that if a rule with trailing context is found, and
REJECT executed, the user
must not have used
.ul
unput
to change the characters forthcoming
from the input stream.
This is the only restriction on the user's ability to manipulate
the not-yet-processed input.
.PP
.2C
.NH
Acknowledgments.
.PP
As should
be obvious from the above, the outside of Lex
is patterned
on Yacc and the inside on Aho's string matching routines.
Therefore, both S. C. Johnson and A. V. Aho
are really originators
of much of Lex,
as well as debuggers of it.
Many thanks are due to both.
.PP
The code of the current version of Lex was designed, written,
and debugged by Eric Schmidt.
.SG MH-1274-MEL-unix
.sp 1
.2C
.NH
References.
.SP 1v
.IP 1.
B. W. Kernighan and D. M. Ritchie,
.I
The C Programming Language,
.R
Prentice-Hall, N. J. (1978).
.IP 2.
B. W. Kernighan,
.I
Ratfor: A Preprocessor for a Rational Fortran,
.R
Software \- Practice and Experience,
\fB5\fR, pp. 395-496 (1975).
.IP 3.
S. C. Johnson,
.I
Yacc: Yet Another Compiler Compiler,
.R
Computing Science Technical Report No. 32,
1975,
.MH
.if \n(tm (also TM 75-1273-6)
.IP 4.
A. V. Aho and M. J. Corasick,
.I
Efficient String Matching: An Aid to Bibliographic Search,
.R
Comm. ACM
.B
18,
.R
333-340 (1975).
.IP 5.
B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
.I
QED Text Editor,
.R
Computing Science Technical Report No. 5,
1972,
.MH
.IP 6.
D. M. Ritchie,
private communication.
See also
M. E. Lesk,
.I
The Portable C Library,
.R
Computing Science Technical Report No. 31,
.MH
.if \n(tm (also TM 75-1274-11)
|