summaryrefslogtreecommitdiff
path: root/usr.bin/yacc/PSD.doc/ss4
blob: 9859b2489fba37c622a901a48633330320e1641b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
.\"	$OpenBSD: ss4,v 1.1 2002/12/03 21:36:35 mickey Exp $
.\"
.\" Copyright (C) Caldera International Inc.  2001-2002.
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code and documentation must retain the above
.\"    copyright notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\"    notice, this list of conditions and the following disclaimer in the
.\"    documentation and/or other materials provided with the distribution.
.\" 3. All advertising materials mentioning features or use of this software
.\"    must display the following acknowledgement:
.\"	This product includes software developed or owned by Caldera
.\"	International, Inc.
.\" 4. Neither the name of Caldera International, Inc. nor the names of other
.\"    contributors may be used to endorse or promote products derived from
.\"    this software without specific prior written permission.
.\"
.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
.\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
.\" IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE FOR ANY DIRECT,
.\" INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
.\" (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
.\" SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
.\" STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
.\" IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
.\" POSSIBILITY OF SUCH DAMAGE.
.\"
.\"	@(#)ss4	8.2 (Berkeley) 6/1/94
.\"
.SH
4: How the Parser Works
.PP
Yacc turns the specification file into a C program, which
parses the input according to the specification given.
The algorithm used to go from the
specification to the parser is complex, and will not be discussed
here (see
the references for more information).
The parser itself, however, is relatively simple,
and understanding how it works, while
not strictly necessary, will nevertheless make
treatment of error recovery and ambiguities much more
comprehensible.
.PP
The parser produced by Yacc consists
of a finite state machine with a stack.
The parser is also capable of reading and remembering the next
input token (called the
.I lookahead
token).
The
.I "current state"
is always the one on the top of the stack.
The states of the finite state machine are given
small integer labels; initially, the machine is in state 0,
the stack contains only state 0, and no lookahead token has been read.
.PP
The machine has only four actions available to it, called
.I shift ,
.I reduce ,
.I accept ,
and
.I error .
A move of the parser is done as follows:
.IP 1.
Based on its current state, the parser decides
whether it needs a lookahead token to decide
what action should be done; if it needs one, and does
not have one, it calls
.I yylex
to obtain the next token.
.IP 2.
Using the current state, and the lookahead token
if needed, the parser decides on its next action, and
carries it out.
This may result in states being pushed onto the stack, or popped off of
the stack, and in the lookahead token being processed
or left alone.
.PP
The
.I shift
action is the most common action the parser takes.
Whenever a shift action is taken, there is always
a lookahead token.
For example, in state 56 there may be
an action:
.DS
	IF	shift 34
.DE
which says, in state 56, if the lookahead token is IF,
the current state (56) is pushed down on the stack,
and state 34 becomes the current state (on the
top of the stack).
The lookahead token is cleared.
.PP
The
.I reduce
action keeps the stack from growing without
bounds.
Reduce actions are appropriate when the parser has seen
the right hand side of a grammar rule,
and is prepared to announce that it has seen
an instance of the rule, replacing the right hand side
by the left hand side.
It may be necessary to consult the lookahead token
to decide whether to reduce, but
usually it is not; in fact, the default
action (represented by a ``.'') is often a reduce action.
.PP
Reduce actions are associated with individual grammar rules.
Grammar rules are also given small integer
numbers, leading to some confusion.
The action
.DS
	\fB.\fR	reduce 18
.DE
refers to
.I "grammar rule"
18, while the action
.DS
	IF	shift 34
.DE
refers to
.I state
34.
.ne 1i
.PP
Suppose the rule being reduced is
.DS
A	\fB:\fR	x  y  z    ;
.DE
The reduce action depends on the
left hand symbol (A in this case), and the number of
symbols on the right hand side (three in this case).
To reduce, first pop off the top three states
from the stack
(In general, the number of states popped equals the number of symbols on the
right side of the rule).
In effect, these states were the ones
put on the stack while recognizing
.I x ,
.I y ,
and
.I z ,
and no longer serve any useful purpose.
After popping these states, a state is uncovered
which was the state the parser was in before beginning to
process the rule.
Using this uncovered state, and the symbol
on the left side of the rule, perform what is in
effect a shift of A.
A new state is obtained, pushed onto the stack, and parsing continues.
There are significant differences between the processing of
the left hand symbol and an ordinary shift of a token,
however, so this action is called a
.I goto
action.
In particular, the lookahead token is cleared by a shift, and
is not affected by a goto.
In any case, the uncovered state contains an entry such as:
.DS
	A	goto 20
.DE
causing state 20 to be pushed
onto the stack, and become the current state.
.PP
In effect, the reduce action ``turns back the clock'' in the parse,
popping the states off the stack to go back to the
state where the right hand side of the rule was first seen.
The parser then behaves as if it had seen the left side at that time.
If the right hand side of the rule is empty,
no states are popped off of the stack: the uncovered state
is in fact the current state.
.PP
The reduce action is also important in the treatment of user-supplied
actions and values.
When a rule is reduced, the code supplied with the rule is executed
before the stack is adjusted.
In addition to the stack holding the states, another stack,
running in parallel with it, holds the values returned
from the lexical analyzer and the actions.
When a shift takes place, the external variable
.I yylval
is copied onto the value stack.
After the return from the user code, the reduction is carried out.
When the
.I goto
action is done, the external variable
.I yyval
is copied onto the value stack.
The pseudo-variables $1, $2, etc., refer to the value stack.
.PP
The other two parser actions are conceptually much simpler.
The
.I accept
action indicates that the entire input has been seen and
that it matches the specification.
This action appears only when the lookahead token is 
the endmarker, and indicates that the parser has successfully
done its job.
The
.I error
action, on the other hand, represents a place where the parser
can no longer continue parsing according to the specification.
The input tokens it has seen, together with the lookahead token,
cannot be followed by anything that would result
in a legal input.
The parser reports an error, and attempts to recover the situation and
resume parsing: the error recovery (as opposed to the detection of error)
will be covered in Section 7.
.PP
It is time for an example!
Consider the specification
.DS
%token  DING  DONG  DELL
%%
rhyme	:	sound  place
	;
sound	:	DING  DONG
	;
place	:	DELL
	;
.DE
.PP
When Yacc is invoked with the
.B \-v
option, a file called
.I y.output
is produced, with a human-readable description of the parser.
The
.I y.output
file corresponding to the above grammar (with some statistics
stripped off the end) is:
.DS
state 0
	$accept  :  \_rhyme  $end 

	DING  shift 3
	.  error

	rhyme  goto 1
	sound  goto 2

state 1
	$accept  :   rhyme\_$end 

	$end  accept
	.  error

state 2
	rhyme  :   sound\_place 

	DELL  shift 5
	.  error

	place   goto 4

state 3
	sound   :   DING\_DONG 

	DONG  shift 6
	.  error

state 4
	rhyme  :   sound  place\_    (1)

	.   reduce  1

state 5
	place  :   DELL\_    (3)

	.   reduce  3

state 6
	sound   :   DING  DONG\_    (2)

	.   reduce  2
.DE
Notice that, in addition to the actions for each state, there is a
description of the parsing rules being processed in each
state.  The \_ character is used to indicate
what has been seen, and what is yet to come, in each rule.
Suppose the input is
.DS
DING  DONG  DELL
.DE
It is instructive to follow the steps of the parser while
processing this input.
.PP
Initially, the current state is state 0.
The parser needs to refer to the input in order to decide
between the actions available in state 0, so
the first token,
.I DING ,
is read, becoming the lookahead token.
The action in state 0 on
.I DING
is
is ``shift 3'', so state 3 is pushed onto the stack,
and the lookahead token is cleared.
State 3 becomes the current state.
The next token,
.I DONG ,
is read, becoming the lookahead token.
The action in state 3 on the token
.I DONG
is ``shift 6'',
so state 6 is pushed onto the stack, and the lookahead is cleared.
The stack now contains 0, 3, and 6.
In state 6, without even consulting the lookahead,
the parser reduces by rule 2.
.DS
	sound  :   DING  DONG
.DE
This rule has two symbols on the right hand side, so
two states, 6 and 3, are popped off of the stack, uncovering state 0.
Consulting the description of state 0, looking for a goto on 
.I sound ,
.DS
	sound	goto 2
.DE
is obtained; thus state 2 is pushed onto the stack,
becoming the current state.
.PP
In state 2, the next token,
.I DELL ,
must be read.
The action is ``shift 5'', so state 5 is pushed onto the stack, which now has
0, 2, and 5 on it, and the lookahead token is cleared.
In state 5, the only action is to reduce by rule 3.
This has one symbol on the right hand side, so one state, 5,
is popped off, and state 2 is uncovered.
The goto in state 2 on
.I place ,
the left side of rule 3,
is state 4.
Now, the stack contains 0, 2, and 4.
In state 4, the only action is to reduce by rule 1.
There are two symbols on the right, so the top two states are popped off,
uncovering state 0 again.
In state 0, there is a goto on
.I rhyme
causing the parser to enter state 1.
In state 1, the input is read; the endmarker is obtained,
indicated by ``$end'' in the
.I y.output
file.
The action in state 1 when the endmarker is seen is to accept,
successfully ending the parse.
.PP
The reader is urged to consider how the parser works
when confronted with such incorrect strings as
.I "DING DONG DONG" ,
.I "DING DONG" ,
.I "DING DONG DELL DELL" ,
etc.
A few minutes spend with this and other simple examples will
probably be repaid when problems arise in more complicated contexts.