Cooper K.Engineering a compiler
.pdf3.4. BOTTOM-UP PARSING |
79 |
push invalid
token ← next token()
repeat until (top of stack = Goal & token = eof) if we have a handle α → β on top of the stack
then reduce by α→β
pop | β | symbols o the stack push α onto the stack
else if (token = eof) then shift
push token
token ← next token()
else
report syntax error & halt
Figure 3.7: Shift-reduce parsing algorithm
Using this algorithm, Figure 3.5 can be reinterpreted to show the actions of our shift-reduce parser on the input stream x − 2 × y. The row labelled Token shows the contents of the variable token in the algorithm. The row labelled Frontier depicts the contents of the stack at each step; the stack top is to the right. Finally, the action extend indicates a shift; reduce still indicates a reduction.
For an input stream of length s, this parser must perform s shifts. It must perform a reduction for each step in the derivation, for r steps. It looks for a handle on each iteration of the while loop, so it must perform s+r handle-finding operations. If we can keep the cost of handle-finding to a small constant, we have a parser that can operate in time proportional to the length of the input stream (in words) plus the length of the derivation. Of course, this rules out traversing the entire stack on each handle-find, so it places a premium on e cient handle-finding.
3.4.2Finding Handles
The handle-finding mechanism is the key to e cient bottom-up parsing. Let’s examine this problem in more detail. In the previous section, handles appeared in the example as if they were derived from an oracle. Lacking an oracle, we need an algorithm.
As it parses an input string, the parser must track multiple potential handles. For example, on every legal input, the parser eventually reduces to its goal symbol. In the classic expression grammar, this implies that the parser reaches a state where its has the handle Goal→Expr,1 on its stack. This particular handle represents one half of the halting condition—having Goal as the root of the parse tree. (The only production reducing to Goal is Goal→Expr. Since it must be the last reduction, the position must be 1.) Thus, Goal→Expr is a
80 CHAPTER 3. PARSING
potential handle at every step in the parse, from first to last.
In between, the parser discovers other handles. In the example of Figure 3.5, it found eight other handles. At each step, the set of potential handles represent all the legal completions of the sentential form that has already been recognized. (The sentential form consists of the upper frontier, concatenated with the remainder of the input stream—beginning with the current lookahead token.) Each of these potential handles contains the symbol on top of the stack; that symbol can be at a di erent location in each handle.
To represent the position of the top of stack in a potential handle, we introduce a placeholder, •, into the right-hand side of the production. Inserting the placeholder into α→βγδ gives rise to four di erent strings:
α→•βγδ, α→β • γδ, α→βγ • δ, and α→βγδ•.
This notation captures the di erent relationships between potential handles and the state of the parser. In step 7 of the example, the parser has the frontier Expr − Num, with the handle Factor→Num,3 . Using •, we can write this asFactor→Num • ; the position is implicit relative to the top of the stack. A potential handle becomes a handle when • appears at the right end of the production. It must also track a number of other potential handles.
For example, Expr→Expr − • Term represents the possibility that Num will eventually reduce to Term, with a right context that allows the parser to reduce Expr − Term to Expr. Looking ahead in the parse, this potential handle becomes the active handle in step 13, after Num × Id has been reduced to Term. Other potential handles at step 7 include Term→ • Factor and
Goal→ • Expr .
This notation does not completely capture the state of the parser. Consider the parser’s action at step 9. The frontier contains Expr − Term, with potential handles including Expr→Expr − Term • and Term→Term • × Factor . Rather than reduce by Expr→Expr − Term • , the parser extended the frontier to follow the future represented by Term→Term • × Factor . The example demonstrates that this is the right action. Reducing would move the parser into a state where it could make no further progress, since it cannot reduce subsequently Expr × Factor.
To choose between these two actions, the parser needs more context than it has on the frontier. In particular, the parser can look ahead one symbol to discover that the next word in the input stream is ×. This single-word lookahead allows it to determine that the correct choice is pursuing the handle represented by Term→Term • × Factor , rather than reducing as represented by Expr→Expr − Term • . The key to making this decision is the value of the next token, also called the lookahead symbol.
This suggests that we can represent each possible future decision of the parser as a pair, alpha→β • γδ, a , where α→βγδ P , and a T . This pair, called an item, is interpreted as
The parser is in a state where finding an α, followed by the terminal a, would be consistent with its left context. It has already found a
3.5. BUILDING AN LR(1) PARSER |
81 |
β, so finding γδa would allow it to reduce by α→βγδ
At each step in the parse, a collection of these pairs represents the set of possible futures for the derivation, or the set of su xes that would legally complete the left context that the parser has already seen. This pair is usually called an lr(1) item. When written as an lr(1) item, it has square brackets rather than angle brackets.
The set of lr(1) items is finite. If r is the maximal length of a right-hand side for any production in P , then the number of lr(1) items can be no greater than (r + 1)· | T |. For a grammar G, the set of lr(1) items includes all the possible handles for G. Thus, the set of handles is finite and can be recognized by a dfa. This is the central insight behind lr(1) parsers:
Because the set of lr(1) items is finite, we can build a handle-finder that operates as a dfa.
The lr(1) parser construction algorithm builds the dfa to recognize handles. It uses a shift-reduce parsing framework to guide the application of the dfa. The framework can invoke the dfa recursively; to accomplish this, it stores information about internal states of the dfa on the stack.
3.5Building an LR(1) parser
The lr(1) parsers, and their cousins slr(1) and lalr(1), are the most widely used family of parsers. The parsers are easy to build and e cient. Tools to automate construction of the tables are widely available. Most programming language constructs have a natural expression in a grammar that is parsable with an lr(1) parser. This section explains how lr(1) parsers work, and shows how to construct the parse tables for one kind of lr(1) parser, a canonical lr(1) parser.
Lr(1) table construction is an elegant application of theory to practice. However, the actual process of building tables is better left to parser generators than to humans. That notwithstanding, the algorithm is worth studying because it explains the kinds of errors that the parser generator can encounter, how they arise, and how they can be remedied.
3.5.1The LR(1) parsing algorithm
An lr(1) parser consists of a skeleton parser and a pair of tables that drive the parser. Figure 3.8 shows the skeleton parser; it is independent of the grammar being parsed. The bottom half of the figure shows the action and goto tables for the classic expression grammar without the production Factor → ( Expr ).
The skeleton parser resembles the shift-reduce parser shown in Figure 3.7. At each step, it pushes two objects onto the stack: a grammar symbol from the frontier and a state from the handle recognizer. It has four actions:
1.shift: extends the frontier by shifting the lookahead token onto the stack, along with a new state for the handle recognizer. This may result in a recursive invocation of the recognizer.
82 |
CHAPTER 3. PARSING |
push invalid push s0
token ← next token()
while (true)
s ← top of stack
if action[s,token] = ”shift si”’ then push token
push si
token ← next token()
else if action[s,token] = ”reduce A → β” then pop 2 × | β | symbols
s ← top of stack push A
push goto[s,A]
else if action[s, token] = ”accept” then return
else error()
The Skeleton lr(1) Parser
Action Table |
|
|
|
|
|
|
Goto Table |
|
|||
State |
EOF |
+ |
- |
× |
÷ |
Num |
id |
|
Expr |
Term |
Factor |
0 |
|
|
|
|
|
s 4 |
s 5 |
|
1 |
2 |
3 |
1 |
acc |
s 6 |
s 7 |
|
|
|
|
|
0 |
0 |
0 |
2 |
r 4 |
r 4 |
r 4 |
s 8 |
s 9 |
|
|
|
0 |
0 |
0 |
|
|
|
|
|
|
|
|
|
|
|
|
3 |
r 7 |
r 7 |
r 7 |
r 7 |
r 7 |
|
|
|
0 |
0 |
0 |
4 |
r 8 |
r 8 |
r 8 |
r 8 |
r 8 |
|
|
|
0 |
0 |
0 |
5 |
r 9 |
r 9 |
r 9 |
r 9 |
r 9 |
|
|
|
0 |
0 |
0 |
6 |
|
|
|
|
|
s 4 |
s 5 |
|
0 |
10 |
3 |
7 |
|
|
|
|
|
s 4 |
s 5 |
|
0 |
11 |
3 |
8 |
|
|
|
|
|
s 4 |
s 5 |
|
0 |
0 |
12 |
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
|
|
|
|
s 4 |
s 5 |
|
0 |
0 |
13 |
10 |
r 2 |
r 2 |
r 2 |
s 8 |
s 9 |
|
|
|
0 |
0 |
0 |
11 |
r 3 |
r 3 |
r 3 |
s 8 |
s 9 |
|
|
|
0 |
0 |
0 |
12 |
r 5 |
r 5 |
r 5 |
r 5 |
r 5 |
|
|
|
0 |
0 |
0 |
13 |
r 6 |
r 6 |
r 6 |
r 6 |
r 6 |
|
|
|
0 |
0 |
0 |
lr(1) Tables for the Classic Expression Grammar
Figure 3.8: An lr(1) parser
3.5. BUILDING AN LR(1) PARSER |
83 |
||||||
|
|
|
|
|
|
||
|
|
Token |
Stack |
Action |
|
||
|
1. |
Id |
$ |
0 |
|
shift |
|
|
2. |
- |
$ |
0 |
id 5 |
reduce |
|
|
3. |
- |
$ |
0 |
Factor 3 |
reduce |
|
|
4. |
- |
$ |
0 |
Term 2 |
reduce |
|
|
5. |
- |
$ |
0 |
Expr 1 |
shift |
|
|
6. |
Num |
$ |
0 |
Expr 1 - 7 |
shift |
|
|
7. |
× |
$ |
0 |
Expr 1 - 7 Num 4 |
reduce |
|
|
8. |
× |
$ |
0 |
Expr 1 - 7 Factor 3 |
reduce |
|
|
9. |
× |
$ |
0 |
Expr 1 - 7 Term 11 |
shift |
|
|
10. |
Id |
$ |
0 |
Expr 1 - 7 Term 11 × 8 |
shift |
|
|
11. |
eof |
$ |
0 |
Expr 1 - 7 Term 11 × 8 id 5 |
reduce |
|
|
12. |
eof |
$ |
0 |
Expr 1 - 7 Term 11 × 8 Factor 12 |
reduce |
|
|
13. |
eof |
$ |
0 |
Expr 1 - 7 Term 11 |
reduce |
|
|
14. |
eof |
$ |
0 |
Expr 1 |
accept |
|
Figure 3.9: lr(1) parser states for x − 2 × y
2.reduce: shrinks the frontier by replacing the current handle with its right hand site. It discards all the intermediate states used to recognize the handle by popping o two stack items for each symbol in the handle’s right-hand side. Next, it uses the state underneath the handle and the left-hand side to find a new recognizer state, and pushes both the lefthand side and the new state onto the stack.
3.accept: reports success back to the user. This state is only reached when the parser has reduced the frontier to the goal symbol and the lookahead character is eof. (All other entries in state s1 indicate errors!)
4.error: reports a syntax error. This state is reached anytime the action table contains an entry other than shift, reduce, or accept. In the figure, these entries are left blank.
Entries in the action table are encoded using the letter ‘s’ as “shift” and ‘r’ as “reduce”. Thus, the entry “s3” indicates the action “shift and go to state s3”, while “r5” indicates “reduce by production 5”. On a reduce item, the new state is determined by the goto table entry for the left-hand side of the production, and the state revealed on top of the stack after the right hand side is popped.
To understand this parser, consider again our simple example. Figure 3.9 shows the succession of states that the lr(1) parser takes to parse the expression x − 2 × y. The parser shifts id onto the stack, then reduces it to a Factor, to a Term, and to an Expr. Next, it shifts the -, followed by the Num. At this point, it reduces Num to Factor, then Factor to Term. At this point, it shifts × and then id onto the stack. Finally, it reduces id to Factor, Term × Factor to Term, and then Expr - Term to Expr. At this point, with Expr on the stack and eof as the next token, it accepts the input string. This is similar to the sequence portrayed in Figure 3.5.
84 |
|
|
|
|
|
|
|
CHAPTER 3. |
PARSING |
|||
source |
|
|
|
|
|
|
|
|
|
|
||
- |
scanner |
|
- |
table-driven |
|
- |
|
ir |
||||
code |
|
|
|
|
parser |
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
|
|
|
||
|
|
|
|
|
|
|
|
? |
|
|
|
|
|
|
|
|
|
|
|
ActionGoto & |
|
|
|
|
|
grammar |
- |
|
parser |
- |
|
|
|
|
||||
|
|
generator |
|
|
|
|
|
|||||
|
|
|
|
|
tables |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 3.10: Structure of an lr(1) parser generator system
The key to building an lr(1) parser is constructing the action and goto tables. These tables encode the actions of the handle-recognizing dfa, along with the information necessary to use limited right context to decide whether to shift, reduce, or accept. While it is possible to construct these tables by hand, the algorithm requires manipulation of lots of detail, as well as scrupulous book-keeping. Building lr(1) tables is a prime example of the kind of task that should be automated and relegated to a computer. Most lr(1) parse tables are constructed by parser generator systems, as shown in Figure 3.10. Using a parser generator, the compiler-writer creates a grammar that describes the source language; the parser generator converts that into action and goto tables. In most such systems, the individual productions can be augmented with ad hoc code that will execute on each reduction. These “actions” are used for many purposes, including context-sensitive error checking, generating intermediate representations (see Section 4.4.3).
3.5.2Building the tables
To construct action and goto tables, an lr(1) parser generator builds a model of the handle-recognizing dfa and uses that model to fill in the tables. The model uses a set of lr(1) items to represent each parser state; these sets are constructed using a disciplined and systematic technique. The model is called the canonical collection of sets of lr(1) items. Each set in the canonical collection represents a parse state.
To explain the construction, we will use two examples. The SheepNoise grammar, SN, is small enough to use as a running example to clarify the individual steps in the process.
1.Goal → SheepNoise
2. SheepNoise → SheepNoise baa
3.| baa
This version of SN includes a distinct Goal production. Including a separate
3.5. BUILDING AN LR(1) PARSER |
85 |
production for the goal symbol simplifies the implementation of the parser generator. As a second example, we will use the classic expression grammar. It includes complexity not found in SN ; this makes it an interesting example, but too large to include incrementally in the text. Thus, this subsection ends with the classic expression grammar as a detailed example.
LR(1) Items An lr(1) item is a pair [α→β • γδ, a], where α→βγδ P is production of the grammar, the symbol • is a placeholder in the right-hand side, and a T is a word in the source language. Individual lr(1) items describe configurations of a bottom-up parser; they represent potential handles that would be consistent with the left context that the parser has already seen. The • marks the point in the production where the current upper frontier of the parse tree ends. (Alternatively, it marks the top of the parser’s stack. These two views are functionally equivalent.)
For a production α→βγδ and a lookahead symbol a T , the addition of the placeholder generates four possible items, each with its own interpretation. In each case, the presence of the item in the set associated with some parser state indicates that the input that the parser has seen is consistent with the occurrence of an α followed by an a. The position of • in the item provides more information.
[α→•βγδ, a] indicates that an α would be valid and that recognizing a β at this point would be one step toward discovering an α.
[α→β • γδ, a] indicates that the parser has progressed from the state where an α would be valid by recognizing β. The β is consistent with recognizing an α. The next step would be to recognize a γ.
[α→βγ • δ, a] indicates that the parser has moved forward from the state where α would be valid by recognizing βγ. At this point, recognizing a δ, followed by a, would allow the parser to reduce βγδ to α.
[α→βγδ•, a] indicates that the parser has found βγδ in a context where an α followed by a would be valid. If the lookahead symbol is a, the parser can reduce βγδ to α (and the item is a handle).
The lookahead symbols in lr(1) items encode left-context that the parser has already seen, in the sense that [α→βγδ•, a] only indicates a reduction if the lookahead symbol is a.
The SheepNoise grammar produces the following lr(1) items:
[Goal → • SheepNoise, EOF] [Goal → SheepNoise •, EOF] [SheepNoise → • baa, EOF] [SheepNoise → baa •, EOF] [SheepNoise → • baa, baa] [SheepNoise → baa •, baa]
[SheepNoise → • SheepNoise baa, EOF] [SheepNoise → SheepNoise • baa, EOF] [SheepNoise → SheepNoise baa •, EOF] [SheepNoise → • SheepNoise baa, baa] [SheepNoise → SheepNoise • baa, baa] [SheepNoise → SheepNoise baa •, baa]
86 |
CHAPTER 3. PARSING |
|
for each α T |
|
First(α) ← α |
|
for each α N T |
|
First(α) ← |
|
while (First sets are still changing) |
|
for each p P , where p has the form α → β |
|
if β is |
|
then First(α) ← First(α) { } |
|
else if β is β1β2 . . . βk then |
|
First(α) ← First(α) First(β1) |
|
for i ← 1 to k − 1 by 1 |
if First(βi)
then First(α) ← First(α) First(βi+1) else break
Figure 3.11: Computing First sets
SN generates two terminal symbols. The first, baa, comes directly from the grammar. The second, EOF (for end of file) arises from the need to represent the parser’s final state. The item [Goal → SheepNoise •, EOF] represents a parser configuration where it has already recognized a string that reduces to Goal, and it has exhausted the input, indicated by the lookahead of EOF.
Constructing First Sets The construction of the canonical collection of sets of lr(1) items uses the First sets for various grammar symbols. This set was informally defined in our discussion of predictive parsing (see Section 3.3.3). For the lr(1) construction, we need a more constructive definition:
if α aβ, a T, β (T N T ) then a First(α)
if α then First(α)
To compute First sets, we apply these two rules inside a fixed-point framework. Figure 3.11 shows an algorithm that computes the First set for every symbol in a context-free grammar G = (T, N T, S, P ).
On successive iterations, First(α) takes on some value in 2(T ). The entire collection of First sets is a subset of 2(T ). Each iteration of the while loop either increases the size of some First set, or it changes no set. The algorithm halts when an iteration leaves all the First sets unchanged. The actual running time of the algorithm depends on the structure of the grammar.
3.5. BUILDING AN LR(1) PARSER |
87 |
The First sets for the augmented SN grammar are trivial.
Symbol |
First |
Goal |
baa |
SheepNoise |
baa |
baa |
baa |
|
|
EOF |
EOF |
|
|
A more involved example of the First set computation occurs with the classic expression grammar. (See the detailed example later in this section.)
Constructing the Canonical Collection The construction begins by building a model of the parser’s initial state. To the grammar, G = (T, N T, S, P ), the construction adds an additional non-terminal, S , and one production, S →S. Adding this production allows us to specify the parser’s start state concisely; it is represented by the lr(1) item [S →•S, eof].
Closure To this initial item, the construction needs to add all of the items implied by [S →•S, eof]. The procedure closure does this.
closure(si )
while (si is still changing)
item [α→β • γδ, a] si
production γ→ς P
b First(δa)
if [γ→•ς, b] si
then add [γ→•ς, b] to si
It iterates over the items in set. If an item has the • immediately before some non-terminal γ, it adds every production that can derive a γ, with the • at the left end of the right-hand side. The rationale is clear. If [α→β • γδ, a] is a member of the set, then one potential completion for the left context is to find the string γδa, possibly followed by more context. This implies that γ is legal; hence, every production deriving a γ is part of a potential future handle. To complete the item, closure needs to add the appropriate lookahead symbol. If δ is non-empty, it generates items for every element of First(δa). If δ is , this devolves into First(a) = a.
The closure computation is another fixed-point computation. At each
point, the triply-nested loop either adds some elements to si or leaves it intact. Since the set of lr(1) items is finite, this loop must halt. In fact, si 2items,
the power set of the set of all lr(1) items. The triply-nested loop looks expensive. However, close examination should convince you that each item in si must be processed exactly once. Only items added in the previous iteration need be processed in the inner two loops. The middle loop iterates over the set of alternative right-hand sides for a single production; this can easily be restricted so that each left-hand side is processed once per invocation of closure. The
88 |
CHAPTER 3. PARSING |
amount of work in the inner loop depends entirely on the size of First(δa); if δ is , the loop makes a single trip for a. Thus, this computation may be much more sparse than it first appears.
In SN, the item [Goal → • SheepNoise,EOF] represents the initial state of the parser. (The parser is looking for an input string that reduces to SheepNoise, followed by EOF.) Taking the closure of this initial state produces the set
|
Item |
Reason |
1. |
[Goal → • SheepNoise,EOF] |
original item |
2. |
[SheepNoise → • SheepNoise baa, EOF] |
from 1, δa is “EOF” |
3. |
[SheepNoise → • baa,EOF] |
from 1, δa is “EOF” |
4. |
[SheepNoise → • SheepNoise baa, baa] |
from 2, δa is “baa EOF” |
5. |
[SheepNoise → • baa,baa] |
from 2, δa is “baa EOF” |
Items two and three derive directly from the first item. The final two items derive from item two; their lookahead symbol is just First(baa EOF). This set represents the initial state, s0, of an lr(1) parser for SN.
Goto If closure([S →•S, eof]) computes the initial state s0, the remaining step in the construction is to derive, from s0, the other parser states. To accomplish this, we compute the state that would arise if we recognized a grammar symbol X while in s0. The procedure goto does this.
goto(si, x) new ←
items i si
if i is [α→β • xδ, a] then
moved ← [α→βx • δ, a] new ← new moved
return closure(new)
Goto takes two arguments, a set of lr(1) items si and a grammar symbol x. It iterates over the items in si. When it finds one where • immediately precedes x, it creates the item resulting from recognizing x by moving the • rightward past x. It finds all such items, then returns their closure to fill out the state.
To find all the states that can be derived directly from s0, the algorithm iterates over x (T N T ) and computes Goto(s0, x). This produces all the sets that are one symbol away from s0. To compute the full canonical collection, we simply iterate this process to a fixed point.
In SN, the set s0 derived earlier represents the initial state of the parser. To build a representation of the parser’s state after seeing an initial baa, the construction computes Goto(s0,baa). Goto creates two items:
|
Item |
Reason |
1. |
[SheepNoise → baa •,EOF] |
from item 3 in s0 |
2. |
[SheepNoise → baa •,baa] |
from item 5 in s0 |
The final part of goto invokes closure on this set. It finds no items to add because • is at the end of the production in each item.