Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Cooper K.Engineering a compiler

.pdf
Скачиваний:
49
Добавлен:
23.08.2013
Размер:
2.31 Mб
Скачать

3.4. BOTTOM-UP PARSING

79

push invalid

token next token()

repeat until (top of stack = Goal & token = eof) if we have a handle α → β on top of the stack

then reduce by αβ

pop | β | symbols o the stack push α onto the stack

else if (token = eof) then shift

push token

token next token()

else

report syntax error & halt

Figure 3.7: Shift-reduce parsing algorithm

Using this algorithm, Figure 3.5 can be reinterpreted to show the actions of our shift-reduce parser on the input stream x 2 × y. The row labelled Token shows the contents of the variable token in the algorithm. The row labelled Frontier depicts the contents of the stack at each step; the stack top is to the right. Finally, the action extend indicates a shift; reduce still indicates a reduction.

For an input stream of length s, this parser must perform s shifts. It must perform a reduction for each step in the derivation, for r steps. It looks for a handle on each iteration of the while loop, so it must perform s+r handle-finding operations. If we can keep the cost of handle-finding to a small constant, we have a parser that can operate in time proportional to the length of the input stream (in words) plus the length of the derivation. Of course, this rules out traversing the entire stack on each handle-find, so it places a premium on e cient handle-finding.

3.4.2Finding Handles

The handle-finding mechanism is the key to e cient bottom-up parsing. Let’s examine this problem in more detail. In the previous section, handles appeared in the example as if they were derived from an oracle. Lacking an oracle, we need an algorithm.

As it parses an input string, the parser must track multiple potential handles. For example, on every legal input, the parser eventually reduces to its goal symbol. In the classic expression grammar, this implies that the parser reaches a state where its has the handle GoalExpr,1 on its stack. This particular handle represents one half of the halting condition—having Goal as the root of the parse tree. (The only production reducing to Goal is GoalExpr. Since it must be the last reduction, the position must be 1.) Thus, GoalExpr is a

80 CHAPTER 3. PARSING

potential handle at every step in the parse, from first to last.

In between, the parser discovers other handles. In the example of Figure 3.5, it found eight other handles. At each step, the set of potential handles represent all the legal completions of the sentential form that has already been recognized. (The sentential form consists of the upper frontier, concatenated with the remainder of the input stream—beginning with the current lookahead token.) Each of these potential handles contains the symbol on top of the stack; that symbol can be at a di erent location in each handle.

To represent the position of the top of stack in a potential handle, we introduce a placeholder, , into the right-hand side of the production. Inserting the placeholder into αβγδ gives rise to four di erent strings:

α•βγδ, αβ • γδ, αβγ • δ, and αβγδ•.

This notation captures the di erent relationships between potential handles and the state of the parser. In step 7 of the example, the parser has the frontier Expr − Num, with the handle FactorNum,3 . Using , we can write this asFactorNum ; the position is implicit relative to the top of the stack. A potential handle becomes a handle when appears at the right end of the production. It must also track a number of other potential handles.

For example, ExprExpr − • Term represents the possibility that Num will eventually reduce to Term, with a right context that allows the parser to reduce Expr − Term to Expr. Looking ahead in the parse, this potential handle becomes the active handle in step 13, after Num × Id has been reduced to Term. Other potential handles at step 7 include Term• Factor and

Goal• Expr .

This notation does not completely capture the state of the parser. Consider the parser’s action at step 9. The frontier contains Expr − Term, with potential handles including ExprExpr − Term • and TermTerm • × Factor . Rather than reduce by ExprExpr − Term • , the parser extended the frontier to follow the future represented by TermTerm • × Factor . The example demonstrates that this is the right action. Reducing would move the parser into a state where it could make no further progress, since it cannot reduce subsequently Expr × Factor.

To choose between these two actions, the parser needs more context than it has on the frontier. In particular, the parser can look ahead one symbol to discover that the next word in the input stream is ×. This single-word lookahead allows it to determine that the correct choice is pursuing the handle represented by TermTerm • × Factor , rather than reducing as represented by ExprExpr − Term • . The key to making this decision is the value of the next token, also called the lookahead symbol.

This suggests that we can represent each possible future decision of the parser as a pair, alphaβ • γδ, a , where αβγδ P , and a T . This pair, called an item, is interpreted as

The parser is in a state where finding an α, followed by the terminal a, would be consistent with its left context. It has already found a

3.5. BUILDING AN LR(1) PARSER

81

β, so finding γδa would allow it to reduce by αβγδ

At each step in the parse, a collection of these pairs represents the set of possible futures for the derivation, or the set of su xes that would legally complete the left context that the parser has already seen. This pair is usually called an lr(1) item. When written as an lr(1) item, it has square brackets rather than angle brackets.

The set of lr(1) items is finite. If r is the maximal length of a right-hand side for any production in P , then the number of lr(1) items can be no greater than (r + 1)· | T |. For a grammar G, the set of lr(1) items includes all the possible handles for G. Thus, the set of handles is finite and can be recognized by a dfa. This is the central insight behind lr(1) parsers:

Because the set of lr(1) items is finite, we can build a handle-finder that operates as a dfa.

The lr(1) parser construction algorithm builds the dfa to recognize handles. It uses a shift-reduce parsing framework to guide the application of the dfa. The framework can invoke the dfa recursively; to accomplish this, it stores information about internal states of the dfa on the stack.

3.5Building an LR(1) parser

The lr(1) parsers, and their cousins slr(1) and lalr(1), are the most widely used family of parsers. The parsers are easy to build and e cient. Tools to automate construction of the tables are widely available. Most programming language constructs have a natural expression in a grammar that is parsable with an lr(1) parser. This section explains how lr(1) parsers work, and shows how to construct the parse tables for one kind of lr(1) parser, a canonical lr(1) parser.

Lr(1) table construction is an elegant application of theory to practice. However, the actual process of building tables is better left to parser generators than to humans. That notwithstanding, the algorithm is worth studying because it explains the kinds of errors that the parser generator can encounter, how they arise, and how they can be remedied.

3.5.1The LR(1) parsing algorithm

An lr(1) parser consists of a skeleton parser and a pair of tables that drive the parser. Figure 3.8 shows the skeleton parser; it is independent of the grammar being parsed. The bottom half of the figure shows the action and goto tables for the classic expression grammar without the production Factor ( Expr ).

The skeleton parser resembles the shift-reduce parser shown in Figure 3.7. At each step, it pushes two objects onto the stack: a grammar symbol from the frontier and a state from the handle recognizer. It has four actions:

1.shift: extends the frontier by shifting the lookahead token onto the stack, along with a new state for the handle recognizer. This may result in a recursive invocation of the recognizer.

82

CHAPTER 3. PARSING

push invalid push s0

token next token()

while (true)

s top of stack

if action[s,token] = ”shift si”’ then push token

push si

token next token()

else if action[s,token] = ”reduce A β” then pop 2 × | β | symbols

s top of stack push A

push goto[s,A]

else if action[s, token] = ”accept” then return

else error()

The Skeleton lr(1) Parser

Action Table

 

 

 

 

 

 

Goto Table

 

State

EOF

+

-

×

÷

Num

id

 

Expr

Term

Factor

0

 

 

 

 

 

s 4

s 5

 

1

2

3

1

acc

s 6

s 7

 

 

 

 

 

0

0

0

2

r 4

r 4

r 4

s 8

s 9

 

 

 

0

0

0

 

 

 

 

 

 

 

 

 

 

 

 

3

r 7

r 7

r 7

r 7

r 7

 

 

 

0

0

0

4

r 8

r 8

r 8

r 8

r 8

 

 

 

0

0

0

5

r 9

r 9

r 9

r 9

r 9

 

 

 

0

0

0

6

 

 

 

 

 

s 4

s 5

 

0

10

3

7

 

 

 

 

 

s 4

s 5

 

0

11

3

8

 

 

 

 

 

s 4

s 5

 

0

0

12

 

 

 

 

 

 

 

 

 

 

 

 

9

 

 

 

 

 

s 4

s 5

 

0

0

13

10

r 2

r 2

r 2

s 8

s 9

 

 

 

0

0

0

11

r 3

r 3

r 3

s 8

s 9

 

 

 

0

0

0

12

r 5

r 5

r 5

r 5

r 5

 

 

 

0

0

0

13

r 6

r 6

r 6

r 6

r 6

 

 

 

0

0

0

lr(1) Tables for the Classic Expression Grammar

Figure 3.8: An lr(1) parser

3.5. BUILDING AN LR(1) PARSER

83

 

 

 

 

 

 

 

 

Token

Stack

Action

 

 

1.

Id

$

0

 

shift

 

 

2.

-

$

0

id 5

reduce

 

 

3.

-

$

0

Factor 3

reduce

 

 

4.

-

$

0

Term 2

reduce

 

 

5.

-

$

0

Expr 1

shift

 

 

6.

Num

$

0

Expr 1 - 7

shift

 

 

7.

×

$

0

Expr 1 - 7 Num 4

reduce

 

 

8.

×

$

0

Expr 1 - 7 Factor 3

reduce

 

 

9.

×

$

0

Expr 1 - 7 Term 11

shift

 

 

10.

Id

$

0

Expr 1 - 7 Term 11 × 8

shift

 

 

11.

eof

$

0

Expr 1 - 7 Term 11 × 8 id 5

reduce

 

 

12.

eof

$

0

Expr 1 - 7 Term 11 × 8 Factor 12

reduce

 

 

13.

eof

$

0

Expr 1 - 7 Term 11

reduce

 

 

14.

eof

$

0

Expr 1

accept

 

Figure 3.9: lr(1) parser states for x 2 × y

2.reduce: shrinks the frontier by replacing the current handle with its right hand site. It discards all the intermediate states used to recognize the handle by popping o two stack items for each symbol in the handle’s right-hand side. Next, it uses the state underneath the handle and the left-hand side to find a new recognizer state, and pushes both the lefthand side and the new state onto the stack.

3.accept: reports success back to the user. This state is only reached when the parser has reduced the frontier to the goal symbol and the lookahead character is eof. (All other entries in state s1 indicate errors!)

4.error: reports a syntax error. This state is reached anytime the action table contains an entry other than shift, reduce, or accept. In the figure, these entries are left blank.

Entries in the action table are encoded using the letter ‘s’ as “shift” and ‘r’ as “reduce”. Thus, the entry “s3” indicates the action “shift and go to state s3”, while “r5” indicates “reduce by production 5”. On a reduce item, the new state is determined by the goto table entry for the left-hand side of the production, and the state revealed on top of the stack after the right hand side is popped.

To understand this parser, consider again our simple example. Figure 3.9 shows the succession of states that the lr(1) parser takes to parse the expression x 2 × y. The parser shifts id onto the stack, then reduces it to a Factor, to a Term, and to an Expr. Next, it shifts the -, followed by the Num. At this point, it reduces Num to Factor, then Factor to Term. At this point, it shifts × and then id onto the stack. Finally, it reduces id to Factor, Term × Factor to Term, and then Expr - Term to Expr. At this point, with Expr on the stack and eof as the next token, it accepts the input string. This is similar to the sequence portrayed in Figure 3.5.

84

 

 

 

 

 

 

 

CHAPTER 3.

PARSING

source

 

 

 

 

 

 

 

 

 

 

-

scanner

 

-

table-driven

 

-

 

ir

code

 

 

 

 

parser

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

?

 

 

 

 

 

 

 

 

 

 

 

ActionGoto &

 

 

 

 

grammar

-

 

parser

-

 

 

 

 

 

 

generator

 

 

 

 

 

 

 

 

 

 

tables

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.10: Structure of an lr(1) parser generator system

The key to building an lr(1) parser is constructing the action and goto tables. These tables encode the actions of the handle-recognizing dfa, along with the information necessary to use limited right context to decide whether to shift, reduce, or accept. While it is possible to construct these tables by hand, the algorithm requires manipulation of lots of detail, as well as scrupulous book-keeping. Building lr(1) tables is a prime example of the kind of task that should be automated and relegated to a computer. Most lr(1) parse tables are constructed by parser generator systems, as shown in Figure 3.10. Using a parser generator, the compiler-writer creates a grammar that describes the source language; the parser generator converts that into action and goto tables. In most such systems, the individual productions can be augmented with ad hoc code that will execute on each reduction. These “actions” are used for many purposes, including context-sensitive error checking, generating intermediate representations (see Section 4.4.3).

3.5.2Building the tables

To construct action and goto tables, an lr(1) parser generator builds a model of the handle-recognizing dfa and uses that model to fill in the tables. The model uses a set of lr(1) items to represent each parser state; these sets are constructed using a disciplined and systematic technique. The model is called the canonical collection of sets of lr(1) items. Each set in the canonical collection represents a parse state.

To explain the construction, we will use two examples. The SheepNoise grammar, SN, is small enough to use as a running example to clarify the individual steps in the process.

1.Goal SheepNoise

2. SheepNoise SheepNoise baa

3.| baa

This version of SN includes a distinct Goal production. Including a separate

3.5. BUILDING AN LR(1) PARSER

85

production for the goal symbol simplifies the implementation of the parser generator. As a second example, we will use the classic expression grammar. It includes complexity not found in SN ; this makes it an interesting example, but too large to include incrementally in the text. Thus, this subsection ends with the classic expression grammar as a detailed example.

LR(1) Items An lr(1) item is a pair [αβ • γδ, a], where αβγδ P is production of the grammar, the symbol is a placeholder in the right-hand side, and a T is a word in the source language. Individual lr(1) items describe configurations of a bottom-up parser; they represent potential handles that would be consistent with the left context that the parser has already seen. The marks the point in the production where the current upper frontier of the parse tree ends. (Alternatively, it marks the top of the parser’s stack. These two views are functionally equivalent.)

For a production αβγδ and a lookahead symbol a T , the addition of the placeholder generates four possible items, each with its own interpretation. In each case, the presence of the item in the set associated with some parser state indicates that the input that the parser has seen is consistent with the occurrence of an α followed by an a. The position of in the item provides more information.

[α•βγδ, a] indicates that an α would be valid and that recognizing a β at this point would be one step toward discovering an α.

[αβ • γδ, a] indicates that the parser has progressed from the state where an α would be valid by recognizing β. The β is consistent with recognizing an α. The next step would be to recognize a γ.

[αβγ • δ, a] indicates that the parser has moved forward from the state where α would be valid by recognizing βγ. At this point, recognizing a δ, followed by a, would allow the parser to reduce βγδ to α.

[αβγδ•, a] indicates that the parser has found βγδ in a context where an α followed by a would be valid. If the lookahead symbol is a, the parser can reduce βγδ to α (and the item is a handle).

The lookahead symbols in lr(1) items encode left-context that the parser has already seen, in the sense that [αβγδ•, a] only indicates a reduction if the lookahead symbol is a.

The SheepNoise grammar produces the following lr(1) items:

[Goal → • SheepNoise, EOF] [Goal → SheepNoise •, EOF] [SheepNoise → • baa, EOF] [SheepNoise → baa , EOF] [SheepNoise → • baa, baa] [SheepNoise → baa , baa]

[SheepNoise → • SheepNoise baa, EOF] [SheepNoise → SheepNoise • baa, EOF] [SheepNoise → SheepNoise baa , EOF] [SheepNoise → • SheepNoise baa, baa] [SheepNoise → SheepNoise • baa, baa] [SheepNoise → SheepNoise baa , baa]

86

CHAPTER 3. PARSING

 

for each α T

 

First(α) α

 

for each α N T

 

First(α)

 

while (First sets are still changing)

 

for each p P , where p has the form α → β

 

if β is

 

then First(α) First(α) { }

 

else if β is β1β2 . . . βk then

 

First(α) First(α) First(β1)

 

for i 1 to k − 1 by 1

if First(βi)

then First(α) First(α) First(βi+1) else break

Figure 3.11: Computing First sets

SN generates two terminal symbols. The first, baa, comes directly from the grammar. The second, EOF (for end of file) arises from the need to represent the parser’s final state. The item [Goal SheepNoise •, EOF] represents a parser configuration where it has already recognized a string that reduces to Goal, and it has exhausted the input, indicated by the lookahead of EOF.

Constructing First Sets The construction of the canonical collection of sets of lr(1) items uses the First sets for various grammar symbols. This set was informally defined in our discussion of predictive parsing (see Section 3.3.3). For the lr(1) construction, we need a more constructive definition:

if α aβ, a T, β (T N T ) then a First(α)

if α then First(α)

To compute First sets, we apply these two rules inside a fixed-point framework. Figure 3.11 shows an algorithm that computes the First set for every symbol in a context-free grammar G = (T, N T, S, P ).

On successive iterations, First(α) takes on some value in 2(T ). The entire collection of First sets is a subset of 2(T ). Each iteration of the while loop either increases the size of some First set, or it changes no set. The algorithm halts when an iteration leaves all the First sets unchanged. The actual running time of the algorithm depends on the structure of the grammar.

3.5. BUILDING AN LR(1) PARSER

87

The First sets for the augmented SN grammar are trivial.

Symbol

First

Goal

baa

SheepNoise

baa

baa

baa

 

 

EOF

EOF

 

 

A more involved example of the First set computation occurs with the classic expression grammar. (See the detailed example later in this section.)

Constructing the Canonical Collection The construction begins by building a model of the parser’s initial state. To the grammar, G = (T, N T, S, P ), the construction adds an additional non-terminal, S , and one production, S S. Adding this production allows us to specify the parser’s start state concisely; it is represented by the lr(1) item [S •S, eof].

Closure To this initial item, the construction needs to add all of the items implied by [S •S, eof]. The procedure closure does this.

closure(si )

while (si is still changing)

item [αβ • γδ, a] si

production γς P

b First(δa)

if [γ•ς, b] si

then add [γ•ς, b] to si

It iterates over the items in set. If an item has the immediately before some non-terminal γ, it adds every production that can derive a γ, with the at the left end of the right-hand side. The rationale is clear. If [αβ • γδ, a] is a member of the set, then one potential completion for the left context is to find the string γδa, possibly followed by more context. This implies that γ is legal; hence, every production deriving a γ is part of a potential future handle. To complete the item, closure needs to add the appropriate lookahead symbol. If δ is non-empty, it generates items for every element of First(δa). If δ is , this devolves into First(a) = a.

The closure computation is another fixed-point computation. At each

point, the triply-nested loop either adds some elements to si or leaves it intact. Since the set of lr(1) items is finite, this loop must halt. In fact, si 2items,

the power set of the set of all lr(1) items. The triply-nested loop looks expensive. However, close examination should convince you that each item in si must be processed exactly once. Only items added in the previous iteration need be processed in the inner two loops. The middle loop iterates over the set of alternative right-hand sides for a single production; this can easily be restricted so that each left-hand side is processed once per invocation of closure. The

88

CHAPTER 3. PARSING

amount of work in the inner loop depends entirely on the size of First(δa); if δ is , the loop makes a single trip for a. Thus, this computation may be much more sparse than it first appears.

In SN, the item [Goal • SheepNoise,EOF] represents the initial state of the parser. (The parser is looking for an input string that reduces to SheepNoise, followed by EOF.) Taking the closure of this initial state produces the set

 

Item

Reason

1.

[Goal → • SheepNoise,EOF]

original item

2.

[SheepNoise → • SheepNoise baa, EOF]

from 1, δa is “EOF”

3.

[SheepNoise → • baa,EOF]

from 1, δa is “EOF”

4.

[SheepNoise → • SheepNoise baa, baa]

from 2, δa is “baa EOF”

5.

[SheepNoise → • baa,baa]

from 2, δa is “baa EOF”

Items two and three derive directly from the first item. The final two items derive from item two; their lookahead symbol is just First(baa EOF). This set represents the initial state, s0, of an lr(1) parser for SN.

Goto If closure([S •S, eof]) computes the initial state s0, the remaining step in the construction is to derive, from s0, the other parser states. To accomplish this, we compute the state that would arise if we recognized a grammar symbol X while in s0. The procedure goto does this.

goto(si, x) new

items i si

if i is [αβ • xδ, a] then

moved [αβx • δ, a] new new moved

return closure(new)

Goto takes two arguments, a set of lr(1) items si and a grammar symbol x. It iterates over the items in si. When it finds one where immediately precedes x, it creates the item resulting from recognizing x by moving the rightward past x. It finds all such items, then returns their closure to fill out the state.

To find all the states that can be derived directly from s0, the algorithm iterates over x (T N T ) and computes Goto(s0, x). This produces all the sets that are one symbol away from s0. To compute the full canonical collection, we simply iterate this process to a fixed point.

In SN, the set s0 derived earlier represents the initial state of the parser. To build a representation of the parser’s state after seeing an initial baa, the construction computes Goto(s0,baa). Goto creates two items:

 

Item

Reason

1.

[SheepNoise → baa ,EOF]

from item 3 in s0

2.

[SheepNoise → baa ,baa]

from item 5 in s0

The final part of goto invokes closure on this set. It finds no items to add because is at the end of the production in each item.

Соседние файлы в предмете Электротехника