

We've barely started trying to figure out exactly what an expression
is and then how to code a recursive descent parser for it and already we've encountered an
embarassing problem -- how to decide between several options for a given production --
particularly when it causes our code to call itself over and over again, infinitely, or
else simply quit too early. This is a common problem in setting up useful productions for
any language. And algebraic expressions are no exception.
Left Recursion
Here are the productions, so far:
expression := term | expression + term | expression - term
term := number | ( expression )
The first production, expression, has its last
two choices using the production expression and, as such, it recursively calls
the production expression. But note that this recursive use is on the left side
of each of these last two choices. In other words, it implies that we'd need to
recursively call the Expression function before a check is made for a
plus or minus, not after. Of course, we already know this from trying to code it. And it
is a problem.
This is called left-recursion and in recursive descent
parsing circles, this is a bad thing. What you really want is right-recursion, as that
means you only make a recursive call if and only if everything else has already been
verified. It's just fine to recursively call a production, if it's the last thing you do
as by then you've already figured out which choice is better.
You really want to arrange things so that recursive
calls are the last thing to do -- sometimes called 'tail recursion.'
Right Recursion
Well, as you might guess, there's a trick in switching
from a left-recursive form to a right recursive form. And once you see the result you'll
see why it's worth doing.
It works kind of like this -- if your right-recursive
form can be though of as being, in some abstract sense:
A := A x | z
Then it can be re-written into two new abstract
productions, looking like this:
A := z B
B := x B | <null>
This is one of those parser guru tricks (well,
actually, it's a first semester compiler guru trick) used to get rid of left recursion in
productions. And it's not hard to memorize (or figure out.) Let's see how the above works.
Look again at the left-recursive form (the one with
only one production.) It's simply a z production followed by zero or more x
productions. Note, I said followed by. Why not write it like that?
A := z B
We don't yet know how the production B should
be written, yet. But it is easy to imagine that it is zero or more x's. So how to
write that? Like this:
B := x B | <null>
That simply says that B can be nothing at all
(empty) or it can be an x followed by a B. This is what gets us the
"zero or more" meaning. The meaning of <null> is simply that it matches
the empty set. In other words, B can be nothing at all -- empty -- and still be quite
valid.
Anyway, you put those two together and it works to get
rid of the left-recursion and replace it with right-recursion, which is our desired
result.
What about the case with several left-recursive forms
and several non-recursive forms? Well, you simply define A to be any one of the
non-recursive forms, where each is followed by a B that is any one of the
recursive forms or <null>. So if you first have:
A := A x | A y | A z | j | k | m | n
Then this would become:
A := j B | k B | m B | n B
B := x B | y B | z B | <null>
I hope that makes how it works a little clearer.
Now, getting back to our expressions, if you look at
the first production:
expression := term | expression + term | expression - term
It can be expressed in this slightly altered form:
expression := term | expression x | expression y
That is, it can be if and only if we've also
arbitrarily declared that,
x := + term
y := - term
Now we can follow those rules, so it can be transformed
such that:
expression := term B
B := x B | y B | <null>
Substituting back in for x and y,
now, we get:
expression := term B
B := + term B | - term B | <null>
Or, renaming B to something more meaningful, like moreterms,
gives us:
expression := term moreterms
moreterms := + term moreterms | - term moreterms | <null>
Recalling our production for term and adding
that back to the list, we get:
expression := term moreterms
moreterms := + term moreterms | - term moreterms | <null>
term := number | ( expression )
The main benefit in doing all this is to lead off with
options that start with plus and minus, so that we have an easy way to detect which of the
options to take. In other words, we can predict based on the next token. Predictive
parsing!
Note that we now have right-recursion and that we have
three productions, instead of two. Plus, one of them can be matched by nothing. But the
main result here is that our production moreterms can actually be _decided_ upon
easily by a parser, so that which of the three possibilities should be accepted is clear
and concrete. It is either going to start with a +, start with a -, or else it simply is
the third option, the empty set, and requires no tokens at all. But there is no ambiguity
about which.
Having neatly tucked the recursive call to moreterms
at the end of each option,we've improved the situation and we can now write code that
won't get into an infinite, recursive call. A nice thing! We can now easily decide which
option is the right option for a production. In the case of moreterms, all we
need to is check for a plus or minus. If either are found, we know exactly which option to
use. If not, there is only one other choice.
By the way, you might also try and imagine rewriting
the last set of productions above as:
expression := term moreterms
moreterms := + expression | - expression | <null>
term := number | ( expression )
This might occur to you because of the patterns you
might match, by eye. However, this set of productions wouldn't be the same thing as the
earlier set. It would associate the minus, for example, from right to left instead of from
left to right. In other words, the expression (2 - 3 - 5) would calculate as 4 instead of
-6. Do you see why this is true?
Coding?
In this case, we create the routines called MoreTerms,
Terms, Expression, and Number. With these, it's now pretty clear how to write the code for
them, as well, so that they can easily determine which of their options to select. When
there are options from which to select, the options is always clear from the next
character being examined. This makes things much nicer.
Again avoiding Number, we have:
FUNCTION MoreTerms% (eqpos AS INTEGER, eq AS STRING)
DIM status AS INTEGER
SkipSpaces eqpos, eq
IF Match("+-", eqpos, eq) THEN
LET status = Term(eqpos, eq)
IF status THEN
LET status = MoreTerms(eqpos, eq)
END IF
ELSE
LET status = -1
END IF
LET MoreTerms = status
END FUNCTION
FUNCTION Term% (eqpos AS INTEGER, eq AS STRING)
DIM status AS INTEGER
SkipSpaces eqpos, eq
IF Match("(", eqpos, eq) THEN
LET status = Expression(eqpos, eq)
IF status THEN
SkipSpaces eqpos, eq
LET status = Match(")", eqpos, eq)
END IF
ELSE
LET status = Number(eqpos, eq)
END IF
LET Term = status
END FUNCTION
FUNCTION Expression% (eqpos AS INTEGER, eq AS STRING)
DIM status AS INTEGER
LET status = Term(eqpos, eq)
IF status THEN
LET status = MoreTerms(eqpos, eq)
END IF
LET Expression = status
END FUNCTION
Note that now I'm using Match to match either of two
characters, in MoreTerms, above. An example of a working program for this is provided
here:
- EQ_02.ZIP or EQ_02.BAS
- An example program that will only accept the a sum-type expression.
We can also take advantage of the tail recursion and
combine Expression and MoreTerms, this way:
FUNCTION Term% (eqpos AS INTEGER, eq AS STRING)
DIM status AS INTEGER
SkipSpaces eqpos, eq
IF Match("(", eqpos, eq) THEN
LET status = Expression(eqpos, eq)
IF status THEN
SkipSpaces eqpos, eq
LET status = Match(")", eqpos, eq)
END IF
ELSE
LET status = Number(eqpos, eq)
END IF
LET Term = status
END FUNCTION
FUNCTION Expression% (eqpos AS INTEGER, eq AS STRING)
DIM status AS INTEGER
LET status = Term(eqpos, eq)
DO WHILE status
SkipSpaces eqpos, eq
IF Match("+-", eqpos, eq) THEN
LET status = Term(eqpos, eq)
ELSE
EXIT DO
END IF
LOOP
LET Expression = status
END FUNCTION
If you don't see how I did that quite yet, that's fine.
But it might be worth thinking through it at this point, too. Here's an example set of
code to try out:
- EQ_03.ZIP or EQ_03.BAS
- An example program that will only accept the a sum-type expression,
but this time using the merged version just mentioned.
A Note
We could have tried something like this:
moreterms := + term moreterms | - term moreterms | <null>
term := ( term moreterms ) | number
That would do away with expression. But it
creates other problems. Which routine gets called first, to evaluate an expression? If you
call Term, you can't just have an expression of 5+3 because
it's not a number, but neither does it begin with an open paren. So, it doesn't
work. We could call two routines in consecutive order, namely Term and then MoreTerms, but
that forces the any user of these functions to know that detail, too, because it will have
to replicate it. This pushes knowledge back onto the user's code, where it shouldn't be.
We could try and expand it to read something like:
moreterms := + term moreterms | - term moreterms | <null>
term := ( term moreterms ) | term moreterms | number
But then we are taken squarely back into that
left-recusion problem in trying to figure out which option to take. That's what we were
trying to avoid, in the first place.
No. The correct way is to make a new production using expression
and refer to that.
Last updated: Friday, January 13, 2006 23:58
|