Javanotes 9.0, Section 9.5 — A Simple Recursive Descent Parser

Amanda Shelton

43 Javanotes 9.0, Section 9.5 — A Simple Recursive Descent Parser

Section 9.5

A Simple Recursive Descent Parser

I have always been fascinated by language—both
natural languages like English and the artificial languages that are used by
computers. There are many difficult questions about how languages can convey
information, how they are structured, and how they can be processed. Natural
and artificial languages are similar enough that the study of programming
languages, which are pretty well understood, can give some insight into the
much more complex and difficult natural languages. And programming languages
raise more than enough interesting issues to make them worth studying in their
own right. How can it be, after all, that computers can be made to “understand”
even the relatively simple languages that are used to write programs?
Computers can only directly use instructions expressed in very
simple machine language. Higher level languages must be translated into machine
language. But the translation is done by a compiler, which is just a program.
How could such a translation program be written?

9.5.1 Backus-Naur Form

Natural and artificial languages are similar in that they have a structure
known as grammar or syntax. Syntax can be expressed by a set of rules that
describe what it means to be a legal sentence or program. For programming
languages, syntax rules are often expressed in BNF
(Backus-Naur Form), a system that was developed by computer scientists John
Backus and Peter Naur in the late 1950s. Interestingly, an equivalent system
was developed independently at about the same time by linguist Noam Chomsky to
describe the grammar of natural language. BNF cannot express all possible
syntax rules. For example, it can’t express the fact that a variable must be
defined before it is used. Furthermore, it says nothing about the meaning or
semantics of the language. The problem of specifying the semantics of a
language—even of an artificial programming language—is one that is still
far from being completely solved. However, BNF does express the basic structure
of the language, and it plays a central role in the design of compilers.

A variety of different notations are used for BNF. The one that I will use here is
fairly common. Although other notations are used, they express the same concepts.

In English, terms such as “noun”, “transitive verb,” and “prepositional
phrase” are syntactic categories that describe
building blocks of sentences. Similarly, “statement”, “number,” and “while
loop” are syntactic categories that describe building blocks of Java programs.
In BNF, a syntactic category is written as a word enclosed between
“<” and “>“. For example: <noun>,
<verb-phrase>, or <while-loop>. A
rule in BNF specifies the structure of an item in a given
syntactic category, in terms of other syntactic categories and/or basic symbols
of the language. For example, one BNF rule for the English language might
be

<sentence>  ::=  <noun-phrase> <verb-phrase>

The symbol “::=” is read “can be”, so this rule says that a
<sentence> can be a <noun-phrase> followed by a
<verb-phrase>. (The term is “can be” rather than “is” because
there might be other rules that specify other possible forms for a sentence.)
This rule can be thought of as a recipe for a sentence: If you want to make a
sentence, make a noun-phrase and follow it by a verb-phrase. Noun-phrase and
verb-phrase must, in turn, be defined by other BNF rules.

In BNF, a choice between alternatives is represented by the symbol “|“,
which is read “or”. For example, the rule

<verb-phrase>  ::=  <intransitive-verb>  |
                    ( <transitive-verb> <noun-phrase> )

says that a <verb-phrase> can be an
<intransitive-verb>, or a
<transitive-verb> followed by a <noun-phrase>.
Note also that parentheses can be used for grouping. To express the fact that
an item is optional, it can be enclosed between “[” and “]“.
An optional item that can be repeated any number of times is enclosed between
“[” and “]…“. And a symbol that is an actual part of the
language that is being described is enclosed in quotes. For example,

<noun-phrase>  ::=  <common-noun> [ "that" <verb-phrase> ]  |
                    <common-noun> [ <prepositional-phrase> ]...

says that a <noun-phrase> can be a
<common-noun>, optionally followed by the literal word
“that” and a <verb-phrase>, or it can be a
<common-noun> followed by zero or more
<prepositional-phrase>’s. Obviously, we can describe very
complex structures in this way. The real power comes from the fact that BNF
rules can be recursive. In fact, the two preceding rules, taken together, are
recursive. A <noun-phrase> is defined partly in terms of
<verb-phrase>, while <verb-phrase> is defined
partly in terms of <noun-phrase>. For example, a
<noun-phrase> might be “the rat that ate the cheese”, since “ate
the cheese” is a <verb-phrase>. But then we can, recursively,
make the more complex <noun-phrase> “the cat that caught the rat
that ate the cheese” out of the <common-noun> “the cat”, the word “that”
and the <verb-phrase> “caught the rat that ate the cheese”.
Building from there, we can make the <noun-phrase> “the dog that
chased the cat that caught the rat that ate the cheese”. The recursive
structure of language is one of the most fundamental properties of language,
and the ability of BNF to express this recursive structure is what makes it so
useful.

BNF can be used to describe the syntax of a programming language such as
Java in a formal and precise way. For example, a <while-loop>
can be defined as

<while-loop>  ::=  "while" "(" <condition> ")" <statement>

This says that a <while-loop> consists of the word “while”,
followed by a left parenthesis, followed by a <condition>,
followed by a right parenthesis, followed by a <statement>. Of
course, it still remains to define what is meant by a condition and by a
statement. Since a statement can be, among other things, a while loop,
we can already see the recursive structure of the Java language. The exact
specification of an if statement, which is hard to express clearly in
words, can be given as

<if-statement>  ::=  
             "if" "(" <condition> ")" <statement>
             [ "else" "if" "(" <condition> ")" <statement> ]...
             [ "else" <statement> ]

This rule makes it clear that the “else” part is optional and that
there can be, optionally, one or more “else if” parts.

9.5.2 Recursive Descent Parsing

In the rest of this section, I will show how a BNF grammar for a language
can be used as a guide for constructing a parser. A parser is a program that
determines the grammatical structure of a phrase in the language. This is the
first step in determining the meaning of the phrase—which for a programming
language means translating it into machine language. Although we will look at
only a simple example, I hope it will be enough to convince you that compilers
can in fact be written and understood by mortals and to give you some idea of
how that can be done.

The parsing method that we will use is called recursive descent parsing.
It is not the only possible parsing
method, or the most efficient, but it is the one most suited for writing
compilers by hand (rather than with the help of so called “parser generator”
programs). In a recursive descent parser, every rule of the BNF grammar is the
model for a subroutine. Not every BNF grammar is suitable for recursive descent
parsing. The grammar must satisfy a certain property. Essentially, while
parsing a phrase, it must be possible to tell what syntactic category is coming
up next just by looking at the next item in the input. Many grammars are
designed with this property in mind.

When we try to parse a phrase that contains a syntax error, we need some way
to respond to the error. A convenient way of doing this is to throw an
exception. I’ll use an exception class called ParseError, defined as
follows:

/**
 * An object of type ParseError represents a syntax error found in 
 * the user's input.
 */
private static class ParseError extends Exception {
   ParseError(String message) {
      super(message);
   }
} // end nested class ParseError

Another general point is that our BNF rules don’t say anything about spaces
between items, but in reality we want to be able to insert spaces between items
at will. To allow for this, I’ll always call the routine TextIO.skipBlanks()
before trying to look ahead to see what’s coming up next in input.
TextIO.skipBlanks() skips past any whitespace, such as spaces and tabs, in the input,
and stops when the next character in the input is either a non-blank character or the
end-of-line character. (For a discussion of robust handling of TextIO input, see
Subsection 8.2.4.)

Let’s start with a very simple example. A “fully parenthesized expression”
can be specified in BNF by the rules

<expression>  ::=  <number>  |
                   "(" <expression> <operator> <expression> ")"
                   
<operator>  ::=  "+" | "-" | "*" | "/"

where <number> refers to any non-negative real number. An example
of a fully parenthesized expression is “(((34-17)*8)+(2*7))“. Since
every operator corresponds to a pair of parentheses, there is no ambiguity
about the order in which the operators are to be applied. Suppose we want a
program that will read and evaluate such expressions. We’ll read the
expressions from standard input, using TextIO. To apply recursive
descent parsing, we need a subroutine for each rule in the grammar.
Corresponding to the rule for <operator>, we get a subroutine
that reads an operator. The operator can be a choice of any of four things. Any
other input will be an error.

/**
 * If the next character in input is one of the legal operators,
 * read it and return it.  Otherwise, throw a ParseError.
 */
static char getOperator() throws ParseError {
   TextIO.skipBlanks();
   char op = TextIO.peek(); // look ahead at the next char, without reading it
   if ( op == '+' || op == '-' || op == '*' || op == '/' ) {
      TextIO.getAnyChar();  // read the operator, to remove it from the input
      return op;
   }
   else if (op == '\n')
      throw new ParseError("Missing operator at end of line.");
   else
      throw new ParseError("Missing operator.  Found \"" +
            op + "\" instead of +, -, *, or /.");
} // end getOperator()

I’ve tried to give a reasonable error message, depending on whether the next
character is an end-of-line or something else. I use TextIO.peek() to
look ahead at the next character before I read it, and I call
TextIO.skipBlanks() before testing TextIO.peek() in order to ignore
any blanks that separate items. I will follow this same pattern in every
case.

When we come to the subroutine for <expression>, things are a
little more interesting. The rule says that an expression can be either a
number or an expression enclosed in parentheses. We can tell which it is by
looking ahead at the next character. If the character is a digit, we have to
read a number. If the character is a “(“, we have to read the “(“, followed by
an expression, followed by an operator, followed by another expression,
followed by a “)”. If the next character is anything else, there is an error.
Note that we need recursion to read the nested expressions. The routine doesn’t
just read the expression. It also computes and returns its value. This requires
semantical information that is not specified in the BNF rule.

/**
 * Read an expression from the current line of input and return its value.
 * @throws ParseError if the input contains a syntax error
 */
private static double expressionValue() throws ParseError {
   TextIO.skipBlanks();
   if ( Character.isDigit(TextIO.peek()) ) {
          // The next item in input is a number, so the expression
          // must consist of just that number.  Read and return
          // the number.
      return TextIO.getDouble();
   }
   else if ( TextIO.peek() == '(' ) {
          // The expression must be of the form 
          //         "(" <expression> <operator> <expression> ")"
          // Read all these items, perform the operation, and
          // return the result.
      TextIO.getAnyChar();  // Read the "("
      double leftVal = expressionValue();  // Read and evaluate first operand.
      char op = getOperator();             // Read the operator.
      double rightVal = expressionValue(); // Read and evaluate second operand.
      TextIO.skipBlanks();
      if ( TextIO.peek() != ')' ) {
             // According to the rule, there must be a ")" here.
             // Since it's missing, throw a ParseError.
         throw new ParseError("Missing right parenthesis.");
      }
      TextIO.getAnyChar();  // Read the ")"
      switch (op) {   //  Apply the operator and return the result. 
         case '+':  return leftVal + rightVal;
         case '-':  return leftVal - rightVal;
         case '*':  return leftVal * rightVal;
         case '/':  return leftVal / rightVal;
         default:   return 0;  // Can't occur since op is one of the above.
                               // (But Java syntax requires a return value.)
      }
   }
   else {  // No other character can legally start an expression.
      throw new ParseError("Encountered unexpected character, \"" + 
            TextIO.peek() + "\" in input.");
   }
} // end expressionValue()

I hope that you can see how this routine corresponds to the BNF rule. Where
the rule uses “|” to give a choice between alternatives, there is an
if statement in the routine to determine which choice to take. Where
the rule contains a sequence of items, “(” <expression>
<operator> <expression> “)”, there is a sequence
of statements in the subroutine to read each item in turn.

When expressionValue() is called to evaluate the expression
(((34-17)*8)+(2*7)), it sees the “(” at the beginning of the input, so
the else part of the if statement is executed. The “(” is
read. Then the first recursive call to expressionValue() reads and
evaluates the subexpression ((34-17)*8), the call to
getOperator() reads the “+” operator, and the second recursive call to
expressionValue() reads and evaluates the second subexpression
(2*7). Finally, the “)” at the end of the expression is read. Of
course, reading the first subexpression, ((34-17)*8), involves further
recursive calls to the expressionValue() routine, but it’s better not
to think too deeply about that! Rely on the recursion to handle the
details.

You’ll find a complete program that uses these routines in the file
SimpleParser1.java.

Fully parenthesized expressions aren’t very natural for people to use. But
with ordinary expressions, we have to worry about the question of operator
precedence, which tells us, for example, that the “*” in the
expression “5+3*7” is applied before the “+“. The complex expression
“3*6+8*(7+1)/4-24” should be seen as made up of three “terms”,
3*6, 8*(7+1)/4, and 24, combined with “+” and “–”
operators. A term, on the other hand, can be made up of several factors
combined with “*” and “/” operators. For example,
8*(7+1)/4 contains the factors 8, (7+1) and
4. This example also shows that a factor can be either a number or an
expression in parentheses. To complicate things a bit more, we allow for
leading minus signs in expressions, as in “-(3+4)” or “-7“.
(Since a <number> is a positive number, this is the only way we
can get negative numbers. It’s done this way to avoid “3 * -7“, for
example.) This structure can be expressed by the BNF rules

<expression>  ::=  [ "-" ] <term> [ ( "+" | "-" ) <term> ]...
<term>  ::=  <factor> [ ( "*" | "/" ) <factor> ]...
<factor>  ::=  <number>  |  "(" <expression> ")"

The first rule uses the “[ ]…” notation, which says that the
items that it encloses can occur zero, one, two, or more times. The rule means that
an <expression> can begin, optionally, with a “-“. Then there
must be a <term> which can optionally be followed by one of the
operators “+” or “–” and another <term>, optionally followed by
another operator and <term>, and so on. In a subroutine that
reads and evaluates expressions, this repetition is handled by a while
loop. An if statement is used at the beginning of the loop to test
whether a leading minus sign is present:

/**
 * Read an expression from the current line of input and return its value.
 * @throws ParseError if the input contains a syntax error
 */
private static double expressionValue() throws ParseError {
   TextIO.skipBlanks();
   boolean negative;  // True if there is a leading minus sign.
   negative = false;
   if (TextIO.peek() == '-') {
      TextIO.getAnyChar();  // Read the minus sign.
      negative = true;
   }
   double val;  // Value of the expression.
   val = termValue();  // Read and evaluate the first term.
   if (negative)
      val = -val;  // Apply the leading minus sign to the first term.
   TextIO.skipBlanks();
   while ( TextIO.peek() == '+' || TextIO.peek() == '-' ) {
          // Read the next term and add it to or subtract it from
          // the value of previous terms in the expression.
      char op = TextIO.getAnyChar();  // Read the operator.
      double nextVal = termValue();    // Read and evaluate the next term.
      if (op == '+')
         val += nextVal;
      else
         val -= nextVal;
      TextIO.skipBlanks();
   }
   return val;
} // end expressionValue()

The subroutine for <term> is very similar to this, and the
subroutine for <factor> is similar to the example given above
for fully parenthesized expressions. A complete program that reads and
evaluates expressions based on the above BNF rules can be found in the file
SimpleParser2.java.

9.5.3 Building an Expression Tree

Now, so far, we’ve only evaluated expressions. What does that have to do
with translating programs into machine language? Well, instead of actually
evaluating the expression, it would be almost as easy to generate the machine
language instructions that are needed to evaluate the expression. If we are
working with a “stack machine,” these instructions would be stack operations
such as “push a number” or “apply a + operation”. The program
SimpleParser3.java can both evaluate the
expression and print a list of stack machine operations for evaluating the
expression.

It’s quite a jump from this program to a recursive descent parser that can
read a program written in Java and generate the equivalent machine language
code—but the conceptual leap is not huge.

The SimpleParser3 program doesn’t actually generate the stack
operations directly as it parses an expression. Instead, it builds an
expression tree, as discussed in Subsection 9.4.3, to
represent the expression. The expression tree is then used to find the value
and to generate the stack operations. The tree is made up of nodes belonging to
classes ConstNode and BinOpNode that are similar to those
given in Subsection 9.4.3. Another subclass of ExpNote,
UnaryMinusNode, has been
introduced to represent the unary minus operation. I’ve added a method,
printStackCommands(), to each class. This method is responsible for
printing out the stack operations that are necessary to evaluate an expression.
Here for example is the new BinOpNode class from SimpleParser3.java:

private static class BinOpNode extends ExpNode {
   char op;        // The operator.
   ExpNode left;   // The expression for its left operand.
   ExpNode right;  // The expression for its right operand.
   BinOpNode(char op, ExpNode left, ExpNode right) {
          // Construct a BinOpNode containing the specified data.
      assert op == '+' || op == '-' || op == '*' || op == '/';
      assert left != null && right != null;
           // (for assert statements, see Subsection 8.4.1)
      this.op = op;
      this.left = left;
      this.right = right;
   }
   double value() {
          // The value is obtained by evaluating the left and right
          // operands and combining the values with the operator.
      double x = left.value();
      double y = right.value();
      switch (op) {
      case '+':  
         return x + y;
      case '-':  
         return x - y;
      case '*':  
         return x * y;
      case '/':  
         return x / y;
      default:   
         return Double.NaN;  // Bad operator! Should not be possible.
      }
   }
   void  printStackCommands() {
          // To evaluate the expression on a stack machine, first do
          // whatever is necessary to evaluate the left operand, leaving
          // the answer on the stack.  Then do the same thing for the
          // second operand.  Then apply the operator (which means popping
          // the operands, applying the operator, and pushing the result).
      left.printStackCommands();
      right.printStackCommands();
      System.out.println("  Operator " + op);
   }
}

It’s also interesting to look at the new parsing subroutines. Instead of
computing a value, each subroutine builds an expression tree. For example, the
subroutine corresponding to the rule for <expression>
becomes

    static ExpNode expressionTree() throws ParseError {
           // Read an expression from the current line of input and
           // return an expression tree representing the expression.
           // (The return value is a pointer to the root of the tree.)
       TextIO.skipBlanks();
       boolean negative;  // True if there is a leading minus sign.
       negative = false;
       if (TextIO.peek() == '-') {
          TextIO.getAnyChar();
          negative = true;
       }
       ExpNode exp;   // The expression tree for the expression.
       exp = termTree();  // Start with a tree for first term.
       if (negative) {
              // Build the tree that corresponds to applying a
              // unary minus operator to the term we've
              // just read.
          exp = new UnaryMinusNode(exp);
       }
       TextIO.skipBlanks();
       while ( TextIO.peek() == '+' || TextIO.peek() == '-' ) {
                // Read the next term and combine it with the
                // previous terms into a bigger expression tree.
           char op = TextIO.getAnyChar();
           ExpNode nextTerm = termTree();
                // Create a tree that applies the binary operator
                // to the previous tree and the term we just read.
           exp = new BinOpNode(op, exp, nextTerm);
           TextIO.skipBlanks();
       }
       return exp;
    } // end expressionTree()

In some real compilers, the parser creates a tree to represent the program
that is being parsed. This tree is called a parse tree
or abstract syntax tree.
Parse trees are somewhat different in form from expression trees,
but the purpose is the same. Once you have the tree, there are a number of
things you can do with it. For one thing, it can be used to generate machine
language code. But there are also techniques for examining the tree and
detecting certain types of programming errors, such as an attempt to reference
a local variable before it has been assigned a value. (The Java compiler, of
course, will reject the program if it contains such an error.) It’s also
possible to manipulate the tree to optimize the
program. In optimization, the tree is transformed to make the program more
efficient before the code is generated.

And so we are back where we started in Chapter 1, looking at programming
languages, compilers, and machine language. But looking at them, I hope, with a
lot more understanding and a much wider perspective.