Monday, July 22, 2013

LL and LR Parsing Demystified

My first adventures in parsing theory came when I was doing an independent study of programming languages in college. When I got to the part about algorithms such as LL, LR, and their many variations (Strong-LL, SLR, LALR, etc), I was fascinated. I felt like I was peering at some deep and powerful incantations, the significance of which I could not yet appreciate, but I was sure that someday terms like "left-to-right, rightmost derivation" would make perfect sense, and I looked forward to achieving this enlightenment.

I can say now, 10 years and a whole shelf of parsing books later, that I understand these algorithms pretty well. But the way I think about them is quite different than any of the literature that I have found. I think more from an implementation perspective than a mathematical one, which could have something to do with it. In any case, I want to explain how I think of these algorithms; hopefully some people will find this perspective intuitive, as I do.

This article will only address the parser from a black-box perspective: what are its inputs/outputs and what are its constraints? A planned future article will break open the black box for more details about the inner workings of these algorithms.

Parsing And Polish Notation

If you studied Computer Science at university, or ever owned an HP calculator, you likely came across Polish and Reverse Polish notation. These are ways of writing mathematical expressions that don't need parentheses or order-of-operations rules. We're used to writing expressions as infix, where the operators go in between the operands:
  1 + 2 * 3
In this case, how do you know what order to do the operations in? You have to follow the conventional rules (PEDMAS), and if you want a different order you have to use parentheses, like:
  (1 + 2) * 3
Polish and Reverse Polish notation let you write these expressions without needing arbitrary order-of-operations rules or parentheses to disambiguate. They work by putting the operators before (Polish) or after (Reverse Polish) the operands. These are also known as prefix and postfix notation.
  // First example:
  1 + 2 * 3  // Infix
  + 1 * 2 3  // Polish (prefix)
  1 2 3 * +  // Reverse Polish (postfix)

  // Second example:
  (1 + 2) * 3  // Infix
  * + 1 2 3    // Polish (prefix)
  1 2 + 3 *    // Reverse Polish (postfix)
Besides not needing parentheses or a defined order of operations, Polish and Reverse Polish are much easier to write evaluators for (maybe the HP calculator's designers decided to use Reverse Polish so they could spend a week in the Bahamas). Here is a simple Reverse Polish evaluator in Python.
# Functions that define the operators and how to evaluate them.
# This example assumes binary operators, but this is easy to extend.
ops = {
  "+": (lambda a, b: a + b),
  "-": (lambda a, b: a - b)
}
 
def eval(tokens):
  stack = []
 
  for token in tokens:
    if token in ops:
      arg2 = stack.pop()
      arg1 = stack.pop()
      result = ops[token](arg1, arg2)
      stack.append(result)
    else:
      stack.append(int(token))
 
  return stack.pop()
 
print "Result:",  eval("7 2 3 + -".split())
Polish and Reverse Polish notation, as they are usually described, do require that all operators have a known arity. Arity is just the number of operands that the operator takes. This means, for example, that unary minus needs to be a different operator than subtraction. Otherwise we don't know how many operands to pop from the stack when we see an operator.

A similar formulation that avoids this problem is one like Lisp's s-expressions. S-expressions (and similar encodings like XML) avoid the need for fixed operator arity by explicitly marking both the beginning and end of each expression:
  ; Lisp-style prefix notation; the same operator can be used
  ; for different numbers of arguments.
  (+ 1 2)
  (+ 1 2 3 4 5)

  ; Lisp equivalents of our first two examples:
  ; Prefix: + 1 * 2 3
  (+ 1 (* 2 3))

  ; Prefix: * + 1 2 3
  (* (+ 1 2) 3)
This variant has slightly different tradeoffs (we traded fixed arity for required parentheses) but the underlying algorithms for parsing/processing these are quite similar, so generally we'll consider this a minor variant of prefix notation.

It might seem like I've veered off-topic, but I've been sneakily talking about LL and LR the whole time. Polish and Reverse Polish notation directly correspond, in my view, to LL and LR parsing, respectively. But to fully explore this idea we need to first describe what kind of output we expect from a parser.

For a fun exercise, try implementing an algorithm to convert Polish to Reverse Polish notation. See if you can do it without building the entire expression into a tree first; you can do it with a stack alone. Now say you wanted to do the opposite (Reverse Polish to Polish) -- you can just run the same algorithm on the input, but backwards! Of course you can build an intermediate tree too, but this takes O(input length) space, whereas the stack-only solution takes only O(tree depth) space. How about going infix to postfix? There is a very clever and efficient algorithm for that called the Shunting Yard Algorithm.

A Parser And Its Output

We can all agree that the input to a parser is a stream of tokens (which probably came from a lexer, but we can talk about that part another day). But what is a parser's output? You might be inclined to say "a parse tree," and while you can certainly use a parser to build a parse tree, it's also possible to consume a parser's output without ever actually building a tree at all. For example, this Bison example evaluates arithmetic expressions in-line with the parse. Every time a subexpression is recognized, it is immediately evaluated until the final result is just a single number. No parse tree is ever explicitly built.

For this reason, saying that the output of a parser is a parse tree is not general enough. Instead, I claim that the output of a parser, at least for the cases of LL and LR which we are discussing today, is a parse tree traversal.

Apologies if I've set off anyone's nonsense detectors. I can hear people protest that a tree traversal is an algorithm; an operation you perform on a tree. How can I say that a parser outputs a tree traversal? The answer lies in thinking back to Polish and Reverse Polish notation. These are usually thought of as just a notation for mathematical expressions, but we can think of them more generally as flat (serialized) encodings of tree traversals.

Think back to our first example of 1 + 2 * 3. Here is that expression written as a tree:


There are three ways of traversing a binary tree, as explained on Wikipedia: in-order, pre-order, and post-order. They differ in whether you visit the parent node before (pre-order), after (post-order), or in between (in-order) the children of that parent. These three correspond exactly to infix, Polish, and Reverse Polish notation:
  1 + 2 * 3  // Infix expression; in-order traversal.
  + 1 * 2 3  // Polish (prefix) expression; pre-order traversal.
  1 2 3 * +  // Reverse Polish (postfix) expression; post-order traversal.
So Polish and Reverse Polish notation fully encode a tree structure and the steps you would take to traverse it. The main difference between these encodings and an actual parse tree is that Polish and Reverse Polish encodings are not random access. With a real tree you can choose to follow an interior node to its right subtree, its left subtree, or even (for many trees) its parent. With these linear encodings there is no such flexibility: you have to follow the traversal as it is already encoded.

But on the plus side, this allows the parser's output to be a stream that can be consumed while the parse is happening. This is how the Bison example from before could manage to evaluate the arithmetic expression without ever actually building a tree. If a fully-fledged tree is actually required, the linear tree traversal can easily build one. But in cases where it is not, the cost of building one can be avoided.

This brings us to a key point:
The primary difference between how LL and LR parsers operate is that an LL parser outputs a pre-order traversal of the parse tree and an LR parser outputs a post-order traversal.
This is equivalent to these more traditional but (in my view) more confusing and less intuitive explanations of the distinction:
  • "LL parsers produce a leftmost derivation, while LR parsers produce a reversed rightmost derivation."
  • "LL parsers build the tree from the top-down, while LR parsers build the tree from the bottom-up."
  • LL parsers are often called "predictive parsers," while LR parsers are often called "shift-reduce parsers."

The Shape of a Parse Tree

Our arithmetic expression tree that we've been using so far isn't truly a parse tree, because it doesn't correspond to a grammar. To examine actual parse trees, we'll need a real grammar. Unfortunately writing grammars for infix arithmetic expressions isn't as simple or elegant as you might expect. Encoding the precedence and associativity rules into a grammar that is unambiguous (and can be handled by LL and LR parsers) is pretty ugly and nonintuitive. This is one reason why LL and LR parsers are often extended with capabilities that let you specify operator precedence; for example, see the precedence features of Bison. But for the purposes of this article we want to discuss pure LL and LR.

So we need to ditch our arithmetic expressions example in favor of something that is easier to write a grammar for. We'll use JSON since it is very simple but just complex enough to be interesting.
object → '{' pairs '}'

pairs → pair pairs_tail | ε
pair → STRING ':' value
pairs_tail → ',' pairs | ε

value → STRING | NUMBER | 'true' | 'false' | 'null' | object | array
array → '[' elements ']'

elements → value elements_tail | ε
elements_tail → ',' elements | ε
I've used single-quoted strings for literal tokens and upper-case names like STRING for tokens whose spelling can vary (for example, "abc" and "" are both valid STRING tokens). All lower-case names are grammar rules (also known as "nonterminals").

You might wonder why I am doing this pairs_tail and elements_tail business instead of using a repetition construct, which many parser generators like ANTLR support. They let you write something like:
elements → value (',' value)*
While this is more convenient and leads to simpler grammars, it makes parse trees conceptually a bit more complicated, because the number of children for a given grammar rule can vary. Also, LR parsers can't support repetition operators (for example, Bison doesn't support them), whereas the grammar I wrote above can be used with both LL and LR parsers. So we'll use this slightly more complicated grammar for now.

Now that we have a grammar, we can look at an example of a stream of tokens and the resulting parse tree.
{"message":"Hello, World!"}
The token stream for this text is:
{ STRING : STRING }
And the parse tree, according to our grammar, is:


Notice that the leaf nodes (colored in green) are all tokens, and correspond exactly to the sequence of tokens that was the input to our parser. (I cheated slightly by making ε into a leaf node, but this turns out to be pretty clean and principled as we will see).

I claimed earlier that LL parsers output a pre-order traversal and LR parsers a post-order traversal. From this we can say what output we would expect to get from an LL and LR parser given this input:
// LL output: pre-order traversal.
object '{' pairs(1) pair STRING ':' value(1) STRING pairs_tail(2) '}'

// LR output: post-order traversal.
'{' STRING ':' STRING value(1) pair pairs_tail(2) pairs(1) '}' object
Since leaf nodes are always exactly the input tokens themselves, in exactly the order of the input, all the parser is really doing is inserting the interior nodes in the appropriate places. Another way of looking at it is that a parse tree is just a bunch of structure that is defined on top of the sequence of input tokens. We can see this more clearly if we rearrange our previous diagram slightly:


We are converging on a very simple model for how LL and LR parsers operate. Both read a stream of input tokens and output that same token stream, inserting rules in the appropriate places to achieve a pre-order (LL) or post-order (LR) traversal of the parse tree.


Here we see another advantage of thinking of a parser's output in terms of Polish and Reverse Polish Notation. It lets us model the parser's input and output both as simple, flat streams. This is a lot simpler than thinking of a parser's intermediate output state as a partially-built tree -- it is unclear how you could meaningfully consume or inspect that.

Lookahead

LL and LR parsers are "on-line," meaning that they can start producing output while they are still consuming input. But in many cases the tokens prior to a stream position do not contain enough information for the parser to know whether it should insert a rule or not (or if so, which rule it should insert). So the parser will "look ahead" in the stream to see what the next token(s) are before it makes a decision. When you see designations like LL(1), LR(0), etc. the number in parentheses is the number of tokens of lookahead.

Note that the lookahead is relative to where the rule should be inserted, which (as you will remember) is before that rule's tokens for LL parsers or after that rule's tokens for LR parsers. That means that LL lookahead counts from the beginning of the rule's tokens, whereas LR lookahead counts from the end. This gives LR parsers a huge advantage, because they get to see all of the rule's tokens (and maybe some lookahead) before they have to commit to a decision, whereas LL parsers only get to see the first few tokens of the rule.



This is why there is such a thing as an LR(0) parser, whereas an LL(0) parser would be impossible; it would have no information with which to know what rule to use for the following tokens!

Consequences

With this understanding of LL vs LR parsing, we can draw a number of very significant conclusions about why certain things are the way they are. These illustrate a lot of the pros/cons of LL vs LR parsing.

LR parsers can handle more grammars

This follows from the previous section about lookahead. Since LR lookahead starts from the end of a rule, a LR(1) parser has strictly more information available to it when making a decision than an LL(1) parser. It follows that LR(1) parsers can parse strictly more grammars than LL(1) (modulo LL-only grammar extensions; see below). LR parsers can also handle left recursion, which LL parsers cannot.

Advantage: LR

On the other hand, since LL parsers commit to what rule they are parsing before they parse that rule's tokens, and LL parser knows the context of what it is parsing whenever it parses a token. While that is more difficult job (since they have less information to go on), it also leads to some important advantages.

LL parsers can support regex-like operators in grammars

Knowing the parsing context makes it possible to extend grammars with rich regex-like operators like repetition, alternation anywhere (not just at the top level), etc. Basically each rule can form a DFA. This is possible with top-down parsing because the parser knows what rule it is in and can step through that rule's state machine as it is parsing it. I don't believe this is possible with bottom-up parsing (even if you could somehow get the parsing tables to do the right thing, the "reduce" step counts on the reduction having a fixed arity). This is a really nice advantage of LL, because grammars are often more readable with these rich grammar extensions. In practice this helps mitigate the more restrictive grammar rules of LL, since many things that you'd want left-recursion for you can use repetition operators instead.
// LR Grammar: nothing fancy allowed, alternation only allowed
// at the top-level.
//
// This is only allowed because it is equivalent to:
// pairs → pair pairs_tail
// pairs → ε
pairs → pair pairs_tail | ε

// Extended LL grammar; possible because we can build each
// rule into a DFA.
pairs → (pair (',' pair)*)?
The latter rule could be built into a DFA like (green states are "accept" states):


Knowing the context also allows for mid-rule actions (custom code that runs in-between any two elements of the rule). Bison supports this, but only by rewriting the grammar internally which makes it appear more complicated in any kind of visualization.

Advantage: LL

LL parsers support inherited attributes

Knowing the context also allows LL-based applications to pass attributes/metadata down the tree as it is being built (this is sometimes referred to as "inherited attributes." (Both LL and LR parsers can support "synthesized attributes" which are passed up the tree).

Advantage: LL

Conclusion

I have described a alternate model for LL and LR parsers that is equivalent to, but more intuitive (for me at least), than most of the literature. We can consider a parser a black box that inputs and outputs streams of tokens and rules according to pre-order or post-order notation. So far we have not explored the inner workings of these parsers at all; we have just considered them black boxes and we have no idea how they work internally. We have not explored issues of what grammars they can handle and which they can't. We also have not explored the variants of LL and LR (Strong-LL, SLR, LALR, etc). I hope to explore these issues more fully, including example code, in a follow-up article.

7 comments:

  1. Thanks for the article. I'm looking forward to the next one, especially since I've been beating the "parser-output-is-traversal" drum for years.

    ReplyDelete
  2. I think that's the only explanation of LL and LR that will stick in my head. Thanks!

    ReplyDelete
  3. An addition with PEG parsing would be nice, also what happened to gazelle?!

    ReplyDelete
  4. i agree with ale, i have been mystified about LL and LR up to this point,....its starting to make some sense now :)

    ReplyDelete
  5. Really well written and much clearer than the competing explanations. I'm writing a compiler in my spare time and have been fiddling with expression parsing for a while now (hello, Shunting Yard). Thanks a lot for posting.

    ReplyDelete
  6. Personally, I strongly believe that the output of a parser should be the object graph that it represents, and by "object graph" I'm NOT talking about a parse tree (AST). Rather, think of a parser as a deserializer. Also, I think there's no reason to choose LL, LR, etc only - especially not in these days of advanced languages with embedded DSLs (such as LINQ in C#). Instead, pick and choose what works best for each subset of your language. Most importantly, though, I think it should be possible to extend the parsed language by simply creating new classes for new language features and "plugging" them in without the need to update a formal grammar and re-generate your parser. This article explains further, and includes full sources for a C# parser implemented using these ideas:
    http://www.codeproject.com/Articles/492466/Object-Oriented-Parsing-Breaking-With-Tradition-Pa

    ReplyDelete
  7. Really nice post. Looking forward for the next one.

    ReplyDelete