Sunday, June 22, 2014

Beware of Lua finalizers in C modules!

If you write Lua C modules, there is a corner case surrounding finalizers (ie. __gc metamethods) that you should be aware of. I haven't run into a public and prominent warning about this, so I decided to write it up in this entry. I only discovered it myself through trial and error.

Lua provides the __gc metamethod for releasing resources associated with a userdata. It is commonly used in C modules to free any C-level objects or resources that were allocated by a type.
#include "lauxlib.h"
#include "mytype.h"

#if LUA_VERSION_NUM == 501
#define setfuncs(L, l) luaL_register(L, NULL, l)
#define luaL_setmetatable(L, name) \
  luaL_getmetatable(L, name); \
  lua_setmetatable(L, -2)
#elif LUA_VERSION_NUM == 502
#define setfuncs(L, l) luaL_setfuncs(L, l, 0)

static const char *MYTYPE = "mytype";

typedef struct {
  mytype_t *val;
} MyTypeWrapper;

static int newobj(lua_State *L) {
  MyTypeWrapper *obj = lua_newuserdata(L, sizeof(MyTypeWrapper));
  obj->val = mytype_new();
  luaL_setmetatable(L, MYTYPE);
  return 1;

static int call(lua_State *L) {
  MyTypeWrapper *obj = lual_checkudata(L, 1, MYTYPE);
  return 0;

static int gc(lua_State *L) {
  MyTypeWrapper *obj = lual_checkudata(L, 1, MYTYPE);
  return 0;

static const struct luaL_Reg mm[] = {
  {"__gc", gc},
  {"__call", call},

int luaopen_ext(lua_State *L) {
  luaL_newmetatable(L, mytype);
  setfuncs(L, mm);
  lua_pushcfunction(L, &newobj);
  return 1;
It turns out this will work nearly all of the time. But there is one very unusual corner case that can reliably cause gc() to run before call()! This program can trigger this behavior on both Lua 5.1 and Lua 5.2 (and LuaJIT):
local ext = require "ext"

if _VERSION >= 'Lua 5.2' then
  function defer(fn)
    setmetatable({}, { __gc = fn })
  function defer(fn)
    getmetatable(newproxy(true)).__gc = fn

local y = {}
defer(function() y[1]() end)
y[1] = ext()
Basically, any Lua code that runs inside a __gc metamethod can get access to a userdata that has already been finalized! This can crash your C extension if you don't handle this case.

There are two main solutions to this problem:
  1. Clear the userdata's metatable inside the finalizer. This will ensure that the userdata fails any subsequent luaL_checkudata() check later. The downside is that the error message for trying to call a method on a finalized value will be very unhelpful (something like "attempt to index field '?' (a user data value)")
  2. Set a "finalized" flag inside the finalizer, and check this flag right after calling luaL_checkudata(). For example, you could set the pointer to NULL and check for this. This gives you the benefit of being able to return a custom error message like "attempted to call into dead object."
Here is an example of what the second solution might look like:
mytype *mytype_check(lua_State *L, int narg) {
  mytype *obj = luaL_checkudata(L, narg, MYTYPE);
  if (!obj->val) luaL_error(L, "called into dead object");
  return obj;

static int call(lua_State *L) {
  MyTypeWrapper *obj = mytype_check(L, 1);
  return 0;

static int gc(lua_State *L) {
  mytype *obj = mytype_check(L, 1);
  // The critical step that will prevent us from allowing
  // a call into a dead object.
  obj->val = NULL;
  return 0;
For a bit more discussion about this, see this lua-l thread where I raised the issue.

The moral of the story is: anytime you are using __gc from a C module, you need to handle the case where the finalizer gets called before other methods. Otherwise a user could SEGV your module.

Saturday, February 22, 2014

On "The Future of JavaScript MVC Frameworks"

This entry is a sort of "part 2" to my previous entry React Demystified. Now that I understand React better, I want to take a closer look at the blog post that motivated this quest to begin with, The Future of JavaScript MVC Frameworks. That entry advocates a vision for what MVC's of the future will look like in JavaScript, and is accompanied by some strong benchmark numbers.

The article's overall argument is that the design/architecture of most JavaScript MVC libraries today makes them slow. While they can be optimized to improve performance, a much better approach (according to the article) is to change the design of the MVC library to something that is inherently faster; a design where the optimizations fall out of the design "for free". A design that is fast by default. This is the claim that I want to take a deeper look at and evaluate in detail.

I want to go deeper because the form of the argument -- the idea that a fundamentally different design can render a lot of busy-work obsolete -- is one that resonates with me. But there are a lot of aspects of this that are all mashed together a bit in the article, so I want to break them apart and look at them one by one.

React/Om vs. Backbone.js benchmarks

The article begins with some benchmarks from TodoMVC, a very cool web site that creates the same little TODO application in a ton of different JavaScript MVC frameworks. This is not only a great body of example code, but it has benchmarks for doing performance comparisons between the frameworks.

The article starts by noting that the React/Om library (which the author also wrote) is 2-4x faster than Backbone in "Benchmark 1" of TodoMVC, and about 800x faster than Backbone for Benchmark 2. This is followed with a profiling graph that shows Backbone making a ton of short-running function calls, where React/Om make many fewer calls that run longer.

From what I can tell, it appears that these benchmark numbers are mainly illustrating the difference between doing direct DOM updates (as in the Backbone example) and using a library like React that batches DOM updates. I suspect that these benchmarks have little to do with Om and everything to do with React.

In particular, "Benchmark 2" (the one with the 800x speedup vs Backbone) is effectively a no-op on the DOM, so none of Om's custom shouldComponentUpdate() optimizations are coming into play here. The only thing that Om seems to be possibly contributing performance-wise to this benchmark is that it uses requestAnimationFrame() instead of the default React batching strategy, but this can be done easily with plain React too: here is a github project to do it. Unfortunately the React example for TodoMVC doesn't implement the benchmarks, but if it did I suspect the performance would be almost identical to the React/Om numbers.

The author addresses this point a moment later:
Of course you can use Backbone.js or your favorite JS MVC with React, and that's a great combo that delivers a lot of value. However, I'll go out on a limb and say I simply don't believe in event-oriented MVC systems - the flame graph above says it all. Decoupling the models from the views is only the first important step.
The argument here is that even if you use React with Backbone, Backbone is still based around change events (ie. a "push model"). Using Backbone together with React means calling React's forceUpdate() whenever a Backbone change handler fires. The author argues that the flame graph from before illustrates that making lots of little function calls whenever there is a change event is slow.

I'm not sure I buy this argument. The flame graph from before is significant because it shows lots of DOM updates. It illustrates that Backbone performs all of its DOM updates eagerly, and DOM updates are known to be one of the biggest application bottlenecks in JavaScript applications. React is fast because it aggressively batches and minimizes its DOM updates. Having Backbone fire a bunch of change handlers that call React's forceUpdate() function a lot isn't slow, because React will still wait a while to actually re-render everything and update the DOM.

Om and Immutable Data Structures

The next section describes a lot of points that are more Om-specific. Om is an immutable data structure library: a structure is never mutated once created, which means that the contents of the entire tree are captured in the identity of the object root. Or in simpler terms, if you are passed an object that is the same object (by identity) that you saw earlier, you can be assured that it hasn't changed in the meantime. This pattern has a lot of nice properties, but also generates more garbage than mutable objects.

React/Om takes advantage of this by implementing React's shouldComponentUpdate() call to optimize away render() calls (and related diff-ing) when the objects are the same. When objects are the same they are known to have the same value, which is how we know that the optimization is safe. This is particularly important for React/Om, because unlike React with mutability (ie. setState()) React/Om completely refreshes the tree from the root every time. So React/Om's shouldComponentUpdate() will return true for the root and for any paths to nodes which have changed, but can prune away diffs for any subtrees that have not changed.

I think it's worthwhile to note that React/Om does not necessarily do less work than if you used React with setState(). If you called setState() on an interior node of the virtual DOM, you would only trigger a re-render of that one component. But with React/Om, you will trigger a re-render of all components between the root and the changed component. The final DOM update will be the same in both cases, but React will have had to do more work to discover that. It's a small amount more work (and in other cases the setState() approach will require more work, possibly a lot more if you're careless), but it was an interesting and unexpected that the immutable approach isn't a strict improvement over the mutable one.

The article goes on to describe several other benefits of the functional-style immutable data structures. The biggest one as I see it is "undo for free" -- this is an undeniably powerful pattern of immutable data structures.


The article's broad claims that current JavaScript MVCs are inherently slow made a big impression on me. But when I examined the article's substance in more depth, I was not fully convinced of this. To me most of the claimed improvements boil down to: updating the DOM is slow, so use a framework that batches updates to it. React does this, and I have heard that Angular does batching to some extent too. After looking at this, I'm not convinced that immutable data structures are inherently better-performing in the browser than mutable ones. And it is hard to imagine convenient two-way data binding being built on top of immutable data structures.

That said, there is a lot to like about immutable data structures, particularly their snapshot/undo capabilities.

React Demystified

This entry will be a bit of a departure from the usual content of this blog, which is mostly about parsing and low-level programming. Lately I've had some interest in JavaScript frameworks, including Facebook's React. Some recent articles I have read, particularly The Future of JavaScript MVC Frameworks, have convinced me that there are some deep and powerful ideas in React, but none of the articles or documentation I could find explained the core abstractions in a way that satisfied me. Much like my previous article LL and LR Parsing Demystified, this article is an attempt to explain the core ideas in a way that makes sense to me.

The 1000-Foot View

In a traditional web app, you interact extensively with the DOM, usually using jQuery:

I made the DOM red because updating the DOM is expensive. Now sometimes the "App" will have model classes that it uses internally to represent state, but for our purposes that is an implementation detail that is internal to the app.

React's primary goal is to provide a different and more efficient way of performing DOM updates. Instead of mutating the DOM directly, your app builds a "virtual DOM", and React handles updating the real DOM to match:

How does introducing an extra layer make things faster? Doesn't that imply that the browsers have sub-optimal DOM implementations, if adding a layer on top of them can speed them up?

It would mean that, except that the virtual DOM has different semantics than the real DOM. Most notably, changes to the virtual DOM are not guaranteed to take effect immediately. This allows React to wait until the end of its event loop before it even touches the real DOM at all. At that point it calculates a nearly-minimal diff and applies it to the real DOM in as few steps as possible.

Batching DOM updates and applying minimal diffs are things that an application could do on its own. Any application that did this would be as efficient as React. But doing this manually is tedious and error-prone. React handles that for you.


I mentioned that the virtual DOM has different semantics than the real DOM, but it also has a noticeably different API. The nodes in the DOM tree are elements, but the nodes of the virtual DOM are a completely different abstraction called components.

The use of components is very important to React, because components are designed to make calculating the DOM diff much more efficient than the O(n^3) that the fully general tree-diff algorithm would cost.

To find out why, we'll have to dig in to the design of components a bit. Let's take the React "Hello, World" example from their front page:
/** @jsx React.DOM */
var HelloMessage = React.createClass({
  render: function() {
    return <div>Hello {}</div>;

React.renderComponent(<HelloMessage name="John" />, mountNode);
There is an awful lot going on here that isn't entirely explained. Even this short example illustrates some big ideas, so I'm going to take some time here and go slow.

This example creates a React component class "HelloMessage", then creates a virtual DOM with one component (<HelloMessage>, essentially an "instance" of the HelloMessage class) and "mounts" it onto the real DOM element mountNode.

The first thing to notice is that the React virtual DOM is made up of custom, application-defined components (in this case <HelloMessage>). This is a significant departure from the real DOM where all of the elements are browser built-ins like <p>, <ul>, etc. The real DOM carries no application-specific logic; it is just a passive data structure that lets you attach event handlers. The React virtual DOM, on the other hand, is built from application-specific components that can carry application-specific APIs and internal logic. This is more than a DOM-updating library; it is a new abstraction and framework for building views.

As a side note: If you've been keeping up with all things HTML you may know that HTML custom elements may be coming to browsers soon. This will bring to the real DOM a similar capability: defining application-specific DOM elements with their own logic. But React has no need to wait for official custom elements because the virtual DOM isn't a real DOM. This allows it to jump the gun and integrate features similar to custom elements and Shadow DOM before browsers add those features to the real DOM.

Getting back to our example, we have established that it creates a component called <HelloMessage> and "mounts" it on mountNode. I want to diagram this initial situation in a couple of ways. First let's visualize the relationship between the virtual DOM and the real DOM. Let's assume that mountNode is the document's <body> tag:

The arrow indicates that the virtual element is mounted on the real DOM element, which we'll see in action shortly. But let's also take a look at the logical illustration of our application's view right now:

That is to say, our entire web page's content is represented by our custom component <HelloMessage>. But what does a <HelloMessage> look like?

The rendering of a component is defined by its render() function. React does not say exactly when or how often it will call render(), only that it will call it often enough to notice valid changes. Whatever you return from your render() method represents how your view should look in the real browser DOM.

In our case, render() returns a <div> with some content in it. React calls our render() function, gets the <div>, and updates the real DOM to match. So now the picture looks more like this:

It doesn't just update the DOM though; it remembers what it updated it to. This is how it will perform fast diffs later.

I glossed over one thing, which is how a render() function can return DOM nodes. This is obscured by the JSX which isn't plain JavaScript. It's instructive to see what this JSX compiles to:
/** @jsx React.DOM */
var HelloMessage = React.createClass({displayName: 'HelloMessage',
  render: function() {
    return React.DOM.div(null, "Hello ",;

React.renderComponent(HelloMessage( {name:"John"} ), mountNode);
Aha, so what we're returning aren't real DOM elements, but React shadow DOM equivalents (like React.DOM.div) of real DOM elements. So the React shadow DOM really has no true DOM nodes.

Representing State and Changes

So far I've left out a huge piece of the story, which is how a component is allowed to change. If a component wasn't allowed to change, then React would be nothing more than a static rendering framework, similar to a plain templating engine like Mustache or HandlebarsJS. But the entire point of React is to do updates efficiently. To do updates, components must be allowed to change.

React models its state as a state property of the component. This is illustrated in the second example on the React web page:
/** @jsx React.DOM */
var Timer = React.createClass({
  getInitialState: function() {
    return {secondsElapsed: 0};
  tick: function() {
    this.setState({secondsElapsed: this.state.secondsElapsed + 1});
  componentDidMount: function() {
    this.interval = setInterval(this.tick, 1000);
  componentWillUnmount: function() {
  render: function() {
    return (
      <div>Seconds Elapsed: {this.state.secondsElapsed}</div>

React.renderComponent(<Timer />, mountNode);
The callbacks getInitialState(), componentDidMount(), and componentWillUnmount() are all invoked by React at appropriate times, and their names should pretty clearly give away their meanings given the concepts we have explained so far.

So the basic assumptions behind a component and its state changes are:
  1. render() is only a function of the component's state and props.
  2. the state does not change except when setState() is called.
  3. the props do not change except when our parent re-renders us with different props.
(I did not explicitly mention props before, but they are the attributes passed down by a component's parent when it is rendered.)

So earlier when I said that React would call render "often enough", that means that React has no reason to call render() again until somebody calls setState() on that component, or it gets re-rendered by its parent with different props.

We can put all of this information together to illustrate the data-flow when the app initiates a virtual DOM change (for example, in response to an AJAX call):

Getting Data from the DOM

So far we have only talked about propagating changes to the real DOM. But in a real application, we'll want to get data from the DOM also, because that is how we receive all input from the user. To see how this works, we can examine the third example on the React home page:
/** @jsx React.DOM */
var TodoList = React.createClass({
  render: function() {
    var createItem = function(itemText) {
      return <li>{itemText}</li>;
    return <ul>{}</ul>;
var TodoApp = React.createClass({
  getInitialState: function() {
    return {items: [], text: ''};
  onChange: function(e) {
  handleSubmit: function(e) {
    var nextItems = this.state.items.concat([this.state.text]);
    var nextText = '';
    this.setState({items: nextItems, text: nextText});
  render: function() {
    return (
        <TodoList items={this.state.items} />
        <form onSubmit={this.handleSubmit}>
          <input onChange={this.onChange} value={this.state.text} />
          <button>{'Add #' + (this.state.items.length + 1)}</button>
React.renderComponent(<TodoApp />, mountNode);
The short answer is, you handle DOM events manually (as with the onChange() handler in this example), and your event handler can call setState() to update the UI. If your app has model classes, your event handlers will probably want to update your model appropriately and also call setState() so React also knows there were changes. If you've gotten used to frameworks that provide automatic two-way data binding, where changes to your model are automatically propagated to the view and vice versa, this may seem like a step backwards.

There is more to this example than meets the eye though. Despite how this example may look, React will not actually install an "onChange" handler on the <input> element on the real DOM. Instead it installs handlers at the document level, lets events bubble up, and then dispatches them into the appropriate element of the virtual DOM. This gives benefits such as speed (installing lots of handlers on the real DOM can be slow) and consistent behavior across browsers (even on browsers that have non-standard behavior for how events are delivered or what properties they have).

So putting all of this together, we can finally get a full picture for the data flow when a user event (ie. a mouse click) results in a DOM update:


I learned a lot about React by writing this entry. Here are my primary takeaways.

React is a view library. React doesn't impose anything about your models. A React component is a view-level concept and a component's state is just the state of that portion of the UI. You could bind any sort of model library to React (though certain ways of writing the model will make it easier to optimize updates further, as the Om post explains).

React's component abstraction is very good at pushing changes to the DOM. The component abstraction is principled, composes well, and efficient DOM updates fall out of the design.

React components are less convenient for getting updates from the DOM. Writing event handlers gives React a distinctly lower-level feel than libraries that automatically propagate view changes into the model.

React is a leaky abstraction. Most of the time you will program only to the virtual DOM, but sometimes you need to escape this and interact with the real DOM directly. The React docs talk more about this and the cases where this is necessary in their Working With the Browser section.

With my new knowledge I am inclined to more closely examine the claims made in the article The Future of JavaScript MVC Frameworks, but that is a slightly different topic that will have to wait for another entry.

I am not an expert in React, so kindly let me know of any mistakes.

Thursday, September 5, 2013

LL and LR in Context: Why Parsing Tools Are Hard

In my last blog entry LL and LR Parsing Demystified, we explored LL and LR parsers from a black-box perspective. We arrived at a model for these parsers where both their input and output were streams of tokens, with the parser inserting rules as appropriate according to Polish and Reverse Polish notation.

In future articles I want to focus in even closer on the details of LL and LR algorithms, but I realized that I should first zoom out and give some motivation for why anyone should care about LL or LR to begin with.

As I wrote this article, it turned into an answer to the question "why is parsing hard?" Or alternatively "why doesn't everybody use parser generators?" LL and LR parsing theory is taught in in books like Compilers: Principles, Techniques, and Tools (known as "The Dragon Book" and used in many university compilers courses), but then people graduate to find that most parsers in the real world don't work like this. What gives? This article is my answer to that question.

Theory vs. Practice

The theory of LL and LR parsing is almost 50 years old: Knuth's paper On the Translation of Languages from Left to Right that first defined LR(k) was published in 1965. This is only one of an incredible number of mathematically-oriented papers about parsing and language theory. Over the last 50 years academics have explored the mathematical dimensions of parsing with great vigor, but the field is nowhere near exhausted; even in the last five years we've seen some entirely new and important results published. One of the best surveys of the field is the book Parsing Techniques: A Practical Guide, whose bibliography contains over 1700 cited papers!

Despite this vast body of theoretical knowledge, few of the parsers that are in production systems today are textbook cases of the theory. Many opt for hand-written parsers that are not based on any formalism at all. Language specifications are often defined in terms of a formalism like BNF, but it's almost never the case that real parsers can be generated directly from this formalism. GCC moved away from their Bison-based parser to a handwritten recursive descent parser. While some notable language implementations do use Bison (like Ruby, PHP, and Go), many choose not to.

Why this divergence between theory and practice? While it is tempting to blame ignorance of the literature, that could hardly explain why GCC moved away from an LR parser.

I think it is safe to say that pure LL and LR parsers have proven to be largely inadequate for real-world use cases. Many grammars that you'd naturally write for real-world use cases are not LL or LR, as we will see. The two most popular LL and LR-based parsing tools (ANTLR and Bison, respectively) both extend the pure LL and LR algorithms in various ways, adding features such as operator precedence, syntactic/semantic predicates, optional backtracking, and generalized parsing.

But even the evolved tools that are currently available sometimes come up short, and are still evolving to address the traditional pain points of parser generators. ANTLR v4 completely reworked its parsing algorithm to improve ease-of-use vs ANTLR v3 with a new algorithm it calls ALL(*). Bison is experimenting with IELR, an alternative to LALR that was published in 2008 and intended to expand the number of grammars it can accept and parse efficiently. Some people have explored alternatives to LL/LR such as Parsing Expression Grammars (PEGs) that attempt to solve these pain points in a different way entirely.

Does this mean that LL and LR are obsolete? Far from it. While pure LL and LR do indeed come up short in several ways, these algorithms can be extended in ways that preserves their strengths, in much the same way that a multi-paradigm programming language can offer features of imperative, functional, and object-oriented programming styles. I firmly believe that as parser tools continue to improve with better tooling, better error reporting, better visualization, better language integration, etc. they will become something you'd reach for as readily as you reach for a regex today. There is a lot of room for improvement in this space, and I want to help make that happen (my tabled project Gazelle is where I have invested effort so far and I intend to do more). But I digress.

LL and LR parsers have some indisputable strengths. They are the most efficient parsing algorithms around. The grammar analysis they perform ahead-of-time can tell you important things about your grammar, and properly visualized can help you catch bugs, in much the same way that regex visualizing tools like regexper can. They offer some of the earliest and best error reporting of syntax errors at parse time (this is separate from the shift/reduce and reduce/reduce errors you might get at grammar analysis time).

Even if you are not sold on the usefulness of LL and LR, learning about them will help you better understand the tradeoffs that your favorite parsing method makes compared to LL/LR. Alternatives to LL/LR are generally forced to give up some at least one of the advantages of LL/LR.

Clarifying "LL parser" and "LR parser"

"LL parser" and "LR parser" are not actually specific algorithms at all, but rather generic terms referring to families of algorithms. You may have seen names such as LR(k), "full LL", LALR(1), SLR, LL(*), etc; these are specific algorithms (or variants of the same algorithm, depending on how you look at it) that fall under the category of "LL parser" or "LR parser." These variants have different tradeoffs in terms of what grammars they can handle and how big the resulting parsing automata are, but they share a common set of characteristics.

LL and LR parsers usually (but not always) involve two separate steps: a grammar analysis step that is performed ahead-of-time and the actual parser that runs at parse time. The grammar analysis step builds an automaton if it can, otherwise the grammar is rejected as not LALR/SLL/SLR/whatever. Once the automaton is built, the parsing step is much simpler because the automaton encodes the structure of the grammar such that what to do with each input token is a simple decision.

What then is the distinguishing characteristic that makes a parser an LL parser or an LR parser? We will answer the question with a pair of definitions. Don't worry if these definitions make no sense to you; the entire rest of the article is dedicated to explaining them. These definitions are not given in the literature, since they are informal terms, but they correspond to the general usage of what is meant if you look at (for example) the Wikipedia pages for "LL Parser" or "LR Parser."

An LL parser is a deterministic, canonical top-down parser for context-free grammars.

An LR parser is a deterministic, canonical bottom-up parser for context-free grammars.

Any parser that meets these definitions is an LL or LR parser. Both the strengths and weaknesses of LL and LR are encapsulated in these definitions.

Beware that not every parser with "LR" or "LL" in its name is actually an LR or LL parser. For example, GLR, LR(k,∞), and Partitioned LL(k) are all examples of parsing algorithms that are not actually LL or LR; they are variations on an LL or LR algorithm, but give up one or more of the essential LL/LR properties.

We will now more deeply explore the key parts of these definitions.

Context-Free Grammars: powerful, but not all-powerful

LL and LR parsers use context-free grammars as their way of specifying formal languages. Most programmers have seen context-free grammars in one form or another, possibly in the form of BNF or EBNF. A close variant called ABNF is used in documentation of protocols in RFCs.

On one hand CFGs are really nice, because they match the way that programmers think about languages. The fact that RFCs use a CFG-like abstraction to write documentation speaks to how readable context-free grammars can be.

Here is the JSON context-free grammar from my previous article:
object → '{' pairs '}'
pairs → pair pairs_tail | ε
pair → STRING ':' value
pairs_tail → ',' pairs | ε
value → STRING | NUMBER | 'true' | 'false' | 'null' | object | array
array → '[' elements ']'
elements → value elements_tail | ε
elements_tail → ',' elements | ε
This is so intuitive to read that I didn't even bother to explain it. An object is a bunch of pairs surrounded by curly brackets. A "pairs" is either a pair followed by a pairs_tail or empty. It reads really nicely.

Context-free grammars not only tell us whether a given string is valid according to the language, they also define a tree structure for any valid string. It's the tree structure that helps us figure out what the string actually means, which is arguably the most important part of parsing (a compiler that did nothing but say "yep, this is a valid program" wouldn't be very useful). So writing a CFG really goes a long way in helping us parse and analyze a language.

On the other hand context-free grammars can be frustrating, for two related reasons:
  1. When writing a CFG intuitively, you often end up with something ambiguous.
  2. When writing a CFG intuitively, you often end up with something that is unambiguous but can't be parsed by LL or LR algorithms.
While the second problem is LL/LR's "fault" the first is just an inherent challenge of designing formal languages. Let's talk about ambiguity first.

Ambiguity in CFGs

If a grammar is ambiguous, it means that there is at least one string that can have multiple valid parse trees. This is a real problem in the design of a language, because the two valid parse trees almost certainly have different semantic meaning. If both are valid according to your grammar, your users cannot know which meaning to expect.

The simplest and most common example is arithmetic expressions. The intuitive way to write a grammar is something like:
expr → expr '+' expr |
       expr '-' expr |
       expr '*' expr |
       expr '/' expr |
       expr '^' expr |
       - expr |
But this grammar is highly ambiguous because it doesn't capture the standard rules of precedence and associativity. Without these rules to disambiguate, a string like 1+2*3-4^5 has exponentially many valid parse trees, all with different meanings.

It's possible to rewrite this to capture the rules of precedence and associativity:
expr → expr '+' term |
       expr '-' term |

term → term '*' factor |
       term '/' factor |

factor → '-' factor |

prim → NUMBER |
       NUMBER '^' prim
Now we have an unambiguous grammar that encodes the precedence and associativity rules, but it is not at all easy or self-evident what these rules are from reading the grammar. For example, it's certainly not obvious at first glance that all of the operators in this grammar are left-associative except exponentiation (^) which is right-associative. And it's not easy to write grammars in this style; I have a fair amount of experience writing grammars but I still have to slow down and be careful (without testing, I'm not even 100% confident I have got it right).

It's really unfortunate that one of the first and most common use cases of text parsing is one that pure context-free grammars are so bad at. No wonder people can get turned off of CFG-based tools when something that seems like it should be so simple ends up being so complicated. It's especially unfortunate because other non-CFG parsing techniques like the Shunting Yard Algorithm are so good at this kind of operator precedence parsing. This is clearly one of the most glaring examples where CFGs and pure LL/LR let us down.

Another famous example of actual grammar ambiguity is the dangling else problem. For languages that don't have an "endif" statement, what does this mean?
if a then if b then s else s2

// Ambiguity: which of these is meant?
if a then (if b then s) else s2
if a then (if b then s else s2)
Unlike with arithmetic expressions, there are no standard precedence/associativity rules to tell us which of these interpretations is "correct." The choice here is pretty much arbitrary. Any language that has this construct must tell users which meaning is correct. The grammar ambiguity here is a symptom of the fact that our language has a confusing case.

One final example of ambiguity, this one from C and C++. This is known as the type/variable ambiguity.
  // What does this mean?
  x * y;
The correct answer is that it depends on whether x was previously declared as a type with typedef. If x is a type, then this line declares a pointer-to-x named y. If x is not a type, then this line multiplies x and y and throws away the result. The traditional solution to this problem is to give the lexer access to the symbol table so it can lex a type name differently than a regular variable; this is known as the lexer hack. (While this may seem out of place in an article about parsers, the same ambiguity manifests in C++ in a way that cannot so easily be confined to the lexer).

In other words, this ambiguity is resolved according to the semantic context of the statement. People sometimes refer to this as a "context-sensitive," (like the article The context sensitivity of C's grammar), but context-sensitive grammar is a very specific term that has a mathematical meaning in the Chomsky hierarchy of languages. The Chomsky definition refers to syntactic context sensitivity, which almost never occurs in computer languages. Because of this, it's good to clarify that we are talking about semantic context-sensitivity, which is an entirely different thing.

A key point about semantic context-sensitivity is that you need a Turing-complete language to properly disambiguate between the ambiguous alternatives. What this means for parser generator tools is that it's effectively impossible to parse languages like this correctly unless you allow the user to write arbitrary disambiguation code snippets in a Turing-complete language. No mathematical formalism alone (not CFGs, not PEGs, not operator grammars) can be sufficient to express these languages. This is one very notable case where theory and practice diverge.

In tools such as ANTLR, these disambiguating code snippets are known as "semantic predicates." For example, to disambiguate the type/variable ambiguity, you'd need to write code to build/maintain a symbol table whenever a "typedef" is seen, and then a predicate to see if a symbol is in the table or not.

Dealing with ambiguity in CFGs

No matter what parsing strategy is being used, a language designer must be aware of and directly confront any ambiguities in the language. If possible, the best idea is often to change the language to avoid having the ambiguity at all. For example, most languages these days do not have the dangling else problem because "if" statements have an explicit end (either an "endif" keyword or curly brackets that surround the statements in the "else" clause).

If the designer can't or doesn't want to remove the ambiguity, they must decide which meaning is intended, implement the ambiguity resolution appropriately, and communicate this decision to users.

But to confront ambiguities you must first know about them. Unfortunately this is easier said than done. One of the huge bummers about grammars (both CFGs and other formalisms like PEGs) is that many of the useful questions you might want to ask about them are undecidable (if you haven't studied Theory of Computation, a rough approximation for "undecidable" is "impossible to compute"). Determining whether a context-free grammar is ambiguous is unfortunately one of these undecidable problems.

If it's impossible to compute whether a grammar is ambiguous a priori, how can we be aware of ambiguities and address them?

One approach is to use a parsing algorithm that can handle ambiguous grammars (for example GLR). These algorithms can handle any grammar and any input string, and can detect at parse time if the input string is ambiguous. If ambiguity is detected, they can yield all valid parse trees and the user can disambiguate between them however they see fit.

But with this strategy you don't learn about ambiguity until an ambiguous string is seen in the wild. You can never be sure that your grammar is unambiguous, because it could always be the case that you just haven't seen the right ambiguous string yet. You could ship your compiler only to learn, years later, that your grammar has had an unknown ambiguity in it all along. This can actually happen in the real world; it was not discovered that ALGOL 60 had a "dangling else" problem until the language had already been published in a technical report.

Another strategy is to abandon context-free grammars completely and use a formalism like parsing expression grammars that is unambiguous by definition. Parsing expression grammars avoid ambiguity by forcing all grammar rules to be defined in terms of prioritized choice, so in cases where multiple grammar rules match the input the first one is correct by definition.
// PEG solution of the if/else ambiguity:
stmt <- "if" cond "then" stmt "else" stmt /
        "if" cond "then" stmt /
Prioritized choice is a great tool for resolving some ambiguities; it works perfectly for the solving the dangling else problem. But while this has given us a tool for resolving ambiguity, it hasn't solved the problem of finding ambiguities. Every rule in PEGs is required to be defined in terms of prioritized choice, which means that every PEG rule could be hiding a "conceptual" ambiguity:
// Is this PEG rule equivalent to a <- c / b ?
a <- b / c

// We can't know (it's undecidable in general),
// so every rule could be hiding an ambiguity we don't know about.
I call this a "conceptual" ambiguity because even though a PEG-based tool does not consider this ambiguous, it still looks ambiguous to a user. Another way of thinking about this is that you have resolved the ambiguity without ever being aware the ambiguity existed, thus denying you the opportunity to think about it and make a conscious choice about how it should be resolved. Prioritized choice doesn't make the dangling else problem go away, it just hides it. Users still see a language construct that could plausibly be interpreted in two different ways, and users still need to be informed which option the parser will choose.

Prioritized choice also requires that an ambiguity is resolved in the same way each time; it can't accommodate cases like the C/C++ variable/type ambiguity which are resolved by semantic information.

And unlike GLR, Packrat Parsing (the linear-time algorithm for parsing PEGs) doesn't tell you even at parse time if the input string is ambiguous. So with a Packrat-Parsing-based strategy, you are really flying blind about whether there are "conceptual" ambiguities in your grammar. It's also possible for entire alternatives or rules of a PEG to be unreachable (see here for a bit more discussion of this). The net result is that with PEGs you know very little about the properties of your grammar.

So far none of the options we've discussed can actually help us find ambiguities up-front. Surely there must be a way of analyzing our grammar ahead of time and proving that it's not ambiguous, as long as the grammar isn't too crazy? Indeed there is, but the answer brings us back to where we started: LL and LR parsers.

It turns out that simply trying to construct an LR parser for a grammar is very nearly the most powerful ambiguity test we know of for CFGs. We know it cannot be a perfect test, since we already stated that testing ambiguity is undecidable. If a grammar is not LR we don't know whether it is ambiguous or not. But every grammar that we can construct an LR parser for is guaranteed to be unambiguous, and as a bonus, you also get an efficient linear-time parser for it.

But wait, there's more. Let's review the three types of ambiguity we've encountered and the most natural solution for solving each:
  1. Arithmetic expressions: the ideal solution is to be able to declare precedence/associativity directly, and not have to solve it at a grammar level.
  2. Dangling else: because this can be resolved by always preferring one alternative over another, the ideal solution is prioritized choice. Another example of this case is C++'s "most vexing parse", which is likewise resolved by simply preferring one interpretation over the other when both are valid.
  3. Type/variable ambiguity: the only real solution is to allow Turing-complete semantic predicates to resolve the ambiguity. C++ has an even more extreme version of this problem since type/variable disambiguation could require arbitrary amounts of template instantiation, and therefore just parsing C++ is technically undecidable (!!), unless you limit template instantiation depth -- more gory detail and an example here.
The good news is that all three of these ambiguity-resolution strategies can be incorporated into a LL or LR-based parser generator, to some extent. While pure LL or LR only support context-free grammars, it is entirely possible to add operator precedence, prioritized choice, and semantic predicates to such tools, subject to some limitations. I liken this to multi-paradigm programming languages. Just as a language that supports procedural, OO, and functional styles is more powerful and expressive than a language offering just one, so is a CFG+precedence+predicates tool more powerful than one that only supports CFGs.

We can sum up all of this information with this venn diagram:

To more fully understand how LL/LR grammar analysis can prove grammars unambiguous (and how we can add non-CFG features like prioritized choice to them), we will explore the concept of a deterministic parser.

Deterministic Parsers

Without going into too much detail, a deterministic parser is one that works by building a deterministic automaton (this is very much related to automata theory for regular expressions, except that parsing automata also have a stack). That means that as the parser reads tokens left-to-right, it is always in one particular state, and each token transitions it into exactly one other state.

More informally, we can say that a deterministic parser is one that doesn't have to do any guessing or searching. Terence Parr, author of ANTLR, often uses the metaphor of parsing as a maze; at every fork in the maze, a deterministic parser always knows which fork to take the first time. It might "look ahead" to make the decision, but it never makes a decision and then backs up (a parser that "backs up" is known as a "backtracking parser" and has exponential worst-case running time).

This determinism is really the defining characteristic that gives LL/LR both their advantages and disadvantages. On one hand, they are the fastest algorithms because they are simply transitioning a state machine, and you can know that an LL/LR grammar is unambiguous because an ambiguous grammar won't allow you to build a deterministic automaton. To build the automaton you have to be able to prove, for every grammar state and every input token, that there is only one valid path through the grammar that you can take for this token.

A Bison "shift/reduce" or "reduce/reduce" conflict is a case where Bison was not able to make the parser deterministic, because two state transitions for the same token were both valid. But Bison can't prove whether this is because of grammar ambiguity (in which case both transitions could ultimately lead to a successful parse) or whether this is an unambiguous grammar that just isn't LR (in which case one of the paths would eventually hit a dead-end.

"Generalized parsing" algorithms like GLR and GLL can handle any grammar, because they just take both paths simultaneously. If one of the paths hits a dead-end, then we're back to an unambiguous string. But if multiple paths turn out to all be valid, we have an ambiguous string and the parser can give you all of the valid parse trees.

This also gives us a hint about how pure LL/LR algorithms can be extended with extra features. Any method we can think of for deciding which path is the "right" one is fair game! It turns out that operator precedence declarations can provide a way of resolving shift/reduce and reduce/reduce conflicts in LR parsers, and Bison supports this with its precedence features. Prioritized choice can also give us enough information in some cases to resolve a nondeterminism by deciding that one of the paths is correct because it has a higher priority. And when all else fails, if we can write our own predicates that run at parse time, we can write arbitrary logic that uses whatever other criteria we could possibly want to decide which path is the correct one.

Conclusion: so why are parsing tools hard?

Given all of this, what is the answer to our original question? Why are parsing tools hard? I think there are two sets of reasons: some inherent reasons and some reasons that are just a weakness of current parsing tools and could be improved.

The inherent reasons parsing tools are hard have to do with ambiguity. More specifically:
  1. The input grammar may be ambiguous but we can't robustly check for this because it's undecidable.
  2. We can use a deterministic (LL/LR) parsing algorithm: this gives us a fast parser and a proof that the grammar is unambiguous. But unfortunately no deterministic parsing algorithm can handle all unambiguous grammars. So in some cases we're forced to adapt our grammar and debug any nondeterminism.
  3. We can use a generalized parsing algorithm like GLL or GLR that can handle all grammars (even ambiguous ones), but because of (1) we can't know for sure whether the grammar is ambiguous or not. With this strategy we have to always be prepared to get multiple valid parse trees at parse time. If we didn't know about the ambiguity, we probably won't know how we should disambiguate.
  4. We can use a formalism like Parsing Expression Grammars that defines ambiguity away -- this will always give us one unique parse tree, but can still hide "conceptual" ambiguities.
  5. Some real-world ambiguities can't be resolved at a grammar-level because they have semantic context-sensitivity. To parse these languages, there must be a way of embedding arbitrary logic into the parser to disambiguate.
While this may all seem annoying, ambiguity is a real language design issue, and anyone designing a language or implementing a parser benefits from getting early warning about ambiguity. In other words, while LL and LR tools are not perfect, some of their pain simply comes from the fact that parsing and language design are complicated. While rolling your own parser will free you from ever getting error messages from your parser generator, it will also keep you from learning about ambiguities you may be inadvertently designing into your language.

The other reasons parsing tools are hard are things that could realistically be improved. The tools could benefit from greater flexibility, more reusable grammars, better ability to compose languages, and more. This is where the opportunity lies.

Friday, August 23, 2013

Parsing C++ is literally undecidable

Many programmers are aware that C++ templates are Turing-complete, and this was proved in the 2003 paper C++ Templates are Turing Complete.

However, there is an even stronger result that many people are not aware of. The C++ FQA has a section showing that parsing C++ is undecidable, but many people have misinterpreted the full implications of this (understandable, since the FQA is discussing several issues over the course of its questions and does not make explicit the undecidability proof).

Some people misinterpret this statement to simply mean that fully compiling a C++ program is undecidable, or that showing the program valid is undecidable. This line of thinking presumes that constructing a parse tree is decidable, but only further stages of the compiler such as template instantiation are undecidable.

For example, see this (incorrect, but top-voted) Stack Overflow answer to the question What do people mean when they say C++ has “undecidable grammar”? This answer errs when it says: "Note this has nothing to do with the ambiguity of the C++ grammar."

In fact, simply producing a parse tree for a C++ program is undecidable, because producing a parse tree can require arbitrary template instantiation. I will demonstrate this with a short program, which is a simplification/adaptation of what is in the FQA link above.
struct SomeType {};

template <...> struct TuringMachine {
  // Insert implementation of a Turing machine here, which we know
  // is possible from previous proofs.

template <typename T> struct S {
  static int name;

template<> struct S<SomeType> {
  typedef int name;

int x;
int main() {
  S<TuringMachine<...>::output>::name * x;
The parse tree for this program depends on whether TuringMachine::output is SomeType or not. If it is SomeType then ::name is an integer and the parse tree for the program is multiplying two integers and throwing away the result. If it is not SomeType, then ::name is a typedef for int and the parse tree is declaring a pointer-to-int named x. These two are completely different parse trees, and the difference between them cannot be delayed to further stages of the compiler.

The parse tree itself depends on arbitrary template instantiation, and is therefore the parsing step is undecidable.

In practice, compilers limit template instantiation depth, so this is more of a theoretical problem than a practical one. But it is still a deep and significant result if you are ever planning on writing a C++ parser.

Monday, August 19, 2013

Microsoft, thank you (tentatively) for supporting some of C99

Last year I vented my frustration that Visual C++ was the only mainstream C compiler that had almost no C99 support: Microsoft, please support (at least a tiny bit of) C99. I did not have much hope that anything would change, since their stated position seemed firm that C99 was not a priority.

I'm not sure how I missed this at the time, but a few months ago there was a blog post: C++11/14 STL Features, Fixes, And Breaking Changes In VS 2013. In addition to a bunch of C++ features, it contains this key excerpt:
Additionally, some C99 Core Language features will be implemented in 2013 RTM:
  • C99 _Bool
  • C99 compound literals
  • C99 designated initializers
  • C99 variable declarations
These are some of the most welcome parts of C99 to see implemented (since they are some of the hardest to work around when absent). I'm not sure what caused this about-face, but presuming this is released as expected, I feel compelled to say: thank you Microsoft!

Hopefully this will obsolete the need for c99-to-c89, which I must give props to as a very clever hack for working around this problem.

(It appears VC++ 2013 will also contain extensive support for the C99 standard library, as detailed in C99 library support in Visual Studio 2013; while this is less critical to me personally, I'm sure others will welcome this development also).

Monday, July 22, 2013

LL and LR Parsing Demystified

My first adventures in parsing theory came when I was doing an independent study of programming languages in college. When I got to the part about algorithms such as LL, LR, and their many variations (Strong-LL, SLR, LALR, etc), I was fascinated. I felt like I was peering at some deep and powerful incantations, the significance of which I could not yet appreciate, but I was sure that someday terms like "left-to-right, rightmost derivation" would make perfect sense, and I looked forward to achieving this enlightenment.

I can say now, 10 years and a whole shelf of parsing books later, that I understand these algorithms pretty well. But the way I think about them is quite different than any of the literature that I have found. I think more from an implementation perspective than a mathematical one, which could have something to do with it. In any case, I want to explain how I think of these algorithms; hopefully some people will find this perspective intuitive, as I do.

This article will only address the parser from a black-box perspective: what are its inputs/outputs and what are its constraints? A planned future article will break open the black box for more details about the inner workings of these algorithms.

Parsing And Polish Notation

If you studied Computer Science at university, or ever owned an HP calculator, you likely came across Polish and Reverse Polish notation. These are ways of writing mathematical expressions that don't need parentheses or order-of-operations rules. We're used to writing expressions as infix, where the operators go in between the operands:
  1 + 2 * 3
In this case, how do you know what order to do the operations in? You have to follow the conventional rules (PEDMAS), and if you want a different order you have to use parentheses, like:
  (1 + 2) * 3
Polish and Reverse Polish notation let you write these expressions without needing arbitrary order-of-operations rules or parentheses to disambiguate. They work by putting the operators before (Polish) or after (Reverse Polish) the operands. These are also known as prefix and postfix notation.
  // First example:
  1 + 2 * 3  // Infix
  + 1 * 2 3  // Polish (prefix)
  1 2 3 * +  // Reverse Polish (postfix)

  // Second example:
  (1 + 2) * 3  // Infix
  * + 1 2 3    // Polish (prefix)
  1 2 + 3 *    // Reverse Polish (postfix)
Besides not needing parentheses or a defined order of operations, Polish and Reverse Polish are much easier to write evaluators for (maybe the HP calculator's designers decided to use Reverse Polish so they could spend a week in the Bahamas). Here is a simple Reverse Polish evaluator in Python.
# Functions that define the operators and how to evaluate them.
# This example assumes binary operators, but this is easy to extend.
ops = {
  "+": (lambda a, b: a + b),
  "-": (lambda a, b: a - b)
def eval(tokens):
  stack = []
  for token in tokens:
    if token in ops:
      arg2 = stack.pop()
      arg1 = stack.pop()
      result = ops[token](arg1, arg2)
  return stack.pop()
print "Result:",  eval("7 2 3 + -".split())
Polish and Reverse Polish notation, as they are usually described, do require that all operators have a known arity. Arity is just the number of operands that the operator takes. This means, for example, that unary minus needs to be a different operator than subtraction. Otherwise we don't know how many operands to pop from the stack when we see an operator.

A similar formulation that avoids this problem is one like Lisp's s-expressions. S-expressions (and similar encodings like XML) avoid the need for fixed operator arity by explicitly marking both the beginning and end of each expression:
  ; Lisp-style prefix notation; the same operator can be used
  ; for different numbers of arguments.
  (+ 1 2)
  (+ 1 2 3 4 5)

  ; Lisp equivalents of our first two examples:
  ; Prefix: + 1 * 2 3
  (+ 1 (* 2 3))

  ; Prefix: * + 1 2 3
  (* (+ 1 2) 3)
This variant has slightly different tradeoffs (we traded fixed arity for required parentheses) but the underlying algorithms for parsing/processing these are quite similar, so generally we'll consider this a minor variant of prefix notation.

It might seem like I've veered off-topic, but I've been sneakily talking about LL and LR the whole time. Polish and Reverse Polish notation directly correspond, in my view, to LL and LR parsing, respectively. But to fully explore this idea we need to first describe what kind of output we expect from a parser.

For a fun exercise, try implementing an algorithm to convert Polish to Reverse Polish notation. See if you can do it without building the entire expression into a tree first; you can do it with a stack alone. Now say you wanted to do the opposite (Reverse Polish to Polish) -- you can just run the same algorithm on the input, but backwards! Of course you can build an intermediate tree too, but this takes O(input length) space, whereas the stack-only solution takes only O(tree depth) space. How about going infix to postfix? There is a very clever and efficient algorithm for that called the Shunting Yard Algorithm.

A Parser And Its Output

We can all agree that the input to a parser is a stream of tokens (which probably came from a lexer, but we can talk about that part another day). But what is a parser's output? You might be inclined to say "a parse tree," and while you can certainly use a parser to build a parse tree, it's also possible to consume a parser's output without ever actually building a tree at all. For example, this Bison example evaluates arithmetic expressions in-line with the parse. Every time a subexpression is recognized, it is immediately evaluated until the final result is just a single number. No parse tree is ever explicitly built.

For this reason, saying that the output of a parser is a parse tree is not general enough. Instead, I claim that the output of a parser, at least for the cases of LL and LR which we are discussing today, is a parse tree traversal.

Apologies if I've set off anyone's nonsense detectors. I can hear people protest that a tree traversal is an algorithm; an operation you perform on a tree. How can I say that a parser outputs a tree traversal? The answer lies in thinking back to Polish and Reverse Polish notation. These are usually thought of as just a notation for mathematical expressions, but we can think of them more generally as flat (serialized) encodings of tree traversals.

Think back to our first example of 1 + 2 * 3. Here is that expression written as a tree:

There are three ways of traversing a binary tree, as explained on Wikipedia: in-order, pre-order, and post-order. They differ in whether you visit the parent node before (pre-order), after (post-order), or in between (in-order) the children of that parent. These three correspond exactly to infix, Polish, and Reverse Polish notation:
  1 + 2 * 3  // Infix expression; in-order traversal.
  + 1 * 2 3  // Polish (prefix) expression; pre-order traversal.
  1 2 3 * +  // Reverse Polish (postfix) expression; post-order traversal.
So Polish and Reverse Polish notation fully encode a tree structure and the steps you would take to traverse it. The main difference between these encodings and an actual parse tree is that Polish and Reverse Polish encodings are not random access. With a real tree you can choose to follow an interior node to its right subtree, its left subtree, or even (for many trees) its parent. With these linear encodings there is no such flexibility: you have to follow the traversal as it is already encoded.

But on the plus side, this allows the parser's output to be a stream that can be consumed while the parse is happening. This is how the Bison example from before could manage to evaluate the arithmetic expression without ever actually building a tree. If a fully-fledged tree is actually required, the linear tree traversal can easily build one. But in cases where it is not, the cost of building one can be avoided.

This brings us to a key point:
The primary difference between how LL and LR parsers operate is that an LL parser outputs a pre-order traversal of the parse tree and an LR parser outputs a post-order traversal.
This is equivalent to these more traditional but (in my view) more confusing and less intuitive explanations of the distinction:
  • "LL parsers produce a leftmost derivation, while LR parsers produce a reversed rightmost derivation."
  • "LL parsers build the tree from the top-down, while LR parsers build the tree from the bottom-up."
  • LL parsers are often called "predictive parsers," while LR parsers are often called "shift-reduce parsers."

The Shape of a Parse Tree

Our arithmetic expression tree that we've been using so far isn't truly a parse tree, because it doesn't correspond to a grammar. To examine actual parse trees, we'll need a real grammar. Unfortunately writing grammars for infix arithmetic expressions isn't as simple or elegant as you might expect. Encoding the precedence and associativity rules into a grammar that is unambiguous (and can be handled by LL and LR parsers) is pretty ugly and nonintuitive. This is one reason why LL and LR parsers are often extended with capabilities that let you specify operator precedence; for example, see the precedence features of Bison. But for the purposes of this article we want to discuss pure LL and LR.

So we need to ditch our arithmetic expressions example in favor of something that is easier to write a grammar for. We'll use JSON since it is very simple but just complex enough to be interesting.
object → '{' pairs '}'

pairs → pair pairs_tail | ε
pair → STRING ':' value
pairs_tail → ',' pairs | ε

value → STRING | NUMBER | 'true' | 'false' | 'null' | object | array
array → '[' elements ']'

elements → value elements_tail | ε
elements_tail → ',' elements | ε
I've used single-quoted strings for literal tokens and upper-case names like STRING for tokens whose spelling can vary (for example, "abc" and "" are both valid STRING tokens). All lower-case names are grammar rules (also known as "nonterminals").

You might wonder why I am doing this pairs_tail and elements_tail business instead of using a repetition construct, which many parser generators like ANTLR support. They let you write something like:
elements → value (',' value)*
While this is more convenient and leads to simpler grammars, it makes parse trees conceptually a bit more complicated, because the number of children for a given grammar rule can vary. Also, LR parsers can't support repetition operators (for example, Bison doesn't support them), whereas the grammar I wrote above can be used with both LL and LR parsers. So we'll use this slightly more complicated grammar for now.

Now that we have a grammar, we can look at an example of a stream of tokens and the resulting parse tree.
{"message":"Hello, World!"}
The token stream for this text is:
And the parse tree, according to our grammar, is:

Notice that the leaf nodes (colored in green) are all tokens, and correspond exactly to the sequence of tokens that was the input to our parser. (I cheated slightly by making ε into a leaf node, but this turns out to be pretty clean and principled as we will see).

I claimed earlier that LL parsers output a pre-order traversal and LR parsers a post-order traversal. From this we can say what output we would expect to get from an LL and LR parser given this input:
// LL output: pre-order traversal.
object '{' pairs(1) pair STRING ':' value(1) STRING pairs_tail(2) '}'

// LR output: post-order traversal.
'{' STRING ':' STRING value(1) pair pairs_tail(2) pairs(1) '}' object
Since leaf nodes are always exactly the input tokens themselves, in exactly the order of the input, all the parser is really doing is inserting the interior nodes in the appropriate places. Another way of looking at it is that a parse tree is just a bunch of structure that is defined on top of the sequence of input tokens. We can see this more clearly if we rearrange our previous diagram slightly:

We are converging on a very simple model for how LL and LR parsers operate. Both read a stream of input tokens and output that same token stream, inserting rules in the appropriate places to achieve a pre-order (LL) or post-order (LR) traversal of the parse tree.

Here we see another advantage of thinking of a parser's output in terms of Polish and Reverse Polish Notation. It lets us model the parser's input and output both as simple, flat streams. This is a lot simpler than thinking of a parser's intermediate output state as a partially-built tree -- it is unclear how you could meaningfully consume or inspect that.


LL and LR parsers are "on-line," meaning that they can start producing output while they are still consuming input. But in many cases the tokens prior to a stream position do not contain enough information for the parser to know whether it should insert a rule or not (or if so, which rule it should insert). So the parser will "look ahead" in the stream to see what the next token(s) are before it makes a decision. When you see designations like LL(1), LR(0), etc. the number in parentheses is the number of tokens of lookahead.

Note that the lookahead is relative to where the rule should be inserted, which (as you will remember) is before that rule's tokens for LL parsers or after that rule's tokens for LR parsers. That means that LL lookahead counts from the beginning of the rule's tokens, whereas LR lookahead counts from the end. This gives LR parsers a huge advantage, because they get to see all of the rule's tokens (and maybe some lookahead) before they have to commit to a decision, whereas LL parsers only get to see the first few tokens of the rule.

This is why there is such a thing as an LR(0) parser, whereas an LL(0) parser would be impossible; it would have no information with which to know what rule to use for the following tokens!


With this understanding of LL vs LR parsing, we can draw a number of very significant conclusions about why certain things are the way they are. These illustrate a lot of the pros/cons of LL vs LR parsing.

LR parsers can handle more grammars

This follows from the previous section about lookahead. Since LR lookahead starts from the end of a rule, a LR(1) parser has strictly more information available to it when making a decision than an LL(1) parser. It follows that LR(1) parsers can parse strictly more grammars than LL(1) (modulo LL-only grammar extensions; see below). LR parsers can also handle left recursion, which LL parsers cannot.

Advantage: LR

On the other hand, since LL parsers commit to what rule they are parsing before they parse that rule's tokens, and LL parser knows the context of what it is parsing whenever it parses a token. While that is more difficult job (since they have less information to go on), it also leads to some important advantages.

LL parsers can support regex-like operators in grammars

Knowing the parsing context makes it possible to extend grammars with rich regex-like operators like repetition, alternation anywhere (not just at the top level), etc. Basically each rule can form a DFA. This is possible with top-down parsing because the parser knows what rule it is in and can step through that rule's state machine as it is parsing it. I don't believe this is possible with bottom-up parsing (even if you could somehow get the parsing tables to do the right thing, the "reduce" step counts on the reduction having a fixed arity). This is a really nice advantage of LL, because grammars are often more readable with these rich grammar extensions. In practice this helps mitigate the more restrictive grammar rules of LL, since many things that you'd want left-recursion for you can use repetition operators instead.
// LR Grammar: nothing fancy allowed, alternation only allowed
// at the top-level.
// This is only allowed because it is equivalent to:
// pairs → pair pairs_tail
// pairs → ε
pairs → pair pairs_tail | ε

// Extended LL grammar; possible because we can build each
// rule into a DFA.
pairs → (pair (',' pair)*)?
The latter rule could be built into a DFA like (green states are "accept" states):

Knowing the context also allows for mid-rule actions (custom code that runs in-between any two elements of the rule). Bison supports this, but only by rewriting the grammar internally which makes it appear more complicated in any kind of visualization.

Advantage: LL

LL parsers support inherited attributes

Knowing the context also allows LL-based applications to pass attributes/metadata down the tree as it is being built (this is sometimes referred to as "inherited attributes." (Both LL and LR parsers can support "synthesized attributes" which are passed up the tree).

Advantage: LL


I have described a alternate model for LL and LR parsers that is equivalent to, but more intuitive (for me at least), than most of the literature. We can consider a parser a black box that inputs and outputs streams of tokens and rules according to pre-order or post-order notation. So far we have not explored the inner workings of these parsers at all; we have just considered them black boxes and we have no idea how they work internally. We have not explored issues of what grammars they can handle and which they can't. We also have not explored the variants of LL and LR (Strong-LL, SLR, LALR, etc). I hope to explore these issues more fully, including example code, in a follow-up article.