Friday, August 23, 2013

Parsing C++ is literally undecidable

Many programmers are aware that C++ templates are Turing-complete, and this was proved in the 2003 paper C++ Templates are Turing Complete.

However, there is an even stronger result that many people are not aware of. The C++ FQA has a section showing that parsing C++ is undecidable, but many people have misinterpreted the full implications of this (understandable, since the FQA is discussing several issues over the course of its questions and does not make explicit the undecidability proof).

Some people misinterpret this statement to simply mean that fully compiling a C++ program is undecidable, or that showing the program valid is undecidable. This line of thinking presumes that constructing a parse tree is decidable, but only further stages of the compiler such as template instantiation are undecidable.

For example, see this (incorrect, but top-voted) Stack Overflow answer to the question What do people mean when they say C++ has “undecidable grammar”? This answer errs when it says: "Note this has nothing to do with the ambiguity of the C++ grammar."

In fact, simply producing a parse tree for a C++ program is undecidable, because producing a parse tree can require arbitrary template instantiation. I will demonstrate this with a short program, which is a simplification/adaptation of what is in the FQA link above.
struct SomeType {};

template <...> struct TuringMachine {
  // Insert implementation of a Turing machine here, which we know
  // is possible from previous proofs.
};

template <typename T> struct S {
  static int name;
};

template<> struct S<SomeType> {
  typedef int name;
};

int x;
int main() {
  S<TuringMachine<...>::output>::name * x;
}
The parse tree for this program depends on whether TuringMachine::output is SomeType or not. If it is SomeType then ::name is an integer and the parse tree for the program is multiplying two integers and throwing away the result. If it is not SomeType, then ::name is a typedef for int and the parse tree is declaring a pointer-to-int named x. These two are completely different parse trees, and the difference between them cannot be delayed to further stages of the compiler.

The parse tree itself depends on arbitrary template instantiation, and is therefore the parsing step is undecidable.

In practice, compilers limit template instantiation depth, so this is more of a theoretical problem than a practical one. But it is still a deep and significant result if you are ever planning on writing a C++ parser.

6 comments:

  1. Note that this piece of code is not legal C++ (and this is not related to template instantiation depth).
    To make this legal (and decidable, in fact) the standard requires to use "typename" (in order to disambiguate whether "name" is type or not): http://eigen.tuxfamily.org/dox-devel/TopicTemplateKeyword.html

    ReplyDelete
    Replies
    1. I believe you are misunderstanding where "typename" is required. Typename for disambiguation is only required from within a template. See this Stack Overflow question: Why is the keyword “typename” needed before qualified dependent names, and not before qualified independent names?.

      This program actually compiles with both GCC and Clang if I fill in the placeholders with "int N" and "0", respectively (and typedef "output" within TuringMachine).

      Delete
  2. Josh, thanks for posting this, especially with the concrete example and the FQA links. I knew C++ syntax analysis was messy, but I didn't realize it was so completely tangled with semantic analysis. I've updated my SO answer to address this. Please comment if I got anything wrong.

    Consider copying this post as an answer to the same question.

    ReplyDelete
    Replies
    1. Hey thanks Jay for being so gracious about this, and sorry for calling out your answer like that, I just wanted to show that it was a common misconception.

      I have just one minor comment with your revision, which I'll put on SO.

      Delete
  3. So if the correct parsing is undecidable, does that make it undefined? Or implementation-defined? Or defined but logically impossible to implement?

    ReplyDelete
    Replies
    1. The C++ standard says:

      "There is an implementation-defined quantity that specifies the limit on the total depth of recursive instantiations, which could involve more than one template. The result of an infinite recursion in instantiation is undefined."

      So the parse is decidable given a specific (implementation-defined) template instantiation depth.

      Delete