Friday, February 13, 2009

The C Typedef Parsing Problem

The well-known "typedef problem" with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing. That's standard, although because parser generators usually allow fixed lookahead, you have to pay very close attention to making sure the lexer stays properly in synch with the parser, even when it gets ahead. For example:
typedef int my_int;
my_int x;
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name. But this is just a small matter of heroic hacking.

The real problem is a larger engineering one. Just to parse a program, you need to have the full type environment. And the program you're parsing may have included other programs. So if you want to write tools that perform analyses on fragments of C, you still have to feed them the entire environment one way or another. Either that becomes a requirement you foist onto the user ("this tool takes two inputs: a fragment of C and an initial type environment"), or you only allow whole programs instead of fragments, or you use an ambiguous grammar with, say, a GLR parser and divine some clever disambiguation heuristics.

And of course, C's braindead non-module-system is just a preprocessor directive (#include). So if you punt and require whole programs, that means you have to implement the C preprocessor, too. Or at least feed it through an existing implementation of the C preprocessor. And then process stdio.h, stdlib.h, etc. etc. every time you analyze just about any C program. So then you start thinking about caching results of #include for efficiency...

All this because of one silly grammar ambiguity.


benpsm said...

A good post.

What a mess, though. These are the sorts of things about C syntax that just make my skin crawl.

Relatedly, I think C's attempt to make variable and function declaration "look like usage" to be a demented and failed experiment. It's a shame, really that a language where pointers are so central makes declaring types dereferencing those pointers syntactically so awkward.

I'll take:

type Handler = procedure(var m: array of Message): pointer to Result;


typedef Result *(*Handler)(message (*m)[]));

or whatevertheheck the correct mangling would be. In C is virtually impossible to even tell which identifier is being declared without playing parser. A declaration syntax that can't even manage that really is a failure.

Leo Meyerovich said...

Even in terms of lexing and parsing, I'd say that's bad enough to be a performance bug: if you want to try to parallelize the process, this means you're forced to do wacko speculative tricks beyond the norm. Stepping back, however, unless variable identifiers and type identifiers can have different values, I'm not sure why one would advocate differentiating them within the lexer.

I'm curious as to why this came up for you :)

Dave Herman said...

I'm curious as to why this came up for you.

It's totally mundane, nothing research-related. I am building a tool for PLT Scheme that allows you to take a C header and compute binary layout information for the foreign function interface.

Sebastien Carlier said...

I found this post while trying to remind myself what the deal is with this ambiguity in C's syntax.

I think the C statement "a * b;" illustrates the problem well. This statement can be read as declaring 'b' as a pointer to an entity of type 'a', or it can be read as multiplying the values of the variables 'a' and 'b' (and discarding the result). It is impossible to tell which is meant without knowing whether 'a' names a variable or a type.

Perhaps it can be shown that all such examples involve throwing away the result of a calculation in a trivial way... which might lead to a heuristic allowing sensible compilation in the absence of a type dictionary.

Of course, C++ inherited the issue, and '*' above could then be overloaded and do all sorts of wacky stuff behind the scene...