Showing posts with label C. Show all posts
Showing posts with label C. Show all posts

Monday, January 25, 2010

Wading into the C

No time for deep thoughts these days; too much hacking, dissertating, designing, and committeefying going on. Just a couple notes based on recent experiences hacking in C/C++:

1. Not being able to rely on recursion makes me sad.

2. "Downwards macro-args" in C:
#define MY_ENUM_LIST(m) \
m(RED, 0), \
m(GREEN, 1), \
m(BLUE, 2)

#define DEF_ENUM_ENTRY(c, v) c = v
#define QUOTE_ENUM_ENTRY(c, v) #c

typedef enum rgb {
MY_ENUM_LIST(DEF_ENUM_ENTRY)
} rgb;

const char *rgb_names[] = {
MY_ENUM_LIST(QUOTE_ENUM_ENTRY)
};
3. I am fast becoming acquainted with gdb.

4. And a few choice command-line shortcuts really save extraordinary amounts of time. My new fav: the !? bash-history shortcut.

Monday, February 16, 2009

I feel much better about myself

I'd always felt stupid for being stymied by the syntax of C declarations. But I think I get it now.

First of all, you have to understand that a C declaration starts with one single base type followed by a sequence of fragments of declarations; you essentially can interpret this as a sequence of separate declarations, each created by plugging the base type into each separate fragment. Syntactically, the "hole" inside each fragment actually contains the name of the identifier being bound; conceptually, you pull out that identifier and bind it to the type you get by plugging the base type into the position where you found the identifier.

For example:
int x, *y, z[3];
declares an int x, an int pointer y, and an int array z. Now for function types, the base type is interpreted as the return type; in other words, a fragment of a declaration of function type has its "hole" in the return type position. So:
int x, f(char);
declares an integer x and a function f of type charint. Next we have to worry about pointer types. Oh, pointer types. First of all, you have to know that pointer types in the declaration are associated with the fragments, not the base type. So:
int *p, x;
declares an int pointer p and a plain int x. Now, notice that the asterisk is a prefix operator, whereas the function and array type constructors are postfix operators. So now we get to worry about precedence. You just have to remember that the asterisk binds loosest. So:
int *f(char);
is a function that returns an int pointer, whereas
int (*f)(char);
is a pointer to a function that returns an int. (You can throw in parentheses most anywhere in these things; forgot to mention that.) Okay, but here's the kicker: the nesting of type constructors is interpreted inside-out. Consider:
int (*(foo[3]))(char);
((Not that it helps, but I've over-parenthesized to avoid precedence issues.)) If you're still trying, you might be tempted to read this as declaring a function that returns an array of int pointers, or maybe a pointer to an array of ints. But the inner-most syntactically nested type constructor is interpreted as the outer-most semantically nested type constructor. So foo is in fact an array of pointers to functions.

Friday, February 13, 2009

The C Typedef Parsing Problem

The well-known "typedef problem" with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing. That's standard, although because parser generators usually allow fixed lookahead, you have to pay very close attention to making sure the lexer stays properly in synch with the parser, even when it gets ahead. For example:
typedef int my_int;
my_int x;
At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name. But this is just a small matter of heroic hacking.

The real problem is a larger engineering one. Just to parse a program, you need to have the full type environment. And the program you're parsing may have included other programs. So if you want to write tools that perform analyses on fragments of C, you still have to feed them the entire environment one way or another. Either that becomes a requirement you foist onto the user ("this tool takes two inputs: a fragment of C and an initial type environment"), or you only allow whole programs instead of fragments, or you use an ambiguous grammar with, say, a GLR parser and divine some clever disambiguation heuristics.

And of course, C's braindead non-module-system is just a preprocessor directive (#include). So if you punt and require whole programs, that means you have to implement the C preprocessor, too. Or at least feed it through an existing implementation of the C preprocessor. And then process stdio.h, stdlib.h, etc. etc. every time you analyze just about any C program. So then you start thinking about caching results of #include for efficiency...

All this because of one silly grammar ambiguity.

Wednesday, February 04, 2009

Using C to talk to C

I've released the first version of a PLaneT package for working with C. It's based on an idea I got from Felix Klock's Scheme Workshop 2008 paper: rather than try to do manual pointer arithmetic based on the current architecture's ABI, you can find out byte offsets of data structures by creating a C program that tells you what they should be. So that's what c.plt allows you to do; using a system-installed C compiler, you can generate a representation of the binary layouts of data structures. Here's a sample interaction:
> (define time.h
(make-header
#reader (planet dherman/c/reader) {
struct tm {
int tm_sec;
int tm_min;
int tm_hour;
int tm_mday;
int tm_mon;
int tm_year;
int tm_wday;
int tm_yday;
int tm_isdst;
};
}))
> (define time-abi
(compile-header time.h
(system-compiler #:include<> '("time.h") gcc)))
> (layout-size (time-abi 'tm))
36
> (layout-offset (time-abi 'tm) 'tm_sec)
0
> (layout-offset (time-abi 'tm) 'tm_year)
20