The Little Calculist: February 2009

Friday, February 20, 2009

Whoa

This is a really neat new feature of the Northeastern University library catalog:

Unless you spend the next 10 minutes taking screen shots and blogging about it, this is really convenient. And it saves our precious reserves of little scraps of paper.

Thursday, February 19, 2009

PLT System Facilities (Software Components): Modules

Lexical scope in general and lambda in particular go a long way towards supporting modular, separate development. As I mentioned before, PLT Scheme builds a number of first-class, module-like component systems on top of lambda. But it also contains a more primitive module system in which modules are not first-class. This serves a number of purposes.

First of all, static modules provide systematic support for packaging, compiling, and deploying code. First-class modules are flexible and expressive, but they don't have anything to say about compilation and deployment. Somewhere along the line there has to be a notion of what the compiler takes as input, and when you have separate development, you need separate deployment and ideally separate compilation.

Another critical purpose of static modules is the ability to modularize static entities in a language. In ML, for example, modules can import and export types. Scheme is of course more [ed: dynamic] than ML, but it still has its own crucial compile-time abstractions: macros. With dynamic modules, there's no straightforward way to import and export static entities like macros. A secondary benefit of static modules is that you can import all the bindings from another module at once without having to spell them all out; this is admittedly less important but still very convenient.

Finally, PLT Scheme was designed to support multiple languages. For pedagogical purposes, this has allowed them to design multiple, concentric subsets of the language tailored to the How to Design Programs curriculum. This has also facilitated language research by making it easy to design and implement new languages (by macro-compiling them to Scheme) and to research language interactions in a multi-language environment. The relevant piece of the module system is a single hook at the beginning of a module definition: the grammar of a module is an S-expression containing the symbol module, a symbol naming the module, and an S-expression indicating the language of the body:

(module foo scheme body ...)

Typically the language chosen is the special built-in language scheme, as above, or the somewhat leaner scheme/base. But the language position works by simply importing another module that implements the required macros for compiling the body. In place of scheme, you can put in a module path for any module installed on the system.

It's also possible to specify a custom reader for a PLT language so it doesn't even have to be restricted to an S-expression syntax. The initial reader allows you to specify a language with the special shebang-like #lang syntax:

#lang scheme/base
body ...

From that point on, the language's reader has access to the input stream to parse it any way it likes.

Anyone can implement a language module, which means people have developed PLT implementations of Algol, Java, ML, and JavaScript, to name a few.

Update: Added in the missing word "dynamic" above. Also, this is a better link for module paths.

Also, Sam is right in the comments: the scheme language really isn't special or built-in in any significant way. It's simply another module provided in the standard PLT collections.

Wednesday, February 18, 2009

Ooh, that's pretty

OS X calculator "programmer" view:

PLT System Facilities (Software Components): Lexical scope

Like any programming system, PLT Scheme involves multiple people and a lot of code. So managing software components is a primary concern. And of course, as a Scheme, PLT uses lexical scope as the linguistic linchpin for managing software. The importance of lexical scope in a component architecture is that it allows code to be shared and mixed while guaranteeing the internal integrity of components.

With the power of macros, PLT has developed a number of software component abstractions on top of little lambda. One of these is a single-inheritance, class- and interface-based OOP system. Classes are first-class values, which means it's easy to implement mixins (i.e., classes parameterized over their superclass) as functions that return classes. More recently, they've introduced traits, which are like fragments of classes that can be more freely and flexibly combined than mixins. Also built with macros is the unit system, which is a first-class, parameterized, recursive module system.

All of these abstractions admit modular and separate development because they protect the internal integrity of their local bindings and definitions. You can hand out a component into an untrusted context and know that it won't be able to modify or even inspect the component's internals. And you can reason locally about your code knowing that no context can change its behavior based on these internals.

Tuesday, February 17, 2009

The PLT Scheme Operating System

Back in the day, programming languages were considered a systems concern. These days a lot of PL research is done in a vacuum, with abstract models and stand-alone prototypes. But programming languages rise and fall by their applications, and the deployment of a language always involves interesting and non-trivial systems challenges.

The PLT group has done a ton of work in this area. A decade ago, Matthew, Robby, Shriram and Matthias wrote an ICFP paper on Programming Languages as Operating Systems, sketching a few of the highlights of the PLT virtual machine. They focused on MrEd, the GUI engine. But taking a step back, there are many more systems-y facilities in PLT Scheme than just the UI infrastructure. And of course, that paper is a decade old and beginning to show its age.

I'm planning to write a series of posts on some of the myriad systems facilities in PLT Scheme and how they fit together.

Monday, February 16, 2009

I feel much better about myself

I'd always felt stupid for being stymied by the syntax of C declarations. But I think I get it now.

First of all, you have to understand that a C declaration starts with one single base type followed by a sequence of fragments of declarations; you essentially can interpret this as a sequence of separate declarations, each created by plugging the base type into each separate fragment. Syntactically, the "hole" inside each fragment actually contains the name of the identifier being bound; conceptually, you pull out that identifier and bind it to the type you get by plugging the base type into the position where you found the identifier.

For example:

int x, *y, z[3];

declares an int x, an int pointer y, and an int array z. Now for function types, the base type is interpreted as the return type; in other words, a fragment of a declaration of function type has its "hole" in the return type position. So:

int x, f(char);

declares an integer x and a function f of type char → int. Next we have to worry about pointer types. Oh, pointer types. First of all, you have to know that pointer types in the declaration are associated with the fragments, not the base type. So:

int *p, x;

declares an int pointer p and a plain int x. Now, notice that the asterisk is a prefix operator, whereas the function and array type constructors are postfix operators. So now we get to worry about precedence. You just have to remember that the asterisk binds loosest. So:

int *f(char);

is a function that returns an int pointer, whereas

int (*f)(char);

is a pointer to a function that returns an int. (You can throw in parentheses most anywhere in these things; forgot to mention that.) Okay, but here's the kicker: the nesting of type constructors is interpreted inside-out. Consider:

int (*(foo[3]))(char);

((Not that it helps, but I've over-parenthesized to avoid precedence issues.)) If you're still trying, you might be tempted to read this as declaring a function that returns an array of int pointers, or maybe a pointer to an array of ints. But the inner-most syntactically nested type constructor is interpreted as the outer-most semantically nested type constructor. So foo is in fact an array of pointers to functions.

Friday, February 13, 2009

My first PLT patch

I've submitted my first patch to the internals of PLT Scheme. It's a tiny thing: the ability to unquote inside a quasiquoted hash-table literal, e.g.:

> `#hash((x . ,(+ 1 2)))
#hash((x . 3))

And Matthew had to fix up some issues with it.

But anyway, making contributions to PLT is easier these days due to a couple of factors. First, the creation of plt-dev, a public mailing list for discussing the development of PLT Scheme, means that there's better support for people who want to contribute. Second, I'm told that more of the internals of PLT Scheme are self-hosted than apparently they used to be, i.e. a lot of PLT Scheme is itself implemented in PLT Scheme. This makes it a lot less scary to dive into the code.

The C Typedef Parsing Problem

The well-known "typedef problem" with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing. That's standard, although because parser generators usually allow fixed lookahead, you have to pay very close attention to making sure the lexer stays properly in synch with the parser, even when it gets ahead. For example:

typedef int my_int;
my_int x;

At the semicolon, the type environment needs to be updated with an entry for my_int. But if the lexer has already looked ahead to my_int, it will have lexed it as an identifier rather than a type name. But this is just a small matter of heroic hacking.

The real problem is a larger engineering one. Just to parse a program, you need to have the full type environment. And the program you're parsing may have included other programs. So if you want to write tools that perform analyses on fragments of C, you still have to feed them the entire environment one way or another. Either that becomes a requirement you foist onto the user ("this tool takes two inputs: a fragment of C and an initial type environment"), or you only allow whole programs instead of fragments, or you use an ambiguous grammar with, say, a GLR parser and divine some clever disambiguation heuristics.

And of course, C's braindead non-module-system is just a preprocessor directive (#include). So if you punt and require whole programs, that means you have to implement the C preprocessor, too. Or at least feed it through an existing implementation of the C preprocessor. And then process stdio.h, stdlib.h, etc. etc. every time you analyze just about any C program. So then you start thinking about caching results of #include for efficiency...

All this because of one silly grammar ambiguity.

Monday, February 09, 2009

History lesson

A history and semantics lesson from Matthias on plt-scheme:

Normal order and applicative order are failed attempts to explain the nature of call-by-name programming languages and call-by-value programming languages as models of the lambda calculus. Each describes a so-called reduction strategy, which is an algorithm that picks the position of next redex BETA that should be reduced. By 1972, it was clear that instead you want different kind of calculi for different calling conventions and evaluation strategies (to the first outermost lambda, not inside). That is, you always reduce at the leftmost-outermost point in a program but you use either BETA-NAME or BETA-VALUE. Non-PL people were confused (and still are) because BETA-NAME looks like BETA but nearly 40 years later, everyone should figure this out. SICP was written when the majority of people were still confused. -- Matthias

Saturday, February 07, 2009

A funny reader context

Trivia time: find a place in the Scheme reader where ,expr is not equivalent to (unquote expr).

Answer:

> (define ls '(1 2 3))
> `#(unquote ls)
#(1 2 3)
> `#,ls
#,ls

Wednesday, February 04, 2009

Using C to talk to C

I've released the first version of a PLaneT package for working with C. It's based on an idea I got from Felix Klock's Scheme Workshop 2008 paper: rather than try to do manual pointer arithmetic based on the current architecture's ABI, you can find out byte offsets of data structures by creating a C program that tells you what they should be. So that's what c.plt allows you to do; using a system-installed C compiler, you can generate a representation of the binary layouts of data structures. Here's a sample interaction:

> (define time.h
   (make-header
     #reader (planet dherman/c/reader) {
         struct tm {
             int tm_sec;
             int tm_min;
             int tm_hour;
             int tm_mday;
             int tm_mon;
             int tm_year;
             int tm_wday;
             int tm_yday;
             int tm_isdst;
         };
     }))
> (define time-abi
   (compile-header time.h
     (system-compiler #:include<> '("time.h") gcc)))
> (layout-size (time-abi 'tm))
36
> (layout-offset (time-abi 'tm) 'tm_sec)
0
> (layout-offset (time-abi 'tm) 'tm_year)
20

The Little Calculist