Friday, February 22, 2008

True unions

I am a Super Big Fan of disjoint union datatypes in programming languages, but there are places where they are really inconvenient. Having to inject and project data between related types can be prohibitively cumbersome, especially when dealing with large data definitions such as the AST definitions for realistic programming languages. I know of a couple of languages where "true" unions are used instead, including Typed Scheme and the draft ECMAScript Edition 4. In both cases, unions are being added to a language where the runtime is already paying the price for runtime type tags, so keeping the variants distinct where they don't overlap doesn't introduce any extra burden.

But I was thinking this morning, what would happen if you tried to add true unions to Hindley-Milner languages? For concreteness, let's imagine extending Haskell. So instead of writing
data Foo = I Int
| S String
you could instead write
type Foo = Int | String
Now if you want to pattern match on such a construct, you have to explicitly mention the type; there's no other way to distinguish the variants. So you could imagine type-annotating the patterns in a match expression:
showFoo :: Foo -> String
showFoo foo = case foo of
n :: Int -> show n
s :: String -> s
Note that because there's nothing prevent overlap in the variants of a true union, the order in which you match the patterns is significant. Alternatively, you could write the definition of showFoo in pattern-definition style:
showFoo :: Foo -> String
showFoo (n :: Int) = show n
showFoo (s :: String) = s
Consider the possible drawbacks to such a feature:

Cost of tagging:

One of the benefits of (sound) static typing is the ability to compile to efficient runtimes that avoid tagging runtime objects with their datatype. The Foo type, by contrast would require its Ints and Strings to be tagged. But this is the wrong way of looking at it; the language requires you to tag them anyway if you use the disjoint union type, so there's no additional cost over what you would already have been paying. For non-union types, you can still do the same amount of erasure.

Possible overlap:

You can express overlapping types like (Int | (Int | String)) with true unions, which makes the order of patterns significant, could result in surprising (order-dependent) logical relationships between case dispatches, and could generally lead to messy type definitions. Maybe a more principled way of looking at it is a disjoint union can be thought of as a type abstraction, whereas with a true union you might have to know its full definition to use it effectively. But hey, the same is true of type-definitions but they're still useful; and besides, nothing's preventing you from using disjoint unions when they're more appropriate.

Case exhaustiveness:

Standard ML's pattern matching facility is carefully designed to allow the compiler to prove or disprove case exhaustiveness. I tend to think this is a bad trade-off; the language is significantly crippled to enable a very weak theorem prover to prove a theorem that's of limited utility. Sure, it has its uses, but when you know things the theorem prover doesn't, you have to abandon pattern matching entirely. Other language designers seem to agree with me, since Haskell and I think Ocaml also don't check for case exhaustiveness.

Inference problems:

I have no expertise in Hindley-Milner inference, but I'm sure true unions are not inferrable. But one thing recent Haskell research has demonstrated is that convenient but uninferrable extensions to Hindley-Milner type systems can often be safely added with the requirement that they be explicitly annotated. I bet you could add true unions, require explicit type annotations when they're used, and get along just fine.

Function types and polymorphic types:

This is where it gets tougher. How do you tag these kinds of types? Do you do runtime subtyping checks for things like the more-polymorphic-than relation? I don't have an answer. A simple one is to restrict unions to types with simple tags, although those kinds of arbitrary restrictions lead to lack of compositionality in language design. But perhaps it's possible to make this work with the entire type system.

Some of this design space has been explored with Ocaml's polymorphic variants. I should take a look at the literature on that.