Friday, February 22, 2008

True unions

I am a Super Big Fan of disjoint union datatypes in programming languages, but there are places where they are really inconvenient. Having to inject and project data between related types can be prohibitively cumbersome, especially when dealing with large data definitions such as the AST definitions for realistic programming languages. I know of a couple of languages where "true" unions are used instead, including Typed Scheme and the draft ECMAScript Edition 4. In both cases, unions are being added to a language where the runtime is already paying the price for runtime type tags, so keeping the variants distinct where they don't overlap doesn't introduce any extra burden.

But I was thinking this morning, what would happen if you tried to add true unions to Hindley-Milner languages? For concreteness, let's imagine extending Haskell. So instead of writing
data Foo = I Int
| S String
you could instead write
type Foo = Int | String
Now if you want to pattern match on such a construct, you have to explicitly mention the type; there's no other way to distinguish the variants. So you could imagine type-annotating the patterns in a match expression:
showFoo :: Foo -> String
showFoo foo = case foo of
n :: Int -> show n
s :: String -> s
Note that because there's nothing prevent overlap in the variants of a true union, the order in which you match the patterns is significant. Alternatively, you could write the definition of showFoo in pattern-definition style:
showFoo :: Foo -> String
showFoo (n :: Int) = show n
showFoo (s :: String) = s
Consider the possible drawbacks to such a feature:

Cost of tagging:

One of the benefits of (sound) static typing is the ability to compile to efficient runtimes that avoid tagging runtime objects with their datatype. The Foo type, by contrast would require its Ints and Strings to be tagged. But this is the wrong way of looking at it; the language requires you to tag them anyway if you use the disjoint union type, so there's no additional cost over what you would already have been paying. For non-union types, you can still do the same amount of erasure.

Possible overlap:

You can express overlapping types like (Int | (Int | String)) with true unions, which makes the order of patterns significant, could result in surprising (order-dependent) logical relationships between case dispatches, and could generally lead to messy type definitions. Maybe a more principled way of looking at it is a disjoint union can be thought of as a type abstraction, whereas with a true union you might have to know its full definition to use it effectively. But hey, the same is true of type-definitions but they're still useful; and besides, nothing's preventing you from using disjoint unions when they're more appropriate.

Case exhaustiveness:

Standard ML's pattern matching facility is carefully designed to allow the compiler to prove or disprove case exhaustiveness. I tend to think this is a bad trade-off; the language is significantly crippled to enable a very weak theorem prover to prove a theorem that's of limited utility. Sure, it has its uses, but when you know things the theorem prover doesn't, you have to abandon pattern matching entirely. Other language designers seem to agree with me, since Haskell and I think Ocaml also don't check for case exhaustiveness.

Inference problems:

I have no expertise in Hindley-Milner inference, but I'm sure true unions are not inferrable. But one thing recent Haskell research has demonstrated is that convenient but uninferrable extensions to Hindley-Milner type systems can often be safely added with the requirement that they be explicitly annotated. I bet you could add true unions, require explicit type annotations when they're used, and get along just fine.

Function types and polymorphic types:

This is where it gets tougher. How do you tag these kinds of types? Do you do runtime subtyping checks for things like the more-polymorphic-than relation? I don't have an answer. A simple one is to restrict unions to types with simple tags, although those kinds of arbitrary restrictions lead to lack of compositionality in language design. But perhaps it's possible to make this work with the entire type system.

Some of this design space has been explored with Ocaml's polymorphic variants. I should take a look at the literature on that.

13 comments:

Kenn Knowles said...

Dave,

You start by talking about how annoying it is to project/inject between related types, which I can agree with. But then your example uses a union of essentially unrelated types and takes advantage of the fact that the union _is_ disjoint.

I do like your example, which seems to be essentially a proposal for automatically creating disjoint unions between named types. Something like this exists in Haskell as the 'Typeable' type class, but I could see language support making this very easy (also easily abused!)

True union types are more complicated and can be found in various guises (apologies if this is old news to you):
- The set-based analysis of N Heintze, A Aiken, and many others. It is the topic of Cormac's dissertation, in fact.
- In Pfenning and company's intersection type systems, unions naturally get involved. Some references: Unions in Call-by-Value Languages by Dunfield and Pfenning (http://www-2.cs.cmu.edu/~rwh/courses/refinements/papers/DunfieldPfenning03/fossacs03.pdf), Subtyping Union Types by Vouillon (http://www.pps.jussieu.fr/~vouillon/publi/subtyping.pdf)
- Your desires also seem related to "overloading" which is solved in Haskell by type classes, Generic Haskell by recursion over types, and in little calculi like A Calculus for Overloaded Functions with Subtyping by Castagna, Ghelli, and Longo (http://www.di.unipi.it/~ghelli/papers/CasGheLon95-ic.ps).

(Apologies for the URLs, Blogger denied all my link tags)

Jeremy said...

You might to look at CDuce, which allows you to write "showFoo" more or less exactly as you have it:

type Foo = Int | String

let fun showFoo (Foo -> String)

  | n & Int -> string_of n

  | s & String -> s

Dave Herman said...

Hi Kenn--

But then your example uses a union of essentially unrelated types and takes advantage of the fact that the union _is_ disjoint.

The way I was thinking about it is that disjoint union types enforce disjointness, which you usually want to be true, but often you already know that the variants are disjoint and don't need the extra level of tagging to enforce it.

For example, in ECMAScript 4, we have runtime tags for things like instances of classes, so when you already know that Int and String are disjoint, there's not much benefit to wrapping elements of (Int | String) in an extra constructor.

But of course, I'm more used to untyped languages like Scheme, where the tagging is always there, so I suspect I might be making an apples-to-oranges comparison. It's a somewhat different situation in ML and Haskell, where unions of Int and String can never appear in collections or as the result of computations, so runtime tags aren't necessarily going to be there. So I'm willing to acknowledge the possibility that my analogy is leading me astray.

I am aware of the work on SBA, though by "true unions" I don't mean "set-theoretic," I just mean "not necessarily disjoint unions". But thanks for all the references!

Vesa Karvonen said...

I doubt that there are (m)any people who would think that SML's pattern matching could not or should not be extended. As a case in point, I know that SML/NJ, Alice ML, and HaMLet-S all provide more expressive pattern matching than what is specified in the Definition of Standard ML. While it is more than likely that people would have differing opinions on what exactly should be supported, I don't think that this would be an issue where designers of SML derivatives would be completely disagreeing with "other language designers". Personally I value exhaustiveness checking. Some of the extensions (e.g. (disjunctive) or-patterns, nested patterns, and generalized (conjunctive) as-patterns) do not even necessarily make it significantly more difficult to check for exhaustiveness.

BTW, I tried to post a comment like this earlier already, but for some reason it didn't get through. Is the blog moderated or does it just regularly eat comments?

Chris said...

One note: GHC will check for exhaustive patterns if you use the -fwarn-incomplete-patterns flag.

Adrien said...

% cat ex.ml
type sum = Foo | Bar

let print_sum = function
| Foo -> "foo";;

let print_int_2 = function
| 0 -> "0";;

% ocaml ex.ml
File "ex.ml", line 3, characters 16-41:
Warning P: this pattern-matching is not exhaustive.
Here is an example of a value that is not matched:
Bar
File "ex.ml", line 6, characters 18-39:
Warning P: this pattern-matching is not exhaustive.
Here is an example of a value that is not matched:
1


That seems pretty good. OCaml's patterns checked that from a finite sum type, pattern was incomplete, and could even do it for an "infinite" one (modulo 2^31). I reckon that even on really infinite data such as GMP's bignums, pattern exhaustivity in present.

Dave Herman said...

Vesa: I don't know why you weren't able to get through before. It's not moderated. Blame blogger, I suppose. But thanks for the comment!

I'm not against exhaustiveness checking per se. It's just that there's a trade-off between the expressivity of the pattern matching language and the feasibility of exhaustiveness checking. This trade-off changes in different contexts (for an extreme example, if you have a language with an embedded proof assistant or theorem prover, which seems to be becoming a popular research topic, then you may not need to restrict the pattern language at all). But the loss of e.g. arbitrary boolean expression guards makes me suspect that the trade-off that SML makes may not be worth it. But I do take your point that it's possible to make the pattern language more expressive and still have exhaustiveness checking.

Chris, Adrien: I'm curious--I was under the impression OCaml and Haskell use backtracking algorithms rather than a decision tree for pattern matching, which I thought made exhaustiveness checking harder. I've never implemented this myself. What am I missing?

Vesa Karvonen said...

Alice ML and HaMLet-S both include guard(ed) patterns.

Also, AFAIK, all of SML, OCaml, and Haskell specify a simple left-to-right, top-to-bottom order for pattern matching. The way it is implemented internally by a specific compiler is another matter.

Chris said...

{-# OPTIONS_GHC -fwarn-incomplete-patterns #-}

data Sum = Foo | Bar

print_sum Foo = "foo"
print_int_2 0 = "0"

>ghci Tbp2.hs
GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help
...
Tpb2.hs:5:0:
Warning: Pattern match(es) are non-exhaustive
In the definition of `print_sum': Patterns not matched: Bar

Tpb2.hs:6:0:
Warning: Pattern match(es) are non-exhaustive
In the definition of `print_int_2':
Patterns not matched: #x with #x `notElem` [0#]


compiler/deSugar/Match.lhs in GHC says it's basically the same as in the Wadler chapter, except monadised to carry around extra state. It looks to me like it uses a series of classification, grouping and extraction rules to build the match expression. Each pattern group has its own way of determining exhaustiveness.

I'm hardly an expert though, and I know there have been a few problems in the with it working precisely in the past. Nontrivial guards (i.e. not including a literal True case) are always assumed to be incomplete.

Ezra Cooper said...

I do recommend looking at OCaml's polymorophic variants; I think they're quite a good solution to a similar (but perhaps different) problem to the one you're looking at.

One reason why they're liberating is because they do allow overlapping variant types. This allows you to avoid a lot of tagging/untagging steps.

We use them to great effect throughout the Links system, and they are the only kind of variants in Links itself. We also offer their dual, "extensible records" which OCaml misses out (or maybe they just lose the duality by calling them "objects" and implementing a different set of features on them).

There are some choices to the way type inference is done, which Jacques Garrigue has usefully explored in a couple of papers. The choices made in OCaml could lead to the inference of surprising types, but I haven't had problems, so Garrigue seems to have made the decent choices.

Ezra Cooper said...

Back to your original question with a little healthy criticism.

Are you sure you really want the type of the value itself to determine the behavior? I think it leads to a lack of abstraction. As a straw man, suppose you have a function that takes either a database ID (Int) or a name (String) as a way of specifying a user; fine, you can use true unions to handle this--until you decide, say, that database IDs should actually be alphanumeric. Now you've got two things colliding that are semantically unrelated, just because they happen to be represented by the same type.

To be fair, when I'm programming in languages like Perl or Scheme (let's say "tagless languages"), I do the above type-driven checking all the time. But I know I'm cheating on myself: I think the failure of abstraction does make my code inflexible in this way, and I think the discipline of variant tags is a good one, and not onerous, at least in the polymorphic form.

Dave Herman said...

Ezra,

Good points, but I just want to note that dynamically typed =/= lack of data abstraction. In Scheme, you can create new records that are distinct from all other datatypes so that you can in fact still make these guarantees. My point is not that one way is uniformly better than the other, but rather that a cost of data abstraction is sometimes-excessive explicit tagging and untagging, and there are cases where I don't need the data abstraction.

Bryan O'Sullivan said...

You can tack somewhat close to your goal using GADTs.