Monday, July 11, 2005

Deriving an Efficient Tree Traversal, Part I

In an attempt to live up to this blog's name, I'm going to spend the next few posts describing a series of transformations that derive an efficient implementation of a tree traversal suitable for imperative languages, starting from a natural, structural recursion in a language that isn't purposefully broken. This is mostly based on material from Mitch's Principles of Programming Languages course that all first-year Northeastern PhD students take--especially data abstraction, contexts, and continuations. This is great stuff, and it's worth learning.

The Problem

The problem we'll solve is to write a traversal of the DHTML DOM tree in a web page. The problem is useful and ubiquitous, but it's hard to solve in a language that doesn't like recursion. It's pretty well-known that you need to maintain an explicit stack to drive the traversal, but it's not obvious how to get it right. I could point out popular libraries that get it wrong, but I'll be polite.

For the sake of example, I'll simplify the problem somewhat: we'll represent DOM nodes as a simple structure:
;; tree = symbol × (listof tree)
(define-struct tree (tag children))
Leaf nodes are represented as trees with empty lists of children. We'll traverse the DOM in the style of CSS selectors, which are represented as an ancestry path of tags:
;; path = (listof selector)
;; selector = symbol
For example, the path '(foo bar) represents the set of all bar nodes that are descendants of foo nodes.

Natural Solution

The clearest implementation of the problem is a structural induction on the DOM tree and selector path (technically, the ordering is lexicographic over both inputs):
;; select : tree × path → (listof tree)
(define (select tree path)
(cond
[(null? path) (list tree)]
[(matches? tree (car path)) ; does node's tag match first tag in path?
(append (select tree (cdr path))
(select* (children tree) path))]
[else (select* (children tree) path)]))

;; select* : (listof tree) × path → (listof tree)
(define (select* trees path)
(cond
[(null? trees) null]
[else (append (select (car trees) path)
(select* (cdr trees) path))]))
We have two distinct datatypes to recur on: tree and (listof tree), which leads to two separate functions. When we find a node that matches, we need to test it against the rest of the path as well; otherwise we just search its descendants. We only produce a match when we find a node that has matched the entire selector path.1

CPS

The recursive solution uses the implicit control stack of Scheme to drive the traversal, but we want an explicit implementation. The easiest way to represent control is to CPS a program, so let's try that:
;; select/k : tree × path × ((listof tree) → (listof tree)) → (listof tree)
(define (select/k tree path k)
(cond
[(null? path) (k (list tree))]
[(matches? tree (car path))
(select/k tree (cdr path)
(lambda (ts1)
(select*/k (children tree) path
(lambda (ts2)
(k (append ts1 ts2))))))]
[else (select*/k (children tree) path k)]))

;; select*/k : (listof tree) × path × ((listof tree) → (listof tree)) → (listof tree)
(define (select*/k trees path k)
(cond
[(null? trees) (k null)]
[else (select/k (car trees) path
(lambda (ts1)
(select*/k (cdr trees) path
(lambda (ts2)
(k (append ts1 ts2))))))]))
This no longer relies on the control stack of Scheme, because every function call is in tail position. However, it isn't a satisfying solution for an iterative language--even one with higher-order functions--because without proper tail calls, it will still use unbounded control space.

To be continued...
1 For the sake of simplicity, I'm not going to worry about duplicate elements in the result list.

5 comments:

Jacob Matthews said...

Great series, thanks for writing this. One thing: it seems as though your program in all its forms collapses any number of repetitions of any tag. For instance:

> (select (make-tree 'a (list (make-tree 'b '()))) '(a a a b))
(#(struct:tree b ()))
> (select (make-tree 'a (list (make-tree 'b '()))) '(a b))
(#(struct:tree b ()))

That seems totally weird to me, but maybe it's the intended behavior. Anyway, fixing it to do what I'd expect (i.e. make repetitions of a tag in a path require sequential occurences of the tag to appear in the path) is a bit of extra work, and given all the transformations you're doing maybe you just didn't want to make the program any more complicated than it already is?

(Oddly enough I noticed this behavior by studying the final program, not the first one; I guess I'm just weird that way.)

Dave Herman said...

You win the prize for finding the first problem! Basically, the semantics I've implemented is that the "descendant" relation is reflexive, i.e., a node is its own descendant. It would probably make more sense not to be reflexive, but it's actually not hard to fix; just change the recursion to

(append (select* (children t) (cdr path)) (select* (children t) path))

But I think the bigger problem is of repeated nodes in the output. You probably don't want to generate a list with duplicates, but removing duplicates is inefficient. (In this simplified example, if you remove the reflexivity, I think it might not come up, but in general, i.e. with real CSS selectors, it will.)

You could use an efficient set representation like that crazy union-find stuff, but a better way might be to visit every node just once, collecting up the set of all tests for each node and try them in order, stopping as soon as you find one that accepts the node.

This is starting to sound more like graph traversal... which full CSS selectors are anyway, since you have to do things like test siblings. Blech.

(BTW, you are weird for finding this bug in the final program. I'd found it in the original program. I hardly thought anyone could even read the final one...)

Jacob Matthews said...

it's actually not hard to fix; just change the recursion to

(append (select* (children t) (cdr path)) (select* (children t) path))


I'm weirder than you think; I actually tried that before posting. Doing it that way changes the output list into the children of matching trees, not that tree that matches the last tag (as the function as-is returns); to fix that the easiest way I came up with was to treat the path as a nonempty list and return the tree if it matches the last tag in the path. A bit nasty, but doable.

Dave Herman said...

Oops, right you are. Thanks!

Dave Herman said...

FWIW, I just noticed in the CSS spec that CSS sibling selectors don't necessitate graph traversal: they're only for adjacent siblings, and they're ordered. That makes it easier. Ugly, but easier.