Minor improvements to pattern matching

Kragen Javier Sitaker, 02021-03-24 (updated 02021-04-08) (10 minutes)

Reading EOPL I enountered their variant-case structure. The idea is that if you have, say, a tree made out of interior records and leaf records defined as (p. 80, §3.4.1)

(define-record interior (symbol left-tree right-tree))
(define-record leaf (number))

then you can define, say (p. 81, §3.4.2, slightly tweaked):

(define (leaf-sum tree)
  (variant-case tree
     (leaf (number) number)
     (interior (left-tree right-tree)
       (+ (leaf-sum left-tree) (leaf-sum right-tree)))
     (else (error "leaf-sum: Invalid tree" tree))))

Polymorphic variants in OCaml

This is pretty closely analogous to polymorphic variants in OCaml, except that the fields are named; in the last case, the symbol field is unused and so not mentioned. In OCaml we can define this without defining the record types first, but the fields are named only positionally:

# let rec leaf_sum = function `Leaf n -> n | `Interior (_, left, right) -> leaf_sum left + leaf_sum right ;;
val leaf_sum : ([< `Interior of 'b * 'a * 'a | `Leaf of int ] as 'a) -> int =
  <fun>
# leaf_sum (`Interior (`Leaf 4, `Interior (`Leaf 5, `Leaf 6)));;
Characters 9-60:
  leaf_sum (`Interior (`Leaf 4, `Interior (`Leaf 5, `Leaf 6)));;
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type
         [> `Interior of
              [> `Leaf of int ] *
              [> `Interior of [> `Leaf of int ] * [> `Leaf of int ] ] ]
       but an expression was expected of type
         [< `Interior of 'b * 'a * 'a | `Leaf of int ] as 'a
       Types for tag `Interior are incompatible
# leaf_sum (`Interior ("foo", `Leaf 4, `Interior ("bar", `Leaf 5, `Leaf 6)));;
- : int = 15

The type inferred is not actually fully general, because it requires the type for a given tag to be consistent:

# leaf_sum (`Interior ("foo", `Leaf 4, `Interior (3.14, `Leaf 5, `Leaf 6)));;
Characters 9-73:
  leaf_sum (`Interior ("foo", `Leaf 4, `Interior (3.14, `Leaf 5, `Leaf 6)));;
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type
         [> `Interior of
              string *
              ([< `Interior of string * 'a * 'a | `Leaf of int > `Leaf ]
               as 'a) *
              [> `Interior of float * [> `Leaf of int ] * [> `Leaf of int ] ] ]
       but an expression was expected of type 'a
       Types for tag `Interior are incompatible

Named fields for terser code

Usually you have more than one function operating on a given type, so it occurred to me that the Scheme code is somewhat redundant; as long as it’s only dispatching on record types, it could be written as follows:

(define (leaf-sum tree)
  (variant-case tree
     (leaf number)
     (interior (+ (leaf-sum left-tree) (leaf-sum right-tree)))
     (else (error "leaf-sum: Invalid tree" tree))))

Moreover the error could be implicit as it is in OCaml.

This way of doing things requires you to name your fields in a type declaration, and only accommodates the simplest pattern-matches, but those are nevertheless the most commonly used ones. (It also has the disadvantage that adding fields to a record type could silently change the meaning of existing code, instead of just breaking it as it normally does.) So you could imagine saying, for example:

a tree:
    a leaf:
        n: int
    a interior:
        sym: symbol
        left-tree: tree
        right-tree: tree

to leaf-sum:
    on leaf:
        n
    on interior:
        leaf-sum left-tree + leaf-sum right-tree

Pointer-bit variant discrimination

Another vaguely related pattern-matching note is that if your record types are all non-polymorphic sum types like the above, and you do full type erasure, as is normal in ML, then in most cases you can get away with discriminating them entirely with pointer tag bits, avoiding embedding a tag field in the record itself. tree above, for example, needs only one tag bit, to distinguish leaf from interior; very many such sum types need only 2–4. You could provide an “overflow tag”, say, when all the pointer-tag bits are 1, which indicates that the record does indeed contain a tag field further discriminates the record type, but only types with 8 or more variants will need it if your pointers are 64-bit aligned.

Here are the last few sum types I defined. These are from porting μKanren to OCaml:

type var = Var of Index.t (* the index is a counter typically from call_fresh *)
type term = Vart of var | Const of int | Pair of term * term
type 'a stream = Cons of 'a * 'a stream | Thunk of (unit -> 'a stream) | Mzero
type state = State of env * Index.t (* index of the next variable to create *)

This is from an incomplete port of COMFY-65 to OCaml; the real type would have about five more variants:

type ast = If of ast * ast * ast | Not of ast | Seq of ast list | Const of int

This is also sort of an example:

type num = Int of int | Float of float
type expr = Sum of expr * expr | Product of expr * expr | Const of num

This was also sort of an example:

type test_item = Hematocrit of int | Creatinine of float | Glucose of int
type test_items = EmptyTest | TestCons of test_item * test_items
type test = Test of (int * float * test_items)
type int_tag = HematocritT | GlucoseT
type float_tag = CreatinineT
type by_type_tag = EmptyBTT
                 | BTTConsInt of (int_tag * int * by_type_tag)
                 | BTTConsFloat of (float_tag * float * by_type_tag)
type int_item = HematocritI | GlucoseI
type float_item = CreatinineI
type item = IntItem of int_item * int | FloatItem of float_item * float
type maps_test = MTest of (int IntMap.t * float FloatMap.t)
  type item = Int of K.int_key * int
            | Float of K.float_key * float
  type int_key = Hematocrit | Glucose
  type float_key = Creatinine

This was from Neel Krishnaswami:

type 'a exp =
  | Var of string
  | App of 'a exp * 'a exp
  | Lam of string * 'a

This is a regular expression engine, based on a remark by Dave Long, which I cut down to use polymorphic variants in order to minimize the amount of code:

let rec any = function `N -> false | `C (h, t) -> h || any (t ())
and map f = function `N -> `N | `C (a, b) -> `C (f a, fun () -> map f (b ()))
and iota m n = if m = n then `N else `C (m, fun () -> iota (m+1) n)
let rec splits s = let n = String.length s in
                map (fun i -> String.sub s 0 i, String.sub s i (n-i))
                  (iota 0 (n+1))
and matches s = function `Lit t -> s = t
                       | `Cat (h, t) -> any (map (fun (a, b) ->
                         matches a h && matches b t) (splits s))
                       | `Alt (a, b) -> matches s a || matches s b
                       | `Star r -> s = "" || matches s (`Cat (r, `Star r))

This uses two types, which could be defined in the conventional way as

type stream = Cons of bool * (unit -> stream) | Nil
type regex = Alt of regex * regex | Cat of regex * regex | Star of regex | Lit of string

And here are some types from Bicicleta:

type methods = NoDefs
                           (* name, body, is_positional ... *)
               | Definition of string * bicexpr * bool * methods
and bicexpr = Name of string
              | Call of bicexpr * string
              | Literal of string option * methods
              | Derivation of bicexpr * string option * methods
              | StringConstant of string
              | Integer of int
              | Float of float
              | NativeMethod of (lookup -> bicobj)
and userdata = UserString of string
              | UserInteger of int 
              | UserFloat of float
                  (* name, selfname, body, env *)
and bicmethod = string * string option * bicexpr * lookup
and bicobj = ProtoObject 
             | BaseObject of lookup
            (* Derive of positional method names, methods, parent, cache *)
             | Derive of string list * bicmethod list * bicobj
                 * (string, bicobj) Hashtbl.t option ref
             | UserData of userdata
             | Error of string * string
and lookup = Phi | Add of string * bicobj * lookup ;;

So, in reverse order, these types have 2, 5, 1, 3, 8, 2, 4, 2, 3, 1, 2, 2, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 3, 2, 4, 1, 3, 3, and 1 variant. So in most cases you could distinguish them entirely with pointer bits, even if you only had two pointer bits to play with.

A more aggressive way to handle this is to represent references of a given type as integers, some of whose bits indicate which variant the object belongs to, while the other bits index an array of all objects of that variant. For example, the high 16 bits of a 32-bit oop might indicate whether an object is a ProtoObject, a BaseObject, a Derive, a UserData, or an Error, while the low 16 bits index an array of Derives or BaseObjects or whatever. For really simple generational garbage collection you could allocate a second set of typecodes for nursery ProtoObjects, nursery BaseObjects, and so on, where the “index” bits directly indicate an offset into the nursery (probably bit-shifted by whatever your nursery allocator alignment is.)

The regexp engine revisited

The regex engine above in OCaml with polymorphic variants consists of 547 non-indentation characters. We could rewrite it in the above notation:

a stream:
    a cons:
        car: bool
        cdr: unit -> stream
    a nil

a regex:
    a literal:
        content: string
    a catenation:
        head: regex
        tail: regex
    a alternation:
        a: regex
        b: regex
    a closure:
        content: regex

to any:
    on nil:
        false
    on cons:
        head or any (tail())

to map f:
    on nil:
        nil
    on cons:
        cons (f head) λ().map f (tail())

to iota m n:
    nil if m == n else cons m λ().iota (m+1) n

to splits s:
    n ← #s
    map (λi.s[0:i], s[i:n-i]) (iota 0 (n+1))

to matches s:
    on literal:
        s == content
    on catenation:
        any (map (λa b.matches a head and matches b tail) (splits s)
    on alternation:
        matches s a or matches s b
    on closure:
        s == "" or matches s (cat content (star content))

That’s 646 non-indentation characters, 15% larger. You could imagine that if you had more than one function on regexps, you could start winning:

to can_be_empty:
    on literal:
        s == ""
    on catenation:
        can_be_empty head and can_be_empty tail
    on alternation:
        can_be_empty a or can_be_empty b
    on closure:
        true

Topics