A compact bytecode sketch that should average about 3 bytes per line of C

Kragen Javier Sitaker, 02021-08-17 (updated 02021-09-13) (66 minutes)

From Table I we see that the assignment, IF, CALL, RETURN and FOR statements together account for 96 percent of the source statements. Therefore we will design an instruction set to handle the object code from these statements efficiently. To push local variables (including parameters) onto the stack, we propose 12 distinct 1-byte (format 1) opcodes, one each for offsets 0-11. Twelve instructions allow access to all the locals (and parameters) in 94.6 percent of the procedures, and to more than 50 percent of the locals in the remaining procedures. For example, opcodes 114-125 might be used for PUSH LOCAL 0, PUSH LOCAL 1..., PUSH LOCAL 11.

(Tanenbaum 01978, §5, p. 243)

What would a compact stack bytecode for C look like? I think you could usually manage to compile to about 3 bytes per line of C code, which would enable you to run C programs of about 10kloc on an Arduino or about 300kloc on an Ambiq Apollo3, in exchange for nearly an order of magnitude worse power usage and runtime for the interpreted code. This could expand the uses of such small computers dramatically.

Why?

There are two major reasons to do this: to pack more functionality into less memory and to improve in-application programmability.

To pack more functionality into less memory

There are lots of microcontrollers around now, but for traditional personal computing purposes they have more speed than needed and less memory. The 3¢ Padauk microcontrollers (the PMS150 family) as described in file minimal-cost-computer.md in Derctuo has “512-4096 words of program memory and 64-256 bytes of RAM, all running at up to 16 MHz (normally 8 MIPS) ... running on 2.2–5 volts at 750 μA/MHz,” though they’re out of stock everywhere last I saw. As mentioned in A bargain-basement Holter monitor with a BOM under US$2.50, the ATTiny1614 costs 82¢ from Digi-Key; it has 16KiB of program memory (8192 words) and 2048 bytes of RAM and runs at 20 MHz (≈20 MIPS). (The formerly cheaper ATTiny25V-15MT is now out of stock due to the Chipocalypse and its price has shot up enormously to US$1.82 in quantity 1000.) The STM32F051C8T6 is still out of stock at Digi-Key with a standard lead time of 52 weeks, and now the CKS32F051C8T6 is out of stock at LCSC too, but at least they have the GD32F130C8T6: 64KiB program memory, 8KiB RAM, 72 MHz Cortex-M3 (80 DMIPS?), US$3.25. And the Ambiq Apollo3 mentioned in Energy autonomous computing has 1MB Flash and 384KiB SRAM and is a 48MHz Cortex-M4F, running at about 10 μA/MHz at 3.3 volts.

In How do you fit a high-level language into a microcontroller? Let’s look at BBN Lisp and Energy autonomous computing I’ve written about how to effectively use off-chip memory in order to kind of bridge this gap somewhat. Swapping in code or data from off-chip Flash is a great deal faster than swapping from a spinning-rust disk drive or loading overlays from a floppy, so it can be a lot more transparent.

However, an alternative approach is to make more frugal use of the internal memory, and in How do you fit a high-level language into a microcontroller? Let’s look at BBN Lisp (and in file tiny-interpreters-for-microcontrollers in Dercuano) I explored compact bytecode formats a bit.

The Microsoft Excel team wrote their own C compiler in the 01980s for this purpose; as I understand it, to fit more functionality into PCs of the time, such as the Macintosh, their compiler allowed them to choose for each function whether to compile it to bytecode or to native code.

The MakeCode VM reportedly consists of about 500 bytes of AVR machine code:

MicroPython and similar environments cannot run on the [Arduino] Uno due to flash [32 KiB] and RAM [2 KiB] size limitations. We also ran into these limitations, and as a result, developed two compilation modes for AVR. One compiles STS [Static TypeScript, a JS-like language] to AVR machine code, and the other (MakeCode VM) generates density-optimized byte code for a tiny (~500 bytes of code) interpreter. The native strategy achieves code density of about 60.8 bytes per statement, which translates into space for 150 lines of STS user code. The VM achieves 12.3 bytes per statement allowing for about 800 lines. For comparison, the ARM Thumb code generator used in other targets achieves 37.5 bytes per statement, but due to the larger flash sizes we did not run into space issues.

In exchange for this 4.9× code compression, they accepted about a 6.4× interpretive slowdown.

The Aztec C compiler for the Apple (available for free download)included a bytecode mode to reduce space, and you could compile only part of your program with it:

As an alternative, the pseudo-code C compiler, CCI, produces machine language for a theoretical machine with 8, 16 and 32 bit capabilities. This machine language is interpreted by an assembly language program that is about 3000 bytes in size.

The effects of using CCI are twofold. First, since one instruction can manipulate a 16 or 32 bit quantity, the size of the compiled program is generally more than fifty percent smaller than the same program compiled with C65 [the Aztec C native code compiler for the 6502]. However, interpreting the pseudo-code incurs an overhead which causes the execution speed to be anywhere from five to twenty times slower.

Chuck McManis famously reverse-engineered the Parallax BASIC Stamp, based on a PIC16C56, and found that it used a variable-bit-length “bytecode” system to take maximum advantage of the 2048 bytes of its serial EEPROM. Most of the “tokens” in the “bytecode” corresponded straightforwardly to BASIC tokens, but some of them were significantly swizzled around; NEXT B0 compiles to 10111 CCCVVV VVVVVV F CCCVVV CCCVVV AAAAAAAAAAA where CCCVVV encodes the increment 1, VVVVVV encodes B0, F encodes +, the second CCCVVV encodes the ending counter value (255, say), the third CCCVVV encodes the starting counter value (0, say), and the final AAAAAAAAAAA encodes the address of the statement to jump to (the first statement inside the loop). This is 41 bits, plus 16 in the FOR token, for a total of 57 bits per loop.

How big is, uh, functionality? The original MacOS Finder was 46KiB and fit on a 400KiB floppy with MacOS (“System”) and an application and a few documents, making it possible to use the Macintosh with a single floppy, though drawing on the 64KiB of ROM including LisaGraf/QuickDraw (the 128KiB ROM didn’t arrive until the Mac Plus?); MacPaint was under 50 KB and consisted of 5804 lines of Pascal and 2738 lines of assembly, but that includes blank lines and comments and excludes MyHeapAsm.a and MyTools.a. A more reliable figure is 4688 lines of Pascal and 2176 lines of assembly, which is according to David A. Wheeler’s “SLOCCount”. If we figure that a line of Pascal is about 5 lines of assembly, according to the folklore.org figures, that’s the equivalent of 6351 lines of Pascal, and under 7.9 bytes of compiled code per line; or, using SLOCCount figures, 5123 Pascal-equivalent lines and under 9.8 bytes of compiled code per line.

If you could get 4 bytecode bytes per Pascal-level line of code, which seems plausible but difficult, you could fit MacPaint into about 20 KB. The same 2.5× compression would reduce the Mac ROM to 26 KB.

There’s a nonlinear benefit to this kind of thing, too: the functionality of software comes more from interactions among the components than from individual components. If you can fit 12000 lines of code into your Arduino, there are nine times as many potential pairwise interactions as if you can only fit 4000 lines of code in. At least potentially, it's an order-of-magnitude increase in system functionality.

To facilitate programmability

So, the primary objective is to fit more functionality into less memory. But a secondary objective is to improve “in-application programmability” (that is, reprogrammability without popping the microcontroller out of its ciruit and into a PROM burner) and thus flexibility.

Such a bytecode can help overcome several different obstacles to in-application programmability.

One-time-programmable devices like the PMS150C can only have code burned into them once; any further changes requires either making that code do different things based on what’s in RAM, or replacing the chip with a fresh, unburned chip (3.18¢ according to file minimal-cost-computer.md in Derctuo — but taking advantage of that potentially involves desoldering and/or waiting for a package in the mail). It has 1024 words of PROM but only 64 bytes of RAM, but like the AVR and the GD32VF, it’s a Harvard-architecture device, so it can’t execute native code from RAM. Only some kind of interpreter can make it programmable.

Moreover, RAM is often small, but loading code into RAM is easier than loading code into program memory, when that is possible at all. The PMS150C’s 64 bytes of RAM is enough for about 15 or 20 lines of C using a compact bytecode, but would be only about 5 lines of C using native code. (And in practice you probably only have about 48 of those bytes, since you need a few for stack and variables.) The 82¢ ATTiny1614 I mentioned earlier has 16384 bytes of Flash but only 2048 bytes of RAM; 2048 bytes is enough for 500—800 lines of C.

Finally, compiling to bytecode can be considerably easier than compiling to native code, which makes it more feasible to include a compiler in the device itself. Its input format doesn’t necessarily have to be C; it might be something like ladder logic, scientific-calculator formulas, keyboard macros, Lisp, or Forth.

How?

Probably the right thing to do, to be able to run existing microcontroller-friendly code, which is mostly written in C, C++, or the Arduino dialect of C++, is to write a C compiler to some kind of very compact bytecode, along with a bytecode interpreter for it.

A virtual machine designed for C probably needs a frame-pointer register, but otherwise a registerless stack architecture will probably result in more compact code. A lot of the usual C compiler optimizations aren’t useful on a stack machine, though dead-code elimination, constant folding, and common subexpression elimination may be valuable.

There’s a tradeoff between interpreter size and code size. In the MakeCode case, they evidently had about 9–10 KiB left for STS user code, and by using 500 bytes of it for the interpreter they were effectively able to increase the functionality that fit in memory by almost a factor of 5. But this is probably far from the optimum; if by doubling the size of the interpreter they were able to increase the code density from 12.3 to 10 bytes per statement, then instead of 800 statements, they would be able to fit 930 statements into the same space. How much further could you take that approach?

So it’s quite plausible that the right tradeoff for minimum space is to use the majority of program memory for a really elaborate interpreter. But we can’t do this simply by adding common subroutines as built-in interpreter instructions written in native code; we can certainly vector some built-in interpreter instructions to common subroutines, but if we want to use minimum space, we should still encode that common subroutine in bytecode, not native code. (For the most part, anyway — for example, while you could implement multiplication in terms of addition and bit tests, for example, invoking a multiply instruction is going to be a lot simpler, and similarly for most functionality commonly provided as part of the CPU instruction set.) So what should we put in the interpreter proper? Other than optimizations, of course.

Probably a big part of the answer is “weird addressing modes”.

For really tiny chips like the PMS150C with its 1024 instructions of ROM, probably only 512-768 instructions or so of interpreter are affordable; obviously 1024 instructions would not be. So I think it’s worthwhile to think about what kind of kernel bytecode interpreter you could fit inside that kind of constraint, even on an 8-bit machine, before decorating it with richer functionality for machines with larger memory.

Elisp and Smalltalk bytecodes commonly have an operand field within the byte, 3–5 bits of operand. Often these operands are indexes into tables (of constants, selectors, or functions) that are associated with the method or function, and which can contain more data than the bytecode itself.

I tend to think that variable-bit-length instructions like the BASIC Stamp are hard to justify, particularly on machines like the AVR where you apparently need to write a 5-iteration loop to shift a 16-bit word by 5 bits. But variable-byte-length instructions are probably easy to justify, where perhaps you have 3 bits of operand within the instruction byte, perhaps another operand byte following, and perhaps, with a different opcode, two operand bytes following.

Making a global table of all subroutines is probably a win. Talking to solrize about the subject, they made the observation that if a subroutine is called zero times, it can be omitted from the compiled program; if it is called once, it can be inlined; and if it is called twice or more, then it is quite plausibly a savings to put its address into a global table of subroutines, and refer to it elsewhere only with an index into this subroutine table. For example, on the ATMega328 in the Arduino Uno, a word-aligned address anywhere in the 32-KiB program memory requires 14 bits, but if you can only fit 800 lines of user code in there, you can’t plausibly have more than 160 user subroutines, so you only need 8 bits for a subroutine-table index. Or 7.32 bits, really. So by putting a 16-bit address into the subroutine table, you can invoke the subroutine with an 8-bit operand instead of a 14-bit operand, which probably saves you the 2 bytes that it cost to put it into the subroutine table. If there are more than 2 calls to it, you’re in the black.

Exceptions might include where there was one reference to the subroutine, but you were passing its address rather than invoking it, so it can’t be inlined; or where the alternative to a global table of subroutines is a local table (like the ones Smalltalk and Elisp use) that can be indexed with fewer bits, like 3–5 instead of 7.32.

If a subroutine table is too large for a compact operand field to index into it, the subroutines can be sorted so that the ones with the most callsites come first, and the losers who drew the long addresses can’t be invoked without a 2-byte calling sequence.

We can improve in density over Elisp bytecode to the extent that we can exclude K&R C’s generic variadic calling convention, where the caller knows how many parameters are passed but the callee doesn’t, and every function returns a value — because we don’t have to encode the number of arguments at the callsite, and we don’t have to explicitly discard void return values. A function header can specify how many arguments there are, automatically moving them from the operand stack into the function’s stack frame.

C is pretty thorough about having byte-oriented memory. IIRC you can do pointer arithmetic between nested struct members. You’re not supposed to do pointer arithmetic between different things in the same stack frame. On the other hand, you could very reasonably allocate word-sized local variables in a separate area, even if you take their addresses.

A sketch with a low-level bytecode

Let’s look at some examples of what bytecode-compiled subroutines might look like. My examples so far, with a strawman bytecode I made up as I went along, are respectively 3.4, 3.1, 7.5, 3.4, 2.4, and 2.3 bytes per source line of code. I think this is a compelling argument that 4 bytes per line of code is usually achievable and 3 bytes might be.

DrawTxScrap

Here’s a Pascal subroutine from MacPaint; C and this Pascal are almost identical.

{$S        }
PROCEDURE DrawTxScrap(dstRect: Rect);
VAR i:          INTEGER;
    myProcs:    QDProcs;
BEGIN
  KillMask;   { not enough room in heap for mask and font }
  ClipRect(dstRect);
  EraseRect(dstRect);
  InsetRect(dstRect,2,0);
  dstRect.right := Max(dstRect.right,dstRect.left + txMinWidth);
  dstRect.bottom := Max(dstRect.bottom,dstRect.top + txMinHeight);
  SetStdProcs(myProcs);
  thePort^.grafProcs := @myProcs;
  myProcs.textProc := @PatchText;
  i := CharWidth('A');  { swap in font }
  HLock(txScrapHndl);
  TextBox(txScrapHndl^,txScrapSize,dstRect,textJust);
  HUnlock(txScrapHndl);
  ClipRect(pageRect);
  thePort^.grafProcs := Nil;   { restore to normal }
END;

What would this ideally look like in bytecode?

This only has three local variables, but two of them are structs (records in Pascal). dstRect evidently has .top, .bottom, .left, and .right members, which are probably integers, and it’s evidently being modified by InsetRect, which I guess must take it as a var parameter. The @ operator, an Apple extension, takes the address of a variable like & in C, just as passing a var parameter does implicitly. thePort is a global variable imported from elsewhere; txMinHeight, txMinWidth, txScrapHndl, textJust, and pageRect are five of the 128 global variables in this file; and PatchText is a procedure defined just above.

The repeated use of the same txScrapHndl and thePort globals makes me think that maybe a function could have an associated vector of referenced global variable indices so that such repeated references only cost one byte.

Maybe in idealized bytecode assembly this would look something like this:

; This next line compiles to a procedure header structure with bit
; fields.  This procedure takes 8 bytes of “blob arguments” and
; has another 18 bytes of local blobs immediately after them; both
; of these are in the blob part of the activation record.  It also
; takes 1 word of word (register-sized) arguments, which go into
; the word part of the activation record.
PROCEDURE DrawTxScrap argblob=8 localblob=18 localwords=1
globals txScrapHndl, thePort
    call KillMask
    lea_blob 0       ; load address of offset 0 into stack frame blobs
    call ClipRect
    lea_blob 0
    call EraseRect
    ;      InsetRect(dstRect,2,0);
    lea_blob 0
    tinylit 2        ; push constant 2 (coded in a 3-bit field in the bytecode)
    tinylit 0
    call InsetRect
    ;       dstRect.right := Max(dstRect.right,dstRect.left + txMinWidth);
    loadword_blob 1  ; load word from stack frame blobs at offset 1 word (.right)
    loadword_blob 0  ; load .left
    loadglobal txMinWidth
    add
    max
    storeword_blob 1 ; store max result into .right
    ;       dstRect.bottom := Max(dstRect.bottom,dstRect.top + txMinHeight);
    loadword_blob 3  ; load .bottom
    loadword_blob 2  ; load .top
    loadglobal txMinHeight
    add
    max
    storeword_blob 3
    lea_blob 8          ; load @myProcs (@myProcs)
    call SetStdProcs
    lea_blob 8          ; load @myProcs (@myProcs)
    loadglobal thePort
    storeword_offset 5  ; using thePort value from top of stack, offset 5 words and store result
    ;       myProcs.textProc := @PatchText;
    loadfuncptr PatchText
    lea_blob 8
    storeword_offset 3  ; let’s say .textProc is the fourth word of myProcs
    ;      i := CharWidth('A');  { swap in font } (oddly this return value i is not used, but let’s compile it anyway)
    lit8 65             ; 'A'
    call CharWidth
    storeword 0         ; store into i, in the stack frame word-sized variables (rather than blobs)
    loadglobal txScrapHndl
    call HLock
    ;      TextBox(txScrapHndl^,txScrapSize,dstRect,textJust);
    loadglobal txScrapHndl  ; not sure if this is a var parameter, I’ll assume so
    loadglobal_long txScrapSize  ; this variable is 32 bits
    lea_blob 0
    loadglobal textJust
    call TextBox
    loadglobal txScrapHndl
    call HUnlock
    lea_global pageRect     ; again, assuming this is a var parameter; if not we must memcpy
    call ClipRect
    ;      thePort^.grafProcs := Nil;   { restore to normal }
    tinylit 0
    loadglobal thePort
    storeword_offset 0
    ret

That’s 47 bytecode instructions, so 47 opcode bytes; how many operand bytes? All but 5 of them have immediate operands, but probably none of those operands need to be more than 1 byte, so we have at most 39 operand bytes. 10 of them are call instructions (or 12 if we count the max instances); there are 211 functions and procedures declared in this file, including EXTERNAL procedures, and each of them is only called once in this function, so plausibly those 10 call operations do need an operand byte to index into a global table of subroutines. Another 10 operands are global variables; of these half (thePort twice and and txScrapHndl three times) are referenced more than once and could thus be usefully listed in the function’s global-references vector so they could be referenced in a single byte; the other 5 would require a separate operand byte. loadfuncptr requires a separate operand byte, too. Of the other 21 operands, most are small integers between -3 and 3 or between 0 and 6, so they could be packed into a 3-bit immediate field; the only exceptions are 8, 8, 8, and 65, so there would be 4 more bytes of operands, and 17 bytecodes with embedded operands.

So that’s 4 numeric operand bytes, 5 global-index operand bytes, 10 function-call operand bytes, and one loadfuncptr operand byte, for 20 operand bytes, and 44+20 = 64 bytes of bytecode. The procedure header is probably 2 bytes, the procedure's address in the global subroutine table is another 2 bytes, and then you have two global-variable offset bytes, so all in all the subroutine is probably about 64 + 2 + 2 + 2 = 70 bytes. This corresponds to 20 non-comment non-blank source lines of Pascal code, or 3.5 bytes per line, which is about 2.8 times the density of the original MacOS compiled program.

All this is assuming that there are few enough bytecode operations to fit into a single byte. The above code already uses call, lea_blob, tinylit, loadword_blob, loadglobal, add, max, storeword_blob, lit8, storeword, storeword_offset, and ret, which is 12, to 9 of which we are imputing this 3-bit operand field (let’s call these heavyweight opcodes “baryonic”); furthermore supersymmetry implies the existence of such undiscovered massive particles as storeglobal, loadbyte_blob, storebyte_blob, loadword_offset, storebyte_offset, and loadbyte_offset, which bring us to 15 of the 32 major-opcode slots filled, plus leptons such as subtract, multiply, divide, mod, bitand, bitor, xor. So I think it’s pretty plausible that we’ll have plenty of opcode space, but it’s something to watch.

(If we wanted the bytecodes to be printable ASCII things might get more difficult: from 32 to 128 we only have room for 11 baryons and 8 leptons, or 10 baryons and 16 leptons, with one of the baryons missing the DEL character. But that’s probably too strict a restriction; text in ISO-8859-1 (24 baryons instead of 12) or Windows CP1252 (most of 26, though maybe the A0 control characters would be best as leptons due to the holes) would be more reasonable, and conceivably UTF-8 would work well if operand bytes were trailing 10xx xxxx bytes. Cyrillic and Greek codepages have many more homoglyphs.)

It’s a little bit unclear how blob parameters are supposed to get passed here. Do we pass them in word-sized chunks on the operand stack, or is there a separate blob stack or something? If we assume that the bytecode interpreter is equipped to go look at the function header in memory when it’s interpreting a call, it might be reasonable to put the call bytecode before the arguments so that the interpreter can allocate the callee’s stack frame, allowing arguments to be poked directly into it in the appropriate places instead of needing to be copied there on subroutine entry.

strlcpy

Here’s strlcpy, originally from OpenBSD, but this version is the copy from avr-libc, for which SLOCCount reports 26 lines of code, condensed down to 14 lines for ease of reading:

size_t strlcpy (char *dst, const char *src, size_t siz) {
        register char *d = dst;
        register const char *s = src;
        register size_t n = siz;

        if (n != 0 && --n != 0) {
                do { if ((*d++ = *s++) == 0) break; } while (--n != 0);
        }

        if (n == 0) {
                if (siz != 0) *d = '\0';
                while (*s++)
                        ;
        }

        return(s - src - 1);
}

Unlike the straight-line Pascal code above, this has a bunch of control flow, seven conditional jumps. (Normally a while would also involve an unconditional jump, but in this case the body is empty.)

This has six local variables, all word-sized, but only five of them are live at once; dst passes the torch to d early on, which can be eliminated by the compiler.

If we try encoding it in the same bytecode as before with a fairly traditional compilation strategy, I think it looks like this:

PROCEDURE strlcpy argwords=3 localwords=2
    loadword 1       ; load argument 1, src
    storeword 3      ; store into word variable 3, s
    loadword 2       ; load siz (argument 2)
    storeword 5      ; n
    jz 5, 1f         ; jump to label 1 if word variable 5 is 0
    decrword 5       ; decrement word variable 5
    jz 5, 1f
2:  loadword 0       ; d, preparation for *d++
    incrword 0       ; d++
    loadbyte_offset 0 ; dereference pre-incremented pointer
    dup              ; this will be used in an assignment whose result value is used
    loadword 3       ; s
    incrword 3
    storebyte_offset 0
    jztos 1f
    decrword 5       ; --n
    jnz 5, 2b        ; repeat do loop if word variable 5 still isn’t 0
1:  jnz 5, 1f        ; if (n == 0)
    jz 2, 3f         ; if (siz != 0) using word variable 2
    tinylit 0        ; '\0'
    loadword 0       ; d
    storebyte_offset 0 ; *d =
3:  loadword 3       ; s
    incrword 3       ; ++
    loadbyte_offset 0
    jnztos 3b
1:  loadword 3       ; s
    loadword 1       ; src
    subtract
    tinylit 1
    subtract
    ret

That’s 32 bytecode instructions. 4 of these are zero-operand leptons (dup, subtract, subtract, ret), 7 are conditional jumps (6 with 2 arguments), and the other 21 are the same kind of one-operand one-byte operations that dominated DrawTxScrap. Maybe jnz 5, 2b gets encoded with 5 in the 3-bit immediate field, indicating that it’s looking at local register-sized variable 5, and jumping to label 2, looking backwards, if it is Not Zero. The byte offset to label 2 is encoded in a following operand byte, within the range ±127; if it’s -128 then two more operand bytes follow giving the real jump offset. jnztos jumps if the top of stack is 0 instead of if a local variable is zero; I think the way to do this is that if the 3-bit immediate field is 0 through 6, it tests that variable, but if it’s 7, it tests the top of stack (and pops it). By contrast, in the other baryonic operations, I was thinking 7 would indicate that the immediate parameter is in the following byte, but if you want to test a local variable that’s higher than 6, then you could just loadword it and then jztos.

So, in addition to demonstrating the previously speculative loadbyte_offset and storebyte_offset operations, this imposes new baryonic opcodes on us: incrword, decrword, jz, jnz, and probably js, jns, and jmp, bringing the total from 15 to 22 out of 32.

So with the above-suggested encodings, we have 7 jump-offset argument bytes, 2 bytes of procedure header, and 2 bytes of global subroutine table entry, for a total of 32+7+2+2 = 43 bytes. That’s 1.7 bytes of bytecode per SLOCCount line or 3.1 bytes per physical non-blank line in the above.

You could very plausibly squeeze this down a bit more. The 8086’s LOOP instruction provides the decrword/jnz functionality, its LODSB instruction provides the loadword/incrword/loadbyte_offset functionality, and its STOSB instruction provides loadword/incrword/storebyte_offset. And of course the final expression being returned admits all sorts of optimizations, including a decrement-top-of-stack operation or an increment or decrement to one of the variables being subtracted.

In an alternative direction, note that this code has 7 jump instructions, comprising 14 bytes, but only 4 labels; following Henry Baker’s COMFY-65, if we could use a 1-byte instruction to set the destination to jump to on failure or success, maybe we’d only need 4 bytes of jump destinations instead of 7.

`proc_get_cam_register_3`

Let’s look at something a little less nice and pretty: proc_get_cam_register_3 from linux-2.6/drivers/staging/rtl8192e/rtl_debug.c:

static int proc_get_cam_register_3(char *page, char **start,
                          off_t offset, int count,
                          int *eof, void *data)
{
        struct net_device *dev = data;
        u32 target_command = 0;
        u32 target_content = 0;
        u8 entry_i = 0;
        u32 ulStatus;
        int len = 0;
        int i = 100, j = 0;

        /* This dump the current register page */
        len += snprintf(page + len, count - len,
                        "\n#################### SECURITY CAM (22-31) ######"
                        "############\n ");
        for (j = 22; j < TOTAL_CAM_ENTRY; j++) {
                len += snprintf(page + len, count - len, "\nD:  %2x > ", j);
                for (entry_i = 0; entry_i < CAM_CONTENT_COUNT; entry_i++) {
                        target_command = entry_i + CAM_CONTENT_COUNT * j;
                        target_command = target_command | BIT31;

                        while ((i--) >= 0) {
                                ulStatus = read_nic_dword(dev, RWCAM);
                                if (ulStatus & BIT31)
                                        continue;
                                else
                                        break;
                        }
                        write_nic_dword(dev, RWCAM, target_command);
                        target_content = read_nic_dword(dev, RCAMO);
                        len += snprintf(page + len, count - len, "%8.8x ",
                                        target_content);
                }
        }

        len += snprintf(page + len, count - len, "\n");
        *eof = 1;
        return len;
}

This is 36 lines of code, which I think is maybe a little too much for a sketch, so I’ll try just the first half of it, plus enough cleanup to get it to compile, which brings it to 22 lines:

static int proc_get_cam_register_3(char *page, char **start,
                          off_t offset, int count,
                          int *eof, void *data)
{
        struct net_device *dev = data;
        u32 target_command = 0;
        u32 target_content = 0;
        u8 entry_i = 0;
        u32 ulStatus;
        int len = 0;
        int i = 100, j = 0;
        len += snprintf(page + len, count - len,
                        "\n#################### SECURITY CAM (22-31) ######"
                        "############\n ");
        for (j = 22; j < TOTAL_CAM_ENTRY; j++) {
                len += snprintf(page + len, count - len, "\nD:  %2x > ", j);
                for (entry_i = 0; entry_i < CAM_CONTENT_COUNT; entry_i++) {
                        target_command = entry_i + CAM_CONTENT_COUNT * j;
                }
        }
        return len;
}

Let’s say our C is a 16-bit-pointer platform, like Arduino, so the u32 items all go into blobland instead of wordland. As before, I’ll elide the copy from data to dev. And I’ll assume that the interpreter by default initializes all local variables to zero, which is a reasonable thing for a C implementation to do.

PROCEDURE proc_get_cam_register_3 argwords=6 localwords=4 localblob=16
const "\n#################### SECURITY CAM (22-31) ##################\n "
const "\nD:  %2x > "
    lit8 100
    storeword 8  ; i.  note that this implies a separate operand byte with 3-bit immediate fields
    loadword 0   ; page *
    loadword 7   ; len
    add
    loadword 3   ; count *
    loadword 7
    subtract
    loadconst 0  ; the string *
    call snprintf
    loadword 7
    add          ; len +=
    storeword 7
    ; for (j = 22; j < TOTAL_CAM_ENTRY; j++) {
    lit8 22
    storeword 9
1:  loadword 9
    lit8 32      ; TOTAL_CAM_ENTRY
    subtract
    jstos 1f     ; if result negative, skip loop
    incrword 9
    loadword 0   ; *
    loadword 7
    add
    loadword 3   ; *
    loadword 7
    subtract
    loadconst 1  ; the other string *
    call snprintf
    loadword 7
    add
    storeword 7
    tinylit 0    ; *
    storebyte_blob 0  ; entry_i = 0.  In the blob because u8 *
2:  loadbyte_blob 0   ; *
    lit8 8            ; CAM_CONTENT_COUNT
    subtract
    jstos 1f
    loadbyte_blob 0   ; *
    tinylit 1         ; *
    add
    storebyte_blob 0  ; *
    ; target_command = entry_i + CAM_CONTENT_COUNT * j;        
    loadbyte_blob 0   ; *
    lit8 8            ; CAM_CONTENT_COUNT
    loadword 9        ; j
    multiply
    storelong_blob 0  ; target_command = *
    jmp 2b
1:  loadword 7
    ret

This is a mess! But a workable mess. 49 bytecode instructions. Of these, 11 have no operands at all; another 14, marked with * above, have operands that can be packed into a 3-bit field; the remaining 24 each need an operand byte. 73 bytes of bytecode! But by itself that would be pretty okay for 21 lines of code (3.8 bytes per line). What really kills us here is the 64-byte literal string and the other 12-byte literal string, plus a constant vector (probably two 16-bit pointers). All that, plus the 2-byte procedure header and 2-byte entry in the global subroutine table, adds up to 73 + 64 + 12 + 4 + 2 + 2 = 157 bytes. That’s 7.5 bytes per line of code! Those two string literals doubled the space this truncated subroutine needs.

I think that if I were to add in the missing 15 lines of code, this would get slightly less ugly, maybe another 64 bytes, which would get the total down to about 4.4 bytes per line of code. But it would be easy for someone to write code that really does have that many quoted # signs in it.

Of course, this particular code probably has no business running on a microcontroller, even if I hadn’t snipped out the read_nic_dword call that does the real work; it’s part of a Linux device driver for a Wi-Fi card that I think requires a PCI bus to operate at all. But I think it’s a good representative of workaday C code.

`File_pipe2file`

This is from php5-5.4.4/ext/fileinfo/libmagic/compress.c:

protected int
file_pipe2file(struct magic_set *ms, int fd, const void *startbuf,
    size_t nbytes)
{
        char buf[4096];
        ssize_t r;
        int tfd;
#ifdef HAVE_MKSTEMP
        int te;
#endif

        (void)strlcpy(buf, "/tmp/file.XXXXXX", sizeof buf);
#ifndef HAVE_MKSTEMP
        {
                char *ptr = mktemp(buf);
                tfd = open(ptr, O_RDWR|O_TRUNC|O_EXCL|O_CREAT, 0600);
                r = errno;
                (void)unlink(ptr);
                errno = r;
        }
#else
        tfd = mkstemp(buf);
        te = errno;
        (void)unlink(buf);
        errno = te;
#endif
        if (tfd == -1) {
                file_error(ms, errno,
                    "cannot create temporary file for pipe copy");
                return -1;
        }

It... goes on for another 36 lines from there; that’s 30 lines. It’s not super plausible that we could fit the PHP interpreter into an Arduino, but we could surely fit it into an Ambiq Apollo3. The protected suggests that this is not actually C at all but C++, but it’s pretty close. Let’s see what this would look like in the strawman bytecode assembly. Let’s imagine we do HAVE_MKSTEMP.

PROCEDURE file_pipe2file argwords=4 localwords=3 localblob=4096
const "/tmp/file.XXXXXX"
const "cannot create temporary file for pipe copy"
const 4096
globals errno
    lea_blob 0     ; buf
    loadconst 0    ; filename template
    loadconst 2    ; 4096, sizeof buf
    call strlcpy
    drop           ; discard result, because of course that makes sense when calling strlcpy
    ; #ifndef HAVE_MKSTEMP drops the next N lines
    lea_blob 0     ; tfd = mkstemp(buf); 
    call mkstemp
    storeword 6
    loadglobal errno ; te = errno
    storeword 7
    lea_blob 0
    call unlink
    drop
    loadword 7       ; errno = te
    storeglobal errno
    loadword 6
    tinylit -1
    subtract
    jnz 1f
    loadword 1
    loadglobal errno
    loadconst 1
    call file_error
    tinylit -1
    ret
1:

So that’s 25 bytecodes, of which the four calls and the two references to te require an operand byte. The others can all reasonably be 1 byte each, so we have 31 bytes of bytecode, plus 43 + 17 = 60 bytes of literal strings, 6 bytes of constant table, 1 byte of global vector, 2 bytes of procedure header, and 2 bytes of global subroutine table entry, 31 + 60 + 6 + 1 + 2 + 2 = 102 bytes, nearly ⅔ in those two stupid strings. Still, that’s 102 bytes for 30 lines of code: 3.4 bytes per line. But only because 12 of those 30 lines were discarded by the preprocessor!

Like most of the PHP interpreter, this is a good example of really pretty shitty C code that people nevertheless want to run, and run correctly.

Insertion sort

Here’s an excerpt from the latest .c file I wrote in my dev3 directory, which does an insertion sort on an array of ints (obviously a programming exercise):

static inline void
swap (int *x, int *y)
{
  int tmp = *x;
  *x = *y;
  *y = tmp;
}

void
isort(int *a, size_t n)
{
  for (size_t i = 1; i < n; i++) {
    for (size_t j = i; j > 0; j--) {
      if (a[j-1] > a[j]) swap(&a[j], &a[j-1]);
    }
  }
}

This counts as 16 lines of source code, although that’s pretty generous! Let’s suppose our compiler does in fact inline swap and cancel out the resulting *&:

void
isort(int *a, size_t n)
{
  for (size_t i = 1; i < n; i++) {
    for (size_t j = i; j > 0; j--) {
      if (a[j-1] > a[j]) {
         int tmp = a[j];
         a[j] = a[j-1];
         a[j-1] = tmp;
      }
    }
  }
}

This is the first example we’ve seen that does real pointer arithmetic, the kind where you have to multiply by the size of the pointer. Using just the strawman bytecode above and no CSE I think it looks something like this:

PROCEDURE isort argwords=2 localwords=3
    tinylit 1
    storeword 2    ; i
1:  loadword 2
    loadword 1     ; n
    subtract
    jstop 2f
    loadword 2
    storeword 3    ; j
3:  jz 3, 4f       ; if j == 0, exit loop
    loadword 0     ; a for a[j-1]
    loadword 3     ; j-1
    tinylit 1
    subtract
    tinylit 2      ; sizeof int
    multiply
    add
    loadword_offset 0
    loadword 0      ; a[j]
    loadword 3
    tinylit 2
    multiply
    add
    loadword_offset 0
    subtract          ; if >
    js 5f
    loadword 0        ; tmp = a[j]
    loadword 3
    tinylit 2
    multiply
    add
    loadword_offset 0
    storeword 4       ; tmp
    loadword 0        ; a[j] = a[j-1]
    loadword 3
    tinylit 1
    subtract
    tinylit 2
    multiply
    add
    loadword_offset 0
    loadword 0
    loadword 3
    tinylit 2
    multiply
    add
    storeword_offset 0
    loadword 4        ; a[j-1] = tmp
    loadword 0
    loadword 3
    tinylit 1
    subtract
    tinylit 2
    multiply
    add
    storeword_offset 0
5:  decrword 3     ; j--
    jmp 3b
4:  incrword 2
    jmp 1b
2:  ret

This is 61 bytecode instructions, which I think is pretty horrific. 5 of them are jumps which take an operand byte, so 66 bytecode bytes in all, plus the usual 4 bytes of per-subroutine overhead, for 70 bytes. That’s 4.4 bytes per line of code.

However, there are a couple of directions to go to improve this. One is that, despite the examples above, neither a[j-1] nor for (...; i < expr; i++) is actually at all unusual in C. We could support the first one with leptonic bytecodes to decrement top of stack and to do base+index addressing:

    loadword 0       ; a[
    loadword 4       ; j
    decrtos          ; -1
    loadword_indexed ; ]

You’d also have storeword_indexed, loadbyte_indexed, storebyte_indexed, leaword_indexed, and maybe leabyte_indexed, though that last one is really just add.

And we could support the loop with a baryonic bytecode similar to the 8086 LOOP instruction mentioned earlier, but for up-counting loops; if at the bottom of the loop:

    loadword 1       ; n
    countup 2, 1b    ; i < n? if so increment and jump back to label 1

For the full C semantics, which loops zero times when n is 0, you’d need to initialize the counter to one less than the initial value and then unconditionally jump to that loop trailer; 5 bytes per loop in total. Alternatively you could reverse the sense of the conditional jump, put it at the top of the loop, and put the unconditional jump at the end of the loop.

The Forth way to handle counting loops is different; it stores the loop counters, limits, and increments on the return stack instead. This would require more analysis from the compiler (it has to verify that the loop limit is a constant expression and that the loop counter is only read from, insert an UNLOOP instruction at breaks) but it would allow you to write the outer loop with two new leptonic instructions as follows:

    tinylit 1        ; start value
    loadword 1       ; n, loop limit
    fortoloop        ; stash loop counters and loop address on the loop stack
    ...
    continue         ; update loop counter, maybe jump to address after fortoloop

And the inner loop as follows:

    i                ; get outer loop counter as start value
    0                ; loop limit
    tinylit -1       ; step
    fortosteploop
    ...
    continue

These are respectively 4 bytes (32 bits) and 5 bytes (40 bits), which are significantly less than the BASIC Stamp’s 57 bits despite using a byte-oriented encoding.

Instead of having Forth-like i, j, and k instructions, you could maybe stick the loop counter in a regular word-sized local variable, using a baryonic for-loop instruction. But suppose we go with the leptonic approach above:

PROCEDURE isort argwords=2 localwords=1
    tinylit 1
    loadword 1
    fortoloop
    i               ; outer loop counter, currently topmost
    tinylit 0
    tinylit -1
    fortosteploop
    loadword 0
    i               ; inner loop counter, called j in the code, now topmost
    decrtos
    loadword_indexed
    loadword 0
    i
    loadword_indexed
    subtract        ; if not <
    jns 1f          ; skip body
    loadword 0      ; a[j] → tmp
    i             
    loadword_indexed
    storeword 2
    loadword 0      ; a[j-1] → a[j]
    i
    decrtos
    loadword 0
    i
    storeword_indexed
    loadword 2      ; tmp → a[j-1]
    loadword 0
    i
    decrtos
    storeword_indexed
1:  continue
    continue
    ret

That gets us down to 34 bytecode instructions and 35 bytes of bytecode, 39 bytes in all, 2.4 bytes per line of code.

Finally, you could actually do common subexpression elimination and stack-allocate the value of tmp:

PROCEDURE isort argwords=2 localwords=2
    tinylit 1
    loadword 1
    fortoloop
    i
    0
    tinylit -1
    fortosteploop
    loadword 0      ; &a[j-1]
    i
    decrtos
    leaword_indexed
    storeword 2
    loadword 0
    i
    leaword_indexed
    storeword 3     ; &a[j]
    loadword 2
    loadword_offset 0
    loadword 3
    loadword_offset 0
    subtract
    jns 1f          ; if
    loadword 3      ; swap
    loadword_offset 0
    loadword 2
    loadword_offset 0
    loadword 3
    storeword_offset 0
    loadword 2
    storeword_offset 0
1:  continue
    continue
    ret

That gets it down to 33 bytecode instructions, 34 bytes of bytecode, 38 bytes in all, still 2.4 bytes per line. It’s surprising, though, how little space those complicated compiler optimizations actually bought us: one measly byte!

However, we did need some complicated compiler optimizations to use the above version.

XXX what if we use the dumber loop operations? how about if we don’t inline?

Rob Kendrick mentions that DSP loop instructions are often implemented as Intercal’s “COMEFROM”, which seems like another approach that might be worth investigating — a for-loop instruction that encodes the length of the loop, so that instead of

    tinylit 1
    loadword 1
    fortoloop
    ...
    continue
    ret

you would write

    tinylit 1
    loadword 1
    fortoloop 2f
    ...
2:  ret

so that interpreting each instruction would require first checking to see if you had hit the address of the end of the innermost for-loop. This would improve compression over having a continue bytecode only if the loop extent could be packed into the fortoloop byte, which limits us to fairly short loops.

Also, it might be worthwhile to have a special short form for for (int i = 0; i < x; i++) where x and the loop end are the only varying parameters, so the starting index doesn’t eat up a bytecode. Here’s a random sampling of 16 for loops in C codebases I have handy, divided into loops that fit that pattern and loops that don’t:

9base_6.orig/rc/trap.c: while(ntrap) for(i = 0;i!=NSIG;i++) while(trap[i]){
cmusphinx/multisphinx/sphinxbase/feat.c:        for (i = 0; i < start_pad; ++i)
emacs24_24.5+1.orig/src/coding.c:  for (reg = 0; reg < 4; reg++)
exopc/lib/libtermlib/mkinfo.c:    for (i = 0; i < numcaps; i++)
flashlight-firmware/ToyKeeper/crescendo/crescendo.c:    for(i=0;i<32;i++) {
gmp-5.0.5+dfsg/mpn/generic/mu_divappr_q.c:      for (i = 0; i <= err_rec; i++)
linux-2.6/drivers/watchdog/sbc8360.c:           for (i = 0; i != count; i++) {
linux-2.6/sound/soc/soc-dapm.c:     for (i = 0; i < ARRAY_SIZE(dapm_up_seq); i++)
puzzles-20200513.c9b3c38/bridges.c:        for (x = 0; x < params->w; x++) {

exopc/bin/less/charset.c:   for (p = charsets;  p->name != NULL;  p++)
exopc/bin/less/opttbl.c:    for (o = option;  o->oletter != '\0';  o++)
exopc/bin/sh/show.c:    for (p = arg->narg.text ; *p ; p++) {
exopc/sys/ubb/dependency.c: for(e = l_first(a->out); e; e = l_next(e, out_next))
gawk_4.2.1+dfsg.orig/cint_array.c:  for (i = NHAT; i < INT32_BIT; i++) {
linux-2.6/drivers/atm/eni.c:            for (*pre = 3; *pre >= 0; (*pre)--)
maru-2.4/eval.c:    for (;;) {

9 of these, the majority, though barely, fit that pattern, more or less. Saving a tinylit 0 byte on half of all for loops seems like it would be worthwhile. There’s substantial diversity in the termination condition — <, <=, != — though in many cases <= err_rec is equivalent to < err_rec + 1 or != err_rec + 1.

Anduril `config_state_base`

Let’s look at some BudgetLightForum flashlight firmware, because Anduril now compiles to over 8 KiB and so won't fit in some of the smaller AVRs. Here’s a random 27-line function from flashlight-firmware/ToyKeeper/spaghetti-monster/anduril/anduril.c, with internal comments removed:

// ask the user for a sequence of numbers, then save them and return to caller
uint8_t config_state_base(Event event, uint16_t arg,
                          uint8_t num_config_steps,
                          void (*savefunc)()) {
    static uint8_t config_step;
    if (event == EV_enter_state) {
        config_step = 0;
        set_level(0);
        return MISCHIEF_MANAGED;
    }
    else if (event == EV_tick) {
        if (config_step < num_config_steps) {
            push_state(number_entry_state, config_step + 1);
        }
        else {
            savefunc();
            save_config();
            pop_state();
        }
        return MISCHIEF_MANAGED;
    }
    else if (event == EV_reenter_state) {
        config_state_values[config_step] = number_entry_value;
        config_step ++;
        return MISCHIEF_MANAGED;
    }
    return EVENT_HANDLED;
}

Event is uint8_t. config_step is a “global variable” that no other functions use. EV_enter_state is (B_SYSTEM|0b00001000). B_SYSTEM is just 0, so EV_enter_state is just 8. MISCHIEF_MANAGED is EVENT_HANDLED which is 0. EV_tick is (B_SYSTEM|0b00000001) or 1. EV_reenter_state is (B_SYSTEM|0b00001010), or 10. config_state_values is an array of 3 uint8_t. number_entry_value is a volatile global uint8_t. So maybe that function would look something like this:

; event and num_config_steps are two argument blob bytes; arg and
; savefunc are two word args.
PROCEDURE config_state_base argwords=2 argblob=2
globals config_step
    loadbyte_blob 0     ; event
    lit8 8              ; EV_enter_state
    subtract
    jnz 1f              ; if ==
    tinylit 0
    storeglobal_byte config_step
    tinylit 0
    call set_level
    tinylit 0
    ret
1:  loadbyte_blob 0
    tinylit 1           ; EV_tick
    subtract
    jnz 3f
    loadglobal_byte config_step
    loadbyte_blob 1     ; num_config_steps
    subtract
    jns 1f              ; if <
    loadfuncptr number_entry_state
    loadglobal_byte config_step
    incrtos
    call push_state
    jmp 2f
1:  call savefunc
    call save_config
    call pop_state
2:  tinylit 0
    ret
3:  loadbyte_blob 0
    lit8 10             ; EV_reenter_state
    subtract
    jnz 1f
    loadglobal_byte number_entry_value  ; config_state_values[config_step] = number_entry_value
    leaglobal config_state_values
    loadglobal_byte config_step
    add
    storeword_offset 0
    loadglobal_byte config_step
    incrtos
    storeglobal_byte config_step
    tinylit 0           ; return MISCHIEF_MANAGED
    ret
1:  tinylit 0           ; return EVENT_HANDLED
    ret

That’s 44 bytecode instructions. 5 of these are calls, 2 are lit8s, 5 are jumps, and 1 is loadfuncptr, and those 13 instructions will need a separate operand byte. I think all the others can be single-byte instructions, so this would be 57 bytes of bytecode. There would additionally be 2 bytes of procedure header, 2 bytes of subroutine table entry, and 1 or 2 bytes of global vector (to enable the function to refer to config_step with a 3-bit immediate index field), for a total of 63 bytes, which is about 2.3 bytes per line of code.

(There’s questions here of how this kind of thing affects the instruction set, and whether there might be opcode space contention, but I feel like there was enough headroom previously that I shouldn’t worry about that.)

I asked what this function looks like in AVR machine code; solrize generously provided the following assembler output listing for a slightly different version of the code, which I’ve edited down for readability:

 4121                   config_state_base:
 4134 0fc0 8830                 cpi r24,lo8(8)                    ;     config_step = 0;
 4135 0fc2 01F4                 brne .L348
 4138 0fc4 1092 0000            sts config_step.2502,__zero_reg__ ;     set_level(0);
 4141 0fc8 80E0                 ldi r24,0
 4142 0fca 0E94 0000            call set_level
 4145 0fce 8091 0000            lds r24,button_last_state         ;         config_step ++;
 4146 0fd2 8111                 cpse r24,__zero_reg__
 4147 0fd4 00C0                 rjmp .L350
 4150 0fd6 81E0                 ldi r24,lo8(1)
 4151 0fd8 8093 0000            sts config_step.2502,r24          ;         push_state(number_entry_state, 0);
 4152 0fdc 00C0                 rjmp .L362
 4154 0fde 982F         .L348:  mov r25,r24
 4155 0fe0 907F                 andi r25,lo8(-16)
 4158 0fe2 903B                 cpi r25,lo8(-80)                  ;     if (config_step <= num_config_steps) {
 4159 0fe4 01F4                 brne .L351
 4162 0fe6 2091 0000            lds r18,config_step.2502          ;         if (2 == (arg % (TICKS_PER_SECOND*3/2))) {
 4163 0fea 4217                 cp r20,r18
 4164 0fec 00F0                 brlo .L352
 4167 0fee CB01                 movw r24,r22                      ;             config_step ++;
 4168 0ff0 6DE5                 ldi r22,lo8(93)
 4169 0ff2 70E0                 ldi r23,0
 4170 0ff4 0E94 0000            call __udivmodhi4
 4171 0ff8 0297                 sbiw r24,2
 4172 0ffa 01F4                 brne .L353
 4175 0ffc 2F5F                 subi r18,lo8(-(1))
 4176 0ffe 2093 0000            sts config_step.2502,r18
 4179 1002 4217                 cp r20,r18                        ;                 set_level(RAMP_SIZE * 3 / 8);
 4180 1004 00F0                 brlo .L350
 4183 1006 88E3                 ldi r24,lo8(56)                   ;         }
 4184 1008 00C0                 rjmp .L360
 4188 100a 82E1         .L353:  ldi r24,lo8(18)                   ;         }
 4189 100c 00C0                 rjmp .L360
 4193 100e 80E0         .L352:  ldi r24,0                         ;     }
 4195 1010 0E94 0000    .L360:  call set_level
 4196 1014 00C0                 rjmp .L350
 4200 1016 903E         .L351:  cpi r25,lo8(-32)
 4201 1018 01F4                 brne .L355
 4204 101a 8091 0000            lds r24,config_step.2502          ;         push_state(number_entry_state, 0);
 4205 101e 8823                 tst r24
 4206 1020 01F0                 breq .L361
 4209 1022 4817                 cp r20,r24                        ;         push_state(number_entry_state, 0);
 4210 1024 00F0                 brlo .L361
 4214 1026 60E0         .L362:  ldi r22,0                         ;     }
 4215 1028 70E0                 ldi r23,0
 4216 102a 80E0                 ldi r24,lo8(gs(number_entry_state))
 4217 102c 90E0                 ldi r25,hi8(gs(number_entry_state))
 4218 102e 0E94 0000            call push_state
 4219 1032 00C0                 rjmp .L350
 4223 1034 8A30         .L355:  cpi r24,lo8(10)
 4224 1036 01F4                 brne .L350
 4227 1038 6091 0000            lds r22,number_entry_value
 4228 103c 8091 0000            lds r24,config_step.2502
 4229 1040 F901                 movw r30,r18
 4230 1042 0995                 icall
 4233 1044 0E94 0000            call save_config                  ;     pop_state();
 4237 1048 0E94 0000    .L361:  call pop_state                    ; }
 4241 104c 80E0         .L350:  ldi r24,0
 4242 104e 0895                 ret

That’s 58 instructions, and with the 16-bit immediate address arguments for instructions like sts and call, it turns out to be 144 bytes of AVR machine code, 5.3 bytes per line of code, 2.3 times the size of as the strawman bytecode above.

The Anduril repo is fairly large; it draws from the 1703 SLOCCount lines of code in its parent directory and contains 2814 SLOCCount lines itself. But the majority of these lines are #ifdeffed out in most configurations or are declarations or preprocessor directives. A typical configuration might have 1200 logical lines of code (with semicolons) that actually make it through the preprocessor, a quarter of which are declarations. But that still compiles to over 8 KiB of AVR code. So, although I could be mistaken, I tentatively think that this sort of bytecode trick could be useful even for existing embedded projects that have gone to substantial trouble to fit into the limitations of tiny microcontrollers.

A sketch with an AST bytecode

The above design corresponds pretty closely to machine operations, although in some cases several of them. But, for example, loadbyte_blob 0 pushes a byte onto the stack from the beginning of blob space.

A disadvantage of this is that, in a way, the type of every variable is encoded redundantly all over the program. Every time config_step is loaded or stored, the loadglobal_byte or storeglobal_byte instruction repeats the fact that it’s a byte. This is fast but uses up a lot of the 32 possible baryonic opcodes: you need the cross product of {lea, load, store}, {global, local}, and {word, byte}, which is 12, and at some point you need to handle longs too, which I’ve given short shrift to above, and maybe long longs, at which point you’d be at 24. (The load and store instructions that use computed pointers on the operand stack could be leptonic, although in the above sketch they contain a rarely-used offset in their operand field, but the global and local instructions need an operand field to specify which global or local variable they’re referring to. Potentially you could supply the long (or short, on a 32-bit system) and long long versions only in leptonic form so they always take a pointer from the stack.)

But you could imagine storing the fact that config_step is a byte just once, in the config_step object, and just having loadglobal consult that information. Moreover, you could even eliminate the distinct types of storeword_offset etc. if the pointers on the stack carried a data type with them (which could be overridden by an explicit cast operation).

In a sense, your virtual machine ends up being dynamically typed, purely in order to reduce the number of distinct opcodes; it mirrors the structure of the C source code, where type information is confined to variable declarations rather than replicated across all uses of the variables. Aside from potentially making more efficient use of opcode space (and thus getting a more compact bytecode) such dynamic typing might be beneficial in two other ways: it might be friendlier for interactive use, and it might allow library functions to be polymorphic, so you could just invoke a linear-search function rather than writing a linear-search loop.

Self took this approach further; because its bytecode was intended for JIT-compiling rather than direct interpretation, it was very nearly just an abstract syntax tree of the Self program. Could this provide a more compact representation? Consider my isort example above:

void
isort(int *a, size_t n)
{
  for (size_t i = 1; i < n; i++) {
    for (size_t j = i; j > 0; j--) {
      if (a[j-1] > a[j]) {
         int tmp = a[j];
         a[j] = a[j-1];
         a[j-1] = tmp;
      }
    }
  }
}

As an S-expression, its abstract syntax tree might look like this:

(function isort ((pointer int) unsigned) ; argument types, words 0 and 1
                (unsigned unsigned int)  ; local types, words 2, 3, 4
   (forto (word 2) (const 1) (word 1)    ; i is word 2, n is word 1
      (fortostep (word 3) (word 2) (const 0) (const -1)
         (if (> (aref (word 0) (- (word 3) (const 1))) (aref (word 0) (word 3)))
            (progn
                (setf (word 4) (aref (word 0) (word 3)))
                (setf (aref (word 0) (word 3)) (aref (word 0) (- (word 3) (const 1))))
                (setf (aref (word 0) (- (word 3) (const 1))) (word 4))))))))

Most of the node types here have a fixed arity; the exceptions are progn and the function arguments and locals. So if we wanted to build up this tree in RPN, we could use a setf operation that pops two expressions off the stack and pushes a setf node, and so on, but for progn we’d need some kind of marker. A dumb way of writing this down might be:

isort:
    [        ; begin list
    int
    pointer  ; make pointer of int
    unsigned
    ]        ; end argument type list
    [        ; begin local variable type list
    unsigned
    unsigned
    int
    ]
    word 2   ; (for) i =
    const 1  ; 1 to
    word 1   ; n
    word 3   ; (for) j =
    word 2   ; i to
    const 0  ; 0 step
    const -1 ; -1
    word 0   ; a[
    word 3
    const 1
    -
    aref     ; ]
    word 0   ; a[
    word 3
    aref     ; ]
    >
    [        ; progn
    word 4   ; tmp =
    word 0   ; a[
    word 3   ; j
    aref     ; ]
    setf
    word 0
    word 3
    aref
    word 0
    word 3
    const 1
    -
    aref
    setf
    word 0
    word 3
    const 1
    -
    aref
    word 4
    setf
    ]progn
    if
    fortostep
    forto

By my count, this is 52 AST-node building operations, each quite plausibly encodable in a byte; const 1 corresponds rather closely to tinylit 1, word 3 to loadword 3, etc. It’s pretty hard to read, as stack-machine code often is, because the context that gives meaning to the for-loop stuff at the beginning isn't provided until half a page later. The type-building operations could conceivably use a separate encoding from the expression-building operations like aref and -, which could conceivably use a separate encoding from the statement-building operations like if, forto, and ]progn, but that would probably also require a REBOL-like or LOGO-like non-reverse Polish notation approach. That might look like this. The . token represents a byte that terminates a variable-length thing like a progn; word.2 represents a byte with 2 packed into the low 3 bits.

function isort pointer int unsigned .
               unsigned unsigned int .
   forto word.2 const.1 word.1
      fortostep word.3 word.2 const.0 const.-1
         if > aref word.0 - word.3 const.1  aref word.0 word.3
            progn
                setf word.4 aref word.0 word.3
                setf aref word.0 word.3  aref word.0 - word.3 const.1
                setf aref word.0 - word.3 const.1  word.4
            .

(I’m not suggesting that writing in this format would be a good way to program, just trying to write out the contents of the sketched bytecode in a way that’s easy to run through wc and easy to see if I left something out of.)

These 52 tokens are almost the same tokens as before, but in a different order. My earlier lower-level sketch of this using fortoloop and fortosteploop and no CSE was more compact for the following reasons:

It doesn’t have 8 or more bytes of procedure header.
It doesn’t specify where to store the loop counters; instead they are stored on a loop counter stack, saving 2 bytes.
Instead of saying - word.3 const.1 it uses decrtos, saving 3 bytes.
Instead of if > it uses subtract; jns, but that's a wash.
Instead of setf aref, it uses storeword_indexed, and instead of setf word, it uses storeword, saving 3 bytes.
Instead of using progn ... . to delimit the innermost block, it uses continue; continue; ret, which ought to use one more byte.

So it was 34 bytes, and 34 + 8 + 2 + 3 + 3 - 1 = 51, which I think is everything but the “function isort” at the beginning.

If we apply the applicable improvements from the above, we get:

function isort pointer int unsigned .
               int .
   forto const.1 word.1
      fortostep i const.0 const.-1
         if > aref word.0 1- i  aref word.0 i
            progn
                setword.4         aref word.0 i
                aset word.0 i     aref word.0 1- i
                aset word.0 1- i  word.4
            .

This gets us down to, say, 40 bytes, which is still worse than the low-level bytecode version.

It’s a little annoying that aref needs to be explicit; in Forth you can define and use an array-creating word as follows:

: array create cells allot does> swap cells + ;  ok
5 array myarray  ok
57 0 myarray !  68 1 myarray !  ok
0 myarray ?  1 myarray ? 57 68  ok

Here we have defined a 5-entry array called myarray and stored 57 in its entry 0 and 68 in its entry 1. No explicit aref is needed, but this being Forth, an explicit ? or @ is needed to fetch from it in rvalue context anyway! But we could imagine avoiding that in an AST format designed for C, somehow.

However, aside from being (it seems) bulkier than the low-level approach, this structure seems like it would be much more difficult to interpret.

Notes on Tanenbaum 01978

Tanenbaum and his students tackled this idea in 01978, with the idea of implementing the interpreter as microcode and making their computer faster and cheaper. Since they couldn’t measure how fast unimplemented microcode was, they settled for measuring how small their programs were, which was under the streetlight and is vaguely related. The paper is sometimes cited as inspiring RISC, due to its emphasis on measuring the frequencies of real operations in real programs, but the actual “EM-1” instruction set they arrived at is about as far from RISC as it is possible to be.

Where RISCs have many general-purpose registers, EM-1 has none, not even a single accumulator, instead using an operand stack (which is also the call stack and storage for globals). Where RISCs have a single instruction width, EM-1 has instructions of 1, 2, 3, and 4 bytes. Where RISC instructions typically have 3 operand fields, EM-1 instructions have 0 or 1. Where RISCs have carefully laid out bit fields to minimize instruction decoding latency, EM-1 declares, “There is no need to have distinct “opcode” and “address” bits.” Where RISCs are, like nearly all CPUs, untyped, and strive to keep all instructions constant-time so that they can be implemented without microcode, EM-1 has hardware support for bounds-checked Numpy-like arbitrary-dimensional array descriptors and for “accessing intermediate lexicographical levels in block structured languages”, which latter seems to be defined handwavily and buggily, and involves a single load or store instruction following an arbitrarily long linked list through RAM.

However, the EM-1 is RISC-like in that it’s roughly a load-store machine: ALU operations operate strictly on the stack, not even including the add-immediate instruction, which is included even in the very frugal RV32E, except that the EM-1 has 14 increment instructions, including one for TOS. And, like the EM-1, the Berkeley RISC-I and -II (and their demon offspring SPARC) are designed around reducing the cost of procedure call and return.

The strawman bytecode outlined earlier is remarkably similar to the EM-1! The biggest differences are:

the EM-1 is a single-stack design.
the EM-1 is substantially less aggressive about packing operands into single-byte instructions.
the EM-1 is not designed to be able to execute C, which makes demands on pointer arithmetic that the EM-1’s array descriptors are ill-suited to fulfill.
the EM-1 apparently entirely lacks structs or records.

The EM-1’s instruction set consists of the following:

12 one-byte pushlocal opcode bytes (“all the locals in 94.6% of procedures”);
8 one-byte pushglobal opcodes;
3 one-byte pushconst opcodes (0, 1, and 2);
12 one-byte poplocal opcodes;
8 one-byte popglobal opcodes;
4 two-byte {push,pop}{global,local} opcodes;
2 two-byte pushconst opcodes for numerical constants in [-256, 255];
4 one-byte ALU opcodes (+, -, ×, ÷);
24 one-byte opcodes and 2 two-byte opcodes for zeroing and incrementing locals;
4 two-byte local and global array-access opcodes (pushelement, popelement);
1 two-byte lealocal opcode “for call-by-reference”;
1 two-byte dereference opcode, for the same reason;
3 one-byte “mark” opcodes for starting to set up a parameter list for a subroutine call, which increment, decrement, or leave unchanged the static nesting levels (for nested subroutines);
1 two-byte subroutine-call opcode for programs containing up to 256 subroutines, though they suggest that in most cases the special context following a “mark” operation would allow some 200 opcodes to context-dependently specify which subroutine to call;
1 two-byte opcode for allocating a stack frame, which is odd since the destination of the call instruction is a “procedure descriptor” and not a raw code address;
2 three-byte opcodes for FOR (one counting up and one down; the second byte tells which variable to use, and the third gives the branch offset), though you additionally need an unconditional jump back at the end of the loop;
14 (?) two-byte opcodes for conditional branches;
141 unallocated opcode bytes;
lots of other three- and four-byte instructions whose first byte is 255; they propose “accessing intermediate lexicographical levels”, “multiple precision arithmetic, floating point, shifting, rotating, Boolean operations, etc.”

I seem to have omitted 10 opcode bytes somewhere. Unfortunately they didn’t include an opcode table.

Discussion of records (structs) is, bizarrely, completely lacking; in the EM-1 design it seems to be impossible to heap-allocate linked list nodes. Perhaps multidimensional arrays were the only data-structuring mechanism contemplated, but 01978 is about 10 years too late for such an omission. Even arrays were accessed via “descriptors” in the stack frame, suggesting that the EM-1’s stack frames were considered to be homogeneous vectors of machine words.

There are lots of good ideas in this paper. They mention that 95% of FOR loops have steps of +1 or -1, so maybe it’s best to support those two with special for-loop instructions, and relegate other steps to compilation as while loops. Their conditional branches combine testing with jumping, like RISC-V, for example, and they propose an assembler that sorts local variables by number of references.

Topics

History (24 notes)
Performance (22 notes)
Lisp (11 notes)
Assembly-language programming (11 notes)
Compilers (10 notes)
Virtual machines (9 notes)
C (8 notes)
Microcontrollers (6 notes)
Instruction sets (6 notes)
FORTH (5 notes)
Bytecode (5 notes)
Reverse Polish notation (RPN) (4 notes)
Pascal (4 notes)
Sorting (3 notes)
AVR8 microcontrollers (2 notes)
Arduino (2 notes)
Ambiq (2 notes)