UMC Bytecode Format

In UMC we have not only “unlimited” registers, but also “unlimited” register types. These need to be represented in a fixed amount of space, which requires a variable-length encoding of instructions.

Constraints:

Unlimited registers and register types
Minimum 64 instruction opcodes, ideally more or extensible

Register-based bytecodes are generally faster than stack-based ones, but spend significantly longer on “instruction-fetch”. So we want to use an efficient encoding to mitigate this, especially since we have unbounded size.

Register Types

Unsigned Int (+ width)
Signed Int (+ width)
Float (+ width)
Memory Address
Instruction Address

Plus vector types with a “length” This could fit in 4 bits, first bit as a flag for a vector. However we might also want a flag for a constant, at which point we’d need 8 bits (a full byte).

Basic/Naive encoding

Example of 1.5 byte type and 1 byte index

u32:1 | 0001 00010000 00000001 | 2.5 bytes
        ^^^^ ^^^^^^^^ ^^^^^^^^
    unsigned   |      |
              32      index 1

Example of vectorised register

u32x8:1 | 1001 00010000 0100 00000001
          ^^^^ ^^^^^^^^ ^^^^ ^^^^^^^^
 unsigned vector  |      |      |
                  32     8    index

Example of a full u32-u32 instruction (8.5 bytes):

add     | 00000010 <- add opcode
u32:1,  | 0001 00010000 00000001
u32:2,  | 0001 00010000 00000010
u32:3   | 0001 00010000 00000011

Some instructions have only one valid register type, or a constant, so they don’t necessarily need a type:

jmp | 00001000 <- jump opcode
n:0 | 00000000 <- jmp to register 0 value

jmp    | 00001000 <- Jump-constant instruction ideally?
.LABEL | 00001000

A lot of instructions are likely to be “uniform”, using the same types (for the registers that aren’t forced), so we could consider an encoding like (5.5 bytes):

add     | 00000010 <- add opcode
u32     | 0001 00010000
:1      | 0000 0001
:2      | 0000 0010
:3      | 0000 0011

However, using a full byte just to store the 32/64-part of u32/u64 is pretty wasteful, especially considering they are probably the most common type. Unfortunately the full range of u1-u32 is allowed, so we can’t simply store log2(width).

Huffman encoding of register types

Ultimately, the most common register types will be used massively more than the less common types.

I propose a Huffman encoding of types (with their widths / lengths!).

In the following example, a program uses 32-bit registers most often, plus some others used in UMC.

 entry    | register
--------------------
0000 0000 | u32
0000 0001 | i32
0000 0010 | u64
0000 0011 | i64
0000 0100 | f32
0000 0101 | f64
0000 0111 | u1 (used as a boolean in UMC)
0000 1001 | n (Instruction Address)
0000 1011 | m (Memory Address)
0000 1111 | u32x8 (example of vector registers, probably only a handful used)
... a few more application specific, maybe u128/i128, u8 etc.

0000 0000 0000 0001 | If necessary, for rare types, a 2-byte or 1.5-byte encoding can be used

Instruction and Memory address registers will only really be used in add/sub for offsetting, so won’t be too high, but should still fit in the table.

It might also be a good idea to ‘fix’ the first few entries of the table for types we know will be used, like u32, u64, i32, i64, u1, n and m to reduce how much binaries change and make things like combining two umc binary files easier.

Instructions now for non-uniform types (7 bytes):

add     | 0000 0010 <- add opcode
u32x8:1 | 0000 1111 0000 0001
u32x8:2 | 0000 1111 0000 0010
u32:2   | 0000 0000 0000 0010

Covers any 3 operand types with a range of 256 different registers.

Only 5 bytes needed for a uniform instruction

add     | 0100 0010 <- add opcode
u32     | 0000 0000 <- u32 index
:1      | 0000 0001
:2      | 0000 0010
:3      | 0000 0011

Re-indexing with the most common as lower numbers can also be used to ensure the most common register indexes are packed tightly, without changing the result of the program. 1 byte is probably the minimum for a register index. Another advantage is that registers can be stored in an implementation-dependent way without decoding penalty, and we can just store a list of register operands.

Can reserve one or two bits in the instruction opcode for 2-byte encoding where rare types or high indices are used. Leading 1 on operand type could indicate a inline constant

Huffman encoding registers

We could take this even further for a much greater space saving, by encoding the exact registers (type + index).

   entry  | register
---------------------
0000 0000 | u32:0
0000 0001 | u32:1
0000 0010 | u32:2
0000 0011 | u32:3
0000 0100 | u64:0
0000 0101 | u64:1
0000 0110 | i32:0
0000 0111 | i32:1
0000 1000 | i32:2
0000 1001 | f32:0
0000 1010 | f32:1
0000 1011 | u32x8:0

This causes an 4-byte encoding of:

add    | 0000 0010 <- add opcode
u32:2  | 0000 0010
u32:1  | 0000 0001
u32:0  | 0000 0000

The cost of non-uniform encodings is no longer an issue (Still 4 bytes):

add   | 0000 0010 <- add opcode
u64:0 | 0000 0100
u32:1 | 0000 0001
u32:2 | 0000 0010

However this would require a decently large table at the start of the file, and decoding logic would also be more complex, if you wanted to store values of the same type contiguously.

Keyboard shortcuts

Universal Machine Code Docs

UMC Bytecode Format

Register Types

Basic/Naive encoding

Huffman encoding of register types

Huffman encoding registers