I needed to figure out what ID type to use for atlas9. Initially, I reached for UUIDv7, which seemed like the obvious choice: it's the new version of UUID that incorporates years of experience and wishes. It's sortable, which seems nice, and probably plays nicely with databases –  I'm guessing that if your UUID is a primary key, new IDs will be inserted closer to each other, rather than spread randomly over the index, which probably helps with caching and pages and that kind of stuff.

But then I started to experiment with ltree. ltree doesn't support hyphens in labels, so I would need to convert/remove them somehow. It wouldn't be terrible if the ltree path didn't match the IDs exactly, but it's not ideal either. I started wondering if maybe, for consistency, I should just remove hyphens from all my IDs, that way IDs in the database, logs, etc always look the same. That was the slippery slope that sent me way down this rabbit hole.

So, if I wanted consistent ID strings, I could just remove the hyphens. UUIDs are 128-bit (16 byte) numbers. In their string form, they're represented by 36 characters (including hyphens, 32 without). Postgres has a uuid type that can store the 16 byte version, which saves some space per ID compared to the string form.

I started to wonder if there was a more compact string form. I came across some suggestions to use base58 or base32 or other bases. Base58 encodes a UUID into 21 characters. I discovered Crockford's Base32, which is interesting because it's URL-friendly and also tries to be human-friendly by excluding characters that might look the same like l and 1 –  "Be pronounceable. Humans should be able to accurately transmit the symbols to other humans using a telephone.". I don't think anyone is reading these IDs character by character (especially over the phone), but it's a fun find nonetheless.

Along the way I found other ID specifications like ULID. These have similar goals: sortable, compact, URL-friendly. Some of them have had bugs – for example, some were designed to be case-sensitive, which created issues with sorting in Postgres due to collation settings.

I also dug into the implementation of UUIDv7 in the google/uuid and the UUIDv7 spec itself, and I learned some interesting things:

  • they generate random data in blocks, perhaps because generating random data with crypto/rand is expensive enough to make it worth it
  • the UUID format contains fields that specify the version and variant
  • the time precision (because UUIDv7 is time-sortable) is variable: the spec defines at least 1 millisecond resolution, but implementations can extend this. The google/uuid library uses fractional milliseconds: getV7Time returns the time in milliseconds and nanoseconds / 256, and has some extra logic to ensure IDs are monotonic.
  • the crypto.rand.Read function never returns an error and always fills the buffer entirely (docs)

The UUIDv7 layout:

/*
		 0                   1                   2                   3
		 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
		+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
		|                           unix_ts_ms                          |
		+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
		|          unix_ts_ms           |  ver  |  rand_a (12 bit seq)  |
		+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
		|var|                        rand_b                             |
		+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
		|                            rand_b                             |
		+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
	*/

I spent a few hours going down this rabbit hole, learning all the things people have considered when designing these ID formats. At some point, I started writing my own implementation into atlas9. Initially, I just wanted to avoid the dependency on google/uuid – the code is fairly small and straightforward, so avoiding an external dependency seemed reasonable.

As I implemented ID generation, I realized I don't really need some of these features: monotonicity, introspectable IDs (version and variant).

I don't need IDs to be monotonic. Libraries put in extra effort to ensure that IDs generated by the same instance are monotonic (always increasing). Maybe I'm missing something, but I don't see the point. You most likely have multiple instances of a service spread across multiple servers and there's no guarantee they are monotonic, so I don't quite understand why it's important that IDs are strictly monotonic. I suppose if you want to rely on IDs for ordering, and you don't want to pass along a high-resolution timestamp, and you're ok with relying on system clocks for ordering guarantees (which is bold, perhaps).

So anyway, I decided to keep the code dead simple and drop monotonicity. I also dropped the bits used for variant and version from the UUID spec, because I don't need IDs to be introspected like UUID does. That provides a few more bits for random data (or could be traded for higher resolution timestamp). The current implementation uses 1 millisecond resolution (just a unix timestamp). IDs are encoded in Crockford's Base32. The IDs encode to a 26 character string.

Initially, I stored the ID as a slice of bytes (type ID []byte), but it seemed silly to be constantly encoding and decoding that to and from a string (which happened all over the codebase: APIs, database, access checks, logging, etc), so I ended up just storing it as a type ID string.

I'm a bit sad that I lost the compact, 16-byte representation in Postgres. I did look into the CREATE DOMAIN and CREATE TYPE statements in Postgres – seems like you can define your own types with tons of detail – but it started to feel like too much, and was over my head, and also very Postgres-specific. Fascinating features though!

I'm not super worried about compact byte size anyway – with 1 million rows, it's what like 10 megabytes? Probably more with indexes and such, but :shrug:, if that many bytes are truly a problem, maybe UUIDs are the wrong choice anyway (int64 would be used maybe).

So, after all that, I'm not certain what I made is better, but I think it will work.

Sometimes engineering is like driving to the grocery store – you've been there hundreds of times, you know exactly where you're going and what's needed. Sometimes, like this time, it's more like Lewis and Clark – venturing out west somewhere, with a vague destination, looking for lots of help from others along the way, until you end up somewhere and it's all ok.

Here's the full implementation of generation (encoding is implemented elsewhere):

> NewID()
06E2MKE0YP14ENMYC5X2Q3E34C

func NewID() ID {
	var data [16]byte
	t := time.Now().UnixMilli()

	data[0] = byte(t >> 40)
	data[1] = byte(t >> 32)
	data[2] = byte(t >> 24)
	data[3] = byte(t >> 16)
	data[4] = byte(t >> 8)
	data[5] = byte(t)

	poolMtx.Lock()
	if poolPos == poolSize {
		// https://pkg.go.dev/crypto/rand#Read
		// Read fills b with cryptographically secure random bytes.
		// It never returns an error, and always fills b entirely.
		// Read calls io.ReadFull on Reader and crashes the program irrecoverably
		// if an error is returned. The default Reader uses operating system APIs
		// that are documented to never return an error on all but legacy Linux systems.
		_, err := rand.Read(pool[:])
		if err != nil {
			poolMtx.Unlock()
			panic(err)
		}
		poolPos = 0
	}
	copy(data[6:], pool[poolPos:poolPos+10])
	poolPos += 10
	poolMtx.Unlock()
	return ID{str: encodeBase32(data)}
}