Part 1 of 2; here’s where I talk about a premise; in part 2 I pester you for help and involvement
Q: what's faster than doing {anything}?A: not doing it
I spend a lot of time playing with raw data, whether it is serialization protocols (protobuf-net), materialization (dapper) or network api protocols (redis, web-sockets, etc). I don't claim to be the world's gift to binary, but I like to think I can "hold my own" in this area. And one thing that I've learned over and over again is: at the application level, sure: do what you want - you're not going to upset the runtime / etc - but at the library level (in these areas, and others) you often need to do quite a bit of work to minimize CPU and memory (allocation / collection) overhead.
In the real world, we have competing needs. We want to process large amounts of data, but we don't want to pay the price. What to do?
Get me a bowl of Cap’n Proto…
A little while ago I became aware of Cap'n Proto. You might suspect from the name that it is tangentially related to Protocol Buffers (protobuf), and you would be right: it it the brain-child of Kenton Varda, one of the main originators of protobuf. So at this point you're probably thinking "another serialization protocol? boooring", but stick with me - it gets more interesting! And not just because it describes itself as a “cerealization protocol” (badum tish?). Rather than simply describing a serialization format, Cap'n Proto instead describes a general way of mapping structured data to raw memory. And here's the fun bit: in a way that is also suitable for live objects.Compare and contrast:
- load (or stream) the source data through memory, applying complex formatting rules, constructing a tree (or graph) of managed objects, before handing the root object back to the caller
- load the source data into memory, and... that's it
So how does that work?
Basically, it describes a “message” of data as a series of one or more “segments” (which can be of different sizes), with the objects densely packed inside a segment. Each object is split into a number of “data words” (this is where your ints, bools, floats, etc, get stored), and a number of “pointers”, where “pointer” here just means a few bits to describe the nature of the data, and either a relative offset to another chunk in the same segment, or an absolute offset into a different segment. Visually:We can then make an object model that sits on top of that, where each object just has a reference to the segment, an absolute offset to the first word of our data, and a few bits to describe the shape, and we’re done. This format works both on disk and for live interactive objects, so the disk format is exactly the in-memory format. Zero processing. Note that there’s a lot of up-front processing that defines how an object with 2 bools, an int, a float, a string, and another bool (that was added in a later revision of your schema) should be laid out in the data words, but ultimately that is a solved problem.