Code, code and more code.: November 2014

Friday, 14 November 2014

Episode 2; a cry for help

Part 2 of 2; in part 1 I talked about Cap’n Proto; here’s where you get to be a hero!

In my last post, I talked about the fun I've been having starting a Cap'n Proto spike for .NET/C#. Well, here's the thing: I don't scale. I already have an absurd number of OSS projects needing my time, plus a full time job, plus a family, including a child with "special needs", plus I help run my local school (no, really - I'm legally accountable for the education of 240 small people; damn!), and a number of speaking engagements (see you at London NDC or WCDC ?).

I don't have infinite time. What I don’t need is yet another project where I’m the primary contributor, struggling to give it anything like the time it deserves. So here's me asking if anybody in my reach wants to get involved and help me take it from a barely-usable shell, into a first rate tool. Don't get me wrong: the important word in "barely usable" is "usable", but it could be so much better.

Remind me: why do you care about this?

Let’s summarize the key points that make Cap’n Proto interesting:

next to zero processing
next to zero allocations (even for a rich object model)
possibility to exploit OS-level concepts like memory mapped files (actually, this is already implemented) for high performance
multi-platform
version tolerant
interesting concepts like “unions” for overlapping data

Although if I’m honest, pure geek curiosity was enough for me. The simple “nobody else has done it” was my lure.

So what needs doing?

While the core is usable, there’s a whole pile of stuff that could be done to make it into a much more useful tool. Conveniently, a lot of these are fairly independent, and could be sliced off very easily if a few other people wanted to get involved in an area of particular interest. But; here’s the my high-level ideas:

Schema parsing: this is currently a major PITA, since the current "capnp" tool only really works / compiles for Linux. There is a plan in the core Cap'n Proto project to get this working on MinGW (for Windows), but it would be nice to have a full .NET parser - I'm thinking "Irony"-based, although I'm not precious about the implementation
Offset processing: related to schema parsing, we need to compute the byte offsets of a parsed schema. This is another job that the "capnp" tool does currently. Basically, the idea is to look at all of the defined fields, and assign byte offsets to each, taking into account some fairly complicated "group" and "union" rules
Code generation: I have some working code-gen, but it is crude and "least viable". It feels like this could be done much better. In particular I'm thinking "devenv" tooling, whether that means T4 or some kind of VS plugin, ideally trivially deployed (NuGet or similar) - so some experience making Visual Studio behave would be great. Of course, it would be great if it worked on Mono too – I don’t know what that means for choices like T4.
Code-first: “schemas? we don't need no stinking schemas!”; here I mean the work to build a schema model from pre-existing types, presumably via attributes – or perhaps combining an unattributed POCO model with a regular schema
POCO serializer: the existing proof-of-concept works via generated types; however, building on the code-first work, it is entirely feasible to write a "regular" serializer that talks in terms of some pre-existing POCO model, but uses the library api to read/write the objects per the wire format
RPC: yes, Cap'n Proto defines an RPC stack. No I haven't even looked at it. It would be great if somebody did, though
Packed encoding: the specification defines an alternative "packed" representation for data, that requires some extra processing during load/save, but removes some redundant data; not currently implemented
Testing: I'm the worst possible person to test my own code – too “close” to it. I should note that I have a suite of tests related to my actual needs that aren't in then open source repo (I’ll try and migrate many of them), but: more would be great
Multi-platform projects: for example, an iOS / Windows Store version probably needs to use less (well, zero) of the “unsafe” code (mostly there for efficiency); does it compile / run on Mono? I don’t know.
Proper performance testing; I'm casually happy with it, but some dedicated love would be great
Much more compatibility testing against the other implementations
Documentation; yeah, telling people how to use it helps
And probably lots more stuff I'm forgetting

Easy budget planning

Conveniently, I have a budget that can be represented using no bits of data! See how well we're doing already? I can offer nothing except my thanks, and the satisfaction of having fun hacking at some fun stuff (caveat: for small values of "fun"), and maybe some community visibility if it takes off. Which it might not. I'm more than happy to open up commit access to the repo (I can always revert :p) - although I'd rather keep more control over NuGet (more risk to innocents).
So... anyone interested? Is my offer of work for zero pay and limited reward appealing? Maybe you’re looking to get involved in OSS, but haven’t found an interesting project. Maybe you work for a company that has an interest in being able to process large amounts of data with minimum overheads.

Besides, it could be fun ;p

If you think you want to get involved, drop me an email (marc.gravell@gmail.com) and we'll see if we can get things going!

Efficiency: the work you don’t do…

Part 1 of 2; here’s where I talk about a premise; in part 2 I pester you for help and involvement

Q: what's faster than doing {anything}?
A: not doing it
I spend a lot of time playing with raw data, whether it is serialization protocols (protobuf-net), materialization (dapper) or network api protocols (redis, web-sockets, etc). I don't claim to be the world's gift to binary, but I like to think I can "hold my own" in this area. And one thing that I've learned over and over again is: at the application level, sure: do what you want - you're not going to upset the runtime / etc - but at the library level (in these areas, and others) you often need to do quite a bit of work to minimize CPU and memory (allocation / collection) overhead.
In the real world, we have competing needs. We want to process large amounts of data, but we don't want to pay the price. What to do?

Get me a bowl of Cap’n Proto…

A little while ago I became aware of Cap'n Proto. You might suspect from the name that it is tangentially related to Protocol Buffers (protobuf), and you would be right: it it the brain-child of Kenton Varda, one of the main originators of protobuf. So at this point you're probably thinking "another serialization protocol? boooring", but stick with me - it gets more interesting! And not just because it describes itself as a “cerealization protocol” (badum tish?). Rather than simply describing a serialization format, Cap'n Proto instead describes a general way of mapping structured data to raw memory. And here's the fun bit: in a way that is also suitable for live objects.
Compare and contrast:

load (or stream) the source data through memory, applying complex formatting rules, constructing a tree (or graph) of managed objects, before handing the root object back to the caller
load the source data into memory, and... that's it

Cap'n Proto is the second of these; it is designed to be fully usable (read and write) in the raw form, so if you can load the data, your work is done.

So how does that work?

Basically, it describes a “message” of data as a series of one or more “segments” (which can be of different sizes), with the objects densely packed inside a segment. Each object is split into a number of “data words” (this is where your ints, bools, floats, etc, get stored), and a number of “pointers”, where “pointer” here just means a few bits to describe the nature of the data, and either a relative offset to another chunk in the same segment, or an absolute offset into a different segment. Visually:

We can then make an object model that sits on top of that, where each object just has a reference to the segment, an absolute offset to the first word of our data, and a few bits to describe the shape, and we’re done. This format works both on disk and for live interactive objects, so the disk format is exactly the in-memory format. Zero processing. Note that there’s a lot of up-front processing that defines how an object with 2 bools, an int, a float, a string, and another bool (that was added in a later revision of your schema) should be laid out in the data words, but ultimately that is a solved problem.

Nothing is free

Of course everything has a price; the cost in this case is that traversing from object to object involves parsing the pointer offsets, but this is all heavily optimized - it just means the actual data is one level abstracted. Indeed, most of the work in shimming between the two can be done via basic bitwise operations (“and”, “shift”, etc). Plus of course, you only pay this miniscule cost for data you actually touch. Ignoring data is effectively entirely free. So all we need to do is load the data into memory. Now consider: many modern OSes offer "memory mapped files" to act as an OS-optimized way of treating a flat file as raw memory. This holds some very interesting possibilities for insanely efficient processing of arbitrarily large files.

I couldn't stay away

In my defence, I did have a genuine business need, but it was something I'd been thinking about for a while; so I took the plunge and started a .NET / C# implementation of Cap'n Proto, paying lots of attention to effeciency - and in particular allocations (the "pointers" to the underlying data are implemented as value types, with the schema-generated entities also existing as value types that simply surround a "pointer", making the defined members more conveniently accessible). This means that not only is there no real processing involved in parsing the data, but there are also no allocations - and no allocations means nothing to collect. Essentially, we can use the “message” as a localized memory manager, with no .NET managed reference-types for our data – just structs that refer to the formatted data inside the “message” – and then dispose of the entire “message” in one go, rather than having lots of object references. Consider also that while a segment could be a simple byte[] (or perhaps ulong[]), it could also be implemented using entirely unmanaged memory, avoiding the “Large Object Heap” entirely (we kinda need to use unmanaged memory if we want to talk to memory mapped files). A bit “inner platform”, but I think we’ll survive.

Sounds interesting! How do I use it?

Well, that’s the gotcha; the core is stable, but - I have lots more to do, and this is where I need your help. I think there's lots of potential here for a genuinely powerful and versatile piece of .NET tooling. In my next blog entry, I'm going to be describing what the gaps are and inviting community involvement.