Wednesday 17 May 2017

protobuf-net: large data, and the future

protobuf-net was born into a different world

On Jul 17, 2008 I pushed the first commits of protobuf-net. It is easy to forget, but back then, most machines had access to a lot less memory than they do today, with x86 still being a common choice, meaning that 2GB user space (or maybe a little more if you fancied fighting with /3GB+LAA) was a hard upper limit. In reality, your usable memory was much less. Processors were much less powerful - user desktops were doing well if their single core had hyper-threading support (dual and quad cores existed, but were much rarer).

Thanks for the 2GB memories

It is in this context that protobuf-net was born, and in which many of the early design decisions were made. Although to be fair, even Google (who designed the thing) suggested an upper bound in the low hundreds of MB. Here's the original author (Kenton Varda) saying on Stack Overflow that 10MB is "pushing it" - although he does also note that 1GB works, but that 2GB is a hard limit.

protobuf-net took these limitations on board, and many aspects of the code could only work inside these borders. In particular, one of the key design questions in protobuf-net was how, when serializing general purpose objects, to handle the length prefix.

protobuf strings

Protobuf is actually a relatively simple binary format; it has few primitives, one of which is the length-prefixed string (where "string" means "arbitrary payload", not just text). The encoding of this is a variable length "varint" that tells it how many bytes are involved, then that many bytes of the payload:

[field x, "string"]
[n, 1-10 bytes]
[payload, n bytes]

The requirement to know the length in advance is fine for the Google implementation - as I understand it, the "builder" approach means that the length is calculated when the "builder" creates the actual object, which is long before serialization happens (note: I'm happy to be corrected here if I've misunderstood). But protobuf-net doesn't work with "builder" types; it works against gereral every-day POCOs - usually written without any DSL schema ("code-first"). We can't rely on any construction-time calculations. So: how to write the length?

Essentially, there's two ways of doing this:

  • serialize the data first (perhaps hoping that the length prefix will fit in a single byte, and leaving a space for it); when you've finished serializing, you know the length - so now backfill that into the original space, which might mean nudging the data over a bit if the prefix took more space than expected
  • compute the actual required length, write the prefix, then serialize the data

Both have advantages and disadvantages. The first requires you to buffer all the data in the payload (you can't flush something that you might need to update later), and might need us to move a lot of data. The second requires us to do more thinking without actually writing anything - which might mean doing a lot of work twice.

At the current time, protobuf-net chooses the first approach. For quite a lot of small leaf types, this doesn't actually mean much more than backfilling a single byte of length data, but it becomes progressively more expensive as the payload size increases.

I hate limits

Over the time since then, I have seen many, many requests from people asking for protobuf-net to support larger data sizes - at least an order of magnitude above what has previously been usable, tens of GB or more, which makes perfect sense when you consider the data that some apps load into the plentiful RAM available on even a mid-range server. In principle this is simple (mostly making sure that the reader and writer use 64-bit tracking internally), but there are 2 stumbling blocks:

  • the need to buffer vast quantities of data would demand excessive amounts of RAM
  • the current buffer implementation woud be prohibitively hard to refactor to go above 2GB
  • even if we did, it would then take a loooong time to output the buffered data after backfilling

I've recently pushed some commits intended to address the 64-bit reader/writer issue - unblocking some users, but the other factors are much harder to solve in the current implementation.

Wait... how does that unblock anyone?

Good catch; indeed, simply enabling 64-bit readers and writers doesn't fix the buffering problem - but: there is a workaround. A long time in protobuf's past, there were two ways of encoding sub-messages. One was the length-prefixed string that we've discussed; the other was the "group". At the binary level, the difference is that "groups" don't have a length prefix - instead a sentinel value suffix is used to denote the end of the message:

[field x, "start group"]
[field x, "end group"]

(the protocol itself means that "end group" could not occur as an immediate child of the payload, so this is unambiguous)

As with most things, this has various advantages and disadvantages - but most significantly in our case here, it means we don't need to know the length in advance. And if we don't need to know the length, then we don't need to buffer anything - we can write the data in a purely forwards direction without any need to backfill data. There's just one problem: it is out of favor with the protobuf specification owners - it was marked as deprecated but supported in the proto2 DSL, and there is no syntax for it at all in the proto3 DSL (these all just describe data against the same binary format).

But: I really, really like groups, at least at the binary format level. Essentially, the current 2GB+ unblocking in an upcoming deploy of protobuf-net is limited to scenarios where it is possible to use groups extensively. The closer something is to being a leaf, the more it'll be OK to use length-prefixed strings; the closer something is to the root object, the more it will benefit from being treated as a "group". With this removing the need to buffer+backfill, arbitrarily large files can be produced. The cost, however, is that you won't be able to interop with data that is expressed as proto3 schemas.

Historically, you have been able to indicate that a member should be treated as a group via:

// for field number "n"
[ProtoMember(n, DataFormat = DataFormat.Group)]
public SomeType MemberName { get; set; }

However, this is hard to express in some cases (such as dictionaries), so this has been extended to allow declaration at the type-level:

[ProtoContract(IsGroup = true)]
public class SomeType {...}

(both of which can also be expressed via the RuntimeTypeModel API for runtime configuration)

These changes move us forward, at least - but are mainly appropriate when using protobuf-net as the only piece of the puzzle, since it simply cannot be expressed in the proto3 DSL.

The future

This is all great, but isn't ideal. So in parallel with that, I have some work-in-progress early-stages work that is taking a much more aggressive look at the future of protobuf-net and what it needs to move forward. I have many lofty aims on the list:

  • true 2GB+ support including length-prefix, achieved by a redesign of the writer API, including switching to precalculation of lengths as required
  • optimized support for heterogeneous backend targets, including in-memory serialization, Streams, "Channels" (the experimental redesign of the .NET IO stack), memory-mapped-files, etc
  • making use of new concepts like Utf8String, Span<T> where appropriate
  • full support for async backend targets, making optimal use of ValueTask<T> as appropriate so that performance is retained in the case where it is possible to complete entirely synchronously
  • rework of the codegen / meta-programming layer, reducing or removing the dependency on IL-emit, and moving more towards compile-time code-gen (ideally fully automated and silent) using Roslyn
  • in doing so, greatly improve the experience for AOT scenarios, where meta-programming is restricted or impossible
  • improve the performance of a range of common scenarios by every mechanism imaginable
  • and maybe, just maybe: getting around to implementing updated DSL parsing tooling (but realistically: that isn't the key selling-point of protobuf-net)

As counterpoints, I also imagine that I'll be dropping support for everything that isn't either ".NET Framework recent-enough to build via dotnet build" (4.0 and avove, IIRC) or ".NET Standard (something)". The reality is that I'm not in a position to support some obscure PCL configuration or an ancient version of Silverlight. If you can make it compile: great! I'm also entirely open to including targets for things like Xamarin or Unity as long as somebody else can make them work in the build - I'm simply not a user of those tools, and it would be artificial to say that I've seen it work. I'm also moving away from my historic aim of being able to compile on down-level compiler versions. These days, with NuGet as the de-facto package manager, and dotnet build readily available, and the free Visual Studio Community edition, I'm not sure it makes sense to worry about old compilers.

As you can see, there's a lot in the planning. I've been experimenting with various pieces of it to see how it fits together, and I'm confident that I see a viable route forward. Now all I need is to make it happen.

The first step there is to get the "longification" changes shipped; this has now seen real-world usage, so it is just some packaging work to do. I hope to have that available on NuGet before next week.

Fun times!