Wednesday, 21 June 2017

protobuf-net gets proto3 support

protobuf-net gets proto3

For quite a little while, protobuf-net hasn't seen any major changes. Sure, I've been pottering along with ongoing maintenance and things like .NET Core support, but it hasn't had any step changes in behavior. Until recently.

2.3.0 Released

I'm pleased to say that 2.3.0 has finally dropped. The most significant part of this is "proto3", which ties into the 3.0.0 version of Protocol Buffers - released by Google at the end of July 2016. There are a few reasons why I haven't looked at this for protobuf-net before now, including:

  • zero binary format changes; so ultimately, even without any library or tooling changes: everything that can be done in proto2 can be done in proto3, interchangeably; I didn't feel under immense pressure to rush out a release
  • significant DSL changes for "proto3" syntax, coupled with the fact protobuf-net's existing DSL tools were in bad shape; not least, they were tied into some technologies with a bad cross-platform story. Since I knew I needed a new answer for DSL tooling, it seemed a poor investment to hack the new features into the end-of-life tooling. A significant portion of protobuf-net's usage is from code-first users who don't even have a DSL version of their schema, hence why this wasn't at the top of my list of priorities
  • some new data contracts targeting commonly exchanged types, but this is tied into the DSL changes
  • I misunderstood the nature of the "proto3" syntax changes; I assumed it would be *adding features and complexity, when in fact it removes a lot of the more awkward features. The few pieces that it did actually add were backported into "proto2" anyway
  • I've been busy with lots of other things, including a lot of .NET Core work for multiple libraries

But; I've finally managed to get enough time together to look at things properly.

First, some notes on proto3:

proto3 is simpler than proto2

This genuinely surprised me, but it was a very pleasant surprise. When writing protobuf-net, I made a conscious decision to make it easy and natural to implement the most common scenarios. I supported the full range of protobuf features, but some of them were more awkward to use. As such, I made some random decisions towards making it simple and obvious to use:

  • implicit zero defaults: most people don't have complex default values, where-as this makes it simple and efficient to store "empty" data (in zero bytes) without any configuration
  • don't worry about implicitly set vs explicitly set values: values are value are values; the library supports a few common .NET patterns for explicit assignment (ShouldSerialize* / *Specified / Nullable<T> + null), but it doesn't demand them and is perfectly fine without them
  • extensions and unknown data entirely optional: the question here is what to do if the serialized data contains unexpected / unknown values - which could be from external "extensions", or could just be new fields that the code doesn't know about. protobuf-net supports this type of usage, but accepts that it isn't something that most folks need or even want - they just want to get the expected data in and out

It turns out that proto3 makes some striking omissions from proto2:

  • default values are gone - implicit zero values are assumed and are the only permitted defaults
  • explicit assignment is gone - if something has a value other than the zero default, it is serialized, and that's it
  • extensions are largely missing

A part of me feels that these changes totally validate the decisions I made when making protobuf-net as simple to use as possible. Note that protobuf-net still retains full support for the wider set of protobuf features (including all the proto2 features) - they're not going anywhere.

what about protobuf JSON?

protobuf 3.0.0 added a well-defined JSON encoding for protobuf data. I confess that I'm deeply conflicted on this. In the .NET world, JSON is a solved problem. If I want my data serialized as JSON, I'm probably going to look at JIL (if I want raw performance) or Json.NET (if I want greater flexibility and range of features, or just want to use the de-facto platform serializer). Since protobuf-net targets idiomatic .NET types that would already serialize just fine with either of these, it seems to me of very little benefit to spend a large amount of time writing JSON support directly for protobuf-net. As such, protobuf-net still does not support this. If there is a genuine need for this, the first thing I would do would be to look at JIL or Json.NET to see if there is some combination of configuration options that I can specify that would conveninetly be compatible with the expected JSON encoding. At the very worst case, I could see either some PRs to JIL or a fork of JIL to support it, but frankly I'm going to defer on touching the JSON option until I understand the use-case. On the surface, it seems like the JSON option here takes all the main reasons for using protobuf and throws them out the window. My reservations here are probably because I'm spoiled by working in a platform where I can take virtually any object, and both JIL and Json.NET will be able to serialize and deserialize it for me.

So what do we get in protobuf-net 2.3.0?

Brand new protogen tooling for both proto2 and proto3

This release completely replaces the protogen DSL parsing tool; it has been 100% rewritten from scratch using pure managed code. The old version used to:

  • shell execute to call Google's "protoc" tool to emit a compiled schema (in the protobuf serialization format, naturally) as a file
  • then deserialize that file into the corresponding type model using protobuf-net
  • serialize that same object as xml
  • run the xml through an xslt 1.0 engine to generate C#

This worked, but is a cross-platform nightmare as well as being a maintenance nightmare. I doubt that xslt was a good choice for codegen even when it was written, but today... just painful. I looked at a range of parsing engines, but ultimately decided on a manual tokenizer and imperative forwards-only parser. It turned out to not be anything like as much work as I had feared, which was nice. In order to have confidence in the parser, I have tested it on every .proto schema I can find, including about 220 schemas that describe a good portion of Google's public API surface. I've tested these against protoc's binary output to ensure that not only does it parse the files meaningfully, but it produces the exact same bytes (as a compiled / serialized schema) that protoc produces.

This parser is then tied into a relatively basic codegen system. At the moment this is relatively crude, and is subject to significant change. The good thing is that now that everything is in place, this can be reworked relatively easily - perhaps to use one of the many templating systems that are available in .NET.

As an illustration of how the parser and codegen are neatly decoupled, Roger Johansson has also independently converted his Proto Actor code to use protobuf-net's parser rather than protoc, which is great! https://twitter.com/RogerAlsing/status/871829162218184704. If you want to use the parser and code-generation tools outside of the tools I provide, protobuf-net.Reflection may be useful to you.

How do I use it?

OK, you have a .proto schema (proto2 or proto3). At the moment, you have 2 options for codegen from protobuf-net:

  1. compile, build and execute the protogen command line tool (which deliberately shares command-line switches with Google's protoc tool)
  2. use https://protogen.marcgravell.com/ to do it online

(as a 2.1 option you could also clone that same website from git and host it locally; that's totally fine)

I want to introduce much better tooling options, including something that ties into msbuild and dotnet CLI, and (optionally) devenv, but so far this is looking like hard work, so I wanted to ship 2.3.0 before tackling it. It is my opinion that https://protogen.marcgravell.com/ is now perhaps the easiest way to play with .proto schemas - and to show willing, it also includes support for all official protoc output languages, and includes the entire public Google API surface as readily avaialble imports (those same 220 schemas from before).

Support for maps

Maps (map<key_type, value_type>) in .proto are the equivalent of dictionaries in .NET. If you're familiar with protobuf-net, you'll know that it has offered dictionary support for many years. Fortunately, Google's idea of how this should be implemented matches perfectly with the arbitrary and unilateral decisions I stumbled into, so maps are 99.95% interchangeable with how protobuf-net already handles dictionaries. The 0.05% relates to what happens with duplicate keys. Basically: historically, protobuf-net used theData.Add(key, value), which would throw if a key was duplicated. However, maps are defined such as the last value replaces previous values - so: theData[key] = value;. This is a very small difference, and doesn't impact any data that would currently successfully deserialize, so I've made the executive decision that from 2.3.0 all dictionaries should follow the "map" rules by default (when appropriate). To allow full control, protobuf-net has a new ProtoMapAttribute ([ProtoMap]). This has options to use the old .Add behavior, and also has options to control the sub-format used for the key and value. The protogen tool will always include the appropriate [ProtoMap] options for your data.

Support for Timestamp and Duration

Timestamp and Duration refer to a point in time (think: DateTime) and an amount of time (think: TimeSpan). Again, protobuf-net has had support for DateTime and TimeSpan for many years, but this time my arbitrary interpretation and Google's differs significantly. I have added native support for these formats, but because it is different to (and fundamentally incompattible with) what protobuf-net has done historically, this has to be done on an opt-in basis. I've added a new DataFormat.WellKnown option that indicates that you want to use these formats. For example:

[ProtoMember(7, DataFormat = DataFormat.WellKnown)]
pubic DateTime CreationDate {get; set;}

will be serialized as a Timestamp. The protogen tool recognises Timestamp and Duration and will emit the appropriate options.

Simpler enum handling

Historically, enums in .proto were a bit awkward when it came to unknown values, and protobuf-net defaulted to the most paranoid options of panicking if it saw a value it didn't explicitly expect. However, the guidance now includes the new remark:

During deserialization, unrecognized enum values will be preserved in the message, though how this is represented when the message is deserialized is language-dependent. In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation.

Enums in .NET are open enum types, so it makes sense to relax the handling here. Additionally, historically protobuf-net didn't really properly implelemt the older "make it available as an extension value" approach from proto2 (it would throw an exception instead) - far from ideal. So: from 2.3.0 onwards, all enums will be (by default) interpreted directly and without checking against expected values, with the exception of the unusual scenario where [ProtoEnum(Value=...)] has been used to re-map any enum such that the serialized value is different to the natural value. In this case, it can't assume that a direct interpretation will be valid, so the legacy checks will remain. Emphasis: this is a very rare scenario, and probably won't impact anyone except me (and my test suite). Because of this, the [ProtoContract(EnumPassthru = ...)] option is now mostly redundant: the only time it is useful is to explicitly set this to false to revert to the previous "throw an exception" behaviour.

Discriminated unions, aka one-of

One of the features introduced in proto3 (and back-ported to proto2) is the ability for multiple fields to overlap such that only one of them can contain a value at a time. The ideal in-memory representation of this is a discriminated union, which C# can't really represent directly, but which can be simulated via a struct with explicit layout; so that's exactly what we now do! A family of discriminated union structs have been introduced for this purpose, and are mainly intended to be used with generated code. But if you want to use them directly: have fun!

proto3 schema generation

Since the DSL tools accept proto2 or proto3 syntax, it makes sense that we should be able to emit both proto2 and proto3 syntax, so there are now overloads of GetSchema / GetProto<T> that allow this. These tools have also been updated to be aware of maps, Timestamp, Duration etc.

New custom option DSL support

The new DSL tooling makes use of the "extensions" feature to add custom syntax options to your .proto files. At the moment the options here are pretty limited, allowing you to control the accessibility and naming of elements, but as new controls becomes necessary: that's where they will go.

General bug fixes

This build also includes a range of more general fixes for specific scenarios, as covered by the release notes

What next?

I'm keeping a basic future roadmap on the release notes. There are some significant pieces of work ahead, including (almost certainly) a major rework of the core serializer to support async IO, "Pipelines", etc. I also want to improve the buid-time tooling. My work here is very much not done.