Saturday, 17 September 2016

Channelling my inner geek

 

I do a lot of IO programming – both from a client perspective (for example, StackExchange.Redis), and from a server perspective (for example, our custom web-sockets server usually has between 350k and 500k open connections, depending on how long it has been since we gave it a kick). So I’m always interested in new “players” when it comes to IO code, and a few weeks ago Microsoft’s David Fowler came through with some interesting goodies – specifically the “Channels” API.

First, a couple of disclaimers: everything I’m talking about here is completely experimental, evolving rapidly, depends on “corefxlab” concepts (i.e. stuff the .NET team is playing with to see what works well), and is virtually untested. Even the name is purely a working title. But let’s be bold and take a dive anyway!

What are Channels, and what problem do they solve?

The short description of Channels would be something like: “high performance zero-copy buffer-pool-managed asynchronous message pipes”. Which is quite a mouthful, so we need to break that down a bit. I’m going to talk primarily in the context of network IO software (aka network client libraries and server applications), but the same concepts apply equally to anything where data goes in or out of a stream of bytes. It is relatively easy to write a basic client library or server application for a simple protocol; heck, spin up a Socket via a listener, maybe wrap it in a NetworkStream, call Read until we get a message we can process, then spit back a response. But trying to do that efficiently is remarkably hard – especially when you are talking about high volumes of connections. I don’t want to get clogged down in details, but it quickly gets massively complicated, with concerns like:

  • threading – don’t want a thread per socket
  • connection management (listeners starting the read/write cycle when connections are accepted)
  • buffers to use to wait for incoming data
  • buffers to use to store any backlog of data you can’t process yet (perhaps because you don’t have an entire message)
  • dealing with partial messages (i.e. you read all of one message and half of another – do you shuffle the remainder down in the existing buffer? or…?)
  • dealing with “frame” reading/writing (another way of saying: how do we decode a stream in to multiple successive messages for sequential processing)
  • exposing the messages we parse as structured data to calling code – do we copy the data out into new objects like string? how “allocatey” can we afford to be? (don’t believe people who say that objects are cheap; they become expensive if you allocate enough of them)
  • do we re-use our buffers? and if so, how? lots of small buffers? fewer big buffers and hand out slices? how do we handle recycling?

Basically, before long it becomes an ugly mess of code. And almost none of this has any help from the framework. A few years ago I put together NetGain to help with some of this; NetGain is the code that drives our web-socket server, and has pieces to cover most of the above, so implementing a new protocol mostly involves writing the code to read and write frames. It works OK, but it isn’t great. Some bits of it are decidedly ropey.

But here’s the good news: Channels does all of these things, and does it better.

You didn’t really answer my question; what is a Channel?

Damn, busted. OK; let’s put it like this – a Channel is like a Stream that pushes data to you rather than having you pull. One chunk of code feeds data into a Channel, and another chunk of code awaits data to pull from the channel. In the case of a network socket, there are two Channels – one in each direction:

  • “input” for our application code:
    • the socket-aware code borrows a buffer from the memory manager and talks to the socket; as data becomes available it pushes it into the channel then repeats
    • the application code awaits data from the channel, and processes it
  • “output” for our application code:
    • our application may at any point (but typically as a response to a request) choose to write data to the channel
    • the socket-aware code awaits data from the channel, and pushes it to the network infrastructure

In most cases you should not expect to have to write any code at all that talks to the underlying socket; that piece is abstracted away into library code. But if you want to write your own channel backend you can. A small selection of backend providers have been written to explore the API surface (mostly focusing on files and networking; there are actually already 3 different backend ‘sockets’ libraries: libuv, Windows RIO, and async managed sockets). So let’s focus on what the application code might look like, with the example of a basic “echo” server:

IChannel channel = ...
while (true)
{
    // await data from the input channel
    var request = await channel.Input.ReadAsync();

    // check for "end of stream" condition
    if (request.IsEmpty && channel.Input.Completion.IsCompleted)
    {
        request.Consumed();
        break;
    }

    // start writing to the output channel
    var response = channel.Output.Alloc();
    // pass whatever we got as input to the output
    response.Append(ref request);
    // finish writing (this should activate the socket driver)
    await response.FlushAsync();
    // tell the engine that we have read all of that data
    // (note: we could also express that we read *some* of it,
    // in which case the rest would be retained and supplied next time)
    request.Consumed();
}

What isn’t obvious here is the types involved; “request” is a ReadableBuffer, which is an abstraction over one or more slices of memory from the buffer pool, from which we can consume data. “response” is a WritableBuffer, which is again an abstraction over any number of slices of memory from the buffer pool, to which we can write data (with more blocks being leased as we need them). The idea is that you can “slice” the request as many times as you need without allocation cost:

var prefix = request.Slice(0, 4);
var len = prefix.Read<int>();
var payload = request.Slice(4, len);...

Note that ReadableBuffer is a value-type, so slicing it like this is not allocating - the new value simply has a different internal tracking state of the various buffers. Each block is also reference counted, so the block is only made available back to the pool when it has been fully released. Normally this would happen as we call “Consumed()”, but it has more interesting applications. The “echo” server already gave an example of this – the “Append” call incremented the counter, which means that the data remains valid even after “request.Consumed()” has been called. Instead, it is only when the socket driver says that it too has consumed it, that it gets released back to the pool. But we can do the same thing ourselves to expose data to our application without any copies; our previous example can become:

var prefix = request.Slice(0, 4);
var len = prefix.Read();
var payload = request.Slice(4, len).Preserve();...

Here “Preserve” is essentially “increment the reference counter”. We can then pass that value around and the data will be valid until we “Dispose()” it (which will decrement the counter, releasing it to the pool if it becomes zero). Some of you might now be twitching uncontrollably at the thought of explicit counting, but for high volume scenarios it is far more scalable than garbage collection, allocations, finalizers, and all that jazz.

So where did the zero copy come in?

A consequence of this is that the memory we handed to the network device is the exact same memory that our application code can access for values. Recall that the socket-aware code asked for a buffer from the pool and then handed the populated data to the channel. That is the exact same blocks as the block that “payload” is accessing. Issues like data backlog is handled simply by only releasing blocks once the consumer has indicated that they’ve consumed all the data we know valid on that block; if the caller only reads 10 bytes, it just slices into the existing block. Of course, if the application code chooses to access the data in terms of “string” etc, there are mechanisms to read that, but the data can also be accessed via the new “Span<byte>” API, and potentially as “Utf8String” (although there are some complications there when it comes to data in non-contiguous blocks).

Synopsis

It looks really really exciting for anyone involved in high performance applications – as either a client or server. Don’t get me wrong: there’s no “magic” here; most of the tricks can already be done and are done, but it is really hard to do this stuff well. This API essentially tries to make it easy to get it right, but at the same time has performance considerations at every step by some folks who really really care about absurd performance details.

But is it practical? How complete is it?

It is undeniably very early days. But is is enough to be useful. There are a range of trivially simple examples in the repository, and some non-trivial; David Fowler is looking to hack the entire thing underneath ASP.NET to see what kind of difference it makes. At the less ambitious end of things, I have re-written the web-sockets code from NetGain on top of the Channels API; and frankly, it was way less work than you might expect. This means that there is a fully working web-sockets server and web-sockets client. I’ve also been playing with the StackExchange.Redis multiplexed redis client code on top of Channels (although that is far less complete). So it is very definitely possible to implement non-trivial systems. But be careful: the API is changing rapidly – you definitely shouldn’t be writing your mission critical pacemaker monitoring and ambulance scheduling software on top of this just yet!

There’s also a lot of missing pieces – SSL is a huge hole that is going to take a lot of work to fill; and probably a few more higher level abstractions to make the application code even simpler (David has mentioned a frame encoder/decoder layer, for example). The most important next step is to explore what the framework offers in terms of raw performance: how much “better” does it buy you? I genuinely don’t have a simple answer to that currently, although I’ve been trying to explore it.

Can I get involved?

If you want to explore the API and help spot the missing pieces, the ugly pieces, and the broken pieces – and perhaps even offer the fixes, then the code is on GitHub and there’s a discussion room. Note: nothing here is up to me – I’m just an interested party; David Fowler is commander in chief on this.

And what next?

And then, once all those easy hurdles have been jumped – then starts the really really hard work: figuring out what, in terms of formal API names, to call it.