Code, code and more code.: 2016

Saturday 17 September 2016

Channelling my inner geek

I do a lot of IO programming – both from a client perspective (for example, StackExchange.Redis), and from a server perspective (for example, our custom web-sockets server usually has between 350k and 500k open connections, depending on how long it has been since we gave it a kick). So I’m always interested in new “players” when it comes to IO code, and a few weeks ago Microsoft’s David Fowler came through with some interesting goodies – specifically the “Channels” API.

First, a couple of disclaimers: everything I’m talking about here is completely experimental, evolving rapidly, depends on “corefxlab” concepts (i.e. stuff the .NET team is playing with to see what works well), and is virtually untested. Even the name is purely a working title. But let’s be bold and take a dive anyway!

What are Channels, and what problem do they solve?

The short description of Channels would be something like: “high performance zero-copy buffer-pool-managed asynchronous message pipes”. Which is quite a mouthful, so we need to break that down a bit. I’m going to talk primarily in the context of network IO software (aka network client libraries and server applications), but the same concepts apply equally to anything where data goes in or out of a stream of bytes. It is relatively easy to write a basic client library or server application for a simple protocol; heck, spin up a Socket via a listener, maybe wrap it in a NetworkStream, call Read until we get a message we can process, then spit back a response. But trying to do that efficiently is remarkably hard – especially when you are talking about high volumes of connections. I don’t want to get clogged down in details, but it quickly gets massively complicated, with concerns like:

threading – don’t want a thread per socket
connection management (listeners starting the read/write cycle when connections are accepted)
buffers to use to wait for incoming data
buffers to use to store any backlog of data you can’t process yet (perhaps because you don’t have an entire message)
dealing with partial messages (i.e. you read all of one message and half of another – do you shuffle the remainder down in the existing buffer? or…?)
dealing with “frame” reading/writing (another way of saying: how do we decode a stream in to multiple successive messages for sequential processing)
exposing the messages we parse as structured data to calling code – do we copy the data out into new objects like string? how “allocatey” can we afford to be? (don’t believe people who say that objects are cheap; they become expensive if you allocate enough of them)
do we re-use our buffers? and if so, how? lots of small buffers? fewer big buffers and hand out slices? how do we handle recycling?

Basically, before long it becomes an ugly mess of code. And almost none of this has any help from the framework. A few years ago I put together NetGain to help with some of this; NetGain is the code that drives our web-socket server, and has pieces to cover most of the above, so implementing a new protocol mostly involves writing the code to read and write frames. It works OK, but it isn’t great. Some bits of it are decidedly ropey.

But here’s the good news: Channels does all of these things, and does it better.

You didn’t really answer my question; what is a Channel?

Damn, busted. OK; let’s put it like this – a Channel is like a Stream that pushes data to you rather than having you pull. One chunk of code feeds data into a Channel, and another chunk of code awaits data to pull from the channel. In the case of a network socket, there are two Channels – one in each direction:

“input” for our application code:
- the socket-aware code borrows a buffer from the memory manager and talks to the socket; as data becomes available it pushes it into the channel then repeats
- the application code awaits data from the channel, and processes it
“output” for our application code:
- our application may at any point (but typically as a response to a request) choose to write data to the channel
- the socket-aware code awaits data from the channel, and pushes it to the network infrastructure

In most cases you should not expect to have to write any code at all that talks to the underlying socket; that piece is abstracted away into library code. But if you want to write your own channel backend you can. A small selection of backend providers have been written to explore the API surface (mostly focusing on files and networking; there are actually already 3 different backend ‘sockets’ libraries: libuv, Windows RIO, and async managed sockets). So let’s focus on what the application code might look like, with the example of a basic “echo” server:

IChannel channel = ...
while (true)
{
    // await data from the input channel
    var request = await channel.Input.ReadAsync();

    // check for "end of stream" condition
    if (request.IsEmpty && channel.Input.Completion.IsCompleted)
    {
        request.Consumed();
        break;
    }

    // start writing to the output channel
    var response = channel.Output.Alloc();
    // pass whatever we got as input to the output
    response.Append(ref request);
    // finish writing (this should activate the socket driver)
    await response.FlushAsync();
    // tell the engine that we have read all of that data
    // (note: we could also express that we read *some* of it,
    // in which case the rest would be retained and supplied next time)
    request.Consumed();
}

What isn’t obvious here is the types involved; “request” is a ReadableBuffer, which is an abstraction over one or more slices of memory from the buffer pool, from which we can consume data. “response” is a WritableBuffer, which is again an abstraction over any number of slices of memory from the buffer pool, to which we can write data (with more blocks being leased as we need them). The idea is that you can “slice” the request as many times as you need without allocation cost:

var prefix = request.Slice(0, 4);
var len = prefix.Read<int>();
var payload = request.Slice(4, len);...

Note that ReadableBuffer is a value-type, so slicing it like this is not allocating - the new value simply has a different internal tracking state of the various buffers. Each block is also reference counted, so the block is only made available back to the pool when it has been fully released. Normally this would happen as we call “Consumed()”, but it has more interesting applications. The “echo” server already gave an example of this – the “Append” call incremented the counter, which means that the data remains valid even after “request.Consumed()” has been called. Instead, it is only when the socket driver says that it too has consumed it, that it gets released back to the pool. But we can do the same thing ourselves to expose data to our application without any copies; our previous example can become:

var prefix = request.Slice(0, 4);
var len = prefix.Read();
var payload = request.Slice(4, len).Preserve();...

Here “Preserve” is essentially “increment the reference counter”. We can then pass that value around and the data will be valid until we “Dispose()” it (which will decrement the counter, releasing it to the pool if it becomes zero). Some of you might now be twitching uncontrollably at the thought of explicit counting, but for high volume scenarios it is far more scalable than garbage collection, allocations, finalizers, and all that jazz.

So where did the zero copy come in?

A consequence of this is that the memory we handed to the network device is the exact same memory that our application code can access for values. Recall that the socket-aware code asked for a buffer from the pool and then handed the populated data to the channel. That is the exact same blocks as the block that “payload” is accessing. Issues like data backlog is handled simply by only releasing blocks once the consumer has indicated that they’ve consumed all the data we know valid on that block; if the caller only reads 10 bytes, it just slices into the existing block. Of course, if the application code chooses to access the data in terms of “string” etc, there are mechanisms to read that, but the data can also be accessed via the new “Span<byte>” API, and potentially as “Utf8String” (although there are some complications there when it comes to data in non-contiguous blocks).

Synopsis

It looks really really exciting for anyone involved in high performance applications – as either a client or server. Don’t get me wrong: there’s no “magic” here; most of the tricks can already be done and are done, but it is really hard to do this stuff well. This API essentially tries to make it easy to get it right, but at the same time has performance considerations at every step by some folks who really really care about absurd performance details.

But is it practical? How complete is it?

It is undeniably very early days. But is is enough to be useful. There are a range of trivially simple examples in the repository, and some non-trivial; David Fowler is looking to hack the entire thing underneath ASP.NET to see what kind of difference it makes. At the less ambitious end of things, I have re-written the web-sockets code from NetGain on top of the Channels API; and frankly, it was way less work than you might expect. This means that there is a fully working web-sockets server and web-sockets client. I’ve also been playing with the StackExchange.Redis multiplexed redis client code on top of Channels (although that is far less complete). So it is very definitely possible to implement non-trivial systems. But be careful: the API is changing rapidly – you definitely shouldn’t be writing your mission critical pacemaker monitoring and ambulance scheduling software on top of this just yet!

There’s also a lot of missing pieces – SSL is a huge hole that is going to take a lot of work to fill; and probably a few more higher level abstractions to make the application code even simpler (David has mentioned a frame encoder/decoder layer, for example). The most important next step is to explore what the framework offers in terms of raw performance: how much “better” does it buy you? I genuinely don’t have a simple answer to that currently, although I’ve been trying to explore it.

Can I get involved?

If you want to explore the API and help spot the missing pieces, the ugly pieces, and the broken pieces – and perhaps even offer the fixes, then the code is on GitHub and there’s a discussion room. Note: nothing here is up to me – I’m just an interested party; David Fowler is commander in chief on this.

And what next?

And then, once all those easy hurdles have been jumped – then starts the really really hard work: figuring out what, in terms of formal API names, to call it.

Monday 9 May 2016

CUDA–getting started in .NET

In my two previous posts (part 1, part 2) I described how we have found uses for CUDA. I am, however, aware that this abstract text doesn’t necessarily help the reader (aka “you”) make inroads into the world of CUDA. So: the purpose of this post is to take the concepts discussed previously and put them together into some actual code.

Firstly, a word on OpenCL: so far, I’ve spoken about CUDA – NVIDIA’s framework for GPGPU programming. Other frameworks exist, and the largest “other” is undoubtedly OpenCL. This framework is designed to let you use the same code over a range of technologies – GPUs from different providers, CPUs, etc. I am aware of it – I just haven’t even tried to address it. If that’s your bag: have fun. I don’t feel hugely limited by restricting myself to CUDA for what I need.

The first things you need for CUDA, then, are:

a CUDA capable GPU device
the CUDA toolkit
a programming environment

Since I’m focused on .NET, we’re also going to need one of the tools for interacting with CUDA from .NET, or we’ll need to write extern wrappers for the C API. A range of CUDA tools for .NET exist, including CUDAfy.NET, managedCUDA, Alea GPU / QuantAlea, Campy.NET, and several others that Campy.NET list and describe on their home page. These fall into two major categories:

wrappers for the driver API (you write your kernels in C etc and launch them from C#)
IL-to-PTX generators (you write and launch your kernels in C#)

While I’m a huge fan of C#, I’m also a huge fan of keeping close to the metal, and in a heartbeat will sacrifice some programming convenience for performance.

Our first kernel

So without further ado, I’m going to jump straight in with a basic CUDA kernel written in C. Rather than trying to discuss the tag-engine, we’re going to do something simpler… like multiplying numbers. Multiplication is hard, right?

typedef struct {
    int Id;
    unsigned int Value;
} SomeBasicType;

extern "C" {
    __global__ void Multiply(const int N, SomeBasicType* __restrict data, int factor) {
        for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
            i += blockDim.x * gridDim.x)
        {
            (data + i)->Value *= factor;
        }
    }
}

The first thing this does is to define a data-type (SomeBasicType) for us to share data between the CPU and GPU code. The __global__ indicates that we’re declaring a kernel, along with a given name and signature. We’re passing in N, the number of elements to process, data, a pointer to some of the data to work against, and factor - the number to multiply the values by. The significance of the N comes into play when we see the i < N in the for loop on the next line. Recall that we usually launch a number of blocks, each consisting of a set number of threads. It is not always the case that the number of elements to process is a convenient multiple of some block size, so it is common for kernels to be passed a count to compare the position against, exiting if the current thread is not needed. This sounds like a contradiction to the “don’t branch” guidance, but as long as most of the threads make the same choice, this isn’t going to be a performance problem.

Recall also that I described a “monolithic kernel” earlier, where-by we launch a thread for every element to process. But: there is a hard limit to the number of threads in a block, and the number of blocks in any dimension – and as always: the more threads there are, the more management overhead there is (although this works differently to CPU threads). As such, it is common to use a “grid-stride loop” rather than a monolithic kernel. As in our example, a for loop is used to that each thread processes multiple elements. But unlike the CPU version shown previously (where each thread processed a separate chunk of data), we instead have each thread in an iteration process adjacent elements, and then move forwards by the stride – namely the width of each block multiplied by the number of blocks. This has multiple side benefits:

it can cope with data of any size – it is not subject to the 75M limitation
it requires less overhead for the GPU engine
it can still be used as a monolithic kernel (by setting the dimensions such that it only performs one iteration per thread), or as a single-threaded debugging kernel that processes all the data (by setting the dimensions to 1 thread in 1 block)
the work being done by a warp or block is contiguous, minimizing memory access overheads

Finally, we get to the actual meat of the kernel, which does an in-place multiplication of one of the fields of our data by one of the parameters.

That’s our first kernel, all done!

A side note on data width

You’ll note that the above example is doing 32-bit arithmetic. For doing serious CUDA work, it is important to be aware of whether you are doing 16-bit, 32-bit or 64-bit arithmetic, as it can significantly impact your choice of hardware – considering them in release order:

the “Kepler” architecture (server devices: K40, K80) will serve as a baseline – let’s arbitrarily call it “good” at both 32-bit and 64-bit; 16-bit operations are performed using 32-bit arithmetic
the “Maxwell” architecture (server device: M40) is faster than Kepler at 32-bit operations, but is relatively poor at 64-bit operations; 16-bit operations are performed using 32-bit arithmetic
to complicate this: the fact that the K80 packs two devices on a physical package means that even though a single M40 may be faster than a single K80 device, code that appropriately uses both K80 devices on a package may still out-perform a single M40
the “Pascal” architecture apparently (not yet available for testing; server device: P100) significantly increases both 32-bit and 64-bit performance, while also introducing specific optimized instructions for 16-bit processing

Because of this, there is no single way of saying “this device is faster” – it depends a lot on what exactly you are doing!

Compiling our kernel (and: hello managedCUDA)

A kernel as a text source-file isn’t much use by itself. We need to convert that into GPU-runnable code – either intermediate (“PTX”) or fully compiled to bytecode (“cubin”). There is also a middle-ground that combines both bytecode and intermediate instructions – a “fatbin”; this allows a compiled kernel to target multiple physical architectures and still allow usage on other architectures via a JIT step. There are two common ways of compiling a C kernel – compile-time (NVCC) and runtime (NVRTC). I’ve spoken previously about using NVCC, so for this example I’ll use NVRTC instead.

Since we want to do this from .NET, we need a wrapper library. My examples today are going to use managedCUDA, written by Michael Kunz. This is available on NuGet and is licenced under the LGPL – which I understand (and: IANAL) is not the same as the notoriously infectious and restrictive GPL. What I particularly like about this library is that in addition to exposing an idiomatic .NET object model that maps to the C API, it also exposes the raw C API directly – which makes it convenient to use even when the method you want doesn’t map naturally to a .NET concept.

So let’s load our kernel from a text file and compile it at runtime:

string path = "MyKernels.c";
ManagedCuda.NVRTC.nvrtcResult result;
using (var rtc = new ManagedCuda.NVRTC.CudaRuntimeCompiler(
    File.ReadAllText(path), Path.GetFileName(path)))
{
    try {
        rtc.Compile(new string[0]);
        result = ManagedCuda.NVRTC.nvrtcResult.Success;
    } catch(ManagedCuda.NVRTC.NVRTCException ex) {
        result = ex.NVRTCError;
    }
    log = rtc.GetLogAsString();

    if (result == ManagedCuda.NVRTC.nvrtcResult.Success)
    {
        byte[] ptx = rtc.GetPTX();
        // ...
    }
}

This takes our file contents as input, and (if successful) produces a blob that contains the compiled kernel. We can also obtain the log which will contain detained error messages if our kernel is invalid.

The CUDA context

One of the key objects in CUDA is the context. This wraps the state for a long-running CUDA scenario. Most operations, including memory allocation and kernel launches typically happen within a CUDA context. managedCUDA exposes a class that wraps this. Note:

a CUDA context is bound to a particular GPU device, and it is when creating the CUDA context that you can target specific devices
a CUDA context is associated with the CPU thread that creates it; as such, CUDA code would work well for scenarios such as a dedicated worker thread processing items from a queue, but would not work well from .NET async/await code that switches between multiple threads

So; let’s create our context and load the kernel:

CudaContext ctx = new CudaContext(deviceId, true);
CudaKernel multiply = ctx.LoadKernelFatBin(ptx, "Multiply");

Note that the context (like many CUDA objects) is IDisposable, and you should ensure to Dispose() it properly, via using or otherwise.

CUDA streams and asynchronous processing

In the simplest usage, the CUDA methods are blocking, however it is often the case that either the CPU can do other useful work while the GPU churns, or we can overlap multiple GPU operations – memory transfers running in parallel with kernel execution, or (depending on the hardware) even multiple kernels running at the same time. To allow this, CUDA introduces the notion of a “stream”. Work on the same stream is processed in order, but the work on different streams may overlap. I’m not going to make extensive usage of streams in this walkthrough, but it is easier to include stream considerations from the start than to add it later.

Initializing memory at the host and on the device

Traditionally, the host (CPU) memory and the device (GPU) memory were completely separate, and all CUDA code had to specifically be cautious of whether it was referring to host or device memory. More recently, the concept of “unified memory” has been introduced that hides this distinction, and performs transfers automatically as needed – it is intended to make it simpler to use the API, but for the maximum performance it is recommended to take full control over allocation and copy. To that end, I’ll stick to the traditional approach, which means explicitly allocating the data at both ends.

We’ll start by allocating the host memory:

struct SomeBasicType {
    public int Id;
    public uint Value;
}

IntPtr hostPointer = IntPtr.Zero;
var res = DriverAPINativeMethods.MemoryManagement.cuMemAllocHost_v2(
    ref hostPointer, count * sizeof(SomeBasicType));
if (res != CUResult.Success) throw new CudaException(res);
SomeBasicType* hostBuffer = (SomeBasicType*)hostPointer;

As before, we declare a raw data type to share the data between CPU and GPU. Here is one of those cases where I’m making use of the raw C API rather than the wrapped objects, as this (IMO) more conveniently allows naked allocations. You might think “hang on, you’re just allocating some memory – surely a .NET array would suffice, or maybe at worst Marshal.AllocHGlobal” – but: in order for the CUDA memory-copy engine to work at maximum performance (avoiding double-copies, etc) it is required to be configured in a specific way, and the best way to guarantee that is to let the CUDA API allocate the memory in the first place. A pinned .NET array certainly wouldn’t suffice. Or at least, it would work – but not as fast. Yes, this needs some unsafe code; it’ll be fine, don’t panic.

Once allocated, we can intialize this memory with some invented data:

for (int i = 0; i < count; i++) {
    hostBuffer[i].Id = i;
    hostBuffer[i].Value = (uint)i;
}

Next we’ll want to allocate a similar block of memory on the device, and start copying the data from the host to the device. Emphasis: large memory transfers are fast, but not instant – if possible you should minimize the amount of data you need to transfer for individual operations. The ideal scenario is to copy the data to the device once / periodically, and use it many times once it is there.

CudaDeviceVariable<SomeBasicType> deviceBuffer = new CudaDeviceVariable<SomeBasicType>(count);
CudaStream defaultStream = new CudaStream();
res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyHtoDAsync_v2(deviceBuffer.DevicePointer,
    hostPointer, deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

This time I'm using the CudaDeviceVariable<T> object to represent the memory at the device, which is perfectly convenient since we will not usually need to do anything except access the DevicePointer to pass into methods. Once again I'm using a raw C-style API for the actual memory copy. There are some transfer methods exposed directly on the object, but this particular combination is not exposed directly. Note that because we have created and used a stream, this is non-blocking. Related operations sent to the same stream will be queued behind this one.

Launching our kernel

We have data on the device; we have a kernel that we have compiled and loaded. Now finally to run the kernel against the data! First we must configure the dimensions – the number of threads in a block and the number of blocks:

multiply.BlockDimensions = new ManagedCuda.VectorTypes.dim3(threadsPerBlock, 1, 1);
multiply.GridDimensions = new ManagedCuda.VectorTypes.dim3(blockCount, 1, 1);

multiply.RunAsync(defaultStream.Stream, new object[] {
    count, deviceBuffer.DevicePointer, value
});

Note that the parameters to RunAsync (other than the stream) match the signature on the kernel itself. value in the above is the factor to multiply by. Once we've got the data in place, running kernels is actually alarmingly simple!

Getting results back

It isn’t usually sufficient to run kernels – sadly we usually expect to get results. As mentioned before, we should strive to minimize the amount of data we want to transfer, but for simplicity in this example I’m going to fetch back the entire buffer, over-writing the previous contents. In real code you might be extracting only small portions of the buffer, or summary / aggregate data.

var res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyDtoHAsync_v2(
    new IntPtr(hostBuffer), deviceBuffer.DevicePointer,
    deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

Once more, for memory operations it has been easier to use the raw C API than the wrapped managed API, and that's OK. Note that because we've been using streams, at this point we've only queued the copy. If we've set all our wheels in motion, and want to sit back and wait while they turn, we can use either ctx.Synchronize(); (to wait for all streams to complete) or defaultStream.Synchronize() (to wait for a specific stream to complete). Finally, when that is done, we can iterate over the data and congratulate ourselves for managing to over-complicate the multiplication of numbers!

Not enough code for you?

Everything I’ve shown here is available in a github project, that you can clone, fork, etc. I don’t claim it is the most useful tool ever, but hopefully it might help a few people get up and running with CUDA.

Enjoy!

How I found CUDA, or: Rewriting the Tag Engine–part 2

(part 1, part 3)

So we got to work…

We knew we had a problem where CUDA might be part of an answer, especially for performance. And in reality, there’s only one way to find out which horse is faster. You race the horses. We knew that the v1 tag-engine was poorly designed (organically, so to speak), so it wouldn’t be meaningful to compare that for performance purposes. And even if the GPU is better, we still wanted a better CPU implementation:

it needs to run on developer local machines; not all developers have a CUDA device, especially if they are on a laptop, or on a VM, or on a VM on a laptop
in the event of the tag-engine servers being unreachable, our fallback strategy is to load the tag-engine in-memory on the production web-servers, so it needs to be able to run there
if the GPU approach turns out to not be what we want, we still want to have moved our code forwards
and if the GPU approach turns out to be exactly what we want, then we’d have physical factors like server hardware, cage configuration, etc to add lead-time, where-as we’d like to replace the code ASAP

So: we definitely wanted a shiny new CPU version, but one that was designed to work nicely side-by-side with the GPU version. The other thing you need to race your horses is: a course. The best way of representing your load is to capture your load, in a way that you can play back. So the next thing we did was to edit our existing tag-engine to make it possible to record all the real requests it was serving, in a way that we could play back and use to compare performance, features, and results. Helpfully, tag-engine requests don’t contain any PII data – note that this isn’t possible in all circumstances.

Parallel or sequential?

Next up, we need to think about how GPUs work. CPUs are often (not always) used to perform multiple independent operations in parallel. GPUs, by contrast, are usually used to make a single operation (per device) happen really quickly, processing the operations sequentially. The parallel approach, while scalable, has a lot of complications:

you need to be careful how you perform data updates without breaking running operations (or: you need to duplicate the data to perform updates) – contrast sequential, where you can just squeeze the update in as just one more thing in the queue (it is never fighting other requests), and update the data in-place with impunity
each parallel operation needs memory for it’s processing (the results it has collected, and everything it needed to get there) – contrast sequential where you can allocate a single workspace and just keep re-using it

After considering these factors, and looking at our typical request rate, we concluded that our best approach for both GPU and CPU would be to use sequential operations, using all the resources available to us (cores, etc) to make each individual operation as fast as possible. This also makes for a very simple CPU vs GPU race, while also making for some amusing CPU task-manager charts:

(yes, that’s me punishing a 36-core, HT-enabled server for 72 logical cores of goodness)

The CPU bits

Recall from part 1 that we have two categories of query; trivially simple, and absurdly hard. I’m going to completely ignore the first set. The interesting case is the second, and in those scenarios you can pretty much guarantee that there is no single index that is ever going to fit your query, and we’re essentially trying to do a table-scan of a “where” clause, perhaps with some short-cuts. Let’s consider an example. Say we want to know all the “java and android” questions, sorted by activity. One of the main things the tag-engine stores is an index of “questions by tag” – i.e. given a tag, what questions exist in that tag (pre-sorted). Since the “and” makes this restrictive (intersection rather than union), what we can do is choose the smallest (“android”, which takes us to a manageable size) and then just test them all. For reasonable sizes of data, this can be much cheaper than trying to do complicated index combinations, and can be done with very little memory allocation. Note that we almost always want the total count and the “intersecting related tags” data, so it won’t help to cheat and just scroll forwards until we have enough data to return.

Let’s say that “android” has 800,000 questions. If we want to parallelize this work (to use all of the available cores), it might be tempting to use Parallel.ForEach to outsource this work to the TPL and aggregate the results, but:

this has quite a lot of overhead in terms of doing lots of small things instead of a small number of big things
since the order is now unpredictable, it makes it very hard to exploit the fact that we have pre-sorted the data

Instead, what we can do is to carve these 800,000 questions into a small number of chunks of contiguous questions (8, say, or something related to the number of available cores), and then inside each chunk (unit of work): test each question in turn for the remaining conditions, writing successive matches in blocks to a similarly sized workspace.

By letting different threads process different blocks, we get lots of benefits:

the number of units-of-work for the threading library (TPL) to manage is minimal, reducing overheads
each unit-of-work is always looking at contiguous data, maximizing cache usage and memory locality
we get to exploit the fact that we’ve already sorted the data once – we never need to re-sort
we don’t have to synchronize when writing any of the results, since each unit-of-work is writing to a separate area of the workspace
all we need to know is how many matches each unit-of-work contained and we can step through the final workspace very efficiently

This is actually pretty simple to do with Parallel.Invoke - something like:

int chunks = DecideNumberOfChunks(questions, cpuCores);
int workPerChunk = questions / chunks;
WorkUnit[] workUnits = new WorkUnit[chunks];
Action[] actions = new Action[chunks];
for(int chunk = 0; chunk < chunks ; chunk++) {
    int start = chunk * workPerChunk,
        stop = Math.Min(start + workPerChunk, questions);
    workUnits[chunk] = new WorkUnit(start, stop, ...);
    actions[chunk] = workUnits[chunk].Execute;
}
Parallel.Invoke(actions);

Where each Execute method is essentially:

public void Execute() {
    int matches = 0, resultIndex = start;
    for(int i = start ; i < stop ; i++) {
        if(IsMatch(i)) { // some test or set of composite tests
            resultWorkspace[resultIndex++] = i;
            matches++;
        }
    }
    this.Matches = matches;
}

The above is a very simplified illustration of the design that drives the CPU implementation of the re-written tag-engine. The result is that it creates a packed set of contiguous matches for each unit-of-work, while allowing us to scale the query effectively over all the available CPU cores. The IsMatch method might be non-trivial, of course. We use a combination of meta-programming and special-cased tag-tests to allow us to support a wide range of queries; it works very well. So how does this compare to GPU? How does it change our approach?

CUDA: Kernels, Threads, Warps, Blocks and Grids

I’m going to run through some high level CUDA concepts now before showing any CUDA code, and before showing how it relates to the tag-engine.

When developing with CUDA, the first things you need to learn about are kernels, blocks and grids. A kernel is just a chunk of your code that you can invoke on the GPU device. But unlike regular code, when you “launch” a kernel, you aren’t usually asking it to call it once; you’re asking it to call the exact same method lots of times. As a trivial example, we could ask it to call the same kernel 800,000 times – once for each of our 800,000 “android” questions. A kernel designed to be called once per input element is sometimes called a “monolithic” kernel. But that isn’t quite the end of the story. Each separate invoke of the kernel is a “thread” – so in this monolithic case, we’d be launching 800,000 threads – but we don’t just ask for 800,000 – instead we might ask it to launch 3125 “blocks”, with each block consisting of 256 “threads”. Each of these numbers has limits – a block is limited to a maximum of 1024 threads on most current hardware, and you can have a maximum of 65535 blocks (per dimension, but that’s a separate story).

This means that for a single-dimension kernel, there’s a limit of about 67M, but don’t worry – I’ll cover how to get around that later. All I’m trying to do here is give an overview of the topology of what we’re playing in. The significance of these different concepts is that per-clock-cycle, each symmetric multiprocessor in a GPU actually works on multiple threads in the same block. This ability to operate on multiple threads at once is what makes GPUs so much more powerful (for some tasks) than CPUs. This group of threads that are being controlled in unison is a “warp” (the warp-size is 32 in current hardware).

The above is then duplicated for however many blocks you asked for – 3125 blocks of 256 threads in our example, with a warp-size of 32. In simple examples we’re often just talking about 1 dimension, but note that CUDA blocks and threads are actually 3-dimensional; this means you can actually have 65535 x 65535 x 65535 blocks, with 1024 threads per block – but… you probably shouldn’t ever do that! That’s not the way to solve the “more than 67M” problem.

Why do I need to know?

There are a few reasons why you need to understand the topology; firstly, in addition to knowing the limits of what you can ask for, it impacts things like memory locality and cache. Remember in the CPU example we optimized by having each thread work on a different chunk of data? In CUDA, a warp of threads will be executing at once (per symmetric multiprocessor), and the infrastructure is built specifically to share memory lookup costs between a warp. Because of this, you usually want all the threads in a warp looking at adjacent data.

The “secondly” is perhaps even more important: the ability of a symmetric multiprocessor to progress multiple threads simultaneously is dependent upon each of those threads doing exactly the same thing. Which makes sense when you think about it. This doesn’t mean that your CUDA code can’t ever branch (if, etc - basically, any decision point), but it does mean that if different threads in a warp branch in different directions, then the symmetric multiprocessor has to identify the ones in different states and progress them separately. Which means: you kill the performance.

So: aim to work on adjacent data, and try to branch in a single direction.

How does this impact tag-engine matching?

If we go back to the index that we described in the CPU example, we are now essentially applying all our tests in parallel (at least, in a single kernel launch; how the hardware schedules it is up to the hardware). With this setup, we can’t really have the concept of “write to the next position in the result index and increment the write-position”, because all matches in a single CPU cycle would want to write to the same position. CUDA of course provides mechanisms for doing atomic increments, but if these collide you’re ultimately going to be branching and breaking the lock-step on the warps – and more importantly, since we don’t control which thread runs when, we’d actually be randomizing the results, which would require us to sort the data again.

So; instead of trying to write packed data, we’ll instead try to create a sparse vector of just the matches – so zero (or another sentinel) for data that didn’t match, and the key otherwise:

What we’ll then do is pack that down to give just the non-zero data:

Note that we’ve preserved the sort on the data, and now we’re in a position where we can just do a memory copy from the device (GPU) to the host (CPU) of just the page of results we want.

Show me some code!

OK, time for some CUDA. Note that there are are many ways of creating CUDA binaries, and I intend showing much more on this next time – but for now, I’m using C code, since that works directly with NVIDIA’s tooling.

The first thing we need to do is to figure out what element we’re meant to be processing. CUDA makes available to us the block dimensions (blockDim), the logical position of the current block (blockIdx), and the logical position of the current thread inside the block (threadIdx). For a 1-dimensional kernel, this means that our actual index is:

int i = blockIdx.x * blockDim.x + threadIdx.x;

We can also assume that the kernel has access to the underlying data (I’ll demonstrate this next time), and somewhere to put the results. For simplicity, let's consider a test on the score of the post; what we want to do is essentially:

Question* q = allData + i; // pointer arithmetic 
results[i] = (q->score >= minScore && q->score <= maxScore) ? i : 0;

However, the observant might notice that both the short-circuiting "and" (&&) and the ternary-conditional (? :) are branching operations, and are best avoided. Fortunately, it isn't hard to rewrite this using non-branching equivalents. The "and" is simple - we'll just make it non-short-curcuiting. The conditional is harder, but "true" in C is expressed by the number 1. This means that if we negate (numerically, not bitwise) the result of our boolean test we get 0 and -1; and -1 in two's-complement binary is "all the ones". This means we can just do a bitwise "and" between this and the number we want to store (i):

results[i] = (-(q->score >= minScore & q->score <= maxScore)) & i;

While it might not be as intuitive, this type of thinking is key to keeping warps in lock-step to get the maximum performance from the GPU.

OK, I see why this might be useful, but how well does it work in practice? Is it worth it?

Note: as I have tried to emphasize: all performance data is deeply dependent on your exact scenario and your exact implementation. I’m going to talk about what we’ve found, but please don’t use this to justify a big hardware spend: race your own horses.

All of this work would be purely academic if it didn’t help us. So we’ve spent a lot of time and effort comparing performance using our captured data between the CPU versions (v1 and v2) and the GPU version, using a range of devices. For local development purposes today, a GTX 980 is more than sufficient.

So without further ado: some numbers. Looking only at the “hard” queries (the easy queries aren’t a bottleneck), the 72-core high-end server that I showed in the task-manager screenshot above managed and average of 259.9 requests per second for the CPU version – 3.8ms per request, which is pretty respectable for non-trivial filters against 12 million records. The GPU version, however, managed nearly double that (507.7 requests per second) just on my development GTX 980. This is not “high end” in GPU terms. There are server devices like the K40, K80 (two K40s back-to-back with a shared power limit), and the M40 which are quite a bit more powerful, and in my testing the maximum throughput went up linearly as you added more GPUs. Given that a C4130 has space for 4 GPUs, that could give you 4 M40 devices or 8 K80 devices (two devices per package), to get a massive speedup compared to CPU limits.

And also note: the GTX 1080 (desktop) and P100 (server) around the corner boasting the next generation of architecture and hopefully another huge jump in performance (I haven’t got my grubby hands on those yet).

But: from what we’ve seen, we’re keen to push to the next level and get some of this kit deployed into production.

My curiosity on CUDA is piqued; are you done? what next?

No, you don’t shut me up that easily! Next time, I’m going to be talking about things like grid-stride loops, async and CUDA streams, memory transfers, and showing how we can get access to all this goodness from my preferred every-day language: C#. I’ll also be walking you through some code that you can clone and play with to see it in action, rather than as text.

How I found CUDA, or: Rewriting the Tag Engine–part 1

(part 2, part 3)

This post is largely an introduction to set context so that the following few articles make sense… there won’t be any code here, but: well, take it or leave it :)

The Context – what is the tag engine?

At Stack Overflow / Stack Exchange, a lot of our pages relate to “tags” (topics). As an obvious example, a lot of users browse questions in specific technologies looking for new things to answer, or have feeds / email notifications configured for specific tags. Other users might be interested in all the newest questions, but absolutely never want to see another question that reminds them of their last job (by the way, if your job sucks, you should fix that – life is too short to be miserable). We also do things like showing “related tags” – essentially the counts of the intersections between technologies of what you’re looking at and other questions we know about.

All of this needs a non-trivial amount of processing and memory. Back in the day, it was sufficient to use our RDBMS for that (via some hacks that in turn left us some technical debt, but that is long gone now), but as we grew that simply wasn’t going to work. So after investigating a few options, out popped the “tag engine” – basically some bespoke code with a small set of jobs that we could run out-of-process to the main web-servers (so they don’t have to reload everything when we deploy / recycle).

So… life was good?

All was well. Sure, we had to fight a few things like GC, but… it worked. But as we grew, that code base started to become more and more of a limiting factor. It is nearly 5 years old now, and we’ve grown a lot in that time, and our needs have changed a lot in that time. The tag engine was never really “designed” so much as … “grew”. We gradually hacked in features we needed, and tweaked bits, and more or less it kept working. It was a bit ugly, but it wasn’t actually a problem. We’re pragmatists: we fix problems. We don’t fix things that aren’t problems.

Lately, it has been moving more and more from the “not a problem” camp to “problem”, so yay, more things to do. Performance was a key part of this, stemming in part from data volume, part from design choices - an overhaul was overdue.

Starting to think about GPUs

Around this time, I happened to see an email conversation from Daniel Egloff at QuantAlea, and it made me think about how much of the tag-engine might be suitable for GPU work. After a brief exchange with Daniel, I was satisfied that I wasn’t totally crazy. As a brief aside: QuantAlea were great and seemed keen to help us work on the problem. I think their tools show real promise, but for us we made the decision to keep everything in-house and do it ourselves. This is in part because our scenario is relatively simple. What I’m saying here is: they were really helpful, and if you’re interested in CUDA you might want to think about them, but: we didn’t go that way ourselves.

So what the hell is the tag engine doing? Why do you even need that level of crazy?

The interesting thing (to me) about the tag engine is that there are two very different scenarios. About half the queries tend to be embarrassingly trivial; and the other half are absurd. If you come at the tag-engine thinking “list the first 50 newest C# questions” – then: that’s the embarrassingly trivial side. This really doesn’t need a lot of work – just keep track of questions by tag, pre-sorted, and pick out the pages you need. Very fast, very simple. It just works. Although I’ll leave it as an exercise for the reader to think about how to generate the tag intersection cloud.

The real problem is the other half, which could be “page 200 (50 per page) of all 'java or .net or sql' questions, and everything that starts with ‘visual’, but never show me anything with php or fortran, and only show me questions with a score above 2 that were created after some date”. Lots of complex unions, restrictions, etc. You can do this type of thing with general purpose indexing tools (Elasticsearch for example), but a: we want the maximum performance, and b: some of our data is not really amenable to such tools – for example, some of our sort-orders are highly time dependent, which is complex and awkward for such tools (and may require lots of re-sends). Equally, some of the data we want to get back out (including, but not limited to, the tag intersection cloud) are not easy to do.

It is this second category of queries where the tag-engine shines; essentially, what it has to do is to keep some pre-built indexes and then try to figure out how best to exploit those indexes to answer a complex query – or, worst case, to do the moral equivalent of a table scan, and just walk the data.

OK, there’s a tricky technical problem; how could CUDA help? What the hell even is CUDA?

CUDA is an offshoot from the gaming world. In their efforts to make games that have high frame rates at high visual quality, the graphics card (GPU) vendors came up with a different approach to processing. Rather than having a relatively small number of very fast general purpose CPUs that switch constantly between doing 200 things, they went with a higher number of “symmetric multiprocessors” – not individually as fast as regular CPUs, but able to do the same thing many many times in parallel (think: SIMD gone mad), and with access to a large number of math processors (ALUs). NVIDIA correctly reasoned that this type of technology might be awesome for many computing tasks, so CUDA was developed as a framework to enable general purpose computing on the GPU, aka GPGPU. Of course, the bus between the device and main memory isn’t as fast as direct CPU<->RAM access (although they’re working hard on that problem), but: very powerful.

And as a result of this, GPU programming is very, very good at scenarios where you want to do a relatively simple and predictable operation many times, ideally where the data you need can be pushed onto the device up-front (and left there, updated only periodically), and where you only need to get back relatively small quantities of results. Doesn't this sound a lot like the second – harder - scenario we just described for the tag engine?

So…

That’s why we were intrigued with GPU programming on the tag-engine. The next post will be more technical, discussing the relative merits of CPU and GPU programming and how we might need to use different approaches to optimize each. And hopefully some code, yay! As a teaser though: it works great.

Sunday 1 May 2016

Using the Windows Subsystem for Linux to simplify CUDA builds

Recently, Microsoft announced the Windows Sybsystem for Linux, aka Bash on Ubuntu on Windows. This is currently pre-release, and allows you to run linux tools inside Windows. In particular, this opens up a whole range of development tools and options. First. a caveat: this stuff is early release and only currently available on the “Fast Ring” of insider builds. If it works: great. If it doesn’t work: oh well. If it melts your PC, stops it booting, or causes skynet to launch the missiles: that’s tough too.

Now, it just so happens that I’ve been playing a lot with CUDA lately – it turns out that it works quite nicely for our “tag engine” back-end service (and I shall no-doubt be blogging about my journey here very soon). If you’ve done any CUDA, you will know that there are two compilation options for compiling your CUDA kernel source-code into something that can run on your GPU – runtime (NVRTC) and build-time (NVCC). The runtime compiler is very convenient for fast iterations – edit a text file, reload your app – but: it doesn’t currently support one very important feature: dynamic parallelization. I’m hoping that CUDA 8.0 Toolkit (due some time around August) might fix this, but until then, it means that I need to use NVCC for my kernels, since I happen to need this feature.

The next thing you’ll know if you’ve done any CUDA work is that NVCC on Windows is fussy. Not just a little fussy, but really fussy. Not only does it require specific versions of Visual Studio (that are several major versions behind what I usually have installed), but it also requires your command-line C++ environment to be configured perfectly, and even then (as I found out the hard way), it might still turn around and laugh in your face. Possibly for reasons, possibly just for the giggles. I eventually gave up on making NVCC work in Windows, so I configured a Ubuntu VM in Hyper-V, added a SMB share, and used the VM to build my kernels. It worked, but it was unsatisfactory. So I thought: can I use the new Windows goodies to fix this? And would you believe it: yes… ish.

Scope: note that for the purposes of this article I’m not going to try to explain what CUDA kernels are, or how to use NVCC. If you don’t happen to know these things, it will suffice to know that NVCC is a tool you need, it is really awkward to get it working on Windows, and quite a bit easier to get it working on linux.

It turns out that actually this was pretty easy; first we need to get Bash working:

1. enable the fast ring updates
2. be patient for a day or so until it deigns to offer to let you install the new Windows build
3. install the new Windows build and once again learn the true meaning of patience
4. enable developer mode in Windows settings:

5. turn on the new system in Windows features:

6. find and run the new tool

The first time you run this, it downloads some pieces and prompts for your new linux credentials. When it has finished installing itself, you get a Ubuntu 14.04 terminal:

Conveniently, it also mounts your Windows drives for you – so “C:” is “/mnt/c” etc.

Important note: ctrl-v to paste doesn’t currently work, but don’t worry: you don’t need to re-type lines – you can successfully paste via the window menu:

Right. So we have a vanilla (if minimal) Ubuntu installation. From my Hyper-V install, I know that NVCC on linux requires gcc 4.9, so we can follow a few lines from Ask Ubuntu to install this. We also need to install NVCC itself; I recommend not using the version from apt-get, as that is very old – I installed 7.5 using the notes on nvidia.com, which really just meant choosing my preferred install from here - I went for the network-enabled deb install:

And remember: once you have downloaded the file, you can access it under /mnt/c/ or similar. The website tells you the full commands to run, so it isn’t a challenge even if you don’t know much linux. The installer takes quite a while to download and complete, so again: patience. It is also worth noting that when installing this way, no symlink or similar is added to the path – so you can either do that yourself, or just use the tool in the installation location of “/usr/local/cuda-7.5/bin/nvcc”:

After that, compiling to cubin, fatbin, or ptx should just work. The actual commands aren’t very interesting and are described in the documentation. I just wrap my NVCC commands in a bash file (remembering to ensure newlines are LF only), and run that. Here we can see no “fatbin” before the script, executing the script, and the “fatbin” existing after the script:

So what have we done?

we have installed the linux subsystem
we have installed some linux tools and utilities
we have used those tools as part of a development build process, accessing our windows file system
all without needing a separate VM

The fact that it involved NVCC and CUDA is largely incidental; the same tecnhiques could be used on many tools.

Minor confessions

1. Life doesn’t often run smoothly. I’m not much of a linux person, and it took me very little time to completely trash my linux install. Fortunately the nuclear reset option exists.

2. The fast ring installs are … temperamental. As it happens, it installed just fine on my travel laptop (hence @TRAVEL in the screens), but it silently failed to install on my main development desktop, and has now disappeared from the update UI … so I can’t find any way to get it installed. Unfortunately, my laptop doesn’t have a CUDA-enabled GPU, so actually this hasn’t saved me anything – I still need to use my Hyper-V VM for real work. But: it was still fun to get it working!