Saturday, 17 September 2016

Channelling my inner geek


I do a lot of IO programming – both from a client perspective (for example, StackExchange.Redis), and from a server perspective (for example, our custom web-sockets server usually has between 350k and 500k open connections, depending on how long it has been since we gave it a kick). So I’m always interested in new “players” when it comes to IO code, and a few weeks ago Microsoft’s David Fowler came through with some interesting goodies – specifically the “Channels” API.

First, a couple of disclaimers: everything I’m talking about here is completely experimental, evolving rapidly, depends on “corefxlab” concepts (i.e. stuff the .NET team is playing with to see what works well), and is virtually untested. Even the name is purely a working title. But let’s be bold and take a dive anyway!

What are Channels, and what problem do they solve?

The short description of Channels would be something like: “high performance zero-copy buffer-pool-managed asynchronous message pipes”. Which is quite a mouthful, so we need to break that down a bit. I’m going to talk primarily in the context of network IO software (aka network client libraries and server applications), but the same concepts apply equally to anything where data goes in or out of a stream of bytes. It is relatively easy to write a basic client library or server application for a simple protocol; heck, spin up a Socket via a listener, maybe wrap it in a NetworkStream, call Read until we get a message we can process, then spit back a response. But trying to do that efficiently is remarkably hard – especially when you are talking about high volumes of connections. I don’t want to get clogged down in details, but it quickly gets massively complicated, with concerns like:

  • threading – don’t want a thread per socket
  • connection management (listeners starting the read/write cycle when connections are accepted)
  • buffers to use to wait for incoming data
  • buffers to use to store any backlog of data you can’t process yet (perhaps because you don’t have an entire message)
  • dealing with partial messages (i.e. you read all of one message and half of another – do you shuffle the remainder down in the existing buffer? or…?)
  • dealing with “frame” reading/writing (another way of saying: how do we decode a stream in to multiple successive messages for sequential processing)
  • exposing the messages we parse as structured data to calling code – do we copy the data out into new objects like string? how “allocatey” can we afford to be? (don’t believe people who say that objects are cheap; they become expensive if you allocate enough of them)
  • do we re-use our buffers? and if so, how? lots of small buffers? fewer big buffers and hand out slices? how do we handle recycling?

Basically, before long it becomes an ugly mess of code. And almost none of this has any help from the framework. A few years ago I put together NetGain to help with some of this; NetGain is the code that drives our web-socket server, and has pieces to cover most of the above, so implementing a new protocol mostly involves writing the code to read and write frames. It works OK, but it isn’t great. Some bits of it are decidedly ropey.

But here’s the good news: Channels does all of these things, and does it better.

You didn’t really answer my question; what is a Channel?

Damn, busted. OK; let’s put it like this – a Channel is like a Stream that pushes data to you rather than having you pull. One chunk of code feeds data into a Channel, and another chunk of code awaits data to pull from the channel. In the case of a network socket, there are two Channels – one in each direction:

  • “input” for our application code:
    • the socket-aware code borrows a buffer from the memory manager and talks to the socket; as data becomes available it pushes it into the channel then repeats
    • the application code awaits data from the channel, and processes it
  • “output” for our application code:
    • our application may at any point (but typically as a response to a request) choose to write data to the channel
    • the socket-aware code awaits data from the channel, and pushes it to the network infrastructure

In most cases you should not expect to have to write any code at all that talks to the underlying socket; that piece is abstracted away into library code. But if you want to write your own channel backend you can. A small selection of backend providers have been written to explore the API surface (mostly focusing on files and networking; there are actually already 3 different backend ‘sockets’ libraries: libuv, Windows RIO, and async managed sockets). So let’s focus on what the application code might look like, with the example of a basic “echo” server:

IChannel channel = ...
while (true)
    // await data from the input channel
    var request = await channel.Input.ReadAsync();

    // check for "end of stream" condition
    if (request.IsEmpty && channel.Input.Completion.IsCompleted)

    // start writing to the output channel
    var response = channel.Output.Alloc();
    // pass whatever we got as input to the output
    response.Append(ref request);
    // finish writing (this should activate the socket driver)
    await response.FlushAsync();
    // tell the engine that we have read all of that data
    // (note: we could also express that we read *some* of it,
    // in which case the rest would be retained and supplied next time)

What isn’t obvious here is the types involved; “request” is a ReadableBuffer, which is an abstraction over one or more slices of memory from the buffer pool, from which we can consume data. “response” is a WritableBuffer, which is again an abstraction over any number of slices of memory from the buffer pool, to which we can write data (with more blocks being leased as we need them). The idea is that you can “slice” the request as many times as you need without allocation cost:

var prefix = request.Slice(0, 4);
var len = prefix.Read<int>();
var payload = request.Slice(4, len);...

Note that ReadableBuffer is a value-type, so slicing it like this is not allocating - the new value simply has a different internal tracking state of the various buffers. Each block is also reference counted, so the block is only made available back to the pool when it has been fully released. Normally this would happen as we call “Consumed()”, but it has more interesting applications. The “echo” server already gave an example of this – the “Append” call incremented the counter, which means that the data remains valid even after “request.Consumed()” has been called. Instead, it is only when the socket driver says that it too has consumed it, that it gets released back to the pool. But we can do the same thing ourselves to expose data to our application without any copies; our previous example can become:

var prefix = request.Slice(0, 4);
var len = prefix.Read();
var payload = request.Slice(4, len).Preserve();...

Here “Preserve” is essentially “increment the reference counter”. We can then pass that value around and the data will be valid until we “Dispose()” it (which will decrement the counter, releasing it to the pool if it becomes zero). Some of you might now be twitching uncontrollably at the thought of explicit counting, but for high volume scenarios it is far more scalable than garbage collection, allocations, finalizers, and all that jazz.

So where did the zero copy come in?

A consequence of this is that the memory we handed to the network device is the exact same memory that our application code can access for values. Recall that the socket-aware code asked for a buffer from the pool and then handed the populated data to the channel. That is the exact same blocks as the block that “payload” is accessing. Issues like data backlog is handled simply by only releasing blocks once the consumer has indicated that they’ve consumed all the data we know valid on that block; if the caller only reads 10 bytes, it just slices into the existing block. Of course, if the application code chooses to access the data in terms of “string” etc, there are mechanisms to read that, but the data can also be accessed via the new “Span<byte>” API, and potentially as “Utf8String” (although there are some complications there when it comes to data in non-contiguous blocks).


It looks really really exciting for anyone involved in high performance applications – as either a client or server. Don’t get me wrong: there’s no “magic” here; most of the tricks can already be done and are done, but it is really hard to do this stuff well. This API essentially tries to make it easy to get it right, but at the same time has performance considerations at every step by some folks who really really care about absurd performance details.

But is it practical? How complete is it?

It is undeniably very early days. But is is enough to be useful. There are a range of trivially simple examples in the repository, and some non-trivial; David Fowler is looking to hack the entire thing underneath ASP.NET to see what kind of difference it makes. At the less ambitious end of things, I have re-written the web-sockets code from NetGain on top of the Channels API; and frankly, it was way less work than you might expect. This means that there is a fully working web-sockets server and web-sockets client. I’ve also been playing with the StackExchange.Redis multiplexed redis client code on top of Channels (although that is far less complete). So it is very definitely possible to implement non-trivial systems. But be careful: the API is changing rapidly – you definitely shouldn’t be writing your mission critical pacemaker monitoring and ambulance scheduling software on top of this just yet!

There’s also a lot of missing pieces – SSL is a huge hole that is going to take a lot of work to fill; and probably a few more higher level abstractions to make the application code even simpler (David has mentioned a frame encoder/decoder layer, for example). The most important next step is to explore what the framework offers in terms of raw performance: how much “better” does it buy you? I genuinely don’t have a simple answer to that currently, although I’ve been trying to explore it.

Can I get involved?

If you want to explore the API and help spot the missing pieces, the ugly pieces, and the broken pieces – and perhaps even offer the fixes, then the code is on GitHub and there’s a discussion room. Note: nothing here is up to me – I’m just an interested party; David Fowler is commander in chief on this.

And what next?

And then, once all those easy hurdles have been jumped – then starts the really really hard work: figuring out what, in terms of formal API names, to call it.

Monday, 9 May 2016

CUDA–getting started in .NET


In my two previous posts (part 1, part 2) I described how we have found uses for CUDA. I am, however, aware that this abstract text doesn’t necessarily help the reader (aka “you”) make inroads into the world of CUDA. So: the purpose of this post is to take the concepts discussed previously and put them together into some actual code.

Firstly, a word on OpenCL: so far, I’ve spoken about CUDA – NVIDIA’s framework for GPGPU programming. Other frameworks exist, and the largest “other” is undoubtedly OpenCL. This framework is designed to let you use the same code over a range of technologies – GPUs from different providers, CPUs, etc. I am aware of it – I just haven’t even tried to address it. If that’s your bag: have fun. I don’t feel hugely limited by restricting myself to CUDA for what I need.

The first things you need for CUDA, then, are:

  • a CUDA capable GPU device
  • the CUDA toolkit
  • a programming environment

Since I’m focused on .NET, we’re also going to need one of the tools for interacting with CUDA from .NET, or we’ll need to write extern wrappers for the C API. A range of CUDA tools for .NET exist, including CUDAfy.NET, managedCUDA, Alea GPU / QuantAlea, Campy.NET, and several others that Campy.NET list and describe on their home page. These fall into two major categories:

  • wrappers for the driver API (you write your kernels in C etc and launch them from C#)
  • IL-to-PTX generators (you write and launch your kernels in C#)

While I’m a huge fan of C#, I’m also a huge fan of keeping close to the metal, and in a heartbeat will sacrifice some programming convenience for performance.

Our first kernel

So without further ado, I’m going to jump straight in with a basic CUDA kernel written in C. Rather than trying to discuss the tag-engine, we’re going to do something simpler… like multiplying numbers. Multiplication is hard, right?

typedef struct {
    int Id;
    unsigned int Value;
} SomeBasicType;

extern "C" {
    __global__ void Multiply(const int N, SomeBasicType* __restrict data, int factor) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
i += blockDim.x * gridDim.x)
{ (data + i)->Value *= factor; } } }

The first thing this does is to define a data-type (SomeBasicType) for us to share data between the CPU and GPU code. The __global__ indicates that we’re declaring a kernel, along with a given name and signature. We’re passing in N, the number of elements to process, data, a pointer to some of the data to work against, and factor - the number to multiply the values by. The significance of the N comes into play when we see the i < N in the for loop on the next line. Recall that we usually launch a number of blocks, each consisting of a set number of threads. It is not always the case that the number of elements to process is a convenient multiple of some block size, so it is common for kernels to be passed a count to compare the position against, exiting if the current thread is not needed. This sounds like a contradiction to the “don’t branch” guidance, but as long as most of the threads make the same choice, this isn’t going to be a performance problem.

Recall also that I described a “monolithic kernel” earlier, where-by we launch a thread for every element to process. But: there is a hard limit to the number of threads in a block, and the number of blocks in any dimension – and as always: the more threads there are, the more management overhead there is (although this works differently to CPU threads). As such, it is common to use a “grid-stride loop” rather than a monolithic kernel. As in our example, a for loop is used to that each thread processes multiple elements. But unlike the CPU version shown previously (where each thread processed a separate chunk of data), we instead have each thread in an iteration process adjacent elements, and then move forwards by the stride – namely the width of each block multiplied by the number of blocks. This has multiple side benefits:

  • it can cope with data of any size – it is not subject to the 75M limitation
  • it requires less overhead for the GPU engine
  • it can still be used as a monolithic kernel (by setting the dimensions such that it only performs one iteration per thread), or as a single-threaded debugging kernel that processes all the data (by setting the dimensions to 1 thread in 1 block)
  • the work being done by a warp or block is contiguous, minimizing memory access overheads

Finally, we get to the actual meat of the kernel, which does an in-place multiplication of one of the fields of our data by one of the parameters.

That’s our first kernel, all done!

A side note on data width

You’ll note that the above example is doing 32-bit arithmetic. For doing serious CUDA work, it is important to be aware of whether you are doing 16-bit, 32-bit or 64-bit arithmetic, as it can significantly impact your choice of hardware – considering them in release order:

  • the “Kepler” architecture (server devices: K40, K80) will serve as a baseline – let’s arbitrarily call it “good” at both 32-bit and 64-bit; 16-bit operations are performed using 32-bit arithmetic
  • the “Maxwell” architecture (server device: M40) is faster than Kepler at 32-bit operations, but is relatively poor at 64-bit operations; 16-bit operations are performed using 32-bit arithmetic
  • to complicate this: the fact that the K80 packs two devices on a physical package means that even though a single M40 may be faster than a single K80 device, code that appropriately uses both K80 devices on a package may still out-perform a single M40
  • the “Pascal” architecture apparently (not yet available for testing; server device: P100) significantly increases both 32-bit and 64-bit performance, while also introducing specific optimized instructions for 16-bit processing

Because of this, there is no single way of saying “this device is faster” – it depends a lot on what exactly you are doing!

Compiling our kernel (and: hello managedCUDA)

A kernel as a text source-file isn’t much use by itself. We need to convert that into GPU-runnable code – either intermediate (“PTX”) or fully compiled to bytecode (“cubin”). There is also a middle-ground that combines both bytecode and intermediate instructions – a “fatbin”; this allows a compiled kernel to target multiple physical architectures and still allow usage on other architectures via a JIT step. There are two common ways of compiling a C kernel – compile-time (NVCC) and runtime (NVRTC). I’ve spoken previously about using NVCC, so for this example I’ll use NVRTC instead.

Since we want to do this from .NET, we need a wrapper library. My examples today are going to use managedCUDA, written by Michael Kunz. This is available on NuGet and is licenced under the LGPL – which I understand (and: IANAL) is not the same as the notoriously infectious and restrictive GPL. What I particularly like about this library is that in addition to exposing an idiomatic .NET object model that maps to the C API, it also exposes the raw C API directly – which makes it convenient to use even when the method you want doesn’t map naturally to a .NET concept.

So let’s load our kernel from a text file and compile it at runtime:

string path = "MyKernels.c";
ManagedCuda.NVRTC.nvrtcResult result;
using (var rtc = new ManagedCuda.NVRTC.CudaRuntimeCompiler(
File.ReadAllText(path), Path.GetFileName(path))) { try { rtc.Compile(new string[0]); result = ManagedCuda.NVRTC.nvrtcResult.Success; } catch(ManagedCuda.NVRTC.NVRTCException ex) { result = ex.NVRTCError; } log = rtc.GetLogAsString(); if (result == ManagedCuda.NVRTC.nvrtcResult.Success) { byte[] ptx = rtc.GetPTX(); // ... } }

This takes our file contents as input, and (if successful) produces a blob that contains the compiled kernel. We can also obtain the log which will contain detained error messages if our kernel is invalid.

The CUDA context

One of the key objects in CUDA is the context. This wraps the state for a long-running CUDA scenario. Most operations, including memory allocation and kernel launches typically happen within a CUDA context. managedCUDA exposes a class that wraps this. Note:

  • a CUDA context is bound to a particular GPU device, and it is when creating the CUDA context that you can target specific devices
  • a CUDA context is associated with the CPU thread that creates it; as such, CUDA code would work well for scenarios such as a dedicated worker thread processing items from a queue, but would not work well from .NET async/await code that switches between multiple threads

So; let’s create our context and load the kernel:

CudaContext ctx = new CudaContext(deviceId, true);
CudaKernel multiply = ctx.LoadKernelFatBin(ptx, "Multiply");

Note that the context (like many CUDA objects) is IDisposable, and you should ensure to Dispose() it properly, via using or otherwise.

CUDA streams and asynchronous processing

In the simplest usage, the CUDA methods are blocking, however it is often the case that either the CPU can do other useful work while the GPU churns, or we can overlap multiple GPU operations – memory transfers running in parallel with kernel execution, or (depending on the hardware) even multiple kernels running at the same time. To allow this, CUDA introduces the notion of a “stream”. Work on the same stream is processed in order, but the work on different streams may overlap. I’m not going to make extensive usage of streams in this walkthrough, but it is easier to include stream considerations from the start than to add it later.

Initializing memory at the host and on the device

Traditionally, the host (CPU) memory and the device (GPU) memory were completely separate, and all CUDA code had to specifically be cautious of whether it was referring to host or device memory. More recently, the concept of “unified memory” has been introduced that hides this distinction, and performs transfers automatically as needed – it is intended to make it simpler to use the API, but for the maximum performance it is recommended to take full control over allocation and copy. To that end, I’ll stick to the traditional approach, which means explicitly allocating the data at both ends.

We’ll start by allocating the host memory:

struct SomeBasicType {
    public int Id;
    public uint Value;

IntPtr hostPointer = IntPtr.Zero;
var res = DriverAPINativeMethods.MemoryManagement.cuMemAllocHost_v2(
    ref hostPointer, count * sizeof(SomeBasicType));
if (res != CUResult.Success) throw new CudaException(res);
SomeBasicType* hostBuffer = (SomeBasicType*)hostPointer;

As before, we declare a raw data type to share the data between CPU and GPU. Here is one of those cases where I’m making use of the raw C API rather than the wrapped objects, as this (IMO) more conveniently allows naked allocations. You might think “hang on, you’re just allocating some memory – surely a .NET array would suffice, or maybe at worst Marshal.AllocHGlobal” – but: in order for the CUDA memory-copy engine to work at maximum performance (avoiding double-copies, etc) it is required to be configured in a specific way, and the best way to guarantee that is to let the CUDA API allocate the memory in the first place. A pinned .NET array certainly wouldn’t suffice. Or at least, it would work – but not as fast. Yes, this needs some unsafe code; it’ll be fine, don’t panic.

Once allocated, we can intialize this memory with some invented data:

for (int i = 0; i < count; i++) {
    hostBuffer[i].Id = i;
    hostBuffer[i].Value = (uint)i;

Next we’ll want to allocate a similar block of memory on the device, and start copying the data from the host to the device. Emphasis: large memory transfers are fast, but not instant – if possible you should minimize the amount of data you need to transfer for individual operations. The ideal scenario is to copy the data to the device once / periodically, and use it many times once it is there.

CudaDeviceVariable<SomeBasicType> deviceBuffer = new CudaDeviceVariable<SomeBasicType>(count);
CudaStream defaultStream = new CudaStream();
res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyHtoDAsync_v2(deviceBuffer.DevicePointer,
    hostPointer, deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

This time I'm using the CudaDeviceVariable<T> object to represent the memory at the device, which is perfectly convenient since we will not usually need to do anything except access the DevicePointer to pass into methods. Once again I'm using a raw C-style API for the actual memory copy. There are some transfer methods exposed directly on the object, but this particular combination is not exposed directly. Note that because we have created and used a stream, this is non-blocking. Related operations sent to the same stream will be queued behind this one.

Launching our kernel

We have data on the device; we have a kernel that we have compiled and loaded. Now finally to run the kernel against the data! First we must configure the dimensions – the number of threads in a block and the number of blocks:

multiply.BlockDimensions = new ManagedCuda.VectorTypes.dim3(threadsPerBlock, 1, 1);
multiply.GridDimensions = new ManagedCuda.VectorTypes.dim3(blockCount, 1, 1);

multiply.RunAsync(defaultStream.Stream, new object[] {
    count, deviceBuffer.DevicePointer, value

Note that the parameters to RunAsync (other than the stream) match the signature on the kernel itself. value in the above is the factor to multiply by. Once we've got the data in place, running kernels is actually alarmingly simple!

Getting results back

It isn’t usually sufficient to run kernels – sadly we usually expect to get results. As mentioned before, we should strive to minimize the amount of data we want to transfer, but for simplicity in this example I’m going to fetch back the entire buffer, over-writing the previous contents. In real code you might be extracting only small portions of the buffer, or summary / aggregate data.

var res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyDtoHAsync_v2(
    new IntPtr(hostBuffer), deviceBuffer.DevicePointer,
    deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

Once more, for memory operations it has been easier to use the raw C API than the wrapped managed API, and that's OK. Note that because we've been using streams, at this point we've only queued the copy. If we've set all our wheels in motion, and want to sit back and wait while they turn, we can use either ctx.Synchronize(); (to wait for all streams to complete) or defaultStream.Synchronize() (to wait for a specific stream to complete). Finally, when that is done, we can iterate over the data and congratulate ourselves for managing to over-complicate the multiplication of numbers!

Not enough code for you?

Everything I’ve shown here is available in a github project, that you can clone, fork, etc. I don’t claim it is the most useful tool ever, but hopefully it might help a few people get up and running with CUDA.


How I found CUDA, or: Rewriting the Tag Engine–part 2

(part 1, part 3)

So we got to work…

We knew we had a problem where CUDA might be part of an answer, especially for performance. And in reality, there’s only one way to find out which horse is faster. You race the horses. We knew that the v1 tag-engine was poorly designed (organically, so to speak), so it wouldn’t be meaningful to compare that for performance purposes. And even if the GPU is better, we still wanted a better CPU implementation:

  • it needs to run on developer local machines; not all developers have a CUDA device, especially if they are on a laptop, or on a VM, or on a VM on a laptop
  • in the event of the tag-engine servers being unreachable, our fallback strategy is to load the tag-engine in-memory on the production web-servers, so it needs to be able to run there
  • if the GPU approach turns out to not be what we want, we still want to have moved our code forwards
  • and if the GPU approach turns out to be exactly what we want, then we’d have physical factors like server hardware, cage configuration, etc to add lead-time, where-as we’d like to replace the code ASAP

So: we definitely wanted a shiny new CPU version, but one that was designed to work nicely side-by-side with the GPU version. The other thing you need to race your horses is: a course. The best way of representing your load is to capture your load, in a way that you can play back. So the next thing we did was to edit our existing tag-engine to make it possible to record all the real requests it was serving, in a way that we could play back and use to compare performance, features, and results. Helpfully, tag-engine requests don’t contain any PII data – note that this isn’t possible in all circumstances.

Parallel or sequential?

Next up, we need to think about how GPUs work. CPUs are often (not always) used to perform multiple independent operations in parallel. GPUs, by contrast, are usually used to make a single operation (per device) happen really quickly, processing the operations sequentially. The parallel approach, while scalable, has a lot of complications:

  • you need to be careful how you perform data updates without breaking running operations (or: you need to duplicate the data to perform updates) – contrast sequential, where you can just squeeze the update in as just one more thing in the queue (it is never fighting other requests), and update the data in-place with impunity
  • each parallel operation needs memory for it’s processing (the results it has collected, and everything it needed to get there) – contrast sequential where you can allocate a single workspace and just keep re-using it

After considering these factors, and looking at our typical request rate, we concluded that our best approach for both GPU and CPU would be to use sequential operations, using all the resources available to us (cores, etc) to make each individual operation as fast as possible. This also makes for a very simple CPU vs GPU race, while also making for some amusing CPU task-manager charts:

(yes, that’s me punishing a 36-core, HT-enabled server for 72 logical cores of goodness)

The CPU bits

Recall from part 1 that we have two categories of query; trivially simple, and absurdly hard. I’m going to completely ignore the first set. The interesting case is the second, and in those scenarios you can pretty much guarantee that there is no single index that is ever going to fit your query, and we’re essentially trying to do a table-scan of a “where” clause, perhaps with some short-cuts. Let’s consider an example. Say we want to know all the “java and android” questions, sorted by activity. One of the main things the tag-engine stores is an index of “questions by tag” – i.e. given a tag, what questions exist in that tag (pre-sorted). Since the “and” makes this restrictive (intersection rather than union), what we can do is choose the smallest (“android”, which takes us to a manageable size) and then just test them all. For reasonable sizes of data, this can be much cheaper than trying to do complicated index combinations, and can be done with very little memory allocation. Note that we almost always want the total count and the “intersecting related tags” data, so it won’t help to cheat and just scroll forwards until we have enough data to return.

Let’s say that “android” has 800,000 questions. If we want to parallelize this work (to use all of the available cores), it might be tempting to use Parallel.ForEach to outsource this work to the TPL and aggregate the results, but:

  • this has quite a lot of overhead in terms of doing lots of small things instead of a small number of big things
  • since the order is now unpredictable, it makes it very hard to exploit the fact that we have pre-sorted the data

Instead, what we can do is to carve these 800,000 questions into a small number of chunks of contiguous questions (8, say, or something related to the number of available cores), and then inside each chunk (unit of work): test each question in turn for the remaining conditions, writing successive matches in blocks to a similarly sized workspace.


By letting different threads process different blocks, we get lots of benefits:

  • the number of units-of-work for the threading library (TPL) to manage is minimal, reducing overheads
  • each unit-of-work is always looking at contiguous data, maximizing cache usage and memory locality
  • we get to exploit the fact that we’ve already sorted the data once – we never need to re-sort
  • we don’t have to synchronize when writing any of the results, since each unit-of-work is writing to a separate area of the workspace
  • all we need to know is how many matches each unit-of-work contained and we can step through the final workspace very efficiently

This is actually pretty simple to do with Parallel.Invoke - something like:

int chunks = DecideNumberOfChunks(questions, cpuCores);
int workPerChunk = questions / chunks;
WorkUnit[] workUnits = new WorkUnit[chunks];
Action[] actions = new Action[chunks];
for(int chunk = 0; chunk < chunks ; chunk++) {
    int start = chunk * workPerChunk,
        stop = Math.Min(start + workPerChunk, questions);
    workUnits[chunk] = new WorkUnit(start, stop, ...);
    actions[chunk] = workUnits[chunk].Execute;

Where each Execute method is essentially:

public void Execute() {
    int matches = 0, resultIndex = start;
    for(int i = start ; i < stop ; i++) {
        if(IsMatch(i)) { // some test or set of composite tests
            resultWorkspace[resultIndex++] = i;
    this.Matches = matches;

The above is a very simplified illustration of the design that drives the CPU implementation of the re-written tag-engine. The result is that it creates a packed set of contiguous matches for each unit-of-work, while allowing us to scale the query effectively over all the available CPU cores. The IsMatch method might be non-trivial, of course. We use a combination of meta-programming and special-cased tag-tests to allow us to support a wide range of queries; it works very well. So how does this compare to GPU? How does it change our approach?

CUDA: Kernels, Threads, Warps, Blocks and Grids

I’m going to run through some high level CUDA concepts now before showing any CUDA code, and before showing how it relates to the tag-engine.

When developing with CUDA, the first things you need to learn about are kernels, blocks and grids. A kernel is just a chunk of your code that you can invoke on the GPU device. But unlike regular code, when you “launch” a kernel, you aren’t usually asking it to call it once; you’re asking it to call the exact same method lots of times. As a trivial example, we could ask it to call the same kernel 800,000 times – once for each of our 800,000 “android” questions. A kernel designed to be called once per input element is sometimes called a “monolithic” kernel. But that isn’t quite the end of the story. Each separate invoke of the kernel is a “thread” – so in this monolithic case, we’d be launching 800,000 threads – but we don’t just ask for 800,000 – instead we might ask it to launch 3125 “blocks”, with each block consisting of 256 “threads”. Each of these numbers has limits – a block is limited to a maximum of 1024 threads on most current hardware, and you can have a maximum of 65535 blocks (per dimension, but that’s a separate story).

This means that for a single-dimension kernel, there’s a limit of about 67M, but don’t worry – I’ll cover how to get around that later. All I’m trying to do here is give an overview of the topology of what we’re playing in. The significance of these different concepts is that per-clock-cycle, each symmetric multiprocessor in a GPU actually works on multiple threads in the same block. This ability to operate on multiple threads at once is what makes GPUs so much more powerful (for some tasks) than CPUs. This group of threads that are being controlled in unison is a “warp” (the warp-size is 32 in current hardware).


The above is then duplicated for however many blocks you asked for – 3125 blocks of 256 threads in our example, with a warp-size of 32. In simple examples we’re often just talking about 1 dimension, but note that CUDA blocks and threads are actually 3-dimensional; this means you can actually have 65535 x 65535 x 65535 blocks, with 1024 threads per block – but… you probably shouldn’t ever do that! That’s not the way to solve the “more than 67M” problem.

Why do I need to know?

There are a few reasons why you need to understand the topology; firstly, in addition to knowing the limits of what you can ask for, it impacts things like memory locality and cache. Remember in the CPU example we optimized by having each thread work on a different chunk of data? In CUDA, a warp of threads will be executing at once (per symmetric multiprocessor), and the infrastructure is built specifically to share memory lookup costs between a warp. Because of this, you usually want all the threads in a warp looking at adjacent data.

The “secondly” is perhaps even more important: the ability of a symmetric multiprocessor to progress multiple threads simultaneously is dependent upon each of those threads doing exactly the same thing. Which makes sense when you think about it. This doesn’t mean that your CUDA code can’t ever branch (if, etc - basically, any decision point), but it does mean that if different threads in a warp branch in different directions, then the symmetric multiprocessor has to identify the ones in different states and progress them separately. Which means: you kill the performance.

So: aim to work on adjacent data, and try to branch in a single direction.

How does this impact tag-engine matching?

If we go back to the index that we described in the CPU example, we are now essentially applying all our tests in parallel (at least, in a single kernel launch; how the hardware schedules it is up to the hardware). With this setup, we can’t really have the concept of “write to the next position in the result index and increment the write-position”, because all matches in a single CPU cycle would want to write to the same position. CUDA of course provides mechanisms for doing atomic increments, but if these collide you’re ultimately going to be branching and breaking the lock-step on the warps – and more importantly, since we don’t control which thread runs when, we’d actually be randomizing the results, which would require us to sort the data again.

So; instead of trying to write packed data, we’ll instead try to create a sparse vector of just the matches – so zero (or another sentinel) for data that didn’t match, and the key otherwise:


What we’ll then do is pack that down to give just the non-zero data:


Note that we’ve preserved the sort on the data, and now we’re in a position where we can just do a memory copy from the device (GPU) to the host (CPU) of just the page of results we want.

Show me some code!

OK, time for some CUDA. Note that there are are many ways of creating CUDA binaries, and I intend showing much more on this next time – but for now, I’m using C code, since that works directly with NVIDIA’s tooling.

The first thing we need to do is to figure out what element we’re meant to be processing. CUDA makes available to us the block dimensions (blockDim), the logical position of the current block (blockIdx), and the logical position of the current thread inside the block (threadIdx). For a 1-dimensional kernel, this means that our actual index is:

int i = blockIdx.x * blockDim.x + threadIdx.x;

We can also assume that the kernel has access to the underlying data (I’ll demonstrate this next time), and somewhere to put the results. For simplicity, let's consider a test on the score of the post; what we want to do is essentially:

Question* q = allData + i; // pointer arithmetic 
results[i] = (q->score >= minScore && q->score <= maxScore) ? i : 0;

However, the observant might notice that both the short-circuiting "and" (&&) and the ternary-conditional (? :) are branching operations, and are best avoided. Fortunately, it isn't hard to rewrite this using non-branching equivalents. The "and" is simple - we'll just make it non-short-curcuiting. The conditional is harder, but "true" in C is expressed by the number 1. This means that if we negate (numerically, not bitwise) the result of our boolean test we get 0 and -1; and -1 in two's-complement binary is "all the ones". This means we can just do a bitwise "and" between this and the number we want to store (i):

results[i] = (-(q->score >= minScore & q->score <= maxScore)) & i;

While it might not be as intuitive, this type of thinking is key to keeping warps in lock-step to get the maximum performance from the GPU.

OK, I see why this might be useful, but how well does it work in practice? Is it worth it?

Note: as I have tried to emphasize: all performance data is deeply dependent on your exact scenario and your exact implementation. I’m going to talk about what we’ve found, but please don’t use this to justify a big hardware spend: race your own horses.

All of this work would be purely  academic if it didn’t help us. So we’ve spent a lot of time and effort comparing performance using our captured data between the CPU versions (v1 and v2) and the GPU version, using a range of devices. For local development purposes today, a GTX 980 is more than sufficient.

So without further ado: some numbers. Looking only at the “hard” queries (the easy queries aren’t a bottleneck), the 72-core high-end server that I showed in the task-manager screenshot above managed and average of 259.9 requests per second for the CPU version – 3.8ms per request, which is pretty respectable for non-trivial filters against 12 million records. The GPU version, however, managed nearly double that (507.7 requests per second) just on my development GTX 980. This is not “high end” in GPU terms. There are server devices like the K40, K80 (two K40s back-to-back with a shared power limit), and the M40 which are quite a bit more powerful, and in my testing the maximum throughput went up linearly as you added more GPUs. Given that a C4130 has space for 4 GPUs, that could give you 4 M40 devices or 8 K80 devices (two devices per package), to get a massive speedup  compared to CPU  limits.

And also note: the GTX 1080 (desktop) and P100 (server) around the corner boasting the next generation of architecture and hopefully another huge jump in performance (I haven’t got my grubby hands on those yet).

But: from what we’ve seen, we’re keen to push to the next level and get some of this kit deployed into production.

My curiosity on CUDA is piqued; are you  done? what next?

No, you don’t shut me up that easily! Next time, I’m going to be talking about things like grid-stride loops, async and CUDA streams, memory transfers, and showing how we can get access to all this goodness from my preferred every-day language: C#. I’ll also be walking you through some code that you can clone and play with to see it in action, rather than as text.

How I found CUDA, or: Rewriting the Tag Engine–part 1

(part 2, part 3)

This post is largely an introduction to set context so that the following few articles make sense… there won’t be any code here, but: well, take it or leave it :)

The Context – what is the tag engine?

At Stack Overflow / Stack Exchange, a lot of our pages relate to “tags” (topics). As an obvious example, a lot of users browse questions in specific technologies looking for new things to answer, or have feeds / email notifications configured for specific tags. Other users might be interested in all the newest questions, but absolutely never want to see another question that reminds them of their last job (by the way, if your job sucks, you should fix that – life is too short to be miserable). We also  do things like showing “related tags” – essentially the counts of the intersections between technologies of what you’re looking at and other questions we know about.

All of this needs a non-trivial amount of processing and memory. Back in the day, it was sufficient to use our RDBMS for that (via some hacks that in turn left us some technical debt, but that is long gone now), but as we grew that simply wasn’t going to work. So after investigating a few options, out popped the “tag engine” – basically some bespoke code with a small set of jobs that we could run out-of-process to the main web-servers (so they don’t have to reload everything when we deploy / recycle).

So… life was good?

All was well. Sure, we had to fight a few things like GC, but… it worked. But as we grew, that code base started to become more and more of a limiting factor. It is nearly 5 years old now, and we’ve grown a lot in that time, and our needs have changed a lot in that time. The tag engine was never really “designed” so much as … “grew”. We gradually hacked in features we needed, and tweaked bits, and more or less it kept working. It was a bit ugly, but it wasn’t actually a problem. We’re pragmatists: we fix problems. We don’t fix things that aren’t problems.

Lately, it has been moving more and more from the “not a problem” camp to “problem”, so yay, more things to do. Performance was a key part of this, stemming in part from data volume, part from design choices - an overhaul was overdue.

Starting to think about GPUs

Around this time, I happened to see an email conversation from Daniel Egloff at QuantAlea, and it made me think about how much of the tag-engine might be suitable for GPU work. After a brief exchange with Daniel, I was satisfied that I wasn’t totally crazy. As a brief aside: QuantAlea were great and seemed keen to help us work on the problem. I think their tools show real promise, but for us we made the decision to keep everything in-house and do it ourselves. This is in part because our scenario is relatively simple. What I’m saying here is: they were really helpful, and if you’re interested in CUDA you might want to think about them, but: we didn’t go that way ourselves.

So what the hell is the tag engine doing? Why do you even need that level of crazy?

The interesting thing (to me) about the tag engine is that there are two very different scenarios. About half the queries tend to be embarrassingly trivial; and the other half are absurd. If you come at the tag-engine thinking “list the first 50 newest C# questions” – then: that’s the embarrassingly trivial side. This really doesn’t need a lot of work – just keep track of questions by tag, pre-sorted, and pick out the pages you need. Very fast, very simple. It just works. Although I’ll leave it as an exercise for the reader to think about how to generate the tag intersection cloud.

The real problem is the other half, which could be “page 200 (50 per page) of all 'java or .net or sql' questions, and everything that starts with ‘visual’, but never show me anything with php or fortran, and only show me questions with a score above 2 that were created after some date”. Lots of complex unions, restrictions, etc. You can do this type of thing with general purpose indexing tools (Elasticsearch for example), but a: we want the maximum performance, and b: some of our data is not really amenable to such tools – for example, some of our sort-orders are highly time dependent, which is complex and awkward for such tools (and may require lots of re-sends). Equally, some of the data we want to get back out (including, but not limited to, the tag intersection cloud) are not easy to do.

It is this second category of queries where the tag-engine shines; essentially, what it has to do is to keep some pre-built indexes and then try to figure out how best to exploit those indexes to answer a complex query – or, worst case, to do the moral equivalent of a table scan, and just walk the data.

OK, there’s a tricky technical problem; how could CUDA help? What the hell even is CUDA?

CUDA is an offshoot from the gaming world. In their efforts to make games that have high frame rates at high visual quality, the graphics card (GPU) vendors came up with a different approach to processing. Rather than having a relatively small number of very fast general purpose CPUs that switch constantly between doing 200 things, they went with a higher number of “symmetric multiprocessors” – not individually as fast as regular CPUs, but able to do the same thing many many times in parallel (think: SIMD gone mad), and with access to a large number of math processors (ALUs). NVIDIA correctly reasoned that this type of technology might be awesome for many computing tasks, so CUDA was developed as a framework to enable general purpose computing on the GPU, aka GPGPU. Of course, the bus between the device and main memory isn’t as fast as direct CPU<->RAM access (although they’re working hard on that problem), but: very powerful.

And as a result of this, GPU programming is very, very good at scenarios where you want to do a relatively simple and predictable operation many times, ideally where the data you need can be pushed onto the device up-front (and left there, updated only periodically), and where you only need to get back relatively small quantities of results. Doesn't this sound a lot like the second – harder - scenario we just described for the tag engine?


That’s why we were intrigued with GPU programming on the tag-engine. The next post will be more technical, discussing the relative merits of CPU and GPU programming and how we might need to use different approaches to optimize each. And hopefully some code, yay! As a teaser though: it works great.

Sunday, 1 May 2016

Using the Windows Subsystem for Linux to simplify CUDA builds

Recently, Microsoft announced the Windows Sybsystem for Linux, aka Bash on Ubuntu on Windows. This is currently pre-release, and allows you to run linux tools inside Windows. In particular, this opens up a whole range of development tools and options. First. a caveat: this stuff is early release and only currently available on the “Fast Ring” of insider builds. If it works: great. If it doesn’t work: oh well. If it melts your PC, stops it booting, or causes skynet to launch the missiles: that’s tough too.

Now, it just so happens that I’ve been playing a lot with CUDA lately – it turns out that it works quite nicely for our “tag engine” back-end service (and I shall no-doubt be blogging about my journey here very soon). If you’ve done any CUDA, you will know that there are two compilation options for compiling your CUDA kernel source-code into something that can run on your GPU – runtime (NVRTC) and build-time (NVCC). The runtime compiler is very convenient for fast iterations – edit a text file, reload your app – but: it doesn’t currently support one very important feature: dynamic parallelization. I’m hoping that CUDA 8.0 Toolkit (due some time around August) might fix this, but until then, it means that I need to use NVCC for my kernels, since I happen to need this feature.

The next thing you’ll know if you’ve done any CUDA work is that NVCC on Windows is fussy. Not just a little fussy, but really fussy. Not only does it require specific versions of Visual Studio (that are several major versions behind what I usually have installed), but it also requires your command-line C++ environment to be configured perfectly, and even then (as I found out the hard way), it might still turn around and laugh in your face. Possibly for reasons, possibly just for the giggles. I eventually gave up on making NVCC work in Windows, so I configured a Ubuntu VM in Hyper-V, added a SMB share, and used the VM to build my kernels. It worked, but it was unsatisfactory. So I thought: can I use the new Windows goodies to fix this? And would you believe it: yes… ish.

Scope: note that for the purposes of this article I’m not going to try to explain what CUDA kernels are, or how to use NVCC. If you don’t happen to know these things, it will suffice to know that NVCC is a tool you need, it is really awkward to get it working on Windows, and quite a bit easier to get it working on linux.

It turns out that actually this was pretty easy; first we need to get Bash working:

1. enable the fast ring updates
2. be patient for a day or so until it deigns to offer to let you install the new Windows build
3. install the new Windows build and once again learn the true meaning of patience
4. enable developer mode in Windows settings:


5. turn on the new system in Windows features:


6. find and run the new tool


The first time you run this, it downloads some pieces and prompts for your new linux credentials. When it has finished installing itself, you get a Ubuntu 14.04 terminal:


Conveniently, it also mounts your Windows drives for you – so “C:” is “/mnt/c” etc.

Important note: ctrl-v to paste doesn’t currently work, but don’t worry: you don’t need to re-type lines – you can successfully paste via the window menu:


Right. So we have a vanilla (if minimal) Ubuntu installation. From my Hyper-V install, I know that NVCC on linux requires gcc 4.9, so we can follow a few lines from Ask Ubuntu to install this. We also need to install NVCC itself; I recommend not using the version from apt-get, as that is very old – I installed 7.5 using the notes on, which really just meant choosing my preferred install from here - I went for the network-enabled deb install:


And remember: once you have downloaded the file, you can access it under /mnt/c/ or similar. The website tells you the full commands to run, so it isn’t a challenge even if you don’t know much linux. The installer takes quite a while to download and complete, so again: patience. It is also worth noting that when installing this way, no symlink or similar is added to the path – so you can either do that yourself, or just use the tool in the installation location of “/usr/local/cuda-7.5/bin/nvcc”:


After that, compiling to cubin, fatbin, or ptx should just work. The actual commands aren’t very interesting and are described in the documentation. I just wrap my NVCC commands in a bash file (remembering to ensure newlines are LF only), and run that. Here we can see no “fatbin” before the script, executing the script, and the “fatbin” existing after the script:


So what have we done?

  • we have installed the linux subsystem
  • we have installed some linux tools and utilities
  • we have used those tools as part of a development build process, accessing our windows file system
  • all without needing a separate VM

The fact that it involved NVCC and CUDA is largely incidental; the same tecnhiques could be used on many tools.

Minor confessions

1. Life doesn’t often run smoothly. I’m not much of a linux person, and it took me very little time to completely trash my linux install. Fortunately the nuclear reset option exists.

2. The fast ring installs are … temperamental. As it happens, it installed just fine on my travel laptop (hence @TRAVEL in the screens), but it silently failed to install on my main development desktop, and has now disappeared from the update UI … so I can’t find any way to get it installed. Unfortunately, my laptop doesn’t have a CUDA-enabled GPU, so actually this hasn’t saved me anything – I still need to use my Hyper-V VM for real work. But: it was still fun to get it working!

Tuesday, 24 November 2015

The road to DNX–part 3

In part 1, we  looked at an existing library that we wanted to move to core-clr; we covered the basics of the tools, and made the required changes just to change to the project.json build approach, targeting the same frameworks.

In part 2, we looks at “dnxcore50”, and how to port a library to support this new framework alongside existing .net frameworks. We looked at how to setup and debug tests. We then introduced  “dnx451”: the .net framework running inside DNX.

In part 3, we dive deeper still…

Targeting Hell

You could well be thinking that all these frameworks (dnx451, dnxcore40, net35, net40, etc) could start to become  tedious. And you’d be right! FastMember only targets a few, but as a library author, you may know the …. joy … of targeting a much wider set of .net frameworks. Here’s the build tree for protobuf-net r668 (pre core-clr  conversion):


These  are incredibly hard  to build  currently (often requiring per-platform tools). It is a mess. Adding more frameworks isn’t going to make our life any easier. However, many of these frameworks have huge intersections. Rather than having to explicitly target 20 similar frameworks, how about if we could just target an entire flavor of similar APIs? That is what the .NET Platform Standard (aka: netstandard) introduces. This is mainly targeting a lot of the newer frameworks, but then… it is probably about time I dropped support for Silverlight 3. With these new tools, we  can increase our target audience very quickly, without overly increasing our development burden.

Important: the names here are moving. At the current time (rc1), the documentation talks about “netstandard1.4” etc; however, the tools recognise “dotnet5.5” to mean the same thing. (edited: I originally said dotnet5.4===netstandard1.4; I had an off-by-one error - the versions are 4.1 off from each-other!) Basically, the community made it clear that “dotnet” was too confusing for this purpose, so the architects wisely changed the nomenclature. So  “netstandard1.1” === “dotnet5.2” – savvy?

Great! So how do we do this? It is much easier than it sounds; we just change our project.json from “dnxcore50” to “dotnet5.4”:


netstandard1.4 (dotnet5.5) is the richest variant – the intersection of DNX Core 5.0 and .NET Framework 4.6.*. We're going to target dotnet5.4. If you go backwards (1.3, 1.2, 1.1, etc) you can target  a wider audience, but using a narrower intersection of available APIs. netstandard1.1, for example, includes Windows Phone 8.0. There are various tables  on the .NET Platform Standard documentation that tell you what each name targets. Notice I added a “COREFX” define. That  is because the compiler is now including “DOTNET5_4” as a build symbol, not the more specific “DNXCORE50”. To avoid confusion (especially since I know it will change in the next tools drop), I’ve changed my existing “#if DNXCORE50” to “#if COREFX”, for my convenience:


We don’t have to stop with netstandard1.3, though; what I suggest library authors do is:

  1. get it working on what they actively want to support
  2. then try wider (lower number) versions to see what they can support

For example, changing to 1.2 (dotnet5.3) only gives me 13 errors, many of them duplicates and trivial to fix:


And interestingly, this is the same 13 errors that I get for 1.1 (dotnet5.2). If  I try  targeting dotnet5.1, I lose a lot of  things that I absolutely depend on (TypeBuilder, etc), so perhaps draw the line at dotnet5.2; that is still a lot more potential users than 1.4. With some minimal changes, we can support dotnet5.2 (netstandard1.1); the surprising bits there are:

  • the need to add a System.Threading dependency to get Monitor support (aka: the “lock” keyword)
  • the need to explicitly specify the System.Runtime version in the test project

We can  test this with “dnx  test” / “dnx perf”, in both core-clr and .net  under dnx, and it works fine. We don’t need the dnx451 specific build any more


  • I have seen issues with dotnet5.4 projects trying to consume libraries that expose dotnet5.2 builds; this might just be because of the in-progress tooling
  • At the moment, xunit targets dnxcore50, not dotnet*/netstandard* – so you’ll need to keep your test projects targeting dnxcore50 and dnx451 for  now; however, your library code should be able to just target the .NET Platform Standard without dnx451 or dnxcore50:


That’s pretty much the key bits of netstandard; it lets you target a wider audience without having a myriad of individual frameworks defined. But you can use this in combination with more specific targets if you want to use specific features of a particular framework, when available.


As this is aimed at library authors, I’m assuming  you have previously deployed to nuget, so you should be familiar with the hoops you need to jump through, and the maintenance overhead. When you think about it, our project.json already defines quite a few of the key things nuget  needs (dependencies, etc). The dnx tools, then, introduce a new way to package our libraries. What we need to do first is fill in some extra fields (copying from an existing nuspec, typically):


Now all we need to do is “dnu pack --configuration release”:


and … we’ve just built our nupkg (or two). Aside: does anyone else think “dnu pack” should default to release builds? Or is that just me? We can go in and see what it has created:


The nupkg is the packed contents, but we can also see what it was targeting. Looking at the above, it occurs that I should probably go back and remove the explicit dnx451 build, deferring to dotnet5.2, but… meh. It’ll all change again in rc2 ;p

I wish there was a “dnu push” for uploading to nuget, but for now I’ll just use the manual upload:


The details are as expected, so: library uploaded! (and repeated for the strong-name version; don’t get me started on the “strong-name or don’t strong-name” debate; it makes me lose the will to live).

We have now built, tested, packaged and deployed our multi-targeting library that uses the .NET Platform Standard 1.1 and above, plus .Net 3.5 and .Net 4.0. Hoorah for us!

I should also note that Visual Studio also offers the ability to create packages; this is hidden in the project properties (this is manipulating the xproj file):


and if you build this way, the outputs go into “artifacts” under the solution (not the project):


Either way: we have our nupkg, ready to distribute.

Common Problems

The feature  I want isn’t available in core-clr

First, search dotnet/corefx; it is, of course, entirely possible that it isn’t supported, especially if you are doing WPF over WCF, or something obscure like … DataTable ;p Hopefully you find it tucked away in a different package; lots of things move.

The feature is there on github, but not on nuget

One thing you could do here is to try using the experimental feed. You can control  your package feeds using NuGet.config in your solution folder, like in this example, which disregards whatever package feeds are defined globally, and uses the experimental feed and official nuget feed. You may need to explicitly specify a full release number (including the beta marker) if it is pre-release. If that still doesn’t work, you could perhaps enquire with the corefx team on why/when.

The third party package I want doesn’t support core-clr

If it is open source, you could always throw them a pull-request with the changes. That depends on a lot of factors, obviously. Best practice would be to communicate with the project owner first, and check for branches. If it is available in source but not on NuGet, you could build it locally, and (using the same trick as above) add a local package source – all you need to do is drop the nupkg in a folder on the file-system, and add the folder to the NuGet.config file. When the actual package gets released, remember to nuke any temporary packages from  %USERPROFILE%/.dnx/packages.

I don’t like having the csproj as well as the project.json

Long term, we can probably nuke those csproj; they are handy to keep for now, though, to make it easy for people to build the solution (minus core-clr support).

The feature I want isn’t available in my target framework / Platform Standard

Sometimes, you’ll be able to work around it. Sometimes you’ll have to restrict what you can support to more forgiving configurations. However, sometimes there are cheeky workarounds. For example, RegexOptions.Compiled is not available on a lot of Platform Standard configurations, but it is there really. You can cheat by checking if the enum is defined at runtime, and use it when available; here’s a nice example of that. There are uglier things you can do, too, such as using reflection to see if types and methods are actually available, even if they aren’t there in the declared API – you should try to minimize these things. As an example, protobuf-net would really like to use FormatterServices.GetUninitializedObject() when it is available. Just… be careful. This trick work on things like universal applications, but then: neither will hardly any of what protobuf-net does, so that is a moot point.

I’m having a problem with the tooling

The various teams are very open to feedback. I confess that I sometimes struggle to know what should go to the corefx team vs the team (some of the boundaries are largely arbitrary and historical), but it’ll probably find a receptive ear.


The core-clr project moves a lot of pieces, and a lot of things are still in flux. But: it is now stable enough that many library authors should be more than capable of porting their projects, and quite  possibly simplifying their build process at the same time.

Happy coding.

The road to DNX - part 2

In part 1 I gave a brief introduction to the core-clr project and the key tools involved, from the perspective of a library author with existing .net libraries that they want to migrate to core-clr. I took a sample project (FastMember), and made some tooling changes to take it from a csproj-based build (targeting .net 3.5 and .net 4.0), to a build using project.json (again, targeting .net 3.5 and .net 4.0). Because the core-clr tools are not yet stable (rtm) or mainstream, I have retained the ability to build everything from the csproj – so that any arbitrary developer who clones the repo can build right away without having to install a pile of unfamiliar, unreleased tools.

In part 2, we start exploring what we can now do with the new tools.

Target Platforms

Here we’re going to look at adding a new core-clr build, and making the necessary code changes to make it compile.

The first thing we probably want to do is start playing with core-clr. At current (rc1), in terms of build tools this is “dnxcore50”. Our framework dependencies change from being “frameworkAssemblies” to “dependencies”: we’re now going to be pulling down each set of libraries from nuget separately (cached in %USERPROFILE%/.dnx/packages), rather than a monolithic platform install. All we need to do to start, then, is add a “dnxcore50” token into our project.json (for those who have done a lot of core-clr work: we’ll be changing this later, don’t worry), including a dependency on “System.Runtime” (the version number was chosen with the help of auto-complete, so I didn’t need to go out of the editor to look this up):


Since we’ve changed our dependencies, we need to use “dnu restore”. This looks at what project.json requires, and compares it to project.project.lock.json which tracks what has been resolved already. If anything is missing, it looks in .dnx/packages, and if they are missing it uses our defined package sources to go and get them. This works identically for framework dependencies and 3rd-party libraries – it makes the entire thing painless.

Having added “dnxcore50” and the “System.Runtime” dependency, we can try to build – although we’re not actually expecting it to compile yet. In fact, “dnu build” reports 270 errors, and Visual Studio lights up like the Blackpool illuminations:


At this point, we have a small job to do of going through the errors and figuring out what packages we are missing. Granular framework packages makes deployment more convenient (and hopefully more frequent), but means we need a few more dependencies.

Let’s look at the second one – Hashtable; one way of resolving this is to play auto-complete pot-luck by guessing that this is probably in System.Collections.Something:


Since Hashtable is non-generic, it is probably in System.Collections.NonGeneric; we can add this and that error goes away:


We can go through a few obvious ones this way, getting rid of almost two-thirds of the fail:


I still have errors relating to:

  • TypeBuilder / MethodBuilder / ILGenerator etc (the library does metaprogramming)
  • members of Type / MemberInfo: IsValueType, IsDefined, MemberType (the library does a lot of reflection)
  • IDataReader and DataTable

Which – broadly – covers some of the more subtle hurdles you need to jump.

finding rogue types

(edited: this tool does exactly this! - thanks for pointing that out, TIL).

It isn’t necessarily obvious where to find something like ILGenerator, especially since we’re included System.Reflection. There may be better tools, but my usual go-to place is the dotnet/corefx repo; by searching for “class ILGenerator” you can usually quickly determine whether something exists. Often you’ll hit the actual definition first time, which tells you the package in the name. In this case we get the tests, which is probably enough:


but if you’re still stuck,  you can click into the test, then go back a few levels until you find the test’s project.json, and look at what it referenced:


So I’m probably going to want System.Reflection.Emit and System.Reflection.Emit.ILGeneration.

working with refactored APIs

You’ll find  a lot of places where the API available to you has been changed. For example, if you work with reflection – everything changes; System.Type ceases (or rather: ceased, quite a long time ago on a lot of frameworks) being the rich “I know everything” type. The idea is that Type is a lightweight token that allows you to identify and compare types, and if you want to know more you use System.TypeInfo; there are methods to translate between the two (GetTypeInfo() and AsType(), respectively). Likewise,  MemberTypes no longer exists. As far as I know, there is no single master list of these changes – you just  kinda need to tease  each one separately.

Some changes you can make  in a way that works satisfactorily on all frameworks; for example, rather than doing a switch on MemberType:


we can use some combination of “is”/”as”/cast:


In some places, there are extension methods in utility libraries you can use to bridge the gap; for example, to add a lot of familiar methods back onto Type, you can add a dependency on “System.Reflection.TypeExtensions”, and ensure that you have “using System.Reflection;” in the code-file (because  these are extension methods added by the System.Reflection.TypeExtensions type)

In other places, like IsValueType, IsPublic, etc - there is no single common API we can use; it is fundamentally different (and “extension properties” aren’t a thing). The good news is that the project.json build chain makes it easy to use #if sections to switch between different implementations. The upper-case name of the target framework is automatically added as a build symbol  - so we can check using “#if DNXCORE50”. If the impacted API is only used in one place, you can just use #if in-situ, but for frequently recurring things like IsValueType (which is often all over your code), I do not recommend polluting all your code with constant #if. Rather, my strategy is to create a utility class that bridges the gap (usually, but not always, via extension methods), and have just that class deal with different implementations.

The really nice thing is that the IDE helps us here: in the top left corner, it now tells us all the frameworks we are targeting, and we can switch between them in the context of a filelook in particular at which sections are greyed / colorized as we switch between frameworks:



Here’s the commit with everything compiling except System.Data, which I have just excised for now.

System.Data is a much more complicated story; while System.Data has been migrated, there are significant changes:

I’m going to look at this System.Data reference in a bit more detail, but the takeaway here is not System.Data: it is how we can investigate problems.

The first is a big problem if you’re using DataTable; this is a controversial change (see the linked thread), but in reality it is often misused. There are times when it is genuinely the right tool, but for most scenarios you should really have moved  to an ORM or micro-ORM approach by now (did  I mention that  Dapper is available for core-clr? – 1.50 and upwards). It is also (mis)used as part of the reader metadata API. Even if we don’t ever see DataTable, we will need (soon) a  new API that has similar aims as GetSchemaTable(). This is all a bit of an aside, but the point I’m trying to emphasize is that : some APIs have irreconcilable differences. If your library depends on these features, you’re going to have some soul searching to do.

Ignoring the DataTable difference, we can still access much of the rest of System.Data; but we need to move from interfaces to abstract base types (DbDataReader instead of IDataReader). In this case, we have a little grunt work to do, but afterwards: we again have a single implementation with minimal #if. The one interesting bit is that DbEnumerator doesn’t seem to be in the current packaging:


It is dotnet/corefx in a branch, but not in “master”. This looks like some “work in progress” in the conversion, since that API is meant to talk in terms of IDataRecord (or DbDataRecord), and neither of those is supported in core-clr currently, so it isn’t clear to me what this enumerator is  meant to do on core-clr! You will occasionally find pockets like this; to seek clarification, I could look at what SqlDataReader does, or I could ask the developers. Checking github, it looks like it uses a copied implementation in System.Data.SqlClient.  And despite being declared “public”, this simply isn’t in the currently published assemblies. In this case, it all looks a bit of a mess, and it isn’t critical, so I’ll ask the developers, and throw an exception in that scenario for now. Here’s my eventual core-clr conversion of the System.Data-related code.

Testing Against Multiple Frameworks

Woohoo! We now have a project that compiles against .net 3.5, .net 4.0 and dnxcore50. That’s a great start, but we haven’t actually done anything except get it to compile yet. We want to run our tests, too! If you remember from part 1, FastMember has tests that use NUnit. I’m a pragmatist when it comes to test tools. I’ll be honest: at the current time, the easiest way to test on core-clr is via xunit. I’m sure the other tools will catch up, though.

Now, you’re probably thinking “but I don’t want to change all my tests”. I agree with you. Which is why I don’t do that. Instead: I cheat. We can make a final decision whether to migrate the tests more formally when everything is RTM. Right now, we just want things to work. What we’re going to do is:

  • add a “dnxcore50” framework block to our test projects
  • use xunit from the new framework
  • add a bridge file (only active when targeting xunit) that shims between the two
  • #if out any tests that won’t compile on core-clr due to missing features

With these changes, we can start moving to testing. A key piece here is learning about dnx commands; out project.json can actually declare multiple  named commands (which map to assemblies in which to locate an entry-point) that dnx can then invoke. For xunit, the one we want is “xunit.runner.dnx”, but as it happens FastMember.Tests already declares a Main() entry-point that does some performance tests. As such, we can declare multiple commands:


The IDE even updates to let us choose very conveniently what the “play” button should do:


If we’re going to use the IDE, we also need to make sure that we’re targeting .NET Core, since we haven’t enabled DNX tools for regular .NET yet (DNXCORE50 is .NET Core):


Finally, we can hit play, and amazingly our tests pass first time (this is the exception, not  the rule):


We can switch to the “perf” command and run that:


To do the same thing at the command-line; the really really important thing to remember is to switch framework to core-clr via dnvm (here, c64 is an alias for “rc1 64-bit core-clr on windows” that I created in part 1):


In the IDE, you can debug tests with breakpoints in the ways you would hope. The test tooling is “functional” right now, but is improving at a rate.

But what about regular .net?

This is where it starts getting fun! Remember that dnxcore50 is core-clr using the dnx tools. net40 and net35 are regular  .net; the dnx tools don’t really do much other than compile them. But! The dnx tools themselves allow you to run .net apps! There is a different build we use for this: dnx451. This is .net 4.5.1 on dnx. Because this is consuming framework assemblies (not nuget feeds), it is basically the same configuration as net40 – except we can now use the up-to-date xunit bits (which support dnx451); we can add a new build to all of our projects in the project.json:


The IDE now lets us successfully run tests targeting the .NET Framework using the dnx tools:


And again, we can use the command-line to run our tests from dnx, making sure to use dnvm to switch to the .NET Framework (I have an n64 alias for this):


Oops! I broke something, and it only impacts the .NET Framework version. This is easiest to diagnose in the IDE, where pressing F5 quickly tells me it is something to do with the DbDataReader.Dispose method:


Fair enough; that’s just my own brain-dead implementation. I fixed this, and another error (a boolean inversion, along with committing the dnx451 builds, here; Nick Craver is going to laugh at me for this…) that I had introduced, but: this shows the importance of testing any changes you make to your  existing codebase! Our tests now pass for dnxcore50 and dnx451:


End of part 2

That got longer than I expected. We’ve now got as far as targeting dnxcore50 and dnx451 (alongside regular net35 and net40), running test suites, debugging, etc. We’ve actually seen something happen, and we’ve seen things go wrong.

Coming up in the unplanned part 3 (part 2 got too big):

  • Targeting hell: what the hell is netstandard, and  why should I care?
  • Packaging and deployment
  • Common problems

Continue to part 3