Monday, 9 May 2016

CUDA–getting started in .NET


In my two previous posts (part 1, part 2) I described how we have found uses for CUDA. I am, however, aware that this abstract text doesn’t necessarily help the reader (aka “you”) make inroads into the world of CUDA. So: the purpose of this post is to take the concepts discussed previously and put them together into some actual code.

Firstly, a word on OpenCL: so far, I’ve spoken about CUDA – NVIDIA’s framework for GPGPU programming. Other frameworks exist, and the largest “other” is undoubtedly OpenCL. This framework is designed to let you use the same code over a range of technologies – GPUs from different providers, CPUs, etc. I am aware of it – I just haven’t even tried to address it. If that’s your bag: have fun. I don’t feel hugely limited by restricting myself to CUDA for what I need.

The first things you need for CUDA, then, are:

  • a CUDA capable GPU device
  • the CUDA toolkit
  • a programming environment

Since I’m focused on .NET, we’re also going to need one of the tools for interacting with CUDA from .NET, or we’ll need to write extern wrappers for the C API. A range of CUDA tools for .NET exist, including CUDAfy.NET, managedCUDA, Alea GPU / QuantAlea, Campy.NET, and several others that Campy.NET list and describe on their home page. These fall into two major categories:

  • wrappers for the driver API (you write your kernels in C etc and launch them from C#)
  • IL-to-PTX generators (you write and launch your kernels in C#)

While I’m a huge fan of C#, I’m also a huge fan of keeping close to the metal, and in a heartbeat will sacrifice some programming convenience for performance.

Our first kernel

So without further ado, I’m going to jump straight in with a basic CUDA kernel written in C. Rather than trying to discuss the tag-engine, we’re going to do something simpler… like multiplying numbers. Multiplication is hard, right?

typedef struct {
    int Id;
    unsigned int Value;
} SomeBasicType;

extern "C" {
    __global__ void Multiply(const int N, SomeBasicType* __restrict data, int factor) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
i += blockDim.x * gridDim.x)
{ (data + i)->Value *= factor; } } }

The first thing this does is to define a data-type (SomeBasicType) for us to share data between the CPU and GPU code. The __global__ indicates that we’re declaring a kernel, along with a given name and signature. We’re passing in N, the number of elements to process, data, a pointer to some of the data to work against, and factor - the number to multiply the values by. The significance of the N comes into play when we see the i < N in the for loop on the next line. Recall that we usually launch a number of blocks, each consisting of a set number of threads. It is not always the case that the number of elements to process is a convenient multiple of some block size, so it is common for kernels to be passed a count to compare the position against, exiting if the current thread is not needed. This sounds like a contradiction to the “don’t branch” guidance, but as long as most of the threads make the same choice, this isn’t going to be a performance problem.

Recall also that I described a “monolithic kernel” earlier, where-by we launch a thread for every element to process. But: there is a hard limit to the number of threads in a block, and the number of blocks in any dimension – and as always: the more threads there are, the more management overhead there is (although this works differently to CPU threads). As such, it is common to use a “grid-stride loop” rather than a monolithic kernel. As in our example, a for loop is used to that each thread processes multiple elements. But unlike the CPU version shown previously (where each thread processed a separate chunk of data), we instead have each thread in an iteration process adjacent elements, and then move forwards by the stride – namely the width of each block multiplied by the number of blocks. This has multiple side benefits:

  • it can cope with data of any size – it is not subject to the 75M limitation
  • it requires less overhead for the GPU engine
  • it can still be used as a monolithic kernel (by setting the dimensions such that it only performs one iteration per thread), or as a single-threaded debugging kernel that processes all the data (by setting the dimensions to 1 thread in 1 block)
  • the work being done by a warp or block is contiguous, minimizing memory access overheads

Finally, we get to the actual meat of the kernel, which does an in-place multiplication of one of the fields of our data by one of the parameters.

That’s our first kernel, all done!

A side note on data width

You’ll note that the above example is doing 32-bit arithmetic. For doing serious CUDA work, it is important to be aware of whether you are doing 16-bit, 32-bit or 64-bit arithmetic, as it can significantly impact your choice of hardware – considering them in release order:

  • the “Kepler” architecture (server devices: K40, K80) will serve as a baseline – let’s arbitrarily call it “good” at both 32-bit and 64-bit; 16-bit operations are performed using 32-bit arithmetic
  • the “Maxwell” architecture (server device: M40) is faster than Kepler at 32-bit operations, but is relatively poor at 64-bit operations; 16-bit operations are performed using 32-bit arithmetic
  • to complicate this: the fact that the K80 packs two devices on a physical package means that even though a single M40 may be faster than a single K80 device, code that appropriately uses both K80 devices on a package may still out-perform a single M40
  • the “Pascal” architecture apparently (not yet available for testing; server device: P100) significantly increases both 32-bit and 64-bit performance, while also introducing specific optimized instructions for 16-bit processing

Because of this, there is no single way of saying “this device is faster” – it depends a lot on what exactly you are doing!

Compiling our kernel (and: hello managedCUDA)

A kernel as a text source-file isn’t much use by itself. We need to convert that into GPU-runnable code – either intermediate (“PTX”) or fully compiled to bytecode (“cubin”). There is also a middle-ground that combines both bytecode and intermediate instructions – a “fatbin”; this allows a compiled kernel to target multiple physical architectures and still allow usage on other architectures via a JIT step. There are two common ways of compiling a C kernel – compile-time (NVCC) and runtime (NVRTC). I’ve spoken previously about using NVCC, so for this example I’ll use NVRTC instead.

Since we want to do this from .NET, we need a wrapper library. My examples today are going to use managedCUDA, written by Michael Kunz. This is available on NuGet and is licenced under the LGPL – which I understand (and: IANAL) is not the same as the notoriously infectious and restrictive GPL. What I particularly like about this library is that in addition to exposing an idiomatic .NET object model that maps to the C API, it also exposes the raw C API directly – which makes it convenient to use even when the method you want doesn’t map naturally to a .NET concept.

So let’s load our kernel from a text file and compile it at runtime:

string path = "MyKernels.c";
ManagedCuda.NVRTC.nvrtcResult result;
using (var rtc = new ManagedCuda.NVRTC.CudaRuntimeCompiler(
File.ReadAllText(path), Path.GetFileName(path))) { try { rtc.Compile(new string[0]); result = ManagedCuda.NVRTC.nvrtcResult.Success; } catch(ManagedCuda.NVRTC.NVRTCException ex) { result = ex.NVRTCError; } log = rtc.GetLogAsString(); if (result == ManagedCuda.NVRTC.nvrtcResult.Success) { byte[] ptx = rtc.GetPTX(); // ... } }

This takes our file contents as input, and (if successful) produces a blob that contains the compiled kernel. We can also obtain the log which will contain detained error messages if our kernel is invalid.

The CUDA context

One of the key objects in CUDA is the context. This wraps the state for a long-running CUDA scenario. Most operations, including memory allocation and kernel launches typically happen within a CUDA context. managedCUDA exposes a class that wraps this. Note:

  • a CUDA context is bound to a particular GPU device, and it is when creating the CUDA context that you can target specific devices
  • a CUDA context is associated with the CPU thread that creates it; as such, CUDA code would work well for scenarios such as a dedicated worker thread processing items from a queue, but would not work well from .NET async/await code that switches between multiple threads

So; let’s create our context and load the kernel:

CudaContext ctx = new CudaContext(deviceId, true);
CudaKernel multiply = ctx.LoadKernelFatBin(ptx, "Multiply");

Note that the context (like many CUDA objects) is IDisposable, and you should ensure to Dispose() it properly, via using or otherwise.

CUDA streams and asynchronous processing

In the simplest usage, the CUDA methods are blocking, however it is often the case that either the CPU can do other useful work while the GPU churns, or we can overlap multiple GPU operations – memory transfers running in parallel with kernel execution, or (depending on the hardware) even multiple kernels running at the same time. To allow this, CUDA introduces the notion of a “stream”. Work on the same stream is processed in order, but the work on different streams may overlap. I’m not going to make extensive usage of streams in this walkthrough, but it is easier to include stream considerations from the start than to add it later.

Initializing memory at the host and on the device

Traditionally, the host (CPU) memory and the device (GPU) memory were completely separate, and all CUDA code had to specifically be cautious of whether it was referring to host or device memory. More recently, the concept of “unified memory” has been introduced that hides this distinction, and performs transfers automatically as needed – it is intended to make it simpler to use the API, but for the maximum performance it is recommended to take full control over allocation and copy. To that end, I’ll stick to the traditional approach, which means explicitly allocating the data at both ends.

We’ll start by allocating the host memory:

struct SomeBasicType {
    public int Id;
    public uint Value;

IntPtr hostPointer = IntPtr.Zero;
var res = DriverAPINativeMethods.MemoryManagement.cuMemAllocHost_v2(
    ref hostPointer, count * sizeof(SomeBasicType));
if (res != CUResult.Success) throw new CudaException(res);
SomeBasicType* hostBuffer = (SomeBasicType*)hostPointer;

As before, we declare a raw data type to share the data between CPU and GPU. Here is one of those cases where I’m making use of the raw C API rather than the wrapped objects, as this (IMO) more conveniently allows naked allocations. You might think “hang on, you’re just allocating some memory – surely a .NET array would suffice, or maybe at worst Marshal.AllocHGlobal” – but: in order for the CUDA memory-copy engine to work at maximum performance (avoiding double-copies, etc) it is required to be configured in a specific way, and the best way to guarantee that is to let the CUDA API allocate the memory in the first place. A pinned .NET array certainly wouldn’t suffice. Or at least, it would work – but not as fast. Yes, this needs some unsafe code; it’ll be fine, don’t panic.

Once allocated, we can intialize this memory with some invented data:

for (int i = 0; i < count; i++) {
    hostBuffer[i].Id = i;
    hostBuffer[i].Value = (uint)i;

Next we’ll want to allocate a similar block of memory on the device, and start copying the data from the host to the device. Emphasis: large memory transfers are fast, but not instant – if possible you should minimize the amount of data you need to transfer for individual operations. The ideal scenario is to copy the data to the device once / periodically, and use it many times once it is there.

CudaDeviceVariable<SomeBasicType> deviceBuffer = new CudaDeviceVariable<SomeBasicType>(count);
CudaStream defaultStream = new CudaStream();
res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyHtoDAsync_v2(deviceBuffer.DevicePointer,
    hostPointer, deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

This time I'm using the CudaDeviceVariable<T> object to represent the memory at the device, which is perfectly convenient since we will not usually need to do anything except access the DevicePointer to pass into methods. Once again I'm using a raw C-style API for the actual memory copy. There are some transfer methods exposed directly on the object, but this particular combination is not exposed directly. Note that because we have created and used a stream, this is non-blocking. Related operations sent to the same stream will be queued behind this one.

Launching our kernel

We have data on the device; we have a kernel that we have compiled and loaded. Now finally to run the kernel against the data! First we must configure the dimensions – the number of threads in a block and the number of blocks:

multiply.BlockDimensions = new ManagedCuda.VectorTypes.dim3(threadsPerBlock, 1, 1);
multiply.GridDimensions = new ManagedCuda.VectorTypes.dim3(blockCount, 1, 1);

multiply.RunAsync(defaultStream.Stream, new object[] {
    count, deviceBuffer.DevicePointer, value

Note that the parameters to RunAsync (other than the stream) match the signature on the kernel itself. value in the above is the factor to multiply by. Once we've got the data in place, running kernels is actually alarmingly simple!

Getting results back

It isn’t usually sufficient to run kernels – sadly we usually expect to get results. As mentioned before, we should strive to minimize the amount of data we want to transfer, but for simplicity in this example I’m going to fetch back the entire buffer, over-writing the previous contents. In real code you might be extracting only small portions of the buffer, or summary / aggregate data.

var res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyDtoHAsync_v2(
    new IntPtr(hostBuffer), deviceBuffer.DevicePointer,
    deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

Once more, for memory operations it has been easier to use the raw C API than the wrapped managed API, and that's OK. Note that because we've been using streams, at this point we've only queued the copy. If we've set all our wheels in motion, and want to sit back and wait while they turn, we can use either ctx.Synchronize(); (to wait for all streams to complete) or defaultStream.Synchronize() (to wait for a specific stream to complete). Finally, when that is done, we can iterate over the data and congratulate ourselves for managing to over-complicate the multiplication of numbers!

Not enough code for you?

Everything I’ve shown here is available in a github project, that you can clone, fork, etc. I don’t claim it is the most useful tool ever, but hopefully it might help a few people get up and running with CUDA.


How I found CUDA, or: Rewriting the Tag Engine–part 2

(part 1, part 3)

So we got to work…

We knew we had a problem where CUDA might be part of an answer, especially for performance. And in reality, there’s only one way to find out which horse is faster. You race the horses. We knew that the v1 tag-engine was poorly designed (organically, so to speak), so it wouldn’t be meaningful to compare that for performance purposes. And even if the GPU is better, we still wanted a better CPU implementation:

  • it needs to run on developer local machines; not all developers have a CUDA device, especially if they are on a laptop, or on a VM, or on a VM on a laptop
  • in the event of the tag-engine servers being unreachable, our fallback strategy is to load the tag-engine in-memory on the production web-servers, so it needs to be able to run there
  • if the GPU approach turns out to not be what we want, we still want to have moved our code forwards
  • and if the GPU approach turns out to be exactly what we want, then we’d have physical factors like server hardware, cage configuration, etc to add lead-time, where-as we’d like to replace the code ASAP

So: we definitely wanted a shiny new CPU version, but one that was designed to work nicely side-by-side with the GPU version. The other thing you need to race your horses is: a course. The best way of representing your load is to capture your load, in a way that you can play back. So the next thing we did was to edit our existing tag-engine to make it possible to record all the real requests it was serving, in a way that we could play back and use to compare performance, features, and results. Helpfully, tag-engine requests don’t contain any PII data – note that this isn’t possible in all circumstances.

Parallel or sequential?

Next up, we need to think about how GPUs work. CPUs are often (not always) used to perform multiple independent operations in parallel. GPUs, by contrast, are usually used to make a single operation (per device) happen really quickly, processing the operations sequentially. The parallel approach, while scalable, has a lot of complications:

  • you need to be careful how you perform data updates without breaking running operations (or: you need to duplicate the data to perform updates) – contrast sequential, where you can just squeeze the update in as just one more thing in the queue (it is never fighting other requests), and update the data in-place with impunity
  • each parallel operation needs memory for it’s processing (the results it has collected, and everything it needed to get there) – contrast sequential where you can allocate a single workspace and just keep re-using it

After considering these factors, and looking at our typical request rate, we concluded that our best approach for both GPU and CPU would be to use sequential operations, using all the resources available to us (cores, etc) to make each individual operation as fast as possible. This also makes for a very simple CPU vs GPU race, while also making for some amusing CPU task-manager charts:

(yes, that’s me punishing a 36-core, HT-enabled server for 72 logical cores of goodness)

The CPU bits

Recall from part 1 that we have two categories of query; trivially simple, and absurdly hard. I’m going to completely ignore the first set. The interesting case is the second, and in those scenarios you can pretty much guarantee that there is no single index that is ever going to fit your query, and we’re essentially trying to do a table-scan of a “where” clause, perhaps with some short-cuts. Let’s consider an example. Say we want to know all the “java and android” questions, sorted by activity. One of the main things the tag-engine stores is an index of “questions by tag” – i.e. given a tag, what questions exist in that tag (pre-sorted). Since the “and” makes this restrictive (intersection rather than union), what we can do is choose the smallest (“android”, which takes us to a manageable size) and then just test them all. For reasonable sizes of data, this can be much cheaper than trying to do complicated index combinations, and can be done with very little memory allocation. Note that we almost always want the total count and the “intersecting related tags” data, so it won’t help to cheat and just scroll forwards until we have enough data to return.

Let’s say that “android” has 800,000 questions. If we want to parallelize this work (to use all of the available cores), it might be tempting to use Parallel.ForEach to outsource this work to the TPL and aggregate the results, but:

  • this has quite a lot of overhead in terms of doing lots of small things instead of a small number of big things
  • since the order is now unpredictable, it makes it very hard to exploit the fact that we have pre-sorted the data

Instead, what we can do is to carve these 800,000 questions into a small number of chunks of contiguous questions (8, say, or something related to the number of available cores), and then inside each chunk (unit of work): test each question in turn for the remaining conditions, writing successive matches in blocks to a similarly sized workspace.


By letting different threads process different blocks, we get lots of benefits:

  • the number of units-of-work for the threading library (TPL) to manage is minimal, reducing overheads
  • each unit-of-work is always looking at contiguous data, maximizing cache usage and memory locality
  • we get to exploit the fact that we’ve already sorted the data once – we never need to re-sort
  • we don’t have to synchronize when writing any of the results, since each unit-of-work is writing to a separate area of the workspace
  • all we need to know is how many matches each unit-of-work contained and we can step through the final workspace very efficiently

This is actually pretty simple to do with Parallel.Invoke - something like:

int chunks = DecideNumberOfChunks(questions, cpuCores);
int workPerChunk = questions / chunks;
WorkUnit[] workUnits = new WorkUnit[chunks];
Action[] actions = new Action[chunks];
for(int chunk = 0; chunk < chunks ; chunk++) {
    int start = chunk * workPerChunk,
        stop = Math.Min(start + workPerChunk, questions);
    workUnits[chunk] = new WorkUnit(start, stop, ...);
    actions[chunk] = workUnits[chunk].Execute;

Where each Execute method is essentially:

public void Execute() {
    int matches = 0, resultIndex = start;
    for(int i = start ; i < stop ; i++) {
        if(IsMatch(i)) { // some test or set of composite tests
            resultWorkspace[resultIndex++] = i;
    this.Matches = matches;

The above is a very simplified illustration of the design that drives the CPU implementation of the re-written tag-engine. The result is that it creates a packed set of contiguous matches for each unit-of-work, while allowing us to scale the query effectively over all the available CPU cores. The IsMatch method might be non-trivial, of course. We use a combination of meta-programming and special-cased tag-tests to allow us to support a wide range of queries; it works very well. So how does this compare to GPU? How does it change our approach?

CUDA: Kernels, Threads, Warps, Blocks and Grids

I’m going to run through some high level CUDA concepts now before showing any CUDA code, and before showing how it relates to the tag-engine.

When developing with CUDA, the first things you need to learn about are kernels, blocks and grids. A kernel is just a chunk of your code that you can invoke on the GPU device. But unlike regular code, when you “launch” a kernel, you aren’t usually asking it to call it once; you’re asking it to call the exact same method lots of times. As a trivial example, we could ask it to call the same kernel 800,000 times – once for each of our 800,000 “android” questions. A kernel designed to be called once per input element is sometimes called a “monolithic” kernel. But that isn’t quite the end of the story. Each separate invoke of the kernel is a “thread” – so in this monolithic case, we’d be launching 800,000 threads – but we don’t just ask for 800,000 – instead we might ask it to launch 3125 “blocks”, with each block consisting of 256 “threads”. Each of these numbers has limits – a block is limited to a maximum of 1024 threads on most current hardware, and you can have a maximum of 65535 blocks (per dimension, but that’s a separate story).

This means that for a single-dimension kernel, there’s a limit of about 67M, but don’t worry – I’ll cover how to get around that later. All I’m trying to do here is give an overview of the topology of what we’re playing in. The significance of these different concepts is that per-clock-cycle, each symmetric multiprocessor in a GPU actually works on multiple threads in the same block. This ability to operate on multiple threads at once is what makes GPUs so much more powerful (for some tasks) than CPUs. This group of threads that are being controlled in unison is a “warp” (the warp-size is 32 in current hardware).


The above is then duplicated for however many blocks you asked for – 3125 blocks of 256 threads in our example, with a warp-size of 32. In simple examples we’re often just talking about 1 dimension, but note that CUDA blocks and threads are actually 3-dimensional; this means you can actually have 65535 x 65535 x 65535 blocks, with 1024 threads per block – but… you probably shouldn’t ever do that! That’s not the way to solve the “more than 67M” problem.

Why do I need to know?

There are a few reasons why you need to understand the topology; firstly, in addition to knowing the limits of what you can ask for, it impacts things like memory locality and cache. Remember in the CPU example we optimized by having each thread work on a different chunk of data? In CUDA, a warp of threads will be executing at once (per symmetric multiprocessor), and the infrastructure is built specifically to share memory lookup costs between a warp. Because of this, you usually want all the threads in a warp looking at adjacent data.

The “secondly” is perhaps even more important: the ability of a symmetric multiprocessor to progress multiple threads simultaneously is dependent upon each of those threads doing exactly the same thing. Which makes sense when you think about it. This doesn’t mean that your CUDA code can’t ever branch (if, etc - basically, any decision point), but it does mean that if different threads in a warp branch in different directions, then the symmetric multiprocessor has to identify the ones in different states and progress them separately. Which means: you kill the performance.

So: aim to work on adjacent data, and try to branch in a single direction.

How does this impact tag-engine matching?

If we go back to the index that we described in the CPU example, we are now essentially applying all our tests in parallel (at least, in a single kernel launch; how the hardware schedules it is up to the hardware). With this setup, we can’t really have the concept of “write to the next position in the result index and increment the write-position”, because all matches in a single CPU cycle would want to write to the same position. CUDA of course provides mechanisms for doing atomic increments, but if these collide you’re ultimately going to be branching and breaking the lock-step on the warps – and more importantly, since we don’t control which thread runs when, we’d actually be randomizing the results, which would require us to sort the data again.

So; instead of trying to write packed data, we’ll instead try to create a sparse vector of just the matches – so zero (or another sentinel) for data that didn’t match, and the key otherwise:


What we’ll then do is pack that down to give just the non-zero data:


Note that we’ve preserved the sort on the data, and now we’re in a position where we can just do a memory copy from the device (GPU) to the host (CPU) of just the page of results we want.

Show me some code!

OK, time for some CUDA. Note that there are are many ways of creating CUDA binaries, and I intend showing much more on this next time – but for now, I’m using C code, since that works directly with NVIDIA’s tooling.

The first thing we need to do is to figure out what element we’re meant to be processing. CUDA makes available to us the block dimensions (blockDim), the logical position of the current block (blockIdx), and the logical position of the current thread inside the block (threadIdx). For a 1-dimensional kernel, this means that our actual index is:

int i = blockIdx.x * blockDim.x + threadIdx.x;

We can also assume that the kernel has access to the underlying data (I’ll demonstrate this next time), and somewhere to put the results. For simplicity, let's consider a test on the score of the post; what we want to do is essentially:

Question* q = allData + i; // pointer arithmetic 
results[i] = (q->score >= minScore && q->score <= maxScore) ? i : 0;

However, the observant might notice that both the short-circuiting "and" (&&) and the ternary-conditional (? :) are branching operations, and are best avoided. Fortunately, it isn't hard to rewrite this using non-branching equivalents. The "and" is simple - we'll just make it non-short-curcuiting. The conditional is harder, but "true" in C is expressed by the number 1. This means that if we negate (numerically, not bitwise) the result of our boolean test we get 0 and -1; and -1 in two's-complement binary is "all the ones". This means we can just do a bitwise "and" between this and the number we want to store (i):

results[i] = (-(q->score >= minScore & q->score <= maxScore)) & i;

While it might not be as intuitive, this type of thinking is key to keeping warps in lock-step to get the maximum performance from the GPU.

OK, I see why this might be useful, but how well does it work in practice? Is it worth it?

Note: as I have tried to emphasize: all performance data is deeply dependent on your exact scenario and your exact implementation. I’m going to talk about what we’ve found, but please don’t use this to justify a big hardware spend: race your own horses.

All of this work would be purely  academic if it didn’t help us. So we’ve spent a lot of time and effort comparing performance using our captured data between the CPU versions (v1 and v2) and the GPU version, using a range of devices. For local development purposes today, a GTX 980 is more than sufficient.

So without further ado: some numbers. Looking only at the “hard” queries (the easy queries aren’t a bottleneck), the 72-core high-end server that I showed in the task-manager screenshot above managed and average of 259.9 requests per second for the CPU version – 3.8ms per request, which is pretty respectable for non-trivial filters against 12 million records. The GPU version, however, managed nearly double that (507.7 requests per second) just on my development GTX 980. This is not “high end” in GPU terms. There are server devices like the K40, K80 (two K40s back-to-back with a shared power limit), and the M40 which are quite a bit more powerful, and in my testing the maximum throughput went up linearly as you added more GPUs. Given that a C4130 has space for 4 GPUs, that could give you 4 M40 devices or 8 K80 devices (two devices per package), to get a massive speedup  compared to CPU  limits.

And also note: the GTX 1080 (desktop) and P100 (server) around the corner boasting the next generation of architecture and hopefully another huge jump in performance (I haven’t got my grubby hands on those yet).

But: from what we’ve seen, we’re keen to push to the next level and get some of this kit deployed into production.

My curiosity on CUDA is piqued; are you  done? what next?

No, you don’t shut me up that easily! Next time, I’m going to be talking about things like grid-stride loops, async and CUDA streams, memory transfers, and showing how we can get access to all this goodness from my preferred every-day language: C#. I’ll also be walking you through some code that you can clone and play with to see it in action, rather than as text.

How I found CUDA, or: Rewriting the Tag Engine–part 1

(part 2, part 3)

This post is largely an introduction to set context so that the following few articles make sense… there won’t be any code here, but: well, take it or leave it :)

The Context – what is the tag engine?

At Stack Overflow / Stack Exchange, a lot of our pages relate to “tags” (topics). As an obvious example, a lot of users browse questions in specific technologies looking for new things to answer, or have feeds / email notifications configured for specific tags. Other users might be interested in all the newest questions, but absolutely never want to see another question that reminds them of their last job (by the way, if your job sucks, you should fix that – life is too short to be miserable). We also  do things like showing “related tags” – essentially the counts of the intersections between technologies of what you’re looking at and other questions we know about.

All of this needs a non-trivial amount of processing and memory. Back in the day, it was sufficient to use our RDBMS for that (via some hacks that in turn left us some technical debt, but that is long gone now), but as we grew that simply wasn’t going to work. So after investigating a few options, out popped the “tag engine” – basically some bespoke code with a small set of jobs that we could run out-of-process to the main web-servers (so they don’t have to reload everything when we deploy / recycle).

So… life was good?

All was well. Sure, we had to fight a few things like GC, but… it worked. But as we grew, that code base started to become more and more of a limiting factor. It is nearly 5 years old now, and we’ve grown a lot in that time, and our needs have changed a lot in that time. The tag engine was never really “designed” so much as … “grew”. We gradually hacked in features we needed, and tweaked bits, and more or less it kept working. It was a bit ugly, but it wasn’t actually a problem. We’re pragmatists: we fix problems. We don’t fix things that aren’t problems.

Lately, it has been moving more and more from the “not a problem” camp to “problem”, so yay, more things to do. Performance was a key part of this, stemming in part from data volume, part from design choices - an overhaul was overdue.

Starting to think about GPUs

Around this time, I happened to see an email conversation from Daniel Egloff at QuantAlea, and it made me think about how much of the tag-engine might be suitable for GPU work. After a brief exchange with Daniel, I was satisfied that I wasn’t totally crazy. As a brief aside: QuantAlea were great and seemed keen to help us work on the problem. I think their tools show real promise, but for us we made the decision to keep everything in-house and do it ourselves. This is in part because our scenario is relatively simple. What I’m saying here is: they were really helpful, and if you’re interested in CUDA you might want to think about them, but: we didn’t go that way ourselves.

So what the hell is the tag engine doing? Why do you even need that level of crazy?

The interesting thing (to me) about the tag engine is that there are two very different scenarios. About half the queries tend to be embarrassingly trivial; and the other half are absurd. If you come at the tag-engine thinking “list the first 50 newest C# questions” – then: that’s the embarrassingly trivial side. This really doesn’t need a lot of work – just keep track of questions by tag, pre-sorted, and pick out the pages you need. Very fast, very simple. It just works. Although I’ll leave it as an exercise for the reader to think about how to generate the tag intersection cloud.

The real problem is the other half, which could be “page 200 (50 per page) of all 'java or .net or sql' questions, and everything that starts with ‘visual’, but never show me anything with php or fortran, and only show me questions with a score above 2 that were created after some date”. Lots of complex unions, restrictions, etc. You can do this type of thing with general purpose indexing tools (Elasticsearch for example), but a: we want the maximum performance, and b: some of our data is not really amenable to such tools – for example, some of our sort-orders are highly time dependent, which is complex and awkward for such tools (and may require lots of re-sends). Equally, some of the data we want to get back out (including, but not limited to, the tag intersection cloud) are not easy to do.

It is this second category of queries where the tag-engine shines; essentially, what it has to do is to keep some pre-built indexes and then try to figure out how best to exploit those indexes to answer a complex query – or, worst case, to do the moral equivalent of a table scan, and just walk the data.

OK, there’s a tricky technical problem; how could CUDA help? What the hell even is CUDA?

CUDA is an offshoot from the gaming world. In their efforts to make games that have high frame rates at high visual quality, the graphics card (GPU) vendors came up with a different approach to processing. Rather than having a relatively small number of very fast general purpose CPUs that switch constantly between doing 200 things, they went with a higher number of “symmetric multiprocessors” – not individually as fast as regular CPUs, but able to do the same thing many many times in parallel (think: SIMD gone mad), and with access to a large number of math processors (ALUs). NVIDIA correctly reasoned that this type of technology might be awesome for many computing tasks, so CUDA was developed as a framework to enable general purpose computing on the GPU, aka GPGPU. Of course, the bus between the device and main memory isn’t as fast as direct CPU<->RAM access (although they’re working hard on that problem), but: very powerful.

And as a result of this, GPU programming is very, very good at scenarios where you want to do a relatively simple and predictable operation many times, ideally where the data you need can be pushed onto the device up-front (and left there, updated only periodically), and where you only need to get back relatively small quantities of results. Doesn't this sound a lot like the second – harder - scenario we just described for the tag engine?


That’s why we were intrigued with GPU programming on the tag-engine. The next post will be more technical, discussing the relative merits of CPU and GPU programming and how we might need to use different approaches to optimize each. And hopefully some code, yay! As a teaser though: it works great.

Sunday, 1 May 2016

Using the Windows Subsystem for Linux to simplify CUDA builds

Recently, Microsoft announced the Windows Sybsystem for Linux, aka Bash on Ubuntu on Windows. This is currently pre-release, and allows you to run linux tools inside Windows. In particular, this opens up a whole range of development tools and options. First. a caveat: this stuff is early release and only currently available on the “Fast Ring” of insider builds. If it works: great. If it doesn’t work: oh well. If it melts your PC, stops it booting, or causes skynet to launch the missiles: that’s tough too.

Now, it just so happens that I’ve been playing a lot with CUDA lately – it turns out that it works quite nicely for our “tag engine” back-end service (and I shall no-doubt be blogging about my journey here very soon). If you’ve done any CUDA, you will know that there are two compilation options for compiling your CUDA kernel source-code into something that can run on your GPU – runtime (NVRTC) and build-time (NVCC). The runtime compiler is very convenient for fast iterations – edit a text file, reload your app – but: it doesn’t currently support one very important feature: dynamic parallelization. I’m hoping that CUDA 8.0 Toolkit (due some time around August) might fix this, but until then, it means that I need to use NVCC for my kernels, since I happen to need this feature.

The next thing you’ll know if you’ve done any CUDA work is that NVCC on Windows is fussy. Not just a little fussy, but really fussy. Not only does it require specific versions of Visual Studio (that are several major versions behind what I usually have installed), but it also requires your command-line C++ environment to be configured perfectly, and even then (as I found out the hard way), it might still turn around and laugh in your face. Possibly for reasons, possibly just for the giggles. I eventually gave up on making NVCC work in Windows, so I configured a Ubuntu VM in Hyper-V, added a SMB share, and used the VM to build my kernels. It worked, but it was unsatisfactory. So I thought: can I use the new Windows goodies to fix this? And would you believe it: yes… ish.

Scope: note that for the purposes of this article I’m not going to try to explain what CUDA kernels are, or how to use NVCC. If you don’t happen to know these things, it will suffice to know that NVCC is a tool you need, it is really awkward to get it working on Windows, and quite a bit easier to get it working on linux.

It turns out that actually this was pretty easy; first we need to get Bash working:

1. enable the fast ring updates
2. be patient for a day or so until it deigns to offer to let you install the new Windows build
3. install the new Windows build and once again learn the true meaning of patience
4. enable developer mode in Windows settings:


5. turn on the new system in Windows features:


6. find and run the new tool


The first time you run this, it downloads some pieces and prompts for your new linux credentials. When it has finished installing itself, you get a Ubuntu 14.04 terminal:


Conveniently, it also mounts your Windows drives for you – so “C:” is “/mnt/c” etc.

Important note: ctrl-v to paste doesn’t currently work, but don’t worry: you don’t need to re-type lines – you can successfully paste via the window menu:


Right. So we have a vanilla (if minimal) Ubuntu installation. From my Hyper-V install, I know that NVCC on linux requires gcc 4.9, so we can follow a few lines from Ask Ubuntu to install this. We also need to install NVCC itself; I recommend not using the version from apt-get, as that is very old – I installed 7.5 using the notes on, which really just meant choosing my preferred install from here - I went for the network-enabled deb install:


And remember: once you have downloaded the file, you can access it under /mnt/c/ or similar. The website tells you the full commands to run, so it isn’t a challenge even if you don’t know much linux. The installer takes quite a while to download and complete, so again: patience. It is also worth noting that when installing this way, no symlink or similar is added to the path – so you can either do that yourself, or just use the tool in the installation location of “/usr/local/cuda-7.5/bin/nvcc”:


After that, compiling to cubin, fatbin, or ptx should just work. The actual commands aren’t very interesting and are described in the documentation. I just wrap my NVCC commands in a bash file (remembering to ensure newlines are LF only), and run that. Here we can see no “fatbin” before the script, executing the script, and the “fatbin” existing after the script:


So what have we done?

  • we have installed the linux subsystem
  • we have installed some linux tools and utilities
  • we have used those tools as part of a development build process, accessing our windows file system
  • all without needing a separate VM

The fact that it involved NVCC and CUDA is largely incidental; the same tecnhiques could be used on many tools.

Minor confessions

1. Life doesn’t often run smoothly. I’m not much of a linux person, and it took me very little time to completely trash my linux install. Fortunately the nuclear reset option exists.

2. The fast ring installs are … temperamental. As it happens, it installed just fine on my travel laptop (hence @TRAVEL in the screens), but it silently failed to install on my main development desktop, and has now disappeared from the update UI … so I can’t find any way to get it installed. Unfortunately, my laptop doesn’t have a CUDA-enabled GPU, so actually this hasn’t saved me anything – I still need to use my Hyper-V VM for real work. But: it was still fun to get it working!

Tuesday, 24 November 2015

The road to DNX–part 3

In part 1, we  looked at an existing library that we wanted to move to core-clr; we covered the basics of the tools, and made the required changes just to change to the project.json build approach, targeting the same frameworks.

In part 2, we looks at “dnxcore50”, and how to port a library to support this new framework alongside existing .net frameworks. We looked at how to setup and debug tests. We then introduced  “dnx451”: the .net framework running inside DNX.

In part 3, we dive deeper still…

Targeting Hell

You could well be thinking that all these frameworks (dnx451, dnxcore40, net35, net40, etc) could start to become  tedious. And you’d be right! FastMember only targets a few, but as a library author, you may know the …. joy … of targeting a much wider set of .net frameworks. Here’s the build tree for protobuf-net r668 (pre core-clr  conversion):


These  are incredibly hard  to build  currently (often requiring per-platform tools). It is a mess. Adding more frameworks isn’t going to make our life any easier. However, many of these frameworks have huge intersections. Rather than having to explicitly target 20 similar frameworks, how about if we could just target an entire flavor of similar APIs? That is what the .NET Platform Standard (aka: netstandard) introduces. This is mainly targeting a lot of the newer frameworks, but then… it is probably about time I dropped support for Silverlight 3. With these new tools, we  can increase our target audience very quickly, without overly increasing our development burden.

Important: the names here are moving. At the current time (rc1), the documentation talks about “netstandard1.4” etc; however, the tools recognise “dotnet5.5” to mean the same thing. (edited: I originally said dotnet5.4===netstandard1.4; I had an off-by-one error - the versions are 4.1 off from each-other!) Basically, the community made it clear that “dotnet” was too confusing for this purpose, so the architects wisely changed the nomenclature. So  “netstandard1.1” === “dotnet5.2” – savvy?

Great! So how do we do this? It is much easier than it sounds; we just change our project.json from “dnxcore50” to “dotnet5.4”:


netstandard1.4 (dotnet5.5) is the richest variant – the intersection of DNX Core 5.0 and .NET Framework 4.6.*. We're going to target dotnet5.4. If you go backwards (1.3, 1.2, 1.1, etc) you can target  a wider audience, but using a narrower intersection of available APIs. netstandard1.1, for example, includes Windows Phone 8.0. There are various tables  on the .NET Platform Standard documentation that tell you what each name targets. Notice I added a “COREFX” define. That  is because the compiler is now including “DOTNET5_4” as a build symbol, not the more specific “DNXCORE50”. To avoid confusion (especially since I know it will change in the next tools drop), I’ve changed my existing “#if DNXCORE50” to “#if COREFX”, for my convenience:


We don’t have to stop with netstandard1.3, though; what I suggest library authors do is:

  1. get it working on what they actively want to support
  2. then try wider (lower number) versions to see what they can support

For example, changing to 1.2 (dotnet5.3) only gives me 13 errors, many of them duplicates and trivial to fix:


And interestingly, this is the same 13 errors that I get for 1.1 (dotnet5.2). If  I try  targeting dotnet5.1, I lose a lot of  things that I absolutely depend on (TypeBuilder, etc), so perhaps draw the line at dotnet5.2; that is still a lot more potential users than 1.4. With some minimal changes, we can support dotnet5.2 (netstandard1.1); the surprising bits there are:

  • the need to add a System.Threading dependency to get Monitor support (aka: the “lock” keyword)
  • the need to explicitly specify the System.Runtime version in the test project

We can  test this with “dnx  test” / “dnx perf”, in both core-clr and .net  under dnx, and it works fine. We don’t need the dnx451 specific build any more


  • I have seen issues with dotnet5.4 projects trying to consume libraries that expose dotnet5.2 builds; this might just be because of the in-progress tooling
  • At the moment, xunit targets dnxcore50, not dotnet*/netstandard* – so you’ll need to keep your test projects targeting dnxcore50 and dnx451 for  now; however, your library code should be able to just target the .NET Platform Standard without dnx451 or dnxcore50:


That’s pretty much the key bits of netstandard; it lets you target a wider audience without having a myriad of individual frameworks defined. But you can use this in combination with more specific targets if you want to use specific features of a particular framework, when available.


As this is aimed at library authors, I’m assuming  you have previously deployed to nuget, so you should be familiar with the hoops you need to jump through, and the maintenance overhead. When you think about it, our project.json already defines quite a few of the key things nuget  needs (dependencies, etc). The dnx tools, then, introduce a new way to package our libraries. What we need to do first is fill in some extra fields (copying from an existing nuspec, typically):


Now all we need to do is “dnu pack --configuration release”:


and … we’ve just built our nupkg (or two). Aside: does anyone else think “dnu pack” should default to release builds? Or is that just me? We can go in and see what it has created:


The nupkg is the packed contents, but we can also see what it was targeting. Looking at the above, it occurs that I should probably go back and remove the explicit dnx451 build, deferring to dotnet5.2, but… meh. It’ll all change again in rc2 ;p

I wish there was a “dnu push” for uploading to nuget, but for now I’ll just use the manual upload:


The details are as expected, so: library uploaded! (and repeated for the strong-name version; don’t get me started on the “strong-name or don’t strong-name” debate; it makes me lose the will to live).

We have now built, tested, packaged and deployed our multi-targeting library that uses the .NET Platform Standard 1.1 and above, plus .Net 3.5 and .Net 4.0. Hoorah for us!

I should also note that Visual Studio also offers the ability to create packages; this is hidden in the project properties (this is manipulating the xproj file):


and if you build this way, the outputs go into “artifacts” under the solution (not the project):


Either way: we have our nupkg, ready to distribute.

Common Problems

The feature  I want isn’t available in core-clr

First, search dotnet/corefx; it is, of course, entirely possible that it isn’t supported, especially if you are doing WPF over WCF, or something obscure like … DataTable ;p Hopefully you find it tucked away in a different package; lots of things move.

The feature is there on github, but not on nuget

One thing you could do here is to try using the experimental feed. You can control  your package feeds using NuGet.config in your solution folder, like in this example, which disregards whatever package feeds are defined globally, and uses the experimental feed and official nuget feed. You may need to explicitly specify a full release number (including the beta marker) if it is pre-release. If that still doesn’t work, you could perhaps enquire with the corefx team on why/when.

The third party package I want doesn’t support core-clr

If it is open source, you could always throw them a pull-request with the changes. That depends on a lot of factors, obviously. Best practice would be to communicate with the project owner first, and check for branches. If it is available in source but not on NuGet, you could build it locally, and (using the same trick as above) add a local package source – all you need to do is drop the nupkg in a folder on the file-system, and add the folder to the NuGet.config file. When the actual package gets released, remember to nuke any temporary packages from  %USERPROFILE%/.dnx/packages.

I don’t like having the csproj as well as the project.json

Long term, we can probably nuke those csproj; they are handy to keep for now, though, to make it easy for people to build the solution (minus core-clr support).

The feature I want isn’t available in my target framework / Platform Standard

Sometimes, you’ll be able to work around it. Sometimes you’ll have to restrict what you can support to more forgiving configurations. However, sometimes there are cheeky workarounds. For example, RegexOptions.Compiled is not available on a lot of Platform Standard configurations, but it is there really. You can cheat by checking if the enum is defined at runtime, and use it when available; here’s a nice example of that. There are uglier things you can do, too, such as using reflection to see if types and methods are actually available, even if they aren’t there in the declared API – you should try to minimize these things. As an example, protobuf-net would really like to use FormatterServices.GetUninitializedObject() when it is available. Just… be careful. This trick work on things like universal applications, but then: neither will hardly any of what protobuf-net does, so that is a moot point.

I’m having a problem with the tooling

The various teams are very open to feedback. I confess that I sometimes struggle to know what should go to the corefx team vs the team (some of the boundaries are largely arbitrary and historical), but it’ll probably find a receptive ear.


The core-clr project moves a lot of pieces, and a lot of things are still in flux. But: it is now stable enough that many library authors should be more than capable of porting their projects, and quite  possibly simplifying their build process at the same time.

Happy coding.

The road to DNX - part 2

In part 1 I gave a brief introduction to the core-clr project and the key tools involved, from the perspective of a library author with existing .net libraries that they want to migrate to core-clr. I took a sample project (FastMember), and made some tooling changes to take it from a csproj-based build (targeting .net 3.5 and .net 4.0), to a build using project.json (again, targeting .net 3.5 and .net 4.0). Because the core-clr tools are not yet stable (rtm) or mainstream, I have retained the ability to build everything from the csproj – so that any arbitrary developer who clones the repo can build right away without having to install a pile of unfamiliar, unreleased tools.

In part 2, we start exploring what we can now do with the new tools.

Target Platforms

Here we’re going to look at adding a new core-clr build, and making the necessary code changes to make it compile.

The first thing we probably want to do is start playing with core-clr. At current (rc1), in terms of build tools this is “dnxcore50”. Our framework dependencies change from being “frameworkAssemblies” to “dependencies”: we’re now going to be pulling down each set of libraries from nuget separately (cached in %USERPROFILE%/.dnx/packages), rather than a monolithic platform install. All we need to do to start, then, is add a “dnxcore50” token into our project.json (for those who have done a lot of core-clr work: we’ll be changing this later, don’t worry), including a dependency on “System.Runtime” (the version number was chosen with the help of auto-complete, so I didn’t need to go out of the editor to look this up):


Since we’ve changed our dependencies, we need to use “dnu restore”. This looks at what project.json requires, and compares it to project.project.lock.json which tracks what has been resolved already. If anything is missing, it looks in .dnx/packages, and if they are missing it uses our defined package sources to go and get them. This works identically for framework dependencies and 3rd-party libraries – it makes the entire thing painless.

Having added “dnxcore50” and the “System.Runtime” dependency, we can try to build – although we’re not actually expecting it to compile yet. In fact, “dnu build” reports 270 errors, and Visual Studio lights up like the Blackpool illuminations:


At this point, we have a small job to do of going through the errors and figuring out what packages we are missing. Granular framework packages makes deployment more convenient (and hopefully more frequent), but means we need a few more dependencies.

Let’s look at the second one – Hashtable; one way of resolving this is to play auto-complete pot-luck by guessing that this is probably in System.Collections.Something:


Since Hashtable is non-generic, it is probably in System.Collections.NonGeneric; we can add this and that error goes away:


We can go through a few obvious ones this way, getting rid of almost two-thirds of the fail:


I still have errors relating to:

  • TypeBuilder / MethodBuilder / ILGenerator etc (the library does metaprogramming)
  • members of Type / MemberInfo: IsValueType, IsDefined, MemberType (the library does a lot of reflection)
  • IDataReader and DataTable

Which – broadly – covers some of the more subtle hurdles you need to jump.

finding rogue types

(edited: this tool does exactly this! - thanks for pointing that out, TIL).

It isn’t necessarily obvious where to find something like ILGenerator, especially since we’re included System.Reflection. There may be better tools, but my usual go-to place is the dotnet/corefx repo; by searching for “class ILGenerator” you can usually quickly determine whether something exists. Often you’ll hit the actual definition first time, which tells you the package in the name. In this case we get the tests, which is probably enough:


but if you’re still stuck,  you can click into the test, then go back a few levels until you find the test’s project.json, and look at what it referenced:


So I’m probably going to want System.Reflection.Emit and System.Reflection.Emit.ILGeneration.

working with refactored APIs

You’ll find  a lot of places where the API available to you has been changed. For example, if you work with reflection – everything changes; System.Type ceases (or rather: ceased, quite a long time ago on a lot of frameworks) being the rich “I know everything” type. The idea is that Type is a lightweight token that allows you to identify and compare types, and if you want to know more you use System.TypeInfo; there are methods to translate between the two (GetTypeInfo() and AsType(), respectively). Likewise,  MemberTypes no longer exists. As far as I know, there is no single master list of these changes – you just  kinda need to tease  each one separately.

Some changes you can make  in a way that works satisfactorily on all frameworks; for example, rather than doing a switch on MemberType:


we can use some combination of “is”/”as”/cast:


In some places, there are extension methods in utility libraries you can use to bridge the gap; for example, to add a lot of familiar methods back onto Type, you can add a dependency on “System.Reflection.TypeExtensions”, and ensure that you have “using System.Reflection;” in the code-file (because  these are extension methods added by the System.Reflection.TypeExtensions type)

In other places, like IsValueType, IsPublic, etc - there is no single common API we can use; it is fundamentally different (and “extension properties” aren’t a thing). The good news is that the project.json build chain makes it easy to use #if sections to switch between different implementations. The upper-case name of the target framework is automatically added as a build symbol  - so we can check using “#if DNXCORE50”. If the impacted API is only used in one place, you can just use #if in-situ, but for frequently recurring things like IsValueType (which is often all over your code), I do not recommend polluting all your code with constant #if. Rather, my strategy is to create a utility class that bridges the gap (usually, but not always, via extension methods), and have just that class deal with different implementations.

The really nice thing is that the IDE helps us here: in the top left corner, it now tells us all the frameworks we are targeting, and we can switch between them in the context of a filelook in particular at which sections are greyed / colorized as we switch between frameworks:



Here’s the commit with everything compiling except System.Data, which I have just excised for now.

System.Data is a much more complicated story; while System.Data has been migrated, there are significant changes:

I’m going to look at this System.Data reference in a bit more detail, but the takeaway here is not System.Data: it is how we can investigate problems.

The first is a big problem if you’re using DataTable; this is a controversial change (see the linked thread), but in reality it is often misused. There are times when it is genuinely the right tool, but for most scenarios you should really have moved  to an ORM or micro-ORM approach by now (did  I mention that  Dapper is available for core-clr? – 1.50 and upwards). It is also (mis)used as part of the reader metadata API. Even if we don’t ever see DataTable, we will need (soon) a  new API that has similar aims as GetSchemaTable(). This is all a bit of an aside, but the point I’m trying to emphasize is that : some APIs have irreconcilable differences. If your library depends on these features, you’re going to have some soul searching to do.

Ignoring the DataTable difference, we can still access much of the rest of System.Data; but we need to move from interfaces to abstract base types (DbDataReader instead of IDataReader). In this case, we have a little grunt work to do, but afterwards: we again have a single implementation with minimal #if. The one interesting bit is that DbEnumerator doesn’t seem to be in the current packaging:


It is dotnet/corefx in a branch, but not in “master”. This looks like some “work in progress” in the conversion, since that API is meant to talk in terms of IDataRecord (or DbDataRecord), and neither of those is supported in core-clr currently, so it isn’t clear to me what this enumerator is  meant to do on core-clr! You will occasionally find pockets like this; to seek clarification, I could look at what SqlDataReader does, or I could ask the developers. Checking github, it looks like it uses a copied implementation in System.Data.SqlClient.  And despite being declared “public”, this simply isn’t in the currently published assemblies. In this case, it all looks a bit of a mess, and it isn’t critical, so I’ll ask the developers, and throw an exception in that scenario for now. Here’s my eventual core-clr conversion of the System.Data-related code.

Testing Against Multiple Frameworks

Woohoo! We now have a project that compiles against .net 3.5, .net 4.0 and dnxcore50. That’s a great start, but we haven’t actually done anything except get it to compile yet. We want to run our tests, too! If you remember from part 1, FastMember has tests that use NUnit. I’m a pragmatist when it comes to test tools. I’ll be honest: at the current time, the easiest way to test on core-clr is via xunit. I’m sure the other tools will catch up, though.

Now, you’re probably thinking “but I don’t want to change all my tests”. I agree with you. Which is why I don’t do that. Instead: I cheat. We can make a final decision whether to migrate the tests more formally when everything is RTM. Right now, we just want things to work. What we’re going to do is:

  • add a “dnxcore50” framework block to our test projects
  • use xunit from the new framework
  • add a bridge file (only active when targeting xunit) that shims between the two
  • #if out any tests that won’t compile on core-clr due to missing features

With these changes, we can start moving to testing. A key piece here is learning about dnx commands; out project.json can actually declare multiple  named commands (which map to assemblies in which to locate an entry-point) that dnx can then invoke. For xunit, the one we want is “xunit.runner.dnx”, but as it happens FastMember.Tests already declares a Main() entry-point that does some performance tests. As such, we can declare multiple commands:


The IDE even updates to let us choose very conveniently what the “play” button should do:


If we’re going to use the IDE, we also need to make sure that we’re targeting .NET Core, since we haven’t enabled DNX tools for regular .NET yet (DNXCORE50 is .NET Core):


Finally, we can hit play, and amazingly our tests pass first time (this is the exception, not  the rule):


We can switch to the “perf” command and run that:


To do the same thing at the command-line; the really really important thing to remember is to switch framework to core-clr via dnvm (here, c64 is an alias for “rc1 64-bit core-clr on windows” that I created in part 1):


In the IDE, you can debug tests with breakpoints in the ways you would hope. The test tooling is “functional” right now, but is improving at a rate.

But what about regular .net?

This is where it starts getting fun! Remember that dnxcore50 is core-clr using the dnx tools. net40 and net35 are regular  .net; the dnx tools don’t really do much other than compile them. But! The dnx tools themselves allow you to run .net apps! There is a different build we use for this: dnx451. This is .net 4.5.1 on dnx. Because this is consuming framework assemblies (not nuget feeds), it is basically the same configuration as net40 – except we can now use the up-to-date xunit bits (which support dnx451); we can add a new build to all of our projects in the project.json:


The IDE now lets us successfully run tests targeting the .NET Framework using the dnx tools:


And again, we can use the command-line to run our tests from dnx, making sure to use dnvm to switch to the .NET Framework (I have an n64 alias for this):


Oops! I broke something, and it only impacts the .NET Framework version. This is easiest to diagnose in the IDE, where pressing F5 quickly tells me it is something to do with the DbDataReader.Dispose method:


Fair enough; that’s just my own brain-dead implementation. I fixed this, and another error (a boolean inversion, along with committing the dnx451 builds, here; Nick Craver is going to laugh at me for this…) that I had introduced, but: this shows the importance of testing any changes you make to your  existing codebase! Our tests now pass for dnxcore50 and dnx451:


End of part 2

That got longer than I expected. We’ve now got as far as targeting dnxcore50 and dnx451 (alongside regular net35 and net40), running test suites, debugging, etc. We’ve actually seen something happen, and we’ve seen things go wrong.

Coming up in the unplanned part 3 (part 2 got too big):

  • Targeting hell: what the hell is netstandard, and  why should I care?
  • Packaging and deployment
  • Common problems

Continue to part 3

Monday, 23 November 2015

The road to DNX – part 1

Target audience: library authors who want to get into this “dnx” thing.

Part 2; Part 3

Unless you have been asleep at the wheel, you probably know that Microsoft have been working really really hard at moving forward with the “corefx” / “core-clr” / “dnx” / “ 5” stream of work (all broadly related and often used interchangeably, whether correctly or incorrectly) – their effort to make .net a truly open-source, cross-platform open technology. An awesome set of aims. A few days ago saw Release Candidate 1, and it is now becoming very capable.

Any platform is only as rich as the ecosystem – the libraries available for it. In the case of core-clr (which, right or wrong, I will now use to mean the set of things mentioned above), this is a combination of the framework libraries (which are now open source, this is the “corefx” piece), but also the third party community libraries – which often, but not exclusively, means nuget.

It is my judgement that there are a large number of library authors and contributors who want to start exploring the tools, but find the current state confusing and overwhelming. My aim here, then, is to try to demystify what you need to do. I’m going to assume that you already have some .net libraries that you want to migrate to core-clr. I’m drawing on the involvement I’ve had working on the core-clr conversions of Dapper, protobuf-net, Sigil, Jil (all now available for core-clr on nuget), SimpleSpeedTester (PR not yet taken), and StackExchange.Redis (I’m still working through a big PR kindly contributed by some awesome Microsoft folks). Topics I hope to cover:

Part 1 – running fast to stand still
  • the tools of the trade – what you need to get started, what each does, and where to find things
  • our sample project: FastMember
  • say hello to project.json and package structure
Part 2 – learning to fly
  • understanding target platforms / monikers
  • more on package management
  • changing your code to fit the platform; what is going to hurt?
  • testing your code
  • packaging and deployment
I’m not going to cover application code such as MVC applications.

Caveat emptor

All of these things are evolving; I hope it is all correct at the moment (rc1), but many details may have subtle changes by rtm and beyond. Such is the life of the software developer. Expect your cheese to be moved.

The tools of the trade

In the past, the .net framework has been huge system-wide installs of the entire framework library, with many upgrades over-the-top (meaning: once you’ve installed 4.6.1 or whatever: all 4.6 apps get those changes). In dnx, everything is much more granular; this makes for a much better upgrade cadence – System.Some.Component has changes they want to get out? Sure thing: they just deploy it to nuget, and you pick it up when you choose. The tool-chain is very different, and you need some new pieces; so… let’s go get them.


The first  thing you need is “dnvm” – the “.NET Version Manager”. This tool is in charge of installing and managing as many different runtimes as you can like, including cross-targeting reference runtimes. Essentially, when people talk about “1.0.0-rc1-final” (the current release), that is the runtime version. You can install this for windows, mac or linux. If you’re using Visual Studio, be sure to install the appropriate bits (look just above “runtime and tooling”). In particular: don’t get confused by the fact that the page is talking about “ASP.NET 5”. Even if you are a pure library author with no interest in ASP.NET, this is the right stuff. I told you the names were largely interchangeable!

So what does dnvm do? Once it is installed, the first thing you want to do is “dnvm update-self” (to ensure you have the latest dnvm tooling), then “dnvm upgrade”, to update to the latest runtime.
dnvm is basically a tool that manages and switches between any number of runtimes, where runtimes are just folders under %USERPROFILE%/.dnx/runtimes; here are mine:


which is exactly what I get if I type “dnvm list”


If I decide I don’t need those beta 4-8 bits, I can just delete the folders any way I choose, and type “dnvm list” again:


(I could also use “dnvm uninstall” for this)

I can install additional runtimes with “dnvm install”, and I can give runtimes aliases – for example, in my examples I’ve added “c64” to mean “rc1 coreclr on x64 targeting windows”. I did this with:

dnvm alias c64 1.0.0-rc1-final -r coreclr -a x64”

(and likewise for the others) I can now switch my command-line tools between runtimes by entering “dnvm use c64” or “dnvm use n86”. Note that a runtime comprises not just the core pieces to make .net work at all, but also includes per runtime our other main tools - “dnu” and “dnx”.


Our next  tool is “dnu”, the “Microsoft .NET Development Utility”. This acts as a wrapper including:

  • package management (for obtaining and managing our dependencies)
  • build tools (the compiler)
  • packaging and deployment tools (think: “nuget pack”)

To see this tool in action we really need to have a project, but perhaps the most important thing to remember about a lot of what it does is: %USERPROFILE%/.dnx/packages. In the same way that dnvm owns /runtimes, dnu owns /packages – the local cache of dependencies we have on our local machine.


The last of our tools is “dnx”, the “Microsoft .NET Execution environment”; basically, it runs stuff! There are ways of bootstrapping things, but for most dev purposes, dnx is your friend (unless you’re using an IDE to do the same thing).
Again, we can’t  really show dnx doing much without a project, so we’ll come back to it.
All of these are command-line tools; almost everything can also be done in the IDE (with the right tools); but it is worth understanding what is going on.

What is FastMember?

Frankly, it is a little project I wrote ages ago and haven’t changed in ages (I didn’t even migrate it from google-code until recently). In fact, I even lost the snk password and had to break the identity (well damn, that’s embarrassing). What it does isn’t particularly important – just that it is an existing real-world library that I want to move to core-clr (one of the things it does very nicely is allow you to expose an IEnumerable<T> as an IDataReader for SqlBulkCopy). It does a few non-trivial things, but we’ll burn that bridge when we get to it.

Say hello project.json


This is actually one of the hardest bits. Once you have the project structure working, most other things are relatively easy! This step is awkward for existing projects. It would be nice if the tools made this a little less messy.

You may have heard mention of project.json; this is the new format that can be used  as an alternative to a csproj file. It is clean, human maintainable, and relatively versatile. So how do we get one? There is a way to do this with “dnu wrap”, but I’m not personally very satisfied with the hybrid beast that results from that – I’m going to focus instead on a complete transition to core-clr tooling. The good news is that a project.json file is very simple – and it is opinionated, with a lot of assumptions made implicitly (like: “include all the .cs files in the sub-tree”). One of the opinions it currently holds very strongly is that the folder name defines identity. So since I want my package to be FastMember, it needs to be in a folder called FastMember. This fits my existing file structure, except currently my .csproj  is also in the FastMember folder:


This is actually slightly problematic, and is a current pain point – because project.json and csproj do not play nicely in the same folder. While transitioning (as in: while you wait for the tools to stabilise so that the entire team are familiar with working in dnx), you probably want to have both csproj and project.json builds side-by-side. Your mileage may vary, but what has worked best for me is to move the csproj.

You should apply your own thoughts, but since a csproj targets a single framework (and a project.json doesn’t), I’m using a _Net40, _Net35 etc suffix for each of my csproj folders, with the project.json going into the main folder. So for each of my projects I’m going to:

  • Relocate the csproj file (and packages.config) into FastMember_Net40 (the original FastMember/FastMember.csproj targets .net 4.0) – don’t worry about multi-targeting – that will be covered later
  • Manually edit the csproj to pick up code files from the existing location – there’s a trick you can do here with the same “everything  under the sub-tree” approach; basically, from the new location I can tell it it to include “..\FastMember\**\*.cs”
  • Create a minimal project.json; for  starters, “{ }” will let us at least get to the next step


  • Rename the FastMember_Signed folder to FastMember.Signed, because the nuget  package is called FastMember.Signed
  • Update the existing sln with the new csproj project locations
  • Create a new sln that has the project.json projects

Here’s the outcome:


The _Net35, _Net40 etc just contain the various csproj files for different builds. My actual code is in the FastMember and FastMember.Tests folders, so that I can move the project.json into each. I split this into two commits – one that refactored the existing csproj / sln, and one that created the project.json and corresponding sln.

IMPORTANT: as soon as you add a project.json to a sln you get a second file per project.json for free: {YourProject}.xproj; these should be included in source control.

IMPORTANT: as soon as you try to build (which first  invokes package restore), or explicitly run “dnu restore” – you get a project.lock.json file per project.json; this is an internal tracking file and does not need to be included in source control.

So after this, I have:

  • actual code (.cs) in FastMember and FastMember.Tests
  • a project.json (and lock file, and xproj) in FastMember, FastMember.Tests, FastMember.Signed, FastMember.Signed.Tests (note that the .Signed ones will be identical once complete, but with strong names included) – linked by FastMember.DNX.sln
  • csproj / packages.config in each of the _Net35 / Net40 folders – linked by FastMember.sln

Now, after all that, I can load *either* of the  two solutions.
At this point, my project.json is just a dummy “{ }”, but  we can fill it out; this bit is alarmingly simple to get a minimal build that mirrors the existing .net (pre core-clr) setup:

  • FastMember and FastMember.Signed should each target .net 3.5 and .net 4.0
  • they all need to reference System.Data from the BCL
  • the 3.5 build needs to define NO_DYNAMIC to compile without “dynamic”
  • the two test projects should reference NUnit 3 and their corresponding main project file
  • the .Signed versions should use the SNK, and need to obtain their .cs files from the parallel non-signed versions

The project.json for the above is very simple; and if you use Visual Studio the IDE will automatically prompt you in all the right places (for other editors, the schema is published here). Note that BCL references come under  “frameworkAssemblies”, where as packages from our package manager come under “dependencies”. One really nice thing the Visual Studio tools do for us here is auto-completion on package sources – on both names and versions:


A quick shout out to “Visual Studio Code”: it should be noted that the folder-based approach used by core-clr works very well in this cut down  (but fast and well-featured) editor. And for extra awesome, it includes all these same abilities (just by typing “code .” from the command-line in the folder of choice):


In addition to building in Visual Studio, we can also build at the command line (after running “dnu restore”) by using “dnu build” or “dnu build --configuration release” in any of the folders with a project.json:


As you can see, building a single project.json can build for multiple targets – in this case .NET  3.5 and .NET 4.0. This results in the usual dll, pdb and xml outputs under bin/debug/net35 and bin/debug/net40 – or bin/release/net35 and bin/release/net40 if we specified release. So far, so good.

End of Part 1

At this point, you would be well within your rights to be underwhelmed. We’ve taken quite a bit of effort to get back to exactly where we started from: a project we can build that targets .net 3.5 and .net 4.0 and can  be compiled to binaries - but using a project.json instead of a csproj. Everything so far has been just tooling changes. In part 2, we’ll get into what this enables. It goes uphill from here, honest! See you soon.

Continue to part 2