Monday 9 May 2016

CUDA–getting started in .NET


In my two previous posts (part 1, part 2) I described how we have found uses for CUDA. I am, however, aware that this abstract text doesn’t necessarily help the reader (aka “you”) make inroads into the world of CUDA. So: the purpose of this post is to take the concepts discussed previously and put them together into some actual code.

Firstly, a word on OpenCL: so far, I’ve spoken about CUDA – NVIDIA’s framework for GPGPU programming. Other frameworks exist, and the largest “other” is undoubtedly OpenCL. This framework is designed to let you use the same code over a range of technologies – GPUs from different providers, CPUs, etc. I am aware of it – I just haven’t even tried to address it. If that’s your bag: have fun. I don’t feel hugely limited by restricting myself to CUDA for what I need.

The first things you need for CUDA, then, are:

  • a CUDA capable GPU device
  • the CUDA toolkit
  • a programming environment

Since I’m focused on .NET, we’re also going to need one of the tools for interacting with CUDA from .NET, or we’ll need to write extern wrappers for the C API. A range of CUDA tools for .NET exist, including CUDAfy.NET, managedCUDA, Alea GPU / QuantAlea, Campy.NET, and several others that Campy.NET list and describe on their home page. These fall into two major categories:

  • wrappers for the driver API (you write your kernels in C etc and launch them from C#)
  • IL-to-PTX generators (you write and launch your kernels in C#)

While I’m a huge fan of C#, I’m also a huge fan of keeping close to the metal, and in a heartbeat will sacrifice some programming convenience for performance.

Our first kernel

So without further ado, I’m going to jump straight in with a basic CUDA kernel written in C. Rather than trying to discuss the tag-engine, we’re going to do something simpler… like multiplying numbers. Multiplication is hard, right?

typedef struct {
    int Id;
    unsigned int Value;
} SomeBasicType;

extern "C" {
    __global__ void Multiply(const int N, SomeBasicType* __restrict data, int factor) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
i += blockDim.x * gridDim.x)
{ (data + i)->Value *= factor; } } }

The first thing this does is to define a data-type (SomeBasicType) for us to share data between the CPU and GPU code. The __global__ indicates that we’re declaring a kernel, along with a given name and signature. We’re passing in N, the number of elements to process, data, a pointer to some of the data to work against, and factor - the number to multiply the values by. The significance of the N comes into play when we see the i < N in the for loop on the next line. Recall that we usually launch a number of blocks, each consisting of a set number of threads. It is not always the case that the number of elements to process is a convenient multiple of some block size, so it is common for kernels to be passed a count to compare the position against, exiting if the current thread is not needed. This sounds like a contradiction to the “don’t branch” guidance, but as long as most of the threads make the same choice, this isn’t going to be a performance problem.

Recall also that I described a “monolithic kernel” earlier, where-by we launch a thread for every element to process. But: there is a hard limit to the number of threads in a block, and the number of blocks in any dimension – and as always: the more threads there are, the more management overhead there is (although this works differently to CPU threads). As such, it is common to use a “grid-stride loop” rather than a monolithic kernel. As in our example, a for loop is used to that each thread processes multiple elements. But unlike the CPU version shown previously (where each thread processed a separate chunk of data), we instead have each thread in an iteration process adjacent elements, and then move forwards by the stride – namely the width of each block multiplied by the number of blocks. This has multiple side benefits:

  • it can cope with data of any size – it is not subject to the 75M limitation
  • it requires less overhead for the GPU engine
  • it can still be used as a monolithic kernel (by setting the dimensions such that it only performs one iteration per thread), or as a single-threaded debugging kernel that processes all the data (by setting the dimensions to 1 thread in 1 block)
  • the work being done by a warp or block is contiguous, minimizing memory access overheads

Finally, we get to the actual meat of the kernel, which does an in-place multiplication of one of the fields of our data by one of the parameters.

That’s our first kernel, all done!

A side note on data width

You’ll note that the above example is doing 32-bit arithmetic. For doing serious CUDA work, it is important to be aware of whether you are doing 16-bit, 32-bit or 64-bit arithmetic, as it can significantly impact your choice of hardware – considering them in release order:

  • the “Kepler” architecture (server devices: K40, K80) will serve as a baseline – let’s arbitrarily call it “good” at both 32-bit and 64-bit; 16-bit operations are performed using 32-bit arithmetic
  • the “Maxwell” architecture (server device: M40) is faster than Kepler at 32-bit operations, but is relatively poor at 64-bit operations; 16-bit operations are performed using 32-bit arithmetic
  • to complicate this: the fact that the K80 packs two devices on a physical package means that even though a single M40 may be faster than a single K80 device, code that appropriately uses both K80 devices on a package may still out-perform a single M40
  • the “Pascal” architecture apparently (not yet available for testing; server device: P100) significantly increases both 32-bit and 64-bit performance, while also introducing specific optimized instructions for 16-bit processing

Because of this, there is no single way of saying “this device is faster” – it depends a lot on what exactly you are doing!

Compiling our kernel (and: hello managedCUDA)

A kernel as a text source-file isn’t much use by itself. We need to convert that into GPU-runnable code – either intermediate (“PTX”) or fully compiled to bytecode (“cubin”). There is also a middle-ground that combines both bytecode and intermediate instructions – a “fatbin”; this allows a compiled kernel to target multiple physical architectures and still allow usage on other architectures via a JIT step. There are two common ways of compiling a C kernel – compile-time (NVCC) and runtime (NVRTC). I’ve spoken previously about using NVCC, so for this example I’ll use NVRTC instead.

Since we want to do this from .NET, we need a wrapper library. My examples today are going to use managedCUDA, written by Michael Kunz. This is available on NuGet and is licenced under the LGPL – which I understand (and: IANAL) is not the same as the notoriously infectious and restrictive GPL. What I particularly like about this library is that in addition to exposing an idiomatic .NET object model that maps to the C API, it also exposes the raw C API directly – which makes it convenient to use even when the method you want doesn’t map naturally to a .NET concept.

So let’s load our kernel from a text file and compile it at runtime:

string path = "MyKernels.c";
ManagedCuda.NVRTC.nvrtcResult result;
using (var rtc = new ManagedCuda.NVRTC.CudaRuntimeCompiler(
File.ReadAllText(path), Path.GetFileName(path))) { try { rtc.Compile(new string[0]); result = ManagedCuda.NVRTC.nvrtcResult.Success; } catch(ManagedCuda.NVRTC.NVRTCException ex) { result = ex.NVRTCError; } log = rtc.GetLogAsString(); if (result == ManagedCuda.NVRTC.nvrtcResult.Success) { byte[] ptx = rtc.GetPTX(); // ... } }

This takes our file contents as input, and (if successful) produces a blob that contains the compiled kernel. We can also obtain the log which will contain detained error messages if our kernel is invalid.

The CUDA context

One of the key objects in CUDA is the context. This wraps the state for a long-running CUDA scenario. Most operations, including memory allocation and kernel launches typically happen within a CUDA context. managedCUDA exposes a class that wraps this. Note:

  • a CUDA context is bound to a particular GPU device, and it is when creating the CUDA context that you can target specific devices
  • a CUDA context is associated with the CPU thread that creates it; as such, CUDA code would work well for scenarios such as a dedicated worker thread processing items from a queue, but would not work well from .NET async/await code that switches between multiple threads

So; let’s create our context and load the kernel:

CudaContext ctx = new CudaContext(deviceId, true);
CudaKernel multiply = ctx.LoadKernelFatBin(ptx, "Multiply");

Note that the context (like many CUDA objects) is IDisposable, and you should ensure to Dispose() it properly, via using or otherwise.

CUDA streams and asynchronous processing

In the simplest usage, the CUDA methods are blocking, however it is often the case that either the CPU can do other useful work while the GPU churns, or we can overlap multiple GPU operations – memory transfers running in parallel with kernel execution, or (depending on the hardware) even multiple kernels running at the same time. To allow this, CUDA introduces the notion of a “stream”. Work on the same stream is processed in order, but the work on different streams may overlap. I’m not going to make extensive usage of streams in this walkthrough, but it is easier to include stream considerations from the start than to add it later.

Initializing memory at the host and on the device

Traditionally, the host (CPU) memory and the device (GPU) memory were completely separate, and all CUDA code had to specifically be cautious of whether it was referring to host or device memory. More recently, the concept of “unified memory” has been introduced that hides this distinction, and performs transfers automatically as needed – it is intended to make it simpler to use the API, but for the maximum performance it is recommended to take full control over allocation and copy. To that end, I’ll stick to the traditional approach, which means explicitly allocating the data at both ends.

We’ll start by allocating the host memory:

struct SomeBasicType {
    public int Id;
    public uint Value;

IntPtr hostPointer = IntPtr.Zero;
var res = DriverAPINativeMethods.MemoryManagement.cuMemAllocHost_v2(
    ref hostPointer, count * sizeof(SomeBasicType));
if (res != CUResult.Success) throw new CudaException(res);
SomeBasicType* hostBuffer = (SomeBasicType*)hostPointer;

As before, we declare a raw data type to share the data between CPU and GPU. Here is one of those cases where I’m making use of the raw C API rather than the wrapped objects, as this (IMO) more conveniently allows naked allocations. You might think “hang on, you’re just allocating some memory – surely a .NET array would suffice, or maybe at worst Marshal.AllocHGlobal” – but: in order for the CUDA memory-copy engine to work at maximum performance (avoiding double-copies, etc) it is required to be configured in a specific way, and the best way to guarantee that is to let the CUDA API allocate the memory in the first place. A pinned .NET array certainly wouldn’t suffice. Or at least, it would work – but not as fast. Yes, this needs some unsafe code; it’ll be fine, don’t panic.

Once allocated, we can intialize this memory with some invented data:

for (int i = 0; i < count; i++) {
    hostBuffer[i].Id = i;
    hostBuffer[i].Value = (uint)i;

Next we’ll want to allocate a similar block of memory on the device, and start copying the data from the host to the device. Emphasis: large memory transfers are fast, but not instant – if possible you should minimize the amount of data you need to transfer for individual operations. The ideal scenario is to copy the data to the device once / periodically, and use it many times once it is there.

CudaDeviceVariable<SomeBasicType> deviceBuffer = new CudaDeviceVariable<SomeBasicType>(count);
CudaStream defaultStream = new CudaStream();
res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyHtoDAsync_v2(deviceBuffer.DevicePointer,
    hostPointer, deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

This time I'm using the CudaDeviceVariable<T> object to represent the memory at the device, which is perfectly convenient since we will not usually need to do anything except access the DevicePointer to pass into methods. Once again I'm using a raw C-style API for the actual memory copy. There are some transfer methods exposed directly on the object, but this particular combination is not exposed directly. Note that because we have created and used a stream, this is non-blocking. Related operations sent to the same stream will be queued behind this one.

Launching our kernel

We have data on the device; we have a kernel that we have compiled and loaded. Now finally to run the kernel against the data! First we must configure the dimensions – the number of threads in a block and the number of blocks:

multiply.BlockDimensions = new ManagedCuda.VectorTypes.dim3(threadsPerBlock, 1, 1);
multiply.GridDimensions = new ManagedCuda.VectorTypes.dim3(blockCount, 1, 1);

multiply.RunAsync(defaultStream.Stream, new object[] {
    count, deviceBuffer.DevicePointer, value

Note that the parameters to RunAsync (other than the stream) match the signature on the kernel itself. value in the above is the factor to multiply by. Once we've got the data in place, running kernels is actually alarmingly simple!

Getting results back

It isn’t usually sufficient to run kernels – sadly we usually expect to get results. As mentioned before, we should strive to minimize the amount of data we want to transfer, but for simplicity in this example I’m going to fetch back the entire buffer, over-writing the previous contents. In real code you might be extracting only small portions of the buffer, or summary / aggregate data.

var res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyDtoHAsync_v2(
    new IntPtr(hostBuffer), deviceBuffer.DevicePointer,
    deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

Once more, for memory operations it has been easier to use the raw C API than the wrapped managed API, and that's OK. Note that because we've been using streams, at this point we've only queued the copy. If we've set all our wheels in motion, and want to sit back and wait while they turn, we can use either ctx.Synchronize(); (to wait for all streams to complete) or defaultStream.Synchronize() (to wait for a specific stream to complete). Finally, when that is done, we can iterate over the data and congratulate ourselves for managing to over-complicate the multiplication of numbers!

Not enough code for you?

Everything I’ve shown here is available in a github project, that you can clone, fork, etc. I don’t claim it is the most useful tool ever, but hopefully it might help a few people get up and running with CUDA.