Sunday, 23 April 2017

Spans and ref part 2 : spans

Spans and ref part 2 : spans

In part 1, we looked at ref locals and ref return, and hinted at a connection to “spans”; this time we’re going to take a deeper look at what this connection might be, and how we can use make use of it.


I’m mostly on the outside of this - looking in at the public artefacts, playing with the API etc - maybe the odd PR or issue report. It is entirely possible that I’ve misunderstood some things, and it is possible that things will change between now and general availability.

What are spans?

By spans, I mean System.Span<T>, which is part of .NET Core, living in the System.Memory assembly. It is also available for .NET via the System.Memory package. But please note: it is a loaded gun to use at the moment - you can currently compile code that has undefined behavior, and which may not compile at some point in the future. Although to be fair, to get into any of the terrible scenarios you need to use the unsafe keyword, at which point you already said “I take full responsibility for everything that goes wrong here”. I’ll discuss this more below, but I wanted to mention that at the top in case you stop reading and don’t get to that important point.

Note that some of the code in this post uses unreleased features; I’m using:

<PackageReference Include="System.Memory"
    Version="4.4.0-preview1-25219-04" />
<PackageReference Include="System.Runtime.CompilerServices.Unsafe"
    Version="4.4.0-preview1-25219-04" />

Obviously all bets are off with preview code; things may change.

Why do spans need to exist?

We saw previously how ref T can be used similarly to pointers (T*) to represent a reference to a single value. Basically, anything that allows us to talk about complex scenarios without needing pointers is a good thing. But: representing a single value is not the only use-case of pointers. The much more common scenario for pointers is for talking about a range of contiguous data, usually when paired with a count of the elements.

At the most basic level, a Span<T> represents a strongly typed contiguous chunk of elements of type T with a known and enforced length. In many ways, very comparable to an array (T[]) or segment ArraySegment<T>) - but… more. They also provide safe (by which I mean: not unsafe in the C# sense) access to features that would previously have required pointers (T*).

I’m probably missing a few things here, but the most immediate features are:

  • provide a unified type system over all contiguous memory, including: arrays, unmanaged pointers, stack pointers, fixed / pinned pointers to managed data, and references into the interior of values
  • allow type coercion for primitives and value-types
  • work with generics (unlike pointers, which don’t)
  • respect garbage collection (GC) semantics by using references instead of pointers (the GC only walks references)

Now: if none of the above sounds like things you ever need to do, then great: you probably won’t ever need to use Span<T> - and that’s perfectly OK. Most application code will never need to use these features. Ultimately, these tools are designed for lower level code (usually: library code) that is performance critical. That said, there are some great uses in regular code, that we’ll get onto.

But… what is a span?

OK, OK. Conceptually, a Span<T> can be thought of as a reference and a length:

public struct Span<T> {
    ref T _reference;
    int _length;
    public ref T this[int index] { get {...} }

with a cousin:

public struct ReadOnlySpan<T> {
    ref T _reference;
    int _length;
    public T this[int index] { get {...} }

You would be perfectly correct to complain “but… but… in the last part you said no ref fields!”. That’s fair, but I did say conceptually. At least… for now!

Spans as ranges of an array

As a completely trivial (and rather pointless) example, we can see how we can use a Span<T> very similarly to how we might have used a T[]:

void ArrayExample() {
    byte[] data = new byte[1024];
    // not shown: populate data
void ProcessData(Span<byte> span) {
    for (int i = 0; i < span.Length; i++) {

Here we implicitly convert the byte[] to Span<byte> when calling the method, but at this point you would still be justified in being underwhelmed - we could have done everything here with just an array.

Similarly, we can talk about just a portion of the array:

void ArrayExample() {
    byte[] data = new byte[1024];
    // not shown: populate data
    ProcessData(new Span<byte>(data, 10, 512));
void ProcessData(Span<byte> span) {
    for (int i = 0; i < span.Length; i++) {

And again you could observe that we could have just used ArraySegment<T>. Actually, let’s be realistic: very few people use ArraySegment<T> - but we could have just passed int offset and int count as additional parameters, it would have worked fine. But I mentioned pointers earlier…

Spans as ranges of pointers

The second way we can use Span<T> is over a pointer; which could be any of:

  • a stackalloc pointer for a small value that we want to work on without allocating an array
  • a managed array that we previously fixed
  • a managed array that we previously pinned with GCHandle.Alloc
  • a fixed-sized buffer that we previously fixed
  • the contents of a string that we previously fixed
  • a coerced pointer from any of the above (I’ll explain what this means below)
  • a chunk of unmanaged memory obtained with Marshal.AllocHGlobal or any other unmanaged memory API
  • etc

All of these will necessarily involve unsafe, but: we’ll tread carefully! Let’s have a look at a stackalloc example (stackalloc is where you obtain a chunk of data directly on the call-stack):

void StackAllocExample() {
    unsafe {
        byte* data = stackalloc byte[128];
        var span = new Span<byte>(data, 128);
        // not shown: populate data / span
void ProcessData(Span<byte> span) {
    for (int i = 0; i < span.Length; i++) {

That’s… actually pretty huge! We just used the exact same processing code to handle an array and a pointer, and we didn’t need to use unsafe (except in the code that initially obtained the pointer). This opens up a huge range of possibilities, especially for things like network IO and serialization. Even better, it means that we can do all of the above with a “zero copy” mentality: rather than having managed code writing to a byte[] that later gets copied to some unmanaged chunk (for whatever IO we need), we can write directly to the unmanaged memory via a Span<T>.

Slice and dice

A very common scenario when working with buffers and buffer segments is the need to sub-divide the buffer. Span<T> makes this easy via the Slice() method, best illustrated by an example:

void ProcessData(Span<byte> span) {
    while(span.Length > 0) {
        // first byte is single-byte length-prefix
        int len = span[0];

        // process the next "len" bytes
        ProcessChunk(span.Slice(1, len));

        // move forward len+1 bytes
        span = span.Slice(len + 1);

This isn’t something we couldn’t do other ways, but it is very convenient here. Importantly, we haven’t allocated anything here - there’s no “new array” or similar - we just have a reference to a different part of the existing range, and / or a different length.


A more interesting example is coercion; this is something that you can do with pointers, but is very hard to do with arrays. A classic scenario here would be IO / serialization: you have a chunk of bytes, and at some point in that data you need to treat the data as fixed-size int, float, double, etc data. In the world of pointers, you just… do that:

byte* raw = ...
float* floats = (float*)raw;
float x = floats[0], y = floats[1]; // consume 8 bytes

With arrays, there is no direct way to do this; you’d either need to use unsafe hacks, or you can use BitConverter if the types you need are supported. But this is easy with Span<T>:

Span<byte> raw = ...
var floats = raw.NonPortableCast<byte, float>();
float x = floats[0], y = floats[1]; // consume 8 bytes

Not only can we do it, but we have the added advantage that it has correctly tracked the end range for us during the conversion - we will find that floats.Length is equal to raw.Length / 4 (since each float requires 4 bytes). The important thing to realise here is that we haven’t copied any data - we’re still looking at the exact same place in memory - but instead of treating it as a ref byte, we’re treating it as a ref float.

Except… better!

We observed that with pointers we could coerce from byte* to float*. That’s fine, but you can’t use pointers with all types. Span<T> has much stronger support here. A particularly interesting illustration is SIMD, which is exposed in .NET via Vector<T>. A vexing limitation of pointers is that we cannot talk about a Vector<float>* pointer (for example). This means that we can’t use pointer coercion as a convenient way of reading and writing SIMD vectors (you’ll usually have to use Unsafe.Read<T> and Unsafe.Write<T> instead). But we can coerce directly to Vector<T> from a span! Here’s an example that might come up in things like applying the web-sockets xor mask to a received frame’s payload:

void ApplyXor(Span<byte> span, uint mask) {
    if(Vector.IsHardwareAccelerated) {
        // apply the mask to SIMD-width bytes at a time
        var vectorMask = new Vector<uint>(mask);
        var typed = span.NonPortableCast<byte, Vector<uint>>();
        for (int i = 0; i < typed.Length; i++) {
            typed[i] ^= vectorMask;
        // move past that data (might be a few bytes left)
        span = span.Slice(Vector<uint>.Count * typed.Length);
    // not shown - finish any remaining data 

That’s pretty minimal code for vectorizing something; it is especially nice that we didn’t even need to do the math to figure out the vectorizable range - typed.Length did everything we wanted. It would be premature for me to know for sure, but I’m also hopeful that these 0-Span<T>.Length loops will also elide the bounds check in the same way that array access from 0-T[].Length elides the bounds check.

And readonly too!

Pointers are notoriously permissive; if you have a pointer: you can do anything. You can use fixed to obtain the char* pointer inside a string: if you change the data via the pointer, the string now has different contents. string is not immutable if you allow unsafe: nothing is immutable if you allow unsafe. But just as we can obtain a Span<T>, we can also get a ReadOnlySpan<T>. If you only expect a method to read the data, you can give them a ReadOnlySpan<T>.

Zero-cost substrings

In the “corefxlab” preview code, there’s a method-group with signatures along the lines of:

 public static  ReadOnlySpan<char> Slice(this string text, ...)

(where the overloads allow an initial range to be specified). This gives us a ReadOnlySpan<char> that points directly at a range inside the string. If we want a substring, we can just Slice() again and again - with zero allocations and zero string copying - we just have different spans over the same data. A rich set of APIs already exists in the corefxlab code for working with this type of string-like data. If you do a lot of text processing, this could have some really interesting aspects.

This all sounds too good to be true - what’s the catch?

Here’s the gotcha: in order to have the appropriate correctness guarantees when discussing something that could be a managed object, could be data on the stack, or could be unmanaged data, we run into very similar problems that make it impossible to store a ref T local as a field. Remember that a Span<T> is conceptually a ref T (reference) and int (length) - well: we still need to obey the rules imposed by that “conceptually”. For a trivial example of how we can get in a mess, we can tweak our stackalloc example:

private Span<byte> _span;
unsafe void StackAllocExample() {
    byte* data = stackalloc byte[128];
    _span = new Span<byte>(data, 128);
void SomeWhileLater() {

Where does _span refer to in SomeWhileLater? I can’t tell you. We get into similar problems with anything that used fixed to get a pointer - the pointer is only guaranteed to make sense inside the fixed. Conceptually the issue is not restricted to pointers - it would apply equally if we could initialize Span<T> directly with a ref T constuctor:

private Span<SomeStruct> _span;
void StackRefExample() {
    var val = new SomeStruct(123, 456);
    _span = new Span<SomeStruct>(ref val);
    // ^^^ hypothetical span of length 1

We didn’t even need unsafe to break things this time. No such constructor currently exists, very wisely!

We should be OK if we only ever use managed heap objects (arrays, etc) to initialize a Span<T>, but the entire point of Span<T> is to provide feature parity between things like arrays and pointers while making it hard to shoot yourself in the foot.

In addition to this, we also need to worry about atomicity. The runtime and language guarantee that a single reference can be read atomically (in one CPU instruction), but it makes no guarantees about anything larger. If we have a reference and a length, we start getting into very complex issues around “torn” values (an invalid pair of the reference and length that didn’t actually exist, due to two threads squabbling). A torn value is vexing at the best of times, but in this case it would lead to valid-looking code accessing unexpected memory - a very bad thing.

The stackalloc example above is a perfect example of code that will compile without complaint today, but will end very very badly - although we used unsafe, so: self-inflicted. But this and the atomicity issue are both illustrations of why we have…

The Important Big Rule Of Spans

Span<T> has undefined behavior off the stack. And in the future: may not be allowed off the stack at all - this means no fields, no arrays, no boxing, etc. In the same way that ref T only has defined behavior on the stack (locals, parameters, return values) - so Span<T> only has defined behavior on the stack. You are not meant to ever put a Span<T> in a field (including all those times when things look like locals but are actually fields, that I touched on last time). An immediate consequence of this is that atomicity is no longer an issue: each stack is specific to a single thread; if our value can’t escape the stack, then two threads can’t have competing reads and writes.

There’s some in-progress discussion on how the rules for this requirement should work, but it looks like the concept of a “ref-like” stack-only type is being introduced. ref T as a field would be ref-like, and Span<T> would be ref-like. Any ref-like type would only be valid directly on the stack, or as an instance field (not a static field) on a ref-like type. If I had to speculate at syntax, I’d expect this to look something like:

public ref struct Span<T> {
    ref T _reference;
    int _length;
    public ref T this[int index] { get {...} }

Emphasis: this syntax is pure speculation based on the historic reluctance to introduce new keywords, but the ref struct here denotes a ref-like type. It could also be done via attributes or a range of other ideas, but note that we’re now allowed to embed the ref-like ref T field. Additionally, the compiler and runtime would verify that Span<T> is never used illegally as a field or in an array etc. Notionally, we could also do this for our own types that shouldn’t escape the stack, if we have similar semantics but Span<T> doesn’t represent our scenario.

Thinking back to the StackRefExample, if we wanted to safely support usage like:

var val = new SomeStruct(123, 456);
var span = new Span<SomeStruct>(ref val); // local, not field

then presumably it could work, but we’d have to have similar logic about returning ref-like types as currently exists for ref return, further complicated by the fact that we don’t have the single-assignment guarantee - we can reassign a Span<T>. If ref-like types work in the general case, then the logic about passing and returning such a value needs ironing out. And that’s complex. I’m very happy to defer to Vladimir Sadov on this!

EDIT: to clarify - it is only the pair of ref T and length (together known as a span, Span<T> or ReadOnlySpan<T>) that need to stay on the stack; the memory that we're spanning can be anywhere - and will often be part of a regular array (T[]) on the managed heap. It could also be a reference to the unmanaged heap, or to a separate part of the current stack.

So how am I meant to work with spans?

Sure, not everything is on the stack.

This isn’t as much of a limitation as it sounds. Instead of storing the Span<T> itself, you just need to store something that can manifest a span. For example, if you’re actually using arrays you might have a type that contains an ArraySegment<T>, but which has a property:

public Span<T> Span { get { ... } }

As long as you can switch into Span<T> mode when you’re inside an appropriate method, all is good.

For a more unified model, the corefxlab code contains the Buffer<T> concept, but it is still very much a work in progress. We’ll have to see how it shakes out in time.

Wait… why so much ref previously?

We covered a lot of ref details - you might feel cheated. Well, partly we needed that information to understand the stack-only semantics of Span<T>. But there’s more! Span<T> also exposes the ref T directly via the aptly named DangerousGetPinnableReference() method. This is a ref return, and allows us to do any of:

  • store the ref return into a ref local and work with it
  • pass the ref return as a ref or out parameter to another method
  • use fixed to convert the ref to a pointer (preventing GC movement at the same time)

The latter option means that not only can we get from unsafe to Span<T>, but we can go the other direction if we need:

fixed(byte* ptr = &span.DangerousGetPinnableReference())
{ ... }

If I can get a ref, can I escape the bounds?

The DangerousGetPinnableReference() method give us back a ref to the start of the range, comparable to how a T* pointer refers to the start of a range in pointer terms. So: can we use this to get around the range constraints? Well… yes… ish:

ref int somewhere = ref Unsafe.Add(
    ref span.DangerousGetPinnableReference(), 5000);

This cheeky duo gives us a reference to whatever is 5000-integers ahead of the span we were thinking of. It might still be part of our data (if we have a large array, for example), or it might be something completely random. But the sharp eyed might have noticed some key words in that expression… “Unsafe...” and “Dangerous...”. If you keep sprinting past signs with words like that on: expect to hit rocks. There’s nothing here that you couldn’t already do with unsafe code, note.

Doing crazy things with unmanaged memory

Sometimes you need to use unmanaged memory - this could be because of memory / collection issues, or could be because of interfacing with unmanaged systems - I use it in CUDA work, for example, where the CUDA driver has to allocate the memory in a special way to get optimal performance. Historically, working with unmanaged memory is hard - you will be using pointers all the time. But we can simplify everything by using spans. Here’s our dummy type that we will store in unmanaged memory:

// could be explict layout to match external definition
struct SomeType
    public SomeType(int id, DateTime creationDate)
        Id = id;
        _creationDate = creationDate.ToEpochTime();
        // ...
    public int Id { get; }
    private long _creationDate;
    public DateTime CreationDate => _creationDate.FromEpochTime();
    // ...
    public override string ToString()
        => $"{Id}: {CreationDate}, ...";

We’ll need to allocate some memory and ensure it is collected, usually via a finalizer in a wrapper class:

unsafe class UnmanagedStuff : IDisposable
    private SomeType* ptr;
    public UnmanagedStuff(int count)
        ptr = (SomeType*) Marshal.AllocHGlobal(
            sizeof(SomeType) * count).ToPointer();
    ~UnmanagedStuff() { Dispose(); }
    public void Dispose()
        var ip = new IntPtr(ptr);
        if (ip != IntPtr.Zero) Marshal.Release(ip);
        ptr = default(SomeType*);

The wrapper type needs to know about the pointers, so is going to be unsafe - but does the rest of the code need to? Sure, we could add an indexer that uses Unsafe.Read / Unsafe.Write to access individual elements, but that means copying the data constantly, which is probably not what we want - and it doesn’t help us represent ranges. But spans do: we can return a span of the data (perhaps via a Slice() API):

public Span<SomeType> Slice(int offset, int count)
    => new Span<SomeType>(ptr + offset, count);
// ^^^ not shown: validate range first

And we can consume this pretty naturally without unsafe:

// "stuff" is our UnmanagedStuff object
// easily talk about a slice of unmanaged data
var slice = stuff.Slice(5, 10);
slice[0] = new SomeType(123, DateTime.Now);                

// (separate slices work)
slice = stuff.Slice(0, 25);
Console.WriteLine(slice[5]); // 123: 23/04/2017 09:09:51, ...

If we want to talk about individual elements (rather than a range), then a ref local (via a ref return) is what we want; we could use the DangerousGetPinnableReference() API on a Span<T> for this, but in this case it is probably easier just to use Unsafe directly:

public ref SomeType this[int index]
    => ref Unsafe.AsRef<SomeType>(ptr + index);
// ^^^ not shown: validate range first 

We can consume this with similar ease:

// talk about a *reference* to unmanaged data
ref SomeType item = ref stuff[5];
Console.WriteLine(item); // 123: 23/04/2017 09:09:51, ...
item = new SomeType(42, new DateTime(2016, 1, 8));

// prove that updated *inside* the slice
Console.WriteLine(slice[5]); // 42: 08/01/2016 00:00:00, ...

And now from any code, we can talk directly to the unmanaged memory simply by passing it in as a ref parameter - it will never be copied, just dereferenced. If you want to talk about an isolated copy or store a copy as a field, then you can dereference, but that is easy:

SomeType isolated = item;

If you’ve ever worked with unmanaged memory from C#, this is a huge difference - and opens up a whole range of interesting scenarios for allocation-free systems without requiring the entire codebase to be unsafe. For context, in an allocation-free system, the lifetime of a set of data is strictly defined by some unit of work - processing an inbound request, for example. This means we don’t need reference tracking and garbage collection (and GC pauses can hurt high performance systems), so instead we simply take some slabs of memory, work from them (incrementing counters as we consume space), and then when we’ve finished the request we just set all the counters back to zero and we’re ready for the next request, no mess. Spans and ref locals and ref return make this friendly, even in the unmanaged memory scenario. The only caveat being - once again: Span<T> and ref T cannot legally escape the stack. But as we’ve seen, we can expose on-demand a Span<T> or ref T - so it isn’t a burden.


Spans; they’re very powerful if you need that kind of thing. And they force a range of new concepts into C#, giving us all the combined strong points of arrays, pointers, references and generics - with very few of the pain points. If you don’t care about pointers, buffers, etc - you probably won’t need to learn about spans. But if you do, they’re awesome. The amount of effort the .NET folks (and the community, but mostly Microsoft) have made making this span concept so rich and powerful is huge - it impacts the compiler, the JIT, the runtime, and multiple libraries both pre-existing and brand new. And it impacts both .NET and .NET Core. As someone who works a lot in the areas affected by spans and ref - it is also hugely appreciated. Good things are coming.

Saturday, 22 April 2017

Spans and ref part 1 : ref

Spans and ref part 1 : ref

One of the new features in C# 7 is by-reference (ref) return values and locals. This is a complex topic to explain, but a good example of why we might want this is “spans” (Span<T>). I don’t have any inside knowledge on the design meetings, but I’d go further and speculate that if Span<T> wasn’t a thing, the ref-changes wouldn’t have happened, so it makes sense to consider them together. Most of the fun things you can do with ref returns / locals start to make a lot more sense when we look at Span<T> - which we’ll do in part 2, but first we need to remind ourselves what ref means, and explore the new ref changes.

ref returns and locals

There’s a reason that ref (and cousin out) aren’t used extensively on many APIs: they are hard to fully understand. A lot of people will describe them in terms of “changes being visible”, but that is just a side-effect, not the meaning. I don’t mean this as a criticism: it isn’t necessary for every C# developer to have a deep knowledge of the inner workings of these things.

But consider the following common question:

void PassByRef()
    int i = 42;
    IncrementByRef(ref i);
    // what does this line print, and why?
void IncrementByRef(ref int x)
    x = x + 1; // increment

Most developers will be able to correctly understand that this will output “43”, but asking them exactly what happened can reveal very different levels of understanding. The short summary is that a reference to the variable i was passed to IncrementByRef; all the code in IncrementByRef that looks like it is reading / writing to the parameter is actually dereferencing the parameter at each stage. This is clearer if we write it in unsafe pointer code instead:

unsafe void PassByPointer()
    int i = 42;
    // what does this line print, and why?
unsafe void IncrementByPointer(int* x)
    *x = *x + 1; // increment

Here we can clearly see the “take a reference to” operation (&) and “dereference” (*) operations, but there’s a lot of problems with pointers:

  • the garbage collector (GC) refuses to even try to walk pointers, which means you need to be very careful to only access memory that won’t move (unmanaged memory, stack memory, or pinned managed objects)
  • pointer arithmetic makes it trivially possible to access adjacent memory without any bounds checking
  • it forces us to use unsafe, which makes it very easy to make subtle but major bugs that cause just about any level of silliness imaginable
  • pointers only work for a small subset of types - essentially primitives and structs composed of primitives

The point of ref parameters is to get the best of both worlds. ref is essentially just like pointers, but with enough restrictions to stop us getting into messes. The aditional sanity checks and restrictions mean that the IL knows enough about the meaning for the GC to sensibly be able to navigate them without getting confused, so we don’t need to worry about the reference suddenly being meaninglses - and since we can’t do anything too silly, we don’t need to drop to unsafe. And it should work for any regular type.

But, historically, this ability to add automatic dereferencing and talk about ref has been restricted to method parameters; no fields, no locals, and no return values.

ref locals

The first change in C#7 allows us to talk about automatically dereferenced ref items as local (method) variables. In the same way that a ref parameter is denoted by a ref prefix before the type, so it is with ref locals, with the added bonus that ref var is legal:

void ByRefLocal()
    int i = 42;
    ref var x = ref i;
    x = x + 1;
    // what does this line print, and why?

This prints “43” for exactly the same reasons as before - the only difference is that we now have a syntax to express ref when talking about locals. Previously, we would have to have added an additional method to switch to ref semantics for a local. One slight peculiarity here is that ref locals must be assigned a value at the point of declaration - and we can only assign it a value at this point. Any further attempt to assign a value to a ref local is interpreted as a dereferencing assignment - the *x = in our pointer example.

This ability is nice, but it isn’t very useful until we combine it with…

ref return

A much more interesting and powerful addition in C# 7 is ref returns. As the name suggests, this allows us to return a ref value from a method. We can capture this value into a ref local as long as we include an additional ref just before the value to make it very clear that we don’t want to dereference - which is the regular behaviour whenever touching a ref parameter or local:

ref int GetArrayReference(int[] items, int index)
    => ref items[index];

void IncrementInsideArrayByRef()
    int[] values = { 1, 2, 3 };

    ref int item = ref GetArrayReference(values, 1);
    IncrementByRef(ref item);
    // what does this line print, and why?
    Console.WriteLine(string.Join(",", values));

Here the GetArrayReference method provides the caller a ref to inside the array. Note that the ability to get a ref into an array is not by itself new - this has always worked:

IncrementByRef(ref values[item]);

The bit that is new and different is only the ability to return a ref value.

Since we increment a ref int that refers to the array index 1 (the second element), the result is “1,3,3”.

Note that we don’t need to capture the ref value before we use it - we can also pass a ref return result directly into a ref or out parameter:

IncrementByRef(ref GetArrayReference(values, 1));

Are there restrictions on what we can ref return?

Yes, yes there are. Figuring out the rules on what can and can’t be safely returned as ref without letting the author get into an accidental ugly mess of landmines is probably why it has never been supported in the past. We’ve seen that we can return ref references into arrays. And we’ve seen that we can take a ref of a local variable. But a local only exists in the context of the current stack-frame: very bad things would happen if we could ref return a ref to a local - the caller would have a ref to a position outside the active stack, which would be undefined behavior:

ref int ReturnInvalidStackReference()
    int i = 32;
    return ref i; // can't do this
void WhatHappensHere()
    ref int v = ref ReturnInvalidStackReference();
    CallSomeOtherMethods(); // to use the stack
    int i = v; // dereference the ref

You’ll be relieved to know that the compiler doesn’t let us do this - the compiler is very strict to ensure that if we want to ref return something, then it must demonstrably refer to a safe value. Put very simply: as long as the assignment doesn’t involve a ref to a local, we’ll be fine. return ref i; clearly involves the local i, so can’t be returned. The return ref expression is inspected for safety; each part of the expression must be safe. This includes any ref parameters that we have passed into any method calls:

ref int ReturnInvalidStackReference()
    int j = 42;
    return ref DoSomething(ref j);

This might look like a confusing restriction, but note that DoSomething could be implemented as:

ref int DoSomething(ref int evil) => ref evil;

which would expose the ref j stack reference to the caller of ReturnInvalidStackReference, so any such possibility is excluded. The implementation here is pretty solid, so if it refuses to let you ref return something, you’ve probably attempted something that looks too much like you’re involving locals of the current method.

But no ref fields

We have ref parameters, ref locals and ref returns. However, there is no currently support for ref fields (instance or static variables). Specifically, this is not legal:

struct Foo {
    ref int _reference;


class Foo {
    ref int _reference;

The reason for this again relates to undefined behaviour and escaping the stack-frame. If we could put a ref into a field, then we are at grave danger of proving invalid access (at some point later on) to a position in memory that now means something completely unrelated to what it meant when we took the ref. Strictly speaking we can prove that ref fields would be valid if the assigned value comes from inside an object, and we’ll discuss another safe scenario in part 2,but currently the rule is simple: no ref fields.

When is a local not a local?

This has some additional consequences for a number of code concepts that look like locals, but which are actually fields; for example:

  • locals in an iterator block (yield return)
  • locals in an async method
  • captured variables in lambdas, anonymous methods, and LINQ syntax comprehensions

All of these situations are - and for similar reasons (with the added bonus of issues of lifetime) - the scenarios where you can’t use ref or out parameters or unsafe, so basically: if you can’t use ref, out parameters or unsafe, you won’t be able to use ref locals or ref return either.

One additional scenario, though, is tuples: as I discussed previously, tuples are secretly implemented as fields on the ValueTuple<...> family. So: no ref values in tuples.


This should give you enough to start understanding what ref locals and ref returns are, but for them really start to make sense we need a concrete example. And we get that in “spans”, coming up next!

Sunday, 9 April 2017

Exploring Tuples as a Library Author

Exploring Tuples as a Library Author


This works in an upcoming version of dapper; we like it:

// very lazy dapper query 
var users = db.Query<(int id, string name)>(
    "select Id, Name from Users").AsList();

And this works in an upcoming release of protobuf-net:

// poor man's serialization contract
var token = Serializer.Deserialize<(DateTime dob, string key)>(source);

But - all is not necessarily as it seems. Here we will discuss the nuances of C# 7 tuple usage, and look at the limitations and restrictions that the implementation imposes. In doing so, we can look at what this means for library authors, and what expectations consumers should have.

What is new in the world of tuples?

One of the fun new tools in C# 7 is language-level tuple support. Sure, we’ve had the Tuple<...> family for ages, but that forces you to use Item1, Item2 (etc) naming, and it is a class - so involves allocations, but… it works:

// System.Tuple usage
var tuple = Tuple.Create(123, "abc");
int foo = tuple.Item1; // 123
string bar = tuple.Item2; // "abc"

The ever-terser syntax for read-only properties makes it not too hard to write your own manual tuples, but it isn’t necessarily the best use of your time (especially if you want to do it well - the code is less trivial than people imagine); a simple example might look like:

// manual tuple struct definition
struct MyType
    public MyType(int id, string name)
        Id = id;
        Name = name;
    public int Id { get; }
    public string Name { get; }

Maneagable, but… not very friendly if you’re just trying to hack a few values around in close proximity to each-other.

More recently, the ValueTuple<...> family has been added on nuget; this solves the allocation problem, but still leaves us with names like Item1, Item2:

// System.ValueTuple usage
var tuple = ValueTuple.Create(123, "abc");
int foo = tuple.Item1; // 123
string bar = tuple.Item2; // "abc"

The new thing, then, is language support to prettify this. As an illustration, we can re-write that as:

// language tuple usage
var tuple = (id: 23, name: "abc");
int foo =; // 123
string bar =; // "abc"

This allows us to conveniently represent some named pieces of data. It looks broadly similar to anonymous types:

// anonymous type
var obj = new { Id = 23, Name = "abc" };
int foo = obj.Id; // 123
string bar =; // "abc"

But it is implemented very differently, as we’ll see. Anonymous types are very limited in scope - bacause you can’t say their name, you can’t expose them on any API boundary (even internal methods). Tuples, however, don’t have this limitation - they can be used in parameters and return types:

// language tuples on method signatures
public static (int x, int y) DoSomething(
    string foo, (int id, DateTime when) bar) {...}

And you can even express them in generics:

// language tuples as generic type parameters
var points = new List<(int x, int y)>();

How are language tuples implemented?

Given how similar they look to anonymous types, you’d be forgiven for assuming that they were implemented similarly - perhaps defining a struct instead of a class. Let’s test that by decompiling a basic console exe:

// test console using language tuples
static class Program
    static void Main()
        var tuple = (id: 23, name: "abc");
        System.Console.WriteLine(; // "abc"

We’ll compile that in release mode and look at the IL:

ildasm mytest.exe

and find the Main() method - we see:

.method private hidebysig static void  Main() cil managed
// Code size       23 (0x17)
.maxstack  8
IL_0000:  ldc.i4.s   23
IL_0002:  ldstr      "abc"
IL_0007:  newobj     instance void valuetype [System.ValueTuple]System.ValueTuple`2<int32,string>::.ctor(!0, !1)
IL_000c:  ldfld      !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32,string>::Item2
IL_0011:  call       void [mscorlib]System.Console::WriteLine(string)
IL_0016:  ret
} // end of method Program::Main

I don’t expect everyone to be able to read IL, but what you won’t see in there is any mention of id or name - and there are no additional types in the assembly (just Program). Instead, it is using System.ValueTuple<int,string>. The IL we have here is exactly the same as we would get if we had compiled:

// test console using System.ValueTuple<...>
static class Program
    static void Main()
        var tuple = new System.ValueTuple<int, string>(23, "abc");
        System.Console.WriteLine(tuple.Item2); // "abc"

This is by design and is always the case - which is why you need a reference to System.ValueTuple in order to use language tuples. In fact, to make things more fun, you’re *still allowed to use the Item1 etc names:

// accessing a tuple via the implementation names
var tuple = (id: 23, name: "abc");
System.Console.WriteLine(tuple.Item2); // "abc"

Things would obviously get very confusing if you had used Item2 as a name in your tuple syntax - the compiler has enough sense to check that if you declare Item2, it must be in the second spot; and a few other well-known method names are blocked too, but generally: you’ll be fine.

But if it is just ValueTuple<...>, how does it work on the public API?

After all, this works:

// declare a method that uses tuples on the public API
public static class TypeInAssemblyA
    public static List<(int x, int y)> GetCoords((string key, string value) foo)
        return new[] { (1, 2) }.ToList();
// consume a method that uses tuples on the public API
static class TypeInAssemblyB
    static void ConsumeData()
        foreach (var pair in TypeInAssemblyA.GetCoords(("abc", "def")))
            System.Console.WriteLine($"{pair.x}, {pair.y}");

So the logical names (x and y) are clearly being conveyed somehow, but this is still ValueTuple<...>. If we look at the IL again, we see the trick in the top of the declaration of GetCoords:

.method public hidebysig static class [mscorlib]System.Collections.Generic.List`1<valuetype [System.ValueTuple]System.ValueTuple`2<int32,int32>> 
        GetCoords(valuetype [System.ValueTuple]System.ValueTuple`2<string,string> foo) cil managed
.param [0]
.custom instance void [System.ValueTuple]System.Runtime.CompilerServices.TupleElementNamesAttribute::.ctor(string[])
= ( 01 00 02 00 00 00 01 78 01 79 00 00 )
// .......x.y..
.param [1]
.custom instance void [System.ValueTuple]System.Runtime.CompilerServices.TupleElementNamesAttribute::.ctor(string[])
= ( 01 00 02 00 00 00 03 6B 65 79 05 76 61 6C 75 65 00 00 )
// .......key.value

Don’t worry if you’re not an IL expert - they key point is that it declares attributes ([TupleElementNames(...)]) against the parameter and return type that include the names, essentially as though we had done:

// equivalent: use tuple element name attributes explicitly
[return: TupleElementNames(new[] { "x", "y}" })]
public static List<ValueTuple<int, int>> GetCoords(
    [TupleElementNames(new[] {"key","value"})]
    ValueTuple<string,string> foo)
    return new[] { (1, 2) }.ToList();

(the C# 7 compiler doesn’t let us do this manually, though - it tells us to use tuple syntax instead)

This tells us once and for all that the names are nothing to do with the type, and are either removed completely, or exposed simply as decorations via attributes (when used on an API).

But tuples can be arbitrarily complex

Note that we don’t just need to consider returning flat tuples - the individual elements of tuples can themselves be tuples, or involve tuples (lists of tuples, arrays of tuples, tuple fields, etc). To be honest I don’t expect to see this much in the wild (if you’re getting this complex, you should probably define your own type), but this works:

// expose nested tuples
public static (int x, (int y, (float a, float b))[] coords) Nested()
    => (1, new[] { (2, (3.0f, 4.0f)) });

which is described as:

.param [0]
.custom instance void [System.ValueTuple]System.Runtime.CompilerServices.TupleElementNamesAttribute::.ctor(string[])
= ( 01 00 06 00 00 00 01 78 06 63 6F 6F 72 64 73 01 79 FF 01 61 01 62 00 00 )
// .......x.coords.y..a.b..

which is the equivalent of:

 [return: TupleElementNames(new[] {
     "x", "coords", "y", null, "a", "b" })]

In truth, I don’t know the significance of the null, or how exactly the mapping works in terms of position in the compound tuple vs position in the array.

EDIT: oops, silly me - the null is because I didn't name the element described by (float a, float b) - yes, that's allowed.

We can get names out of an API - can we push them in?

It is very common in code that uses libraries to create an insteance and then push it into an API either as object or via generics (<T>). For example:

var val = (id: 123, name: "Fred");
SomeLibrary.SomeGenericMethod(val); // <T> inferred

with (for a simple example:

public static void SomeGenericMethod<T>(T val) {
    foreach(var field in typeof(T).GetFields())

public static void SomeObjectMethod(object val)
    foreach (var field in val.GetType().GetFields())

If we run this, we see:


This output confirms that once again we’re seeing ValueTuple<...> in play. As before, let’s look at the relevant part of the IL for these two calls (note I added some other code, not shown, to force the value into locals for simplicity):

box        valuetype [System.ValueTuple]System.ValueTuple`2<int32,string>
call       void SomeLibrary::SomeObjectMethod(object)

call       void SomeLibrary::SomeGenericMethod<valuetype [System.ValueTuple]System.ValueTuple`2<int32,string>>(!!0)

The first call converts the tuple to object by “boxing” it; the second call uses generics without boxing. But neither of them include anything whatsoever about the names. This is consistent with what we saw by dumping the fields, and gives us the summary statement:

Tuple names pass outwards, never inwards

Tuple names can be passed outwards (names declared in the public signature of a method can be used in code that consumes that method), but can not be passed inwards (names declared in one method will never be available to general library code that consumes those values). This has significant implications in a wide range of scenarios that we normally expect to be able to access names via reflection: the names simply won’t exist. For example, if we serialized a tuple to JSON, we would expect to see Item1, Item2, etc instead of the names we thought we declared. And there is nothing whatsoever the library author can do to get the original names. It would be nice if there was something akin to the method caller attributes ([CallerMemberName] etc), that library authors could attach to an optional parameter to receive the dummy names in some structured form (one part per generic type parameter), but: I can see that it would be very hard to define the semantics of this in the general case (it could be a generic method on a generic type nested in a generic type, for example).

So what can library authors do?

OK, so names are off the table. But that doesn’t mean that tuples are completely useless for use with library code. We still have position. And if a library can do something sensible just using the position semantics, then it doesn’t matter that the library doesn’t know the names that the caller had in mind.

Let’s take 2 examples; protobuf-net and dapper.


This is a binary serialization library that follows the protocol buffers (protobuf) serialization format. One of the ways that protobuf achieves efficent encoding is that it doesn’t care about names. At all (at least, in the binary protocol). All it knows is that there is an integer in field 1, a utf8 string in field 2, and an array of floats in field 3 (for example). In this case, we don’t lose anything when we see ValueTuple<...> - all we need to do is to make sure that the library recognises the pattern and knows how to infer the contract from the type (so Item1 maps to field 1, Item2 maps to field 2, etc).

As it happens, protobuf-net already had code to recognise a general family of tuple patterns, so this very nearly worked for free without any changes. The only thing that tripped it up was that historically the “is this a tuple” check made the assumption that any sensible tuple would be immutable. As it happens, ValueTuple<...> is implemented as fully mutable (with mutable public fields). By softening that check a little bit, protobuf-net now works with any such tuple it sees.


This is an ADO.NET helper utility, that makes it really really simple to correctly execute parameterized commands and queries, and populate records from the results. In terms of the latter, when executing Query<T>(...), it executes the ADO.NET query, then maps the results by name into the fields and properties of T, one instance of T per row received. As we’ve discovered before, we are never going to able to respect the caller’s names when using C# 7 tuples, and it is unlikely that people will conveniently call their columns Item1, Item2, etc. That means that it probably isn’t going to be useful to use value tuples with domain entity objects, but… if you’re using value tuples for your domain entity objects you already need a stiff talking to.

However! There is another very common query scenario with dapper; ad-hoc local queries that get trivial data. In a lot of these cases, it really isn’t worth going to the trouble of declaring a custom type to receive the results, so dapper supports dynamic as a “meh, it’ll work” helper:

// query dapper via "dynamic"
int id = ...
var row = conn.QuerySingle("select Name, LastUpdated from Records where Id=@id", new {id});
string name = row.Name; // "dynamic"
DateTime lastUpdated = row.LastUpdated;
// use name and lastUpdated

This works, but: it involves the DLR (relatively expensive compared to direct C#), and has a few other overheads. What we can do is change dapper to recognise that names aren’t useful in ValueTuple<...>, but rely instead on the column position:

// query dapper via C# 7 tuples
int id = ...
var row = conn.QuerySingle<(string name, DateTime lastUpdated)>(
     "select Name, LastUpdated from Records where Id=@id", new {id});
// use and row.lastUpdated

That’s actually pretty handy for a wide range of scenarios where we’re going to consume the data immediately adjacent to the query. And because they are value types we don’t even pay heap allocation costs. I wouldn’t use it for populating real domain objects, though.

dapper: what about parameters?

you’ll notice that we also have a new {id} anonymous type usage to pass parameters into dapper. There are a number of reasons that I’ve left this alone and have made no attempt to support value tuples:

  • in ADO.NET, most providers support names, and some important providers (SqlClient for SQL Server for example) do not support positional parameters; while we could in theory expose a positional parameter syntax in the API, the use would be somewhat limited
  • at the moment, the parameters are passed as object; if we pass a C# tuple it would be boxed, which means an allocation - and at that point, we might as well have just used an anonymous type object (which would have conveyed names)
  • if you’re thinking “make the method generic so the parameter is passed as T to avoid boxing”, the problem is that the method is already generic in the return type, and in C# when calling a generic method you can either supply all the generic types explicitly, or let the compiler infer all of them; there is no middle ground. Making the parameters a generic would be inconvenient in all cases, and impossible in others - we cannot name the T used in code that uses anonymous types, precisely because it is anonymous. There are hacks around this, but: nothing great

All in all, I’m happy to just say “use anonymous types for the parameters; use tuples for output rows” - it is a pragmatic compromise that covers the huge majority of useful real-world scenarios. I have added code to detect ValueTuple<...> being used in the parameters and raise an appropriate exception guiding the user on the problem an how to fix it.


We’ve explored C# 7 tuples and their implementation. Armed with the implementation details, we’ve explored what that means for code that consumes C# 7 tuples - both upstream and downstream. We’ve discussed the limitations for use with libary code, but we’ve also seen that they still have plenty of uses with library code in some scenarios. If I’ve missed anything, let me know on twitter!


After publishing, Daniel Crabtree very rightly notes that the Item* fields only go up to Item7, and that additional fields are stored in the Rest field which can itself be a tuple. You can read more about that in his post here.

Saturday, 17 September 2016

Channelling my inner geek


I do a lot of IO programming – both from a client perspective (for example, StackExchange.Redis), and from a server perspective (for example, our custom web-sockets server usually has between 350k and 500k open connections, depending on how long it has been since we gave it a kick). So I’m always interested in new “players” when it comes to IO code, and a few weeks ago Microsoft’s David Fowler came through with some interesting goodies – specifically the “Channels” API.

First, a couple of disclaimers: everything I’m talking about here is completely experimental, evolving rapidly, depends on “corefxlab” concepts (i.e. stuff the .NET team is playing with to see what works well), and is virtually untested. Even the name is purely a working title. But let’s be bold and take a dive anyway!

What are Channels, and what problem do they solve?

The short description of Channels would be something like: “high performance zero-copy buffer-pool-managed asynchronous message pipes”. Which is quite a mouthful, so we need to break that down a bit. I’m going to talk primarily in the context of network IO software (aka network client libraries and server applications), but the same concepts apply equally to anything where data goes in or out of a stream of bytes. It is relatively easy to write a basic client library or server application for a simple protocol; heck, spin up a Socket via a listener, maybe wrap it in a NetworkStream, call Read until we get a message we can process, then spit back a response. But trying to do that efficiently is remarkably hard – especially when you are talking about high volumes of connections. I don’t want to get clogged down in details, but it quickly gets massively complicated, with concerns like:

  • threading – don’t want a thread per socket
  • connection management (listeners starting the read/write cycle when connections are accepted)
  • buffers to use to wait for incoming data
  • buffers to use to store any backlog of data you can’t process yet (perhaps because you don’t have an entire message)
  • dealing with partial messages (i.e. you read all of one message and half of another – do you shuffle the remainder down in the existing buffer? or…?)
  • dealing with “frame” reading/writing (another way of saying: how do we decode a stream in to multiple successive messages for sequential processing)
  • exposing the messages we parse as structured data to calling code – do we copy the data out into new objects like string? how “allocatey” can we afford to be? (don’t believe people who say that objects are cheap; they become expensive if you allocate enough of them)
  • do we re-use our buffers? and if so, how? lots of small buffers? fewer big buffers and hand out slices? how do we handle recycling?

Basically, before long it becomes an ugly mess of code. And almost none of this has any help from the framework. A few years ago I put together NetGain to help with some of this; NetGain is the code that drives our web-socket server, and has pieces to cover most of the above, so implementing a new protocol mostly involves writing the code to read and write frames. It works OK, but it isn’t great. Some bits of it are decidedly ropey.

But here’s the good news: Channels does all of these things, and does it better.

You didn’t really answer my question; what is a Channel?

Damn, busted. OK; let’s put it like this – a Channel is like a Stream that pushes data to you rather than having you pull. One chunk of code feeds data into a Channel, and another chunk of code awaits data to pull from the channel. In the case of a network socket, there are two Channels – one in each direction:

  • “input” for our application code:
    • the socket-aware code borrows a buffer from the memory manager and talks to the socket; as data becomes available it pushes it into the channel then repeats
    • the application code awaits data from the channel, and processes it
  • “output” for our application code:
    • our application may at any point (but typically as a response to a request) choose to write data to the channel
    • the socket-aware code awaits data from the channel, and pushes it to the network infrastructure

In most cases you should not expect to have to write any code at all that talks to the underlying socket; that piece is abstracted away into library code. But if you want to write your own channel backend you can. A small selection of backend providers have been written to explore the API surface (mostly focusing on files and networking; there are actually already 3 different backend ‘sockets’ libraries: libuv, Windows RIO, and async managed sockets). So let’s focus on what the application code might look like, with the example of a basic “echo” server:

IChannel channel = ...
while (true)
    // await data from the input channel
    var request = await channel.Input.ReadAsync();

    // check for "end of stream" condition
    if (request.IsEmpty && channel.Input.Completion.IsCompleted)

    // start writing to the output channel
    var response = channel.Output.Alloc();
    // pass whatever we got as input to the output
    response.Append(ref request);
    // finish writing (this should activate the socket driver)
    await response.FlushAsync();
    // tell the engine that we have read all of that data
    // (note: we could also express that we read *some* of it,
    // in which case the rest would be retained and supplied next time)

What isn’t obvious here is the types involved; “request” is a ReadableBuffer, which is an abstraction over one or more slices of memory from the buffer pool, from which we can consume data. “response” is a WritableBuffer, which is again an abstraction over any number of slices of memory from the buffer pool, to which we can write data (with more blocks being leased as we need them). The idea is that you can “slice” the request as many times as you need without allocation cost:

var prefix = request.Slice(0, 4);
var len = prefix.Read<int>();
var payload = request.Slice(4, len);...

Note that ReadableBuffer is a value-type, so slicing it like this is not allocating - the new value simply has a different internal tracking state of the various buffers. Each block is also reference counted, so the block is only made available back to the pool when it has been fully released. Normally this would happen as we call “Consumed()”, but it has more interesting applications. The “echo” server already gave an example of this – the “Append” call incremented the counter, which means that the data remains valid even after “request.Consumed()” has been called. Instead, it is only when the socket driver says that it too has consumed it, that it gets released back to the pool. But we can do the same thing ourselves to expose data to our application without any copies; our previous example can become:

var prefix = request.Slice(0, 4);
var len = prefix.Read();
var payload = request.Slice(4, len).Preserve();...

Here “Preserve” is essentially “increment the reference counter”. We can then pass that value around and the data will be valid until we “Dispose()” it (which will decrement the counter, releasing it to the pool if it becomes zero). Some of you might now be twitching uncontrollably at the thought of explicit counting, but for high volume scenarios it is far more scalable than garbage collection, allocations, finalizers, and all that jazz.

So where did the zero copy come in?

A consequence of this is that the memory we handed to the network device is the exact same memory that our application code can access for values. Recall that the socket-aware code asked for a buffer from the pool and then handed the populated data to the channel. That is the exact same blocks as the block that “payload” is accessing. Issues like data backlog is handled simply by only releasing blocks once the consumer has indicated that they’ve consumed all the data we know valid on that block; if the caller only reads 10 bytes, it just slices into the existing block. Of course, if the application code chooses to access the data in terms of “string” etc, there are mechanisms to read that, but the data can also be accessed via the new “Span<byte>” API, and potentially as “Utf8String” (although there are some complications there when it comes to data in non-contiguous blocks).


It looks really really exciting for anyone involved in high performance applications – as either a client or server. Don’t get me wrong: there’s no “magic” here; most of the tricks can already be done and are done, but it is really hard to do this stuff well. This API essentially tries to make it easy to get it right, but at the same time has performance considerations at every step by some folks who really really care about absurd performance details.

But is it practical? How complete is it?

It is undeniably very early days. But is is enough to be useful. There are a range of trivially simple examples in the repository, and some non-trivial; David Fowler is looking to hack the entire thing underneath ASP.NET to see what kind of difference it makes. At the less ambitious end of things, I have re-written the web-sockets code from NetGain on top of the Channels API; and frankly, it was way less work than you might expect. This means that there is a fully working web-sockets server and web-sockets client. I’ve also been playing with the StackExchange.Redis multiplexed redis client code on top of Channels (although that is far less complete). So it is very definitely possible to implement non-trivial systems. But be careful: the API is changing rapidly – you definitely shouldn’t be writing your mission critical pacemaker monitoring and ambulance scheduling software on top of this just yet!

There’s also a lot of missing pieces – SSL is a huge hole that is going to take a lot of work to fill; and probably a few more higher level abstractions to make the application code even simpler (David has mentioned a frame encoder/decoder layer, for example). The most important next step is to explore what the framework offers in terms of raw performance: how much “better” does it buy you? I genuinely don’t have a simple answer to that currently, although I’ve been trying to explore it.

Can I get involved?

If you want to explore the API and help spot the missing pieces, the ugly pieces, and the broken pieces – and perhaps even offer the fixes, then the code is on GitHub and there’s a discussion room. Note: nothing here is up to me – I’m just an interested party; David Fowler is commander in chief on this.

And what next?

And then, once all those easy hurdles have been jumped – then starts the really really hard work: figuring out what, in terms of formal API names, to call it.

Monday, 9 May 2016

CUDA–getting started in .NET


In my two previous posts (part 1, part 2) I described how we have found uses for CUDA. I am, however, aware that this abstract text doesn’t necessarily help the reader (aka “you”) make inroads into the world of CUDA. So: the purpose of this post is to take the concepts discussed previously and put them together into some actual code.

Firstly, a word on OpenCL: so far, I’ve spoken about CUDA – NVIDIA’s framework for GPGPU programming. Other frameworks exist, and the largest “other” is undoubtedly OpenCL. This framework is designed to let you use the same code over a range of technologies – GPUs from different providers, CPUs, etc. I am aware of it – I just haven’t even tried to address it. If that’s your bag: have fun. I don’t feel hugely limited by restricting myself to CUDA for what I need.

The first things you need for CUDA, then, are:

  • a CUDA capable GPU device
  • the CUDA toolkit
  • a programming environment

Since I’m focused on .NET, we’re also going to need one of the tools for interacting with CUDA from .NET, or we’ll need to write extern wrappers for the C API. A range of CUDA tools for .NET exist, including CUDAfy.NET, managedCUDA, Alea GPU / QuantAlea, Campy.NET, and several others that Campy.NET list and describe on their home page. These fall into two major categories:

  • wrappers for the driver API (you write your kernels in C etc and launch them from C#)
  • IL-to-PTX generators (you write and launch your kernels in C#)

While I’m a huge fan of C#, I’m also a huge fan of keeping close to the metal, and in a heartbeat will sacrifice some programming convenience for performance.

Our first kernel

So without further ado, I’m going to jump straight in with a basic CUDA kernel written in C. Rather than trying to discuss the tag-engine, we’re going to do something simpler… like multiplying numbers. Multiplication is hard, right?

typedef struct {
    int Id;
    unsigned int Value;
} SomeBasicType;

extern "C" {
    __global__ void Multiply(const int N, SomeBasicType* __restrict data, int factor) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < N;
i += blockDim.x * gridDim.x)
{ (data + i)->Value *= factor; } } }

The first thing this does is to define a data-type (SomeBasicType) for us to share data between the CPU and GPU code. The __global__ indicates that we’re declaring a kernel, along with a given name and signature. We’re passing in N, the number of elements to process, data, a pointer to some of the data to work against, and factor - the number to multiply the values by. The significance of the N comes into play when we see the i < N in the for loop on the next line. Recall that we usually launch a number of blocks, each consisting of a set number of threads. It is not always the case that the number of elements to process is a convenient multiple of some block size, so it is common for kernels to be passed a count to compare the position against, exiting if the current thread is not needed. This sounds like a contradiction to the “don’t branch” guidance, but as long as most of the threads make the same choice, this isn’t going to be a performance problem.

Recall also that I described a “monolithic kernel” earlier, where-by we launch a thread for every element to process. But: there is a hard limit to the number of threads in a block, and the number of blocks in any dimension – and as always: the more threads there are, the more management overhead there is (although this works differently to CPU threads). As such, it is common to use a “grid-stride loop” rather than a monolithic kernel. As in our example, a for loop is used to that each thread processes multiple elements. But unlike the CPU version shown previously (where each thread processed a separate chunk of data), we instead have each thread in an iteration process adjacent elements, and then move forwards by the stride – namely the width of each block multiplied by the number of blocks. This has multiple side benefits:

  • it can cope with data of any size – it is not subject to the 75M limitation
  • it requires less overhead for the GPU engine
  • it can still be used as a monolithic kernel (by setting the dimensions such that it only performs one iteration per thread), or as a single-threaded debugging kernel that processes all the data (by setting the dimensions to 1 thread in 1 block)
  • the work being done by a warp or block is contiguous, minimizing memory access overheads

Finally, we get to the actual meat of the kernel, which does an in-place multiplication of one of the fields of our data by one of the parameters.

That’s our first kernel, all done!

A side note on data width

You’ll note that the above example is doing 32-bit arithmetic. For doing serious CUDA work, it is important to be aware of whether you are doing 16-bit, 32-bit or 64-bit arithmetic, as it can significantly impact your choice of hardware – considering them in release order:

  • the “Kepler” architecture (server devices: K40, K80) will serve as a baseline – let’s arbitrarily call it “good” at both 32-bit and 64-bit; 16-bit operations are performed using 32-bit arithmetic
  • the “Maxwell” architecture (server device: M40) is faster than Kepler at 32-bit operations, but is relatively poor at 64-bit operations; 16-bit operations are performed using 32-bit arithmetic
  • to complicate this: the fact that the K80 packs two devices on a physical package means that even though a single M40 may be faster than a single K80 device, code that appropriately uses both K80 devices on a package may still out-perform a single M40
  • the “Pascal” architecture apparently (not yet available for testing; server device: P100) significantly increases both 32-bit and 64-bit performance, while also introducing specific optimized instructions for 16-bit processing

Because of this, there is no single way of saying “this device is faster” – it depends a lot on what exactly you are doing!

Compiling our kernel (and: hello managedCUDA)

A kernel as a text source-file isn’t much use by itself. We need to convert that into GPU-runnable code – either intermediate (“PTX”) or fully compiled to bytecode (“cubin”). There is also a middle-ground that combines both bytecode and intermediate instructions – a “fatbin”; this allows a compiled kernel to target multiple physical architectures and still allow usage on other architectures via a JIT step. There are two common ways of compiling a C kernel – compile-time (NVCC) and runtime (NVRTC). I’ve spoken previously about using NVCC, so for this example I’ll use NVRTC instead.

Since we want to do this from .NET, we need a wrapper library. My examples today are going to use managedCUDA, written by Michael Kunz. This is available on NuGet and is licenced under the LGPL – which I understand (and: IANAL) is not the same as the notoriously infectious and restrictive GPL. What I particularly like about this library is that in addition to exposing an idiomatic .NET object model that maps to the C API, it also exposes the raw C API directly – which makes it convenient to use even when the method you want doesn’t map naturally to a .NET concept.

So let’s load our kernel from a text file and compile it at runtime:

string path = "MyKernels.c";
ManagedCuda.NVRTC.nvrtcResult result;
using (var rtc = new ManagedCuda.NVRTC.CudaRuntimeCompiler(
File.ReadAllText(path), Path.GetFileName(path))) { try { rtc.Compile(new string[0]); result = ManagedCuda.NVRTC.nvrtcResult.Success; } catch(ManagedCuda.NVRTC.NVRTCException ex) { result = ex.NVRTCError; } log = rtc.GetLogAsString(); if (result == ManagedCuda.NVRTC.nvrtcResult.Success) { byte[] ptx = rtc.GetPTX(); // ... } }

This takes our file contents as input, and (if successful) produces a blob that contains the compiled kernel. We can also obtain the log which will contain detained error messages if our kernel is invalid.

The CUDA context

One of the key objects in CUDA is the context. This wraps the state for a long-running CUDA scenario. Most operations, including memory allocation and kernel launches typically happen within a CUDA context. managedCUDA exposes a class that wraps this. Note:

  • a CUDA context is bound to a particular GPU device, and it is when creating the CUDA context that you can target specific devices
  • a CUDA context is associated with the CPU thread that creates it; as such, CUDA code would work well for scenarios such as a dedicated worker thread processing items from a queue, but would not work well from .NET async/await code that switches between multiple threads

So; let’s create our context and load the kernel:

CudaContext ctx = new CudaContext(deviceId, true);
CudaKernel multiply = ctx.LoadKernelFatBin(ptx, "Multiply");

Note that the context (like many CUDA objects) is IDisposable, and you should ensure to Dispose() it properly, via using or otherwise.

CUDA streams and asynchronous processing

In the simplest usage, the CUDA methods are blocking, however it is often the case that either the CPU can do other useful work while the GPU churns, or we can overlap multiple GPU operations – memory transfers running in parallel with kernel execution, or (depending on the hardware) even multiple kernels running at the same time. To allow this, CUDA introduces the notion of a “stream”. Work on the same stream is processed in order, but the work on different streams may overlap. I’m not going to make extensive usage of streams in this walkthrough, but it is easier to include stream considerations from the start than to add it later.

Initializing memory at the host and on the device

Traditionally, the host (CPU) memory and the device (GPU) memory were completely separate, and all CUDA code had to specifically be cautious of whether it was referring to host or device memory. More recently, the concept of “unified memory” has been introduced that hides this distinction, and performs transfers automatically as needed – it is intended to make it simpler to use the API, but for the maximum performance it is recommended to take full control over allocation and copy. To that end, I’ll stick to the traditional approach, which means explicitly allocating the data at both ends.

We’ll start by allocating the host memory:

struct SomeBasicType {
    public int Id;
    public uint Value;

IntPtr hostPointer = IntPtr.Zero;
var res = DriverAPINativeMethods.MemoryManagement.cuMemAllocHost_v2(
    ref hostPointer, count * sizeof(SomeBasicType));
if (res != CUResult.Success) throw new CudaException(res);
SomeBasicType* hostBuffer = (SomeBasicType*)hostPointer;

As before, we declare a raw data type to share the data between CPU and GPU. Here is one of those cases where I’m making use of the raw C API rather than the wrapped objects, as this (IMO) more conveniently allows naked allocations. You might think “hang on, you’re just allocating some memory – surely a .NET array would suffice, or maybe at worst Marshal.AllocHGlobal” – but: in order for the CUDA memory-copy engine to work at maximum performance (avoiding double-copies, etc) it is required to be configured in a specific way, and the best way to guarantee that is to let the CUDA API allocate the memory in the first place. A pinned .NET array certainly wouldn’t suffice. Or at least, it would work – but not as fast. Yes, this needs some unsafe code; it’ll be fine, don’t panic.

Once allocated, we can intialize this memory with some invented data:

for (int i = 0; i < count; i++) {
    hostBuffer[i].Id = i;
    hostBuffer[i].Value = (uint)i;

Next we’ll want to allocate a similar block of memory on the device, and start copying the data from the host to the device. Emphasis: large memory transfers are fast, but not instant – if possible you should minimize the amount of data you need to transfer for individual operations. The ideal scenario is to copy the data to the device once / periodically, and use it many times once it is there.

CudaDeviceVariable<SomeBasicType> deviceBuffer = new CudaDeviceVariable<SomeBasicType>(count);
CudaStream defaultStream = new CudaStream();
res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyHtoDAsync_v2(deviceBuffer.DevicePointer,
    hostPointer, deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

This time I'm using the CudaDeviceVariable<T> object to represent the memory at the device, which is perfectly convenient since we will not usually need to do anything except access the DevicePointer to pass into methods. Once again I'm using a raw C-style API for the actual memory copy. There are some transfer methods exposed directly on the object, but this particular combination is not exposed directly. Note that because we have created and used a stream, this is non-blocking. Related operations sent to the same stream will be queued behind this one.

Launching our kernel

We have data on the device; we have a kernel that we have compiled and loaded. Now finally to run the kernel against the data! First we must configure the dimensions – the number of threads in a block and the number of blocks:

multiply.BlockDimensions = new ManagedCuda.VectorTypes.dim3(threadsPerBlock, 1, 1);
multiply.GridDimensions = new ManagedCuda.VectorTypes.dim3(blockCount, 1, 1);

multiply.RunAsync(defaultStream.Stream, new object[] {
    count, deviceBuffer.DevicePointer, value

Note that the parameters to RunAsync (other than the stream) match the signature on the kernel itself. value in the above is the factor to multiply by. Once we've got the data in place, running kernels is actually alarmingly simple!

Getting results back

It isn’t usually sufficient to run kernels – sadly we usually expect to get results. As mentioned before, we should strive to minimize the amount of data we want to transfer, but for simplicity in this example I’m going to fetch back the entire buffer, over-writing the previous contents. In real code you might be extracting only small portions of the buffer, or summary / aggregate data.

var res = DriverAPINativeMethods.AsynchronousMemcpy_v2.cuMemcpyDtoHAsync_v2(
    new IntPtr(hostBuffer), deviceBuffer.DevicePointer,
    deviceBuffer.SizeInBytes, defaultStream.Stream);
if (res != CUResult.Success) throw new CudaException(res);

Once more, for memory operations it has been easier to use the raw C API than the wrapped managed API, and that's OK. Note that because we've been using streams, at this point we've only queued the copy. If we've set all our wheels in motion, and want to sit back and wait while they turn, we can use either ctx.Synchronize(); (to wait for all streams to complete) or defaultStream.Synchronize() (to wait for a specific stream to complete). Finally, when that is done, we can iterate over the data and congratulate ourselves for managing to over-complicate the multiplication of numbers!

Not enough code for you?

Everything I’ve shown here is available in a github project, that you can clone, fork, etc. I don’t claim it is the most useful tool ever, but hopefully it might help a few people get up and running with CUDA.