Tuesday 30 January 2018

Sorting myself out, extreme edition

...where I go silly with optimization

Yes, I’m still talking about sorting... ish. I love going deep on problems - it isn’t enough to just have a solution - I like knowing that I have the best solution that I am capable of. This isn’t always possible in a business setting - deadlines etc - but this is my own time.

In part 1, we introduced the scenario - and discussed how to build a composite unsigned value that we could use to sort multiple properties as a single key.

In part 2, we looked a little at radix sort, and found that it was a very compelling candidate.

In this episode, we’re going to look at some ways to signficantly improve what we’ve done so far. In particular, we’ll look at:

  • using knowledge of how signed data works to avoid having to transform between them
  • performing operations in blocks rather than per value to reduce calls
  • using Span<T> as a replacemment for unsafe code and unmanaged pointers, allowing you to get very high performance even in 100% managed/safe code
  • investigating branch removal as a performance optimization of critical loops
  • vectorizing critical loops to do the same work with significantly fewer CPU operations

Hopefully, as a follow up after this one, I’ll look at pratical guidance on parallelizing this same work to spread the load over available cores.

Key point: the main purpose of these words is not to discuss how to implement a radix sort - in fact, we don’t even do that. Instead, it uses one small part of radix sort as an example problem with which to discuss much broader concepts of performance optimization in C# / .NET.

Obviously I can’t cover the entire of radix sort for these, so I’m going to focus on one simple part: composing the radix for sorting. To recall, a naive implementation of radix sort requires unsigned keys, so that the data is naturally sortable in their binary representation. Signed integers and floating point numbers don’t follow this layout, so in part 1 we introduced some basic tools to change between them:

uint Sortable(int value)
{
    // re-base eveything upwards, so anything
    // that was the min-value is now 0, etc
    var val = unchecked((uint)(value - int.MinValue));
    return val;
}
unsafe uint Sortable (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    if ((raw & MSB) != 0) // IEEE first bit is the sign bit
    {
        // is negative; shoult interpret as -(the value without the MSB) - not the same as just
        // dropping the bit, since integer math is twos-complement
        raw = -(raw & ~MSB);
    }
    return Sortable(raw);
}

These two simple transformation - applied to our target values - will form the central theme of this entry.

To measure performance, I’ll be using the inimitable BenchmarkDotNet, looking at multiple iterations of transforming 2 million random float values taken from a seeded Random(), with varying signs etc. The method above will be our baseline, and at each step we’ll add a new row to our table at the bottom:

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00

This gives us a good starting point.

A negative sign of the times

What’s faster than performing a fast operation? Not performing a fast operation. The way radix sort works is by looking at the sort values r bits at a time (commonly 4, 8, 10, but any number is valid) and for that block of bits: counting how many candidates are in each of the 1 << r possible buckets. So if r is 3, we have 8 possible buckets. From that it computes target offsets for each group: if there are 27 values in bucket 0, 12 in bucket 1, 3 in bucket 2, etc - then when sorted bucket 0 will start at offset 0, bucket 1 at offset 27, bucket 2 at offset 39, bucket 3 at offset 41, and so on - just by accumulating the counts. But this breaks if we have signed numbers.

Why?

First, let’s remind ourselves of the various ways that signed and unsigned data can be represented in binary, using a 4 bit number system and integer representations:

Binary Unsigned 2s-complement 1s-complement Sign bit
0000 0 0 +0 +0
0001 1 1 1 1
0010 2 2 2 2
0011 3 3 3 3
0100 4 4 4 4
0101 5 5 5 5
0110 6 6 6 6
0111 7 7 7 7
1000 8 -8 -7 -0
1001 9 -7 -6 -1
1010 10 -6 -5 -2
1011 11 -5 -4 -3
1100 12 -4 -3 -4
1101 13 -3 -2 -5
1110 14 -2 -1 -6
1111 15 -1 -0 -7

We’re usually most familiar with unsigned and 2s-complement representations, because that is what most modern processors use to represent integers. 1s-complement is where -x ≡ ~x - i.e. to negate something we simply invert all the bits. This works fine but has two zeros, which is one more than we usually need - hence we usually use 2s-complement which simply adds an off-by-one step; this makes zero unambiguous (very useful for false as we’ll see later) and (perhaps less important) gives us an extra negative value to play with.

The final option is to use a sign bit; to negate a number we flip the most significant bit, so -x ≡ x ^ 0b1000. IEEE754 floating point numbers (float and double) are implemented using a sign bit, which is why floating point numbers have +0 and -0. Due to clever construction, the rest of the value can be treated as naturally/bitwise sortable - even without needing to understand about the “mantissa”, “exponent” and “bias”. This means that to convert a negative float (or any other sign-bit number) to a 1s-complement representation, we simply flip all the bits except the most significant bit. Or we flip all the bits and put the most significant bit back again, since we know it should be a 1.


So: armed with this knowledge, we can see that signed data in 1s-complement or 2s-complement is almost “naturally sortable” in binary, but simply: the negative values are sorted in increasing numerical value, but come after the positive values (we can happily assert that -0 < +0). This means that we can educate radix sort about 1s-complement and 2s-complement signed data simply by being clever when processing the final left-most chunk: based on r and the bit-width, calculate which bit is the most-signficant bit (which indicates sign), and simply process the negative buckets first (still in the same order) when calculating the offsets; then calculate the offsets of the non-negative buckets. If we were using the 4-bit system above and r=4, we would have 16 buckets, and would calculate offsets in the order (of unsigned buckets) 8-15 then 0-7.

By doing this, we can completely remove any need to do any pre-processing when dealing with values like int. We could perhaps wish that the IEEE754 committee had preferred 1s-complement so we could skip all of this for float too, but a: I think it is fair to assume that there are good technical reasons for the choice (presumably relating to fast negation, and fast access to the mantissa/exponent), and b: it is moot: IEEE754 is implemented in CPU architectures and is here to stay.

So we’re still left with an issue for float: we can’t use the same trick for values with a sign bit, because the sign bit changes the order throughout the data - making grouping impossible. We can make our lives easier though: since the algorithm can now cope with 1s-complement and 2s-complement data, we can switch to 1s-complement rather than to fully unsigned, which as discussed above: is pretty easy:

(Aside: actually, there is a related trick we can do to avoid having to pre-process floating point data, but: it would make this entire blog redundant! So for the purposes of a problem to investigate, we’re going to assume that we need to do this transformation.)

unsafe int ToRadix (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    // if sign bit set: flip all bits except the MSB
    return (raw & MSB) == 0 ? raw : ~raw | MSB;
}

A nice side-effect of this is that it is self-reversing: we can apply the exact same bit operations to convert from 1s-complement back to a sign bit.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97

We’ve made a slight but measurable improvment - nothing drastic, but the code is nicer.

Blocking ourselves out

We have a large chunk of data, and we want to perform a transformation on each value. So far, we’ve looked at a per-value transformation function (Sortable), but that means the overhead of a call per-value (which may or may not get inlined, depending on the complexity, and how we resolve the method - i.e. does it involve a virtual call to a type that isn’t reliably known). Additioanlly, it makes it very hard for us to apply more advanced optimizations! Blocks good.

So; let’s say we have our existing loop:

float[] values = ...
int[] radix = ...
for(int i = 0 ; i < values.Length; i++)
{
    radix = someHelper.Sortable(values[i]);
}

and we want to retain the ability to swap in per-type implementations of someHelper.Sortable; we can significantly reduce the call overhead by performing a block-based transformation. Consider:

float[] values = ...
int[] radix = ...
someHelper.ToRadix(values, radix);
...
unsafe void ToRadix(float[] source, int[] destination)
{
    const int MSB = 1 << 31;
    for(int i = 0 ; i < source.Length; i++)
    {
        var val = source[i];
        int raw = *(int*)(&val);
        // if sign bit set: flip all bits except the MSB
        destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
    }
}

How much of a speed improvement this makes depends a lot on whether the JIT managed to inline the IL from the original version. It is usually a good win by itself, but more importantly: it is a key stepping stone to further optimizations.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96

Another small improvement; I was hoping for more, but I suspect that the JIT was already doing a good job of inlining the method we're calling, making it almost the same loop at runtime. This is not always the case, though - especially if you have multiple different transformations to apply through a single API.

Safely spanning the performance chasm

You’ll notice that in the code above I’ve made use of unsafe code. There are a few things that make unsafe appealing, but one of the things it does exceptionally well is allow us to reintrepret chunks of data as other types, which is what this line does:

int raw = *(int*)(&val);

Actually, there are some methods on BitConverter to do exactly this, but only the 64-bit (double/long) versions exist in the “.NET Framework” (“.NET Core” has both 32-bit and 64-bit) - and that only helps us with this single example, rather than the general case.

For example, if we are talking in pointers, we can tell the compiler to treat a float* as though it were an int*. One way we might be tempted to rewrite our ToRadix method could be to move this coercison earlier:

unsafe void ToRadix(float[] source, int[] destination)
{
    const int MSB = 1 << 31;
    fixed(float* fPtr = source)
    {
        int* ptr = (int*)fPtr;
        for(int i = 0 ; i < values.Length; i++)
        {
            int raw = *ptr++;
            destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
        }
    }
}

Now we’re naturally reading values out as int, rather than performing any reinterpretation per value. This is useful, but it requires us to use unsafe code (always a great way to get hurt), and it doesn’t work with generics - you cannot use T* for some <T>, even with the where T : struct constraint.

I’ve spoken more than a few times about Span<T>; quite simply: it rocks. To recap, Span<T> (and it’s heap-friendly cousin, Memory<T>) is a general purpose, efficient, and versatile representation of contiguous memory - which includes things like arrays (float[]), but more exotic things too.

One of the most powerful (but simple) features of Span<T> is that it allows us to do type coercison in fully safe managed code. For example, instead of a float[], let’s say that we have a Span<float>. We can reinterpet that as int very simply:

Span<float> values = ...
var asIntegers = values.NonPortableCast<float, int>();

Note: this API is likely to change - by the time it hits general availablity, it’ll probably be:

var asIntegers = values.Cast<int>();

What this does is:

  • look at the sizes of the original and target type
  • calculate how many of the target type fit into the original data
  • round down (so we never go out of range)
  • hand us back a span of the target type, of that (possibly reduced) length

Since float and int are the same size, we’ll find that asIntegers has the same length as values.

What is especially powerful here is that this trick works with generics. It does something that unsafe code will not do for us. Note that a lot of love has been shown to Span<T> in the runtime and JIT - essentially all of the same tricks that make T[] array performance largely indistinguishable from pointer T* performance.

This means we could simplify a lot of things by converting from our generic T even earlier (so we only do it once for the entire scope of our radix code), and having our radix converter just talk in terms of the raw bits (usually: uint).

// our original data
float[] values = ...
// recall that radix sort needs some extra space to work in
float[] workspace = ... 
// implicit <float> and implicit conversion to Span<float>
RadixSort32(values, workspace); 

...
RadixSort32<TSource>(Span<T> typedSource, Span<T> typedWorkspace)
    where T : struct
{
    // note that the JIT can inline this *completely* and remove impossible
    //  code paths, because for struct T, the JIT is per-T
    if (Unsafe.SizeOf<T>() != Unsafe.SizeOf<uint>()) .. throw an error

    var source = typedSource.NonPortableCast<T, uint>();
    var workspace = typedWorkspace.NonPortableCast<T, uint>();

    // convert the radix if needed (into the workspace)
    var converter = GetConverter<T>();
    converter?.ToRadix(source, workspace);

    // ... more radix sort details not shown
}
...
// our float converter
public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;
    for(int i = 0 ; i < values.Length; i++)
    {
        uint raw = values[i];
        destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
    }
}

The code is getting simpler, while retaining performance and becoming more generic-friendly; and we haven’t needed to use a single unsafe. You’ll have to excuse me an Unsafe.SizeOf<T>() - despite the name, this isn’t really an “unsafe” operation in the usual sense - this is simply a wrapper to the sizeof IL instruction that is perfectly well defined for all T that are usable in generics. It just isn’t directly available in safe C#.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76

Now we’re starting to make decent improvements - Span<T> is really useful for large operations where type coercion is necessary.

Taking up tree surgery: prune those branches

Something that gnaws at my soul in where we’ve got to is that it includes a branch - an if test, essentially - in the inner part of the loop. Actually, there’s two and they’re both hidden. The first is in the for loop, but the one I’m talking about here is the one hidden in the ternary conditional operation, a ? b : c. CPUs are very clever about branching, with branch prediction and other fancy things - but it can still stall the instruction pipeline, especially if the prediction is wrong. If only there was a way to rewrite that operation to not need a branch. I’m sure you can see where this is going.

Branching: bad. Bit operations: good. A common trick we can use to remove branches is to obtain a bit-mask that is either all 0s (000…000) or all 1s (111…111) - so: 0 and -1 in 2s-complement terms. There are various ways we can do that (although it also depends on the actual value of true in your target system, which is a surprisingly complex question). Obviously one way to do that would be:

// -1 if negative, 0 otherwise
var mask = (raw & MSB) == 0 ? 0 : ~0;

but that just adds another branch. If we were using C, we could use the knowledge that the equality test returns an integer of either 0 or 1, and just negate that to get 0 or -1:

// -1 if negative, 0 otherwise
var mask = -((raw & MSB) == 0);

But no such trick is available in C#. What we can do, though, is use knowledge of arithmetic shift. Left-shift (<<) is simple; we shift our bits n places to the left, filling in with 0s on the right. So binary 11001101 << 3 becomes 01101000 (we lose the 110 from the left).

Right-shift is more subtle, as there are two of them: logical and arithmetic, which are essentially unsigned and signed. The logical shift (used with uint etc) moves our bits n places to the right, filling in with 0s on the left, so 11001101 >> 3 gives 00011001 (we lose the 101 from the right). The arithmetic shift behaves differently depending on whether the most significant bit (which tells us the sign of the value) is 0 or 1. If it is 0, it behaves exactly like the logical shift; however, if it is 1, it fills in with 1s on the left; so 11001101 >> 3 gives 11111001. Using this, we can use >> 31 (or >> 63 for 64-bit data) to create a mask that matches the sign of the original data:

// -1 if negative, 0 otherwise
var mask = (uint)(((int)raw) >> 31);

Don’t worry about the extra conversions: as long as we’re in an unchecked context (which we are by default), they simply don’t exist. All they do is tell the compiler which shift operation to emit. If you’re curious, you can see this here, but in IL terms, this is just:

(push the value onto the stack)
ldc.i4.s 31 // push 31 onto the stack
shr         // arithmetic right shift

In IL terms, there’s really no difference between signed and unsigned integers, other than whether the compiler emites the signed or unsigned opcodes for operations. In this case we’ve told it to emit shr - the signed/arithmetic opcode, instead of shr.un - the unsigned/logical opcode.

OK; so we’ve got a mask that is either all zeros or all ones. Now we need to use it to avoid a branch, but how? Consider:

var condition = // all zeros or all ones
var result = (condition & trueValue) | (~condition & falseValue);

If condition is all zeros, then the condition & trueValue gives us zero; the ~condition becomes all ones, and therefore ~condition & falseValue gives us falseValue. When we “or” (|) those together, we get falseValue.

Likewise, if condition is all ones, then condition & trueValue gives us trueValue; the ~condition becomes all zeros, and therefore ~condition & falseValue gives us zero. When we “or” (|) those together, we get trueValue.

So our branchless operation becomes:

public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;
    for(int i = 0 ; i < values.Length; i++)
    {
        uint raw = values[i];
        var ifNeg = (uint)(((int)raw) >> 31);
        destination[i] =
            (ifNeg & (~raw | MSB)) // true
            | (~ifNeg & raw);      // false
    }
}

This might look more complicated, but it is very CPU-friendly: it pipelines very well, and doesn’t involve any branches for it to worry about. Doing a few extra bit operations is nothing to a CPU - especially if they can be pipelined. Long instruction pipelines are actually a good thing to a CPU - compared to a branch or something that might involve a cache miss, at least.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76
Branchless 2,507.0 us 0.24

By removing the branches, we’re down to less than a quarter of the original run-time; that’s a huge win, even if the code is slightly more complex.

Why do one thing at a time?

OK, so we’ve got a nice branchless version, and the world looks great. We’ve made significant improvements. But we can still get much better. At the moment we’re processing each value one at a time, but as it happens, this is a perfect scenario for vectorization via SIMD (“Single instruction, multiple data”).

We have a pipelineable operation without any branches; if we can execute it for one value, we can probably execute it for multiple values at the same time - magic, right? Many modern CPUs include support for performing basic operations like the above on multiple values at a time, using super-wide registers. Right now we’re using 32-bit values, but most current CPUs will have support for AVX (mostly: 128-bit) or AVX2 (mostly: 256-bit) operations. If you’re on a very expensive server, you might have more (AVX512). But let’s assume AVX2: that means we can handle 8 32-bit values at a time. That means 1/8th of the main operations, and also 1/8th of the if branches hidden in the for loop.

Some languages have automatic vectorization during compilation; C# doesn’t have that, and neither does the JIT. But, we still have access to a range of vectorized operations (with support for the more exotic intrinsics being added soon). Until recently, one of the most awkward things about working with vectorization has been loading the values. This might sound silly, but it is surprisingly difficult to pull the values in efficiently when you don’t know how wide the vector registers are on the target CPU. Fortunately, our amazing new friend Span<T> jumps to the rescue here - making it almost embarrassingly easy!

First, let’s look at what the shell loop might look like, without actually doing the real work in the middle:

public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;

    int i = 0;
    if (Vector.IsHardwareAccelerated)
    {                               
        var vSource = values.NonPortableCast<uint, Vector<uint>>();
        var vDest = destination.NonPortableCast<uint, Vector<uint>>();
        
        for (int j = 0; j < vSource.Length; j++)
        {
            var vec = vSource[j];
            vDest[j] = // TODO
        }
        // change our root offset for the remainder of the values
        i = vSource.Length * Vector<uint>.Count;
    }
    for( ; i < values.Length; i++)
    {
        uint raw = values[i];
        var ifNeg = (uint)(((int)raw) >> 31);
        destination[i] =
            (ifNeg & (~raw | MSB)) // true
            | (~ifNeg & raw);      // false
    }
}

First, look at the bottom of the code; here we see that our regular branchless code still persists. This is for two reasons:

  • the target CPU might not be capable of vectorization
  • our input data might not be a nice multiple of the register-width, so we might need to process a final few items the old way

Note that we’ve changed the for loop so that it doesn’t reset the position of i - we don’t necessarily start at 0.

Now look at the if (Vector.IsHardwareAccelerated); this checks that suitable vectorization support is available. Note that the JIT can optimize this check away completely (and remove all of the inner code if it won’t be reached). If we do have support, we cast the span from a Span<uint> to a Span<Vector<uint>>. Note that Vector<T> is recognized by the JIT, and will be reshaped by the JIT to match the size of the available vectorization support on the running computer. That means that when using Vector<T> we don’t need to worry about whether the target computer has SSE vs AVX vs AVX2 etc - or what the available widths are; simply: “give me what you can”, and the JIT worries about the details.

We can now loop over the vectors available in the cast spans - loading an entire vector at a time simply using our familiar: var vec = vSource[j];. This is a huge difference to what loading vectors used to look like. We then do some operation (not shown) on vec, and assign the result again as an entire vector to vDest[j]. On my machine with AVX2 support, vec is block of 8 32-bit values.

Next, we need to think about that // TODO - what are we actually going to do here? If you’ve already re-written your inner logic to be branchless, there’s actually a very good chance that it will be a like-for-like translation of your branchless code. In fact, it turns out that the ternary conditional scenario we’re looking at here is so common that there are vectorized operations precisely to do it; the “conditional select” vectorized CPU instruction can essentially be stated as:

// result conditionalSelect(condition, left, right)
result = (condition & left) | (~condition & right);

Where condition is usually either all-zeros or all-ones (but it doesn’t have to be; if you want to pull different bits from each value, you can do that too).

This intrinsic is exposed directly on Vector, so our missing code becomes simply:

var vMSB = new Vector<uint>(MSB);
var vNOMSB = ~vMSB;
for (int j = 0; j < vSource.Length; j++)
{
    var vec = vSource[j];
    vDest[j] = Vector.ConditionalSelect(
        condition: Vector.GreaterThan(vec, vNOMSB),
        left: ~vec | vMSB, // when true
        right: vec // when false
    );
}

Note that I’ve pre-loaded a vector with the MSB value (which creates a vector with that value in every cell), and I’ve switched to using a > test instead of a bit test and shift. Partly, this is because the vectorized equality / inequality operations expect this kind of usage, and very kindly return -1 as their true value - using the result to directly feed “conditional select”.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76
Branchless 2,507.0 us 0.24
Vectorized 930.0 us 0.09

As you can see, the effect of vectorization on this type of code is just amazing - with us now getting more than an order-of-magnitude improvement on the original data. That’s why I’m so excited about how easy (relatively speaking) Span<T> makes vectorization, and why I can’t wait for Span<T> to hit production.

A reasonable range of common operations are available on Vector and Vector<T>. If you need exotic operations like “gather”, you might need to wait until System.Runtime.Intrinsics lands. One key difference here is that Vector<T> exposes the common intersection of operations that might be available (with different widths) against different CPU instruction sets, where as System.Runtime.Intrinsics aims to expose the underlying intrinsics - giving access to the full range of instructions, but forcing you to code specifically to a chosen instruction set (or possibly having two implementations - one for AVX and one for AVX2). This is simply because there isn’t a uniform API surface between generatons and vendors - it isn’t simply that you get the same operations with different widths: you get different operations too. So you’d typically be checking Aes.IsSupported, Avx2.IsSupported, etc. Being realistic: Vector<T> is what we have today, and it worked damned well.

Summary

We’ve looked at a range of advanced techniques for improving performance of critical loops of C# code, including (to repeat the list from the start):

  • using knowledge of how signed data works to avoid having to transform between them
  • performing operations in blocks rather than per value to reduce calls
  • using Span<T> as a replacemment for unsafe code and unmanaged pointers, allowing you to get very high performance even in 100% managed/safe code
  • investigating branch removal as a performance optimization of critical loops
  • vectorizing critical loops to do the same work with significantly fewer CPU operations

And we’ve seen dramatic improvements to the performance. Hopefully, some or all of these techniques will be applicable to your own code. Either way, I hope it has been an interesting diversion.

Next time: practical parallelization

Addendum

For completeness: yes I also tried a val ^ ~MSB approach for both branchless and vectorized; it wasn't an improvement.

And for the real implementation (the "aside" mentioned above): what the code actually does for sign-bit data (IEEE754) is: sort just on the sign bit first, use the count data to find where the sign changes (without scanning over the data an extra time), and then sort the two halves separately ignoring the MSB, with the first chunk in descending order and the second chunk in ascending order. By doing this, we avoid the need for the transform - again, by using knowledge of the bit layout of the data.