Tuesday, 30 January 2018

Sorting myself out, extreme edition

...where I go silly with optimization

Yes, I’m still talking about sorting... ish. I love going deep on problems - it isn’t enough to just have a solution - I like knowing that I have the best solution that I am capable of. This isn’t always possible in a business setting - deadlines etc - but this is my own time.

In part 1, we introduced the scenario - and discussed how to build a composite unsigned value that we could use to sort multiple properties as a single key.

In part 2, we looked a little at radix sort, and found that it was a very compelling candidate.

In this episode, we’re going to look at some ways to signficantly improve what we’ve done so far. In particular, we’ll look at:

  • using knowledge of how signed data works to avoid having to transform between them
  • performing operations in blocks rather than per value to reduce calls
  • using Span<T> as a replacemment for unsafe code and unmanaged pointers, allowing you to get very high performance even in 100% managed/safe code
  • investigating branch removal as a performance optimization of critical loops
  • vectorizing critical loops to do the same work with significantly fewer CPU operations

Hopefully, as a follow up after this one, I’ll look at pratical guidance on parallelizing this same work to spread the load over available cores.

Key point: the main purpose of these words is not to discuss how to implement a radix sort - in fact, we don’t even do that. Instead, it uses one small part of radix sort as an example problem with which to discuss much broader concepts of performance optimization in C# / .NET.

Obviously I can’t cover the entire of radix sort for these, so I’m going to focus on one simple part: composing the radix for sorting. To recall, a naive implementation of radix sort requires unsigned keys, so that the data is naturally sortable in their binary representation. Signed integers and floating point numbers don’t follow this layout, so in part 1 we introduced some basic tools to change between them:

uint Sortable(int value)
{
    // re-base eveything upwards, so anything
    // that was the min-value is now 0, etc
    var val = unchecked((uint)(value - int.MinValue));
    return val;
}
unsafe uint Sortable (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    if ((raw & MSB) != 0) // IEEE first bit is the sign bit
    {
        // is negative; shoult interpret as -(the value without the MSB) - not the same as just
        // dropping the bit, since integer math is twos-complement
        raw = -(raw & ~MSB);
    }
    return Sortable(raw);
}

These two simple transformation - applied to our target values - will form the central theme of this entry.

To measure performance, I’ll be using the inimitable BenchmarkDotNet, looking at multiple iterations of transforming 2 million random float values taken from a seeded Random(), with varying signs etc. The method above will be our baseline, and at each step we’ll add a new row to our table at the bottom:

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00

This gives us a good starting point.

A negative sign of the times

What’s faster than performing a fast operation? Not performing a fast operation. The way radix sort works is by looking at the sort values r bits at a time (commonly 4, 8, 10, but any number is valid) and for that block of bits: counting how many candidates are in each of the 1 << r possible buckets. So if r is 3, we have 8 possible buckets. From that it computes target offsets for each group: if there are 27 values in bucket 0, 12 in bucket 1, 3 in bucket 2, etc - then when sorted bucket 0 will start at offset 0, bucket 1 at offset 27, bucket 2 at offset 39, bucket 3 at offset 41, and so on - just by accumulating the counts. But this breaks if we have signed numbers.

Why?

First, let’s remind ourselves of the various ways that signed and unsigned data can be represented in binary, using a 4 bit number system and integer representations:

Binary Unsigned 2s-complement 1s-complement Sign bit
0000 0 0 +0 +0
0001 1 1 1 1
0010 2 2 2 2
0011 3 3 3 3
0100 4 4 4 4
0101 5 5 5 5
0110 6 6 6 6
0111 7 7 7 7
1000 8 -8 -7 -0
1001 9 -7 -6 -1
1010 10 -6 -5 -2
1011 11 -5 -4 -3
1100 12 -4 -3 -4
1101 13 -3 -2 -5
1110 14 -2 -1 -6
1111 15 -1 -0 -7

We’re usually most familiar with unsigned and 2s-complement representations, because that is what most modern processors use to represent integers. 1s-complement is where -x ≡ ~x - i.e. to negate something we simply invert all the bits. This works fine but has two zeros, which is one more than we usually need - hence we usually use 2s-complement which simply adds an off-by-one step; this makes zero unambiguous (very useful for false as we’ll see later) and (perhaps less important) gives us an extra negative value to play with.

The final option is to use a sign bit; to negate a number we flip the most significant bit, so -x ≡ x ^ 0b1000. IEEE754 floating point numbers (float and double) are implemented using a sign bit, which is why floating point numbers have +0 and -0. Due to clever construction, the rest of the value can be treated as naturally/bitwise sortable - even without needing to understand about the “mantissa”, “exponent” and “bias”. This means that to convert a negative float (or any other sign-bit number) to a 1s-complement representation, we simply flip all the bits except the most significant bit. Or we flip all the bits and put the most significant bit back again, since we know it should be a 1.


So: armed with this knowledge, we can see that signed data in 1s-complement or 2s-complement is almost “naturally sortable” in binary, but simply: the negative values are sorted in increasing numerical value, but come after the positive values (we can happily assert that -0 < +0). This means that we can educate radix sort about 1s-complement and 2s-complement signed data simply by being clever when processing the final left-most chunk: based on r and the bit-width, calculate which bit is the most-signficant bit (which indicates sign), and simply process the negative buckets first (still in the same order) when calculating the offsets; then calculate the offsets of the non-negative buckets. If we were using the 4-bit system above and r=4, we would have 16 buckets, and would calculate offsets in the order (of unsigned buckets) 8-15 then 0-7.

By doing this, we can completely remove any need to do any pre-processing when dealing with values like int. We could perhaps wish that the IEEE754 committee had preferred 1s-complement so we could skip all of this for float too, but a: I think it is fair to assume that there are good technical reasons for the choice (presumably relating to fast negation, and fast access to the mantissa/exponent), and b: it is moot: IEEE754 is implemented in CPU architectures and is here to stay.

So we’re still left with an issue for float: we can’t use the same trick for values with a sign bit, because the sign bit changes the order throughout the data - making grouping impossible. We can make our lives easier though: since the algorithm can now cope with 1s-complement and 2s-complement data, we can switch to 1s-complement rather than to fully unsigned, which as discussed above: is pretty easy:

(Aside: actually, there is a related trick we can do to avoid having to pre-process floating point data, but: it would make this entire blog redundant! So for the purposes of a problem to investigate, we’re going to assume that we need to do this transformation.)

unsafe int ToRadix (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    // if sign bit set: flip all bits except the MSB
    return (raw & MSB) == 0 ? raw : ~raw | MSB;
}

A nice side-effect of this is that it is self-reversing: we can apply the exact same bit operations to convert from 1s-complement back to a sign bit.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97

We’ve made a slight but measurable improvment - nothing drastic, but the code is nicer.

Blocking ourselves out

We have a large chunk of data, and we want to perform a transformation on each value. So far, we’ve looked at a per-value transformation function (Sortable), but that means the overhead of a call per-value (which may or may not get inlined, depending on the complexity, and how we resolve the method - i.e. does it involve a virtual call to a type that isn’t reliably known). Additioanlly, it makes it very hard for us to apply more advanced optimizations! Blocks good.

So; let’s say we have our existing loop:

float[] values = ...
int[] radix = ...
for(int i = 0 ; i < values.Length; i++)
{
    radix = someHelper.Sortable(values[i]);
}

and we want to retain the ability to swap in per-type implementations of someHelper.Sortable; we can significantly reduce the call overhead by performing a block-based transformation. Consider:

float[] values = ...
int[] radix = ...
someHelper.ToRadix(values, radix);
...
unsafe void ToRadix(float[] source, int[] destination)
{
    const int MSB = 1 << 31;
    for(int i = 0 ; i < source.Length; i++)
    {
        var val = source[i];
        int raw = *(int*)(&val);
        // if sign bit set: flip all bits except the MSB
        destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
    }
}

How much of a speed improvement this makes depends a lot on whether the JIT managed to inline the IL from the original version. It is usually a good win by itself, but more importantly: it is a key stepping stone to further optimizations.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96

Another small improvement; I was hoping for more, but I suspect that the JIT was already doing a good job of inlining the method we're calling, making it almost the same loop at runtime. This is not always the case, though - especially if you have multiple different transformations to apply through a single API.

Safely spanning the performance chasm

You’ll notice that in the code above I’ve made use of unsafe code. There are a few things that make unsafe appealing, but one of the things it does exceptionally well is allow us to reintrepret chunks of data as other types, which is what this line does:

int raw = *(int*)(&val);

Actually, there are some methods on BitConverter to do exactly this, but only the 64-bit (double/long) versions exist in the “.NET Framework” (“.NET Core” has both 32-bit and 64-bit) - and that only helps us with this single example, rather than the general case.

For example, if we are talking in pointers, we can tell the compiler to treat a float* as though it were an int*. One way we might be tempted to rewrite our ToRadix method could be to move this coercison earlier:

unsafe void ToRadix(float[] source, int[] destination)
{
    const int MSB = 1 << 31;
    fixed(float* fPtr = source)
    {
        int* ptr = (int*)fPtr;
        for(int i = 0 ; i < values.Length; i++)
        {
            int raw = *ptr++;
            destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
        }
    }
}

Now we’re naturally reading values out as int, rather than performing any reinterpretation per value. This is useful, but it requires us to use unsafe code (always a great way to get hurt), and it doesn’t work with generics - you cannot use T* for some <T>, even with the where T : struct constraint.

I’ve spoken more than a few times about Span<T>; quite simply: it rocks. To recap, Span<T> (and it’s heap-friendly cousin, Memory<T>) is a general purpose, efficient, and versatile representation of contiguous memory - which includes things like arrays (float[]), but more exotic things too.

One of the most powerful (but simple) features of Span<T> is that it allows us to do type coercison in fully safe managed code. For example, instead of a float[], let’s say that we have a Span<float>. We can reinterpet that as int very simply:

Span<float> values = ...
var asIntegers = values.NonPortableCast<float, int>();

Note: this API is likely to change - by the time it hits general availablity, it’ll probably be:

var asIntegers = values.Cast<int>();

What this does is:

  • look at the sizes of the original and target type
  • calculate how many of the target type fit into the original data
  • round down (so we never go out of range)
  • hand us back a span of the target type, of that (possibly reduced) length

Since float and int are the same size, we’ll find that asIntegers has the same length as values.

What is especially powerful here is that this trick works with generics. It does something that unsafe code will not do for us. Note that a lot of love has been shown to Span<T> in the runtime and JIT - essentially all of the same tricks that make T[] array performance largely indistinguishable from pointer T* performance.

This means we could simplify a lot of things by converting from our generic T even earlier (so we only do it once for the entire scope of our radix code), and having our radix converter just talk in terms of the raw bits (usually: uint).

// our original data
float[] values = ...
// recall that radix sort needs some extra space to work in
float[] workspace = ... 
// implicit <float> and implicit conversion to Span<float>
RadixSort32(values, workspace); 

...
RadixSort32<TSource>(Span<T> typedSource, Span<T> typedWorkspace)
    where T : struct
{
    // note that the JIT can inline this *completely* and remove impossible
    //  code paths, because for struct T, the JIT is per-T
    if (Unsafe.SizeOf<T>() != Unsafe.SizeOf<uint>()) .. throw an error

    var source = typedSource.NonPortableCast<T, uint>();
    var workspace = typedWorkspace.NonPortableCast<T, uint>();

    // convert the radix if needed (into the workspace)
    var converter = GetConverter<T>();
    converter?.ToRadix(source, workspace);

    // ... more radix sort details not shown
}
...
// our float converter
public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;
    for(int i = 0 ; i < values.Length; i++)
    {
        uint raw = values[i];
        destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
    }
}

The code is getting simpler, while retaining performance and becoming more generic-friendly; and we haven’t needed to use a single unsafe. You’ll have to excuse me an Unsafe.SizeOf<T>() - despite the name, this isn’t really an “unsafe” operation in the usual sense - this is simply a wrapper to the sizeof IL instruction that is perfectly well defined for all T that are usable in generics. It just isn’t directly available in safe C#.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76

Now we’re starting to make decent improvements - Span<T> is really useful for large operations where type coercion is necessary.

Taking up tree surgery: prune those branches

Something that gnaws at my soul in where we’ve got to is that it includes a branch - an if test, essentially - in the inner part of the loop. Actually, there’s two and they’re both hidden. The first is in the for loop, but the one I’m talking about here is the one hidden in the ternary conditional operation, a ? b : c. CPUs are very clever about branching, with branch prediction and other fancy things - but it can still stall the instruction pipeline, especially if the prediction is wrong. If only there was a way to rewrite that operation to not need a branch. I’m sure you can see where this is going.

Branching: bad. Bit operations: good. A common trick we can use to remove branches is to obtain a bit-mask that is either all 0s (000…000) or all 1s (111…111) - so: 0 and -1 in 2s-complement terms. There are various ways we can do that (although it also depends on the actual value of true in your target system, which is a surprisingly complex question). Obviously one way to do that would be:

// -1 if negative, 0 otherwise
var mask = (raw & MSB) == 0 ? 0 : ~0;

but that just adds another branch. If we were using C, we could use the knowledge that the equality test returns an integer of either 0 or 1, and just negate that to get 0 or -1:

// -1 if negative, 0 otherwise
var mask = -((raw & MSB) == 0);

But no such trick is available in C#. What we can do, though, is use knowledge of arithmetic shift. Left-shift (<<) is simple; we shift our bits n places to the left, filling in with 0s on the right. So binary 11001101 << 3 becomes 01101000 (we lose the 110 from the left).

Right-shift is more subtle, as there are two of them: logical and arithmetic, which are essentially unsigned and signed. The logical shift (used with uint etc) moves our bits n places to the right, filling in with 0s on the left, so 11001101 >> 3 gives 00011001 (we lose the 101 from the right). The arithmetic shift behaves differently depending on whether the most significant bit (which tells us the sign of the value) is 0 or 1. If it is 0, it behaves exactly like the logical shift; however, if it is 1, it fills in with 1s on the left; so 11001101 >> 3 gives 11111001. Using this, we can use >> 31 (or >> 63 for 64-bit data) to create a mask that matches the sign of the original data:

// -1 if negative, 0 otherwise
var mask = (uint)(((int)raw) >> 31);

Don’t worry about the extra conversions: as long as we’re in an unchecked context (which we are by default), they simply don’t exist. All they do is tell the compiler which shift operation to emit. If you’re curious, you can see this here, but in IL terms, this is just:

(push the value onto the stack)
ldc.i4.s 31 // push 31 onto the stack
shr         // arithmetic right shift

In IL terms, there’s really no difference between signed and unsigned integers, other than whether the compiler emites the signed or unsigned opcodes for operations. In this case we’ve told it to emit shr - the signed/arithmetic opcode, instead of shr.un - the unsigned/logical opcode.

OK; so we’ve got a mask that is either all zeros or all ones. Now we need to use it to avoid a branch, but how? Consider:

var condition = // all zeros or all ones
var result = (condition & trueValue) | (~condition & falseValue);

If condition is all zeros, then the condition & trueValue gives us zero; the ~condition becomes all ones, and therefore ~condition & falseValue gives us falseValue. When we “or” (|) those together, we get falseValue.

Likewise, if condition is all ones, then condition & trueValue gives us trueValue; the ~condition becomes all zeros, and therefore ~condition & falseValue gives us zero. When we “or” (|) those together, we get trueValue.

So our branchless operation becomes:

public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;
    for(int i = 0 ; i < values.Length; i++)
    {
        uint raw = values[i];
        var ifNeg = (uint)(((int)raw) >> 31);
        destination[i] =
            (ifNeg & (~raw | MSB)) // true
            | (~ifNeg & raw);      // false
    }
}

This might look more complicated, but it is very CPU-friendly: it pipelines very well, and doesn’t involve any branches for it to worry about. Doing a few extra bit operations is nothing to a CPU - especially if they can be pipelined. Long instruction pipelines are actually a good thing to a CPU - compared to a branch or something that might involve a cache miss, at least.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76
Branchless 2,507.0 us 0.24

By removing the branches, we’re down to less than a quarter of the original run-time; that’s a huge win, even if the code is slightly more complex.

Why do one thing at a time?

OK, so we’ve got a nice branchless version, and the world looks great. We’ve made significant improvements. But we can still get much better. At the moment we’re processing each value one at a time, but as it happens, this is a perfect scenario for vectorization via SIMD (“Single instruction, multiple data”).

We have a pipelineable operation without any branches; if we can execute it for one value, we can probably execute it for multiple values at the same time - magic, right? Many modern CPUs include support for performing basic operations like the above on multiple values at a time, using super-wide registers. Right now we’re using 32-bit values, but most current CPUs will have support for AVX (mostly: 128-bit) or AVX2 (mostly: 256-bit) operations. If you’re on a very expensive server, you might have more (AVX512). But let’s assume AVX2: that means we can handle 8 32-bit values at a time. That means 1/8th of the main operations, and also 1/8th of the if branches hidden in the for loop.

Some languages have automatic vectorization during compilation; C# doesn’t have that, and neither does the JIT. But, we still have access to a range of vectorized operations (with support for the more exotic intrinsics being added soon). Until recently, one of the most awkward things about working with vectorization has been loading the values. This might sound silly, but it is surprisingly difficult to pull the values in efficiently when you don’t know how wide the vector registers are on the target CPU. Fortunately, our amazing new friend Span<T> jumps to the rescue here - making it almost embarrassingly easy!

First, let’s look at what the shell loop might look like, without actually doing the real work in the middle:

public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;

    int i = 0;
    if (Vector.IsHardwareAccelerated)
    {                               
        var vSource = values.NonPortableCast<uint, Vector<uint>>();
        var vDest = destination.NonPortableCast<uint, Vector<uint>>();
        
        for (int j = 0; j < vSource.Length; j++)
        {
            var vec = vSource[j];
            vDest[j] = // TODO
        }
        // change our root offset for the remainder of the values
        i = vSource.Length * Vector<uint>.Count;
    }
    for( ; i < values.Length; i++)
    {
        uint raw = values[i];
        var ifNeg = (uint)(((int)raw) >> 31);
        destination[i] =
            (ifNeg & (~raw | MSB)) // true
            | (~ifNeg & raw);      // false
    }
}

First, look at the bottom of the code; here we see that our regular branchless code still persists. This is for two reasons:

  • the target CPU might not be capable of vectorization
  • our input data might not be a nice multiple of the register-width, so we might need to process a final few items the old way

Note that we’ve changed the for loop so that it doesn’t reset the position of i - we don’t necessarily start at 0.

Now look at the if (Vector.IsHardwareAccelerated); this checks that suitable vectorization support is available. Note that the JIT can optimize this check away completely (and remove all of the inner code if it won’t be reached). If we do have support, we cast the span from a Span<uint> to a Span<Vector<uint>>. Note that Vector<T> is recognized by the JIT, and will be reshaped by the JIT to match the size of the available vectorization support on the running computer. That means that when using Vector<T> we don’t need to worry about whether the target computer has SSE vs AVX vs AVX2 etc - or what the available widths are; simply: “give me what you can”, and the JIT worries about the details.

We can now loop over the vectors available in the cast spans - loading an entire vector at a time simply using our familiar: var vec = vSource[j];. This is a huge difference to what loading vectors used to look like. We then do some operation (not shown) on vec, and assign the result again as an entire vector to vDest[j]. On my machine with AVX2 support, vec is block of 8 32-bit values.

Next, we need to think about that // TODO - what are we actually going to do here? If you’ve already re-written your inner logic to be branchless, there’s actually a very good chance that it will be a like-for-like translation of your branchless code. In fact, it turns out that the ternary conditional scenario we’re looking at here is so common that there are vectorized operations precisely to do it; the “conditional select” vectorized CPU instruction can essentially be stated as:

// result conditionalSelect(condition, left, right)
result = (condition & left) | (~condition & right);

Where condition is usually either all-zeros or all-ones (but it doesn’t have to be; if you want to pull different bits from each value, you can do that too).

This intrinsic is exposed directly on Vector, so our missing code becomes simply:

var vMSB = new Vector<uint>(MSB);
var vNOMSB = ~vMSB;
for (int j = 0; j < vSource.Length; j++)
{
    var vec = vSource[j];
    vDest[j] = Vector.ConditionalSelect(
        condition: Vector.GreaterThan(vec, vNOMSB),
        left: ~vec | vMSB, // when true
        right: vec // when false
    );
}

Note that I’ve pre-loaded a vector with the MSB value (which creates a vector with that value in every cell), and I’ve switched to using a > test instead of a bit test and shift. Partly, this is because the vectorized equality / inequality operations expect this kind of usage, and very kindly return -1 as their true value - using the result to directly feed “conditional select”.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76
Branchless 2,507.0 us 0.24
Vectorized 930.0 us 0.09

As you can see, the effect of vectorization on this type of code is just amazing - with us now getting more than an order-of-magnitude improvement on the original data. That’s why I’m so excited about how easy (relatively speaking) Span<T> makes vectorization, and why I can’t wait for Span<T> to hit production.

A reasonable range of common operations are available on Vector and Vector<T>. If you need exotic operations like “gather”, you might need to wait until System.Runtime.Intrinsics lands. One key difference here is that Vector<T> exposes the common intersection of operations that might be available (with different widths) against different CPU instruction sets, where as System.Runtime.Intrinsics aims to expose the underlying intrinsics - giving access to the full range of instructions, but forcing you to code specifically to a chosen instruction set (or possibly having two implementations - one for AVX and one for AVX2). This is simply because there isn’t a uniform API surface between generatons and vendors - it isn’t simply that you get the same operations with different widths: you get different operations too. So you’d typically be checking Aes.IsSupported, Avx2.IsSupported, etc. Being realistic: Vector<T> is what we have today, and it worked damned well.

Summary

We’ve looked at a range of advanced techniques for improving performance of critical loops of C# code, including (to repeat the list from the start):

  • using knowledge of how signed data works to avoid having to transform between them
  • performing operations in blocks rather than per value to reduce calls
  • using Span<T> as a replacemment for unsafe code and unmanaged pointers, allowing you to get very high performance even in 100% managed/safe code
  • investigating branch removal as a performance optimization of critical loops
  • vectorizing critical loops to do the same work with significantly fewer CPU operations

And we’ve seen dramatic improvements to the performance. Hopefully, some or all of these techniques will be applicable to your own code. Either way, I hope it has been an interesting diversion.

Next time: practical parallelization

Addendum

For completeness: yes I also tried a val ^ ~MSB approach for both branchless and vectorized; it wasn't an improvement.

And for the real implementation (the "aside" mentioned above): what the code actually does for sign-bit data (IEEE754) is: sort just on the sign bit first, use the count data to find where the sign changes (without scanning over the data an extra time), and then sort the two halves separately ignoring the MSB, with the first chunk in descending order and the second chunk in ascending order. By doing this, we avoid the need for the transform - again, by using knowledge of the bit layout of the data.

Saturday, 20 January 2018

More Of A Sort Of Problem

(part 1 here)

(part 3 here)

So last time I talked about a range of ways of performing a sort, ranging from the simple thru to hijacking .NET source code. Very quickly, some folks pointed out that I should have looked at “radix sort”, and they’re absoltely right - I should have. In fact, in the GPU version of this same code, we do exactly that via the CUB library.

The great thing is, radix sort is relatively simple, so:

Attempt 9: radix sort

The key point about radix sort is that it works by grouping the data by groups of bits in the data, using the same “bitwise sortable” idea that we used previously. We’ve already done the hard work to get a radix compliant representation of our sort data.

We can get a basic radix sort implementation from wikibooks, but this version has a few things we need to fix:

  • this is a single-array version; we want a dual array
  • we can use unsafe code to get rid of a lot of array range checks (and just: don’t be wrong!)
  • radix sort needs a workspace the same same as the input values as a scratch area; in the version shown, it allocates this internally, but in “real” code we’ll want to manage that externally and pass it in
  • we can make r (the number of bits to consider at a time) configurable
  • the shown code copies the workspace over the real data each cycle, but we can avoid this by simply swapping what we consider “real” and “workspace” each cycle, and copying once at the end if required

I’m not going to try and describe how or why radix sort works (wikipedia covers much of that here); the key thing that will be relevant in a moment is: for the group size r, it loops through all the data looking r bits at a time, to see how many values there are with each possible value for those r bits. So if r=4, there are 16 possible values over each 4 bits. Once it has that, it iterates a second time, writing the values into the corresponding places for the group it is in.

Once we have an implementation, our code basically consists of preparing the bit-sortable keys just like we did before, then simply invoking the algoritm, passing in our reusable workspace:

Helpers.RadixSort(sortKeys, index, keysWorkspace, valuesWorkspace, r);

(where keysWorkspace and valuesWorkspace are scratch areas of the required size, shared between sort cycles).

One consideration here is: what value of r (the number of bits to consider at a time) to choose. 4 is a reasonably safe default, but you can experiment with different values for your data to see what works well.

I get:

  • r=2: 3800ms
  • r=4: 1900ms
  • r=8: 1200ms
  • r=16: 2113ms

This r=8 is very tempting that is a significant improvement on our previous best.

Attempt 10: radix sort with parellelization

Remember the “that will be relevant in a moment” from a few paragraphs ago? Recall: a key point of radix sort is that for each group of bits (of size r), it needs to iterate the entire key-set to count the frequencies of each possible group value. This count operation is something that is embarrasingly parallelizable, since counting chunks can be done independently over the entire data.

To do that, we can create a number of workers, divide the key-space into that many chunks, and tell each worker to perform the counts for that chunk. Fire these workers in parallel via Parallel.Invoke or similar, and reap the rewards. This creates a slight complexity that we need to combine the counts, and there will be thread races. A naive but thread-safe implementation would be to use Interlocked.Increment to do all the counts, but that would have severe collision penalties - it is far preferable to count each chunk in complete isolation, and only worry about the combination at the end. At that point, either lock or Interlocked would be fine, as it is going to happen very minimally. We should also be careful to hoist everything we want into a local, to avoid a lot of ldarg.0, ldfld overhead:

public void Invoke()
{
    var len = Length;
    var mask = Mask;
    var keys = Keys + Offset;
    var shift = Shift;
    int* count = stackalloc int[CountLength];

    // count into a local buffer
    for (int i = 0; i < len; i++)
        count[(*keys++ >> shift) & mask]++;

    // now update the origin data, synchronized
    lock (SyncLock)
    {
        for (int i = 0; i < CountLength; i++)
            Counts[i] += count[i];
    }
}

Here we’re also using stackalloc to do all our counting in the stack space, rather than allocating a count buffer per worker. This is fine, since we’ll typically be dealing with values like r=4 (CountLength=16). Even for larger reasonable r, the stack space is fine. We could very reasonably put an upper bound on r of 16 if we wanted to be sure.

Our calling code is virtually idental - all we’re doing is changing the internal implementation:

Helpers.RadixSortParallel(sortKeys, index, keysWorkspace, valuesWorkspace, r);

So what does this do for performance? Note: I’m using Environment.ProcessorCount * 2 workers, but we could play with other values.

I get

  • r=2: 3600ms
  • r=4: 1800ms
  • r=8: 1200ms
  • r=16: 2000ms

So; we don’t get a vast improvement really - our key benefit comes from simply choosing a suitable r for our data, like r=8.

Throws down gauntlet

So; so far we’ve gone from 17s (LINQ) to 1.2s (radix sort, single-threaded or parallel). What more can we do? Can we parallelize the second half of radix sort? Can we try a completely different sort? Can we combine our index and keys so we are performing a single array sort? Can we make use of some obscure CPU instructions to perform 128-bit (or wider) operations to combine our existing 64-bit key and 32-bit value? Vectorize a key part of one of the existing algorithms with SIMD?

If you have more ideas, please feel free to fork and PR from here.

Friday, 19 January 2018

A Sort Of Problem

(part 2 here)

(part 3 here)

I love interesting questions, especially when they directly relate to things I need to do. A great question came up on Stack Overflow today about how to efficiently sort large data. I gave an answer, but there’s so much more we can say on the topic, so I thought I’d turn it into a blog entry, exploring pragmatic ways to improve sort performance when dealing with non-trivial amounts of data. In particular, this is remarkably similar to time I’ve spent trying to make our “tag engine” faster.

The problem

So, the premise is this:

  • we have a complex entity, SomeType, with multiple properties
  • we have a large number of these entities - lets say 16M+
  • we want to sort this data using a sort that considers multiple properties - “this, then that”
  • and we want it to be fast

Note that sorting data when it is already sorted or nearly-sorted is usually cheap under most common algorithms, so I’m going to be focusing only on the initial painful sort when the data is not at all sorted.

Because we’re going to have so many of them, and they are going to be basic storage types only, this is a good scenario to consider a struct, and I was delighted to see that the OP in the question had already done this. We’ll play with a few of the properties (for sorting, etc), but to simulate the usual context, there will be extra stuff that isn’t relevant to the question, so we’ll pad the size of the struct with some dummy fields up to 64 bytes. So, something like:

readonly partial struct SomeType
{
    public int Id { get; }
    public DateTime ReleaseDate { get; }
    public double Price { get; }

    public SomeType(int id, DateTime releaseDate, double price)
    {
        Id = id;
        ReleaseDate = releaseDate;
        Price = price;
        _some = _other = _stuff = _not = _shown = 0;
    }

#pragma warning disable CS0414 // suppress "assigned, never used"
    private readonly long _some, _other, _stuff, _not, _shown;
#pragma warning restore CS0414
}

Note: yes, I know that double is a terrible choice for something that describes money.

Note: readonly struct is a new C# feature, described in more detail here - this is a good fit, and might help us avoid some large “load” costs.

For something interesting to do, we’ll try sorting things “most recent, then cheapest”.

Inventing some data

The first thing we need is some data; a very basic seeded random data script might be something like:

var rand = new Random(data.Length);
for (int i = 0; i < data.Length; i++)
{
    int id = rand.Next();
    var releaseDate = Epoch
        .AddYears(rand.Next(50))
        .AddDays(rand.Next(365))
        .AddSeconds(rand.Next(24 * 60 * 60));
    var price = rand.NextDouble() * 50000;
    data[i] = new SomeType(
        id, releaseDate, price);
}

Attempt 1: LINQ

LINQ is great; I love LINQ, and it makes some code very expressive. So let’s try the most obvious thing first:

sorted = (from item in data
        orderby item.ReleaseDate descending,
                item.Price
        select item).ToArray();

This LINQ expression performs the sort we want, creating a copy of the data - but on my machine this takes about 17 seconds to run - not ideal. So that’s the target to beat. The key thing about LINQ is that it is designed for your efficiency, i.e. the size and complexity of the code that you need to write, on the assumption that you’ll only use it on reasonable data. We do not have reasonable data here.

Attempt 2: IComparable<T>

Since we’re talking about arrays, another obvious thing to do is Array.Sort; for the simplest version of that, we need to implement IComparable<T> on our type:

partial struct SomeType : IComparable<SomeType>
{
    int IComparable<SomeType>.CompareTo(SomeType other)
    {
        var delta = other.ReleaseDate
            .CompareTo(this.ReleaseDate);
        if (delta == 0) // second property
            delta = this.Price.CompareTo(other.Price);
        return delta;
    }
}

And then we can use:

Array.Sort<SomeType>(data);

to perform an in-place sort. The generic <SomeType> here is actually redundant, but I’ve included it to make it obious that I am using the generic API.

This takes just over 6 seconds for me, so: a huge improvement! Note that for the purpose of our tests, we will re-populate the data after this, to oensure that all tests start with randomized data. For brevity, assume we’re doing this whenever necessary - I won’t keep calling it out.

Attempt 3: IComparer<T>

There’s a second common sort API: IComparer<T> custom comparers. This has the advantages that a: you don’t need to edit the target type, and b: you can support multiple different sorts against the same type via different custom comparers. For this, we add or own comparer:

sealed class SomeTypeComparer : IComparer<SomeType>
{
    private SomeTypeComparer() { }
    public static SomeTypeComparer Default { get; } = new SomeTypeComparer();
    int IComparer<SomeType>.Compare(SomeType x, SomeType y)
    {
        var delta = y.ReleaseDate
                .CompareTo(x.ReleaseDate);
        if (delta == 0) // second property
            delta = x.Price.CompareTo(y.Price);
        return delta;
    }
}

using it via:

Array.Sort<SomeType>(data, SomeTypeComparer.Default);

This takes around 8 seconds; we’re not going in the right direction here!

Attempt 4: Comparison<T>

Why stop at two ways to do the same thing, when we can have 3? For completeness, there’s yet another primary Array.Sort<T> variant that takes a delegate, for example:

Array.Sort<SomeType>(data, (x, y) =>
{
    var delta = y.ReleaseDate
            .CompareTo(x.ReleaseDate);
    if (delta == 0) // second property
        delta = x.Price.CompareTo(y.Price);
    return delta;
});

This keeps the “do a sort” and “like this” code all in the same place, which is nice; but: it performs virtually identically to the previous attempt, at around 8 seconds.

First intermission: what is going wrong?

We’re doing a lot of work here, that much is true; but there are things that are exacerbating the situation:

  • we have a large struct, which means we need to copy that data on the stack whenever we do anything
  • because it needs to compare values to their neighbours, there are a lot of virtual calls going on

These costs are fine for reasonable data, but for larger volumes the costs start building up.

We need an alternative.

It happens that Array.Sort also has overloads that accept two arrays - the keys and the values. What this does is: perform the sort logic on the first array, but whenever it swaps data around: it swaps the corresponding items in both arrays. This has the effect of sorting the second array by the values of the first. In visual terms, it is like selecting two columns of a spreadsheet and clicking sort.

If we only had a single value, this would be great! For example…

Attempt 5: dual arrays, single property

Let’s pretend for a moment that we only want to sort by the date, in ascending order. Which isn’t what we want, but: humour me.

What we could do is keep a CreationDate[] hanging around (reuse it between operations), and when we want to sort: populate the data we want to sort by into this array:

for (int i = 0; i < data.Length; i++)
    releaseDates[i] = data[i].ReleaseDate;

and then to sort:

Array.Sort(releaseDates, data);

For me, this takes about 150ms to prepare the keys, and 4.5s to execute the sort. Promising, although hard to tell if that is useful until we can handle the complex dual sort.

Second intermission: how can we compose the sort?

We have two properties that we want to sort by, and a Sort method that only takes a single value. We could start looking at tuple types, but that is just making things even more complex. What we want is a way to simplify the complex sort into a single value. What if we could use something simple like an integer to represent our combined sort? Well, we can!

Many basic values can - either directly, or via a hack - be treated as a bitwise-sortable value. By bitwise sortable, I essentially mean: “sorts like the samme bits expressed as an unsigned integer would sort”. Consider a 32-bit integer: obviously an unsigned integer sorts just like an unsigned integer, but a signed integer does not - negative numbers are problematic. What would be great is if int.MinValue was treated as 0, with int.MinValue + 1 treated as 1, etc; we can do that by subtraction:

protected static ulong Sortable(int value)
{
    // re-base eveything upwards, so anything
    // that was the min-value is now 0, etc
    var val = unchecked((uint)(value - int.MinValue));
    return val;
}

The result of this is that Sortable will return 32-bits worth of data (the same as the input), but with 000...000 as the minimum expected value, and 111...111 as the maximum expected value.

Now; notice that we’re only talking about 32 bits here, but we’ve returne a ulong; that’s because we’re going to pack 2 values into a single token.

For our actual data, we hae two pieces:

  • a DateTime
  • a Double

Now, that’s 16 bytes worth of data, and we only have 8 to play with. This sounds like a dilemma, but usually: we can cheat by fudging the precision.

For many common applications - and especially things like a ReleaseDate, most of the bits in a DateTime are not useful. We probably don’t need to handle every tick in a 10,000-year range. We can almost certainly use per-second precision - perhaps even per-day for a release date. Unix time in seconds using 32 bits has us covered until January 19, 2038. If we need less precision than seconds, we can extend that hugely; and we can often use a different epoch that fits our minimum expected data. Heck, starting at the year 2000 instead of 1970 buys 30 years even in per-second precision. Time in an epoch is bitwise-sortable.

Likewise, an obvious way of approximating a double in 32 bits would be to cast it as a float. This doesn’t have the same range or precision, but will usually be just fine for sorting purposes. Floating point data in .NET has a complex internal structure, but fortunately making it bitwise-sortable can be achieved through some simple well-known bit hacks:

protected static unsafe ulong Sortable (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    if ((raw & MSB) != 0) // IEEE first bit is the sign bit
    {
        // is negative; shoult interpret as -(the value without the MSB) - not the same as just
        // dropping the bit, since integer math is twos-complement
        raw = -(raw & ~MSB);
    }
    return Sortable(raw);
}

Putting these together, we havev all the tools we need to create a single composite value that is totally meanningless for all ordinary purposes, but which represents our sort perfectly.

Attempt 6: dual arrays, dual property

We can create a method that composes our two properties:

static ulong Sortable(in SomeType item)
{
    return (~(ulong)item.ReleaseDate.ToMillenialTime()) << 32
                | Sortable((float)item.Price);
}

This might look complex, but what it does is:

  • compute the time in seconds since 2000 as a 32-bit unsigned integer
  • extends it to 64-bits
  • inverts it; this has the same effect as “descending”, since it reverses the order
  • left shifts it by 32 bits, to place those 32 bits in the upper half of our 64 bits (padding on the right with zero)
  • compute the bitwise-sortable representation of the price as a 32-bit unsigned integer
  • throws that value into the lower 32 bits

We can prepare our data into a ulong[] that we keep around between sort operations:

for (int i = 0; i < data.Length; i++)
    sortKeys[i] = Sortable(in data[i]);

and finally sort:

Array.Sort(sortKeys, data);

The prepare operation is more complex now - and has gone up to 300ms, but the sort is faster at just over 4 seconds. We’re moving in the right direction. Note that the prepare operation is embarrassingly parallelizable, so we can trivially divide that over a number of cores (say: 16 blocks of 1M records per block) - and can often be further reduced by storing the data in the struct in similar terms to the sortable version (so: the same representation of time, and the same floating-point scale) - thus I’m not going to worry about the prepare cost here.

But we’re still paying a lot of overhead from having to move around those big structs. We can avoid that by… just not doing that!

Attempt 7: indexed

Rather than sorting our SomeType[] array, we could instead leave that data alone, forever. Never move the items around (although it is usually fine to replace them with updates). This has multiple advantages, but the one we’re keen on is the reduction of cost copying the data.

So; we can declare an int[] index that is our index - it just tells us the offsets to look in the actual data. We can sort that index as though it were the actual data, and just make sure we go through the index. We need to initialize the index as well as the composite sortable value (although we can re-use the positions if we are re-sorting the data, as usually the data doesn’t move much between cycles - we’ll get another huge boost on re-sorts when the data hasn’t drifted much; we do not need to reset the index when re-sorting the same data):

for (int i = 0; i < data.Length; i++)
{
    index[i] = i;
    sortKeys[i] = Sortable(in data[i]);
}

and sort:

Array.Sort(sortKeys, index);

The only complication is that now, to access the sorted data - instead of looking at data[i] we need to look at data[index[i]], i.e. find the i’th item in the index, and use that value as the offset in the actual data.

This takes the time down to 3 seconds - we’re getting there.

Attempt 8: indexed, direct compare, no range checks

The introspective sort that Array.Sort does is great, but it is still going to be talking via the general CompareTo API on our key type (ulong), and using the array indexers extensively. The JIT in .NET is good, but we can help it out a little bit more by … “borrowing” (ahem) the IntroSort code, and:

  • replacing the CompareTo usage on the keys with direct integer operations
  • replacing the array access with unsafe code that uses ulong* (for the keys) and int* (for the index)

(as an aside, it’ll be interesting to see how this behaves with Span<T>, but that’s not complete yet).

I’m not going to show the implementation here, but it is available in the source project. Our code for consuming this doesn’t change much, except to call our butchered sort API:

Helpers.Sort(sortKeys, index);

For me this now takes just over 2.5 seconds.

Conclusion

So there we go; I’ve explored some common approaches to improrving sort performance; we’ve looked at LINQ; we’ve looked at basic sorts using comparables, comparers and comparisons (which are all different, obviously); we’ve looked at keyed dual-array sorts; we’ve looked at indexed sorts (where the source data remains unsorted); and finally we’ve hacked the introspective sort to squeeze a tiny bit more from it.

We’ve seen performance range from 17 seconds for LINQ, 8 seconds for the 3 basic sort APIs, then 4 seconds for our dual array sorts, 3 seconds for the indexed sort, and finally 2.5 seconds with our hacked and debased version.

Not bad for a night’s work!

All the code discussed here is available on github.

(part 2 here)