Tuesday, 30 January 2018

Sorting myself out, extreme edition

...where I go silly with optimization

Yes, I’m still talking about sorting... ish. I love going deep on problems - it isn’t enough to just have a solution - I like knowing that I have the best solution that I am capable of. This isn’t always possible in a business setting - deadlines etc - but this is my own time.

In part 1, we introduced the scenario - and discussed how to build a composite unsigned value that we could use to sort multiple properties as a single key.

In part 2, we looked a little at radix sort, and found that it was a very compelling candidate.

In this episode, we’re going to look at some ways to signficantly improve what we’ve done so far. In particular, we’ll look at:

  • using knowledge of how signed data works to avoid having to transform between them
  • performing operations in blocks rather than per value to reduce calls
  • using Span<T> as a replacemment for unsafe code and unmanaged pointers, allowing you to get very high performance even in 100% managed/safe code
  • investigating branch removal as a performance optimization of critical loops
  • vectorizing critical loops to do the same work with significantly fewer CPU operations

Hopefully, as a follow up after this one, I’ll look at pratical guidance on parallelizing this same work to spread the load over available cores.

Key point: the main purpose of these words is not to discuss how to implement a radix sort - in fact, we don’t even do that. Instead, it uses one small part of radix sort as an example problem with which to discuss much broader concepts of performance optimization in C# / .NET.

Obviously I can’t cover the entire of radix sort for these, so I’m going to focus on one simple part: composing the radix for sorting. To recall, a naive implementation of radix sort requires unsigned keys, so that the data is naturally sortable in their binary representation. Signed integers and floating point numbers don’t follow this layout, so in part 1 we introduced some basic tools to change between them:

uint Sortable(int value)
{
    // re-base eveything upwards, so anything
    // that was the min-value is now 0, etc
    var val = unchecked((uint)(value - int.MinValue));
    return val;
}
unsafe uint Sortable (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    if ((raw & MSB) != 0) // IEEE first bit is the sign bit
    {
        // is negative; shoult interpret as -(the value without the MSB) - not the same as just
        // dropping the bit, since integer math is twos-complement
        raw = -(raw & ~MSB);
    }
    return Sortable(raw);
}

These two simple transformation - applied to our target values - will form the central theme of this entry.

To measure performance, I’ll be using the inimitable BenchmarkDotNet, looking at multiple iterations of transforming 2 million random float values taken from a seeded Random(), with varying signs etc. The method above will be our baseline, and at each step we’ll add a new row to our table at the bottom:

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00

This gives us a good starting point.

A negative sign of the times

What’s faster than performing a fast operation? Not performing a fast operation. The way radix sort works is by looking at the sort values r bits at a time (commonly 4, 8, 10, but any number is valid) and for that block of bits: counting how many candidates are in each of the 1 << r possible buckets. So if r is 3, we have 8 possible buckets. From that it computes target offsets for each group: if there are 27 values in bucket 0, 12 in bucket 1, 3 in bucket 2, etc - then when sorted bucket 0 will start at offset 0, bucket 1 at offset 27, bucket 2 at offset 39, bucket 3 at offset 41, and so on - just by accumulating the counts. But this breaks if we have signed numbers.

Why?

First, let’s remind ourselves of the various ways that signed and unsigned data can be represented in binary, using a 4 bit number system and integer representations:

Binary Unsigned 2s-complement 1s-complement Sign bit
0000 0 0 +0 +0
0001 1 1 1 1
0010 2 2 2 2
0011 3 3 3 3
0100 4 4 4 4
0101 5 5 5 5
0110 6 6 6 6
0111 7 7 7 7
1000 8 -8 -7 -0
1001 9 -7 -6 -1
1010 10 -6 -5 -2
1011 11 -5 -4 -3
1100 12 -4 -3 -4
1101 13 -3 -2 -5
1110 14 -2 -1 -6
1111 15 -1 -0 -7

We’re usually most familiar with unsigned and 2s-complement representations, because that is what most modern processors use to represent integers. 1s-complement is where -x ≡ ~x - i.e. to negate something we simply invert all the bits. This works fine but has two zeros, which is one more than we usually need - hence we usually use 2s-complement which simply adds an off-by-one step; this makes zero unambiguous (very useful for false as we’ll see later) and (perhaps less important) gives us an extra negative value to play with.

The final option is to use a sign bit; to negate a number we flip the most significant bit, so -x ≡ x ^ 0b1000. IEEE754 floating point numbers (float and double) are implemented using a sign bit, which is why floating point numbers have +0 and -0. Due to clever construction, the rest of the value can be treated as naturally/bitwise sortable - even without needing to understand about the “mantissa”, “exponent” and “bias”. This means that to convert a negative float (or any other sign-bit number) to a 1s-complement representation, we simply flip all the bits except the most significant bit. Or we flip all the bits and put the most significant bit back again, since we know it should be a 1.


So: armed with this knowledge, we can see that signed data in 1s-complement or 2s-complement is almost “naturally sortable” in binary, but simply: the negative values are sorted in increasing numerical value, but come after the positive values (we can happily assert that -0 < +0). This means that we can educate radix sort about 1s-complement and 2s-complement signed data simply by being clever when processing the final left-most chunk: based on r and the bit-width, calculate which bit is the most-signficant bit (which indicates sign), and simply process the negative buckets first (still in the same order) when calculating the offsets; then calculate the offsets of the non-negative buckets. If we were using the 4-bit system above and r=4, we would have 16 buckets, and would calculate offsets in the order (of unsigned buckets) 8-15 then 0-7.

By doing this, we can completely remove any need to do any pre-processing when dealing with values like int. We could perhaps wish that the IEEE754 committee had preferred 1s-complement so we could skip all of this for float too, but a: I think it is fair to assume that there are good technical reasons for the choice (presumably relating to fast negation, and fast access to the mantissa/exponent), and b: it is moot: IEEE754 is implemented in CPU architectures and is here to stay.

So we’re still left with an issue for float: we can’t use the same trick for values with a sign bit, because the sign bit changes the order throughout the data - making grouping impossible. We can make our lives easier though: since the algorithm can now cope with 1s-complement and 2s-complement data, we can switch to 1s-complement rather than to fully unsigned, which as discussed above: is pretty easy:

(Aside: actually, there is a related trick we can do to avoid having to pre-process floating point data, but: it would make this entire blog redundant! So for the purposes of a problem to investigate, we’re going to assume that we need to do this transformation.)

unsafe int ToRadix (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    // if sign bit set: flip all bits except the MSB
    return (raw & MSB) == 0 ? raw : ~raw | MSB;
}

A nice side-effect of this is that it is self-reversing: we can apply the exact same bit operations to convert from 1s-complement back to a sign bit.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97

We’ve made a slight but measurable improvment - nothing drastic, but the code is nicer.

Blocking ourselves out

We have a large chunk of data, and we want to perform a transformation on each value. So far, we’ve looked at a per-value transformation function (Sortable), but that means the overhead of a call per-value (which may or may not get inlined, depending on the complexity, and how we resolve the method - i.e. does it involve a virtual call to a type that isn’t reliably known). Additioanlly, it makes it very hard for us to apply more advanced optimizations! Blocks good.

So; let’s say we have our existing loop:

float[] values = ...
int[] radix = ...
for(int i = 0 ; i < values.Length; i++)
{
    radix = someHelper.Sortable(values[i]);
}

and we want to retain the ability to swap in per-type implementations of someHelper.Sortable; we can significantly reduce the call overhead by performing a block-based transformation. Consider:

float[] values = ...
int[] radix = ...
someHelper.ToRadix(values, radix);
...
unsafe void ToRadix(float[] source, int[] destination)
{
    const int MSB = 1 << 31;
    for(int i = 0 ; i < source.Length; i++)
    {
        var val = source[i];
        int raw = *(int*)(&val);
        // if sign bit set: flip all bits except the MSB
        destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
    }
}

How much of a speed improvement this makes depends a lot on whether the JIT managed to inline the IL from the original version. It is usually a good win by itself, but more importantly: it is a key stepping stone to further optimizations.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96

Another small improvement; I was hoping for more, but I suspect that the JIT was already doing a good job of inlining the method we're calling, making it almost the same loop at runtime. This is not always the case, though - especially if you have multiple different transformations to apply through a single API.

Safely spanning the performance chasm

You’ll notice that in the code above I’ve made use of unsafe code. There are a few things that make unsafe appealing, but one of the things it does exceptionally well is allow us to reintrepret chunks of data as other types, which is what this line does:

int raw = *(int*)(&val);

Actually, there are some methods on BitConverter to do exactly this, but only the 64-bit (double/long) versions exist in the “.NET Framework” (“.NET Core” has both 32-bit and 64-bit) - and that only helps us with this single example, rather than the general case.

For example, if we are talking in pointers, we can tell the compiler to treat a float* as though it were an int*. One way we might be tempted to rewrite our ToRadix method could be to move this coercison earlier:

unsafe void ToRadix(float[] source, int[] destination)
{
    const int MSB = 1 << 31;
    fixed(float* fPtr = source)
    {
        int* ptr = (int*)fPtr;
        for(int i = 0 ; i < values.Length; i++)
        {
            int raw = *ptr++;
            destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
        }
    }
}

Now we’re naturally reading values out as int, rather than performing any reinterpretation per value. This is useful, but it requires us to use unsafe code (always a great way to get hurt), and it doesn’t work with generics - you cannot use T* for some <T>, even with the where T : struct constraint.

I’ve spoken more than a few times about Span<T>; quite simply: it rocks. To recap, Span<T> (and it’s heap-friendly cousin, Memory<T>) is a general purpose, efficient, and versatile representation of contiguous memory - which includes things like arrays (float[]), but more exotic things too.

One of the most powerful (but simple) features of Span<T> is that it allows us to do type coercison in fully safe managed code. For example, instead of a float[], let’s say that we have a Span<float>. We can reinterpet that as int very simply:

Span<float> values = ...
var asIntegers = values.NonPortableCast<float, int>();

Note: this API is likely to change - by the time it hits general availablity, it’ll probably be:

var asIntegers = values.Cast<int>();

What this does is:

  • look at the sizes of the original and target type
  • calculate how many of the target type fit into the original data
  • round down (so we never go out of range)
  • hand us back a span of the target type, of that (possibly reduced) length

Since float and int are the same size, we’ll find that asIntegers has the same length as values.

What is especially powerful here is that this trick works with generics. It does something that unsafe code will not do for us. Note that a lot of love has been shown to Span<T> in the runtime and JIT - essentially all of the same tricks that make T[] array performance largely indistinguishable from pointer T* performance.

This means we could simplify a lot of things by converting from our generic T even earlier (so we only do it once for the entire scope of our radix code), and having our radix converter just talk in terms of the raw bits (usually: uint).

// our original data
float[] values = ...
// recall that radix sort needs some extra space to work in
float[] workspace = ... 
// implicit <float> and implicit conversion to Span<float>
RadixSort32(values, workspace); 

...
RadixSort32<TSource>(Span<T> typedSource, Span<T> typedWorkspace)
    where T : struct
{
    // note that the JIT can inline this *completely* and remove impossible
    //  code paths, because for struct T, the JIT is per-T
    if (Unsafe.SizeOf<T>() != Unsafe.SizeOf<uint>()) .. throw an error

    var source = typedSource.NonPortableCast<T, uint>();
    var workspace = typedWorkspace.NonPortableCast<T, uint>();

    // convert the radix if needed (into the workspace)
    var converter = GetConverter<T>();
    converter?.ToRadix(source, workspace);

    // ... more radix sort details not shown
}
...
// our float converter
public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;
    for(int i = 0 ; i < values.Length; i++)
    {
        uint raw = values[i];
        destination[i] = (raw & MSB) == 0 ? raw : ~raw | MSB;
    }
}

The code is getting simpler, while retaining performance and becoming more generic-friendly; and we haven’t needed to use a single unsafe. You’ll have to excuse me an Unsafe.SizeOf<T>() - despite the name, this isn’t really an “unsafe” operation in the usual sense - this is simply a wrapper to the sizeof IL instruction that is perfectly well defined for all T that are usable in generics. It just isn’t directly available in safe C#.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76

Now we’re starting to make decent improvements - Span<T> is really useful for large operations where type coercion is necessary.

Taking up tree surgery: prune those branches

Something that gnaws at my soul in where we’ve got to is that it includes a branch - an if test, essentially - in the inner part of the loop. Actually, there’s two and they’re both hidden. The first is in the for loop, but the one I’m talking about here is the one hidden in the ternary conditional operation, a ? b : c. CPUs are very clever about branching, with branch prediction and other fancy things - but it can still stall the instruction pipeline, especially if the prediction is wrong. If only there was a way to rewrite that operation to not need a branch. I’m sure you can see where this is going.

Branching: bad. Bit operations: good. A common trick we can use to remove branches is to obtain a bit-mask that is either all 0s (000…000) or all 1s (111…111) - so: 0 and -1 in 2s-complement terms. There are various ways we can do that (although it also depends on the actual value of true in your target system, which is a surprisingly complex question). Obviously one way to do that would be:

// -1 if negative, 0 otherwise
var mask = (raw & MSB) == 0 ? 0 : ~0;

but that just adds another branch. If we were using C, we could use the knowledge that the equality test returns an integer of either 0 or 1, and just negate that to get 0 or -1:

// -1 if negative, 0 otherwise
var mask = -((raw & MSB) == 0);

But no such trick is available in C#. What we can do, though, is use knowledge of arithmetic shift. Left-shift (<<) is simple; we shift our bits n places to the left, filling in with 0s on the right. So binary 11001101 << 3 becomes 01101000 (we lose the 110 from the left).

Right-shift is more subtle, as there are two of them: logical and arithmetic, which are essentially unsigned and signed. The logical shift (used with uint etc) moves our bits n places to the right, filling in with 0s on the left, so 11001101 >> 3 gives 00011001 (we lose the 101 from the right). The arithmetic shift behaves differently depending on whether the most significant bit (which tells us the sign of the value) is 0 or 1. If it is 0, it behaves exactly like the logical shift; however, if it is 1, it fills in with 1s on the left; so 11001101 >> 3 gives 11111001. Using this, we can use >> 31 (or >> 63 for 64-bit data) to create a mask that matches the sign of the original data:

// -1 if negative, 0 otherwise
var mask = (uint)(((int)raw) >> 31);

Don’t worry about the extra conversions: as long as we’re in an unchecked context (which we are by default), they simply don’t exist. All they do is tell the compiler which shift operation to emit. If you’re curious, you can see this here, but in IL terms, this is just:

(push the value onto the stack)
ldc.i4.s 31 // push 31 onto the stack
shr         // arithmetic right shift

In IL terms, there’s really no difference between signed and unsigned integers, other than whether the compiler emites the signed or unsigned opcodes for operations. In this case we’ve told it to emit shr - the signed/arithmetic opcode, instead of shr.un - the unsigned/logical opcode.

OK; so we’ve got a mask that is either all zeros or all ones. Now we need to use it to avoid a branch, but how? Consider:

var condition = // all zeros or all ones
var result = (condition & trueValue) | (~condition & falseValue);

If condition is all zeros, then the condition & trueValue gives us zero; the ~condition becomes all ones, and therefore ~condition & falseValue gives us falseValue. When we “or” (|) those together, we get falseValue.

Likewise, if condition is all ones, then condition & trueValue gives us trueValue; the ~condition becomes all zeros, and therefore ~condition & falseValue gives us zero. When we “or” (|) those together, we get trueValue.

So our branchless operation becomes:

public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;
    for(int i = 0 ; i < values.Length; i++)
    {
        uint raw = values[i];
        var ifNeg = (uint)(((int)raw) >> 31);
        destination[i] =
            (ifNeg & (~raw | MSB)) // true
            | (~ifNeg & raw);      // false
    }
}

This might look more complicated, but it is very CPU-friendly: it pipelines very well, and doesn’t involve any branches for it to worry about. Doing a few extra bit operations is nothing to a CPU - especially if they can be pipelined. Long instruction pipelines are actually a good thing to a CPU - compared to a branch or something that might involve a cache miss, at least.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76
Branchless 2,507.0 us 0.24

By removing the branches, we’re down to less than a quarter of the original run-time; that’s a huge win, even if the code is slightly more complex.

Why do one thing at a time?

OK, so we’ve got a nice branchless version, and the world looks great. We’ve made significant improvements. But we can still get much better. At the moment we’re processing each value one at a time, but as it happens, this is a perfect scenario for vectorization via SIMD (“Single instruction, multiple data”).

We have a pipelineable operation without any branches; if we can execute it for one value, we can probably execute it for multiple values at the same time - magic, right? Many modern CPUs include support for performing basic operations like the above on multiple values at a time, using super-wide registers. Right now we’re using 32-bit values, but most current CPUs will have support for AVX (mostly: 128-bit) or AVX2 (mostly: 256-bit) operations. If you’re on a very expensive server, you might have more (AVX512). But let’s assume AVX2: that means we can handle 8 32-bit values at a time. That means 1/8th of the main operations, and also 1/8th of the if branches hidden in the for loop.

Some languages have automatic vectorization during compilation; C# doesn’t have that, and neither does the JIT. But, we still have access to a range of vectorized operations (with support for the more exotic intrinsics being added soon). Until recently, one of the most awkward things about working with vectorization has been loading the values. This might sound silly, but it is surprisingly difficult to pull the values in efficiently when you don’t know how wide the vector registers are on the target CPU. Fortunately, our amazing new friend Span<T> jumps to the rescue here - making it almost embarrassingly easy!

First, let’s look at what the shell loop might look like, without actually doing the real work in the middle:

public void ToRadix(Span<uint> values, Span<uint> destination)
{
    const uint MSB = 1U << 31;

    int i = 0;
    if (Vector.IsHardwareAccelerated)
    {                               
        var vSource = values.NonPortableCast<uint, Vector<uint>>();
        var vDest = destination.NonPortableCast<uint, Vector<uint>>();
        
        for (int j = 0; j < vSource.Length; j++)
        {
            var vec = vSource[j];
            vDest[j] = // TODO
        }
        // change our root offset for the remainder of the values
        i = vSource.Length * Vector<uint>.Count;
    }
    for( ; i < values.Length; i++)
    {
        uint raw = values[i];
        var ifNeg = (uint)(((int)raw) >> 31);
        destination[i] =
            (ifNeg & (~raw | MSB)) // true
            | (~ifNeg & raw);      // false
    }
}

First, look at the bottom of the code; here we see that our regular branchless code still persists. This is for two reasons:

  • the target CPU might not be capable of vectorization
  • our input data might not be a nice multiple of the register-width, so we might need to process a final few items the old way

Note that we’ve changed the for loop so that it doesn’t reset the position of i - we don’t necessarily start at 0.

Now look at the if (Vector.IsHardwareAccelerated); this checks that suitable vectorization support is available. Note that the JIT can optimize this check away completely (and remove all of the inner code if it won’t be reached). If we do have support, we cast the span from a Span<uint> to a Span<Vector<uint>>. Note that Vector<T> is recognized by the JIT, and will be reshaped by the JIT to match the size of the available vectorization support on the running computer. That means that when using Vector<T> we don’t need to worry about whether the target computer has SSE vs AVX vs AVX2 etc - or what the available widths are; simply: “give me what you can”, and the JIT worries about the details.

We can now loop over the vectors available in the cast spans - loading an entire vector at a time simply using our familiar: var vec = vSource[j];. This is a huge difference to what loading vectors used to look like. We then do some operation (not shown) on vec, and assign the result again as an entire vector to vDest[j]. On my machine with AVX2 support, vec is block of 8 32-bit values.

Next, we need to think about that // TODO - what are we actually going to do here? If you’ve already re-written your inner logic to be branchless, there’s actually a very good chance that it will be a like-for-like translation of your branchless code. In fact, it turns out that the ternary conditional scenario we’re looking at here is so common that there are vectorized operations precisely to do it; the “conditional select” vectorized CPU instruction can essentially be stated as:

// result conditionalSelect(condition, left, right)
result = (condition & left) | (~condition & right);

Where condition is usually either all-zeros or all-ones (but it doesn’t have to be; if you want to pull different bits from each value, you can do that too).

This intrinsic is exposed directly on Vector, so our missing code becomes simply:

var vMSB = new Vector<uint>(MSB);
var vNOMSB = ~vMSB;
for (int j = 0; j < vSource.Length; j++)
{
    var vec = vSource[j];
    vDest[j] = Vector.ConditionalSelect(
        condition: Vector.GreaterThan(vec, vNOMSB),
        left: ~vec | vMSB, // when true
        right: vec // when false
    );
}

Note that I’ve pre-loaded a vector with the MSB value (which creates a vector with that value in every cell), and I’ve switched to using a > test instead of a bit test and shift. Partly, this is because the vectorized equality / inequality operations expect this kind of usage, and very kindly return -1 as their true value - using the result to directly feed “conditional select”.

Method Mean Scaled
SortablePerValue 10,483.8 us 1.00
ToRadixPerValue 10,120.5 us 0.97
ToRadixBlock 10,080.0 us 0.96
ToRadixSpan 7,976.3 us 0.76
Branchless 2,507.0 us 0.24
Vectorized 930.0 us 0.09

As you can see, the effect of vectorization on this type of code is just amazing - with us now getting more than an order-of-magnitude improvement on the original data. That’s why I’m so excited about how easy (relatively speaking) Span<T> makes vectorization, and why I can’t wait for Span<T> to hit production.

A reasonable range of common operations are available on Vector and Vector<T>. If you need exotic operations like “gather”, you might need to wait until System.Runtime.Intrinsics lands. One key difference here is that Vector<T> exposes the common intersection of operations that might be available (with different widths) against different CPU instruction sets, where as System.Runtime.Intrinsics aims to expose the underlying intrinsics - giving access to the full range of instructions, but forcing you to code specifically to a chosen instruction set (or possibly having two implementations - one for AVX and one for AVX2). This is simply because there isn’t a uniform API surface between generatons and vendors - it isn’t simply that you get the same operations with different widths: you get different operations too. So you’d typically be checking Aes.IsSupported, Avx2.IsSupported, etc. Being realistic: Vector<T> is what we have today, and it worked damned well.

Summary

We’ve looked at a range of advanced techniques for improving performance of critical loops of C# code, including (to repeat the list from the start):

  • using knowledge of how signed data works to avoid having to transform between them
  • performing operations in blocks rather than per value to reduce calls
  • using Span<T> as a replacemment for unsafe code and unmanaged pointers, allowing you to get very high performance even in 100% managed/safe code
  • investigating branch removal as a performance optimization of critical loops
  • vectorizing critical loops to do the same work with significantly fewer CPU operations

And we’ve seen dramatic improvements to the performance. Hopefully, some or all of these techniques will be applicable to your own code. Either way, I hope it has been an interesting diversion.

Next time: practical parallelization

Addendum

For completeness: yes I also tried a val ^ ~MSB approach for both branchless and vectorized; it wasn't an improvement.

And for the real implementation (the "aside" mentioned above): what the code actually does for sign-bit data (IEEE754) is: sort just on the sign bit first, use the count data to find where the sign changes (without scanning over the data an extra time), and then sort the two halves separately ignoring the MSB, with the first chunk in descending order and the second chunk in ascending order. By doing this, we avoid the need for the transform - again, by using knowledge of the bit layout of the data.

Saturday, 20 January 2018

More Of A Sort Of Problem

(part 1 here)

(part 3 here)

So last time I talked about a range of ways of performing a sort, ranging from the simple thru to hijacking .NET source code. Very quickly, some folks pointed out that I should have looked at “radix sort”, and they’re absoltely right - I should have. In fact, in the GPU version of this same code, we do exactly that via the CUB library.

The great thing is, radix sort is relatively simple, so:

Attempt 9: radix sort

The key point about radix sort is that it works by grouping the data by groups of bits in the data, using the same “bitwise sortable” idea that we used previously. We’ve already done the hard work to get a radix compliant representation of our sort data.

We can get a basic radix sort implementation from wikibooks, but this version has a few things we need to fix:

  • this is a single-array version; we want a dual array
  • we can use unsafe code to get rid of a lot of array range checks (and just: don’t be wrong!)
  • radix sort needs a workspace the same same as the input values as a scratch area; in the version shown, it allocates this internally, but in “real” code we’ll want to manage that externally and pass it in
  • we can make r (the number of bits to consider at a time) configurable
  • the shown code copies the workspace over the real data each cycle, but we can avoid this by simply swapping what we consider “real” and “workspace” each cycle, and copying once at the end if required

I’m not going to try and describe how or why radix sort works (wikipedia covers much of that here); the key thing that will be relevant in a moment is: for the group size r, it loops through all the data looking r bits at a time, to see how many values there are with each possible value for those r bits. So if r=4, there are 16 possible values over each 4 bits. Once it has that, it iterates a second time, writing the values into the corresponding places for the group it is in.

Once we have an implementation, our code basically consists of preparing the bit-sortable keys just like we did before, then simply invoking the algoritm, passing in our reusable workspace:

Helpers.RadixSort(sortKeys, index, keysWorkspace, valuesWorkspace, r);

(where keysWorkspace and valuesWorkspace are scratch areas of the required size, shared between sort cycles).

One consideration here is: what value of r (the number of bits to consider at a time) to choose. 4 is a reasonably safe default, but you can experiment with different values for your data to see what works well.

I get:

  • r=2: 3800ms
  • r=4: 1900ms
  • r=8: 1200ms
  • r=16: 2113ms

This r=8 is very tempting that is a significant improvement on our previous best.

Attempt 10: radix sort with parellelization

Remember the “that will be relevant in a moment” from a few paragraphs ago? Recall: a key point of radix sort is that for each group of bits (of size r), it needs to iterate the entire key-set to count the frequencies of each possible group value. This count operation is something that is embarrasingly parallelizable, since counting chunks can be done independently over the entire data.

To do that, we can create a number of workers, divide the key-space into that many chunks, and tell each worker to perform the counts for that chunk. Fire these workers in parallel via Parallel.Invoke or similar, and reap the rewards. This creates a slight complexity that we need to combine the counts, and there will be thread races. A naive but thread-safe implementation would be to use Interlocked.Increment to do all the counts, but that would have severe collision penalties - it is far preferable to count each chunk in complete isolation, and only worry about the combination at the end. At that point, either lock or Interlocked would be fine, as it is going to happen very minimally. We should also be careful to hoist everything we want into a local, to avoid a lot of ldarg.0, ldfld overhead:

public void Invoke()
{
    var len = Length;
    var mask = Mask;
    var keys = Keys + Offset;
    var shift = Shift;
    int* count = stackalloc int[CountLength];

    // count into a local buffer
    for (int i = 0; i < len; i++)
        count[(*keys++ >> shift) & mask]++;

    // now update the origin data, synchronized
    lock (SyncLock)
    {
        for (int i = 0; i < CountLength; i++)
            Counts[i] += count[i];
    }
}

Here we’re also using stackalloc to do all our counting in the stack space, rather than allocating a count buffer per worker. This is fine, since we’ll typically be dealing with values like r=4 (CountLength=16). Even for larger reasonable r, the stack space is fine. We could very reasonably put an upper bound on r of 16 if we wanted to be sure.

Our calling code is virtually idental - all we’re doing is changing the internal implementation:

Helpers.RadixSortParallel(sortKeys, index, keysWorkspace, valuesWorkspace, r);

So what does this do for performance? Note: I’m using Environment.ProcessorCount * 2 workers, but we could play with other values.

I get

  • r=2: 3600ms
  • r=4: 1800ms
  • r=8: 1200ms
  • r=16: 2000ms

So; we don’t get a vast improvement really - our key benefit comes from simply choosing a suitable r for our data, like r=8.

Throws down gauntlet

So; so far we’ve gone from 17s (LINQ) to 1.2s (radix sort, single-threaded or parallel). What more can we do? Can we parallelize the second half of radix sort? Can we try a completely different sort? Can we combine our index and keys so we are performing a single array sort? Can we make use of some obscure CPU instructions to perform 128-bit (or wider) operations to combine our existing 64-bit key and 32-bit value? Vectorize a key part of one of the existing algorithms with SIMD?

If you have more ideas, please feel free to fork and PR from here.

Friday, 19 January 2018

A Sort Of Problem

(part 2 here)

(part 3 here)

I love interesting questions, especially when they directly relate to things I need to do. A great question came up on Stack Overflow today about how to efficiently sort large data. I gave an answer, but there’s so much more we can say on the topic, so I thought I’d turn it into a blog entry, exploring pragmatic ways to improve sort performance when dealing with non-trivial amounts of data. In particular, this is remarkably similar to time I’ve spent trying to make our “tag engine” faster.

The problem

So, the premise is this:

  • we have a complex entity, SomeType, with multiple properties
  • we have a large number of these entities - lets say 16M+
  • we want to sort this data using a sort that considers multiple properties - “this, then that”
  • and we want it to be fast

Note that sorting data when it is already sorted or nearly-sorted is usually cheap under most common algorithms, so I’m going to be focusing only on the initial painful sort when the data is not at all sorted.

Because we’re going to have so many of them, and they are going to be basic storage types only, this is a good scenario to consider a struct, and I was delighted to see that the OP in the question had already done this. We’ll play with a few of the properties (for sorting, etc), but to simulate the usual context, there will be extra stuff that isn’t relevant to the question, so we’ll pad the size of the struct with some dummy fields up to 64 bytes. So, something like:

readonly partial struct SomeType
{
    public int Id { get; }
    public DateTime ReleaseDate { get; }
    public double Price { get; }

    public SomeType(int id, DateTime releaseDate, double price)
    {
        Id = id;
        ReleaseDate = releaseDate;
        Price = price;
        _some = _other = _stuff = _not = _shown = 0;
    }

#pragma warning disable CS0414 // suppress "assigned, never used"
    private readonly long _some, _other, _stuff, _not, _shown;
#pragma warning restore CS0414
}

Note: yes, I know that double is a terrible choice for something that describes money.

Note: readonly struct is a new C# feature, described in more detail here - this is a good fit, and might help us avoid some large “load” costs.

For something interesting to do, we’ll try sorting things “most recent, then cheapest”.

Inventing some data

The first thing we need is some data; a very basic seeded random data script might be something like:

var rand = new Random(data.Length);
for (int i = 0; i < data.Length; i++)
{
    int id = rand.Next();
    var releaseDate = Epoch
        .AddYears(rand.Next(50))
        .AddDays(rand.Next(365))
        .AddSeconds(rand.Next(24 * 60 * 60));
    var price = rand.NextDouble() * 50000;
    data[i] = new SomeType(
        id, releaseDate, price);
}

Attempt 1: LINQ

LINQ is great; I love LINQ, and it makes some code very expressive. So let’s try the most obvious thing first:

sorted = (from item in data
        orderby item.ReleaseDate descending,
                item.Price
        select item).ToArray();

This LINQ expression performs the sort we want, creating a copy of the data - but on my machine this takes about 17 seconds to run - not ideal. So that’s the target to beat. The key thing about LINQ is that it is designed for your efficiency, i.e. the size and complexity of the code that you need to write, on the assumption that you’ll only use it on reasonable data. We do not have reasonable data here.

Attempt 2: IComparable<T>

Since we’re talking about arrays, another obvious thing to do is Array.Sort; for the simplest version of that, we need to implement IComparable<T> on our type:

partial struct SomeType : IComparable<SomeType>
{
    int IComparable<SomeType>.CompareTo(SomeType other)
    {
        var delta = other.ReleaseDate
            .CompareTo(this.ReleaseDate);
        if (delta == 0) // second property
            delta = this.Price.CompareTo(other.Price);
        return delta;
    }
}

And then we can use:

Array.Sort<SomeType>(data);

to perform an in-place sort. The generic <SomeType> here is actually redundant, but I’ve included it to make it obious that I am using the generic API.

This takes just over 6 seconds for me, so: a huge improvement! Note that for the purpose of our tests, we will re-populate the data after this, to oensure that all tests start with randomized data. For brevity, assume we’re doing this whenever necessary - I won’t keep calling it out.

Attempt 3: IComparer<T>

There’s a second common sort API: IComparer<T> custom comparers. This has the advantages that a: you don’t need to edit the target type, and b: you can support multiple different sorts against the same type via different custom comparers. For this, we add or own comparer:

sealed class SomeTypeComparer : IComparer<SomeType>
{
    private SomeTypeComparer() { }
    public static SomeTypeComparer Default { get; } = new SomeTypeComparer();
    int IComparer<SomeType>.Compare(SomeType x, SomeType y)
    {
        var delta = y.ReleaseDate
                .CompareTo(x.ReleaseDate);
        if (delta == 0) // second property
            delta = x.Price.CompareTo(y.Price);
        return delta;
    }
}

using it via:

Array.Sort<SomeType>(data, SomeTypeComparer.Default);

This takes around 8 seconds; we’re not going in the right direction here!

Attempt 4: Comparison<T>

Why stop at two ways to do the same thing, when we can have 3? For completeness, there’s yet another primary Array.Sort<T> variant that takes a delegate, for example:

Array.Sort<SomeType>(data, (x, y) =>
{
    var delta = y.ReleaseDate
            .CompareTo(x.ReleaseDate);
    if (delta == 0) // second property
        delta = x.Price.CompareTo(y.Price);
    return delta;
});

This keeps the “do a sort” and “like this” code all in the same place, which is nice; but: it performs virtually identically to the previous attempt, at around 8 seconds.

First intermission: what is going wrong?

We’re doing a lot of work here, that much is true; but there are things that are exacerbating the situation:

  • we have a large struct, which means we need to copy that data on the stack whenever we do anything
  • because it needs to compare values to their neighbours, there are a lot of virtual calls going on

These costs are fine for reasonable data, but for larger volumes the costs start building up.

We need an alternative.

It happens that Array.Sort also has overloads that accept two arrays - the keys and the values. What this does is: perform the sort logic on the first array, but whenever it swaps data around: it swaps the corresponding items in both arrays. This has the effect of sorting the second array by the values of the first. In visual terms, it is like selecting two columns of a spreadsheet and clicking sort.

If we only had a single value, this would be great! For example…

Attempt 5: dual arrays, single property

Let’s pretend for a moment that we only want to sort by the date, in ascending order. Which isn’t what we want, but: humour me.

What we could do is keep a CreationDate[] hanging around (reuse it between operations), and when we want to sort: populate the data we want to sort by into this array:

for (int i = 0; i < data.Length; i++)
    releaseDates[i] = data[i].ReleaseDate;

and then to sort:

Array.Sort(releaseDates, data);

For me, this takes about 150ms to prepare the keys, and 4.5s to execute the sort. Promising, although hard to tell if that is useful until we can handle the complex dual sort.

Second intermission: how can we compose the sort?

We have two properties that we want to sort by, and a Sort method that only takes a single value. We could start looking at tuple types, but that is just making things even more complex. What we want is a way to simplify the complex sort into a single value. What if we could use something simple like an integer to represent our combined sort? Well, we can!

Many basic values can - either directly, or via a hack - be treated as a bitwise-sortable value. By bitwise sortable, I essentially mean: “sorts like the samme bits expressed as an unsigned integer would sort”. Consider a 32-bit integer: obviously an unsigned integer sorts just like an unsigned integer, but a signed integer does not - negative numbers are problematic. What would be great is if int.MinValue was treated as 0, with int.MinValue + 1 treated as 1, etc; we can do that by subtraction:

protected static ulong Sortable(int value)
{
    // re-base eveything upwards, so anything
    // that was the min-value is now 0, etc
    var val = unchecked((uint)(value - int.MinValue));
    return val;
}

The result of this is that Sortable will return 32-bits worth of data (the same as the input), but with 000...000 as the minimum expected value, and 111...111 as the maximum expected value.

Now; notice that we’re only talking about 32 bits here, but we’ve returne a ulong; that’s because we’re going to pack 2 values into a single token.

For our actual data, we hae two pieces:

  • a DateTime
  • a Double

Now, that’s 16 bytes worth of data, and we only have 8 to play with. This sounds like a dilemma, but usually: we can cheat by fudging the precision.

For many common applications - and especially things like a ReleaseDate, most of the bits in a DateTime are not useful. We probably don’t need to handle every tick in a 10,000-year range. We can almost certainly use per-second precision - perhaps even per-day for a release date. Unix time in seconds using 32 bits has us covered until January 19, 2038. If we need less precision than seconds, we can extend that hugely; and we can often use a different epoch that fits our minimum expected data. Heck, starting at the year 2000 instead of 1970 buys 30 years even in per-second precision. Time in an epoch is bitwise-sortable.

Likewise, an obvious way of approximating a double in 32 bits would be to cast it as a float. This doesn’t have the same range or precision, but will usually be just fine for sorting purposes. Floating point data in .NET has a complex internal structure, but fortunately making it bitwise-sortable can be achieved through some simple well-known bit hacks:

protected static unsafe ulong Sortable (float value)
{
    const int MSB = 1 << 31;
    int raw = *(int*)(&value);
    if ((raw & MSB) != 0) // IEEE first bit is the sign bit
    {
        // is negative; shoult interpret as -(the value without the MSB) - not the same as just
        // dropping the bit, since integer math is twos-complement
        raw = -(raw & ~MSB);
    }
    return Sortable(raw);
}

Putting these together, we havev all the tools we need to create a single composite value that is totally meanningless for all ordinary purposes, but which represents our sort perfectly.

Attempt 6: dual arrays, dual property

We can create a method that composes our two properties:

static ulong Sortable(in SomeType item)
{
    return (~(ulong)item.ReleaseDate.ToMillenialTime()) << 32
                | Sortable((float)item.Price);
}

This might look complex, but what it does is:

  • compute the time in seconds since 2000 as a 32-bit unsigned integer
  • extends it to 64-bits
  • inverts it; this has the same effect as “descending”, since it reverses the order
  • left shifts it by 32 bits, to place those 32 bits in the upper half of our 64 bits (padding on the right with zero)
  • compute the bitwise-sortable representation of the price as a 32-bit unsigned integer
  • throws that value into the lower 32 bits

We can prepare our data into a ulong[] that we keep around between sort operations:

for (int i = 0; i < data.Length; i++)
    sortKeys[i] = Sortable(in data[i]);

and finally sort:

Array.Sort(sortKeys, data);

The prepare operation is more complex now - and has gone up to 300ms, but the sort is faster at just over 4 seconds. We’re moving in the right direction. Note that the prepare operation is embarrassingly parallelizable, so we can trivially divide that over a number of cores (say: 16 blocks of 1M records per block) - and can often be further reduced by storing the data in the struct in similar terms to the sortable version (so: the same representation of time, and the same floating-point scale) - thus I’m not going to worry about the prepare cost here.

But we’re still paying a lot of overhead from having to move around those big structs. We can avoid that by… just not doing that!

Attempt 7: indexed

Rather than sorting our SomeType[] array, we could instead leave that data alone, forever. Never move the items around (although it is usually fine to replace them with updates). This has multiple advantages, but the one we’re keen on is the reduction of cost copying the data.

So; we can declare an int[] index that is our index - it just tells us the offsets to look in the actual data. We can sort that index as though it were the actual data, and just make sure we go through the index. We need to initialize the index as well as the composite sortable value (although we can re-use the positions if we are re-sorting the data, as usually the data doesn’t move much between cycles - we’ll get another huge boost on re-sorts when the data hasn’t drifted much; we do not need to reset the index when re-sorting the same data):

for (int i = 0; i < data.Length; i++)
{
    index[i] = i;
    sortKeys[i] = Sortable(in data[i]);
}

and sort:

Array.Sort(sortKeys, index);

The only complication is that now, to access the sorted data - instead of looking at data[i] we need to look at data[index[i]], i.e. find the i’th item in the index, and use that value as the offset in the actual data.

This takes the time down to 3 seconds - we’re getting there.

Attempt 8: indexed, direct compare, no range checks

The introspective sort that Array.Sort does is great, but it is still going to be talking via the general CompareTo API on our key type (ulong), and using the array indexers extensively. The JIT in .NET is good, but we can help it out a little bit more by … “borrowing” (ahem) the IntroSort code, and:

  • replacing the CompareTo usage on the keys with direct integer operations
  • replacing the array access with unsafe code that uses ulong* (for the keys) and int* (for the index)

(as an aside, it’ll be interesting to see how this behaves with Span<T>, but that’s not complete yet).

I’m not going to show the implementation here, but it is available in the source project. Our code for consuming this doesn’t change much, except to call our butchered sort API:

Helpers.Sort(sortKeys, index);

For me this now takes just over 2.5 seconds.

Conclusion

So there we go; I’ve explored some common approaches to improrving sort performance; we’ve looked at LINQ; we’ve looked at basic sorts using comparables, comparers and comparisons (which are all different, obviously); we’ve looked at keyed dual-array sorts; we’ve looked at indexed sorts (where the source data remains unsorted); and finally we’ve hacked the introspective sort to squeeze a tiny bit more from it.

We’ve seen performance range from 17 seconds for LINQ, 8 seconds for the 3 basic sort APIs, then 4 seconds for our dual array sorts, 3 seconds for the indexed sort, and finally 2.5 seconds with our hacked and debased version.

Not bad for a night’s work!

All the code discussed here is available on github.

(part 2 here)

Wednesday, 6 December 2017

Dapper, Prepared Statements, and Car Tyres

Why Doesn't Dapper Use Prepared Statements?

I had a very interesting email in my inbox this week from a Dapper user; I'm not going to duplicate the email here, but it can be boiled down to:

My external security consultant is telling me that Dapper is insecure because it doesn't use prepared statements, and is therefore susceptible to SQL injection. What are your thoughts on this?

with a Dapper-specific example of something comparable to:

List<Order> GetOpenOrders(int customerId) => _connection.Query<Order>(
        "select * from Orders where CustomerId=@customerId and Status=@Open",
        new { customerId, OrderStatus.Open }).AsList();

Now this is a fun topic for me, because in my head I'm reading it in the same way that I would read:

My car mechanic is telling me my car is dangerous because it doesn't use anti-smear formula screen-wash, and is therefore susceptible to skidding in icy weather. What are your thoughts on this?

Basically, these are two completely unrelated topics. You can have a perfectly good and constructive conversation about either in isolation. There are merits to both discussions. But when you smash them together, it might suggest that the person raising the issue (the "security consultant" in this case, not the person sending me the email) has misunderstood something fundamental.

My initial response - while in my opinion valid - probably needs to be expanded upon:

Hi! No problem at all. If your security consultant is telling you that a correctly parameterized SQL query is prone to SQL injection, then your security consultant is a fucking idiot with no clue what they're talking about, and you can quote me on that.

So; let's take this moment to discuss the two topics and try to put this beast to bed!

Part The First: What is SQL injection?

Most folks will be very familiar with this, so I'm not going to cover every nuance, but: SQL injection is the major error made by concatenating inputs into SQL strings. It could be typified by the bad example:

string customerName = customerNameTextBox.Value; // or a http request input; whatever
var badOptionOne = connection.Query<Customer>(
    "select * from Customers where Name='" + customerName + "'");
var badOptionTwo = connection.Query<Customer>(
    string.Format("select * from Customers where Name='{0}'", customerName));
var badOptionThree = connection.Query<Customer>(
    $"select * from Customers where Name='{customerName}'");

As an aside on badOptionThree, it really frustrates me that C# overloading prefers string to FormattableString (interpolated $"..." strings can be assigned to either, but only FormattableString retains the semantics). I would really have loved to be able to add a method to Dapper like:

[Obsolete("No! Bad developer! Bobby Tables will find you in the dark", error: true)]
public static IEnumerable<T> Query<T>(FormattableString query, /* other args not shown */)
    => throw new InvalidOperation(...);

This category of coding error is perpetually on the OWASP "top 10" list, and is now infamously associated with xkcd's "Bobby Tables":

Did you really name your son Robert'); DROP TABLE Students;-- ?

The problem, as the cartoon shows us, is that this allows malicious input to do unexpected and dangerous things. In this case the hack was to use a quote to end a SQL literal (');... - in this case with the attacker guessing that the clause was inside parentheses), then issue a separate command (DROP TABLE ...), then discard anything at the end of the original query using a comment (-- ...). But the issue is not limited to quotes, and frankly any attempt to play "replace/escape the risky tokens" is an arms race where you need to win every time, but the attacker only needs to win once. Don't play that game.

It can also be a huge internationalization problem, familiar to every developer who has received bug reports about the search not working for some people of Irish or Scottish descent. This (SQL injection - not Irish or Scottish people) is such an exploitable problem that readily available tools exist that can trivially seach a site for exploitable inputs and give free access to the database with a friendly UI. So... yeah, you really don't want to have SQL injection bugs. No argument there.

So how do we prevent SQL injection?

The solution to SQL injection is parameters. One complaint I have about the xkcd comic - and a range of other discussions on the topic - is the suggestion that you should "sanitize" your inputs to prevent SQL injection. Nope! Wrong. You "sanitize" your inputs to check that they are within what your logic allows - for example, if the only permitted options from a drop-down are 1, 2 and 3 - then you might want to check that they haven't entered 42. Sanitizing the inputs is not the right solution to SQL injection: parameters are. We already showed an example of parameters in my SQL example at the top, but to take our search example:

string customerName = customerNameTextBox.Value; // or a http request input; whatever
var customers = connection.Query<Customer>(
    "select * from Customers where Name=@customerName",
    new { customerName });

What this does is add a parameter called "customerName" with the chosen value, passing that alongside and separate to the command text, in a raw form that doesn't need it to be encoded to work inside a SQL string. At no point does the parameter value get written into the SQL as a literal. Well, except perhaps on some DB backends that don't support parameters at all, in which case frankly it is up to the DB provider to get the handling right (and: virtually all RDBMS have first-class support for parameters).

Note that parameters solve other problems too:

  • the formatting of things like dates and numbers: if you use injection you need to know the format that the RDBMS expects, which is usually not the format that the "current culture" is going to specify, making it awkward; but by using a parameter, the value doesn't need to be formatted as text at all - with things like numbers and dates usually being sent in a raw binary format - some-defined-endian-fixed-width for integers (including dates), or something like IEEE754 for floating point.
  • query-plan re-use: the RDBMS can cache our ...Name=@customerName query and re-use the same plan automatically and trivially (without saturating the cache with a different plan for every unique name searched), with different values of the parameter @customerName - this can provide a great performance boost (side note: this can be double-edged, so you should also probably learn about OPTIMIZE FOR ... UNKNOWN (or the equivalent on your chosen RDBMS) if you're serious about SQL performance - note this should only be added reactively based on actual performance investigations)

Dapper loves parameters

Parameterization is great; Dapper loves parameterization, and does everything it can to make it easy for you to parameterize your queries. So: whatever criticism you want to throw at Dapper: SQL injection isn't really a fair one. The only time Dapper will be complicit in SQL injection is when you feed it a query that already has an injection bug before Dapper ever sees it. We can't fix stupid.

For full disclosure: there is actually one case where Dapper allows literal injection. Consider our GetOpenOrders query from above. This can also be written:

List<Order> GetOpenOrders(int customerId) => _connection.Query<Order>(
        "select * from Orders where CustomerId=@customerId and Status={=Open}",
        new { customerId, OrderStatus.Open }).AsList();

Note that instead of @Open we're now using {=Open}. This is not SQL syntax - it is telling Dapper to do an injection of a literal value. This is intended for things that don't change per query such as status codes - and can result in some cases in performance improvements. Dapper doesn't want to make it easy to blow your own feet off, so it STRICTLY only allows this for integers (including enum values, which are fundamentally integers), since integers are a: very common for this scenario, and b: follow predictable rules as literals.

Part The Second: What are prepared statements?

There's often a slight confusion here with "stored procedures", so we'll have to touch on that too...

It is pretty common to issue commands to a RDBMS, where the SQL for those commands is contained in the calling application. This isn't universal - some applications are written with all the SQL in "stored procedures" that are deployed separately to the server, so the only SQL in the application is the names to invoke. There are merits of both approaches, which might include discussions around:

  • isolation - the ability to deploy and manage the SQL separately to the application (which might be desirable in client applications in particlar, where re-deploying all the client installations to fix a small SQL bug is hard or expensive)
  • performance - historically stored procedures tended to out-perform ad-hoc commands; in most modern RDBMS this simply isn't a real concern, with the query-plan-cache working virtually identically regardless of the mechanism
  • granular security - in a high security application you might not want users (even if the "user" is a central app-server) to have direct SELECT permission on the tables or views - instead preferring to wrap the allowed queries in stored procedures that the calling user can be granted EXEC permission; of course a counter-argument there is that a blind EXEC can hide what a stored procedure is doing (so it does something the caller didn't expect), but ultimately if someone has pwned your RDBMS server, you're already toast
  • flexibility - being able to construct SQL to match a specific scenario (for example: the exact combination of 17 search options) can be important to improving performance (compared, in our search example, to 17 and (@someArg is null or row.SomeCol=@someArg) clauses). Tools like LINQ and ORMs rely extensively on runtime query generation to match the queries and model known at runtime, so allowing them to execute ad-hoc parameterized commands is required; it should also be noted that most RDBMS can also execute ad-hoc parameterzied commands from within SQL - via things like sp_executesql from inside a stored procedure

You'll notice that SQL injection is not part of that discussion on the merits of "ad-hoc commands" vs "stored procedures", because parameterization makes it a non-topic.

So: let's assume that we've had the conversation about stored procedures and we've decided to use ad-hoc statements.

What does it mean to "prepare" a statement?

"Preparing a statement" is a sometimes-optional / sometimes-mandatory (depending on the RDBMS) step required to issue ad-hoc SQL commands. Conceptually, it takes our "select * from Orders where CustomerId=@customerId and Status=@Open" query - along with the defined parameters - and says "I'm going to want to run this in a moment; kindly figure out what that means to you and get everything in place". In terms of ADO.NET, this means calling the DbCommand.Prepare() method. There are 3 possible outcomes of a Prepare() call (ignoring errors):

  • it does literally nothing - a no-op; this might commonly be the case if you've told it that you're running a stored procedure (it is already as prepared as it will ever be), or if your chosen RDBMS isn't interested in the concept of prepared statements
  • it runs an additional optional operation that it wouldn't have done otherwise - adding a round trip
  • it runs a required operation that it was otherwise going to do automatically when we executed the query

So on the surface, the best case is that we achieve no benefit (the first and third options). The worst case is that we've added a round trip. You might be thinking "so why does Prepare() exist, if it is only ever harmful?" - and the reason is: I've only talked about running the operation once.

The main scenario in which Prepare() helps us is when you're going to be issuing exactly the same command (including the parameter definition, but not values), on exactly the same connection, many many times, and especially when your RDBMS requires command preparation. In that scenario, preparing a statement can be a very important performance tweak.

You'll notice - similarly to stored procedures - that SQL injection is not part of that discussion on the merits of "prepared statements".

It is entirely true to say that Dapper does not currently call Prepare().

Why doesn't Dapper Prepare() statements?

There are various reasons for this, but the most important one is: on most providers, a prepared statement is scoped to the connection and is stored as part of the DbCommand. To actually provide a useful prepared statement story, Dapper would need to store and re-use every DbCommand for every DbConnection. Dapper really, really doesn't want to store your connections. It is designed with high concurrency in mind, and typically works in scenarios where the DbConnection is short-lived - perhaps scoped to the context of a single web-request. Note that connection pooling doesn't mean that the underlying connection is short-lived, but Dapper only gets to see the managed DbConnection, so anything else is opaque to it.

Without tracking every DbConnection / DbCommand and without a new abstraction, the best Dapper could do would be to call .Prepare() on every DbCommand immediately before executing it - but this is exactly the situation we discussed previously where the only two options are "has no effect" and "makes things worse".

Actually, there is one scenario using the current API in which Dapper could usefully consider doing this, which is the scenario:

connection.Execute(someSql, someListOfObjects);

In this case, Dapper unrolls someListOfObjects, executing someSql with the parameters from each object in turn - on the same connection. I will acknowledge that a case could be made for Dapper to call .Prepare() in anticipation of the loop here, although it would require some changes to implement.

But fundamentally, the main objection that dapper has to prepared statements is that typically, the connections that Dapper works with are transient and short-lived.

Could Dapper usefully offer a Prepare() API for systems with long-lived connections?

Hypothetically, yes: there is something that Dapper could do here, specifically targeted at the scenario:

I have a long-lived connection and an RDBMS that needs statements to be prepared, and I want the best possible performance when issuing repeated ad-hoc parameterized commands.

We could conceptualize an API that pins a command to a single connection:

var getOrders = connection.Prepare<Order>(
        "select * from Orders where CustomerId=@customerId",
        new { customerId = 123 }); // dummy args, for type inference
// ...
var orders = getOrders.Query(new { customerId }).AsList();

Note that in this imaginary API the connection is trapped and pinned inside the object that we stored in getOrders. There are some things that would need to be considered - for example, how does this work for literal injection and Dapper's fancy "in" support. A trivial answer might be: just don't support those features when used with .Prepare().

I think there's plenty of merit to have this kind of discussion, and I'm 100% open to discussing API features and additions. As long as we are discussing the right thing - i.e. the "I have a long-lived..." discussion from above.

If, however, we start that conversation (via a security consultant) via:

I want to use prepared statements to avoid SQL injection

then: that is not a useful discussion.

Tl;dr:

If you want to avoid your car skidding in icy weather, you fit appropriate tyres. You don't change the screen-wash.