Code, code and more code.: October 2011

TL;DR;

We had performance spikes, which we eased with some insane use of structs.

History

For a while now, we had been seeing a problem in the Stack Exchange engine where we would see regular and predictable stalls in performance. So much so that our sysadmins blogged about it back here. Annoyingly, we could only see this problem from the outside (i.e. haproxy logging, etc) – and to cut a very long story short, these stalls were due to garbage collection (GC).

We had a problem, you see, with some particularly large and long-lived sets of data (that we use for some crazy-complex performance-related code), that would hobble the server periodically.

Short aside on Server GC

The regular .NET framework has a generational GC – meaning: new objects are allocated “generation 0” (GEN-0). When it chooses, the system scans GEN-0 and finds (by walking references from the so-called “roots”) which (if any) of the objects are still in use to your application. Most objects have very short lives, and it can very quickly and efficiently reclaim the space from GEN-0; any objects that survived move to GEN-1. GEN-1 works the same (more-or-less), but is swept less often – moving any survivors into GEN-2. GEN-2 is the final lurking place for all your long-lived data – it is swept least often, and is the most expensive to check (especially if it gets big).

Until .NET 4.5 rolls into town with a background server GC, checking GEN-2 is a “stop the world” event – it (usually briefly) pauses your code, and does what it needs to. Now imagine you have a huge set of objects which will never be available to collect (because you will always be using them) all sat in GEN-2. What does that look like? Well, using StackExchange Data Explorer to analyse our haproxy logs, it looks a bit like this:

I’ve omitted the numbers, as they don’t matter; but interpretation: normally the server is ticking along with nice fast response times, then WHAM! A big spike (which we have correlated with GC) that just hammers the response-times.

So what to do?

We take performance very seriously at Stack Exchange, so as you imagine this got a bit of attention. The obvious answer of “don’t keep that data”, while a valid suggestion, would have hurt a lot of the overall performance, so we needed to find a way to remove or reduce this while keeping the data.

Our initial efforts focused on removing things like unnecessary copy/extend operations on the data, which helped some, but didn’t really make a step-change. Eventually, we concluded…

Break all the rules

Important: the following is merely a discussion of what has helped us. This is not a magic bullet, and should only be applied to some very limited scenarios after you have profiled and you know what you are doing and why. And it helps if you are just a little bit crazy.

First, a brief overview – imagine you are holding Customer objects in memory for an extended period; you have a lot of them in a List<Customer>, and occasionally add more (with suitable threading protection, etc). Further, you have some pre-calculated subsets of the data (perhaps by region) – so a number of smaller List<Customer>. That is entirely inaccurate, but sets the scene ;p

After exhausting alternatives, what we did was:

change Customer from a class to a struct (only within this crazy code)
change the main store from a List<Customer> to a Customer[]
change the subsets from List<Customer> to List<int>, specifically the offset into the main Customer[]

eek; so what does that do for us?

the main data is now a single object (the array) on the "large object heap", rather than lots of small objects
by using direct access into an array (rather than a list indexer) you can access the data in-situ, so we are not constantly copying Customer values on the stack
the int offsets for subsets are essential to make sure we don't have multiple separate values of each record, but on x64 using int offsets rather than references also means our subsets suddenly take half the memory

Note that for some of the more complex code (applying predicates etc), we also had to switch to by-ref passing, i.e.

void SomethingComplex(ref Customer customer) {...}
...
int custIndex = ...
SomethingComplex(ref customers[custIndex]);

This again is accessing the Customer value in-situ inside the array, rather than copying it.

To finish off the bad-practice sheet, we also had some crazy changes to replace some reference data inside the record to fixed sized buffers inside the value (avoiding an array on the heap per item), and some corresponding unsafe code to query it (including the rarely-used stackalloc), but that code is probably a bit complex to cover properly - essentially: we removed all the objects/references from this data. And after removing the objects, there is nothing for GC to look at.

Impact

It helped! Here’s the “after”, at the same scales:

As you can see, there are still a few vertical spikes (which still tie into GC), but they are much less dense, and less tall. Basically, the server is no longer tied up in knots. The code is a bit less OOP than we might usually do, but it is a: constrained to this very specific scenario (this is absolutely not a carte-blanche “make everything a struct”), and b: for understood reasons.

Plus, it was pretty interesting (oh dear; I really, really need to get out more). Enjoy.

Disclaimers

You need to be really careful messing with this – especially with updates etc, since your records are certainly over-size for atomic read/write, and you don’t want callers seeing torn data.
This is not about “struct is faster” – please don’t let that be what you take away; this is a very specific scenario.
You need to be really careful to avoid copying fat values on the stack.
This code really is well outside of the normal bell-curve. Indeed, it is pretty resonant of XNA-style C# (for games, where GC hurts the frame-rate).

Credit

A lot of input, data-analysis, scheming, etc here is also due to Sam Saffron and Kyle Brandt; any stupid/wrong bits are mine alone.

A little while ago, I mentioned how new tuple handling in protobuf-net meant that likely-looking tuples could now be handled automatically, and even better this meant I could remove my hacky KeyValuePair<,> handling. All was well in the world – after all, what could go wrong?

Answer: Mono.

Actually, I’m not criticizing the Mono implementation really, but simply: Mono has a subtly different implementation of KeyValuePair<,> to Microsoft – nothing huge; simply there exist some set accessors (private ones, visible here). And the library was pretty fussy – if there was a set it wouldn’t be treated as a sure-thing tuple.

This is now fixed in r447, but: if you are using Mono and your dictionaries stopped serializing – then honestly and humbly: sorry about that.

And if you are a library author – watch out: sometimes the simplest most subtle differences between implementations can kill you.

Code, code and more code.

Monday, 24 October 2011

Assault by GC