TL;DR;
We had performance spikes, which we eased with some insane use of struct
s.
History
For a while now, we had been seeing a problem in the Stack Exchange engine where we would see regular and predictable stalls in performance. So much so that our sysadmins blogged about it back here. Annoyingly, we could only see this problem from the outside (i.e. haproxy logging, etc) – and to cut a very long story short, these stalls were due to garbage collection (GC).
We had a problem, you see, with some particularly large and long-lived sets of data (that we use for some crazy-complex performance-related code), that would hobble the server periodically.
Short aside on Server GC
The regular .NET framework has a generational GC – meaning: new objects are allocated “generation 0” (GEN-0). When it chooses, the system scans GEN-0 and finds (by walking references from the so-called “roots”) which (if any) of the objects are still in use to your application. Most objects have very short lives, and it can very quickly and efficiently reclaim the space from GEN-0; any objects that survived move to GEN-1. GEN-1 works the same (more-or-less), but is swept less often – moving any survivors into GEN-2. GEN-2 is the final lurking place for all your long-lived data – it is swept least often, and is the most expensive to check (especially if it gets big).
Until .NET 4.5 rolls into town with a background server GC, checking GEN-2 is a “stop the world” event – it (usually briefly) pauses your code, and does what it needs to. Now imagine you have a huge set of objects which will never be available to collect (because you will always be using them) all sat in GEN-2. What does that look like? Well, using StackExchange Data Explorer to analyse our haproxy logs, it looks a bit like this:
I’ve omitted the numbers, as they don’t matter; but interpretation: normally the server is ticking along with nice fast response times, then WHAM! A big spike (which we have correlated with GC) that just hammers the response-times.
So what to do?
We take performance very seriously at Stack Exchange, so as you imagine this got a bit of attention. The obvious answer of “don’t keep that data”, while a valid suggestion, would have hurt a lot of the overall performance, so we needed to find a way to remove or reduce this while keeping the data.
Our initial efforts focused on removing things like unnecessary copy/extend operations on the data, which helped some, but didn’t really make a step-change. Eventually, we concluded…
Break all the rules
Important: the following is merely a discussion of what has helped us. This is not a magic bullet, and should only be applied to some very limited scenarios after you have profiled and you know what you are doing and why. And it helps if you are just a little bit crazy.
First, a brief overview – imagine you are holding Customer
objects in memory for an extended period; you have a lot of them in a List<Customer>
, and occasionally add more (with suitable threading protection, etc). Further, you have some pre-calculated subsets of the data (perhaps by region) – so a number of smaller List<Customer>
. That is entirely inaccurate, but sets the scene ;p
After exhausting alternatives, what we did was:
- change
Customer
from aclass
to astruct
(only within this crazy code) - change the main store from a
List<Customer>
to aCustomer[]
- change the subsets from
List<Customer>
toList<int>
, specifically the offset into the mainCustomer[]
eek; so what does that do for us?
- the main data is now a single object (the array) on the "large object heap", rather than lots of small objects
- by using direct access into an array (rather than a list indexer) you can access the data in-situ, so we are not constantly copying
Customer
values on the stack - the
int
offsets for subsets are essential to make sure we don't have multiple separate values of each record, but on x64 usingint
offsets rather than references also means our subsets suddenly take half the memory
Note that for some of the more complex code (applying predicates etc), we also had to switch to by-ref passing, i.e.
void SomethingComplex(ref Customer customer) {...}
...
int custIndex = ...
SomethingComplex(ref customers[custIndex]);
This again is accessing the Customer
value in-situ inside the array, rather than copying it.
To finish off the bad-practice sheet, we also had some crazy changes to replace some reference data inside the record to fixed sized buffers inside the value (avoiding an array on the heap per item), and some corresponding unsafe
code to query it (including the rarely-used stackalloc
), but that code is probably a bit complex to cover properly - essentially: we removed all the objects/references from this data. And after removing the objects, there is nothing for GC to look at.
Impact
It helped! Here’s the “after”, at the same scales:
As you can see, there are still a few vertical spikes (which still tie into GC), but they are much less dense, and less tall. Basically, the server is no longer tied up in knots. The code is a bit less OOP than we might usually do, but it is a: constrained to this very specific scenario (this is absolutely not a carte-blanche “make everything a struct”), and b: for understood reasons.
Plus, it was pretty interesting (oh dear; I really, really need to get out more). Enjoy.
Disclaimers
- You need to be really careful messing with this – especially with updates etc, since your records are certainly over-size for atomic read/write, and you don’t want callers seeing torn data.
- This is not about “struct is faster” – please don’t let that be what you take away; this is a very specific scenario.
- You need to be really careful to avoid copying fat values on the stack.
- This code really is well outside of the normal bell-curve. Indeed, it is pretty resonant of XNA-style C# (for games, where GC hurts the frame-rate).
Credit
A lot of input, data-analysis, scheming, etc here is also due to Sam Saffron and Kyle Brandt; any stupid/wrong bits are mine alone.