Code, code and more code.: stackoverflow

Thursday, 13 March 2014

Beware jamming the thread pool

I’ve been struggling over the last few days to diagnose and eradicate a fun little bug in the cache tier of Stack Overflow / Stack Exchange; for context, we use redis extensively as a shared cache (and for some other things – including the realtime updates via web-sockets, which rely heavily on redis pub/sub) – and have this week deployed a new implementation of our redis communications stack. Brain-dead bugs aside (and I really did manage a howler, for which I apologise: sorry), we got it in and stable: it would be working just fine, processing a few million messages without issue, and then out of the blue… WHAM! 10 thousand timeouts in a second, and when you immediately go to look, everything is happy again, merrily churning through load as though you had imagined things. Local load testing failed to reproduce this issue.

As always, I got bitten by the rule: interesting problems only happen at scale.

ONE DOES NOT SIMPLY ... DO ANYTHING AT STACKEXCHANGE SCALE

By which, I don’t mean to imply that we’re the biggest site out there, or that we’re doing anything especially clever (quite the opposite in my case, ahem), but we take great pride in the fact that we run a very busy site on very small numbers of servers. We accidentally found out on Tuesday (again, my bad) that we can still run the entire Stack Exchange Network on two web-servers. Not quite as well as normal, but it worked. edit - proof (courtesy of @Nick_Craver):

The Stack Exchange Network running on two servers

But unless you have a very expensive test lab and dedicated load-test team, it is really quite hard to simulate realistic load for the network.

Enough rambling; what went wrong?

I got hit by the thread-pool. You see, like BookSleeve (which I will be talking about probably next blog), the new client is an async-enabled multiplexer. As outbound messages for an endpoint come in, we dispatch them to a queue (you need to serialize work on a socket, and know which requests marry to which responses), and if we don’t already have a writer for that queue, we request one. The problem here was: I requested if from the thread-pool. Now, the thread-pool is very very clever – it has lots of smarts in there to handle automatic growth and shrinking, etc. It works perfectly well under sane conditions. But, I asked too much of it: asp.net will already be chewing through workers for requests, and we don’t currently use much async code – our requests are processed pretty much synchronously. This means that during a storm (and only then) we were essentially getting into a deadlock condition:

asp.net requests were using lots of workers
which were blocked waiting on a response from redis
which hadn’t been sent the request yet, because no thread-pool thread could be allocated to write the queue

Fortunately, this is self curing; eventually either:

the blocking requests will timeout (we always specify timeouts, and pretty short ones), allowing the requests to complete and making more writers available
the thread-pool will grow and allocate a writer (although to be honest, when competing with constant inbound asp.net requests, it is dubious whether the writer would get there fast enough)

That is why when you look at the system a second after the trouble, it was all healthy again - the timeouts have happened, releasing enough workers to the pool to service the queue.

Sigh; all fun.

The moral of the story

See, I do have morals! Don’t use the thread-pool for time-critical operations if there’s a good chance you’ll already be stressing the thread-pool. The fix was pretty simple once we understood the problem: we now retain a dedicated writer thread per multiplexer (note: not per socket). We do reserve the right to seek help from the thread-pool if there is a backlog, but frankly that is pretty rare – the dedicated writer is usually more than able to keep up (in local testing, I’ve seen the multiplexer servicing around 400k ops/s – far above what most people need).

Next time: announcing a new redis client for .NET! (also: reasons why, and what about BookSleeve?)

Monday, 24 October 2011

Assault by GC

TL;DR;

We had performance spikes, which we eased with some insane use of structs.

History

For a while now, we had been seeing a problem in the Stack Exchange engine where we would see regular and predictable stalls in performance. So much so that our sysadmins blogged about it back here. Annoyingly, we could only see this problem from the outside (i.e. haproxy logging, etc) – and to cut a very long story short, these stalls were due to garbage collection (GC).

We had a problem, you see, with some particularly large and long-lived sets of data (that we use for some crazy-complex performance-related code), that would hobble the server periodically.

Short aside on Server GC

The regular .NET framework has a generational GC – meaning: new objects are allocated “generation 0” (GEN-0). When it chooses, the system scans GEN-0 and finds (by walking references from the so-called “roots”) which (if any) of the objects are still in use to your application. Most objects have very short lives, and it can very quickly and efficiently reclaim the space from GEN-0; any objects that survived move to GEN-1. GEN-1 works the same (more-or-less), but is swept less often – moving any survivors into GEN-2. GEN-2 is the final lurking place for all your long-lived data – it is swept least often, and is the most expensive to check (especially if it gets big).

Until .NET 4.5 rolls into town with a background server GC, checking GEN-2 is a “stop the world” event – it (usually briefly) pauses your code, and does what it needs to. Now imagine you have a huge set of objects which will never be available to collect (because you will always be using them) all sat in GEN-2. What does that look like? Well, using StackExchange Data Explorer to analyse our haproxy logs, it looks a bit like this:

I’ve omitted the numbers, as they don’t matter; but interpretation: normally the server is ticking along with nice fast response times, then WHAM! A big spike (which we have correlated with GC) that just hammers the response-times.

So what to do?

We take performance very seriously at Stack Exchange, so as you imagine this got a bit of attention. The obvious answer of “don’t keep that data”, while a valid suggestion, would have hurt a lot of the overall performance, so we needed to find a way to remove or reduce this while keeping the data.

Our initial efforts focused on removing things like unnecessary copy/extend operations on the data, which helped some, but didn’t really make a step-change. Eventually, we concluded…

Break all the rules

Important: the following is merely a discussion of what has helped us. This is not a magic bullet, and should only be applied to some very limited scenarios after you have profiled and you know what you are doing and why. And it helps if you are just a little bit crazy.

First, a brief overview – imagine you are holding Customer objects in memory for an extended period; you have a lot of them in a List<Customer>, and occasionally add more (with suitable threading protection, etc). Further, you have some pre-calculated subsets of the data (perhaps by region) – so a number of smaller List<Customer>. That is entirely inaccurate, but sets the scene ;p

After exhausting alternatives, what we did was:

change Customer from a class to a struct (only within this crazy code)
change the main store from a List<Customer> to a Customer[]
change the subsets from List<Customer> to List<int>, specifically the offset into the main Customer[]

eek; so what does that do for us?

the main data is now a single object (the array) on the "large object heap", rather than lots of small objects
by using direct access into an array (rather than a list indexer) you can access the data in-situ, so we are not constantly copying Customer values on the stack
the int offsets for subsets are essential to make sure we don't have multiple separate values of each record, but on x64 using int offsets rather than references also means our subsets suddenly take half the memory

Note that for some of the more complex code (applying predicates etc), we also had to switch to by-ref passing, i.e.

void SomethingComplex(ref Customer customer) {...}
...
int custIndex = ...
SomethingComplex(ref customers[custIndex]);

This again is accessing the Customer value in-situ inside the array, rather than copying it.

To finish off the bad-practice sheet, we also had some crazy changes to replace some reference data inside the record to fixed sized buffers inside the value (avoiding an array on the heap per item), and some corresponding unsafe code to query it (including the rarely-used stackalloc), but that code is probably a bit complex to cover properly - essentially: we removed all the objects/references from this data. And after removing the objects, there is nothing for GC to look at.

Impact

It helped! Here’s the “after”, at the same scales:

As you can see, there are still a few vertical spikes (which still tie into GC), but they are much less dense, and less tall. Basically, the server is no longer tied up in knots. The code is a bit less OOP than we might usually do, but it is a: constrained to this very specific scenario (this is absolutely not a carte-blanche “make everything a struct”), and b: for understood reasons.

Plus, it was pretty interesting (oh dear; I really, really need to get out more). Enjoy.

Disclaimers

You need to be really careful messing with this – especially with updates etc, since your records are certainly over-size for atomic read/write, and you don’t want callers seeing torn data.
This is not about “struct is faster” – please don’t let that be what you take away; this is a very specific scenario.
You need to be really careful to avoid copying fat values on the stack.
This code really is well outside of the normal bell-curve. Indeed, it is pretty resonant of XNA-style C# (for games, where GC hurts the frame-rate).

Credit

A lot of input, data-analysis, scheming, etc here is also due to Sam Saffron and Kyle Brandt; any stupid/wrong bits are mine alone.