I’ve been struggling over the last few days to diagnose and eradicate a fun little bug in the cache tier of Stack Overflow / Stack Exchange; for context, we use redis extensively as a shared cache (and for some other things – including the realtime updates via web-sockets, which rely heavily on redis pub/sub) – and have this week deployed a new implementation of our redis communications stack. Brain-dead bugs aside (and I really did manage a howler, for which I apologise: sorry), we got it in and stable: it would be working just fine, processing a few million messages without issue, and then out of the blue… WHAM! 10 thousand timeouts in a second, and when you immediately go to look, everything is happy again, merrily churning through load as though you had imagined things. Local load testing failed to reproduce this issue.
As always, I got bitten by the rule: interesting problems only happen at scale.
By which, I don’t mean to imply that we’re the biggest site out there, or that we’re doing anything especially clever (quite the opposite in my case, ahem), but we take great pride in the fact that we run a very busy site on very small numbers of servers. We accidentally found out on Tuesday (again, my bad) that we can still run the entire Stack Exchange Network on two web-servers. Not quite as well as normal, but it worked. edit - proof (courtesy of @Nick_Craver):
But unless you have a very expensive test lab and dedicated load-test team, it is really quite hard to simulate realistic load for the network.
Enough rambling; what went wrong?
I got hit by the thread-pool. You see, like BookSleeve (which I will be talking about probably next blog), the new client is an async-enabled multiplexer. As outbound messages for an endpoint come in, we dispatch them to a queue (you need to serialize work on a socket, and know which requests marry to which responses), and if we don’t already have a writer for that queue, we request one. The problem here was: I requested if from the thread-pool. Now, the thread-pool is very very clever – it has lots of smarts in there to handle automatic growth and shrinking, etc. It works perfectly well under sane conditions. But, I asked too much of it: asp.net will already be chewing through workers for requests, and we don’t currently use much
async code – our requests are processed pretty much synchronously. This means that during a storm (and only then) we were essentially getting into a deadlock condition:
- asp.net requests were using lots of workers
- which were blocked waiting on a response from redis
- which hadn’t been sent the request yet, because no thread-pool thread could be allocated to write the queue
Fortunately, this is self curing; eventually either:
- the blocking requests will timeout (we always specify timeouts, and pretty short ones), allowing the requests to complete and making more writers available
- the thread-pool will grow and allocate a writer (although to be honest, when competing with constant inbound asp.net requests, it is dubious whether the writer would get there fast enough)
That is why when you look at the system a second after the trouble, it was all healthy again - the timeouts have happened, releasing enough workers to the pool to service the queue.
Sigh; all fun.
The moral of the story
See, I do have morals! Don’t use the thread-pool for time-critical operations if there’s a good chance you’ll already be stressing the thread-pool. The fix was pretty simple once we understood the problem: we now retain a dedicated writer thread per multiplexer (note: not per socket). We do reserve the right to seek help from the thread-pool if there is a backlog, but frankly that is pretty rare – the dedicated writer is usually more than able to keep up (in local testing, I’ve seen the multiplexer servicing around 400k ops/s – far above what most people need).
Next time: announcing a new redis client for .NET! (also: reasons why, and what about BookSleeve?)