Monday, 9 May 2016

How I found CUDA, or: Rewriting the Tag Engine–part 1

(part 2, part 3)

This post is largely an introduction to set context so that the following few articles make sense… there won’t be any code here, but: well, take it or leave it :)

The Context – what is the tag engine?

At Stack Overflow / Stack Exchange, a lot of our pages relate to “tags” (topics). As an obvious example, a lot of users browse questions in specific technologies looking for new things to answer, or have feeds / email notifications configured for specific tags. Other users might be interested in all the newest questions, but absolutely never want to see another question that reminds them of their last job (by the way, if your job sucks, you should fix that – life is too short to be miserable). We also  do things like showing “related tags” – essentially the counts of the intersections between technologies of what you’re looking at and other questions we know about.

All of this needs a non-trivial amount of processing and memory. Back in the day, it was sufficient to use our RDBMS for that (via some hacks that in turn left us some technical debt, but that is long gone now), but as we grew that simply wasn’t going to work. So after investigating a few options, out popped the “tag engine” – basically some bespoke code with a small set of jobs that we could run out-of-process to the main web-servers (so they don’t have to reload everything when we deploy / recycle).

So… life was good?

All was well. Sure, we had to fight a few things like GC, but… it worked. But as we grew, that code base started to become more and more of a limiting factor. It is nearly 5 years old now, and we’ve grown a lot in that time, and our needs have changed a lot in that time. The tag engine was never really “designed” so much as … “grew”. We gradually hacked in features we needed, and tweaked bits, and more or less it kept working. It was a bit ugly, but it wasn’t actually a problem. We’re pragmatists: we fix problems. We don’t fix things that aren’t problems.

Lately, it has been moving more and more from the “not a problem” camp to “problem”, so yay, more things to do. Performance was a key part of this, stemming in part from data volume, part from design choices - an overhaul was overdue.

Starting to think about GPUs

Around this time, I happened to see an email conversation from Daniel Egloff at QuantAlea, and it made me think about how much of the tag-engine might be suitable for GPU work. After a brief exchange with Daniel, I was satisfied that I wasn’t totally crazy. As a brief aside: QuantAlea were great and seemed keen to help us work on the problem. I think their tools show real promise, but for us we made the decision to keep everything in-house and do it ourselves. This is in part because our scenario is relatively simple. What I’m saying here is: they were really helpful, and if you’re interested in CUDA you might want to think about them, but: we didn’t go that way ourselves.

So what the hell is the tag engine doing? Why do you even need that level of crazy?

The interesting thing (to me) about the tag engine is that there are two very different scenarios. About half the queries tend to be embarrassingly trivial; and the other half are absurd. If you come at the tag-engine thinking “list the first 50 newest C# questions” – then: that’s the embarrassingly trivial side. This really doesn’t need a lot of work – just keep track of questions by tag, pre-sorted, and pick out the pages you need. Very fast, very simple. It just works. Although I’ll leave it as an exercise for the reader to think about how to generate the tag intersection cloud.

The real problem is the other half, which could be “page 200 (50 per page) of all 'java or .net or sql' questions, and everything that starts with ‘visual’, but never show me anything with php or fortran, and only show me questions with a score above 2 that were created after some date”. Lots of complex unions, restrictions, etc. You can do this type of thing with general purpose indexing tools (Elasticsearch for example), but a: we want the maximum performance, and b: some of our data is not really amenable to such tools – for example, some of our sort-orders are highly time dependent, which is complex and awkward for such tools (and may require lots of re-sends). Equally, some of the data we want to get back out (including, but not limited to, the tag intersection cloud) are not easy to do.

It is this second category of queries where the tag-engine shines; essentially, what it has to do is to keep some pre-built indexes and then try to figure out how best to exploit those indexes to answer a complex query – or, worst case, to do the moral equivalent of a table scan, and just walk the data.

OK, there’s a tricky technical problem; how could CUDA help? What the hell even is CUDA?

CUDA is an offshoot from the gaming world. In their efforts to make games that have high frame rates at high visual quality, the graphics card (GPU) vendors came up with a different approach to processing. Rather than having a relatively small number of very fast general purpose CPUs that switch constantly between doing 200 things, they went with a higher number of “symmetric multiprocessors” – not individually as fast as regular CPUs, but able to do the same thing many many times in parallel (think: SIMD gone mad), and with access to a large number of math processors (ALUs). NVIDIA correctly reasoned that this type of technology might be awesome for many computing tasks, so CUDA was developed as a framework to enable general purpose computing on the GPU, aka GPGPU. Of course, the bus between the device and main memory isn’t as fast as direct CPU<->RAM access (although they’re working hard on that problem), but: very powerful.

And as a result of this, GPU programming is very, very good at scenarios where you want to do a relatively simple and predictable operation many times, ideally where the data you need can be pushed onto the device up-front (and left there, updated only periodically), and where you only need to get back relatively small quantities of results. Doesn't this sound a lot like the second – harder - scenario we just described for the tag engine?


That’s why we were intrigued with GPU programming on the tag-engine. The next post will be more technical, discussing the relative merits of CPU and GPU programming and how we might need to use different approaches to optimize each. And hopefully some code, yay! As a teaser though: it works great.