Wednesday, 27 April 2011

Completion tasks - the easy way

One of the good things about doing things in public – people point out when you’ve missed a trick.

Just the other day, I was moaning about how Task seemed too tightly coupled to schedulers, and wouldn’t it be great if you could have a Task that just allowed you to indicate success/failure – I even went so far as to write some experimental code to do just that.

So you can predict what comes next; it already existed – I just didn’t know about it. I give you TaskCompletionSource<T>, which has SetResult(…), SetException(…) and a Task property. Everything I need (except maybe support for result-free tasks, but I can work around that).

And wouldn’t you believe it, the TPL folks shine again, easily doubling my lousy performance, and trebling (or more) the performance compared to “faking it” via RunSynchronously:

Future (uncontested): 1976ms // my offering
Task (uncontested): 4149ms // TPL Task via RunSynchronously
Source (uncontested): 765ms // via TaskCompletionSource
Future (contested): 5608ms
Task (contested): 6982ms
Source (contested): 2389ms


I am hugely indebted to Sam Jack, who corrected me via a comment on the blog entry. Cheers!

Monday, 25 April 2011

Musings on async

OUT OF DATE: SEE UPDATE


For BookSleeve, I wanted an API that would work with current C#/.NET, but which would also mesh directly into C# 5 with the async/await pattern. The obvious option there was the Task API, which is familiar to anyone who has used the TPL in .NET 4.0, but which gains extension methods with the Async CTP to enable async/await.

This works, but in the process I found a few niggles that made me have a few doubts:

Completion without local scheduling

In my process, I never really want to schedule a task; the task is performed on a separate server, and what I really want to do is signal that something is now complete. There isn’t really an API for that on Task (since it is geared more for operations you are running locally, typically in parallel). The closest you can do is to ask the default scheduler to run your task immediately, but this feels a bit ungainly, not least because task-schedulers are not required to offer synchronous support.

In my use-case, I’m not even remotely interested in scheduling; personally I’d quite like it if Task supported this mode of use in isolation, perhaps via a protected method and a subclass of Task.

(yes, there is RunSynchronously, but that just uses the current scheduler, which in library code you can’t assume is capable of actually running synchronously).

Death by exception

The task API is also pretty fussy about errors – which isn’t unreasonable. If a task fails and you don’t explicitly observe the exception (by asking it for the result, etc), then it intentionally re-surfaces that exception in a finalizer. Having a finalizer is note-worthy in itself, and you get one last chance to convince the API that you’re sane – but if you forget to hook that exception it is a process-killer.

So what would it take to do it ourselves?

So: what is involved in writing our own sync+async friendly API? It turns out it isn’t that hard; in common with things like LINQ (and foreach if you really want), the async API is pattern-based rather than interface-based; this is convenient for retro-fitting the async tools onto existing APIs without changing existing interfaces.

What you need (in the Async CTP Refresh for VS2010 SP1) is:

  • Some GetAwaiter() method (possibly but not necessarily an extension method) that returns something with all of:
  • A boolean IsCompleted property (get)
  • A void OnCompleted(Action callback)
  • A GetResult() method which returns void, or the desired outcome of the awaited operation

So this isn’t a hugely challenging API to implement if you want to write a custom awaitable object. I have a working implementation that I put together with BookSleeve in mind. Highlights:

  • Acts as a set-once value with wait (sync) and continuation (async) support
  • Thread-safe
  • Isolated from the TPL, and scheduling in particular
  • Based on Monitor, but allowing efficient re-use of the object used as the sync-lock (my understanding is that once used/contested in a Monitor, the object instance obtains additional cost; may as well minimise that)
  • Supporting typed (Future<T>) or untyped (Future) usage – compares to Task<T> and Task respectively

My local tests aren’t exhaustive, but (over 500,000 batches of 10 operations each), I get:

Future (uncontested): 1993ms
Task (uncontested): 4126ms
Future (contested): 5487ms
Task (contested): 6787ms

So our custom awaitable object is faster… but I’m just not convinced that it is enough of an improvement to justify changing away from the Task API. This call density is somewhat artificial, and we’re talking less than a µs per-operation difference.

Conclusions

In some ways I’m pleasantly surprised with the results; if Task is keeping up (more or less), even outside of it’s primary design case, then I think we should forget about it; use Task, and move on to the actual meat of the problem we are trying to solve.

However, I’ll leave my experimental Future/Future<T> code as reference only off on the side of BookSleeve – in case anybody else feels the need for a non-TPL implementation. I’m not saying mine is ideal, but it works reasonably.

But: I will not be changing away from Task / Task<T> at this time. I’m passionate about performance, but I’m not (quite) crazy; I’ll take the more typical and more highly-tested Task API that has been put together by people who really, really understand threading optimisation, to quite ludicrous levels.

Tuesday, 19 April 2011

Practical Profiling

Profiling the hard wayIf you don’t know what is causing delays, you are doomed to performance woes. Performance is something I care about deeply, and there are no end of performance / profiling tools available for .NET. Some invasive and detailed, some at a higher level. And quite often, tools that you wouldn’t leave plugged into your production code 24/7.

Yet… I care about my production environment 24/7; and trying to reproduce a simulated load for the sites I work on can be  somewhat challenging. So how can we get realistic and detailed data without adversely impacting the system?

Keep It Simple, Stupid

A common strapline in agile programming states:

Do the simplest thing that could possibly work

Now, I don’t profess to be strictly “agile”, or indeed strictly anything (except maybe “pragmatic”) in my development process, but there is much wisdom in the above. And it makes perfect sense when wanting to add constant (live) profiling capabilities.

So what is the simplest thing that could possibly work with profiling? Automated instrumentation? Process polling on the debugging API? Way too invasive. IoC/DI chaining with profiling decorators? Overkill. AOP with something like PostSharp? Unnecessary complexity. How about we just tell the system openly what we are doing?

Heresy! That isn’t part of the system! It has no place in the codebase!

Well, firstly – remember I said I was “pragmatic”, and secondly (more importantly) performance is both a requirement and a feature, so I have no qualms whatsoever changing my code to improve our measurements.

So what are you talking about?

Frustrated by the inconvenience of many of the automated profiling tools, I cobbled together the simplest, hackiest, yet fully working mini-profiler – and I thought I’d share. What I want is as a developer, to be able to review the performance of pages I’m viewing in the production environment – sure, this doesn’t cover every scenario, but it certainly does the job on sites that are read-intensive. So say I browse to “http://mysite/grobbits/MK2-super-grobit” – I want immediate access to how that page was constructed (and where the pain was). And in particular, I want it live so I can hit “refresh” a few times and watch how it behaves as different caches expire. Nothing rocket-science, just a basic tool that will let me hone in on the unexpected performance bumps. Finally, it can’t impact performance the 99.99% of regular users who will never see that data.

I’m currently having great success using this mini tool; the concept is simple – you have a MiniProfiler object (which would be null for most users), and you just surround the interesting code:

using (profiler.Step("Set page title"))
{
ViewBag.Message = "Welcome to ASP.NET MVC!";
}

using (profiler.Step("Doing complex stuff"))
{
using (profiler.Step("Step A"))
{ // something more interesting here
Thread.Sleep(100);
}
using (profiler.Step("Step B"))
{ // and here
Thread.Sleep(250);
}
}

(the Step(…) method is implemented as an extension method, so it is perfectly happy operating on a null reference; which is a crude but simple way of short-circuiting the timings for regular users)

Obviously this isn’t very sophisticated, and it isn’t meant to be – but it is very fast, and easy to make as granular as you like as you focus in on some specific knot of code that is hurting. But for the simplicity, it is remarkably useful in finding trouble spots on your key pages, and reviewing ongoing performance.


So what do I see?

The output is very basic functional – you simply get a call tree of the code you’ve marked as interesting (above), to whatever granularity you marked it to; no more, no less. So with the sample project (see below), the home-page displays (in the html markup):


<!--
MYPC at 19/04/2011 11:28:12
Path: http://localhost:3324/
http://localhost:3324/ = 376.9ms
> Set page title = 0ms
> Doing complex stuff = 349.3ms
>> Step A = 99.3ms
>> Step B = 250ms
> OnResultExecuting = 27.5ms
>> Some complex thinking = 24.8ms
-->

(for your developers only; there is no change for users without the profiler enabled)

Stop waffling, Man!

Anyway, if you have similar (high-level, but live) profiling needs, I’ve thrown the mini-profiler onto google-code and NuGet, along with a tweaked version of the MVC3 (razor) sample project to show typical usage. Actually, you only need a single C# file (it really is basic code, honest).

  • If you find it useful, let me know!
  • If it sucks, let me know!
  • If I’ve gone mad, let me know!
  • I’ve you’ve gone mad, keep that to yourself.

Monday, 11 April 2011

async Redis await BookSleeve

UPDATE

BookSleeve has now been succeeded by StackExchange.Redis, for lots of reasons. The API and intent is similar, but the changes are significant enough that we had to reboot. All further development will be in StackExchange.Redis, not BookSleeve.

ORIGINAL CONTENT

At Stack Exchange, performance is a feature we work hard at. Crazy hard. Whether that means sponsoring load-balancer features to reduce system impact, or trying to out-do the ORM folks on their own turf.

One of the many tools in our performance toolkit is Redis; a highly performant key-value store that we use in various ways:

  • as our second-level cache
  • for various tracking etc counters, that we really don’t want to bother SQL Server about
  • for our pub/sub channels
  • for various other things that don’t need to go direct to SQL Server

It is really fast; we were using the redis-sharp bindings and they served us well. I have much thanks for redis-sharp, and my intent here is not to critique it at all – but rather to highlight that in some environments you might need that extra turn of the wheel. First some context:

  • Redis itself is single threaded supporting multiple connections
  • the Stack Exchange sites work in a multi-tenancy configuration, and in the case of Redis we partition (mainly) into Redis databases
  • to reduce overheads (both handshakes etc and OS resources like sockets) we re-use our Redis connection(s)
  • but since redis-sharp is not thread-safe we need to synchronize access to the connection
  • and since redis-sharp is synchronous we need to block while we get each response
  • and since we are split over Redis databases we might also first have to block while we select database

Now, LAN latency is low; most estimates put it at around 0.3ms per call – but this adds up, especially if you might be blocking other callers behind you. And even more so given that you might not even care what the response is (yes, I know we could offload that somewhere so that it doesn’t impact the current request, but we would still end up adding blocking for requests that do care).

Enter BookSleeve

Seriously, what now? What on earth is BookSleeve?

As a result of the above, we decided to write a bespoke Redis client with specific goals around solving these problems. Essentially it is a wrapper around Redis dictionary storage; and what do you call a wrapper around a dictionary? A book-sleeve. Yeah, I didn’t get it at first, but naming stuff is hard.

And we’re giving it away (under the Apache License 2.0)! Stack Exchange is happy to release our efforts here as open source, which is groovy.

So; what are the goals?

  • to operate as a fully-functional Redis client (obviously)
  • to be thread-safe and non-blocking
  • to support implicit database switching to help with multi-tenancy scenarios
  • to be on-par with redis-sharp on like scenarios (i.e. a complete request/response cycle)
  • to allow absolute minimum cost fire-and-forget usage (for when you don’t care what the reply is, and errors will be handled separately)
  • to allow use as a “future” – i.e request some data from Redis and start some other work while it is on the wire, and merge in the Redis reply when available
  • to allow use with callbacks for when you need the reply, but not necessarily as part of the current request
  • to allow C# 5 continuation usage (aka async/await)
  • to allow fully pipelined usage – i.e. issue 200 requests before we’ve even got the first response
  • to allow fully multiplexed usage – i.e. it must handle meshing the responses from different callers on different threads and on different databases but on the same connection back to the originator

(actually, Stack Exchange didn’t strictly need the C# 5 scenario; I added that while moving it to open-source, but it is an excellent fit)

Where are we? And where can I try it?

It exists; it works; it even passes some of the tests! And it is fast. It still needs some tidying, some documentation, and more tests, but I offer you BookSleeve:

http://code.google.com/p/booksleeve/

The API is very basic and should be instantly familiar to anyone who has used Redis; and documentation will be added.

In truth, the version I’m open-sourcing is more like the offspring of the version we’re currently using in production – you tend to learn a lot the first time through. But as soon as we can validate it, Stack Exchange will be using BookSleeve too.

So how about some numbers

These are based on my dev machine, running redis on the same machine, so I also include estimates using the 0.3ms latency per request as mentioned above.

In each test we are doing 5000 INCR commands (purely as an arbitrary test); spread over 5 databases, in a round-robin in batches of 10 per db – i.e. 10 on db #0, 10 on db #1, … 10 on db #4 – so that is an additional 500 SELECT commands too.

redis-sharp:

  • to completion 430ms
  • (not meaningful to measure fire-and-forget)
  • to completion assuming 0.3ms LAN latency: 2080ms

BookSleeve

  • to completion 391ms
  • 2ms fire-and-forget
  • to completion assuming 0.3ms LAN latency: 391ms

The last 2 are the key, in particular noting that the time we aren’t waiting on LAN latency is otherwise-blocking time we have subtracted for other callers (web servers tend to have more than one thing happening…); the fire-and-forget performance allows us to do a lot of operations without blocking the current caller.

As a bonus we have added to ability to do genuinely parallel work on a single caller – by starting a Redis request first, doing the other work (TSQL typically), and then asking for the Redis result. And let’s face it, while TSQL is versatile, Redis is so fast that it would be quite unusual for the Redis reply to not to already be there by the time you get to look.

Wait – did you say C# 5?

Yep; because the API is task based, it can be used in any of 3 ways without needing separate APIs:

As an example of the last:

async redis

IMPORTANT: in the above “await” does not mean “block until this is done” – it means “yield back to the caller here, and run the rest as a callback when the answer is available” – or for a better definition see Eric Lippert’s blog series.

And did I mention…

…that a high perfomance binary-based dictionary store works well when coupled with a high performance binary serializer? ;p