Friday, 23 August 2019

Prefer ValueTask to Task, always; and don't await twice

Preamble - not a part 2

A little while ago I blogged here and I set it up to be a "continues..." style post. I haven't had the energy to continue it in that context, and this fact was putting me off concluding the post. I then realised: the thing that matters isn't some overarching narrative structure, but that I get my ideas down. So: I'm aborting any attempt at making this post a continuation, and just focusing on the content!

Prefer ValueTask[<T>] to Task[<T>], always.

There's been a lot of confusion over when to use Task[<T>] vs ValueTask[<T>] (note: I'm going to drop the [<T>] from now on; just pretend they're there when you see Task / ValueTask etc).

Context: what are Task and ValueTask?

In case you don't know, Task and ValueTask are the two primary implementations of "awaitable" types in .NET; "awaitable" here means that there is a duck-typed signature that allows the compiler to turn this:

int i = await obj.SomeMethodAsync();

into something like this:

var awaiter = obj.SomeMethodAsync().GetAwaiter();
if (!awaiter.IsCompleted)
{
    // voodoo here that schedules a
    // continuation that resumes here
    // once the result becomes available
}
int i = awaiter.GetResult();

Task is the original and most well known API, since it shipped with the TPL, but it means that an object allocation is necessary even for scenarios where it turns out that it was already available, i.e. awaiter.IsCompleted returned true. The ValueTask value-type (struct) acts as a hybrid result that can represent an already completed result without allocating or an incomplete pending operation. You can implement your own custom awaitables, but it isn't common.

When to choose each, the incorrect version

If you'd asked me a while back about when to choose each, I might have incorrectly said something like:

Use Task when something is usually or always going to be genuinely asynchronous, i.e. not immediately complete; use ValueTask when something is usually or always going to be synchronous, i.e. the value will be known inline; also use ValueTask in a polymorphic scenario (virtual, interface) where you can't know the answer.

The logic behind this incorrect statement is that if something is incomplete, your ValueTask is going to end up being backed by a Task anyway, but without the extra indirection and false promise of ValueTask. This is incorrect, though, because it is based on the premise that a ValueTask is a composite of "known result (T)" and "Task". In fact, ValueTask is also a composite of a third thing: IValueTaskSource[<T>].

What is IValueTaskSource[<T>]?

IValueTaskSource is an abstraction that allows you to represent the logical behaviour of a task separately to the result itself. That's a little vague, so an example:

IValueTaskSource<int> someSource = // ...
short token = // ...
var vt = new ValueTask<int>(someSource, token);
// ...
int i = await vt;

This now functions like you'd expect from an awaitable, but even in the incomplete/asynchronous case the logic about how everything works is now down to whatever implements the interface - it does not need to be backed by a Task. You might be thinking:

ah, but we still need an instance of whatever is implementing the interface, and we're treating it as a reference, so: we're still going to allocate; what's the point? what have you gained?

And that's when I need to point out the short token. This little gem allows us to use the same interface instance with multiple value-tasks, and have them know the difference. There are two ways you could use this:

  • keep the state for multiple asynchronous operations concurrently, using the token to pick the correct state (presumably from a vector)
  • keep a single piece of state for multiple consecutive operations, using the token to guarantee that we're talking about the correct one

The second is actually by far the more common implementation, and in fact is now included in the BCL for you to make direct use of - see ManualResetValueTaskSourceCore<T>.

So what? How does this help me?

OK; so - we've seen that this alternative exists. There are two ways that people commonly author awaitable APIs today:

  • using TaskCompletionSource<T> and handing the caller the .Task (perhaps wrapped in a ValueTask), and calling TrySetResult etc when we want to trigger completion
  • using async and await, having the compiler generate all the machinery behind the scenes - noting that this currently involves creating a Task in the incomplete case, even for ValueTask methods (because it has to come from somewhere)

Hopefully you can see that if we have ValueTask available to us it is relatively easy to substitute in a ManualResetValueTaskSourceCore backer, allowing us to reuse the same IValueTaskSource instance multiple times, avoiding lots of allocations. But: there's an important caveat - it changes the API. No, really. Let's take a stroll to discuss how...

Don't await twice

Right now, the following code works - assuming the result is backed by either a fixed T or a Task<T>:

var pending = obj.SomeMethodAsync();
int i = await pending;
// ...
int j = await pending;

You'll get the same answer from each await, unsurprisingly - but the actual operation (the method) is only performed once. But: if we switch to ManualResetValueTaskSourceCore, we should only assume that each token is valid exactly once; once we've awaited the result, the entire point is that the backing implementation is free to re-use that IValueTaskSource with a different token for another consumer. That means that the code shown above is no longer legal, and we should expect that the second await can now throw an exception about the token being incorrect.

This is a pretty rare thing to see in code, so personally I'm OK with saying "tough; await once only". Think of it in human terms; this is like a manager going to someone's desk and saying:

Hi, I need the answer to (some topical question); do you know that now? if so, tell me now; otherwise, when you have the answer, bring it (somewhere) and nudge me.

All fine and reasonable so far; our office hero didn't know the answer right away, so they went away and got it, took it where instructed and handed the answer to the manager.

20 minutes later (or 2 days later), the manager stops by their desk again:

Hey, give me that answer

At this point, our hero might reasonably say

Boss, I already gave it you; I only printed it out once - you have the copy; I deal with lots of requests each day, and I can't even remember what you asked about, let alone what the answer was; if you've forgotten the answer, that's on you - feel free to ask again, it's all billable

This is kinda how I anthropomorphize ValueTask, especially in the context of IValueTaskSource. So key point: don't await twice. Treat the results of awaitables exactly the same as you would the result of any other expression: if you are going to need the value twice, store it in a local when you first fetch it.

How else can we benefit from IValueTaskSource?

So; we've seen how we can manually use an IValueTaskSource to efficiently issue ValueTask awaitable results; but if we use async/await, in the incomplete / asynchronous case the compiler is still going to be generating a Task - and also generating a bunch of other state boxes associated with the continuation voodoo. But.. it doesn't have to! A while ago I did some playing in this area that resulted in "Pooled Await"; I'm not going to go into details about this here, and for reasons that will become clear in a moment, I don't recommend switching to this, but the short version is: you can write a method that behaves exactly like a ValueTask awaitable method (including async), but the library makes the compiler generate different code that using IValueTaskSource to avoid the Task allocation, and uses state machine boxing to reduce the other allocations. It works pretty well, but as you might expect, it has the above caveat about awaiting things more than once

So; why am I saying don't leap at this? That because the BCL folks are also now playing in this space, as evidenced by this PR, which has pretty much the exact same feature set, but the advantages of:

  • being written by people who really, really understand async
  • it not adding any dependencies - it would just work out of the box for ValueTask awaitables

If that happens, then a lot of asynchronous code will magically get less allocatey all at once. I know this is something they've discussed in the past, so maybe my "Pooled Await" stuff gave them the metaphorical kick to go and take another look at implementing it for real; or maybe it was just a timing coincidence.

For both my own implementation and the BCL version, it can't do all the magic if you return Task - for best results, a ValueTask is needed (although "Pooled Await" still reuses the state-machine boxes for Task APIs)

Conclusion

So, going back to the earlier question of when to use Task vs ValueTask, IMO the answer is now obvious:

Use ValueTask[<T>], unless you absolutely can't because the existing API is Task[<T>], and even then: at least consider an API break

And also keep in mind:

Only await any single awaitable expression once

If we put those two things together, libraries and the BCL are free to work miracles in the background to improve performance without the caller needing to care.

Thursday, 21 February 2019

Fun with the Spiral of Death

Subtitled: "a cautionary tale of SemaphoreSlim", an adventure in two parts:

  • In part 1 I want to discuss a very fun series of problems we had in some asynchronous code - where "fun" here means "I took Stack Overflow offline, again". Partly because it is a fun story, but mostly because I think there's some really useful learning points in there for general adventures in asynchronicity
  • In part 2 I want to look at some of the implementation details of our eventual fix, which covers some slightly more advanced themes around how to implement awaitable code in non-trivial scenarios

I took Stack Overflow offline, again

As a side note: many of the themes here run hand-in-hand with David and Damian's recent presentation "Why your ASP.NET Core application won't scale" at NDC; if you haven't seen it yet: go watch it - in particular everything around "the application works fine until it suddenly doesn't" and "don't sync-over-async or async-over-sync".

A lot of this journey relates to our migration of StackExchange.Redis to use "pipelines", the new IO layer in .NET (previously discussed here, here, here, and here - I love me some pipelines). One of the key design choices in StackExchange.Redis is for the library to implement multiplexing to allow multiple concurrent calling threads to communicate over the same underlying socket to the server; this keeps the socket count low while also helping to reduce packet fragmentation, but it means that we need to do some synchronization around how the many caller threads access the underlying socket.

Before the pipeline migration, this code was basically synchronous (it was a bit more complex, but… that's close enough), and the "write an actual command" code could be expressed (if we take some liberties for readability) as below:

readonly object syncLock = new object(); // single writer

void WriteMessage(Message message)
{
    bool haveLock = false;
    try
    {
        Monitor.TryEnter(syncLock, timeout, ref haveLock);
        if (!haveLock) ThrowTimeout();

        ActuallyWriteTheThing(message);
        Flush();
    }
    finally
    {
        if (haveLock) Monitor.Exit(syncLock);
    }
}

This is a fairly normal style of coding - the try/finally/Monitor/haveLock code here is just a standard implementation of "lock with a timeout", so all this really does is:

  • try to acquire exclusive access to the socket, guarded by syncLock
  • if successful, write and flush

All reasonable. But then we moved to pipelines, and one of the defining features of the pipelines implementation is that key steps in it are async. You might assume that it is the write that is async - but since you write to a buffer pool, this isn't actually the case - it's the flush that is async. The flush in pipelines achieves a few different things:

  • if necessary, it activates the consumer that is pulling work from the pipe and sending it to the next step (a socket in our case)
  • it provides back-pressure to the provider (WriteMessage in this case), so that if the consumer is falling behind and there's too much backlog, we can slow down the provider (in an asynchronous way) so we don't get unbounded buffer growth

All very neat.

But switching from synchronous code to an API that uses async is not always trivial - async begets async, and once you start going async, it all goes async. So… I did a bad thing; I was lazy, and figured "hey, flush will almost always complete synchronously anyway; we can probably get away with a sync-over-async here" (narrator: they didn't get away with it).

So; what I did was something like:

readonly object syncLock = new object(); // single writer

void WriteMessage(Message message)
{
    bool haveLock = false;
    try
    {
        Monitor.TryEnter(syncLock, timeout, ref haveLock);
        if (!haveLock) ThrowTimeout();

        ActuallyWriteTheThing(message);
        FlushSync();
    }
    finally
    {
        if (haveLock) Monitor.Exit(syncLock);
    }
}

void FlushSync() // evil hack, DO NOT USE
{
    var flush = FlushAsync();
    if (!flush.IsCompletedSuccessfully)
    {
        flush.Wait();
    }
}

The IsCompletedSuccessfully here is a check you can use on many task-like (awaitable) results to see if it completed synchronously and without faulting; if it did, you're safe to access the .Result (etc.) and it will all be available already - a good trick for avoiding the async state-machine complications in high-throughput code (typically library code, not application code). The bad bit is the .Wait(…) when it isn't already completed - this is a sync-over-async.

What happened next?

A key thing to keep in mind is that StackExchange.Redis exposes both synchronous and asynchronous APIs - i.e. there are twin methods, for example:

  • RedisValue StringGet(RedisKey key)
  • Task<RedisValue> StringGetAsync(RedisKey key)

Internally they are implemented very differently so that they both get the job done with the minimum of fuss and overhead, but they were both calling into the same WriteMessage at some point. Actually, never afraid to double-down on the anti-patterns, this means that for the async callers, they were effectively doing async-over-sync-over-async; ouch.

The WriteMessage code above is used from both the synchronous and asynchronous call paths. As it happens, much of our internal existing application codebase mostly uses the synchronous paths (we're gradually adding more async, but we need to complete our in-progress transition from .NET Framework to .NET Core to be able to do it more extensively), and on the synchronous paths you were always going to be blocked anyway, so from the perspective of synchronous callers, there's not really that much wrong with the above. It does what it promises: execute synchronously.

The problem here comes from asynchronous callers, who thought they were calling StringGetAsync, and their thread got blocked. The golden rule of async is: don't block an async caller unless you really, really have to. We broke this rule, and we had reports from users about big thread-jams with async call paths all stuck at WriteMessage, because one thread had paused for the flush, and all the other threads were trying to obtain the lock.

Note: the problem here isn't that "a backlog happened, and we had to delay" - that's just business as normal. That happens, especially when you need mutex-like semantics. The problem is that we blocked the worker threads (although we did at least have the good grace to include a timeout), which under heavy load caused thread-pool starvation and a cascading failure (again: watch the video above).

So what should we have done in theory?

Given that we have both synchronous and asynchronous call-paths, what we should do is have two versions of the write code:

  • void WriteMessage(Message message)
  • ValueTask WriteMessageAsync(Message message)

but we get into immediate problems when we talk about our locking mechanism. We can see this more clearly if we use a simple lock rather than the more complex Monitor usage above - the following does not compile:

async ValueTask Foo()
{
    lock (syncLock)
    {
        // CS1996 - Cannot await in the body of a lock statement
        await Task.Delay(SomeAmount);
    }
}

The reason this doesn't work is that lock (aka Monitor) is thread-oriented. You need to [Try]Enter (take the lock) and Exit (release the lock) the constrained region from the same thread. But the moment you await, you're saying "this might continue synchronously, or it might resume later on a different thread". This actually has two consequences:

  • it would mean that when we try to release the lock, it will fail because the resuming thread probably won't actually have it
  • when we await, we're releasing the current thread back to do whatever else needs doing… which could actually end up calling back into Foo… and Monitor is "re-entrant", meaning: if you have the lock once, you can actually lock again successfully (it maintains a counter internally), which means that code in a completely unrelated execution context could incorrectly end up inside the lock, before we've resumed from the await and logically released it

As a side note, it is worth knowing that the compiler only spots this (CS1996) if you use lock; if you use manual Monitor code (because of timeouts), it won't warn you - you just need to know not to do this (which perhaps by itself is good motivation for "lock with timeout" as a language feature). Fortunately, I did know not to do this - and I moved to the next most obvious locking primitive: SemaphoreSlim. A semaphore is like Monitor, but instead of being thread-based, it is purely counter-based. Theoretically you can use a semaphore to say "no more than 5 in here", but in reality it is often used as a mutex by saying "no more than 1". SemaphoreSlim is particularly enticing because it has both synchronous and asynchronous APIs, allowing us to split our code in two fairly neatly:

readonly SemaphoreSlim singleWriter
    = new SemaphoreSlim(1); // single writer

void WriteMessage(Message message)
{
    if (!singleWriter.Wait(timeout))
        ThrowTimeout();
    try
    {
        ActuallyWriteTheThing(message);
        FlushSync(); // our hack from before
    }
    finally
    {
        singleWriter.Release();
    }
}

async ValueTask WriteMessageAsync(Message message)
{
    if (!await singleWriter.WaitAsync(timeout))
        ThrowTimeout();
    try
    {
        ActuallyWriteTheThing(message);
        await FlushAsync();
    }
    finally
    {
        singleWriter.Release();
    }
}

This looks broadly similar to what we had before; the new SemaphoreSlim(1) initializes the semaphore with a limit of 1, i.e. a mutex. In the synchronous path, it works mostly like it always did, but the asynchronous path (now used by the asynchronous callers) now correctly releases worker threads back to wherever worker threads go - when either they can't get the lock yet, or when they are waiting (or rather: awaiting) on the flush. We still have the sync-over-async in the sync path, but that's not really a problem in this case - but we've completely fixed the async path. Short of removing or optionally disabling the sync path (which is an idea I'm putting serious thought into doing, as an opt-in thing), that's probably about as good as we can get.

This looks like it should work, and the chances are that this would have completely solved the problems being seen by our consumers with heavily asynchronous workloads. But one of the nice things about working at Stack Overflow is that I have an opportunity to dogfood library releases under Stack Overflow load (which isn't "big big" by any stretch, but it is comfortably big enough to give me confidence that the library isn't pathologically broken). So, we dropped the above changes into production (after testing etc.), and: BOOM!

We went down.

What happened there?

Fortunately, we were lucky enough to manage to grab some process dumps from the production servers in their death-throes before we stood them back up (with the older version of the library), and the stack-traces in the doomed processes were very interesting; they are pretty verbose, but something that kept recurring (note: I've inverted and summarised this trace for readability):

WriteMessage
…
System.Threading.SemaphoreSlim.Wait
…
System.Threading.SemaphoreSlim.WaitAsync
…
KERNELBASE!WaitForMultipleObjectsEx

This was the case for 650+ threads - almost all of them; and critically, no-one actually had the lock - nobody was doing anything useful. The semaphore had, in an edge case, failed to activate the lucky winner of the conch.

What actually went wrong?

Looking at it, our synchronous WriteMessage implementation, when calling Wait on the semaphore, was calling into WaitAsync, and then blocking at the kernel for the object. Despite looking odd, this by itself isn't actually a terrible idea. It turns out that SemaphoreSlim has different strategies that it uses internally:

  • if you're just using synchronous Wait calls, it can handle everything using regular synchronous code and syncronous blocks
  • if you're just using WaitAsync, because it wants to release the caller promptly, it needs to maintain a queue (actually a linked-list) of waiting callers as Task<bool>; when you release something, it takes the next item from one end of the list, and reactivates (TrySetResult) that caller
  • if you're using a mixture of Wait and WaitAsync, if it can't get access immediately, then it uses the WaitAsync approach so that the Wait and WaitAsync consumers are in the same queue - otherwise you'd have two separate queues and it gets very confusing and unpredictable

Now this seems fine, but it turns out that the way it was using TrySetResult was… problematic. It wasn't using TrySetResult directly, but instead was enqueuing a work item to do the TrySetResult. There's actually a good - albeit now legacy - reason for this: thread stealing, another problem I've had to contend with many times.

When you call TrySetResult etc. on a Task<T> (usually via TaskCompletionSource<T>), it is possible (likely, even) that the async continuation is going to run immediately and inline on the thread that called TrySetResult. This is something you need to be really careful about - it can lead to dedicated IO threads somehow ending up serving web requests; or more generally: just … not doing what you expected. But in the scenario presented we got into a "spiral of death": due to a very brief blip from the FlushAsync, our workers had got stuck in the Wait->WaitAsync path, and the very thing that was meant to unblock everything: needed a worker. To release (resource) you need more of (resource), and (resource) is currently exhausted. It is almost impossible to recover from that situation due to the growth limits on workers, and the servers became increasingly unstable until they stopped working completely.

This is clearly a dangerous scenario, so we reported it as an issue, and amazingly within a day Stephen Toub had a surprisingly minimal and elegant fix for SemaphoreSlim. The commit message (and code changes themselves) explain it in more depth, but by configuring the queued tasks with the TaskCreationOptions.RunContinuationsAsynchronously flag, it means the "release" code can call TrySetResult directly, without needing an extra worker as an intermediary. In the specific case where the only thing waiting on the task is a synchronous Wait, the task code already has specific detection to unblock that scenario directly without needing a worker at all, and in the genuine async/await case, we just end up with the actual work going to the queue, rather than the "call TrySetResult" going to the queue. Tidy!

But that isn't the end of the story

It would be nice to say "all's well that ends well; bug in SemaphoreSlim fixed", but it isn't as easy as that. The fix for SemaphoreSlim has been merged, but a) that won't help "us" until the next .NET Framework service release, and b) as library authors, we can't rely on which service releases are on our consumers' machines. We need a fix that works reliably everywhere. So whilst it is great to know that our pain has improved things for future users of SemaphoreSlim, we needed something more immediate and framework-independent. So that's when I went away and created a bespoke synchronous/asynchronous MutexSlim that we are now using in StackExchange.Redis.

It is amazing how much simpler things become if you limit yourself to "0 or 1 people in the pool", so it wasn't actually that much work; but: I thought I knew a lot about async/await, yet in writing MutexSlim I dove deeper into that topic than I have usually had to; and in the second part I'll talk about some of what I learned.