Saturday, 26 November 2011

The autistic elephant in the server-room

A bit of a serious blog-entry this time. Normally I blog about language lemmas, IL intricacies, serialization subtleties, and materialisation mechanisms; but occasionally bigger topics present.

This blog entry is probably more for my purposes than yours. But you might like it anyway. If not, just come back next time.

Woah, lots of words; why should I read this?


Because knowledge never hurts. You might start to understand a colleague better; maybe it might help you recognise autistic traits in a family-member; or maybe you'll just think for a second before "tutting" when you see a child apparently misbehaving in the supermarket - perhaps you don't know what is happening quite as much as you think.

Why am I putting this on my geek blog?


Because, while the reasons still aren't exactly clear [edit: alternative], it tends to present more commonly in the children (mainly sons) of people in geeky professions. Indeed, our profession has a pretty high rate of successful autistic or Asperger's members; some very notable, some doing the day job like everyone else.

More specifically, I care because my eldest son is autistic. Now, autism (or more correctly, ASD) is a pretty large spectrum. In the UK, there is a trend not to bother trying to diagnose more specific groupings (like Asperger's) because it doesn't actually help with treatment: everyone with ASD needs to be assessed as an individual to understand their needs.

Evan is at the (perhaps unfairly named) "high functioning" end of the spectrum, which means he is of pretty normal intelligence and largely independent, but experiences the world in a slightly different way.

If you want to very quickly get a feel for stereotypical ASD at the "high" end, then Sheldon Cooper is perhaps the easiest place to look, with two caveats:

- there is no automatic "brilliance"; normal intelligence (albeit it, focused) is more likely
- everybody truly is individual; it helps to remember that generally speaking, generalisations don't help

Also, AFAIK The Big Bang Theory has never explicitly labelled Sheldon with ASD or Asperger's

(boo, someone deleted my Sheldon image... take your pick from here

Typical observations would include (not exhaustive):

- attention / interest can be intensely focused on a narrow topic, to the exclusion of others
- sensory perception (sound, touch, taste, smell, sight, etc) can be extreme (both hyper- and hypo-); for example, simply the washing/size-label on a t-shirt can cause agitation
- cognitive processing can get focused on a particular detail rather than processing the wider context; in programming terms think "depth first" rather than "bredth first"
- language and interpretation can be very literal (for example, "laughing your head off" could be concerning) and repetitive
- routine and predictability are important (in particular, this reduced the stress of processing unfamiliar scenarios), to the point where an individual can seem inflexible and stubborn
- social interaction tends to be limited (people aren't predictable, and often aren't very interesting - even less so if they aren't talking about your preferred topic) and a little awkward (not least, brutal honesty doesn't always go down well - "that lady smells bad" etc)
- there can be a definite need for "quiet time", or just time to unwind in whatever way works. That might be "stimming", or it might be sitting under a table for half an hour
- recollection (in particular visual memory) can be more acute

Of course, at the other end of ASD are far more debilitating issues, which may mean dedicated life-long care, maybe institutionalisation. It is not always gentle. I can't speak of this from experience, so I won't try.

So... What is the point of this blog entry?


It is a serious issue that affects our industry, perhaps more than most. Also, our industry is well suited to ASD. Computers are predictable; IT has a use for massively-focused detail-oriented people (experts perhaps in a very narrow and specialised field), and people who can quickly spot some really minor and subtle differences.

Perhaps more importantly though: it shouldn't be something we don't talk about. ASD tends to be an invisible and private thing (you can see a wheel-chair or hearing-aid; you can't necessarily see that the person you are talking to is actually massively stressed to breaking point because a road was closed and they took a different route to work, and then the 2nd lift on the right - THE LIFT THEY USE - was out of order).

Ultimately, no matter where on the spectrum someone is, we should not feel anything like embarrassment. I'm not embarrassed by my son - he's awesome! It took us as a family a little time to properly understand him and his needs, but I think we have something that works well now. And his current school is really great (mainstream, but really understand ASD; his last two schools..... not so much).

If you have a child with ASD


Moving from denial through to acceptance can take time. There's only so many birthday parties you can take them to where opening and closing a door for an hour is more interesting that the hired entertainer - eventually you need to accept that they are simply different. Not better or worse; just different.

Yes, I know, it can be hard sometimes. You may feel like a chess grand-master constantly thinking 12 steps ahead to avoid some meltdown later in the day. Remembering to give advance notice of change, and implementing damage limitation when something truly unexpected happens. Talking openly about it can be hard (yet is also very liberating). There are often local groups of other parents with similar experiences that you may find helpful (assuming you can arrange cover - regular babysitters may not work out so well).

Laughing helps. Try reading #youmightbeanautismparentif. Or watching Mary and Max

Also, don't buy into any of the snake-oil.

If you have ASD


Then you know ASD better than me. I've seen some folks with ASD get annoyed before because it is always "parents of..." doing the talking. Well, the world is your podium: I'd love to hear your thoughts. The "parents of" also have an awful lot to bring to the table, though.

And for Evan


He's doing great. He likes maths and his handwriting is almost as bad as mine. He can probably name a handful of the children in his class, but I'm sure he could tell me how many lights there are in each room in the school (and how many have broken/missing bulbs). We have good days, and we've had some really horrible days - but as we get better, together, at knowing what each of us needs, we get more of the former and less of the latter.

Final words




Dr. Sheldon Cooper

Monday, 24 October 2011

Assault by GC

TL;DR;

We had performance spikes, which we eased with some insane use of structs.

History

For a while now, we had been seeing a problem in the Stack Exchange engine where we would see regular and predictable stalls in performance. So much so that our sysadmins blogged about it back here. Annoyingly, we could only see this problem from the outside (i.e. haproxy logging, etc) – and to cut a very long story short, these stalls were due to garbage collection (GC).

We had a problem, you see, with some particularly large and long-lived sets of data (that we use for some crazy-complex performance-related code), that would hobble the server periodically.

Short aside on Server GC

The regular .NET framework has a generational GC – meaning: new objects are allocated “generation 0” (GEN-0). When it chooses, the system scans GEN-0 and finds (by walking references from the so-called “roots”) which (if any) of the objects are still in use to your application. Most objects have very short lives, and it can very quickly and efficiently reclaim the space from GEN-0; any objects that survived move to GEN-1. GEN-1 works the same (more-or-less), but is swept less often – moving any survivors into GEN-2. GEN-2 is the final lurking place for all your long-lived data – it is swept least often, and is the most expensive to check (especially if it gets big).

Until .NET 4.5 rolls into town with a background server GC, checking GEN-2 is a “stop the world” event – it (usually briefly) pauses your code, and does what it needs to. Now imagine you have a huge set of objects which will never be available to collect (because you will always be using them) all sat in GEN-2. What does that look like? Well, using StackExchange Data Explorer to analyse our haproxy logs, it looks a bit like this:

image

I’ve omitted the numbers, as they don’t matter; but interpretation: normally the server is ticking along with nice fast response times, then WHAM! A big spike (which we have correlated with GC) that just hammers the response-times.

So what to do?

We take performance very seriously at Stack Exchange, so as you imagine this got a bit of attention. The obvious answer of “don’t keep that data”, while a valid suggestion, would have hurt a lot of the overall performance, so we needed to find a way to remove or reduce this while keeping the data.

Our initial efforts focused on removing things like unnecessary copy/extend operations on the data, which helped some, but didn’t really make a step-change. Eventually, we concluded…

Break all the rules

Important: the following is merely a discussion of what has helped us. This is not a magic bullet, and should only be applied to some very limited scenarios after you have profiled and you know what you are doing and why. And it helps if you are just a little bit crazy.

First, a brief overview – imagine you are holding Customer objects in memory for an extended period; you have a lot of them in a List<Customer>, and occasionally add more (with suitable threading protection, etc). Further, you have some pre-calculated subsets of the data (perhaps by region) – so a number of smaller List<Customer>. That is entirely inaccurate, but sets the scene ;p

After exhausting alternatives, what we did was:

  • change Customer from a class to a struct (only within this crazy code)
  • change the main store from a List<Customer> to a Customer[]
  • change the subsets from List<Customer> to List<int>, specifically the offset into the main Customer[]

eek; so what does that do for us?

  • the main data is now a single object (the array) on the "large object heap", rather than lots of small objects
  • by using direct access into an array (rather than a list indexer) you can access the data in-situ, so we are not constantly copying Customer values on the stack
  • the int offsets for subsets are essential to make sure we don't have multiple separate values of each record, but on x64 using int offsets rather than references also means our subsets suddenly take half the memory

Note that for some of the more complex code (applying predicates etc), we also had to switch to by-ref passing, i.e.

void SomethingComplex(ref Customer customer) {...}
...
int custIndex = ...
SomethingComplex(ref customers[custIndex]);

This again is accessing the Customer value in-situ inside the array, rather than copying it.

To finish off the bad-practice sheet, we also had some crazy changes to replace some reference data inside the record to fixed sized buffers inside the value (avoiding an array on the heap per item), and some corresponding unsafe code to query it (including the rarely-used stackalloc), but that code is probably a bit complex to cover properly - essentially: we removed all the objects/references from this data. And after removing the objects, there is nothing for GC to look at.

Impact

It helped! Here’s the “after”, at the same scales:

image

As you can see, there are still a few vertical spikes (which still tie into GC), but they are much less dense, and less tall. Basically, the server is no longer tied up in knots. The code is a bit less OOP than we might usually do, but it is a: constrained to this very specific scenario (this is absolutely not a carte-blanche “make everything a struct”), and b: for understood reasons.

Plus, it was pretty interesting (oh dear; I really, really need to get out more). Enjoy.

Disclaimers

  • You need to be really careful messing with this – especially with updates etc, since your records are certainly over-size for atomic read/write, and you don’t want callers seeing torn data.
  • This is not about “struct is faster” – please don’t let that be what you take away; this is a very specific scenario.
  • You need to be really careful to avoid copying fat values on the stack.
  • This code really is well outside of the normal bell-curve. Indeed, it is pretty resonant of XNA-style C# (for games, where GC hurts the frame-rate).

Credit

A lot of input, data-analysis, scheming, etc here is also due to Sam Saffron and Kyle Brandt; any stupid/wrong bits are mine alone.

Friday, 7 October 2011

The trouble with tuples

A little while ago, I mentioned how new tuple handling in protobuf-net meant that likely-looking tuples could now be handled automatically, and even better this meant I could remove my hacky KeyValuePair<,> handling. All was well in the world – after all, what could go wrong?

Answer: Mono.

Actually, I’m not criticizing the Mono implementation really, but simply: Mono has a subtly different implementation of KeyValuePair<,> to Microsoft – nothing huge; simply there exist some set accessors (private ones, visible here). And the library was pretty fussy – if there was a set it wouldn’t be treated as a sure-thing tuple.

This is now fixed in r447, but: if you are using Mono and your dictionaries stopped serializing – then honestly and humbly: sorry about that.

And if you are a library author – watch out: sometimes the simplest most subtle differences between implementations can kill you.

Monday, 15 August 2011

Automatic serialization; what’s in a tuple?

I recently had a lot of reason (read: a serialization snafu) to think about tuples in the context of serialization. Historically, protobuf-net has focused on mutable types, which are convenient to manipulate while processing a protobuf stream (setting fields/properties on a per-field basis). However, if you think about it, tuples have an implicit obvious positional “contract”, so it makes a lot of sense to serialize them automatically. If I write:

var tuple = Tuple.Create(123, "abc");

then it doesn’t take a lot of initiative to think of tuple as an implicit contract with two fields:

  • field 1 is an integer with value 123
  • field 2 is a string with value “abc”

Since it is deeply immutable, at the moment we would need to either abuse reflection to mutate private fields, or write a mutable surrogate type for serialization, with conversion operators, and tell protobuf-net about the surrogate. Wouldn’t it be nice if protobuf-net could make this leap for us?

Well, after a Sunday-night hack it now (in the source code) does.

The rules are:

  • it must not already be marked as an explicit contract
  • only public fields / properties are considered
  • any public fields (spit) must be readonly
  • any public properties must have a get but not a set (on the public API, at least)
  • there must be exactly one interesting constructor, with parameters that are a case-insensitive match for each field/property in some order (i.e. there must be an obvious 1:1 mapping between members and constructor parameter names)

If all of the above conditions are met then it is now capable of behaving as you might hope and expect, deducing the contract and using the chosen constructor to rehydrate the objects. Which is nice! As a few side-benefits:

  • this completely removes the need for the existing KeyValuePairSurrogate<,>, which conveniently meets all of the above requirements
  • it also works for C# anonymous types if we want, since they too have an implicit positional contract (I am not convinced this is significant, but it may have uses)

This should make it into the next deploy, once I’m sure there are no obvious failures in my assumptions.

Just one more thing, sir…

While I’m on the subject of serialization (which, to be fair, I often am) – I have now also completed some changes to use RuntimeHelpers.GetHashCode()for reference-tracking (serialization). This lets met construct a reference-preserving hash-based lookup to minimise the cost of checking whether an object has already been seen (and if so, fetch the existing token). Wins all round.

Friday, 12 August 2011

Shifting expectations: non-integer shift operators in C#

In C#, it is quite nice that the shift operators (<< and >>) have had their second argument constrained to int arguments, avoiding the oft-confusing piped C++ usage:

cout << "abc" << "def";

But wait a minute! The above line is actually C#, in an all-C# project. OK, I cheated a little (I always do…) but genuinely! This little nugget comes from a stackoverflow post that really piqued my curiosity. All credit here goes to vcsjones, but indeed the line in the documentation about restricting to int (and the return type, etc) is about declaring the operator – not consuming it – so it is seemingly valid for the C# dynamic implementation to use it quite happily.

In fact, the main cheat I used here was simply hiding the assignment of the result, since that is still required. Here’s the full evil:

  using System;

using System.Dynamic;
using System.Linq.Expressions;

class Evil : DynamicObject {
static void Main() {
dynamic cout = new Evil();
var hacketyHackHack =
cout << "abc" << "def";
}
public override bool TryBinaryOperation(
BinaryOperationBinder binder,
object arg, out object result) {
switch(binder.Operation) {
case ExpressionType.LeftShift:
case ExpressionType.LeftShiftAssign:
// or whatever you want to do
Console.WriteLine(arg);
result = this;
return true;
}
return base.TryBinaryOperation(
binder, arg, out result);
}
}

I've seen more evil things (subverting new() via ContextBoundObject is still my favorite evil), but quite dastardly, IMO!

Additional: grrr! I don't know why blogger hates me so much; here it is on pastie.org.

Monday, 4 July 2011

BookSleeve, transactions, hashes, and flukes

Sometimes, you just get lucky.

tl;dr; version : BookSleeve now has transactions and Hashes; new version on nuget

Transactions

For some time now I’ve been meaning to add multi/exec support to BookSleeve (which is how redis manages transactions). The main reason it wasn’t there already is simply: we don’t use them within StackExchange.

To recap, because BookSleeve is designed to be entirely asynchronous, the API almost-always returns some kind of Task or (more often) Task-of-T; from there you can hook a “continue-with” handler, you can elect to wait for a result (or error), or you can just drop it on the floor (fire-and-forget).

Over the weekend, I finally got around to looking at multi/exec. The interesting thing here is that once you’ve issued a MULTI, redis changes the output – return “QUEUED” for pretty much every command. If I was using a synchronous API, I’d be pretty much scuppered at this point; I’d need to duplicate the entire (quite large) API to provide a way of using it. I’d love to say it was by-design (I’d be lying), but the existing asynchronous API made this a breeze (relatively speaking) – all I had to do was wrap the messages in a decorator that expects to see “QUEUED”, and then (after the EXEC) run the already-existing logic against the wrapped message. Or more visually:

conn.Remove(db, "foo"); // just to reset
using(var tran = conn.CreateTransaction())
{ // deliberately ignoring INCRBY here
tran.Increment(db, "foo");
tran.Increment(db, "foo");
var val = tran.GetString(db, "foo");

tran.Execute(); // this *still* returns a Task

Assert.AreEqual("2", conn.Wait(val));
}

Here all the operations happen as an atomic (but pipelined) unit, allowing for more complex integrity conditions. Note that Execute() is still asynchronous (returning a Task), so you can still carry on doing useful work while the network and redis server does some thinking (although redis is shockingly quick).

So what is CreateTransaction?

Because a BookSleeve connection is a thread-safe multiplexer, it is not a good idea to start sending messages down the wire until we have everything we need. If we did that, we’d have to block all the other clients, which is not a good thing. Instead, CreateTransaction() creates a staging area to build commands (using exactly the same API) and capture future results. Then, when Execute() is called the buffered commands are assembled into a MULTI/EXEC unit and sent down in a contiguous block (the multiplexer will send all of these together, obviously).

As an added bonus, we can also use the Task cancellation metaphor to cleanly handle discarding the transaction without execution.

The only disadvantage of the multiplexer approach here is that it makes it pretty hard to use WATCH/UNWATCH, since we wouldn’t be sure what we are watching. Such is life; I’ll think some more on this.

Hashes

Another tool we don’t use much at StackExchange is hashes; however, for your convenience this is now fully implemented in BookSleeve. And since we now have transactions, we can solve the awkward varadic/non-varadic issue of HDEL between 2.2+ and previous versions; if BookSleeve detects a 2.2+ server it will automatically use the varadic version, otherwise it will automatically use a transaction (or enlist in the existing transaction). Sweet!

Go play

A new build is on nuget; I hope you find it useful.

Wednesday, 8 June 2011

Profiling with knobs on

A little while ago, I mentioned the core part of the profiling tool we use at stackexchange – our mini-profiler. Well, I’m delighted to say that with the input from Jarrod and Sam this tool has been evolving at an alarming rate, and has now gained:

  • full integration with our bespoke SQL profiler, which is now also included
    • basically, this is a custom ADO.NET DbConnection that you wrap around your existing connection, and it adds logging
  • an awesome UI (rather than html comments), including a much clearer timeline
  • full support for AJAX activity
  • some very basic highlighting of duplicated queries, etc
  • webform support
  • and a ton of other features and enhancements

To illustrate with Jarrod’s image from the project page:

miniprofiler

The custom connection has been exercised extensively with dapper-dot-net, LINQ-to-SQL and direct ADO.NET, but should also work with Entity Framework (via ObjectContextUtils, provided) and hopefully any other DB frameworks built on top of ADO.NET.

We use it constantly. It is perhaps our primary tool in both understanding problems and just keeping a constant eye on the system while we use it – since this is running (for our developers) constantly, 24×7.

All too often such fine-grain profiling is just “there’s a problem, ask a DBA to attach a profiler to the production server for 2 minutes while I validate something”.

Enjoy.

http://code.google.com/p/mvc-mini-profiler/

Wednesday, 25 May 2011

So soon, beta 2

I’m delighted by the level of response to my protobuf-net v2 beta; I have been kept really busy with a range of questions, feature requests and bug fixes (almost exclusively limited to the new features).

So, before I annoy people with too many “fixed in source” comments, I thought I’d better re-deploy. Beta 2 gives the same as beta 1, but with:

  • reference-tracked objects
    • root object now included (BREAKING CHANGE)
    • fixed false-positive recursion issue
  • full type metadata
    • support custom formatting/parsing
  • interface serialization
    • support serialization from interfaces
    • support known-implementations of interfaces
    • support default concrete type
  • WCF
    • expose WCF types on public API to allow custom models
  • misc
    • support shadow set-methods
    • fix IL glitch
    • fix cold-start thread-race condition
    • fix missing method-forwarding from legacy non-generic API

In addition to a very encouraging level of interest, I’m also pleased that things like the interface-based support mentioned above were pretty effortless to push through the v2 type-model. These are things that had been proposed previously, but there was just no way to push them into the codebase without hideous (and cumulative) hackery. With v2, it was pretty much a breeze.

But for v2 fun see the project download page

And please, keep pestering me ;p

Note that the BREAKING CHANGE marked above only applies to data serialized with the new full-graph AsReference option. No v1 data is impacted, but if you have stored data from earlier v2 alpha/beta code, please drop me a line and I can advise.

Thursday, 19 May 2011

protobuf-net v2, beta

It has been a long time in the coming, I know. I make no excuses, but various other things have meant that it didn’t get much input for a while…

But! I’m happy to say that I’ve just pushed a beta download up onto the project site

So… this v2… what is it?

Basically, it is still the core protobuf stream, but against all the safe advice of my employer I rewrote the entire core. For many reasons:

  • the original design grew organically, and ideas that seemed good turned to make for some problematic code later
  • the vast overuse of generics (and in particular, generics at runtime via reflection) was really hurting some platforms
  • there were many things the design could not readily support – structs, runtime models, pre-built serialization dlls, usage on mobile devices, etc
  • the code path at runtime was just not as minimal as I would like
  • and a myriad of other things

So… I rewrote it. Fortunately I had a barrage of integration tests, which grew yet further during this exercise. The way I figured, I had 2 choices here:

  • go with a richer code-generator to use as a build task (not runtime)
  • go crazy with meta-programming

After some thought, I opted for the latter. In particular, this would (in my reasoning) give me lots of flexibility for keeping things runtime based (where you are able to, at least), and would be a good opportunity to have some fun and learn IL emit to dangerous levels (which, incidentally, turns out to be very valuable – hence mine and Sam’s work on dapper-dot-net).

So what is new?

I don’t claim this list is exhaustive:

  • the ability to define models at runtime (also useful for working with types you don’t control) (example)
  • the ability to have separate models for the same types
  • pre-generation to dll
  • runtime performance improvements
  • support for structs
  • support for immutable objects via surrogates (example)
  • the ability to run on mobile platforms (unity, wp7 – and perhaps moot for a while: MonoDroid, MonoTouch – but presumably Xamarin)
  • the ability to serialize object graphs preserving references (example)
  • the ability to work with types not known in advance (same)
  • probably about 40 other things that slip my mind

I’ll try to go into each of these in detail when I can

And what isn’t there yet?

This is beta; at some point I had to make a milestone cut – but some things aren’t quite complete yet; these are not usually needed (hint: I’m happy to use protobuf-net v2 on my employer’s code-base, so I have confidence in it, and confidence that the core paths have been smoke-tested)

But things that definitely aren’t there in the beta

  • WCF hooks
  • VS tooling (it is just the core runtime dlls)
  • round-trip safe / “extension” fields
  • extraction of .proto (schema) files from a model
  • my build / deploy script needs an overhaul
  • tons of documentation / examples
  • tooling to make the “mobile” story easier (but; it should work – it is being used in a few, for example)

But; have at it

If you want to play with it, go ahead. I only ask that if something behaves unexpectedly, drop me a line before declaring loudly “it sux, dude!”. It might just need some guidance, or maybe even a code fix (I’m far from perfect).

Likewise, any feature suggestions, etc; let me know.

Wednesday, 27 April 2011

Completion tasks - the easy way

One of the good things about doing things in public – people point out when you’ve missed a trick.

Just the other day, I was moaning about how Task seemed too tightly coupled to schedulers, and wouldn’t it be great if you could have a Task that just allowed you to indicate success/failure – I even went so far as to write some experimental code to do just that.

So you can predict what comes next; it already existed – I just didn’t know about it. I give you TaskCompletionSource<T>, which has SetResult(…), SetException(…) and a Task property. Everything I need (except maybe support for result-free tasks, but I can work around that).

And wouldn’t you believe it, the TPL folks shine again, easily doubling my lousy performance, and trebling (or more) the performance compared to “faking it” via RunSynchronously:

Future (uncontested): 1976ms // my offering
Task (uncontested): 4149ms // TPL Task via RunSynchronously
Source (uncontested): 765ms // via TaskCompletionSource
Future (contested): 5608ms
Task (contested): 6982ms
Source (contested): 2389ms


I am hugely indebted to Sam Jack, who corrected me via a comment on the blog entry. Cheers!

Monday, 25 April 2011

Musings on async

OUT OF DATE: SEE UPDATE


For BookSleeve, I wanted an API that would work with current C#/.NET, but which would also mesh directly into C# 5 with the async/await pattern. The obvious option there was the Task API, which is familiar to anyone who has used the TPL in .NET 4.0, but which gains extension methods with the Async CTP to enable async/await.

This works, but in the process I found a few niggles that made me have a few doubts:

Completion without local scheduling

In my process, I never really want to schedule a task; the task is performed on a separate server, and what I really want to do is signal that something is now complete. There isn’t really an API for that on Task (since it is geared more for operations you are running locally, typically in parallel). The closest you can do is to ask the default scheduler to run your task immediately, but this feels a bit ungainly, not least because task-schedulers are not required to offer synchronous support.

In my use-case, I’m not even remotely interested in scheduling; personally I’d quite like it if Task supported this mode of use in isolation, perhaps via a protected method and a subclass of Task.

(yes, there is RunSynchronously, but that just uses the current scheduler, which in library code you can’t assume is capable of actually running synchronously).

Death by exception

The task API is also pretty fussy about errors – which isn’t unreasonable. If a task fails and you don’t explicitly observe the exception (by asking it for the result, etc), then it intentionally re-surfaces that exception in a finalizer. Having a finalizer is note-worthy in itself, and you get one last chance to convince the API that you’re sane – but if you forget to hook that exception it is a process-killer.

So what would it take to do it ourselves?

So: what is involved in writing our own sync+async friendly API? It turns out it isn’t that hard; in common with things like LINQ (and foreach if you really want), the async API is pattern-based rather than interface-based; this is convenient for retro-fitting the async tools onto existing APIs without changing existing interfaces.

What you need (in the Async CTP Refresh for VS2010 SP1) is:

  • Some GetAwaiter() method (possibly but not necessarily an extension method) that returns something with all of:
  • A boolean IsCompleted property (get)
  • A void OnCompleted(Action callback)
  • A GetResult() method which returns void, or the desired outcome of the awaited operation

So this isn’t a hugely challenging API to implement if you want to write a custom awaitable object. I have a working implementation that I put together with BookSleeve in mind. Highlights:

  • Acts as a set-once value with wait (sync) and continuation (async) support
  • Thread-safe
  • Isolated from the TPL, and scheduling in particular
  • Based on Monitor, but allowing efficient re-use of the object used as the sync-lock (my understanding is that once used/contested in a Monitor, the object instance obtains additional cost; may as well minimise that)
  • Supporting typed (Future<T>) or untyped (Future) usage – compares to Task<T> and Task respectively

My local tests aren’t exhaustive, but (over 500,000 batches of 10 operations each), I get:

Future (uncontested): 1993ms
Task (uncontested): 4126ms
Future (contested): 5487ms
Task (contested): 6787ms

So our custom awaitable object is faster… but I’m just not convinced that it is enough of an improvement to justify changing away from the Task API. This call density is somewhat artificial, and we’re talking less than a µs per-operation difference.

Conclusions

In some ways I’m pleasantly surprised with the results; if Task is keeping up (more or less), even outside of it’s primary design case, then I think we should forget about it; use Task, and move on to the actual meat of the problem we are trying to solve.

However, I’ll leave my experimental Future/Future<T> code as reference only off on the side of BookSleeve – in case anybody else feels the need for a non-TPL implementation. I’m not saying mine is ideal, but it works reasonably.

But: I will not be changing away from Task / Task<T> at this time. I’m passionate about performance, but I’m not (quite) crazy; I’ll take the more typical and more highly-tested Task API that has been put together by people who really, really understand threading optimisation, to quite ludicrous levels.

Tuesday, 19 April 2011

Practical Profiling

Profiling the hard wayIf you don’t know what is causing delays, you are doomed to performance woes. Performance is something I care about deeply, and there are no end of performance / profiling tools available for .NET. Some invasive and detailed, some at a higher level. And quite often, tools that you wouldn’t leave plugged into your production code 24/7.

Yet… I care about my production environment 24/7; and trying to reproduce a simulated load for the sites I work on can be  somewhat challenging. So how can we get realistic and detailed data without adversely impacting the system?

Keep It Simple, Stupid

A common strapline in agile programming states:

Do the simplest thing that could possibly work

Now, I don’t profess to be strictly “agile”, or indeed strictly anything (except maybe “pragmatic”) in my development process, but there is much wisdom in the above. And it makes perfect sense when wanting to add constant (live) profiling capabilities.

So what is the simplest thing that could possibly work with profiling? Automated instrumentation? Process polling on the debugging API? Way too invasive. IoC/DI chaining with profiling decorators? Overkill. AOP with something like PostSharp? Unnecessary complexity. How about we just tell the system openly what we are doing?

Heresy! That isn’t part of the system! It has no place in the codebase!

Well, firstly – remember I said I was “pragmatic”, and secondly (more importantly) performance is both a requirement and a feature, so I have no qualms whatsoever changing my code to improve our measurements.

So what are you talking about?

Frustrated by the inconvenience of many of the automated profiling tools, I cobbled together the simplest, hackiest, yet fully working mini-profiler – and I thought I’d share. What I want is as a developer, to be able to review the performance of pages I’m viewing in the production environment – sure, this doesn’t cover every scenario, but it certainly does the job on sites that are read-intensive. So say I browse to “http://mysite/grobbits/MK2-super-grobit” – I want immediate access to how that page was constructed (and where the pain was). And in particular, I want it live so I can hit “refresh” a few times and watch how it behaves as different caches expire. Nothing rocket-science, just a basic tool that will let me hone in on the unexpected performance bumps. Finally, it can’t impact performance the 99.99% of regular users who will never see that data.

I’m currently having great success using this mini tool; the concept is simple – you have a MiniProfiler object (which would be null for most users), and you just surround the interesting code:

using (profiler.Step("Set page title"))
{
ViewBag.Message = "Welcome to ASP.NET MVC!";
}

using (profiler.Step("Doing complex stuff"))
{
using (profiler.Step("Step A"))
{ // something more interesting here
Thread.Sleep(100);
}
using (profiler.Step("Step B"))
{ // and here
Thread.Sleep(250);
}
}

(the Step(…) method is implemented as an extension method, so it is perfectly happy operating on a null reference; which is a crude but simple way of short-circuiting the timings for regular users)

Obviously this isn’t very sophisticated, and it isn’t meant to be – but it is very fast, and easy to make as granular as you like as you focus in on some specific knot of code that is hurting. But for the simplicity, it is remarkably useful in finding trouble spots on your key pages, and reviewing ongoing performance.


So what do I see?

The output is very basic functional – you simply get a call tree of the code you’ve marked as interesting (above), to whatever granularity you marked it to; no more, no less. So with the sample project (see below), the home-page displays (in the html markup):


<!--
MYPC at 19/04/2011 11:28:12
Path: http://localhost:3324/
http://localhost:3324/ = 376.9ms
> Set page title = 0ms
> Doing complex stuff = 349.3ms
>> Step A = 99.3ms
>> Step B = 250ms
> OnResultExecuting = 27.5ms
>> Some complex thinking = 24.8ms
-->

(for your developers only; there is no change for users without the profiler enabled)

Stop waffling, Man!

Anyway, if you have similar (high-level, but live) profiling needs, I’ve thrown the mini-profiler onto google-code and NuGet, along with a tweaked version of the MVC3 (razor) sample project to show typical usage. Actually, you only need a single C# file (it really is basic code, honest).

  • If you find it useful, let me know!
  • If it sucks, let me know!
  • If I’ve gone mad, let me know!
  • I’ve you’ve gone mad, keep that to yourself.

Monday, 11 April 2011

async Redis await BookSleeve

UPDATE

BookSleeve has now been succeeded by StackExchange.Redis, for lots of reasons. The API and intent is similar, but the changes are significant enough that we had to reboot. All further development will be in StackExchange.Redis, not BookSleeve.

ORIGINAL CONTENT

At Stack Exchange, performance is a feature we work hard at. Crazy hard. Whether that means sponsoring load-balancer features to reduce system impact, or trying to out-do the ORM folks on their own turf.

One of the many tools in our performance toolkit is Redis; a highly performant key-value store that we use in various ways:

  • as our second-level cache
  • for various tracking etc counters, that we really don’t want to bother SQL Server about
  • for our pub/sub channels
  • for various other things that don’t need to go direct to SQL Server

It is really fast; we were using the redis-sharp bindings and they served us well. I have much thanks for redis-sharp, and my intent here is not to critique it at all – but rather to highlight that in some environments you might need that extra turn of the wheel. First some context:

  • Redis itself is single threaded supporting multiple connections
  • the Stack Exchange sites work in a multi-tenancy configuration, and in the case of Redis we partition (mainly) into Redis databases
  • to reduce overheads (both handshakes etc and OS resources like sockets) we re-use our Redis connection(s)
  • but since redis-sharp is not thread-safe we need to synchronize access to the connection
  • and since redis-sharp is synchronous we need to block while we get each response
  • and since we are split over Redis databases we might also first have to block while we select database

Now, LAN latency is low; most estimates put it at around 0.3ms per call – but this adds up, especially if you might be blocking other callers behind you. And even more so given that you might not even care what the response is (yes, I know we could offload that somewhere so that it doesn’t impact the current request, but we would still end up adding blocking for requests that do care).

Enter BookSleeve

Seriously, what now? What on earth is BookSleeve?

As a result of the above, we decided to write a bespoke Redis client with specific goals around solving these problems. Essentially it is a wrapper around Redis dictionary storage; and what do you call a wrapper around a dictionary? A book-sleeve. Yeah, I didn’t get it at first, but naming stuff is hard.

And we’re giving it away (under the Apache License 2.0)! Stack Exchange is happy to release our efforts here as open source, which is groovy.

So; what are the goals?

  • to operate as a fully-functional Redis client (obviously)
  • to be thread-safe and non-blocking
  • to support implicit database switching to help with multi-tenancy scenarios
  • to be on-par with redis-sharp on like scenarios (i.e. a complete request/response cycle)
  • to allow absolute minimum cost fire-and-forget usage (for when you don’t care what the reply is, and errors will be handled separately)
  • to allow use as a “future” – i.e request some data from Redis and start some other work while it is on the wire, and merge in the Redis reply when available
  • to allow use with callbacks for when you need the reply, but not necessarily as part of the current request
  • to allow C# 5 continuation usage (aka async/await)
  • to allow fully pipelined usage – i.e. issue 200 requests before we’ve even got the first response
  • to allow fully multiplexed usage – i.e. it must handle meshing the responses from different callers on different threads and on different databases but on the same connection back to the originator

(actually, Stack Exchange didn’t strictly need the C# 5 scenario; I added that while moving it to open-source, but it is an excellent fit)

Where are we? And where can I try it?

It exists; it works; it even passes some of the tests! And it is fast. It still needs some tidying, some documentation, and more tests, but I offer you BookSleeve:

http://code.google.com/p/booksleeve/

The API is very basic and should be instantly familiar to anyone who has used Redis; and documentation will be added.

In truth, the version I’m open-sourcing is more like the offspring of the version we’re currently using in production – you tend to learn a lot the first time through. But as soon as we can validate it, Stack Exchange will be using BookSleeve too.

So how about some numbers

These are based on my dev machine, running redis on the same machine, so I also include estimates using the 0.3ms latency per request as mentioned above.

In each test we are doing 5000 INCR commands (purely as an arbitrary test); spread over 5 databases, in a round-robin in batches of 10 per db – i.e. 10 on db #0, 10 on db #1, … 10 on db #4 – so that is an additional 500 SELECT commands too.

redis-sharp:

  • to completion 430ms
  • (not meaningful to measure fire-and-forget)
  • to completion assuming 0.3ms LAN latency: 2080ms

BookSleeve

  • to completion 391ms
  • 2ms fire-and-forget
  • to completion assuming 0.3ms LAN latency: 391ms

The last 2 are the key, in particular noting that the time we aren’t waiting on LAN latency is otherwise-blocking time we have subtracted for other callers (web servers tend to have more than one thing happening…); the fire-and-forget performance allows us to do a lot of operations without blocking the current caller.

As a bonus we have added to ability to do genuinely parallel work on a single caller – by starting a Redis request first, doing the other work (TSQL typically), and then asking for the Redis result. And let’s face it, while TSQL is versatile, Redis is so fast that it would be quite unusual for the Redis reply to not to already be there by the time you get to look.

Wait – did you say C# 5?

Yep; because the API is task based, it can be used in any of 3 ways without needing separate APIs:

As an example of the last:

async redis

IMPORTANT: in the above “await” does not mean “block until this is done” – it means “yield back to the caller here, and run the rest as a callback when the answer is available” – or for a better definition see Eric Lippert’s blog series.

And did I mention…

…that a high perfomance binary-based dictionary store works well when coupled with a high performance binary serializer? ;p

Monday, 7 March 2011

Objects, Graphs and all that Jazz

protobuf-net v2 ticks along slowly; I’m embarrassed to say that due to a combination of factors progress has been slower than I would have liked – and for purely human reasons (availability etc).

But; I thought I’d better update with the tweaks I’m looking at currently; they have all been ticking along in the back of head for ages, but frankly people kept nagging me to provide them, so who am I to argue?

Caveat

This is all work-in-progress; don’t try to pull the trunk and shove it into production! Soon… I should also stress that all of the thoughts described below are outside the interoperable protobuf spec; it’ll still be a valid protobuf stream, but this only applies to protobuf-net talking to protobuf-net.

So; here’s some common questions I get…

Type meta

The serialization is fine, but I don’t know (and cannot know) all of my types up front. How can I do this?

Well, protobuf is a contract based format; if you don’t know the types, it will struggle – as will any contract based serializer…

Yes, I get that; now: how do I do it?

Now, I’ve held off putting any meta in the stream for various reasons:

  • it steps far outside the core protobuf spec
  • it flashes the warning signs of BinaryFormatter, my nemesis

But, so many people seem to want this that I think I have to buckle; but on my terms! So in v2, I’m adding the ability to indicate that (on a per-member basis) objects should resolve their type information from the stream. By default, by embedding the assembly-qualified-name, but providing an abstraction layer over that allowing you to provide your own string<===>Type map (and thus avoiding the knots in by stomach caused by too much type dependency).

Full graphs

The serialization is fine, but my data is a graph, not a tree. How can I do this?

Well, protobuf is a tree format; it doesn’t work like that…

Yes, I get that; now: how do I do it?

(by the way, are you spotting a pattern in the questions I get?)

Type meta is something I didn’t want to add, but graph support is something I have really wanted to sneak in; breaking all the rules with type meta seems a reasonable excuse. So in v2, I’m adding the ability (on a per-member basis) to use reference tracking to (de)serialize a complete object graph (except events; don’t get me started on that…). A minor complication here is that for technical reasons it is a nightmare to support this concurrently with inheritance (which, btw, the core protobuf doesn’t support – I feel quite the anarchist here…); but I can support it in conjunction with type meta; so you still get to keep inheritance, but implemented differently.

Repeated strings

It is fairly common to have repeated string data in a large graph. As an offshoot of full-graph support, we also get a mechanism to support string re-use for free; woot! So again, in v2 I’ll be enabling this on a per-member basis.

Footnote

All of this goes so far outside of the protobuf spec that I have little right to even call it protobuf-related any more; maybe I should force the caller to explicitly set an EnableImplementationSpecificOptions flag? So there is an portable core with some opt-in hacks for these random options.

And again; this whole area is under development; I’m working on it, honest!

That’s great! My full-graph of dynamic objects using inheritance and string re-use works fine! Now how do I load it into my c++/java/php/python/whatever client?

Don’t make me hurt you…