Tuesday 18 March 2014

So I went and wrote another Redis client…

…aka: Introducing StackExchange.Redis (nuget | github)

The observant out there are probably thinking “Wut?” about now, after all didn’t I write this one? Yes, yes I did. This one too, although that was just a mechanism to illustrate the C# dynamic API. So you would be perfectly justified in thinking that I had finally lost the plot and meandered into crazy-town. Actually, you’d have been reasonably justified in thinking that already. So let me take a step back and explain….

Why? Just… why?

Firstly, BookSleeve has served us well: it has got the job done with ever increasing load through the network. If it is possible to salute code as a trusted friend, then BookSleeve: I salute you. But: some things had to change. There were a number of problems and design decisions that were conspiring against me, including:

  • Robustness: BookSleeve acts as a wrapper around a single connection to a single endpoint; while it is up: great! But occasionally sockets die, and it was incredibly hard to do anything useful either investigative or recover. In our internal code we had a whole second library that existed mainly to hide this wrinkle (and for things like emergency slave fallback), but even then it wasn’t perfect and we were getting increasing problems with network issues causing downstream impact.
  • Ability to identify performance issues: BookSleeve has only minimal instrumentation – it wasn’t a first class feature, and it showed. Again; fine when everything works great, but if you hit any load issues, there was virtually nothing you could do to resolve them.
  • Single-server: in common with “Robustness” above, the single-endpoint nature of BookSleeve means that it isn’t in a great position if you want to talk to multiple nodes; this could be to fully exploit the available masters and slave nodes, or thinking ahead could be in consideration of “redis cluster” (currently in beta) – and simply wrapping multiple RedisConnection instances doesn’t play nicely in terms of efficient network API usage
  • The socket query API: again, partly tied up to the above multi-server concerns, but also tied up to things like the thread pool and IO pool (the issues described there applies equally to the async network read API, not just the thread-pool)

And those are just the immediate problems.

There were some other longstanding glitches in the API that deserved some love, but didn’t justify huge work by themselves – things like the fact that only string keys were supported (some folks make use of binary keys), and things like constantly having to specify the database number (rather than abstracting over that) were troublesome. The necessity of involving the TPL when you didn’t actually want to be true “async” (forcing you to use sync-over-async, an ugly anti-pattern), and the fact that the names didn’t follow the correct async convention (although I can’t honestly remember whether the current conventions existed at the time).

Looking ahead, “redis cluster” as mentioned above introduced a range of concerns; it probably would have been possible to wrap multiple connections in a super connection, but the existing implementations didn’t really make that feasible without significant work. Also, when looking at “redis cluster”, it is critically important that you know at every point whether a particular value represents a key versus a valuekeys impact how commands are targeted for sharding (and impact whether a multi-key operation is even possible); values do not – and the distinction between them had been lost, which would basically need an operation-by-operation review of the entire codebase.

In short, to address any of:

  • our immediate internal needs
  • the community needs that weren’t internal priorities
  • any future “redis cluster” needs
  • providing a generally usable platform going forward

was going to require significant work, and would have by necessity involved significant breaking API changes. If I had reworked the existing code, not only would it have shattered the old API, but it would have meant compromise both for users of the old code and users of the new. And that simply wasn’t acceptable. So: I drew a line, and went for a clean break.

So what does this mean for users?

Firstly, if you are using BookSleeve, feel free to continue using it; I won’t delete the files or anything silly like that – but: my main development focus going forward is going to be in StackExchange.Redis, not BookSleeve. The API is basically similar – the work to migrate between them is not huge, but first – why would you? How about:

  • Full multi-server connection abstraction including automatic reconnect and fallback (so read operations continue on a slave if the master is unavailable), and automatic pub/sub resubscription
  • Ability to express preferences to target work at slaves, or demand a certain operation happens on a slave, etc – trivially
  • Full support for “redis cluster”
  • Completely reworked network layer, designed to avoid stalls due to worker starvation while efficiently scaling to multiple connections, while also reducing overheads and moving steps like text encode/decode to the caller/consumer rather than the network layer
  • Full support for both binary and string keys, while reducing (not increasing) the methods necessary (you no longer need to tell it which you want in advance)
  • Observes TPL guidance: no more sync-over-async (there is a full sync API that does not involve the TPL), and the TPL/async methods are now named appropriately
  • Instrumentation as a design feature
  • And lots more…
  • … heck, when I get a moment I might also throw our 2-tier cache (local in-memory cache with a shared redis central cache, including pub/sub-based cache expiry notification etc) down into the client library too

Is it ready?

Let’s consider it a “late beta” (edit: fully released now); on the Q&A sites we have now replaced all of our BookSleeve code with StackExchange.Redis code, which meant that hopefully we’ve already stubbed our toes on all the big bugs. I don’t plan on any breaking changes to the API (and will try to avoid it). Lua script support is not yet implemented (edit: it is now), and “redis cluster” isn’t yet released and thus support for this is still a work in progress, but basically: it works, and works fine.

A full introduction and example basic usage is shown on the project site; please do take a look, and let me know if I’ve moved too much cheese!