Friday, 11 July 2014

Securing redis in the cloud

Redis has utility in just about any system, but when we start thinking about “the cloud” we have a few additional things to worry about. One of the biggest issues is that the cloud is not your personal space (unless you have a VPN / subnet setup, etc) – so you need to think very carefully about what data is openly available to inspection at the network level. Redis does have an AUTH concept, but frankly it is designed to deter casual access: all commands and data remain unencrypted and visible in the protocol, including any AUTH requests themselves. What we probably want, then, is some kind of transport level security.
Now, redis itself does not provide this; there is no standard encryption, but you could configure a secure tunnel to the server using something like stunnel. This works, but requires configuration at both client and server. But to make our lives a bit easier, some of the redis hosting providers are beginning to offer encrypted redis access as a supported option. This certainly applies both to Microsoft “Azure Redis Cache” and Redis Labs “redis cloud”. I’m going to walk through both of these, discussing their implementations, and showing how we can connect.

Creating a Microsoft “Azure Redis Cache” instance

First, we need a new redis instance, which you can provision at https://portal.azure.com by clicking on “NEW”, “Everything”, “Redis Cache”, “Create”:
image
image
There are different sizes of server available; they are all currently free during the preview, and I’m going to go with a “STANDARD” / 250MB:
image
Azure will now go away and start creating your instance:
image
This could take a few minutes (actually, it takes surprisingly long IMO, considering that starting a redis process is virtually instantaneous; but for all I know it is running on a dedicated VM for isolation etc; and either way, it is quicker and easier than provisioning a server from scratch). After a while, it should become ready:
image

Connecting to a Microsoft “Azure Redis Cache” instance

We have our instance; lets talk to it. Azure Redis Cache uses a server-side certificate chain that should be valid without having to configure anything, and uses a client-side password (not a client certificate), so all we need to know is the host address, port, and key. These are all readily available in the portal:
image
image
Normally you wouldn’t post these on the internet, but I’m going to delete the instance before I publish, so; meh. You’ll notice that there are two ports: we only want to use the SSL port. You also want either stunnel, or a client library that can talk SSL; I strongly suggest that the latter is easier! So; Install-Package StackExchange.Redis, and you’re sorted (or Install-Package StackExchange.Redis.StrongName if you are still a casualty of the strong name war). The configuration can be set either as a single configuration string, or via properties on an object model; I’ll use a single string for convenience – and my string is:
mgblogdemo.redis.cache.windows.net,ssl=true,password=LLyZwv8evHgveA8hnS1iFyMnZ1A=

The first part is the host name without a port; the middle part enables ssl, and the final part is either of our keys (the primary in my case, for no particular reason). Note that if no port is specified, StackExchange.Redis will select 6379 if ssl is disabled, and 6380 if ssl is enabled. There is no official convention on this, and 6380 is not an official “ssl redis” port, but: it works. You could also explicitly specify the ssl port (6380) using standard {host}:{port} syntax. With that in place, we can access redis (an overview of the library API is available here; the redis API is on http://redis.io)
var muxer = ConnectionMultiplexer.Connect(configString);
var db = muxer.GetDatabase();
db.KeyDelete("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
int i = (int)db.StringGet("foo");
Console.WriteLine(i); // 3

and there we are; readily talking to an Azure Redis Cache instance over SSL.

Creating a new Redis Labs “redis cloud” instance and configuring the certificates


Another option is Redis Labs; they too have an SSL offering, although it makes some different implementation choices. Fortunately, the same client can connect to both, giving you flexibility. Note: the SSL feature of Redis Labs is not available just through the UI yet, as they are still gauging uptake etc. But it exists and works, and is available upon request; here’s how:

Once you have logged in to Redis Labs, you should immediately have a simple option to create a new redis instance:

image

Like Azure, a range of different levels is available; I’m using the Free option, purely for demo purposes:

image

We’ll keep the config simple:

image

and wait for it to provision:

image

(note; this only takes a few moments)

Don’t add anything to this DB yet, as it will probably get nuked in a moment! Now we need to contact Redis Labs; the best option here is support@redislabs.com; make sure you tell them who you are, your subscription number (blanked out in the image above), and that you want to try their SSL offering. At some point in that dialogue, a switch gets flipped, or a dial cranked, and the Access Control & Security changes from password:

image

to SSL; click edit:

image

and now we get many more options, including the option to generate a new client certificate:

image

Clicking this button will cause a zip file to be downloaded, which has the keys to the kingdom:

image

The pem file is the certificate authority; the crt and key files are the client key. They are not in the most convenient format for .NET code like this, so we need to tweak them a bit; openssl makes this fairly easy:
c:\OpenSSL-Win64\bin\openssl pkcs12 -inkey garantia_user_private.key -in garantia_user.crt -export -out redislabs.pfx

This converts the 2 parts of the user key into a pfx, which .NET is much happier with. The pem can be imported directly by running certmgr.msc (note: if you don’t want to install the CA, there is another option, see below):

image

Note that it doesn’t appear in any of the pre-defined lists, so you will need to select “All Files (*.*)”:

image

After the prompts, it should import:

image

So now we have a physical pfx for the client certificate, and the server’s CA is known; we should be good to go!

Connecting to a Redis Labs “redis cloud” instance


Back on the Redis Labs site, select your subscription, and note the Endpoint:

image

We need a little bit more code to connect than we did with Azure, because we need to tell it which certificate to load; the configuration object model has events that mimic the callback methods on the SslStream constructor:
var options = new ConfigurationOptions();
options.EndPoints.Add(
    "pub-redis-16398.us-east-1-3.1.ec2.garantiadata.com:16398");
options.Ssl = true;
options.CertificateSelection += delegate {
    return new System.Security.Cryptography.X509Certificates.X509Certificate2(
    @"C:\redislabs_creds\redislabs.pfx", "");
};

var muxer = ConnectionMultiplexer.Connect(options);
var db = muxer.GetDatabase();
db.KeyDelete("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
int i = (int)db.StringGet("foo");
Console.WriteLine(i); // 3

Which is the same smoke test we did for Azure. If you don’t want to import the CA certificate, you could also use the CertificateValidation event to provide custom certificate checks (return true if you trust it, false if you don’t).

Way way way tl:dr;


Cloud host providers are happy to let you use redis, and happy to provide SSL support so you can do it without being negligent. StackExchange.Redis has hooks to let this work with the two SSL-based providers that I know of.

Thursday, 3 July 2014

Dapper gets type handlers and learns how to read maps

A recurring point of contention in dapper has been that it is a bit limited in terms of the types it handles. If you are passing around strings and integers: great. If you are passing around DataTable – that’s a bit more complicated (although moderate support was added for table valued parameters). If you were passing around an entity framework spatial type: forget it.
Part of the problem here is that we don’t want dapper to take a huge pile of dependencies on external libraries, that most people aren’t using – and often don’t even have installed or readily available. What we needed was a type handler API. So: I added a type handler API! Quite a simple one, really – dapper still deals with most of the nuts and bolts, and to add your own handler all you need to provide is some basic parameter setup.
For example, here’s the code for DbGeographyHandler; the only interesting thing that dapper doesn’t do internally is set the value – but the type-handler can also do other things to configure the ADO.NET parameter (in this case, set the type name). It also needs to convert between the Entity Framework representation of geography and the ADO.NET representation, but that is pretty easy:
public override void SetValue(IDbDataParameter parameter, DbGeography value)
{
    parameter.Value = value == null ? (object)DBNull.Value
        : (object)SqlGeography.Parse(value.AsText());
    if (parameter is SqlParameter)
    {
        ((SqlParameter)parameter).UdtTypeName = "GEOGRAPHY";
    }
}

and… that’s it. All you need to do is register any additional handlers (SqlMapper.AddTypeHandler()) and it will hopefully work. We can now use geography values in parameters without pain – i.e.
conn.Execute("... @geo ...",
    new { id = 123, name = "abc", geo = myGeography }

Plugin packages

This means that handlers for exotic types with external dependencies can be shipped separately to dapper, meaning we can now add packages like Dapper.EntityFramework, which brings in support for the Entity Framework types. Neat.

Free to a good home: better DataTable support

At the same time, we can also make a custom handler for DataTable, simplifying a lot of code. There is one slight wrinkle, though: if you are using stored procedures, the type name of the custom type is known implicitly, but a lot of people (including us) don’t really use stored procedures much : we just use raw command text. In this situation, it is necessary to specify the custom type name along with the parameter. Previously, support for this has been provided via the AsTableValuedParameter() extension method, which created a custom parameter with an optional type name – but dapper now internally registers a custom type handler for DataTable to make this easier. We still might need the type name, though, so dapper adds a separate extension method for this, exploiting the extended-properties feature of DataTable:
DataTable table = ...
table.SetTypeName("MyCustomType");
conn.Execute(someSql, new { id = 123, values = table });

That should make things a bit cleaner! Custom type handlers are welcome and encouraged - please do share them with the community (ideally in ready-to-use packages).

Tuesday, 1 July 2014

SNK and me work out a compromise

and bonus feature: our build server configuration
A few days ago I was bemoaning the issue of strong names and nuget. There isn’t a perfect solution, but I think I have a reasonable compromise now. What I have done is:
  • Copy/pasted the csproj, with the duplicates referencing a strong name key
  • Changed those projects to emit an assembly named StackExchange.Redis.StrongName
  • Copy/pasted the nuspec, with the new spec creating the StackExchange.Redis.StrongName package from the new assemblies
  • PINNED the assembly version number; this will not change unless I am introducing breaking changes, which will also coincide with major/minor version number changes – or maybe also for reasonably significant feature additions; not for every point-release, is the thing
  • Ensured I am using [assembly:AssemblyFileVersion] and [assembly:AssembyInformationalVersion] to describe the point-release status
This should achieve the broadest reach with the minimum of fuss and maintenance.
Since that isn’t enough words for a meaningful blog entry, I thought I’d also talk about the build process we use, and the tooling changes we needed for that. Since we are using TeamCity for the builds, it is pretty easy to double everything without complicating the process – I just added the 2 build steps for the new csproj, and told it about the new nuspec:
image
Likewise, TeamCity includes a tool to automatically tweak the assembly metadata before the build happens – the “AssemblyInfo patcher” – which we can use to pin the one value (to change manually based on human decisions), while keeping the others automatic:
image
Actually, I’ll probably change that to use a parameter to avoid me having to dig 3 levels down. After that, we can just let it run the build, and out pop the 2 nupkg as artefacts ready for uploading:
image
If you choose, you can also use TeamCity to set up an internal nuget feed – ideal for both internal use and dogfooding before you upload to nuget.org. In this case, you don’t need to grab the artefacts manually or add a secondary upload step – you can get TeamCity to publish them by default, making them immediately available in-house:
image
It isn’t obvious in the UI, but this search is actually restricted to the “SE” feed, which is my Visual Studio alias to the TeamCity nuget server:
image
Other build tools will generally have similar features – simply: this is what works for us.

Thursday, 19 June 2014

SNK, we need to talk…


Because the world needs another rant about SNK and NuGet


In .NET assemblies, strong names are an optional lightweight signing mechanism that provides identity, including versioning support. Part of the idea here is that a calling assembly can have a pretty good idea that what it asks for is what it gets – it asks for Foo.Bar with key {blah}, version “1.2.3”, and it gets exactly that from the GAC, and the world is rosy. If somebody installs a new additional version of Foo.Bar, or a Foo.Bar from a different author, the old code still gets the dll it wanted, happy in the knowledge that the identity is reliable.
There are only a few problems with this:
  • Most applications don’t use the GAC; the only times people generally “choose” to use the GAC is when their list of options had exactly one option: “use the GAC”. Sharepoint and COM+, I’m looking at you and judging you harshly
  • Actually, there’s a second category of this: people who use strong names because that is what their corporate policy says they must do, with some some well-meaning but completely misguided and incorrect notion that this provides some kind of security. Strong naming is not a security feature. You are just making work and issues for yourself; seriously
  • It doesn’t actually guarantee the version: binding redirect configuration options (just in an xml file in your application) allow for a different version (with the same key) to be provided
  • It doesn’t actually guarantee the integrity of the dll: if somebody has enough access to your computer that they have access to the GAC, they also have enough access to configure .NET to skip assembly identity checking for that dll (just a “snk –Vr {assembly}” away)
And of course, it introduces a range of problems:
  • Versioning becomes a huge pain the backside for all downstream callers, who now need to manage the binding redirect configuration every time a dll gets upgraded anywhere (there are some tools that can help with this, but it isn’t perfect)
  • A strong-named assembly can only reference other strong-named assemblies
  • You now have all sorts of key management issues over your key file (despite the fact that it is pointless and can be bypassed, as already mentioned)

 

Assembly management versus package management


Now enter package management, i.e. NuGet and kin (note: I’m only using NuGet as the example here because it is particularly convenient and readily available to .NET developers; other package management tools exist, each with strengths and weaknesses). Here we have a tool that clearly targets the way 95% of the .NET world actually works:
  • no strong name; we could not care less (or for the Americans: we could care less)
  • no GAC: libraries deployed alongside the application for per-application isolation and deployment convenience (this is especially useful for web-farms, where we just want to robocopy the files out)
  • versioning managed by the package management tool
So, as a library author, it is hugely tempting to simply ignore strong naming, and simply put assemblies without a strong name onto NuGet. And that is exactly what I have done with StackExchange.Redis: it has no strong name. And for most people, that is fine. But now I get repeated calls and emails from people saying “please can you strong name it”.
The argument for and against strong-naming in NuGet is very verbose; there are threads with hundreds of messages for and against – both with valid points. There is no simple answer here.
  • if it is strong named, I introduce the problems already mentioned – when for 95% (number totally invented, note) of the people using it, this is simply not an issue
  • if it isn’t strong named, people doing Sharepoint development, or COM+ development, or just with awkward local policies cannot use it – at least not conveniently
They can of course self-sign locally, and there are tools to help with that – including Nivot.StrongNaming. But this only helps for immediate references: any such change will be fatal to indirect references. You can’t use binding redirects to change the identity of an assembly – except for the version.
I could start signing before deployment, but that would be a breaking change. So I’d have to at a minimum do a major version release. Again, direct references will be fine – just update the package and it works – but indirect references are still completely toast, with no way of fixing them except to recompile the intermediate assembly against the new identity. Not ideal.

I’m torn


In some ways, it is tempting to say “screw it, I need to add a strong name so that the tiny number of people bound by strong naming can use it”, but that is also saying “I need to totally and irreconcilably break all indirect references, to add zero functionality and despite the fact that it was working fine”, and also “I actively want to introduce binding redirect problems for users who currently don’t have any issues whatsoever”.
This is not an easy place to be. Frankly, at this stage I’m also not sure I want to be adding implicit support to the problems that SNK introduce by adding a strong name.

But what if…


Imagineering is fun. Let’s be realistic and suppose that there is nothing we can do to get those systems away from strong names, and that we can’t change the strong-named-can-only-reference-strong-named infection. But let’s also assume we don’t want to keep complicating package management by adding them by default. Why can’t we let this all just work. Or at least, that maybe our package management tools could fix it all. It seems to me that we would only need two things:
  • assembly binding redirects that allow unsigned assemblies to be forwarded to signed assemblies
  • some inbuilt well-known publicly available key that the package management tools could use to self-sign assemblies
For example, imagine that you have your signed application. and you use NuGet to add a package reference to Microsoft.Web.RedisSessionStateProvider, which in turn references StackExchange.Redis. In my imaginary world, NuGet would detect that the local project is signed, and these aren’t – so it would self-sign them with the well-known key, and add an assembly-binding redirect from “StackExchange.Redis” to “StackExchange.Redis with key hash and version”. Important note: the well-known key here is not being used to assert any particular authorship etc – it is simply “this is what we got; make it work”. There’s no need to protect the private key of that.
The major wrinkle in this, of course, is that it would require .NET changes to the fusion loader, in order to allow a binding redirect that doesn’t currently exist. But seriously: haven’t we been having this debate for long enough now? Isn’t it time the .NET framework started helping us with this? If I could request a single vNext CLR feature: this would be it.
Because I am so very tired of having this whole conversation, after a decade of it.
There are probably huge holes in my reasoning here, and reasons why it isn’t a simple thing to change. But: this change, or something like it, is so very very overdue.

And for now…


Do I add a strong name to StackExchange.Redis? Local experiments have shown me that whatever I do: somebody gets screwed, and no matter what I do: I’m going to get yelled at by someone. But I’m open to people’s thoughts and suggestions.

Wednesday, 18 June 2014

Dapper : some minor but useful tweaks

At Stack Exchange we pay a lot of attention to our data access, which is why a few years ago we spawned dapper. In case you haven’t seen it, the purpose of dapper is to remove the mind-numbing and easy-to-get-wrong parts of ADO.NET, making it easy to do proper parameterization, without having the overhead of a micro-ORM. In other words, to make it easy to do this:
DateTime from = ...
int id = ...
int count = ...
var orders = connection.Query<Order>(
    @"select top (@count) * from Orders where CustomerId=@id
    and CreationDate >= @from and Status=@open",
    new { id, from, count, open=OrderStatus.Open }).ToList();

It is unapologetic about expecting you to know and “own” your SQL – entirely good things, IMO (by comparison: most professionals wouldn't program the web tier without knowing what their html looks like).

Inline value injection


With a few exceptions the sql that you pass in is essentially opaque: the only things it has done historically is:
  • check the sql to see which possible parameters can definitely be ignored, because they aren’t mentioned (assuming it isn’t a stored procedure, etc)
  • parameter expansion (typically used for in queries, i.e. where id in @ids, where @ids is something like an int[])
Sometimes, however, parameters bite back. There are times when you don’t really want something to be a parameter – it either changes the query so significantly that a separate query plan would be desirable. You don’t want to use a different command-string per value, as this would force dapper to perform meta-programming per value (since it would be a cache miss on the strategy cache).

Alternatively, often the value will never be changed (but is treated as a parameter for code reason), or will only ever be one value unless some configuration setting is changed, maybe twice a year. Classic values here might be a status code from an enum, or the number of rows (select top 50 sometimes performs quite differently to select top (@rows)).

Because of this, the current builds of dapper introduce value injection. This uses an intentionally different syntax similar to parameters, but injects the value directly into the SQL immediately before execution (but allowing dapper to re-use the strategy). It probably makes more sense with an example:
DateTime from = ...
int id = ...
int count = ...
var orders = connection.Query<Order>(
    @"select top {=count} * from Orders where CustomerId=@id
    and CreationDate >= @from and Status={=open}",
    new { id, from, count, open=OrderStatus.Open }).ToList();

Essentially, anything of the form {=name} will be treated as an inlined value, and will be replaced rather than treated as a parameter. Of course, at this point you’re probably screaming “nooooooo! what about sql injection!” – which is why it restricts usage to integer-based types (including enums). This significantly reduces the risk of any abuse, and of course: you don’t have to use it if you don’t want!

Parameter expansion support for OPTIMIZE FOR query hints


Parameter sniffing can be a real pain. For us, we call this “the Jon Skeet problem”: we have some very different users – some with maybe a handful posts and a few dozen reputation, and some with tens of thousands of posts and over a half-million reputation (which means: a lot of vote records). Let’s assume we want to keep the user-id as a regular SQL parameter: if the first use of a query is for “new user Fred”, the query plan won’t work well for Jon. If the first use of a query is for Jon, it won’t work well for “new user Fred”. Fortunately, SQL Server has a mechanism to tell it not to get too attached to a query-plan based on a parameter – the OPTIMIZE FOR query hint. This can be left open (applies to all parameters), but it is often desirable to use the variant that tells it exactly which parameters we are concerned about. This is easy normally, but recall that dapper offers parameter expansion. So how can we use this query hint with the in query above? We don't conveniently know the names of the eventual parameters...

Because of this, dapper now recognises this specific pattern, and performs parameter expansion compatible with this hint. If you use:
option (optimize for (@ids unknown))

it will expand this out to the correct query hint for the parameters that @ids become during expansion.

Async all the things


The usage of async keeps growing, so dapper has now evolved much better async methods (with a number of useful contributions from users); most of these are self explanatory – simply using the *Async method names, but some key additions:
  • Support for cancellation tokens
  • Opt-in use of pipelining (performing a range of operations on a connection without waiting for the early operations to complete – this requires MARS to be enabled)

Summary

No huge changes, but hopefully useful to a few people. Enjoy.

Thursday, 17 April 2014

Technical Debt, a case study : tags

 

This is going to be more a discussion / wordy thing; no real code here. This is just to wander through some points, and maybe you’ll find them interesting or thought provoking. You have been warned.

At Stack Exchange, we have a fair understanding of technical debt. Like real debt, technical debt is not by necessity a bad thing – it can allow you to choose an acceptable (but not ideal) solution today, which means you can ship today, but you know that at some point you are going to have to revisit it. Like all loans, technical debt carries interest.

Today I’m talking about tags; specifically, the tags on Stack Exchange questions. These things:

image

I’m going to run through a lot of back-story, but if you’re lazy, just pretend you read all of it and skip to the last paragraph. Or, heck, go look at cat videos. I won’t tell.

Step 0 : the original problem

Way way back (long before I joined the company), the then-much-smaller team needed a way to store and query tags – in particular a wide range of “{a} and {b} and {c}”, “{d} or {e}”, “{f} but not {y}” etc queries for the “show me the questions in {some tags}” pages, or the search page (which allows a lot of tag-based filtering). Your classic database approach here might be to have a Posts table, and Tags table, and a PostTags table, and do a lot of work on PostTags, but as I understand it this simply didn’t want to perform well. Equally, we access and display questions a lot. No, a real lot. Huge amounts. The one thing that we want to happen really really efficiently is “fetch a question”.

Having that data spread over 3 tables requires complex joins or multiple queries (because you could be fetching 500 questions, which could each have 1-5 tags), and then processing the more complicated data. Because of this, we store the tags as a single character-data field in the database – all the tags in one string. This makes it possible to get the post including all the tags in a single row query, and just worry about deciphering it at the UI. Neat. But it doesn’t really allow for efficient query.

Aside: at this point I should also note that we additionally store the data in a PostTags table (although I can’t remember whether this was the case at the time) – the inline vs normalized data is kept in sync and used for different purposes; the things you do for performance, eh?

Step 1 : the evil/awesome hack

So, we’ve got inline tag data that is simple to display, but is virtually impossible to query. Regular indexing doesn’t really work well at finding matches in the middle of character data. Enter (trumpets) SQL Server Full-Text Search. This is inbuilt to SQL Server (which we were already using), and allows all kinds of complex matching to be done using CONTAINS, FREETEXT, CONTAINSTABLE and FREETEXTTABLE. But there were some problems: stop words and non-word characters (think “c#”, “c++”, etc). For the tags that weren’t stop words and didn’t involve symbols, it worked great. So how to convince Full Text Search to work with these others? Answer: cheat, lie and fake it. At the time, we only allowed ASCII alpha-numerics and a few reserved special characters (+, –, ., #), so it was possible to hijack some non-English characters to replace these 4, and the problem of stop words could be solved by wrapping each tag in a pair of other characters. It looked like gibberish, but we were asking Full Text Search for exact matches only, so frankly it didn’t matter. A set of tags like “.net c#” thus became “éûnetà écñà”. Obviously! This worked; it performed well (remember, Stack Overflow was still very young here), and allowed the team to ship. Shipping is a pretty important feature! The team knew it was a fudge, but all was well (enough) in the world…

Step 2 : the creaks begin

Stack Overflow was a hit. Over time, the question count grows steadily larger, as do the number of page requests. Eventually it became apparent that our flagrant abuse of Full Text Search was becoming a performance bottleneck. To be fair, Full Text Search wasn’t really intended to be used in this way, and we were using it at an alarming rate (even after caching). But by now the team had a viable product, more developers, and enough information to judge that more effort was both necessary and worthwhile. At different points, then, our search efforts were re-written to use Lucene.Net and then Elasticsearch, and the “list by tag” work grew a bespoke entity that we call the “tag engine” (which is actually what this previous post is about) – essentially an out-of-process index service. Nowhere near as fully-featured as (say) Lucene.Net, but it only needs to do one thing.

Step 3 : Stack Exchange has sites about language

In time, we had requests for sites that were still primarily in English, but were about languages; French, Japanese, Russian, etc. A lot of their tags were in English, but there was a need to remove our “ASCII alpha-numerics” restriction. Fortunately, since we had Elasticsearch and the tag-engine, this mainly meant changing the few remaining Full Text Search usages (things that  were used rarely, and hadn’t been worth fixing) to use alternatives. The format in the database, however, remained. And since we didn’t want to introduce even more bizarre rules, we kept the 6 reserved characters. Sorry, French.StackExchange – no “é” or “à” in your tags. In reality this was less of a problem than it sounds, as French.StackExchange elects to use accent-free tags – while sites like Russian.StackExchange could use the tags they wanted. And the world keeps turning.

Step 4 : Stack Exchange goes fully multi-lingual (well, ok, bilingual)

Enter Portuguese; in pt.stackoverflow.com, we have our first site that is completely in another language. No weasel room left now: they want tags like this:

image

And guess what: ç was one of our reserved tokens (representing +, so “c++” was stored as “écççà”). We couldn’t even whistle a happy tune and hope nobody would notice the gaps: they found it on day 1 of closed beta. Maybe day 2. We finally had a reason to remove this legacy from the past. But – and I cannot emphasize this enough: these changes were scary. So scary that we didn’t want to do a “update all the things” until we’d had chance to “bed it in” on pt.StackOverflow, and fix any glitches that we’d missed. As it happened, all the code was making use of a single “decipher the tags” method, so it was sensible and pragmatic to simply make that understand both the old format and whatever we came up with now. After some thought and consideration, we settled on a pipe (bar) delimited natural representation, with leading/trailing pipes, so “.net c#” becomes simply “|.net|c#|”. This has virtues:

  • very simple to parse
  • we can tell immediately from the first character which format it is
  • bulk update and removal of tags can be done with a simple replace (including the pipes, to avoid replacing mid-tag matches)
  • and unlike the old format, you don’t need to worry about special-casing the first/last tag when doing a bulk operation (trailing spaces, etc)

Sure, there’s still a reserved |, but we can live with that. We could have used a non-printing character, but that would be very awkward in things like JSON – lots of risk of subtle bugs.

Once this had been working on pt.StackOverflow for a while, we flipped a switch and all new write operations switched to the new format; all “language” sites could have free access to the 6 previously reserved characters. Hoorah!

Step 5 : the backfill

When we did that, we only affected new writes. There was still data using the old format. A lot of data (not just the questions themselves: all tag edits, for example, were stored in this way). But while it remained, our “bulk remove a tag” code had to cope with two formats, which is an unnecessary maintenance overhead. So finally this week we absorbed the pain to do a batched migration of all the old data to the new format. Fairly routine, if a little scary.

And happy dance; the hack is no more!

So why are you telling me all this? What is the point?

Our job as engineers is not always to write the best possible thing that we can, and that can solve every problem ever, and is beautiful code that makes us want to adopt it and take it on picnics. An important part of our job is to:

  • understand what the code actually and genuinely needs to be able to do today and tomorrow
  • think of a way of delivering that with the available resources
  • think a little about what might be needed in the future, acknowledging that this is only a best guess
  • make sure that it is not impossible (or prohibitively expensive) to change things later
  • At the time the original hack was put in, it was absolutely the right choice. A few lines of code, a few clicks in SQL Server, job done. Shipped. Working. It would not have been sensible to invest time getting an external indexing service working just for this. But there was always the possibility to change the implementation, and the “unscramble the mess” code was in one place. Sure: there was debt, but the level of pain was kept to a minimum, and has finally been paid back in full.

    Friday, 28 March 2014

    Windows Redis-64

    As you know, I’m a huge fan of Redis. This week, a team at the Microsoft Open Tech group released a new drop of redis-64 for windows (note: all links there are old – github is fresher), with binaries available on both NuGet and chocolatey. It is even marked by them as “production ready”. The current drop is 2.8.4 in Redis terms (the last public drop was 2.6.12).

    So congratulations, thanks and kudos to the team involved: an important milestone.

    In production, at Stack Exchange we use CentOS to host Redis, so I’m simply not in a position to comment on how well it works in production, but for windows developers using Redis it is really handy having convenient access to a Redis server without having to spin up a VM.

    One really important thing to watch

    From reading the "Redis Release Notes.docx" and "redis.windows.conf" file in the packages/Redis-64.2.8.4 folder, the appoach they used to solve the fork problem was to introduce a shared, memory-mapped file, and it defaults to the size of your physical memory (and that is before it has forked, when it can grow due to copy-on-write). So whether you are using it as a production server or a dev tool, you might want to pay particular attention the the “maxheap” setting in your .conf file. For my local dev work I usually spin up 5 servers, and I currently have 24GB physical memory in my machine – so by default, when running 5 completely empty databases, it instantly chewed up 120GB of hard disk space (you get it back when the server exits). If you don’t have enough disk space: the server won’t start. You'll probably see:

    QForkMasterInit: system error caught. error code=0x000005af, message=VirtualAllocEx failed.

    Fortunately it is simple to configure a memory bound:

    maxheap 4gb # or whatever

    Please note – this really isn’t a criticism; I understand the “why” – they need the shared memory-mapped file for the pseudo-fork, and they need to allocate the “maxheap” amount from the outset because IIRC you can’t grow such a map very flexibly. My intent is merely to flag it in flashing colours that if you’re playing with Redis-64, you want to think about your memory vs disk.

    Now go, play, have some Redis fun.