Friday, 14 November 2014

Episode 2; a cry for help


Part 2 of 2; in part 1 I talked about Cap’n Proto; here’s where you get to be a hero!
In my last post, I talked about the fun I've been having starting a Cap'n Proto spike for .NET/C#. Well, here's the thing: I don't scale. I already have an absurd number of OSS projects needing my time, plus a full time job, plus a family, including a child with "special needs", plus I help run my local school (no, really - I'm legally accountable for the education of 240 small people; damn!), and a number of speaking engagements (see you at London NDC or WCDC ?).

I don't have infinite time. What I don’t need is yet another project where I’m the primary contributor, struggling to give it anything like the time it deserves. So here's me asking if anybody in my reach wants to get involved and help me take it from a barely-usable shell, into a first rate tool. Don't get me wrong: the important word in "barely usable" is "usable", but it could be so much better.
 

Remind me: why do you care about this?


Let’s summarize the key points that make Cap’n Proto interesting:
  • next to zero processing
  • next to zero allocations (even for a rich object model)
  • possibility to exploit OS-level concepts like memory mapped files (actually, this is already implemented) for high performance
  • multi-platform
  • version tolerant
  • interesting concepts like “unions” for overlapping data
Although if I’m honest, pure geek curiosity was enough for me. The simple “nobody else has done it” was my lure.

So what needs doing?


While the core is usable, there’s a whole pile of stuff that could be done to make it into a much more useful tool. Conveniently, a lot of these are fairly independent, and could be sliced off very easily if a few other people wanted to get involved in an area of particular interest. But; here’s the my high-level ideas:

  • Schema parsing: this is currently a major PITA, since the current "capnp" tool only really works / compiles for Linux. There is a plan in the core Cap'n Proto project to get this working on MinGW (for Windows), but it would be nice to have a full .NET parser - I'm thinking "Irony"-based, although I'm not precious about the implementation
  • Offset processing: related to schema parsing, we need to compute the byte offsets of a parsed schema. This is another job that the "capnp" tool does currently. Basically, the idea is to look at all of the defined fields, and assign byte offsets to each, taking into account some fairly complicated "group" and "union" rules
  • Code generation: I have some working code-gen, but it is crude and "least viable". It feels like this could be done much better. In particular I'm thinking "devenv" tooling, whether that means T4 or some kind of VS plugin, ideally trivially deployed (NuGet or similar) - so some experience making Visual Studio behave would be great. Of course, it would be great if it worked on Mono too – I don’t know what that means for choices like T4.
  • Code-first: “schemas? we don't need no stinking schemas!”; here I mean the work to build a schema model from pre-existing types, presumably via attributes – or perhaps combining an unattributed POCO model with a regular schema
  • POCO serializer: the existing proof-of-concept works via generated types; however, building on the code-first work, it is entirely feasible to write a "regular" serializer that talks in terms of some pre-existing POCO model, but uses the library api to read/write the objects per the wire format
  • RPC: yes, Cap'n Proto defines an RPC stack. No I haven't even looked at it. It would be great if somebody did, though
  • Packed encoding: the specification defines an alternative "packed" representation for data, that requires some extra processing during load/save, but removes some redundant data; not currently implemented
  • Testing: I'm the worst possible person to test my own code – too “close” to it. I should note that I have a suite of tests related to my actual needs that aren't in then open source repo (I’ll try and migrate many of them), but: more would be great
  • Multi-platform projects: for example, an iOS / Windows Store version probably needs to use less (well, zero) of the “unsafe” code (mostly there for efficiency); does it compile / run on Mono? I don’t know.
  • Proper performance testing; I'm casually happy with it, but some dedicated love would be great
  • Much more compatibility testing against the other implementations
  • Documentation; yeah, telling people how to use it helps
  • And probably lots more stuff I'm forgetting

Easy budget planning


Conveniently, I have a budget that can be represented using no bits of data! See how well we're doing already? I can offer nothing except my thanks, and the satisfaction of having fun hacking at some fun stuff (caveat: for small values of "fun"), and maybe some community visibility if it takes off. Which it might not. I'm more than happy to open up commit access to the repo (I can always revert :p) - although I'd rather keep more control over NuGet (more risk to innocents).
So... anyone interested? Is my offer of work for zero pay and limited reward appealing? Maybe you’re looking to get involved in OSS, but haven’t found an interesting project. Maybe you work for a company that has an interest in being able to process large amounts of data with minimum overheads.

Besides, it could be fun ;p

If you think you want to get involved, drop me an email (marc.gravell@gmail.com) and we'll see if we can get things going!





Efficiency: the work you don’t do…

 
Part 1 of 2; here’s where I talk about a premise; in part 2 I pester you for help and involvement
Q: what's faster than doing {anything}?
A: not doing it
I spend a lot of time playing with raw data, whether it is serialization protocols (protobuf-net), materialization (dapper) or network api protocols (redis, web-sockets, etc). I don't claim to be the world's gift to binary, but I like to think I can "hold my own" in this area. And one thing that I've learned over and over again is: at the application level, sure: do what you want - you're not going to upset the runtime / etc - but at the library level (in these areas, and others) you often need to do quite a bit of work to minimize CPU and memory (allocation / collection) overhead.
In the real world, we have competing needs. We want to process large amounts of data, but we don't want to pay the price. What to do?
 

Get me a bowl of Cap’n Proto…

A little while ago I became aware of Cap'n Proto. You might suspect from the name that it is tangentially related to Protocol Buffers (protobuf), and you would be right: it it the brain-child of Kenton Varda, one of the main originators of protobuf. So at this point you're probably thinking "another serialization protocol? boooring", but stick with me - it gets more interesting! And not just because it describes itself as a “cerealization protocol” (badum tish?). Rather than simply describing a serialization format, Cap'n Proto instead describes a general way of mapping structured data to raw memory. And here's the fun bit: in a way that is also suitable for live objects.
Compare and contrast:
  • load (or stream) the source data through memory, applying complex formatting rules, constructing a tree (or graph) of managed objects, before handing the root object back to the caller
  • load the source data into memory, and... that's it
Cap'n Proto is the second of these; it is designed to be fully usable (read and write) in the raw form, so if you can load the data, your work is done.
 

So how does that work?

Basically, it describes a “message” of data as a series of one or more “segments” (which can be of different sizes), with the objects densely packed inside a segment. Each object is split into a number of “data words” (this is where your ints, bools, floats, etc, get stored), and a number of “pointers”, where “pointer” here just means a few bits to describe the nature of the data, and either a relative offset to another chunk in the same segment, or an absolute offset into a different segment. Visually:
image
We can then make an object model that sits on top of that, where each object just has a reference to the segment, an absolute offset to the first word of our data, and a few bits to describe the shape, and we’re done. This format works both on disk and for live interactive objects, so the disk format is exactly the in-memory format. Zero processing. Note that there’s a lot of up-front processing that defines how an object with 2 bools, an int, a float, a string, and another bool (that was added in a later revision of your schema) should be laid out in the data words, but ultimately that is a solved problem.
 

Nothing is free

Of course everything has a price; the cost in this case is that traversing from object to object involves parsing the pointer offsets, but this is all heavily optimized - it just means the actual data is one level abstracted. Indeed, most of the work in shimming between the two can be done via basic bitwise operations (“and”, “shift”, etc). Plus of course, you only pay this miniscule cost for data you actually touch. Ignoring data is effectively entirely free. So all we need to do is load the data into memory. Now consider: many modern OSes offer "memory mapped files" to act as an OS-optimized way of treating a flat file as raw memory. This holds some very interesting possibilities for insanely efficient processing of arbitrarily large files.
 

I couldn't stay away

In my defence, I did have a genuine business need, but it was something I'd been thinking about for a while; so I took the plunge and started a .NET / C# implementation of Cap'n Proto, paying lots of attention to effeciency - and in particular allocations (the "pointers" to the underlying data are implemented as value types, with the schema-generated entities also existing as value types that simply surround a "pointer", making the defined members more conveniently accessible). This means that not only is there no real processing involved in parsing the data, but there are also no allocations - and no allocations means nothing to collect. Essentially, we can use the “message” as a localized memory manager, with no .NET managed reference-types for our data – just structs that refer to the formatted data inside the “message” – and then dispose of the entire “message” in one go, rather than having lots of object references. Consider also that while a segment could be a simple byte[] (or perhaps ulong[]), it could also be implemented using entirely unmanaged memory, avoiding the “Large Object Heap” entirely (we kinda need to use unmanaged memory if we want to talk to memory mapped files). A bit “inner platform”, but I think we’ll survive.
 

Sounds interesting! How do I use it?

Well, that’s the gotcha; the core is stable, but - I have lots more to do, and this is where I need your help. I think there's lots of potential here for a genuinely powerful and versatile piece of .NET tooling. In my next blog entry, I'm going to be describing what the gaps are and inviting community involvement.

















Monday, 1 September 2014

Optional parameters: maybe not so harmful

 

A few days ago I blogged Optional parameters considered harmful; well, as is often the case – I was missing a trick. Maybe they aren’t as bad as I said. Patrick Huizinga put me right; I don’t mind admitting: I was being stupid.

To recap, we had a scenario were these two methods were causing a problem:

static void Compute(int value, double factor = 1.0,
string caption = null)
{ Compute(value, factor, caption, null); }
static void Compute(int value, double factor = 1.0,
string caption = null, Uri target = null)
{ /* ... */ }

What I should have done is: make the parameters non-optional in all overloads except the new one. Existing compiled code isn’t interested in the optional parameters, so will still use the old methods. New code will use the most appropriate overload, which will often (but not quite always) be the overload with the most parameters – the optional parameters.


static void Compute(int value, double factor,
string caption)
{ Compute(value, factor, caption, null); }
static void Compute(int value, double factor = 1.0,
string caption = null, Uri target = null)
{ /* ... */ }

There we go; done; simple; working; no issues. My thanks to Patrick for keeping me honest.

Thursday, 28 August 2014

Optional parameters considered harmful (in public libraries)


UPDATE: I was being stupid. See here for the right way to do this

TAKE THIS POST WITH A HUGE PINCH OF SALT

Optional parameters are great; they can really clean down the number of overloads needed on some APIs where the intent can be very different in different scenarios. But; they can hurt. Consider the following:
static void Compute(int value, double factor = 1.0,
    string caption = null)
{ /* ... */ }

Great; our callers can use Compute(123), or Compute(123, 2.0, "awesome"). All is well in the world. Then a few months later, you realize that you need more options. The nice thing is, you can just add them at the end, so our method becomes:
static void Compute(int value, double factor = 1.0,
  string caption = null, Uri target = null)
{ /* ... */ }

This works great if you are recompiling everything that uses this method, but it isn’t so great if you are a library author; the method could be used inside another assembly that you don’t control and can’t force to be recompiled. If someone updates your library without rebuilding that other dll, it will start throwing MissingMethodException; not great.

OK, you think; I’ll just add it as an overload instead:
static void Compute(int value, double factor = 1.0,
    string caption = null)
{ Compute(value, factor, caption, null); }
static void Compute(int value, double factor = 1.0,
    string caption = null, Uri target = null)
{ /* ... */ }

And you test Compute(123, 1.0, "abc") and Compute(123, target: foo), and everything works fine; great! ship it! No, don’t. You see, what doesn’t work is: Compute(123). This instead creates a compiler error:


The call is ambiguous between the following methods or properties: 'blah.Compute(int, double, string)' and 'blah.Compute(int, double, string, System.Uri)'

Well, damn…

This is by no means new or different; this has been there for quite a while now – but it still sometimes trips me (and other people) up. It would be really nice if the compiler would have a heuristic for this scenario such as “pick the one with the fewest parameters”, but; not today. A limitation to be wary of (if you are a library author). For reference, it bit me hard when I failed to add CancellationToken as a parameter on some *Async methods in Dapper that had optional parameters - and of course, I then couldn't add it after the fact.

Friday, 11 July 2014

Securing redis in the cloud

Redis has utility in just about any system, but when we start thinking about “the cloud” we have a few additional things to worry about. One of the biggest issues is that the cloud is not your personal space (unless you have a VPN / subnet setup, etc) – so you need to think very carefully about what data is openly available to inspection at the network level. Redis does have an AUTH concept, but frankly it is designed to deter casual access: all commands and data remain unencrypted and visible in the protocol, including any AUTH requests themselves. What we probably want, then, is some kind of transport level security.
Now, redis itself does not provide this; there is no standard encryption, but you could configure a secure tunnel to the server using something like stunnel. This works, but requires configuration at both client and server. But to make our lives a bit easier, some of the redis hosting providers are beginning to offer encrypted redis access as a supported option. This certainly applies both to Microsoft “Azure Redis Cache” and Redis Labs “redis cloud”. I’m going to walk through both of these, discussing their implementations, and showing how we can connect.

Creating a Microsoft “Azure Redis Cache” instance

First, we need a new redis instance, which you can provision at https://portal.azure.com by clicking on “NEW”, “Everything”, “Redis Cache”, “Create”:
image
image
There are different sizes of server available; they are all currently free during the preview, and I’m going to go with a “STANDARD” / 250MB:
image
Azure will now go away and start creating your instance:
image
This could take a few minutes (actually, it takes surprisingly long IMO, considering that starting a redis process is virtually instantaneous; but for all I know it is running on a dedicated VM for isolation etc; and either way, it is quicker and easier than provisioning a server from scratch). After a while, it should become ready:
image

Connecting to a Microsoft “Azure Redis Cache” instance

We have our instance; lets talk to it. Azure Redis Cache uses a server-side certificate chain that should be valid without having to configure anything, and uses a client-side password (not a client certificate), so all we need to know is the host address, port, and key. These are all readily available in the portal:
image
image
Normally you wouldn’t post these on the internet, but I’m going to delete the instance before I publish, so; meh. You’ll notice that there are two ports: we only want to use the SSL port. You also want either stunnel, or a client library that can talk SSL; I strongly suggest that the latter is easier! So; Install-Package StackExchange.Redis, and you’re sorted (or Install-Package StackExchange.Redis.StrongName if you are still a casualty of the strong name war). The configuration can be set either as a single configuration string, or via properties on an object model; I’ll use a single string for convenience – and my string is:
mgblogdemo.redis.cache.windows.net,ssl=true,password=LLyZwv8evHgveA8hnS1iFyMnZ1A=

The first part is the host name without a port; the middle part enables ssl, and the final part is either of our keys (the primary in my case, for no particular reason). Note that if no port is specified, StackExchange.Redis will select 6379 if ssl is disabled, and 6380 if ssl is enabled. There is no official convention on this, and 6380 is not an official “ssl redis” port, but: it works. You could also explicitly specify the ssl port (6380) using standard {host}:{port} syntax. With that in place, we can access redis (an overview of the library API is available here; the redis API is on http://redis.io)
var muxer = ConnectionMultiplexer.Connect(configString);
var db = muxer.GetDatabase();
db.KeyDelete("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
int i = (int)db.StringGet("foo");
Console.WriteLine(i); // 3

and there we are; readily talking to an Azure Redis Cache instance over SSL.

Creating a new Redis Labs “redis cloud” instance and configuring the certificates


Another option is Redis Labs; they too have an SSL offering, although it makes some different implementation choices. Fortunately, the same client can connect to both, giving you flexibility. Note: the SSL feature of Redis Labs is not available just through the UI yet, as they are still gauging uptake etc. But it exists and works, and is available upon request; here’s how:

Once you have logged in to Redis Labs, you should immediately have a simple option to create a new redis instance:

image

Like Azure, a range of different levels is available; I’m using the Free option, purely for demo purposes:

image

We’ll keep the config simple:

image

and wait for it to provision:

image

(note; this only takes a few moments)

Don’t add anything to this DB yet, as it will probably get nuked in a moment! Now we need to contact Redis Labs; the best option here is support@redislabs.com; make sure you tell them who you are, your subscription number (blanked out in the image above), and that you want to try their SSL offering. At some point in that dialogue, a switch gets flipped, or a dial cranked, and the Access Control & Security changes from password:

image

to SSL; click edit:

image

and now we get many more options, including the option to generate a new client certificate:

image

Clicking this button will cause a zip file to be downloaded, which has the keys to the kingdom:

image

The pem file is the certificate authority; the crt and key files are the client key. They are not in the most convenient format for .NET code like this, so we need to tweak them a bit; openssl makes this fairly easy:
c:\OpenSSL-Win64\bin\openssl pkcs12 -inkey garantia_user_private.key -in garantia_user.crt -export -out redislabs.pfx

This converts the 2 parts of the user key into a pfx, which .NET is much happier with. The pem can be imported directly by running certmgr.msc (note: if you don’t want to install the CA, there is another option, see below):

image

Note that it doesn’t appear in any of the pre-defined lists, so you will need to select “All Files (*.*)”:

image

After the prompts, it should import:

image

So now we have a physical pfx for the client certificate, and the server’s CA is known; we should be good to go!

Connecting to a Redis Labs “redis cloud” instance


Back on the Redis Labs site, select your subscription, and note the Endpoint:

image

We need a little bit more code to connect than we did with Azure, because we need to tell it which certificate to load; the configuration object model has events that mimic the callback methods on the SslStream constructor:
var options = new ConfigurationOptions();
options.EndPoints.Add(
    "pub-redis-16398.us-east-1-3.1.ec2.garantiadata.com:16398");
options.Ssl = true;
options.CertificateSelection += delegate {
    return new System.Security.Cryptography.X509Certificates.X509Certificate2(
    @"C:\redislabs_creds\redislabs.pfx", "");
};

var muxer = ConnectionMultiplexer.Connect(options);
var db = muxer.GetDatabase();
db.KeyDelete("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
db.StringIncrement("foo");
int i = (int)db.StringGet("foo");
Console.WriteLine(i); // 3

Which is the same smoke test we did for Azure. If you don’t want to import the CA certificate, you could also use the CertificateValidation event to provide custom certificate checks (return true if you trust it, false if you don’t).

Way way way tl:dr;


Cloud host providers are happy to let you use redis, and happy to provide SSL support so you can do it without being negligent. StackExchange.Redis has hooks to let this work with the two SSL-based providers that I know of.

Thursday, 3 July 2014

Dapper gets type handlers and learns how to read maps

A recurring point of contention in dapper has been that it is a bit limited in terms of the types it handles. If you are passing around strings and integers: great. If you are passing around DataTable – that’s a bit more complicated (although moderate support was added for table valued parameters). If you were passing around an entity framework spatial type: forget it.
Part of the problem here is that we don’t want dapper to take a huge pile of dependencies on external libraries, that most people aren’t using – and often don’t even have installed or readily available. What we needed was a type handler API. So: I added a type handler API! Quite a simple one, really – dapper still deals with most of the nuts and bolts, and to add your own handler all you need to provide is some basic parameter setup.
For example, here’s the code for DbGeographyHandler; the only interesting thing that dapper doesn’t do internally is set the value – but the type-handler can also do other things to configure the ADO.NET parameter (in this case, set the type name). It also needs to convert between the Entity Framework representation of geography and the ADO.NET representation, but that is pretty easy:
public override void SetValue(IDbDataParameter parameter, DbGeography value)
{
    parameter.Value = value == null ? (object)DBNull.Value
        : (object)SqlGeography.Parse(value.AsText());
    if (parameter is SqlParameter)
    {
        ((SqlParameter)parameter).UdtTypeName = "GEOGRAPHY";
    }
}

and… that’s it. All you need to do is register any additional handlers (SqlMapper.AddTypeHandler()) and it will hopefully work. We can now use geography values in parameters without pain – i.e.
conn.Execute("... @geo ...",
    new { id = 123, name = "abc", geo = myGeography }

Plugin packages

This means that handlers for exotic types with external dependencies can be shipped separately to dapper, meaning we can now add packages like Dapper.EntityFramework, which brings in support for the Entity Framework types. Neat.

Free to a good home: better DataTable support

At the same time, we can also make a custom handler for DataTable, simplifying a lot of code. There is one slight wrinkle, though: if you are using stored procedures, the type name of the custom type is known implicitly, but a lot of people (including us) don’t really use stored procedures much : we just use raw command text. In this situation, it is necessary to specify the custom type name along with the parameter. Previously, support for this has been provided via the AsTableValuedParameter() extension method, which created a custom parameter with an optional type name – but dapper now internally registers a custom type handler for DataTable to make this easier. We still might need the type name, though, so dapper adds a separate extension method for this, exploiting the extended-properties feature of DataTable:
DataTable table = ...
table.SetTypeName("MyCustomType");
conn.Execute(someSql, new { id = 123, values = table });

That should make things a bit cleaner! Custom type handlers are welcome and encouraged - please do share them with the community (ideally in ready-to-use packages).

Tuesday, 1 July 2014

SNK and me work out a compromise

and bonus feature: our build server configuration
A few days ago I was bemoaning the issue of strong names and nuget. There isn’t a perfect solution, but I think I have a reasonable compromise now. What I have done is:
  • Copy/pasted the csproj, with the duplicates referencing a strong name key
  • Changed those projects to emit an assembly named StackExchange.Redis.StrongName
  • Copy/pasted the nuspec, with the new spec creating the StackExchange.Redis.StrongName package from the new assemblies
  • PINNED the assembly version number; this will not change unless I am introducing breaking changes, which will also coincide with major/minor version number changes – or maybe also for reasonably significant feature additions; not for every point-release, is the thing
  • Ensured I am using [assembly:AssemblyFileVersion] and [assembly:AssembyInformationalVersion] to describe the point-release status
This should achieve the broadest reach with the minimum of fuss and maintenance.
Since that isn’t enough words for a meaningful blog entry, I thought I’d also talk about the build process we use, and the tooling changes we needed for that. Since we are using TeamCity for the builds, it is pretty easy to double everything without complicating the process – I just added the 2 build steps for the new csproj, and told it about the new nuspec:
image
Likewise, TeamCity includes a tool to automatically tweak the assembly metadata before the build happens – the “AssemblyInfo patcher” – which we can use to pin the one value (to change manually based on human decisions), while keeping the others automatic:
image
Actually, I’ll probably change that to use a parameter to avoid me having to dig 3 levels down. After that, we can just let it run the build, and out pop the 2 nupkg as artefacts ready for uploading:
image
If you choose, you can also use TeamCity to set up an internal nuget feed – ideal for both internal use and dogfooding before you upload to nuget.org. In this case, you don’t need to grab the artefacts manually or add a secondary upload step – you can get TeamCity to publish them by default, making them immediately available in-house:
image
It isn’t obvious in the UI, but this search is actually restricted to the “SE” feed, which is my Visual Studio alias to the TeamCity nuget server:
image
Other build tools will generally have similar features – simply: this is what works for us.