Wednesday, 6 December 2017

Dapper, Prepared Statements, and Car Tyres

Why Doesn't Dapper Use Prepared Statements?

I had a very interesting email in my inbox this week from a Dapper user; I'm not going to duplicate the email here, but it can be boiled down to:

My external security consultant is telling me that Dapper is insecure because it doesn't use prepared statements, and is therefore susceptible to SQL injection. What are your thoughts on this?

with a Dapper-specific example of something comparable to:

List<Order> GetOpenOrders(int customerId) => _connection.Query<Order>(
        "select * from Orders where CustomerId=@customerId and Status=@Open",
        new { customerId, OrderStatus.Open }).AsList();

Now this is a fun topic for me, because in my head I'm reading it in the same way that I would read:

My car mechanic is telling me my car is dangerous because it doesn't use anti-smear formula screen-wash, and is therefore susceptible to skidding in icy weather. What are your thoughts on this?

Basically, these are two completely unrelated topics. You can have a perfectly good and constructive conversation about either in isolation. There are merits to both discussions. But when you smash them together, it might suggest that the person raising the issue (the "security consultant" in this case, not the person sending me the email) has misunderstood something fundamental.

My initial response - while in my opinion valid - probably needs to be expanded upon:

Hi! No problem at all. If your security consultant is telling you that a correctly parameterized SQL query is prone to SQL injection, then your security consultant is a fucking idiot with no clue what they're talking about, and you can quote me on that.

So; let's take this moment to discuss the two topics and try to put this beast to bed!

Part The First: What is SQL injection?

Most folks will be very familiar with this, so I'm not going to cover every nuance, but: SQL injection is the major error made by concatenating inputs into SQL strings. It could be typified by the bad example:

string customerName = customerNameTextBox.Value; // or a http request input; whatever
var badOptionOne = connection.Query<Customer>(
    "select * from Customers where Name='" + customerName + "'");
var badOptionTwo = connection.Query<Customer>(
    string.Format("select * from Customers where Name='{0}'", customerName));
var badOptionThree = connection.Query<Customer>(
    $"select * from Customers where Name='{customerName}'");

As an aside on badOptionThree, it really frustrates me that C# overloading prefers string to FormattableString (interpolated $"..." strings can be assigned to either, but only FormattableString retains the semantics). I would really have loved to be able to add a method to Dapper like:

[Obsolete("No! Bad developer! Bobby Tables will find you in the dark", error: true)]
public static IEnumerable<T> Query<T>(FormattableString query, /* other args not shown */)
    => throw new InvalidOperation(...);

This category of coding error is perpetually on the OWASP "top 10" list, and is now infamously associated with xkcd's "Bobby Tables":

Did you really name your son Robert'); DROP TABLE Students;-- ?

The problem, as the cartoon shows us, is that this allows malicious input to do unexpected and dangerous things. In this case the hack was to use a quote to end a SQL literal (');... - in this case with the attacker guessing that the clause was inside parentheses), then issue a separate command (DROP TABLE ...), then discard anything at the end of the original query using a comment (-- ...). But the issue is not limited to quotes, and frankly any attempt to play "replace/escape the risky tokens" is an arms race where you need to win every time, but the attacker only needs to win once. Don't play that game.

It can also be a huge internationalization problem, familiar to every developer who has received bug reports about the search not working for some people of Irish or Scottish descent. This (SQL injection - not Irish or Scottish people) is such an exploitable problem that readily available tools exist that can trivially seach a site for exploitable inputs and give free access to the database with a friendly UI. So... yeah, you really don't want to have SQL injection bugs. No argument there.

So how do we prevent SQL injection?

The solution to SQL injection is parameters. One complaint I have about the xkcd comic - and a range of other discussions on the topic - is the suggestion that you should "sanitize" your inputs to prevent SQL injection. Nope! Wrong. You "sanitize" your inputs to check that they are within what your logic allows - for example, if the only permitted options from a drop-down are 1, 2 and 3 - then you might want to check that they haven't entered 42. Sanitizing the inputs is not the right solution to SQL injection: parameters are. We already showed an example of parameters in my SQL example at the top, but to take our search example:

string customerName = customerNameTextBox.Value; // or a http request input; whatever
var customers = connection.Query<Customer>(
    "select * from Customers where Name=@customerName",
    new { customerName });

What this does is add a parameter called "customerName" with the chosen value, passing that alongside and separate to the command text, in a raw form that doesn't need it to be encoded to work inside a SQL string. At no point does the parameter value get written into the SQL as a literal. Well, except perhaps on some DB backends that don't support parameters at all, in which case frankly it is up to the DB provider to get the handling right (and: virtually all RDBMS have first-class support for parameters).

Note that parameters solve other problems too:

  • the formatting of things like dates and numbers: if you use injection you need to know the format that the RDBMS expects, which is usually not the format that the "current culture" is going to specify, making it awkward; but by using a parameter, the value doesn't need to be formatted as text at all - with things like numbers and dates usually being sent in a raw binary format - some-defined-endian-fixed-width for integers (including dates), or something like IEEE754 for floating point.
  • query-plan re-use: the RDBMS can cache our ...Name=@customerName query and re-use the same plan automatically and trivially (without saturating the cache with a different plan for every unique name searched), with different values of the parameter @customerName - this can provide a great performance boost (side note: this can be double-edged, so you should also probably learn about OPTIMIZE FOR ... UNKNOWN (or the equivalent on your chosen RDBMS) if you're serious about SQL performance - note this should only be added reactively based on actual performance investigations)

Dapper loves parameters

Parameterization is great; Dapper loves parameterization, and does everything it can to make it easy for you to parameterize your queries. So: whatever criticism you want to throw at Dapper: SQL injection isn't really a fair one. The only time Dapper will be complicit in SQL injection is when you feed it a query that already has an injection bug before Dapper ever sees it. We can't fix stupid.

For full disclosure: there is actually one case where Dapper allows literal injection. Consider our GetOpenOrders query from above. This can also be written:

List<Order> GetOpenOrders(int customerId) => _connection.Query<Order>(
        "select * from Orders where CustomerId=@customerId and Status={=Open}",
        new { customerId, OrderStatus.Open }).AsList();

Note that instead of @Open we're now using {=Open}. This is not SQL syntax - it is telling Dapper to do an injection of a literal value. This is intended for things that don't change per query such as status codes - and can result in some cases in performance improvements. Dapper doesn't want to make it easy to blow your own feet off, so it STRICTLY only allows this for integers (including enum values, which are fundamentally integers), since integers are a: very common for this scenario, and b: follow predictable rules as literals.

Part The Second: What are prepared statements?

There's often a slight confusion here with "stored procedures", so we'll have to touch on that too...

It is pretty common to issue commands to a RDBMS, where the SQL for those commands is contained in the calling application. This isn't universal - some applications are written with all the SQL in "stored procedures" that are deployed separately to the server, so the only SQL in the application is the names to invoke. There are merits of both approaches, which might include discussions around:

  • isolation - the ability to deploy and manage the SQL separately to the application (which might be desirable in client applications in particlar, where re-deploying all the client installations to fix a small SQL bug is hard or expensive)
  • performance - historically stored procedures tended to out-perform ad-hoc commands; in most modern RDBMS this simply isn't a real concern, with the query-plan-cache working virtually identically regardless of the mechanism
  • granular security - in a high security application you might not want users (even if the "user" is a central app-server) to have direct SELECT permission on the tables or views - instead preferring to wrap the allowed queries in stored procedures that the calling user can be granted EXEC permission; of course a counter-argument there is that a blind EXEC can hide what a stored procedure is doing (so it does something the caller didn't expect), but ultimately if someone has pwned your RDBMS server, you're already toast
  • flexibility - being able to construct SQL to match a specific scenario (for example: the exact combination of 17 search options) can be important to improving performance (compared, in our search example, to 17 and (@someArg is null or row.SomeCol=@someArg) clauses). Tools like LINQ and ORMs rely extensively on runtime query generation to match the queries and model known at runtime, so allowing them to execute ad-hoc parameterized commands is required; it should also be noted that most RDBMS can also execute ad-hoc parameterzied commands from within SQL - via things like sp_executesql from inside a stored procedure

You'll notice that SQL injection is not part of that discussion on the merits of "ad-hoc commands" vs "stored procedures", because parameterization makes it a non-topic.

So: let's assume that we've had the conversation about stored procedures and we've decided to use ad-hoc statements.

What does it mean to "prepare" a statement?

"Preparing a statement" is a sometimes-optional / sometimes-mandatory (depending on the RDBMS) step required to issue ad-hoc SQL commands. Conceptually, it takes our "select * from Orders where CustomerId=@customerId and Status=@Open" query - along with the defined parameters - and says "I'm going to want to run this in a moment; kindly figure out what that means to you and get everything in place". In terms of ADO.NET, this means calling the DbCommand.Prepare() method. There are 3 possible outcomes of a Prepare() call (ignoring errors):

  • it does literally nothing - a no-op; this might commonly be the case if you've told it that you're running a stored procedure (it is already as prepared as it will ever be), or if your chosen RDBMS isn't interested in the concept of prepared statements
  • it runs an additional optional operation that it wouldn't have done otherwise - adding a round trip
  • it runs a required operation that it was otherwise going to do automatically when we executed the query

So on the surface, the best case is that we achieve no benefit (the first and third options). The worst case is that we've added a round trip. You might be thinking "so why does Prepare() exist, if it is only ever harmful?" - and the reason is: I've only talked about running the operation once.

The main scenario in which Prepare() helps us is when you're going to be issuing exactly the same command (including the parameter definition, but not values), on exactly the same connection, many many times, and especially when your RDBMS requires command preparation. In that scenario, preparing a statement can be a very important performance tweak.

You'll notice - similarly to stored procedures - that SQL injection is not part of that discussion on the merits of "prepared statements".

It is entirely true to say that Dapper does not currently call Prepare().

Why doesn't Dapper Prepare() statements?

There are various reasons for this, but the most important one is: on most providers, a prepared statement is scoped to the connection and is stored as part of the DbCommand. To actually provide a useful prepared statement story, Dapper would need to store and re-use every DbCommand for every DbConnection. Dapper really, really doesn't want to store your connections. It is designed with high concurrency in mind, and typically works in scenarios where the DbConnection is short-lived - perhaps scoped to the context of a single web-request. Note that connection pooling doesn't mean that the underlying connection is short-lived, but Dapper only gets to see the managed DbConnection, so anything else is opaque to it.

Without tracking every DbConnection / DbCommand and without a new abstraction, the best Dapper could do would be to call .Prepare() on every DbCommand immediately before executing it - but this is exactly the situation we discussed previously where the only two options are "has no effect" and "makes things worse".

Actually, there is one scenario using the current API in which Dapper could usefully consider doing this, which is the scenario:

connection.Execute(someSql, someListOfObjects);

In this case, Dapper unrolls someListOfObjects, executing someSql with the parameters from each object in turn - on the same connection. I will acknowledge that a case could be made for Dapper to call .Prepare() in anticipation of the loop here, although it would require some changes to implement.

But fundamentally, the main objection that dapper has to prepared statements is that typically, the connections that Dapper works with are transient and short-lived.

Could Dapper usefully offer a Prepare() API for systems with long-lived connections?

Hypothetically, yes: there is something that Dapper could do here, specifically targeted at the scenario:

I have a long-lived connection and an RDBMS that needs statements to be prepared, and I want the best possible performance when issuing repeated ad-hoc parameterized commands.

We could conceptualize an API that pins a command to a single connection:

var getOrders = connection.Prepare<Order>(
        "select * from Orders where CustomerId=@customerId",
        new { customerId = 123 }); // dummy args, for type inference
// ...
var orders = getOrders.Query(new { customerId }).AsList();

Note that in this imaginary API the connection is trapped and pinned inside the object that we stored in getOrders. There are some things that would need to be considered - for example, how does this work for literal injection and Dapper's fancy "in" support. A trivial answer might be: just don't support those features when used with .Prepare().

I think there's plenty of merit to have this kind of discussion, and I'm 100% open to discussing API features and additions. As long as we are discussing the right thing - i.e. the "I have a long-lived..." discussion from above.

If, however, we start that conversation (via a security consultant) via:

I want to use prepared statements to avoid SQL injection

then: that is not a useful discussion.

Tl;dr:

If you want to avoid your car skidding in icy weather, you fit appropriate tyres. You don't change the screen-wash.

Wednesday, 21 June 2017

protobuf-net gets proto3 support

protobuf-net gets proto3

For quite a little while, protobuf-net hasn't seen any major changes. Sure, I've been pottering along with ongoing maintenance and things like .NET Core support, but it hasn't had any step changes in behavior. Until recently.

2.3.0 Released

I'm pleased to say that 2.3.0 has finally dropped. The most significant part of this is "proto3", which ties into the 3.0.0 version of Protocol Buffers - released by Google at the end of July 2016. There are a few reasons why I haven't looked at this for protobuf-net before now, including:

  • zero binary format changes; so ultimately, even without any library or tooling changes: everything that can be done in proto2 can be done in proto3, interchangeably; I didn't feel under immense pressure to rush out a release
  • significant DSL changes for "proto3" syntax, coupled with the fact protobuf-net's existing DSL tools were in bad shape; not least, they were tied into some technologies with a bad cross-platform story. Since I knew I needed a new answer for DSL tooling, it seemed a poor investment to hack the new features into the end-of-life tooling. A significant portion of protobuf-net's usage is from code-first users who don't even have a DSL version of their schema, hence why this wasn't at the top of my list of priorities
  • some new data contracts targeting commonly exchanged types, but this is tied into the DSL changes
  • I misunderstood the nature of the "proto3" syntax changes; I assumed it would be *adding features and complexity, when in fact it removes a lot of the more awkward features. The few pieces that it did actually add were backported into "proto2" anyway
  • I've been busy with lots of other things, including a lot of .NET Core work for multiple libraries

But; I've finally managed to get enough time together to look at things properly.

First, some notes on proto3:

proto3 is simpler than proto2

This genuinely surprised me, but it was a very pleasant surprise. When writing protobuf-net, I made a conscious decision to make it easy and natural to implement the most common scenarios. I supported the full range of protobuf features, but some of them were more awkward to use. As such, I made some random decisions towards making it simple and obvious to use:

  • implicit zero defaults: most people don't have complex default values, where-as this makes it simple and efficient to store "empty" data (in zero bytes) without any configuration
  • don't worry about implicitly set vs explicitly set values: values are value are values; the library supports a few common .NET patterns for explicit assignment (ShouldSerialize* / *Specified / Nullable<T> + null), but it doesn't demand them and is perfectly fine without them
  • extensions and unknown data entirely optional: the question here is what to do if the serialized data contains unexpected / unknown values - which could be from external "extensions", or could just be new fields that the code doesn't know about. protobuf-net supports this type of usage, but accepts that it isn't something that most folks need or even want - they just want to get the expected data in and out

It turns out that proto3 makes some striking omissions from proto2:

  • default values are gone - implicit zero values are assumed and are the only permitted defaults
  • explicit assignment is gone - if something has a value other than the zero default, it is serialized, and that's it
  • extensions are largely missing

A part of me feels that these changes totally validate the decisions I made when making protobuf-net as simple to use as possible. Note that protobuf-net still retains full support for the wider set of protobuf features (including all the proto2 features) - they're not going anywhere.

what about protobuf JSON?

protobuf 3.0.0 added a well-defined JSON encoding for protobuf data. I confess that I'm deeply conflicted on this. In the .NET world, JSON is a solved problem. If I want my data serialized as JSON, I'm probably going to look at JIL (if I want raw performance) or Json.NET (if I want greater flexibility and range of features, or just want to use the de-facto platform serializer). Since protobuf-net targets idiomatic .NET types that would already serialize just fine with either of these, it seems to me of very little benefit to spend a large amount of time writing JSON support directly for protobuf-net. As such, protobuf-net still does not support this. If there is a genuine need for this, the first thing I would do would be to look at JIL or Json.NET to see if there is some combination of configuration options that I can specify that would conveninetly be compatible with the expected JSON encoding. At the very worst case, I could see either some PRs to JIL or a fork of JIL to support it, but frankly I'm going to defer on touching the JSON option until I understand the use-case. On the surface, it seems like the JSON option here takes all the main reasons for using protobuf and throws them out the window. My reservations here are probably because I'm spoiled by working in a platform where I can take virtually any object, and both JIL and Json.NET will be able to serialize and deserialize it for me.

So what do we get in protobuf-net 2.3.0?

Brand new protogen tooling for both proto2 and proto3

This release completely replaces the protogen DSL parsing tool; it has been 100% rewritten from scratch using pure managed code. The old version used to:

  • shell execute to call Google's "protoc" tool to emit a compiled schema (in the protobuf serialization format, naturally) as a file
  • then deserialize that file into the corresponding type model using protobuf-net
  • serialize that same object as xml
  • run the xml through an xslt 1.0 engine to generate C#

This worked, but is a cross-platform nightmare as well as being a maintenance nightmare. I doubt that xslt was a good choice for codegen even when it was written, but today... just painful. I looked at a range of parsing engines, but ultimately decided on a manual tokenizer and imperative forwards-only parser. It turned out to not be anything like as much work as I had feared, which was nice. In order to have confidence in the parser, I have tested it on every .proto schema I can find, including about 220 schemas that describe a good portion of Google's public API surface. I've tested these against protoc's binary output to ensure that not only does it parse the files meaningfully, but it produces the exact same bytes (as a compiled / serialized schema) that protoc produces.

This parser is then tied into a relatively basic codegen system. At the moment this is relatively crude, and is subject to significant change. The good thing is that now that everything is in place, this can be reworked relatively easily - perhaps to use one of the many templating systems that are available in .NET.

As an illustration of how the parser and codegen are neatly decoupled, Roger Johansson has also independently converted his Proto Actor code to use protobuf-net's parser rather than protoc, which is great! https://twitter.com/RogerAlsing/status/871829162218184704. If you want to use the parser and code-generation tools outside of the tools I provide, protobuf-net.Reflection may be useful to you.

How do I use it?

OK, you have a .proto schema (proto2 or proto3). At the moment, you have 2 options for codegen from protobuf-net:

  1. compile, build and execute the protogen command line tool (which deliberately shares command-line switches with Google's protoc tool)
  2. use https://protogen.marcgravell.com/ to do it online

(as a 2.1 option you could also clone that same website from git and host it locally; that's totally fine)

I want to introduce much better tooling options, including something that ties into msbuild and dotnet CLI, and (optionally) devenv, but so far this is looking like hard work, so I wanted to ship 2.3.0 before tackling it. It is my opinion that https://protogen.marcgravell.com/ is now perhaps the easiest way to play with .proto schemas - and to show willing, it also includes support for all official protoc output languages, and includes the entire public Google API surface as readily avaialble imports (those same 220 schemas from before).

Support for maps

Maps (map<key_type, value_type>) in .proto are the equivalent of dictionaries in .NET. If you're familiar with protobuf-net, you'll know that it has offered dictionary support for many years. Fortunately, Google's idea of how this should be implemented matches perfectly with the arbitrary and unilateral decisions I stumbled into, so maps are 99.95% interchangeable with how protobuf-net already handles dictionaries. The 0.05% relates to what happens with duplicate keys. Basically: historically, protobuf-net used theData.Add(key, value), which would throw if a key was duplicated. However, maps are defined such as the last value replaces previous values - so: theData[key] = value;. This is a very small difference, and doesn't impact any data that would currently successfully deserialize, so I've made the executive decision that from 2.3.0 all dictionaries should follow the "map" rules by default (when appropriate). To allow full control, protobuf-net has a new ProtoMapAttribute ([ProtoMap]). This has options to use the old .Add behavior, and also has options to control the sub-format used for the key and value. The protogen tool will always include the appropriate [ProtoMap] options for your data.

Support for Timestamp and Duration

Timestamp and Duration refer to a point in time (think: DateTime) and an amount of time (think: TimeSpan). Again, protobuf-net has had support for DateTime and TimeSpan for many years, but this time my arbitrary interpretation and Google's differs significantly. I have added native support for these formats, but because it is different to (and fundamentally incompattible with) what protobuf-net has done historically, this has to be done on an opt-in basis. I've added a new DataFormat.WellKnown option that indicates that you want to use these formats. For example:

[ProtoMember(7, DataFormat = DataFormat.WellKnown)]
pubic DateTime CreationDate {get; set;}

will be serialized as a Timestamp. The protogen tool recognises Timestamp and Duration and will emit the appropriate options.

Simpler enum handling

Historically, enums in .proto were a bit awkward when it came to unknown values, and protobuf-net defaulted to the most paranoid options of panicking if it saw a value it didn't explicitly expect. However, the guidance now includes the new remark:

During deserialization, unrecognized enum values will be preserved in the message, though how this is represented when the message is deserialized is language-dependent. In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation.

Enums in .NET are open enum types, so it makes sense to relax the handling here. Additionally, historically protobuf-net didn't really properly implelemt the older "make it available as an extension value" approach from proto2 (it would throw an exception instead) - far from ideal. So: from 2.3.0 onwards, all enums will be (by default) interpreted directly and without checking against expected values, with the exception of the unusual scenario where [ProtoEnum(Value=...)] has been used to re-map any enum such that the serialized value is different to the natural value. In this case, it can't assume that a direct interpretation will be valid, so the legacy checks will remain. Emphasis: this is a very rare scenario, and probably won't impact anyone except me (and my test suite). Because of this, the [ProtoContract(EnumPassthru = ...)] option is now mostly redundant: the only time it is useful is to explicitly set this to false to revert to the previous "throw an exception" behaviour.

Discriminated unions, aka one-of

One of the features introduced in proto3 (and back-ported to proto2) is the ability for multiple fields to overlap such that only one of them can contain a value at a time. The ideal in-memory representation of this is a discriminated union, which C# can't really represent directly, but which can be simulated via a struct with explicit layout; so that's exactly what we now do! A family of discriminated union structs have been introduced for this purpose, and are mainly intended to be used with generated code. But if you want to use them directly: have fun!

proto3 schema generation

Since the DSL tools accept proto2 or proto3 syntax, it makes sense that we should be able to emit both proto2 and proto3 syntax, so there are now overloads of GetSchema / GetProto<T> that allow this. These tools have also been updated to be aware of maps, Timestamp, Duration etc.

New custom option DSL support

The new DSL tooling makes use of the "extensions" feature to add custom syntax options to your .proto files. At the moment the options here are pretty limited, allowing you to control the accessibility and naming of elements, but as new controls becomes necessary: that's where they will go.

General bug fixes

This build also includes a range of more general fixes for specific scenarios, as covered by the release notes

What next?

I'm keeping a basic future roadmap on the release notes. There are some significant pieces of work ahead, including (almost certainly) a major rework of the core serializer to support async IO, "Pipelines", etc. I also want to improve the buid-time tooling. My work here is very much not done.

Wednesday, 17 May 2017

protobuf-net: large data, and the future

protobuf-net was born into a different world

On Jul 17, 2008 I pushed the first commits of protobuf-net. It is easy to forget, but back then, most machines had access to a lot less memory than they do today, with x86 still being a common choice, meaning that 2GB user space (or maybe a little more if you fancied fighting with /3GB+LAA) was a hard upper limit. In reality, your usable memory was much less. Processors were much less powerful - user desktops were doing well if their single core had hyper-threading support (dual and quad cores existed, but were much rarer).

Thanks for the 2GB memories

It is in this context that protobuf-net was born, and in which many of the early design decisions were made. Although to be fair, even Google (who designed the thing) suggested an upper bound in the low hundreds of MB. Here's the original author (Kenton Varda) saying on Stack Overflow that 10MB is "pushing it" - although he does also note that 1GB works, but that 2GB is a hard limit.

protobuf-net took these limitations on board, and many aspects of the code could only work inside these borders. In particular, one of the key design questions in protobuf-net was how, when serializing general purpose objects, to handle the length prefix.

protobuf strings

Protobuf is actually a relatively simple binary format; it has few primitives, one of which is the length-prefixed string (where "string" means "arbitrary payload", not just text). The encoding of this is a variable length "varint" that tells it how many bytes are involved, then that many bytes of the payload:

[field x, "string"]
[n, 1-10 bytes]
[payload, n bytes]

The requirement to know the length in advance is fine for the Google implementation - as I understand it, the "builder" approach means that the length is calculated when the "builder" creates the actual object, which is long before serialization happens (note: I'm happy to be corrected here if I've misunderstood). But protobuf-net doesn't work with "builder" types; it works against gereral every-day POCOs - usually written without any DSL schema ("code-first"). We can't rely on any construction-time calculations. So: how to write the length?

Essentially, there's two ways of doing this:

  • serialize the data first (perhaps hoping that the length prefix will fit in a single byte, and leaving a space for it); when you've finished serializing, you know the length - so now backfill that into the original space, which might mean nudging the data over a bit if the prefix took more space than expected
  • compute the actual required length, write the prefix, then serialize the data

Both have advantages and disadvantages. The first requires you to buffer all the data in the payload (you can't flush something that you might need to update later), and might need us to move a lot of data. The second requires us to do more thinking without actually writing anything - which might mean doing a lot of work twice.

At the current time, protobuf-net chooses the first approach. For quite a lot of small leaf types, this doesn't actually mean much more than backfilling a single byte of length data, but it becomes progressively more expensive as the payload size increases.

I hate limits

Over the time since then, I have seen many, many requests from people asking for protobuf-net to support larger data sizes - at least an order of magnitude above what has previously been usable, tens of GB or more, which makes perfect sense when you consider the data that some apps load into the plentiful RAM available on even a mid-range server. In principle this is simple (mostly making sure that the reader and writer use 64-bit tracking internally), but there are 2 stumbling blocks:

  • the need to buffer vast quantities of data would demand excessive amounts of RAM
  • the current buffer implementation woud be prohibitively hard to refactor to go above 2GB
  • even if we did, it would then take a loooong time to output the buffered data after backfilling

I've recently pushed some commits intended to address the 64-bit reader/writer issue - unblocking some users, but the other factors are much harder to solve in the current implementation.

Wait... how does that unblock anyone?

Good catch; indeed, simply enabling 64-bit readers and writers doesn't fix the buffering problem - but: there is a workaround. A long time in protobuf's past, there were two ways of encoding sub-messages. One was the length-prefixed string that we've discussed; the other was the "group". At the binary level, the difference is that "groups" don't have a length prefix - instead a sentinel value suffix is used to denote the end of the message:

[field x, "start group"]
[payload]
[field x, "end group"]

(the protocol itself means that "end group" could not occur as an immediate child of the payload, so this is unambiguous)

As with most things, this has various advantages and disadvantages - but most significantly in our case here, it means we don't need to know the length in advance. And if we don't need to know the length, then we don't need to buffer anything - we can write the data in a purely forwards direction without any need to backfill data. There's just one problem: it is out of favor with the protobuf specification owners - it was marked as deprecated but supported in the proto2 DSL, and there is no syntax for it at all in the proto3 DSL (these all just describe data against the same binary format).

But: I really, really like groups, at least at the binary format level. Essentially, the current 2GB+ unblocking in an upcoming deploy of protobuf-net is limited to scenarios where it is possible to use groups extensively. The closer something is to being a leaf, the more it'll be OK to use length-prefixed strings; the closer something is to the root object, the more it will benefit from being treated as a "group". With this removing the need to buffer+backfill, arbitrarily large files can be produced. The cost, however, is that you won't be able to interop with data that is expressed as proto3 schemas.

Historically, you have been able to indicate that a member should be treated as a group via:

// for field number "n"
[ProtoMember(n, DataFormat = DataFormat.Group)]
public SomeType MemberName { get; set; }

However, this is hard to express in some cases (such as dictionaries), so this has been extended to allow declaration at the type-level:

[ProtoContract(IsGroup = true)]
public class SomeType {...}

(both of which can also be expressed via the RuntimeTypeModel API for runtime configuration)

These changes move us forward, at least - but are mainly appropriate when using protobuf-net as the only piece of the puzzle, since it simply cannot be expressed in the proto3 DSL.

The future

This is all great, but isn't ideal. So in parallel with that, I have some work-in-progress early-stages work that is taking a much more aggressive look at the future of protobuf-net and what it needs to move forward. I have many lofty aims on the list:

  • true 2GB+ support including length-prefix, achieved by a redesign of the writer API, including switching to precalculation of lengths as required
  • optimized support for heterogeneous backend targets, including in-memory serialization, Streams, "Channels" (the experimental redesign of the .NET IO stack), memory-mapped-files, etc
  • making use of new concepts like Utf8String, Span<T> where appropriate
  • full support for async backend targets, making optimal use of ValueTask<T> as appropriate so that performance is retained in the case where it is possible to complete entirely synchronously
  • rework of the codegen / meta-programming layer, reducing or removing the dependency on IL-emit, and moving more towards compile-time code-gen (ideally fully automated and silent) using Roslyn
  • in doing so, greatly improve the experience for AOT scenarios, where meta-programming is restricted or impossible
  • improve the performance of a range of common scenarios by every mechanism imaginable
  • and maybe, just maybe: getting around to implementing updated DSL parsing tooling (but realistically: that isn't the key selling-point of protobuf-net)

As counterpoints, I also imagine that I'll be dropping support for everything that isn't either ".NET Framework recent-enough to build via dotnet build" (4.0 and avove, IIRC) or ".NET Standard (something)". The reality is that I'm not in a position to support some obscure PCL configuration or an ancient version of Silverlight. If you can make it compile: great! I'm also entirely open to including targets for things like Xamarin or Unity as long as somebody else can make them work in the build - I'm simply not a user of those tools, and it would be artificial to say that I've seen it work. I'm also moving away from my historic aim of being able to compile on down-level compiler versions. These days, with NuGet as the de-facto package manager, and dotnet build readily available, and the free Visual Studio Community edition, I'm not sure it makes sense to worry about old compilers.

As you can see, there's a lot in the planning. I've been experimenting with various pieces of it to see how it fits together, and I'm confident that I see a viable route forward. Now all I need is to make it happen.

The first step there is to get the "longification" changes shipped; this has now seen real-world usage, so it is just some packaging work to do. I hope to have that available on NuGet before next week.

Fun times!

Saturday, 29 April 2017

StackExchange.Redis and Redis 4.0 Modules

StackExchange.Redis and Redis Modules

This is largely a brain-dump of my plans for Redis 4.0 Modules and the StackExchange.Redis client library.

Redis 4.0 is in RC 3, which is great for folks interested in Redis. As the primary maintainer of StackExchange.Redis, new releases also bring me some extra work in terms of checking whether there are new features that I need to incorporate into the client library. Some client libraries expose a very raw API surface, leaving the individual commands etc to the caller - this has the advantagee of simplicity, but it has disadvantages too:

  • it presents a higher barrier to entry, as users need to learn the redis command semantics
  • it prevents the library offering any special-case optimizations or guidance
  • it makes it hard to ensure that key-based sharding is being implemented correctly (as to do that you need to know with certainty which tokens are keys vs values vs command semantics)
  • it is hard to optimize the API

For all these reasons, StackExchange.Redis has historically offered a more method-per-command experience, allowing full intellisense, identification of keys, helper enums for options, santity checking of operands, and various scenario-specific optimizations / fallback strategies. And ... if that isn't enough, you can always use hack by using Lua to do things at the server directly.

Along comes Modules

A key feature in Redis 4.0 is the introduction of modules. This allows anyone to write a module that does something interesting and useful that they want to run inside Redis, and load that module into their Redis server - then invoke it using whatever exotic commands they choose. If you're interested in Redis, you should go check it out! There's already a gallery of useful modules started by Redis Labs - things like JSON support, Machine Learning, or Search - with an option to submit your own modules to the community.

Clearly, my old approach of "manually update the API when new releases come out" doesn't scale to the advent of modules, and saying "use Lua to run them" is ... ungainly. We need a different approach.

Adding Execute / ExecuteAsync

As a result, in an upcoming (not yet released) version, the plan is to add some new methods to StackExchange.Redis to allow more direct and raw access to the pipe; for example the rejson module adds a JSON.GET command that takes a key to an existing JSON value, and a path inside that json - we can invoke this via:

string foo = (string)db.Execute(
    "JSON.GET", key, "[1].foo");

(there's a similar ExecuteAsync method)

The return value of these methods is the flexible RedisResult type that the Lua API already exposes, which handles all the expected scenarios of primitives, arrays, etc. The parameters are simply a string command name, and a params object[] of everything else - with appropriate handling of the types you're likely to use with redis commands (string, int, double, etc). It also recognises parameters typed as RedisKey and uses them for routing / sharding purposes as necessary.

The key from all of this is that it should be easy to quickly hook into any modules that you write or want to consume.

What about more graceful handling for well-known modules?

My hope here is that or well-known but non-trivial modules, "someone" (maybe me, maybe the wider community) will be able to write helper methods as C# extension methods against the client library, and package them as module-specific NuGet packages; for example, a package could add:

public static RedisValue JsonGet(this IDatabase db, RedisKey key,
    string path = ".", CommandFlags flags = CommandFlags.None)
{
    return (RedisValue)db.Execute("JSON.GET",
        new object[] { key, path }, flags);
}

to expose raw json functionality, or could choose to add serialization / deserialization into the mix too:

public static T JsonGet<T>(this IDatabase db, RedisKey key,
    string path = ".", CommandFlags flags = CommandFlags.None)
{
    byte[] bytes = (byte[])db.Execute("JSON.GET",
        new object[] { key, path }, flags);
    using (var ms = new MemoryStream(bytes))
    {
        return SomeJsonSerializer.Deserialize<T>(ms);
    }
}

The slight wrinkle here is that it is still using the Execute[Async] API; as a general-purpose API it is very convenient and flexible, but slightly more expensive than it absolutely needs to be. But being realistic, it is probably fine for 95% of use-cases, so: let's get that shipped and iterate from there.

I'd like to add a second API specifically intended for extensions like this (more direct, less allocations, etc), but a: ideally I'd want to ensure that I can subsequently tie it cleanly into the "pipelines" concept (which is currently just a corefxlab dream, without a known ETA for "real" .NET), and b: it would be good to gauge interest and uptake before spending any time doing this.

But what should consumers target?

This also makes "strong naming" rear it's ugly head. I'm not going to opine on strong naming here - the discussion is not very interesting and has been done to death. Tl,dr: currently, there are two packages for the client library - strong named and not strong named. It would be sucky if there was a mix of external extensions targeting one, the other, or both. The mid range plan is to make a breaking package change and re-deploy StackExchange.Redis (which currently is not strong-named) as: strong-named. The StackExchange.Redis.StrongName would be essentially retired, although I guess it could be an empty package with a StackExchange.Redis dependency for convenience purposes, possibly populated entirely by [assembly:TypeForwardedTo(...)] markers. I'm open to better ideas, of course!

So that's "The Plan"

If you have strong views, hit me on twitter (@marcgravell), or log an issue and we can discuss it.

Sunday, 23 April 2017

Spans and ref part 2 : spans

Spans and ref part 2 : spans

In part 1, we looked at ref locals and ref return, and hinted at a connection to “spans”; this time we’re going to take a deeper look at what this connection might be, and how we can use make use of it.

Disclaimer

I’m mostly on the outside of this - looking in at the public artefacts, playing with the API etc - maybe the odd PR or issue report. It is entirely possible that I’ve misunderstood some things, and it is possible that things will change between now and general availability.

What are spans?

By spans, I mean System.Span<T>, which is part of .NET Core, living in the System.Memory assembly. It is also available for .NET via the System.Memory package. But please note: it is a loaded gun to use at the moment - you can currently compile code that has undefined behavior, and which may not compile at some point in the future. Although to be fair, to get into any of the terrible scenarios you need to use the unsafe keyword, at which point you already said “I take full responsibility for everything that goes wrong here”. I’ll discuss this more below, but I wanted to mention that at the top in case you stop reading and don’t get to that important point.

Note that some of the code in this post uses unreleased features; I’m using:

<PackageReference Include="System.Memory"
    Version="4.4.0-preview1-25219-04" />
<PackageReference Include="System.Runtime.CompilerServices.Unsafe"
    Version="4.4.0-preview1-25219-04" />

Obviously all bets are off with preview code; things may change.

Why do spans need to exist?

We saw previously how ref T can be used similarly to pointers (T*) to represent a reference to a single value. Basically, anything that allows us to talk about complex scenarios without needing pointers is a good thing. But: representing a single value is not the only use-case of pointers. The much more common scenario for pointers is for talking about a range of contiguous data, usually when paired with a count of the elements.

At the most basic level, a Span<T> represents a strongly typed contiguous chunk of elements of type T with a known and enforced length. In many ways, very comparable to an array (T[]) or segment ArraySegment<T>) - but… more. They also provide safe (by which I mean: not unsafe in the C# sense) access to features that would previously have required pointers (T*).

I’m probably missing a few things here, but the most immediate features are:

  • provide a unified type system over all contiguous memory, including: arrays, unmanaged pointers, stack pointers, fixed / pinned pointers to managed data, and references into the interior of values
  • allow type coercion for primitives and value-types
  • work with generics (unlike pointers, which don’t)
  • respect garbage collection (GC) semantics by using references instead of pointers (the GC only walks references)

Now: if none of the above sounds like things you ever need to do, then great: you probably won’t ever need to use Span<T> - and that’s perfectly OK. Most application code will never need to use these features. Ultimately, these tools are designed for lower level code (usually: library code) that is performance critical. That said, there are some great uses in regular code, that we’ll get onto.

But… what is a span?

OK, OK. Conceptually, a Span<T> can be thought of as a reference and a length:

public struct Span<T> {
    ref T _reference;
    int _length;
    public ref T this[int index] { get {...} }
}

with a cousin:

public struct ReadOnlySpan<T> {
    ref T _reference;
    int _length;
    public T this[int index] { get {...} }
}

You would be perfectly correct to complain “but… but… in the last part you said no ref fields!”. That’s fair, but I did say conceptually. At least… for now!

Spans as ranges of an array

As a completely trivial (and rather pointless) example, we can see how we can use a Span<T> very similarly to how we might have used a T[]:

void ArrayExample() {
    byte[] data = new byte[1024];
    // not shown: populate data
    ProcessData(data);
}
void ProcessData(Span<byte> span) {
    for (int i = 0; i < span.Length; i++) {
        DoSomething(span[i]);
    }
}

Here we implicitly convert the byte[] to Span<byte> when calling the method, but at this point you would still be justified in being underwhelmed - we could have done everything here with just an array.

Similarly, we can talk about just a portion of the array:

void ArrayExample() {
    byte[] data = new byte[1024];
    // not shown: populate data
    ProcessData(new Span<byte>(data, 10, 512));
}
void ProcessData(Span<byte> span) {
    for (int i = 0; i < span.Length; i++) {
        DoSomething(span[i]);
    }
}

And again you could observe that we could have just used ArraySegment<T>. Actually, let’s be realistic: very few people use ArraySegment<T> - but we could have just passed int offset and int count as additional parameters, it would have worked fine. But I mentioned pointers earlier…

Spans as ranges of pointers

The second way we can use Span<T> is over a pointer; which could be any of:

  • a stackalloc pointer for a small value that we want to work on without allocating an array
  • a managed array that we previously fixed
  • a managed array that we previously pinned with GCHandle.Alloc
  • a fixed-sized buffer that we previously fixed
  • the contents of a string that we previously fixed
  • a coerced pointer from any of the above (I’ll explain what this means below)
  • a chunk of unmanaged memory obtained with Marshal.AllocHGlobal or any other unmanaged memory API
  • etc

All of these will necessarily involve unsafe, but: we’ll tread carefully! Let’s have a look at a stackalloc example (stackalloc is where you obtain a chunk of data directly on the call-stack):

void StackAllocExample() {
    unsafe {
        byte* data = stackalloc byte[128];
        var span = new Span<byte>(data, 128);
        // not shown: populate data / span
        ProcessData(span);
    }
}
void ProcessData(Span<byte> span) {
    for (int i = 0; i < span.Length; i++) {
        DoSomething(span[i]);
    }
}

That’s… actually pretty huge! We just used the exact same processing code to handle an array and a pointer, and we didn’t need to use unsafe (except in the code that initially obtained the pointer). This opens up a huge range of possibilities, especially for things like network IO and serialization. Even better, it means that we can do all of the above with a “zero copy” mentality: rather than having managed code writing to a byte[] that later gets copied to some unmanaged chunk (for whatever IO we need), we can write directly to the unmanaged memory via a Span<T>.

Slice and dice

A very common scenario when working with buffers and buffer segments is the need to sub-divide the buffer. Span<T> makes this easy via the Slice() method, best illustrated by an example:

void ProcessData(Span<byte> span) {
    while(span.Length > 0) {
        // first byte is single-byte length-prefix
        int len = span[0];

        // process the next "len" bytes
        ProcessChunk(span.Slice(1, len));

        // move forward len+1 bytes
        span = span.Slice(len + 1);
    }
}

This isn’t something we couldn’t do other ways, but it is very convenient here. Importantly, we haven’t allocated anything here - there’s no “new array” or similar - we just have a reference to a different part of the existing range, and / or a different length.

Coercion

A more interesting example is coercion; this is something that you can do with pointers, but is very hard to do with arrays. A classic scenario here would be IO / serialization: you have a chunk of bytes, and at some point in that data you need to treat the data as fixed-size int, float, double, etc data. In the world of pointers, you just… do that:

byte* raw = ...
float* floats = (float*)raw;
float x = floats[0], y = floats[1]; // consume 8 bytes

With arrays, there is no direct way to do this; you’d either need to use unsafe hacks, or you can use BitConverter if the types you need are supported. But this is easy with Span<T>:

Span<byte> raw = ...
var floats = raw.NonPortableCast<byte, float>();
float x = floats[0], y = floats[1]; // consume 8 bytes

Not only can we do it, but we have the added advantage that it has correctly tracked the end range for us during the conversion - we will find that floats.Length is equal to raw.Length / 4 (since each float requires 4 bytes). The important thing to realise here is that we haven’t copied any data - we’re still looking at the exact same place in memory - but instead of treating it as a ref byte, we’re treating it as a ref float.

Except… better!

We observed that with pointers we could coerce from byte* to float*. That’s fine, but you can’t use pointers with all types. Span<T> has much stronger support here. A particularly interesting illustration is SIMD, which is exposed in .NET via Vector<T>. A vexing limitation of pointers is that we cannot talk about a Vector<float>* pointer (for example). This means that we can’t use pointer coercion as a convenient way of reading and writing SIMD vectors (you’ll usually have to use Unsafe.Read<T> and Unsafe.Write<T> instead). But we can coerce directly to Vector<T> from a span! Here’s an example that might come up in things like applying the web-sockets xor mask to a received frame’s payload:

void ApplyXor(Span<byte> span, uint mask) {
    if(Vector.IsHardwareAccelerated) {
        // apply the mask to SIMD-width bytes at a time
        var vectorMask = new Vector<uint>(mask);
        var typed = span.NonPortableCast<byte, Vector<uint>>();
        for (int i = 0; i < typed.Length; i++) {
            typed[i] ^= vectorMask;
        }
        // move past that data (might be a few bytes left)
        span = span.Slice(Vector<uint>.Count * typed.Length);
    }
    // not shown - finish any remaining data 
}

That’s pretty minimal code for vectorizing something; it is especially nice that we didn’t even need to do the math to figure out the vectorizable range - typed.Length did everything we wanted. It would be premature for me to know for sure, but I’m also hopeful that these 0-Span<T>.Length loops will also elide the bounds check in the same way that array access from 0-T[].Length elides the bounds check.

And readonly too!

Pointers are notoriously permissive; if you have a pointer: you can do anything. You can use fixed to obtain the char* pointer inside a string: if you change the data via the pointer, the string now has different contents. string is not immutable if you allow unsafe: nothing is immutable if you allow unsafe. But just as we can obtain a Span<T>, we can also get a ReadOnlySpan<T>. If you only expect a method to read the data, you can give them a ReadOnlySpan<T>.

Zero-cost substrings

In the “corefxlab” preview code, there’s a method-group with signatures along the lines of:

 public static  ReadOnlySpan<char> Slice(this string text, ...)

(where the overloads allow an initial range to be specified). This gives us a ReadOnlySpan<char> that points directly at a range inside the string. If we want a substring, we can just Slice() again and again - with zero allocations and zero string copying - we just have different spans over the same data. A rich set of APIs already exists in the corefxlab code for working with this type of string-like data. If you do a lot of text processing, this could have some really interesting aspects.

This all sounds too good to be true - what’s the catch?

Here’s the gotcha: in order to have the appropriate correctness guarantees when discussing something that could be a managed object, could be data on the stack, or could be unmanaged data, we run into very similar problems that make it impossible to store a ref T local as a field. Remember that a Span<T> is conceptually a ref T (reference) and int (length) - well: we still need to obey the rules imposed by that “conceptually”. For a trivial example of how we can get in a mess, we can tweak our stackalloc example:

private Span<byte> _span;
unsafe void StackAllocExample() {
    byte* data = stackalloc byte[128];
    _span = new Span<byte>(data, 128);
    ...
}
void SomeWhileLater() {
    ProcessData(_span);
}

Where does _span refer to in SomeWhileLater? I can’t tell you. We get into similar problems with anything that used fixed to get a pointer - the pointer is only guaranteed to make sense inside the fixed. Conceptually the issue is not restricted to pointers - it would apply equally if we could initialize Span<T> directly with a ref T constuctor:

private Span<SomeStruct> _span;
void StackRefExample() {
    var val = new SomeStruct(123, 456);
    _span = new Span<SomeStruct>(ref val);
    // ^^^ hypothetical span of length 1
}

We didn’t even need unsafe to break things this time. No such constructor currently exists, very wisely!

We should be OK if we only ever use managed heap objects (arrays, etc) to initialize a Span<T>, but the entire point of Span<T> is to provide feature parity between things like arrays and pointers while making it hard to shoot yourself in the foot.

In addition to this, we also need to worry about atomicity. The runtime and language guarantee that a single reference can be read atomically (in one CPU instruction), but it makes no guarantees about anything larger. If we have a reference and a length, we start getting into very complex issues around “torn” values (an invalid pair of the reference and length that didn’t actually exist, due to two threads squabbling). A torn value is vexing at the best of times, but in this case it would lead to valid-looking code accessing unexpected memory - a very bad thing.

The stackalloc example above is a perfect example of code that will compile without complaint today, but will end very very badly - although we used unsafe, so: self-inflicted. But this and the atomicity issue are both illustrations of why we have…

The Important Big Rule Of Spans

Span<T> has undefined behavior off the stack. And in the future: may not be allowed off the stack at all - this means no fields, no arrays, no boxing, etc. In the same way that ref T only has defined behavior on the stack (locals, parameters, return values) - so Span<T> only has defined behavior on the stack. You are not meant to ever put a Span<T> in a field (including all those times when things look like locals but are actually fields, that I touched on last time). An immediate consequence of this is that atomicity is no longer an issue: each stack is specific to a single thread; if our value can’t escape the stack, then two threads can’t have competing reads and writes.

There’s some in-progress discussion on how the rules for this requirement should work, but it looks like the concept of a “ref-like” stack-only type is being introduced. ref T as a field would be ref-like, and Span<T> would be ref-like. Any ref-like type would only be valid directly on the stack, or as an instance field (not a static field) on a ref-like type. If I had to speculate at syntax, I’d expect this to look something like:

public ref struct Span<T> {
    ref T _reference;
    int _length;
    public ref T this[int index] { get {...} }
}

Emphasis: this syntax is pure speculation based on the historic reluctance to introduce new keywords, but the ref struct here denotes a ref-like type. It could also be done via attributes or a range of other ideas, but note that we’re now allowed to embed the ref-like ref T field. Additionally, the compiler and runtime would verify that Span<T> is never used illegally as a field or in an array etc. Notionally, we could also do this for our own types that shouldn’t escape the stack, if we have similar semantics but Span<T> doesn’t represent our scenario.

Thinking back to the StackRefExample, if we wanted to safely support usage like:

var val = new SomeStruct(123, 456);
var span = new Span<SomeStruct>(ref val); // local, not field

then presumably it could work, but we’d have to have similar logic about returning ref-like types as currently exists for ref return, further complicated by the fact that we don’t have the single-assignment guarantee - we can reassign a Span<T>. If ref-like types work in the general case, then the logic about passing and returning such a value needs ironing out. And that’s complex. I’m very happy to defer to Vladimir Sadov on this!

EDIT: to clarify - it is only the pair of ref T and length (together known as a span, Span<T> or ReadOnlySpan<T>) that need to stay on the stack; the memory that we're spanning can be anywhere - and will often be part of a regular array (T[]) on the managed heap. It could also be a reference to the unmanaged heap, or to a separate part of the current stack.

So how am I meant to work with spans?

Sure, not everything is on the stack.

This isn’t as much of a limitation as it sounds. Instead of storing the Span<T> itself, you just need to store something that can manifest a span. For example, if you’re actually using arrays you might have a type that contains an ArraySegment<T>, but which has a property:

public Span<T> Span { get { ... } }

As long as you can switch into Span<T> mode when you’re inside an appropriate method, all is good.

For a more unified model, the corefxlab code contains the Buffer<T> concept, but it is still very much a work in progress. We’ll have to see how it shakes out in time.

Wait… why so much ref previously?

We covered a lot of ref details - you might feel cheated. Well, partly we needed that information to understand the stack-only semantics of Span<T>. But there’s more! Span<T> also exposes the ref T directly via the aptly named DangerousGetPinnableReference() method. This is a ref return, and allows us to do any of:

  • store the ref return into a ref local and work with it
  • pass the ref return as a ref or out parameter to another method
  • use fixed to convert the ref to a pointer (preventing GC movement at the same time)

The latter option means that not only can we get from unsafe to Span<T>, but we can go the other direction if we need:

fixed(byte* ptr = &span.DangerousGetPinnableReference())
{ ... }

If I can get a ref, can I escape the bounds?

The DangerousGetPinnableReference() method give us back a ref to the start of the range, comparable to how a T* pointer refers to the start of a range in pointer terms. So: can we use this to get around the range constraints? Well… yes… ish:

ref int somewhere = ref Unsafe.Add(
    ref span.DangerousGetPinnableReference(), 5000);

This cheeky duo gives us a reference to whatever is 5000-integers ahead of the span we were thinking of. It might still be part of our data (if we have a large array, for example), or it might be something completely random. But the sharp eyed might have noticed some key words in that expression… “Unsafe...” and “Dangerous...”. If you keep sprinting past signs with words like that on: expect to hit rocks. There’s nothing here that you couldn’t already do with unsafe code, note.

Doing crazy things with unmanaged memory

Sometimes you need to use unmanaged memory - this could be because of memory / collection issues, or could be because of interfacing with unmanaged systems - I use it in CUDA work, for example, where the CUDA driver has to allocate the memory in a special way to get optimal performance. Historically, working with unmanaged memory is hard - you will be using pointers all the time. But we can simplify everything by using spans. Here’s our dummy type that we will store in unmanaged memory:

// could be explict layout to match external definition
struct SomeType
{
    public SomeType(int id, DateTime creationDate)
    {
        Id = id;
        _creationDate = creationDate.ToEpochTime();
        // ...
    }
    public int Id { get; }
    private long _creationDate;
    public DateTime CreationDate => _creationDate.FromEpochTime();
    // ...
    public override string ToString()
        => $"{Id}: {CreationDate}, ...";
}

We’ll need to allocate some memory and ensure it is collected, usually via a finalizer in a wrapper class:

unsafe class UnmanagedStuff : IDisposable
{
    private SomeType* ptr;
    public UnmanagedStuff(int count)
    {
        ptr = (SomeType*) Marshal.AllocHGlobal(
            sizeof(SomeType) * count).ToPointer();
    }
    ~UnmanagedStuff() { Dispose(false); }
    public void Dispose() => Dispose(true);
    private void Dispose(bool disposing)
    {
        if(disposing) GC.SuppressFinalize(this);
        var ip = new IntPtr(ptr);
        if (ip != IntPtr.Zero) Marshal.Release(ip);
        ptr = default(SomeType*);
    }
}

The wrapper type needs to know about the pointers, so is going to be unsafe - but does the rest of the code need to? Sure, we could add an indexer that uses Unsafe.Read / Unsafe.Write to access individual elements, but that means copying the data constantly, which is probably not what we want - and it doesn’t help us represent ranges. But spans do: we can return a span of the data (perhaps via a Slice() API):

public Span<SomeType> Slice(int offset, int count)
    => new Span<SomeType>(ptr + offset, count);
// ^^^ not shown: validate range first

And we can consume this pretty naturally without unsafe:

// "stuff" is our UnmanagedStuff object
// easily talk about a slice of unmanaged data
var slice = stuff.Slice(5, 10);
slice[0] = new SomeType(123, DateTime.Now);                

// (separate slices work)
slice = stuff.Slice(0, 25);
Console.WriteLine(slice[5]); // 123: 23/04/2017 09:09:51, ...

If we want to talk about individual elements (rather than a range), then a ref local (via a ref return) is what we want; we could use the DangerousGetPinnableReference() API on a Span<T> for this, but in this case it is probably easier just to use Unsafe directly:

public ref SomeType this[int index]
    => ref Unsafe.AsRef<SomeType>(ptr + index);
// ^^^ not shown: validate range first 

We can consume this with similar ease:

// talk about a *reference* to unmanaged data
ref SomeType item = ref stuff[5];
Console.WriteLine(item); // 123: 23/04/2017 09:09:51, ...
item = new SomeType(42, new DateTime(2016, 1, 8));

// prove that updated *inside* the slice
Console.WriteLine(slice[5]); // 42: 08/01/2016 00:00:00, ...

And now from any code, we can talk directly to the unmanaged memory simply by passing it in as a ref parameter - it will never be copied, just dereferenced. If you want to talk about an isolated copy or store a copy as a field, then you can dereference, but that is easy:

SomeType isolated = item;

If you’ve ever worked with unmanaged memory from C#, this is a huge difference - and opens up a whole range of interesting scenarios for allocation-free systems without requiring the entire codebase to be unsafe. For context, in an allocation-free system, the lifetime of a set of data is strictly defined by some unit of work - processing an inbound request, for example. This means we don’t need reference tracking and garbage collection (and GC pauses can hurt high performance systems), so instead we simply take some slabs of memory, work from them (incrementing counters as we consume space), and then when we’ve finished the request we just set all the counters back to zero and we’re ready for the next request, no mess. Spans and ref locals and ref return make this friendly, even in the unmanaged memory scenario. The only caveat being - once again: Span<T> and ref T cannot legally escape the stack. But as we’ve seen, we can expose on-demand a Span<T> or ref T - so it isn’t a burden.

Summary

Spans; they’re very powerful if you need that kind of thing. And they force a range of new concepts into C#, giving us all the combined strong points of arrays, pointers, references and generics - with very few of the pain points. If you don’t care about pointers, buffers, etc - you probably won’t need to learn about spans. But if you do, they’re awesome. The amount of effort the .NET folks (and the community, but mostly Microsoft) have made making this span concept so rich and powerful is huge - it impacts the compiler, the JIT, the runtime, and multiple libraries both pre-existing and brand new. And it impacts both .NET and .NET Core. As someone who works a lot in the areas affected by spans and ref - it is also hugely appreciated. Good things are coming.

Saturday, 22 April 2017

Spans and ref part 1 : ref

Spans and ref part 1 : ref

One of the new features in C# 7 is by-reference (ref) return values and locals. This is a complex topic to explain, but a good example of why we might want this is “spans” (Span<T>). I don’t have any inside knowledge on the design meetings, but I’d go further and speculate that if Span<T> wasn’t a thing, the ref-changes wouldn’t have happened, so it makes sense to consider them together. Most of the fun things you can do with ref returns / locals start to make a lot more sense when we look at Span<T> - which we’ll do in part 2, but first we need to remind ourselves what ref means, and explore the new ref changes.

ref returns and locals

There’s a reason that ref (and cousin out) aren’t used extensively on many APIs: they are hard to fully understand. A lot of people will describe them in terms of “changes being visible”, but that is just a side-effect, not the meaning. I don’t mean this as a criticism: it isn’t necessary for every C# developer to have a deep knowledge of the inner workings of these things.

But consider the following common question:

void PassByRef()
{
    int i = 42;
    IncrementByRef(ref i);
    // what does this line print, and why?
    Console.WriteLine(i);
}
void IncrementByRef(ref int x)
{
    x = x + 1; // increment
}

Most developers will be able to correctly understand that this will output “43”, but asking them exactly what happened can reveal very different levels of understanding. The short summary is that a reference to the variable i was passed to IncrementByRef; all the code in IncrementByRef that looks like it is reading / writing to the parameter is actually dereferencing the parameter at each stage. This is clearer if we write it in unsafe pointer code instead:

unsafe void PassByPointer()
{
    int i = 42;
    IncrementByPointer(&i);
    // what does this line print, and why?
    Console.WriteLine(i);
}
unsafe void IncrementByPointer(int* x)
{
    *x = *x + 1; // increment
}

Here we can clearly see the “take a reference to” operation (&) and “dereference” (*) operations, but there’s a lot of problems with pointers:

  • the garbage collector (GC) refuses to even try to walk pointers, which means you need to be very careful to only access memory that won’t move (unmanaged memory, stack memory, or pinned managed objects)
  • pointer arithmetic makes it trivially possible to access adjacent memory without any bounds checking
  • it forces us to use unsafe, which makes it very easy to make subtle but major bugs that cause just about any level of silliness imaginable
  • pointers only work for a small subset of types - essentially primitives and structs composed of primitives

The point of ref parameters is to get the best of both worlds. ref is essentially just like pointers, but with enough restrictions to stop us getting into messes. The aditional sanity checks and restrictions mean that the IL knows enough about the meaning for the GC to sensibly be able to navigate them without getting confused, so we don’t need to worry about the reference suddenly being meaningless - and since we can’t do anything too silly, we don’t need to drop to unsafe. And it should work for any regular type.

But, historically, this ability to add automatic dereferencing and talk about ref has been restricted to method parameters; no fields, no locals, and no return values.

ref locals

The first change in C#7 allows us to talk about automatically dereferenced ref items as local (method) variables. In the same way that a ref parameter is denoted by a ref prefix before the type, so it is with ref locals, with the added bonus that ref var is legal:

void ByRefLocal()
{
    int i = 42;
    ref var x = ref i;
    x = x + 1;
    // what does this line print, and why?
    Console.WriteLine(i);
}

This prints “43” for exactly the same reasons as before - the only difference is that we now have a syntax to express ref when talking about locals. Previously, we would have to have added an additional method to switch to ref semantics for a local. One slight peculiarity here is that ref locals must be assigned a value at the point of declaration - and we can only assign it a value at this point. Any further attempt to assign a value to a ref local is interpreted as a dereferencing assignment - the *x = in our pointer example.

This ability is nice, but it isn’t very useful until we combine it with…

ref return

A much more interesting and powerful addition in C# 7 is ref returns. As the name suggests, this allows us to return a ref value from a method. We can capture this value into a ref local as long as we include an additional ref just before the value to make it very clear that we don’t want to dereference - which is the regular behaviour whenever touching a ref parameter or local:

ref int GetArrayReference(int[] items, int index)
    => ref items[index];

void IncrementInsideArrayByRef()
{
    int[] values = { 1, 2, 3 };

    ref int item = ref GetArrayReference(values, 1);
    IncrementByRef(ref item);
    // what does this line print, and why?
    Console.WriteLine(string.Join(",", values));
}

Here the GetArrayReference method provides the caller a ref to inside the array. Note that the ability to get a ref into an array is not by itself new - this has always worked:

IncrementByRef(ref values[item]);

The bit that is new and different is only the ability to return a ref value.

Since we increment a ref int that refers to the array index 1 (the second element), the result is “1,3,3”.

Note that we don’t need to capture the ref value before we use it - we can also pass a ref return result directly into a ref or out parameter:

IncrementByRef(ref GetArrayReference(values, 1));

Are there restrictions on what we can ref return?

Yes, yes there are. Figuring out the rules on what can and can’t be safely returned as ref without letting the author get into an accidental ugly mess of landmines is probably why it has never been supported in the past. We’ve seen that we can return ref references into arrays. And we’ve seen that we can take a ref of a local variable. But a local only exists in the context of the current stack-frame: very bad things would happen if we could ref return a ref to a local - the caller would have a ref to a position outside the active stack, which would be undefined behavior:

ref int ReturnInvalidStackReference()
{
    int i = 32;
    return ref i; // can't do this
}
void WhatHappensHere()
{
    ref int v = ref ReturnInvalidStackReference();
    CallSomeOtherMethods(); // to use the stack
    int i = v; // dereference the ref
}

You’ll be relieved to know that the compiler doesn’t let us do this - the compiler is very strict to ensure that if we want to ref return something, then it must demonstrably refer to a safe value. Put very simply: as long as the assignment doesn’t involve a ref to a local, we’ll be fine. return ref i; clearly involves the local i, so can’t be returned. The return ref expression is inspected for safety; each part of the expression must be safe. This includes any ref parameters that we have passed into any method calls:

ref int ReturnInvalidStackReference()
{
    int j = 42;
    return ref DoSomething(ref j);
}

This might look like a confusing restriction, but note that DoSomething could be implemented as:

ref int DoSomething(ref int evil) => ref evil;

which would expose the ref j stack reference to the caller of ReturnInvalidStackReference, so any such possibility is excluded. The implementation here is pretty solid, so if it refuses to let you ref return something, you’ve probably attempted something that looks too much like you’re involving locals of the current method.

But no ref fields

We have ref parameters, ref locals and ref returns. However, there is no currently support for ref fields (instance or static variables). Specifically, this is not legal:

struct Foo {
    ref int _reference;
}

or:

class Foo {
    ref int _reference;
}

The reason for this again relates to undefined behaviour and escaping the stack-frame. If we could put a ref into a field, then we are at grave danger of proving invalid access (at some point later on) to a position in memory that now means something completely unrelated to what it meant when we took the ref. Strictly speaking we can prove that ref fields would be valid if the assigned value comes from inside an object, and we’ll discuss another safe scenario in part 2,but currently the rule is simple: no ref fields.

When is a local not a local?

This has some additional consequences for a number of code concepts that look like locals, but which are actually fields; for example:

  • locals in an iterator block (yield return)
  • locals in an async method
  • captured variables in lambdas, anonymous methods, and LINQ syntax comprehensions

All of these situations are - and for similar reasons (with the added bonus of issues of lifetime) - the scenarios where you can’t use ref or out parameters or unsafe, so basically: if you can’t use ref, out parameters or unsafe, you won’t be able to use ref locals or ref return either.

One additional scenario, though, is tuples: as I discussed previously, tuples are secretly implemented as fields on the ValueTuple<...> family. So: no ref values in tuples.

Summary

This should give you enough to start understanding what ref locals and ref returns are, but for them really start to make sense we need a concrete example. And we get that in “spans”, coming up next!