Thursday, 17 April 2014

Technical Debt, a case study : tags

 

This is going to be more a discussion / wordy thing; no real code here. This is just to wander through some points, and maybe you’ll find them interesting or thought provoking. You have been warned.

At Stack Exchange, we have a fair understanding of technical debt. Like real debt, technical debt is not by necessity a bad thing – it can allow you to choose an acceptable (but not ideal) solution today, which means you can ship today, but you know that at some point you are going to have to revisit it. Like all loans, technical debt carries interest.

Today I’m talking about tags; specifically, the tags on Stack Exchange questions. These things:

image

I’m going to run through a lot of back-story, but if you’re lazy, just pretend you read all of it and skip to the last paragraph. Or, heck, go look at cat videos. I won’t tell.

Step 0 : the original problem

Way way back (long before I joined the company), the then-much-smaller team needed a way to store and query tags – in particular a wide range of “{a} and {b} and {c}”, “{d} or {e}”, “{f} but not {y}” etc queries for the “show me the questions in {some tags}” pages, or the search page (which allows a lot of tag-based filtering). Your classic database approach here might be to have a Posts table, and Tags table, and a PostTags table, and do a lot of work on PostTags, but as I understand it this simply didn’t want to perform well. Equally, we access and display questions a lot. No, a real lot. Huge amounts. The one thing that we want to happen really really efficiently is “fetch a question”.

Having that data spread over 3 tables requires complex joins or multiple queries (because you could be fetching 500 questions, which could each have 1-5 tags), and then processing the more complicated data. Because of this, we store the tags as a single character-data field in the database – all the tags in one string. This makes it possible to get the post including all the tags in a single row query, and just worry about deciphering it at the UI. Neat. But it doesn’t really allow for efficient query.

Aside: at this point I should also note that we additionally store the data in a PostTags table (although I can’t remember whether this was the case at the time) – the inline vs normalized data is kept in sync and used for different purposes; the things you do for performance, eh?

Step 1 : the evil/awesome hack

So, we’ve got inline tag data that is simple to display, but is virtually impossible to query. Regular indexing doesn’t really work well at finding matches in the middle of character data. Enter (trumpets) SQL Server Full-Text Search. This is inbuilt to SQL Server (which we were already using), and allows all kinds of complex matching to be done using CONTAINS, FREETEXT, CONTAINSTABLE and FREETEXTTABLE. But there were some problems: stop words and non-word characters (think “c#”, “c++”, etc). For the tags that weren’t stop words and didn’t involve symbols, it worked great. So how to convince Full Text Search to work with these others? Answer: cheat, lie and fake it. At the time, we only allowed ASCII alpha-numerics and a few reserved special characters (+, –, ., #), so it was possible to hijack some non-English characters to replace these 4, and the problem of stop words could be solved by wrapping each tag in a pair of other characters. It looked like gibberish, but we were asking Full Text Search for exact matches only, so frankly it didn’t matter. A set of tags like “.net c#” thus became “éûnetà écñà”. Obviously! This worked; it performed well (remember, Stack Overflow was still very young here), and allowed the team to ship. Shipping is a pretty important feature! The team knew it was a fudge, but all was well (enough) in the world…

Step 2 : the creaks begin

Stack Overflow was a hit. Over time, the question count grows steadily larger, as do the number of page requests. Eventually it became apparent that our flagrant abuse of Full Text Search was becoming a performance bottleneck. To be fair, Full Text Search wasn’t really intended to be used in this way, and we were using it at an alarming rate (even after caching). But by now the team had a viable product, more developers, and enough information to judge that more effort was both necessary and worthwhile. At different points, then, our search efforts were re-written to use Lucene.Net and then Elasticsearch, and the “list by tag” work grew a bespoke entity that we call the “tag engine” (which is actually what this previous post is about) – essentially an out-of-process index service. Nowhere near as fully-featured as (say) Lucene.Net, but it only needs to do one thing.

Step 3 : Stack Exchange has sites about language

In time, we had requests for sites that were still primarily in English, but were about languages; French, Japanese, Russian, etc. A lot of their tags were in English, but there was a need to remove our “ASCII alpha-numerics” restriction. Fortunately, since we had Elasticsearch and the tag-engine, this mainly meant changing the few remaining Full Text Search usages (things that  were used rarely, and hadn’t been worth fixing) to use alternatives. The format in the database, however, remained. And since we didn’t want to introduce even more bizarre rules, we kept the 6 reserved characters. Sorry, French.StackExchange – no “é” or “à” in your tags. In reality this was less of a problem than it sounds, as French.StackExchange elects to use accent-free tags – while sites like Russian.StackExchange could use the tags they wanted. And the world keeps turning.

Step 4 : Stack Exchange goes fully multi-lingual (well, ok, bilingual)

Enter Portuguese; in pt.stackoverflow.com, we have our first site that is completely in another language. No weasel room left now: they want tags like this:

image

And guess what: ç was one of our reserved tokens (representing +, so “c++” was stored as “écççà”). We couldn’t even whistle a happy tune and hope nobody would notice the gaps: they found it on day 1 of closed beta. Maybe day 2. We finally had a reason to remove this legacy from the past. But – and I cannot emphasize this enough: these changes were scary. So scary that we didn’t want to do a “update all the things” until we’d had chance to “bed it in” on pt.StackOverflow, and fix any glitches that we’d missed. As it happened, all the code was making use of a single “decipher the tags” method, so it was sensible and pragmatic to simply make that understand both the old format and whatever we came up with now. After some thought and consideration, we settled on a pipe (bar) delimited natural representation, with leading/trailing pipes, so “.net c#” becomes simply “|.net|c#|”. This has virtues:

  • very simple to parse
  • we can tell immediately from the first character which format it is
  • bulk update and removal of tags can be done with a simple replace (including the pipes, to avoid replacing mid-tag matches)
  • and unlike the old format, you don’t need to worry about special-casing the first/last tag when doing a bulk operation (trailing spaces, etc)

Sure, there’s still a reserved |, but we can live with that. We could have used a non-printing character, but that would be very awkward in things like JSON – lots of risk of subtle bugs.

Once this had been working on pt.StackOverflow for a while, we flipped a switch and all new write operations switched to the new format; all “language” sites could have free access to the 6 previously reserved characters. Hoorah!

Step 5 : the backfill

When we did that, we only affected new writes. There was still data using the old format. A lot of data (not just the questions themselves: all tag edits, for example, were stored in this way). But while it remained, our “bulk remove a tag” code had to cope with two formats, which is an unnecessary maintenance overhead. So finally this week we absorbed the pain to do a batched migration of all the old data to the new format. Fairly routine, if a little scary.

And happy dance; the hack is no more!

So why are you telling me all this? What is the point?

Our job as engineers is not always to write the best possible thing that we can, and that can solve every problem ever, and is beautiful code that makes us want to adopt it and take it on picnics. An important part of our job is to:

  • understand what the code actually and genuinely needs to be able to do today and tomorrow
  • think of a way of delivering that with the available resources
  • think a little about what might be needed in the future, acknowledging that this is only a best guess
  • make sure that it is not impossible (or prohibitively expensive) to change things later
  • At the time the original hack was put in, it was absolutely the right choice. A few lines of code, a few clicks in SQL Server, job done. Shipped. Working. It would not have been sensible to invest time getting an external indexing service working just for this. But there was always the possibility to change the implementation, and the “unscramble the mess” code was in one place. Sure: there was debt, but the level of pain was kept to a minimum, and has finally been paid back in full.

    41 comments:

    Binary Worrier said...

    You have NO IDEA how exactly that post matches our own "indexing" evolution, even to the point of "last few bits still using SQL FTI".

    Nice :)

    Tiendq said...

    Excellent post about technical debt I've ever read, hopefully all my bosses and ex-bosses read it too :)

    Xorlev said...

    Talk about deja vu -- had a very similar tag evolution. Was like reading about our own.

    Anonymous said...

    I am going to name my next computer language: |

    Makro said...

    the good thing is that you actually worked towards reducing the debt at the moment you felt it was too big. In most places, people just keep adding debt forever, until it becomes impossible or too costly to eliminate.

    Soner Gönül said...

    Excellent write up Marc.

    "Our job as engineers is not always to write the best possible thing that we can, and that can solve every problem ever, and is beautiful code that makes us want to adopt it and take it on picnics.."

    Nate said...

    I am SO happy to have stumbled across this post -- very reassuring. I spent several days trying to determine the best way to handle a tags system for a website I'm working on (http://www.pricewombat.com) that allows content to be tagged and for users to search by tags and this is basically exactly what I came up with. The only difference is that I decided to use commas as my delimiter and am using Sphinx to search.

    Dixon Leavitt said...

    Awesome post. Thank you. Very helpful to see the play by play of technical debt being payed. Helps put my current project worries in perspective. BTW I came to your blog through researching protobuf-net. Looks awesome. I'm going to give it a try in my Unity project.

    Banane9 said...

    Well, I'm sure French.StackExchange still hates you for not differentiating between é and è letters... Only one of them is an accent, the other one is something else :)

    Marc Gravell said...

    banane9 the entire point is that we don't need that any more, so all the "e"s are OK.

    Kimberly McNeeley said...

    I truly like it such a plans and uncommonly help the understudies. We appreciatively help premeir essay helped me write my paper the training and that is truly pleasant plans and we likewise dissection report of science studies. A debt of gratitude is in order regarding offer.

    You know who said...

    Why didn't you guys use the old school character escaping but rather borrowed some extraANSII chars for `+` and such?

    get hack here said...

    get from here the link to brave frontier hack.check url for free gems and coins .visit us at this website.

    Monster Warlord said...

    monster warlord cheat I am giving online hacks and cheats codes of monster warlord hack at zero coins than go to the website on us.

    Raja Hindustani said...

    Amazing working hacks and cheats for boom beach by supercell,its for free hack for boom beach

    club penguin code said...

    now you can get club penguin codes free of cost this is the bets place to get it

    obat liver untuk anak said...

    thank you for sharing information guys
    obat pegal linu
    obat benjolan di vagina
    obat pegal linu alami
    obat pegal linu alami yang aman
    obat bronkitis
    obat kista ginjal

    obat kista rahim alami said...

    good morning guys
    obat kista rahim alami

    obat benjolan di anus said...

    thank you information guys
    obat benjolan di anus

    obey alami said...

    terimakasih banyak untuk infonya
    obat asam lambung alami yang aman

    obey alami said...

    obat kista ginjal

    throne rush hacks said...

    It is imperative that we read blog post very carefully. I am already done it and find that this post is really amazing.throne rush hack tool

    obat amandel anak 5 tahun said...

    thanks for the information useful guys

    obat amandel anak 5 tahun
    obat liver untuk anak
    obey hungkul is typing...
    obey hungkul: obat radang prostat
    obat infeksi telinga untuk anak
    obat benjolan di leher
    obat pembengkakan kelenjar getah bening
    obat untuk menyembuhkan diabetes basah
    obat untuk meringankan kanker payudara

    obat amandel anak 5 tahun said...

    thanks for the information useful guys

    obat amandel anak 5 tahun
    obat liver untuk anak
    obey hungkul is typing...
    obey hungkul: obat radang prostat
    obat infeksi telinga untuk anak
    obat benjolan di leher
    obat pembengkakan kelenjar getah bening
    obat untuk menyembuhkan diabetes basah
    obat untuk meringankan kanker payudara





    Indra nunoel21 said...

    thanks for the information useful guys

    obat infeksi jantng

    rudra said...

    Then we have to ask that safety and security codes to those 3 buddy by mail or telephone call. free facebook accounts hacks

    Hanudin said...

    Thank you for the info gan, https://obatdalamuntuklukaoperasicaesar.wordpress.com/ | https://obatdalamuntuklukaoperasicaesar.wordpress.com/2015/10/06/obat-dalam-untuk-luka-operasi-caesar/ | http://www.tokodehan.com/

    PrimeDice Bot said...

    the good thing is that you actually worked towards reducing the debt at the moment you felt it was too big.

    animal jam said...

    I enjoyed over read your blog post. Your blog have nice information, I got good ideas from this amazing blog. I am always searching like this type blog post. I hope I will see again..
    halloween| animal jam| cool games| cool math games| 8 ball pool| sudoku| yoob| stickman games| unblocked games | tetris| friv4school

    Obat Herbal Darah Tinggi Yang Ampuh said...

    thank you for the information you present here,
    Keep posting and I really like it.. it is very good

    Trang Minh said...

    Nhật bản - đất nước mặt trời mọc với nhiều mặt hàng được nhiều người tại Việt Nam tin dùng và yêu thích với những điểm cao về chất lượng, ngoại hình, giá cả, ...Tuy nhiên, về các khoảng cách địa lý mà khách hàng không thể mua được những loại sản phẩm mà mình yêu thích. Do vậy chúng tôi cung cấp dịch vụ ship hàng từ nhật về việt nam
    , nhận ship hàng từ nhật về việt nam để có thể giúp khách hàng mua sắm đươc thuận lợi hơn.
    Ngoài ra, chúng tôi còn nhận ship hàng toàn quốc ở Việt Nam, ship hàng free ở 2 tình thành lơn là Hồ Chí Minh và Hà Nội: nhận ship hàng từ nhật về tphcm , nhận ship hàng từ nhật về hà nội>
    Hy vọng dịch vụ ship đồ từ nhật về việt nam dịch vụ vận chuyển hàng từ nhật về việt nam sẽ mang lại nhiều thuận lợi nhất cho khách hàng để khách hàng có thể tha hồ mua sắm các mặt hàng tại Nhật Bản.
    Ngoài ra chúng tôi còn cung cấp dịch vụ mua hàng trên amazon ship về việt nam với chi phí rẻ, thời gian chờ ngắn, đang và đã được nhiều người sử dụng.

    Anonymous said...

    I just want to say : why you do not click to play!

    - Mortal Kombat XL
    - Atari Breakout
    - Dragon Ball Z Games

    mario said...

    Busy life circumstances than the current world history.
    Mario , Friv , Doraemon Games , Kizi , Yoob Juegos

    Klinik Indonesia said...


    Berita yang sangat menarik dan bagus sekali untuk di baca. Artikel ini akan bermanfaat dan berguna untuk semua pembaca artikel di seluruh dunia. semangat untuk menulis Artikel menarik lainnya

    Klinik Indonesia

    Candy Sim said...

    All the best blogs that is very useful for keeping me share the ideas
    of the future as well this is really what I was looking for, and I am
    very happy to come here. Thank you very much
    earn to die play
    earn to die
    earn to die 3
    Hi! I’ve been reading your blog for a while now and finally got the
    earn to die 4
    courage to go ahead and give youu a shout out from
    earn to die 6
    Austin Texas! Just wanted to tell
    earn to die 5
    you keep up the fantastic work!my weblog
    age of war
    Hi! I’ve been reading your blog for a while now and finally got the
    happy wheels
    strike force heroes
    slither io
    slitherio
    good game empire

    friv said...

    Thanks for giving me the useful information. I think I need it. Thank you
    Ebog
    Z6
    Huz 10
    Frip 10
    friv 2
    Kizi 4
    friv

    Regina Hilary said...

    The war between humans, orcs and elves continues. Lead your race through a series of epic battles, using your crossbow to fend off foes and sending out units to destroy castles. Researching and upgrading wisely will be crucial to your success!
    slitherio | unfair mario 2 | age of war 2
    The game controls are shown just under . Movement mechanisms primarily include acceleration and tilting controls.
    cubefield | tank trouble | happy wheels | earn to die 2 |
    earn to die 1 | earn to die 2

    Candy Sim said...

    The blog or and best that is extremely useful to keep I can share the ideas
    of the future as this is really what I was looking for, I am very comfortable and pleased to come here. Thank you very much.
    strike force heroes| slitherio | earn to die 2
    earn to die 5| goodgame empire |tank trouble 2

    Anonymous said...

    Which database did u use to effectively store and retrieve tags?
    I am working on similar thing and I pretty much got what you wanted to say but I am still confused as to how to store these tags in database.

    Did u use MySQL, NoSQL or graph databases?

    Kristofer Kihn said...

    I cherished the way you talk about the point awesome work much obliged for the share Your useful post.
    Python Programming Assignment Help

    Jacob Martin said...

    Those who come to read your article will find lots of helpful and informative tipsOperations Research Project Help