Thursday, November 1, 2012

Hardware trends for datacenter computing

Moore’s Law (and family) are a significant boon to anyone writing distributed systems. As the systems grow over time, so too does the underlying hardware get more powerful, allowing exponential growth to be readily managed. Programs need to be modified to cope, of course, but the raw compute power is there and available. That said, designing algorithms that can scale with more than an order of magnitude of growth remains a difficult challenge, and regular rewrites are inevitable. In the interest of minimizing that number, here are a few trends I’ve observed that should continue into the next few years. Caveat emptor, this is my personal perspective, but if it doesn't line up with your experience I'd love to hear about it.

1. Compute Units are Static

  1. Moore’s Law tells us that the number of transistors on a die double every 2 years, and practical experience tells us that performance doubles every 18 months. In the multicore era, however, the performance of a single core has largely plateaued, with further advances coming in the form of more cores per chip.
  2. RAM and on-die caches roughly keep pace with CPUs, which means that now that we’re into core count growth, the RAM and cache per core is now approximately fixed.
  3. Disk storage is also keeping the same pace, and hence disk per core is now also approximately fixed.

Thus, we can define a theoretical “compute unit” as, say, 1 core, 2GB RAM, 20GB flash*, 200 GB disk (adjust numbers as you see fit); this unit should stay roughly stable going forward, with the number of such units increasing exponentially.

Note that the purchasable range of resources-per-core is generally increasing; this represents the most economical average rather than a strict upper or lower bound. Resources do not need to be apportioned to each process in these ratios, either; e.g. disk-heavy and CPU-heavy processes can coexist on a machine and balance each other out. But it's worth looking closely at anything expected to grow significantly on one axis and not in others, to see if it's sustainable long term.

* Flash adoption remains spotty, but early indications put it roughly halfway between RAM and disk on performance and price, and it seems to be keeping up the growth pace. I speculate it will become a standard inclusion going forward, but this isn’t as clear as the others.

Update: One of my colleagues has pointed out that disk performance is not keeping pace with capacity increases (neither seek nor transfer rates). I'm not sure how to include this detail in the compute unit definition itself, but it will likely prove to be significant and help drive the adoption of flash going forward.

2. The Datacenter is the New Machine

This is already largely true, but the trend will only get stronger. This is driven by two main trends:

  1. NUMA architectures are reducing sharing and performance uniformity within machines, and
  2. Intra-datacenter connectivity is growing faster than individual machine performance, reducing the impact of communicating between machines and racks.

Thus, the trend of running programs as many smaller components that may or may not share any particular physical hardware will continue to get stronger. Correspondingly, virtualization will be increasingly necessary for isolation, further abstracting processes from the hardware they run on.

We may also start to see this trend spread to the client-side as well, but I’m not sure what form it would take. In-home compute sharing?

3. Decomposable Programs are Ideal

As a corollary to the previous point, algorithms that can be decomposed into fixed-sized chunks are best. They can be apportioned among CPUs/machines/racks easily, and scaling is decoupled from machine growth (which occurs in generational bursts when hardware is refreshed). Which means that if sharding is necessary, dynamic sharding - a variable number of fixed-size chunks - should be preferred over having a fixed number of shards that grow separately.

MapReduce (or Hadoop), Pregel (or Giraph), and GFS (or HDFS) are all examples of decomposable programs, tackling data processing, graph computation, and storage, respectively.

4. Consumer Bandwidth is Lagging

Consumer network growth is lagging behind intra- and inter-datacenter networks, and hardware trends. Some of this is due to a shift towards mobile, where equivalent resources (primarily images and videos) are smaller, but it holds on wired internet as well. It is unclear to me how long this will last - gigabit customer internet is already being tested in a few places, for example - but for the time being it remains true. This means inter-datacenter communication is growing cheaper relative to communicating with consumers. While it remains true we should expect more work to be pushed remotely to the “cloud” and an increase in datacenter entities cross communicating, and a slow shift towards smaller and more expensive resources (heavier compression, more computationally expensive codecs, etc) for customers.

5. Latency Doesn’t Scale

A fact that has always remained true: the speed of light is (effectively) constant, and does not decrease in line with other improvements. As bandwidth increases latency will continue to dominate long-distance performance, and round-trip times will continue to constrain serial throughput. Together, these will continue to push a couple of long-lived trends:

  1. Reducing data proximity. Caching (local and edge) and having data prefetched are key in reducing latency, and get comparatively cheaper over time. One could imagine future efforts that will proactively push resources straight into user networks or devices as well.
  2. Supporting long-lived connections and message parallelism. Connection establishment requires at least one round trip, and any form of serial channel use cannot fully utilize one. Together these will lead to more connection sharing where possible, and more long-lived idle connections otherwise.

Further Reading:

IEEE Industry Connections Ethernet Bandwidth Assessment
(Thorough study of bandwidth trends at all levels)

Sunday, August 26, 2012

Primes part 2: A segmented sieve

TL;DR: The sum of all primes <= 1,000,000,000,000 is 18,435,588,552,550,705,911,377

This post is a followup to Writing an efficient Sieve of Eratosthenes

A while back I wrote a post detailing a memory-efficient Sieve of Eratosthenes. The algorithm I used took advantage of lazy evaluation and a sliding array to reduce the RAM requirement to a fraction of what a 'vanilla' implementation would require, at the cost of a non-trivial increase in CPU time. The resulting code ran at approximately 25% the throughput of the vanilla implementation, and maxed out at 2 billion candidates.

While researching that post, I noted that the most efficient generators at present use either the Sieve of Atkin or a 'segmented sieve'. As an excuse to play with Go[1], a couple weeks ago I decided to implement a segmented Sieve of Eratosthenes. This post details my results.


Language: Go
Time complexity:
Space complexity: O(√n)
Candidates in 1 sec: ~50,000,000

Gist (expand/contract)

The algorithm proceeds as follows:

  1. Calculate primes up to √max via a vanilla array sieve
  2. Slice up segments of about √max candidates for checking
  3. To check a range,
    1. For each prime p from 1., find the first multiple within the range that's >= p2
    2. Cross off every multiple from there to the end of the range
  4. Merge the results from the processed segments

You'll note that other than splitting the full candiate set into segments, this is the standard Sieve of Eratosthenes. Hence, it's the segmented Sieve of Eratosthenes.

In my Go version this is implemented by starting segments as individual goroutines that output to their own channels. A single worker goroutine is responsible for marshaling the results from these channels to a single channel read by the main thread. This architecture was chosen simply because it fits well with the Go way of doing things, but it also has the side-effect of providing some amount of free parallelism.


The very first run of this variant was faster than the most optimized version from my previous post. It runs at about 65% the speed of a vanilla implementation, making it about 2.5x as efficient as the previous lazy implementations, with a lower memory footprint. As always, a better algorithm is worth more than any amount of low level code tuning :). I should point out that in the current implementation I also implemented a bit array rather than simply using an array of bools. This reduced the memory footprint somewhat, but did not appear to have any significant impact in either direction on CPU time required, and so could reasonably be dropped to shorten the code.

With all primes needing to be marshaled back to the main thread parallelism maxes out below 2x linear. If all we care about is an aggregate value computed from the primes (the sum in this case), rather than the primes themselves in order, we can achieve >4x parallelism simply by adding more processes. This is also more efficient in general, and allows >300,000,000 primes to be processed in one second[2].

The net result is an implementation that can sieve 50M candidates in one second on one core or sum 300M on four; sum primes up to one trillion in an hour; or sum primes in a range of one billion (10^9) in the region of one quintillion (10^18) in under a minute. I'm happy with that...for now.


  1. Let me say right now that Go is a fantastic language to work with, being both easy to write and efficient to run. I fully intend to start writing some of my work projects in Go in the near future.
  2. As noted in the previous post, we use the generic "primes in one second" metric for making apples-to-oranges comparisons of algorithms and implementations. This is not intended to provide anything more than a rough comparison.

Sunday, August 19, 2012

The algorithms of memory

The human brain has the best storage system on the block in a lot of ways. It’s notably lossy and doesn’t have the easiest API to work with, but in terms of flexibility it’s second to none. Computer scientists have been trying to model and mimic its features for a lot of years, but we haven’t got it cracked quite yet. Part of the challenge lies in the sheer variety of access styles that human memory allows. I don’t think we even have them all cataloged yet, much less working together in one system.

I’ve been trying over the last couple days to list all the different patterns I can see myself using. I’ve also tried to pull out systems I know of that do the same for comparison, although I can’t find direct equivalents in all cases. Those without an existing equivalent are probably the most interesting - would mimicking these patterns be useful in the future?

NameMind exampleSystem exampleCharacterized by
Cache hitFacts immediately available for use with no delayIn-memory data storageSynchronous; low latency
Long term storageFacts that take a while to look up. “Um...his name... was.... Let's move on - it'll come to me in a minute.”Lookups from disk or tapeAsynchronous (notify on delivery); high latency
RemindersRemembering to go buy groceries on your way home from workCalendar notificationsTime or event based; defined in advance
Information requestsAll the associations that come up when you think of a topic. “New Orleans” brings to mind...Web searchWeb of relationships; can be explored further in any direction
Background processingComing up with answers/solutions while you sleep or otherwise aren’t explicitly working on the problemUploading a video to YouTube - when you check again all the different formats and qualities will be availableProcessing item while in storage; queued work requests; separated from foreground tasks
Full ScanLife flashing before your eyes (wiki)Processing ordered events during crash recoveryOrdered; sequential scan of all items
Guided randomnessBrainstorming, free association games, mad libs, improv?Random item output or random exploration of web; subject to limited constraints
Unsolicited triggered remindersBeing reminded of facts/stories/events by things you see or hear? [1]
Unsolicited notifications; loosely related to recent input
Unsolicited untriggered remindersMemories that come to mind with no discernible trigger, e.g. past regrets?Unsolicited notifications; no relation to recent input; may be randomly derived
State affectingMemories that change your mood or attitude. E.g. remembering a birthday makes you happy; remembering specific near miss makes you cautious.? [2]State changes triggered by the contents of the information retrieved
(suggested by e-socrates on Reddit)
"We constantly predict from memory what to expect, then compare that to experience"?Continuous short-term predictions; characterizing events as unexpected or surprising


  • [1] Google Now is trying for this to some extent. Online advertising partially fits, however it is not bubbling up details you already know - rather, it’s feeding you new data.
  • [2] There are some trigger-based examples of this in security, e.g. logging of accesses to documents, but they don’t really change the state of the system itself (they trigger work in others instead).

Monday, January 30, 2012

On hosting static files

You may have noticed that I have a blog. Shocking, I know. It’s predominantly composed of text, as many are, but that’s not the only content I like to share. Posts are more approachable and memorable if they contain images, sorting visualizations require Javascript frameworks, etcetera. Unfortunately, all these support files need to be stored somewhere.

This blog is hosted on Blogger, a decision I’m very happy with. Blogger doesn’t do arbitrary file hosting (that I know of), but it does support pushing images to Picasa. So images are hosted by Picasa, another Google service, no sweat. Now, what about other files?


It turns out the answer to this question is a lot trickier than I would have hoped. I was hoping for another Google service, to keep my dependencies down, so I started with Sites.

Google Sites

Google Sites is a free and easy way to create and share webpages”, the tagline goes. And it is – we use it internally all the time. One quick form later, I have a site, and another couple clicks puts a single attachment on it. Great! Update the post to reference it, double-check in incognito mode, and my work here is done. Right?

Well…not quite. It turns out that Sites uses some interesting redirect magic to actually link you to your content. Redirect magic that, for reasons I don’t understand, doesn’t actually resolve when you’re using it to hotlink content. For exactly this reason I guess, I don’t know. Anyways, since I had visited this content while on Sites, my browser had it cached, and even incognito mode could access it, but it wouldn’t resolve anywhere else. Which is a good reason to test things on more than one computer, I suppose.

Ok, not sites. App Engine?

Google App Engine

App Engine does do static file hosting. Instructions are here – you must simply download the SDK, create the app.yaml file appropriately, upload your new App (the files must belong to something, after all), etc. This looks doable, but certainly non-trivial. I also cannot figure out how this is going to be priced, nor the latency to expect from it. So let’s keep looking.

I’m running out of Google services (I considered and rejected a few more). Time to look further afield. I don’t know of any good, trustworthy solutions offhand, and a quick search isn’t taking me anywhere I really want to go, so lets look at Amazon AWS. They have a service for every occasion, right?

Amazon AWS

As it turns out, yes, yes they do. A quick look through the options (there are so many!) says that Amazon Simple Storage Service (S3) will do nicely, optionally backed by CloudFront if I ever really need to scale. A quick signup, upload of my files, and one new CNAME, and I’ve got my files living under, backed by S3. Nice! Doubly so since the free usage tier should make this entirely free for the next year or so, and pennies after.

Finally, in researching this blog post, I found one more option. And it’s a Google service, which was my original goal. Let’s check it out!

Google Cloud Storage

New on the block, Google Cloud Storage appears to be a competitor to Amazon S3. Priced a tiny bit cheaper, the features seem roughly comparable. The documentation is a bit rough still, which partly explains why I didn’t figure this out earlier, but everything is there if you look hard enough. One significant distinction is that you do not choose where data is stored (unless you want to add restrictions), and it will get replicated wherever needed. Note that this includes automatic edge caching for popular files, so this is pretty much S3 and CloudFront all rolled into one. Fancy! It supports the same CNAME aliases, so I’ve got this set up as a hot-spare for my S3 data. I’ll leave S3 as primary for now since I’ve got it all configured and tested happily, but it looks like I’d be well served either way.

Maybe in a future post I’ll do a head-to-head comparison of S3 and GCS, if I can figure out a fair way of measuring performance all over the globe. Until then, I’m happy to stick with either.

Mission accomplished – static files hosted. Time for bed.

Monday, January 16, 2012

Marriage Sort, a visualization

You'll need a modern browser that handles SVG (a recent Chrome, Firefox, or IE9 will do) to properly see this post.

On Saturday, Mike Bostock posted a link on Hacker News to an explanation of the Fisher-Yates Shuffle he had written. Alongside the explanation he included a very slick visualization where you could see it in action. Now, at a fundamental level, shuffle algorithms and sorting algorithms are simply the reverse of each other. So if this visualization does well on shuffle algorithms, why not a sorting algorithm as well?

A couple years ago I wrote a sorting algorithm, which I gave the tongue-in-cheek name Marriage Sort (an homage to the problem that inspired it - see the introduction post for more details). I was looking for a project to play with on the weekend, so I thought, why not try the visualization on that? Mike has kindly let me re-use his code, so here's the same visualization of Marriage Sort.

Disclaimer: The visualization I'm building off looks great across platforms, so if there is anything wrong with this one, it's from my changes. Go see the original if you don't believe me.

For those that don't remember the algorithm, here's a synopsis. There are two stages:
  • Stage 1:
    1. Take a prefix of size √m, where m is the size of the working set.
    2. Pick the maximum (in this case, most-right-leaning) of the prefix, and take it as the pivot.
    3. Walk through the list, pulling out anything greater than the pivot and moving it to the end of the working set.
    4. Move the pivot to the end of the working set.
    5. Go back to step 1 and repeat.
  • Stage 2: a bog-standard insertion sort.
As the passes proceed the pivots decrease in size and more values are matched, allowing the array to be put into a "mostly sorted" state. On average, every element is within √n spots of the correct position, which the insertion sort then corrects. Overall complexity O(n1.5) time, O(1) extra space. Code for the visualization is at for easy perusing.

Looks good to me! What do you think?

Tuesday, May 10, 2011

The Game of Life, part 2: HashLife

Last time I wrote I gave a brief introduction to the Game of Life and a very simple Python implementation for visualizing it. I will freely admit that was a teaser post; this post gets into the real meat of the topic with an overview of the HashLife algorithm and a much more interesting 570


This entry has taken me an embarrassingly long time to post. As is my habit, I wrote the code and 90% of the post, and then left it for months and months. Whoops!

If you haven’t played with a Game of Life viewer before they are legitimately fun to toy around with - I encourage you to check this one out (code is here). Since the last version everything is much improved. The viewer supports a larger set of controls (see the README for details) and basic file reading is implemented so it’s possible to try new starting patterns on the fly. And, as promised, I’ve implemented the HashLife algorithm to massively speed up iterations, so enormous patterns billions of generations forward are easily within your reach.


HashLife is a simple yet interesting algorithm. Invented in 1984 by Bill Gosper (of Gosper glider gun fame), it exploits repeated patterns to dramatically cut down the work required to support large patterns over vast numbers of iterations. Between the Wikipedia page and the enigmatically named “An Algorithm for Compressing Space and Time” in Dr. Dobb’s Journal I think it’s decently well explained, but it took me a couple read-throughs to really wrap my head around so I’m going to try to give an overview of the key insights it utilizes.


At it’s heart, HashLife is built around the concept of a quadtree. If you’re unfamiliar with it, a quadtree takes a square region and breaks it into four quadrants, each a quarter the size of the original. Each quadrant is further broken down into quadrants of its own, and on down. At the bottom, in squares of some minimum size like 2x2, actual points are stored. This structure is usually used to make spatial queries like “what points intersect this bounding box” efficient, but in this case two other properties are taken advantage of. First, nodes at any level are uniquely defined by the points within their region, which means duplicated regions can be backed by the same node in memory. For the Game of Life, where there are repeated patterns and empty regions galore, this can drastically reduce the space required. Second, in the Game of Life a square of  (n)x(n) points fully dictates the inner (n-2)x(n-2) core one generation forward, the inner (n/2)x(n/2) core n/4 generations forward, irrespective of what cells are adjacent to it. So the future core of a node can be calculated once and will apply at any future point in time, anywhere in the tree.

Inner nodesTogether these properties allow for ridiculous speedups. Hashing and sharing nodes drastically reduces the space requirements, with exponentially more sharing the further down the tree you go. There are only 16 possible leaf nodes, after all! From this, calculating the future core for a node requires exponentially less time than a na├»ve implementation would. It can be done by recursively calculating the inner core of smaller nodes, where the better caching comes into play, and then combining them together into a new node. You might be wondering if the gains from caching are lost to the increasing difficulty of determining which nodes are equal, but with a couple careful invariants we actually get that for free. First, nodes must be immutable - this one’s pretty straightforward. Second, nodes must be unique at all times. This forces us to build the tree from the bottom up, but then checking if a new node duplicates an existing one is simply a matter of checking if there are any existing nodes that point to the same set of quadrants in the same order, a problem that hash tables trivially solve.

def __hash__(self):
# Hash is dependent on cells only, not e.g. _next.
# Required for Canonical(), so cannot be simply the id of the current
# object (which would otherwise work).
return hash((id(self._nw), id(self._ne), id(self._sw), id(self._se)))

def __eq__(self, other):
"""Are two nodes equal? Doesn't take caching _next into account."""
if id(self) == id(other):
return True
return (id(self._nw) == id(other._nw) and
id(self._ne) == id(other._ne) and
id(self._sw) == id(other._sw) and
id(self._se) == id(other._se))


As before, the code I’ve written is for Python 2.6 and makes use of PyGame, although neither dependency is terribly sticky. The code lives in a repository on github, and I welcome any contributions you care to make. As the code here is complicated enough to be almost guaranteed a bug or two, there is a basic set of unit tests in and the code itself is liberally sprinkled with asserts. Incidentally, removing the asserts nets a 20% performance gain (as measured by the time it takes to run the ‘PerformanceTest’ unit test), although I find the development time saved by having them is easily worth keeping them in forever. As noted later, the performance of the implementation isn’t all that important anyways. Which is a good thing, since I coded it in Python!

A comment on rewrites: during the transition from version 1 - a simple brute force algorithm - to version 2 - the Node class that implements HashLife - I had both algorithms implemented in parallel for a while. This let me have every second frame rendered by the old algorithm so I could ensure that at different times and different render speeds that the algorithms were coming up with the same results. I’ve seen this pattern used at work for migrating to replacement systems and it’s very much worth the extra glue code you have to write or the confidence it gives. John Carmack recently wrote about parallel implementations on his own blog, if you want to hear more on the topic.


The performance is hard to objectively detail for an algorithm like this. For example, it takes ~1 second to generate the billionth generation of the backrake 3 pattern, which has around 300,000,000 live cells; it takes ~2 seconds to generate the quintillionth generation with 3x10^17 live cells. But this is a perfect patterns to showcase HashLife - a simple spaceship traveling in a straight line, generating a steady stream of gliders. In comparison, a chaotic pattern like Acorn takes almost 25 seconds to generate just 5000 generations with at most 1057 alive at any time. As it stands the properties of the algorithm drastically outweigh the peculiarities of the implementation for anything I care to do. Although I must say, if you want to compare it to another implementation in an apples to apples comparison I’d love to hear the numbers you get.

As always, I’d love to hear what you think!

Saturday, March 12, 2011

The Game of Life, part 1

Update: See part 2 for the implemented HashLife algorithm.

The Game of Life is a fascinating system. It was invented by John Conway in 1970 and has been studied continuously ever since. For those reading who haven’t heard of it before, a brief explanation: The world is an infinite grid of points, all either alive or dead. After each generation – or ‘iteration’ if you’d prefer – cells are updated according to the following three rules:

  1. If a cell is alive and it has two or three live neighbours, it stays alive.
  2. If a cell is dead and it has exactly three live neighbours, it becomes alive (tripartite reproduction?).
  3. Any other cell is dead.

Blinker infinite growth Glider

From these simple rules amazing complexity can arise. Some configurations are stable, like the period two “blinker” [above left], or the period four “glider” [above right] that moves one row over and one row down with every cycle. Other configurations, like the one above centre, grow infinitely – this one spits out two gliders then lays down a zig-zag strip of blocks forever after.

There is more to the Game of Life than pretty patterns and curious growth, I must hasten to add. It has been studied by a host of people in a variety of fields and has gone on to start a new branch of mathematics (cellular automata) and spur discussions on whether a sufficiently complicated pattern could be considered intelligent. It has also been proven to be Turing complete, so any computation your computer can run can be run by simulating the Game of Life with the correct starting state.

I have implemented a basic python program for simulating the Game of Life on GitHub. It allows for infinite patterns, grows the field of view automatically, and allows speed to be controlled with Up/Down, but otherwise is a very simple implementation. The goal here is to eventually implement some of the more interesting algorithms for speeding up the simulation.  There are numerous such algorithms, although the one I find the most interesting is called Hashlife and exploits repeated patterns through space and time to achieve an exponential speedup in running the simulation. More details in part 2, whenever I write it :).