Thursday, November 1, 2012

Hardware trends for datacenter computing

Moore’s Law (and family) are a significant boon to anyone writing distributed systems. As the systems grow over time, so too does the underlying hardware get more powerful, allowing exponential growth to be readily managed. Programs need to be modified to cope, of course, but the raw compute power is there and available. That said, designing algorithms that can scale with more than an order of magnitude of growth remains a difficult challenge, and regular rewrites are inevitable. In the interest of minimizing that number, here are a few trends I’ve observed that should continue into the next few years. Caveat emptor, this is my personal perspective, but if it doesn't line up with your experience I'd love to hear about it.

1. Compute Units are Static

  1. Moore’s Law tells us that the number of transistors on a die double every 2 years, and practical experience tells us that performance doubles every 18 months. In the multicore era, however, the performance of a single core has largely plateaued, with further advances coming in the form of more cores per chip.
  2. RAM and on-die caches roughly keep pace with CPUs, which means that now that we’re into core count growth, the RAM and cache per core is now approximately fixed.
  3. Disk storage is also keeping the same pace, and hence disk per core is now also approximately fixed.

Thus, we can define a theoretical “compute unit” as, say, 1 core, 2GB RAM, 20GB flash*, 200 GB disk (adjust numbers as you see fit); this unit should stay roughly stable going forward, with the number of such units increasing exponentially.

Note that the purchasable range of resources-per-core is generally increasing; this represents the most economical average rather than a strict upper or lower bound. Resources do not need to be apportioned to each process in these ratios, either; e.g. disk-heavy and CPU-heavy processes can coexist on a machine and balance each other out. But it's worth looking closely at anything expected to grow significantly on one axis and not in others, to see if it's sustainable long term.

* Flash adoption remains spotty, but early indications put it roughly halfway between RAM and disk on performance and price, and it seems to be keeping up the growth pace. I speculate it will become a standard inclusion going forward, but this isn’t as clear as the others.

Update: One of my colleagues has pointed out that disk performance is not keeping pace with capacity increases (neither seek nor transfer rates). I'm not sure how to include this detail in the compute unit definition itself, but it will likely prove to be significant and help drive the adoption of flash going forward.

2. The Datacenter is the New Machine

This is already largely true, but the trend will only get stronger. This is driven by two main trends:

  1. NUMA architectures are reducing sharing and performance uniformity within machines, and
  2. Intra-datacenter connectivity is growing faster than individual machine performance, reducing the impact of communicating between machines and racks.

Thus, the trend of running programs as many smaller components that may or may not share any particular physical hardware will continue to get stronger. Correspondingly, virtualization will be increasingly necessary for isolation, further abstracting processes from the hardware they run on.

We may also start to see this trend spread to the client-side as well, but I’m not sure what form it would take. In-home compute sharing?

3. Decomposable Programs are Ideal

As a corollary to the previous point, algorithms that can be decomposed into fixed-sized chunks are best. They can be apportioned among CPUs/machines/racks easily, and scaling is decoupled from machine growth (which occurs in generational bursts when hardware is refreshed). Which means that if sharding is necessary, dynamic sharding - a variable number of fixed-size chunks - should be preferred over having a fixed number of shards that grow separately.

MapReduce (or Hadoop), Pregel (or Giraph), and GFS (or HDFS) are all examples of decomposable programs, tackling data processing, graph computation, and storage, respectively.

4. Consumer Bandwidth is Lagging

Consumer network growth is lagging behind intra- and inter-datacenter networks, and hardware trends. Some of this is due to a shift towards mobile, where equivalent resources (primarily images and videos) are smaller, but it holds on wired internet as well. It is unclear to me how long this will last - gigabit customer internet is already being tested in a few places, for example - but for the time being it remains true. This means inter-datacenter communication is growing cheaper relative to communicating with consumers. While it remains true we should expect more work to be pushed remotely to the “cloud” and an increase in datacenter entities cross communicating, and a slow shift towards smaller and more expensive resources (heavier compression, more computationally expensive codecs, etc) for customers.

5. Latency Doesn’t Scale

A fact that has always remained true: the speed of light is (effectively) constant, and does not decrease in line with other improvements. As bandwidth increases latency will continue to dominate long-distance performance, and round-trip times will continue to constrain serial throughput. Together, these will continue to push a couple of long-lived trends:

  1. Reducing data proximity. Caching (local and edge) and having data prefetched are key in reducing latency, and get comparatively cheaper over time. One could imagine future efforts that will proactively push resources straight into user networks or devices as well.
  2. Supporting long-lived connections and message parallelism. Connection establishment requires at least one round trip, and any form of serial channel use cannot fully utilize one. Together these will lead to more connection sharing where possible, and more long-lived idle connections otherwise.

Further Reading:

IEEE Industry Connections Ethernet Bandwidth Assessment
(Thorough study of bandwidth trends at all levels)

Sunday, August 26, 2012

Primes part 2: A segmented sieve

TL;DR: The sum of all primes <= 1,000,000,000,000 is 18,435,588,552,550,705,911,377

This post is a followup to Writing an efficient Sieve of Eratosthenes

A while back I wrote a post detailing a memory-efficient Sieve of Eratosthenes. The algorithm I used took advantage of lazy evaluation and a sliding array to reduce the RAM requirement to a fraction of what a 'vanilla' implementation would require, at the cost of a non-trivial increase in CPU time. The resulting code ran at approximately 25% the throughput of the vanilla implementation, and maxed out at 2 billion candidates.

While researching that post, I noted that the most efficient generators at present use either the Sieve of Atkin or a 'segmented sieve'. As an excuse to play with Go[1], a couple weeks ago I decided to implement a segmented Sieve of Eratosthenes. This post details my results.

Implementation

Language: Go
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~50,000,000

Gist (expand/contract)

The algorithm proceeds as follows:

  1. Calculate primes up to √max via a vanilla array sieve
  2. Slice up segments of about √max candidates for checking
  3. To check a range,
    1. For each prime p from 1., find the first multiple within the range that's >= p2
    2. Cross off every multiple from there to the end of the range
  4. Merge the results from the processed segments

You'll note that other than splitting the full candiate set into segments, this is the standard Sieve of Eratosthenes. Hence, it's the segmented Sieve of Eratosthenes.

In my Go version this is implemented by starting segments as individual goroutines that output to their own channels. A single worker goroutine is responsible for marshaling the results from these channels to a single channel read by the main thread. This architecture was chosen simply because it fits well with the Go way of doing things, but it also has the side-effect of providing some amount of free parallelism.

Results

The very first run of this variant was faster than the most optimized version from my previous post. It runs at about 65% the speed of a vanilla implementation, making it about 2.5x as efficient as the previous lazy implementations, with a lower memory footprint. As always, a better algorithm is worth more than any amount of low level code tuning :). I should point out that in the current implementation I also implemented a bit array rather than simply using an array of bools. This reduced the memory footprint somewhat, but did not appear to have any significant impact in either direction on CPU time required, and so could reasonably be dropped to shorten the code.

With all primes needing to be marshaled back to the main thread parallelism maxes out below 2x linear. If all we care about is an aggregate value computed from the primes (the sum in this case), rather than the primes themselves in order, we can achieve >4x parallelism simply by adding more processes. This is also more efficient in general, and allows >300,000,000 primes to be processed in one second[2].

The net result is an implementation that can sieve 50M candidates in one second on one core or sum 300M on four; sum primes up to one trillion in an hour; or sum primes in a range of one billion (10^9) in the region of one quintillion (10^18) in under a minute. I'm happy with that...for now.

Footnotes

  1. Let me say right now that Go is a fantastic language to work with, being both easy to write and efficient to run. I fully intend to start writing some of my work projects in Go in the near future.
  2. As noted in the previous post, we use the generic "primes in one second" metric for making apples-to-oranges comparisons of algorithms and implementations. This is not intended to provide anything more than a rough comparison.

Sunday, August 19, 2012

The algorithms of memory

The human brain has the best storage system on the block in a lot of ways. It’s notably lossy and doesn’t have the easiest API to work with, but in terms of flexibility it’s second to none. Computer scientists have been trying to model and mimic its features for a lot of years, but we haven’t got it cracked quite yet. Part of the challenge lies in the sheer variety of access styles that human memory allows. I don’t think we even have them all cataloged yet, much less working together in one system.

I’ve been trying over the last couple days to list all the different patterns I can see myself using. I’ve also tried to pull out systems I know of that do the same for comparison, although I can’t find direct equivalents in all cases. Those without an existing equivalent are probably the most interesting - would mimicking these patterns be useful in the future?


NameMind exampleSystem exampleCharacterized by
Cache hitFacts immediately available for use with no delayIn-memory data storageSynchronous; low latency
Long term storageFacts that take a while to look up. “Um...his name... was.... Let's move on - it'll come to me in a minute.”Lookups from disk or tapeAsynchronous (notify on delivery); high latency
RemindersRemembering to go buy groceries on your way home from workCalendar notificationsTime or event based; defined in advance
Information requestsAll the associations that come up when you think of a topic. “New Orleans” brings to mind...Web searchWeb of relationships; can be explored further in any direction
Background processingComing up with answers/solutions while you sleep or otherwise aren’t explicitly working on the problemUploading a video to YouTube - when you check again all the different formats and qualities will be availableProcessing item while in storage; queued work requests; separated from foreground tasks
Full ScanLife flashing before your eyes (wiki)Processing ordered events during crash recoveryOrdered; sequential scan of all items
Guided randomnessBrainstorming, free association games, mad libs, improv?Random item output or random exploration of web; subject to limited constraints
Unsolicited triggered remindersBeing reminded of facts/stories/events by things you see or hear? [1]
Unsolicited notifications; loosely related to recent input
Unsolicited untriggered remindersMemories that come to mind with no discernible trigger, e.g. past regrets?Unsolicited notifications; no relation to recent input; may be randomly derived
State affectingMemories that change your mood or attitude. E.g. remembering a birthday makes you happy; remembering specific near miss makes you cautious.? [2]State changes triggered by the contents of the information retrieved
Expectations
(suggested by e-socrates on Reddit)
"We constantly predict from memory what to expect, then compare that to experience"?Continuous short-term predictions; characterizing events as unexpected or surprising

Footnotes

  • [1] Google Now is trying for this to some extent. Online advertising partially fits, however it is not bubbling up details you already know - rather, it’s feeding you new data.
  • [2] There are some trigger-based examples of this in security, e.g. logging of accesses to documents, but they don’t really change the state of the system itself (they trigger work in others instead).

Monday, January 30, 2012

On hosting static files

You may have noticed that I have a blog. Shocking, I know. It’s predominantly composed of text, as many are, but that’s not the only content I like to share. Posts are more approachable and memorable if they contain images, sorting visualizations require Javascript frameworks, etcetera. Unfortunately, all these support files need to be stored somewhere.

This blog is hosted on Blogger, a decision I’m very happy with. Blogger doesn’t do arbitrary file hosting (that I know of), but it does support pushing images to Picasa. So images are hosted by Picasa, another Google service, no sweat. Now, what about other files?

Hmm.

It turns out the answer to this question is a lot trickier than I would have hoped. I was hoping for another Google service, to keep my dependencies down, so I started with Sites.

Google Sites

Google Sites is a free and easy way to create and share webpages”, the tagline goes. And it is – we use it internally all the time. One quick form later, I have a site, and another couple clicks puts a single attachment on it. Great! Update the post to reference it, double-check in incognito mode, and my work here is done. Right?

Well…not quite. It turns out that Sites uses some interesting redirect magic to actually link you to your content. Redirect magic that, for reasons I don’t understand, doesn’t actually resolve when you’re using it to hotlink content. For exactly this reason I guess, I don’t know. Anyways, since I had visited this content while on Sites, my browser had it cached, and even incognito mode could access it, but it wouldn’t resolve anywhere else. Which is a good reason to test things on more than one computer, I suppose.

Ok, not sites. App Engine?

Google App Engine

App Engine does do static file hosting. Instructions are here – you must simply download the SDK, create the app.yaml file appropriately, upload your new App (the files must belong to something, after all), etc. This looks doable, but certainly non-trivial. I also cannot figure out how this is going to be priced, nor the latency to expect from it. So let’s keep looking.

I’m running out of Google services (I considered and rejected a few more). Time to look further afield. I don’t know of any good, trustworthy solutions offhand, and a quick search isn’t taking me anywhere I really want to go, so lets look at Amazon AWS. They have a service for every occasion, right?

Amazon AWS

As it turns out, yes, yes they do. A quick look through the options (there are so many!) says that Amazon Simple Storage Service (S3) will do nicely, optionally backed by CloudFront if I ever really need to scale. A quick signup, upload of my files, and one new CNAME, and I’ve got my files living under static.thelowlyprogrammer.com, backed by S3. Nice! Doubly so since the free usage tier should make this entirely free for the next year or so, and pennies after.

Finally, in researching this blog post, I found one more option. And it’s a Google service, which was my original goal. Let’s check it out!

Google Cloud Storage

New on the block, Google Cloud Storage appears to be a competitor to Amazon S3. Priced a tiny bit cheaper, the features seem roughly comparable. The documentation is a bit rough still, which partly explains why I didn’t figure this out earlier, but everything is there if you look hard enough. One significant distinction is that you do not choose where data is stored (unless you want to add restrictions), and it will get replicated wherever needed. Note that this includes automatic edge caching for popular files, so this is pretty much S3 and CloudFront all rolled into one. Fancy! It supports the same CNAME aliases, so I’ve got this set up as a hot-spare for my S3 data. I’ll leave S3 as primary for now since I’ve got it all configured and tested happily, but it looks like I’d be well served either way.

Maybe in a future post I’ll do a head-to-head comparison of S3 and GCS, if I can figure out a fair way of measuring performance all over the globe. Until then, I’m happy to stick with either.

Mission accomplished – static files hosted. Time for bed.

Monday, January 16, 2012

Marriage Sort, a visualization

You'll need a modern browser that handles SVG (a recent Chrome, Firefox, or IE9 will do) to properly see this post.

On Saturday, Mike Bostock posted a link on Hacker News to an explanation of the Fisher-Yates Shuffle he had written. Alongside the explanation he included a very slick visualization where you could see it in action. Now, at a fundamental level, shuffle algorithms and sorting algorithms are simply the reverse of each other. So if this visualization does well on shuffle algorithms, why not a sorting algorithm as well?

A couple years ago I wrote a sorting algorithm, which I gave the tongue-in-cheek name Marriage Sort (an homage to the problem that inspired it - see the introduction post for more details). I was looking for a project to play with on the weekend, so I thought, why not try the visualization on that? Mike has kindly let me re-use his code, so here's the same visualization of Marriage Sort.


Disclaimer: The visualization I'm building off looks great across platforms, so if there is anything wrong with this one, it's from my changes. Go see the original if you don't believe me.

For those that don't remember the algorithm, here's a synopsis. There are two stages:
  • Stage 1:
    1. Take a prefix of size √m, where m is the size of the working set.
    2. Pick the maximum (in this case, most-right-leaning) of the prefix, and take it as the pivot.
    3. Walk through the list, pulling out anything greater than the pivot and moving it to the end of the working set.
    4. Move the pivot to the end of the working set.
    5. Go back to step 1 and repeat.
  • Stage 2: a bog-standard insertion sort.
As the passes proceed the pivots decrease in size and more values are matched, allowing the array to be put into a "mostly sorted" state. On average, every element is within √n spots of the correct position, which the insertion sort then corrects. Overall complexity O(n1.5) time, O(1) extra space. Code for the visualization is at https://gist.github.com/1628602 for easy perusing.

Looks good to me! What do you think?