The Lowly Programmer

Monday, April 26, 2021

The Rise of Datacenter Computing

The coming decades will see a significant change in the usage and deployment of “compute” (herein CPU cycles, storage, etc), specifically relating to the rise of datacenter computing. I don’t know the exact form this change will take - though I will point to the first phases of it below - but the fact that significant paradigm shifts are coming seems inevitable.

Note that I’m not talking about IoT or anything in that space, though it’s peripherally related. Here I’m specifically talking about the rise of datacenters as the majority share of compute power, and how the economy of computing and our individual access will be reshaped by it.

Part 1: The Data Center of Gravity

Two trends here seem clear: first, the balance of compute power is going to continue shifting towards datacenters and away from personal devices (even as those devices proliferate); and second, economics will dictate a shift in how this capacity is delivered to meet the ever-growing demand.

The balance of compute power will shift inexorably towards Centralized: as the fraction of GDP put towards compute rises, the portion of it situated in the home and on the person will fall, as will non-datacenter corporate compute.

Centralized is efficient: better power utilization (PUE), better resource utilization, more cost-efficient devices (e.g. larger processors, rack-oriented computer designs) all favour datacenter compute. But also because high density computing itself often allows for more efficient processing: lots of computers and memory and storage together on a high bandwidth fabric permits algorithms ill-suited to - or impossible on - highly-distributed small-scale devices.
Our willingness to have compute resources poorly utilized and at high risk will decline. Most compute capability of a cell phone or a personal computer goes almost entirely wasted already, which is tolerable for the convenience they bring at the relative cost. But would you want a $100k computer sitting idle at home, and a $10k cell phone in your pocket, if that’s what it took to keep up with compute demand growth? Almost certainly not. And relatedly, we’re seeing a rapid shift from low efficiency corporate computing - on-prem, colo, and private small datacenters - towards larger datacenters and especially hyperscalars with cost effectiveness as one of the driving motivators.
When we consider the electricity consumption of global compute rising from the present-day ~3% (including consumer devices, datacenters, and networks) to say 21% by 2030 (or perhaps a few years later once efficiencies are played out), we couldn’t carry all those watt-hours in our collective pocket even if we wanted to. At best it could shift to remote devices that live in the home, but efficiency still pushes strongly for it being situated near cheap electricity instead.

This is not to say that individuals won’t continue to grow their net compute usage. Rather, the trend of comparatively weak computing devices on the person and in the home (e.g. 2-4 core cell phones) will continue, augmented by a growing fraction of compute happening in responsibilities offloaded to centralized (datacenter) environments and their hundred-plus core machines.

To make this concrete: today the majority of compute capabilities (e.g. CPU cores) remain in the hands of consumers, in the form of cell phones, tablets, laptops, PCs, game consoles, and other such devices. But server-oriented compute devices are rapidly growing, with disks leading the charge. Note that server-oriented disks do not dominate by unit count yet, but already do by capacity as the server- and consumer-oriented devices diverge. GPUs are soon to tip towards server-dominant as well (and may have already when measured by capabilities, if we account for GPUs used crypto mining that appear in the consumer “Graphics” segment, capability segmentation consumer vs enterprise, and generally higher prices / revenue in the consumer market skewing revenue-based breakdowns). CPUs and ram are the most difficult to pinpoint, but anecdotally servers helping to hold up the ram market, and AMD’s datacenter-first strategy, suggest that we’re nearing the tipping point there too.

By usage aka utilization, datacenters are already likely dominant, and growing rapidly. The average consumer device has an extremely low overall utilization (single-digit percentage), when considering the time-on-device, core scaling, and frequency scaling utilized by devices to keep power consumption low. Whereas datacenters have entire teams dedicated to keeping utilization high and timing capacity delivery against demand, achieving overall utilizations reaching 50% and likely averaging 30-40% industry-wide. Combined with the relative share of compute capabilities, this suggests the balance of computing is already happening in datacenters, and incremental compute coming online suggests usage highly skewed towards datacenters.

On the economic side, the complexity of producing these datacenters will continue to shift as well. “Supply” will tend towards fixed over the coming decades: not deployed according to short-term needs, but rather, at a fixed pace determined over increasingly long time-scales. Consider the recurring semiconductor shortages (bottlenecked on fabrication capacity) combined with the rising price of chip fabs, the lead times on electric grid scaling, the lead time on adjusting mining output for rare earth elements, the cost of a modern datacenter building, and the general inertia of any industry as its share of global GDP rises.

This will see datacenter compute capabilities continue the shift into “base infrastructure” IaaS / PaaS plus higher levels layered above, with Public Clouds being only the first step on this road. One might also expect planning to trend towards more centralized as well: akin to the Oil industry, I could imagine the producers and consumers of bottleneck infrastructure resources - power, computer chips, flash, and disk drives today - coming together to commit to a particular curve years or decades in advance, and pre-committing demand in a way that makes the funding of mega fabs ($20B today; $50B in a couple more generations) possible.

Beyond that, I’m not sure. Do whole countries start to take a seat at the negotiating table? The world’s reliance on TSMC for top-end chips, and the collaboration with the US government to land a fab on US soil, suggests it’s decently likely. Does the cryptocurrency share get reigned in? I guess we’ll see.

Part 2: “Personal” computing

Let’s run a thought experiment: what does it look like for centralized “compute” to become available not only to corporations who pay for it, but also made directly and practically available to the citizens of the world. What will it be used for? How will that be decided?

“Practically” is important - it’s true that individuals can sign up for accounts at public Clouds today, and use the many services available, but the vast majority do not have the knowledge necessary to do so successfully. Whereas almost everyone knows how to download and run an app on a local device. As such, the predominant consumer compute patterns leveraged today are:

“Local general-purpose”: apps and programs running on the capabilities of a device you provide.
“Ad-supported”: websites, social networks, etc, where the load you personally contribute is small enough to be funded entirely by advertising or offered for free; requiring no payments either way.
"Remote fixed-cost” services: costly enough to host that they’re not offered for free, but where the load you personally contribute is still sufficiently bounded that a flat-rate subscription pricing model suffices.
Plus a comparably small amount of “remote compute services”: e.g. photo storage; paid for intentionally by the user, with variable costs or tiered plans.

The paradigm of yore was “local general-purpose” - you had a computer or device, and used it however you saw fit; the networks didn’t really exist to support anything else. The internet then ushered in an era of “ad-supported” services, e.g. search engines, maps, and social networks. More recently yet “Remote fixed-cost” has been on the rise, with services like Spotify or Netflix leading the charge into the home from the media front with VPNs riding on their coattails, team-oriented services like Zoom gaining significant mindshare in the “free for home, paid for business” space, and most recently “cloud gaming” starting to take off as well. And finally “remote compute services” exist in limited form, particularly for storage applications, but not yet having taken off.

(Tangent: I looked up the Adobe Creative Cloud and other competitors to see if “paid cloud photo/video editing” is a thing, and it seems not yet? I’m a little surprised, as data-heavy applications that idle at 0 but could easily spike to hundreds of cores during active use seems inherently well suited to this model.)

You’ll note that I did not include “remote general-purpose” in the list above, as I do not see this as a prevalent paradigm today, with a bit of niche usage of “virtual desktop in the Cloud” style services and little else. But it’s likely coming next - I expect the consumer portion of computing to shift in decent part towards remote variable-cost compute as the local slice of compute consumption becomes relatively smaller, through some mix of proliferating “remote compute services” offering pay-as-you-go or subscription access to myriad services for storage, photo/video editing, data analysis, model generation, data generation, etc; as well as new “remote general-purpose” paradigms that have yet to be developed, such as “apps” executing on dynamic remote resources.

While this could be presented as “desktop in the Cloud”, I doubt it will be - the desktop paradigm is suited to a small, always on, fixed-sized, pre-purchased unit of compute (one computer), and less clearly suited to expressing irregular bursts of short-lived high-consumption activities. Solving for burst usage with cost predictability will likely require developing paradigms more inherently suited to that task, rather than simply transplanting a (relatively) dying paradigm into “the Cloud”. It’s much more likely to me that subscription services will take the broadest share, and decently likely that a cost-predictable paradigm for “remote apps” will be found to fill the remaining gap.

As a final thought to leave you with, the donation of spare cycles towards causes of interest may be interesting in a future world also. Today imagine collectivist projects like Folding@Home, or seeding torrents to challenge content “ownership”. Will these still exist, and in what form?

Thursday, November 1, 2012

Hardware trends for datacenter computing

Moore’s Law (and family) are a significant boon to anyone writing distributed systems. As the systems grow over time, so too does the underlying hardware get more powerful, allowing exponential growth to be readily managed. Programs need to be modified to cope, of course, but the raw compute power is there and available. That said, designing algorithms that can scale with more than an order of magnitude of growth remains a difficult challenge, and regular rewrites are inevitable. In the interest of minimizing that number, here are a few trends I’ve observed that should continue into the next few years. Caveat emptor, this is my personal perspective, but if it doesn't line up with your experience I'd love to hear about it.

1. Compute Units are Static

Moore’s Law tells us that the number of transistors on a die double every 2 years, and practical experience tells us that performance doubles every 18 months. In the multicore era, however, the performance of a single core has largely plateaued, with further advances coming in the form of more cores per chip.
RAM and on-die caches roughly keep pace with CPUs, which means that now that we’re into core count growth, the RAM and cache per core is now approximately fixed.
Disk storage is also keeping the same pace, and hence disk per core is now also approximately fixed.

Thus, we can define a theoretical “compute unit” as, say, 1 core, 2GB RAM, 20GB flash*, 200 GB disk (adjust numbers as you see fit); this unit should stay roughly stable going forward, with the number of such units increasing exponentially.

Note that the purchasable range of resources-per-core is generally increasing; this represents the most economical average rather than a strict upper or lower bound. Resources do not need to be apportioned to each process in these ratios, either; e.g. disk-heavy and CPU-heavy processes can coexist on a machine and balance each other out. But it's worth looking closely at anything expected to grow significantly on one axis and not in others, to see if it's sustainable long term.

* Flash adoption remains spotty, but early indications put it roughly halfway between RAM and disk on performance and price, and it seems to be keeping up the growth pace. I speculate it will become a standard inclusion going forward, but this isn’t as clear as the others.

Update: One of my colleagues has pointed out that disk performance is not keeping pace with capacity increases (neither seek nor transfer rates). I'm not sure how to include this detail in the compute unit definition itself, but it will likely prove to be significant and help drive the adoption of flash going forward.

2. The Datacenter is the New Machine

This is already largely true, but the trend will only get stronger. This is driven by two main trends:

NUMA architectures are reducing sharing and performance uniformity within machines, and
Intra-datacenter connectivity is growing faster than individual machine performance, reducing the impact of communicating between machines and racks.

Thus, the trend of running programs as many smaller components that may or may not share any particular physical hardware will continue to get stronger. Correspondingly, virtualization will be increasingly necessary for isolation, further abstracting processes from the hardware they run on.

We may also start to see this trend spread to the client-side as well, but I’m not sure what form it would take. In-home compute sharing?

3. Decomposable Programs are Ideal

As a corollary to the previous point, algorithms that can be decomposed into fixed-sized chunks are best. They can be apportioned among CPUs/machines/racks easily, and scaling is decoupled from machine growth (which occurs in generational bursts when hardware is refreshed). Which means that if sharding is necessary, dynamic sharding - a variable number of fixed-size chunks - should be preferred over having a fixed number of shards that grow separately.

MapReduce (or Hadoop), Pregel (or Giraph), and GFS (or HDFS) are all examples of decomposable programs, tackling data processing, graph computation, and storage, respectively.

4. Consumer Bandwidth is Lagging

Consumer network growth is lagging behind intra- and inter-datacenter networks, and hardware trends. Some of this is due to a shift towards mobile, where equivalent resources (primarily images and videos) are smaller, but it holds on wired internet as well. It is unclear to me how long this will last - gigabit customer internet is already being tested in a few places, for example - but for the time being it remains true. This means inter-datacenter communication is growing cheaper relative to communicating with consumers. While it remains true we should expect more work to be pushed remotely to the “cloud” and an increase in datacenter entities cross communicating, and a slow shift towards smaller and more expensive resources (heavier compression, more computationally expensive codecs, etc) for customers.

5. Latency Doesn’t Scale

A fact that has always remained true: the speed of light is (effectively) constant, and does not decrease in line with other improvements. As bandwidth increases latency will continue to dominate long-distance performance, and round-trip times will continue to constrain serial throughput. Together, these will continue to push a couple of long-lived trends:

Reducing data proximity. Caching (local and edge) and having data prefetched are key in reducing latency, and get comparatively cheaper over time. One could imagine future efforts that will proactively push resources straight into user networks or devices as well.
Supporting long-lived connections and message parallelism. Connection establishment requires at least one round trip, and any form of serial channel use cannot fully utilize one. Together these will lead to more connection sharing where possible, and more long-lived idle connections otherwise.

Sunday, August 26, 2012

Primes part 2: A segmented sieve

TL;DR: The sum of all primes <= 1,000,000,000,000 is 18,435,588,552,550,705,911,377

This post is a followup to Writing an efficient Sieve of Eratosthenes

A while back I wrote a post detailing a memory-efficient Sieve of Eratosthenes. The algorithm I used took advantage of lazy evaluation and a sliding array to reduce the RAM requirement to a fraction of what a 'vanilla' implementation would require, at the cost of a non-trivial increase in CPU time. The resulting code ran at approximately 25% the throughput of the vanilla implementation, and maxed out at 2 billion candidates.

While researching that post, I noted that the most efficient generators at present use either the Sieve of Atkin or a 'segmented sieve'. As an excuse to play with Go^[1], a couple weeks ago I decided to implement a segmented Sieve of Eratosthenes. This post details my results.

Implementation

Language: Go
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~50,000,000

Gist (expand/contract)

The algorithm proceeds as follows:

Calculate primes up to √max via a vanilla array sieve
Slice up segments of about √max candidates for checking
To check a range,
1. For each prime p from 1., find the first multiple within the range that's >= p²
2. Cross off every multiple from there to the end of the range
Merge the results from the processed segments

You'll note that other than splitting the full candiate set into segments, this is the standard Sieve of Eratosthenes. Hence, it's the segmented Sieve of Eratosthenes.

In my Go version this is implemented by starting segments as individual goroutines that output to their own channels. A single worker goroutine is responsible for marshaling the results from these channels to a single channel read by the main thread. This architecture was chosen simply because it fits well with the Go way of doing things, but it also has the side-effect of providing some amount of free parallelism.

Results

The very first run of this variant was faster than the most optimized version from my previous post. It runs at about 65% the speed of a vanilla implementation, making it about 2.5x as efficient as the previous lazy implementations, with a lower memory footprint. As always, a better algorithm is worth more than any amount of low level code tuning :). I should point out that in the current implementation I also implemented a bit array rather than simply using an array of bools. This reduced the memory footprint somewhat, but did not appear to have any significant impact in either direction on CPU time required, and so could reasonably be dropped to shorten the code.

With all primes needing to be marshaled back to the main thread parallelism maxes out below 2x linear. If all we care about is an aggregate value computed from the primes (the sum in this case), rather than the primes themselves in order, we can achieve >4x parallelism simply by adding more processes. This is also more efficient in general, and allows >300,000,000 primes to be processed in one second^[2].

The net result is an implementation that can sieve 50M candidates in one second on one core or sum 300M on four; sum primes up to one trillion in an hour; or sum primes in a range of one billion (10^9) in the region of one quintillion (10^18) in under a minute. I'm happy with that...for now.

Footnotes

Let me say right now that Go is a fantastic language to work with, being both easy to write and efficient to run. I fully intend to start writing some of my work projects in Go in the near future.
As noted in the previous post, we use the generic "primes in one second" metric for making apples-to-oranges comparisons of algorithms and implementations. This is not intended to provide anything more than a rough comparison.

Sunday, August 19, 2012

The algorithms of memory

The human brain has the best storage system on the block in a lot of ways. It’s notably lossy and doesn’t have the easiest API to work with, but in terms of flexibility it’s second to none. Computer scientists have been trying to model and mimic its features for a lot of years, but we haven’t got it cracked quite yet. Part of the challenge lies in the sheer variety of access styles that human memory allows. I don’t think we even have them all cataloged yet, much less working together in one system.

I’ve been trying over the last couple days to list all the different patterns I can see myself using. I’ve also tried to pull out systems I know of that do the same for comparison, although I can’t find direct equivalents in all cases. Those without an existing equivalent are probably the most interesting - would mimicking these patterns be useful in the future?

Name	Mind example	System example	Characterized by
Cache hit	Facts immediately available for use with no delay	In-memory data storage	Synchronous; low latency
Long term storage	Facts that take a while to look up. “Um...his name... was.... Let's move on - it'll come to me in a minute.”	Lookups from disk or tape	Asynchronous (notify on delivery); high latency
Reminders	Remembering to go buy groceries on your way home from work	Calendar notifications	Time or event based; defined in advance
Information requests	All the associations that come up when you think of a topic. “New Orleans” brings to mind...	Web search	Web of relationships; can be explored further in any direction
Background processing	Coming up with answers/solutions while you sleep or otherwise aren’t explicitly working on the problem	Uploading a video to YouTube - when you check again all the different formats and qualities will be available	Processing item while in storage; queued work requests; separated from foreground tasks
Full Scan	Life flashing before your eyes (wiki)	Processing ordered events during crash recovery	Ordered; sequential scan of all items
Guided randomness	Brainstorming, free association games, mad libs, improv	?	Random item output or random exploration of web; subject to limited constraints
Unsolicited triggered reminders	Being reminded of facts/stories/events by things you see or hear	? [1]	Unsolicited notifications; loosely related to recent input
Unsolicited untriggered reminders	Memories that come to mind with no discernible trigger, e.g. past regrets	?	Unsolicited notifications; no relation to recent input; may be randomly derived
State affecting	Memories that change your mood or attitude. E.g. remembering a birthday makes you happy; remembering specific near miss makes you cautious.	? [2]	State changes triggered by the contents of the information retrieved
Expectations (suggested by e-socrates on Reddit)	"We constantly predict from memory what to expect, then compare that to experience"	?	Continuous short-term predictions; characterizing events as unexpected or surprising

Footnotes

[1] Google Now is trying for this to some extent. Online advertising partially fits, however it is not bubbling up details you already know - rather, it’s feeding you new data.
[2] There are some trigger-based examples of this in security, e.g. logging of accesses to documents, but they don’t really change the state of the system itself (they trigger work in others instead).

Monday, January 30, 2012

On hosting static files

You may have noticed that I have a blog. Shocking, I know. It’s predominantly composed of text, as many are, but that’s not the only content I like to share. Posts are more approachable and memorable if they contain images, sorting visualizations require Javascript frameworks, etcetera. Unfortunately, all these support files need to be stored somewhere.

This blog is hosted on Blogger, a decision I’m very happy with. Blogger doesn’t do arbitrary file hosting (that I know of), but it does support pushing images to Picasa. So images are hosted by Picasa, another Google service, no sweat. Now, what about other files?

Hmm.

It turns out the answer to this question is a lot trickier than I would have hoped. I was hoping for another Google service, to keep my dependencies down, so I started with Sites.

Google Sites

“Google Sites is a free and easy way to create and share webpages”, the tagline goes. And it is – we use it internally all the time. One quick form later, I have a site, and another couple clicks puts a single attachment on it. Great! Update the post to reference it, double-check in incognito mode, and my work here is done. Right?

Well…not quite. It turns out that Sites uses some interesting redirect magic to actually link you to your content. Redirect magic that, for reasons I don’t understand, doesn’t actually resolve when you’re using it to hotlink content. For exactly this reason I guess, I don’t know. Anyways, since I had visited this content while on Sites, my browser had it cached, and even incognito mode could access it, but it wouldn’t resolve anywhere else. Which is a good reason to test things on more than one computer, I suppose.

Ok, not sites. App Engine?

Google App Engine

App Engine does do static file hosting. Instructions are here – you must simply download the SDK, create the app.yaml file appropriately, upload your new App (the files must belong to something, after all), etc. This looks doable, but certainly non-trivial. I also cannot figure out how this is going to be priced, nor the latency to expect from it. So let’s keep looking.

I’m running out of Google services (I considered and rejected a few more). Time to look further afield. I don’t know of any good, trustworthy solutions offhand, and a quick search isn’t taking me anywhere I really want to go, so lets look at Amazon AWS. They have a service for every occasion, right?

Amazon AWS

As it turns out, yes, yes they do. A quick look through the options (there are so many!) says that Amazon Simple Storage Service (S3) will do nicely, optionally backed by CloudFront if I ever really need to scale. A quick signup, upload of my files, and one new CNAME, and I’ve got my files living under static.thelowlyprogrammer.com, backed by S3. Nice! Doubly so since the free usage tier should make this entirely free for the next year or so, and pennies after.

Finally, in researching this blog post, I found one more option. And it’s a Google service, which was my original goal. Let’s check it out!

Google Cloud Storage

New on the block, Google Cloud Storage appears to be a competitor to Amazon S3. Priced a tiny bit cheaper, the features seem roughly comparable. The documentation is a bit rough still, which partly explains why I didn’t figure this out earlier, but everything is there if you look hard enough. One significant distinction is that you do not choose where data is stored (unless you want to add restrictions), and it will get replicated wherever needed. Note that this includes automatic edge caching for popular files, so this is pretty much S3 and CloudFront all rolled into one. Fancy! It supports the same CNAME aliases, so I’ve got this set up as a hot-spare for my S3 data. I’ll leave S3 as primary for now since I’ve got it all configured and tested happily, but it looks like I’d be well served either way.

Maybe in a future post I’ll do a head-to-head comparison of S3 and GCS, if I can figure out a fair way of measuring performance all over the globe. Until then, I’m happy to stick with either.

Mission accomplished – static files hosted. Time for bed.

Monday, January 16, 2012

Marriage Sort, a visualization

You'll need a modern browser that handles SVG (a recent Chrome, Firefox, or IE9 will do) to properly see this post.

On Saturday, Mike Bostock posted a link on Hacker News to an explanation of the Fisher-Yates Shuffle he had written. Alongside the explanation he included a very slick visualization where you could see it in action. Now, at a fundamental level, shuffle algorithms and sorting algorithms are simply the reverse of each other. So if this visualization does well on shuffle algorithms, why not a sorting algorithm as well?

A couple years ago I wrote a sorting algorithm, which I gave the tongue-in-cheek name Marriage Sort (an homage to the problem that inspired it - see the introduction post for more details). I was looking for a project to play with on the weekend, so I thought, why not try the visualization on that? Mike has kindly let me re-use his code, so here's the same visualization of Marriage Sort.

Disclaimer: The visualization I'm building off looks great across platforms, so if there is anything wrong with this one, it's from my changes. Go see the original if you don't believe me.

For those that don't remember the algorithm, here's a synopsis. There are two stages:

Stage 1:
1. Take a prefix of size √m, where m is the size of the working set.
2. Pick the maximum (in this case, most-right-leaning) of the prefix, and take it as the pivot.
3. Walk through the list, pulling out anything greater than the pivot and moving it to the end of the working set.
4. Move the pivot to the end of the working set.
5. Go back to step 1 and repeat.
Stage 2: a bog-standard insertion sort.

As the passes proceed the pivots decrease in size and more values are matched, allowing the array to be put into a "mostly sorted" state. On average, every element is within √n spots of the correct position, which the insertion sort then corrects. Overall complexity O(n^1.5) time, O(1) extra space. Code for the visualization is at https://gist.github.com/1628602 for easy perusing.

Looks good to me! What do you think?

Tuesday, May 10, 2011

The Game of Life, part 2: HashLife

Last time I wrote I gave a brief introduction to the Game of Life and a very simple Python implementation for visualizing it. I will freely admit that was a teaser post; this post gets into the real meat of the topic with an overview of the HashLife algorithm and a much more interesting implementation.

Introduction

This entry has taken me an embarrassingly long time to post. As is my habit, I wrote the code and 90% of the post, and then left it for months and months. Whoops!

If you haven’t played with a Game of Life viewer before they are legitimately fun to toy around with - I encourage you to check this one out (code is here). Since the last version everything is much improved. The viewer supports a larger set of controls (see the README for details) and basic file reading is implemented so it’s possible to try new starting patterns on the fly. And, as promised, I’ve implemented the HashLife algorithm to massively speed up iterations, so enormous patterns billions of generations forward are easily within your reach.

Algorithm

HashLife is a simple yet interesting algorithm. Invented in 1984 by Bill Gosper (of Gosper glider gun fame), it exploits repeated patterns to dramatically cut down the work required to support large patterns over vast numbers of iterations. Between the Wikipedia page and the enigmatically named “An Algorithm for Compressing Space and Time” in Dr. Dobb’s Journal I think it’s decently well explained, but it took me a couple read-throughs to really wrap my head around so I’m going to try to give an overview of the key insights it utilizes.

At it’s heart, HashLife is built around the concept of a quadtree. If you’re unfamiliar with it, a quadtree takes a square region and breaks it into four quadrants, each a quarter the size of the original. Each quadrant is further broken down into quadrants of its own, and on down. At the bottom, in squares of some minimum size like 2x2, actual points are stored. This structure is usually used to make spatial queries like “what points intersect this bounding box” efficient, but in this case two other properties are taken advantage of. First, nodes at any level are uniquely defined by the points within their region, which means duplicated regions can be backed by the same node in memory. For the Game of Life, where there are repeated patterns and empty regions galore, this can drastically reduce the space required. Second, in the Game of Life a square of (n)x(n) points fully dictates the inner (n-2)x(n-2) core one generation forward, the inner (n/2)x(n/2) core n/4 generations forward, irrespective of what cells are adjacent to it. So the future core of a node can be calculated once and will apply at any future point in time, anywhere in the tree.

Together these properties allow for ridiculous speedups. Hashing and sharing nodes drastically reduces the space requirements, with exponentially more sharing the further down the tree you go. There are only 16 possible leaf nodes, after all! From this, calculating the future core for a node requires exponentially less time than a naïve implementation would. It can be done by recursively calculating the inner core of smaller nodes, where the better caching comes into play, and then combining them together into a new node. You might be wondering if the gains from caching are lost to the increasing difficulty of determining which nodes are equal, but with a couple careful invariants we actually get that for free. First, nodes must be immutable - this one’s pretty straightforward. Second, nodes must be unique at all times. This forces us to build the tree from the bottom up, but then checking if a new node duplicates an existing one is simply a matter of checking if there are any existing nodes that point to the same set of quadrants in the same order, a problem that hash tables trivially solve.

def __hash__(self):
# Hash is dependent on cells only, not e.g. _next.
# Required for Canonical(), so cannot be simply the id of the current
# object (which would otherwise work).
return hash((id(self._nw), id(self._ne), id(self._sw), id(self._se)))

def __eq__(self, other):
"""Are two nodes equal? Doesn't take caching _next into account."""
if id(self) == id(other):
return True
return (id(self._nw) == id(other._nw) and
id(self._ne) == id(other._ne) and
id(self._sw) == id(other._sw) and
id(self._se) == id(other._se))

Implementation

As before, the code I’ve written is for Python 2.6 and makes use of PyGame, although neither dependency is terribly sticky. The code lives in a repository on github, and I welcome any contributions you care to make. As the code here is complicated enough to be almost guaranteed a bug or two, there is a basic set of unit tests in life_test.py and the code itself is liberally sprinkled with asserts. Incidentally, removing the asserts nets a 20% performance gain (as measured by the time it takes to run the ‘PerformanceTest’ unit test), although I find the development time saved by having them is easily worth keeping them in forever. As noted later, the performance of the implementation isn’t all that important anyways. Which is a good thing, since I coded it in Python!

A comment on rewrites: during the transition from version 1 - a simple brute force algorithm - to version 2 - the Node class that implements HashLife - I had both algorithms implemented in parallel for a while. This let me have every second frame rendered by the old algorithm so I could ensure that at different times and different render speeds that the algorithms were coming up with the same results. I’ve seen this pattern used at work for migrating to replacement systems and it’s very much worth the extra glue code you have to write or the confidence it gives. John Carmack recently wrote about parallel implementations on his own blog, if you want to hear more on the topic.

Performance

The performance is hard to objectively detail for an algorithm like this. For example, it takes ~1 second to generate the billionth generation of the backrake 3 pattern, which has around 300,000,000 live cells; it takes ~2 seconds to generate the quintillionth generation with 3x10^17 live cells. But this is a perfect patterns to showcase HashLife - a simple spaceship traveling in a straight line, generating a steady stream of gliders. In comparison, a chaotic pattern like Acorn takes almost 25 seconds to generate just 5000 generations with at most 1057 alive at any time. As it stands the properties of the algorithm drastically outweigh the peculiarities of the implementation for anything I care to do. Although I must say, if you want to compare it to another implementation in an apples to apples comparison I’d love to hear the numbers you get.

As always, I’d love to hear what you think!

The Lowly Programmer

Monday, April 26, 2021

The Rise of Datacenter Computing

Part 1: The Data Center of Gravity

Part 2: “Personal” computing

Thursday, November 1, 2012

Hardware trends for datacenter computing

1. Compute Units are Static

2. The Datacenter is the New Machine

3. Decomposable Programs are Ideal

4. Consumer Bandwidth is Lagging

5. Latency Doesn’t Scale

Further Reading:

Sunday, August 26, 2012

Primes part 2: A segmented sieve

Implementation

Results

Footnotes

Sunday, August 19, 2012

The algorithms of memory

Footnotes

Monday, January 30, 2012

On hosting static files

Google Sites

Google App Engine

Amazon AWS

Google Cloud Storage

Monday, January 16, 2012

Marriage Sort, a visualization

Tuesday, May 10, 2011

The Game of Life, part 2: HashLife

Introduction

Algorithm

Implementation

Performance

About Me

Blog Archive