## Sunday, August 26, 2012

### Primes part 2: A segmented sieve

TL;DR: The sum of all primes <= 1,000,000,000,000 is 18,435,588,552,550,705,911,377

This post is a followup to Writing an efficient Sieve of Eratosthenes

A while back I wrote a post detailing a memory-efficient Sieve of Eratosthenes. The algorithm I used took advantage of lazy evaluation and a sliding array to reduce the RAM requirement to a fraction of what a 'vanilla' implementation would require, at the cost of a non-trivial increase in CPU time. The resulting code ran at approximately 25% the throughput of the vanilla implementation, and maxed out at 2 billion candidates.

While researching that post, I noted that the most efficient generators at present use either the Sieve of Atkin or a 'segmented sieve'. As an excuse to play with Go[1], a couple weeks ago I decided to implement a segmented Sieve of Eratosthenes. This post details my results.

## Implementation

Language: Go
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~50,000,000

Gist (expand/contract)

The algorithm proceeds as follows:

1. Calculate primes up to √max via a vanilla array sieve
2. Slice up segments of about √max candidates for checking
3. To check a range,
1. For each prime p from 1., find the first multiple within the range that's >= p2
2. Cross off every multiple from there to the end of the range
4. Merge the results from the processed segments

You'll note that other than splitting the full candiate set into segments, this is the standard Sieve of Eratosthenes. Hence, it's the segmented Sieve of Eratosthenes.

In my Go version this is implemented by starting segments as individual goroutines that output to their own channels. A single worker goroutine is responsible for marshaling the results from these channels to a single channel read by the main thread. This architecture was chosen simply because it fits well with the Go way of doing things, but it also has the side-effect of providing some amount of free parallelism.

## Results

The very first run of this variant was faster than the most optimized version from my previous post. It runs at about 65% the speed of a vanilla implementation, making it about 2.5x as efficient as the previous lazy implementations, with a lower memory footprint. As always, a better algorithm is worth more than any amount of low level code tuning :). I should point out that in the current implementation I also implemented a bit array rather than simply using an array of bools. This reduced the memory footprint somewhat, but did not appear to have any significant impact in either direction on CPU time required, and so could reasonably be dropped to shorten the code.

With all primes needing to be marshaled back to the main thread parallelism maxes out below 2x linear. If all we care about is an aggregate value computed from the primes (the sum in this case), rather than the primes themselves in order, we can achieve >4x parallelism simply by adding more processes. This is also more efficient in general, and allows >300,000,000 primes to be processed in one second[2].

The net result is an implementation that can sieve 50M candidates in one second on one core or sum 300M on four; sum primes up to one trillion in an hour; or sum primes in a range of one billion (10^9) in the region of one quintillion (10^18) in under a minute. I'm happy with that...for now.

## Footnotes

1. I also wrote an Eratosthenesque prime sieve using Go and channels here a while ago: http://blog.onideas.ws/eratosthenes.go

It is slower than yours but is more channel-ish, i.e. lots of communicating over channels.

Interesting that your parallel segment filtering design will output primes but not in increasing order, e.g. hard to ask "which is the 10 millionth prime?", but good for summing them I guess.

1. Thanks for the link! The paper your approach is based on was the basis for my original post (linked above), albeit not in Go. Definitely an interesting read (and fun to port to non-Haskell).

> Interesting that your parallel segment filtering design will output primes but not in increasing order [...]

It actually does output them in order. Segments are computed in parallel, but writing results to their own buffer channels. The contents of the buffers are later copied to the main output channel in segment order - there is a goroutine specifically dedicated to doing so. This somewhat limits the parallelism achievable (to 1.85x on my machine) since it involves non-trivial serial work, but it also means that all primes are output sequentially.