**TL;DR: The sum of all primes <= 1,000,000,000,000 is 18,435,588,552,550,705,911,377**

*This post is a followup to Writing an efficient Sieve of Eratosthenes*

A while back I wrote a post detailing a memory-efficient Sieve of Eratosthenes. The algorithm I used took advantage of lazy evaluation and a sliding array to reduce the RAM requirement to a fraction of what a 'vanilla' implementation would require, at the cost of a non-trivial increase in CPU time. The resulting code ran at approximately 25% the throughput of the vanilla implementation, and maxed out at 2 billion candidates.

While researching that post, I noted that the most efficient generators at present use either the Sieve of Atkin or a 'segmented sieve'. As an excuse to play with Go^{[1]}, a couple weeks ago I decided to implement a segmented Sieve of Eratosthenes. This post details my results.

## Implementation

__Language:__Go

Time complexity:

Time complexity:

**O(n∙log(log(n)))**

__Space complexity__

**:**O(√n)

__Candidates in 1 sec:__~50,000,000

Gist (expand/contract)

The algorithm proceeds as follows:

- Calculate primes up to √max via a vanilla array sieve
- Slice up segments of about √max candidates for checking
- To check a range,

- For each prime p from 1., find the first multiple within the range that's >= p
^{2} - Cross off every multiple from there to the end of the range

- For each prime p from 1., find the first multiple within the range that's >= p
- Merge the results from the processed segments

You'll note that other than splitting the full candiate set into segments, this is the standard Sieve of Eratosthenes. Hence, it's the *segmented* Sieve of Eratosthenes.

In my Go version this is implemented by starting segments as individual goroutines that output to their own channels. A single worker goroutine is responsible for marshaling the results from these channels to a single channel read by the main thread. This architecture was chosen simply because it fits well with the Go way of doing things, but it also has the side-effect of providing some amount of free parallelism.

## Results

The very first run of this variant was faster than the most optimized version from my previous post. It runs at about 65% the speed of a vanilla implementation, making it about 2.5x as efficient as the previous lazy implementations, with a lower memory footprint. As always, a better algorithm is worth more than any amount of low level code tuning :). I should point out that in the current implementation I also implemented a bit array rather than simply using an array of bools. This reduced the memory footprint somewhat, but did not appear to have any significant impact in either direction on CPU time required, and so could reasonably be dropped to shorten the code.

With all primes needing to be marshaled back to the main thread parallelism maxes out below 2x linear. If all we care about is an aggregate value computed from the primes (the sum in this case), rather than the primes themselves in order, we can achieve >4x parallelism simply by adding more processes. This is also more efficient in general, and allows >300,000,000 primes to be processed in one second^{[2]}.

The net result is an implementation that can sieve 50M candidates in one second on one core or sum 300M on four; sum primes up to one trillion in an hour; or sum primes in a range of one billion (10^9) in the region of one quintillion (10^18) in under a minute. I'm happy with that...for now.

## Footnotes

- Let me say right now that Go is a fantastic language to work with, being both easy to write and efficient to run. I fully intend to start writing some of my work projects in Go in the near future.
- As noted in the previous post, we use the generic "primes in one second" metric for making apples-to-oranges comparisons of algorithms and implementations. This is not intended to provide anything more than a rough comparison.

I also wrote an Eratosthenesque prime sieve using Go and channels here a while ago: http://blog.onideas.ws/eratosthenes.go

ReplyDeleteIt is slower than yours but is more channel-ish, i.e. lots of communicating over channels.

Interesting that your parallel segment filtering design will output primes but not in increasing order, e.g. hard to ask "which is the 10 millionth prime?", but good for summing them I guess.

Thanks for the link! The paper your approach is based on was the basis for my original post (linked above), albeit not in Go. Definitely an interesting read (and fun to port to non-Haskell).

Delete> Interesting that your parallel segment filtering design will output primes but not in increasing order [...]

It actually does output them in order. Segments are computed in parallel, but writing results to their own buffer channels. The contents of the buffers are later copied to the main output channel in segment order - there is a goroutine specifically dedicated to doing so. This somewhat limits the parallelism achievable (to 1.85x on my machine) since it involves non-trivial serial work, but it also means that all primes are output sequentially.

Eric, a great first post introducing the principles of prime sieving using the Sieve of Eratosthenes and the logical progression of improved performance with gradually improved algorithms along with this follow-up post, but I think you stopped too soon in not considering further improvements to the Array based sieves as in paging, wheel factorization, and multi-threading, not to mention eliminating the enumeration itself for some types of results such as the count, the nth, and the sum of all elements over a range; when the other optimizations are fully utilized, enumeration takes about ten times longer than the process of actually culling composite numbers itself. Following is a link to a C# program that produces the count of the primes from the first four billion candidates (actually 2 ^ 32) in about a second on about the speed of machine as you likely were using to do these tests (fi7-2700K 3.5 GHz machine I was using), and even numerates them about five times faster than your Array based version: http://stackoverflow.com/a/18885065/549617. I'm not sure how to compute your "candidates per second" as this algorithm pre-culls some composite numbers using wheel factorization, but a range of four billion in one second might mean four billion candidates per second or it might mean 718 million candidates per second processed after pre-culling; alternatively it could mean about 515 million composite numbers eliminated per second leaving about 200 million primes. All of them are much faster than any C# prime sieving code I have seen published and only slower than extremely optimized C code.

ReplyDelete