**TL;DR: The sum of all primes <= 1,000,000,000,000 is 18,435,588,552,550,705,911,377**

*This post is a followup to Writing an efficient Sieve of Eratosthenes*

A while back I wrote a post detailing a memory-efficient Sieve of Eratosthenes. The algorithm I used took advantage of lazy evaluation and a sliding array to reduce the RAM requirement to a fraction of what a 'vanilla' implementation would require, at the cost of a non-trivial increase in CPU time. The resulting code ran at approximately 25% the throughput of the vanilla implementation, and maxed out at 2 billion candidates.

While researching that post, I noted that the most efficient generators at present use either the Sieve of Atkin or a 'segmented sieve'. As an excuse to play with Go^{[1]}, a couple weeks ago I decided to implement a segmented Sieve of Eratosthenes. This post details my results.

## Implementation

__Language:__Go

Time complexity:

Time complexity:

**O(n∙log(log(n)))**

__Space complexity__

**:**O(√n)

__Candidates in 1 sec:__~50,000,000

Gist (expand/contract)

The algorithm proceeds as follows:

- Calculate primes up to √max via a vanilla array sieve
- Slice up segments of about √max candidates for checking
- To check a range,

- For each prime p from 1., find the first multiple within the range that's >= p
^{2} - Cross off every multiple from there to the end of the range

- For each prime p from 1., find the first multiple within the range that's >= p
- Merge the results from the processed segments

You'll note that other than splitting the full candiate set into segments, this is the standard Sieve of Eratosthenes. Hence, it's the *segmented* Sieve of Eratosthenes.

In my Go version this is implemented by starting segments as individual goroutines that output to their own channels. A single worker goroutine is responsible for marshaling the results from these channels to a single channel read by the main thread. This architecture was chosen simply because it fits well with the Go way of doing things, but it also has the side-effect of providing some amount of free parallelism.

## Results

The very first run of this variant was faster than the most optimized version from my previous post. It runs at about 65% the speed of a vanilla implementation, making it about 2.5x as efficient as the previous lazy implementations, with a lower memory footprint. As always, a better algorithm is worth more than any amount of low level code tuning :). I should point out that in the current implementation I also implemented a bit array rather than simply using an array of bools. This reduced the memory footprint somewhat, but did not appear to have any significant impact in either direction on CPU time required, and so could reasonably be dropped to shorten the code.

With all primes needing to be marshaled back to the main thread parallelism maxes out below 2x linear. If all we care about is an aggregate value computed from the primes (the sum in this case), rather than the primes themselves in order, we can achieve >4x parallelism simply by adding more processes. This is also more efficient in general, and allows >300,000,000 primes to be processed in one second^{[2]}.

The net result is an implementation that can sieve 50M candidates in one second on one core or sum 300M on four; sum primes up to one trillion in an hour; or sum primes in a range of one billion (10^9) in the region of one quintillion (10^18) in under a minute. I'm happy with that...for now.

## Footnotes

- Let me say right now that Go is a fantastic language to work with, being both easy to write and efficient to run. I fully intend to start writing some of my work projects in Go in the near future.
- As noted in the previous post, we use the generic "primes in one second" metric for making apples-to-oranges comparisons of algorithms and implementations. This is not intended to provide anything more than a rough comparison.