*See also the followup post containing a segmented sieve implementation in Go that beats all these.*

I recently came across a paper by Melissa E. O'Neill on implementing a lazily evaluated Sieve of Eratosthenes. It was a very interesting read, although for a non-Haskeller understanding the code was certainly tricky. And by happy happenstance, not two weeks later I ended up needing a prime generator of my own for one of the Project Euler problems. This, I thought, was the perfect opportunity to try implementing the algorithm from that paper! That thought ended up leading me on a many-day journey to see how efficient an implementation I could make. This post details the key changes my code went through in that time, culminating in a C# algorithm that can sieve around 20 million candidates per second (or a Python version that can do 1.4 million).

## Disclaimers

- During the course of this journey I learned about segmented sieves and the Sieve of Atkin, but I decided not to make use of either of these, instead sticking to the algorithm I chose from the start. Maybe next time.
- If you are simply looking for the fastest implementation possible, I point you towards http://cr.yp.to/primegen.html, which is written in highly optimized C and at least an order of magnitude faster than mine.

## Version 1: Straight re-implementation of algorithm from paper

__Language:__ Python

Time complexity:** **O(n∙log(n)∙log(log(n)))

__Space complexity__**: **O(√n/log(n))

__Candidates in 1 sec:__ ~300,000

Code (expand/contract)
This is simply a re-implementation in Python of the Haskell code from the paper above, although it loses some of the elegance in the translation. You'll note that I am measuring performance with the very unscientific 'candidates in 1 sec' - since I am trying to compare algorithms with different asymptotic complexities in different languages with order-of-magnitude performance differences, it seemed like the easiest way to get a feel for the general magnitudes in this apples-to-oranges comparison.

The key idea here is that for each prime we encounter, we make a generator for multiples of it (starting at p^{2}) that we can use to filter out our candidates. For efficiency, all multiples of 2 and 3 are skipped automatically as well, since that reduces the number of candidates to check by two thirds. These generators are all added to a min-heap, keyed by the next value each will generate. Then we can keep checking the next candidate and seeing if it matches the top of the heap. If it does, pop that, increment it, and push it back on keyed by the next value. If not, the candidate must be prime, so build a new generator and add that instead.

So, let's try an example of this. The first candidate we try is 5 (remembering that all multiples of 2 and 3 are already gone), and our heap is empty. This becomes the first item on the heap, and then 7 is checked:

5? []

Nope, prime

7? [25: 25,35,55,...]

Nope, prime

11? [25:25,35,55,... 49:49,77,91,...]

Nope, prime

13? [25:25,35,55,... 49:49,77,91,... 121:121,143,187...]

Nope, prime

17? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...]

Nope, prime

19? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime

23? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime

25? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Aha! Matches the top of our heap...not prime

29? [35:35,55,65,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime

...

Make sense? This is the algorithm straight from the paper, so if my explanation is lacking you could always try there. When I write it down this way, however, an optimization springs out at me - why are we using a heap at all? At any given time we are querying for a single value, and don't actually care about the other items. Instead of updating a heap structure all the time, we can simply use a hash table instead. Strictly speaking a multi-way hash table, since some values will have multiple prime factors.

## Version 2: Using a hash table

__Language:__ Python

Time complexity:** **O(n∙log(log(n)))

__Space complexity__**: **O(√n/log(n))

__Candidates in 1 sec:__ ~1,400,000

Code (expand/contract)
Not bad, under 50 lines of code and with good complexity. This is as far as I know how to go with Python, and it is a dynamic language anyways which isn't really suited to this kind of number crunching, so for the next version I'm switching to C#. There I know the time required for most operations, and more importantly, I have a good profiler I know how to use.

## Version 3: Re-implement in C#

__Language:__ C#

Time complexity:** **O(n∙log(log(n)))

__Space complexity__**: **O(√n/log(n))

__Candidates in 1 sec:__ ~10,000,000

Code (expand/contract)
Wow, significant speedup there! And with a modest 25 line code hit (plus a main method for convenience), that is certainly reasonable. A little profiling tells me that the bottleneck is the list operations, which isn't too much of a surprise. Of course, C#'s native list implementation isn't necessarily going to be the fastest for our exact situation - as is often the case when optimizing, the next step is to write our own data structure. Here I use a dead simple data structure, essentially just an array with an Add operation that dynamically resizes it.

## Version 4: Use a custom structure for the 'multi' in multi dictionary

__Language:__ C#

Time complexity:** **O(n∙log(log(n)))

__Space complexity__**: **O(√n/log(n))

__Candidates in 1 sec:__ ~12,000,000

Code (expand/contract)
Not bad, a modest speedup in this version. But I must admit, I cheated a little. The main benefit of this new data structure is not the speedup gained at this step. No, the real speedup is allowing an essentially free Clear operation. Because as useful as a hash table is, we can re-write it too to get even more of a speedup. The idea here is that most of our current iterators will have a 'next' value that differ by at most √n or so, so a circular array will be better for cache locality, remove the cost of hashing, and let us re-use the 'multi' objects. Note that a circular array is essentially a giant array, except only a small sliding window is actually backed by memory and allowed to contain values at any given time.

## Version 5: Use a multi cyclic array instead of hash table

__Language:__ C#

Time complexity:** **O(n∙log(log(n)))

__Space complexity__**: **O(√n)

__Candidates in 1 sec:__ ~20,000,000

Code (expand/contract)
Ok, that is about as far as I'm willing to go. Beyond this point I expect I'll start getting into low-benefit, high-complexity optimizations, and I don't want to go down that road. Although by some counts I am already there - from version 3 there has been a 2x speedup, but at the cost of doubling the code size and ending up with algorithms that fall firmly within the realm of 'tricky'. If I actually had to maintain a prime sieve in a professional setting that wasn't absolutely time critical, I think I would be going with version 3 - the later code starts to impose too much of a mental tax.

And there you have it, a reasonably efficient Sieve of Eratosthenes, and for the most part without 'tricky' optimizations. Let me know what you think!

## Postscript

In the comments, Isaac was wondering how these compared to "vanilla array" implementations (essentially, pre-allocate the array and cross-off), so I have added two array implementations to GitHub (Python, C#) for comparision. Both are about 4 times faster than the comparative best in their language, testing 5.5 million and 75 million candidates in one second (respectively). The Python version runs out of memory somewhere before 500 million candidates, and the C# version can't get beyond about 2 billion due to limitations of the BitArray class.