## Tuesday, March 2, 2010

### Writing an efficient Sieve of Eratosthenes

See also the followup post containing a segmented sieve implementation in Go that beats all these.

I recently came across a paper by Melissa E. O'Neill on implementing a lazily evaluated Sieve of Eratosthenes. It was a very interesting read, although for a non-Haskeller understanding the code was certainly tricky. And by happy happenstance, not two weeks later I ended up needing a prime generator of my own for one of the Project Euler problems. This, I thought, was the perfect opportunity to try implementing the algorithm from that paper! That thought ended up leading me on a many-day journey to see how efficient an implementation I could make. This post details the key changes my code went through in that time, culminating in a C# algorithm that can sieve around 20 million candidates per second (or a Python version that can do 1.4 million).

## Disclaimers

1. During the course of this journey I learned about segmented sieves and the Sieve of Atkin, but I decided not to make use of either of these, instead sticking to the algorithm I chose from the start. Maybe next time.
2. If you are simply looking for the fastest implementation possible, I point you towards http://cr.yp.to/primegen.html, which is written in highly optimized C and at least an order of magnitude faster than mine.

## Version 1: Straight re-implementation of algorithm from paper

Language: Python
Time complexity:
O(n∙log(n)∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~300,000

Code (expand/contract)

This is simply a re-implementation in Python of the Haskell code from the paper above, although it loses some of the elegance in the translation.  You'll note that I am measuring performance with the very unscientific 'candidates in 1 sec' - since I am trying to compare algorithms with different asymptotic complexities in different languages with order-of-magnitude performance differences, it seemed like the easiest way to get a feel for the general magnitudes in this apples-to-oranges comparison.

The key idea here is that for each prime we encounter, we make a generator for multiples of it (starting at p2) that we can use to filter out our candidates. For efficiency, all multiples of 2 and 3 are skipped automatically as well, since that reduces the number of candidates to check by two thirds. These generators are all added to a min-heap, keyed by the next value each will generate. Then we can keep checking the next candidate and seeing if it matches the top of the heap. If it does, pop that, increment it, and push it back on keyed by the next value. If not, the candidate must be prime, so build a new generator and add that instead.

So, let's try an example of this. The first candidate we try is 5 (remembering that all multiples of 2 and 3 are already gone), and our heap is empty. This becomes the first item on the heap, and then 7 is checked:

5? []
Nope, prime
7? [25: 25,35,55,...]
Nope, prime
11? [25:25,35,55,... 49:49,77,91,...]
Nope, prime
13? [25:25,35,55,... 49:49,77,91,... 121:121,143,187...]
Nope, prime
17? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...]
Nope, prime
19? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Nope, prime
23? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Nope, prime
25? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Aha! Matches the top of our heap...not prime
29? [35:35,55,65,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Nope, prime
...

Make sense? This is the algorithm straight from the paper, so if my explanation is lacking you could always try there. When I write it down this way, however, an optimization springs out at me - why are we using a heap at all? At any given time we are querying for a single value, and don't actually care about the other items. Instead of updating a heap structure all the time, we can simply use a hash table instead. Strictly speaking a multi-way hash table, since some values will have multiple prime factors.

## Version 2: Using a hash table

Language: Python
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~1,400,000

Code (expand/contract)

Not bad, under 50 lines of code and with good complexity. This is as far as I know how to go with Python, and it is a dynamic language anyways which isn't really suited to this kind of number crunching, so for the next version I'm switching to C#. There I know the time required for most operations, and more importantly, I have a good profiler I know how to use.

## Version 3: Re-implement in C#

Language: C#
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~10,000,000

Code (expand/contract)

Wow, significant speedup there! And with a modest 25 line code hit (plus a main method for convenience), that is certainly reasonable. A little profiling tells me that the bottleneck is the list operations, which isn't too much of a surprise. Of course, C#'s native list implementation isn't necessarily going to be the fastest for our exact situation - as is often the case when optimizing, the next step is to write our own data structure. Here I use a dead simple data structure, essentially just an array with an Add operation that dynamically resizes it.

## Version 4: Use a custom structure for the 'multi' in multi dictionary

Language: C#
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~12,000,000

Code (expand/contract)

Not bad, a modest speedup in this version. But I must admit, I cheated a little. The main benefit of this new data structure is not the speedup gained at this step. No, the real speedup is allowing an essentially free Clear operation. Because as useful as a hash table is, we can re-write it too to get even more of a speedup. The idea here is that most of our current iterators will have a 'next' value that differ by at most √n or so, so a circular array will be better for cache locality, remove the cost of hashing, and let us re-use the 'multi' objects. Note that a circular array is essentially a giant array, except only a small sliding window is actually backed by memory and allowed to contain values at any given time.

## Version 5: Use a multi cyclic array instead of hash table

Language: C#
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~20,000,000

Code (expand/contract)

Ok, that is about as far as I'm willing to go. Beyond this point I expect I'll start getting into low-benefit, high-complexity optimizations, and I don't want to go down that road. Although by some counts I am already there - from version 3 there has been a 2x speedup, but at the cost of doubling the code size and ending up with algorithms that fall firmly within the realm of 'tricky'. If I actually had to maintain a prime sieve in a professional setting that wasn't absolutely time critical, I think I would be going with version 3 - the later code starts to impose too much of a mental tax.

And there you have it, a reasonably efficient Sieve of Eratosthenes, and for the most part without 'tricky' optimizations. Let me know what you think!

## Postscript

In the comments, Isaac was wondering how these compared to "vanilla array" implementations (essentially, pre-allocate the array and cross-off), so I have added two array implementations to GitHub (Python, C#) for comparision. Both are about 4 times faster than the comparative best in their language, testing 5.5 million and 75 million candidates in one second (respectively). The Python version runs out of memory somewhere before 500 million candidates, and the C# version can't get beyond about 2 billion due to limitations of the BitArray class.

1. Hi Eric. If you're interested you can see a couple of these that I did (Erlang and Clojure) here: http://tinyurl.com/yl9pe7s

2. 1) "Candidates in 1 sec" makes me ask - On what hardware? - even though your point is to allow comparison between your 5 programs :-)

2) For comparison, what about Python and C# for a naïve (not lazy) Sieve of Eratosthenes algorithm? How do these lazy programs compare to using vanilla arrays?

3. Mike,
Thanks for the link! It is always interesting to see what different implementations people come up with, especially coming from the same paper as you and I did. Now I'd be interested in seeing a functional implementation using a hash table :).

Isaac,
1. The numbers were generated by very ad hoc testing on my Intel i7 920, which puts them on the higher end of the performance spectrum. But I really wouldn't trust them for more than ballpark comparisons :).

2. I'm not at home at the moment so I can't give you an exact number, but from what I remember the vanilla array implementations are quite a bit faster for low values of n (say up to a few million or so). Beyond that the linear memory usage starts really affecting the performance. And of course, the memory usage provides a solid upper bound to to the number of composites that can be tested.

4. Some other implementations, I got this one because of the Limbo multi threaded version.

http://www.cs.cmu.edu/~phoenix/nsc2/program/paper/1-2.pdf

5. > so I can't give you an exact number

I think it would give some more context if you could add those naïve programs and times.

6. Python+Psyco (http://psyco.sourceforge.net/) can give nice gains with little effort in some cases.

Of course if you want to make it really fast you have to give Pyrex (http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/) a go. :)

7. Isaac,
Ok, I tried a couple array versions, and I need to amend my answer to 2) slightly. The performance doesn't degrade as n is increased as I had thought, so the performance improvement over the lazy version is pretty constant. Take a look at the postscript I've added for details, but in both Python and C# an array based implementation seems to be about 4 times faster.

Maht,
Thanks! I had been idly wondering about parallelizing a prime sieve for a while now, but didn't really know how to go about it. This paper looks like it will be a great read for ideas :).

8. Juho,
I was looking at Pyrex earlier, but I decided that if I was going to go down that road I'd probably just code it in C myself. It seemed like a bit of a cheat to me to use :P. On the other hand, I hadn't seen Psyco, and after a quick glance it seems like something I'd really like to try. If it is as easy as they say, I should be reporting back results shortly :).

9. Unfortunately, Pysco is failing to load on both Python 2.6 and 3.1 for me and I don't really have the time to fight it, so I won't be giving it a try right now.

10. Great post, I really like to see how progressed step by step.

I see you mentioned The Sieve of Atkins. I failed, for obvious reasons, to scale that one over multiple cores in C# and blogged about it here http://alicebobandmallory.com/articles/2010/01/14/prime-factorization-in-parallel

I think I will have to take a look at segmented sieves, thanks for that!