The Lowly Programmer: March 2010

When I posted my 'Visualizing RGB, take two' article last month, an anonymous commenter going by the name Full-size had a good suggestion - I should use a form of error diffusion to better hide the pixel errors when selecting the colours to use. Within a couple days I had this implemented, and wow does it make an improvement! Unfortunately, between school and needing to reinstall Windows I never ended up posting the results. So, three weeks late, here they are.

Results for Barcelona image

As you may recall, the goal of the algorithm was to transform a source image into a 4096x4096 version that uses every RGB colour exactly once. As a reminder, here is the original image, and what my algorithm previously did with it:

Original

Result from previous version

Now, take a look at how it looks using a form of error diffusion (with error terms capped at 60):

(Full 50 meg png available here). Much nicer to look at! Take a look at the sky in particular - where it used to go psychedelic trying to deal with all that near white, large portions now end up simply turning into a uniform gray. The new version is worse in a couple spots (e.g. the wheel wells of the car), but overall I think it is hugely improved. Now, I wonder how the new version would do on a harder image?

Results for the Mona Lisa

As promised, I am also posting the results of running this algorithm on an image of the Mona Lisa. This is an especially difficult image, because the colour palette is very limited - notice the lack of blue in particular. First, let's take a look at the original, and the result from the previous version of the code:

Original

Result from previous version

Ouch. Poor Lisa. Still, let's forge on and see how the new version does, shall we?

(Full 50 meg png available here). Considerably better overall, although the colour burning on the forehead and neck is pretty ugly to look at. Still, considering we are trying to use every RGB colour once, I think the results are quite decent.

Postscript

The code to achieve this is in the same place as before, updated on GitHub. Right now you have to change the source code to edit the error diffusion cap, sorry - but feel free to fix that!

See also the followup post containing a segmented sieve implementation in Go that beats all these.

I recently came across a paper by Melissa E. O'Neill on implementing a lazily evaluated Sieve of Eratosthenes. It was a very interesting read, although for a non-Haskeller understanding the code was certainly tricky. And by happy happenstance, not two weeks later I ended up needing a prime generator of my own for one of the Project Euler problems. This, I thought, was the perfect opportunity to try implementing the algorithm from that paper! That thought ended up leading me on a many-day journey to see how efficient an implementation I could make. This post details the key changes my code went through in that time, culminating in a C# algorithm that can sieve around 20 million candidates per second (or a Python version that can do 1.4 million).

Disclaimers

During the course of this journey I learned about segmented sieves and the Sieve of Atkin, but I decided not to make use of either of these, instead sticking to the algorithm I chose from the start. Maybe next time.
If you are simply looking for the fastest implementation possible, I point you towards http://cr.yp.to/primegen.html, which is written in highly optimized C and at least an order of magnitude faster than mine.

Version 1: Straight re-implementation of algorithm from paper

Language: Python
Time complexity: O(n∙log(n)∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~300,000

Code (expand/contract)

This is simply a re-implementation in Python of the Haskell code from the paper above, although it loses some of the elegance in the translation. You'll note that I am measuring performance with the very unscientific 'candidates in 1 sec' - since I am trying to compare algorithms with different asymptotic complexities in different languages with order-of-magnitude performance differences, it seemed like the easiest way to get a feel for the general magnitudes in this apples-to-oranges comparison.

The key idea here is that for each prime we encounter, we make a generator for multiples of it (starting at p²) that we can use to filter out our candidates. For efficiency, all multiples of 2 and 3 are skipped automatically as well, since that reduces the number of candidates to check by two thirds. These generators are all added to a min-heap, keyed by the next value each will generate. Then we can keep checking the next candidate and seeing if it matches the top of the heap. If it does, pop that, increment it, and push it back on keyed by the next value. If not, the candidate must be prime, so build a new generator and add that instead.

So, let's try an example of this. The first candidate we try is 5 (remembering that all multiples of 2 and 3 are already gone), and our heap is empty. This becomes the first item on the heap, and then 7 is checked:

5? []

Nope, prime
7? [25: 25,35,55,...]

Nope, prime
11? [25:25,35,55,... 49:49,77,91,...]

Nope, prime
13? [25:25,35,55,... 49:49,77,91,... 121:121,143,187...]

Nope, prime
17? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...]

Nope, prime
19? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime
23? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime
25? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Aha! Matches the top of our heap...not prime
29? [35:35,55,65,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,... ...]

Nope, prime
...

Make sense? This is the algorithm straight from the paper, so if my explanation is lacking you could always try there. When I write it down this way, however, an optimization springs out at me - why are we using a heap at all? At any given time we are querying for a single value, and don't actually care about the other items. Instead of updating a heap structure all the time, we can simply use a hash table instead. Strictly speaking a multi-way hash table, since some values will have multiple prime factors.

Version 2: Using a hash table

Language: Python
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~1,400,000

Code (expand/contract)

Not bad, under 50 lines of code and with good complexity. This is as far as I know how to go with Python, and it is a dynamic language anyways which isn't really suited to this kind of number crunching, so for the next version I'm switching to C#. There I know the time required for most operations, and more importantly, I have a good profiler I know how to use.

Version 3: Re-implement in C#

Language: C#
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~10,000,000

Code (expand/contract)

Wow, significant speedup there! And with a modest 25 line code hit (plus a main method for convenience), that is certainly reasonable. A little profiling tells me that the bottleneck is the list operations, which isn't too much of a surprise. Of course, C#'s native list implementation isn't necessarily going to be the fastest for our exact situation - as is often the case when optimizing, the next step is to write our own data structure. Here I use a dead simple data structure, essentially just an array with an Add operation that dynamically resizes it.

Version 4: Use a custom structure for the 'multi' in multi dictionary

Language: C#
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~12,000,000

Code (expand/contract)

Not bad, a modest speedup in this version. But I must admit, I cheated a little. The main benefit of this new data structure is not the speedup gained at this step. No, the real speedup is allowing an essentially free Clear operation. Because as useful as a hash table is, we can re-write it too to get even more of a speedup. The idea here is that most of our current iterators will have a 'next' value that differ by at most √n or so, so a circular array will be better for cache locality, remove the cost of hashing, and let us re-use the 'multi' objects. Note that a circular array is essentially a giant array, except only a small sliding window is actually backed by memory and allowed to contain values at any given time.

Version 5: Use a multi cyclic array instead of hash table

Language: C#
Time complexity: O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~20,000,000

Code (expand/contract)

Ok, that is about as far as I'm willing to go. Beyond this point I expect I'll start getting into low-benefit, high-complexity optimizations, and I don't want to go down that road. Although by some counts I am already there - from version 3 there has been a 2x speedup, but at the cost of doubling the code size and ending up with algorithms that fall firmly within the realm of 'tricky'. If I actually had to maintain a prime sieve in a professional setting that wasn't absolutely time critical, I think I would be going with version 3 - the later code starts to impose too much of a mental tax.

And there you have it, a reasonably efficient Sieve of Eratosthenes, and for the most part without 'tricky' optimizations. Let me know what you think!

Postscript

In the comments, Isaac was wondering how these compared to "vanilla array" implementations (essentially, pre-allocate the array and cross-off), so I have added two array implementations to GitHub (Python, C#) for comparision. Both are about 4 times faster than the comparative best in their language, testing 5.5 million and 75 million candidates in one second (respectively). The Python version runs out of memory somewhere before 500 million candidates, and the C# version can't get beyond about 2 billion due to limitations of the BitArray class.

The Lowly Programmer

Wednesday, March 17, 2010

Visualizing RGB, take two - Update

Results for Barcelona image

Results for the Mona Lisa

Postscript

Tuesday, March 2, 2010

Writing an efficient Sieve of Eratosthenes

Disclaimers

Version 1: Straight re-implementation of algorithm from paper

Version 2: Using a hash table

Version 3: Re-implement in C#

Version 4: Use a custom structure for the 'multi' in multi dictionary

Version 5: Use a multi cyclic array instead of hash table

Postscript

About Me

Blog Archive