Tuesday, June 8, 2010

New Job!

Today I officially started at Google! Exciting stuff. I will be working on the DoubleClick Ad Exchange doing something-or-other for the time being — however as I am neither a senior engineer nor a corporate figurehead, I do not plan to write about my work. This should, however, explain my recent absence :).

That is all.

Wednesday, April 28, 2010

Robustly hot swapping binaries (with code)

Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine

  A while ago I remember reading an article by Nathan Wiegand on hot swapping binaries. This was a very eye-opening article for me – before reading it, hot swapping was one of those black arts I never really thought about, and certainly wouldn’t have thought was at all easy. I highly recommend you read it for yourself. Go ahead, I’ll wait.

There. Did you notice it? The elephant in the room? One thing the article doesn’t address is how to design programs to use this fancy new ability, without being fragile and crashing and all those bad things. I’ve been mulling this over since reading it, and have settled on a basic design that I’ll present here. But don’t worry, this isn’t just a thought design…I’ve actually coded it up and made sure it works as expected. Feel free to jump down and check it out before reading the design. Still, caveat emptor and all that.

Design Goals

For this design, I am focusing on three key goals:

  • Allow updating to any future version
  • Allow updating to any previous version
  • Make it easy to be crash-free

Simple enough? Updating forward is a pretty obvious goal, as is crash-free code. I want to allow updating backwards as well for the simple expedient that I don’t expect all new code to be bug free, and so it might be desirable to roll back when a bug is introduced.

Design

To achieve this, I’ve settled on a shortlist of constraints for the code:

  1. Use protocol buffers to store all state
  2. Provide suitable defaults for everything, and be robust against weird state
  3. Structure the code as a state machine

I did say the list was short. Let’s look at these in detail:

  1. Protocol buffers are an obvious choice for persisting state, as they are forward and backward compatible by design. Care must be taken to not re-use tag numbers and to never add ‘required’ fields, but this is an easy requirement to satisfy. Now, using protocol buffers to store everything does incur some overhead, but they are quite efficient by design and we really only need to store all state between state machine transitions so local variables are still quick.
  2. Hand in hand with 1), we cannot always expect to have the data we want in the format we want version to version. To accommodate this we must pick suitable defaults for fields, and if necessary be able to get new values at runtime. At the same time, if the meaning or format of a field is changing, it is probably better to use a new field. We will always try to handle weird data that may show up, but this shouldn’t be abused.
  3. Finally, structure the code as a state machine. This surfaces all state machine transitions as potential points for upgrading versions, and forces state to be in the protocol buffer when these transitions are crossed to ensure important data isn’t forgotten. And like everything else, the next state data can be stored in the protocol buffer.

There is one problem with 3), however. What happens when new states are added? Going forward is easy, but if we update to a previous version when we’re in one of these new states, it will have no idea where to start running again. We could try storing fallback states or something like that, but that seems too fragile. Instead, I would recommend not allowing updates to occur when transitioning to these new states. Then, a few versions down the line when you’re sure you won’t need to downgrade past where they were added, remove that restriction.

enum State {
    // Special states supported by the program infrastructure.
    STATE_NONE  = 0;
    STATE_DONE  = 1;
    STATE_ERROR = 2;
    // Program states. Unknown state transitions lead to ERROR and terminate the
    // program, so should be avoided at all costs.
    STATE_INIT          = 3;
    STATE_PROCESS_LINE  = 4;
    STATE_MUTATE_LINE   = 5;
  }
  optional State prev_state = 2 [default = STATE_NONE];
  optional State cur_state  = 3 [default = STATE_ERROR];

What About Threads?

You may have noticed that this design is inherently single-threaded. Threading can be added easily enough if the main thread owns all the work, and can easily and safely wait for or cancel all worker threads without losing anything. In that case, spin down the workers when you’re about to swap, and spin them up again when it completes. If your program doesn’t fit that description, however, this design may not be for you.

Testing?

Of course! I would recommend trying all transitions on a test instance first before upgrading the real process. You could also build in consistency checks that auto-revert if the state doesn’t meet expectations, regression tests for certain upgrade patterns, etc. This design is meant to make it easy to hot swap successfully, but it is no silver bullet.

Let's See the Code!

As always, the code is up on GitHub for you to peruse. It is broken into two demonstration applications, ‘v1’ and ‘v2’, that can be swapped between at will. While looping they respond to ‘u’ and ‘q’ (update and quit), although at times you may be prompted for other input. the makefiles build to the same target location, so build whichever one you want run next and press ‘u’ to swap to it.

The code is structured so you can use it as a framework to play with yourself easily enough. You should only need to write an init method, update the state machine and .proto file, and write the respective state methods to do the real work. The state machine and state methods will look something like this:

ReturnCode runStateMachine(ProgramState& state) {
    cerr << "Running state machine\n";
    // Put stdin into non-blocking, raw mode, so we can watch for character
    // input one keypress at a time.
    setStdinBlocking(false);
    while (true) {
        ProgramState::State next;
        switch (state.cur_state()) {
            case ProgramState::STATE_INIT:
                next = runState_init(state);
                break;
            case ProgramState::STATE_PROCESS_LINE:
                next = runState_process_line(state);
                break;
            case ProgramState::STATE_DONE:
                setStdinBlocking(true);
                return SUCCESS;
            case ProgramState::STATE_NONE:
            case ProgramState::STATE_ERROR:
            default:
                setStdinBlocking(true);
                return FAILURE;
        }

        ProgramState::State cur = state.cur_state();
        state.set_prev_state(cur);
        state.set_cur_state(next);

        // For now, simply let the user decide when to swap and quit. We can
        // always change this later.
        ReturnCode code = checkForUserSignal();
        if (code != CONTINUE) {
            setStdinBlocking(true);
            return code;
        }
    }
}

ProgramState::State runState_init(ProgramState& state) {
    cout << "Please provide a line of text for me to repeat ad-nauseum\n";
    string line;
    setStdinBlocking(true);
    getline(cin, line);
    setStdinBlocking(false);
    state.set_line_text(line);
    cout << "Thanks!\n";
    state.set_line_count(0);

    return ProgramState::STATE_PROCESS_LINE;
}

Easy, right? And here is an example transcript from going forward and then back between the versions in the repository (behind the scenes compiles not shown):

eric:~/code/hotswap/v1$ ../bin/hotswap.out
HotSwap example started - version 0.1
Initial call
Running state machine
Please provide a line of text for me to repeat ad-nauseum
All work and no play makes jack a dull boy
Thanks!
0: All work and no play makes jack a dull boy
1: All work and no play makes jack a dull boy
2: All work and no play makes jack a dull boy
u
Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine
3 mutations: All work and no play makes jack a dull boy
4 mutations: All work and no play maXes jack a dull boy
5 mutations: All workqand no play maXes jack a dull boy
6 mutations: All workqand nL play maXes jack a dull boy
u
Going down for swap
HotSwap example started - version 0.1
Coming up from swap!
Running state machine
7: All workqand nL play maXes jack a dull boy
8: All workqand nL play maXes jack a dull boy
9: All workqand nL play maXes jack a dull boy
q
Terminating with code 0


As you can see, version 0.2 mutates the line as it goes, while version 0.1 simply prints it forever. There are more differences than that, but you can find all that out from the code.

Enjoy! If you do end up playing with it, I’d love to hear about your experiences, or your thoughts on the design even if not.


This will probably be my last post for a while – on Saturday I leave the continent for 2 weeks and the city for 6. I will try to respond to emails and comments while I’m gone, but I may be a bit slower than usual.

Friday, April 23, 2010

Indexing and enumerating subsets of a given size

Subsets of a 9 element set containing up to 3 elements, in Banker's order

I received an email yesterday from a gentleman named Calvin Miracle, asking my opinion on subset enumeration strategy. He also provided a copy of this paper, which details an algorithm to generate a sequence of subsets in Banker’s order1, and asked about an algorithm to generate these subsets in a stateless manner. I’ll let him describe it:

Given a call to the method banker(n,k,i), where n is the size of a set, k is the subset size under consideration, and i is the i'th subset of size k, the method will return a boolean vector of size n, with only k TRUE values, that selects the i'th k-size subset from the overall set.

Now, prior to this I had never heard of Banker’s sequences nor really thought about enumerating subsets, but I’m always willing to be nerd sniped so I gave it a go. Presented here is the algorithm I designed for him.

Disclaimer

After writing this algorithm, I did some more digging and found out that some recent research has been done into enumerating subsets of a specific size, although in lexicographic order rather than Banker’s order. Wikipedia has details on this, and provides an algorithm very similar to the one I created here for except lexicographic ordering. So none of this should be seen as especially new or groundbreaking, although it was new to me and hopefully will be to you as well.

Algorithm

The basic idea of this algorithm is that given any choice of the first element for our subset, we can calculate how many possible ways to choose the remaining elements there are, and we know that every subset with this first element will come before every subset with a later first element according to our ordering scheme. So we can iterate through the possible choices of first element, totalling up how many subsets each represents, until we find the range that the desired index falls within. We can then recursively determine the rest of the subset in the same manner.

So, Let n be the number of items, k the size of the subsequence, and i the specific index we are looking for. We will enumerate sequences by considering all subsets that start with “1”, then all that start with “01”, then all that start with “001”, etc., where “0” represents skipping an item and “1” represents selecting it. There are n-1 choose k-1 of the first sort, n-2 choose k-1 of the second, n-3 choose k-1 of the third, etc. So:

“1”: 0 <= i < n-1Ck-1
“01”: n-1Ck-1 <= i < n-1Ck-1 + n-2Ck-1
“001”: n-1Ck-1 + n-2Ck-1 <= i < n-1Ck-1 + n-2Ck-1 + n-3Ck-1
     

 

Once we know what prefix i has, we can recursively determine the next sequence using n' = n-(prefix length), k' = k-1, i' = i-(bottom of range).

Example

Let n = 5, k = 3, i = 7:

“1” 0 <= i < 6                        (4C2)
“01”: 6 <= i < 9 = 6+3          (4C2 + 3C2)

So, initial prefix is “01”.

We recurse with n = 3, k = 2, i = 1:

“1” 0 <= i < 2                        (2C1)

So, the next piece is “1”.

Finally, we recurse with n=2, k=1, i=1:
Trivially we can see that for “10” = 0, “01” = 1, so the last piece is “01”.

Putting this together, the sequence is “01101”,  so items 2, 3, and 5 of the 5 compose the 7th subset2 of length 3.

Complexity

This algorithm is quite efficient – it needs to check at most n prefixes in total over all the levels of recursion, since each we consider shortens our candidate set by 1. There will be at most k recursive levels, and kn. So if we handwaive the “a choose b” calculations, we find that this can be done in O(n) time, which is O(1) in the size of the output, and thus optimal.

If we don’t handwaive the choice functions, we find that nCk is O(n!) in the worst case, so the result requires O(log(n!)) =O(nlogn) bits to store. If addition is O(1) we still may need to add up to n of these, so the total worst-case runtime is O(n2 logn), or O(nlogn) in the size of the output. However, with small values for k we don't get anywhere close to that worst case, and indeed are closer to O(logn) in size of the input. This is still quite efficient, and probably about the best that can be achieved in a function of this type. We should also consider the time needed to calculate the choice values, but if we are iterating over these indices the values can be pre-cached first and then the amortized cost per subsequence lookup goes to effectively zero.

Code

If you want to play with this algorithm yourself, I have placed some java code implementing it on GitHub. It is as efficient as expected, and should be plenty fast enough for anything you could conceivably need it for. For example, it takes essentially no time to determine that the 160,000,000,000,000,000,000,000,000,000th subset of size 12 from a 10,000 item set is {0, 1, 2, 69, 1212, 1381, 4878, 5291, 5974, 6139, 6639, 8979}. So have at it!

Update: Ligon Liu has kindly provided a C# port of this code, which I have added to the repository (it the c# directory).

Update 2: Josiah Carlson has kindly provided a python port of this code, also in the repository now.

Update 3: Richard Lyon has kindly provided ports to both JavaScript and PHP, on GitHub as well. Thanks guys!

Update 4: Corin Lawson has a C implementation in his GitHub repository, along with other Banker's order functions. Great!

Footnotes

  1. The easiest way to think of Banker’s ordering is to think of comparing the indices of the items that make up the subsets. So to order two subsets, take the list of indices of elements in the first subset, and compare it to that of the second subset in standard dictionary (lexicographic) order. So {2, 3, 4} comes before {2, 3, 5} but after {1, 3, 5}, for example.

    The image at the top of this post depicts the Banker’s ordering for subsets of  lengths 1, 2, and 3, respectively, from a set of size 9.
  2. Subsets are indexed from 0, so you may consider this the 8th subset.

Wednesday, April 14, 2010

Introducing: Marriage Sort


Update: I've created a live visualization of this algorithm, so you can see it in action - see Marriage Sort, a Visualization.

Two weeks ago, a link showed up on Hacker News on how to (mathematically) select the best wife. Tongue firmly in cheek, of course. The key takeaway from the article is that knowing only the relative rankings of items in a sequence (and assuming you can never go back to a previous one), you can maximize your chances of selecting the maximum one by letting N/e go by and then picking the next one that's better than all those. Further, to maximize your expected value you only need to let the first √N - 1 go by. After rattling around in the back of my mind for a couple weeks, yesterday this tidbit coalesced into an actual Idea. Could I turn this selection routine into a sorting algorithm?

And thus, Marriage Sort was born. I present it here purely out of interest, not because it is especially fast - if nothing else, because it is the only sorting algorithm I know of that has an average complexity of O(n1.5)1.

Algorithm

The basic idea of this algorithm is to repeatedly choose the maximum element from the first √N - 1 elements of our working set, and then scan to the end looking for elements bigger than it. When one is found, swap it to the end of the working set, and decrease the working set by one. After the pass is complete, swap out the maximum element from the first √N - 1 as well, and start again. When everything is finished (√N - 1 <= 0), use insertion sort to put the array into true sorted order.
One pass of the Marriage Sort. Two elements are found and swapped to the end, followed by the maximum element from the first √N - 1.


This works because, as we have noted from the article linked above, each element bigger than the first √N - 1 is expected to be close to the largest remaining element in the array. By repeatedly taking these elements and moving them to the end, we put the array into an approximately sorted order, where elements should be reasonably close to their 'true' positions. When this is done insertion sort will do the final rearranging, moving elements the short distances to their true positions.

In pseudocode:
def marriagesort(array):
    end = array.length
    while true:
       skip = sqrt(end) - 1
       if skip <= 0: break

       # Pick the best element in the first vN - 1:
       bestPos = 0, i = 1
       while i < skip:
           if array[i] > array[bestPos]: bestPos = i
           i += 1

       # Now pull out elements >= array[bestPos], and move to the end:
       i = skip
       while i < end:
           if array[i] >= array[bestPos]:
               array.swap(i, end-1)
               end -= 1
           else:
               i += 1

       # Finally, move our best pivot element to the end
       swap(bestPos, end-1)
       end -= 1

    # Finish off with insertion sort to put the elements into true sorted order
    insertionsort(array)

Here you can see this algorithm working on a randomized array. Many thanks to Aldo Cortesi for the wonderful sorting algorithm visualization framework!

Update: Here is another visualization made using Aldo Cortesi's tools, of a 512 element array this time. If you take a close look at the insertion sort phase (the slope on the right side), you can see that it is composed out of a lot of little triangles. Each of these triangles corresponds to one pass of the marriage selection routine, and shows how far elements can be from their true sorted position.

Complexity

Some quick back-of-the-napkin calculations indicate that this algorithm takes O(n1.5) compares in the average case, although the worst case looks to be O(n2). For the first stage (i.e. without the final insertion sort) each pass will take at most n compares and move approximately √N items to the end, requiring about √N passes. I don't know if this is a tight bound - the passes will actually speed up and require fewer compares as the working set shrinks, however I didn't want to try including that in the calculations. Similarly, after each pass all the items moved are guaranteed to be greater than all the items left, capping the distance moved items are from "home" to √N on average for the insertion sort. This holds the insertion sort to O(n1.5) compares on average as well.

Similarly, this algorithm takes O(n1.5) swaps in the average case (and O(n2) in the worst case). The number of swaps could certainly be decreased if a different final sort were used - the first pass only takes Θ(n) by itself - although in practice this isn't much use since the compares will still dominate the runtime.

Note that both of these graphs are log-log plots, so the important feature to look at is the slope of lines rather than the absolute position. For example, in the "Swaps" plot quicksort appears to be hovering around the n∙log(log(n)) line, however looking at the slope we can see that it is actually approximating n∙log(n), just with a better constant factor.

Note 2: The x axis of these graphs is the number of elements in the array, and the y axis is the number of comparisons/sorts required. I should have labeled these better.

Postscript

Some sample code implementing this algorithm is up on GitHub - this is the code to generate the log-log plots shown above.

As always, questions and comments are welcome!

Footnotes

  1. Most sorting algorithms are either Θ(n2) or Θ(n∙log(n)) in the average case. This one falls between those two extremes, making it faster (asymptotically) than algorithms like insertion sort or bubble sort but slower than quicksort or merge sort.

Wednesday, March 17, 2010

Visualizing RGB, take two - Update


When I posted my 'Visualizing RGB, take two' article last month, an anonymous commenter going by the name Full-size had a good suggestion - I should use a form of error diffusion to better hide the pixel errors when selecting the colours to use. Within a couple days I had this implemented, and wow does it make an improvement! Unfortunately, between school and needing to reinstall Windows I never ended up posting the results. So, three weeks late, here they are.

Results for Barcelona image

As you may recall, the goal of the algorithm was to transform a source image into a 4096x4096 version that uses every RGB colour exactly once. As a reminder, here is the original image, and what my algorithm previously did with it:
Original
Result from previous version


Now, take a look at how it looks using a form of error diffusion (with error terms capped at 60):

(Full 50 meg png available here). Much nicer to look at! Take a look at the sky in particular - where it used to go psychedelic trying to deal with all that near white, large portions now end up simply turning into a uniform gray. The new version is worse in a couple spots (e.g. the wheel wells of the car), but overall I think it is hugely improved. Now, I wonder how the new version would do on a harder image?

Results for the Mona Lisa

As promised, I am also posting the results of running this algorithm on an image of the Mona Lisa. This is an especially difficult image, because the colour palette is very limited - notice the lack of blue in particular. First, let's take a look at the original, and the result from the previous version of the code:
Original
Result from previous version


Ouch. Poor Lisa. Still, let's forge on and see how the new version does, shall we?

(Full 50 meg png available here). Considerably better overall, although the colour burning on the forehead and neck is pretty ugly to look at. Still, considering we are trying to use every RGB colour once, I think the results are quite decent.

Postscript

The code to achieve this is in the same place as before, updated on GitHub. Right now you have to change the source code to edit the error diffusion cap, sorry - but feel free to fix that!

Tuesday, March 2, 2010

Writing an efficient Sieve of Eratosthenes

See also the followup post containing a segmented sieve implementation in Go that beats all these.


I recently came across a paper by Melissa E. O'Neill on implementing a lazily evaluated Sieve of Eratosthenes. It was a very interesting read, although for a non-Haskeller understanding the code was certainly tricky. And by happy happenstance, not two weeks later I ended up needing a prime generator of my own for one of the Project Euler problems. This, I thought, was the perfect opportunity to try implementing the algorithm from that paper! That thought ended up leading me on a many-day journey to see how efficient an implementation I could make. This post details the key changes my code went through in that time, culminating in a C# algorithm that can sieve around 20 million candidates per second (or a Python version that can do 1.4 million).

Disclaimers

  1. During the course of this journey I learned about segmented sieves and the Sieve of Atkin, but I decided not to make use of either of these, instead sticking to the algorithm I chose from the start. Maybe next time.
  2. If you are simply looking for the fastest implementation possible, I point you towards http://cr.yp.to/primegen.html, which is written in highly optimized C and at least an order of magnitude faster than mine.

Version 1: Straight re-implementation of algorithm from paper

Language: Python
Time complexity:
O(n∙log(n)∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~300,000

Code (expand/contract)

This is simply a re-implementation in Python of the Haskell code from the paper above, although it loses some of the elegance in the translation.  You'll note that I am measuring performance with the very unscientific 'candidates in 1 sec' - since I am trying to compare algorithms with different asymptotic complexities in different languages with order-of-magnitude performance differences, it seemed like the easiest way to get a feel for the general magnitudes in this apples-to-oranges comparison.


The key idea here is that for each prime we encounter, we make a generator for multiples of it (starting at p2) that we can use to filter out our candidates. For efficiency, all multiples of 2 and 3 are skipped automatically as well, since that reduces the number of candidates to check by two thirds. These generators are all added to a min-heap, keyed by the next value each will generate. Then we can keep checking the next candidate and seeing if it matches the top of the heap. If it does, pop that, increment it, and push it back on keyed by the next value. If not, the candidate must be prime, so build a new generator and add that instead.


So, let's try an example of this. The first candidate we try is 5 (remembering that all multiples of 2 and 3 are already gone), and our heap is empty. This becomes the first item on the heap, and then 7 is checked:

5? []
Nope, prime
7? [25: 25,35,55,...]
Nope, prime
11? [25:25,35,55,... 49:49,77,91,...]
Nope, prime
13? [25:25,35,55,... 49:49,77,91,... 121:121,143,187...]
Nope, prime
17? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...]
Nope, prime
19? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Nope, prime
23? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Nope, prime
25? [25:25,35,55,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Aha! Matches the top of our heap...not prime
29? [35:35,55,65,... 49:49,77,91,... 121:121,143,187... 169:169,221,247,...  ...]
Nope, prime
...

Make sense? This is the algorithm straight from the paper, so if my explanation is lacking you could always try there. When I write it down this way, however, an optimization springs out at me - why are we using a heap at all? At any given time we are querying for a single value, and don't actually care about the other items. Instead of updating a heap structure all the time, we can simply use a hash table instead. Strictly speaking a multi-way hash table, since some values will have multiple prime factors.

Version 2: Using a hash table

Language: Python
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~1,400,000

Code (expand/contract)

Not bad, under 50 lines of code and with good complexity. This is as far as I know how to go with Python, and it is a dynamic language anyways which isn't really suited to this kind of number crunching, so for the next version I'm switching to C#. There I know the time required for most operations, and more importantly, I have a good profiler I know how to use.

Version 3: Re-implement in C#

Language: C#
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~10,000,000

Code (expand/contract)

Wow, significant speedup there! And with a modest 25 line code hit (plus a main method for convenience), that is certainly reasonable. A little profiling tells me that the bottleneck is the list operations, which isn't too much of a surprise. Of course, C#'s native list implementation isn't necessarily going to be the fastest for our exact situation - as is often the case when optimizing, the next step is to write our own data structure. Here I use a dead simple data structure, essentially just an array with an Add operation that dynamically resizes it.

Version 4: Use a custom structure for the 'multi' in multi dictionary

Language: C#
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n/log(n))
Candidates in 1 sec: ~12,000,000

Code (expand/contract)

Not bad, a modest speedup in this version. But I must admit, I cheated a little. The main benefit of this new data structure is not the speedup gained at this step. No, the real speedup is allowing an essentially free Clear operation. Because as useful as a hash table is, we can re-write it too to get even more of a speedup. The idea here is that most of our current iterators will have a 'next' value that differ by at most √n or so, so a circular array will be better for cache locality, remove the cost of hashing, and let us re-use the 'multi' objects. Note that a circular array is essentially a giant array, except only a small sliding window is actually backed by memory and allowed to contain values at any given time.

Version 5: Use a multi cyclic array instead of hash table

Language: C#
Time complexity:
O(n∙log(log(n)))
Space complexity: O(√n)
Candidates in 1 sec: ~20,000,000

Code (expand/contract)

Ok, that is about as far as I'm willing to go. Beyond this point I expect I'll start getting into low-benefit, high-complexity optimizations, and I don't want to go down that road. Although by some counts I am already there - from version 3 there has been a 2x speedup, but at the cost of doubling the code size and ending up with algorithms that fall firmly within the realm of 'tricky'. If I actually had to maintain a prime sieve in a professional setting that wasn't absolutely time critical, I think I would be going with version 3 - the later code starts to impose too much of a mental tax.


And there you have it, a reasonably efficient Sieve of Eratosthenes, and for the most part without 'tricky' optimizations. Let me know what you think!

Postscript

In the comments, Isaac was wondering how these compared to "vanilla array" implementations (essentially, pre-allocate the array and cross-off), so I have added two array implementations to GitHub (Python, C#) for comparision. Both are about 4 times faster than the comparative best in their language, testing 5.5 million and 75 million candidates in one second (respectively). The Python version runs out of memory somewhere before 500 million candidates, and the C# version can't get beyond about 2 billion due to limitations of the BitArray class.

Saturday, February 20, 2010

Visualizing RGB, take two


Update: The follow-up post can be found here.

About 3 weeks ago, I wrote about a program I created to transform pictures into all-RGB images (4096x4096 images using each possible RGB colour exactly once). It worked by ordering pixels based on a Hilbert ordering of the 3d colour cube and then re-colouring them in order, and while it produced interesting images, it pretty much failed at the stated goal of “keep[ing] the overall look untouched”. The problem was that the general hue of the pixels was often very shifted from the input, so the overall features were preserved but the colour balance was not. So for the past week or so I’ve been working on a new program, one that will (hopefully!) do a better job of keeping the base look intact. As with last time, I’m using an image I took in Barcelona for testing – let me know if you have a different one you’d like to see.


Original
Result From Take One

Choose the closest colour…

My idea this time was that instead of choosing an ordering of the pixels, it would be better to try to minimize the distance between the source and destination colours overall. The easiest way I could think of was to simply choose pixels at random, and assign them the “closest” colour remaining. Hopefully deviations would occur in all directions equally, so the average colour of a region would be as close as possible to the source. By popular demand, I will try to make this algorithm a little more explicit this time:

  1. Go through the pixels in the source image in a random order.
    1. For each, select the closest remaining unused colour, by Euclidean distance between the coordinates in the colour space.
    2. Assign the found colour as the output colour of the pixel, and mark it used.

But in which colour space?

Sources: one and two

A key question I had was which colour space would be best for preserving hues? There are a number of different “colour solids” that I could use coordinates from, with RGB being only one of many. I had a strong suspicion that something like HSL would do better than using RGB directly, but the easiest way to find out which to do a direct comparison. I tried the RGB cube as well as HSL and HSV cylinders for the comparison. My test images are presented below.


Original
RGB


HSL
HSV

As you can see, HSL and HSV give essentially the same results, which are both much better than RGB (look closely at the wheel wells, or the buildings in the trees on the right to see the differences). I like to think that HSV is slightly better, but I might be imagining differences that really aren’t there. Either way, I chose to use HSV for the final copy.



Looks good! Certainly a lot closer to the source image – I’m satisfied with this one for now.

Postscript

As with last time I am using a conceptually simple algorithm, however this time the implementation was considerably more difficult. The problem is that choosing the closest remaining colour to a source pixel is a hard problem to do efficiently, especially since the set of candidate colours changes at every step. I wrote the code in C# for performance this time, but I have still had to spend quite a few hours optimizing the code to get the program to finish at all. The final version can take 30+ hours to generate an image, and peak at over 4 GB of ram. I based my code around a KD-tree I found online, then rewrote to optimize for the 3D, single-nearest-neighbour case as well as to support branch pruning on delete. The rewritten tree – as well as the rest of my code – is available in a repository on GitHub: http://github.com/EricBurnett/AllRGBv2. Feel free to try it out for yourself - if you do, I’d love to hear about it!

Friday, February 12, 2010

Buzz: The perfect spam platform?

Buzz is out, genius or a privacy fiasco or just another me-too, depending on your view. Those topics have been covered to death already, but one thing I haven't seen talked about is how easy it makes spamming. And I don't mean shouting inanities to all your friends - that's what it's for, after all - I'm talking about targeted spam by spammers, like the blog spam that used to be a huge problem.

Here is the problem as I see it:

  1. You can follow anyone you want simply by adding their email address.
  2. When they "buzz", you are notified.
  3. When you comment, they see your response.

Does this seem like a bad idea to anyone else? I have a feeling that Google is going to have to allow people to "accept" followers, bringing it that much closer to Facebook.

Straining the limits of C#

Two weeks ago I wrote about an algorithm to generate All-RGB images from pictures. I am currently working on a follow-up post about a new algorithm, in C# this time. This one is a bit more computationally intensive, and despite the language change it is running into scaling issues. So while I wait for it to finish, I thought I'd write about a few of them.

Good data structures are hard to find

When you start processing large numbers of items in different ways, choosing the right data structure to store them becomes an absolute necessity. They can mean the difference between an O(n∙log(n)) and an O(n2) and algorithm, which can be the difference between taking 1 hour to run or 100 years. For this project, the requirements were simple – a data structure mapping points to objects that supported nearest-neighbour searches and deletion. To me, that immediately translated to kd-tree.


Usually in cases like this I end up needing to roll my own structure, but this time I was lucky. After some Googling I found exactly one viable implementation to use, and better yet, it was open source. I'm glad it was; it turned out later that there was a bug1 that needed fixing, and I needed to compile a 64-bit version anyways (I wonder if there's a lesson in here?). It is unfortunate that this was the only option, however. I mean, there are a ton of data structure libraries for most languages you can imagine, but the vast majority of them implement the same set of structures, are buggy, unsupported, and incompatible. I would love to see a Stack Overflow-style site to address this – community edited and supported code, implementations ranked together for comparison, searching by requirements if you don't know what you need, and the list goes on.


But even with the appropriate structure, the algorithm I have chosen will take more than a day to run and  4+ GB of memory. That is fine, I knew the approximate complexity when I started, but it does lead to the next set of issues.

Good algorithms are hard to find

Or should I say, good implementations of algorithms are hard to find. By way of introduction, a brief digression: my computer is unstable. Not terribly unstable, not enough to for me to actually take the time to fix it, but my sound card is slightly unsupported on Windows 7 so every once in a blue moon something like skipping a song will blue-screen the computer. Just about all my programs know how to pick up where they left off, but of course that doesn't hold for these projects I throw together in an evening. So when my computer decided to crash this morning, I decided to add some basic checkpointing. Checkpointing is easy, right? Hah!


Attempt 1: tag classes with [Serializable], run needed structures through a BinaryFormatter, streaming to file.

So, anyone want to guess what the problem is here? If you said object-graph size, you're right on the money. BinaryFormatter doesn't support object graphs with more than 6M items or so, and arrays get no special treatment. So serializing an array of 16.7M items throws a very unhelpful error message ("The internal array cannot expand to greater than Int32.MaxValue elements")2. Fine, I can unwrap my own arrays easily enough.


Attempt 2: unwrap arrays manually.

With each array element being serialized as a separate object, the overhead in the file is huge. If I had to guess, I'd say that the size on disk is about 10 times the size in memory. And since I'm trying to write about 1 GB of data...you can probably guess where this is going. Something in the output stack explodes when more than 4 GB of data is written, a number suspiciously close to the max size of an Int32. This is simply poor implementation, since it's not like I'm trying to mmap the data, and large files have been supported in all modern OS' for years. Not a big deal though, that data is going to be very redundant and I/O is expensive, so writing a compressed stream is probably faster in the long run.


Attempt 3: write to the file using a System.IO.Compression.GZipStream wrapper.

With compressed data, I expect the on-disk size to be comparable to the in-memory size, or a bit better. So the 4 GB limit should be no, problem, right? Wrong! The GZipStream has the same problem, and refuses to support more than 4 GB uncompressed. The fix here is even simpler – swap in a better GZip library.


Attempt 4: write to the file using a SharpZipLib.GZipOutputStream wrapper.

Success! The output file is about 700 MB and takes somewhere around 20 minutes to write, for a compression rate of about 9 MB/sec and space savings of about 93%.


Now, I could chalk these problems up as a failing of C#, but that wouldn't be accurate. By playing with this much data I am working outside the limits expected by the library designers, and I know it. I have focused on C#, but the issues are far more general than that – I can't even find a 64-bit version of python 2.6 for Windows to test with at all, but I'm sure I would run into a different set of problems if I could use it, and the same goes for the rest of the languages out there. The real issue is that versatile algorithm implementation is hard, and not getting much easier with time. And that I don't have a workaround for.


Footnotes

  1. The problem is that "deletions" are supported by tombstoning, so you periodically have to rebuild the index to clean them up. That is fine, except the range() method used to get the current entries out doesn't check the deleted flag! Don't worry, I'll be sending a fix back upstream.
  2. Someone else did the digging, and it seems the list of prime sizes for some internal hash table stops at 6 million, so the next prime it tries to resize to is something enormous (-1 unsigned?). Microsoft decided this is a non-issue, so no fix is coming. Their suggested workaround was to use the NetDataContractSerializer and a binary xml writer, but when I tested it the performance was too terrible to consider.

Sunday, January 31, 2010

Visualizing RGB


Update: See the follow-up post here.

Aldo Cortesi posted a link today to allrgb.com, a site dedicated to images visualizing the RGB colour space - in particular, 4096x4096 images that use each RGB value exactly once. Inspired by his hilbert curve visualization and the urge to spend a day programming, I present to you: the all-RGB Mandelbrot set.

Sort by colour...

My idea was this: instead of trying to visualize the colour space directly, why not use a base image for the "shape", and then map the RGB spectrum onto it? I thought that if I could find an image with an even spread of colours, this would let me make each pixel unique yet keep the overall look untouched.
To perform this mapping, I chose to define an ordering based on the 3 dimensional Hilbert curve. Cortesi explains it far better than I can, but the basic idea is this: the Hilbert curve can be used to find an ordering of all 16.8 million colours so that if you were to stretch them out on a line, every colour would be there and they would flow smoothly from one to another. Like this, except a lot smoother and a lot longer.
With this ordering in hand it is easy to find the index into this line for each pixel in the source image, sort the pixels, and then assign them the colour they line up with.

Choosing an image

When I started this morning, I had the idea that the output images would look reasonably close to the source images. I was half right; the images certainly have all the same features as before, but the colouring is all wrong. In hindsight the reason is obvious - unless the original had a perfectly even spectrum of colours, the mapping would be stretched in some places and shifted in others, and in general not line up nicely.


While interesting, this wasn't exactly what I was going for. Hmm...what image could I use where it wouldn't matter if the colours were all shifted? The first thing that came to mind was a visualization like the Mandelbrot set, where the colours are arbitrarily chosen anyways. A quick Google search found me this:



Which, when transformed, came out as this:
Perfect!

Postscript

I created all the visualizations here using Python 2.6 and the Python Imaging Library. It isn't the most efficient code (it takes half an hour to render a 4096x4096 image), but the quick development time easily makes up for it. If anyone is interested in playing with it, I have placed the code on github. Or if you just want to see what something would look like, feel free to send it to me and I'd be happy to run it for you.

Saturday, January 30, 2010

Moved!

This blog has moved! It now lives at www.thelowlyprogrammer.com. As is probably obvious, I have also changed the name to The Lowly Programmer (see the updated about me for details). Nothing else is changing at the moment; in particular, I doubt I will be posting to it with any regularity yet :).

Saturday, January 16, 2010

The value of a point

Recently I have been spending a lot of time on Hacker News. I've lurked for quite a while, but it is only in the last month or so that I have actively started to comment. I usually try for informative comments rather than going for the inflammatory or popular topics. This only gets me a point or two per comment I post, but these are points I have earned by contributing to the discussion in a positive way. I had almost reached 50 points this morning and was feeling quite good about my contributions overall.

But then I made a submission.

Most of my submissions go the way of my comments, earning with a point or two plus some interesting responses. It's these responses I am after; I find an interesting article somewhere, so I submit it to see what the HN community has to say about it. But this one, for whatever reason, the community really appreciated, and within 20 minutes it was at the top of the front page. The discussion was lively and as interesting to read as I could hope for, and it was raking in the points.

And therein lies the problem. Someone, a self proclaimed member of HN no less, took the time and effort to write a thought provoking piece and there I was earning scores of points off it, which left me feeling like a bit of a cheat. But even had I been the original author, I don't think I would want that many points for it. By the time it falls off the front page it will likely be over 50 points itself, surpassing a month's effort in one fell swoop and devaluing the points accordingly. I was looking at points as a general reflection of my value to the site, but I cannot do that anymore, at least not directly.

So what can be done about the 'problem' with points? I can think of a few suggestions.

  1. Nothing. After all, I am only a single opinion, and the system seems to be working fine as it is.
  2. Don't count any points past a certain cutoff. They would still be useful for display and ranking, but for karma the extra would be discarded. I kind of like this idea, but any cutoff would be an arbitrary one. Stack Overflow uses a system like this, although there it's points per day rather than per action.
  3. Count points on a log scale. Display them as normal, but change the way karma is calculated. This would help address crazy outliers like this one, and in general promote sustained quality over hot-button topics (for people actively trying for points).
I know pg is always tweaking the system to make it work better; it will be interesting to see what, if anything, actually changes in the next while.