Wednesday, April 28, 2010

Robustly hot swapping binaries (with code)

Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine

  A while ago I remember reading an article by Nathan Wiegand on hot swapping binaries. This was a very eye-opening article for me – before reading it, hot swapping was one of those black arts I never really thought about, and certainly wouldn’t have thought was at all easy. I highly recommend you read it for yourself. Go ahead, I’ll wait.

There. Did you notice it? The elephant in the room? One thing the article doesn’t address is how to design programs to use this fancy new ability, without being fragile and crashing and all those bad things. I’ve been mulling this over since reading it, and have settled on a basic design that I’ll present here. But don’t worry, this isn’t just a thought design…I’ve actually coded it up and made sure it works as expected. Feel free to jump down and check it out before reading the design. Still, caveat emptor and all that.

Design Goals

For this design, I am focusing on three key goals:

  • Allow updating to any future version
  • Allow updating to any previous version
  • Make it easy to be crash-free

Simple enough? Updating forward is a pretty obvious goal, as is crash-free code. I want to allow updating backwards as well for the simple expedient that I don’t expect all new code to be bug free, and so it might be desirable to roll back when a bug is introduced.


To achieve this, I’ve settled on a shortlist of constraints for the code:

  1. Use protocol buffers to store all state
  2. Provide suitable defaults for everything, and be robust against weird state
  3. Structure the code as a state machine

I did say the list was short. Let’s look at these in detail:

  1. Protocol buffers are an obvious choice for persisting state, as they are forward and backward compatible by design. Care must be taken to not re-use tag numbers and to never add ‘required’ fields, but this is an easy requirement to satisfy. Now, using protocol buffers to store everything does incur some overhead, but they are quite efficient by design and we really only need to store all state between state machine transitions so local variables are still quick.
  2. Hand in hand with 1), we cannot always expect to have the data we want in the format we want version to version. To accommodate this we must pick suitable defaults for fields, and if necessary be able to get new values at runtime. At the same time, if the meaning or format of a field is changing, it is probably better to use a new field. We will always try to handle weird data that may show up, but this shouldn’t be abused.
  3. Finally, structure the code as a state machine. This surfaces all state machine transitions as potential points for upgrading versions, and forces state to be in the protocol buffer when these transitions are crossed to ensure important data isn’t forgotten. And like everything else, the next state data can be stored in the protocol buffer.

There is one problem with 3), however. What happens when new states are added? Going forward is easy, but if we update to a previous version when we’re in one of these new states, it will have no idea where to start running again. We could try storing fallback states or something like that, but that seems too fragile. Instead, I would recommend not allowing updates to occur when transitioning to these new states. Then, a few versions down the line when you’re sure you won’t need to downgrade past where they were added, remove that restriction.

enum State {
    // Special states supported by the program infrastructure.
    STATE_NONE  = 0;
    STATE_DONE  = 1;
    STATE_ERROR = 2;
    // Program states. Unknown state transitions lead to ERROR and terminate the
    // program, so should be avoided at all costs.
    STATE_INIT          = 3;
  optional State prev_state = 2 [default = STATE_NONE];
  optional State cur_state  = 3 [default = STATE_ERROR];

What About Threads?

You may have noticed that this design is inherently single-threaded. Threading can be added easily enough if the main thread owns all the work, and can easily and safely wait for or cancel all worker threads without losing anything. In that case, spin down the workers when you’re about to swap, and spin them up again when it completes. If your program doesn’t fit that description, however, this design may not be for you.


Of course! I would recommend trying all transitions on a test instance first before upgrading the real process. You could also build in consistency checks that auto-revert if the state doesn’t meet expectations, regression tests for certain upgrade patterns, etc. This design is meant to make it easy to hot swap successfully, but it is no silver bullet.

Let's See the Code!

As always, the code is up on GitHub for you to peruse. It is broken into two demonstration applications, ‘v1’ and ‘v2’, that can be swapped between at will. While looping they respond to ‘u’ and ‘q’ (update and quit), although at times you may be prompted for other input. the makefiles build to the same target location, so build whichever one you want run next and press ‘u’ to swap to it.

The code is structured so you can use it as a framework to play with yourself easily enough. You should only need to write an init method, update the state machine and .proto file, and write the respective state methods to do the real work. The state machine and state methods will look something like this:

ReturnCode runStateMachine(ProgramState& state) {
    cerr << "Running state machine\n";
    // Put stdin into non-blocking, raw mode, so we can watch for character
    // input one keypress at a time.
    while (true) {
        ProgramState::State next;
        switch (state.cur_state()) {
            case ProgramState::STATE_INIT:
                next = runState_init(state);
            case ProgramState::STATE_PROCESS_LINE:
                next = runState_process_line(state);
            case ProgramState::STATE_DONE:
                return SUCCESS;
            case ProgramState::STATE_NONE:
            case ProgramState::STATE_ERROR:
                return FAILURE;

        ProgramState::State cur = state.cur_state();

        // For now, simply let the user decide when to swap and quit. We can
        // always change this later.
        ReturnCode code = checkForUserSignal();
        if (code != CONTINUE) {
            return code;

ProgramState::State runState_init(ProgramState& state) {
    cout << "Please provide a line of text for me to repeat ad-nauseum\n";
    string line;
    getline(cin, line);
    cout << "Thanks!\n";

    return ProgramState::STATE_PROCESS_LINE;

Easy, right? And here is an example transcript from going forward and then back between the versions in the repository (behind the scenes compiles not shown):

eric:~/code/hotswap/v1$ ../bin/hotswap.out
HotSwap example started - version 0.1
Initial call
Running state machine
Please provide a line of text for me to repeat ad-nauseum
All work and no play makes jack a dull boy
0: All work and no play makes jack a dull boy
1: All work and no play makes jack a dull boy
2: All work and no play makes jack a dull boy
Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine
3 mutations: All work and no play makes jack a dull boy
4 mutations: All work and no play maXes jack a dull boy
5 mutations: All workqand no play maXes jack a dull boy
6 mutations: All workqand nL play maXes jack a dull boy
Going down for swap
HotSwap example started - version 0.1
Coming up from swap!
Running state machine
7: All workqand nL play maXes jack a dull boy
8: All workqand nL play maXes jack a dull boy
9: All workqand nL play maXes jack a dull boy
Terminating with code 0

As you can see, version 0.2 mutates the line as it goes, while version 0.1 simply prints it forever. There are more differences than that, but you can find all that out from the code.

Enjoy! If you do end up playing with it, I’d love to hear about your experiences, or your thoughts on the design even if not.

This will probably be my last post for a while – on Saturday I leave the continent for 2 weeks and the city for 6. I will try to respond to emails and comments while I’m gone, but I may be a bit slower than usual.

Friday, April 23, 2010

Indexing and enumerating subsets of a given size

Subsets of a 9 element set containing up to 3 elements, in Banker's order

I received an email yesterday from a gentleman named Calvin Miracle, asking my opinion on subset enumeration strategy. He also provided a copy of this paper, which details an algorithm to generate a sequence of subsets in Banker’s order1, and asked about an algorithm to generate these subsets in a stateless manner. I’ll let him describe it:

Given a call to the method banker(n,k,i), where n is the size of a set, k is the subset size under consideration, and i is the i'th subset of size k, the method will return a boolean vector of size n, with only k TRUE values, that selects the i'th k-size subset from the overall set.

Now, prior to this I had never heard of Banker’s sequences nor really thought about enumerating subsets, but I’m always willing to be nerd sniped so I gave it a go. Presented here is the algorithm I designed for him.


After writing this algorithm, I did some more digging and found out that some recent research has been done into enumerating subsets of a specific size, although in lexicographic order rather than Banker’s order. Wikipedia has details on this, and provides an algorithm very similar to the one I created here for except lexicographic ordering. So none of this should be seen as especially new or groundbreaking, although it was new to me and hopefully will be to you as well.


The basic idea of this algorithm is that given any choice of the first element for our subset, we can calculate how many possible ways to choose the remaining elements there are, and we know that every subset with this first element will come before every subset with a later first element according to our ordering scheme. So we can iterate through the possible choices of first element, totalling up how many subsets each represents, until we find the range that the desired index falls within. We can then recursively determine the rest of the subset in the same manner.

So, Let n be the number of items, k the size of the subsequence, and i the specific index we are looking for. We will enumerate sequences by considering all subsets that start with “1”, then all that start with “01”, then all that start with “001”, etc., where “0” represents skipping an item and “1” represents selecting it. There are n-1 choose k-1 of the first sort, n-2 choose k-1 of the second, n-3 choose k-1 of the third, etc. So:

“1”: 0 <= i < n-1Ck-1
“01”: n-1Ck-1 <= i < n-1Ck-1 + n-2Ck-1
“001”: n-1Ck-1 + n-2Ck-1 <= i < n-1Ck-1 + n-2Ck-1 + n-3Ck-1


Once we know what prefix i has, we can recursively determine the next sequence using n' = n-(prefix length), k' = k-1, i' = i-(bottom of range).


Let n = 5, k = 3, i = 7:

“1” 0 <= i < 6                        (4C2)
“01”: 6 <= i < 9 = 6+3          (4C2 + 3C2)

So, initial prefix is “01”.

We recurse with n = 3, k = 2, i = 1:

“1” 0 <= i < 2                        (2C1)

So, the next piece is “1”.

Finally, we recurse with n=2, k=1, i=1:
Trivially we can see that for “10” = 0, “01” = 1, so the last piece is “01”.

Putting this together, the sequence is “01101”,  so items 2, 3, and 5 of the 5 compose the 7th subset2 of length 3.


This algorithm is quite efficient – it needs to check at most n prefixes in total over all the levels of recursion, since each we consider shortens our candidate set by 1. There will be at most k recursive levels, and kn. So if we handwaive the “a choose b” calculations, we find that this can be done in O(n) time, which is O(1) in the size of the output, and thus optimal.

If we don’t handwaive the choice functions, we find that nCk is O(n!) in the worst case, so the result requires O(log(n!)) =O(nlogn) bits to store. If addition is O(1) we still may need to add up to n of these, so the total worst-case runtime is O(n2 logn), or O(nlogn) in the size of the output. However, with small values for k we don't get anywhere close to that worst case, and indeed are closer to O(logn) in size of the input. This is still quite efficient, and probably about the best that can be achieved in a function of this type. We should also consider the time needed to calculate the choice values, but if we are iterating over these indices the values can be pre-cached first and then the amortized cost per subsequence lookup goes to effectively zero.


If you want to play with this algorithm yourself, I have placed some java code implementing it on GitHub. It is as efficient as expected, and should be plenty fast enough for anything you could conceivably need it for. For example, it takes essentially no time to determine that the 160,000,000,000,000,000,000,000,000,000th subset of size 12 from a 10,000 item set is {0, 1, 2, 69, 1212, 1381, 4878, 5291, 5974, 6139, 6639, 8979}. So have at it!

Update: Ligon Liu has kindly provided a C# port of this code, which I have added to the repository (it the c# directory).

Update 2: Josiah Carlson has kindly provided a python port of this code, also in the repository now.

Update 3: Richard Lyon has kindly provided ports to both JavaScript and PHP, on GitHub as well. Thanks guys!

Update 4: Corin Lawson has a C implementation in his GitHub repository, along with other Banker's order functions. Great!


  1. The easiest way to think of Banker’s ordering is to think of comparing the indices of the items that make up the subsets. So to order two subsets, take the list of indices of elements in the first subset, and compare it to that of the second subset in standard dictionary (lexicographic) order. So {2, 3, 4} comes before {2, 3, 5} but after {1, 3, 5}, for example.

    The image at the top of this post depicts the Banker’s ordering for subsets of  lengths 1, 2, and 3, respectively, from a set of size 9.
  2. Subsets are indexed from 0, so you may consider this the 8th subset.

Wednesday, April 14, 2010

Introducing: Marriage Sort

Update: I've created a live visualization of this algorithm, so you can see it in action - see Marriage Sort, a Visualization.

Two weeks ago, a link showed up on Hacker News on how to (mathematically) select the best wife. Tongue firmly in cheek, of course. The key takeaway from the article is that knowing only the relative rankings of items in a sequence (and assuming you can never go back to a previous one), you can maximize your chances of selecting the maximum one by letting N/e go by and then picking the next one that's better than all those. Further, to maximize your expected value you only need to let the first √N - 1 go by. After rattling around in the back of my mind for a couple weeks, yesterday this tidbit coalesced into an actual Idea. Could I turn this selection routine into a sorting algorithm?

And thus, Marriage Sort was born. I present it here purely out of interest, not because it is especially fast - if nothing else, because it is the only sorting algorithm I know of that has an average complexity of O(n1.5)1.


The basic idea of this algorithm is to repeatedly choose the maximum element from the first √N - 1 elements of our working set, and then scan to the end looking for elements bigger than it. When one is found, swap it to the end of the working set, and decrease the working set by one. After the pass is complete, swap out the maximum element from the first √N - 1 as well, and start again. When everything is finished (√N - 1 <= 0), use insertion sort to put the array into true sorted order.
One pass of the Marriage Sort. Two elements are found and swapped to the end, followed by the maximum element from the first √N - 1.

This works because, as we have noted from the article linked above, each element bigger than the first √N - 1 is expected to be close to the largest remaining element in the array. By repeatedly taking these elements and moving them to the end, we put the array into an approximately sorted order, where elements should be reasonably close to their 'true' positions. When this is done insertion sort will do the final rearranging, moving elements the short distances to their true positions.

In pseudocode:
def marriagesort(array):
    end = array.length
    while true:
       skip = sqrt(end) - 1
       if skip <= 0: break

       # Pick the best element in the first vN - 1:
       bestPos = 0, i = 1
       while i < skip:
           if array[i] > array[bestPos]: bestPos = i
           i += 1

       # Now pull out elements >= array[bestPos], and move to the end:
       i = skip
       while i < end:
           if array[i] >= array[bestPos]:
               array.swap(i, end-1)
               end -= 1
               i += 1

       # Finally, move our best pivot element to the end
       swap(bestPos, end-1)
       end -= 1

    # Finish off with insertion sort to put the elements into true sorted order

Here you can see this algorithm working on a randomized array. Many thanks to Aldo Cortesi for the wonderful sorting algorithm visualization framework!

Update: Here is another visualization made using Aldo Cortesi's tools, of a 512 element array this time. If you take a close look at the insertion sort phase (the slope on the right side), you can see that it is composed out of a lot of little triangles. Each of these triangles corresponds to one pass of the marriage selection routine, and shows how far elements can be from their true sorted position.


Some quick back-of-the-napkin calculations indicate that this algorithm takes O(n1.5) compares in the average case, although the worst case looks to be O(n2). For the first stage (i.e. without the final insertion sort) each pass will take at most n compares and move approximately √N items to the end, requiring about √N passes. I don't know if this is a tight bound - the passes will actually speed up and require fewer compares as the working set shrinks, however I didn't want to try including that in the calculations. Similarly, after each pass all the items moved are guaranteed to be greater than all the items left, capping the distance moved items are from "home" to √N on average for the insertion sort. This holds the insertion sort to O(n1.5) compares on average as well.

Similarly, this algorithm takes O(n1.5) swaps in the average case (and O(n2) in the worst case). The number of swaps could certainly be decreased if a different final sort were used - the first pass only takes Θ(n) by itself - although in practice this isn't much use since the compares will still dominate the runtime.

Note that both of these graphs are log-log plots, so the important feature to look at is the slope of lines rather than the absolute position. For example, in the "Swaps" plot quicksort appears to be hovering around the n∙log(log(n)) line, however looking at the slope we can see that it is actually approximating n∙log(n), just with a better constant factor.

Note 2: The x axis of these graphs is the number of elements in the array, and the y axis is the number of comparisons/sorts required. I should have labeled these better.


Some sample code implementing this algorithm is up on GitHub - this is the code to generate the log-log plots shown above.

As always, questions and comments are welcome!


  1. Most sorting algorithms are either Θ(n2) or Θ(n∙log(n)) in the average case. This one falls between those two extremes, making it faster (asymptotically) than algorithms like insertion sort or bubble sort but slower than quicksort or merge sort.