Wednesday, April 28, 2010

Robustly hot swapping binaries (with code)

Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine

  A while ago I remember reading an article by Nathan Wiegand on hot swapping binaries. This was a very eye-opening article for me – before reading it, hot swapping was one of those black arts I never really thought about, and certainly wouldn’t have thought was at all easy. I highly recommend you read it for yourself. Go ahead, I’ll wait.

There. Did you notice it? The elephant in the room? One thing the article doesn’t address is how to design programs to use this fancy new ability, without being fragile and crashing and all those bad things. I’ve been mulling this over since reading it, and have settled on a basic design that I’ll present here. But don’t worry, this isn’t just a thought design…I’ve actually coded it up and made sure it works as expected. Feel free to jump down and check it out before reading the design. Still, caveat emptor and all that.

Design Goals

For this design, I am focusing on three key goals:

  • Allow updating to any future version
  • Allow updating to any previous version
  • Make it easy to be crash-free

Simple enough? Updating forward is a pretty obvious goal, as is crash-free code. I want to allow updating backwards as well for the simple expedient that I don’t expect all new code to be bug free, and so it might be desirable to roll back when a bug is introduced.

Design

To achieve this, I’ve settled on a shortlist of constraints for the code:

  1. Use protocol buffers to store all state
  2. Provide suitable defaults for everything, and be robust against weird state
  3. Structure the code as a state machine

I did say the list was short. Let’s look at these in detail:

  1. Protocol buffers are an obvious choice for persisting state, as they are forward and backward compatible by design. Care must be taken to not re-use tag numbers and to never add ‘required’ fields, but this is an easy requirement to satisfy. Now, using protocol buffers to store everything does incur some overhead, but they are quite efficient by design and we really only need to store all state between state machine transitions so local variables are still quick.
  2. Hand in hand with 1), we cannot always expect to have the data we want in the format we want version to version. To accommodate this we must pick suitable defaults for fields, and if necessary be able to get new values at runtime. At the same time, if the meaning or format of a field is changing, it is probably better to use a new field. We will always try to handle weird data that may show up, but this shouldn’t be abused.
  3. Finally, structure the code as a state machine. This surfaces all state machine transitions as potential points for upgrading versions, and forces state to be in the protocol buffer when these transitions are crossed to ensure important data isn’t forgotten. And like everything else, the next state data can be stored in the protocol buffer.

There is one problem with 3), however. What happens when new states are added? Going forward is easy, but if we update to a previous version when we’re in one of these new states, it will have no idea where to start running again. We could try storing fallback states or something like that, but that seems too fragile. Instead, I would recommend not allowing updates to occur when transitioning to these new states. Then, a few versions down the line when you’re sure you won’t need to downgrade past where they were added, remove that restriction.

enum State {
    // Special states supported by the program infrastructure.
    STATE_NONE  = 0;
    STATE_DONE  = 1;
    STATE_ERROR = 2;
    // Program states. Unknown state transitions lead to ERROR and terminate the
    // program, so should be avoided at all costs.
    STATE_INIT          = 3;
    STATE_PROCESS_LINE  = 4;
    STATE_MUTATE_LINE   = 5;
  }
  optional State prev_state = 2 [default = STATE_NONE];
  optional State cur_state  = 3 [default = STATE_ERROR];

What About Threads?

You may have noticed that this design is inherently single-threaded. Threading can be added easily enough if the main thread owns all the work, and can easily and safely wait for or cancel all worker threads without losing anything. In that case, spin down the workers when you’re about to swap, and spin them up again when it completes. If your program doesn’t fit that description, however, this design may not be for you.

Testing?

Of course! I would recommend trying all transitions on a test instance first before upgrading the real process. You could also build in consistency checks that auto-revert if the state doesn’t meet expectations, regression tests for certain upgrade patterns, etc. This design is meant to make it easy to hot swap successfully, but it is no silver bullet.

Let's See the Code!

As always, the code is up on GitHub for you to peruse. It is broken into two demonstration applications, ‘v1’ and ‘v2’, that can be swapped between at will. While looping they respond to ‘u’ and ‘q’ (update and quit), although at times you may be prompted for other input. the makefiles build to the same target location, so build whichever one you want run next and press ‘u’ to swap to it.

The code is structured so you can use it as a framework to play with yourself easily enough. You should only need to write an init method, update the state machine and .proto file, and write the respective state methods to do the real work. The state machine and state methods will look something like this:

ReturnCode runStateMachine(ProgramState& state) {
    cerr << "Running state machine\n";
    // Put stdin into non-blocking, raw mode, so we can watch for character
    // input one keypress at a time.
    setStdinBlocking(false);
    while (true) {
        ProgramState::State next;
        switch (state.cur_state()) {
            case ProgramState::STATE_INIT:
                next = runState_init(state);
                break;
            case ProgramState::STATE_PROCESS_LINE:
                next = runState_process_line(state);
                break;
            case ProgramState::STATE_DONE:
                setStdinBlocking(true);
                return SUCCESS;
            case ProgramState::STATE_NONE:
            case ProgramState::STATE_ERROR:
            default:
                setStdinBlocking(true);
                return FAILURE;
        }

        ProgramState::State cur = state.cur_state();
        state.set_prev_state(cur);
        state.set_cur_state(next);

        // For now, simply let the user decide when to swap and quit. We can
        // always change this later.
        ReturnCode code = checkForUserSignal();
        if (code != CONTINUE) {
            setStdinBlocking(true);
            return code;
        }
    }
}

ProgramState::State runState_init(ProgramState& state) {
    cout << "Please provide a line of text for me to repeat ad-nauseum\n";
    string line;
    setStdinBlocking(true);
    getline(cin, line);
    setStdinBlocking(false);
    state.set_line_text(line);
    cout << "Thanks!\n";
    state.set_line_count(0);

    return ProgramState::STATE_PROCESS_LINE;
}

Easy, right? And here is an example transcript from going forward and then back between the versions in the repository (behind the scenes compiles not shown):

eric:~/code/hotswap/v1$ ../bin/hotswap.out
HotSwap example started - version 0.1
Initial call
Running state machine
Please provide a line of text for me to repeat ad-nauseum
All work and no play makes jack a dull boy
Thanks!
0: All work and no play makes jack a dull boy
1: All work and no play makes jack a dull boy
2: All work and no play makes jack a dull boy
u
Going down for swap
Coming up from swap!
This is generation 2 running 0.2 software (initial version 0.1)
Running state machine
3 mutations: All work and no play makes jack a dull boy
4 mutations: All work and no play maXes jack a dull boy
5 mutations: All workqand no play maXes jack a dull boy
6 mutations: All workqand nL play maXes jack a dull boy
u
Going down for swap
HotSwap example started - version 0.1
Coming up from swap!
Running state machine
7: All workqand nL play maXes jack a dull boy
8: All workqand nL play maXes jack a dull boy
9: All workqand nL play maXes jack a dull boy
q
Terminating with code 0


As you can see, version 0.2 mutates the line as it goes, while version 0.1 simply prints it forever. There are more differences than that, but you can find all that out from the code.

Enjoy! If you do end up playing with it, I’d love to hear about your experiences, or your thoughts on the design even if not.


This will probably be my last post for a while – on Saturday I leave the continent for 2 weeks and the city for 6. I will try to respond to emails and comments while I’m gone, but I may be a bit slower than usual.

2 comments

  1. Thanks for another smart idea and good luck in your trip :) I've added this post to http://www.strchr.com/news/

    ReplyDelete
  2. Nice, I hadn't realized strchr had a dedicated news feed, and even better, it looks like its right up my alley.

    And thanks, I'm looking forward to it :).

    ReplyDelete