Saturday, February 20, 2010

Visualizing RGB, take two


Update: The follow-up post can be found here.

About 3 weeks ago, I wrote about a program I created to transform pictures into all-RGB images (4096x4096 images using each possible RGB colour exactly once). It worked by ordering pixels based on a Hilbert ordering of the 3d colour cube and then re-colouring them in order, and while it produced interesting images, it pretty much failed at the stated goal of “keep[ing] the overall look untouched”. The problem was that the general hue of the pixels was often very shifted from the input, so the overall features were preserved but the colour balance was not. So for the past week or so I’ve been working on a new program, one that will (hopefully!) do a better job of keeping the base look intact. As with last time, I’m using an image I took in Barcelona for testing – let me know if you have a different one you’d like to see.


Original
Result From Take One

Choose the closest colour…

My idea this time was that instead of choosing an ordering of the pixels, it would be better to try to minimize the distance between the source and destination colours overall. The easiest way I could think of was to simply choose pixels at random, and assign them the “closest” colour remaining. Hopefully deviations would occur in all directions equally, so the average colour of a region would be as close as possible to the source. By popular demand, I will try to make this algorithm a little more explicit this time:

  1. Go through the pixels in the source image in a random order.
    1. For each, select the closest remaining unused colour, by Euclidean distance between the coordinates in the colour space.
    2. Assign the found colour as the output colour of the pixel, and mark it used.

But in which colour space?

Sources: one and two

A key question I had was which colour space would be best for preserving hues? There are a number of different “colour solids” that I could use coordinates from, with RGB being only one of many. I had a strong suspicion that something like HSL would do better than using RGB directly, but the easiest way to find out which to do a direct comparison. I tried the RGB cube as well as HSL and HSV cylinders for the comparison. My test images are presented below.


Original
RGB


HSL
HSV

As you can see, HSL and HSV give essentially the same results, which are both much better than RGB (look closely at the wheel wells, or the buildings in the trees on the right to see the differences). I like to think that HSV is slightly better, but I might be imagining differences that really aren’t there. Either way, I chose to use HSV for the final copy.



Looks good! Certainly a lot closer to the source image – I’m satisfied with this one for now.

Postscript

As with last time I am using a conceptually simple algorithm, however this time the implementation was considerably more difficult. The problem is that choosing the closest remaining colour to a source pixel is a hard problem to do efficiently, especially since the set of candidate colours changes at every step. I wrote the code in C# for performance this time, but I have still had to spend quite a few hours optimizing the code to get the program to finish at all. The final version can take 30+ hours to generate an image, and peak at over 4 GB of ram. I based my code around a KD-tree I found online, then rewrote to optimize for the 3D, single-nearest-neighbour case as well as to support branch pruning on delete. The rewritten tree – as well as the rest of my code – is available in a repository on GitHub: http://github.com/EricBurnett/AllRGBv2. Feel free to try it out for yourself - if you do, I’d love to hear about it!

Friday, February 12, 2010

Buzz: The perfect spam platform?

Buzz is out, genius or a privacy fiasco or just another me-too, depending on your view. Those topics have been covered to death already, but one thing I haven't seen talked about is how easy it makes spamming. And I don't mean shouting inanities to all your friends - that's what it's for, after all - I'm talking about targeted spam by spammers, like the blog spam that used to be a huge problem.

Here is the problem as I see it:

  1. You can follow anyone you want simply by adding their email address.
  2. When they "buzz", you are notified.
  3. When you comment, they see your response.

Does this seem like a bad idea to anyone else? I have a feeling that Google is going to have to allow people to "accept" followers, bringing it that much closer to Facebook.

Straining the limits of C#

Two weeks ago I wrote about an algorithm to generate All-RGB images from pictures. I am currently working on a follow-up post about a new algorithm, in C# this time. This one is a bit more computationally intensive, and despite the language change it is running into scaling issues. So while I wait for it to finish, I thought I'd write about a few of them.

Good data structures are hard to find

When you start processing large numbers of items in different ways, choosing the right data structure to store them becomes an absolute necessity. They can mean the difference between an O(n∙log(n)) and an O(n2) and algorithm, which can be the difference between taking 1 hour to run or 100 years. For this project, the requirements were simple – a data structure mapping points to objects that supported nearest-neighbour searches and deletion. To me, that immediately translated to kd-tree.


Usually in cases like this I end up needing to roll my own structure, but this time I was lucky. After some Googling I found exactly one viable implementation to use, and better yet, it was open source. I'm glad it was; it turned out later that there was a bug1 that needed fixing, and I needed to compile a 64-bit version anyways (I wonder if there's a lesson in here?). It is unfortunate that this was the only option, however. I mean, there are a ton of data structure libraries for most languages you can imagine, but the vast majority of them implement the same set of structures, are buggy, unsupported, and incompatible. I would love to see a Stack Overflow-style site to address this – community edited and supported code, implementations ranked together for comparison, searching by requirements if you don't know what you need, and the list goes on.


But even with the appropriate structure, the algorithm I have chosen will take more than a day to run and  4+ GB of memory. That is fine, I knew the approximate complexity when I started, but it does lead to the next set of issues.

Good algorithms are hard to find

Or should I say, good implementations of algorithms are hard to find. By way of introduction, a brief digression: my computer is unstable. Not terribly unstable, not enough to for me to actually take the time to fix it, but my sound card is slightly unsupported on Windows 7 so every once in a blue moon something like skipping a song will blue-screen the computer. Just about all my programs know how to pick up where they left off, but of course that doesn't hold for these projects I throw together in an evening. So when my computer decided to crash this morning, I decided to add some basic checkpointing. Checkpointing is easy, right? Hah!


Attempt 1: tag classes with [Serializable], run needed structures through a BinaryFormatter, streaming to file.

So, anyone want to guess what the problem is here? If you said object-graph size, you're right on the money. BinaryFormatter doesn't support object graphs with more than 6M items or so, and arrays get no special treatment. So serializing an array of 16.7M items throws a very unhelpful error message ("The internal array cannot expand to greater than Int32.MaxValue elements")2. Fine, I can unwrap my own arrays easily enough.


Attempt 2: unwrap arrays manually.

With each array element being serialized as a separate object, the overhead in the file is huge. If I had to guess, I'd say that the size on disk is about 10 times the size in memory. And since I'm trying to write about 1 GB of data...you can probably guess where this is going. Something in the output stack explodes when more than 4 GB of data is written, a number suspiciously close to the max size of an Int32. This is simply poor implementation, since it's not like I'm trying to mmap the data, and large files have been supported in all modern OS' for years. Not a big deal though, that data is going to be very redundant and I/O is expensive, so writing a compressed stream is probably faster in the long run.


Attempt 3: write to the file using a System.IO.Compression.GZipStream wrapper.

With compressed data, I expect the on-disk size to be comparable to the in-memory size, or a bit better. So the 4 GB limit should be no, problem, right? Wrong! The GZipStream has the same problem, and refuses to support more than 4 GB uncompressed. The fix here is even simpler – swap in a better GZip library.


Attempt 4: write to the file using a SharpZipLib.GZipOutputStream wrapper.

Success! The output file is about 700 MB and takes somewhere around 20 minutes to write, for a compression rate of about 9 MB/sec and space savings of about 93%.


Now, I could chalk these problems up as a failing of C#, but that wouldn't be accurate. By playing with this much data I am working outside the limits expected by the library designers, and I know it. I have focused on C#, but the issues are far more general than that – I can't even find a 64-bit version of python 2.6 for Windows to test with at all, but I'm sure I would run into a different set of problems if I could use it, and the same goes for the rest of the languages out there. The real issue is that versatile algorithm implementation is hard, and not getting much easier with time. And that I don't have a workaround for.


Footnotes

  1. The problem is that "deletions" are supported by tombstoning, so you periodically have to rebuild the index to clean them up. That is fine, except the range() method used to get the current entries out doesn't check the deleted flag! Don't worry, I'll be sending a fix back upstream.
  2. Someone else did the digging, and it seems the list of prime sizes for some internal hash table stops at 6 million, so the next prime it tries to resize to is something enormous (-1 unsigned?). Microsoft decided this is a non-issue, so no fix is coming. Their suggested workaround was to use the NetDataContractSerializer and a binary xml writer, but when I tested it the performance was too terrible to consider.