Two weeks ago I wrote about an algorithm to generate All-RGB images from pictures. I am currently working on a follow-up post about a new algorithm, in C# this time. This one is a bit more computationally intensive, and despite the language change it is running into scaling issues. So while I wait for it to finish, I thought I'd write about a few of them.
Good data structures are hard to find
When you start processing large numbers of items in different ways, choosing the right data structure to store them becomes an absolute necessity. They can mean the difference between an O(n∙log(n)) and an O(n2) and algorithm, which can be the difference between taking 1 hour to run or 100 years. For this project, the requirements were simple – a data structure mapping points to objects that supported nearest-neighbour searches and deletion. To me, that immediately translated to kd-tree.
Usually in cases like this I end up needing to roll my own structure, but this time I was lucky. After some Googling I found exactly one viable implementation to use, and better yet, it was open source. I'm glad it was; it turned out later that there was a bug1 that needed fixing, and I needed to compile a 64-bit version anyways (I wonder if there's a lesson in here?). It is unfortunate that this was the only option, however. I mean, there are a ton of data structure libraries for most languages you can imagine, but the vast majority of them implement the same set of structures, are buggy, unsupported, and incompatible. I would love to see a Stack Overflow-style site to address this – community edited and supported code, implementations ranked together for comparison, searching by requirements if you don't know what you need, and the list goes on.
But even with the appropriate structure, the algorithm I have chosen will take more than a day to run and 4+ GB of memory. That is fine, I knew the approximate complexity when I started, but it does lead to the next set of issues.
Good algorithms are hard to find
Or should I say, good implementations of algorithms are hard to find. By way of introduction, a brief digression: my computer is unstable. Not terribly unstable, not enough to for me to actually take the time to fix it, but my sound card is slightly unsupported on Windows 7 so every once in a blue moon something like skipping a song will blue-screen the computer. Just about all my programs know how to pick up where they left off, but of course that doesn't hold for these projects I throw together in an evening. So when my computer decided to crash this morning, I decided to add some basic checkpointing. Checkpointing is easy, right? Hah!
Attempt 1: tag classes with [Serializable], run needed structures through a BinaryFormatter, streaming to file.
So, anyone want to guess what the problem is here? If you said object-graph size, you're right on the money. BinaryFormatter doesn't support object graphs with more than 6M items or so, and arrays get no special treatment. So serializing an array of 16.7M items throws a very unhelpful error message ("The internal array cannot expand to greater than Int32.MaxValue elements")2. Fine, I can unwrap my own arrays easily enough.
Attempt 2: unwrap arrays manually.
With each array element being serialized as a separate object, the overhead in the file is huge. If I had to guess, I'd say that the size on disk is about 10 times the size in memory. And since I'm trying to write about 1 GB of data...you can probably guess where this is going. Something in the output stack explodes when more than 4 GB of data is written, a number suspiciously close to the max size of an Int32. This is simply poor implementation, since it's not like I'm trying to mmap the data, and large files have been supported in all modern OS' for years. Not a big deal though, that data is going to be very redundant and I/O is expensive, so writing a compressed stream is probably faster in the long run.
Attempt 3: write to the file using a System.IO.Compression.GZipStream wrapper.
With compressed data, I expect the on-disk size to be comparable to the in-memory size, or a bit better. So the 4 GB limit should be no, problem, right? Wrong! The GZipStream has the same problem, and refuses to support more than 4 GB uncompressed. The fix here is even simpler – swap in a better GZip library.
Attempt 4: write to the file using a SharpZipLib.GZipOutputStream wrapper.
Success! The output file is about 700 MB and takes somewhere around 20 minutes to write, for a compression rate of about 9 MB/sec and space savings of about 93%.
Now, I could chalk these problems up as a failing of C#, but that wouldn't be accurate. By playing with this much data I am working outside the limits expected by the library designers, and I know it. I have focused on C#, but the issues are far more general than that – I can't even find a 64-bit version of python 2.6 for Windows to test with at all, but I'm sure I would run into a different set of problems if I could use it, and the same goes for the rest of the languages out there. The real issue is that versatile algorithm implementation is hard, and not getting much easier with time. And that I don't have a workaround for.
- The problem is that "deletions" are supported by tombstoning, so you periodically have to rebuild the index to clean them up. That is fine, except the range() method used to get the current entries out doesn't check the deleted flag! Don't worry, I'll be sending a fix back upstream.
- Someone else did the digging, and it seems the list of prime sizes for some internal hash table stops at 6 million, so the next prime it tries to resize to is something enormous (-1 unsigned?). Microsoft decided this is a non-issue, so no fix is coming. Their suggested workaround was to use the NetDataContractSerializer and a binary xml writer, but when I tested it the performance was too terrible to consider.