Saturday, May 10, 2008

Scaling User Content

Scalability. One little word, and yet it continues to be a major problem plaguing programmers to this day. From Wikitionary:

2. (computing) The ability to support the required quality of service as the system load increases without changing the system.

A desirable property in any system to be sure, and to some extent it's easy enough to do. But there is always a limit, and once hit, the system needs to change. A lot of research has been done on how to scale just about anything – a party, a business, a network – and I don't really want to talk about these. But there is one issue I will talk about: user content.

Let's take a look at one of the earliest examples: the book. At first there were only a couple books to be found, laboriously copied out by hand and hugely expensive. This in itself wasn't too much of a problem, as very few knew how to read anyways. But the effort involved meant that only the best content would do, and only the elite would ever see it. Reading and writing grew slowly, and over time libraries were built. But as the price of admission fell and the literacy rate rose, more people started reading and writing. And then in 1439, Johann Gutenberg invented the printing press. Within a very short period of time the problem had changed – rather than getting access to books, people now had to choose between them, and the quality bar had been lowered. People have been solving this problem ever since; Oprah's Book Club, the New York Times' best sellers, and the Hugo Awards are all ways to separate the wheat from the chaff.

In the internet age, it is so easy to get your words out there that websites need to have a strategy for scaling content beyond the purely technical problems of storage and retrieval, to keep from being drowned in the chaff. Let's take a look at a few popular websites and see how they do it.


Born in an era when most companies were only interested in creating portal sites and search engines were 'good enough', Google was a very different kind of company. Focused on providing good quality search results, and unwilling to allow ad companies to 'buy' position in the index, the Google search engine was generations better than its competitors. It used innovative algorithms (notably the PageRank algorithm) and a team of high class programmers to stay ahead of spammers and keep the search results fresh and relevant. Even when branching out into other tools, Google has always used specialized algorithms to solve its problems and provide quality content for users.

  • Approach: Custom algorithms for filtering content
  • Drawback: Spammers can find flaws in the algorithms, and exploit them for gain


Originally conceived as a feeder project for Nupedia, an expert-written online encyclopedia (complete with full-time editor), Wikipedia was born in early 2001. It was to be a more 'open' compliment to Nupedia where anyone could edit articles, except it was not well received by the editors and reviewers. Only five days after creation it was given its own name, Wikipedia, and moved to its own site. A popular idea from the very beginning, Wikipedia had over 20,000 articles in its very first year. To keep the content clean of spam a whole set of tools were created, and a hierarchy of moderators to use them. The goal was to make it easier to undo vandalism than to create it in the first place, and looking at the current state of Wikipedia, it was highly successful. There is also a series of guidelines for acceptable content, and a court style system for resolving complaints.

  • Approach: Content moderated by the community itself, with sophisticated tools and guidelines to assist this
  • Drawback: A lot of time and effort goes into keeping the content clean, usually done by a small subset of the site's users


Created by Mark Zuckerberg as a way to connect with friends in Harvard University, Facebook rapidly expanded to other universities, high schools, and eventually, anyone who wanted to join. It works on the principle that people are grouped by the networks they are part of, and these can be used to find people you know. Users 'friend' each other, identifying that they do indeed know one another and giving permission to access each other's profile. This keeps the content remarkably relevant and spam free, since users only sees the actions of people they personally know.

  • Approach: Content associated with its author, and users manually select which authors they want to see content by
  • Drawback: Not well suited to finding new associations, instead good for maintaining ones created externally


Digg was one of the first widely used tools for people to share interesting content they found on the internet. Users can 'dig' an article, picture, or video they find, and subsequent users can 'dig' or 'bury' it depending on whether they like it or not. This kind of social voting brings the most interesting articles to the top and leaves the rest behind. The same kind of voting system is used on the submission comments, highlighting interesting ones and hiding the rest.

  • Approach: Submitted content is voted up or down by the rest of the community, filtering out what's interesting
  • Drawback: Using community opinion leads to a groupthink mentality, promoting content that reinforces the community's opinions

This is by no means all of the approaches out there, just a few that I think are worth noting. For example, Slashdot uses a more sophisticated version of comment voting that takes into account how 'correct' a user's opinions usually are. Each one has its own set of drawbacks, however they are all better than just doing nothing. I am excited to see what innovative solutions people will come up with in the coming years. I mentioned in my last post that I thought blog comments had a systemic flaw; they just don't scale with one moderator and n users. I'm sure someone will find a good solution for this problem soon, whether it is comment voting à la Digg or something entirely new. But until then, manual moderation it is.

No comments

Post a Comment