Tepom.com

Personal finance advice for the average American.

Friday, August 15, 2008

Buying tickets online and translating old newspapers

Do you remember the last time you bought something online and had to prove that you were a human? You know the drill: sites will display a set of hard-to-read characters that you must read and type into a text box before the transaction can continue. This is an important feature, especially for ticketing websites, ensuring that hackers can't write programs that will access the site, follow the appropriate links, and buy all of the tickets before anyone else is able.

By making the purchaser read and reenter text that is irregular and slightly challenging to discern, programs designed to cheat the system are stopped in their tracks. Most people don't mind this extra step that takes, on average, ten seconds to complete. After all, it levels the playing field for everyone interested in buying those Hannah Montana tickets the instant they become available on ticketmaster.

But did you know that when you're entering this text, you're actually helping the newspaper industry? Many longstanding newspapers have been converting their archives of old printed pages to digital text, allowing them to be indexed, searched, and browsed in a modern fashion. By using Optical Character Resolution (OCR) software, just like your home scanner may have, workers are able to scan the old pages and let the computer translate the print to editable, searchable text.

Quite frequently the computers come across a word that they are not able to translate. After all, some of these newspapers are dozens of years old and not in the best condition. Imagine how much time it would take to manually evaluate and correct the blurry, ink-smeared words in 50 years' worth of newspapers! It would take thousands of man hours. Who's got time and budget for that? Well, the Computer Science department at Carnegie Mellon University found a way to procure some pretty cheap labor. And it includes you and me!

CMU developed a system that extracts the words that the newspapers' OCR software can't identify and places them into a package that is provided to websites with the "prove you're a human" security measure. Instead of entering in randomly generated numbers, users will digitize the OCR-unreadable words from an old newspaper or book. The thousands of hours that humans spend entering in random words are now productive and helpful. To ensure accuracy, each unreadable word is showed to multiple humans. As long as there is consensus on its translation, the newly digitized word is returned to the New York Times or whomever asked for its translation.

Additionally, computer scientists at CMU argue that humans actually spend less time filling in the letters with this model because it is easier to type an actual word than it is a random alphanumeric phrase.

So the next time you post a blog comment or buy tickets online and have to fill in that annoying text box, know that you're contributing something to the information age. You're helping to index the writings and reportings of our pre-PC, typewriting ancestors.

2 Comments:

  • At August 15, 2008 11:16 AM , Blogger Steve said...

    The CMU program is called reCAPTCHA. http://recaptcha.net/captcha.html and http://recaptcha.net/learnmore.html.

    There are several other places that use human distributed work, like www.galaxyzoo.com, which first teaches you to recognize types of galaxies, then has you classify a library of un-cataloged images. As of August 2007, 80,000 users had classified 10 million images (http://en.wikipedia.org/wiki/Galaxy_Zoo). This was inspired largely by a NASA project, Stardust@Home (http://en.wikipedia.org/wiki/Stardust%40home).

    These projects were largely based on distributed computing programs, like SETI@Home (http://en.wikipedia.org/wiki/SETI@home) and Folding@Home (http://en.wikipedia.org/wiki/Folding@home), which simply require users to donate processor time when their computers are idle. These have effectively become the world's most powerful supercomputers.

    All of these models get work for free. There are some startups looking at giving money to people for small amounts of work, with similar distributed networks (http://www.wired.com/science/discoveries/news/2000/06/37293).

    There is another company based on distributed engineering work (www.innocentive.com). It is more like an agency for free-lance writers than a distributed work network, but I thought I would throw it in as another interesting example of the internet allowing geographically distributed professional quality work to be done on a new scale.

    Steve
    www.iHateWheat.com
    stevescookingjournal.blogspot.com

     
  • At August 15, 2008 11:23 AM , Blogger Scott Bliss said...

    I've heard of some of those distributed computing projects before. I think recently, a woman made some significant space discovery while doing some free work from home deal. I believe that she was browsing photos from the Hubbell telescope.

    What I found most interesting about the CMU CS department's idea was that it was harnessing the power of work that was A) already being done, and B)Essentially wasting time. It's almost like harnessing a new form of energy. I mean, the wind is going to blow anyway...why not use it to generate energy?

     

Post a Comment

Links to this post:

Create a Link

<< Home