reCAPTCHA (a.k.a. Those Infernal Squiggly Words) Almost Done Digitizing the New York Times Archive

You've done it so many times, at so many sites across the Internet, that chances are you don't even think about it anymore: deciphering and typing in a "CAPTCHA," those squiggly, mucked-up words presented each time you buy tickets online, write a blog comment, or join a social network. Their purpose is clear: they tell Web sites that you are a person and not a computer, theoretically cutting down on spam. More perceptive Web users may have noticed that sometimes the garbled strings appear in pairs, with one looking more like it's been scanned out of a library book or old newspaper, perhaps with some sloppy underlining or stray pen marks. The latter is a variant known as "reCAPTCHA," and for two years it has been performing double duty, both authenticating you and helping to digitize old printed material at the same time. Far from just wasting your time, it has now helped digitize almost all of the New York Times archives.

Both CAPTCHA (which stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart) and reCAPTCHA are the invention of Luis von Ahn, a Carnegie Mellon computer scientist and MacArthur "genius grant" recipient. "A couple hundred million CAPTCHAs are typed daily around the world," von Ahn tells NEWSWEEK. "The first time I did the calculations, I felt quite proud. And then I felt bad because people really find these annoying." They're also wasteful. It takes about 10 seconds to type a CAPTCHA─more, obviously, if you err and have to start over─meaning a total of some 500,000 human hours per day are spent typing them in. As a point of comparison, according to von Ahn, the Empire State Building took 7 million human hours to build. "Life is only like 700,000 hours," he says. "It's almost the equivalent of a life. We thought, is there any way we can use this human effort in a way that's good for humanity?"

Turns out, there is. Recognizing distorted words is one of the (dwindling number of) things that the human brain can still do better than computers. In order to make old books, newspaper, and other texts searchable, pages are scanned and fed into optical character-recognition software. Because ink and paper degrade over time, some words remain inscrutable. The reCAPTCHA system presents Web users with two words: one word that computers can't read, and one that they can. So long as you type the known word in correctly, and a few other people agree with you on the unknown word, you have helped digitize an archival page. And, von Ahn says, typing in two words instead of one doesn't cost you a significant amount of extra time.

Von Ahn is tough to pin down on a number of details─he won't say how much the New York Times pays for reCAPTCHA's services, nor the precise amount of progress it has made in digitizing its 150 years of pages. But he hinted in his recent talk at the 2009 PopTech conference that the project was on track to finish by the end of 2009 or slightly later.

reCAPTCHA, which is free for Web sites to implement, is being used by Facebook, Craigslist, Twitter, and more than 100,000 other sites. In September, it was acquired by Google, which has massive human proofreading needs in its Google Books and Google News Archiveprojects. At some 40 million deciphered words a day, and approximately 100,000 words per book, that means the reCAPTCHA army could in theory chew through hundreds of thousands of books per year.

It's been said that we shouldn't ask what's next in terms of what the Internet and technology will be able to do but instead try to understand what we've already got and figure out how to put it to good use. Von Ahn's efforts surely prove that point. They also show that in some ways, we can help computers as much as they help us.

Join the Discussion