October 8, 2007

Positivity: Spam weapon helps preserve books

Filed under: Positivity — Tom @ 5:56 am

From Pittsburgh, PA, via the BBC:

Published: 2007/10/02 10:01:32 GMT

A weapon used to fight spammers is now helping university researchers preserve old books and manuscripts.

Many websites use an automated test to tell computers and humans apart when signing up to an account or logging in.

The test consists of typing in a few random letters in an image and is designed to fight spammers.

Carnegie Mellon is using this test to help decipher words in books that machines cannot read by letting sites use them to authenticate log-ins.

The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart), was originally designed at Carnegie Mellon to help to keep out automated programs known as “bots.”

Spam messages

Bots are designed by spammers to post advertisements in discussion forums or to sign up for large numbers of e-mail addresses which are later used to send spam messages.

A CAPTCHA consists of an image containing letters or numbers which have been heavily distorted, making it hard or impossible for a bot to “read.”

By requiring web site visitors to type in the contents of the CAPTCHA before being allowed in to the site, humans can be admitted while all but the smartest bots are rebuffed.

CAPTCHAs are unpopular with many Internet users because the words they contain are often so heavily distorted to foil bots that that many humans struggle to read them.

This means potential visitors’ time is wasted while they make repeated attempts to decipher the CAPTCHA they are presented with.

But the CMU research team, based in Pittsburgh, Pennsylvania, has devised an ingenious system to put the time used interpreting CAPTCHAs to good use.

Text files

The team is involved in digitising old books and manuscripts supplied by a non-profit organisation called the Internet Archive, and uses Optical Character Recognition (OCR) software to examine scanned images of texts and turn them into digital text files which can be stored and searched by computers.

But the OCR software is unable to read about one in 10 words, due to the poor quality of the original documents.

The only reliable way to decode them is for a human to examine them individually – a mammoth task since CMU processes thousands of pages of text every month.

To solve this problem the team takes images of the words which the OCR software can’t read, and uses them as CAPTCHAs.

These CAPTCHAs, known as reCAPTCHAS, are then distributed to websites around the world to be used in place of conventional CAPTCHAs.

When visitors decipher the reCAPTCHAs to gain access to the web site, the answers – the results of humans examining the images – are sent back to CMU.

Every time an Internet user deciphers a reCAPTCHA, another word from an old book or manuscript is digitised.

Deciphered correctly

To ensure that the reCAPTCHAs are deciphered correctly, website visitors are actually presented with images of two words to examine, the contents of one of which is already known.

“If a person types the correct answer to the one we already know, we have confidence that they will give the correct answer to the other,” says Luis von Ahn, a Professor at CMU.

“We send the same unknown words to two different people, and if they both provide the same answer then effectively we can be sure that it is correct.

Go here for the rest of the story.

Share

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.