digitising Books One Word at a Time
A CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart) is a program that can tell whether its user is a human or a computer. You’ll have seen them before – distorted text at the bottom of sign-up and contact forms. CAPTCHAs are used to prevent abuse from "bots," or automated programs usually written to generate spam. No computer program can read distorted text as well as humans can, so bots cannot use sites protected by CAPTCHAs.
About 60 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that's not a lot of time, but they consume more than 150,000 hours of work each day. What if we could make positive use of this human effort?
For this reason we are now using reCAPTCHA. This uses the effort spent solving CAPTCHAs online into "reading" books.
To make information more accessible to the world and to archive what exists, multiple projects are currently digitising physical books. The book pages are being photographically scanned, and then, to make them searchable, transformed into text using "Optical Character Recognition" (OCR). The problem is that OCR is not perfect.
reCAPTCHA improves the process of digitising books by sending words that cannot be read by computers to the web in the form of CAPTCHAs for humans to decipher. Each word that cannot be read correctly by OCR used as a CAPTCHA.
But how does the computer know the answer to the CAPTCHA if it can’t read the word in the first place? Each new word that cannot be read correctly by OCR is given in conjunction with another word for which the answer is already known. You are then asked to read both words. If you solve the one for which the answer is known, the system assumes your answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
Currently, reCAPTCHA is helping to digitise books from the Internet Archive.

