open...: The Digital Library of India

There's plenty of noise in the press (and blogs) about the Google Book project, or the Million Book Project. These are all interesting and laudable (well, those bits of it in the public domain, at least), but what about elsewhere?

Here's an interesting piece about the Digital Library of India (DLI) initiative. Here, for example, is an issue I bet you've never considered before - I know I haven't:

Designing an accurate OCR in the Indian languages is one of the greatest challenges in computer science. Unlike European languages, Indian languages have more than 300 characters to distinguish, a task that is an order of magnitude greater than distinguishing 26 characters. This also means that the training set needed is significantly larger for Indian languages. It is estimated that at least a ten million-word corpus would be needed in any font to recognize Indian languages with an acceptable level of accuracy. DLI is expected to provide such a phenomenally large amount of data for training and testing of OCRs in Indian Languages. Many of the contents, besides scanned images, have been manually entered for this purpose. Using this extremely large repertoire of data, a Kannada OCR has been developed.

(Via Open Access News.)

30 November 2006

The Digital Library of India