Here's an interesting piece about the Digital Library of India (DLI) initiative. Here, for example, is an issue I bet you've never considered before - I know I haven't:
Designing an accurate OCR in the Indian languages is one of the greatest challenges in computer science. Unlike European languages, Indian languages have more than 300 characters to distinguish, a task that is an order of magnitude greater than distinguishing 26 characters. This also means that the training set needed is significantly larger for Indian languages. It is estimated that at least a ten million-word corpus would be needed in any font to recognize Indian languages with an acceptable level of accuracy. DLI is expected to provide such a phenomenally large amount of data for training and testing of OCRs in Indian Languages. Many of the contents, besides scanned images, have been manually entered for this purpose. Using this extremely large repertoire of data, a Kannada OCR has been developed.
(Via Open Access News.)
You can download books from DLI in PDF using @ABS DLI Downloader.
ReplyDeleteVisit http://alokshukla.wordpress.com/2009/12/11/where-knoledge-is-free-digital-library-of-india/ .
thanks
ReplyDelete