Showing posts with label ocr. Show all posts
Showing posts with label ocr. Show all posts

29 January 2008

Deciphering OCR Cuneiform

One area of weakness for open source is OCR, so news that OCR Cuneiform is available now as freeware, and will be released as open source, is highly welcome.

13 December 2007

Building the Zotero Commons

One of the many insights that have come out of open source is what might be called the "pebble on the cairn" effect - the idea that by combining the small, even negligible, individual efforts we can create something large and durable.

Here's a perfect example that builds on the fact that scholars very often scan books in the public domain during the course of their research, but then don't do anything with those scans. What if they were all brought together, and then fed into an OCR system?

If many researchers have had to scan rare documents or books for their own perusal, there’s a potential treasure trove of material that exists among their combined efforts. Rather than let all that scholarship rot, or waste away in data files, the university’s Center for History and New Media sees an opportunity to create an open archive of scholarly resources in the public domain.

...

In partnership with the Internet Archive, and with funding from the Andrew W. Mellon Foundation, the center is creating a way for scholars to upload existing data files to be optically scanned (to make them text-searchable) and stored in a database available to the public.

Even better is that fact that open source software can be used to make realise this idea:

The vehicle for the new environment will be the Zotero plug-in for the Firebox browser, also developed by the center. The software stores Web pages, collects citations and lets scholars annotate and organize online documents. A new feature of the plug-in will allow people to collaborate and share materials through a dedicated server. Building on that functionality, according to Cohen, the system will allow scholars to drag and drop documents onto an icon in Zotero that essentially sends it to the Internet Archive for storage and free optical character recognition.

The eventual result of the project, called Zotero Commons, could be reduced need need for research trips, Cohen suggested.

(Via Open Access News.)

12 April 2007

Recognising Google's True Character

It's easy to become apprehensive about the massive and growing power of Google. After all, its operating plan is essentially to know everything about everything that happens online - and, as a consequence, offline. I certainly share those concerns, but it's also important to note the company continues to make moves that contribute to the free software commons.

The latest one is pretty cool:

We're happy to announce the OCRopus OCR Project, a Google-sponsored project to develop advanced OCR technologies in the IUPR research group, headed by Prof. Thomas Breuel at the DFKI (German Research Center for Artificial Intelligence, Kaiserslautern, Germany).

The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.

Just as important is the choice of base platform:

We are initially targeting Linux x86 and x86/64 and are developing under Ubuntu 6.10. The code should be easily portable to other Linux distributions and other platforms. If you're interested in taking responsibility for another platform, please let us know.

OCR is an area where free software is still lagging somewhat compared to proprietary code: Google's latest gift to the community is therefore highly welcome - even if ultimately it will help it know even more about documents and hence us. (Via Matt Asay).

30 November 2006

The Digital Library of India

There's plenty of noise in the press (and blogs) about the Google Book project, or the Million Book Project. These are all interesting and laudable (well, those bits of it in the public domain, at least), but what about elsewhere?

Here's an interesting piece about the Digital Library of India (DLI) initiative. Here, for example, is an issue I bet you've never considered before - I know I haven't:

Designing an accurate OCR in the Indian languages is one of the greatest challenges in computer science. Unlike European languages, Indian languages have more than 300 characters to distinguish, a task that is an order of magnitude greater than distinguishing 26 characters. This also means that the training set needed is significantly larger for Indian languages. It is estimated that at least a ten million-word corpus would be needed in any font to recognize Indian languages with an acceptable level of accuracy. DLI is expected to provide such a phenomenally large amount of data for training and testing of OCRs in Indian Languages. Many of the contents, besides scanned images, have been manually entered for this purpose. Using this extremely large repertoire of data, a Kannada OCR has been developed.

(Via Open Access News.)