LACE Greek OCR project

On a better note, we live in blessed times where technology and the ancient world are concerned.  The astonishing results of a project to OCR volumes of ancient Greek from Archive.org may now be found online here.  Clicking on the first entry, and one of the outputs in it here gives astonishingly good results.

Ancient Greek OCR – progress, perhaps!

A correspondent has sent me a very interesting message from a Bruce Robertson, taken from the Digital Classicists list, which I think might interest people here.

Federico and I have been working quite a bit on Greek OCR this past  year, and have made some advances since the publications below.  We now  have a process on Compute Canada servers using my Gamera-based  ‘Rigaudon’ code

https://github.com/brobertson/rigaudon

This process undertakes OCR at multiple levels of darkness, uses a  weighted Levenshtein distance correction system that I’ve worked on, and when possible it combines Greek and Latin-script OCR to produce a good  mixed result, preserving information in the app. crit.

This group is probably most interested in looking over the results.

Here’s a typical volume in Teubner serif font, which took about 2h to  run on our 40 cores (all results are pure machine output, without  manual spellcheck or other human intervention):

http://heml.mta.ca/Rigaudon/Views/SideBySide/alciphronisrheto00alciuoft_2013-02-09-19-28_Kaibel_Round_4_sidebyside/alciphronisrheto00alciuoft_0016.html

Here’s a rather challenging papyrological text:

http://heml.mta.ca/Rigaudon/Views/SideBySide/griechischeurkun00mitt_2013-02-07-06-30_Kaibel_Round_4_sidebyside/griechischeurkun00mitt_0063.html

We also have Teubner sans font texts working:

http://heml.mta.ca/Rigaudon/Views/SideBySide/metrodoriepicure00metruoft_2013-02-27-21-33_TeubnerSans_2012_12_29_no_sigmas_cnn_sidebyside/metrodoriepicure00metruoft_0033.html

And the more challenging Didot foundry:

http://heml.mta.ca/Rigaudon/Views/SideBySide/scholiaintheocri00buss_2013-03-08-00-07_Didor_sidebyside/scholiaintheocri00buss_0046.html

And of course, Oxford:

http://heml.mta.ca/Rigaudon/Views/SideBySide/workswithenglish03juliuoft_2013-02-09-12-56_OCT6_sidebyside/workswithenglish03juliuoft_0120.html

If you’re into bleeding-edge experiments, here’s some Migne:

http://heml.mta.ca/Rigaudon/Views/SideBySide/migne_2013-03-18-06-50_Migne4_No_Latin_sidebyside/migne_0001.html

There are many more, along with some experiments (successful or otherwise) at:

http://heml.mta.ca/Rigaudon/Views/SideBySide/

I have set up a public spreadsheet for OCR requests from archive.org volumes, here:

and I’d be delighted if anyone on the list wanted to add a request, or just email me with a request. Output will be in standard HOCR or  plain text.

Currently, I’m working on a classifier for Migne, which is a very  challenging but potentially quite useful series of volumes. We’re also  working on implementing an idea Federico had quite a while ago, aligning the output of multiple engines or runs to improve the overall output. This would allow one to add the best of Nick White’s recent important work on Tesseract to the output of Rigaudon, for  instance.

The code is a script written in the Python language.  The code requires that you first install Gamera (also written in Python).  I believe Python can run on Windows as well as on Unix.

If I had any time, I’d be interested to find out how well this runs.  But a caveat: when I looked at the home page, I saw the dreaded words that it talked about “training” the code to recognise characters.  I suspect this stuff is not mature enough for normal people.

All the same, this is excellent work!