On a better note, we live in blessed times where technology and the ancient world are concerned. The astonishing results of a project to OCR volumes of ancient Greek from Archive.org may now be found online here. Clicking on the first entry, and one of the outputs in it here gives astonishingly good results.
A correspondent has sent me a very interesting message from a Bruce Robertson, taken from the Digital Classicists list, which I think might interest people here.
Federico and I have been working quite a bit on Greek OCR this past year, and have made some advances since the publications below. We now have a process on Compute Canada servers using my Gamera-based ‘Rigaudon’ code
This process undertakes OCR at multiple levels of darkness, uses a weighted Levenshtein distance correction system that I’ve worked on, and when possible it combines Greek and Latin-script OCR to produce a good mixed result, preserving information in the app. crit.
This group is probably most interested in looking over the results.
Here’s a typical volume in Teubner serif font, which took about 2h to run on our 40 cores (all results are pure machine output, without manual spellcheck or other human intervention):
Here’s a rather challenging papyrological text:
We also have Teubner sans font texts working:
And the more challenging Didot foundry:
And of course, Oxford:
If you’re into bleeding-edge experiments, here’s some Migne:
There are many more, along with some experiments (successful or otherwise) at:
I have set up a public spreadsheet for OCR requests from archive.org volumes, here:https://docs.google.com/spreadsheet/ccc?key=0ArJt01185Q8mdERsS2VRMngtTWRNUDAtNGFxZXhFQVE&usp=sharing
and I’d be delighted if anyone on the list wanted to add a request, or just email me with a request. Output will be in standard HOCR or plain text.
Currently, I’m working on a classifier for Migne, which is a very challenging but potentially quite useful series of volumes. We’re also working on implementing an idea Federico had quite a while ago, aligning the output of multiple engines or runs to improve the overall output. This would allow one to add the best of Nick White’s recent important work on Tesseract to the output of Rigaudon, for instance.
The code is a script written in the Python language. The code requires that you first install Gamera (also written in Python). I believe Python can run on Windows as well as on Unix.
If I had any time, I’d be interested to find out how well this runs. But a caveat: when I looked at the home page, I saw the dreaded words that it talked about “training” the code to recognise characters. I suspect this stuff is not mature enough for normal people.
All the same, this is excellent work!