A correspondent draws my attention to Tesseract, a Google-hosted project to do Optical Character Recognition. The Tesseract website is here. Tesseract is a command-line tool, but there are front-ends available.
I am a long-term fan of Abbyy Finereader, and, let’s face it, have probably OCR’d more text than most. So I thought that I would give Tesseract 3.02.02 a go.
First, getting the bits. I work on Windows 7, so I downloaded:
I double-clicked on the tesseract installer. This went smoothly. It gave me the option to download and install extra languages (English is the default); among others I chose ancient Greek, and German, and German (fraktur). The latter is the “gothic” style characters fashionable in Germany until 1945. Curiously the list of languages is not in alphabetical order; French following German.
Next I clicked on the GImageReader installer. This ran quickly, and warned that you need a copy of Tesseract installed. It did not create a desktop icon; you have to locate the program in the Start menu. This would throw some users, I suspect.
I then started GImageReader. It started with an error; that it was missing the “spellcheck dictionary for Dansk(Frak)”. Why it looks for this I cannot imagine. Not a good start, I fear. I suspect that it expects Tesseract to be installed with all possible languages.
Next I browsed to a tif file containing part of the English translation of Cyril of Alexandria on John. The file explorer is clunky and non-Windows standard. The page displayed OK, although if you swap back to another window and then back again it seems to re-render the image.
At the top of the page is the recognition language – set by default to the mysterious Dansk (Frak). I changed this to English. I then hit “Recognize all”. The recognition was quick.
So far, so good, then. While unpolished, the interface is usable without a lot of stress.
The result of the OCR was not bad. A window pops open on the right, with ASCII text in it. It didn’t cope very well with layout issues, nor with small text. But the basic recognition quality seemed good.
My next choice was a PDF with the text of Severian of Gabala, De pace, in Greek and Latin. This opened fine! (rather to my surprise). I held the cursor over the page, and it turned into a cross. Holding down the left mouse button drew a rectangle around the text I wanted to recognise. A quick language change to Ellenika and I hit “Recognise selection”.
The result was not bad at all. Polytonic accents were recognised (although it did not like the two g’s in a)/ggeloi).
There were some UI issues here. I could zoom the window being read – great! But annoyingly I could not zoom the text window, nor copy and paste from it to Notepad. But I could and did save it to a Unicode text file. The result was this:
1. Οἱ ἄηε).οι τὸν οὐρἀνιον χο-
ρὸ·· συστησἀμενοι εὺηγγελίζοντο
τοῖς ποιμἑσι λἑγοντες· «εὐαγγε-
λιζόμεθα ὑμῖν σήμερον χαρὰ· με-
γάλην, ήτις ἔσται παντὶ τῷ λαῷ».
Παρ’ αὐτῶν τοίνυν τῶν ὰγίων ἐκεί-
νων ὰηέλων καὶ ῆμεῖς δανεισἀ-
μενοι φωνὴν οὐαηελιζόμεθα ὑμῖν
σήμερον, ὅτι σήμερον τὰ τῆς
ὲκπλησίας ἐν γαλή~›η καὶ τὰ τῶν
αἰρετικῶν ἐν ζάλη. Σἡμερον τὸ
οπιάφος τῆς ἑκκλησίας ἐν γαλήνη
Conclusions? I’ve used worse in the past. I think it looks pretty good. I suspect that, to use it, one would need to train it a bit more, but you can’t complain about the price!
Well done, those who created the training dictionary.