Last night I completed the arduous task of manually correcting all the OCR’d pages of Ibn Abi Usaibia. Not that it is perfect even now — optically correcting is an error-prone business.
Today I moved on to the next step — getting the text out of Abbyy Finereader 10, and into some format that can be edited for layout, etc. This is proving rather trickier than it should.
To do the OCR, I divided the 1,000+ pages up into 27 projects, each of about 40 pages. Since the manuscript is typescript, there is really no text formatting to retain — no italics, bold, etc — so simply exporting it as plain text in HTML format, using the Windows 1252 encoding, would seem to be the right choice.
Unfortunately projects 2 and 3 are refusing to do the export. Attempts to do so bring up programme errors, complete with .cpp file and line number. This sort of unreliability arrived with Finereader 10, and it is an unmitigated pain. I can’t export as Word either. Nor can I import the projects into Finereader 11 (a truly duff version, if ever I saw one, which will rarely import any project from a preceding version successfully).
I’ve managed to export the text as unicode text format, in a .txt file. But naturally I am rather annoyed. The projects show no special sign of corruption, although Finereader projects can become corrupt, mysteriously.
This is infuriating, and it undermines the point of using the software. Investing weeks of work in editing something, only to find that you can’t get your work out very easily, is quite annoying.
Finereader 8 was rock-solid. Finereader 9 had better recognition, but was less reliable. And so it has gone on.
Abbyy need to invest some time in improving reliability, or they will lose their market. People who use OCR software work hard. They should be able to rely on the software not to crash.
UPDATE: I have now installed Microsoft FrontPage 2002. I usually use FrontPage 2000 for general editing — it is curious how neither DreamWeaver nor ExpressionWeb has a decent WYSIWYG editor, almost 10 years on — but this can’t handle unicode characters. FP2002 can; but for some reason you cannot run both on the same machine. And, sure enough, FP2002 has silently deinstalled FP2000, drat it.
Fortunately FP2002 has created new .htm files for projects 02 and 03, by the simple process of pasting the unicode .txt files into them.
What I shall need to do now is think up a way to format 1000 pages of text in a satisfactory way. Particularly now that FP2002 has uninstalled all my macros!