Experiments in Arabic OCR

A correspondent has suggested to me the possibility of using Optical Character Recognition (OCR) software to read a portion of al-Makin that was published in the Bibliotheque d’etudes orientale 15, back in the 1950’s.  I admit that I was dubious, but I’ve spent a little time this evening looking into the matter.

I believe that Adobe Acrobat Pro XI may have a facility to OCR text in Arabic.  Certainly Acrobat Pro 9 does not; at least, my copy doesn’t seem to.  There is discussion at the Adobe forums here.

One product mentioned there was something called Novoverus.  This is supposedly used by the US government.  It comes as no surprise, therefore, that the company website omits any prices and will only deal with customers personally.  However I did find a site offering it for sale, here, at a cool $1,299!

Fortunately the Adobe forum notified that Abbyy Finereader Pro 11 supports Arabic OCR.  This I have.  The user interface to this version of FR is buggy. It caused me endless grief while scanning Theodoret’s commentary on Romans.  So I have mostly used an older version.

I’ve installed FR11 (version 10 is not good enough) and it does indeed have an Arabic option: “Arabic (Saudi Arabia)”.

I tried OCR’ing the text on a page of Erpenius.  I didn’t think the results were that great; but then it wasn’t a fair test on a 1625 font!  So I tried again on Cahen’s text.  The result is as follows:


I don’t think that seems particularly impressive; but perhaps those who can actually read Arabic might comment.

Finereader 11 – do not install!

I have just, this evening, finished adding manually italics to 40 pages of a scanned text in Finereader 11.  I export this to Word, and it doesn’t seem to contain my changes.  And … while I was fiddling with formatting on the very last page, and trying to export my work, it has silently erased all my formatting changes in the previous 39 pages as well!  I am unbelievably angry!  Days and days of work … silently deleted.

This product is not fit for use.  DO NOT BUY IT!

I hate Abbyy.  How can anyone ship such a piece of worthless junk as this?

UPDATE: I took a backup of my disk late afternoon.  I’ve lost all the work since 4:30.  18 pages of manual corrections, all hard on the eyes and the hands.  I really, really hate Abbyy.

UPDATE2: It has taken forever to scan 80 pages of stuff.  I think the problem has always, always been Abbyy Finereader 11.  The filters to export don’t work properly; and when you change settings, things happen which you don’t want and didn’t like.  I’m not sure what best to do, but I am quite sure that I have had enough of FR11.

UPDATE3: And I can’t even export the 22 pages of corrected stuff that I still have, without erasing all the formatting on every page other than the one displayed in the editor!

Am giving up.  I’ll export the page images out, and read them in again in FR10 and see if I can get better results.  And … don’t I have Omnipage around here somewhere?

So angry.

UPDATE4: And … I realise that all the italic text was garbage, and that I had to manually correct it.  And there is italics on every other ratted line.  I have to do days and days of work again!!!!!

So angry.  I want to hurt someone at Abbyy, really badly.  I want to stick a broken bottle up his backside and twist.  How dare they ship stuff this badly broken?!?

UPDATE5: OK … what happens if I go into the 22 page version and just do Ctrl-A, select all the contents, page by page, and paste them into Word?  Answer: word sees verse numbers and starts trying to assign automatic page numbers.  Grrrr!!!

Now trying Wordpad instead.

Two broken bottles up the backside of the CEO of Abbyy and twisting really hard.  I want to hear him scream like a damned soul.  You swine, how dare you put me through this?

Wordpad seems to work.  Setting FR11 so that the whole page is displayed before doing the Ctrl-A saves paging up and down, since Abbyy have also broken the next-page hot-key in FR11.

Well, it works more or less.  The Ctrl-A doesn’t include the * against footnotes at the bottom of each page.

UPDATE6: Well, I have rescued, more or less, my 22 pages in a .rtf file.  I am loathe ever to touch FR11 again.

UPDATE7: Looks like I have lost most of the italics in the first 40 pages as well!!!!

The trouble is, if you can only work at things for an hour or two here or there, you rely on the software to keep things straight.  In this case, I shall have to stop work tomorrow, and not look at this again for ages.  So … when I come back, will I even remember where I am?  And will I remember how the software has been biting me?

Frustrated with Finereader

I’ve been working on placing Theodoret’s commentary on Romans on the web for a while.  I OCR’d it in Abbyy Finereader 11, and I finished proofing the OCR in Finereader before Easter.

Today I tried exporting the text to HTML.  It has rather a lot of italics in it, so imagine my fury when I discovered that exporting “formatted” text had lost all the italics!  A bit of experimentation revealed that the same happened when saving “formatted” text as .RTF.  Only saving “exact text” retained the italics.  And you don’t want all the crud that comes with that.

I imagine that it’s just a bug; but it is a frustrating one.  I really do not want to reitalicise some 100 pages.

Another annoyance was that Finereader now attempts to work out where footnotes are involved, and create its own numeration.  In Word this is fine, as inserting and renumbering footnotes is trivial.  In HTML, however, it simply creates work that has to be undone.

Finereader does excellent OCR.  But I wish they would spend some time getting the product user-tested, really I do.

Abbyy Finereader 11 – a dog indeed?

I’ve scanned and uploaded two books by Michael Bourdeaux here.  The Faith on Trial in Russia volume in particular is important reading for the persecution of the Russian baptists in the USSR.

I’ve been working on Gorbachev, Glasnost & The Gospel, one of the late Keston volumes.  I scanned the pages using Finereader 8 — the last version that allowed me to drive my Opticbook 3600 at 400 dpi.  I scanned the photos and the cover in Finereader 11; and then I imported the image files from FR8 into FR11.

But it isn’t working out that well.  In fact I am giving up and going back to Finereader 10, which I used earlier today for Faith on Trial in Russia.  Because it gives odd spelling errors: a word ending in “tly ” like “currently ” will be given as “currendy”.  That wastes time.  Worse, it has decided to treat 100 pages as “Batnan” font — which looks a lot like Courier.  I don’t want to go through every page fixing that.

So I’m exporting the images and going back.  Wish me luck!

UPDATE: In fairness, I’m finding the same -dy problem in FR10.  It must be the rather odd font in use.  But much else is still better in FR10.  Words in italics are bolded in FR11; not in FR10.  The pages in Batnan are not so in FR10.  Hmm.

On the other hand, it is good that FR11 highlights “words” that aren’t in  the  dictionary — that really does help in spotting errors.

From my diary

Last night I completed the arduous task of manually correcting all the OCR’d pages of Ibn Abi Usaibia.  Not that it is perfect even now — optically correcting is an error-prone business.

Today I moved on to the next step — getting the text out of Abbyy Finereader 10, and into some format that can be edited for layout, etc.  This is proving rather trickier than it should.

To do the OCR, I divided the 1,000+ pages up into 27 projects, each of about 40 pages.  Since the manuscript is typescript, there is really no text formatting to retain — no italics, bold, etc — so simply exporting it as plain text in HTML format, using the Windows 1252 encoding, would seem to be the right choice.

Unfortunately projects 2 and 3 are refusing to do the export.  Attempts to do so bring up programme errors, complete with .cpp file and line number.  This sort of unreliability arrived with Finereader 10, and it is an unmitigated pain.  I can’t export as Word either.  Nor can I import the projects into Finereader 11 (a truly duff version, if ever I saw one, which will rarely import any project from a preceding version successfully).

I’ve managed to export the text as unicode text format, in a .txt file.  But naturally I am rather annoyed.  The projects show no special sign of corruption, although Finereader projects can become corrupt, mysteriously.

This is infuriating, and it undermines the point of using the software.  Investing weeks of work in editing something, only to find that you can’t get your work out very easily, is quite annoying.

Finereader 8 was rock-solid.  Finereader 9 had better recognition, but was less reliable.  And so it has gone on.

Abbyy need to invest some time in improving reliability, or they will lose their market.  People who use OCR software work hard.  They should be able to rely on the software not to crash.

UPDATE: I have now installed Microsoft FrontPage 2002.  I usually use FrontPage 2000 for general editing — it is curious how neither DreamWeaver nor ExpressionWeb has a decent WYSIWYG editor, almost 10 years on — but this can’t handle unicode characters.  FP2002 can; but for some reason you cannot run both on the same machine.  And, sure enough, FP2002 has silently deinstalled FP2000, drat it.

Fortunately FP2002 has created new .htm files for projects 02 and 03, by the simple process of pasting the unicode .txt files into them.

What I shall need to do now is think up a way to format 1000 pages of text in a satisfactory way.  Particularly now that FP2002 has uninstalled all my macros!

Problems with Abbyy Finereader 11

Tonight I realised that I was getting close to the end of one section of Ibn Abi Usaibia, and that the next 350 pages was in sight.  I thought that it might be a good idea to create a Finereader project for those pages, and run the optical character recognition on them, and do a few global search-and-replaces.

So far I have been working with Finereader 10, although I did a small experiment with Finereader 11 when I got it.  But this new chunk is an obvious break-point to move up.

I started up Finereader 11, and attempted to import my settings — primarily my custom English-Arabic language setting — from FR10.  This promptly crashed.

I restarted FR11, and after a bit of fiddling recreated the language and saved it.  I then opened the PDF with the 350 files, which was fine.

Then I OCR’d the lot.  This seemed to go OK; and then started popping up horrible-looking internal error messages.  In fact it just would not allow me to view the “read” pages.

I ended up going back to FR10, which is running at the moment.  Doubtless I have done something wrong, but it is troublingly easy to crash FR11.

First impressions of Abbyy Finereader 11

Finereader 11 looks quite a lot like Finereader 10.   So far, it seems very similar.  Once nice touch is that when it is reading a page, a vertical bar travels down the thumbnail.

But I have already found an oddity.  I imported into it the project that I am currently working on in Finereader 10 — part of Ibn Abi Usaibia — and it looks really weird!  All the recognised text is spaced out vertically!  The paragraph style is “bar code”, and no other styles are available. 

Here’s what I see when I open it:

Opening a Finereader 10 project in Finereader 11

Not very useful, is it?  But when I minimise the image, and increase the recognised panel to 100% size, it looks like this!

Finereader 11 – zoomed version of recognised text

There seems to be no rhyme or reason for the massive gaps between lines.  And here is the very same project in Finereader 10:

Finereader 10 image of same document

Weird.  Doubtless there is some setting to persuade FR11 to behave, but it isn’t obvious what.  This does NOT happen when I recognise the page again in FR11.  The style gets set to “Body Text (2)”, in this case. 

And … when I do Ctrl-Z, and revert the recognition, it goes back to the weird appearance above.  But … this time, a bunch of other styles are available, and if I change to BodyText2, that is what I get.  But on the next page … once again, Barcode is the only style. 

This must be a bug, I think.  It means that Abbyy’s testers have not tested importing documents from FR10 sufficiently.  What it means is that you can’t upgrade projects once you start them.  Well … I try to keep my projects small, and break up large documents into small chunks, so I shan’t mind.  That would seem to be the workaround.

One good feature that is new, is that it remembers where you were in the document last time.  All previous versions always opened the document at page 1.  I got quite accustomed, indeed, to placing a “qqq” at the point where I stopped, so I could find it again next time.  No need in FR11, it seems.

Also FR11 comes bundled with “PDF Transformer 3”.  This suggests that the latter product was bought in, to beef up the rather unremarkable PDF handling in Finereader.  I’ve not tried this yet, tho.