Scanning – Roger Pearse

From my diary

Posted on September 9, 2013September 9, 2013 by Roger Pearse

One item that has hung around on my PC for ages now is Theodoret’s Commentary on Romans. A translation actually exists of this obscure item, published by an Oxford Movement person in the 1840’s, in a journal, and then forgotten. I did scan it in the then-new Finereader 11 back in early 2012; but a bug in the software promptly erased a whole load of formatting. The original editor had used italics instead of quotes, where bits of the bible were involved, which means there are a lot of them.

I re-added the italics, laboriously, not realising why it had disappeared; and lo! it vanished again.

After trudging through 80 pages, twice, adding italics all over each page, my will to live disappeared and I left it to one side.

But I have got stuck into this again. This time I add italics to a page, and then copy the page into Word before I do anything else. Slowly, slowly, I am building up the text. Another 25 pages to go. I hope to get it done this week.

Finereader 11 – do not install!

Posted on November 3, 2012November 3, 2012 by Roger Pearse

I have just, this evening, finished adding manually italics to 40 pages of a scanned text in Finereader 11. I export this to Word, and it doesn’t seem to contain my changes. And … while I was fiddling with formatting on the very last page, and trying to export my work, it has silently erased all my formatting changes in the previous 39 pages as well! I am unbelievably angry! Days and days of work … silently deleted.

This product is not fit for use. DO NOT BUY IT!

I hate Abbyy. How can anyone ship such a piece of worthless junk as this?

UPDATE: I took a backup of my disk late afternoon. I’ve lost all the work since 4:30. 18 pages of manual corrections, all hard on the eyes and the hands. I really, really hate Abbyy.

UPDATE2: It has taken forever to scan 80 pages of stuff. I think the problem has always, always been Abbyy Finereader 11. The filters to export don’t work properly; and when you change settings, things happen which you don’t want and didn’t like. I’m not sure what best to do, but I am quite sure that I have had enough of FR11.

UPDATE3: And I can’t even export the 22 pages of corrected stuff that I still have, without erasing all the formatting on every page other than the one displayed in the editor!

Am giving up. I’ll export the page images out, and read them in again in FR10 and see if I can get better results. And … don’t I have Omnipage around here somewhere?

So angry.

UPDATE4: And … I realise that all the italic text was garbage, and that I had to manually correct it. And there is italics on every other ratted line. I have to do days and days of work again!!!!!

So angry. I want to hurt someone at Abbyy, really badly. I want to stick a broken bottle up his backside and twist. How dare they ship stuff this badly broken?!?

UPDATE5: OK … what happens if I go into the 22 page version and just do Ctrl-A, select all the contents, page by page, and paste them into Word? Answer: word sees verse numbers and starts trying to assign automatic page numbers. Grrrr!!!

Now trying Wordpad instead.

Two broken bottles up the backside of the CEO of Abbyy and twisting really hard. I want to hear him scream like a damned soul. You swine, how dare you put me through this?

Wordpad seems to work. Setting FR11 so that the whole page is displayed before doing the Ctrl-A saves paging up and down, since Abbyy have also broken the next-page hot-key in FR11.

Well, it works more or less. The Ctrl-A doesn’t include the * against footnotes at the bottom of each page.

UPDATE6: Well, I have rescued, more or less, my 22 pages in a .rtf file. I am loathe ever to touch FR11 again.

UPDATE7: Looks like I have lost most of the italics in the first 40 pages as well!!!!

The trouble is, if you can only work at things for an hour or two here or there, you rely on the software to keep things straight. In this case, I shall have to stop work tomorrow, and not look at this again for ages. So … when I come back, will I even remember where I am? And will I remember how the software has been biting me?

From my diary

Posted on January 5, 2012January 5, 2012 by Roger Pearse

This afternoon I sat down with Origen, Homilies on Ezekiel 8-10 (and Jerome’s preface), and compared our translation with the 2010 ACW one. The object of the exercise was to locate any serious differences in understanding, and allow us to revise the translation if the ACW version suggested an improvement. I am pleased to say that I think all the deviations so far are in our favour. There is one obscure section where I am not convinced that we are right, but we’ll see. I’ve passed this material over to the translator for review. I still have homilies 11-14 to do, but I think I have done what I will do today. It is hard work!

This evening I’ve been playing with Abbyy Finereader 11, using the PDF’s of the unpublished translation of Book of Asaph the Physician, discovered by Douglas Galbi at the US National Library of Medicine. I don’t know a sausage about this text, I should say at once, so it’s a voyage of discovery here. I’m not committed to OCR’ing it either! But it’s a convenient vehicle for experimentation.

Now in the past I found that Finereader 11 wouldn’t play with my Finereader 10 projects, so I ignored it. But starting afresh, I’m discovering some interesting and useful new facilities.

The photos of Asaph are all rather skewed. This is inevitable in photographing books, unless you can press the pages on a glass to get them flat.

But in Finereader 11, I find that some new tools have been added to the image editor. There’s a very nice facility to adjust for “trapezium” effects — and it works well. Even better is the line straightener. Also there is a brightness/contrast control. If the type on the far side of the paper shows through, you can lose it by increasing the brightness.

The image files for Asaph are pretty bulky, so things are slow. But I was able to turn a page that was skewed to blazes back into something straight. Skewed pages require intervention on pretty much every line, which slows OCR to a crawl. But Finereader 11 can cope with this. I’d like the facility to apply the same deskew to a bunch of images, rather than one-by-one, tho.

Something Abbyy could usefully do is allow us to change the background colour of the OCR window. The green-ish coloured images result in a green-ish coloured background in the text window, for some reason, and this is very unpleasant and impossible to remove.

One pleasing thing that I see has at last arrived: an “insert symbol” facility. Long overdue and very welcome it is too!

Abbyy Finereader 11 – a dog indeed?

Posted on December 26, 2011December 26, 2011 by Roger Pearse

I’ve scanned and uploaded two books by Michael Bourdeaux here. The Faith on Trial in Russia volume in particular is important reading for the persecution of the Russian baptists in the USSR.

I’ve been working on Gorbachev, Glasnost & The Gospel, one of the late Keston volumes. I scanned the pages using Finereader 8 — the last version that allowed me to drive my Opticbook 3600 at 400 dpi. I scanned the photos and the cover in Finereader 11; and then I imported the image files from FR8 into FR11.

But it isn’t working out that well. In fact I am giving up and going back to Finereader 10, which I used earlier today for Faith on Trial in Russia. Because it gives odd spelling errors: a word ending in “tly ” like “currently ” will be given as “currendy”. That wastes time. Worse, it has decided to treat 100 pages as “Batnan” font — which looks a lot like Courier. I don’t want to go through every page fixing that.

So I’m exporting the images and going back. Wish me luck!

UPDATE: In fairness, I’m finding the same -dy problem in FR10. It must be the rather odd font in use. But much else is still better in FR10. Words in italics are bolded in FR11; not in FR10. The pages in Batnan are not so in FR10. Hmm.

On the other hand, it is good that FR11 highlights “words” that aren’t in the dictionary — that really does help in spotting errors.

From my diary

Posted on December 15, 2011December 15, 2011 by Roger Pearse

Last night I completed the arduous task of manually correcting all the OCR’d pages of Ibn Abi Usaibia. Not that it is perfect even now — optically correcting is an error-prone business.

Today I moved on to the next step — getting the text out of Abbyy Finereader 10, and into some format that can be edited for layout, etc. This is proving rather trickier than it should.

To do the OCR, I divided the 1,000+ pages up into 27 projects, each of about 40 pages. Since the manuscript is typescript, there is really no text formatting to retain — no italics, bold, etc — so simply exporting it as plain text in HTML format, using the Windows 1252 encoding, would seem to be the right choice.

Unfortunately projects 2 and 3 are refusing to do the export. Attempts to do so bring up programme errors, complete with .cpp file and line number. This sort of unreliability arrived with Finereader 10, and it is an unmitigated pain. I can’t export as Word either. Nor can I import the projects into Finereader 11 (a truly duff version, if ever I saw one, which will rarely import any project from a preceding version successfully).

I’ve managed to export the text as unicode text format, in a .txt file. But naturally I am rather annoyed. The projects show no special sign of corruption, although Finereader projects can become corrupt, mysteriously.

This is infuriating, and it undermines the point of using the software. Investing weeks of work in editing something, only to find that you can’t get your work out very easily, is quite annoying.

Finereader 8 was rock-solid. Finereader 9 had better recognition, but was less reliable. And so it has gone on.

Abbyy need to invest some time in improving reliability, or they will lose their market. People who use OCR software work hard. They should be able to rely on the software not to crash.

UPDATE: I have now installed Microsoft FrontPage 2002. I usually use FrontPage 2000 for general editing — it is curious how neither DreamWeaver nor ExpressionWeb has a decent WYSIWYG editor, almost 10 years on — but this can’t handle unicode characters. FP2002 can; but for some reason you cannot run both on the same machine. And, sure enough, FP2002 has silently deinstalled FP2000, drat it.

Fortunately FP2002 has created new .htm files for projects 02 and 03, by the simple process of pasting the unicode .txt files into them.

What I shall need to do now is think up a way to format 1000 pages of text in a satisfactory way. Particularly now that FP2002 has uninstalled all my macros!

Nuance Omnipage 18

Posted on November 5, 2011November 5, 2011 by Roger Pearse

This morning I got hold of Nuance Omnipage 18 standard edition. The box was very light: mostly air, a CDROM, and a cheeky bit of cheaply printed paper announcing that they included no manuals at all, in order to save the planet. Humph.

The footprint is quite small, and I copied the CDROM to my hard disk before installation. Curiously the disk packet had two numbers both labelled as “serial number”.

The installation was unfamiliar. As I always do, I clicked on the “select options” and found that it wanted to install some voice-related stuff. I unchecked that. Then I went ahead and did the install. At one point it announced that it was going to install something called “CloudConnector”, without giving me the chance to decline. But I hit cancel, and the rest of the install went fine. It then popped up a box asking me to register — this opened a web page with a rather shoddy page collecting details. Every page gave an “invalid certificate” error in IE, which is sloppy. And then it asked if I wanted to activate, which I did. So far, so good.

I then opened OP. It popped up some “friendly” menu, which I removed. Then I looked at the main screen, and decided to open a PDF and work on it in OP. It took a little while to work out that I needed “Process … Workflows … PDF or Scanned Image to Omnipage document. Somehow I think “File … Open” would be rather more normal! Once you’ve selected this, you click on a button on the tool bar to start processing. It prompted for a PDF, which I had created myself from some digital photos of Ibn Abi Usaibia, and it promptly objected “non-supported image size” to each page and refused to open it! Silly programme: I don’t care what the image size is, I want to get some OCR of the pages!

OK, let’s see if I can workaround. I select instead “Camera image to Omnipage document” and select a bunch of the same images before I put them in a PDF. This time it decides to cooperate. It reads the images, rotates them to portrait mode (correctly). Then it pops up some kind of dictionary thing, which is annoying. I hit “close” and the windows cursor starts spinning. It doesn’t seem to be doing anything, but it’s just sitting there. Hum.

After a while I get bored, and close the program down. At least it dies gracefully, prompting me to save my work. I reopen it, and reopen my project. Then I click the “Text editor” tab. It looks as if it recognised page 1 OK, despite being typescript. No errors, anyway. My first encounter with OCR quality is good.

But … I can only see EITHER the image, or the recognised text, not both at the same time. Hum. It ought to be possible to do this. After a bit of hunting, I find “Window … Classic view” which gives me side-by-side. But I go back to “flexible view”, because I have just discovered that, if I click on the text window, the line of text from the image appears in a hover box above the line.

Now this is really rather convenient. Mind you, when the lines are slanted — as is often the case — I wonder how it would do?

I hit Alt-Down, and nothing happens. Of course, this is not Finereader. A bit of hunting and the Edit menu informs me that Ctrl-PgDn is next page. F4 is next suspect character. I never used this in Finereader, but here using it with the hover boxreally works. My text here has quite a few vowels with overscores. None of these are recognised by default, but at least I can see them!

So far, not too bad! Better, indeed, than I had feared.

Now I need to start adding custom characters. I want to define my own “language” for recognition, based on English but with all the funny characters that I need in this document to represent long vowels. “Tools … Options” seems to give me choices. On the process tab I see a box saying “Open PDF as images”. Its unchecked by default — I’ll check it now, and see if I can open that PDF. Looks as if you have to save your settings; I save mine to the same directory where I stored the install CDROM. Then I do “File … New”, and … still can’t open my PDF. Oh well.

Back to the OPD project from the digital images. Can I define some extra characters? Well you can; but it all looks rather weedy compared to Finereader’s options. Let’s try these: āīōūšŠ. I get them from charmap, pointing at the Alphabetum Unicode font; but any reasonably full unicode font such as Ms Arial Unicode or Titus Cyberbit Basic would do. Then “Tools… Options … OCR … Additional characters” and I just paste them into the box. The “…” button next to that box leads to some weedy, underspecified lookup, which really needs to be more like Charmap. But do these characters get picked up?

Now I want to re-recognise. I click on the thumbnail for page 1 and … the menu gives me no option. Hum. Wonder what to do.

In fact I’ve spent some time now trying to work out how to kick off a limited re-read. No luck yet. Surely this should be simple and obvious? Eventually I work out that you select the thumbnails of the pages you want, and hit the toolbar button and that kicks it off.

So how does it do? Well, it recognises the overscore a. None of the other characters are picked up. That’s not so good as Finereader.

Also the more skewed the page is, the less well OP handles it (understandable), and the less easy it is to fix. OP rather presumes that the recognition is near perfect, and has only limited fixing to do. In such a situation, indeed, OP will be quicker to do a job than Finereader. And I notice that a ribbon with characters to paste is across the top of the text window — nice touch. This motivates me to go back and explore again. I haven’t worked out how to set MY characters in that ribbon. But when I went into the weedy charmap substitute, there was a similar ribbon at the top, and right-clicking on it allowed you to add more character sets, which increased the number of characters; and by clicking on them, to add them to the ribbon. How you remove them from the ribbon I don’t know. It is, in truth, a badly designed feature. And the OCR still doesn’t recognise what I need.

I’ve had enough for now and closed it down. Is it any good? Almost certainly. It’s less good for weird characters. But it undoubtedly will see service.

UPDATE: Have just discovered, on starting Word 2010, that Nuance have seen fit to mess with the menus in this (without asking me). Drat them!

First impressions of Abbyy Finereader 11

Posted on November 3, 2011November 3, 2011 by Roger Pearse

Finereader 11 looks quite a lot like Finereader 10. So far, it seems very similar. Once nice touch is that when it is reading a page, a vertical bar travels down the thumbnail.

But I have already found an oddity. I imported into it the project that I am currently working on in Finereader 10 — part of Ibn Abi Usaibia — and it looks really weird! All the recognised text is spaced out vertically! The paragraph style is “bar code”, and no other styles are available.

Here’s what I see when I open it:

: Opening a Finereader 10 project in Finereader 11

Not very useful, is it? But when I minimise the image, and increase the recognised panel to 100% size, it looks like this!

: Finereader 11 – zoomed version of recognised text

There seems to be no rhyme or reason for the massive gaps between lines. And here is the very same project in Finereader 10:

: Finereader 10 image of same document

Weird. Doubtless there is some setting to persuade FR11 to behave, but it isn’t obvious what. This does NOT happen when I recognise the page again in FR11. The style gets set to “Body Text (2)”, in this case.

And … when I do Ctrl-Z, and revert the recognition, it goes back to the weird appearance above. But … this time, a bunch of other styles are available, and if I change to BodyText2, that is what I get. But on the next page … once again, Barcode is the only style.

This must be a bug, I think. It means that Abbyy’s testers have not tested importing documents from FR10 sufficiently. What it means is that you can’t upgrade projects once you start them. Well … I try to keep my projects small, and break up large documents into small chunks, so I shan’t mind. That would seem to be the workaround.

One good feature that is new, is that it remembers where you were in the document last time. All previous versions always opened the document at page 1. I got quite accustomed, indeed, to placing a “qqq” at the point where I stopped, so I could find it again next time. No need in FR11, it seems.

Also FR11 comes bundled with “PDF Transformer 3”. This suggests that the latter product was bought in, to beef up the rather unremarkable PDF handling in Finereader. I’ve not tried this yet, tho.

OCR: Omnipage and Finereader

Posted on November 2, 2011 by Roger Pearse

Scanning and OCR is on my mind at the moment. A new version of Abbyy Finereader — version 11 — is out. Since I have some 750 pages of Ibn Abi Usaibia to do, any improvement in accuracy is welcome, however slight.

Originally I did my OCR using Omnipage. It is many years since I was led (by Susan Rhoads of Elfinspell.com) to look at Finereader 5. This was immensely superior, and I have never used any other product since. But I see that Omnipage 18 is now out. Stirred by a bit of curiosity, I’ve been wondering what this would be like.

Finereader is not without its faults. Foremost among them, for what I want to do, is that it cannot make a PDF searchable without making the PDF much, much larger, messing with the images, and so forth. This is so bad, in fact, that I use Adobe Acrobat Pro 9 for that task, despite the much inferior OCR.

Omnipage seems to be aware of the issue, and a look at their site suggests that they realise that a lot of this activity goes on.

I decided, therefore, to buy both and see what they’re like. I will let you know!

But … software vendors are thieves and robbers! If you go to the Abbyy site, the cost of a downloaded upgrade to Finereader Pro 11 is “€ 89 / £ 65 (download)”. The full version is “€ 129 / £ 99” — and if you want just the download, it’s exactly the same price, despite the fact that it costs them less! But go to Amazon.co.uk, the complete boxed set is just £63.16 — less than the upgrade. Needless to say, that’s what I ordered.

Omnipage are no better. Go to the Nuance site, and Omnipage 18 (standard version) is £79.99, whether download or boxed. Again they swindle the download users. But go to Amazon.co.uk, and the complete boxed set is £46.90!

I didn’t buy the Omnipage Pro version, but stuck with the standard one. It’s a lot more money, and I wasn’t convinced that I’d use the extra features — especially since I don’t know if the OCR is any good at all. Here a trial version would have helped — Finereader make trial versions available online. This is smart marketing on their part, because magazine reviews of such a specialised area of software are invariably useless.

My current interest in Russian texts of Methodius means that I was interested to see that Omnipage offer a separate Russian version. Finereader used to have a specific “Cyrillic option” version — indeed I owned a copy, back in the FR5 days — but this seems to have vanished from their product list. Kudos to Finereader: Russian support is included in the main product! I only wish their obscure “fraktur” recognition module was included too! This recognises old “Gothic”-style typefaces, and some of us would find it handy. But I could only find it in their SDK for Linux. And it doesn’t seem that you can even buy the latter off-the-shelf.

Abbyy Finereader 10 upgrade now out

Posted on December 17, 2009December 18, 2009 by Roger Pearse

For many years I have used Abbyy Finereader as my OCR software. Version 10 is now out, and I have just bought an upgrade.

Mind you, I have retained copies of FR8 and FR9 on my disk, installed and ready to use. FR9 was quite an improvement in OCR terms on FR8, and has better PDF handling, but the user interface is a lot harder to use. It fights you. I’ve never got used to its quirks. In particular it decided that it wouldn’t allow me to scan images at 400 dpi on my Plustek Opticbook 3600 — which FR8 did — and since I prefer to scan at that resolution, I had to retain FR8. It’s also better for image cropping.

So … FR10. I’ve just installed it, which was painless. It asks if I want to start some screengrab software every time I start my PC — I uncheck this. I open it up for the first time, and it wants me to register – that too is painless.

Then I get a screen with a big red window of “helpful” options — with no way to close it. I uncheck “display on startup” and it still won’t go. I’m forced to close the application, and restart. Not really that good a start.

Next I open an existing FR9 project. I’d started work on Censorinus, so I use that. I select the folder; and then it asks me to save it somewhere else. Yes, OK, we never had to do that in FR5, FR6, FR7, FR8 or FR9. Why change it? So I waste some disk space and create folder censorinus_fr10. I suppose newcomers will find it useful. And it opens the project OK. Hmmm. Now what?

I click on a page, and it doesn’t seem to include any of the OCR’d text. I select ‘Read’ and it OCR’s it. But … where is the text I was working on? A look shows that FR10 has kindly deleted all my recognised text. It’s kept the blocks on the screen, and that’s it. B*****ds!! Now we know why they insisted on keeping the old directory — boy would they be lynched if they hadn’t! This is bad. This is really, really bad. Who wants to restart a whole project?

OK, well I look through a few pages rather hopelessly, and I see one where the image needs editing. So … what do we have? Well, we have the FR9 style: “Let’s hide all the tools boys! Hee hee!” I had to customise mine to get an eraser on it. How do I do that now?

Well, I can’t say. If I choose Page|Edit page image, I get a rubbish image editor, with no tools, on which I can crop. This is the FR9 approach, way inferior to the FR8 one. It looks as if they still haven’t got rid of that idiot who ruined the interface. I erase a bit of rubbish on the image … it takes ages. The pages flashes as I do. Awful!

OK, I see it. You choose View|Toolbars|Quick Access bar. This puts an extra bar at the top, under the file menu. Then you do View|Toolbars|Customize. Choose categories “Image”, and you are looking at that toolbar. Now go down the icons on the left, and insert them where you want them on that toolbar. I add erase and a few others, and suddenly I can clean up the image as I want to. I can zoom the image (although only to 200%, unlike before – another degradation in service), and I can get rid of the image of some long dead student’s pencil on the page.

I’m dispirited, tho. I’m having to work at this, just to do simple OCR tasks.

OK. Let’s OCR that page. Right-click, read and … off it goes. I get two windows, image and text. Luckily the “Quick Access Bar” also allows me to minimise the image! And I click on the text at one point, where it’s duff, and … hang on, where’s the zoom at the bottom? Ah, it’s still at the bottom; just not displayed by default. (Why?!) One click on it, and it appears.

The OCR quality appears about the same, or possibly a little better. We’ll see.

Overall verdict? Wish they’d shoot the interface designer.

UPDATE: another glitch. While working on Censorinus, I had to do a global replace of “aera” to “era”. This I did, but they’ve made a subtle change. After the replace, I used to just hit Esc to get rid of the search/replace dialog box. Now it doesn’t work. And why? Because each time you do a replace, they shift the focus to the document, meaning you have to click the dialog box to get back to where you were.

This is unbelievably infuriating, and will make for much more work in using the product. All those extra clicks during a long search/replace…

Housekeeping journal articles; from my diary 2

Posted on August 5, 2009 by Roger Pearse

It’s hot and humid here; so much so, that I can’t think straight. So I’ve been looking at the piles of photocopied articles and running them through my scanner and throwing away the photocopy. That’s a mindless activity I can do.

Not sure I’m quite there yet, tho. The PDF’s are OK, but they aren’t OCR’d. The scanner software has OCR, but it’s not good enough. Nor is the built-in OCR in Acrobat. The best still seems to be Finereader 9; but the PDF’s don’t go through FR9 unchanged. The images can look strange.

Not sure what to do about that. But I am gradually freeing up storage space.

Roger Pearse

Tag: Scanning

From my diary

Like this:

Finereader 11 – do not install!

Like this:

From my diary

Like this:

Abbyy Finereader 11 – a dog indeed?

Like this:

From my diary

Like this:

Nuance Omnipage 18

Like this:

First impressions of Abbyy Finereader 11

Like this:

OCR: Omnipage and Finereader

Like this:

Abbyy Finereader 10 upgrade now out

Like this:

Housekeeping journal articles; from my diary 2

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: