OCR – Roger Pearse

Finereader 15 includes Fraktur OCR! Finally!

Posted on December 19, 2020 by Roger Pearse

Excellent news this afternoon. It seems that the new version of Abbyy Finereader, version 15 (which for some reason they have renamed Finereader PDF 15) incorporates their excellent Fraktur recognition engine for the first time.

And it works! I tried it out on some 19th century German text.

That is pretty darned good. That’s exactly what comes out, without any editing!

This has been an awful long time coming. Back in 2003 a “European Union” (i.e. German) project commissioned Russian software firm Abbyy to adapt their excellent OCR engine to handle Fraktur. They did so, and the results were good. But then somehow it all went wrong. Instead of being added to Finereader, which we all were buying, they created a standalone version purely for Fraktur, at a price that only universities could afford. The result is that for 17 years we have been denied the use of something paid for by taxpayers. But no longer.

The addition feels a bit bodged in. You turn on Fraktur recognition by selecting one of 6 languages. Instead of the language being “German (Fraktur)”, it is “Old German”, so you don’t see it in the list of languages next to “German”. But once you know, it’s fine. That’s all you have to do; just select “Old German”.

Myself I can barely read texts printed in Fraktur, and German is not my best language anyway. But with the help of this, and dear old Google Translate, we can see what these authors have to say!

From my diary

Posted on February 15, 2019February 15, 2019 by Roger Pearse

Yesterday was Valentine’s Day. Inevitably I found myself wondering what kind of ancient or medieval literary material there was about St Valentine.

I found very little. What little there was to be found by a Google search suggested that it was all derived at many removes from the old Catholic Encyclopedia. The article in this is vague too.

So off I went to the Acta Sanctorum. Feb. 14, the feast day, is in February volume 2. There wasn’t a lot, and this is one of the oldest volumes, from 1658.

I’ve been working on a Latin Life of St George lately, so I am very much “in the zone” to work on another Latin life. So I thought that perhaps I would OCR the Latin text, and maybe look at translating it.

Abbyy Finereader 14 is an excellent piece of software. It supports the Latin language properly, which makes it very useful. Indeed I remember yearning for such a thing in days gone by.

I didn’t think that a 1658 edition, complete with long-s, would OCR that well. So I looked for the Paris reprint of the 1850’s. This I found without difficulty, as they are all in Archive.org; but the quality is not good. Not even Finereader could make much of those grainy faint pages.

My next step was to find some more copies of the book. As I indicated in my last post, I faintly remembered a Google spreadsheet full of links to PDFs of the Acta Sanctorum. A kind correspondent found it, and it is here. But … the links were all to the original edition.

So I’ve spent this morning trying to locate a better scan of one of the Paris reprint volumes. Eventually I succeeded, in Google Books, in finding it here, in the 1864 reprint. This, I was delighted to find, OCRs quite well. The page layout is hardly designed for OCR, but if you manually move the text boxes around, the results are really quite decent.

Time for lunch now. I think that I need to go out and buy the materials that I intend to cook, actually! But I shall continue correcting the OCR after that.

Once I have a Latin text, I shall post it. I shall then look at translating at least some of it.

I’ve yet to see any studies of the St Valentine literature, which is odd. It must exist; if not in English, then in German or French or certainly Italian. My search terms clearly are not good. But I can try out some searches over lunch!

UPDATE: Over a lunch a kind correspondent emailed me a link to an obscure German site where they have apparently uploaded the transcribed text of the whole Acta Sanctorum. The German site itself is poorly designed, but I am assured that buried within is the entire text. If so, of course, then there is no point in my doing it. Once I’ve worked out how to use the site, I’ll write a post on it.

A few months of interesting links

Posted on July 14, 2018 by Roger Pearse

For some months I’ve been collecting bits and pieces. Mostly I have nothing much to add, but they shouldn’t be lost.

Cool 9th century manuscript online as PDF

Via Rick Brannan I learn that a downloadable PDF of the Greek-Latin St Gall 9th century manuscript of Paul’s letters is online and can be downloaded as a single PDF:

Note the link on this page where you can download a PDF of what appears to be the entire Codex Boernerianus. It is beautiful.

And so you can. It’s at the SLUB in Dresden here, where it has the shelfmark A.145.b. It also contains Sedulius Scottus, I gather.

Nice to see the interlinear, isn’t it?

Codex Trecensis of Tertullian online

A correspondent advised me that the Codex Trecensis of the works of Tertullian has appeared online in scanned microfilm form at the IRHT. Rubbish quality, but far better than nothing. The ms is here. De Resurrectione Carnis begins on 157r and ends on 194r. De Baptismo begins on folio 194r and ends on 200v. De Paenitentia begins on folio 200v.

Saints lives = Christian novels?

A review at BMCR by Elisabeth Schiffer of Stratis Papaioannou, Christian Novels from the ‘Menologion’ of Symeon Metaphrastes. Dumbarton Oaks medieval library, 45. Harvard University Press, 2017, caught my eye. This contains 6 lives from Metaphrastes collection.

Even though hagiographical texts are among the most frequently translated Byzantine sources, little effort has been made so far to translate parts of Symeon Metaphrastes’ Menologion. This is primarily due to the generally unfortunate editorial situation of these texts: They are transmitted relatively standardized, but in a vast number of liturgical manuscripts.

…

In addition to summarizing the status of research on Symeon’s rewriting enterprise, Papaioannou explains in his introduction why he calls the texts in focus “Christian novels.” It is not unproblematic to apply this modern term, as he himself states, but he decided to do so because of the fictionality of these narratives and because of their resemblances to the late antique Greek novel. When saying this, it is important to emphasize—as Papaioannou explicitly does—that these texts of novelistic character were not understood as such by their audience. On the contrary, the Byzantines regarded these texts as relating true stories, written for edification and liturgical purposes (see pp. xiv-xviii).

It’s an interesting review of a neglected area of scholarship where the tools for research – editions and translations – are not available.

Full-text of the Greek Sibylline Oracles online for free

Annette Y Reed broke the story on Twitter: it’s J. Geffcken, Die Oracula Sibyllina, Leipzig: Hinrichs, 1902, which has turned up at Archive.org here. A useful transcription, rather than the original book, is also online here.

All known mss in the Bodleian library – detailed in online catalogue

Ben Albritton on Twitter shares:

This is awesome – medieval.bodleian.ox.ac.uk “This catalogue provides descriptions of all known Western medieval manuscripts in the Bodleian Library, and of medieval manuscripts in selected Oxford colleges (currently Christ Church).” Sharing ICYMI too.

It also has direct links to the pinakes.irht.cnrs.fr for Greek mss!

Where did the Byzantine text of the New Testament come from?

Peter Gurry at the ETC blog asks the question, and suggests that Westcott and Hort are no longer the authorities to consult.

How to respond to politically motivated persecution

Since the election of President Trump I have noted on Twitter a new form of anti-Christian posting. There has been an endless stream of anti-Christian jeering online, demanding “how dare you support Trump”? It is surreal to see how people who hate Christians suddenly have become expert theologians on what Jesus would do. Thankfully a certain Kurt Schlichter writes *Sigh* No, Being A Christian Does Not Require You Meekly Submit To Leftist Tyranny:

Everyone seems to want to tell Christians that they are obligated to give in. There’s always some IPA-loving hipster who writes video game reviews when he’s not sobbing alone in the dark because no one loves him tweeting “Oh, that’s real Christian!” whenever a conservative fights back. I know that when I need theological clarification, I seek out the militant atheist who thinks Christ was a socialist and believes that the Golden Rule is that Christians are never allowed to never offend anyone.

It’s a good article, and sadly necessary in these horribly politicised times. It’s worth remembering that, were times different, rightists would most certainly adopt the same lofty lecturing tone.

A quote for pastors from St Augustine

Timothy P. Jones posted on twitter:

“If I fail to show concern for the sheep that strays, the sheep who are strong will think it’s nothing but a joke to stray and to become lost. I do desire outward gains–but I’m more concerned with inward losses” (Augustine of Hippo).

Queried as to the source, he wrote:

It’s from Sermon 46 by Augustine–the entire message is an outstanding exposition of what it means to be a shepherd of God’s people…. I translated the above from this. Here’s a good English translation as well.

Artificial Intelligence in the Vatican Archives

I knew it. It’s alive!!!

Well, not quite. This is a piece in the Atlantic, Artificial Intelligence Is Cracking Open the Vatican’s Secret Archives: A new project untangles the handwritten texts in one of the world’s largest historical collections:

That said, the VSA [Vatican Secret Archives] isn’t much use to modern scholars, because it’s so inaccessible. Of those 53 miles, just a few millimeters’ worth of pages have been scanned and made available online. Even fewer pages have been transcribed into computer text and made searchable. If you want to peruse anything else, you have to apply for special access, schlep all the way to Rome, and go through every page by hand.

But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time.

They’ve found a way around the limitations of OCR by using stroke recognition instead of letter recognition. They open-sourced the manpower by getting students (who didn’t know Latin) to input sample data, and started getting results.

All early days, but … just imagine if we could really read the contents of our archives!

Kazakhstan abandons Cyrillic for Latin-based alphabet

Via SlashDot I read:

The Central Asian nation of Kazakhstan is changing its alphabet from Cyrillic script to the Latin-based style favored by the West. The change, announced on a blustery Tuesday morning in mid-February, was small but significant — and it elicited a big response. The government signed off on a new alphabet, based on a Latin script instead of Kazakhstan’s current use of Cyrillic, in October. But it has faced vocal criticism from the population — a rare occurrence in this nominally democratic country ruled by Nazarbayev’s iron fist for almost three decades. In this first version of the new alphabet, apostrophes were used to depict sounds specific to the Kazakh tongue, prompting critics to call it “ugly.” The second variation, which Kaipiyev liked better, makes use of acute accents above the extra letters. So, for example, the Republic of Kazakhstan, which would in the first version have been Qazaqstan Respy’bli’kasy, is now Qazaqstan Respyblikasy, removing the apostrophes.

The article at SlashDot instinctively opposed a change, which can only benefit every single Kazakhstani, by making a world of literature accessible. Ataturk did the same, and for the same reason.

Tell Google that a book is in the public domain

Sometimes Google misclassifies books. But there is a way to tell it that actually the book is public domain. The Google link is here. From It’s surprisingly easy to make government records public on Google Books:

While working on a recent story about hate speech spread by telephone in the ’60s and ’70s, I came across an interesting book that had been digitized by Google Books. Unfortunately, while it was a transcript of a Congressional hearing, and therefore should be in the public domain and not subject to copyright, it wasn’t fully accessible through Google’s archive….

But, as it turns out, Google provides a form where anyone can ask that a book scanned as part of Google Books be reviewed to determine if it’s in the public domain. And, despite internet companies sometimes earning a mediocre-at-best reputation for responding to user inquiries about free services, I’m happy to report that Google let me know within a week after filling out the form that the book would now be available for reading and download.

What does it mean to speak of an authorial/original/initial form of a Scriptural writing when faced with tremendous complexity in the actual data itself?

Back at ETC blog, Peter Gurry discusses this with Greg Lanier here.

Some of the difficulty, one senses, is because the interaction of the divine with an imperfect world is always inherently beyond our ability to understand. It requires revelation, which is not supplied in this case.

And with that, I think I’ve dealt with a bunch of interesting stories which didn’t deserve a separate post. Onward!

LACE Greek OCR project

Posted on December 13, 2013 by Roger Pearse

On a better note, we live in blessed times where technology and the ancient world are concerned. The astonishing results of a project to OCR volumes of ancient Greek from Archive.org may now be found online here. Clicking on the first entry, and one of the outputs in it here gives astonishingly good results.

Free ancient Greek OCR – getting started with Tesseract

Posted on March 29, 2013 by Roger Pearse

A correspondent draws my attention to Tesseract, a Google-hosted project to do Optical Character Recognition. The Tesseract website is here. Tesseract is a command-line tool, but there are front-ends available.

I am a long-term fan of Abbyy Finereader, and, let’s face it, have probably OCR’d more text than most. So I thought that I would give Tesseract 3.02.02 a go.

First, getting the bits. I work on Windows 7, so I downloaded:

The windows installer
The documentation
A front-end from the third-party page. I downloaded GImageReader.

I double-clicked on the tesseract installer. This went smoothly. It gave me the option to download and install extra languages (English is the default); among others I chose ancient Greek, and German, and German (fraktur). The latter is the “gothic” style characters fashionable in Germany until 1945. Curiously the list of languages is not in alphabetical order; French following German.

Next I clicked on the GImageReader installer. This ran quickly, and warned that you need a copy of Tesseract installed. It did not create a desktop icon; you have to locate the program in the Start menu. This would throw some users, I suspect.

I then started GImageReader. It started with an error; that it was missing the “spellcheck dictionary for Dansk(Frak)”. Why it looks for this I cannot imagine. Not a good start, I fear. I suspect that it expects Tesseract to be installed with all possible languages.

Next I browsed to a tif file containing part of the English translation of Cyril of Alexandria on John. The file explorer is clunky and non-Windows standard. The page displayed OK, although if you swap back to another window and then back again it seems to re-render the image.

At the top of the page is the recognition language – set by default to the mysterious Dansk (Frak). I changed this to English. I then hit “Recognize all”. The recognition was quick.

So far, so good, then. While unpolished, the interface is usable without a lot of stress.

The result of the OCR was not bad. A window pops open on the right, with ASCII text in it. It didn’t cope very well with layout issues, nor with small text. But the basic recognition quality seemed good.

My next choice was a PDF with the text of Severian of Gabala, De pace, in Greek and Latin. This opened fine! (rather to my surprise). I held the cursor over the page, and it turned into a cross. Holding down the left mouse button drew a rectangle around the text I wanted to recognise. A quick language change to Ellenika and I hit “Recognise selection”.

The result was not bad at all. Polytonic accents were recognised (although it did not like the two g’s in a)/ggeloi).

There were some UI issues here. I could zoom the window being read – great! But annoyingly I could not zoom the text window, nor copy and paste from it to Notepad. But I could and did save it to a Unicode text file. The result was this:

1. Οἱ ἄηε).οι τὸν οὐρἀνιον χο-
ρὸ·· συστησἀμενοι εὺηγγελίζοντο
τοῖς ποιμἑσι λἑγοντες· «εὐαγγε-
λιζόμεθα ὑμῖν σήμερον χαρὰ· με-
γάλην, ήτις ἔσται παντὶ τῷ λαῷ».
Παρ’ αὐτῶν τοίνυν τῶν ὰγίων ἐκεί-
νων ὰηέλων καὶ ῆμεῖς δανεισἀ-
μενοι φωνὴν οὐαηελιζόμεθα ὑμῖν
σήμερον, ὅτι σήμερον τὰ τῆς
ὲκπλησίας ἐν γαλή~›η καὶ τὰ τῶν
αἰρετικῶν ἐν ζάλη. Σἡμερον τὸ
οπιάφος τῆς ἑκκλησίας ἐν γαλήνη

Conclusions? I’ve used worse in the past. I think it looks pretty good. I suspect that, to use it, one would need to train it a bit more, but you can’t complain about the price!

Well done, those who created the training dictionary.

From my diary

Posted on November 7, 2011 by Roger Pearse

Thankfully my PC decided that it would boot second time around. Windows is quite an unstable platform these days, I find.

A correspondent writes that there is now OCR software available which can recognise Arabic. It’s sold by Novodynamics of Michigan and called “Verus”. Sadly it is ridiculously expensive — $1300 for the “standard edition” and they don’t dare print a price for the “professional edition”.

An extraordinarily advanced OCR solution, VERUS™ Professional provides the most innovative Middle Eastern language and Asian optical character recognition in the world. VERUS™ Middle East Professional recognizes Arabic, Persian (Farsi, Dari), Pashto, Urdu, including embedded English and French. It also recognizes the Hebrew language, including embedded English. VERUS™ Asia Professional provides support for both Simplified and Traditional Chinese, Korean and Russian languages, including embedded English. Both products automatically detect and clean degraded and skewed documents, automatically identify a page’s primary language, and recognize a page’s fonts without manual intervention. VERUS’™ intuitive user interface allows users to quickly review and edit recognized text.

http://www.novodynamics.com/verus_pro.htm

I would imagine that it should be possible to adapt this software to recognise Syriac, if the manufacturer would agree.

Nuance Omnipage 18

Posted on November 5, 2011November 5, 2011 by Roger Pearse

This morning I got hold of Nuance Omnipage 18 standard edition. The box was very light: mostly air, a CDROM, and a cheeky bit of cheaply printed paper announcing that they included no manuals at all, in order to save the planet. Humph.

The footprint is quite small, and I copied the CDROM to my hard disk before installation. Curiously the disk packet had two numbers both labelled as “serial number”.

The installation was unfamiliar. As I always do, I clicked on the “select options” and found that it wanted to install some voice-related stuff. I unchecked that. Then I went ahead and did the install. At one point it announced that it was going to install something called “CloudConnector”, without giving me the chance to decline. But I hit cancel, and the rest of the install went fine. It then popped up a box asking me to register — this opened a web page with a rather shoddy page collecting details. Every page gave an “invalid certificate” error in IE, which is sloppy. And then it asked if I wanted to activate, which I did. So far, so good.

I then opened OP. It popped up some “friendly” menu, which I removed. Then I looked at the main screen, and decided to open a PDF and work on it in OP. It took a little while to work out that I needed “Process … Workflows … PDF or Scanned Image to Omnipage document. Somehow I think “File … Open” would be rather more normal! Once you’ve selected this, you click on a button on the tool bar to start processing. It prompted for a PDF, which I had created myself from some digital photos of Ibn Abi Usaibia, and it promptly objected “non-supported image size” to each page and refused to open it! Silly programme: I don’t care what the image size is, I want to get some OCR of the pages!

OK, let’s see if I can workaround. I select instead “Camera image to Omnipage document” and select a bunch of the same images before I put them in a PDF. This time it decides to cooperate. It reads the images, rotates them to portrait mode (correctly). Then it pops up some kind of dictionary thing, which is annoying. I hit “close” and the windows cursor starts spinning. It doesn’t seem to be doing anything, but it’s just sitting there. Hum.

After a while I get bored, and close the program down. At least it dies gracefully, prompting me to save my work. I reopen it, and reopen my project. Then I click the “Text editor” tab. It looks as if it recognised page 1 OK, despite being typescript. No errors, anyway. My first encounter with OCR quality is good.

But … I can only see EITHER the image, or the recognised text, not both at the same time. Hum. It ought to be possible to do this. After a bit of hunting, I find “Window … Classic view” which gives me side-by-side. But I go back to “flexible view”, because I have just discovered that, if I click on the text window, the line of text from the image appears in a hover box above the line.

Now this is really rather convenient. Mind you, when the lines are slanted — as is often the case — I wonder how it would do?

I hit Alt-Down, and nothing happens. Of course, this is not Finereader. A bit of hunting and the Edit menu informs me that Ctrl-PgDn is next page. F4 is next suspect character. I never used this in Finereader, but here using it with the hover boxreally works. My text here has quite a few vowels with overscores. None of these are recognised by default, but at least I can see them!

So far, not too bad! Better, indeed, than I had feared.

Now I need to start adding custom characters. I want to define my own “language” for recognition, based on English but with all the funny characters that I need in this document to represent long vowels. “Tools … Options” seems to give me choices. On the process tab I see a box saying “Open PDF as images”. Its unchecked by default — I’ll check it now, and see if I can open that PDF. Looks as if you have to save your settings; I save mine to the same directory where I stored the install CDROM. Then I do “File … New”, and … still can’t open my PDF. Oh well.

Back to the OPD project from the digital images. Can I define some extra characters? Well you can; but it all looks rather weedy compared to Finereader’s options. Let’s try these: āīōūšŠ. I get them from charmap, pointing at the Alphabetum Unicode font; but any reasonably full unicode font such as Ms Arial Unicode or Titus Cyberbit Basic would do. Then “Tools… Options … OCR … Additional characters” and I just paste them into the box. The “…” button next to that box leads to some weedy, underspecified lookup, which really needs to be more like Charmap. But do these characters get picked up?

Now I want to re-recognise. I click on the thumbnail for page 1 and … the menu gives me no option. Hum. Wonder what to do.

In fact I’ve spent some time now trying to work out how to kick off a limited re-read. No luck yet. Surely this should be simple and obvious? Eventually I work out that you select the thumbnails of the pages you want, and hit the toolbar button and that kicks it off.

So how does it do? Well, it recognises the overscore a. None of the other characters are picked up. That’s not so good as Finereader.

Also the more skewed the page is, the less well OP handles it (understandable), and the less easy it is to fix. OP rather presumes that the recognition is near perfect, and has only limited fixing to do. In such a situation, indeed, OP will be quicker to do a job than Finereader. And I notice that a ribbon with characters to paste is across the top of the text window — nice touch. This motivates me to go back and explore again. I haven’t worked out how to set MY characters in that ribbon. But when I went into the weedy charmap substitute, there was a similar ribbon at the top, and right-clicking on it allowed you to add more character sets, which increased the number of characters; and by clicking on them, to add them to the ribbon. How you remove them from the ribbon I don’t know. It is, in truth, a badly designed feature. And the OCR still doesn’t recognise what I need.

I’ve had enough for now and closed it down. Is it any good? Almost certainly. It’s less good for weird characters. But it undoubtedly will see service.

UPDATE: Have just discovered, on starting Word 2010, that Nuance have seen fit to mess with the menus in this (without asking me). Drat them!

First impressions of Abbyy Finereader 11

Posted on November 3, 2011November 3, 2011 by Roger Pearse

Finereader 11 looks quite a lot like Finereader 10. So far, it seems very similar. Once nice touch is that when it is reading a page, a vertical bar travels down the thumbnail.

But I have already found an oddity. I imported into it the project that I am currently working on in Finereader 10 — part of Ibn Abi Usaibia — and it looks really weird! All the recognised text is spaced out vertically! The paragraph style is “bar code”, and no other styles are available.

Here’s what I see when I open it:

: Opening a Finereader 10 project in Finereader 11

Not very useful, is it? But when I minimise the image, and increase the recognised panel to 100% size, it looks like this!

: Finereader 11 – zoomed version of recognised text

There seems to be no rhyme or reason for the massive gaps between lines. And here is the very same project in Finereader 10:

: Finereader 10 image of same document

Weird. Doubtless there is some setting to persuade FR11 to behave, but it isn’t obvious what. This does NOT happen when I recognise the page again in FR11. The style gets set to “Body Text (2)”, in this case.

And … when I do Ctrl-Z, and revert the recognition, it goes back to the weird appearance above. But … this time, a bunch of other styles are available, and if I change to BodyText2, that is what I get. But on the next page … once again, Barcode is the only style.

This must be a bug, I think. It means that Abbyy’s testers have not tested importing documents from FR10 sufficiently. What it means is that you can’t upgrade projects once you start them. Well … I try to keep my projects small, and break up large documents into small chunks, so I shan’t mind. That would seem to be the workaround.

One good feature that is new, is that it remembers where you were in the document last time. All previous versions always opened the document at page 1. I got quite accustomed, indeed, to placing a “qqq” at the point where I stopped, so I could find it again next time. No need in FR11, it seems.

Also FR11 comes bundled with “PDF Transformer 3”. This suggests that the latter product was bought in, to beef up the rather unremarkable PDF handling in Finereader. I’ve not tried this yet, tho.

OCR: Omnipage and Finereader

Posted on November 2, 2011 by Roger Pearse

Scanning and OCR is on my mind at the moment. A new version of Abbyy Finereader — version 11 — is out. Since I have some 750 pages of Ibn Abi Usaibia to do, any improvement in accuracy is welcome, however slight.

Originally I did my OCR using Omnipage. It is many years since I was led (by Susan Rhoads of Elfinspell.com) to look at Finereader 5. This was immensely superior, and I have never used any other product since. But I see that Omnipage 18 is now out. Stirred by a bit of curiosity, I’ve been wondering what this would be like.

Finereader is not without its faults. Foremost among them, for what I want to do, is that it cannot make a PDF searchable without making the PDF much, much larger, messing with the images, and so forth. This is so bad, in fact, that I use Adobe Acrobat Pro 9 for that task, despite the much inferior OCR.

Omnipage seems to be aware of the issue, and a look at their site suggests that they realise that a lot of this activity goes on.

I decided, therefore, to buy both and see what they’re like. I will let you know!

But … software vendors are thieves and robbers! If you go to the Abbyy site, the cost of a downloaded upgrade to Finereader Pro 11 is “€ 89 / £ 65 (download)”. The full version is “€ 129 / £ 99” — and if you want just the download, it’s exactly the same price, despite the fact that it costs them less! But go to Amazon.co.uk, the complete boxed set is just £63.16 — less than the upgrade. Needless to say, that’s what I ordered.

Omnipage are no better. Go to the Nuance site, and Omnipage 18 (standard version) is £79.99, whether download or boxed. Again they swindle the download users. But go to Amazon.co.uk, and the complete boxed set is £46.90!

I didn’t buy the Omnipage Pro version, but stuck with the standard one. It’s a lot more money, and I wasn’t convinced that I’d use the extra features — especially since I don’t know if the OCR is any good at all. Here a trial version would have helped — Finereader make trial versions available online. This is smart marketing on their part, because magazine reviews of such a specialised area of software are invariably useless.

My current interest in Russian texts of Methodius means that I was interested to see that Omnipage offer a separate Russian version. Finereader used to have a specific “Cyrillic option” version — indeed I owned a copy, back in the FR5 days — but this seems to have vanished from their product list. Kudos to Finereader: Russian support is included in the main product! I only wish their obscure “fraktur” recognition module was included too! This recognises old “Gothic”-style typefaces, and some of us would find it handy. But I could only find it in their SDK for Linux. And it doesn’t seem that you can even buy the latter off-the-shelf.

OCR with macrons and other funny letters in Finereader

Posted on July 14, 2011 by Roger Pearse

I’m scanning Brockelmann’s Geschichte der arabischen Litteratur. It’s mostly in German, of course; but the Arabic is transliterated using a wide variety of odd unicode characters. There are letter “a” with a macron over it (a horizontal line), and “sh” written as “s” with a little hat on it and so forth. These don’t occur in modern German, so get weeded out.

But you can do this, in Finereader. You just define a new language, based on German. I called mine “German with Arabic”. And when you do, you specify which unicode characters the language contains. So all I had to do was scroll down through the unicode characters, find the funnies that Brockelmann had used, and add them in.

And, if you don’t get them all first time, you can edit the language, select it, get the properties, and add the next few in. And … it works. It really does.

Finereader is really amazing OCR software. And I learned all this from the help file. Look under “alphabet” in the search.

Roger Pearse

Tag: OCR

Finereader 15 includes Fraktur OCR! Finally!

Like this:

From my diary

Like this:

A few months of interesting links

Like this:

LACE Greek OCR project

Like this:

Free ancient Greek OCR – getting started with Tesseract

Like this:

From my diary

Like this:

Nuance Omnipage 18

Like this:

First impressions of Abbyy Finereader 11

Like this:

OCR: Omnipage and Finereader

Like this:

OCR with macrons and other funny letters in Finereader

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: