In my last post I mentioned how the Life of St Garima in Ethiopian was printed by Rossini, but without a translation. In fact it has never been translated into any modern language, to my knowledge. I don’t know any Ethiopian, and I doubt that I ever will.
But we live in an age of wonders, when it comes to unfamiliar languages.
So… is it possible to work with Ethiopian language editions, even if you know no Ethiopian? What about Google Translate? Ethiopian is in this heavy unfamiliar script. Is there OCR for this? If you can scan Rossini’s edition, can you pop it into Google Translate and get the English?
There are two sorts of Ethiopian out there, I know. There is Ge`ez, or classical Ethiopian; and there is Amharic, the modern dialect. Rossini printed his text from a 19th century manuscript. So it seems likely that this is in Amharic.
A quick Google confirmed; Google Translate knows Amharic! A bit of googling found me an Amharic news website online, here. I’m using Chrome, so all I had to do was right-click anywhere and select “Translate to English” and the whole website was rendered into some sort of English. And… it worked!! Yay me! It’s obviously not 100%, but it’s way better than 0%!
So what about OCR? I was sad to see that Abbyy Finereader apparently doesn’t support Amharic. That’s a blow. It was developed originally to handle Cyrillic, so it certainly has the capability. But it’s not offered. Drat.
A bit of googling brought me to a dubious-looking website here, claiming to offer a selection of tools which could do Amharic OCR. The prose felt a bit machine-generated, so I worried that it was bunk, or worse, a malicious site. But the first option was… Google Drive.
I never knew this, but seems that, if you upload a PDF containing an image of text, and then open it in Drive as a Google Docs document, it OCR’s the content.
Well, I thought, let’s give it a try. So I extracted the first page of Rossini’s edition, using Adobe Acrobat Pro 9 – no flashy latest-edition stuff going on here! Here’s a pic:
Then I uploaded it, and opened as a Google document. And … it just treated the Amharic as an image. Dang! But I noticed that it did indeed OCR the Italian at the top of the page!
This is supposed to work. So I thought maybe I should work over the image a bit. I imported the one-page PDF into Abbyy Finereader 15, and chopped off the Italian at the top, and the critical apparatus at the bottom. I then used the image editor in Finereader to “whiten the background”. This can be flaky, but this time it worked fine, and I got a pure white background. And I got this:
(I’ve just seen the marginal notes, which I need to chop off as well, so I’ll have to go round the loop again)
I exported the image as a PNG, and I used Acrobat again to create a PDF from the image. Then I uploaded the new PDF to Google Drive, and opened it as a Google Docs document. And… it worked! Sort of…
በስመ : አብ : ወወልድ ‘ ወመንፈስ ፡ ቅዱስ ፡ ፩ ፡ አምላከ ፡ ላዕሌሁ ፡ ተወ ከልኩ፡ ወቦቱ ፡ አመንኩ ፡ እስከ ፡ ላዓለመ ፡ ዓለም ፡ አሜን ።
ድርሳን ፡ ዘደረሰ ፡ ቅዱስ ፡ ዮሐንስ ፡ ኤጲስ ፡ ቆጶስ ፡ ዘአክሱም o ፡ በእንተ ዕበዩ ፡ ወክብሩ ፡ ለቅዱስ ፡ ይስሓቅ = ወይቤ ፤ ስምዑ ‘ ወልብዉ ፡ ኦአኀውየ 5 ፍቁራንየ ፡ ዘእነግረከሙ ። ርኢኩ ፡ ብእሲተ ፡ እንዘ ፡ ይዘብጥዋ ፡ ዕራቃ ወእንዘ ፡ ይሀርፉ ፡ ላዕሌሃ ፡ ወላዕለ ፡ እግዝእትነ ፡ ማርያም ፡ እንዘ ፡ ይብሉ በእንተ ፡ ወልዳ ፡ ክርስቶስ ፤ እምብእሲት ፡ ኪያሁ : ኢተወልደ ፣ ይብሉ ፡ እላ ፡ ኢየአምኑ ፡ በክርስቶስ = ወኮንኩ ፡ እንዘ ፡ እረውጽ ፡ ወአኀዝኩ እስዐም ፡ ታሕተ ፡ እገሪሃ ፡ ለይእቲ ፡ ብእሲት ፡ እንዘ ፡ ትብል ፤ እወ ▪በዝ ፡ አንቀጽ ፡ ወፅአ ፡ ንጉሠ ፡ ሰማያት ፡ ወምድር ። ወሶበ ፡ ትብል፡ ከሙዝ ፡ ወ
That’s… rather astonishing. No idea what all that is, but it looks sort of right. Let’s bear in mind that Rossini printed his edition in 1897. This is not a modern typeface. So this is rather good.
Next step was to paste it into Google Translate. It set it to auto-detect the language, and pasted in the first bit. And… it worked. In fact it gave a really useful transcription into Roman letters as well, which makes it a LOT easier to manipulate the text.
OK, I’m cheating slightly. The first time I uploaded, the translation ended at “Spirit”. But this is a Google Translate bug – it sometimes omits the remainder of a sentence. If you split the text with a line feed, you often get the rest. And that’s what I did. I worked out by experiment where I needed to be, and then I got the above.
I don’t quite believe the translation of the second sentence either. I suspect I need to play with this a bit to work out what each word is.
I notice all those colons between every word. It might help if I actually looked up the script online!
But I think you’ll agree that this is quite marvellous – I, who know absolutely nothing about the language, am getting something useful out!