In my last post I found that it was possible to turn a PDF full of images of Amharic text into recognised electronic text using Google Drive, and then get some translation of the results into English using Google Translate.
There were some extremely interesting comments made on the post, which I have been reading. I have also prepared a PDF of the whole text of the Life of Garima by Yohannes, and run that through the Google Drive process.
Where we started was in trying to read a passage of this text, in which – supposedly – God stopped the sun so that St Garima could copy the bible in one day. The summary of the work given by Rossini (instead of a proper translation, drat him), indicates that this was on lines 356-60 of his text, which turns out to be the last line of p.161 and the first three of p.162. Here they are:
The output from the OCR is good, but you still have to compare the characters carefully. Errors can often be picked up just by dumping the raw scan output into Google Translate, which shows things like numerals.
Here we have a character that is plainly wrong, and coming out as a numeral “4”. It looks like an “o” with a hat and two dots under. The two dots under are legs in another copy of Rossini.
I’m guessing that it’s a “ge” character, from looking at the Wikipedia article, but I can’t be sure. The script isn’t an alphabet, but a syllabary, based on syllables. Each character is a consonant followed by a vowel, which makes for a lot more characters. There’s a table of the characters on the Wikipedia article, consonants down the left, vowels across the top. I’ve not really looked at this.
The Google translate output is also interesting because of the choice of “detected language” – Tigrayan, rather than Amharic. If you force it to Amharic, you get a lot less meaning.
One awkward part of using Google Drive to do the OCR is that it doesn’t preserve the line breaks. That makes comparing the lines more awkward. So you have to manually do this:
፬ ፡ ወኮነ ፡ በአሐቲ ፡ ዕላት ፡ ወነሥአ ፡ መጽሐፈ ፡ ወቀለመ ፡ ወወጠነ፡
ይጽሐፍ ። ወተንሥአ ፡ ለጸሎት በሰርክ ። ወጸሐፉ ፡ ሎቱ : መላእክት ፡ ወንጌ ለ ፡
በ፬ ፡ ሰዓት ፡ ወትርጓሜሁ ። ወመላእክተ ፡ እግዚአብሔር ፡ ወትረ ፡ ይት ለአክዎ ፡
ወእግዚእነሂ ፡ ክርስቶስ ፡ ያንሶሱ ፡ ምስሌሁ ። ወተሰምዐ ፡ ዜናሁ :
ውስተ ፡ ኵሉ ፡ ሀገር ። ጸሎቱ ፡ ወበረከቱ ፡ የሀሉ ፡ ምስሌነ ።
The Wikipedia article mentioned earlier gave me a list of punctuation marks. There are two sorts of punctuation visible in here. The colon mark is actually word division, which means that some words above go over two lines. I’ve chosen not to split words above. The double colon mark “::” is the full stop. Interestingly Google Translate gives different results if you remove the spaces!
Going through the electronic text, removing spaces, I notice that sometimes the word-separator isn’t detected by the OCR. So I added that in. Sometimes it put a Roman colon instead, so I replaced that. Finally I split on sentence:
፬፡ወኮነ፡በአሐቲ፡ዕላት፡ወነሥአ፡መጽሐፈ፡ወቀለመ፡ወወጠነ፡ይጽሐፍ።
ወተንሥአ፡ለጸሎት፡በሰርክ።
ወጸሐፉ፡ሎቱ፡መላእክት፡ወንጌ ለ፡በ፬፡ሰዓት፡ወትርጓሜሁ።
ወመላእክተ፡እግዚአብሔር፡ወትረ፡ይትለአክዎ፡ወእግዚእነሂ፡ክርስቶስ፡ያንሶሱ፡ምስሌሁ።
ወተሰምዐ፡ዜናሁ፡ውስተ፡ኵሉ፡ሀገር።
ጸሎቱ፡ወበረከቱ፡የሀሉ፡ምስሌነ።
And run it again and I get this:
But this still is not good enough to do much with. If we didn’t have an idea what the text said, this would not tell us.
All this fiddling about would certainly get to into contact with the language, and start you on a journey to learning it. But it’s not good enough a translation for other purposes, although intriguing.
One suggestion that was made in the comments to the last article was that ChatGPT gave better results. The output quoted was indeed produced, and was very smooth and seemed to be a series of liturgical prayers. But… I don’t think that this is actually the content. These AI tools are really only an improved version of the text prediction tools you get on messaging on a mobile phone. So it was pumping out garbage.
Anyway I tried it on this passage, and it crashed GPT very effectively! At the moment I can’t get any reply of any sort, not even to “hello”.
I don’t think that I will do more here. Clearly the technology is almost, but not quite good enough to be useful.