More experiments with Amharic and technology

In my last post I found that it was possible to turn a PDF full of images of Amharic text into recognised electronic text using Google Drive, and then get some translation of the results into English using Google Translate.

There were some extremely interesting comments made on the post, which I have been reading.  I have also prepared a PDF of the whole text of the Life of Garima by Yohannes, and run that through the Google Drive process.

Where we started was in trying to read a passage of this text, in which – supposedly – God stopped the sun so that St Garima could copy the bible in one day.  The summary of the work  given by Rossini (instead of a proper translation, drat him), indicates that this was on lines 356-60 of his text, which turns out to be the last line of p.161 and the first three of p.162.  Here they are:

The output from the OCR is good, but you still have to compare the characters carefully.  Errors can often be picked up just by dumping the raw scan output into Google Translate, which shows things like numerals.

Here we have a character that is plainly wrong, and coming out as a numeral “4”.  It looks like an “o” with a hat and two dots under.  The two dots under are legs in another copy of Rossini.

I’m guessing that it’s a “ge” character, from looking at the Wikipedia article, but I can’t be sure. The script isn’t an alphabet, but a syllabary, based on syllables.  Each character is a consonant followed by a  vowel, which makes for a lot more characters.  There’s a table of the characters on the Wikipedia article, consonants down the left, vowels across the top.  I’ve not really looked at this.

The Google translate output is also interesting because of the choice of “detected language” – Tigrayan, rather than Amharic.  If you force it to Amharic, you get a lot less meaning.

One awkward part of using Google Drive to do the OCR is that it doesn’t preserve the line breaks.  That makes comparing the lines more awkward.   So you have to manually do this:

፬ ፡ ወኮነ ፡ በአሐቲ ፡ ዕላት ፡ ወነሥአ ፡ መጽሐፈ ፡ ወቀለመ ፡ ወወጠነ፡
ይጽሐፍ ። ወተንሥአ ፡ ለጸሎት በሰርክ ። ወጸሐፉ ፡ ሎቱ : መላእክት ፡ ወንጌ ለ ፡
በ፬ ፡ ሰዓት ፡ ወትርጓሜሁ ። ወመላእክተ ፡ እግዚአብሔር ፡ ወትረ ፡ ይት ለአክዎ ፡
ወእግዚእነሂ ፡ ክርስቶስ ፡ ያንሶሱ ፡ ምስሌሁ ። ወተሰምዐ ፡ ዜናሁ :
ውስተ ፡ ኵሉ ፡ ሀገር ። ጸሎቱ ፡ ወበረከቱ ፡ የሀሉ ፡ ምስሌነ ።

The Wikipedia article mentioned earlier gave me a list of punctuation marks.  There are two sorts of punctuation visible in here.  The colon mark is actually word division, which means that some words above go over two lines.  I’ve chosen not to split words above.  The double colon mark “::” is the full stop.  Interestingly Google Translate gives different results if you remove the spaces!

Going through the electronic text, removing spaces, I notice that sometimes the word-separator isn’t detected by the OCR.  So I added that in.  Sometimes it put a Roman colon instead, so I replaced that.  Finally I split on sentence:

ወጸሐፉ፡ሎቱ፡መላእክት፡ወንጌ ለ፡በ፬፡ሰዓት፡ወትርጓሜሁ።

And run it again and I get this:

But this still is not good enough to do much with.  If we didn’t have an idea what the text said, this would not tell us.

All this fiddling about would certainly get to into contact with the language, and start you on a journey to learning it.  But it’s not good enough a translation for other purposes, although intriguing.

One suggestion that was made in the comments to the last article was that ChatGPT gave better results.  The output quoted was indeed produced, and was very smooth and seemed to be a series of liturgical prayers.  But… I don’t think that this is actually the content.  These AI tools are really only an improved version of the text prediction tools you get on messaging on a mobile phone.  So it was pumping out garbage.

Anyway I tried it on this passage, and it crashed GPT very effectively!  At the moment I can’t get any reply of any sort, not even to “hello”.

I don’t think that I will do more here.  Clearly the technology is almost, but not quite good enough to be useful.


6 thoughts on “More experiments with Amharic and technology

  1. I’ve had very mixed results with ChatGPT at first.

    I found the best method is to separate a text into 2-3 sentences at a time, max – and I also used the function interface to get just the translation back (avoiding the occassional prefix like “The translation is X”).

    Using a python script to go through an entire document like that has resulted in some very readable translations, which seem to hold up pretty well when compared to other human-based translations.

    Here’s my results – most of Jerome’s commentaries here where run through this process (and are now in the public domain):

  2. Roger, forgive me if I am stating the obvious and you have mentioned this in a previous post, but this is neither Amharic nor Tigrayan, but Ge’ez, the classical language of Ethiopia; the liturgical language of the Ethiopian Orthodox Church and the mother language from which both Amharic and Tigrayan derive. Moreover, as I understand it, Tigrayan stands closer to classical Ethiopic, or Ge’ez, so not is not surprising that Google Translate has chosen it when “detecting” the language. It is not also not surprising that Google Translate does not have a Ge’ez option, as the language is used, as far as I know, only in liturgical and theological settings.

  3. No, not obvious at all – I know zilch about Ethiopian! Indeed I know nothing much about this text. Thank you! So that was part of the problem, then – that it was Ge’ez. Oh well! 🙂

  4. Here’s a result from Anthropic’s Claude LLM – I didn’t specify the source language:

    “Here is my attempt at translating this Ge’ez passage into English:

    “And it happened on that day, he took a book and read it. And he began to write.
    And he stood for prayer with incense.
    And the angels wrote for him, and interpreted for him for four hours.
    And the angels of God came and began to speak with him, and our Lord Jesus encouraged him.
    And his news was heard in all the land.
    May his prayer and blessing be with us.”

    Please let me know if any part of the translation could be improved or clarified. I aimed to convey the meaning clearly while retaining the style of the original Ge’ez. Translating early Christian texts requires balancing precision and interpretation.”

