An algorithm for matching ancient Greek despite the accents?

I need to do some more work on my translation helper for ancient Greek.  But I have a major problem to overcome.  It’s all to do with accents and breathings.

These foreigners, they just don’t understand the idea of letters.  Instead they insist on trying to stick things above the letters — extra dots, and quotes going left and right, little circumflexes and what have you.  (In the case of the Greeks they foolishly decided to use different letters as well, but that’s another issue).

If you have a dictionary of Latin words, you can bang in “amo” and you have a reasonable chance.  But if you have a dictionary of Greek, the word will have these funny accent things on it.  And people get them wrong, which means that a word isn’t recognised.

Unfortunately sometimes the accents are the only thing that distinguishes two different words.  Most of the time they don’t make a bit of difference.

What you want, obviously, is to search first for a perfect match.  Then you want the system to search for a partial match — all the letters correct, and as many of the accents, breathings, etc as possible.

Anyone got any ideas on how to do that? 

I thought about holding a table of all the words in the dictionary, minus their accents; then taking the word that I am trying to look up, stripping off its accents, and doing a search.  That does work, but gives me way too many matches.  I need to prune down the matches, by whatever accents I have, bearing in mind that some of them may be wrong.

Ideas, anyone?

19 Responses to “An algorithm for matching ancient Greek despite the accents?”


  1. Wieland

    Where do you want to do a search?
    In a Word document?
    In Google?

    For what reason do you want to do a search?

  2. Roger Pearse

    Sorry if I was unclear!

    I have written a computer programme to turn Greek into English. It has a dictionary of Greek words, with their meanings and morphology.

    When I type in LOGO/S, I want to get back “Word”. But if I type in LOGO\S or LOGOS or LO=GOS, I also want to get back “Word”. At the moment I only get “no result found”. That’s a bit silly.

  3. Jim Darlack

    Is there a way to create some kind of table that would define letters with various accents as the same?

    α = ἄ = ἅ = ἀ = ἁ = ᾳ

    So in essence rather than trying to figure out multiple combinations of letters/accents in a word, you are stating that all alphas regardless of what accents are associated are alphas.

    I have no idea how this would be programmed, but I would imagine that this would be the way to go.

    How do sites like TLG work this out? In TLG you can enter a Greek word with any number of different accents and still find the all (or at the least, most) appropriate iterations of the word.

  4. Roger Pearse

    This was my first thought; just treat alpha+anything as alpha. But it gave too many matches.

    I wonder how the TLG do it.

  5. Wieland

    Why does it give too many results?
    I mean, there is only one LOGOS.
    There are only very few cases where an accent makes a difference.

  6. Roger Pearse

    You can get some interesting results for kai, I know.

  7. Wieland

    So then you are searching for parts of words, not words?

  8. Roger Pearse

    I have a table of words that appear in various texts, such as the New Testament, in the form that they appear (e.g. logoi). For each of these words, there is the part of speech, and the root form (=logos, in this case) and the meaning(s) in English. I’m starting with some other text, finding the word (e.g.) “logoi”, and then looking to see what the part of speech and the meaning are.

    For kai I am finding some oddities. In computerised dictionaries this can happen.

    What I need to do, perhaps, is try the theory out on the New Testament, and see how many words end up with incorrect duplicates, if I strip the accents. That might be quite interesting to see.

  9. Nick Nicholas

    > I wonder how the TLG do it.

    Sorry, wasn’t following. We do what you thought: treat alpha plus anything as alpha. It does indeed give too many matches. That’s the way it is: ambiguity happens.

    To give more detail: for lemmatised search, if a word is unaccented, it is parsed again all possible accentuations, and if it lacks breathings, against both possible breathings. If a word is in a text known to be diplomatic, analyses with different breathings and accents are allowed, but are treated as lower-ranked preferences.

    For word form search, with no grammatical analysis, by default all diacritics are ignored. In advanced search you can tune which diacritics to ignore and which not to.

    You will inevitably get a deluge of analyses, and the only things to do there are rank the results by preference. You can use syntactic context, but that doesn’t go very far — the only reliable rule is that articles and prepositions precede nouns.

  10. Roger Pearse

    Interesting, Nick – thank you. I will bear this in mind.

    It’s the sort of thing that means I’m feeling sympathetic to the Greeks getting rid of most of their accents, actually!

  11. Michael Gilleland

    Perhaps some variation on the Levenshtein distance algorithm might be helpful — see e.g. http://www.merriampark.com/ld.htm. Good luck. Your project sounds very worthwhile.

  12. Roger Pearse

    Too complex for me, but thank you for the thought and the reference.

  13. Nick Nicholas

    You could tweak Levenshtein or whatever distance metric quite simply: give relative weights to strings that differ just by breathing, by accent, and both, and by choice of accent vs location of accent.

    For the fathers, especially if you’re using Migne texts, you’ll also have to deal with misspellings by classical norms. (And misaccentuations, they’re everywhere in Byzantine texts.) That leads to combinatorial explosion, so you only recourse to them when you can’t really work out what the word is any other way…

  14. Nick Nicholas

    As for the Greeks getting rid of the accents: the rules were a lot more capricious applied to Modern than to Ancient Greek: accenting Attic is nowhere near as fraught. (But the system is indeed breaking down in the Fathers as edited in what we have access to.) I’ve posted about that in passing chez moi.

  15. Roger Pearse

    An interesting article – thank you!

  16. Joshua

    Hmm,

    There’s a nerdy way, and a quick way to do this in my opinion; but it’s a multi-step process either way.

    Assuming some kind of SQL database, your user makes a query.

    If an exact match is made; return, life is wonderful.

    If no match is made, run a second query swapping all accented characters for their non-accented equivalent. A SQL “LIKE” query might handle this. If not, a second table with the words likewise cleaned of accents but correlated to their accented equivalents would be needed.

    Return the results to an array, and now run a secondary parsing/sorting function on said results.

    The aforementioned Levenshtein Distance could do the trick. You could compare each result to the given query and anything with a distance < fuzzy is sorted to another array; fuzzy representing your tolerance for difference. If you replaced three accented characters, a good tolerance would be fuzzy = 3.

    Now we would create all possible permutations of the user-query based on swapping out breathing marks and compare them to the array containing the tolerance sorted matches.

    Store all matches in a new array, which will be output as results to the user. If no matches return, you could always return the tolerance sorted array as a "did you mean?" style query.

    This sums up the "quick" way of doing things, and can be done on the fly without eating up tons of resources. Your database can grow without any effort since the sorting is done via query.

    The nerdy way involves preprocessing entries using N-Gram models.

    Basically you sort through your dictionary, and create bi, tri, and quad-grams for the words in a parallel database.

    When a user makes a query take the original search query and break it up into grams as well, and try to make matches off of the N-Grams, swapping out breathing marks based on the grams rather than the entire word.

    In your case, I think the quick way is probably the more effective. N-Grams are amazing for linguistic processing models, but they're also a pain in the butt to deal with, and are resource intensive.

  17. Joshua

    There are a lot of ways to handle things like this, and none of them are particularly “clean.”

    A dynamic way to implement this would be with the aforementioned Levenshtein Distance, and a Stemming Algorithm for classical greek.

    1:) Stem the word to it’s root and return all related lexical entries.

    2:) If no match for stemmed root, do some error handling. You could try a diacritic replacement on the root and “offer a did you mean” menu of relevant roots for example.

    3:) See if an exact match occurs anywhere; if so, celebrate by returning it to the user.

    4:) If root is available, but no exact match occurs, parse the original search term by replacing diacritics that are outside the root and return matches. As noted this is fuzzy and will return way to many entries… but there is a fix for that.

    5:) Using a Levenshtein Distance algorithm where the distance is equal to the number of diacritic substitutions, you can now sort your results into the most likely matches. In short, if you replaced two diacritics the only things that might even potentially match are things that are only 2 characters difference from the original query.

    6:) If desired you can get even more granular, using the substitutions, you can try to weed it out to words which only require one “error” in the user entered data. At this point it basically depends on your own internal understanding of how easy it is for a user to bungle input, and on whom you believe your target application users to be.

  18. Roger Pearse

    Thank you for these thoughts! — I will consider them.

  19. Roger Pearse

    The idea of treating it as if it was a database is helpful too – thanks!