An algorithm for matching ancient Greek despite the accents?

I need to do some more work on my translation helper for ancient Greek.  But I have a major problem to overcome.  It’s all to do with accents and breathings.

These foreigners, they just don’t understand the idea of letters.  Instead they insist on trying to stick things above the letters — extra dots, and quotes going left and right, little circumflexes and what have you.  (In the case of the Greeks they foolishly decided to use different letters as well, but that’s another issue).

If you have a dictionary of Latin words, you can bang in “amo” and you have a reasonable chance.  But if you have a dictionary of Greek, the word will have these funny accent things on it.  And people get them wrong, which means that a word isn’t recognised.

Unfortunately sometimes the accents are the only thing that distinguishes two different words.  Most of the time they don’t make a bit of difference.

What you want, obviously, is to search first for a perfect match.  Then you want the system to search for a partial match — all the letters correct, and as many of the accents, breathings, etc as possible.

Anyone got any ideas on how to do that? 

I thought about holding a table of all the words in the dictionary, minus their accents; then taking the word that I am trying to look up, stripping off its accents, and doing a search.  That does work, but gives me way too many matches.  I need to prune down the matches, by whatever accents I have, bearing in mind that some of them may be wrong.

Ideas, anyone?


How to use Diogenes with the TLG disk (not that any of us have one, oh no)

A friend has sent me this set of instructions on how to use the Diogenes software with the TLG.  Apparently this has a really nice look-and-feel interface.

1: Go to Edit, Preferences, and under “Location of TLG database” I put the location as this “c:\tlg/” or wherever I had stored the TLG in my computer. Click save settings.

2: Go to the main page on Diogenes and under Corpus select “TLG texts”

3: Under Action select “Browse to a specific passage in a given text” and then under “query” type in the first few letters of the author you are interested in finding. This is the Latin name so the complete English spelling will likely come up with nothing. A list will come up with either one author or the names of multiple authors. Select the author you want. A list will come up with their works. Select the work you want. Now you may enter the book and chapter numbers. Enter zeros to view the beginning of the work. Now you can browse the work. If you click the word in the work Diogenes will parse it and find the dictionary’s definition. Pretty neat.

4: To run a TLG search go back to the homepage. Under action choose “search the TLG”. Then under “query” type in a Greek word using unicode font. This will attempt to match this word with that TLG database.

One more thing, under the homepage for Diogenes you can select “search for conjunctions and multiple words”  I find it best if you search for only one word and then add on another word after the second screen pops up. 

Another classicist receives justice
Another classicist receives justice

I am told that TLG FAQ on its website claims that the TLG CD was never sold and that no one should have it now, even libraries, so its risky even saying that you have access to the CD E because to them you shouldn’t.  As a humble member of the public, I most certainly don’t have a copy.  But I post these instructions, just in case some scumbag somewhere is still making use of an ‘illegal’ CD, and hasn’t been reported to the police yet.

If anyone does know someone with a CD, I hope that every reader will most certainly denounce them to the police.  If you are at school, and you discover that your father or mother has a copy, you should denounce them likewise.

This may seem harsh, but it’s the only way to curb the criminal element among classicists and to build a better world.   Mind how you go.


Why do we write accents on our ancient Greek?

The most obvious omission to strike the eye [in his book] is the disappearance of accents.  We are indebted to D. F. Hudson’s Teach Yourself New Testament Greek for pioneering this revolution.  The accentual tradition is so deeply rooted in the minds of classical scholars and of reputable publishers that the sight of a naked unaccented text seems almost indecent.  Yet from the point of view of academic integrity, the case against their use is overwhelming.  The oldest literary texts regularly using accents of any sort date from the first century B.C.  The early uncial manuscripts of the New Testament had no accents at all.  The accentual system now in use dates only from the ninth century A.D. 

It is not suggested that the modern editor should slavishly copy first-century practices.  By all means let us use every possible device that will make the text easier and pleasanter to read; but the accentual system is emphatically not such a device.  Accurate accentuation is in fact difficult.  Most good scholars will admit that they sometimes have to look their accents up.  To learn them properly consumes a great deal of time and effort with no corresponding reward in the understanding of the language.  When ingrained prejudice has been overcome, the clear unaccented text becomes very pleasant to the eye. 

In Hellenistic Greek the value of accents is confined to the distinguishing of pairs of words otherwise the same.  In this whole book it means only four groups of words; EI) and EI=); the indefinite and interrogative pronouns; parts of the article and the relative pronoun; and parts of the present and future indicative active of liquid verbs.  I have adopted the practice of retaining the circumflex in MENW=, -EI=S, -EI=, -OU=SIN and in EI=); of always using a grave accent for the relatives (\H, (\O, O(\I, and A(\I, and an acute for the first syllable of the interrogative pronoun (TI/S, TI/NA, etc.).  These forms are then at once self-explanatory, and the complications of enclitics are avoided.  All other accents have been omitted.

I should dearly love to take the reform one stage further, by the omission of the useless smooth breathing.  Judging by the criterion of antiquity, breathings have no right to inclusion.   Judged by the criterion of utility, ) should be used as an indication of elision or crasis, and nothing else, and the rough breathing would then stand out clearly as the equivalent of h.  The fear that examinees might be penalised for the omission of the smooth breathing has alone deterred me from trying to effect this reform.  I should like to know if other examiners would support this proposal. — J. W. Wenham, Elements of New Testament Greek, pp. vii-viii.

As someone fairly new to Greek, I don’t quite know how to look at this.  If the accents really are largely useless, why have them?  But is it as simple as this?

At the moment I’m working on software to automatically look up Greek words.  In the inscription we were looking at yesterday, the words mostly are found in the dictionaries, including Ares; but not “Aphrodite”.  I don’t really believe that the goddess isn’t in the dictionary.  Rather, I suspect, that some faulty accentuation means that X\ is not equalling X, or the like.  Most bits of code that I have seen for use with ancient Greek involve reams of code to try to overcome this sort of thing; all more or less inept.

Perhaps when I am searching for a word, I should first strip off all its accents, and all smooth breathings except one at the end of a word — e.g. A)LL) would become ALL) — and search using that?  Would I get a load of spurious matches?

And why do we have this complicated thing, if it is such a burden?  Is perhaps the accentuation thing just a bit of snobbery?  A way to keep the hoi polloi out?  No doubt there is snobbery around, as in all things to do with men and their deeds.  But is that all there is?  Or is there more to it than this?


People willing to type up some ancient Greek wanted

Do you have too much money?  If not, you may be interested in this post by Eric at Archaic Christianity.  He’s prepared to pay people to type in some unicode ancient Greek for him.  Might be a quick way to earn a few bucks, if you’re short of cash and have a bit of spare time.

The resulting text will be made available and public domain, so the effort will benefit everyone.


Greek mercenaries in Egypt used mosquito-nets

When I was in Egypt before Christmas, I got bitten to pieces by mosquitos.  On mentioning this, David Miller tells me that “canopy” is derived from the Greek word for mosquito-net.

The word is “k0n0peion”.   The derivation is via late Lat. ‘canopeum’ — perhaps with a supposed connection to ‘Canopus’ .

k0n0ps  (??”cone-face”??) = mosquito.

Imagine all those hard-bitten Greek mercenaries working for the late Pharaohs in the Nile Delta getting bitten, eh?


Greek words in the first millennium

This post at Vitruvian Design is very timely to a man trying to write some Greek->English translation software.  I can’t comment on it from behind this firewall, so will comment here.

I am delighted to see someone else interested in getting a master list of Greek words and morphologies for the first thousand years.  I must look into this project that is referred to.  The problem, surely, will be patristic Greek; and the answer would be to turn G.W.H.Lampe’s Patristic Lexicon into an XML file, in the same way that Perseus have done for Liddell and Scott.  Someone would have to argue with Oxford, who own the copyright; but for non-commercial use, I expect a license could be negotiated.  Lampe is out of print anyway.

I think that I know why Liddell and Scott give weird accusatives as an extra entry.  The book is designed for manual use, and someone finding an odd word is liable to look for something in that form, rather than the unknown to them base form.  But such things are unnecessary in a digital file, I agree.

Not all of the files mentioned in the post are known to me.  I know that an XML file of L&S exists in the Perseus Hopper, and also in the Diogenes download.  But I’m not clear where to find the “invaluable list” by Peter Heslin resulting from running the Perseus morphologiser over the TLG disk E.  A morphology file greek.morph.xml is part of the Perseus Hopper download.

The issue of mismatches between this and L&S is quite interesting.  I’d like to follow this more.

But one obvious omission is the New Testament.  The morphology list in MorphGNT is also available; and English meanings in the XML file of Strong’s dictionary.  These too need integrating into the project, I would suggest.

All this work is enormously valuable.  The project is also trying to establish something shockingly fundamental; a list of extant Greek literature!

I’m not sure how I feel about this.  I agree that the task should be undertaken — indeed it’s appallingly hard to find out these things, as I found out when I wanted a list of manuscript traditions — , but it seems a digression from the main IT-related task.  They’ve decided to start with poets; again, a minority taste.  I can’t help feeling that this task should be spun off.

The post also introduces me to Epidoc, of which I know little, in the context of converting to and from unicode.  If some way to do this reliably exists, I want it!  More details here.  This is the ‘transcoder’.

All in all, a super post!


G.W.H.Lampe’s “Patristic Lexicon” – could we get it electronically?

As we get XML versions of Liddell and Scott, etc, we inevitably start to wonder about other standard reference tools, such as Lampe.  A PDF of the raw page images doesn’t really do it, although that is better than carrying a book around.

Of course those as rich and privileged as myself have no problem here.  We just buy a dozen printed copies and place one in each of our homes, plus one in the back of the Rolls. Also, we can get our butler to carry it for us.  But this still leaves rather a lot of other people with a problem.  And… if we had it in electronic form, it would be possible to do interesting things with it.

I found this blog post from somewhere unpronounceable which asked the same question.  And I ask: how do we go about getting an XML version of a copyright text?  One that we can all use in our computer programs?

The book was published in 1961, comprises 1600+ pages, and is published by Oxford University Press who presumably own it.

Could Perseus negotiate some deal?  Could Logos?  How would one do this?



Linking electronic Greek words to their English meanings

Ancient Greek is tough for computers, and computer programmers, to work with.  Firstly it’s a dead language, secondly it’s a non-Roman script, and thirdly no-one knows Greek anyway (although a lot of people pretend).

What we need are tools on our computers.  These are appearing, but very slowly.  The problem is the non-availability of data.

Except that data does exist.  For some years the Perseus site has had a very nice electronic edition of Liddell and Scott, and a tool wherein you can put in any Greek word and it will spit out the meaning and the standardised form.  The latter is known as the ‘lemma’, presumably to keep people from understanding. 

Perseus have now made their data available in the Perseus Hopper, which can be downloaded for non-commercial purposes.  Liddell and Scott is in a big XML file. 

Peter Heslin of Durham University has grasped the implications.  Version 3.1 of his Diogenes tool includes this XML file, and another file containing all the possible forms of all the words in the Greek language, their lemma, the part of speech (noun, verb, etc), tense, mood, singular or plural (etc), and most importantly the line number of the full description in the XML file.  This means that you can look up any word, and get a full description; so long as it’s in L&S.  The code is in perl, and is supplied.  Perl tends to be impenetrable, but this is a relatively well-written example.  So if you want to create your own dictionary program, here’s the materials.

But what about post-classical Greek?  Well, there’s the New Testament.  A list of all the words, in order, with part of speech, lemma, etc, was created long ago by James Tauber as MorphGNT.  The site is down at the moment, but the 1Mb text file does exist.

Now this is fine, but useless.  It doesn’t contain the English meaning.  But… Ulrik Sandborg-Petersen has digitised Strong’s dictionary and created an XML file of it.  This contains the Greek Lemma, for all words in the New Testament, plus the English meaning and other bits of info of no present concern.  You can see on his site what the data is, by tapping in his demo example.

MorphGNT also contains the lemma.  So this means that if we join the two together, we get all the possible forms of a word in MorphGNT, and the lemma for them; and the lemma plus the meaning in Strong’s.  Effectively, we now have a dictionary of NT Greek, forms, base form, and meanings.  All we have to do is program it.

What about other, non-classical Greek literature?  Somewhere around is a Septuagint in electronic form, with lemmas.  This can be referenced either against the meaning in Strong’s, or that in Liddell and Scott.  How many words appear in neither?  — I don’t know, but it would be interesting to know.  Mostly names, I would guess. 

Every lemmatized Greek text can now be a source of data to this process of creating as large an electronic Greek dictionary as we like.  And, of course, we need more dictionaries of lemmas-plus-English-meaning.  What others could be done, I wonder? 

I’ve just looked for “lemmatized Greek text” in Google and, among many interesting results, I have found the Lexis site, which claims to be able to help produced lemmatized Greek texts.  It runs on Mac, and I haven’t tried it; but it works with the TLG.  Likewise Hypotyposeis talks about lemmatized searches in TLG.  I think Josephus must be available somewhere in lemmatized form — where?

What I’m not finding is much Patristic Greek, tho.  What we need, clearly, is G.W.H.Lampe’s Patristic Greek Lexicon in XML.  This was published in 1961, so will be in copyright until all of us are dead.  But… couldn’t someone license an electronic version for non-commercial use?   It’s much too expensive for me to buy just at the moment (although a pirate PDF of the page images does exist, I see; apparently pp.1202-3 are missing, tho).

There is much that I don’t know still, tho.  Interesting to see that there is a blog called Coding Humanist.  Is there anyone out there interested in this stuff too?