Morphologized Coptic texts?

I’ve been working on my translation tool for ancient Greek again.  The calendar of Antiochus of Athens seems like a perfect text to translate using it.  But the deficiencies of the software are still great.  I’ve been adding code to handle numerals today, with modest success.  Much of the trouble is in the unicode-to-betacode converter.  That apostrophe at the end of the number is represented with a special unicode character, with an apostrophe, and a tilted accent.  I’ve got the first two working, but not the third, not really.

But Coptic is written mostly in Greek letters.  When I was typing some up earlier this week, I was very conscious of this.  Why can’t I add some extra files to the code, and be able to look at Coptic text as well?

For Greek we have things like MorphGNT, where each word is listed in a text file, together with the base form, the part of speech, number, gender, etc.  But I can find no evidence of such a thing for any Coptic text.

Anyone know what we have, in the way of electronic Coptic texts, and electronic XML Coptic dictionaries?

I can’t help feeling that, if we have the New Testament in Coptic in electronic form — and I think we do — that some kind of morphologisation shouldn’t be hard to do.  I wonder if one could hire someone to make such a file?

More on QuickGreek

I’m still stuck at home with a temporarily dodgy leg, so I’ve been looking again at QuickGreek.  This is a bit of software to help people like me, who know Latin, deal with polytonic Ancient Greek text. 

The idea is that you paste in a bunch of unicode Greek into one window and hit Ctrl-T. 

qg1

It reads through the Greek, splitting it up into short bits (i.e. when there is a comma or colon or whatever).  For each bit it parses the individual words, looks up the meaning and displays something underneath the word.

The sections and the meanings are interleaved like this:

qg2

Listing the meanings one after another does not make a sentence, but it’s a start on producing your own.

You then hover the mouse over the Greek word you wish to inspect, and you get a morphology in the bottom left — nominative singular etc — and whatever information I have about the word in the bottom right.

In this way you can build up a translation of short sections, even if you don’t know much Greek at all.  Which is sort of the idea.

I’ve done a little more on the thing today, and I’m quite pleased with what I’ve done and what I’ve got so far.  It needs more work in every area.  The problem is that I can never devote very long to it at any one time, and it takes a while to get back into it.

I might make a  version available for download for people to play with.  I think it’s reached the point of serving some purpose.  But I need to play around with texts with wrong or no accentuation now.

An algorithm for matching ancient Greek despite the accents?

I need to do some more work on my translation helper for ancient Greek.  But I have a major problem to overcome.  It’s all to do with accents and breathings.

These foreigners, they just don’t understand the idea of letters.  Instead they insist on trying to stick things above the letters — extra dots, and quotes going left and right, little circumflexes and what have you.  (In the case of the Greeks they foolishly decided to use different letters as well, but that’s another issue).

If you have a dictionary of Latin words, you can bang in “amo” and you have a reasonable chance.  But if you have a dictionary of Greek, the word will have these funny accent things on it.  And people get them wrong, which means that a word isn’t recognised.

Unfortunately sometimes the accents are the only thing that distinguishes two different words.  Most of the time they don’t make a bit of difference.

What you want, obviously, is to search first for a perfect match.  Then you want the system to search for a partial match — all the letters correct, and as many of the accents, breathings, etc as possible.

Anyone got any ideas on how to do that? 

I thought about holding a table of all the words in the dictionary, minus their accents; then taking the word that I am trying to look up, stripping off its accents, and doing a search.  That does work, but gives me way too many matches.  I need to prune down the matches, by whatever accents I have, bearing in mind that some of them may be wrong.

Ideas, anyone?

More on Greek translator

One advantage of translating that fragment from Euthymius Zigabenus a couple of days ago is that it made me look again at my Greek->English translator.  It doesn’t give you a good “translation”; but it did give the tools for any Latinist to get the idea.  So I’m resuming work on it for a bit.  Let’s see where it goes.

Perseus hopper – 157 downloads

The source code for the Perseus site is available for download at Sourceforge, and contains all the data too.  I was mildly surprised to discover that it has been downloaded, according to Sourceforge… 157 times.

That sounds very low indeed.  Only 157 downloads since it went open source?

Admittedly it’s very hard going to make sense of, and won’t run on Windows, but even so.

Unicode Greek font and vowel length

I didn’t realise that doing Ancient Greek on computers was still a problem, but I found out otherwise today.  We all remember a myriad of incompatible fonts, and partial support for obscure characters; and like most people I imagined that Unicode had taken our problems away.  Hah!

Unicode character 0304 is the “combining macron”.  What that means, to you and I, is the horizontal line above a long vowel.  Character 0306 is the “combining breve” – the little bow above a short vowel.  The “combining” bit means that if you stick one after an “A” in a wordprocessor, the display will stick it above the preceding letter.  Both symbols are required to display dictionary material correctly, of course.  Poetry needs this stuff.

Today I find that neither character is supported in quite a range of fonts.  Palatino Linotype, found on every PC, doesn’t support either.  Ms Arial Unicode supports both, but of course most people don’t have it (or has that changed?).  The links I give above give you lists of supporting fonts, mostly conspicuous for not being present on most PC’s.

This is a bit silly.  Come on, chaps, I thought this was sorted out years ago.

I wonder if I can remember where I met a Microsoft font chap, and suggest to him that Palatino be extended to include these?

An interesting list of fonts tested by the TLG people is here.

Diogenes limitations

I’ve been looking at Peter Heslin’s Diogenes tool, which is really quite extraordinary.  It does things that I do not need, but frankly it’s  a marvel, particularly when you realise that he worked out so much of the content himself.

One limitation seems to be that the parsing information for a word does not indicate whether it is a noun, a verb, a participle, or whatever.  It does tell  you whether it is singular or plural, masculine or feminine etc; but not whether it is a noun or an adjective.  This is a singular omission, and, for a newcomer, a somewhat frustrating one.

Does anyone have any ideas how this information might be calculated?

Writing Greek translation software – searching for meaning

One of the problems with using free online sources  — aside from bumptious Germans claiming ownership of the Word of God — is that the data is never quite in the format you would like. 

I’m still working on my software to help translate ancient Greek into English.

I’ve just found a set of morphologies — lists of Greek words, with the tense, mood, voice, etc — which omits to include the part of speech! 

Likewise meanings for my purposes would best be a single English word; most dictionaries are all waffly, which looks very odd when you put it against each word!

LXX text marked up with part of speech, etc

I was hunting around the web for a morphologised Septuagint text — one with the word, the part of speech (noun, verb, etc) and other details, plus the headword or lemma.  I remember doing this search a few years ago, so I know it exists.  This time I was less lucky.  In general there seemed to be less data available online, not more.

I can’t imagine the labour involved in taking each word of the Greek Old Testament, working out all these details, and creating a text file of it all.  It seems enormous to me.  But… to do it, and then let it just disappear, as if unimportant?  That seems even less believable, if anything.  Whatever is going on?

Somewhere there is a great database of morphologised French.  I can find webpages that refer to it; but the download site is gone.  This was state-funded; yet it too has gone.

Why does this happen?

More notes on QuickGreek

I’m continuing work on a piece of software to help me translate from ancient Greek to English.  One problem has been the time taken before it finishes starting.  When it takes 10-20 seconds on startup, just to load the various dictionaries, you quickly weary of it.  If you want to look up two words, you find it very annoying.  Since I am running it repeatedly, to test things, I’ve got very weary of it.

I’ve now got the load time down to a couple of seconds.  This is still rather longer than I like, but obviously a lot better.  The downside of this is that it takes marginally longer to parse each word.   I tend to work with only a few hundred words at a time, so this is not too onerous.  The processing time for my test text (of 386 words) is about a second, which is acceptable if not wonderful.