This is a bit of a computer-y post, so perhaps will be of interest to few.
A couple of days ago I started with a list of PDFs of Greek works of Ephraem Graecus from here, and I opened it up in Notepad++ and global search and replaced on it. So this:
became this, by changing <li> to <hr>\r\nGreek Title:
A similar process of changes added in blank fields, and became this:
Then it was time to type in some of the data, from the CPG, picking up the pages of the Assemani edition. The file became this:
Next came the pages of the Phrantzolas edition:
I carried on, until I ended up with a text file like this:
Now this is well and good, but I really wanted to manipulate the data programmatically.
For one thing I knew that the works were in the same order as the Thesaurus Linguae Graecae entries – 001-156 – which meant that all I needed to do was number them. But I didn’t fancy typing that in.
What I did, therefore, was to turn each row into XML tags, of my own invention. The file became this:
Of course it is really easy to get the start and end tags mismatched, so I used a free online validator to check the XML, just pasting it in, and dealing with whatever errors it found.
I avoided reformatting the XML in Notepad++, tho. I did install the XML plugin, and tried it out; but it made the file much less compact – not a great idea if you are paging down it and filling in blank fields, as it doubles your keystrokes.
I then added in the translations information from the Ephraem Graecus website list of translations. This meant more tags; but of course I could alter the structure as I went along. I jammed in the data, separated with commas, for speed of entry, and got stuff like this:
So far so good. But I was beginning to feel the need to start turning the XML into something that could be used in a web page. That meant coding.
That done, I opened a command window and installed the CPAN libraries using
This done, I looked for a bit of sample code, which I found here, using the XML::LibXML library. This I adapted.
I got a lot of “Wide character in print” messages, which turned out to be unicode-related. I had to specify in the perl to use utf-8, and also that the STDOUT should use it too (see my code below).
When the script ran, the Greek was gibberish. So I changed the windows console font to “Lucida Console”, and also specified that the code page for it to use was utf-8 by entering the command “chcp 65001”.
But once I had this running, it was fine!
Of course then I had to decide what I wanted my output to look like. I built it up, a bit at a time. I found there was more than one translation; so I had to create a nested array of translations. Some translations had a url, because they were online, so I needed a way to have a url. I had to break up the original <translation> tag above into <info> and <url>. But I managed.
I kept validating the xml file, and I kept running my perl script.
At the end, the output file looked like this:
So, if you open it in Chrome – the browser everyone uses for web development -, it looks like this:
Not bad! The first entry is a bit messy, but that was a vice of the original data. The Phrantzolas edition doesn’t give a title in Greek for the whole work, only for each of 26 bits. Nor is there one in the CPG. The links I made up from the <url> tags that were in my file. I didn’t add much formatting, other than <small> on the editions etc line.
It’s fairly plain HTML. My guess is that it will paste into a WordPress page quite nicely, in the “Text” tab in the editor.
It may need some rejigging, but the code is hardly complex.
Anyway, here are the complete files, as of today:
- ephraim_perl (.zip file)
This contains the script, a.pl (if I have to type “perl a.pl > op.htm” I want as few characters as possible), the xml file input.xml, and a sample output, op.htm.
Of course the bibliography could be extended mightily, but I don’t propose to do this. What I really wanted was the cross-reference between the old Assemani edition, the new Phrantzolas edition, and the CPG, plus any translations that were around. We’ve got more than one translation already for some works.
All this did take a while! But it was worth it.