Computing – Roger Pearse

Using Deepseek on an obscure Greek “Life” of St Isidore of Pelusium (d. 435 AD) by Morton Smith

Posted on March 21, 2025March 27, 2025 by Roger Pearse

Yesterday I started googling about Isidore of Pelusium, and I quickly came across a number of papers showing that Dr. Madaline Toca is actively working on Isidore of Pelusium, the manuscripts of his letters, the reception of his work in Latin, and so on. This is good news! Most of these papers are accessible on Academia here, which is even better news. Also among her efforts is an online bibliography for Isidore, here.

This bibliography informed me that a previously unpublished Greek “Life” had been printed back in 1958 in an obscure Greek volume. Thankfully she provided a PDF here.

The author of this publication was none other than a certain Morton Smith. Today Morton Smith is notorious for his “Secret Mark” forgery. But in 1957 he was just another a young scholar, travelling through Greece and the Levant, and searching for manuscripts of the letters of Isidore of Pelusium.

The “Life” printed is basically a transcription of four manuscripts, the oldest 12th century, the rest 16th century; three from Mount Athos, and the other in the National Library in Athens. From the incipit and explicit, it appears to be the text listed in the BHG, vol. 3 (supplements) on p.38 as BHG 2209. The text is in 12 paragraphs, which are those in the manuscripts. No paragraph numbering or subtitles appear.

I was curious what the text said. Unfortunately Smith gave no translation. It does not seem to be in the TLG. But I knew that if I could create an electronic text, then I could work on it with ChatGPT.

What I did was to run it through Abbyy Finereader 15 Pro. Unfortunately this only supports modern Greek, which means that the accentuation gets mangled. But never mind that. Who cares? This is just idle interest. Or so I thought.

Anyway I pulled it, did a certain amount of light correction. Then I pasted it into ChatGPT using the free online web interface, and asked for an English translation. The request that I gave it – the “prompt” in the AI jargon – was this:

Please correct the following text in medieval Greek: ”’ …. ”’

with the Greek between the triple quotes. Triple quotes seem to be the standard way to indicate something in these AI interfaces. I have no idea why.

Well, ChatGPT produced a very nice, perfectly readable translation. Here is the opening section:

The Life and Conduct of Our Holy Father Isidore of Pelusium

This divine Isidore, who lived a wondrous life both in deed and word, shone like the sun in both respects, and even brighter. His homeland was Pelusium, a distinguished city in Egypt. He sprang from a pious lineage, yet he himself flourished as an even more devout offshoot. From his earliest years, he ardently embraced virtue and steadfastly held to it as an inseparable possession. He despised the pleasures of youth, such as childish games, frivolous laughter, indulgence that weakens the soul, and enslavement to the desires of the flesh—everything, in short, that distracts the mind from perceiving true beauty. Instead, he pursued complete detachment and the heights of divine contemplation, embracing temperance in all things and the highest form of education—one that elevates the soul and leads it toward the heavens.

My experiences with ChatGPT as a helper with the translation of the “Life” of John Damascene had demonstrated that ChatGPT does give reasonable accuracy. But also that you do have to check! But how to do so?

My next thought was to try out another AI product, the Chinese DeepSeek, also available for free online. I did the same with this, with the same prompt, and it also produced a translation. Here’s the opening section:

The Life and Conduct of Our Holy Father Isidore of Pelusium

This divine Isidore, who was wondrous in both life and word, shone brighter than the sun in both respects. His homeland was Pelusium, a renowned city in Egypt. From a devout root, a more devout branch sprouted, for from his earliest youth, he wholeheartedly embraced virtue and held it as an inalienable possession, demonstrating his unwavering love for it. He shunned the pleasures that typically delight youth—childish games, youthful frivolities, and indulgences that weaken the soul and enslave it to the desires of the flesh. In short, he avoided everything that hinders the mind from attaining true understanding of what is good. Instead, he devoted himself entirely to what leads to perfect detachment and the highest contemplation of divine things: universal self-control and the best education, which lightens the soul and lifts it to heavenly things.

Various awkwardnesses in the DeepSeek output suggested to me that it was probably closer to the Greek. But there are enough differences there that I felt that I needed to get closer to the Greek.

So I put in another prompt to DeepSeek:

For each sentence, give the Greek, then the English translation; then analyse the syntax, with meanings; then give a list (in table format) of all the words with the grammar, lemma, part of speech, case, number, gender, and meaning for each word.

Doing so produces quite a bit of output. The “translation” gets quite a bit closer to the Greek as you do this, and of course you can read its explanation of the syntax, you can lookup individual words in Logeion or somewhere using the lemma, and generally wrestle with the sentence until you believe what you have.

But back to the AI output.

If you run this prompt in ChatGPT, it just does one sentence only, and then it asks if you want to continue. If you say “yes”, then it does another. However ChatGPT has no idea what a sentence is! So it does a dozen or so words. DeepSeek is the same, but I quickly found that the length of a “sentence” was much shorter. Rather nervously I asked if it could “do the same but for two sentences” and I got longer outputs. So that worked. When I asked if it could do four “sentences”, it went a bit funny. So I went back to two.

At one sentence a time, all this becomes very tedious. Copying and pasting the output to a word document takes a lot of time. Indeed I have spent the whole day doing this. But DeepSeek did a fine job. It was no worse than ChatGPT.

It’s generally best to do this, one paragraph at a time. It doesn’t feel so oppressive, and you can go off for a break at the end of each paragraph. You don’t want to sicken yourself, and it takes too long to do the whole thing in one go, even for just 4 pages of Greek.

For the curious, I attach a sample file with what I got for one of the paragraphs. How reliable the output is, well, I will find out in due course! It’s here:

8 deepseek (Word .docx)

But I did get tired. So I wondered if it was possible to do this process from the computer command line, thereby saving myself a lot of time. You can indeed connect to the website using the “API,” which would allow you to write a program. But… alas… they want money for that! The restrictions on the free web interface are deliberate.

You can also download for free a DeepSeek “model” (jargon word) and run it on your PC. But unless you have an awful lot of memory fitted, you will find yourself working with “distilled” versions which are not nearly as good. The process is fairly technical, and although I got it to work, I’d need to spend a lot more time on this. Whether my fairly powerful PC would handle a full-size model is something that I don’t yet know. So I went back to the free web interface.

One place where DeepSeek is definitely superior to ChatGPT is that it recognises when it reaches the end of the passage. ChatGPT does not. It will quite happily continue beyond the end, “translating” random Greek garbage. So every so often you have to take the last word translated, and check that it is still in the text!

Doing this led to an interesting discovery. I always ask for the Greek, the English, and then the syntax analysis. I found that DeepSeek was silently fixing up the garbage Greek text that I had got from the OCR. It was adding the missing accentuation.

So I tried asking it explicitly to do so.

Please correct the following text in medieval Greek: ”’…”’

And it did, and then translated it. A quick look at the original PDF suggests that it is doing a good job. Well, well.

Update 27 March 2025. I did find a couple of places in the “corrected” Greek text where it had mysteriously introduced a full-stop. It also capitalised proper names without my asking it to! But still interesting.

One thing that is really important – divide your text, however short, into chunks of no more than half-a-dozen sentences, and work on each chunk in turn in separate documents. If you think, as I did, that the document is too short to bother, you will quickly get into a morass. It’s psychologically necessary to have some positive reward every few sentences, or you get depressed and give up. In this case I ended up simply numbering the paragraphs and taking each as a “chapter.”

Import Turnpike Emails into Thunderbird – for free

Posted on February 12, 2024 by Roger Pearse

When I first came onto the web in 1997, I used Demon Internet, and their “Turnpike” software on Windows. All my emails until about 2012 were done that way, safely offline, when I moved to Gmail. I still have my Turnpike directory on my PC, and, even on Windows 11, Turnpike.exe opens, and all my old emails are still in there.

But it’s pretty hard to search through those for some .doc file from long ago. How do I import all those emails into somewhere that I can actually use?

If you do an internet search, Google will show you page after page of results from sites, all ending in “.com”, offering a “solution” – to buy some tool. Thank you, Google. All that money-grabbing drowns out the real results. Luckily I found one in an old forum here.

The answer, it seems, is to use Mozilla’s Thunderbird as an intermediary.

I detest these scammers, and you do not need to do this. Turnpike can export to “MBOX” format, a text file; and local email clients like Thunderbird – which is free – can import it.

Here’s how.

Export all your emails and attachments from Turnpike to MBox.

Go into your Turnpike directory, and find turnpike.exe. In my case this is Turnpike 5.01.

Open it up. On the menu, choose Window | Mailroom View. That will show all your emails. The first one is highlighted.

Select the lot. For me, I had to click on the first email, hold the shift key, and hit Ctrl-End.

Then do File | Export, and save the mail_001.txt to some directory. It took a few seconds, but it worked. In my case the .txt file was almost a gigabyte. This DOES include all the attachments, all UU encoded as text.

I then copied the mail_001.txt file and called the new copy “00 Turnpike” (because I wanted all my emails in a folder of that name. You can use any name not already a folder in your email. Use 00 on the front to make it appear at the top of the folder, for reasons we will see).

I would strongly suggest that you find an email with an attachment, and just export that on its own. Try to import that under some name, as below, and check the attachment is imported OK.

Find out where the Thunderbird “Local Folder” is on your disk

Then open Thunderbird. Scroll down the left panel until you find the Local Folders area (I have a couple of online email accounts connected to Thunderbird so I can read offline, which you see at the top).

As you can see, I already have a local folder named “00 Roger” which I use to back up my emails locally. But you don’t need that. I called it “00 Roger” because the local folder is full of junk files, which you mustn’t touch. So by using the “00”, my folder sorted to the top! Makes it easier to find.

Right click on “Local Folders” and choose “Settings”. Select “Local Folders” on the left panel. This will show you where your local folders are actually held on your hard disk.

As you can see, I changed the “local directory” from whatever garbage it usually is to somewhere under d:\roger, where I keep all my user files. It doesn’t matter where it is.

Now take a note of where the local directory is.

Then close down Thunderbird.

Import the Mbox file into the Thunderbird Local Folders directory

Then open that local directory in windows explorer.

Copy your small file with the attachment into this directory, right next to the “cert8.db” and all the other files. Or copy your big, “00 Turnpike” folder in.

Then restart Thunderbird.

You will now have a new folder in Local Folders. But … if its the biggie, “00 Turnpike”, do wait before expanding the folder. Allow Thunderbird time to process all those attachments. For a small file, this won’t take all that long.

Once you feel sure, expand it, your emails will be inside, marked unread.

If you go back to the local folder in Windows explorer, you will see your “00 thunderbird” file as you left it, but with a new “.msf” file, which indexes it.

And you’re done. You have your emails out of Turnpike.

Troubleshooting? “Where are my attachments?!” Well, delete the folder in Thunderbird, and try again with a single message. See if that works. If it does, then probably you just need to leave Thunderbird open and let it process stuff.

If it all worked OK, then you’re good.

Getting the emails into GMail

Maybe you want to copy/upload some/all of them into a Gmail account? then there are links online that will tell you, like this one. Basically you just create a connection in Thunderbird to your online email, using IMAP. This will download your emails to your PC, and create folders etc. You then just drag the emails from “Local Folders/00 Turnpike” into the folder under your online email account. But the link will give you a blow-by-blow account of that. (I didn’t do it myself, tho, because I am increasingly suspicious that anybody who uses Google’s “free” services is about to get a rude awakening, in the shape of unavoidable “low” charges which somehow become very high charges. See “Monopoly”.)

Likewise if you want a local copy of your online emails, in Thunderbird, just copy/drag them from the folder for your Gmail account to a folder under “Local Folders”.

But the point here is that you now can work with your Turnpike emails.

Good luck.

How does “AI translation” work? Some high-level thoughts

Posted on February 2, 2024February 2, 2024 by Roger Pearse

The computer world is a high-bullshit industry. Every computer system consists of nothing more than silicon chips running streams of ones (1) and zeros (0), however grandly this may be dressed-up. The unwary blindly accept and repeat the words and pictures offered by salesmen with something to sell. These are repeated by journalists who need something to write about. Indeed the IT industry is the victim of repeated fads. These are always hugely oversold, and they come, reach a crescendo, and then wither away. But anybody doing serious work needs to understand what is going on under the hood. If you cannot express it in your own words, you don’t understand it, and you will make bad decisions.

“AI” is the latest nonsense term being pumped by the media. “Are the machines going to take over?!” scream the journalists. “Your system needs AI,” murmur the salesmen. It’s all bunk, marketing fluff for the less majestic-sounding “large language models (LLM) with a chatbot on the front.”

This area is the preserve of computer science people, who are often a bit strange, and are always rather mathematical. But it would seem useful to share my current understanding as to what is going on, culled from a number of articles online. I guarantee none of this; this is just what I have read.

Ever since Google Translate, machine translation is done by having a large volume of texts in, say, Latin, a similarly large volume in English, and a large amount of human-written translations of Latin into English. The “translator” takes a Latin sentence input by a human, searches for a text containing those words in the mass of Latin texts, looks up the existing English translation of the same text, and spits back the corresponding English sentence. Of course they don’t just have sentences; they have words, and clauses, all indexed in the same way. There is much more to this, particularly in how material from one language is mapped to material in the other, but that’s the basic principle. This was known as – jargon alert – “Neural Machine Translation” (NMT).

This process, using existing translations, is why the English translations produced by Google Translate would sometimes drop into Jacobean English for a sentence, or part of it.

The “AI translation” done using an LLM is a further step along the same road, but with added bullshit at each stage. The jargon word for this technology seems to be “Generative AI”.

A “large language model” (LLM) is a file. You can download them from GitHub. It is a file containing numbers, one after another. Each number represents a word, or part of a word. The numbers are not random either – they are carefully crafted and generated to tell you how that word fits into the language. Words relating to similar subjects have numbers which are “closer together”. So in a sentence “John went skiing in the snow,” both “snow” and “skiing” relate to the same subject, and will have numbers closer together than the same number for “John.”

Again you need a very large amount of text in that language on both sides. For each language, these texts are then processed into this mass of numbers. The numbers tell you whether the word is a verb or a noun, or is a name, or is often found with these words, or never found with those. The mass of numbers is a “language model”, because it contains vast amounts of information about how the language actually works. The same English word may have more than one number; “right” in “that’s right” is a different concept to the one in “the politicians of the right.” The more text you have, the more you can analyse, and the better your model of the language will be. How many sentences contain both “ski” and “snow”? And so on. The model of how words, sentences, and so on are actually used, in real language texts, becomes better, the more data you put in. The analysis of the texts starts with human-written code that generates connections; but as you continue to process the data, the process will generate yet more connections.

The end result is these models, which describe the use of the language. You also end up with a mass of data connecting the two together. The same number in one side of the language pair will also appear in the other model, pointing to the equivalent word or concept. So 11050 may mean “love” in English but “am-” in Latin.

As before, there are a lot of steps to this process, which I have jumped over. Nor is it just a matter of individual words; far from it.

The term used by the AI salesmen for this process is “training the model.” They use this word to mislead, because it gives to the reader the false impression of a man being trained. I prefer to say “populating” the model, because it’s just storing numbers in a file.

When we enter a piece of Latin text in an AI Translator, this is encoded in the same way. The AI system works out what the appropriate number for each token – word or part-word – in our text is. This takes quite a bit of time, which is why AI systems hesitate on-screen. The resulting stream of encoded numbers are then fed into the LLM, which sends back the corresponding English text for those numbers, or numbers which are mathematically “similar”. Plus a lot of tweaking, no doubt.

But here’s the interesting bit. The piece of Latin that we put in, and the analysis of it, is not discarded. This is more raw data for the model. It is stored in the model itself.

This has two interesting consequences.

The first consequence is that running the same piece of text through the LLM twice will always give different results, and not necessarily better ones. Because you can never run the same text through the same LLM twice; the LLM is different now, changed to include your text.

The second consequence is even more interesting: you can poison a model by feeding it malicious data, designed to make it give wrong results. It’s all data, at the end of the day. The model is just a file. It doesn’t know anything. All it is doing is generating the next word, dumbly. And what happens if the input is itself AI-generated, but is wrong?

In order to create a model of the language and how it is used, you need as much data as possible. Long ago Google digitised all the books in the world, and turned them into searchable text, even though 80% of them are in copyright. Google Books merely gives a window on this database.

AI providers need lots of data. But one reason why they have tried to conceal what they are doing is, in part, because the data input is nearly all in copyright. One incautious AI provider did list the sources for its data in an article, and these included a massive pirate archive of books. But they had to get their data from somewhere. Similarly this is why there are free tiers to all the AI websites – they want your input.

So… there is no magic. There is no sinister machine intelligence sitting there. There is a file full of numbers, and processes.

The output is not perfect. Even Google Translate could do some odd things. But AI Translate can produce random results – “hallucinations”.

On the typing of Greek

Posted on January 24, 2024February 3, 2024 by Roger Pearse

I remember when the pre-unicode SPIonic font was the best way to enter polytonic Greek text. You typed in a series of characters – “qeo/j”, changed the font, and the same letters now displayed as θεός. It related very well to the betacode way of doing things, and I think we all got on well with it. All the same, unicode was definitely a better way of doing things, where the Greekness of the text was encoded in the very characters themselves, and not in their formatting.

Unfortunately typing up unicode is a pain. It’s so much of a pain that I have a little routine in my elderly HTML editor (MS Frontpage 2003) that takes text entered in the SPIonic way and automatically converts it to unicode. I’ve probably used this for over a decade. Indeed I just used it to enter θεός just now.

But what do you do, if you need to OCR polytonic Greek? Say in Finereader? You will need to correct the characters within the editor, with the image text right there. You can’t really use that trick to do it. You need to be able to enter the characters properly.

In Windows 11 there is a polytonic Greek keyboard. You have to install the Greek language, which will give you a modern Greek keyboard, and you can also install the polytonic alternative.

But the key mappings are a bit mad. To me, at least, they feel deeply unnatural. If I press “w”, I expect to get omega, ω. Instead I get final sigma. If I type u, I expect to get υ not θ. And so on it goes.

A bit of googling reveals that you can change these things. There’s a microsoft download called MSKLC, Microsoft Keyboard Layout Creator 1.4. You can start with the standard layout, save it out as a “source file” to some name of your choice, and alter all the mappings. With considerable labour, of course. Although the labour gets less if you realise that the “.klc” file produced is just a text file, and you can use Notepad++ to move stuff around. Then you compile it up, and you can install your new layout. Apparently uninstalling can be tricky tho: I’m told the trick is to use the same installer to uninstall, rather than the standard Windows Add/Remove process. But I have yet to try.

I’ve been playing with this, and googling. It’s a very old utility, and frankly rather outdated and clumsy. One sign of this is that the characters on the page are teeny-tiny, and the accents are worse! But it is still perfectly usable. So far I’ve moved a few keys to where, as an old SPIonic user, I think they should be:

But the next stage is the accents and breathings. How best to do this?

The MSKLC defines “dead keys” – keys that, when you press them, don’t seem to do anything, until you press another key. So you press a key to give you an acute accent, and nothing happens; then you press alpha, and lo! You have a single unicode character, an alpha with an acute accent.

Here again the default mapping seems a bit mad. In SPIonic, you did the breathings using round brackets. “(” was the rough breathing, “)” was the smooth breathing. It helped that at least they looked a bit like the breathing. You did the accent with the forward slash “/” and backslash “\”. Not so in the default polytonic keyboard.

I think what I will do is to remap the keys so that this happens.

Of course that gives you a problem. What do you do when you need brackets in your Greek text? But this is an unavoidable problem.

There are legions of weird characters for Greek accents. I’m going to ignore nearly all of them. If I get something weird, I can pull it out of charmap or something.

Once I have this keyboard, then at least I will be able to correct polytonic Greek text in my OCR tool. If I get that far, I’ll upload it to GitHub or somewhere.

UPDATE (3 Feb 2024): It’s on GitHub here.

Working with Bauer’s 1783 translation of Bar Hebraeus’ “History of the Dynasties”

Posted on December 14, 2023 by Roger Pearse

Following my last post, I’ve started to look at the PDFs of Bauer’s 1783-5 German translation of Bar Hebraeus’ History of the Dynasties.

It must be said that the Fraktur print is not pleasant to deal with. But it could be very much worse! I’ve seen much worse. Here’s the version from Google Books:

And here is the same page from the MDZ library:

I’ve tried running both through Abbyy Finereader 15 Pro. Curiously the results are better, on the whole, from the higher resolution MDZ version. I had expected that the bleed-through from the reverse might cause problems – and it may yet! Even more oddly, the OCR on the “Plain Text” version of Google Books is better still.

But there is a problem with using Google Books in plain text mode. There is no way to start part way through the book. You will always be placed at the very start, and you can only navigate by clicking “Next page” or whatever it is. This is not good news if you have 100 pages to click through before you get to where you want to be.

The opening portion of these world chronicles is always a version of the biblical narrative about the creation, followed by material from the Old Testament, combined with apocryphal material. I may be alone here, but I have always found these parts of the narratives unreadable. When I translated Agapius, I started with the time of Jesus, part way through. I did the same with Eutychius. I only did the opening chapters at the end, after I had translated all the way from Jesus to the end of the book first. I recall that it felt like wading through glue. I might have given up, except that I had already invested so much time in the project.

Starting in the time of Jesus immediately introduces us to familiar figures. On page 88 of volume 1, the “Sixth Dynasty” starts, with Alexander the great. It ends on page 98 with Cleopatra. Each section starts with a familiar name, one of the Ptolemies in most cases.

On page 99, dynasty 7 begins, after an introduction, with Augustus. The dynasty ends on p.139 with Justinian. Each ruler gets a paragraph, often only a few sentences.

It’s all do-able, clearly. I’m not sure that I want to get into working on this book seriously, with the St Nicholas project still in mid-air. But it’s not hard work, which is something!

Getting Started With Collatex Standalone

Posted on May 20, 2022December 20, 2022 by Roger Pearse

Collatex seems to be the standard collation tool. Unfortunately I don’t much care for it. Also interestingly, the web site does not actually tell you how to run it locally! So here’s a quick note.

Collatext is a Java program, so you must have a Java Runtime Environment (JRE) installed, for version 8 or higher. I think Windows 10 comes with a JRE anyway, but I can’t tell because long ago I set up a Java development environment which overrides such things.

You download the .jar file for Collatex from here. Download it somewhere convenient, such as your home directory c:\users\Yourname.

Then hit the Start key, type cmd.exe, and open a command window. By default this will start in that same directory.

Then run the following command in the command window.

java -jar collatex/collatex-tools-1.8-SNAPSHOT.jar -S

This starts a web server, on port 7369, with error messages to that command window. (If you just want to start the server and close the window, do “start java …”).

You can then access the GUI interface in your browser on localhost:7369. This is the same interface as the “Demo” link on the Collatex website. You can load witnesses, and see the graphical results.

I think it’s best for collating a few sentences. It’s not very friendly for large quantities of text.

UPDATE: 20 Dec 2022. Apparently this is just a standalone thing, and is NOT how you use Collatex for real. It’s actually done by writing little scripts in python. A couple of links:

https://nbviewer.org/github/DiXiT-eu/collatex-tutorial/blob/master/unit5/1_collate-plain-text.ipynb
http://interedition.github.io/collatex/pythonport.html

A way to compare two early-modern editions of a Latin text

Posted on May 20, 2022 by Roger Pearse

There are three early modern editions of John the Deacon’s Life of St Nicholas. These are the Mombritius (1498), Falconius (1751) and Mai (1830-ish) editions. I have already used Abbyy Finereader 15 to create a word document for each containing the electronic text.

But how to compare these? I took a look at Juxta but did not like it, and this anyway is ceasing to be available. For Collatex I have only been able to use the online version, and I find the output tiring. But Collatex does allow you to compare more than two witnesses.

The basic problem is that most comparison tools operate on a line-by-line basis. But in a printed edition the line-breaks are arbitrary. We just don’t care about them. I have not found a way to get the Unix diff utility to ignore line breaks.

Today I discovered the existence of dwdiff, available here. This can do this quite effectively, as this article makes clear. The downside is that dwdiff is not available for Windows; only for MacOS X, and for Ubuntu Linux.

Fortunately I installed the Windows Subsystem for Linux (WSL) on my Windows 10 PC some time back, with Ubuntu as the Linux variant. So all I had to do was hit the Start key, and type Ubuntu, then click the App that appeared. Lo and behold, a Linux Bash-shell command line box appeared.

First, I needed to update Ubuntu; and then install dwdiff. Finally I ran the man command for dwdiff, to check the installation had worked:

sudo apt-get update –y
sudo apt-get install -y dwdiff
man dwdiff

I then tested it out. I created the text files in the article linked earlier. Then I needed to copy them into the WSL area. Because I have never really used the WSL, I was a bit unsure how to find the “home” directory. But at the Bash shell, you just type this to get Windows Explorer, and then you can copy files using Windows drag and drop:

explorer.exe .

The space and dot are essential. This opened an explorer window on “\\wsl$\Ubuntu-20.04\home\roger” (??), and I could get on. I ran the command:

dwdiff draft1.txt draft2.txt

And got the output, which was a bit of tech gobbledegook:

[-To start with, you-]{+You+} may need to install Tomboy, since it's not yet part of the
stable GNOME release. Most recent distros should have Tomboy packages
available, though they may not be installed by default. On Ubuntu,
run apt-get install tomboy, which should pull down all the necessary [-dependencies ---]
{+dependencies,+} including Mono, if you don't have it installed already.

The [-…] stuff is the value in the first file; the {+…} is the different text in the second file. Other text is common.

There were also some useful options:

dwdiff -c draft1.txt draft2.txt added colours to the output.
dwdiff –ignore-case file1 file2 made it treat both files as lower case.
dwdiff –no-common file1 file2 caused it to omit the common text.

So I thought I’d have a go.

First I went into word and saved each file as a .txt file. I didn’t fiddle with any options. This gave me a mombritius.txt, a falconius.txt and a mai.txt.

I copied these to the WSL “home”, and I ran dwdiff on the two of them like this:

dwdiff falconius.txt mombritius.txt --no-common -i > op.txt

The files are fairly big, so the output was piped to a new file, op.txt. This I opened, in Windows, using the free programmer tool, Notepad++.

The results were interesting, but I found that there were too many useless matches. A lot of these were punctuation. In other cases it was as simple as “cujus” versus “cuius”.

So I opened my falconius.txt in Notepad++ and using Ctrl-H globally replaced the punctuation by a space: the full-stop (.), the colon (:), semi-colon(;), question-mark (?), and two different sorts of brackets – () and []. Then I saved.

I also changed all the text to lower case (Edit | Convert Case to| lower).

I then changed all the “v” to a “u” and all the “j” to an “i”.

And then, most importantly, I saved the file! I did the same with the Mombritius.txt file.

Then I ran the command again, and piped the results to a text file. (I found that if I included the common text, it was far easier to work with.)

dwdiff falconius.txt mombritius.txt > myop2.txt

Then I opened myop2.txt in Notepad++.

This produced excellent results. The only problem was that the result, in myop2.txt, was on very long lines. But this could easily be fixed in Notepad++ with View | Word Wrap.

The result looked as follows:

Output from dwdiff — Falconius edition vs Mombritius edition

The “-[]” stuff was Falconius only, the “+{}” was Mombritius. (I have no idea why chapter 2 is indented).

That, I think, is rather useful. It’s not desperately easy to read – it really needs a GUI interface, that colours the two kinds of text. But that would be fairly easy to knock up in Visual Basic, I think. I might try doing that.

Something not visible in the screen shot was in chapter 13, where the text really gets different. Also not visible in the screen grab – but very visible in the file – is the end, where there is a long chunk of additional (but spurious) text at the end of the Mombritius.

Here by the way is the “no-common” output from the same exercise (with my note on lines 1-2)

This is quite useful as far as it goes. There are some things about this which are less than ideal:

Using Linux. Nobody but geeks has Linux.
Using an oddball command like dwdiff, instead of a standard utility. What happens if this ceases to be supported?
The output does not display the input. Rather it displays the text, all lower case, no “j” and “v”, no punctuation. This makes it harder to relate to the original text.
It’s all very techy stuff. No normal person uses command-line tools and Notepad++.
The output is still hard to read – a GUI is needed.
Because it relies on both Linux and Windows tools, it’s rather ugly.

Surely a windows tool with a GUI that does it all could be produced?

The source code for dwdiff is available, but my urge to attempt to port a linux C++ command line utility to windows is zero. If there was a Windows version, that would help a lot.

Maybe this afternoon I will have a play with Visual Basic and see if I can get that output file to display in colour?

Copying old floppy disks – an adventure in time!

Posted on October 5, 2021 by Roger Pearse

Yesterday I inherited a couple of cases of old 3.5″ floppy disks. Most of them were plainly software, of no special relevance. But it was possible that some contained files and photographs of a deceased relative, which should be preserved.

My first instinct was to use my travelling laptop, which runs Windows 7, and a USB external floppy drive which is branded as Dell but seems to be display the label TEAC FD-05PUB in Devices and Printers. This seems to be the one USB floppy drive available under various names. But when I inserted the first floppy, Windows told me that the floppy needed to be formatted. Obviously it could not read the disk, so no good.

At this end of the game, I think I understand why. The reason seems to be that the floppy was an original 3.5″ 720kb unit, while later 3.5″ drives were formatted for 1.44 mb. The TEAC FD-05PUB driver is badly written and only understands the latter format. So it supposes that the 720k disk is not formatted. This is shoddy work by somebody, and needs to be fixed.

At least the floppy drive does work with Windows 7. Apparently it often does not work with Windows 10, thanks to an attempt by Microsoft to drop support for it. There are various workarounds, such as this one. But it didn’t help me read that disk.

However I still have all the laptops that I have ever bought, since I started freelancing in 1997. Surely the older ones would have a built-in floppy drive?

A twenty-year old Dell Inspiron 7500 peeks out from under a monitor.

The oldest machine is a Compaq – remember them? But this refuses to boot, complaining about the date and time. The internal CMOS battery is long flat, it seems. Unsure what to do, I leave this.

Next up is a chunky Dell Inspiron 7500. This too refuses to boot, but – more helpfully – offers to take me into Setup, for the BIOS. I go in, and, acting on instinct, set the date and time and invite it to continue. And … it works! I did have some hard thoughts about whoever decided that a flat battery should prevent Windows booting, mind you!

Anyway it boots up in Windows 98. A swift shove of the disk into the floppy drive, and … I can see the contents. In fact the disk does contain some useful files. I copy them into a file on the desktop.

Next problem – how do I get the files off the machine and onto something useful?

This proves to be quite a problem! The machine does not have a built-in CD writer. It does not have a network port, although it does have serial and parallel ports. (I had visions at this point of using dear old, slow old Laplink!) It was once connected to the internet – by dialup! It does have some PCMCIA card slots. I toy with seeing if I could get a PCMCIA-to-USB card – they do exist. PCMCIA is 16 bit, tho. I think you can do this sort of thing, although not for USB.

Maybe I could get a PCMCIA network card! They’re all long out of production, of course. I used to have one, in fact, I vaguely recall. I also recall throwing it out. I am not looking forward to trying to configure networking anyway.

I don’t suppose there is a Wifi interface built in? Not likely. But anyway I right-click on My Computer, Properties, and look at the Devices tab. And I forget all about Wifi when I see the magic words … Universal Serial Bus. Yup – that’s USB! So there is support there. But why? There’s no USB port. I hunt around the rear once more… and spy… a USB port!!! Hidden where it won’t be seen! Yay!

But I am not home yet. Oh no. When I stick a USB2 key drive in, it demands a driver! It seems that Windows 98 did not recognise USB drives by default. You have to install a driver. Luckily there is one. You download nusb36e.exe from the web on your main computer, burn it to a CDR – a normal 700Mb one will do -, and then read that in the CD drive that – thankfully – is built in to the machine. Full instructions are here. You remove all the existing USB drivers, install the patch, restart, and get an extra USB driver.

I shove a USB2 key drive in, and up it comes as drive E. Magic!

But I am still not home and dry. When I click on it, it demands to format it!! The reason for this is that modern keydrives use the NTFS file system, whereas Win98 was still using the old FAT32 system. So I go ahead – it’s an empty drive.

Finally it works. The USB drive opens in Windows explorer, I copy the files, pull the drive out and insert it into my main machine. And …. I can see the files!!! Phew!

Now to sift through all those floppies…. yuk!

Pretty painful, I think you’ll admit. Only just possible. In a few years those floppies will be useless to anybody but a laboratory. But they have retained their formatting well, for more than 20 years.

So don’t assume the worst, if you can’t read a floppy in your nice new machine. It may not be the floppy.

Converting old HTML from ANSI to UTF-8 Unicode

Posted on May 14, 2021May 15, 2021 by Roger Pearse

This is a technical post, of interest to website authors who are programmers. Read on at your peril!

The Tertullian Project website dates back to 1997, when I decided to create a few pages about Tertullian for the nascent world-wide web. In those days unicode was hardly thought of. If you needed to be able to include accented characters, like àéü and so forth, you had to do so using “ANSI code pages”. You may believe that you used “plain text”; but it is not very likely.

If you have elderly HTML pages, they are mostly likely using ANSI. This causes phenomenal problems if you try to use Linux command line tools like grep and sed to make global changes. You need to first convert them to Unicode before trying anything like that.

What was ANSI anyway?

But let’s have a history lesson. What are we dealing with here?

In a text file, each byte is a single character. The byte is in fact a number, from 0 to 255. Our computers display each value as text on-screen. In fact you don’t need 256 characters for the symbols that appear on a normal American English typewriter or keyboard. All these can be fitted in the first 127 values. To see what value “means” what character, look up the ASCII table.

The values from 128-255 are not defined in the ASCII table. Different nations, even different companies used them for different things. On an IBM these “extended ASCII codes” were used to draw boxes on screen!

The different sets of values were unhelpfully known as “code pages”. So “code page” 437 was ASCII. The “code page” 1252 was “Western Europe”, and included just such accents as we need. You can still see these “code pages” in a Windows console – just type “chcp” and it will tell you what the current code page is; “chcp 1252” will change it to 1252. In fact Windows used 1252 fairly commonly, and that is likely to be the encoding used in your ANSI text files. Note that nothing whatever in the file tells you what the author used. You just have to know (but see below).

So in an ANSI file, the “ü” character will be a single byte.

Then unicode came along. The version of unicode that prevailed was UTF-8, because, for values of 0-127, it was identical to ASCII. So we will ignore the other formats.

In a unicode file, letters like the “ü” character are coded as TWO bytes. This allows for 65,000+ different characters to be encoded. Most modern text files use UTF-8. End of the history lesson.

What encoding are my HTML files using?

So how do you know what the encoding is? Curiously enough, the best way to find out on a Windows box is to download and use the Notepad++ editor. This simply displays it at the bottom right. There is also a menu option, “Encoding”, which will indicate all the possibles, and … drumroll … allow you to alter them at a click.

As I remarked earlier, the Linux command line tools like grep and sed simply won’t be serviceable. The trouble is that these things are written by Americans who don’t really believe anywhere else exists. Many of them don’t support unicode, even. I was quite unable to find any that understood ANSI. I found one tool, ugrep, which could locate the ANSI characters; but it did not understand code pages so could not display them! After two days of futile pain, I concluded that you can’t even hope to use these until you get away from ANSI.

My attempts to do so produced webpages that displayed with lots of invalid characters!

How to convert multiple ANSI html files to UTF-8.

There is a way to efficiently convert your masses of ANSI files to UTF-8, and I owe my knowledge of it to this StackExchange article here. You do it in Notepad++. You can write a macro that will run the editor and just do it. It runs very fast, it is very simple, and it works.

You install the “Python Script” plugin into Notepad++ that allows you to run a python script. Then you create a script using Plugins | Python Script | New script. Save it to the default directory – otherwise it won’t show up in the list when you need to run it.

Mine looked like this:

import os;
import sys;
import re;
# Get the base directory
filePathSrc="d:\\roger\\website\\tertullian.old.wip"

# Get all the fully qualified file names under that directory
for root, dirs, files in os.walk(filePathSrc):

    # Loop over the files
    for fn in files:
    
      # Check last few characters of file name
      if fn[-5:] == '.html' or fn[-4:] == '.htm':
      
        # Open the file in notepad++
        notepad.open(root + "\\" + fn)
        
        # Comfort message
        console.write(root + "\\" + fn + "\r\n")
        
        # Use menu commands to convert to UTF-8
        notepad.runMenuCommand("Encoding", "Convert to UTF-8")
        
        # Do search and replace on strings
        # Charset
        editor.replace("charset=windows-1252", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=iso-8859-1", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=us-ascii", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=unicode", "charset=utf-8", re.IGNORECASE)
        editor.replace("http://www.tertullian", "https://www.tertullian", re.IGNORECASE)
        editor.replace('', '', re.IGNORECASE)

        # Save and close the file in Notepad++
        notepad.save()
        notepad.close()

The indentation with spaces is crucial for python, instead of curly brackets.

Also turn on the console: Plugins | Python Script | Show Console.

Then run it Plugins | Python Script | Scripts | your-script-name.

Of course you run it on a *copy* of your folder…

Then open some of the files in your browser and see what they look like.

And now … now … you can use the Linux command line tools if you like. Because you’re using UTF-8 files, not ANSI, and, if they support unicode, they will find your characters.

Good luck!

Update: Further thoughts on encoding

I’ve been looking at the output. Interesting this does not always work. I’ve found scripts converted to UTF-8 where the text has become corrupt. Doing it manually with Notepad++ works fine. Not sure why this happens.

I’ve always felt that using non-ASCII characters is risky. It’s better to convert the unicode into HTML entities; using ü rather than ü. I’ve written a further script to do this, in much the same way as above. The changes need to be case sensitive, of course.

I’ve now started to run a script in the base directorym to add DOCTYPE and charset=”utf-8″ to all files that do not have them. It’s unclear how to do the “if” test using Notepad++ and Python, so instead I have used a Bash script running in Git Bash, adapted from one sent in by a correspondent. Here it is. in abbreviated form:

# This section
# 1) adds a DOCTYPE declaration to all .htm files
# 2) adds a charset meta tag to all .htm files before the title tag.

# Read all the file names using a find and store in an array
files=()
find . -name "*htm" -print0 >tmpfile
while IFS= read -r -d $'\0'; do
      #echo $REPLY - the default variable from the read
      files+=("$REPLY")
done <tmpfile
rm -f tmpfile

# Get a list of files
# Loop over them
for file in "${files[@]}"; do

    # Add DOCTYPE if not present
    if ! grep -q "<!DOCTYPE" "$file"; then
        echo "$file - add doctype"
        sed -i 's|<html>|<!DOCTYPE html>\n<html>|' "$file"
    fi

    # Add charset if not present
    if ! grep -q "meta charset" "$file"; then
        echo "$file - add charset"
        sed -i 's|<title>|<meta charset="utf-8" />\n<title>|I' "$file"
    fi

done

Find non-ASCII characters in all the files

Once you have converted to unicode, you then need to convert the non-ASCII characters into HTML entities. This I chose to do on Windows in Git Bash. You can find the duff characters in a file using this:

 grep --color='auto' -P -R '[^\x00-\x7F]' works/de_pudicitia.htm

Which gives you:

Of course this is one file. To get a list of all htm files with characters outside the ASCII range, use this incantation in the base directory, and it will walk the directories (-R) and only show the file names (-l):

grep --color='auto' -P -R -n -l '[^\x00-\x7F]' | grep htm

Convert the non-ASCII characters into HTML entities

I used a python script in Notepad++, and this complete list of HTML entities. So I had line after line of

editor.replace('Ë','&Euml;')

I shall add more notes here. They may help me next time.

From my diary

Posted on March 9, 2019March 19, 2019 by Roger Pearse

It is Saturday evening here. I’m just starting to wind down, in preparation for Sunday and a complete day away from the computer, from all the chores and all my hobbies and interests. I shall go and walk along the seafront instead, and rest and relax and recharge.

Sometimes it is very hard to do these things. But this custom of always keeping Sunday free from everything has been a lifesaver over the last twenty years. Most of my interests are quite compelling. Without this boundary, I would have burned out.

Phase 2 of the QuickLatin conversion from VB6 to VB.Net is complete. Phase 1 was the process of getting the code converted, so that it compiled. With Phase 2, I now have some simple phrases being recognised correctly and all the obvious broken bits fixed. The only exception to this is the copy protection, which I will leave until later.

Phase 3 now lies ahead. This will consist of creating automated tests for all the combinations of test words and phrases that I have used in the past. Code like QuickLatin has any number of special cases, which I have yet to exercise. No doubt some will fail, and I will need to do some fixes. But when this is done then the stability of the code will be much more certain. But I am trying to resist the insidious temptation to rewrite bits of the code. That isn’t the objective here.

I began to do a little of this testing over the last few hours. Something that I missed is code coverage – a tool that tells me visually how much of the code is covered by the tests. It’s an excellent way to spot edge-cases that you haven’t thought about.

It is quite revealing that Microsoft only include their coverage tool in the Enterprise, maximum-price editions of Visual Studio. For Microsoft, plainly, it’s a luxury. But to Java developers like myself, it’s something you use every day.

Of course I can’t afford the expensive corporate editions. But I think there is a relatively cheap tool that I could use. I will look.

Once the code is working, then I can set about adding the syntactical stuff that caused me to undertake this in the first place! I have a small pile of grammars on the floor by my desk which have sat there for a fortnight!

I’m still thinking a bit about the ruins of the Roman fort which lies under the waves at Felixstowe in Suffolk. This evening I found another article exists, estimating how far the coast extended and how big the fort was.^[1] It’s not online, but I think a nearby (25 miles away) university will have it. I’ve sent them a message on twitter, and we’ll see.*

I’ve also continued to monitor archaeological feeds on twitter for items of interest. I’m starting to build up quite a backlog of things to post about! I’ll get to them sometime.

* They did not respond.

^[1]J. Hagar, “A new plan for Walton Castle Suffolk”, Archaeology Today vol 8.1 (1987), pp. 22-25. It seems to be a popular publication, once known as Minerva, but there’s little enough in the literature that it’s worth tracking down.↩

Roger Pearse

Tag: Computing

Using Deepseek on an obscure Greek “Life” of St Isidore of Pelusium (d. 435 AD) by Morton Smith

Like this:

Import Turnpike Emails into Thunderbird – for free

Like this:

How does “AI translation” work? Some high-level thoughts

Like this:

On the typing of Greek

Like this:

Working with Bauer’s 1783 translation of Bar Hebraeus’ “History of the Dynasties”

Like this:

Getting Started With Collatex Standalone

Like this:

A way to compare two early-modern editions of a Latin text

Like this:

Copying old floppy disks – an adventure in time!

Like this:

Converting old HTML from ANSI to UTF-8 Unicode

What was ANSI anyway?

What encoding are my HTML files using?

How to convert multiple ANSI html files to UTF-8.

Update: Further thoughts on encoding

Find non-ASCII characters in all the files

Convert the non-ASCII characters into HTML entities

Like this:

From my diary

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

What was ANSI anyway?

What encoding are my HTML files using?

How to convert multiple ANSI html files to UTF-8.

Update: Further thoughts on encoding

Find non-ASCII characters in all the files

Convert the non-ASCII characters into HTML entities

Share this:

Like this:

Share this:

Like this: