Getting Started With Collatex

Collatex seems to be the standard collation tool.  Unfortunately I don’t much care for it.  Also interestingly, the web site does not actually tell you how to run it locally!  So here’s a quick note.

Collatext is a Java program, so you must have a Java Runtime Environment (JRE) installed, for version 8 or higher.  I think Windows 10 comes with a JRE anyway, but I can’t tell because long ago I set up a Java development environment which overrides such things.

You download the .jar file for Collatex from here.  Download it somewhere convenient, such as your home directory c:\users\Yourname.

Then hit the Start key, type cmd.exe, and open a command window.  By default this will start in that same directory.

Then run the following command in the command window.

java -jar collatex/collatex-tools-1.8-SNAPSHOT.jar -S

This starts a web server, on port 7369, with error messages to that command window.  (If you just want to start the server and close the window, do “start java …”).

You can then access the GUI interface in your browser on localhost:7369.  This is the same interface as the “Demo” link on the Collatex website.  You can load witnesses, and see the graphical results.

I think it’s best for collating a few sentences.  It’s not very friendly for large quantities of text.

 

Share

A way to compare two early-modern editions of a Latin text

There are three early modern editions of John the Deacon’s Life of St Nicholas.  These are the Mombritius (1498), Falconius (1751) and Mai (1830-ish) editions.  I have already used Abbyy Finereader 15 to create a word document for each containing the electronic text.

But how to compare these?  I took a look at Juxta but did not like it, and this anyway is ceasing to be available.  For Collatex I have only been able to use the online version, and I find the output tiring.  But Collatex does allow you to compare more than two witnesses.

The basic problem is that most comparison tools operate on a line-by-line basis.  But in a printed edition the line-breaks are arbitrary.  We just don’t care about them.  I have not found a way to get the Unix diff utility to ignore line breaks.

Today I discovered the existence of dwdiff, available here.  This can do this quite effectively, as this article makes clear.  The downside is that dwdiff is not available for Windows; only for MacOS X, and for Ubuntu Linux.

Fortunately I installed the Windows Subsystem for Linux (WSL) on my Windows 10 PC some time back, with Ubuntu as the Linux variant.    So all I had to do was hit the Start key, and type Ubuntu, then click the App that appeared.  Lo and behold, a Linux Bash-shell command line box appeared.

First, I needed to update Ubuntu; and then install dwdiff.  Finally I ran the man command for dwdiff, to check the installation had worked:

sudo apt-get update –y
sudo apt-get install -y dwdiff
man dwdiff

I then tested it out.  I created the text files in the article linked earlier.  Then I needed to copy them into the WSL area.  Because I have never really used the WSL, I was a bit unsure how to find the “home” directory.  But at the Bash shell, you just type this to get Windows Explorer, and then you can copy files using Windows drag and drop:

explorer.exe .

The space and dot are essential.  This opened an explorer window on “\\wsl$\Ubuntu-20.04\home\roger” (??), and I could get on.  I ran the command:

dwdiff draft1.txt draft2.txt

And got the output, which was a bit of tech gobbledegook:

[-To start with, you-]{+You+} may need to install Tomboy, since it's not yet part of the
stable GNOME release. Most recent distros should have Tomboy packages
available, though they may not be installed by default. On Ubuntu,
run apt-get install tomboy, which should pull down all the necessary [-dependencies ---]
{+dependencies,+} including Mono, if you don't have it installed already.

The [-…] stuff is the value in the first file; the {+…} is the different text in the second file.  Other text is common.

There were also some useful options:

  • dwdiff -c draft1.txt draft2.txt added colours to the output.
  • dwdiff –ignore-case file1 file2 made it treat both files as lower case.
  • dwdiff –no-common file1 file2 caused it to omit the common text.

So I thought I’d have a go.

First I went into word and saved each file as a .txt file.  I didn’t fiddle with any options.  This gave me a mombritius.txt, a falconius.txt and a mai.txt.

I copied these to the WSL “home”, and I ran dwdiff on the two of them like this:

dwdiff falconius.txt mombritius.txt --no-common -i > op.txt

The files are fairly big, so the output was piped to a new file, op.txt.  This I opened, in Windows, using the free programmer tool, Notepad++.

The results were interesting, but I found that there were too many useless matches.  A lot of these were punctuation.  In other cases it was as simple as “cujus” versus “cuius”.

So I opened my falconius.txt in Notepad++ and using Ctrl-H globally replaced the punctuation by a space: the full-stop (.), the colon (:), semi-colon(;), question-mark (?), and two different sorts of brackets – () and [].  Then I saved.

I also changed all the text to lower case (Edit | Convert Case to| lower).

I then changed all the “v” to a “u” and all the “j” to an “i”.

And then, most importantly, I saved the file!  I did the same with the Mombritius.txt file.

Then I ran the command again, and piped the results to a text file.  (I found that if I included the common text, it was far easier to work with.)

dwdiff falconius.txt mombritius.txt > myop2.txt

Then I opened myop2.txt in Notepad++.

This produced excellent results.  The only problem was that the result, in myop2.txt, was on very long lines.  But this could easily be fixed in Notepad++ with View | Word Wrap.

The result looked as follows:

Output from dwdiff
Falconius edition vs Mombritius edition

The “-[]” stuff was Falconius only, the “+{}” was Mombritius.  (I have no idea why chapter 2 is indented).

That, I think, is rather useful.  It’s not desperately easy to read – it really needs a GUI interface, that colours the two kinds of text.  But that would be fairly easy to knock up in Visual Basic, I think.  I might try doing that.

Something not visible in the screen shot was in chapter 13, where the text really gets different.  Also not visible in the screen grab – but very visible in the file – is the end, where there is a long chunk of additional (but spurious) text at the end of the Mombritius.

Here by the way is the “no-common” output from the same exercise (with my note on lines 1-2)

dwdiff no-common output

This is quite useful as far as it goes.  There are some things about this which are less than ideal:

  • Using Linux.  Nobody but geeks has Linux.
  • Using an oddball command like dwdiff, instead of a standard utility.  What happens if this ceases to be supported?
  • The output does not display the input.  Rather it displays the text, all lower case, no “j” and “v”, no punctuation.  This makes it harder to relate to the original text.
  • It’s all very techy stuff.  No normal person uses command-line tools and Notepad++.
  • The output is still hard to read – a GUI is needed.
  • Because it relies on both Linux and Windows tools, it’s rather ugly.

Surely a windows tool with a GUI that does it all could be produced?

The source code for dwdiff is available, but my urge to attempt to port a linux C++ command line utility to windows is zero.  If there was a Windows version, that would help a lot.

Maybe this afternoon I will have a play with Visual Basic and see if I can get that output file to display in colour?

Share