Getting Started With Collatex Standalone

Collatex seems to be the standard collation tool.  Unfortunately I don’t much care for it.  Also interestingly, the web site does not actually tell you how to run it locally!  So here’s a quick note.

Collatext is a Java program, so you must have a Java Runtime Environment (JRE) installed, for version 8 or higher.  I think Windows 10 comes with a JRE anyway, but I can’t tell because long ago I set up a Java development environment which overrides such things.

You download the .jar file for Collatex from here.  Download it somewhere convenient, such as your home directory c:\users\Yourname.

Then hit the Start key, type cmd.exe, and open a command window.  By default this will start in that same directory.

Then run the following command in the command window.

java -jar collatex/collatex-tools-1.8-SNAPSHOT.jar -S

This starts a web server, on port 7369, with error messages to that command window.  (If you just want to start the server and close the window, do “start java …”).

You can then access the GUI interface in your browser on localhost:7369.  This is the same interface as the “Demo” link on the Collatex website.  You can load witnesses, and see the graphical results.

I think it’s best for collating a few sentences.  It’s not very friendly for large quantities of text.

UPDATE: 20 Dec 2022.  Apparently this is just a standalone thing, and is NOT how you use Collatex for real.  It’s actually done by writing little scripts in python.  A couple of links:

  • https://nbviewer.org/github/DiXiT-eu/collatex-tutorial/blob/master/unit5/1_collate-plain-text.ipynb
  • http://interedition.github.io/collatex/pythonport.html

 

Share

A way to compare two early-modern editions of a Latin text

There are three early modern editions of John the Deacon’s Life of St Nicholas.  These are the Mombritius (1498), Falconius (1751) and Mai (1830-ish) editions.  I have already used Abbyy Finereader 15 to create a word document for each containing the electronic text.

But how to compare these?  I took a look at Juxta but did not like it, and this anyway is ceasing to be available.  For Collatex I have only been able to use the online version, and I find the output tiring.  But Collatex does allow you to compare more than two witnesses.

The basic problem is that most comparison tools operate on a line-by-line basis.  But in a printed edition the line-breaks are arbitrary.  We just don’t care about them.  I have not found a way to get the Unix diff utility to ignore line breaks.

Today I discovered the existence of dwdiff, available here.  This can do this quite effectively, as this article makes clear.  The downside is that dwdiff is not available for Windows; only for MacOS X, and for Ubuntu Linux.

Fortunately I installed the Windows Subsystem for Linux (WSL) on my Windows 10 PC some time back, with Ubuntu as the Linux variant.    So all I had to do was hit the Start key, and type Ubuntu, then click the App that appeared.  Lo and behold, a Linux Bash-shell command line box appeared.

First, I needed to update Ubuntu; and then install dwdiff.  Finally I ran the man command for dwdiff, to check the installation had worked:

sudo apt-get update –y
sudo apt-get install -y dwdiff
man dwdiff

I then tested it out.  I created the text files in the article linked earlier.  Then I needed to copy them into the WSL area.  Because I have never really used the WSL, I was a bit unsure how to find the “home” directory.  But at the Bash shell, you just type this to get Windows Explorer, and then you can copy files using Windows drag and drop:

explorer.exe .

The space and dot are essential.  This opened an explorer window on “\\wsl$\Ubuntu-20.04\home\roger” (??), and I could get on.  I ran the command:

dwdiff draft1.txt draft2.txt

And got the output, which was a bit of tech gobbledegook:

[-To start with, you-]{+You+} may need to install Tomboy, since it's not yet part of the
stable GNOME release. Most recent distros should have Tomboy packages
available, though they may not be installed by default. On Ubuntu,
run apt-get install tomboy, which should pull down all the necessary [-dependencies ---]
{+dependencies,+} including Mono, if you don't have it installed already.

The [-…] stuff is the value in the first file; the {+…} is the different text in the second file.  Other text is common.

There were also some useful options:

  • dwdiff -c draft1.txt draft2.txt added colours to the output.
  • dwdiff –ignore-case file1 file2 made it treat both files as lower case.
  • dwdiff –no-common file1 file2 caused it to omit the common text.

So I thought I’d have a go.

First I went into word and saved each file as a .txt file.  I didn’t fiddle with any options.  This gave me a mombritius.txt, a falconius.txt and a mai.txt.

I copied these to the WSL “home”, and I ran dwdiff on the two of them like this:

dwdiff falconius.txt mombritius.txt --no-common -i > op.txt

The files are fairly big, so the output was piped to a new file, op.txt.  This I opened, in Windows, using the free programmer tool, Notepad++.

The results were interesting, but I found that there were too many useless matches.  A lot of these were punctuation.  In other cases it was as simple as “cujus” versus “cuius”.

So I opened my falconius.txt in Notepad++ and using Ctrl-H globally replaced the punctuation by a space: the full-stop (.), the colon (:), semi-colon(;), question-mark (?), and two different sorts of brackets – () and [].  Then I saved.

I also changed all the text to lower case (Edit | Convert Case to| lower).

I then changed all the “v” to a “u” and all the “j” to an “i”.

And then, most importantly, I saved the file!  I did the same with the Mombritius.txt file.

Then I ran the command again, and piped the results to a text file.  (I found that if I included the common text, it was far easier to work with.)

dwdiff falconius.txt mombritius.txt > myop2.txt

Then I opened myop2.txt in Notepad++.

This produced excellent results.  The only problem was that the result, in myop2.txt, was on very long lines.  But this could easily be fixed in Notepad++ with View | Word Wrap.

The result looked as follows:

Output from dwdiff
Falconius edition vs Mombritius edition

The “-[]” stuff was Falconius only, the “+{}” was Mombritius.  (I have no idea why chapter 2 is indented).

That, I think, is rather useful.  It’s not desperately easy to read – it really needs a GUI interface, that colours the two kinds of text.  But that would be fairly easy to knock up in Visual Basic, I think.  I might try doing that.

Something not visible in the screen shot was in chapter 13, where the text really gets different.  Also not visible in the screen grab – but very visible in the file – is the end, where there is a long chunk of additional (but spurious) text at the end of the Mombritius.

Here by the way is the “no-common” output from the same exercise (with my note on lines 1-2)

dwdiff no-common output

This is quite useful as far as it goes.  There are some things about this which are less than ideal:

  • Using Linux.  Nobody but geeks has Linux.
  • Using an oddball command like dwdiff, instead of a standard utility.  What happens if this ceases to be supported?
  • The output does not display the input.  Rather it displays the text, all lower case, no “j” and “v”, no punctuation.  This makes it harder to relate to the original text.
  • It’s all very techy stuff.  No normal person uses command-line tools and Notepad++.
  • The output is still hard to read – a GUI is needed.
  • Because it relies on both Linux and Windows tools, it’s rather ugly.

Surely a windows tool with a GUI that does it all could be produced?

The source code for dwdiff is available, but my urge to attempt to port a linux C++ command line utility to windows is zero.  If there was a Windows version, that would help a lot.

Maybe this afternoon I will have a play with Visual Basic and see if I can get that output file to display in colour?

Share

Copying old floppy disks – an adventure in time!

Yesterday I inherited a couple of cases of old 3.5″ floppy disks.  Most of them were plainly software, of no special relevance.  But it was possible that some contained files and photographs of a deceased relative, which should be preserved.

My first instinct was to use my travelling laptop, which runs Windows 7, and a USB external floppy drive which is branded as Dell but seems to be display the label TEAC FD-05PUB in Devices and Printers.  This seems to be the one USB floppy drive available under various names.  But when I inserted the first floppy, Windows told me that the floppy needed to be formatted.  Obviously it could not read the disk, so no good.

At this end of the game, I think I understand why.  The reason seems to be that the floppy was an original 3.5″ 720kb unit, while later 3.5″ drives were formatted for 1.44 mb.  The TEAC FD-05PUB driver is badly written and only understands the latter format.  So it supposes that the 720k disk is not formatted.  This is shoddy work by somebody, and needs to be fixed.

At least the floppy drive does work with Windows 7.  Apparently it often does not work with Windows 10, thanks to an attempt by Microsoft to drop support for it.  There are various workarounds, such as this one.  But it didn’t help me read that disk.

However I still have all the laptops that I have ever bought, since I started freelancing in 1997.  Surely the older ones would have a built-in floppy drive?

A twenty-year old Dell Inspiron 7500 peeks out from under a monitor.

The oldest machine is a Compaq – remember them?  But this refuses to boot, complaining about the date and time.  The internal CMOS battery is long flat, it seems.  Unsure what to do, I leave this.

Next up is a chunky Dell Inspiron 7500.  This too refuses to boot, but – more helpfully – offers to take me into Setup, for the BIOS.  I go in, and, acting on instinct, set the date and time and invite it to continue.  And … it works!  I did have some hard thoughts about whoever decided that a flat battery should prevent Windows booting, mind you!

Anyway it boots up in Windows 98.  A swift shove of the disk into the floppy drive, and … I can see the contents.  In fact the disk does contain some useful files.  I copy them into a file on the desktop.

Next problem – how do I get the files off the machine and onto something useful?

This proves to be quite a problem!  The machine does not have a built-in CD writer.  It does not have a network port, although it does have serial and parallel ports.  (I had visions at this point of using dear old, slow old Laplink!)  It was once connected to the internet – by dialup!  It does have some PCMCIA card slots.  I toy with seeing if I could get a PCMCIA-to-USB card – they do exist.  PCMCIA is 16 bit, tho.  I think you can do this sort of thing, although not for USB.

Maybe I could get a PCMCIA network card!  They’re all long out of production, of course.  I used to have one, in fact, I vaguely recall.  I also recall throwing it out.  I am not looking forward to trying to configure networking anyway.

I don’t suppose there is a Wifi interface built in?  Not likely.  But anyway I right-click on My Computer, Properties, and look at the Devices tab.  And I forget all about Wifi when I see the magic words … Universal Serial Bus.  Yup – that’s USB!  So there is support there.  But why?  There’s no USB port.  I hunt around the rear once more… and spy… a USB port!!!  Hidden where it won’t be seen!  Yay!

But I am not home yet.  Oh no.  When I stick a USB2 key drive in, it demands a driver!  It seems that Windows 98 did not recognise USB drives by default.  You have to install a driver.  Luckily there is one.  You download nusb36e.exe from the web on your main computer, burn it to a CDR – a normal 700Mb one will do -, and then read that in the CD drive that – thankfully – is built in to the machine.  Full instructions are here.  You remove all the existing USB drivers, install the patch, restart, and get an extra USB driver.

I shove a USB2 key drive in, and up it comes as drive E.  Magic!

But I am still not home and dry.  When I click on it, it demands to format it!!  The reason for this is that modern keydrives use the NTFS file system, whereas Win98 was still using the old FAT32 system.  So I go ahead – it’s an empty drive.

Finally it works.  The USB drive opens in Windows explorer, I copy the files, pull the drive out and insert it into my main machine.  And …. I can see the files!!!  Phew!

Now to sift through all those floppies…. yuk!

Pretty painful, I think you’ll admit.  Only just possible.  In a few years those floppies will be useless to anybody but a laboratory.  But they have retained their formatting well, for more than 20 years.

So don’t assume the worst, if you can’t read a floppy in your nice new machine.  It may not be the floppy.

Share

Converting old HTML from ANSI to UTF-8 Unicode

This is a technical post, of interest to website authors who are programmers.  Read on at your peril!

The Tertullian Project website dates back to 1997, when I decided to create a few pages about Tertullian for the nascent world-wide web.  In those days unicode was hardly thought of.  If you needed to be able to include accented characters, like àéü and so forth, you had to do so using “ANSI code pages”.  You may believe that you used “plain text”; but it is not very likely.

If you have elderly HTML pages, they are mostly likely using ANSI.  This causes phenomenal problems if you try to use Linux command line tools like grep and sed to make global changes.  You need to first convert them to Unicode before trying anything like that.

What was ANSI anyway?

But let’s have a history lesson.  What are we dealing with here?

In a text file, each byte is a single character.  The byte is in fact a  number, from 0 to 255.  Our computers display each value as text on-screen.  In fact you don’t need 256 characters for the symbols that appear on a normal American English typewriter or keyboard.  All these can be fitted in the first 127 values.  To see what value “means” what character, look up the ASCII table.

The values from 128-255 are not defined in the ASCII table.  Different nations, even different companies used them for different things.  On an IBM these “extended ASCII codes” were used to draw boxes on screen!

The different sets of values were unhelpfully known as “code pages”.  So “code page” 437 was ASCII.  The “code page” 1252 was “Western Europe”, and included just such accents as we need.  You can still see these “code pages” in a Windows console – just type “chcp” and it will tell you what the current code page is; “chcp 1252” will change it to 1252.  In fact Windows used 1252 fairly commonly, and that is likely to be the encoding used in your ANSI text files.  Note that nothing whatever in the file tells you what the author used.  You just have to know (but see below).

So in an ANSI file, the “ü” character will be a single byte.

Then unicode came along.  The version of unicode that prevailed was UTF-8, because, for values of 0-127, it was identical to ASCII.  So we will ignore the other formats.

In a unicode file, letters like the “ü” character are coded as TWO bytes.  This allows for 65,000+ different characters to be encoded.  Most modern text files use UTF-8.  End of the history lesson.

What encoding are my HTML files using?

So how do you know what the encoding is?  Curiously enough, the best way to find out on a Windows box is to download and use the Notepad++ editor.  This simply displays it at the bottom right.  There is also a menu option, “Encoding”, which will indicate all the possibles, and … drumroll … allow you to alter them at a click.

As I remarked earlier, the Linux command line tools like grep and sed simply won’t be serviceable.  The trouble is that these things are written by Americans who don’t really believe anywhere else exists.  Many of them don’t support unicode, even.  I was quite unable to find any that understood ANSI.  I found one tool, ugrep, which could locate the ANSI characters; but it did not understand code pages so could not display them!  After two days of futile pain, I concluded that you can’t even hope to use these until you get away from ANSI.

My attempts to do so produced webpages that displayed with lots of invalid characters!

How to convert multiple ANSI html files to UTF-8.

There is a way to efficiently convert your masses of ANSI files to UTF-8, and I owe my knowledge of it to this StackExchange article here.  You do it in Notepad++.  You can write a macro that will run the editor and just do it.  It runs very fast, it is very simple, and it works.

You install the “Python Script” plugin into Notepad++ that allows you to run a python script.  Then you create a script using Plugins | Python Script | New script.  Save it to the default directory – otherwise it won’t show up in the list when you need to run it.

Mine looked like this:

import os;
import sys;
import re;
# Get the base directory
filePathSrc="d:\\roger\\website\\tertullian.old.wip"

# Get all the fully qualified file names under that directory
for root, dirs, files in os.walk(filePathSrc):

    # Loop over the files
    for fn in files:
    
      # Check last few characters of file name
      if fn[-5:] == '.html' or fn[-4:] == '.htm':
      
        # Open the file in notepad++
        notepad.open(root + "\\" + fn)
        
        # Comfort message
        console.write(root + "\\" + fn + "\r\n")
        
        # Use menu commands to convert to UTF-8
        notepad.runMenuCommand("Encoding", "Convert to UTF-8")
        
        # Do search and replace on strings
        # Charset
        editor.replace("charset=windows-1252", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=iso-8859-1", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=us-ascii", "charset=utf-8", re.IGNORECASE)
        editor.replace("charset=unicode", "charset=utf-8", re.IGNORECASE)
        editor.replace("http://www.tertullian", "https://www.tertullian", re.IGNORECASE)
        editor.replace('', '', re.IGNORECASE)

        # Save and close the file in Notepad++
        notepad.save()
        notepad.close()

The indentation with spaces is crucial for python, instead of curly brackets.

Also turn on the console: Plugins | Python Script | Show Console.

Then run it Plugins | Python Script | Scripts | your-script-name.

Of course you run it on a *copy* of your folder…

Then open some of the files in your browser and see what they look like.

And now … now … you can use the Linux command line tools if you like.  Because you’re using UTF-8 files, not ANSI, and, if they support unicode, they will find your characters.

Good luck!

Update: Further thoughts on encoding

I’ve been looking at the output.  Interesting this does not always work.  I’ve found scripts converted to UTF-8 where the text has become corrupt.  Doing it manually with Notepad++ works fine.  Not sure why this happens.

I’ve always felt that using non-ASCII characters is risky.  It’s better to convert the unicode into HTML entities; using ü rather than ü.  I’ve written a further script to do this, in much the same way as above.  The changes need to be case sensitive, of course.

I’ve now started to run a script in the base directorym to add DOCTYPE and charset=”utf-8″ to all files that do not have them.  It’s unclear how to do the “if” test using Notepad++ and Python, so instead I have used a Bash script running in Git Bash, adapted from one sent in by a correspondent.  Here it is. in abbreviated form:

# This section
# 1) adds a DOCTYPE declaration to all .htm files
# 2) adds a charset meta tag to all .htm files before the title tag.

# Read all the file names using a find and store in an array
files=()
find . -name "*htm" -print0 >tmpfile
while IFS= read -r -d $'\0'; do
      #echo $REPLY - the default variable from the read
      files+=("$REPLY")
done <tmpfile
rm -f tmpfile

# Get a list of files
# Loop over them
for file in "${files[@]}"; do

    # Add DOCTYPE if not present
    if ! grep -q "<!DOCTYPE" "$file"; then
        echo "$file - add doctype"
        sed -i 's|<html>|<!DOCTYPE html>\n<html>|' "$file"
    fi

    # Add charset if not present
    if ! grep -q "meta charset" "$file"; then
        echo "$file - add charset"
        sed -i 's|<title>|<meta charset="utf-8" />\n<title>|I' "$file"
    fi

done

Find non-ASCII characters in all the files

Once you have converted to unicode, you then need to convert the non-ASCII characters into HTML entities.  This I chose to do on Windows in Git Bash.  You can find the duff characters in a file using this:

 grep --color='auto' -P -R '[^\x00-\x7F]' works/de_pudicitia.htm

Which gives you:

Of course this is one file.  To get a list of all htm files with characters outside the ASCII range, use this incantation in the base directory, and it will walk the directories (-R) and only show the file names (-l):

grep --color='auto' -P -R -n -l '[^\x00-\x7F]' | grep htm

Convert the non-ASCII characters into HTML entities

I used a python script in Notepad++, and this complete list of HTML entities.  So I had line after line of

editor.replace('Ë','&Euml;')

I shall add more notes here.  They may help me next time.

Share

From my diary

It is Saturday evening here.  I’m just starting to wind down, in preparation for Sunday and a complete day away from the computer, from all the chores and all my hobbies and interests.  I shall go and walk along the seafront instead, and rest and relax and recharge.

Sometimes it is very hard to do these things.  But this custom of always keeping Sunday free from everything has been a lifesaver over the last twenty years.  Most of my interests are quite compelling.  Without this boundary, I would have burned out.

Phase 2 of the QuickLatin conversion from VB6 to VB.Net is complete.  Phase 1 was the process of getting the code converted, so that it compiled.  With Phase 2, I now have some simple phrases being recognised correctly and all the obvious broken bits fixed.  The only exception to this is the copy protection, which I will leave until later.

Phase 3 now lies ahead.  This will consist of creating automated tests for all the combinations of test words and phrases that I have used in the past.  Code like QuickLatin has any number of special cases, which I have yet to exercise.  No doubt some will fail, and I will need to do some fixes.  But when this is done then the stability of the code will be much more certain.   But I am trying to resist the insidious temptation to rewrite bits of the code.  That isn’t the objective here.

I began to do a little of this testing over the last few hours.  Something that I missed is code coverage – a tool that tells me visually how much of the code is covered by the tests.  It’s an excellent way to spot edge-cases that you haven’t thought about.

It is quite revealing that Microsoft only include their coverage tool in the Enterprise, maximum-price editions of Visual Studio.  For Microsoft, plainly, it’s a luxury.  But to Java developers like myself, it’s something you use every day.

Of course I can’t afford the expensive corporate editions.  But I think there is a relatively cheap tool that I could use.  I will look.

Once the code is working, then I can set about adding the syntactical stuff that caused me to undertake this in the first place!  I have a small pile of grammars on the floor by my desk which have sat there for a fortnight!

I’m still thinking a bit about the ruins of the Roman fort which lies under the waves at Felixstowe in Suffolk.  This evening I found another article exists, estimating how far the coast extended and how big the fort was.[1]  It’s not online, but I think a nearby (25 miles away) university will have it.  I’ve sent them a message on twitter, and we’ll see.*

I’ve also continued to monitor archaeological feeds on twitter for items of interest.  I’m starting to build up quite a backlog of things to post about!  I’ll get to them sometime.

* They did not respond.

Share
  1. [1]J. Hagar, “A new plan for Walton Castle Suffolk”, Archaeology Today vol 8.1 (1987), pp. 22-25.  It seems to be a popular publication, once known as Minerva, but there’s little enough in the literature that it’s worth tracking down.

From my diary

WordPress decided, without my permission, to install version 5.1, complete with their new but deeply unpopular “Gutenberg” editor that nobody either wanted nor requested.  I can’t downgrade from 5.1, but I’ve managed to get rid of the useless Gutenberg editor.  Let me know if there are any funnies.

Share

From my diary

This is another highly technical post, so I apologise to those readers with no interest in programming.

This week I have continued the ghastly process of migrating the 27,000 lines of code that make up QuickLatin from Visual Basic 6 to VB.Net 2008.

I found that the “Upgrade Wizard” for VB6 was no longer included in versions of Visual Studio later than 2008.  So I don’t really get a choice on which version of dotNet to use.  That said, I have found that it works quite well, so long as you approach the problem in the right way.  You will, in fact, have to adopt this approach whatever tool you use.

The first thing is to place the existing code under source control.  You will need to change this code a lot, to make it fit to convert.  Sometimes you will get it wrong, and need to revert back to the last checked-in version.  Believe me, you will!  I checked in code after each small set of changes.  It was the only way.

You see, the key to converting a VB6 application to VB.Net is to keep the application working at all times.  Don’t simply launch the upgrade wizard and then end up with 50,000 errors in the VB.net version, and then start at one end to fix them all.  You will just give up!

Instead, make changes to the VB6 version of the code.  Make small changes, check it works, check in the change.  Then do another; and another.

We all know the sort of things that don’t get converted OK:

  • Fixed strings in user-defined types (UDTs).  So code these out.  Convert them to non-fixed strings, a little at a time.  Of course you created them in order to do Win32 API calls?  Well, they won’t convert, and you will have to recode this stuff by hand.  So, do a little recode in VB6.
  • API calls, i.e. stuff that you added in to do more hairy stuff.  Recode to remove them.  You may end up just commenting the contents of the functions out.  I have stuff that does a splash screen.  I don’t need that code in VB.Net, which has a built-in SplashScreen template.  So with other things.
  • Front-end forms stuff.  No need for splitters – VB.NET has its own SplitterContainer.

There are many more.  Some are listed in a Microsoft document “Preparing Your Visual Basic 6.0 Applications for the Upgrade to Visual Basic.NET”  (no doubt this link will fail in a couple of years, thanks to Microsoft’s idiotic policy of moving files around on their website).  If you page down past the general chit-chat, at the bottom is a list of stuff that won’t convert.  Fix these.

If you do this, you can eliminate most of the rubbish, while keeping most of your code still working.

Once you have done this, then do a test of the Upgrade Wizard.  Expect to throw it away; but it will give you a list of failures to address in VB6.

Once I did this, I ended up with some 37 errors in VB.Net.  That was a tiny number!  Most of these I fixed in VB6, and reran the Upgrade Wizard several times.  By the end my VB6 application was rather damaged, and much of the UI didn’t really work.  But the logic engine was still running just fine.

A few things just won’t convert.  But you can fix a few on the other side.

Once your VB.Net application compiles, you can try and run it.  It will fail, of course.  This bit is just slog.  You find out what is wrong, and then consider if you can code it out in VB6 and re-upgrade.  Often you can.

QuickLatin has a lot of file-handling.  VB6 was slow in reading files, so I created a .DLL written in Visual C++, purely to grab a file and squirt it into an area of memory mapped to an array of UDTs.  Needless to say this did not convert to dotNet!  So what I did was write a slow  version, in raw VB6, which did the same thing.  I unpicked the optimisation, knowing that I could reoptimise on the other side of the upgrade process.

I ended up recompiling the DLL.  I’m not sure when I wrote that, but it was probably in Visual Studio 6.  It produced a DLL which was around 80k in size.  The version of the same code, produced by Visual C++ 2008, was 107kb.  It did exactly the same; but Microsoft’s lazy compiler developers had bloated it by 25%.  Microsoft was always notorious for code bloat.  I remember that when IBM took over the source code for OS/2, they were horrified at how flabby it all was, and rewrote most of it in assembler.

However I couldn’t get the DLL to work in VB.Net, whatever I did.  So… I eliminated it and accepted the slower load, for now.

I’ve now reached the stage where the code runs, but it isn’t doing it right.  This is not unexpected.  The change from arrays based on 1 to arrays based on 0 was always likely to break something.  But it’s an opportunity to make use of a VB.Net feature, and create unit tests!  I have started to do this, not without pain.

Of course even VB.NET 2008 is now more than a decade old.  At that time the idea of loose coupling and dependency injection was only just coming in.  I gather that even today Visual Studio 2019 doesn’t really support this all that well.  To a professional Java developer like myself, the idea that the DI equivalent of Spring isn’t even on the mental horizon of Microsoft staff is extraordinary.  I’ll manage somehow; but why can’t I just annotate my classes in dotNet, as I do in Java?  Why?

In the course of today, it has become clear to me that Microsoft’s developers never used VB6 themselves, nor do they use VB.Net.  If they had, they would never have created this huge roadblock to upgrading VB6.  There is still a huge amount of VB6 out there in corporations.  But Microsoft’s staff couldn’t care less.  Indeed there always was a lot of it about, which is one reason that I found it expedient to learn some, all those years ago.  Job security consists of having the skills that people want, even if they don’t want to see them on your CV.

Had Microsoft’s developers ever used VB6 internally, they would have collared the VB.Net team and given them a straight talking-to.

Likewise anybody who does the upgrade immediately finds that he can’t do a lot without massive refactoring.  Again, this shows that nobody at Microsoft actually ever went through this process.  Because you don’t want to do a load of refactoring. You have no tests.  You might break stuff.  You can’t easily create tests either for code in Modules, rather than classes.

Microsoft was founded by Bill Gates, who owed his start to writing a Basic Interpreter for some of the early microprocessors.  So Basic was important to him.  It seems that in later years it wasn’t important to anyone else.  This is a shame.

Microsoft was very arrogant in the 90s and early 2000s.  Few were sorry when Google knocked them off their perch.

Oh well.  Onwards.  Thank heavens I have lots of time right now, tho!

Share

From my diary

I’ve been continuing to work on QuickLatin.  The conversion from VB6 to VB.Net is horrible, but I am making real progress.

The key to it is to change the VB6 project, so that it will convert better.  So for instance I have various places at which I make a raw Win32 API call, because VB6 just doesn’t do something.  These must mostly go.  I replace them with slower equivalents using mainstream VB6 features.  In some cases I shall simply have to rewrite the functionality; but this is mainly front-end stuff.

All the same, the key point is to ensure that the VB6 project continues to work.  It is essential not to allow this to fail, or develop bugs.  This is one area where automated unit tests would be invaluable; but of course that concept did not arise until VB6 was long dead.  So I have to run the program manually and do a few simple tests.  This has worked, as far as I can tell.

The objective is to have a VB6 project that converts cleanly, and works out of the box.  It may be slower, it may have reduced functionality in peripheral areas.  But the business logic remains intact – all those hard-crafted thousands of lines of code still work.

It’s going fairly well.  I’ve been working through known problems – arrays that need to be base 0 rather than base 1.  Fixed strings inside user defined types have to go.   There is a list on the Microsoft site of the likely problems.

Today I had my first attempt at running the VB.Net 2008 Upgrade Wizard.  It failed, as I expected it to do.  The purpose was to identify areas in VB6 that needed work.  But the converted code only had 37 errors.  Only 3 of these were in the business logic, rather than the front-end, and all were easily fixed in VB6.  There were also a large number of warnings, nearly all of them about uninitialised structures.  Those can wait.

So my next stage is to do something about the 34 front-end errors.  Probably I shall simply have to comment out functionality.  Splitters are done differently in VB.NET.  The CommonDialog of VB6 no longer exists to handle file opening.  That’s OK… I can cope with rewriting those.

It has reminded me how much I like programming tho.

In the middle of this enormous task, of course, there are no lack of people who decide to email me about some concern of their own.  So … polite refusals to be distracted are now necessary.  I hate writing those.  But a big project like this can’t get done any other way.

Share

From my diary

It’s been an interesting couple of days.

I was working on the Passio of St Valentine, and I really felt that I could do with some help.  So I started browsing grammars.

This caused me to realise that many of the “rules” embedded in them were things that you’d like to have pop-up, sort of as an informational message, when you were looking at the sentence in a translation tool.

This in turn reminded me that my own morphologising tool, QuickLatin, was available and a natural candidate for such a thing.

This is written in Visual Basic 6.  I wrote most of it, actually, in Visual Basic for Applications, inside a MS Access database, during 1999.  (The language choice was dictated by the machine that I had available at the time, which had no development tools on it).  I then ported it to Visual Basic 6.  Microsoft then kindly abandoned VB6, without even a migration path, some time in the early 2000s.  This left me, and many others, stuck.  It is not a trivial task to rewrite 24,000 lines of code.

So where was my development environment?  I pulled out the last four laptops that I have used; I have them, because I keep all my old machines.  I found it on my Windows XP machine.  The machine started up OK!  In fact the batteries on the Dell laptops all started to charge, unlike a Sony Vaio which had Windows 7 on it.

The Windows XP machine had a tiny screen and was very old.  Could I perhaps install VB6 on Windows 10 instead?  The answer swiftly proved to  be a resounding “no”.  But I gathered a large number of tips from the web while doing so.

Then I tried installing VB onto my travelling laptop, which has Windows 7 on it, using all the info that I had.  The installation failed; but the software seemed to be installed anyway!

Then I tried doing it again on Windows 10.  This time I had a sneaky extra bit of information – to set the SETUP.EXE to run in Windows XP compatibility mode.  And … again it failed; but as with Windows 7, I could in fact still run it!

The process was so fraught that I knew that I’d never remember all the fixes and tips.  So I compiled all the bits together, hastily, into a reference guide on How to Install Visual Basic 6 on Windows 10, for my own use in days to come.

After two days of constant pain, I was at last in a position to work on the code!

But I wasn’t done yet.  I really would rather not work with VB6 any more.  Not that I dislike it; but it is emphatically a dead toolset.  My attempts to convert my code to VB.Net all failed.

But since I last looked, more tools have become available.  My eye was drawn to a commercial product, which Microsoft themselves recommended, by a firm called Mobilize.net.  The tool was VBUC.  You could get a free version which would convert 10,000 lines.  Surely, I naively thought, that would be enough for me?

Anyway I downloaded VBUC, and ran it, and discovered to my horror that I had nearly 30,000 lines of code!  But I set up a tiny test project, with half-a-dozen files borrowed from my main source project, and converted that.  The process of extracting a few files drew my attention to what spaghetti the codebase has become.  It was not trivial to just take a few.  This in turn made me alter the extracted VB code a bit, so that I could use it.

Converting the extract worked, but required some manual fixing.  However it did work in the end.

I was quite impressed with some of the conversions.  One of the StackOverflow pages had indicated that the firm were charging a couple of hundred dollars for the tool, back in 2010.  So I emailed to ask what they were charging now.

Mobilize.net then got a bit funny on me.  Instead of telling me, they asked me to tell them what I wanted it for.  I replied, briefly.  Then they wanted me to run an analyser tool on my code and send it in.  I did.  Then they wanted more details of what it did.  Quite a few emails to and fro.

By this stage I was getting fed up, and I pushed a bit.  They finally came back with a price, based on lines of code, of around $4,500!  That was ridiculous, and our exchange naturally went no further.

However I had not wasted my time, for the most part.  I could now see what the tool might do.  My code may be elderly, but some of the bits that were converted are basically the same throughout.  It is quite possible that I could write my own tool to do the limited subset of changes that I need.

One problem that was not handled well; QuickLatin loads its dictionaries as binaries, created by another tool of my own.  I found that VB.Net would not handle these, whatever I did.  The dictionaries would need to be regenerated in some other format.

So I spent some time experimenting with an XML format.  I quickly found how slow the VB6 file i/o was.  Reading a 20 mb file using VB native methods took 4 seconds.  Using MSXML to load the file and parse it into a linked list took 1.7!  I didn’t want the linkedlist method; but it was clear that the VB native methods were hideously inefficient.

I soon discovered complaints online that the VB.Net i/o did not support the methods used by VB6 and was even slower!  I’ve encountered problems of this sort before, which I got around by dropping into C++ and accessing the files through bare metal.  Clearly I would have to do so again.

Another problem that VBUC showed me was that VB6 fixed length strings were not really supported by VB.Net.  There was some sort of path, but it was horrible.  However there was, in fact, no reason to go that way; the file i/o, for which they were used, will have to change anyway.

I placed my code base under code control, using GIT.  Then I started cautiously making changes, checking that “amas” was giving sensible results – for unit tests were unknown in the days of VB6 – and committing regularly.  This proved wise; several times I had to go back to the last commit.

I spent quite a bit of time removing superfluous fixed strings from the code.  This was not trivial, but I made headway.

Something else I did, once I realised that coding lay ahead, was to rig up an external monitor, keyboard and mouse to my laptop.  I would have rigged up two, but there was no way to turn off the laptop screen – when you close the lid, the machine goes to sleep and that’s that.  On a commercial laptop, I’d set it to turn off the laptop screen and stay running.  Most graphics cards will support two monitors; the home laptops won’t support three.  Oh well.  But it was still better for serious work than using the laptop screen and keyboard alone.

Finally I started creating dictionary loading routines that would convert to VB.NET.  They are much slower; but I can optimise them when I get the code into VB.NET.  They have to change, come what may.  The key thing is to keep the program running and working at all times.  Take it slow.  Little by little If I take it apart into a million pieces, it will never get back together again.  Indeed this mistake I have made before.

Back in the 90s, automated unit tests, continuous integration, test-driven development and dependency injection were all  unheard of.  I have really missed having a set of tests that I can run to check that the code has not broken in some subtle way.  This again is a reason to migrate to VB.Net, where such is possible.  I did write test stubs in the original VBA, but there was no way to run them within VB6.  At least I have them still, and they can form the basis for unit tests.

So … it’s been a very busy few days indeed.  Nothing to show for it, to many eyes; but I feel optimistic.

The next challenges will be to change the other dictionaries over to the slow-but-safe method, and then remove all the stuff that supported the other approach.  This should simplify the code mightily.  Once this is done, then it will be time to attempt to convert the code.  Somehow.  All I need is time, and with luck I shall have some of that this week.

It is remarkable how far down the rabbit-hole one must go, just to get a bit of online help!

Share

From my diary – The “upgrade” that destroys your website

WordPress has pretty much conquered the world, as far as blog engines are concerned.  Who uses anything else now?  Fortunately, to the best of my knowledge, WordPress has not adopted the evil practices of other ‘net monopolies and started to censor content for political reasons.  But the monopoly cannot be good for any of us.

I noticed a few days ago that my blog menu no longer works on my Android  smartphone.  My theme – underskeleton – did once!  But somewhere along the WordPress update schedule, the developers broke it.  Nor is this the first time.  I had to move away from my original theme “unnamed” for the same reason.  “Underskeleton” has not been updated in a year, so plainly it is time to move.  But to what?

Most WordPress themes these days seem to be aimed at websites, not blogs.  The WordPress standard themes are no better.

I have just spent an hour experimenting with themes until my patience was exhausted.  What I want is simple enough – two columns, my pages not treated as navigation, the side panel accessible on mobile, a header image, and reasonable typography.  But I was unable to find anything I liked.

During the week someone mentioned to me how complicated it is becoming to create web content.  There are a million options, and even those of us who are IT professionals are drowning in the flow of information.  Yet at the same time simple things become impossible.

It’s very like how Microsoft have destroyed Visual Basic.  You just can’t get simple stuff done these days.

Likewise the Contact Form 7 is broken.  I’ve used it for years.  But the last update played havoc, and sent me loads of spam.  Why???!  I fell back on my old Tertullian.org feedback form.  This too has had its vicissitudes – the endless upgrades to perl on the server keep removing support for bits of code that I used when I wrote it.  But mostly I can fix it easily.  WordPress on the other hand is a monster.

I sat down here over an hour ago to write a post on Cotelerius.  Instead I’ve been messing with techno-rubbish.

Thank you, WordPress.

Share