Archaeology on our own PCs – unravelling old file formats

A good few years – seventeen! – have passed since I left off working for a certain major corporation, stashed a bunch of documents and sometime projects in a directory on my PC, and went off to seek my fortune.  But this week the past came back to me, in the shape of reunion drinks; and I found myself looking for a document that I hadn’t seen in 20 years.

When I found it, I found that it was in a file produced by WordPerfect 4.2.  For DOS!  It was last edited sometime in the late 80s.  Fortunately at the time I had the habit of using “.wp4” etc as the file suffix, so I knew what the format was.  I found other files, suffixed as “.ws5” – WordStar 5!  There were some “.drw” files, which I knew were vector graphics files, and proved to belong to Lotus Freelance.  There were bunches of zipped up directories; but in “.lzh” directories, produced using the lha.exe archiver, which is now dead.

I know a crux when I see one.  Whether I can retrieve all of this now I do not know; but certainly the problem won’t get better if I leave it.  I once thought these files worth keeping.  But there’s not a lot of point, if I can’t open them.

Dealing with the WordPerfect 4.2 files was relatively straightforward.  Corel bought WordPerfect long ago, and a correspondent showed me that the conv50.exe file at the Corel FTP site, under the WordPerfectDOS 5.0 directory (which you can’t open in IE, but can in Google Chrome) was a self-extracting zip file which contained the convert.exe file used to convert 4.2 to 5.0.  So I got hold of this, and converted my file to Wordperfect 5.0.  Few utilities indeed will work with WordPerfect for DOS versions earlier than 5.0, although in fact 4.2 was a far more popular and widespread version.  You can run this quite happily in a Windows 7 (64-bit) command window, and it will prompt for input – I put *.wp4 – and output, and it will do all the files in the directory in one go.

Now I have a WordPerfect 5.0 file, there is a utility you can obtain, again from Corel, to convert wp5 files to an ancient version of Word.  This may be found in the WordPerfect for Windows 6.1 directory, and is named wp_convert_utility.exe.  This is an installer, actually, which installs a windows utility in the c:\program files (x86)\corel directory on your PC.  Don’t get creative with installing it, by the way – it plainly is on its last legs.  Here’s a screen grab:

WordPerfect Convert Utility
WordPerfect Convert Utility

 You can’t actually browse to files anymore – that doesn’t work!  You must type the names in yourself, and choose the right output type.  You want Word 97, which is actually the next item.  This will give you a nice .doc file.  I was then able to double-click on the file and open it in Microsoft Word 2010; whereupon I promptly saved it in some new, shiny, file format.  In the same directory, naturally.

The Wordstar files were simpler to deal with.  Long ago Microsoft produced an import filter for all versions of Wordstar 3.0-7.0.  They don’t include it any more; but it is out there, on a Microsoft FTP site.  The site is incredibly slow, tho.  The file, wdsupcnv.exe, is a self-extracting zip file, which creates a bunch of .cnv files and a readme.  You then copy these into C:\Program Files\Common Files\Microsoft Shared\Textconv.  Once you have done this, you open the .ws5 (or whatever you called them; if you called them .doc, as was the default, then I don’t know if this confuses Word) by double-clicking and choosing Word 2010 as your application.  It opens, prompts you to confirm the file format, then asks you to say “Yes” to something, and …. your file opens.  I then saved it as a modern Word .docx file – again next to the original.

I haven’t yet managed to open the .drw files.  But I gather that Lotus SmartSuite 9.8 Millennium should be able to open it, and save the results in Microsoft PowerPoint format; and copies are available cheaply on eBay, so I have ordered one.  Whether this will work on 64-bit Windows I do not know.

The worst problem that I got was with the collection of .lzh files.  The lha site is gone, and although 7Zip will open these files (although not on the command-line version), that doesn’t help you if you have a couple of hundred.  If you have an old copy of the lha.exe file, you will find that it doesn’t run on Windows 7 (64 bit), because lha.exe is a 16-bit applicatio1n, and Microsoft thoughtfully ensured that any compatibility layer was only present on the rare 32-bit version of Windows 7.   However I was able to find a clone LHA for Windows, and this worked fine.  I copied the new lha.exe into my directory of files, and adapted a little batch script that I found online to scan for all the .lzh files in a directory, and unpack them to a new subdirectory of the same name:

@echo off 
setlocal enableDelayedExpansion 

set MYDIR=.
for /F %%x in ('dir /B/D %MYDIR%\*.lzh') do (
  rem set FILENAME=%MYDIR%\%%x
  set FILENAME=%%x
  echo Processing !FILENAME! to !FILENAME!.DIR\
  md !FILENAME!.DIR\
  cd !FILENAME!.DIR\  
  D:\MYFILES\lha x ..\!FILENAME! 
  cd ..
)

And it worked: FRED.LZH was unpacked to a new directory FRED.LZH.DIR, and so on.

It’s been an afternoon of archaeology.  I think that I have now converted all the files (except the .drw) that I have on disk.  I hope that these will go with me into the future.  Unless we are careful, even the past that we have saved carefully and archived will vanish.

From my diary: the evanescent internet

Today, at work, I cast around for a web-based form to point a computer program at, for testing purposes.  I recalled my own feedback form, at Tertullian.org, and decided to use that.  I was having one of those days, you know, when everything goes wrong.  But at least my own website wouldn’t let me down, right?

Wrong.  The form didn’t work.

Clearly it hadn’t worked, for quite some time.  Yet I couldn’t see why.  It was a very simple piece of software, and hadn’t changed in, well, probably a decade.

But of course it wasn’t running on the hardware-software platform of 2004 any more.  Somewhere, sometime, my website provider had upgraded.  It happens all the time.

Some software upgrade had broken it, silently.  The form is written in PHP, and clearly one or the other of the PHP upgrades had silently removed features on which it depends.  It emails me in a distinctive format, and, now I come to think of it, I haven’t seen one in quite some time.  A year?  Two?  How time flies…

I spent a less than pleasant hour this evening, rewriting the way it captures variables.  The new version is considerably more baroque than the old.  It’s longer.  It might be more secure, I don’t know.  But it’s not the same form any more.

Of course this makes me wonder what other PHP scripts are lying around on my website, long forgotten.  I can’t even face looking.

This is how the internet dies.  We all know that it is less than permanent.  What we forget is that software less than a decade old, designed to run and be accessible by the world, is probably only sporadically working.

All those eager-beavers, upgrading and improving constantly, are … leaving a trail of wrecked websites behind them.

I wonder how many of us are actually hosting deadware – scripts that once worked and no longer do?

How to download a book at the German Arachne – DAI site

I had trouble with this, so I am going to document it here!  With pictures.  Because it’s about as user-friendly as a cornered rat; but obvious once you know.

Say you want to download a volume of the Corpus Inscriptionum Latinarum?  These are here.  So go to that link.  You get a page like this (you can switch the language to English somewhere on the site, at top right – may as well).

arachne1Click on a volume.

When you get it, there will be a floppy disk icon top right.

When you do, you will get a pop-up:

arachne2IGNORE the “Download” button!!!  All that will give you is some crappy catalogue info.

Instead click on the “Download book as PDF file”.  And … your download will begin.

Be warned: the size of these books is in gigabytes.  Which won’t matter a bit once the internet speeds up a bit, but may make your eyes pop a bit in the mean time!

The decay of digital media

This evening I was looking through some PDF’s of a Mithras reference volume, which a correspondent very kindly scanned for me some time back.   I keep a copy on my travelling laptop, and so when I am working away from home, I can work on the site in the evenings in the hotel.  I was, in fact, looking for information on the Nesce Mithraeum, in Latium; and, rather to my surprise, that page was missing.

So I decided to go through the PDF (which I received in parts of a few pages) and check whether any other pages were missing.  A few were, but I can obtain photocopies from a library and patch the PDF’s.

But I came to the end of the directory, and double-clicked on a file and … it wouldn’t open.  Adobe informed me that it was corrupt.

This was a surprise.  I knew the file must have been OK once.  All the files in that directory were emailed to me, and I certainly opened them all at least once, and often many more times.  How could it be corrupt?

Now I carry around with me a back-up of my hard disk, on external hard disk.  It’s kept up to date every weekend.  So I went to that and tried to open the same file.  And … it wouldn’t open.

Somehow the file that I had downloaded to my PC at home had become corrupt, at some point in the past.

In this case there was a happy ending.  I never got around to deleting the email(s) that sent me this book, and so I could just download the piece again.  And, sure enough, that was fine.

But that PDF file has never been anywhere except on my hard disk.  How could it have become corrupt, without any other intervention?

More seriously … I have gigabytes of PDFs of books.  How many of these, I wonder, have silently rotted?

Nor am I the only one.

Today I accessed a website discussing an obscure technical subject.  The article was less than a year old, but the links to samples and bitmaps no longer worked.

It’s not so long ago that I found that the zip files on the Electronic Journal of Mithraic Studies website – which seems pretty much abandoned – no longer unzip.  Somehow, at some point, in their state of neglect, they have rotted.  But how?

We need a way to check the integrity of our collections of electronic books.  There is no manner of use in having them, if they are not there when we need them.

I don’t know how it might be done; but done it needs to be.

Gentlemen … check your files!

LACE Greek OCR project

On a better note, we live in blessed times where technology and the ancient world are concerned.  The astonishing results of a project to OCR volumes of ancient Greek from Archive.org may now be found online here.  Clicking on the first entry, and one of the outputs in it here gives astonishingly good results.

Admin: possible changes to the appearance of the site

I may need to change the WordPress theme that I use for this site.  For some reason quoting material – which I do a lot – does not work very well since I upgraded.  My apologies if there is any oddness while I experiment!

UPDATE: OK, I have reverted.  The same problems appeared in the default WordPress theme.  It seems that WordPress 3.6 is broken.

When you press “quote”, quite often it just inserts a new paragraph.  It often does not unquote a quoted passage.  And so on.  Blockquoting is a fundamental issue, and WordPress have broken it.

Google sabotaging Internet Explorer

A new version of Google Mail yesterday; and today I find that it won’t work properly with Internet Explorer 10.  I was forced to use Chrome – which I dislike – in order to reply to an email.   (link; link) It looks as if it doesn’t work that well with Firefox either.

This is not the first time that Google has broken its products, if used with IE.  If you use Book Search, hitting backspace works in Chrome but not in IE.  It’s a small thing, and I endure it; but it can hardly be accidental, when Google offers its own rival product.

This is the kind of anti-competitive behaviour that requires regulatory action.  Unscrupulous corporations will happily inconvenience their customers for even the possibility of locking them in.

Once Google had a motto, “Don’t be evil”.  How long ago that seems.

Disabling IE10 auto-complete spam

I upgraded to IE10 recently, but have been driven crazy by one ‘feature’.  When I type in the address box a few letters of one of my regular sites, it shows me a whole list of url’s which I have never visited and in which I have no interest.  This infuriating trick must be commercially driven — “pay to join our spam list!” — and will drive a lot of people to Chrome.

Anyway it did it once too often today.  I’ve found a link that tells you how to turn it off.  Basically it’s Tools | Internet Options | Content | Auto-complete, and turn off “suggestions”.

I thought I’d add this, as it is such a nuisance.

It’s things like this that remind us how little power we have.  Still, ’twas ever thus.  The desire for money is the root of all sorts of trouble.  As has been said before.

Attempts to hack the new Mithras pages

When I wrote the PHP scripts that support my Roman cult of Mithras site, I incorporated some code to tell me if anyone was looking at the pages.  Specifically it tells me which pages are popular; information that is useful to me when deciding what to work on.

Each page is accessed using an address like this:

http://www.tertullian.org/rpearse/mithras/display.php?page=XXXX

where XXXX is the name of one of the pages.  So I display the page names and counts like this:

As you may imagine, I was somewhat surprised to find entries appearing that were most certainly not pages on my site.  No link anywhere will produce these.

Here is one example:

Any database programmer will recognise that these are fragments of the database language, SQL.  What’s going on here?

This is — can only be — an attempt to hack my website.  The hacker has theorised that the pages, as in Wikipedia, are actually stored in a database.  He is trying to guess how my site works.

What if, he thinks, the “display.php” script, in the address above, takes the page name, creates an SQL query, and retrieves the page data from this hypothetical database?  Then perhaps the SQL is this:

select * from database_table where pagename = 'PAGE'

where PAGE is the text in “display.php?page=PAGE“?  If so, he thinks, let’s stick a quote in the address box, and add extra code!  Let’s see, he thinks, if we can get somewhere with this!  It failed, however.

A few days ago he must have realised that he wasn’t getting anywhere with the SQL injection attack (as it is called).  Here’s what he did next:

The hacker has tried again.  He’s now guessing that perhaps the website uses files on the disk, rather than a database.  He thinks that it is perhaps running on the Linux operating system, as most commercial websites do.  And he is guessing that my code perhaps does something like this:

File Open("PAGE");
File Read;
Display file to screen;

So he thought that perhaps he could get the display.php to display the password file from the Linux machine.  Indeed he tried various permutations of the same idea:

The %2F is an HTML encoding for a slash character; so he is still trying to get at the passwd file.  None of it worked, thankfully.

Now there is one obvious conclusion here.  This is not an automatic attack, run by machine.  This sort of tinkering requires human input.  No doubt there are hacking engines, built and sold to attack common software packages used to write websites.  But my site doesn’t use these; it’s all hand-made code.

So, somewhere out there, there is a human being, who is trying to gain control of my website.

Who is this person?  Well, I do know a little about him.  Back in 2006, when I last created a website using PHP scripting, such people didn’t exist.  So when I started the site, in December 2012, I didn’t bother with security.  The first version of the new site was promptly hacked.  And what did he do, once he could edit the content?  Well, he deleted it.  The page content was replaced with spam and links to spam sites.  It’s undoubtedly the same person, since he has kept up various attacks ever since.

The only person who could find advantage in that is someone who works for a spammer.  He’s out there, with some knowledge of programming, trying — for money, I presume — to break my site in order to delete it and replace it with rubbish, because someone else pays him to do it.

Nor is he giving up.  The attempts to hack me, using the attack that worked initially, have gone on unceasingly for months.  Indeed he tried the same hack again, two days ago at 22:42 hours.  It’s usually in the middle of the night that the attacks come.  Is he an Australian, perhaps?  Or some low-paid oriental?

It is sobering to see such determination to do harm.  He has put in months and months of effort – far more effort than I have spent to create the site in the first place.  And he keeps right on going.

Possibly all of our websites are under such daily attack.  The quantities of spam “comments” to this blog run into thousands every day; which, thankfully, WordPress deal with.  Most of the time we just don’t even know it is happening.

How many website authors check their logs regularly?  How many of us would recognise an attack if we saw one?  It is pure coincidence that I chose a format for this site, and a reporting method for it, that highlight the attacks very clearly.

I hope, therefore, that this post may assist my fellow web-authors.  It goes to show that these attacks are real.

Yes, it is sobering, and also rather sad.  For this was not how things were in 2006.  I ran the translation project for Jerome’s Chronicle without any security at all.  And I had no trouble.

But now the criminal classes are on the web.  The criminal is he who will wreck anything for any shred of personal convenience, regardless of the harm to others.

Sadly we may have to accept a police force for the web also, in response.