From my diary: the evanescent internet

Today, at work, I cast around for a web-based form to point a computer program at, for testing purposes.  I recalled my own feedback form, at Tertullian.org, and decided to use that.  I was having one of those days, you know, when everything goes wrong.  But at least my own website wouldn’t let me down, right?

Wrong.  The form didn’t work.

Clearly it hadn’t worked, for quite some time.  Yet I couldn’t see why.  It was a very simple piece of software, and hadn’t changed in, well, probably a decade.

But of course it wasn’t running on the hardware-software platform of 2004 any more.  Somewhere, sometime, my website provider had upgraded.  It happens all the time.

Some software upgrade had broken it, silently.  The form is written in PHP, and clearly one or the other of the PHP upgrades had silently removed features on which it depends.  It emails me in a distinctive format, and, now I come to think of it, I haven’t seen one in quite some time.  A year?  Two?  How time flies…

I spent a less than pleasant hour this evening, rewriting the way it captures variables.  The new version is considerably more baroque than the old.  It’s longer.  It might be more secure, I don’t know.  But it’s not the same form any more.

Of course this makes me wonder what other PHP scripts are lying around on my website, long forgotten.  I can’t even face looking.

This is how the internet dies.  We all know that it is less than permanent.  What we forget is that software less than a decade old, designed to run and be accessible by the world, is probably only sporadically working.

All those eager-beavers, upgrading and improving constantly, are … leaving a trail of wrecked websites behind them.

I wonder how many of us are actually hosting deadware – scripts that once worked and no longer do?

How to download a book at the German Arachne – DAI site

I had trouble with this, so I am going to document it here!  With pictures.  Because it’s about as user-friendly as a cornered rat; but obvious once you know.

Say you want to download a volume of the Corpus Inscriptionum Latinarum?  These are here.  So go to that link.  You get a page like this (you can switch the language to English somewhere on the site, at top right – may as well).

arachne1Click on a volume.

When you get it, there will be a floppy disk icon top right.

When you do, you will get a pop-up:

arachne2IGNORE the “Download” button!!!  All that will give you is some crappy catalogue info.

Instead click on the “Download book as PDF file”.  And … your download will begin.

Be warned: the size of these books is in gigabytes.  Which won’t matter a bit once the internet speeds up a bit, but may make your eyes pop a bit in the mean time!

The decay of digital media

This evening I was looking through some PDF’s of a Mithras reference volume, which a correspondent very kindly scanned for me some time back.   I keep a copy on my travelling laptop, and so when I am working away from home, I can work on the site in the evenings in the hotel.  I was, in fact, looking for information on the Nesce Mithraeum, in Latium; and, rather to my surprise, that page was missing.

So I decided to go through the PDF (which I received in parts of a few pages) and check whether any other pages were missing.  A few were, but I can obtain photocopies from a library and patch the PDF’s.

But I came to the end of the directory, and double-clicked on a file and … it wouldn’t open.  Adobe informed me that it was corrupt.

This was a surprise.  I knew the file must have been OK once.  All the files in that directory were emailed to me, and I certainly opened them all at least once, and often many more times.  How could it be corrupt?

Now I carry around with me a back-up of my hard disk, on external hard disk.  It’s kept up to date every weekend.  So I went to that and tried to open the same file.  And … it wouldn’t open.

Somehow the file that I had downloaded to my PC at home had become corrupt, at some point in the past.

In this case there was a happy ending.  I never got around to deleting the email(s) that sent me this book, and so I could just download the piece again.  And, sure enough, that was fine.

But that PDF file has never been anywhere except on my hard disk.  How could it have become corrupt, without any other intervention?

More seriously … I have gigabytes of PDFs of books.  How many of these, I wonder, have silently rotted?

Nor am I the only one.

Today I accessed a website discussing an obscure technical subject.  The article was less than a year old, but the links to samples and bitmaps no longer worked.

It’s not so long ago that I found that the zip files on the Electronic Journal of Mithraic Studies website – which seems pretty much abandoned – no longer unzip.  Somehow, at some point, in their state of neglect, they have rotted.  But how?

We need a way to check the integrity of our collections of electronic books.  There is no manner of use in having them, if they are not there when we need them.

I don’t know how it might be done; but done it needs to be.

Gentlemen … check your files!

LACE Greek OCR project

On a better note, we live in blessed times where technology and the ancient world are concerned.  The astonishing results of a project to OCR volumes of ancient Greek from Archive.org may now be found online here.  Clicking on the first entry, and one of the outputs in it here gives astonishingly good results.

Admin: possible changes to the appearance of the site

I may need to change the WordPress theme that I use for this site.  For some reason quoting material – which I do a lot – does not work very well since I upgraded.  My apologies if there is any oddness while I experiment!

UPDATE: OK, I have reverted.  The same problems appeared in the default WordPress theme.  It seems that WordPress 3.6 is broken.

When you press “quote”, quite often it just inserts a new paragraph.  It often does not unquote a quoted passage.  And so on.  Blockquoting is a fundamental issue, and WordPress have broken it.

Google sabotaging Internet Explorer

A new version of Google Mail yesterday; and today I find that it won’t work properly with Internet Explorer 10.  I was forced to use Chrome – which I dislike – in order to reply to an email.   (link; link) It looks as if it doesn’t work that well with Firefox either.

This is not the first time that Google has broken its products, if used with IE.  If you use Book Search, hitting backspace works in Chrome but not in IE.  It’s a small thing, and I endure it; but it can hardly be accidental, when Google offers its own rival product.

This is the kind of anti-competitive behaviour that requires regulatory action.  Unscrupulous corporations will happily inconvenience their customers for even the possibility of locking them in.

Once Google had a motto, “Don’t be evil”.  How long ago that seems.

Disabling IE10 auto-complete spam

I upgraded to IE10 recently, but have been driven crazy by one ‘feature’.  When I type in the address box a few letters of one of my regular sites, it shows me a whole list of url’s which I have never visited and in which I have no interest.  This infuriating trick must be commercially driven — “pay to join our spam list!” — and will drive a lot of people to Chrome.

Anyway it did it once too often today.  I’ve found a link that tells you how to turn it off.  Basically it’s Tools | Internet Options | Content | Auto-complete, and turn off “suggestions”.

I thought I’d add this, as it is such a nuisance.

It’s things like this that remind us how little power we have.  Still, ’twas ever thus.  The desire for money is the root of all sorts of trouble.  As has been said before.

Attempts to hack the new Mithras pages

When I wrote the PHP scripts that support my Roman cult of Mithras site, I incorporated some code to tell me if anyone was looking at the pages.  Specifically it tells me which pages are popular; information that is useful to me when deciding what to work on.

Each page is accessed using an address like this:

http://www.tertullian.org/rpearse/mithras/display.php?page=XXXX

where XXXX is the name of one of the pages.  So I display the page names and counts like this:

As you may imagine, I was somewhat surprised to find entries appearing that were most certainly not pages on my site.  No link anywhere will produce these.

Here is one example:

Any database programmer will recognise that these are fragments of the database language, SQL.  What’s going on here?

This is — can only be — an attempt to hack my website.  The hacker has theorised that the pages, as in Wikipedia, are actually stored in a database.  He is trying to guess how my site works.

What if, he thinks, the “display.php” script, in the address above, takes the page name, creates an SQL query, and retrieves the page data from this hypothetical database?  Then perhaps the SQL is this:

select * from database_table where pagename = 'PAGE'

where PAGE is the text in “display.php?page=PAGE“?  If so, he thinks, let’s stick a quote in the address box, and add extra code!  Let’s see, he thinks, if we can get somewhere with this!  It failed, however.

A few days ago he must have realised that he wasn’t getting anywhere with the SQL injection attack (as it is called).  Here’s what he did next:

The hacker has tried again.  He’s now guessing that perhaps the website uses files on the disk, rather than a database.  He thinks that it is perhaps running on the Linux operating system, as most commercial websites do.  And he is guessing that my code perhaps does something like this:

File Open("PAGE");
File Read;
Display file to screen;

So he thought that perhaps he could get the display.php to display the password file from the Linux machine.  Indeed he tried various permutations of the same idea:

The %2F is an HTML encoding for a slash character; so he is still trying to get at the passwd file.  None of it worked, thankfully.

Now there is one obvious conclusion here.  This is not an automatic attack, run by machine.  This sort of tinkering requires human input.  No doubt there are hacking engines, built and sold to attack common software packages used to write websites.  But my site doesn’t use these; it’s all hand-made code.

So, somewhere out there, there is a human being, who is trying to gain control of my website.

Who is this person?  Well, I do know a little about him.  Back in 2006, when I last created a website using PHP scripting, such people didn’t exist.  So when I started the site, in December 2012, I didn’t bother with security.  The first version of the new site was promptly hacked.  And what did he do, once he could edit the content?  Well, he deleted it.  The page content was replaced with spam and links to spam sites.  It’s undoubtedly the same person, since he has kept up various attacks ever since.

The only person who could find advantage in that is someone who works for a spammer.  He’s out there, with some knowledge of programming, trying — for money, I presume — to break my site in order to delete it and replace it with rubbish, because someone else pays him to do it.

Nor is he giving up.  The attempts to hack me, using the attack that worked initially, have gone on unceasingly for months.  Indeed he tried the same hack again, two days ago at 22:42 hours.  It’s usually in the middle of the night that the attacks come.  Is he an Australian, perhaps?  Or some low-paid oriental?

It is sobering to see such determination to do harm.  He has put in months and months of effort – far more effort than I have spent to create the site in the first place.  And he keeps right on going.

Possibly all of our websites are under such daily attack.  The quantities of spam “comments” to this blog run into thousands every day; which, thankfully, WordPress deal with.  Most of the time we just don’t even know it is happening.

How many website authors check their logs regularly?  How many of us would recognise an attack if we saw one?  It is pure coincidence that I chose a format for this site, and a reporting method for it, that highlight the attacks very clearly.

I hope, therefore, that this post may assist my fellow web-authors.  It goes to show that these attacks are real.

Yes, it is sobering, and also rather sad.  For this was not how things were in 2006.  I ran the translation project for Jerome’s Chronicle without any security at all.  And I had no trouble.

But now the criminal classes are on the web.  The criminal is he who will wreck anything for any shred of personal convenience, regardless of the harm to others.

Sadly we may have to accept a police force for the web also, in response.

Free ancient Greek OCR – getting started with Tesseract

A correspondent draws my attention to Tesseract, a Google-hosted project to do Optical Character Recognition.  The Tesseract website is here.  Tesseract is a command-line tool, but there are front-ends available.

I am a long-term fan of Abbyy Finereader, and, let’s face it, have probably OCR’d more text than most.  So I thought that I would give Tesseract 3.02.02 a go.

First, getting the bits.  I work on Windows 7, so I downloaded:

I double-clicked on the tesseract installer.  This went smoothly.  It gave me the option to download and install extra languages (English is the default); among others I chose ancient Greek, and German, and German (fraktur).  The latter is the “gothic” style characters fashionable in Germany until 1945.  Curiously the list of languages is not in alphabetical order; French following German.

Next I clicked on the GImageReader installer.  This ran quickly, and warned that you need a copy of Tesseract installed. It did not create a desktop icon; you have to locate the program in the Start menu.  This would throw some users, I suspect.

I then started GImageReader.  It started with an error; that it was missing the “spellcheck dictionary for Dansk(Frak)”.  Why it looks for this I cannot imagine.  Not a good start, I fear.  I suspect that it expects Tesseract to be installed with all possible languages.

Next I browsed to a tif file containing part of the English translation of Cyril of Alexandria on John.  The file explorer is clunky and non-Windows standard.  The page displayed OK, although if you swap back to another window and then back again it seems to re-render the image.

At the top of the page is the recognition language – set by default to the mysterious Dansk (Frak).  I changed this to English.  I then hit “Recognize all”.  The recognition was quick.

So far, so good, then.  While unpolished, the interface is usable without a lot of stress.

The result of the OCR was not bad.  A window pops open on the right, with ASCII text in it.  It didn’t cope very well with layout issues, nor with small text.  But the basic recognition quality seemed good.

My next choice was a PDF with the text of Severian of Gabala, De pace, in Greek and Latin.  This opened fine! (rather to my surprise).  I held the cursor over the page, and it turned into a cross.  Holding down the left mouse button drew a rectangle around the text I wanted to recognise.  A quick language change to Ellenika and I hit “Recognise selection”.

The result was not bad at all.  Polytonic accents were recognised (although it did not like the two g’s in a)/ggeloi).

There were some UI issues here.  I could zoom the window being read – great!  But annoyingly I could not zoom the text window, nor copy and paste from it to Notepad.  But I could and did save it to a Unicode text file.  The result was this:

1. Οἱ ἄηε).οι τὸν οὐρἀνιον χο-
ρὸ·· συστησἀμενοι εὺηγγελίζοντο
τοῖς ποιμἑσι λἑγοντες· «εὐαγγε-
λιζόμεθα ὑμῖν σήμερον χαρὰ· με-
γάλην, ήτις ἔσται παντὶ τῷ λαῷ».
Παρ’ αὐτῶν τοίνυν τῶν ὰγίων ἐκεί-
νων ὰηέλων καὶ ῆμεῖς δανεισἀ-
μενοι φωνὴν οὐαηελιζόμεθα ὑμῖν
σήμερον, ὅτι σήμερον τὰ τῆς
ὲκπλησίας ἐν γαλή~›η καὶ τὰ τῶν
αἰρετικῶν ἐν ζάλη. Σἡμερον τὸ
οπιάφος τῆς ἑκκλησίας ἐν γαλήνη

Conclusions? I’ve used worse in the past.  I think it looks pretty good.  I suspect that, to use it, one would need to train it a bit more, but you can’t complain about the price!

Well done, those who created the training dictionary.