Time for something less strenuous

A lot of what I do demands a fair bit of concentration.  When I get home at the weekend, I don’t always find myself able to concentrate that much.  This is one reason why my additions to the Early Fathers collection developed; scanning and proofing texts does not require a lot of concentration, and can be quite soothing.

Like most of us, I have books and articles in photocopy form sitting around.  Since I acquired the Fujitsu Scansnap S300, these have looked increasingly inconvenient.

And I hate “inconvenient.”

Well, it’s not that hard to stick one of these books-in-a-pile-of-photocopies through the Scansnap S300.  I’ve just scanned one, which I will need sometime but not now.  It created a PDF.  I then opened the PDF in Abbyy Finereader 9, and ran the text recognition on it.  Then I saved it again, as a searchable PDF.  The latter isn’t as good quality as the first PDF, for some mysterious reason, so I’ll keep both. 

So… I now have a pile of paper to throw away.  If I ever need the book in that form, I can just print off the PDF.

Books that I use all the time are a different matter.  Books that I read through with a glass of something by my side are a different matter.  But books I never look at, and which I retain a copy of because of some idea I may one day work on?  I think not.

I doubt I am alone in this.  All over the world, students must be doing the same with textbooks.



Manuscript digitization in the Wall Street Journal

From the WSJ, some excerpts of a fascinating article by Alexandra Alter.  Note the reference to the manuscript of Michael the Syrian coming online!

One of the most ambitious digital preservation projects is being led, fittingly, by a Benedictine monk. Father Columba Stewart, executive director of the Hill Museum and Manuscript Library at St. John’s Abbey and University in Minnesota, cites his monastic order’s long tradition of copying texts to ensure their survival as inspiration.

His mission: digitizing some 30,000 endangered manuscripts within the Eastern Christian traditions, a canon that includes liturgical texts, Biblical commentaries and historical accounts in half a dozen languages, including Arabic, Coptic and Syriac, the written form of Aramaic. Rev. Stewart has expanded the library’s work to 23 sites, including collections in Syria, Lebanon and Turkey, up from two in 2003. He has overseen the digital preservation of some 16,500 manuscripts, some of which date to the 10th and 11th centuries. Some works photographed by the monastery have since turned up on the black market or eBay, he says.

Among the treasures that Rev. Stewart has digitally captured: a unique Syriac manuscript of a 12th-century account of the Crusades, written by Syrian Christian patriarch Michael the Great. The text, a composite of historical accounts and fables, was last studied in the 1890s by a French scholar who made an incomplete handwritten copy. Western scholars have never studied the complete original, which was locked in a church vault in Aleppo, Syria. Rev. Stewart and his crew persuaded church leaders to let them photograph it last summer. A reproduction will be published this summer, and a digital version will be available through the library’s Web site.

In February, Rev. Stewart traveled to Assyrian and Chaldean Christian communities in Kurdish villages in northern Iraq, where he hopes to soon begin work on collections in ancient monastic libraries. “You have these ancient Christian communities, there since the beginning of Christianity, which are evaporating,” he says He’s now seeking access to manuscript collections in Iran and Georgia.

With his black monk’s habit, trimmed gray beard and deferential manner, Rev. Stewart has been able to make inroads into closed communities that are often suspicious of Western scholars and fiercely protective of their texts. Armed with 23-megapixel cameras and scanning cradles, he sets up imaging labs on site in monasteries and churches, and trains local people to scan the manuscripts.

For now, curators and conservationists say capturing endangered manuscripts should be a top priority. 

“This could be our only chance,” says Daniel Wallace, executive director of the Center for the Study of New Testament Manuscripts, the Texas-based center that is attempting to digitally photograph 2.6 million pages of Greek New Testament manuscripts scattered in monasteries and libraries around the world. The group has discovered 75 New Testament manuscripts, many with unique commentaries, that were unknown to scholars. Mr. Wallace says one of the rare, 10th century manuscripts they photographed was in a private collection and was later sold, page by page, for $1,000 a piece. Others are simply disintegrating, eaten away by rats and worms, or rotting.


More on the Fujitsu Scansnap S300 scanner

This portable scanner is a funny old thing. But it is rather good, as a way to scan documents (it won’t do books), and photocopies. It’s very small, and very fast.

You really do have to make yourself play with it awhile before trying to use it seriously.  It has quite a few quirks.

It won’t handle more than about 8-9 pages at a go.  By default it presumes that when it gets to the end of these, that’s the end of the document and it creates a PDF (which you can’t then add to!).  This is not sensible behaviour.  What you want it to do is prompt for more sheets, so you can keep feeding your 500 page photocopy into it.  Luckily you can do this.  You need to right-click on the “Scansnap manager” icon on your taskbar, hit “Scan button settings”, and modify the default behaviour.  This one is “continue scanning after current scan is finished” (meaningful, mmm?) on the Scanning tab.  Make sure it is checked.

I’m working with Image quality=faster, colour mode=color (well, disk space isn’t an issue, and it’s plenty fast enough), duplex scan (because I was scanning some stuff that was two-sided).  Set the compression to 1, unless you want fuzzy images.  Disk space, remember, is cheap.  Paper size I left on automatic, and it has handled pay slips etc well.

The “Save” tab has some interesting options.  The “image saving folder” is the one in which the PDF will be deposited on output.  So have a Windows Explorer window open on that folder, and just cut and paste the file out of there before you do much else.

Loading the scanner is an art.  Open the top, and extend the supports in the top of the lid.  Then put ONE sheet in, and one only.  This will fit into the right place.  Then place the others on top of it (face down, of course).  This way you will avoid multiple sheets going through.

The software can bite you.  When you’ve done your scan, it pops up a box asking what to do with it — open Organiser, save to folder, email, etc.  Do NOT hit “save to folder” — your work will just vanish.  Nothing appears on the disk anywhere I could discover.  Hit the “open Organiser” and then grab it using Windows Explorer as above.

The software can perform OCR as part of the scan.  This is painfully slow, so don’t do that.   Tab “File Option”, “Searchable PDF” and uncheck the box.  You get another chance anyway when the Organiser opens.

This weekend I used the unit to scan all my expenses for the last business year.  It did them flawlessly, and I did the lot in a very short time.  I recommend this unit; but just practise with it a bit first, hmm?


Carry your library in your pocket

Let’s face it, we all have too many scholarly books.  We can’t work without them, and we end up with piles of books, often read only once, and piles of photocopies.  When we’re on the road, we can’t access them.  And who has not realised, with a sinking feeling, that some most interesting observation is in that pile of data somewhere, but that we cannot quite recall where?

The answer is to convert our books into PDF files.  Easy to say, I know.  But technology has come on, and what would once have taken forever no longer does.

This afternoon I took three books, each of 200+ pages, and made PDF’s of them all.  It took about half an hour each.  How did I do it?

First, you need a modern scanner.  The old ones groaned slowly as they scanned each page.  The modern ones can do a scan in 5 seconds.  I was using a Plustek OpticBook 3600, and even that is not bang-up-to-date.  It’s far faster than my old one, tho.  I controlled it from Abbyy Finereader 8, but really any bit of software would do.  I set the scanner to scan grey-scale, at 300 dpi (quite enough to be readable), and adjusted the page-size down from A4 to whatever the book size was, by trial and error.  I scanned an opening at a time, without splitting the pages.  I set the software to scan multiple pages, so that I didn’t have to hit a key each time (I really didn’t want to hit Ctrl-K 300+ times today!), and I set the interval that the software waits between scans to 5 seconds.  And then I went for it. 

The result was a bunch of images of the twin pages.  These I saved as a PDF.  I then passed them through Finereader 9 (which has excellent OCR) to create a PDF with page images and text hidden under the images (because the text won’t be perfectly recognised by the software anyway).  This means that the PDF is now searchable, and that I scan search a directory of files for keywords. 

I didn’t proof any of the OCR, tho — no time.  The idea is not to upload digital text, but merely to allow me a better chance of finding things.

I used Finereader, but probably other software would be better.  I noted that the PDF sizes varied alarmingly between 200Mb and 20Mb!  So I think Adobe Acrobat would be good for this, from what I have heard.

The end result is that I have three searchable PDF’s which I can stick on a key-drive (flash drive), slip into my pocket and look at anywhere.  I can look at them at lunchtime at work, for instance.

Unscrupulous people might be tempted to borrow books from the library, scan those, and save themselves the purchase price.  Of course I can’t advocate that you break the law in this way; still less exchange them online, as I hear some people do.  But we need to be able to manage our own libraries this way, I think.  Paper books have their uses, but scholarly books need this feature, as do their users.  We need a change in approach from copyright holders to make it possible.

I admit that my sympathy for the copyright industry is not as high as it might be, since their sympathy for those who use their products seems non-existent.  Why else do we have laws that criminalise anyone who makes a personal copy of an out-of-print and unavailable book?  Why do we have laws that create copyright for a century, but print-runs of 200, other than to create a dog-in-the-manger?  Why else do they campaign to increase the scope and reach of copyright, year upon year, while making it impossible for scholars to access out-of-print and obscure texts and even 1937 obscure theses? (a sore point, this last one, as regular readers will know).  But really we need better law, and we need better products from textbook manufacturers. 

In the mean time, I hope these notes will help people convert their libraries into a usable form.  The key thing to remember is that we are not trying to produce something perfect; just something usable, and produce it quickly.


Abby Finereader 9 is really excellent

I’ve been scanning some stuff that I can’t really discuss in the evenings this week, but have been very, very impressed with the character recognition quality of Abby Finereader 9.  It is very nearly perfect, and such an improvement on previous OCR software.

The only thing that I wish it could handle is English translation with embedded accented characters — strange names like `Abu and words like šeikh (=sheikh).


Michael the Syrian vol. 2 now at Archive.org

I’ve finished scanning the 540-odd pages of vol. 2 of Michael the Syrian and uploaded a PDF of it to Archive.org here.  Archive.org are still using Abbyy Finereader 8 to OCR the text, and Finereader 9 is quite a bit better.  So I have also uploaded the output from that; a Word document, a .txt file, and a .htm file.  These are indicated as *_fr9.*.

Tomorrow I will go down to the library and order volume three, which is the final volume of the translation.  There is a fourth volume, which contains the Syriac.  I’ll worry about that when I get to it.


More Michael the Syrian

A crisp sunny morning, a free afternoon at home, and an email arrives telling me that volume 2 of Michael the Syrian is available for collection at my local library.  Sometimes it all just comes together.  I wonder how much of it I can scan today?

UPDATE: (Early Afternoon) I’d forgotten how HEAVY the volumes are.  The physical labour in picking  them up, turning the page, placing it on the scanner, turning it round, etc, it pretty exhausting.  The paper is yellow-ish, which makes for speckling when scanned.  70 pages so far, tho.  The speckling seems to affect the margins most.

It’s an interesting question, whether to trim the margins or not.  Why bulk out the file with speckled white-space? 

UPDATE: (3pm) 123 pages. Groan.  One page had a bit of foxing, which came out as black splotches in black/white scanning.  So I did that page in colour.

UPDATE: (5pm) I’m aiming for 200 pages.  On page 190 at the moment, although I had to stop when the plumber arrived.  Then I can have dinner!  Somewhere in the reign of Justinian at the moment; I saw the name Belisarius a moment ago.


Uploading to Archive.org

Like most people, I have become used to searching Google books and Archive.org for out-of-copyright scholarly texts.  These are an enormous blessing to us all, where books normally hidden in University rare books rooms can be downloaded as a PDF. 

I’ve become aware that it is possible to upload books to Archive.org, and have uploaded a couple of items which I have, and which were not in the archive. 

Of course the first step is to scan the book.  For this I use Abbyy Finereader 8.0, which drives a Plustek Opticbook 3600 scanner at 400 dpi.   This creates images of the pages, and all the pages in the book can be saved as a single PDF file from Finereader.  For optical character recognition, I use Finereader 9.0 (which can only drive the scanner at 300 dpi or 600 dpi, curiously) which has much improved accuracy over Finereader 8.

It is necessary to create an account on Archive.org in order to upload.  Then you get a button ‘Upload’, and can use this to do an upload of a PDF.  This will work fine.  To add extra file formats, use the instructions in the FAQ; edit the item, use the item manager, checkout the item (no download is involved in checkout), and then use an FTP interface to add more files.  I was unable to get this to work in Internet Explorer 7 or Firefox 3; but the CuteFTP programme worked fine once I disabled secure-FTP and used simple FTP. 

I added to each item a text file output, a Word document with all the formatting, and an HTML file with simple formatting only. 

I would like to encourage readers to look at their shelves and consider which texts might be usefully uploaded.  Every printed item prior to 1st January 1923 is out of copyright in the USA and so can go up.  Copyright laws in the EU and UK require knowledge of the biography of the author, as copyright there absurdly expires 70 years after the death of the author.  But union catalogues of research material like COPAC these days often indicate the birth and death date of authors, making it possible to determine status.