Michael the Syrian part 3 – progress report

I’ve now scanned in images of all the pages (around 600) of this monstrously heavy volume — my forearms will never be the same — using Abbyy Finereader 8 to control the scanner.  I scanned in black-and-white at 400 dpi, which is the best for OCR.

I’ve gone through the batch, turning alternate pages the right way up.  I’m now importing it into Finereader 9, which has better OCR and produces smaller PDF’s.

UPDATE (16:30): I’ve created a searchable PDF, which is about 33Mb.  Now starting to upload it to Archive.org.  This can be slow and frustrating, and will probably take all evening.  I’ve also exported the text as .htm and .doc, which I’ll probably place there also.  I haven’t proofed any of the OCR output, but FR9 gives rather better results than FR8, which is what the automatic processes at Archive.org use.

UPDATE (16:36): Good grief.  It uploaded first time.  It’s here: http://www.archive.org/details/MichelLeSyrien3  I’d better add the other formats, then (if it will let me).  It’s not in the searches yet, tho.

UPDATE (16:39): Hmm.  The interface for uploads of extra files has changed.  Somewhat better than it was.  Still very slow, it seems, and not that intuitive.  You can tell it was tested by someone local to the server, and not someone far away from it.


Still scanning Michael the Syrian part 3

And boy is it hard work!  Just lifting and turning the heavy volume itself is tiring.  Just scanned p. 165.  I find that I have to play games with myself, to avoid giving up.  So at the moment I’m saying, “only a couple more to 170; you can pause there.”  When I get to 170, of course I have 171 open.  So I tend to just scan the extra page — just turn the book and lower it on the scanner.  Then, “well, may as well do a couple more.”  And so on.

We tend to take for granted how all those books on Google and Archive.org got scanned.  But it was hard, slow, back-breaking work.  When we grumble about missing pages, perhaps we should think of some low-paid person, very tired.

P. 173 done.  Maybe I’ll just do as far as 180…

UPDATE.  p.269.  Wonder if I can get to 300 tonight?

UPDATE2. p.361.  But I’m missing One Tree Hill!  Still, when the pages are turning and the pain-level is low, you have to keep rolling.


Michael the Syrian vol. 3 has arrived

I scanned volume 1 and volume 2 of the French translation of the Chronicle of Michael the Syrian, the big 12th century Syriac Chronicle and placed them on Archive.org.  I learned today that after a very long wait, volume 3 has appeared at the local library via ILL.  I shall go and get it tomorrow, and fire up my scanner.


Time for something less strenuous

A lot of what I do demands a fair bit of concentration.  When I get home at the weekend, I don’t always find myself able to concentrate that much.  This is one reason why my additions to the Early Fathers collection developed; scanning and proofing texts does not require a lot of concentration, and can be quite soothing.

Like most of us, I have books and articles in photocopy form sitting around.  Since I acquired the Fujitsu Scansnap S300, these have looked increasingly inconvenient.

And I hate “inconvenient.”

Well, it’s not that hard to stick one of these books-in-a-pile-of-photocopies through the Scansnap S300.  I’ve just scanned one, which I will need sometime but not now.  It created a PDF.  I then opened the PDF in Abbyy Finereader 9, and ran the text recognition on it.  Then I saved it again, as a searchable PDF.  The latter isn’t as good quality as the first PDF, for some mysterious reason, so I’ll keep both. 

So… I now have a pile of paper to throw away.  If I ever need the book in that form, I can just print off the PDF.

Books that I use all the time are a different matter.  Books that I read through with a glass of something by my side are a different matter.  But books I never look at, and which I retain a copy of because of some idea I may one day work on?  I think not.

I doubt I am alone in this.  All over the world, students must be doing the same with textbooks.



More on the Fujitsu Scansnap S300 scanner

This portable scanner is a funny old thing. But it is rather good, as a way to scan documents (it won’t do books), and photocopies. It’s very small, and very fast.

You really do have to make yourself play with it awhile before trying to use it seriously.  It has quite a few quirks.

It won’t handle more than about 8-9 pages at a go.  By default it presumes that when it gets to the end of these, that’s the end of the document and it creates a PDF (which you can’t then add to!).  This is not sensible behaviour.  What you want it to do is prompt for more sheets, so you can keep feeding your 500 page photocopy into it.  Luckily you can do this.  You need to right-click on the “Scansnap manager” icon on your taskbar, hit “Scan button settings”, and modify the default behaviour.  This one is “continue scanning after current scan is finished” (meaningful, mmm?) on the Scanning tab.  Make sure it is checked.

I’m working with Image quality=faster, colour mode=color (well, disk space isn’t an issue, and it’s plenty fast enough), duplex scan (because I was scanning some stuff that was two-sided).  Set the compression to 1, unless you want fuzzy images.  Disk space, remember, is cheap.  Paper size I left on automatic, and it has handled pay slips etc well.

The “Save” tab has some interesting options.  The “image saving folder” is the one in which the PDF will be deposited on output.  So have a Windows Explorer window open on that folder, and just cut and paste the file out of there before you do much else.

Loading the scanner is an art.  Open the top, and extend the supports in the top of the lid.  Then put ONE sheet in, and one only.  This will fit into the right place.  Then place the others on top of it (face down, of course).  This way you will avoid multiple sheets going through.

The software can bite you.  When you’ve done your scan, it pops up a box asking what to do with it — open Organiser, save to folder, email, etc.  Do NOT hit “save to folder” — your work will just vanish.  Nothing appears on the disk anywhere I could discover.  Hit the “open Organiser” and then grab it using Windows Explorer as above.

The software can perform OCR as part of the scan.  This is painfully slow, so don’t do that.   Tab “File Option”, “Searchable PDF” and uncheck the box.  You get another chance anyway when the Organiser opens.

This weekend I used the unit to scan all my expenses for the last business year.  It did them flawlessly, and I did the lot in a very short time.  I recommend this unit; but just practise with it a bit first, hmm?


Better OCR with Finereader 9

Last night I ran Finereader 9 over a 400-page English translation from 1936 that I had scanned some time ago at 400 dpi.  I then settled down for the onerous task of correcting scanner errors; only to find very few indeed.  There were perhaps a dozen in the whole book!  Probably if I had just exported it to Word and used the spell-checker, I would have found most of them.

I repeated the exercise on another text, with the same result.

FR9 is perceptibly better than FR8 at OCR.  It has some annoyances in the user-interface.  Worse it forces me to use my Plustek Opticbook 3600 at 300dpi or 600dpi, when FR8 allowed 400 dpi (the optimal resolution).  But the fact is that there has been a considerable advance here. 

When I look back ten years to the misery of “99% accurate” recognition (i.e. 6 errors a page), it is truly amazing.  Recommended.