On Saturday I was working on a text file containing the works of Ephraem Graecus, as they appear in the Phrantzolas edition, with CPG numbers and Assemani page numbers. This proved much more difficult than I had at first thought, and I was reduced to opening the PDFs of the Greek text and looking at the opening words in the index of initia in the CPG volume 5.
At various points it became obvious that it would be very helpful if I had a PDF of the CPG that was searchable.
I don’t possess the volumes of the CPG and never have. The price puts them outside the reach of the layman. (I do possess a copy of the CPL, however, because Brepols issued a paperback of it). So, like most people, I am dependent on PDFs made up of photos taken with a mobile phone by someone or other. These are always askew, and can’t be made searchable.
However… in my directory of CPG files, I discovered a set of 5 PDFs where the images of each double-page were pretty much square on, and also in grey-scale. I never used them, as the grey-scale was faint, and unpleasant to look at. But I started to experiment.
I pulled one volume into Finereader 12, with the options set to automatically split pairs of pages into two. To my amazement this worked fine, without need for correction (in subsequent volumes I had to manually split a dozen pages).
The single page images were still a rubbishy hard-to-read grey, however. I then tried saving the images out of FR12 to disk, as black and white .png files. I hoped that these would be readable and … it worked! The original images were such high resolution that the black-and-white versions were just fine.
The new page images were also much more readable, being black and white.
I then combined all the B/W images into a new PDF file, which became my new volume of the CPG. So now I had a PDF of perfectly readable, square-on, single pages, in black and white.
I wanted to make this searchable. Ideally the Greek should be searchable as Greek, and the Latin as Latin. I am not clear how to do this. One idea would be to pull the black and white images back into FR12, OCR them, and then let FR12 create a searchable PDF. This might well work; but the PDFs created by Finereader tend to be huge. And… would the ancient Greek really work?
What I did instead was to use Adobe Acrobat Pro 9 to OCR the B/W PDFs. This makes the Latin text more or less searchable. It’s a start.
I’ve had to pause work on this for much of today in order to do a job interview, but I am resuming the process for all the volumes tonight. Then I shall return to the Phrantzolas file, with the aid of searchable PDFs.
The job interview was successful, so I may have to go back to work next week! Whatever I am to do must be done now!