I was hunting around the web for an article from an Italian encyclopedia, when I struck lucky. All twelve volumes had been digitised to PDF, and they were available to download from Archive.org. Great news!
Well, I only needed volume nine, so I grabbed that. To my shock, the PDF was over 3 GIGABYTES in size! That would mean the whole encyclopedia would take a massive 40gb out of my disk space. Yet each volume is only 1200 pages, and I think all of the pages are black and white.
Nor was this the only problem. A 3Gb PDF is such a large file that Abbyy Finereader wouldn’t open it. My anti-virus picked it up and complained about it. My long outdated copy of Adobe Acrobat Pro 9 wouldn’t extract the three pages that I actually needed. Nor would it print those three pages. I thought about just buying a copy of whatever the latest version of Acrobat Pro might be; but dear old Adobe, an evil company, has quietly removed the option. All you can buy is a monthly subscription.
So what on earth to do? Why was the file so large anyway?
Thankfully I found a free downloadable tool for Windows called PDFSam Basic. This allowed me to split off the first few pages, and then I could work with them in Adobe Acrobat 9 as usual. I extracted the first page to png format, and found that that one page alone was more than 3 megabytes in size. That’s the same size as a full-colour photograph on my digital camera. Whoever had made the scans had done so at maximum resolution, in full colour. For black-and-white text pages. [Update: do NOT use PDFSam! It also silently installed it’s paid for model, and the uninstall did not work.]
Well, I used PDFSam to chop the 3Gb monster up into three files, and then I used Adobe Acrobat to “save as” these out to PNG, with settings RGB=off, colourspace=Monochrome. This produced a directory full of .png files, one for each page, none larger than 150kb, and often much less. The first page was no longer 3mb but 36kb. Then I gathered these up into a PDF using Adobe Acrobat and… the PDF file for all the pages was now a mere 109mb. Much more sensible.
Only afterwards did it occur to me that this sort of task is what ImageMagick is for. It’s a very powerful command line tool. But I don’t currently have that installed because it has so many switches and options that I use it rarely. And working out what option to use takes a while.
Inspecting the new PDF, I saw that the scanning had been done extremely carelessly. The opening pages had a large stain across them:
Anybody who has used a photocopier knows that this happens when you haven’t got the page flat on the copier. Sheer carelessness.
And that’s actually the cause of the huge page sizes too. Whoever did the scan didn’t bother to set the copier up correctly. They just scanned at max resolution, full colour, and let the output be whatever size it might be. Whoever did it was NOT the person who was going to need to use the file.
I think we can all guess how this might happen. What sort of person has young people at their disposal to do chores like this?
So… my friends, whenever you get a student to scan a book for you, do please remember that they don’t want to do it, and CHECK the results? Thank you.
Update: Further experiments show that I don’t need to use PDFSam – Acrobat will save as the whole file to .png, even if it won’t do much else.


