I was hunting around the web for an article from an Italian encyclopedia, when I struck lucky. All twelve volumes had been digitised to PDF, and they were available to download from Archive.org. Great news!
Well, I only needed volume nine, so I grabbed that. To my shock, the PDF was over 3 GIGABYTES in size! That would mean the whole encyclopedia would take a massive 40gb out of my disk space. Yet each volume is only 1200 pages, and I think all of the pages are black and white.
Nor was this the only problem. A 3Gb PDF is such a large file that Abbyy Finereader wouldn’t open it. My anti-virus picked it up and complained about it. My long outdated copy of Adobe Acrobat Pro 9 wouldn’t extract the three pages that I actually needed. Nor would it print those three pages. I thought about just buying a copy of whatever the latest version of Acrobat Pro might be; but dear old Adobe, an evil company, has quietly removed the option. All you can buy is a monthly subscription.
So what on earth to do? Why was the file so large anyway?
Thankfully I found a free downloadable tool for Windows called PDFSam Basic. This allowed me to split off the first few pages, and then I could work with them in Adobe Acrobat 9 as usual. I extracted the first page to png format, and found that that one page alone was more than 3 megabytes in size. That’s the same size as a full-colour photograph on my digital camera. Whoever had made the scans had done so at maximum resolution, in full colour. For black-and-white text pages. [Update: do NOT use PDFSam! It also silently installed it’s paid for model, and the uninstall did not work.]
Well, I used PDFSam to chop the 3Gb monster up into three files, and then I used Adobe Acrobat to “save as” these out to PNG, with settings RGB=off, colourspace=Monochrome. This produced a directory full of .png files, one for each page, none larger than 150kb, and often much less. The first page was no longer 3mb but 36kb. Then I gathered these up into a PDF using Adobe Acrobat and… the PDF file for all the pages was now a mere 109mb. Much more sensible.
Only afterwards did it occur to me that this sort of task is what ImageMagick is for. It’s a very powerful command line tool. But I don’t currently have that installed because it has so many switches and options that I use it rarely. And working out what option to use takes a while.
Inspecting the new PDF, I saw that the scanning had been done extremely carelessly. The opening pages had a large stain across them:
Anybody who has used a photocopier knows that this happens when you haven’t got the page flat on the copier. Sheer carelessness.
And that’s actually the cause of the huge page sizes too. Whoever did the scan didn’t bother to set the copier up correctly. They just scanned at max resolution, full colour, and let the output be whatever size it might be. Whoever did it was NOT the person who was going to need to use the file.
I think we can all guess how this might happen. What sort of person has young people at their disposal to do chores like this?
So… my friends, whenever you get a student to scan a book for you, do please remember that they don’t want to do it, and CHECK the results? Thank you.
Update: Further experiments show that I don’t need to use PDFSam – Acrobat will save as the whole file to .png, even if it won’t do much else.

The disconnect between scanner operator and reader is the curse of the age.
(The trouble with getting a student to do it, on the other hand, is at least as old as Hippocrates.)
My own go-to tool for dealing with unwieldy PDFs is tools.pdf24.org: free, online or desktop, and has never failed me so far.
But with IA files you may be better off dowloading one of the “SINGLE PAGE JP2” zip or tar files. These usually contain the source images (“ORIGINAL” file) from which the PDF is made; then you can post-process it to your liking. If the encyclopedia you are using is the one I’m thinking of, there is no “ORIGINAL” file because this was uploaded by a user directly as PDF, but the “PROCESSED” file (images extracted from the uploaded PDF, for display on the IA viewer) is still better, and only 1.9 GB. With books scanned by the IA (“ORIGINAL”) this is really the only option, because the PDF format they use is frankly hideous.
You need to be able to convert jp2 files, which is not that common. I use ImageConverterPlus for that.
That’s a very good thought about the zip files! Correct. I had not thought of that! There is in fact an option for the zip.