A PDF is a small filesystem packed into a single file. Numbered objects sit inside it; an xref table at the end records where each one lives. Four kinds of object do most of the work:
PDF defines at least five different ways to describe color, and a compressor that conflates them will turn a carefully prepared document into garbage.
A font in a text PDF can outweigh all the text typed in it. A hundred-page contract set entirely in Arial commonly carries 400 KB of font data, because the generator embedded the whole font program…
In a typical PDF, 60–90% of the bytes are images. If a file contains even one scan or photo, every other optimization is rounding error.
Take a scanned page. The letter “o” appears 200 times on it. Each “o” is roughly the same set of dark pixels — call it 20×25, about 500 bits. Store the page naively and those 200 copies cost 100,00…
JPEG was standardized as ITU-T T.81 in 1992 (ISO/IEC 10918-1 in 1994) and lives in every camera and every PDF with photographs. It still has headroom. pdfcompressor uses the modern jpegli implement…
The rules that say “don’t touch this” matter as much as the transformations themselves. They decide whether running the compressor over an archive of millions of PDFs leaves you with usable files o…
After images shrink and fonts get trimmed, the file still contains thousands of small objects: drawing commands, dictionaries, width tables, metadata. Together they weigh less than a single photogr…