What’s inside a PDF and why it gets so big
What a PDF is
A PDF is a small filesystem packed into a single file. Numbered objects sit inside it; an xref table at the end records where each one lives. Four kinds of object do most of the work:
- a page — its dimensions, the resources it references, and the drawing commands that paint it;
- a stream — a compressed block of bytes: an image, a font program, the contents of a page;
- a dictionary — key/value metadata;
- a font — either a complete font program embedded in the file, or a name reference to one of the 14 “standard” fonts every reader is required to know.
Opening a page runs a tiny program: translate the coordinate system, set font X, draw this text, place image Y, fill this rectangle. The instructions are featherweight; the resources they point to are what blow up the file.
Where the bloat comes from
Take any 30-megabyte PDF apart and you find some combination of four problems.
First, images stored at their original resolution. A phone camera captures 4032×3024; a scanner produces 600 DPI. The page renders at ten centimeters wide. Most of those pixels never make it to a screen or a printer.
Second, untrimmed fonts. The generator embedded all of Helvetica Neue — 2000 glyphs — for a page that uses 40 letters.
Third, lazy stream compression. Most tools run Flate at level 1 or 3 because the writer is tuned for speed. Reading at level 9 costs nothing.
Fourth, duplicates and trailing junk: the same image
embedded ten times because it appears on ten pages, stale object
versions left behind after editing, leftover XFA forms, ICC profiles,
metadata, bytes after the final %%EOF.
The ten phases of pdfcompressor
The algorithm walks the file in ten steps; the order is load-bearing.
| # | Phase | What happens |
|---|---|---|
| 0 | Early deduplication | Hash streams and merge identical ones before any re-encoding mutates them |
| 1 | Images | Classify, choose codec, downsample |
| 2 | Standard font replacement | Drop embedded Helvetica/Times — readers ship them |
| 3 | Font subsetting | Strip glyphs that never appear in the document |
| 4 | Font deduplication | Especially after CFF conversion exposes identical font programs |
| 5 | Stream recompression | Flate at maximum level |
| 6 | Orphan removal | Objects nothing references |
| 7 | Object streams | Pack many small objects into one compressed block |
| 8 | Content stream | Light cleanup of drawing commands |
| 9 | Stripping | Remove metadata, XFA, ICC, dead annotations |
Reduce volume first (images, fonts), then collapse duplicates, then discard what nothing needs. Reverse the order and you lose savings: strip metadata before converting fonts to CFF and the dedup pass that follows finds nothing to merge, because the matchable copies got deleted in the wrong order.
Each line above hides a decision tree. “Recompress images” means deciding whether the stream is a real photograph or a 1-bit stencil mask, what to do with a separation channel destined for a printing press (leave it), what the image’s effective on-page DPI actually is, and whether fine detail justifies bumping JPEG quality from 75 to 85.
On a dataset of 7.9 million PDFs, the algorithm produces output at roughly 41% of input size, with fewer than 0.01% of files showing visible damage and an average SSIM of 0.9996.