← Blog

What’s inside a PDF and why it gets so big

A PDF is a filesystem inside a single file PDF file pages size, references data streams images, fonts dictionaries metadata fonts font program or reference to a standard one xref table object offsets numbered objects connected by references

What a PDF is

A PDF is a small filesystem packed into a single file. Numbered objects sit inside it; an xref table at the end records where each one lives. Four kinds of object do most of the work:

Opening a page runs a tiny program: translate the coordinate system, set font X, draw this text, place image Y, fill this rectangle. The instructions are featherweight; the resources they point to are what blow up the file.

Where the bloat comes from

Take any 30-megabyte PDF apart and you find some combination of four problems.

First, images stored at their original resolution. A phone camera captures 4032×3024; a scanner produces 600 DPI. The page renders at ten centimeters wide. Most of those pixels never make it to a screen or a printer.

Second, untrimmed fonts. The generator embedded all of Helvetica Neue — 2000 glyphs — for a page that uses 40 letters.

Third, lazy stream compression. Most tools run Flate at level 1 or 3 because the writer is tuned for speed. Reading at level 9 costs nothing.

Fourth, duplicates and trailing junk: the same image embedded ten times because it appears on ten pages, stale object versions left behind after editing, leftover XFA forms, ICC profiles, metadata, bytes after the final %%EOF.

The ten phases of pdfcompressor

The algorithm walks the file in ten steps; the order is load-bearing.

# Phase What happens
0 Early deduplication Hash streams and merge identical ones before any re-encoding mutates them
1 Images Classify, choose codec, downsample
2 Standard font replacement Drop embedded Helvetica/Times — readers ship them
3 Font subsetting Strip glyphs that never appear in the document
4 Font deduplication Especially after CFF conversion exposes identical font programs
5 Stream recompression Flate at maximum level
6 Orphan removal Objects nothing references
7 Object streams Pack many small objects into one compressed block
8 Content stream Light cleanup of drawing commands
9 Stripping Remove metadata, XFA, ICC, dead annotations

Reduce volume first (images, fonts), then collapse duplicates, then discard what nothing needs. Reverse the order and you lose savings: strip metadata before converting fonts to CFF and the dedup pass that follows finds nothing to merge, because the matchable copies got deleted in the wrong order.

Each line above hides a decision tree. “Recompress images” means deciding whether the stream is a real photograph or a 1-bit stencil mask, what to do with a separation channel destined for a printing press (leave it), what the image’s effective on-page DPI actually is, and whether fine detail justifies bumping JPEG quality from 75 to 85.

On a dataset of 7.9 million PDFs, the algorithm produces output at roughly 41% of input size, with fewer than 0.01% of files showing visible damage and an average SSIM of 0.9996.

Where the bloat comes from — four sources 1 images photo 4032×3024 scan 600 DPI shown at 10 cm: excessive 2 fonts Helvetica Neue 2000 glyphs 40 used: rest is ballast 3 weak Flate level 1–3 compression instead of 9 recompression almost free 4 duplicates image ×10 pages stale versions metadata, XMP junk after %%EOF