← Blog

Streams, duplicates, and junk

Deduplication: 80 logo copies → 1 + 80 references before page 1: [logo: 24 KB] page 2: [logo: 24 KB] page 3: [logo: 24 KB] page 4: [logo: 24 KB] page 5: [logo: 24 KB] page 80: [logo: 24 KB] total: 1920 KB dedup after [logo: 24 KB] (once) page 1 → ref logo page 2 → ref logo page 3 → ref logo page 80 → ref logo total: ~25 KB savings ×75

After images shrink and fonts get trimmed, the file still contains thousands of small objects: drawing commands, dictionaries, width tables, metadata. Together they weigh less than a single photograph, but this is where the last 5–15% of savings lives, and where commercial optimizers usually beat open-source ones.

Stream recompression

Almost everything inside a PDF is stored in streams — blocks of bytes tagged with the algorithm that compressed them. In 90% of cases that algorithm is Flate (the deflate/zlib used by ZIP and gzip). Flate has compression levels 0–9, and most tools sit at the low or middle end:

pdfcompressor decompresses every stream and recompresses it at level 9. For small streams that yields 5–10%; for large text streams sometimes 20%.

The implementation uses libdeflate, which is roughly twice as fast as standard zlib at the same levels and occasionally finds slightly smaller output thanks to a better repetition search.

Object deduplication

The second large win is removing byte-identical objects. Examples in the wild:

Hash each object and merge the matches. The pitfalls:

Hashing what, exactly

Hashing bytes as they sit in the file (compressed) merges nothing — semantically identical streams end up with different Flate levels, different filter orderings, different padding. So:

  1. Decompress the stream to its original bytes.
  2. Hash those bytes plus the dictionary keys, excluding /Length, /Filter, and /DecodeParms — those describe storage, not meaning.
  3. For dictionaries without streams, hash the contents minus the same housekeeping fields.

Minimum object size to participate is 64 bytes. Below that, lookup overhead exceeds the saving.

Collision check

Two different objects can hash to the same value. When hashes match, the compressor byte-compares the raw content before merging. Collapse two similar-but-not-identical pages and the document is visibly broken.

What can’t be deduplicated

Some objects are dangerous to merge even when their bytes match:

The compressor explicitly excludes both classes from merge candidates.

Two passes

Deduplication runs twice:

Object streams

A PDF contains many tiny dictionary objects: pages, catalogs, links, table-of-contents entries. In the classic file layout, each one carries its own N M obj … endobj wrapper plus a row in the xref table. A 1000-page document with 10 annotations per page produces 11,000 such wrappers.

PDF 1.5 lets the compressor pack many small objects into a single compressed stream (an object stream). The wrapper overhead disappears, and the stream itself gets Flate-compressed as a unit. On office documents the savings often run 10–15% of the whole file.

pdfcompressor performs this repacking in phase 7, after every other optimization has finished mutating object contents.

Page content optimization

The drawing commands that paint letters and lines onto a page form a small language: Tf picks a font, Tj shows a string, re draws a rectangle. PDF generators write this code differently — extra whitespace, redundant font switches to the font that’s already active, color changes that set black when black is already current.

Light normalization removes no-op state changes and compacts sequences of small operations. Savings are modest — 1–3% — but they accumulate on large documents.

Junk at the end of the file

PDF allows appending. When an editor changes a document, it often doesn’t rewrite the whole file; it tacks an incremental update onto the end — new versions of objects plus a new xref table after the old %%EOF. The old objects technically remain, just unreferenced.

This mechanism exists for fast saving and for digital signatures (the signed prefix doesn’t move, so the signature stays valid). For compression, it’s pure dead weight.

pdfcompressor:

  1. Detects whether the file has a tail beyond its last referenced object.
  2. Verifies that removing the tail leaves page count, page metadata, and document structure intact.
  3. Trims it.

Linearized PDFs — the format variant sorted for fast network loading — get separate handling. Their xref table sometimes arrives damaged in transit. The compressor can rebuild the structure by parsing the object stream directly. Without that, around 1% of files in large archives would test as “broken” despite being perfectly readable, just with a corrupted index.

The final touch — stripping

At the end, after everything else has run, the compressor removes content that the document usually doesn’t need:

The order matters: stripping comes after font deduplication, not before. Reverse those and converted fonts that could have merged get deleted as “unreferenced,” after which dedup finds no matches.

What this adds up to across a file

On 1073 office PDFs (~634 MB), this final pass alone — Flate recompression, deduplication, object streams, stripping — saves another 15–20% on top of images and fonts. Combined with the work in the previous articles, the total on this office-heavy dataset comes to about 57.5% of the original at SSIM 0.9996. (On the much larger 7.9M-file SafeDocs run, with its higher share of scan-heavy PDFs where JBIG2 dominates, the average lands closer to 41%.)

Object streams: many small objects in one compressed block before: 11,000 headers 1 0 obj <<...>> endobj 2 0 obj <<...>> endobj 3 0 obj <<...>> endobj 4 0 obj <<...>> endobj 5 0 obj <<...>> endobj 6 0 obj <<...>> endobj … × 11000 … + 11,000 xref rows after: one compressed stream [ ObjStm ] 11,000 objects Flate-compressed as one block 10–15% of the whole file saved