The safe zone — what we don’t touch
The rules that say “don’t touch this” matter as much as the transformations themselves. They decide whether running the compressor over an archive of millions of PDFs leaves you with usable files or with a small but real number of broken ones.
Why the boundary exists
Lossy compression assumes a 2% brightness shift goes unnoticed. For a holiday photo, fine. For a PDF, often not — a single file can contain:
- a contract page where moving a signature a few pixels changes its legal effect;
- a drawing labelled “0.5 mm line thickness” whose stroke disappears if you downsample;
- a map with a color legend, where each shade encodes a numeric concentration;
- a diagram printed in Pantone 2945 C, which no JPEG path can preserve;
- a scan of a historical document where a stain is data, not noise.
The compressor can’t know what’s on any given page, so it relies on formal markers in the PDF and backs off whenever a marker says “be careful.”
What gets recognized as untouchable
Calibrated color spaces
PDF distinguishes DeviceRGB (raw numbers) from
CalRGB, Lab, and ICCBased
(numbers attached to a color model). Calibrated color isn’t three
numbers — it’s a physically defined shade. JPEG works
in YCbCr and breaks the precision of any calibration. The compressor
sees the colorspace tag and disables JPEG for these
images, even when the savings would be substantial.
Separation and DeviceN
Spot-color channels — Pantone, metallic inks, varnish. They tell the press how much of each special ink to lay down. Lossy compression and downsampling are both off. The image passes through unchanged.
ImageMask
A 1-bit stencil. Its pixels say “fill here with the current color”; it isn’t an image, it’s a shape. Drop its DPI from 600 to 200 and edges go visibly jagged. Geometry is preserved in full.
Very small images
Images smaller than 32 pixels on a side don’t go through JPEG — three or four 8×8 blocks across the image is too few to hide quantization artifacts. Images of 1024 pixels or fewer in total go through Flate; the JPEG header alone consumes more bytes than Flate’s entire output.
1-bit indexed and two-color palettes
Almost always engineering drawings or schematics. One-pixel-wide lines are critical. Flate, no resolution change.
Fonts whose content stream won’t parse
If the parser can’t fully resolve which characters are drawn, the font isn’t subset. Better extra weight than missing letters.
Page content streams and Resources
Never deduplicated, even when bytes match exactly. A later edit of one page would propagate to the others sharing the same object.
What does get removed — with constraints
A handful of things are fair game, but each comes with its own guardrail:
- XMP metadata is removed, except in PDF/A mode, where the standard requires it.
- XFA forms are removed because no current reader does anything useful with them, and the parallel AcroForm remains.
- Incremental updates (the tail bytes after
%%EOF) are removed only after verifying that page count and document structure didn’t change. - Embedded files (PDFs with attachments) are never touched.
What this costs in numbers
Internal test on 7.9 million SafeDocs files:
- share of files with visible damage: under 0.01%;
- average SSIM: 0.9996;
- minimum SSIM: 0.9924 (the worst single output in the entire run);
- compression ratio: 41% of the original on average.
The minimum SSIM is the load-bearing number. At 0.9924 the worst case in the run is still indistinguishable from the original to the naked eye. Pushing for another 5 percentage points of size reduction drops worst-case SSIM toward 0.96 — visible degradation on individual files. We refuse that trade.
This is optimization for archives nobody manually reviews. At 41% size and 0.9996 mean SSIM, the compressed files can replace the originals.
The principle behind the rules
Defaults are conservative; optimization is adaptive.
- Base JPEG quality is 75, but rises to 85 wherever the gradient detector flags important detail.
- Target DPI is 200/300, but kicks in only at a factor-of-1.4 overshoot.
- Standard fonts are removed only when the bytes match the standard.
- Objects are deduplicated only when they survive a full byte comparison.
- Tails are trimmed only when page count is verified unchanged.
Every rule reads “do it if you’re sure; if not, leave it alone.” That’s how 7.9 million files get processed unsupervised under one in ten thousand for visible problems.