← Blog

The safe zone — what we don’t touch

What pdfcompressor deliberately does not touch calibrated colorspace CalRGB, Lab, ICCBased JPEG disabled Separation / DeviceN spot color (Pantone) never touched ImageMask 1-bit stencil no downsampling images < 32 px JPEG 8×8 blocks = artifacts Flate only 1-bit indexed drawings, engineering no downsampling embedded attachments files inside the PDF passed through rule: "do it if you're sure; if in doubt, leave it alone"

The rules that say “don’t touch this” matter as much as the transformations themselves. They decide whether running the compressor over an archive of millions of PDFs leaves you with usable files or with a small but real number of broken ones.

Why the boundary exists

Lossy compression assumes a 2% brightness shift goes unnoticed. For a holiday photo, fine. For a PDF, often not — a single file can contain:

The compressor can’t know what’s on any given page, so it relies on formal markers in the PDF and backs off whenever a marker says “be careful.”

What gets recognized as untouchable

Calibrated color spaces

PDF distinguishes DeviceRGB (raw numbers) from CalRGB, Lab, and ICCBased (numbers attached to a color model). Calibrated color isn’t three numbers — it’s a physically defined shade. JPEG works in YCbCr and breaks the precision of any calibration. The compressor sees the colorspace tag and disables JPEG for these images, even when the savings would be substantial.

Separation and DeviceN

Spot-color channels — Pantone, metallic inks, varnish. They tell the press how much of each special ink to lay down. Lossy compression and downsampling are both off. The image passes through unchanged.

ImageMask

A 1-bit stencil. Its pixels say “fill here with the current color”; it isn’t an image, it’s a shape. Drop its DPI from 600 to 200 and edges go visibly jagged. Geometry is preserved in full.

Very small images

Images smaller than 32 pixels on a side don’t go through JPEG — three or four 8×8 blocks across the image is too few to hide quantization artifacts. Images of 1024 pixels or fewer in total go through Flate; the JPEG header alone consumes more bytes than Flate’s entire output.

1-bit indexed and two-color palettes

Almost always engineering drawings or schematics. One-pixel-wide lines are critical. Flate, no resolution change.

Fonts whose content stream won’t parse

If the parser can’t fully resolve which characters are drawn, the font isn’t subset. Better extra weight than missing letters.

Page content streams and Resources

Never deduplicated, even when bytes match exactly. A later edit of one page would propagate to the others sharing the same object.

What does get removed — with constraints

A handful of things are fair game, but each comes with its own guardrail:

What this costs in numbers

Internal test on 7.9 million SafeDocs files:

The minimum SSIM is the load-bearing number. At 0.9924 the worst case in the run is still indistinguishable from the original to the naked eye. Pushing for another 5 percentage points of size reduction drops worst-case SSIM toward 0.96 — visible degradation on individual files. We refuse that trade.

This is optimization for archives nobody manually reviews. At 41% size and 0.9996 mean SSIM, the compressed files can replace the originals.

The principle behind the rules

Defaults are conservative; optimization is adaptive.

Every rule reads “do it if you’re sure; if not, leave it alone.” That’s how 7.9 million files get processed unsupervised under one in ten thousand for visible problems.

The price of caution: quality across 7.9M PDFs average SSIM 0.9996 visually indistinguishable minimum SSIM 0.9924 worst case across the run compression 41% of original size visible degradation < 0.01% of files in the run