← Blog

Images — the heaviest part of a PDF

Image classification determines the codec bitonal (1 bit) text scans, drawings JBIG2 → CCITT G4 → Flate DPI: 300 indexed palette up to 256 colors Flate lossless only palette is critical DPI: 300 color / grayscale photos, renders, charts JPEG → Flate (fallback) DPI: 200

In a typical PDF, 60–90% of the bytes are images. If a file contains even one scan or photo, every other optimization is rounding error.

Classification first

The compressor sorts every image into one of three categories before encoding anything. Every later decision depends on the category.

Bitonal — 1 bit per pixel

Pure black and white: text scans, line drawings, faxes. A pixel is on or off. The right codec is JBIG2, which finds repeated shapes on a page and stores each occurrence as a reference into a shared symbol library. A 300-DPI page of text shrinks from a megabyte to tens of kilobytes.

When JBIG2 isn’t usable — corrupt input, an exotic format — the fallback is CCITT Group 4, the fax codec from 1984, still excellent on binary content. The last resort is plain Flate.

Indexed — palette-based

Each pixel is an index into a palette of up to 256 colors. Color scans, icons, infographics. Lossy codecs are off the table: any pixel-value distortion breaks the index→color mapping and produces nonsense. Only Flate is safe.

Color and grayscale

Photos, renders, gradient charts. The workhorse is JPEG, with Flate as the fallback when JPEG can’t apply.

PDF formally supports JPEG 2000 (the JPXDecode filter, added in PDF 1.5 in 2003). It compresses better at high quality and handles more color spaces. In practice, mobile readers stumble on it and the encoder is much slower. pdfcompressor decodes JPEG 2000 when it appears in input, but writes ordinary JPEG for compatibility.

DPI: how many pixels actually reach the eye

A PDF image has an intrinsic resolution in pixels and a placement size on the page in points. Their ratio is the effective DPI. An image that’s 4000 pixels wide displayed across 10 cm of page renders at roughly 1000 DPI — visibly excessive. Screens show maybe 200; quality home printing tops out around 300; office printing sits at 150.

Targets in pdfcompressor:

Type Target DPI
Bitonal 300
Indexed 300
Grayscale 200
Color 200

The color target is lower because color images go through JPEG, which is already lossy; extra pixels don’t carry extra information, they just absorb compression. Bitonal text is lossless and every pixel defines a letter’s shape, so 600→300 DPI is fine but 300→200 starts losing strokes.

Downsampling triggers only when the actual DPI exceeds the target by a factor of 1.4. Without that hysteresis the system would grind 305-DPI images down to 300 for two percent of savings while accumulating resampling artifacts.

GED — why the same image can land at different quality

Setting JPEG quality 75 globally degrades scanned text. Setting 85 globally throws away about 20% of the savings. The compromise in pdfcompressor is GED — Gradient Energy Detection — content-adaptive quality:

  1. Split the image into 16×16 blocks.
  2. For each block, measure how sharply brightness changes between neighboring pixels — the gradient energy.
  3. Sort blocks by that value, take the 95th percentile.
  4. If the 95th percentile is high — meaning the image has plenty of regions with sharp transitions, like text or thin lines — raise JPEG quality from a base of 75 to 85.

The compressor sees that the image contains regions JPEG would visibly damage, and protects exactly those. A photo with a smooth background gets quality 75. A scanned page that mixes text and a photograph gets 85.

What we don’t touch

Several image classes look like images but aren’t, and any “optimization” makes them worse.

Stencil masks (ImageMask) are 1-bit shapes, not pictures: black pixels say “fill here with the current color,” white means “leave alone.” Effectively vectors. Drop their resolution and edges go jagged. Passed through unchanged.

Separation and DeviceN are spot-color channels — Pantone, metallic inks, varnish. JPEG would route them through RGB→YCbCr, destroying the spot-channel information. No lossy operations apply.

Calibrated color spaces — Lab, CalRGB, CalGray. Standard JPEG assumes sRGB-ish input and converts to YCbCr; running Lab through that math discards the calibration. JPEG is disabled for these regardless of any other heuristic.

1-bit indexed images and two-color palettes are almost always engineering drawings with strokes one pixel wide. Downsampling smears those lines into grey haze.

Images smaller than 32 pixels on either side would only contain three or four 8×8 JPEG blocks tall. Quantization artifacts become visible to the naked eye, especially across horizontal stripes. Flate is better here.

Truly tiny images — 1024 pixels or fewer in total. A baseline JPEG header alone (SOI + JFIF + DQT + SOF0 + DHT + SOS) runs 300–600 bytes; for an image whose Flate output is a few hundred bytes, JPEG is a net loss. On data this small, Flate wins 94.6% of the time.

What this looks like in one document

A single PDF can trigger every branch of the algorithm: JBIG2 cuts a 600-DPI text scan 15×, the line drawing on the next page passes through untouched, one photograph compresses at JPEG quality 75, an adjacent photograph with caption text gets 85, and a 1-bit logo repeated in every page corner is skipped entirely as a mask. No compression-mode prompt.

DPI: actual vs target — when to downsample photo 4000 px on a 10 cm page: ≈ 1000 DPI — actual targets in pdfcompressor: bitonal 300 indexed 300 grayscale 200 color 200 rule: downsample only if actual > target × 1.4 — otherwise we oscillate between 305 and 300, accumulating artifacts