Color in PDF — why “just compress the image” isn’t always possible
PDF defines at least five different ways to describe color, and a compressor that conflates them will turn a carefully prepared document into garbage.
Why several kinds of color exist
Each scenario needs different color information:
- On a screen — RGB values that drive pixel emission.
- Office printer — CMYK values for inks.
- Commercial printing — CMYK plus calibration, so a magazine cover red looks the same on press runs from London, Warsaw, and Singapore.
- Archival storage — color reproducible 50 years from now, even if displays have changed.
- Special inks — “this logo prints in Pantone 186 C, no other red will do.”
Five ways to describe color
DeviceGray, DeviceRGB, DeviceCMYK — just numbers
A pixel is one, three, or four numbers from 0 to 255, with no tie to a physical color: “red 200” looks bluer on one monitor, yellower on another. This is device-dependent color.
- DeviceGray — single-channel images, grayscale scans.
- DeviceRGB — the standard for ordinary office and web PDFs.
- DeviceCMYK — the standard for print-ready PDFs.
For compression: JPEG fine, no constraints. There’s no calibration to lose.
CalRGB and CalGray — calibrated RGB
The same RGB plus explicit information about how the device should render those numbers — three white-point coordinates in CIE space, a gamma, an RGB matrix. With that, the reader can convert values into an absolute physical color.
For compression: be careful. A standard JPEG encoder converts RGB to YCbCr using ITU-R BT.601 coefficients that assume ordinary DeviceRGB. For CalRGB the conversion produces a color shift.
pdfcompressor disables JPEG for images in CalRGB and CalGray, leaving Flate. Larger files, exact shades.
Lab — physically defined color
Lab* is device-independent. L is brightness on a 0–100 scale; a and b are color-difference axes with a sign (roughly -128 to +127). It covers the entire visible spectrum (more than sRGB and Adobe RGB) and shows up in scientific and archival applications.
For compression: JPEG doesn’t work. JPEG itself is colorspace-agnostic, but the standard encoder pipeline (libjpeg, jpegli) assumes 8-bit unsigned channels and applies an RGB→YCbCr transform. Lab — 0–100 range, signed a/b — doesn’t fit those assumptions; after a naive round-trip the numbers drift, and Lab values come out wrong.
pdfcompressor uses only Flate for Lab. Compression is more modest (3–4× rather than JPEG’s 10–15×), with no color distortion.
ICCBased — same approach with an ICC profile
The most common “correct” model in modern PDFs. The image is described in standard RGB, CMYK, or another space, with an ICC profile attached — a file describing the precise transform into absolute color.
This is what makes pre-press work: the printer receives the PDF, reads its embedded ICC profiles, simulates how the file will look on its presses, and corrects. The profile must be embedded inside the PDF.
For compression:
- If the ICC names a widely known profile (sRGB, Adobe RGB, ProPhoto), JPEG usually works — its assumptions match.
- If the ICC names a custom space, like a printer’s calibrated CMYK, JPEG can produce shifts.
In pre-press mode (when the file shows signs of being print-ready), pdfcompressor enables JPEG only for standard RGB profiles; everything else uses Flate.
Indexed — palette
Pixels are indices into a palette, the palette itself defined in any of the spaces above. Compress with Flate only; no other codec is safe.
Pattern
Not raster data but a fill pattern — either a tiling pattern (a repeating image) or a shading pattern (a gradient), made of drawing commands and possibly embedded rasters. The compressor walks the pattern and applies the usual rules to its contents; the Pattern colorspace itself is not re-encoded.
Spot colors: Separation and DeviceN
The most important category for the print industry:
- A magazine cover prints red text in Pantone 186 C ink.
- That ink isn’t a CMYK mix; it’s a separate ink in a separate can.
- In the PDF the image is described as Separation: a single-channel image whose values are “how densely to lay down that specific ink.”
Push such an image through a YCbCr-based JPEG encoder:
- The single-channel image becomes three-channel.
- The press no longer sees “spot Pantone 186 C.”
- It prints a CMYK approximation of red.
- The shade is wrong. On a magazine cover, that’s a defect.
DeviceN generalizes this to multi-channel images for printing in several spot inks at once — black plus gold plus varnish, for example.
pdfcompressor doesn’t touch Separation or DeviceN at all. No JPEG, no downsampling. Optional Flate recompression at most, always lossless.
Rendering intents
ICC-managed PDFs carry a rendering intent — the rule for how the reader should reproduce a color when the source space is richer than the target:
- Perceptual — compress the whole range smoothly, preserving relationships;
- Relative colorimetric — accurate where colors fit, clip the rest;
- Saturation — preserve saturation (for business graphics: bars stay bright even if hue drifts slightly);
- Absolute colorimetric — fully accurate, including white point.
This metadata is preserved.
The decision logic
Simplified, per image:
if colorspace ∈ {Separation, DeviceN}:
leave alone
if colorspace ∈ {Lab, CalRGB, CalGray}:
Flate only, no downsampling
if colorspace = ICCBased:
if profile is well-known and standard (sRGB, Adobe RGB):
JPEG ok
else:
Flate
if colorspace ∈ {DeviceRGB, DeviceGray, DeviceCMYK}:
anything goes
if colorspace = Indexed:
Flate only
Numbers
On 1000-document sets in DeviceRGB the JPEG path always fires. On graphic and print PDFs (magazine layouts, labels, catalogs) up to 30% of images fall into restricted categories — savings are more modest there, but no shade shifts. The same compressor produces -60% on an office document and -30% on a catalog; in both cases the document afterwards looks exactly the same as before.