← Blog

JBIG2 up close — how black-and-white text compresses 20×

page oooooo oooooo oooooo oooooo oooooo 200 instances of one shape JBIG2 symbol dictionary o one shape references (20, 40) → shape 0 (55, 40) → shape 0 (90, 40) → shape 0 (125, 40) → shape 0 (160, 40) → shape 0 (195, 40) → shape 0 (20, 80) → shape 0 (55, 80) → shape 0 (90, 80) → shape 0 (125, 80) → shape 0 ... 200 entries total

A page of text is a catalog of repeating shapes

Take a scanned page. The letter “o” appears 200 times on it. Each “o” is roughly the same set of dark pixels — call it 20×25, about 500 bits. Store the page naively and those 200 copies cost 100,000 bits before you’ve encoded anything else.

JBIG2 instead notices that this is the same shape with minor variations, saves one canonical “o” once, and records each occurrence as a position (x, y) plus a reference into the catalog. The full algorithm:

  1. Split the page into connected black regions — letters, punctuation, image fragments.
  2. Group similar shapes into a symbol dictionary.
  3. For each region, store position plus a dictionary index.
  4. Compress the dictionary itself with an arithmetic coder driven by a context model — another 2–3× on top of the dictionary savings.

A typical 300-DPI page of text drops from 50–100 KB to 5–15 KB. Pages with heavy structural repetition — tables, forms, templates — do better still.

JBIG2’s second mode, generic coding, handles parts of the page without repeating symbols (logos, free-form lines, complex graphics). There a pure arithmetic coder with previous-pixel context still beats Flate by 3–5× on bitonal data.

Lossless versus lossy

JBIG2 has two modes.

Lossless: every shape in the dictionary is an exact copy of the corresponding region of the page. Two “o”s that differ by even one pixel of scanner noise don’t merge — the dictionary stores both, and each occurrence references its own. Compression is still very good, just slightly worse than lossy.

Lossy (“similar symbols are merged”): the encoder decides “these two ’o’s differ by only 3 pixels — call them the same.” One shape lands in the dictionary, both occurrences point to it. Another 20–40% size reduction, but the image changes: the previously-different occurrences now look like the canonical shape.

For ordinary text the difference is invisible. For digits, it caused a scandal.

Xerox 2013: why lossless became mandatory

In 2013 the German researcher David Kriesel discovered that Xerox WorkCentre scanners were silently changing digits in scanned documents when JBIG2 compression was on. A page would contain 85 in one cell and 65 in another; after scanning to PDF, both cells showed 85. Or both showed 65. Or different digits altogether.

The cause was lossy JBIG2. The encoder noticed that 8 and 6 share a closed-loop topology, decided they were similar enough to merge, put one shape in the dictionary, and pointed both occurrences at it. From the algorithm’s point of view, savings. From the customer’s point of view, a document with altered numerical values — in contracts, financial reports, medical prescriptions.

Xerox patched the firmware to disable lossy JBIG2 by default. Since then, in every system that handles documents, lossless JBIG2 is the standard. Smaller savings, integrity guaranteed.

pdfcompressor uses only lossless JBIG2. No “close enough.” Every shape in the dictionary is a bit-for-bit copy of what was on the page.

Where JBIG2 doesn’t fit

JBIG2 works only on 1-bit images — every pixel either black or white. So:

JBIG2 only fires after explicit classification: this is a 1-bit image and it looks like a document. Everything else uses other codecs.

Fallbacks

When JBIG2 isn’t usable — the encoded output came out larger than the input (which can happen on very small images), or the encoder choked on exotic input — the compressor walks down a fallback chain:

  1. CCITT Group 4. ITU-T Recommendation T.6, the fax codec from 1984. Still bitonal, but built on simpler ideas: run-length coding of pixel rows plus references to the previous row (two-dimensional prediction). On modern scans it gives 3–7× — worse than JBIG2 but reliable, with universal reader support.

  2. Flate. Plain zlib, like everything else in PDF. Modest 2–3× on bitonal data, guaranteed to work.

The choice is settled by output size: whichever produced the smallest stream wins. Savings are always positive.

Compatibility

JBIG2 itself is ITU-T T.88 (2000) and ISO/IEC 14492 (2001). It joined PDF as the JBIG2Decode filter in PDF 1.4 in 2001. Every modern reader handles it.

Some enterprise PDF processors can read JBIG2 but not write it. If maximum downstream compatibility matters, pdfcompressor exposes flags to disable JBIG2. The default leaves it on.

Patents: the MQ-coder at the heart of JBIG2 was patented by IBM and Mitsubishi but licensed royalty-free on request. The foundational patents from 1990s filings expired by 2017 (the standard 20-year term). Open-source implementations — jbig2enc, our own myjbig2 — are freely usable today with no legal exposure.

Numbers for one image

A scan of an A4 page of dense text at 300 DPI:

Format Size
Uncompressed 1-bit TIFF ~1.05 MB
Flate ~220 KB
CCITT Group 4 ~140 KB
JBIG2 lossless ~45 KB

A 500-page document of that kind saves roughly 85 megabytes by choosing JBIG2 over Flate. For scanned documents, JBIG2 is the main source of compression.

Xerox 2013: lossy JBIG2 merges similar shapes original on paper 6 5 8 three adjacent table cells lossy JBIG2 treats "6" and "8" as the same shape after scanning 8 5 8 "6" silently became "8" pdfcompressor uses lossless JBIG2 only — shapes are reproduced byte-for-byte