Fonts — the invisible ballast
A font in a text PDF can outweigh all the text typed in it. A hundred-page contract set entirely in Arial commonly carries 400 KB of font data, because the generator embedded the whole font program — every ideograph, every math symbol, every accented Cyrillic glyph — none of which appears anywhere on the pages.
Why fonts are embedded at all
So the document looks the same on every device. If the PDF only said “use Arial,” a recipient without Arial would get whatever the reader substitutes, and the layout would shift. PDFs that aim for byte-identical rendering embed their fonts in full. The trap is that embed usually means “copy the whole font file in as-is,” with no thought to what’s actually used.
Three savings techniques
1. Drop the standard 14
Through PDF 1.7 (ISO 32000-1), every reader was required to render 14 standard fonts: Helvetica (4 weights), Times (4), Courier (4), Symbol, and ZapfDingbats. Viewers shipped with them.
PDF 2.0 (2017) lifted that requirement. Most desktop and mobile readers still substitute the Base 14 in practice, but some web and mobile readers substitute look-alikes, and dropping the embedded copy can nudge line breaks. So pdfcompressor only removes the embedding when the file isn’t PDF/A and the embedded font matches a standard copy byte-for-byte. Otherwise it leaves the embedding alone.
Generators routinely embed Base 14 fonts out of habit. When the bytes match, the compressor strips the font program and keeps only the name reference. The file shrinks by the size of Helvetica — typically 30–100 KB.
If the document uses a modified standard font (Helvetica with extra currency symbols, say), the bytes don’t match and the embedding stays. The check is on content, not on name.
2. Subsetting — drop unused glyphs
The biggest win. A modern font carries hundreds or thousands of glyphs: Latin, Cyrillic, punctuation, ligatures, diacritics for every European language, typographic arrows, math, sometimes whole CJK sets. A typical document uses at most 200 distinct characters. The rest is ballast.
Subsetting works in three steps:
- Walk every page’s contents, every nested form XObject, every template, and collect the exact set of character codes drawn somewhere.
- Use HarfBuzz to rebuild the font, keeping glyphs for those codes and discarding the rest.
- Clean up font tables (CharSet, CIDSet) that still reference glyphs that no longer exist.
Savings depend on how comprehensive the original was. Plain Arial (TrueType) carries about 4000 glyphs; Arial Unicode MS holds tens of thousands. If the document uses 80, less than 2% of the original size survives subsetting. For an already-subsetted font, almost nothing is left to cut.
3. Merging — collapse identical copies
A common pathology: the document embeds the same font multiple times. This happens when a PDF gets assembled from chapters built by different tools, or when each page is a separate XObject with its own resource set. The result is ten copies of the same Times New Roman.
The compressor hashes font programs and merges the duplicates, rewriting all references. The pass runs again after subsetting, because subsetting often reduces what were previously different embeddings to identical subsets — those can now merge.
The effect is even larger after CFF conversion. Compact Font Format normalizes the internal representation, and three “different” fonts frequently turn out to be the same one once they share a canonical encoding.
What makes this hard
PDF text is a sequence of codes, not letters. An encoding table (Encoding or CMap) maps codes to glyph names or CIDs, which then map to the glyphs themselves. Type0 (CID-keyed) fonts add a third level: code → CID → glyph. Safe subsetting requires parsing the CMap and getting the mapping exactly right. Discard a glyph that’s actually in use and the page gets a blank where a character belongs.
Safety rule: if parsing the content stream fails, the font stays untouched and full-size. An extra 200 KB beats invisible text.
ToUnicode and rights
The Font object carries several auxiliary tables alongside the program itself:
- ToUnicode CMap is the table the reader uses for
copy/paste. Without it,
Ctrl+Cfrom the PDF returns garbage. Never modified. - Widths hold per-glyph advance widths and are required for correct text placement. Preserved.
- FontDescriptor carries flags, bounding boxes, ICC references, and sometimes embedding-rights metadata (Embedding Rights, FSType). That field is legally significant, so we don’t change it.
What the numbers look like
On office-document datasets — contracts, reports, statements — font work alone yields 10–25% total size reduction. For heavily illustrated files the share is smaller, but combined with image and stream optimization it still adds up.