Compression on Real Files: 3,420 Benchmark Runs Across Brotli, Zstd, LZMA2, LZ4, gzip, and bzip2

I wanted a benchmark that used real files instead of synthetic test strings, so I took 36 datasets, ran 95 codec presets against them, and collected 3,420 measured results. The mix includes codebases, single source files, logs, databases, documents, images, audio, video, VM images, package caches, and even a multi-gigabyte LLM model directory.

The short answer is simple: highly structured text still compresses incredibly well, already compressed media usually does not, and the “best” algorithm changes immediately once you decide whether you care about absolute ratio, compression time, decompression time, or some balance between them.

In One Paragraph

If you only want the practical answer, use zstd for most real-world work, keep brotli in mind for dense web-style text, use lzma2/xz only when you truly want to squeeze text-heavy archives as hard as possible, and avoid recompressing formats that are already aggressively encoded such as mp4, mov, hevc, mp3, aac, or many office containers.

Compression benchmark

The short version

This benchmark covers 36 real files and directories, 29.3 GiB of source material, and 95 codec presets. The headline result is simple: structured text compresses extremely well, already compressed media barely moves, and the “best” codec depends on whether you optimize for ratio, speed, or a middle ground.

Short Answers

For mixed real-world data, zstd is the safest default because it stays fast while preserving much more ratio than the old “fast but weak” stereotype suggests.
For maximum ratio on text-heavy archives, lzma2 still wins a lot of the time, but the extra compression time becomes expensive very quickly on large directories.
For structured web-style text and tiny source files, brotli remains extremely strong. On individual HTML, CSS, SVG, JSON, SQL, and source-file style inputs, it keeps landing near the top.
For already compressed media, recompression is mostly wasted CPU. Video was effectively flat in this benchmark, and audio gains were usually tiny.
lz4 is exactly what its reputation says: it is very fast, it decompresses extremely quickly, and it gives up ratio to buy that speed.

Practical Recommendations

If I had to collapse the entire benchmark into a few choices, this is where I land:

Use case	What I would pick first	Why
General-purpose archives	`zstd fast-3` to `zstd 3`	Excellent speed-to-ratio balance without absurd compression time
Maximum ratio on code/log/text backups	`lzma2 7` to `lzma2 9e`	Usually the strongest ratio, but only worth it when time is cheap
Web text assets	`brotli 9` to `brotli 11`	Still exceptional on structured text
Very fast local compression	`lz4 1`	Minimal CPU cost and strong decompression speed
Legacy compatibility	`gzip 1` or `gzip 4`	Ubiquitous support, decent middle ground
MP4, HEVC, MP3, AAC, many office containers	Often nothing at all	The data is already compressed, so extra work buys little

Interactive Explorer

The static summary is useful, but compression is all about trade-offs. The dashboard below lets you filter by category, limit the view to files or directories, isolate a single dataset, toggle algorithms, switch scatter-plot axes, inspect level curves, and sort grouped leaderboards.

Interactive explorer

Filter, group, sort, and compare the benchmark any way you need

This dashboard exposes the raw benchmark runs plus grouped views. You can narrow the dataset to a single category, isolate one codec family, inspect level-by-level behavior, or rank presets by the exact metric you care about.

What Stood Out Immediately

1. Text and code are still the easy wins

The cleanest signal in the entire benchmark is that structured text remains extraordinarily compressible.

Codebases reached a median best-case space saving of 73.1%.
Single source files reached 77.3%.
Documents, helped by a few extremely compressible text-heavy samples, reached 80.4% median best-case savings.
The SQL dump saved 75.7% at the top result.
The JSON sample went past 94% savings at the strongest settings.

None of this is surprising in theory, but the scale still matters in practice. If your backup or transport workload is mostly source code, logs, configs, JSON, HTML, SQL, or similar structured text, compression is one of the highest-leverage optimizations available.

2. Already compressed media mostly stayed compressed

At the opposite end, video was basically a wall.

The median best result for the video category was only 0.0096%.
Audio improved by a median best-case 1.95%.
PNG, RAW, and BMP behaved very differently from one another, which is exactly why treating “images” as one compression class is misleading.

This is the benchmark saying: do not pay large CPU costs to recompress formats that were already built around compression. In many cases you are only moving noise around.

3. The absolute winners are not the practical winners

If you rank purely by strongest weighted ratio across the full benchmark, the top single-family presets look like this:

lzma2 9e reached a weighted ratio of 1.667x.
zstd 22 reached 1.633x.
brotli 11 reached 1.632x.

That looks close, but the time cost is not close at all. The high-end lzma2, zstd, and brotli presets pay for those last percentage points with a lot more compression time. This is where interactive inspection matters: a small ratio gain near the top often costs a disproportionate amount of CPU.

4. Brotli is not just a web-server story

Brotli is usually framed as “the thing CDNs use for text assets,” but that undersells it. In this benchmark it took 18 out of 36 best-ratio wins overall, especially on small structured inputs and dense text-like formats.

That does not make it the universal default, because higher Brotli levels become painfully slow on bigger archives. It does mean that whenever the payload looks like text and you care about output size more than encode time, Brotli deserves to be in the conversation.

Category-by-Category Read

Codebases and source files

This is where the benchmark becomes most satisfying.

On large codebases, lzma2 9e won the pure ratio contest with a weighted ratio of 4.143x, but the best balanced preset for the category was lz4 1. That sounds contradictory until you look at the actual trade-off: lzma2 9e absolutely crushes source-like data, but it makes you wait; lz4 1 gives back a large chunk of time while still delivering meaningful savings.

On single source files, brotli 11 was the category ratio leader at 4.261x, while gzip 4 ended up as the best balanced choice. That is a nice reminder that tiny files distort intuition: on small inputs, the absolute time penalty of stronger presets can stay low enough that aggressive codecs remain attractive.

Documents and databases

This was the most uneven category group.

The document category had everything from nearly incompressible office containers to absurd outliers. The single PDF sample compressed by 97.7% at the strongest Brotli setting, which is an enormous result and also a warning not to generalize from one PDF to all PDFs. Some PDFs are mostly compressed images. Others still contain large compressible text or object streams. The format label alone is not enough.

At the category level, lzma2 9e was the strongest ratio preset for documents at 7.956x, while zstd fast-1 was the best balanced preset. That is exactly the kind of split I care about in practice: if I am archiving a text-heavy document corpus for cold storage, I may accept lzma2; if I am repeatedly packing and unpacking it, I want zstd.

The single SQL dump behaved more like code than like a “document,” and Brotli absolutely loved it. That makes sense: SQL dumps are repetitive, verbose, and highly structured.

Binary blobs, model weights, VM images, and package caches

The binary category is where simplistic rules start breaking down.

“Binary” can mean random noise, highly compressible VM disk contents, package caches, or model weights. Those are not equivalent at all.

The random 2 GiB binary blob was effectively incompressible.
The Ubuntu VM image was very compressible.
The Qwen model directory saved more than 30% at the best ratio, but only with very expensive compression presets.
The package cache only improved modestly.

The category-level ratio winner was lzma2 9e at 1.473x, but the best balanced preset was lz4 1. That gap is the story of binary data in one line: some binary trees absolutely reward strong compression, but you cannot assume they all will.

Images, audio, and video

These categories need nuance.

The image category includes BMP, RAW, PNG, and SVG, which are not remotely comparable from a compression standpoint. Vector text-like SVG behaved much more like a document. BMP and some RAW content still had real room to compress. PNG packs, unsurprisingly, were far more resistant.

Audio showed a similar split. Lossless or lightly packed material can still move a bit. Strongly compressed formats barely budge. The median best-case result of 1.95% says it clearly: audio recompression is usually not where your big wins live.

Video is the easiest category to summarize because the answer is practically “don’t bother.” The benchmark agrees with common sense there.

The Codec Families in Plain English

Zstd

zstd is the codec I would hand to most people by default.

Its strongest overall preset, zstd 22, landed at 1.633x weighted ratio, basically neck-and-neck with brotli 11 on the whole benchmark and not dramatically behind lzma2 9e. The difference is that lower and midrange Zstd presets stay far more usable. The family also dominated many of the balanced or near-balanced views, especially once decompression speed mattered.

If you want one family that stays credible almost everywhere, this is it.

Brotli

Brotli remains devastating on small, structured, text-heavy inputs.

Its problem is not effectiveness. Its problem is that the upper levels can get expensive fast. That makes Brotli ideal when the data is very text-like and the encode cost is acceptable, but less attractive as a universal archive codec for large mixed trees.

LZMA2 / xz

When the question is “what produces the smallest file,” lzma2 is still hard to ignore.

On codebases and documents it repeatedly took the strongest ratio spots. The problem is that the time curve is harsh. This is a cold-storage, distribution, or “compress once, keep forever” tool, not something I would casually throw into a hot path.

LZ4

lz4 is the speed specialist. It usually loses the raw ratio contest, but it rarely surprises you in the wrong direction. If fast local packaging or very fast decompression matters more than shaving every last percent, it does exactly what you expect.

gzip and bzip2

Neither is dead, but neither feels like the first choice anymore unless compatibility or existing tooling forces the decision.

gzip still makes sense when universal support matters. bzip2 occasionally produced strong ratio results, including the best result on the model directory, but its overall speed profile remains hard to justify against newer alternatives.

The Nonlinear Part: Levels Stop Paying Off

One of the easiest mistakes in compression tuning is assuming that moving from level 6 to level 9 is a proportional change. It rarely is.

This benchmark kept showing the same pattern:

The first step away from the fastest presets usually buys a meaningful gain.
Midrange presets often capture most of the value.
The last few levels are where time explodes and the ratio gain shrinks.

That does not mean high levels are pointless. It means they need a reason. If the archive is large, text-heavy, and long-lived, maybe the extra time is justified. If you are constantly creating, unpacking, or transferring it, the sweet spot moves downward quickly.

The level curve in the interactive section is the best way to see this on a dataset-by-dataset basis.

What I Would Actually Use

If I were translating this benchmark into defaults for my own systems, it would look roughly like this:

zstd fast-3, zstd fast-1, or zstd 1-3 for everyday use.
brotli 9-11 for dense web text or small structured files where output size matters a lot.
lzma2 7-9e only for text-heavy archives where compression time is a secondary concern.
lz4 1 when speed and low CPU cost matter more than the smallest result.
No extra compression at all for many already compressed media files.

That list is much closer to how systems are actually operated than a pure “who got the highest ratio” ranking.

Methodology and Caveats

The benchmark covers 36 datasets and 95 distinct presets across zstd, brotli, lzma2/xz, lz4, gzip, and bzip2.
Both files and directories were included because real-world archive behavior depends heavily on corpus structure, not just on file format labels.
I tracked compression ratio, percent space saved, compression time, and decompression time.
The interactive article includes both raw runs and grouped views because a weighted category average and a single-dataset winner answer different questions.
The “balanced” score in the charts is not a codec-standard metric. It is a normalized benchmark-specific score that rewards ratio, compression speed, and decompression speed together.
Some samples are intentionally weird. The PDF result is a perfect example: it is real data, but it should not be treated as “all PDFs behave like this.”

Final Take

The most useful result here is not that one algorithm “won.” It is that compression outcomes are highly dependent on the structure of the data, and the moment you measure both compression and decompression time the leaderboard reshuffles.

That is exactly why I wanted the article to stay interactive. A storage archive, a CDN asset pipeline, a backup job, a package mirror, and a local developer workflow do not want the same answer. With the filters, scatter plot, heatmap, level curves, and sortable leaderboard above, you can pull the benchmark toward the decision you actually need to make.