[{"data":1,"prerenderedAt":746},["ShallowReactive",2],{"page-\u002Fblog\u002Fcompression-benchmark-real-files-2026":3},{"id":4,"title":5,"body":6,"date":731,"description":732,"extension":733,"featured":734,"image":735,"meta":736,"navigation":734,"path":737,"published":734,"seo":738,"stem":739,"tags":740,"__hash__":745},"blog\u002Fblog\u002Fcompression-benchmark-real-files-2026.md","Compression on Real Files: 3,420 Benchmark Runs Across Brotli, Zstd, LZMA2, LZ4, gzip, and bzip2",{"type":7,"value":8,"toc":700},"minimark",[9,13,16,53,56,61,93,97,100,213,217,220,223,227,232,235,272,275,279,282,299,302,306,309,335,346,350,357,360,364,368,371,390,403,407,410,417,435,438,442,445,448,466,479,483,486,489,495,498,502,505,510,525,528,531,534,537,541,547,550,553,558,562,565,575,579,582,585,596,599,602,606,609,643,646,650,690,694,697],[10,11,12],"p",{},"I wanted a benchmark that used real files instead of synthetic test strings, so I took 36 datasets, ran 95 codec presets against them, and collected 3,420 measured results. The mix includes codebases, single source files, logs, databases, documents, images, audio, video, VM images, package caches, and even a multi-gigabyte LLM model directory.",[10,14,15],{},"The short answer is simple: highly structured text still compresses incredibly well, already compressed media usually does not, and the “best” algorithm changes immediately once you decide whether you care about absolute ratio, compression time, decompression time, or some balance between them.",[17,18,20],"info",{"title":19},"In One Paragraph",[10,21,22,23,27,28,31,32,35,36,39,40,39,43,39,46,39,49,52],{},"If you only want the practical answer, use ",[24,25,26],"code",{},"zstd"," for most real-world work, keep ",[24,29,30],{},"brotli"," in mind for dense web-style text, use ",[24,33,34],{},"lzma2\u002Fxz"," only when you truly want to squeeze text-heavy archives as hard as possible, and avoid recompressing formats that are already aggressively encoded such as ",[24,37,38],{},"mp4",", ",[24,41,42],{},"mov",[24,44,45],{},"hevc",[24,47,48],{},"mp3",[24,50,51],{},"aac",", or many office containers.",[54,55],"compression-results-summary",{},[57,58,60],"h2",{"id":59},"short-answers","Short Answers",[62,63,64,71,78,84,87],"ul",{},[65,66,67,68,70],"li",{},"For mixed real-world data, ",[24,69,26],{}," is the safest default because it stays fast while preserving much more ratio than the old “fast but weak” stereotype suggests.",[65,72,73,74,77],{},"For maximum ratio on text-heavy archives, ",[24,75,76],{},"lzma2"," still wins a lot of the time, but the extra compression time becomes expensive very quickly on large directories.",[65,79,80,81,83],{},"For structured web-style text and tiny source files, ",[24,82,30],{}," remains extremely strong. On individual HTML, CSS, SVG, JSON, SQL, and source-file style inputs, it keeps landing near the top.",[65,85,86],{},"For already compressed media, recompression is mostly wasted CPU. Video was effectively flat in this benchmark, and audio gains were usually tiny.",[65,88,89,92],{},[24,90,91],{},"lz4"," is exactly what its reputation says: it is very fast, it decompresses extremely quickly, and it gives up ratio to buy that speed.",[57,94,96],{"id":95},"practical-recommendations","Practical Recommendations",[10,98,99],{},"If I had to collapse the entire benchmark into a few choices, this is where I land:",[101,102,103,119],"table",{},[104,105,106],"thead",{},[107,108,109,113,116],"tr",{},[110,111,112],"th",{},"Use case",[110,114,115],{},"What I would pick first",[110,117,118],{},"Why",[120,121,122,140,156,172,185,202],"tbody",{},[107,123,124,128,137],{},[125,126,127],"td",{},"General-purpose archives",[125,129,130,133,134],{},[24,131,132],{},"zstd fast-3"," to ",[24,135,136],{},"zstd 3",[125,138,139],{},"Excellent speed-to-ratio balance without absurd compression time",[107,141,142,145,153],{},[125,143,144],{},"Maximum ratio on code\u002Flog\u002Ftext backups",[125,146,147,133,150],{},[24,148,149],{},"lzma2 7",[24,151,152],{},"lzma2 9e",[125,154,155],{},"Usually the strongest ratio, but only worth it when time is cheap",[107,157,158,161,169],{},[125,159,160],{},"Web text assets",[125,162,163,133,166],{},[24,164,165],{},"brotli 9",[24,167,168],{},"brotli 11",[125,170,171],{},"Still exceptional on structured text",[107,173,174,177,182],{},[125,175,176],{},"Very fast local compression",[125,178,179],{},[24,180,181],{},"lz4 1",[125,183,184],{},"Minimal CPU cost and strong decompression speed",[107,186,187,190,199],{},[125,188,189],{},"Legacy compatibility",[125,191,192,195,196],{},[24,193,194],{},"gzip 1"," or ",[24,197,198],{},"gzip 4",[125,200,201],{},"Ubiquitous support, decent middle ground",[107,203,204,207,210],{},[125,205,206],{},"MP4, HEVC, MP3, AAC, many office containers",[125,208,209],{},"Often nothing at all",[125,211,212],{},"The data is already compressed, so extra work buys little",[57,214,216],{"id":215},"interactive-explorer","Interactive Explorer",[10,218,219],{},"The static summary is useful, but compression is all about trade-offs. The dashboard below lets you filter by category, limit the view to files or directories, isolate a single dataset, toggle algorithms, switch scatter-plot axes, inspect level curves, and sort grouped leaderboards.",[221,222],"compression-results-explorer",{},[57,224,226],{"id":225},"what-stood-out-immediately","What Stood Out Immediately",[228,229,231],"h3",{"id":230},"_1-text-and-code-are-still-the-easy-wins","1. Text and code are still the easy wins",[10,233,234],{},"The cleanest signal in the entire benchmark is that structured text remains extraordinarily compressible.",[62,236,237,245,251,258,265],{},[65,238,239,240,244],{},"Codebases reached a median best-case space saving of ",[241,242,243],"strong",{},"73.1%",".",[65,246,247,248,244],{},"Single source files reached ",[241,249,250],{},"77.3%",[65,252,253,254,257],{},"Documents, helped by a few extremely compressible text-heavy samples, reached ",[241,255,256],{},"80.4%"," median best-case savings.",[65,259,260,261,264],{},"The SQL dump saved ",[241,262,263],{},"75.7%"," at the top result.",[65,266,267,268,271],{},"The JSON sample went past ",[241,269,270],{},"94%"," savings at the strongest settings.",[10,273,274],{},"None of this is surprising in theory, but the scale still matters in practice. If your backup or transport workload is mostly source code, logs, configs, JSON, HTML, SQL, or similar structured text, compression is one of the highest-leverage optimizations available.",[228,276,278],{"id":277},"_2-already-compressed-media-mostly-stayed-compressed","2. Already compressed media mostly stayed compressed",[10,280,281],{},"At the opposite end, video was basically a wall.",[62,283,284,290,296],{},[65,285,286,287,244],{},"The median best result for the video category was only ",[241,288,289],{},"0.0096%",[65,291,292,293,244],{},"Audio improved by a median best-case ",[241,294,295],{},"1.95%",[65,297,298],{},"PNG, RAW, and BMP behaved very differently from one another, which is exactly why treating “images” as one compression class is misleading.",[10,300,301],{},"This is the benchmark saying: do not pay large CPU costs to recompress formats that were already built around compression. In many cases you are only moving noise around.",[228,303,305],{"id":304},"_3-the-absolute-winners-are-not-the-practical-winners","3. The absolute winners are not the practical winners",[10,307,308],{},"If you rank purely by strongest weighted ratio across the full benchmark, the top single-family presets look like this:",[62,310,311,319,328],{},[65,312,313,315,316,244],{},[24,314,152],{}," reached a weighted ratio of ",[241,317,318],{},"1.667x",[65,320,321,324,325,244],{},[24,322,323],{},"zstd 22"," reached ",[241,326,327],{},"1.633x",[65,329,330,324,332,244],{},[24,331,168],{},[241,333,334],{},"1.632x",[10,336,337,338,39,340,342,343,345],{},"That looks close, but the time cost is not close at all. The high-end ",[24,339,76],{},[24,341,26],{},", and ",[24,344,30],{}," presets pay for those last percentage points with a lot more compression time. This is where interactive inspection matters: a small ratio gain near the top often costs a disproportionate amount of CPU.",[228,347,349],{"id":348},"_4-brotli-is-not-just-a-web-server-story","4. Brotli is not just a web-server story",[10,351,352,353,356],{},"Brotli is usually framed as “the thing CDNs use for text assets,” but that undersells it. In this benchmark it took ",[241,354,355],{},"18 out of 36"," best-ratio wins overall, especially on small structured inputs and dense text-like formats.",[10,358,359],{},"That does not make it the universal default, because higher Brotli levels become painfully slow on bigger archives. It does mean that whenever the payload looks like text and you care about output size more than encode time, Brotli deserves to be in the conversation.",[57,361,363],{"id":362},"category-by-category-read","Category-by-Category Read",[228,365,367],{"id":366},"codebases-and-source-files","Codebases and source files",[10,369,370],{},"This is where the benchmark becomes most satisfying.",[10,372,373,374,376,377,380,381,383,384,386,387,389],{},"On large codebases, ",[24,375,152],{}," won the pure ratio contest with a weighted ratio of ",[241,378,379],{},"4.143x",", but the best balanced preset for the category was ",[24,382,181],{},". That sounds contradictory until you look at the actual trade-off: ",[24,385,152],{}," absolutely crushes source-like data, but it makes you wait; ",[24,388,181],{}," gives back a large chunk of time while still delivering meaningful savings.",[10,391,392,393,395,396,399,400,402],{},"On single source files, ",[24,394,168],{}," was the category ratio leader at ",[241,397,398],{},"4.261x",", while ",[24,401,198],{}," ended up as the best balanced choice. That is a nice reminder that tiny files distort intuition: on small inputs, the absolute time penalty of stronger presets can stay low enough that aggressive codecs remain attractive.",[228,404,406],{"id":405},"documents-and-databases","Documents and databases",[10,408,409],{},"This was the most uneven category group.",[10,411,412,413,416],{},"The document category had everything from nearly incompressible office containers to absurd outliers. The single PDF sample compressed by ",[241,414,415],{},"97.7%"," at the strongest Brotli setting, which is an enormous result and also a warning not to generalize from one PDF to all PDFs. Some PDFs are mostly compressed images. Others still contain large compressible text or object streams. The format label alone is not enough.",[10,418,419,420,422,423,399,426,429,430,432,433,244],{},"At the category level, ",[24,421,152],{}," was the strongest ratio preset for documents at ",[241,424,425],{},"7.956x",[24,427,428],{},"zstd fast-1"," was the best balanced preset. That is exactly the kind of split I care about in practice: if I am archiving a text-heavy document corpus for cold storage, I may accept ",[24,431,76],{},"; if I am repeatedly packing and unpacking it, I want ",[24,434,26],{},[10,436,437],{},"The single SQL dump behaved more like code than like a “document,” and Brotli absolutely loved it. That makes sense: SQL dumps are repetitive, verbose, and highly structured.",[228,439,441],{"id":440},"binary-blobs-model-weights-vm-images-and-package-caches","Binary blobs, model weights, VM images, and package caches",[10,443,444],{},"The binary category is where simplistic rules start breaking down.",[10,446,447],{},"“Binary” can mean random noise, highly compressible VM disk contents, package caches, or model weights. Those are not equivalent at all.",[62,449,450,453,456,463],{},[65,451,452],{},"The random 2 GiB binary blob was effectively incompressible.",[65,454,455],{},"The Ubuntu VM image was very compressible.",[65,457,458,459,462],{},"The Qwen model directory saved more than ",[241,460,461],{},"30%"," at the best ratio, but only with very expensive compression presets.",[65,464,465],{},"The package cache only improved modestly.",[10,467,468,469,471,472,475,476,478],{},"The category-level ratio winner was ",[24,470,152],{}," at ",[241,473,474],{},"1.473x",", but the best balanced preset was ",[24,477,181],{},". That gap is the story of binary data in one line: some binary trees absolutely reward strong compression, but you cannot assume they all will.",[228,480,482],{"id":481},"images-audio-and-video","Images, audio, and video",[10,484,485],{},"These categories need nuance.",[10,487,488],{},"The image category includes BMP, RAW, PNG, and SVG, which are not remotely comparable from a compression standpoint. Vector text-like SVG behaved much more like a document. BMP and some RAW content still had real room to compress. PNG packs, unsurprisingly, were far more resistant.",[10,490,491,492,494],{},"Audio showed a similar split. Lossless or lightly packed material can still move a bit. Strongly compressed formats barely budge. The median best-case result of ",[241,493,295],{}," says it clearly: audio recompression is usually not where your big wins live.",[10,496,497],{},"Video is the easiest category to summarize because the answer is practically “don’t bother.” The benchmark agrees with common sense there.",[57,499,501],{"id":500},"the-codec-families-in-plain-english","The Codec Families in Plain English",[228,503,504],{"id":26},"Zstd",[10,506,507,509],{},[24,508,26],{}," is the codec I would hand to most people by default.",[10,511,512,513,515,516,518,519,521,522,524],{},"Its strongest overall preset, ",[24,514,323],{},", landed at ",[241,517,327],{}," weighted ratio, basically neck-and-neck with ",[24,520,168],{}," on the whole benchmark and not dramatically behind ",[24,523,152],{},". The difference is that lower and midrange Zstd presets stay far more usable. The family also dominated many of the balanced or near-balanced views, especially once decompression speed mattered.",[10,526,527],{},"If you want one family that stays credible almost everywhere, this is it.",[228,529,530],{"id":30},"Brotli",[10,532,533],{},"Brotli remains devastating on small, structured, text-heavy inputs.",[10,535,536],{},"Its problem is not effectiveness. Its problem is that the upper levels can get expensive fast. That makes Brotli ideal when the data is very text-like and the encode cost is acceptable, but less attractive as a universal archive codec for large mixed trees.",[228,538,540],{"id":539},"lzma2-xz","LZMA2 \u002F xz",[10,542,543,544,546],{},"When the question is “what produces the smallest file,” ",[24,545,76],{}," is still hard to ignore.",[10,548,549],{},"On codebases and documents it repeatedly took the strongest ratio spots. The problem is that the time curve is harsh. This is a cold-storage, distribution, or “compress once, keep forever” tool, not something I would casually throw into a hot path.",[228,551,552],{"id":91},"LZ4",[10,554,555,557],{},[24,556,91],{}," is the speed specialist. It usually loses the raw ratio contest, but it rarely surprises you in the wrong direction. If fast local packaging or very fast decompression matters more than shaving every last percent, it does exactly what you expect.",[228,559,561],{"id":560},"gzip-and-bzip2","gzip and bzip2",[10,563,564],{},"Neither is dead, but neither feels like the first choice anymore unless compatibility or existing tooling forces the decision.",[10,566,567,570,571,574],{},[24,568,569],{},"gzip"," still makes sense when universal support matters. ",[24,572,573],{},"bzip2"," occasionally produced strong ratio results, including the best result on the model directory, but its overall speed profile remains hard to justify against newer alternatives.",[57,576,578],{"id":577},"the-nonlinear-part-levels-stop-paying-off","The Nonlinear Part: Levels Stop Paying Off",[10,580,581],{},"One of the easiest mistakes in compression tuning is assuming that moving from level 6 to level 9 is a proportional change. It rarely is.",[10,583,584],{},"This benchmark kept showing the same pattern:",[62,586,587,590,593],{},[65,588,589],{},"The first step away from the fastest presets usually buys a meaningful gain.",[65,591,592],{},"Midrange presets often capture most of the value.",[65,594,595],{},"The last few levels are where time explodes and the ratio gain shrinks.",[10,597,598],{},"That does not mean high levels are pointless. It means they need a reason. If the archive is large, text-heavy, and long-lived, maybe the extra time is justified. If you are constantly creating, unpacking, or transferring it, the sweet spot moves downward quickly.",[10,600,601],{},"The level curve in the interactive section is the best way to see this on a dataset-by-dataset basis.",[57,603,605],{"id":604},"what-i-would-actually-use","What I Would Actually Use",[10,607,608],{},"If I were translating this benchmark into defaults for my own systems, it would look roughly like this:",[610,611,612,623,629,635,640],"ol",{},[65,613,614,39,616,618,619,622],{},[24,615,132],{},[24,617,428],{},", or ",[24,620,621],{},"zstd 1-3"," for everyday use.",[65,624,625,628],{},[24,626,627],{},"brotli 9-11"," for dense web text or small structured files where output size matters a lot.",[65,630,631,634],{},[24,632,633],{},"lzma2 7-9e"," only for text-heavy archives where compression time is a secondary concern.",[65,636,637,639],{},[24,638,181],{}," when speed and low CPU cost matter more than the smallest result.",[65,641,642],{},"No extra compression at all for many already compressed media files.",[10,644,645],{},"That list is much closer to how systems are actually operated than a pure “who got the highest ratio” ranking.",[57,647,649],{"id":648},"methodology-and-caveats","Methodology and Caveats",[62,651,652,675,678,681,684,687],{},[65,653,654,655,658,659,662,663,39,665,39,667,39,669,39,671,342,673,244],{},"The benchmark covers ",[241,656,657],{},"36"," datasets and ",[241,660,661],{},"95"," distinct presets across ",[24,664,26],{},[24,666,30],{},[24,668,34],{},[24,670,91],{},[24,672,569],{},[24,674,573],{},[65,676,677],{},"Both files and directories were included because real-world archive behavior depends heavily on corpus structure, not just on file format labels.",[65,679,680],{},"I tracked compression ratio, percent space saved, compression time, and decompression time.",[65,682,683],{},"The interactive article includes both raw runs and grouped views because a weighted category average and a single-dataset winner answer different questions.",[65,685,686],{},"The “balanced” score in the charts is not a codec-standard metric. It is a normalized benchmark-specific score that rewards ratio, compression speed, and decompression speed together.",[65,688,689],{},"Some samples are intentionally weird. The PDF result is a perfect example: it is real data, but it should not be treated as “all PDFs behave like this.”",[57,691,693],{"id":692},"final-take","Final Take",[10,695,696],{},"The most useful result here is not that one algorithm “won.” It is that compression outcomes are highly dependent on the structure of the data, and the moment you measure both compression and decompression time the leaderboard reshuffles.",[10,698,699],{},"That is exactly why I wanted the article to stay interactive. A storage archive, a CDN asset pipeline, a backup job, a package mirror, and a local developer workflow do not want the same answer. With the filters, scatter plot, heatmap, level curves, and sortable leaderboard above, you can pull the benchmark toward the decision you actually need to make.",{"title":701,"searchDepth":702,"depth":702,"links":703},"",2,[704,705,706,707,714,720,727,728,729,730],{"id":59,"depth":702,"text":60},{"id":95,"depth":702,"text":96},{"id":215,"depth":702,"text":216},{"id":225,"depth":702,"text":226,"children":708},[709,711,712,713],{"id":230,"depth":710,"text":231},3,{"id":277,"depth":710,"text":278},{"id":304,"depth":710,"text":305},{"id":348,"depth":710,"text":349},{"id":362,"depth":702,"text":363,"children":715},[716,717,718,719],{"id":366,"depth":710,"text":367},{"id":405,"depth":710,"text":406},{"id":440,"depth":710,"text":441},{"id":481,"depth":710,"text":482},{"id":500,"depth":702,"text":501,"children":721},[722,723,724,725,726],{"id":26,"depth":710,"text":504},{"id":30,"depth":710,"text":530},{"id":539,"depth":710,"text":540},{"id":91,"depth":710,"text":552},{"id":560,"depth":710,"text":561},{"id":577,"depth":702,"text":578},{"id":604,"depth":702,"text":605},{"id":648,"depth":702,"text":649},{"id":692,"depth":702,"text":693},"2026-04-10T12:00:00+02:00","A deep benchmark of 36 real files and directories across 95 compression presets, with interactive charts for ratio, speed, decompression cost, and category-specific winners.","md",true,null,{},"\u002Fblog\u002Fcompression-benchmark-real-files-2026",{"title":5,"description":732},"blog\u002Fcompression-benchmark-real-files-2026",[741,742,743,26,30,744],"compression","benchmark","performance","linux","caB9T_9huI82Q_np3Jsr9wjzi2WJ8miPLmHF64o1hbo",1775845178912]