The biggest change is that the Network and Data portions are now in two different SDKs. You may now use either one on its own, or both together. If you are a licensee of both Data and Network, you will get two SDKs. The two SDKS can be installed to the same place, and shared files may overwrite.
The compressed data bitstreams (for both Network & Data) are not changed from 2.6 to 2.7 ; I've bumped the major version just because of the large API change of splitting the libs.
We've also made some optimizations in the LZ decoders; Mermaid, Kraken & Leviathan are all a wee bit faster to decode.
Perf comparison to previous versions :
Oodle 2.7.0 , Kraken 8 :
PD3D : 4.02:1 , 1.0 enc MB/s , 1128.5 dec MB/s
GTS : 2.68:1 , 1.0 enc MB/s , 1324.1 dec MB/s
Silesia : 4.25:1 , 0.6 enc MB/s , 1062.2 dec MB/s
============================
PD3D :
Kraken8 255 : 3.67:1 , 2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5: 3.72:1 , 1.2 enc MB/s , 1079.9 dec MB/s
Kraken8 260 : 4.00:1 , 1.0 enc MB/s , 1034.7 dec MB/s
GTS :
Kraken8 255 : 2.60:1 , 2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5: 2.63:1 , 1.2 enc MB/s , 1343.8 dec MB/s
Kraken8 260 : 2.67:1 , 1.0 enc MB/s , 1282.3 dec MB/s
Silesia :
Kraken8 255 : 4.12:1 , 1.4 enc MB/s , 982.0 dec MB/s
Kraken8 260 -v5: 4.18:1 , 0.6 enc MB/s , 1018.7 dec MB/s
Kraken8 260 : 4.24:1 , 0.6 enc MB/s , 985.4 dec MB/s
ozip can be used as a command line compressor to create or decode Oodle-compressed files; ozip can also be used to pipe Oodle-compressed streams.
The intention is that ozip should act similarly to gzip (in terms of command line arguments, but with more compression and faster decompression) so it can be dropped in to standard work flows that use gzip-like compressors. For example ozip can be used with tar "I" ("--use-compress-program") to pipe the tar package through ozip.
ozip works with pipes (particularly useful on Unix), so it can be used to pipe compressed data. (eg. with things like "zfs send" to a pipe).
A pre-compiled ozip binary is now distributed with the Oodle SDK.
You can also get and modify the ozip source code on github :
To build ozip from source code, you need the Oodle SDK . (ozip is open source and public domain, Oodle is not).
If you have corrections to ozip, we're happy to take pull requests, particularly wrst making sure we act like gzip and it's easy to drop in ozip to gzip-like work flows on Unix. When writing ozip, we found that gzip/bzip/xz don't have the exact same command line argument handling, and yet are treated as interchangeable by various programs. We tried to replicate the common intersection of their behavior.
ozip is a single file compressor, not a package archiver. It does not store file names or metadata.
ozip was written by my brother, James Bloom.
If you have server workflows that involve streaming compressed data, or think you could benefit from Oodle, we'd be happy to hear from you. We're still evolving our solutions in this space.
If you are compressing very large files (not piped streams), you can get much higher performance by using threads and asynchronous IO to overlap IO with compression CPU time. If this is important to you, ask about the "oozi" example in Oodle for reference on how to do that.
Previously the fastest Oodle encode level was "SuperFast" (level 1). The new "HyperFast" levels are below that (level -1 to -4). The HyperFast levels sacrifice some compression ratio to maximize encode speed.
An example of the performance of the new levels (on lzt99, x64, Core i7-3770) :
|
|
|
In the loglog plot, up = higher compression ratio, right = faster encode.
lzt99 : Kraken-z-3 : 1.711 to 1 : 416.89 MB/s
lzt99 : Kraken-z-2 : 1.877 to 1 : 333.28 MB/s
lzt99 : Kraken-z-1 : 2.103 to 1 : 280.09 MB/s
lzt99 : Kraken-z1 : 2.268 to 1 : 167.01 MB/s
lzt99 : Kraken-z2 : 2.320 to 1 : 120.39 MB/s
lzt99 : Kraken-z3 : 2.390 to 1 : 38.85 MB/s
lzt99 : Kraken-z4 : 2.434 to 1 : 24.98 MB/s
lzt99 : Mermaid-z-3 : 1.660 to 1 : 438.89 MB/s
lzt99 : Mermaid-z-2 : 1.793 to 1 : 353.82 MB/s
lzt99 : Mermaid-z-1 : 2.011 to 1 : 277.35 MB/s
lzt99 : Mermaid-z1 : 2.041 to 1 : 261.38 MB/s
lzt99 : Mermaid-z2 : 2.118 to 1 : 172.77 MB/s
lzt99 : Mermaid-z3 : 2.194 to 1 : 97.11 MB/s
lzt99 : Mermaid-z4 : 2.207 to 1 : 40.88 MB/s
lzt99 : Selkie-z-3 : 1.447 to 1 : 627.76 MB/s
lzt99 : Selkie-z-2 : 1.526 to 1 : 466.57 MB/s
lzt99 : Selkie-z-1 : 1.678 to 1 : 370.34 MB/s
lzt99 : Selkie-z1 : 1.698 to 1 : 340.68 MB/s
lzt99 : Selkie-z2 : 1.748 to 1 : 204.76 MB/s
lzt99 : Selkie-z3 : 1.833 to 1 : 107.29 MB/s
lzt99 : Selkie-z4 : 1.863 to 1 : 43.65 MB/s
A quick guide to the Oodle CompressionLevels :
-4 to -1 : HyperFast levels
when you want maximum encode speed
these sacrifice compression ratio for encode time
0 : no compression (memcpy pass through)
1 to 4 : SuperFast, VeryFast, Fast, Normal
these are the "normal" compression levels
encode times are ballpark comparable to zlib
5 to 8 : optimal levels
increasing compression ratio & encode time
levels above 6 can be slow to encode
these are useful for distribution, when you want the best possible bitstream
Note that the CompressionLevel is a dial for encode speed vs. compression ratio. It does not have a
consistent correlation to decode speed. That is, all of these compression levels get roughly the same
excellent decode speed.
Comparing to Oodle 2.6.0 on Silesia :
Oodle 2.6.0 :
Kraken 1 "SuperFast" : 3.12:1 , 147.2 enc MB/s , 920.9 dec MB/s
Kraken 2 "VeryFast" : 3.26:1 , 107.8 enc MB/s , 945.0 dec MB/s
Kraken 3 "Fast" : 3.50:1 , 47.1 enc MB/s , 1043.3 dec MB/s
Oodle 2.6.3 :
Kraken -2 "HyperFast2" : 2.92:1 , 300.4 enc MB/s , 1092.5 dec MB/s
Kraken -1 "HyperFast1" : 3.08:1 , 231.3 enc MB/s , 996.2 dec MB/s
Kraken 1 "SuperFast" : 3.29:1 , 164.6 enc MB/s , 885.0 dec MB/s
Kraken 2 "VeryFast" : 3.40:1 , 109.5 enc MB/s , 967.3 dec MB/s
Kraken 3 "Fast" : 3.61:1 , 45.8 enc MB/s , 987.5 dec MB/s
Note that in Oodle 2.6.3 the normal levels (1-3) have also improved (much higher compression ratios).
I should also note the HyperFast levels are available for Kraken, Mermaid & Selkie. They currently do nothing on Leviathan (they are the same as SuperFast). Leviathan is probably not the right choice if encode speed is a priority for you.
This is another post about careful measurement, how to compare compressors, and about the unique way Oodle works.
(usual caveat: I don't mean to pick on ZStd here; I use it as a reference point because it is excellent, the closest thing to Oodle, and something we are often compared against. ZStd timing is done with lzbench; times are on x64 Core i7-3770)
There are two files in my "gametestset" where ZStd appears to be significantly faster to
decode than Leviathan :
e.dds :
zstd 1.3.3 -22 3.32 MB/s 626 MB/s 403413 38.47%
ooLeviathan8 : 1,048,704 -> 355,045 = 2.708 bpb = 2.954 to 1
decode : 1.928 millis, 6.26 c/b, rate= 544.03 MB/s
Transistor_AudenFMOD_Ambience.bank :
zstd 1.3.3 -22 5.71 MB/s 4257 MB/s 16281301 84.18%
ooLeviathan8 : 19,341,802 ->16,178,303 = 6.692 bpb = 1.196 to 1
decode : 8.519 millis, 1.50 c/b, rate= 2270.48 MB/s
Whoah! ZStd is a lot faster to decode than Leviathan on these files, right? (626 MB/s vs 544.03 MB/s and 4257 MB/s vs 2270.48 MB/s)
No, it's not that simple. Compressor performance is a two axis value of {space,speed}. It's a 2d vector, not a scalar. You can't simply take one component of the vector and just compare speeds at unequal compression.
All compressors are able to hit a range of {space,speed} points by making different decisions. For example with ZStd at level 22 you could forbid length 3 matches and that would bias it more towards decode speed and lower compression ratio.
Oodle is unique in being fundamentally built as a space-speed optimization process. The Oodle encoders can make bit streams that cover a range of compression ratios and decode speeds, depending on what the client asks it to prioritize.
Compressor performance is determined by two things : the fundamental algorithm, and the current settings. The settings will allow you to dial the 2d performance data point to different places. The algorithm places a limit on where those data points can be - it defines a Pareto Frontier. This Pareto curve is a fundamental aspect of the algorithm, while the exact space speed point on that curve is simply a choice of settings.
There is no such thing as "what is the speed of ZStd?". It depends how you have dialed the settings to reach different performance data points. The speed is not a fundamental aspect of the algorithm. The Pareto frontier *is* a fundamental aspect, the limit on where those 2d data points can reach.
One way to compare compression algorithms (as opposed to their current settings) is to plot many points of their 2d performance at different settings, and then inspect how the curves lie. One curve might strictly cover the other, then that algorithm is always better. Or, they might cross at some point, which means each algorithm is best in a different performance domain.
Another way to compare compression algorithms is to dial them to find points where one axis is equal (either they decode at the same speed, or they have the same compression ratio), then you can do a simple 1d comparison of the other value. You can also try to find points where one compressor is strictly better on both axes. The inconclusive situations is when one compressor is better on one axis, and the other is better on the other axis.
(note I have been talking about compressor performance as the 2d vector of {decode speed,ratio} , but of course you could also consider encode time, memory use, other factors, and then you might choose other axes, or have a 3d or 4d value to compare. The same principles apply.)
(there is another way to compare 2d compressor performance with 1d scalar; at RAD we internally use the corrected Weissman score . One of the reasons we use the 1d Weissman score is because sometimes we make an improvement to a compressor and one of the axes gets worse. That is, we do some work, and then measure, and we see compression ratio went down. Oh no, WTF! But actually decode speed went up. From the 2d performance vector it can be hard to tell if you made an improvement or not, the 1d scalar Weissman score makes that easier.)
Oodle is an optimizing compiler
Oodle is fundamentally different than other compressors. There is no "Oodle has X performance". Oodle has whatever performance you ask it to have (and the compressed size will vary along the Pareto frontier).
Perhaps an analogy that people are more familiar with is an optimizing compiler.
The Oodle decoder is a virtual machine that runs a "program" to create an output. The compressed data is the program of commands that run in the Oodle interpreter.
The Oodle encoder is a compiler that makes the program to run on that machine (the Oodle decoder). The Oodle compiler tries to create the most optimal program it can, by considering different instruction sequences that can create the same output. Those different sequences may have different sizes and speeds. Oodle chooses them based on how the user has specified to consider the value of time vs. size. (this is a bit like telling your optimizing compiler to optimize for size vs. optimize for speed, but Oodle is much more fine grained).
For example at the microscopic level, Oodle might consider a sequence of 6 bytes. This can be sent as 6 literals, or a pair of length 3 matches, or a literal + a len 4 rep match + another literal. Each possibility is considered and the cost is measured for size & decode time. At the microscopic level Oodle considers different encodings of the command sequences, whether to send somethings uncompressed or with different entropy coders, and different bit packings.
Oodle is a market trader
Unlike any other lossless compressor, Oodle makes these decisions based on a cost model.
It has been standard for a long time to make space vs. speed decisions in lossless compressors, but it has in the past always been done with hacky ad-hoc methods. For example, it's common to say something like "if the compressed size is only 1% less than the uncompressed size, then just send it uncompressed".
Oodle does not do that. Oodle considers its compression savings (bytes under the uncompressed size) to be "money". It can spend that money to get decode time. Oodle plays the market, it looks for the best price to spend its money (size savings) to get the maximum gain of decode time.
Oodle does not make ad-hoc decisions to trade speed for size, it makes an effort to get the best possible value for you when you trade size for speed. (it is of course not truly optimal because it uses heuristics and limits the search, since trying all possible encodings would be intractable).
Because of this, it's easy to dial Oodle to different performance points to find more fundamental comparisons with other compressors. (see, for example : Oodle tuneability with space-speed tradeoff )
Note that traditional ad-hoc compressors (like ZStd and everyone else) make mistakes in their space-speed decisions. They do not allocate time savings to the best possible files. This is an inevitable consequence of having simple thresholds in decision making (and this flaw is what led us to do a true cost model). That is, Leviathan decode speed is usually, say, 30% faster than ZStd. On some files that ratio goes way up or way down. When that happens, it is often because ZStd is making a mistake. That is, it's not paying the right price to trade size for speed.
Of course this relies on you telling Oodle the truth about whether you want decode speed or size. Since Oodle is aggressively trading the market, you must tell it the way you value speed vs. size. If you use Leviathan at default settings, Oodle thinks your main concern is size, not decode speed. If you actually care more about decode speed, you need to adjust the price (with "spaceSpeedTradeoffBytes") or possible switch to another compressor (Kraken, Mermaid, or let Hydra switch for you).
Back to the files where ZStd is faster
Armed with our new knowledge, let's revist those two files :
e.dds : zstd 1.3.3 -22 : 2.600 to 1 : 625.72 MB/s
e.dds : Leviathan -8 : 2.954 to 1 : 544.03 MB/s
Transistor_AudenFMOD_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4257 MB/s
Transistor_AudenFMOD_Ambience.bank : Leviathan -8 : 1.196 to 1 : 2270.48 MB/s
Is ZStd faster on these files? At this point we don't know. These are inconclusive data points. In both cases, Leviathan has more
compression, but ZStd has more speed - the victor on each axis differs and we can't easily say which is really doing better.
To get a simpler comparison we can dial Leviathan to different performance points using Oodle's "spaceSpeedTradeoffBytes" parameter, which sets the relative cost of time vs size in Oodle's decisions.
That is, in both cases Oodle has size savings to spend. It can spend those size savings to get more decode speed.
On e.dds, let's take Leviathan and dial spaceSpeedTradeoffBytes up from the default of 256 in powers of two to favor decode speed more :
e.dds : zstd 1.3.3 -22 : 2.600 to 1 : 625.72 MB/s
e.dds : Leviathan 1 : 3.020 to 1 : 448.30 MB/s
e.dds : Leviathan 256 : 2.954 to 1 : 544.03 MB/s
e.dds : Leviathan 512 : 2.938 to 1 : 577.23 MB/s
e.dds : Leviathan 1024 : 2.866 to 1 : 826.15 MB/s
e.dds : Leviathan 2048 : 2.831 to 1 : 886.42 MB/s
What is the speed of Leviathan? There is no one speed of Leviathan. It can go from 448 MB/s to 886 MB/s depending on what you tell the
encoder you want. The fundamental aspect is what compression ratio can be achieved at each decode speed.
We can see that ZStd is not fundamentally faster on this file; in fact Leviathan can get much more decode speed AND compression ratio at spaceSpeedTradeoffBytes = 1024 or 2048.
Similarly on Transistor_AudenFMOD_Ambience.bank :
Transistor_Aude...D_Ambience.bank : zstd 1.3.3 -22 : 1.188 to 1 : 4275.38 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 256 : 1.196 to 1 : 2270.48 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 512 : 1.193 to 1 : 3701.30 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 1024 : 1.190 to 1 : 4738.83 MB/s
Transistor_Aude...D_Ambience.bank : Leviathan 2048 : 1.187 to 1 : 6193.92 MB/s
zstd 1.3.3 -22 5.71 MB/s 4257 MB/s 16281301 84.18 Transistor_AudenFMOD_Ambience.bank
Leviathan spaceSpeedTradeoffBytes = 2048
ooLeviathan8 : 19,341,802 ->16,290,106 = 6.738 bpb = 1.187 to 1
decode : 3.123 millis, 0.55 c/b, rate= 6193.92 MB/s
In this case we can dial Leviathan to get very nearly the same compressed size, and then just compare speeds
(4275.38 MB/s vs 6193.92 MB/s).
Again ZStd is not actually faster than Leviathan here. If you looked at Leviathan's default setting encode (2270.48 MB/s) you were not seeing ZStd being faster to decode. What you are seeing is that you told Leviathan to choose an encoding that favors size over decode speed.
It doesn't make sense to tell Oodle to make a very small file, and then just compare decode speeds. That's like telling your waiter that you want the cheapest bottle of wine, then complaining that it doesn't taste as good as the $100 bottle. You specifically asked me to optimize for the opposite goal!
(note that in the Transistor bank case, it looks like Oodle is paying a bad price to get a tiny compression savings; going from 6000 MB/s to 2000 MB/s seems like a lot. In fact that is a small time difference, while 1.187 to 1.196 ratio is actually a big size savings. The problem here is that ratio & speed are inverted measures of what we are really optimizing, which is time and size. Internally we always look at bpb (bits per byte) and cpb (cycles per byte) when measuring performance.)
e.dds charts :
|
|
|
See also : The Natural Lambda
I believe what's happened is that many people have read about the dangerous of artificial benchmarks. (for example there are some famous papers on the perils of profiling malloc with synthetic use loads, or how profiling threading primitives in isolation is pretty useless).
While those warnings do raise important issues, the right response is not to switch to timing whole operations.
For example, while timing mallocs with bad synthetic data loads is not useful (and perhaps even harmful), similarly timing an entire application run to determine whether a malloc is better or not can also be misleading.
Basically I think the wrong lesson has been learned and people are over simplifying. They have taken one bad practice (time operations by running them in a synthetic test bench over and over), and replaced it with another bad practice (time the whole application).
The reality of profiling is far more complex and difficult. There is no one right answer. There is not a simple prescription of how to do it. Like any scientific measurement of a complex dynamic system, it requires care and study. It requires looking at the specific situation and coming up with the right measurement process. It requires secondary measurements to validate your primary measurements, to make sure you are testing what you think you are.
Now, one of the appealing things of whole-process timing is that in one very specific case, it is the right thing to do.
IF the thing you care about is whole-process time, and the process is always run the same way, and you do timing on the system that the process is run on, and in the same application state and environment, AND crucially this you are only allowed to make one change to the process - then whole process timing is right.
Let's first talk about the last issue, which is the "single change" problem.
Quite often a good change can appear to do nothing (or even be negative) for whole process time on its own. By looking at just the whole process time to evaluate the change, you miss a very positive step. Only if another step is taken will the value of that first step be shown.
A common case of this is if your process has other limiting factors that need to be fixed.
For example on the macroscopic level, if your game is totally GPU bound, then anything you do to CPU time will not show up at all if you are only measuring whole frame time. So you might profile a CPU optimization and see no benefit to frame time. You can miss big improvements this way, because they will only show up if you also fix what's causing the process to be GPU bound.
Similarly at a more microscopic level, it's common to have a major limiting factor in a sequence of code. For example you might have a memory read that typically misses cache, or an unpredictable branch. Any improvements you make to the arithmetic instructions in that area may be invisible, because the processor winds up stalling on a very slow cache line fill from memory. If you are timing your optimization work "in situ" to be "realistic" you can completely miss good changes because they are hidden by other bad code.
Another common example, maybe you convert some scalar code to SIMD. You think it should be faster, but you time it in app and it doesn't seem to be. Maybe you're bound in other places. Maybe you're suffering added latency from round-tripping from scalar to SIMD back to scalar. Maybe your data needs to be reformatted to be stored in SIMD friendly ways. Maybe the surrounding code needs to be also converted to SIMD so that they can hand off more smoothly. There may in fact be a big win there that you aren't seeing.
This is a general problem that greedy optimization and trying to look at steps one by one can be very misleading when measuring whole process time. Sometimes taking individual steps is better evaluated by measuring just those steps in isolation, because using whole process time obscures them. Sometimes you have to try a step that you believe to be good even if it doesn't show up in measurements, and see if taking more steps will provide a non-greedy multi-step improvement.
Particular perils of IO timing
A very common problem that I see is trying to measure data loading performance, including IO timing, which is fraught with pitfalls.
If you're doing repeated timings, then you'll be loading data that is already in the system disk cache, so your IO speed may just look like RAM speed. Is what's important to you cold cache timing (user's first run), or hot cache time? Or both?
Obviously there is a wide range of disk speeds, from very slow hard disks (as on consoles) in the 20 MB/s range up to SSD's and NVMe in the GB/s range. Which are you timing on? Which will your user have? Whether you have slow seeks or not can be a huge factor.
Timing on consoles with disk simulators (or worse : host FS) is particularly problematic and may not reflect real world performance at all.
The previously mentioned issue of high latency problems hiding good changes is very common. For example doing lots of small IO calls creates long idle times that can hide other good changes.
Are you timing on a disk that's fragmented, or nearly full? Has your SSD been through lots of write cycles already or does it need rebalancing? Are you timing when other processes are running hitting the disk as well?
Basically it's almost impossible to accurately recreate the environment that the user will experience. And the variation is not small, it can be absolutely massive. A 1 byte read could take anything between 1 nanosecond (eg. data already in disk cache) to 100 milliseconds (slow HD seek + other processes hitting disk).
Because of the uncertainty of IO timing, I just don't do it.
I use a simulated "disk speed" and just set :
disk time = data size / simulated disk speed
Then the question is, well if it's so uncertain, what simulated disk speed do you use? The answer is : all of them. You cannot
say what disk speed the user will experience, there's a huge range, so you need to look at performance over a spectrum of disk speeds.
I do this by making a plot of what the total time for (load + decomp) is over a range of simulated disk speeds. Then I can examine how the performance is affected over a range of possible client systems, without trying to guess the exact disk speed of the client runtime environment. For more on this, see : Oodle LZ Pareto Frontier or Oodle Kraken Pareto Frontier .
Recall from the tour of mixing :
Geometric mixing is a product of experts. In the binary case, this reduces to linear mixing in logit (codelen difference) domain; this is what's used by PAQ. The coefficients of a geometric mixer are not really "weights" , in that they don't sum to one, and they can be negative.
In fact the combination of experts in geometric mixing is not convex; that is, the mixer does not necessarily interpolate them. Linear mixing stays within the simplex of the original experts, it can't extrapolate (because weights are clamped in [0,1]).
For example, say your best expert always gets the polarity of P right (favors bit 0 or 1 at the right time), but it always predicts P a bit too low. It picks P of 0.7 when it should be 0.8. The linear mixer can't fix that. It can at most give that expert a weight of 100%. The geometric mixer can fix that. It can apply an amplification factor that says - yes I like your prediction, but take it farther.
The geometric mixer coefficients are literally just a scaling of the experts' codelen difference. The gradient descent optimizes that coefficient to make output probabilities that match the observed data; to get there it can apply amplification or suppression of the codelen difference.
Let's see this in a very simple case : just one expert.
The expert here is "geo 5" , (a 1/32 geometric probability update). That's pretty fast for real world use but it looks very slow in these little charts. We apply a PAQ style logit mixer with a *very* fast "learning rate" to exaggerate the effect (1000X faster than typical).
Note the bit sequence here is different than the last post; I've simplified it here to just 30 1's then 10 0's to make the effect more obvious.
The underlying expert adapts slowly : (P(1) in green, codelen difference in blue)
Note that even in the 0000 range, geo 5 is still favoring P(1) , it hasn't forgotten all the 1's at the start. Codelen difference is still positive (L(0) > L(1)).
With the PAQ mixer applied to just a single expert :
In the 111 phase, the mixer "weight" (amplification factor) goes way up; it stabilizes around 4. It's learning that the underlying expert has P(1) on the right side, so our weight should be positive, but it's P(1) is way too low, so we're scaling up the codelen difference by 4X.
In the 000 phase, the mixer quickly goes "whoah wtf this expert is smoking crack" and the weight goes *negative*. P(1) goes way down to around 15% even though the underlying expert still has a P(1) > 50%
Now in practice this is not how you use mixers. The learning rate in the real world needs to be way lower (otherwise you would be shooting your weights back and forth all the time, overreacting to the most recent coding). In practice the weight converge slowly to an ideal and stay there for long periods of time.
But this amplification compensation property is real, just more subtle (more like 1.1X rather than 4X).
For example, perhaps one of your models is something like a deterministic context (PPM*) model. You find the longest context that has seen any symbols before. That maximum-length context usually has very sparse statistics but can be a good predictor; how good it is exactly depends on the file. So that expert contributes some P fo the next symbol based on what it sees in the deterministic context. It has to just make a wild guess because it has limited observations (perhaps it uses secondary statistics). Maybe it guesses P = 0.8. The mixer can learn that no, on this particular file the deterministic model is in fact better than that, so I like you and amplify you even by a bit more, your coefficient might converge to 1.1 (on another file, maybe the deterministic expert is not so great, its weight might go to 0.7, you're getting P in the right direction, but it's not as predictable as you think).
I do think it helps to get intuition by actually looking at charts & graphs, rather than just always look at the end result score for something like compression.
We're going to look at binary probability estimation schemes.
Binary probability estimation is just filtering the history of bits seen in some way.
Each bit seen is a digital signal of value 0 or 1. You just want some kind of smoothing of that signal. Boom, that's your probability estimate, P(1). All this nonsense about Poisson and Bernoulli processes and blah blah, what a load of crap. It's just filtering.
For example, the "Laplace" estimator
P(1) = (n1 + c)/(n0 + n1 + 2c)
That's just the average of the entire signal seen so far. Add up all the bits, divide by number.
(countless papers have been written about what c should be (c= 1? c = 1/2?), why not 1/4? or 3/4?
Of course there's no a-priori reason for any particular value of c and in the real world it should just be tweaked to maximize results.)
That's the least history-forgetting estimator of all, it weights everything in the past equally
(we never want an estimator where older stuff is more important). In the other direction
you have lots of options - FIR and IIR filters. You could of course take the last N bits and average them (FIR filter), or take the last N and
average with a weight that smoothly goes from 0 to 1 across the window (sigmoidal or just linear). You can of course take an IIR average,
that's
average <- (average*N + last)/(N+1)
Which is just the simplest IIR filter, aka geometric decay, "exponential smoothing" blah blah.
Anyway, that's just to get us thinking in terms of filters. Let's look at some common filters and how they behave.
Green bars = probability of a 1 bit after the bit at bottom is processed.
Blue bars = log odds = codelen(0) - codelen(1)
Laplace : count n0 & n1 :
After 10 1's and 10 0's, P is back to 50/50 ; we don't like the way this estimator has no recency bias.
Geometric IIR with updshift 2 (very fast) :
When a fast geometric gets to the 010101 section is wiggles back and forth wildly. If you look at the codelen difference in that region you can see it's wasting something like 1.5 bits on each coding (because it's always wiggled in just the wrong way, eg. biased towards 0 right before a 1 occurs).
Note that geometric probabilities have an exponential decay shape. (specifically , 0.75^n , where 0.75 is from (1 - 1/4) and the 4 is because shift 2). HOWEVER the codelen difference steps are *linear*.
The geometric probability update (in the direction of MPS increasing probability) is very close to just adding a constant to codelen difference (logit). (this just because P *= lambda , so log(P) += log(lambda)). You can see after the 111 region, the first 0 causes a big negative step in codelen difference, and then once 0 becomes the MPS the further 0 steps are the same constant linear step.
Geometric IIR with updshift 5 (fast) :
Shift 5 is still fast by real world standards but looks slow compared to the crazy speed of geo 2. You can see that the lack of an initial adaptation range hurts the ramp up on the 1111 portion. That is, "geo 5" acts like it has 33 symbols in its history; at the beginning it actually has 0, so it has a bogus history of 33 times P = 0.5 which gives it a lot of intertia.
FIR 8-tap window flat weight :
Just take the last 8 and average. (I initialize the window to 01010 which is why it has the two-tap stair step in the beginning). In practice you can't like the probabilities get to 0 or 1 completely, you have to clamp at some epsilon, and in fact you need a very large epsilon here because over-estimating P(1) and then observing a 0 bit is very very bad (codelen of 0 goes to infinity fast as P->1). The "codelen difference" chart here uses a P epsilon of 0.01
bilateral filter :
It's just filtering, right? Why not? This bilateral filter takes a small local average (4 tap) and weights contribution of those local averages back through time. The weight of each sample is multiplied by e^-dist * e^-value_difference. That is, two terms (hence bilateral), weight goes down as you go back in time, but also based on how far away the sample is from the most recent one in value. ("sample" is the 4-tap local average)
What the bilateral does is that as you get into each region, it quickly forgets the previous region. So as you get into 000 it forgets the 111, and once it gets into 0101 it pretty solidly stabilizes at P = 0.5 ; that is, it's fast in forgetting the past when the past doesn't match the present (fast like geo 2), BUT it's not fast in over-responding to the 0101 wiggle like geo 2.
There are an infinity of different funny impulse responses you could do here. I have no idea if any of them would actually help compression (other than by giving you more parameters to be able to train to your data, which always helps, sort of).
mixed :
Linear mix of geo 2 (super fast) and geo 5 (still quite fast). mixing weight starts at 0.5 and is updated with online gradient descent. The mixer here has an unrealistically fast learning rate to exaggerate the effect.
The weight shown is the weight of geo 2, the faster model. You can see that in the 111 and 000 regions the weight of geo 2 shoots way up (because geo 5 is predicting P near 0.5 still), and then in the 0101 region the weight of geo 2 gradually falls off because the slow geo 5 does better in that region.
The mixer immediately does something that none of the other estimators did - when the first 0 bit occurs, it takes a *big* step down, almost down to 50/50. It's an even faster step than geo 2 did on its own. (same thing with the first 1 after the 000 region).
Something quite surprising popped out to me. The probability steps in the 111 and 000 regions wind up linear. Note that both the underlying estimators (geo 2 and geo 5) has exponential decay curving probabilities, but the interaction with the mixing weight cancels that out and we get linear. I'm not sure if that's a coincidence of the particular learning rate, but it definitely illustrates to us that a mixed probability can behave unlike any of its underlying experts!
I thought I would make some pictures to make that more intuitive.
What I have here is a 4 bit symbol (alphabet of 16). It is coded with 4 binary probabilities in a tree.
That is :
first code a bit for sym >= 8 or not (is it in [0-7] or [8-15])
then go to [0-3] vs [4-7] (for example)
then [4-5] vs [6-7]
lastly [6] vs [7]
One way you might implement this is something like :
U32 p0[16];
// sym is given
sym |= 16; // set a place-value marker bit
for 4 times
{
int ctx = sym >> 4; int bit = (sym>>3)&1;
arithmetic_code( p0[ctx] , bit );
sym <<= 1;
}
and note that only 15 p0's are actually used, p0[0] is not accessed; p0[1] is the root probability for
[0-7] vs [8-15] , p0[2] is [0-3] vs [4-7] , etc.
The standard binary probability update is exponential decay (the simplest geometric IIR filter) :
if ( bit )
{
P0 -= P0 * lambda;
}
else
{
P0 += (1.0 - P0) * lambda;
}
So what actually happens to the symbol probabilities when you do this?
Something a bit strange.
When you update symbol [5] (for example), his probability goes up. But so does the probability of his neighbor, [4]. And also his next neighbors [6 and 7]. And so on up the binary tree.
Now the problem is not the binary decomposition of the coding. Fundamentally binary decomposition and full alphabet coding are equivalent and should be able to get the same results if you form the correct correspondence between them.
(aside : For example, if you use a normal n0/n1 count model for probabilities, AND you initialize the counts such that the parents = the sum of the children, then what you get is : visualizing_binary_probability_updates_n0_n1.html - only the symbols you observe increase in probability. Note this is a binary probability tree, not full alphabet, but it acts exactly the same way.)
The problem (as noted in the previous LZMA post) is that the update rate at the top levels is too large.
Intuitively, when a 5 occurs you should update the binary choice [45] vs [67] , because a 5 is more likely, that pair is more likely. The problem is that [4] did not get more likely, and the probability of the group [45] represents the chance of either occuring. One of those things did get more likely, but one did not. The 4 in the group should act as a drag that keeps the group [45] probability from going up so much. Approximately, it should go up by half the speed of the leaf update.
The larger the group, the greater the drag of symbols that did not get more likely. eg. in [0-7] vs [8-15] when a 5 occurs, all of 0123467 did not get more likely, so 7 symbols act as drag and the rate should be around 1/8 the speed.
(see appendix at end for the exact speeds you should use if you wanted to adjust only one probability)
Perhaps its best to just look at the charts. These are histograms of probabilities (in percent). It starts all even, then symbol 5 occurs 20 X, then symbol 12 occurs 20 X. The probabilities are updated with the binary update scheme. lambda here is 1.0/8 , which is rather fast, used to make the visualization more dramatic.
What you should see : when 5 first starts getting incremented, the probabilities of its neighbors go way up, 4, and 6-7, and the whole 0-7 side. After 5 occurs many times, the probabilities of its neighbors goes down. Then when 12 starts beeing seen, the whole 8-15 side shoots up.
Go scroll down through the charts then we'll chat more.
This is a funny thing. Most people in the data compression community think of binary coding symbols this way as just a way to encode an alphabet using binary arithmetic coding. They aren't thinking about the fact that what they're actually doing is a strange sort of group-probability update.
In fact, in many applications if you take a binary arithmetic coder like this and replace it with an N-ary full alphabet coder, results get *worse*. Worse !? WTF !? It's because this weird group update thing that the binary coder is doing is often actually a good thing.
You can imagine scenarios where that could be the case. In some types of data, when a new symbol X suddenly starts occuring (when it had been rare before), then that means (X-1) and (X+2) may start to be seen as well. We're getting some kind of complicated modeling that novel symbols imply their neighbors novel probability should go up. In some type of data (such as BWT post-MTF) the probabilities act very much in binary tree groups. (see Fenwick for example). In other types of data that is very bit structured (such as ascii text and executable code), when a symbol with some top 3 bits occurs, then other symbols with those top bits are also more likely. That is, many alphabets actually have a natural binary decomposition where symbol groups in the binary tree do have joint probability.
This is one of those weird things that happens all the time in data compression where you think you're just doing an implementation choice ("I'll use binary arithmetic coding instead of full alphabet") but that actually winds up also doing modeling in ways you don't understand.
The visuals :
5 x 1
5 x 2
5 x 3
5 x 4
5 x 5
5 x 6
5 x 7
5 x 8
5 x 9
5 x 10
5 x 11
5 x 12
5 x 13
5 x 14
5 x 15
5 x 16
5 x 17
5 x 18
5 x 19
5 x 20
12 x 1
12 x 2
12 x 3
12 x 4
12 x 5
12 x 6
12 x 7
12 x 8
12 x 9
12 x 10
12 x 11
12 x 12
12 x 13
12 x 14
12 x 15
12 x 16
12 x 17
12 x 18
12 x 19
12 x 20
Appendix :
How to do a binary probability update without changing your neighbor :
Consider the simplest case : 4 symbol alphabet :
[0 1 2 3]
A = P[0 1] vs [2 3]
B = P[1] vs [1]
P(0) = A * B
P(1) = A * (1 - B)
now a 0 occurs
alpha = update rate for A
beta = update rate for B
A' = A + (1-A) * alpha
B' = B + (1-B) * beta
we want P(1)/P(23) to be unchanged by the update
(if that is unchanged, then P(1)/P(2) and P(1)/P(3) is unchanged)
that is, the increase to P(0) should scale down all other P's by the same factor
P(1)/P(23) = A * (1-B) / (1-A)
so require :
A' * (1-B') / (1-A') = A * (1-B) / (1-A)
solve for alpha [...algebra...]
alpha = A * beta / ( 1 - (1 - A) * beta )
that's as opposed to just using the same speed at all levels, alpha = beta.
In the limit of small beta, (slow update), this is just alpha = A * beta.
The update rate at the higher level is decreased by the probability of the updated subsection.
In the special case where all the probabilities start equal, A = 0.5, so this is just alpha = beta / 2 - the update rates should be halved at each level, which was the intuitive rule of thumb that I hand-waved about before.
First we need to talk about how our experts are mixed. There are two primary choices :
Linear mixing :
M(x) = Sum{ w_i P_i(x) } / Sum{ w_i }
or = Sum{ w_i P_i(x) } if the weights are normalized
Geometric mixing :
M(x) = Prod{ P_i(x) ^ c_i } / Sum(all symbols y){ Prod{ P_j(y) ^ c_j } }
(the coefficients c_i do not need to be normalized)
P = probability from each expert
M = mixed probability
require
Sum(all symbols y){ M(y) } = 1
Linear mixing is just a weighted linear combination. Geometric mixing is also known as
"product of expert" and "logarithmic opinion pooling".
Geometric mixing can be expressed as a linear combination of the logs, if normalization is
ignored :
log( M_unnormalized ) = Sum{ c_i * log( P_i ) }
In the large alphabet case, there is no simple logarithmic form of geometric M (with normalization included).
In the special case of a binary alphabet, Geometric mixing has a special simple form
(even with normalization) in log space :
M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ (1 - P_i) ^ c_i } )
M = 1 / ( 1 + Prod{ ((1 - P_i)/P_i) ^ c_i } )
M = 1 / ( 1 + Prod{ e^c_i * log((1 - P_i)/P_i) } )
M = 1 / ( 1 + e^ Sum { c_i * log((1 - P_i)/P_i) } )
M = sigmoid( Sum { c_i * logit(P_i) } )
which we looked at in the previous post, getting some background on logit and sigmoid.
We looked before about how logit space strongly preserves skewed probabilities. We can see the same thing directly in the geometric mixer.
Obviously if an expert has a P near 0, that multiplies into the product and produces an output near 0. Zero times anything = zero. Less obviously if an expert has a P near 1, the denomenator normalization factor becomes zero in the other terms (that aren't the numerator), so the mix becomes 1, regardless of what the other P's.
So assume we are using these mixtures. We want to optimize (minimize) code length. The general procedure for
gradient descent is :
the code length cost of a previous coding event is -log2( M(x) )
evaluated at the actually coded symbol x
(in compression the total output file length is just the sum of all these event costs)
To improve the w_i coefficients over time :
after coding symbol x, see how w_i could have been improved to reduce -log2( M(x) )
take a step of w_i in that direction
the direction is the negative gradient :
delta(w_i) ~= - d/dw_i { -log2( M(x) ) }
delta(w_i) ~= (1/M) * d/dw_i{ M(x) }
(~ means proportional; the actual delta we want is scaled by some time step rate)
so to find the steps we should take we just have to do some derivatives.
I won't work through them in ascii, but they're relatively simple.
(the geometric mixer update for the non-binary alphabet case is complex and I won't cover it any more here. For practical reasons we will assume linear mixing for non-binary alphabets, and either choice is available for binary alphabets. For the non-binary geometric step see Mattern "Mixing Strategies in Data Compression")
When you work through the derivatives what you get is :
Linear mixer :
delta(w_i) ~= ( P_i / M ) - 1
Geometric mixer (binary) :
delta(w_i) ~= (1 - M) * logit(P_i)
Let's now spend a little time getting familiar with both of them.
First the linear mixer :
put r_i = P_i(x) / M(x)
ratio of expert i's probability to the mixed probability, on the symbol we just coded (x)
Note : Sum { w_i * r_i } = 1 (if the weights are normalized)
Your goal is to have M = 1
That is, you gave 100% probability to the symbol that actually occured
The good experts are the ones with high P(x) (x actually occured)
The experts that beat the current mix have P > M
Note that because of Sum { w_i * r_i } = 1
Whenever there is an expert with r_i > 1 , there must be another with r_i < 1
The linear mixer update is just :
delta(w_i) ~= r_i - 1
experts that were better than the current mix get more weight
experts worse than the current mix get less weight
I think it should be intuitive this is a reasonable update.
Note that delta(w) does not preserve normalization. That is :
Sum{ delta(w_i) } != 0 , and != constant
however
Sum{ w_i * delta(w_i) } = Sum{ w_i * P_i / M } - Sum { w_i } = Sum {w_i} - Sum{w_i} = 0
the deltas *do* sum to zero when weighted.
We'll come back to that later.
If we write the linear update as :
delta(w_i) ~= ( P_i - M ) / M
it's obvious that delta is proportional to (1/M). That is, when we got the overall mixed prediction M right, M is near 1, the step is small.
When we got the step wrong, M is near 0, the step we take can be very large. (note that P_i does not have to also be near zero when M is near zero,
the w_i for that expert might have been small). In particular, say most of the experts were confused, they
predict the wrong thing, so for the actual coded event x, their P(x) is near zero. One guy said no, what about x! His P(x) was higher.
But everyone ignores the one guy piping up (his weight is low). The
mixed M(x) will be very low, we make a huge codelen (very bad log cost), the step (1/M) will be huge, and it bumps up the one guy who got the
P(x) right.
Now playing with the geometric update :
Geometric mixer (binary) :
delta(w_i) ~= (1 - M) * logit(P_i)
this is evaluated for the actually coded symbol x (x = 0 or 1)
let's use the notation that ! means "for the opposite symbol"
eg. if you coded a 0 , then ! means "for a 1"
eg. !P = 1-P and !M = 1-M
(in the previous post I put up a graph of the relationship between L and !L, codelens of opposing bits)
We can rewrite delta(w_i) as :
delta(w_i) ~= !M * ( !L_i - L_i )
recalling logit is the codelen difference
that's
delta(w_i) ~= (mixed probability of the opposite symbol) * (ith model's codelen of opposite - codelen of coded symbol)
I find this form to be more intuitive; let's talk through it with words.
The step size is proportional to !M , the opposite symbol mixed probability. This is a bit like the (1/M) scaling in the linear mixer. When we got the estimate right, the M(x) of the coded symbol will go to 1, so !M goes to zero, our step size goes to zero. That is, if our mixed probability M was right, we take no step. If it ain't broke don't fix it. If we got the M very wrong, we thought a 0 bit would not occur but then it did, M is near zero, !M goes to one.
While the general direction of scaling here is the same as the linear mixer, the scaling is very different. For example in the case where we were grossly wrong, M -> 0 , the linear mixer steps like (1/M) - it can be a huge step. The geometric mixer steps like (1-M) , it hits a finite maximum.
The next term (!L - L) means weights go up for experts with a large opposite codelen. eg. if the codelen of the actually coded symbol was low, that's a good expert, then the codelen of the opposite must be high, so his weight goes up.
Again this is similar to the linear mixer, which took a step ~ P(x) , so experts who were right go up. Again the scaling is different, and in the opposite way. In the linear mixer, if experts get the P right or wrong, the step just scales in [0,1] ; it varies, but not drastically. Conversely in the geometric case because step is proportional to codelen, it goes to infinity as P gets extremely wrong.
Say you're an expert who gets it all wrong and guesses P(x) = 0 (near zero) then x occurs. In the linear case the step is just (-1). In the geometric case, L(x) -> inf , !L -> 0 , your weight takes a negative infinite step. Experts who get it grossly wrong are not gradually diminished, they are immediately crushed. (in practice we clamp weights to some chosen finite range, or clamp step to a maximim)
Repeating again what the difference is :
In linear mixing, if the ensemble makes a big joint mistake, the one guy out of the set who was right can get a big weight boost. If the ensemble does okay, individuals experts that were way off do not get big adjustments.
In geometric mixing, individual experts that make a big mistake take a huge weight penalty. This is only mildly scaled by whether the ensemble was good or not.
Another funny difference :
With geometric mixing, experts that predict a 50/50 probability (equal chance of 0 or 1) do not change w at all. If you had a static expert that just always said "I dunno, 50/50?" , his weight would never ever change. But he would also never contribute to the mixture either, since as noted in the previous post equivocal experts simply have no effect on the sum of logits. (you might think that if you kept seeing 1 bits over and over, P(1) should be near 100% and experts that keep saying "50%" should have their weight go to crap; that does not happen in geometric mixing).
In contrast, with linear mixing, 50/50 experts would either increase or decrease in weight depending on if they were better or worse than the previous net mixture. eg. if M was 0.4 on the symbol that occurred, the weight of an 0.5 P expert would go up, since he's better than the mix and somebody must be worse than him.
Note again that in geometric mixing (for binary alphabets), there is no explicit normalization needed. The w_i's are not really weights, they don't sum to 1. They are not in [0,1] but go from [-inf,inf]. It's not so much "weighting of experts" as "adding of experts" but in delta-codelen space.
Another funny thing about geometric mixing is what the "learning rate" really does.
With the rate explicitly in the steps :
Linear mixer :
w_i += alpha * ( ( P_i / M ) - 1 )
(then w_i normalized)
Geometric mixer (binary) :
c_i += alpha * ( (1 - M) * logit(P_i) )
(c_i not normalized)
In the linear case, alpha really is a learning rate. It controls the speed of how fast a better expert takes weight from a worse
expert. They redistribute the finite weight resource because of normalization.
That's not the case in the geometric mixer. In fact, alpha just acts as an overall scale to all the weights. Because all the
increments to w over time have been scaled by alpha, we can just factor it out :
no alpha scaling :
c_i += ( (1 - M) * logit(P_i) )
factor out alpha :
M = sigmoid( alpha * Sum { c_i * logit(P_i) } )
The "learning rate" in the geometric mixer really just scales how the sum of logits is stretched back to probability. In fact
you can think of it as a normalization factor more than a learning rate.
The geometric mixer has yet another funny property that it doesn't obviously pass through the best predictor. It doesn't even
obviously pass through the case of a single expert. (with linear mixing it's obvious that crap predictors weight goes
to zero, and a single expert would just pass through at weight=1)
aside you may ignore; quick proof that it does :
say you have only one expert P
say that P is in fact perfectly correct
0 bits occur at rate P and P is constant
initially you have some mixing coefficient 'c' (combining alpha in with c)
The mixed probability is :
M = sigmoid( c * logit(P) )
which is crap. c is applying a weird scaling we don't want.
c should go to 1, because P is the true probability.
Does it?
delta(c) after one step of symbol 0 is (1-M) * logit(P)
for symbol 1 it's M * logit(1-P)
since P is the true probability, the probabilistic average step is :
average_delta(c) = P * ( (1-M) * logit(P) ) + (1-P) * ( (M) * logit(1-P) )
logit(1-P) = - logit(P)
average_delta(c) = P * logit(P) - M * logit(P) = (P - M) * logit(P)
in terms of l = logit(P)
using P = sigmoid(l)
average_delta(c) = ( sigmoid(l) - sigmoid( c * l ) ) * l
this is a differential equation for c(t)
you could solve it (anyone? anyone?)
but even without solving it's clear that it does go to c = 1 as t -> inf
if c < 1 , average_delta(c) is > 0 , so c goes up
if c > 1 , average_delta(c) is < 0 , so c goes down
in the special case of l small (P near 50/50), sigmoid is near linear,
then the diffeq is easy and c ~= (1 + e^-t)
Note that this was done with alpha set to 1 (included in c). If we leave alpha separate, then c should
converge to 1/alpha in order to pass through a single predictor. That's what I mean by alpha not really
being a "learning rate" but more of a normalization factor.
So anyway, despite geometric mixing being a bit odd, it works in practice; it also works in theory (the gradient descent can be proved to converge with reasonable bounds and so on). In practice it has the nice property of not needing any divides for normalization (in the binary case). Of course it doesn't require evaluation of logit & sigmoid, which are typically done by table lookup.
Bonus section : Soft Bayes
The "Soft Bayes" method is not a gradient descent, but it is very similar, so I will mention it here. Soft Bayes uses a weighted linear combination of the experts, just like linear gradient descent mixing.
Linear mixer :
d_i = ( P_i / M ) - 1
w_i += rate * d_i
Soft Bayes :
w_i *= ( 1 + rate * d_i )
Instead of adding on the increment, Soft Bayes multiples 1 + the increment.
Multiplying by 1 + delta is obviously similar to adding delta and the same intuitive arguments we made before about why this works still apply. Maybe I'll repeat them here :
General principles we want in a mixer update step :
The step should be larger when the mixer produced a bad result
the worse M(x) predicted the observed symbol x, the larger the step should be
this comes from scaling like (1/M) or (1-M)
The step should scale up weights of experts that predicted the symbol well
eg. larger P_i(x) should go up , smaller P_i(x) should go down
a step proportional to (P-M) does this in a balanced way (some go up, some go down)
but a step of (P - 0.5) works too
as does logit(P)
We can see a few properties of Soft Bayes :
Soft Bayes :
w_i *= ( 1 + rate * d_i )
w_i' = w_i + w_i * rate * d_i
w_i += rate * w_i * d_i
This is the same as linear gradient descent step, but each step d_i is scaled by w_i
Soft Bayes is inherently self-normalizing :
Sum{w_i'} = Sum{w_i} + Sum{w_i * rate * d_i}
Sum{ w_i * d_i } = Sum{ ( w_i * P_i / M ) - w_i } = ( M/M - 1 ) * Sum{ w_i } = 0
therefore :
Sum{w_i'} = Sum{w_i}
the sum of weights is not changed
if they started normalized, they stay normalized
The reason that "soft bayes" is so named can be seen thusly :
Soft Bayes :
w_i *= ( 1 + rate * d_i )
d_i = ( P_i / M ) - 1
w_i *= (1-rate) + rate * ( P_i / M )
at rate = 1 :
w_i *= P_i / M
ignoring normalization this is
w_i *= P_i
which is "Beta Weighting" , which is aka "naive Bayes".
So Soft Bayes uses "rate" as a lerp factor to blend in a naive Bayes update. When rate is 1, it's "fully Bayes", when rate is lower it keeps some amount of the previous weight lerped in.
Soft Bayes seems to work well in practice. Soft Bayes could have the bad property that if weights get to zero they can never rise, but in practice we always clamp weights to a non-zero range so this doesn't occur.
A quick aside on why Beta weighting is the naive Bayesian solution :
ignoring normalization through this aside
(also ignoring priors and all the proper stuff you should do if you're Bayesian)
(see papers for thorough details)
Mix experts :
M = Sum{ w_i * P_i }
P_i(x) is the probability of x given expert i ; P(x|i)
The w_i should be the probability of expert i, given past data , that is :
P(x|past) = Sum{ P(i|past) * P(x|i) }
P(i|past) is just ~= P(past|i)
(ignoring prior P(i) and normalizing terms P(past) that factor out in normalization)
Now *if* (big if) the past events are all independent random events, then :
P(past|i) = P(x0|i)*P(x1|i)*P(x2|i)... for past symbols x0,x1,...
since P("ab") = P(a)*P(b) etc.
which you can compute incrementally as :
w_i *= P_i
the Beta Weight update.
In words, the weight for the expert should be the probability of that expert having made the data we have seen so far.
That probability is just the P_i that expert has predicted for all past symbols, multiplied together.
Now this is obviously very wrong. Our symbols are very much not independent, they have strong conditional probability, that's how we get compression out of modeling the stream. The failure of that assumption may be why true beta weighting is poor in practice.
This Whirlwind Tour has been brought to you by the papers of Mattern & Knoll, PAQ of Mahoney, and the radness of RAD.
Small note about practical application :
You of course want to implement this with integers and store things like weights and probabilities in fixed point. I have found that the exact way you handle truncation of that fixed point is very important. In particular you need to be careful about rounding & biasing and try not to just lose small values. In general mixing appears to be very "fiddly"; it has a very small sweet spot of tuned parameters and small changes to them can make results go very bad.
In many cases we've been talking about mixers that work with arbitrary alphabets, but let's specialize now to binary. In that case, each expert just makes one P, say P(0), and implicitly P(1) = 1 - P(0).
We will make use of the logit ("stretch") and logistic sigmoid ("squash"), so let's look at them a bit first.
logit(x) = ln( x / (1-x) )
sigmoid(x) = 1 / (1 + e^-x ) (aka "logistic")
logit is "log odds" of a probability
logit takes [0,1] -> [-inf,inf]
sigmoid takes [-inf,inf] -> [0,1]
logit( sigmoid(x) ) = 1
I'm a little bit fast and loose with the base of the logs and exponents sometimes, because the
difference comes down to scaling factors which wash out in constants in the end. I'll try to be
not too sloppy about it.
You can see plots of them here for example . The "logistic sigmoid" is an annoyingly badly named function; "logistic" is way too similar to "logit" which is very confusion, and "sigmoid" is just a general shape description, not a specific function.
What's more intuitive in data compression typically is the base-2 variants :
log2it(P) = log2( P / (1-P) )
log2istic(t) = 1 / (1 + 2^-t )
which is just different by a scale of the [-inf,inf] range. (it's a difference of where you measure
your codelen in units of "bits" or "e's").
In data compression there's an intuitive way to look at the logit. It's the difference of codelens
of symbol 0 and 1 arising from the probability P.
log2it(P) = log2( P / (1-P) )
= log2( P ) - log2( 1 - P )
if P = P0
then L0 = codelen of 0 = - log2(P0)
L1 = - log2(1-P0)
log2it(P0) = L1 - L0
which I like to write as :
log2it = !L - L
where !L means the len of the opposite symbol
Working with the logit in data compression is thus very natural and has nice properties :
1. It is (a)symmetric under exchange of the symbol alphabet 0->1. In that case P -> 1-P , so logit -> - logit. The magnitude stays the same, so mixers that act on logit behave the same way. eg. you aren't favoring 0 or 1, which of course you shouldn't be. This is why using the *difference* of codelen between L0 and L1 is the right thing to do, not just using L0, since the 0->1 exchange acts on L0 in a strange way.
Any mixer that we construct should behave the same way if we swap 0 and 1 bits. If you tried to construct a valid mixer like that using only functions of L0, it would be a mess. Doing it in the logit = (!L-L) it works automatically.
2. It's linear in codelen. Our goal in compression is to minimize total codelen, so having a parameter that is in the space we want makes that natural.
3. It's unbounded. Working with P's is awkward because of their finite range, which requires clamping and theoretically projection back into the valid cube.
4. It's centered on zero for P = 0.5 and large for skewed P. We noted before that what we really want is to gather models that are skewed, and logit does this naturally.
Let's look at the last property.
Doing mixing in logit domain implicitly weights skewed experts, as we looked at before in "functional weighting".
With logit, skewed P's near 0 or 1 can stretched out to -inf or +inf. This gives them very high implicit "weight",
that is even after mixing with a P near 0.5, the result is still skewed.
Mix two probabilities with equal weight :
0.5 & 0.9
linear : 0.7
logit : 0.75
0.5 & 0.99
linear : 0.745
logit : 0.90867
0.5 & 0.999
linear : 0.74950
logit : 0.96933
Logit blending is sticky at skewed P.
A little plot of what the L -> !L function looks like :
Obviously it is symmetric around (1,1) where both codelens are 1 bit, P is 0.5 ; as P gets skewed either way one len goes to zero and the other goes to infinity.
logistic sigmoid is the map back from codelen delta to probability. It has the nice property for us that no matter what you give it, the output is a valid normalized probability in [0,1] ; you can scale or screw up your logistics in whatever way you want and you still get a normalized probability (which makes it codeable).
Linear combination of logits (codelen deltas) is equal to geometric (multiplicative) mixing of probabilities :
M = mixed probability
linear combination in logit space, then transform back to probability :
M = sigmoid( Sum{ c_i * logit( P_i ) } )
M = 1 / ( 1 + e ^ - Sum{ c_i * logit( P_i ) } )
e ^ logit(P) = P/(1-P)
e ^ -logit(P) = (1-P)/P
e ^ - c * logit(P) = ((1-P)/P) ^ c
M = 1 / ( 1 + Prod{ (( 1 - P_i ) / P_i)^c_i } )
M = Prod{ P_i ^ c_i } / ( Prod{ P_i ^ c_i } + Prod{ ( 1 - P_i )^c_i } )
What we have is a geometric blend (like a geometric mean, but with arbitrary exponents, which are
the mixing factors), and the denomenator is just the normalizing factor so that M0 and M1 some to one.
Note that the blend factors (c_i) do not need to be normalized here. We get a normalized probability even if the c_i's are not normalized. That is an important property for us in practice.
This is a classic machine learning method called "product of experts". In the most basic form, all the c_i = 1, the powers are all 1, the probabilities from each expert are just multiplied together.
I'm calling the c_i "coefficients" not "weights" to emphasize the fact that they are not normalized.
Note that while we don't need to normalize the c_i , and in practice we don't (and in the "product of experts" usage with all c_i = 1 they are of course not normalized), that has weird consequences and let us not brush it under the table.
For one thing, the overall scale of c_i changes the blend. In many cases we hand-waved about scaling because normalization will wipe out any overall scaling factors of weights. In this case the magnitude scale of the c_i absolutely matters. Furthermore, the scale of the c_i has nothing to do with a "learning rate" or blend strength. It's just an extra exponent applied to your probabilities that does some kind of nonlinear mapping to how they are mixed. e.g. doing c_i *= 2 is like mixing squared probabilities instead of linear probabilities.
For another, with unnormalized c_i - only skewed probabilities contribute. Mixing in more 50/50 experts does *nothing*. It does not bias your result back towards 50/50 at all!
We can see this both in the logit formulation and in the geometric mean formulation :
M = sigmoid( Sum{ c_i * logit( P_i ) } )
Now add on a new expert P_k with some coefficient c_k
M' = sigmoid( Sum{ c_i * logit( P_i ) + c_k * logit( P_k ) } )
if P_k = 50% , then logit(P_k) = 0
M' = M
even if c_k is large.
(if you normalized the c's, the addition of c_k would bring down the coffecient from the other contributions)
The Sum of logits only adds up codelen deltas
Experts with large codelen deltas are adding their scores to the vote
small codelen deltas simply don't contribute at all
In the geometric mean :
Q = Prod{ P_i ^ c_i }
M = Q / ( Q + !Q )
(where ! means "of the opposing symbol; eg. P -> (1-P) )
Add on another expert with P = 0.5 :
Q' = Q * 0.5 ^ c_k
!Q' = !Q * 0.5 ^ c_k
M' = Q * 0.5 ^ c_k / ( Q * 0.5 ^ c_k + !Q * 0.5 ^ c_k)
the same term multiplied around just factors out
M' = M
This style of mixing is like adding experts, not blending them. Loud shouty experts with skewed codelen
deltas add on.
Next up, how to use this for gradient descent mixing in compression.
First some background :
We have probabilities output by some "experts" , P_i(x) for the ith expert for symbol x.
We wish to combine these in some way to create a mixed probability.
The experts are a previous stage of probability estimation. In data compression these are usually models; they could be different orders of context model, or the same order at different speeds, or other probability sources.
(you could also just have two static experts, one that says P(0) = 100% and one that says P(1) = 100%, then the way you combine them is actually modeling the binary probability; in that case the mixer is the model. What I'm getting at is there's actually a continuum between the model & the mixer; the mixer is doing some of the modeling work. Another common synthetic case used for theoretical analysis is to go continuous and have an infinite number of experts with P(0) = x for all x in [0,1], then the weight w(x) is again the model.)
Now, more generally, the experts could also output a confidence. This is intuitively obviously something that should be done, but nobody does it, so we'll just ignore it and say that our work is clearly flawed and always has room for improvement. (for example, a context that has seen n0 = 1 and n1 = 0 might give P(0) = 90% , but that is a far less confident expert that one that has n0 = 90 and n1 = 10 , which should get much higher weight). What I'm getting at here is the experts themselves have a lot more information than just P, and that information should be used to feed the mixer. We will not do that. We treat the mixer as independent of the experts and only give it the probabilities.
(you can fix this a bit by using side information (not P) from the model stages as context for the mixer coefficients; more on that later).
Why mixing? A lot of reasons. One is that your data may be switching character over time. It might switch from very-LZ-like to very-order-3-context like to order-0-like. One option would be to try to detect those ranges and transmit switch commands to select models, but that is not "online" (one pass) and the overhead to send the switch commands winds up being comparable to just mixing all those models (much research shows that these alternatives are asymptotically bounded). Getting perfect switch transmission right is actually way harder than mixing; you have to send the switch commands efficiently, you have to find the ideal switch ranges with consideration of the cost of transmitting the switch (which is a variant of a "parse-statistics feedback" problem which is one of the big unsolved bugbears in compression).
Another reason we like mixing is that our models might be good for part of the character of the data, but not all of it. For example it's quite common to have some data that's text-like (has strong order-N context properties) but is also LZ-like (perhaps matches at offset 4 or 8 are more likely, or at rep offset). It's hard to capture both those correlations with one model. So do two models and mix.
On to techniques :
1. Select
The most elementary mixer just selects one of the models. PPMZ used this for LOE (Local Order Estimation). PPMZ's primary scheme was to just select the order with highest MPS (most probable symbol) probability. Specifically if any order was deterministic (MPS P = 100%) it would be selected.
2. Fixed Weights
The simplest mixer is just to combine stages using fixed weights. For example :
P = (3 * P1 + P2)/4
This is used very effectively in BCM, for example, which uses it in funny ways like mixing the probability pre-APM with the probability post-APM
using fixed weight ratios.
Fixed linear combinations like this of multi-speed adaptive counters is not really mixing, but is a way to create higher order IIR filters. (as in the "two speed" counter from BALZ; that really makes a 2-tap IIR from 1-tap IIR geometric decay filters)
3. Per-file/chunk transmitted Fixed Weights
Instead of fixed constants for mixing, you can of course optimize the weights per file (or per chunk) and transmit them.
If you think of the goal of online gradient descent as finding the best fit of parameters and converging to a constant - then this makes perfect sense. Why do silly neural net online learning when we can just statically optimize the weights?
Of course we don't actually want our online learners to converge to constants - we want them to keep adapting; there is no steady state in real world data compression. All the online learning research which assumes convergence and learning rate decreasing over time and various asymptotic gaurantees - it's all sort of wrong for us.
M1 (Mattern) uses optimized weights and gets quite high compression. This method is obviously very asymmetric, slow to encode & fast(er) to decode. You can also approximate it by making some pre-optimized weights for various types of data and just doing data type detection to pick a weight set.
4. Functional Weighting
You can combine P's using only functions of the P's, with no separate state (weight variables that are updated over time). The general idea here is that more skewed P's are usually much more powerful (more likely to be true) than P's near 50/50.
If you have one expert on your panel saying "I dunno, everything is equally likely" and another guy saying "definitely blue!" , you are usually best weighting more to the guy who seems sure. (unless the positive guy is a presidential candidate, then you need to treat lack of considering both sides of an argument or admitting mistakes as huge character flaws that make them unfit to serve).
In PPMZ this was used for combining the three SEE orders; the weighting used there was 1/H (entropy).
weight as only a function of P
the idea is to give more weight to skewed P's
W = 1 / H(P) = 1 / ( P * log(P) + (1-P) * log(1-P) )
e^ -H would be fine too
very simple skews work okay too :
W = ABS(P - 0.5)
just higher weight for P away from 0.5
then linearly combine models :
P_mixed = Sum{ W_i * P_i } / Sum{ W_i }
This method has generally been superceded by better approaches in most cases, but it is still useful for creating a prior model
for initial seeding of mixers, such as for 2d APM mixing (below).
I'll argue in a future post that Mahoney's stretch/squash (logit/logistic sigmoid) is a form of this (among other things).
5. Secondary Estimation Mixing / 2d APM Mixing
As mentioned at the end of the last post, you can use a 2d APM for mixing. The idea is very simple, instead of one P as input to the APM, you take two probabilities from different models and look them up as a 2d matrix.
You may want to use the stretch remap and bilinear filtering so that as few bits as possible are needed to index the APM table, in order to preserve as much density as possible. (eg. 6 bits from each P = 12 bit APM table)
This allows you to model completely arbitrary mixing relationships, not just "some weight to model 1, some to model 2". It can do the fully P-input dependent mixing, like if model 1 P is high, it gets most of the influence, but if model 1 P is low, then model 2 takes over.
The big disadvantage of this method is sparsity. It's really only viable for mixing two or three models, you can't do a ton. It's best on large stable data sources, not small rapidly changing data, because the mixing weight takes longer to respond to changes.
One of the big advantages of single-scalar weighted mixing is that it provides a very fast adapting stage to your compressor. Say you have big models with N states, N might be a million, so each individual context only gets an update every millionth byte. Then you might have a (1d) APM stage to adapt the P's; this has 100 entries or so, so it only gets 1/100 updates. But then you blend the models with a scalar mixing weight - that updates after every byte. The scalar mixing weight is always 100% up to date with recent changes in the data, it can model switches very quickly while the rest of the big model takes much more time to respond. 2d APM Mixing loses this advantage.
6. Beta Weighting
Beta Weighting was introduced by CTW ; see also the classic work of Volf on switching, etc. (annoyingly this means something different in stock trading).
Beta Weighting is an incremental weight update. It does :
P_mixed = Sum W_i * P_i / Sum W_i
(linear weighted sum of model probabilities, normalized)
after seeing symbol X
W_i *= P_i(X)
We can understand this intuitively :
If a model had a high probability for the actually occuring symbol (it was a good model),
it will have P_i large
W_i *= 0.9 or whatever
if the model had low probability for the actual symbol (it was wrong, claiming that symbol was unlikely),
it will get scaled down :
W_i *= 0.1 for example
Furthermore, the weights we get this way are just negative exponentials of the cost in each model.
That is :
P = 2^ - L
L = codelen
The weights winds up being the product of all coded probabilities seen so far :
W at time t = P_(t-1) * P(t-2) * P(t-3) * P(t_4) ...
W = 2^( - Sum_t{ L_(t-1) + L(t-2) + L(t-3) ... } )
W = 2^( - codelen if you used this model for all symbols )
W ~ e^ - cost
If you like we can also think of this another way :
since W is normalized, overall scaling is irrelevant
at each update, find the model that produced the lowest codelen (or highest P)
this is the model we would have chosen with an ideal selector
so the cost of any other model is :
cost_i = L_i - L_best
the CTW Beta weight rule is :
W_i *= 2^ - cost_i
(and the best model weight stays the same, its cost is zero)
or
W = 2^ - total_cost
total_cost += cost_i
In the special case of two models we can write :
The normalized weight for model 1 is :
W1 = ( 2^-C1 ) / ( 2^-C1 + 2^-C2 ) = 1 / ( 1 + 2^(C1-C2) ) = log2istic( C2-C1 )
W2 = 1 - W1 = log2istic( C1-C2 )
"log2istic" = logistic sigmoid with base 2 instead of e
We only need to track the delta of costs , dC = C1-C2
P = log2istic(-dC) * P1 + log2istic(dC) * P2
(equivalently, just store W1 and do W1 *= P1/P2)
We'll see in the next post how logistic is a transform from delta-codelens to probabilities, and this will become neater then.
Now Beta weighting is obviously wrong. It weights every coding event equally, no matter how old it is. In compression we always want strong
recency modeling, so a better cost update is something like :
dC' = dC * aging + rate * (C1-C2)
where aging = 0.95 or whatever , something < 1
and rate is not so much the learning rate (it could be normalized out) as it is an exponential modifier to the weighting.
Beta weighting as used in CTW stores a weight per node. Each node weights counts as seen at that level vs. the next deeper level.
In contrast, in context mixers like PAQ the weight for order N vs order N+1 is stored outside of the model, not per context but as a
global property of the data. The PAQ way is much more flexible because you can always use some context for the weight if you like.
Next post we'll get into online gradient descent.
Secondary Estimation began as a way to apply a (small) correction to a (large) primary model, but we will see in our journey that it has transformed into a way of doing heavy duty modeling itself.
The simplest and perhaps earlier form of secondary estimation may have been bias factors to linear predictors.
Say you are predicting some signal like audio. A basic linear predictor is of the form :
to code x[t]
pred = 2*x[t-1] - x[t-2]
delta = x[t] - pred
transmit delta
Now it's common to observe that "delta" has remaining predicability. The simplest case is if it just has
an overall bias; it tends to be > 0. Then you track the average observed delta and subtract it off to
correct the primary predictor. A more interesting case is to use some context to track different corrections.
For example :
Previous delta as context :
context = delta > 0 ? 1 : 0;
pred = 2*x[t-1] - x[t-2]
delta = x[t] - pred
bias = secondary_estimation[ context ]
transmit (delta - bias)
secondary_estimation[ context ].update(bias)
Now we are tracking the average bias that occurs when previous delta was positive or negative, so if they
tend to occur in clumps we will remove that bias. ("update" means "average in in some way" which we will
perhaps revisit).
Another possibility is to use local shape information, something like :
for last 4 values , track if delta was > 0
each of those is a bit
this makes a 4 bit context
look up bias using this context
Now you can detect patterns like the transition from flats to edges and apply different biases in each case.
The fundamental thing that we are trying to do here, which is a common theme is :
The primary model is a mathematical predictor, which extracts some fundamental information about the data.
The secondary estimator tracks how well that primary model is fitting the current observation set and
corrects for it.
In CALIC a form of secondary estimation very similar to this is used. CALIC uses a shape-adaptive gradient predictor (DPCM lossless image coding). This is the primary model; essentially it tries to fix the local pixels to one of a few classes (smooth, 45 degree edge, 90 degree edge, etc.) and uses different linear predictors for each case. The prediction error from that primary model is then corrected using a table lookup in a shape context. The table lookup tracks the average bias in that shape context. While the primary model uses gradient predictors that are hard-coded (don't vary from image to image), the secondary model can correct for how well those gradient predictors fit the current image.
I'm going to do another hand wavey example before we get into modern data compression.
A common form of difficult prediction that everyone is familiar with is weather prediction.
Modern weather prediction is done using ensembles of atmospheric models. They take observations of various atmospheric variables (temperature, pressure, vorticity, clouds, etc.) and do a physical simulation to evolve the fluid dynamics equations forward in time. They form a range of possible outcomes by testing how variations in the input lead to variations in the output; they also create confidence intervals by using ensembles of slightly different models. The result is a prediction with a probability for each possible outcome. The layman is only shown a simplified prediction (eg. predict 80 degrees Tuesday) but the models actually assign a probability to each possibility (80 degrees at 30% , 81 degrees at 15%, etc.)
Now this physically based simulation is a primary model. We can adjust it using a secondary model.
Again the simplest form would be if the physical simulation just has a general bias. eg. it predicts
0.5 degrees too warm on average. A more subtle case would be if there is a bias per location.
primary = prediction from primary model
context = location
track (observed - primary)
store bias per context
secondary = primary corrected by bias[context]
Another correction we might want is the success of yesterday's prediction. You can imagine that
the prediction accuracy is usually better if yesterday's prediction was spot on. If we got yesterday
grossly wrong, it's more likely we will do so again. Now in the primary physical model, we might
be feeding in yesterday's atmospheric conditions as context already, but it is not using yesterday's
prediction. We are doing secondary estimation using side data which is not in the primary model.
You might also imagine that there is no overall bias to the primary prediction, but there is bias
depending on what value it output. And furthermore that bias might depend on location. And it might
depend on how good yesterday's prediction was. Now our secondary estimator is :
make primary prediction
context = value of primary, quantized into N bits
(eg. maybe we map temperature to 5 bits, chance of precipitation to 3 bits)
context += location
context += success of yesterday's prediction
secondary = primary corrected by bias[context]
Now we can perform quite subtle corrections. Say we have a great overall primary predictor.
But in Seattle, when the primary model predicts a temperature of 60 and rain at 10% , it tends to
actually rain 20% of the time (and more likely if we got yesterday wrong).
SEE in PPMZ
I introduced Secondary Estimation (for escapes) to context modeling in PPMZ.
(See "Solving the Problems of Context Modeling" for details; I was obviously quite modest about my work back then!)
Let's start with a tiny bit of background on PPM and what was going on in PPM development back then.
PPM takes a context of some previous characters (order-N = N previous characters) and tracks what characters have been seen in that context. eg. after order-4 context "hell" we may have seen space ' ' 3 times and 'o' 2 times (and if you're a bro perhaps 'a' too many times).
Classic PPM starts from deep high order contexts (perhaps order-8) and if the character to be coded has not been seen in that context order, it sends an "escape" and drops down to the next lower context.
The problem of estimating the escape probability was long a confounding. The standard approach was to
use the count of characters seen in the context to form the escape probability.
A nice summary of classical escape estimators from
"Experiments on the Zero Frequency Problem" :
n = the # of tokens seen so far
u = number of unique tokens seen so far
ti = number of unique tokens seen i times
PPMA : 1 / (n+1)
PPMB : (u - t1) / n
PPMC : u / (n + u)
PPMD : (u/2) / n
PPMP : t1/n - t2/n^2 - t3/n^3 ...
PPMX : t1 / n
PPMXC : t1 / n if t1 < n
u / (n + u) otherwise
eg. for PPMA you just leave the escape count at 1. For PPMC, you increment the escape count each time a
novel symbol is seen. For PPMD, you increment the escape count by 1/2 on a novel symbol (and also set
the novel characters count to 1/2) so the total always goes up by 1.
Now some of these have theoretical justifications based on Poisson process models and Laplacian priors and so on, but the correspondence of those priors to real world data is generally poor (at low counts).
The big difficult with PPM (and data compression in general) is that low counts is where we do our business. We can easily get contexts with high counts (sufficient statistics) and accurate estimators by dropping to lower orders. eg. if you did PPM and limit your max order to order-2 , you wouldn't have this problem. But we generally want to push our orders to the absolute highest breaking point, where our statistical density goes to shit. Big compression gains can be found there, so we need to be able to work where we may have only seen 1 or 2 events in a context before.
Around the time I was doing PPMZ, the PPM* and PPMD papers by Teahan came out ("Unbounded length contexts for PPM" and "Probability estimation for PPM"). PPMD was a new better estimator, but to me it just screamed that something was wrong. All these semi-justified mathematical forms for the escape probability seemed to just be the wrong approach. At the same time, PPM* showed the value of long deterministic contexts, where by definition "u" (the number of unique symbols) is always 1 and "n" (total) count is typically low.
We need to be able to answer a very important and difficult question. In a context that has only seen one symbol one time, what should the escape probability be? (equivalently, what is the probability that one symbol is right?). There is no simple formula of observed counts in that context that can solve that.
Secondary Estimation is the obvious answer. We want to know :
P(esc) when only 1 symbol has been seen in a context
We might first observe that this P(esc) varies between files. On more compressible files, it is much
lower. On near-random files, it can be quite high. This is a question of to what extent is this
novel deterministic context a reflection of the end statistics, or just a mirage due to sparse statistics.
That is,
In an order-8 context like "abcdefgh" we have only even seen the symbol 'x' once
Is that a reflection of a true pattern? That "abcdefgh" will always predict 'x' ?
Or is the following character in fact completely random, and if we saw more symbols we
would see all of them with equal probability
Again you don't have the information within the context to tell the difference. Using an average
across the whole file is a very rough guess, but obviously you could do better.
You need to use out-of-context observations to augment the primary model. Rather than just tracking the real P(esc) for the whole file we can obviously use some context to find portions of the data where it behaves differently. For example P(esc) (when only 1 symbol has been seen in a context) might be 75% if the parent context has high entropy; it might be 50% if the order is 6, it might be 10% if order is 8 and the previous symbols were also coded from high order with no escape.
In PPMZ, I specifically handled only low escape counts and totals; that is, it was specifically trying to solve this sparse statistics problem. The more modern approach (APM's, see later) is to just run all probabilities through secondary statistics.
The general approach to secondary estimation in data compression is :
create a probability from the primary model
take {probability + some other context} and look that up in a secondary model
use the observed statistics in the secondary model for coding
The "secondary model" here is typically just a table. We are using the observed statistics in one model as the input to another model.
In PPMZ, the primary model is large and sparse while the secondary model is small and dense, but that need not always be the case.
The secondary model is usually initialized so that the input probabilities pass through if nothing has been observed yet. That is,
initially :
secondary_model[ probability + context ] = probability
Secondary estimation in cases like PPM can be thought of as a way of sharing information across sparse contexts that are not
connected in the normal model.
That is, in PPM contexts that are suffixes have a parent-child relationship and can easily share information; eg. "abcd" and "bcd" and "cd"
are connected and can share observations. But some other context "xyzw" is totally unconnected in the tree. Despite that, by having
the same statistics they may be quite related! That is
"abcd" has seen 3 occurances of symbol 'd' (and nothing else)
"xyzw" has seen 3 occurances of symbol 'd' (and nothing else)
these are probably very similar contexts even though they have no connection in the PPM tree. The next coding event that happens in either
context, we want to communicate to the other. eg. say a symbol 'e' now occurs in "abcd" - that makes it more likely that "xyzw" will have
an escape. That is, 3-count 'd' contexts are now less likely to be deterministic.
The secondary estimation table allows us to accumulate these disparate contexts together and merge their observations.
The PPMZ SEE context is made from the escape & total count, as well as three different orders of previous symbol bits. The three orders and then blended together (an early form of weighted mixing!). I certainly don't recommend the exact details of how PPMZ does it today. The full details of the PPMZ SEE mechanism can be found in PPMZ (Solving the Problems of Context Modeling) (PDF) . You may also wish to consult the later work by Shkarin "PPM: one step to practicality" (PPMd/PPmonstr/PPMii) which is rather more refined. Shkarin adds some good ideas to the context used for SEE, such as using the recent success so that switches between highly compressible and less compressible data can be modeled.
Let me repeat myself to be clear. Part of the purpose & function of secondary estimation is to find patterns where the observed counts in a state to do not linearly correspond to the actual probability in any way.
That is, secondary estimation can model things like :
Assuming a binary coder
The coder sees 0's and 1's
Each context state tracks n0 and n1 , the # or 0's and 1's seen in that context
On file A :
if observed { n0 = 0, n1 = 1 } -> actual P1 is 60%
if observed { n0 = 0, n1 = 2 } -> actual P1 is 70%
if observed { n0 = 0, n1 = 3 } -> actual P1 is 90%
On file B :
if observed { n0 = 0, n1 = 1 } -> actual P1 is 70%
if observed { n0 = 0, n1 = 2 } -> actual P1 is 90%
if observed { n0 = 0, n1 = 3 } -> actual P1 is 99%
The traditional models are of the form :
P1 = (n1 + c) / (n0 + n1 + 2c)
and you can make some Bayesian prior arguments to come up with different values of c (1/2 and 1 are common)
The point is there is *no* prior that gets it right. If the data actually came from a Laplacian or Poisson source, then observing counts
you could make estimates of the true P in this way. But context observations do not work that way.
There are lots of different effects happening. One is that there is multiple "sources" in a file. There might be one source that's pure random, one source is purely predictable (all 0's or all 1's), another source is in fact pretty close to a Poisson process. When you have only a few observations in a context, part of the uncertainty is trying to guess which of the many sources in the file that context maps to.
Adaptive Probability Map (APM) ; secondary estimation & beyond.
Matt Mahoney's PAQ (and many other modern compressors) make heavy use of secondary estimation, not just for escapes, but for every coding event. Matt calls it an "APM" , and while it is essentially the same thing as a secondary estimation table, the typical usage and some details are a bit different, so I will call them APMs here to distinguish.
PPM-type compressors hit a dead end in their compression ratio due to the difficult of doing things like mixing and secondary estimation in character alphabets. PAQ, by using exclusively binary coding, only has one probability to work with (the probability of a 0 or 1, and then the other is inferred), so it can be easily transformed. This allows you to apply secondary estimation not just to the binary escape event, but to all symbol coding.
An APM is a secondary estimation table. You take a probability from a previous stage, look it up in a table (with some additional context, optionally), and then use the observed statistics in the table, either for coding or as input to another stage.
By convention, there are some differences in typical implementation choices. I'll describe a typical APM implementation :
probability P from previous stage
P is transformed nonlinearly with a "stretch"
[stretch(P) + context] is looked up in APM table
observed P there is passed to next stage
optionally :
look up both floor(P) and ceil(P) in some fixed point to get two adjacent APM entries
linearly interpolate them using fractional bits
A lot of the details here come from the fact that we are passing general P's through the APM, rather than restricting to only low counts
as in PPMZ SEE. That is, we need to handle a wide range of P's.
Thus, you might want to use a fixed point to index the APM table and then linearly interpolate (standard technique for tabulated function lookup); this lets you use smaller, denser tables and still get fine precision.
The "stretch" function is to map P so that we quantize into buckets where we need them. That is, if you think in terms of P in [0,1] we
want to have smaller buckets at the endpoints. The reason is that P steps near 0 or 1 make a very large difference in codelen, so getting
them right there is crucial. That is, a P difference of 0.90 to 0.95 is much more important than from 0.50 to 0.55 ; instead of having variable
bucket sizes (smaller buckets at the end) we use uniform quantization but stretch P first (the ends spread out more). One nice option for
stretch is the "logit" function.
LPS symbol codelen (LPS = less probable symbol)
steps of P near 0.5 produce small changes in codelen :
P = 0.5 : 1 bit
P = 0.55 : 1.152 bits
steps of P near 1.0 produce big changes in codelen :
P = 0.9 : 3.32193 bits
P = 0.95 : 4.32193 bits
The reason why you might want to do this stretch(P) indexing to the APM table (rather than just look up P)
is again density. With the stretch(P) distortion, you can get away with only 5 or 6 bits of index, and
still get good resolution where you need it. With linear P indexing you might need a 10 bit table size for
the same quality. That's 32-64 entries instead of 1024 which is a massive increase in how often each slot
gets updated (or a great decrease in the average age of statistics; we want them to be as fresh as possible).
With such course indexing of the APM, the two adjacent table slots should be used (above and below the input P)
and linearly interpolated.
APM implementations typically update the observed statistics using the "probability shift" method (which is equivalent to constant total count or geometric decay IIR).
An APM should be initialized such that it passes through the input probability. If you get a probability from the previous stage, and nothing has yet been observed in the APM's context, it passes through that probability. Once it does observe something it shifts the output probability towards the observed statistics in that context.
We can see something interesting that APM's can do already :
I wrote before about how secondary estimation is crucial in PPM to handle sparse contexts with very few events. It allows you to track the actual observed rates in that class of contexts. But we thought of SEE as less important in dense contexts.
That is true only in non-time-varying data. In the real world almost all data is time varying. It can shift character, or go through different modes or phases. In that sense all contexts are sparse, even if they have a lot of observed counts, because they are sparse in observation of recent events. That is, some order-4 context in PPM might have 100 observed counts, but most of those were long in the past (several megabytes ago). Only a few are from the last few bytes, where we may have shifted to a different character of data.
Running all your counts through an APM picks this up very nicely.
During stable phase :
look up primary model
secondary = APM[ primary ]
when primary is dense, secondary == primary will pass through
Now imagine we suddenly transition to a chunk of data that is nearly random
the primary model still has all the old counts and will only slowly learn about the new random data
but the APM sees every symbol coded, so it can learn quickly
APM[] will quickly tend towards APM[P] = 0.5 for all P
that is, the input P will be ignored and all states will have 50/50 probability
This is quite an extreme example, but more generally the APM can pick up regions where we move to
more or less compressible data. As in the PPM SEE case, what's happening here is sharing of
information across parts of the large/sparse primary model that might not otherwise communicate.
Let me repeat that to be clear :
Say you have a large primary model, with N context nodes. Each node is only updated at a rate of (1/N) times per byte processed. The APM on the other hand is updated every time. It can adapt much faster - it's sharing information across all the nodes of the primary model. You may have very mature nodes in the primary model with counts like {n0=5,n1=10} (which we would normally consider to be quite stable). Now you move to a region where the probability of a 1 is 100%. It will take *tons* of steps for the primary model to pick that up and model it, because the updates are scattered all over the big model space. Any one node only gets 1/N of the updates, so it takes ~N*10 bytes to get good statistics for the change.
An interesting common way to use APM's is in a cascade, rather than a single step with a large context.
That is, you want to take some P from the primary model, and you have additional contexts C1 and C2 to condition the APM step.
You have two options :
Single step, large context :
P_out = APM[ {C1,C2,P} ]
Two steps, small context, cascade :
P1 = APM1[ {C1,P} ]
P_out = APM2[ {C2,P1} ]
That is, feed P through an APM with one context, then take that P output and pass it through another APM with a different context,
as many times as you like.
Why use a cascade instead of a large context? The main reason is density.
If you have N bits of extra context information to use in your APM lookup, the traditional big table approach dilutes your statistics by 2^N. The cascade approach only dilutes by *N.
The smaller tables of the APM cascade mean that it cannot model certain kinds of correlations. It can only pick up ways that each context
correlated to biases on P. It cannot pick up joint context correlations.
The APM cascade can model things like :
if C1 = 0 , high P's tend to skew higher (input P of 0.7 outputs 0.9)
if C2 = 1 , low P's tend to skew lower
But it can't model joint things like :
if C1 = 0 and C2 = 1 , low P's tend towards 0.5
One nice thing about APM cascades is that they are nearly a NOP when you add a context that doesn't help. (as opposed to the single table
large context method, which has a negative diluting effecting when you add context bits that don't help). For example if C2 is just not
correlated to P, then APM2 will just pass P through unmodified. APM1 can model the correlation of C1 and P without being molested by the
mistake of adding C2. They sort of work independently and stages that aren't helping turn themselves off.
One funny thing you can do with an APM cascade is to just use it for modeling with no primary model at all. This looks like :
Traditional usage :
do order-3 context modeling to generate P
modify observed P with a single APM :
P = APM[P]
APM Cascade as the model :
P0 = order-0 frequency
P1 = APM1[ o1 | P0 ];
P2 = APM2[ o2 | P1 ];
P3 = APM3[ o3 | P2 ];
That is, just run the probability from each stage through the next to get order-N modeling.
We do this from low order to high, so that each stage can add its observation to the extent it has seen events. Imagine that the current order2 and order3 contexts have not been observed at all yet. Then we get P1 from APM1, pass it through APM2, since that has seen nothing it just passes P1 through, then APM3 does to. So we wind up getting the lowest order that has actually observed things.
This method of using an APM cascade is the primary model is used by Mahoney in BBB.
A 2d APM can also be used as a mixer, which I think I will leave for the next post.
TLDR : don't call eof on pipes in Windows! Use read returning zero instead to detect eof.
We observed that semi-randomly pipes would report EOF too soon and we'd get a truncated stream. (this is not the issue with ctrl-Z ; of course you have to make sure your pipe is set to binary).
To reproduce the issue you may use this simple piper :
// piper :
#include
the behavior of piper is this :
<fcntl.h>
#include <io.h>
int main(int argc,char *argv[])
{
int f_in = _fileno(stdin);
int f_out = _fileno(stdout);
// f_in & out are always 0 and 1
_setmode(f_in, _O_BINARY);
_setmode(f_out, _O_BINARY);
char buf[4096];
// NO : sees early eof!
// don't ask about eof until _read returns 0
while ( ! _eof(f_in) )
// YES : just for loop here
//for(;;)
{
int n = _read(f_in,buf,sizeof(buf));
if ( n > 0 )
{
_write(f_out,buf,n);
}
else if ( n == 0 )
{
// I think eof is always true here :
if ( _eof(f_in) )
break;
}
else
{
int err = errno;
if ( err == EAGAIN )
continue;
// respond to other errors?
if ( _eof(f_in) )
break;
}
}
return 0;
}
vc80.pdb 1,044,480
redirect input & ouput :
piper.exe < vc80.pdb > r:\out
r:\out 1,044,480
consistently copies whole stream, no problems.
Now using a pipe :
piper.exe < vc80.pdb | piper > r:\out
r:\out 16,384
r:\out 28,672
r:\out 12,288
semi-random output sizes due to hitting eof early
If the eof check marked with "NO" in the code is removed, and the for loop is used instead, piping
works fine.
I can only guess at what's happening, but here's a shot :
If the pipe reader asks about EOF and there is nothing currently pending to read in the pipe, eof() returns true.
Windows anonymous pipes are unbuffered. That is, they are lock-step between the reader & writer. When the reader calls read() it blocks until the writer puts something, and vice-versa. The bytes are copied directly from writer to reader without going through an internal OS buffer.
In this context, what this means is if the reader drains out everything the writer had to put, and then races ahead and
calls eof() before the writer puts anything, it sees eof true. If the writer puts something first, it seems eof false.
It's just a straight up race.
time reader writer
-----------------------------------------------------------
0 _write blocks waiting for reader
1 _eof returns false
2 _read consumes from writer
3 _write blocks waiting for reader
4 _eof returns false
5 _read consumes from writer
6 _eof returns true
7 _write blocks waiting for reader
at times 1 and 4, the eof check returns false because the writer had gotten to the pipe first. At time 6 the
reader runs faster and checks eof before the writer can put anything, now it seems eof is true.
As a check of this hypothesis : if you add a Sleep(1) call immediately before the _eof check (in the loop), there is no early eof observed, because the writer always gets a chance to put data in the pipe before the eof check.
Having this behavior be a race is pretty nasty. To avoid the problem, never ask about eof on pipes in Windows, instead use the return value of read(). The difference is that read() blocks on the writer process, waiting for it to either put some bytes or terminate. I believe this is a bug. Pipes should never be returning EOF as long as the process writing to them is alive.
Recall
K = Sum{ 2^-L_i }
K <= 1 is prefix codeable
you can think of K as the sum of effective coding probabilities (P = 2^-L), so K over 1 is a total probability over 100%.
when we initially make Huffman codelens and then apply a limit, we use too much code space. That corresponds to K > 1.
If you write K as a binary decimal, it will be something like :
K = 1.00101100
Those 1 bits that are below K = 1.0 are exactly the excess codelens that we need to correct.
That is, if you have an extra symbol of length L, that is K too high by 2^-L , that's a 1 bit in the binary at position L to the right of the decimal.
codelen set of {1,2,2,3}
a : len 1
b : len 2
c : len 2
d : len 3
K = 2^-1 + 2^-2 + 2^-2 + 2^-3
K = 0.100 +
0.010 +
0.010 +
0.001
K = 1.001
take (K-1) , the part below the decimal :
K-1 = .001
we have an excess K of 2^-3 ; that's one len 3 too many.
To fix that we can change a code of len 2 to len 3
that does
-= 0.010
+= 0.001
same as
-= 0.001
That is, when (K-1) has 2 leading zeros, you correct it by promoting a code of len 2 to len 3
so if we compute K in fixed point (integer), we can just take K - one and do a count leading zeros on it, and it tells you which code len to
correct. The bits that are on in K tell us exactly what's wrong with our code.
Now, similarly, we can think of the compound operations in the same way.
Any time we need to do a correction of K by 2^-L we could do one code from (L-1) -> L , or we could do two codes from L -> (L+1), or ... that can be seen as just an expression of the mathematics of how you can add bits together to make the desired delta.
That is :
You want 0.001 (increment codelen 2 to 3)
that can be :
0.001 = 0.0001
+0.0001
(increment two lower lens)
or :
0.001 = 0.01
-0.001
increment a lower codelen then decrement one at your level
Now, thinking about it this way we can try to enumerate all the possible moves.
To reduce the space of all possible moves, we need a few assumptions :
1. No non-len-group changing moves are profitable. That is, the set of symbols for the current len groups are the best possible set. eg. it's not profitable to do something like { symbol a from len 2 to 3 and symbol b from len 3 to 2 } . If there are any profitable moves like that, do them separately. What this means is the counts are sorted; eg. if a symbol is at a higher codelen, its count is less equal the count of any symbol at lower codelen.
2. I only need to enumerate the moves that can be the cheapest (in terms of total code len).
In that case I think that you can enumerate all the moves thusly :
a change of 2^-L can be accomplished via
inc(L-1) (that means increment a codelen at L-1)
inc(L) can be substitituted with { inc(L+1), inc(L+1) }
And each of those inc(L+1) can also be substituted with a pair at L+2.
You take either a codelen at L or two at the next higher len,
whichever has a smaller effect on CL (total codelen).
a change of 2^-L can also be accomplished via :
inc(L-2) and dec(L-1)
OR
inc(L-3) and dec(L-2) and dec(L-1)
again these can be understood as binary decimals :
0.0001 = 0.0010 - 0.0001
0.0001 = 0.0100 - 0.0011
0.0001 = 0.1000 - 0.0111
and finally the decs are also a tree of pairs :
dec(L) can instead be { dec(L+1), dec(L+1) }
this seems like a lot, but compared to all possible ways to make the number X from adds & subtractions of any power of two, it's quite small.
The reason we can consider such a reduced set of moves is because we only need the one best way to toggle a bit of K, not all possible ways.
Really we just do :
enumerate the position of the lowest codelen to inc
between 1 and (L-1)
decrement at all codelens below the one you incremented
down to (L-1)
this produce the desired change in K of 2^-L
each "inc" & "dec" can either be at that code len, or a pair of next codelen
(somebody smarter than me : prove that these are in fact all the necessary moves)
Let's look at how dynamic programming reduces the amount of work we have to do.
Say we need to do an inc() at L = 3
(inc means increment a codelen, that decreases K by 2^-4)
We can either increment a single symbol at L = 3
or a pair from L = 4
(this is just the same kind of operation you do in normal Huffman tree building)
The best symbol at L = 3 is just the lowest count symbol (if any exists)
Ask for the best two nodes at L = 4
Those can also be a single symbol, or a pair from L = 5
When you ask for the first node at L = 4, it gets the best two at L = 5
but then imagine the single symbol at L = 4 was lower count and is taken
Now you ask for the second node at L = 4, it again needs the best two at L = 5
we already have them, no more work is needed.
Any time you chose a symbol rather than a pair of higher len, the 2^n tree stops growing.
[3] -> { [4a], [4b] }
[4a] -> sym(4) or { [5a], [5b] }
[4a] takes sym(4)
[4b] -> sym2(4) or { [5a], [5b] }
[4b] doesn't need a new evaluation at level 5
Another issue I need to mention is that as you increment and decrement codelens, they move between
the lists, so the cached dynamic programming lists cannot be retained, or can they?
(for example, you want to keep the symbols at each len sorted by count)
In fact the accounting for symbols moving is simple and doesn't need to invalidate the cached lists.
When you do an inc(L) , that symbol moves to L+1 and is now available for a further inc(L+1)
(this does not occur with dec(L) since it moves in the opposite direction)
Say you wanted an inc(3) ; you consider doing a pair of { inc(4), inc(4) }
One of the inc(4)'s can be a pair of inc(5)'s , and one of those len 5 symbols can be the one you did inc 4 on.
That is, say you have 'A' at len 4 and 'B' at len 5
inc(3) <- { inc( 'A' @ 4) , { (inc 'A' @ 5) , inc( 'B' @ 5 } }
This is a legal move and something you have to consider.
But the rule for it is quite simple - if a symbol occurs earlier in the list of chosen increments, it is
available at the next level.
If you're familiar with the way Package Merge makes its lists, this is quite similar. It just means
when you choose the lowest count symbol at the current level, you can also draw from the previous
increments in your list if they have lower count.
These queues we are building are exactly the same thing you would need to do the full package merge algorithm. The difference is, in traditional Package Merge you would start with all the symbols at codelen = max (K is too low), and then incrementally apply the best decrements to increase K. Here we are starting with K pretty close to 1 , with K greater than 1. The result is that in many cases we can do far fewer package merge steps. I call this Incremental Package Merge. It allows you to start from a nearly-Kraft set of codelens and get to the same optimal solution as if you did full package merge.
Let's look at a concrete example or two :
codelen : symbol+count
len 3 : a+7
len 4 : b+3 , c+3
len 5 : d+1 , e+1
you need an inc(3) to get K -= 2^-4
you can :
inc(a) ; cost 7
inc(b + c) ; cost 6
inc(b + { d + e } ) ; cost 5
The best inc(4) is {b,d,e}
Another example :
len 3 : a+7
len 4 : b+2
len 5 : c+1
again say you want an inc(3)
the best is
inc( b + { b + c } ) ; cost 5
here the best option is to inc b twice
And finally let's think again about how Package Merge is related to the "numismatic" or "coin collector
problem".
If you play with this you will see what we're really doing is working on a two-cost accountancy. Each symbol has a cost in K which is determined only by its current codelen (2^-L). It also has a cost in total codelen CL which is detemined only by its symbol count. We are trying to pay off a debt in K (or spend a credit in K) and maximize the value we get in CL.
Engel Coding is a fast/approximate way of forming length limited prefix code lengths.
I'm going to back up first and remind us what a prefix code is and the use of the Kraft inequality.
We want to entropy code some alphabet using integer code lengths. We want those codes to be decodeable
without side information, eg. by only seeing the bits themselves, not transmitting the length in bits.
0 : a
10 : b
11 : c
is a prefix code. If you see the sequence "11100" it can only decode to "11=c,10=b,0=a".
0 : a
1 : b
11 : c
is not a prefix code. Any occurance of "11" can either be two b's or one c. They can't be resolved without
knowing the length of the code. There is no bit sequence you can possibly assign to c that works.
The fact that this is impossible is a function of the code lengths.
You can construct a prefix code from code lengths if and only if the lengths satisfy the Kraft inequality :
Sum{ 2^-L_i } <= 1
It's pretty easy to understand this intuitively if you think like an arithmetic coder. 2^-L is the effective
probability for a code length, so this is just saying the probabilities must sum to <= 100%
That is, think of the binary code as dividing the range [0,1] like an arithmetic coder does. The first bit divides it in half, so a single bit code would take half the range. A two bit code takes half of that, so a quarter of the original range, eg. 2^-L.
The two numbers that we care about are the Kraft code space used by the code lengths, and the total code length
of the alphabet under this encoding :
Kraft code space :
K = Sum{ 2^-L_i }
Total code length :
CL = Sum{ C_i * L_i }
L_i = code length in bits of symbol i
C_i = count of symbol i
Minimize CL subject to K <= 1 (the "Kraft inequality")
We want the minimum total code length subject to the prefix code constraint.
The well known solution to this problem is Huffman's algorithm. There are of course lots of other ways to make prefix code lengths which do not minimize CL. A famous historical one is Shannon-Fano coding, but there have been many others, particularly in the early days of data compression before Huffman's discovery.
Now for a length-limited code we add the extra constraint :
max{ L_i } <= limit
now Huffman's standard algorithm can't be used. Again the exact solution is known; to minimize CL under the
two constraints of the Kraft inequality and the maximum codelength limit, the algorithm is "Package Merge".
In Oodle we (uniquely) actually use Package Merge at the higher compression levels, but it is too slow and complicated to use when you want fast encoding, so at the lower compression levels we use a heuristic.
The goal of the heuristics is to find a set of code lengths that satisfy the contraints and get CL reasonably close to the minimum (what Package Merge would find).
The Oodle heuristic works by first finding the true Huffman code lengths, then if any are over the limit, they are changed to equal the limit. This now violates the Kraft inequality (they are not prefix decodeable), so we apply corrections to get them to K = 1. ZStd uses a similar method (and I imagine lots of other people have in the past; this is pretty much how length-limited near-Huffman is done). My previous post on the heuristic length limited code is below, with some other Huffman background :
cbloom rants: 07-03-10 - Length-Limitted Huffman Codes Heuristic
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 08-12-10 - The Lost Huffman Paper
cbloom rants Huffman Performance
cbloom rants Huffman Correction
(Engel correctly points out that most of the places where I say "Huffman coding" I should really be saying "prefix coding". The decoding methods and canonical code assignment and so on can be done with any prefix code. A Huffman code is only the special case of a prefix code with optimal lengths. That is, Huffman's algorithm is only the part about code length assignment; the rest is just prefix coding.)
So Engel's idea is : if we're going to limit the code lengths and muck them up with some heuristic anyway, don't bother with first finding the optimal non-length-limited Huffman code lengths. Just start with heuristic code lengths.
His heuristic is (conceptually) :
L_i = round( log2( P_i ) )
which is intuitively a reasonable place to start. If your code lengths didn't need to be an integer number of bits, then you would
want them to be as close to log(P) as possible.
Then apply the limit and fix the lengths to satisfy the Kraft inequality. Note that in this case the tweaking of lens to satisfy Kraft is not just caused by lens that exceed the limit. After the heuristic codelens are made, even if they are short, they might not be Kraft. eg. you can get code lengths like { 2,3,3,3,3,3,4,4,4 } which are not prefix (one of the 3's need to be changed to a 4). The idea is that unlike Huffman or Shannon-Fano which explicitly work by creating a prefix code by construction, Engel coding instead makes code lengths which could be non-prefix and relies on a fix up phase.
When Joern told me about this it reminded me of "Polar Coding" (Andrew Polar's, not the more common use of the term for error correction). Andrew Polar's code is similar in the sense that it tries to roughly assign log2(P) codelens to symbols, and then uses a fix-up phase to make them prefix. The details of the heuristic are not the same. (I suspect that there are lots of these heuristic entropy coders that have been invented over the years and usually not written down).
Obviously you don't actually want to do a floating log2; for the details of Engel's heuristic see his blog.
But actually the details of the initial codelen guess is not very important to Engel coding. His codelen adjustment phase is what actually determines the codelens. You can start the codelens all at len 1 and let the adjustment phase do all the work to set them, and in fact that gives the same final codelens!
I tried four methods of initial codelen assignment and they all produced the exact same final codelens.
The only difference is how many steps of the iterative refinement were needed to get them to Kraft equality.
all initial codelens = 1 : num_adjustment_iterations = 2350943
codelens = floor( log2(P) ) : num_adjustment_iterations = 136925
codelens = round( log2(P) ) : num_adjustment_iterations = 25823
Engel Coding heuristic : num_adjustment_iterations = 28419
The crucial thing is how the refinement is done.
To get to the refinement, let's go over some basics. I'll first describe the way we were actually doing the length limit heuristic in Oodle (which is not the same as what I described in the old blog post above).
In the Oodle heuristic, we start with Huffman, then clamp the lens to the limit. At this point, the Kraft K
is too big. That is, we are using more code space than we are allowed to. We need to raise some codelens
somewhere to free up more code space. But raising codelens increases the total codelen (CL). So the goal
is to bump up some codelens to get K = 1, with a minimum increase to CL.
If you do L_i ++
K -= 2^(-L_i)
K += 2^(-(L_i+1))
for a net change of :
K -= 2^(-(L_i+1))
(shorter codelen symbols makes a bigger change to K)
and CL does :
CL -= C_i * L_i
CL += C_i * (L_i + 1)
net :
CL += C_i
(lower count symbols hurt total codelen less)
K_delta_i = 2^(-(L_i+1))
CL_delta_i = C_i
To get under the K budget, we want to find the lowest CL_delta with the maximum K_delta.
That is, code space (K) is the resource you want to buy, and code len (CL) is the currency you use to
pay for that code space. You want the best price :
price = CL_delta / K_delta
price = C_i * 2^(L_i+1)
What I was doing in Oodle was taking the step with the best "price" that didn't overshoot the target of K = 1.
If your symbols are sorted by count (which they usually are for Huffman codelen algorithms), then you don't need to compute "price" for all your symbols. The minimum price will always occur at the lowest count (first in the sorted list) at each codelen. So rather than making a full heap of up to 256 symbols (or whatever your alphabet size is), you only need a heap of the 16 (or whatever codelen limit is) lowest count symbols at each codelen.
The big improvement in Engel's refinement heuristic is that it allows overshooting K if the absolute value of the distance to K decreases.
Consider K in fixed point with a 12 bit codelen limit. Then "one" is at K = 4096. Say you had K = 4099. It's 3 too big. My heuristic could only consider K steps of -=2 and -=1 (only power of two steps are possible). Engel can also take a step of -= 4 , changing K to 4095. It's now 1 too small (codeable but wasteful) and rather than increasing codelens to fit in the code space, we can decrease a symbol codelen somewhere to gain some total codelen.
Engel converges to K = one by (possibly) taking successively smaller overshooting steps, so K wiggles around the target, delta going positive & negative. This does not always converge, so a bail out to a simpler approach is needed. This overshooting lets it get to K by doing a combination of positive and negative steps (eg. 3 = 4 - 1 , not just 3 = 1 + 2), which is a little bit of a step towards Package Merge (the difference being that package merge find the cheapest path to get the desired sum, while Engel's heuristic is greedy, taking the single cheapest step each time).
In practice this turns out to be much better than only taking non-overshooting steps.
Time to look at the results on some real data :
"seven" test set, cut into 64k chunks, order 0 entropy coded
comparing code len to package merge (ideal)
The length in excess (percent) is :
excess percent = (codelen(heuristic) - codelen(packagemerge))*100/codelen(packagemerge)
Huffman then limit monotonic mean : 0.027% max : 2.512%
Huffman then limit overshooting mean : 0.002% max : 0.229%
Engel coding mean : 0.016% max : 10.712%
(codelen limit of 12, 256 symbol byte alphabet)
The heuristic (overshooting) limit is really very good, extremely close to package merge and even the maximum excess len is
small. Engel coding (non-Huffman initial code lengths) works fine on average but does have (very rare) bad
cases. This is not surprising; there's reason we use Huffman's algorithm to get the code lengths right.
In that bad 10.71% excess case, the package merge average code len is 1.604 but Engel coding produces an average code length of 1.776
Note that many of the blocks in this test did not hit the codelen limit; in that case "Huffman then limit" produces the best possible codelens, but Engel coding might not.
For most applications, it's probably best to make the true Huffman code and then limit the lengths with a heuristic. The time saved from the approximate initial code lengths is pretty small compared to other operations needed to do entropy coding (histogramming for example is annoyingly slow). Nevertheless I found this technique to be an interesting reminder to keep an open mind about approximations and understand where our algorithms came from and why we use the ones we do.
Another thing I find interesting is how Engel Coding points back to Package Merge again.
First there's the fact that you can start Engel Coding with just all the codelens set at 1 , and let the Kraft fixup make the codelens. That's how Package Merge works. It starts all codelens at 1, and the increments the cheapest ones until it gets up to Kraft. The log2ish starting guess for Engel Coding is just a way of jump-starting the codelens closer to the final answer to avoid lots of steps.
Engel Coding's overshooting heuristic improves on the monotonic heuristic by allowing you to take some +- steps. That is, increment one codelen and decrement another. In particular it can do things like : rather than increment a len 3 codelen , instead increment a len 2 codelen and decrement a len 3 codelen. This is the kind of move that you need to make to get to real optimal code lens.
The key missing element is considering the costs of all possible steps and finding a path to the desired K. That is, Engel coding takes greedy (locally cheapest price) steps, which may not give the optimal path overall. The way to turn this greedy algorithm into an optimal one is dynamic programming. Lo and behold, that's what Package Merge is.
The Android APK package is just a zip (thanks to them for just using zip and not changing the header so that it can be easily manipulated with standard tools).
I chose the list of games from this article :
Google's instant app tech now lets you try games before you buy
which is :
Clash Royale, Words With Friends 2, Bubble Witch 3 Saga, Final Fantasy XV: A New Empire, Mighty Battles and -- of course -- Solitaire
I discovered that "Mighty Battles" internally contains a large pre-compressed pak file. (it's named "content.mp3" but is not really an mp3, it's some sort of compressed archive. They use the mp3 extension to get the APK package system to store it without further zlib compression.) Because of that I exluded Might Battles from the test; it would be about the same size with every compressor, and is not reflective of how it should be compressed (their internal package should be uncompressed if we're testing how well the outer compressor does). Later I also saw that "Clash Royale" is also mostly pre-compressed content. Clash Royale has its content in ".sc" files that are opaque compressed data. I left it in the test set, but it should also have those files uncompressed for real use with an outer compressor. I wasn't sure which Solitaire to test; I chose the one by Zynga.
The "tar" is made by unpacking the APK zip and concatenating all the files together. I also applied PNGz0 to turn off zlib compression on any PNGs. I then tested various compressors on the game tars.
| original | tar | zlib | Leviathan | |
| BubbleWitch3 | 78,032,875 | 304,736,621 | 67,311,666 | 54,443,823 |
| ClashRoyale | 101,702,690 | 124,031,098 | 98,386,824 | 93,026,161 |
| FinalFantasyXV | 58,933,554 | 144,668,500 | 57,104,802 | 41,093,459 |
| Solitaire | 14,814,888 | 139,177,140 | 14,071,999 | 8,337,863 |
| WordsWithFriends2 | 78,992,339 | 570,621,614 | 78,784,623 | 53,413,494 |
| total | 332,476,346 | 1,283,234,973 | 315,659,914 | 250,314,800 |
original = size of the source APK (per-file zip with some files stored uncompressed) tar = unzipped files, with PNGz0, concatenated together zlib = zip -9 applied to the tar ; slightly smaller than original Leviathan = Oodle Leviathan level 8 (Optimal4) applied to the tarYou can see that Clash Royale doesn't change much because it contains large amounts of pre-compressed data internally. The other games all get much smaller with Leviathan on a tar (relative to the original APK, or zlib on the tar). eg. BubbleWitch3 was 78 MB, Leviathan can send it in 54.4 MB ; Solitaire can be sent in almost half the size.
Leviathan is very fast to decode on ARM. Despite getting much more compression than zlib, it is faster to decode. More modern competitors (ZStd, brotli, LZMA) are also slower to decode than Leviathan on ARM, and get less compression.
For reference, here is the performance on this test set of a few compressors (speeds on Windows x64 Core i7-3770) :
|
|
|
Note that some of the wins here are not accessible to game developers. When a mobile game developer uses Oodle on Android, they can apply Oodle to their own content and get the size and time savings there. But they can't apply Oodle to their executable or Java files. The smaller they reduce their content, the larger the proportion of their APK becomes that is made up of files they can't compress. To compress all the content in the APK (both system and client files, as well as cross-file tar compression) requires support from the OS or transport layer.
I'll also take this chance to remind clients that when using Oodle, you should always try to turn off any previous compression on your data. For example, here we didn't just try the compressors directly on the APK files (which are zip archives and have previous zlib compression), we first unpacked them. We then further took the zlib compression off the PNG's so that the outer compressors in the test could have a chance to compress that data better. The internal compressors used on Clash Royale and Mighty Battles should also have been ideally turned off to maximize compression. On the other hand, turning off previous compression does not apply to data-specific lossy compressors such as audio, image, and video codecs. That type of data should be passed through with no further compression.
(by "block compressed textures" I mean BCn, ETC1, ETC2, etc. textures in fixed size blocks for use with GPU's. I do *not* mean already compressed textures such as JPEG, PNG, or BCn that has already been compressed with crunch. You should not be applying Oodle or any other generic compressor on top of already compressed textures of that type. If you have a lot of PNG data consider PNG without ZLib or look for the upcoming Oodle Lossless Image codec.)
See the Appendix at the bottom for a comparison of modern LZ compressors on BCn data. Oodle LZ gets more compression and/or much faster decode speeds on BCn data.
So you can certainly just create your texture as usual (at maximum quality) and compress it with Oodle. That's fine and gives you the best visual quality.
If you need your texture data to be smaller for some reason, you can use a data-specific lossy compressor like crunch (or Basis), or you could use RDO texture creation followed by Oodle LZ compression.
(I've written about this before, here : Improving the compression of block-compressed textures , but I'm trying to do a rather cleaner more thorough job this time).
RDO texture creation is a modification of the step that creates the block compressed texture (BCn or whatever) from the original (RGBA32 or whatever). Instead of simply choosing the compressed texture blocks that minimize error, blocks are chosen to minimize rate + distortion. That is, sometimes larger error is intentionally chosen when it improves rate. In this case, we want to minimize the rate *after* a following LZ compressor. The block compressed textures always have the same size, but some choices are more compressible than others. The basic idea is to choose blocks that have some relation to preceding blocks, thereby making them more compressible. Common examples are trying to reuse selector bits, or to choose endpoints that match neighbors.
RDO encoding of block compressed textures should always be done from the original non-compressed version of the texture, *not* from a previous block compressed encoding. eg. don't take something already in BC1 and try to run RDO to shrink it further. Doing that would cause the errors to add up, a bit like taking a JPEG and lowering it's "quality" setting to make it smaller - that should always be done from original data.
Now, block compressed textures are already lossy. BC1 is quite bad; BC7 and ASTC less so. So adding more error may not be acceptable at all. If large amounts of error are acceptable in your texture, you may not ever be seeing the largest mip levels. Sending mip levels that are too large and never visible is a *far* larger waste of size than anything we do here, so it's important to have a process in your game to find those textures and shrink them.
The best tool I know of at the moment to do RDO texture creation is
crunch by Rich Geldreich / Binomial.
I'm told that their newer Basis product has an improved RDO-for-LZ but I don't have a copy to test.
What I actually run is
Unity's improvement to crunch.
The way you use it is something like :
crunch_x64_unity.exe -DXT1 -fileformat dds -file input.png -maxmips 1 -quality 200 -out output.dds
That is, tell it to make fileformat DDS, it will do normal block texture compression, but with rate-favoring decisions.
NOTE : we're talking about lossy compression here, which is always a little subtle to investigate because there are two axes of performance : both size and quality. Furthermore "quality" is hard to measure well, and there is no substitute for human eyes examining the images to decide what level of loss is acceptable. Here I am reporting "imdiff" scores with my "combo" metric. The "imdiff" scores are not like an RMSE; they're roughly on a scale of 0-100 where 0 = no difference and 100 = complete garbage, like a percent difference (though not really).
Some results :
act3cracked_colour
1024x1024
non-RDO fast BC1 : 524,416 bytes
then Leviathan : -> 416,981
imdiff : 33.26
crunch RDO quality 200 , then Leviathan : -> 354,203
imdiff : 36.15
file size 85% of non-RDO
error 109% of non-RDO
adventurer_colour
1024x1024
non-RDO fast BC1 : 524,416 bytes
then Leviathan : -> 409,874
imdiff : 32.96
crunch RDO quality 200 , then Leviathan : -> 334,342
imdiff : 33.48
file size 81% of non-RDO
error 102% of non-RDO
Personally I like crunch's RDO DDS at these high quality levels, 200 or above. It introduces relatively little
error and the file size savings are still significant.
At lower quality levels use of crunch can be problematic in practice. Unfortunately it's hard to control how much error it introduces. You either have to manually inspect textures for damage, or run an external process to measure quality and feed that back into the crunch settings. Another problem is that crunch's quality setting doesn't scale with texture size; smaller size textures get less error and larger size textures get more error at the same "quality" setting, which means you need to choose a quality setting per texture size. (I think the issue is that crunch's codebook size doesn't scale with texture size, which makes it particularly bad for textures at 2048x2048 or above, or for large texture atlases).
Your other option besides doing RDO texture creation followed by LZ is to just use crunch's native "crn" format for textures.
Let's compare RDO+LZ vs crn for size. I will do this by dialing the quality setting until they get the same
imdiff "combo" score, so we are comparing a line of equal distortion (under one metric).
act3cracked_colour
1024x1024
crunch crn 255 : -> 211,465
imdiff : 42.33
crunch rdo dds 95 : -> 264,206
imdiff : 42.36
adventurer_colour
1024x1024
crunch crn 255 : -> 197,644
imdiff : 38.48
crunch rdo dds 101 : -> 244,402
imdiff : 38.67
The native "crn" format is about 20% smaller than RDO + LZ on both of these textures. It is to be expected that custom compressors, well designed for one type of data, should beat general purpose compressors. Note that comparing "crn" sizes to just doing BCn + LZ (without RDO) is not a valid comparison, since they are at different error levels.
If you look at the quality settings, the "crn" mode at maximum quality is still introducing a lot of error. That "quality" setting is not on the same scale for crn mode and dds mode. Maximum quality (255) in crn mode is roughly equivalent to quality = 100 in dds mode. Unfortunately there seems to be no way to get higher quality in the crn mode.
This has been an attempt to provide some facts to help you make a good choice. There are three options : BCn (without RDO) + LZ , RDO BCn + LZ, or a custom compresssed texture format like CRN. They have different pros and cons and the right choice depends on your app and pipeline.
Now we haven't looked at decode speed in this comparison. I've never measured crunch's decode speed (of CRN format), but I suspect that Oodle's LZ decoders are significantly faster. Another possible speed advantage for Oodle LZ is that you can store your BCn data pre-swizzled for the hardware, which may let you avoid more CPU work. I should also note that you should never LZ decompress directly into uncached graphics memory. You either need to copy it over after decoding (which is very fast and recommended) or start the memory as cached for LZ decoding and then change it to uncached GPU memory after the decode is done.
Comparison of Oodle to some other compressors on samples of game texture data.
Repeating the "Game BCn" test from Oodle 2.6.0 : Leviathan detailed performance report : A mix of BCn textures from a game (mostly BC1, BC4, and BC7) :
|
|
|
"Game BCn" :
lzma 16.04 -9 : 3.692 to 1 : 64.85 MB/s
brotli24 2017-12-12 -11 : 3.380 to 1 : 237.78 MB/s
zlib 1.2.11 -9 : 2.720 to 1 : 282.78 MB/s
zstd 1.3.3 -22 : 3.170 to 1 : 485.97 MB/s
Kraken8 : 3.673 to 1 : 880.99 MB/s
Leviathan8 : 3.844 to 1 : 661.93 MB/s
A different set : "test_data\image\dds" is mostly BC1 with some BC3 and some RGBA32
|
|
|
test_data\image\dds :
lzma 16.04 -9 : 2.354 to 1 : 39.53 MB/s
brotli24 2017-12-12 -11 : 2.161 to 1 : 161.40 MB/s
zlib 1.2.11 -9 : 1.894 to 1 : 222.70 MB/s
zstd 1.3.3 -22 : 2.066 to 1 : 443.96 MB/s
Kraken8 : 2.320 to 1 : 779.84 MB/s
Leviathan8 : 2.386 to 1 : 540.90 MB/s
(note this is lzma with default settings; lzma with settings tweaked for BCn can sometimes get more compression than Leviathan)
While brotli and ZStd are competitive with Kraken's compression ratio on text (and text-like) files, they lag behind on many types of binary data, such as these block compressed textures.
Once installed in your source tree, the Oodle Data Compression integration is transparent. Simply create compressed pak files the way you usually do, and instead of compressing with zlib they will be compressed with Oodle. At runtime, the engine automatically decodes with Oodle or zlib as specified in the pak.
Oodle can compress game data much smaller than zlib. Oodle also decodes faster than zlib. With less data to load from disk and faster decompression, you speed up data loading in two ways.
For example, on the ShooterGame sample game, the pak file sizes are :
ShooterGame-WindowsNoEditor.pak
uncompressed : 1,131,939,544 bytes
Unreal default zlib : 417,951,648 2.708 : 1
Oodle Kraken 4 : 372,845,225 3.036 : 1
Oodle Leviathan 8 : 330,963,205 3.420 : 1
Oodle Leviathan makes the ShooterGame pak file 87 MB smaller than the Unreal default zlib compression!
NOTE this is new and separate from the Oodle Network integration which has been in Unreal for some time. Oodle Network provides compression of network packets to reduce bandwidth in networked games.
The Oodle Data Compression integration is provided for Unreal 4.18 and 4.19. It will also work in other versions, but may require some modification of source code to integrate the diffs.
As before, this is Windows x64 on a Core i7-3770, and we're looking at compression ratio vs. decode speed, not considering encode speed, and running all compressors in their highest compression level.
BTW the reason I include zlib in all these tests is not because I imagine anyone should really be comparing against zlib. It's because
On the "seven" testset, compressing each file independently, then summing decode time and size for each file :
|
|
|
The loglog chart shows log compression ratio on the Y axis and log decode speed on the X axis (the numeric labels show the pre-log values). The upper right is the Pareto frontier.
The raw numbers are :
total : lzma 16.04 -9 : 3.186 to 1 : 52.84 MB/s
total : lzham 1.0 -d26 -4 : 2.932 to 1 : 149.57 MB/s
total : brotli24 2017-12-12 -11 : 2.958 to 1 : 203.91 MB/s
total : zlib 1.2.11 -9 : 2.336 to 1 : 271.61 MB/s
total : zstd 1.3.3 -22 : 2.750 to 1 : 474.47 MB/s
total : lz4hc 1.8.0 -12 : 2.006 to 1 : 2786.78 MB/s
total : Leviathan8 : 3.251 to 1 : 675.92 MB/s
total : Kraken8 : 3.097 to 1 : 983.74 MB/s
total : Mermaid8 : 2.846 to 1 : 1713.53 MB/s
total : Selkie8 : 2.193 to 1 : 3682.88 MB/s
Isolating just Kraken, Leviathan, and ZStd on the same test (ZStd is the closest non-Oodle codec), we can look at file-by-file performance :
|
|
|
The loglog shows each file with a different symbol, colored by the compressor.
The speed advantage of Oodle is pretty uniform, but the compression advantage varies by file type. Some files simply have more air that Oodle can squeeze out beyond what other LZ's find. For example if you only looked at enwik7 (xml/text), then all of the modern LZ's considered here (ZStd,Oodle,Brotli,LZHAM) would get almost exactly the same compression; there's just not a lot of room for them to differentiate themselves on text.
Runs on a couple other standard files :
On the Silesia/Mozilla file :
|
|
|
mozilla : lzma 16.04 -9 : 3.832 to 1 : 63.79 MB/s
mozilla : lzham 1.0 -d26 -4 : 3.570 to 1 : 168.96 MB/s
mozilla : brotli24 2017-12-12 -11 : 3.601 to 1 : 246.68 MB/s
mozilla : zlib 1.2.11 -9 : 2.690 to 1 : 275.40 MB/s
mozilla : zstd 1.3.3 -22 : 3.365 to 1 : 503.44 MB/s
mozilla : lz4hc 1.8.0 -12 : 2.327 to 1 : 2509.82 MB/s
mozilla : Leviathan8 : 3.831 to 1 : 691.37 MB/s
mozilla : Kraken8 : 3.740 to 1 : 985.35 MB/s
mozilla : Mermaid8 : 3.335 to 1 : 1834.49 MB/s
mozilla : Selkie8 : 2.793 to 1 : 3145.63 MB/s
On win81 :
|
|
|
win81 : lzma 16.04 -9 : 2.922 to 1 : 51.87 MB/s
win81 : lzham 1.0 -d26 -4 : 2.774 to 1 : 156.44 MB/s
win81 : brotli24 2017-12-12 -11 : 2.815 to 1 : 214.02 MB/s
win81 : zlib 1.2.11 -9 : 2.207 to 1 : 253.68 MB/s
win81 : zstd 1.3.3 -22 : 2.702 to 1 : 472.39 MB/s
win81 : lz4hc 1.8.0 -12 : 1.923 to 1 : 2408.91 MB/s
win81 : Leviathan8 : 2.959 to 1 : 757.36 MB/s
win81 : Kraken8 : 2.860 to 1 : 949.07 MB/s
win81 : Mermaid8 : 2.618 to 1 : 1847.77 MB/s
win81 : Selkie8 : 2.142 to 1 : 3467.36 MB/s
And again on the "seven" testset, but this time with the files cut into 32 kB chunks :
|
|
|
total : lzma 16.04 -9 : 2.656 to 1 : 43.25 MB/s
total : lzham 1.0 -d26 -4 : 2.435 to 1 : 76.36 MB/s
total : brotli24 2017-12-12 -11 : 2.581 to 1 : 178.25 MB/s
total : zlib 1.2.11 -9 : 2.259 to 1 : 255.18 MB/s
total : zstd 1.3.3 -22 : 2.363 to 1 : 442.42 MB/s
total : lz4hc 1.8.0 -12 : 1.849 to 1 : 2717.30 MB/s
total : Leviathan8 : 2.731 to 1 : 650.23 MB/s
total : Kraken8 : 2.615 to 1 : 975.69 MB/s
total : Mermaid8 : 2.455 to 1 : 1625.69 MB/s
total : Selkie8 : 1.910 to 1 : 4097.27 MB/s
By cutting into 32 kB chunks, we remove the window size disadvantage suffered by zlib and LZ4. Now all the codecs have the same match window, and the compression difference only comes from what additional features they provide. The small chunk also stresses startup overhead time and adaptation speed.
The Oodle codecs generally do even better (vs the competition) on small chunks than they do on large files. For example LZMA and LZHAM both have large models that really need a lot of data to get up to speed. All the non-Oodle codecs slow down more on small chunks than the Oodle codecs do.
Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :
Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility
or visit RAD to read for more information about the Oodle SDK
One of the unique things about Oodle is the fact that its compressors are optimizing for a space-speed goal (not just for size), and the user has control over how that combined score is priced.
This is a crucial aspect of Leviathan. Oodle Leviathan considers many options for the encoded data, it rates those options for space-speed and chooses the bit stream that optimizes that goal. This means that Leviathan can consider slower-to-decode bit stream options, and only use them when they provide enough of a benefit that they are worth it.
That is, Leviathan decompression speed varies a bit depending on the file, as all codecs do. However, other codecs will sometimes get slower for no particular benefit. They may choose a slower mode, or simply take a lot more slow encoding options (short matches, or frequent literal-match switches), even if isn't a big benefit to file size. Leviathan only chooses slower encoding modes when they provide a benefit to file size that meets the goal the user has set for space-speed tradeoff.
Each of the new Oodle codecs (Leviathan, Kraken, Mermaid & Selkie) has a different default space-speed goal, which we set to be near their "natural lambda", the place that their fundamental structure works best. Clients can dial this up or down to bias their decisions more for size or speed.
The flexibility of these Oodle codecs allows them to cover a nearly continuous range of compression ratio vs decode speed points.
Here's an example of the Oodle codecs targeting a range of space-speed goals on the file "TC_Foreground_Arc.clb" (Windows x64):
|
|
|
Now you may notice that at the highest compression setting of Kraken (-zs64) it is strictly worse than the fastest setting of Leviathan (-zs1024). If you want that speed-compression tradeoff point, then using Kraken is strictly wrong - you should switch to Leviathan there.
That's what Oodle Hydra does for you. Hydra (the many headed beast) is a meta compressor which chooses between the other Oodle compressors to make the best space-speed encoding. Hydra does not just choose per file, but per block, which means it can do finer grain switching to find superior encodings.
Oodle Hydra on the file "TC_Foreground_Arc.clb" (Windows x64):
|
|
|
When using Hydra you don't explicitly choose the encoder at all. You set your space-speed goal and you trust in Oodle to make the choice for you. It may use Leviathan, Kraken, or Mermaid, so you may get faster or slower decoding on any given chunk, but you do know that when it chooses a slower decoder it was worth it. Hydra also sometimes gets free wins; if you wanted high compression so you would've gone with Leviathan, there are cases where Kraken compresses nearly the same (or even does better), and switching down to Kraken is just a decode speed win for free (no compression ratio sacrificed).
Of course the disadvantage of Hydra is slow encoding, because it has to consider even more encoding options than Oodle already does. It is ideal for distribution scenarios where you encode once and decode many times.
Another way we can demonstrate Oodle's space-speed encoding is to disable it.
We can run Oodle in "max compression only" mode by setting the price of time to zero. That is, when it scores a decision for space-speed, we consider only speed. (I never actually set the price of time to exactly zero; it's better to just make it very small so that ties in size are still broken in favor of speed; specifically we will set spaceSpeedTradeoffBytes = 1).
Here's Oodle Leviathan on the Silesia standard test corpus :
Leviathan default space-speed score (spaceSpeedTradeoffBytes = 256) :
total : 211,938,580 ->48,735,197 = 1.840 bpb = 4.349 to 1
decode : 264.501 millis, 4.25 c/B, rate= 801.28 MB/s
Leviathan max compress (spaceSpeedTradeoffBytes = 1) :
total : 211,938,580 ->48,592,540 = 1.834 bpb = 4.362 to 1
decode : 295.671 millis, 4.75 c/B, rate= 716.81 MB/s
By ignoring speed in decisions, we've gained 140 kB , but have lost 30 milliseconds in decode time.
Perhaps a better way to look at it is the other way around - by making good space-speed decisions, the default Leviathan setting saves 0.50 cycles per byte in decode time, at a cost of only 0.006 bits per byte of compressed size.
Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :
Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility
or visit RAD to read for more information about the Oodle SDK
Let's have a look at a few concrete tests of Leviathan's performance.
All of the tests in this post are run on Windows, x64, on a Core i7-3770. For performance on some other game platforms, see : Leviathan performance on PS4, Xbox One, and Switch
All the compressors in this test are run on in their slowest encoding level. We're looking for high compression and fast decompression, we're not looking for fast compression here. See Everything new and tasty in Oodle 2.6.0 for a look at encode times.
Tests are on groups of files. The results show the total compressed size & total decode time over the group of files, but each file is encoded independently. The maximum window size is used for all compressors.
The compressors tested here are :
Kraken8 : Oodle Kraken at level 8 (Optimal4)
Leviathan8 : Oodle Leviathan at level 8 (Optimal4)
lzma_def9 : LZMA -mx9 with default settings (lc3lp0pb2fb32) +d29
lzmalp3pb3fb1289 : LZMA -mx9 with settings that are better on binary and game data (lc0lp3pb3fb128) +d29
zlib9 : zlib 1.2.8 at level 9
LZMA and zlib here were built with MSVC; I also checked there speed in a GCC build and confirmed it is nearly
identical.
Two settings for LZMA are tested to try to give it the best chance of competing with Leviathan. LZMA's default settings tend to be good on text but not great on binary (often even Kraken can beat "lzma_def" on binary structured data). I've also increased LZMA's fast bytes to 128 in the non-default options, the default value of 32 is a bit of a detriment to compression ratio, and most of the competition (ZStd, Brotli, LZHAM) use a value more like 128. I want to give LZMA a great chance to compete; we don't need to play any games with selective testing to make Leviathan look good.
Let the tests begin!
|
|
|
Kraken8 : 4.02:1 , 1.0 enc MB/s , 1013.3 dec MB/s Leviathan8 : 4.21:1 , 0.5 enc MB/s , 732.8 dec MB/s zlib9 : 2.89:1 , 5.2 enc MB/s , 374.5 dec MB/s lzmalp3pb3fb1289: 4.08:1 , 3.4 enc MB/s , 65.4 dec MB/s lzma_def9 : 3.97:1 , 4.3 enc MB/s , 65.4 dec MB/s
|
|
|
Kraken8 : 3.67:1 , 0.7 enc MB/s , 881.0 dec MB/s
Leviathan8 : 3.84:1 , 0.4 enc MB/s , 661.9 dec MB/s
zlib9 : 2.72:1 , 6.2 enc MB/s , 321.5 dec MB/s
lzmalp3pb3fb1289: 3.67:1 , 2.0 enc MB/s , 66.9 dec MB/s
lzma_def9 : 3.68:1 , 2.3 enc MB/s , 67.0 dec MB/s
|
|
|
Kraken8 : 1.15:1 , 1.7 enc MB/s , 1289.6 dec MB/s
Leviathan8 : 1.17:1 , 1.0 enc MB/s , 675.8 dec MB/s
zlib9 : 1.10:1 , 18.9 enc MB/s , 233.8 dec MB/s
lzma_def9 : 1.17:1 , 6.2 enc MB/s , 21.4 dec MB/s
lzmalp3pb3fb1289: 1.19:1 , 6.1 enc MB/s , 21.4 dec MB/s
|
|
|
Kraken8 : 2.68:1 , 1.0 enc MB/s , 1256.2 dec MB/s
Leviathan8 : 2.83:1 , 0.6 enc MB/s , 915.0 dec MB/s
zlib9 : 1.99:1 , 8.3 enc MB/s , 332.4 dec MB/s
lzmalp3pb3fb1289: 2.76:1 , 3.4 enc MB/s , 46.5 dec MB/s
lzma_def9 : 2.72:1 , 4.2 enc MB/s , 46.3 dec MB/s
|
|
|
Kraken8 : 4.25:1 , 0.6 enc MB/s , 996.0 dec MB/s
Leviathan8 : 4.35:1 , 0.4 enc MB/s , 804.5 dec MB/s
zlib9 : 3.13:1 , 8.4 enc MB/s , 351.3 dec MB/s
lzmalp3pb3fb1289: 4.37:1 , 1.9 enc MB/s , 80.2 dec MB/s
lzma_def9 : 4.27:1 , 2.6 enc MB/s , 79.0 dec MB/s
brotli24 -11 : 4.21:1 , 310.9 dec MB/s
zstd 1.3.3 -22 : 4.01:1 , 598.2 dec MB/s
(zstd and brotli run in lzbench) (brotli is 2017-12-12)
Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :
Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility
or visit RAD to read for more information about the Oodle SDK
Oodle's speed on the Sony PS4 (and Microsoft Xbox One) and Nintendo Switch is superb. With the slower processors in these consoles (compared to a modern PC), the speed advantage of Oodle makes a big difference in total load time or CPU use.
These are run on the private test file "lzt99". I'm mainly looking at the speed numbers here, not the compression ratio (compression wise, we do so well on lzt99 that it's a bit silly, and also not entirely fair to the competition).
On the Nintendo Switch (clang ARM-A57 AArch64 1.02 GHz) :
Oodle 2.6.0 -z8 :
Leviathan : 2.780 to 1 : 205.50 MB/s
Kraken : 2.655 to 1 : 263.54 MB/s
Mermaid : 2.437 to 1 : 499.72 MB/s
Selkie : 1.904 to 1 : 957.60 MB/s
zlib from nn_deflate
zlib : 1.883 to 1 : 74.75 MB/s
And on the Sony PS4 (clang x64 AMD Jaguar 1.6 GHz) :
Oodle 2.6.0 -z8 :
Leviathan : 2.780 to 1 : 271.53 MB/s
Kraken : 2.655 to 1 : 342.49 MB/s
Mermaid : 2.437 to 1 : 669.34 MB/s
Selkie : 1.904 to 1 :1229.26 MB/s
non-Oodle reference (2016) :
brotli-11 : 2.512 to 1 : 77.84 MB/s
miniz : 1.883 to 1 : 85.65 MB/s
brotli-9 : 2.358 to 1 : 95.36 MB/s
zlib-ng : 1.877 to 1 : 109.30 MB/s
zstd : 2.374 to 1 : 133.50 MB/s
lz4hc-safe : 1.669 to 1 : 673.62 MB/s
LZSSE8 : 1.626 to 1 : 767.11 MB/s
The Microsoft XBox One has similar performance to the PS4. Mermaid & Selkie can decode faster than the
hardware DMA compression engine in the PS4 and Xbox One, and usually compress more if they aren't limited
to small chunks like the hardware DMA engine needs.
Note that the PS4 non-Oodle reference data is from my earlier runs back in 2016 : Oodle Mermaid and Selkie on PS4 and PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle . They should be considered only rough reference points; I imagine some of those codecs are slightly different now, but does even a 10 or 20 or 50% improvement really make much difference? (also note that there's no true zlib reference in that PS4 set; miniz is close but a little different, and zlib-ng is faster than standard zlib).
Leviathan is in a different compression class than any of the other options, and is still 2-3X faster than zlib.
Something I spotted while gathering the old numbers that I think is worth talking about:
If you look at the old Kraken PS4 numbers from
Oodle 2.3.0 : Kraken Improvement
you would see :
PS4 lzt99
old :
Oodle 2.3.0 -z6 : 2.477 to 1 : 389.28 MB/s
Oodle 2.3.0 -z7 : 2.537 to 1 : 363.70 MB/s
vs new :
Oodle 2.6.0 -z8 : 2.655 to 1 : 342.49 MB/s
(-z8 encode level didn't exist back then)
Oh no! Oodle's gotten slower to decode!
Well no, it hasn't. But this is a good example of how looking at just space or speed on their own can be misleading.
Oodle's encoders are always optimizing for a space-speed goal. There are a range of solutions to that problem which have nearly the same space-speed score, but have different sizes or speeds.
So part of what's happened here is that Oodle 2.6.0 is just hitting a slightly different spot in the space-speed solution space than Oodle 2.3.0 is. It's finding a bit stream that is smaller, and trades off some decode speed for that. With its space-speed cost model, it measures that tradeoff as being a good value. (the user can set the relative value of time & bytes that Oodle uses in its scoring via the spaceSpeedTradeoffBytes parameter).
But something else has also happened - Oodle 2.6.0 has just gotten much better. It hasn't just stepped along the Pareto curve to a different but equally good solution - it has stepped perpendicularly to the old Pareto curve and is finding better solutions.
At RAD we measure that using the "correct" Weissman score which provides a way of combining a space-speed point into a single number that can be used to tell whether you have made a real Pareto improvement or just a tangential step.
The easiest way to see that you have definitely made an improvement is to run Oodle 2.6.0 with a different
spaceSpeedTradeoffBytes price so that it provides a simpler relationship :
PS4 lzt99
new, with spaceSpeedTradeoffBytes = 1024
Oodle 2.6.0 -z8 : 2.495 to 1 : 445.56 MB/s
vs old :
Oodle 2.3.0 -z6 : 2.477 to 1 : 389.28 MB/s
Now we have higher compression and higher speed, so there's no question of whether we lost anything.
In general the Oodle 2.6.0 Kraken & Mermaid encoders are making decisions that slightly bias for higher compression (and slower decode; though often the decode speed is very close) than before 2.6.0. If you find you've lost a little decode speed and want it back, increase spaceSpeedTradeoffBytes (try 400).
Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :
Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility
or visit RAD to read for more information about the Oodle SDK
A quick run down of all the exciting new stuff in Oodle 2.6.0 :
1. Leviathan!
2. The Kraken, Mermaid & Selkie fast-level encoders are now much faster.
3. Kraken & Mermaid's optimal level encoders now get more compression.
4. Kraken & Mermaid have new bit stream options which allow them to reach even higher compression.
5. Kraken and Mermaid are now more tuneable to different compression ratios and decode speeds.
1. Leviathan!
See Leviathan detailed performance report
2. The Kraken, Mermaid & Selkie fast-level encoders are now much faster.
The non-optimal-parsing encoder levels that are intended for realtime use in Oodle are levels 1-4 aka SuperFast, VeryFast, Fast & Normal.
Their decode speed was always best in class, but previously their encode speed was slightly off the Pareto frontier
(the best possible trade off of encode speed vs compression ratio). No longer.
"win81" test file (Core i7-3770 Windows x64)
Oodle 255 :
Kraken1 : 2.33:1 , 83.3 enc MB/s , 911.8 dec MB/s
Kraken2 : 2.39:1 , 68.5 enc MB/s , 938.0 dec MB/s
Kraken3 : 2.51:1 , 24.4 enc MB/s , 1005.5 dec MB/s
Kraken4 : 2.57:1 , 17.3 enc MB/s , 997.3 dec MB/s
Oodle 260 :
Kraken1 : 2.33:1 , 135.1 enc MB/s , 928.1 dec MB/s
Kraken2 : 2.41:1 , 94.0 enc MB/s , 937.4 dec MB/s
Kraken3 : 2.52:1 , 38.3 enc MB/s , 1020.6 dec MB/s
Kraken4 : 2.60:1 , 23.0 enc MB/s , 1022.9 dec MB/s
Oodle 255 :
Mermaid1 : 2.10:1 , 106.4 enc MB/s , 2079.4 dec MB/s
Mermaid2 : 2.15:1 , 78.9 enc MB/s , 2161.1 dec MB/s
Mermaid3 : 2.24:1 , 26.4 enc MB/s , 2294.0 dec MB/s
Mermaid4 : 2.29:1 , 26.6 enc MB/s , 2341.1 dec MB/s
Oodle 260 :
Mermaid1 : 2.14:1 , 161.1 enc MB/s , 2012.1 dec MB/s
Mermaid2 : 2.22:1 , 104.1 enc MB/s , 2007.7 dec MB/s
Mermaid3 : 2.29:1 , 39.0 enc MB/s , 2181.7 dec MB/s
Mermaid4 : 2.32:1 , 32.1 enc MB/s , 2294.0 dec MB/s
Oodle 255 :
Selkie1 : 1.76:1 , 146.8 enc MB/s , 3645.0 dec MB/s
Selkie2 : 1.85:1 , 100.5 enc MB/s , 3565.0 dec MB/s
Selkie3 : 1.98:1 , 28.2 enc MB/s , 3533.2 dec MB/s
Selkie4 : 2.04:1 , 28.9 enc MB/s , 3675.9 dec MB/s
Oodle 260 :
Selkie1 : 1.78:1 , 181.7 enc MB/s , 3716.0 dec MB/s
Selkie2 : 1.87:1 , 114.7 enc MB/s , 3595.7 dec MB/s
Selkie3 : 1.98:1 , 42.1 enc MB/s , 3653.1 dec MB/s
Selkie4 : 2.02:1 , 34.9 enc MB/s , 3818.3 dec MB/s
The speed of the fastest encoder (level 1 = "SuperFast") is up by about 60% in Kraken & Mermaid.
Kraken's encode speed vs ratio is now competitive with ZStd, which has long been the best codec for
encode speed tradeoff.
For example, matching Kraken1 to the closest comparable ZStd levels on the same machine :
at similar encode speed, you can compare the compression ratios :
Kraken1 : 2.33:1 , 135.1 enc MB/s , 928.1 dec MB/s
zstd 1.3.3 -4 : 2.22:1 , 136 enc MB/s , 595 dec MB/s
or you can look at equal file sizes and compare the encode speed :
Kraken1 : 2.33:1 , 135.1 enc MB/s , 928.1 dec MB/s
zstd 1.3.3 -6 : 2:33:1 , 62 enc MB/s , 595 dec MB/s
Of course ZStd does have faster encode levels (1-3); Oodle does not provide anything in that domain.
3. Kraken & Mermaid's optimal level encoders now get more compression. (even with 2.5 compatible bit streams)
We improved the ability of the optimal parse encoders to make good decisions and find smaller encodings.
This is at level 8 (Optimal4) our maximum compression level with slow encoding.
PD3D : (public domain 3D game data test set)
Kraken8 255 : 3.67:1 , 2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5: 3.72:1 , 1.2 enc MB/s , 1079.9 dec MB/s
GTS : (private game data test set)
Kraken8 255 : 2.60:1 , 2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5: 2.63:1 , 1.2 enc MB/s , 1343.8 dec MB/s
Silesia : (standard Silesia compression corpus)
Kraken8 255 : 4.12:1 , 1.4 enc MB/s , 982.0 dec MB/s
Kraken8 260 -v5: 4.18:1 , 0.6 enc MB/s , 1018.7 dec MB/s
(speeds on Core i7-3770 Windows x64)
(-v5 means encode in v5 (2.5.x) backwards compatibility mode)
Compression ratio improvements around 1% might not sound like much, but when you're already on the Pareto frontier,
finding another 1% without sacrificing any decode speed or changing the bit stream is quite significant.
4. Kraken & Mermaid have new bit stream options which allow them to reach even higher compression.
PD3D :
Kraken8 255 : 3.67:1 , 2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5: 3.72:1 , 1.2 enc MB/s , 1079.9 dec MB/s
Kraken8 260 : 4.00:1 , 1.0 enc MB/s , 1034.7 dec MB/s
GTS :
Kraken8 255 : 2.60:1 , 2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5: 2.63:1 , 1.2 enc MB/s , 1343.8 dec MB/s
Kraken8 260 : 2.67:1 , 1.0 enc MB/s , 1282.3 dec MB/s
Silesia :
Kraken8 255 : 4.12:1 , 1.4 enc MB/s , 982.0 dec MB/s
Kraken8 260 -v5: 4.18:1 , 0.6 enc MB/s , 1018.7 dec MB/s
Kraken8 260 : 4.24:1 , 0.6 enc MB/s , 985.4 dec MB/s
Kraken in Oodle 2.6.0 now gets Silesia to 50,006,565 bytes at the default space-speed tradeoff target.
Kraken in max-compression space-speed setting gets Silesia to 49,571,429 bytes (and is still far faster
to decode than anything close).
If we look back at where Kraken started in April of 2016 , it was getting 4.05 to 1 on Silesia , now 4.24 to 1.
Kraken now usually gets more compression than anything remotely close to its decode speed.
Looking back at the old
Performance of Oodle Kraken ,
Kraken only got 2.70:1 on win81. On some files, Kraken has always out-compressed the competition, but win81 was one
where it lagged. It does better now :
"win81" test file (Core i7-3770 Windows x64)
in order of decreasing decode speed :
Kraken8 : 2.86:1 , 0.6 enc MB/s , 950.9 dec MB/s
old Kraken 215: 2.70:1 , 1.0 enc mb/s , 877.0 dec mb/s
Leviathan8 : 2.96:1 , 0.4 enc MB/s , 758.1 dec MB/s
zstd 1.3.3 -22 3.35 enc MB/s 473 dec MB/s 38804423 37.01 win81 = 2.702:1
zlib 1.2.11 -9 8.59 enc MB/s 254 dec MB/s 47516720 45.32 win81 = 2.206:1
brotli24 2017-12-12 -11 0.39 enc MB/s 214 dec MB/s 37246857 35.52 win81 = 2.815:1
lzham 1.0 -d26 -4 1.50 enc MB/s 158 dec MB/s 37794766 36.04 win81 = 2.775:1
lzma 16.04 -9 2.75 enc MB/s 51 dec MB/s 35890919 34.23 win81 = 2.921:1
At the time of Kraken's release, it was a huge decode speed win vs comparable compressors, but it sometimes
lagged a bit in compression ratio. No longer.
NOTE : Oodle 2.6.0 by default makes bit streams that are decodable by version >= 2.6.0 only. If you need bit streams that can be read by earlier versions, you must set the backward compatible version number that you need. See the Oodle FAQ on backward compatibility.
5. Kraken and Mermaid are now more tuneable to different compression ratios and decode speeds.
The new v6 bit stream has more options, which allows them to smoothly trade off compression ratio for decode speed. The user can set this goal with a space-speed tradeoff parameter.
All the Oodle codecs have a compression level setting (similar to the familiar zip 1-9 level) that trades encode time for decode speed. Unlike many other codecs, Oodle's compressors do not lose *decode* speed at higher encode effort levels. We are not finding more compact encodings by making the decoder slower. Instead you can dial decode speed vs ratio with a separate parameter that changes how the encoder scores decisions.
See Oodle Hydra and space-speed flexibility
Almost two years ago, we released Oodle Kraken. Kraken roared onto the scene with high compression ratios and crazy fast decompression (over 1 GB/s on a Core i7 3770). The performance of Kraken immediately made lots of codecs obsolete. We could also see that something was wrong in the higher compression domain.
Kraken gets high compression (much more that zlib; usually more than things like RAR, ZStd and Brotli), but it gets a little less than 7z/LZMA. But to get that small step up in compression ratio, you had to accept a 20X decode speed penalty.
For example, on the "seven" testset (a private test set with a variety of data types) :
LZMA-9 : 3.18:1 , 3.0 enc MB/s , 53.8 dec MB/s
Kraken-7 : 3.09:1 , 1.3 enc MB/s , 948.5 dec MB/s
zlib-9 : 2.33:1 , 7.7 enc MB/s , 309.0 dec MB/s
LZMA gets 3% more compression than Kraken, but decodes almost 20X slower.
That's not the right price to pay for that compression gain. There had to be a better way to
take Kraken's great space-speed performance and extend it into higher compression without giving up
so much speed.
It's easy to see that there was a big gap in a plot of decode speed vs compression ratio :
(this is a log-log plot on the seven test set, on a Core i7 3770. We're not looking at encode speed here at all, we're running the compressors in their max compression mode and we care about the tradeoff of size vs decode speed.)
We've spent the last year searching in that mystery zone and we have found the great beast that lives there and fills the gap : Leviathan.
Leviathan gets to high compression ratios with the correct amount of speed decrease from Kraken (about 33%).
This means that even with better than LZMA ratios on the seven test set, it is still over 2X faster to
decode than zlib :
Leviathan-7: 3.23:1 , 0.9 enc MB/s , 642.4 dec MB/s
LZMA-9 : 3.18:1 , 3.0 enc MB/s , 53.8 dec MB/s
Kraken-7 : 3.09:1 , 1.3 enc MB/s , 948.5 dec MB/s
Leviathan doesn't always beat LZMA's compression ratio (they have slightly different strengths) but they are comparable. Leviathan is 7-20X faster to decompress than LZMA (usually around 10X).
Leviathan is ideal for distribution use cases, in which you will compress once and decompress many times. Leviathan allows you to serve highly compressed data to clients without slow decompression. We do try to keep Leviathan encode's time reasonable, but it is not a realtime encoder and not the right answer where fast encoding is needed.
Leviathan is a game changer. It makes high ratio decompression possible where it wasn't before. It can be used on video game consoles and mobile devices where other decoders took far too long. It can be used for in-game loading where CPU use needs to be minimized. It can be used to keep data compressed on disk with no install, because its decompression is fast enough to do on every load.
Leviathan now makes up a part of the Oodle Kraken-Mermaid-Selkie family. These codecs now provide excellent solutions over a wider range of compression needs.
Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :
Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility
or visit RAD to read for more information about the Oodle SDK
In 2.6.0 they will still be available for both encoding & decoding. However, they will be marked "deprecated" to discourage their use. I'm doing this two ways.
One : if you encode with them, they will log a warning through the new Oodle "UsageWarning" system. Decoding with them will not log a warning. This warning can be disabled by calling Oodle_SetUsageWarnings(false). (Of course in shipping you might also have set the Oodle log function to NULL via OodlePlugins_SetPrintf, which will also disable the warning log). (Also note that in MSVC builds of Oodle the default log function only goes to OutputDebugString so you will not see anything unless you either have a debugger attached, change the log function, or use OodleX to install the OodleX logger).
Two : the enum for the old compressor names will be hidden in oodle2.h by default. You must define OODLE_ALLOW_DEPRECATED_COMPRESSORS before including oodle2.h to enable their definition.
As noted, decoding with the old codecs will not log a usage warning, and can be done without setting OODLE_ALLOW_DEPRECATED_COMPRESSORS. That is, Oodle 2.6.0 requires no modifications to decode old codecs and will not log a warning. You only need to consider these steps if you want to *encode* with the old codecs.
In some future version the encoders for the old codecs will be removed completely. All decoders will continue to be shipped for the time being.
I'm intentionally make it difficult to encode with the old codecs so that you transition to one of the new codecs.
Long term support codecs :
Kraken, Mermaid, Selkie (and Hydra)
codecs being deprecated :
LZNA, BitKnit, LZA, LZH, LZHLW, LZNIB, LZB16, LZBLW
The new codecs simply obsolete the old codecs and they should not be used any more. The old codecs are also a mix of some fuzz safe, some not. The new codecs are all fuzz safe and we want to remove support for non-fuzz-safe decoding so that it's not possible to do by accident.
In pursuit of that last point, another change in 2.6.0 is removing the default argument value for OodleLZ_FuzzSafe in the OodleLZ_Decompress call.
Previously that argument had a default value of OodleLZ_FuzzSafe_No , so that it would allow all the codecs to decompress and was backwards compatible when it was introduced.
If you are not explicitly passing something there, you will get a compiler error and need to pass something.
If possible, you should pass OodleLZ_FuzzSafe_Yes. This will ensure the decode is fuzz safe (ie. won't crash if given invalid data to decode).
The only reason that you would not pass FuzzSafe_Yes is if you need to decode some of the older non-fuzz-safe
codecs. We recommend moving away from those codecs to the new post-Kraken codecs (which are all fuzz safe).
Fuzz safe codecs :
Kraken, Mermaid, Selkie (and Hydra)
BitKnit, LZB16, LZH, LZHLW
Non-fuzz-safe codecs :
LZNA, LZA, LZNIB, LZBLW
If you need to decode one of the non-fuzz-safe codecs, you must pass FuzzSafe_No. If you pass FuzzSafe_Yes, and
the decoder encounters data made by that codec, it will return failure.
Fuzz safety in Oodle means that unexpected data will not crash the decoder. It will not necessarilly be detected; the decode might still return success. For full safety, your systems that consume the data post decompression must all be fuzz safe.
Another semi-deprecation coming in Oodle is removing high-performance optimization for 32-bit builds (where 64-bit is available).
What I mean is, we will continue to support & work in 32-bit, but it will no longer be an optimization priority. We may allow it to get slower than it could be. (for example in some cases we just run the 64-bit code that requires two registers per qword; not the ideal way to write the 32-bit version, but it saves us from making yet more code paths to optimize and test).
We're not aware of any games that are still shipping in 32-bit. We have decided to focus our time budget on 64-bit performance. We recommend evaluating Oodle and always running your tools in 64-bit.
If you're a user who believes that 32-bit performance is important, please let us know by emailing your contact at RAD.
Excluding compressors like cmix that are purely aiming for the smallest size, every other compressor is balancing ratio vs. complexity in some way. The author has made countless decisions about space-speed tradeoffs in the design of their codec; are they using huffman or ANS or arithmetic coding? are they using LZ77 or ROLZ or PPM? are they optimal parsing? etc. etc.
There are always ways to spend more CPU time and get more compression, so you choose some balance that you are targeting and try to make decisions that optimize for that balance. (as usual when I talk about "time" or "speed" here I'm talking about decode time ; you can of course also look at balancing encode time vs. ratio, or memory use vs. ratio, or code size, etc.)
Assuming you have done everything well, you may produce a compressor which is "pareto optimal" over some range of speeds. That is if you do something like this : 03-02-15 - Oodle LZ Pareto Frontier , you make a speedup graph over various simulated disk speeds, you are on the pareto frontier if your compressor's curve is the topmost for some range. If your compressor is nowhere topmost, as in eg see the graph with zlib at the bottom : Introducing Oodle Mermaid and Selkie , then try again before proceeding.
The range over which your compressor is pareto optimal is the "natural range" for it. That is a range where it is working better than any other compressor in the world.
Outside of that range, the optimal way to use your compressor is to *not* use it. A client that wants optimal behavior outside of that range should simply switch to a different compressor.
As an example of this - people often take a compressor which is designed for a certain space-speed performance, and then ask it questions that are outside of its natural range, such as "well if I change the settings, just how small of a file can it produce?" , or "just how fast can it decode at maximum speed?". These are bogus questions because they are outside of the operating range. It's like asking how delicate an operation can you do with a hammer, or how much weight can you put on an egg. You're using the tool wrong and should switch to a different tool.
(perhaps a better analogy that's closer to home is taking traditional JPEG Huff and trying to use it for very low bit rates (as many ass-hat authors do when trying to show how much better they are than JPEG), like 0.1 bpp, or at very high bit rates trying to get near-lossless quality. It's simply not built to run outside of its target range, and any examination outside of that range is worse than useless. (worse because it can lead you to conclusions that are entirely wrong))
The "natural lambda" for a compressor is at the center of its natural range, where it is pareto optimal. This is the space-speed tradeoff point where the compressor is working its best, where all those implicit decisions are right.
Say you chose to do an LZ compressor with Huffman and to forbid overlapping matches with offset less than 16 (as in Kraken) - those decisions are wrong at some lambda. If you are lucky they are right decisions at the natural lambda.
Of course if you have developed a compressor in an ad-hoc way, it is inevitable that you have made many decisions wrong. The way this has been traditionally done in the past is for the developer to just try some ideas, often they don't really even measure the alternatives. (for example they might choose to pack their LZ offsets in a certain way, but don't really try other ways). If they do seriously try alternatives, they just look at the numbers for compression & speed and sort of eye ball them and make a judgement call. They do not actually have a well defined space-speed goal and a way to measure if they are improving that score.
The previously posted "correct" Weissman Score provides a systematic way of identifying the space-speed range you wish to target and thus identifying if your decisions are right in a space-speed sense.
Most ad-hoc compressors have illogical contradictory decisions in the sense that they have chosen to do something that makes them slower (to decode), but have failed to do something else which would be a better space-speed step. That is, say you have a compressor, and there are various decisions you face as an implementor. The implicit "natural lambda" for your compressor give you a desired slope for space-speed tradeoffs. Any decision you can make (either adding a new mode, or disabling a current mode; for example adding context mixing, or disabling arithmetic coding) should be measured against that lambda.
Over the past few years, we at RAD have been trying to go about compressor development in this new systematic way. People often ask us for tips on their compressor, hoping that we have some magical answers, like "oh you just do this and that makes it work perfectly". They're disappointed when we don't really have those answers. Often my advice leads nowhere; I suggest "try this and try that" and they don't really pan out and they think "this guy doesn't know anything". But that's what we do - we just try this and that, and usually they don't work, so we try something else. What we have is a way of working that is careful and methodical and thorough. For every decision, you legitimately try it both ways with well-optimized implementations. You measure that decision with a space-speed score that is targetted at the correct range for your compressor. You measure on a variety of test sets and look at randomized subsets of those test sets, not averages or totals. There's no magic, there's just doing the work to try make these decisions correctly.
Usually the way that compressor development works starts from a spark of inspiration.
That is, you don't set out by saying "I want to develop a compressor for the Weissman range of 10 - 100 mb/s" and then proceed to make decisions that optimize that score. Instead it starts with some grain of an idea, like "what if I send matches with LZSA and code literals with context mixed RANS". So you start going about that grain of idea and get something working. Thus all compressor development starts out ad hoc.
At some point (if you work at RAD), you need to transition to having a more defined target range for your compressor so you can beginning making well-justified decisions about its implementation. Here are a few ways that I have used to do that in the past.
One is to look at a space-speed "speedup" curve, such as the "pareto charts" I linked to above. You can visually pick out the range where your compressor is performing well. If you are pareto optimal vs the competition, you can start from that range. Perhaps early in development you are not yet pareto optimal, but you can still see where you are close, you can spot the area where if you improved a bit you would be pareto optimal. I then take that range and expand it slightly to account for the fact that the compressor is likely to be used slightly outside of its natural range. eg. if it was optimal from 10 - 100 mb/s , I might expand that to 5 - 200 mb/.
Once you have a Weissman range you can now measure your compressor's "correct" Weissman score. You then construct your compressor to make
decisions using a Lagrange parameter, such as :
J = size + lambda * time
for each decision, the compressor tries to minimize J. But you don't know what lambda to use. One way to find it is to numerically optimize
for the lambda that maximize the Weissman score over your desired performance interval.
When I first developed Kraken, Mermaid & Selkie, I found their lambdas by using the method of transition points.
Kraken, Mermaid & Selkie are all pareto optimal over large ranges. At some point as you dial lambda on Kraken to favor more speed, you should instead switch to Mermaid. As you dial lambda further towards speed in Mermaid, you should switch to Selkie. We can find what this lambda is where the transition point occurs.
It's simply where the J for each compressor is equal. That is, an outer meta-compressor (like Oodle Hydra ) which could choose between them would have an even choice at this point.
how to set the K/M/S lambdas
measure space & speed of each
find the lambda point at which J(Kraken) = J(Mermaid)
that's the switchover of compressor choice
call that lambda_KM
size_K + lambda_KM * time_K = size_M + lambda_KM * time_M
lambda_KM = (size_M - size_K) / (time_K - time_M)
same for lambda_MS between Mermaid and Selkie
choose Mermaid to be at the geometric mean of the two transitions :
lambda_M = sqrt( lambda_KM * lambda_MS )
then set lambda_K and lambda_S such that the transitions are at the geometric mean
between the points, eg :
lambda_KM = sqrt( lambda_K * lambda_M )
lambda_K = lambda_KM^2 / lambda_M = lambda_KM * sqrt(lambda_KM / lambda_MS)
lambda_S = lambda_MS^2 / lambda_M = lambda_MS * sqrt(lambda_MS / lambda_KM)
Of course always beware that these depend on a size & time measurement on a specific file on a specific machine,
so gather a bunch on different samples and take the median, discard outliers, etc.
(the geometric mean seems to work out well because lambdas seem to like to go in a logarithmic way. This is because the time that we spend to save a byte seems to increase in a power scale way. That is, saving each additional byte in a compressor is 2X harder than the last, so compressors that decrease size more by some linear amount will have times that increase by an exponential amount. Put another way, if J was instead defined to be J = size + lambda * log(time) , with logarithmic time, then lambda would be linear and arithmetic mean would be approprite here.)
It's easy to get these things wrong, so it's important to do many and frequent sanity checks.
An important one is that : more options should always make the compressor better. That is, giving the encoder more choices of encoding, if it makes a space-speed decision with the right lambda, if the rate measurement is correct, if the speed estimation is correct, if there are no unaccounted for side effects - having that choice should lead to a better space-speed performance. Any time you enable a new option and performance gets worse, something is wrong.
There are many ways this can go wrong. Non-local side effects of decisions (such as parse-statistics feedback) is a confounding one. Another common problem is if the speed estimate doesn't match the actual performance of the code, or if it doesn't match the particular machine you are testing on.
An important sanity check is to have macro-scale options. For example if you have other compressors to switch to and consider J against them - if they are being chosen in the range where you think you should be optimal, something is wrong. One "compressor" that should always be included in this is the "null" compressor (memcpy). If you're targetting high speed, you should always be space-speed testing against memcpy and if you can't beat that, either your compressor is not working well or your supposed speed target range is not right.
In the high compression range, you can sanity check by comparing to preprocessors. eg. rather than spend N more cycles on making your compressor slower, compare to other ways of trading off time for compression, such as perhaps word-dictionary preprocessing on text, delta filtering on wav, etc. If those are better options, then they should be done instead.
One of the things I'm getting at is that every compressor has a "natural lambda" , even ones that are not conscious of it. The fundamental way they function implies certain decisions about space-speed tradeoffs and where they are suitable for use.
You can deduce the natural lambda of a compressor by looking at the choices and alternatives in its design.
For example, you can take a compressor like ZStd. ZStd entropy codes literals with Huffman and other things (ML, LRL, offset log2) with TANS (FSE). You can look at switching those choices - switch the literals to coding with TANS. That will cost you some speed and gain some compression. There is a certain lambda where that decision is moot - either way gives equal J. If ZStd is logically designed, then its natural lambda must lie to the speed side of that decision. Similarly try switching coding ML to Huffman, again you will find a lambda where that decision is moot, and ZStd's natural lambda should lie on the slower side of that decision.
You can look at all the options in a codec and consider alternatives and find this whole set of lambda points where that decision makes sense. Some will constrain your natural lambda from above, some below, and you should lie somewhere in the middle there.
This is true and possible even for codecs that are not pareto. For however good they are, their construction implies a set of decisions that target an implicit natural lambda.
ADD 03-06-18 :
This isn't just about compressors. This is about any time you design an algorithm that solves a problem imperfectly (intentionally) because you want to trade off speed, or memory use, or whatever, for an approximate solution.
Each decision you make in algorithm design has a cost (solving it less perfectly), and a benefit (saving you cpu time, memory, whatever). Each decision has a slope of cost vs benefit. You should be taking the decisions with the best slope.
Most algorithm design is done without considering this properly and it creates algorithms that are fundamentally wrong. That is, they are built on contradictory decisions. That is, there is NO space-speed tradeoff point whether all the decisions are correct. Each decision might be correct in a different tradeoff point, but they are mutually exclusive.
I'll do another example in compression because it's what I know best.
LZ4 does no entropy coding. LZ4 does support overlapping match copies. These are contradictory design decisions (*).
That is, if you just wanted to maximize compression, you would do both. If you wanted to maximize decode speed, you would do
neither. In the middle, you should first turn on the feature with the best slope. Entropy coding has more benefit per cost than
overlap matches, so it should be turned on first. That is :
High compression domain : { Huffman coding + overlap matches }
Middle tradeoff domain : { Huffman coding + no overlap matches }
High speed domain : { no Huffman coding + no overlap matches }
Contradictory : { no Huffman coding + overlap matches }
There are logical options for tradeoffs in different domains, and there are choices that are correct nowhere.
(* = caveat: this is considering only the tradeoff of size vs. decode speed; if you consider other factors, such as encode time or code complexity then the design decision of LZ4 could make sense)
This is in no way unique to compression. It's extremely common when programmers heuristically approximate numerical algorithms. Perhaps on the high end you use doubles with careful summation, exact integration of steps, handle degeneracies. Then you wish to approximate it for speed, you can use floats, ignore degenerate cases, take single coarse steps, etc. Each of those decisions has a differeent speed vs error cost and should be evaluated. If you have 10 decisions about how to design your algorithm, the ones that provide the best slope should be taken before other ones.
The "natural lambda" in this more general case corresponds to the fundamental structure of your algorithm design. Say you're doing 3d geometry overlap testing using shaped primitives like OBB's and lozenges. That is a fundamental decision that sets your space-speed tradeoff domain. You have chosen not to do the simpler/faster thing (always use AABB's or spheres), and you have chosen not to do the more complex thing (use convex hulls or exact geometry overlap tests). Therefore you have a natural lambda, a fundamental range where your decision makes sense, which is between those two. If you then do your OBB/Lozenges very carefully using slow exact routines, that is fundamentally wrong - you should have gone to convex hulls. Or if you really approximate your OBB/Lozenges with gross approximations - that is also fundamentally wrong - you should have just use AABB's/spheres instead.
"speed" here always refers to decode speed - this is about the encoder making choices about how it forms the compressed bit stream.
This parameter allows the encoders to make decisions that optimize for a space-speed goal which is of your choosing. You can make those decisions favor size more, or you can favor decode speed more.
If you like, a modern compressor is a bit a like a compiler. The compressed data is a kind of program in bytecode, and the decompressor is just an intepreter that runs that bytecode. An optimal parser is like an optimizing compiler; you're considering different programs that produce the same output, and trying to find the program that maximizes some metric. The "space-speed tradeoff" parameter is a bit like -Ox vs -Os, optimize for speed vs size in a compiler.
Oodle of course includes Hydra (the many headed beast) which can tune performance by selecting compressors based on their space-speed performance.
But even without Hydra the individual compressors are tuneable, none more so than Mermaid. Mermaid can stretch itself from Selkie-like (LZ4 domain) up to standard LZH compression (ZStd domain).
I thought I would show an example of how flexible Mermaid is. Here's Mermaid level 4 (Normal)
with some different space-speed tradeoff parameters :
sstb = space speed tradeoff bytes
sstb 32 : ooMermaid4 : 2.29:1 , 33.6 enc MB/s , 1607.2 dec MB/s
sstb 64 : ooMermaid4 : 2.28:1 , 33.8 enc MB/s , 1675.4 dec MB/s
sstb 128: ooMermaid4 : 2.23:1 , 34.1 enc MB/s , 2138.9 dec MB/s
sstb 256: ooMermaid4 : 2.19:1 , 33.9 enc MB/s , 2390.0 dec MB/s
sstb 512: ooMermaid4 : 2.05:1 , 34.3 enc MB/s , 2980.5 dec MB/s
sstb 1024: ooMermaid4 : 1.89:1 , 34.4 enc MB/s , 3637.5 dec MB/s
compare to : (*)
zstd9 : 2.18:1 , 37.8 enc MB/s , 590.2 dec MB/s
lz4hc : 1.67:1 , 29.8 enc MB/s , 2592.0 dec MB/s
(* MSVC build of ZStd/LZ4 , not a fair speed measurement (they're faster in GCC), just use as a general reference point)
Point being - not only can Mermaid span a large range of performance but it's *good* at both ends of that range, it's not getting terrible as it out of its comfort zone.
You may notice that as sstb goes below 128 you're losing a lot of decode speed and not gaining much size. The problem is you're trying to squeeze a lot of ratio out of a compressor that just doesn't target high ratio. As you get into that domain you need to switch to Kraken. That is, there comes a point where the space-speed benefit of squeezing the last drop out of Mermaid is harder than just making the jump to Kraken. And that's where Hydra comes in, it will do that for you at the right spot.
ADD : Put another way, in Oodle there are *two* speed-ratio tradeoff dials. Most people are just
familiar with the compression "level" dial, as in Zip, where higher levels = slower to encode, but
more compression ratio. In Oodle you have that, but also a dial for decode time :
CompressionLevel = trade off encode time for compression ratio
SpaceSpeedTradeoffBytes = trade off decode time for compression ratio
Perhaps I'll show some sample use cases :
Default initial setting :
CompressionLevel = Normal (4)
SpaceSpeedTradeoffBytes = 256
Reasonably fast encode & decode. This is a balance between caring about encode time, decode time,
and compression ratio. Tries to do a decent job of all 3.
To maximize compression ratio, when you don't care about encode time or decode time :
CompressionLevel = Optimal4 (8)
SpaceSpeedTradeoffBytes = 1
You want every possible byte of compression and you don't care how much time it costs you to encode or
decode. In practice this is a bit silly, rather like the "placebo" mode in x264. You're spending
potentially a lot of CPU time for very small gains.
(for maximum compression, also set maxLocalDictionarySize to 2^30)
A more reasonable very high compression setting :
CompressionLevel = Optimal3 (7)
SpaceSpeedTradeoffBytes = 16
This still says you strongly value ratio over encode time or decode time, but you don't want to chase
tiny gains in ratio that cost a huge amount of decode time.
If you care about decode time but not encode time :
CompressionLevel = Optimal4 (8)
SpaceSpeedTradeoffBytes = 256
Crank up the encode level to spend lots of time making the best possible compressed stream, but make
decisions in the encoder that balance decode time.
etc.
The SpaceSpeedTradeoffBytes is a number of bytes that Oodle must be able to save in order to accept a certain time increase in the decoder. In Kraken that unit of time is 25600 cycles on the artifical machine model that we use. (that's 8.53 microseconds at 3 GHz). So at the default value of 256, it must save 1 byte in compressed size to take an increased time of 100 cycles.
Most of the time Kraken gets better ratio than ZStd, but there were exceptions to that (mainly text), and it always kind of bothered me, since Kraken is roughly a superset of ZStd (not exactly), and the differences are small, it shouldn't have been winning by more than 1% (which is the variation I'd expect due to small differences). On text files, I have no edge over ZStd, all my advantages are moot, so we're reduced to both being pretty basic LZ-Huffs; so we should be equal, but I was losing. So I dug in to see what was going on.
Thanks of course to Yann for making his great work open source so that I'm able to look at it; open source and sharing code is a wonderful and helpful thing when people choose to do so voluntarily, not so nice when your work is stolen from you against your will and shown to the world like phone-hacked dick-pics *cough* *assholes*. Since I'm learning from open source, I figured I should give back, so I'm posting what I learned.
A lot of the differences are a question of binary vs. text focus. ZStd has some tweaking that clearly comes from testing on text and corpora with a lot of text (like silesia). On the other hand, I've been focusing very much on binary and that has caused me to miss some important things that only show up when you look closely at text performance.
This is what I found :
Long hashes are good for text, bad for binary
ZStd non-optimal levels use hash lengths of 5 or even 6 or 7 at the fastest levels. This helps on text because text has many long matches, so it's important to have a hash long enough that it can differentiate between "boogie" and "booger" and put them in different hash table bins. (this is most important at the fastest levels which are cache table with no ways).
On binary you really want to hash len 4 because there are important matches of exactly len 4, and longer hashes
can make you miss them.
zstd2 hash len 6 :
PD3D : zstd2 : 31,941,800 ->11,342,055 = 2.841 bpb = 2.816 to 1
zstd2 hash len 4 :
PD3D : zstd2 : 31,941,800 ->10,828,309 = 2.712 bpb = 2.950 to 1
zstd2 hash len 6 :
dickens : zstd2 : 10,192,446 -> 3,909,882 = 3.069 bpb = 2.607 to 1
zstd2 hash len 4 :
dickens : zstd2 : 10,192,446 -> 4,387,536 = 3.444 bpb = 2.323 to 1
Longer hashes help the fast modes a *lot* on text. If you care about fast compression of text
you really want those longer hashes.
This is a big issue and because of it ZStd fast modes will continue to be better than Oodle on text (and Oodle will be better on binary); or we have to find a good way to detect the data type and tune the hash length to match.
lazy2 is helpful on text
Standard lazy parsing looks for a match at ptr, if one is found it also looks at ptr+1 to see if something better is there. Lazy2 also looks at ptr+2.
I wasn't doing 2-ahead lazy parsing, because on binary it doesn't help much. But on text it's
a nice little win :
Zstd level 9 has 2-step lazy normally :
zstd9 : 41,458,703 ->10,669,424 = 2.059 bpb = 3.886 to 1
disabled : (1-step lazy) :
zstd9 : 41,458,703 ->10,825,637 = 2.089 bpb = 3.830 to 1
optimal parser all len reductions helps on text
I once wrote that in codecs that do strong rep0 exclusion (rep0len1 literal can't occur immediately after a match), that you can just always send max-length matches, and not have to consider match length reductions. (because max-length matches maintain rep0 exclusion but shorter ones violate it).
That is not quite right. It tends to be true on binary, but is wrong on text. The issue is that you only get the rep0 exclusion benefit if you actually send a literal after the match.
That happens often on binary. Binary frequently goes match-literal-match-literal , with some near-random bytes between predictable regions. Text has very few literals. Many text files go match-match-match which means the rep0 literal exclusion does nothing for you.
On text files you often have many short & medium length overlapping matches, and trying len reductions is
important to find the parse that traces through them optimally.
AAAADDDGGGGJJJJ
BBBBBFFFHHHHHH
CCCEEEEEIII
and the optimal parse might be
AAABBBFFFHHHHHH
which you would only find if you tried the len reduction of A
this kind of thing. Text is all about making the best normal-match decisions.
with all len reductions :
zstd22 : 10,000,000 -> 2,800,209 = 2.240 bpb = 3.571 to 1
without :
zstd22 : 10,000,000 -> 2,833,168 = 2.267 bpb = 3.530 to 1
Getting len 3 matches right in the optimal parser is really important on text
Part of the "text is all matches" issue. My codecs are mostly MML 4 in the non-optimal modes, then I switch to MML3 at level 7 (Optimal3). Adding MML3 generally lets you get a bit more compression ratio, but hurts decode speed a bit.
(BTW MML3 in the non-optimal modes generally *hurts* compression ratio, because they can't make the decision correctly about when to use it. A len 3 match is always marginal, it's only slightly cheaper than 3 literals (depending on the literals), and you probably don't want it if you can find any longer match within those next 3 bytes. Non-optimal parsers just make these decisions wrong and muck it all up, they do better with MML 4 or even higher sometimes. (there are definitely files where you can crank up MML to 6 or 8 and improve ratio))
So, I was doing that *but* I was using the statistics from a greedy pre-pass to seed the optimal parse decisions, and the greedy pre-pass was MML 4, which was biasing the optimal against len 3 matches. It was just a fuckup, and it wasn't hurting me on binary, but when I compared to ZStd's optimal parse on text I could immediately see it had a lot more len 3 matches than me.
(this is also an example of the parse-statistics feedback problem, which I believe is the most important problem in LZ compresion)
dickens
zstd22 : 10,192,446 -> 2,856,038 = 2.242 bpb = 3.569 to 1
before :
ooKraken7 : 10,192,446 -> 2,905,719 = 2.281 bpb = 3.508 to 1
after :
ooKraken7 : 10,192,446 -> 2,862,710 = 2.247 bpb = 3.560 to 1
ZStd is full of small clever bits
There's lot of little clever nuggets that are hard to see. They aren't generally commented and they're buried in chunks of copy-pasted code that all looks the same so it's easy to gloss over the variations.
I looked over this code many times :
if ((offset_1 > 0) & (MEM_read32(ip+1-offset_1) == MEM_read32(ip+1))) {
mLength = ZSTD_count(ip+1+4, ip+1+4-offset_1, iend) + 4;
ip++;
ZSTD_storeSeq(seqStorePtr, ip-anchor, anchor, 0, mLength-MINMATCH);
} else {
U32 offset;
if ( (matchIndex <= lowestIndex) || (MEM_read32(match) != MEM_read32(ip)) ) {
ip += ((ip-anchor) >> g_searchStrength) + 1;
continue;
}
// [ got match etc... ]
and I thought - okay, look for a 4 byte rep match, if found take it unconditionally and don't look for
normal match. That's the same thing I do (I think it came from me?), no biggie.
But there's a wrinkle. The rep check is not at the same position as the normal match. It's at pos+1.
This is actually a mini-lazy-parse. It doesn't do a full match & rep find at pos & (pos+1). It's just scanning through, at each pos it only does one rep find and one match find, but the rep find is offset forward by +1. That means it will take {literal + rep} even if match is available, which a normal non-lazy parser can't do.
(aside : you might think that this misses a rep find, when the literal run starts, right after a match, it starts find the first rep at pos+1 so there's a spot where it does no rep find. But that spot is where the rep0 exclusion applies - there can be no rep there, so it's all good!)
This is a solid win and it's totally for free, so very cool.
Seven testset
with rep-ahead search :
total : zstd3 : 80,000,000 ->34,464,878 = 3.446 bpb = 2.321 to 1
with rep at same pos as match :
total : zstd3 : 80,000,000 ->34,521,261 = 3.452 bpb = 2.317 to 1
The end.
ADD : a couple more notes on ZStd (that aren't from the recent investigation) while I'm at it :
ZStd uses a unique approach to the lrl0-rep0 exclusion
After a match (of full length), that same offset cannot match again. If your offsets are in a rep match cache, the most recently used offset is the top (0th) entry, rep0. This is the lrl0-rep0 exclusion.
rep0 is usually the most likely match, so it will get the largest share of the entropy coder probability space. Therefore if you're in an exclusion where that symbol is impossible, you're wasting a lot of bits.
There are two ways that I would call "traditional" or straightforward data compression ways to model the lrl0-rep0 exclusion. One is to use a single bit for (lrl == 0) as context for the rep-index coding event. eg. you have two entropy coding states for offsets, one for lrl == 0 and one for lrl != 0. The other classical method would be to combine lrl with rep-index in a larger alphabet, which allows you to model their correlation using only order-0 entropy coding. The minimum alphabet size here is only 2 bits, 1 bit for (lrl == 0) or not, and one for (match == rep0) or not.
ZStd does not use either of these methods. Instead it shifts the rep index by (lrl == 0). That is, ZStd has 3 reps, and normally they are in match offset slots 0,1,2. But right after the end of a match (when lrl is 0) those offset values change to mean rep 1,2,3 ; and there is no rep3, that's a virtual offset equal to (rep0 - 1).
The ZStd format documentation is a good reference for these things.
I can't say how well the ZStd method here compares to the alternatives as it's a bit more effort to check than I'd like to do. (if you want to try it, you could double the size of ZStd's offset coding alphabet to put 1 bit of lrl == 0 into the offset coding; then the decode sequence grabs an offset and only pulls an lrl code if the offset bit says so).
ZStd uses TANS in a limited and efficient way
ZStd does not use TANS (FSE) on its literals, which are the largest class of entropy coded symbols. Presumably Yann found, like us, that the compression gains on literals (over Huffman) are small, and the speed cost is not worth it. ZStd only uses TANS on the LZ match components - LRL, offset, ML.
Each of these has a small alphabet (52,35,28), and therefore can use a small # of bits for the TANS tables (9,9,8). This is a sweet spot for TANS, so it works well in ZStd.
For large alphabets (eg. 256 for literals), TANS needs a higher # of bits for its code tables (at least 11), which means 2048 entries being filled. This makes the table setup time rather large. By cutting the table size to 8 or 9 bits you cut that down by 4-8X. With large alphabets you also may as well just go Huff. But with small alphabets, Huff gets worse and worse. Consider the extreme - in an alphabet of 2 symbols Huff becomes no compression at all, while TANS can still do entropy coding. With small alphabets to use Huffman you need to combine symbols (eg. in a 2-bit alphabet you would code 4 at once as an 8-bit symbol). BUT that means going up to big decoder tables again, which adds to your constant overhead.
FSE uses the prime-scatter method to fill the TANS decode table. (this is using a relatively-prime step to just walk around the circular array, using the property that you can just keep stepping that way and you will eventually hit every slot once and only once). I evaluated the prime-scatter method before and concluded that the compression penalty was unacceptably large. I was mistaken. I had just implemented it wrong, so my results were much worse than they should be.
(the mistake I made was that I did the prime-scatter in one pass; for each symbol, take the steps and fill table entries, increment "from_state" as you step, "to_state" steps around with the prime-modulo. This causes a non-monotonic relationship between from_state and to_state which is very bad. The right way to do it (the way ZStd/FSE does it) is to use some kind of two-pass scheme, so that you do the shuffle-scatter first (which can step around the loop non-monotonically) but then assign the from_state relationship in a second pass which ensures the monotonic relationship).
With a correct implementation, prime-scatter's compression ratio is totally fine (*). The two-pass method that ZStd/FSE uses would be slow for large alphabets or large L, but ZStd only uses FSE for small alphabets and small L. The entropy coder and application are well matched. (* = if you special case singletons, as below)
The worst case for prime-scatter is low counts, and counts of 1 are the worst. ZStd/FSE uses a special case for counts of 1 that are "below 1". Back in the "Understanding TANS" series I looked at the "precise sort" method of table building and found that artificially skewing the bias to put counts of 1 at the end was a big win in practice. The issue there is that the counts we see at that point are normalized, and zeros were forced up to 1 for codeability. The true count might be much lower. Say you're coding an array of size 64k and symbol 'x' only occurs 1 time. If you have a TANS L of 1024 , the true probability should be 1/64k , but normalized forces it up to 1/1024. Putting the singleton counts at th end of the TANS array gives them the maximum codelen (end of the array has maximum fractional bits). The sort bias I did before was a hack that relies on the fact that most singleton counts come from below-1 normalized probabilities. ZStd/FSE explicitly signals the difference, it can send a "true 1" (eg. closest normalized probability really is 1/1024 ; eg. in the 64k array, count is near 64), or a "below 1" , some very low count that got forced up to 1. The "below 1" symbols are forced to the end of the TANS array while the true 1's are allowed to prime-scatter like other symbols.
The end.
Some days, the sky is full of smoke so thick that the sun is only a dull orange disc. The major highway, I84, is closed, making it hard to get to Portland or the coast, and putting heavy traffic on the only other road through the gorge. We can't go outside because the air burns our eyes and lungs. We've been stuck inside with the windows closed, trying to wait it out, but it seems unlikely to end any time soon. Big forest fires like this are never put out by man, they're just contained and you have to wait for them to burn out or a big rain to fall; it seems likely that all these big fires in the West will keep burning until the fall rainy season arrives, which could be a month or two away. We've had a lot of nasty smokey days even before the Eagle Creek Fire started. There was the small Indian Creek Fire right around the same area here in the Gorge (just West of the Hood River valley, now merged with the Eagle Creek Fire). More significantly there have been big fires in the Sisters area that have given us toxic smokey days all summer (and perhaps also the bigger ones farther away like Chetco Bar and the Montana fires? I don't really know if that smoke comes to us).
The story of how the Eagle Creek Fire started is particularly galling - Woman Witnessed Teen Tossing Firecrackers Into Gorge: “There Was a Whole Group of Kids Who Found It Funny To Do This”
A group of kids were throwing fireworks into the tinderbox of a dry forest (with other fires already burning all over the area, so there can't have been any doubt about the consequences), and of course they were video taping it on cell phones. They had a totally flippant "whatever bro" kind of attitude, fuck responsibility and consequences. To me it seems to sum up our era perfectly. Who cares how this going to hurt the rest of humanity, this is going to get me Youtube views!
This bug was present from Oodle 2.5.0 to 2.5.4 ; if you use those versions you should update to 2.5.5
When the bug occurs, the OodleLZ_Compress call returns success, thinking it made valid compressed data, but it has actually made a damaged bit stream. When you call Decompress it might return failure, or it might return success but produce decompressed output that does not match the original bits.
Any compressed data that you have made which decodes successfully (and matches the original uncompressed data) is fine. The presence of the bug can only be detected by attempting to decode compressed data and checking that it matches the original uncompressed data.
The decoder is not affected by this bug, so if you have shipped user installations that only do decoding, they don't need to be updated. If you have compressed files which were made incorrectly because of this bug, you can patch only those individual compressed files.
Technical details :
This bug was caused by one of the internal bit stream write pointers writing past the end of its bits, potentially over-writing another previously written bit stream. This caused some of the previously written bits to become garbage, causing them to decode into something other than what they had been encoded from.
This only occured with 64-bit encoders. Any data written by 32-bit encoders is not affected by this bug.
This bug could in theory occur on any Kraken & Mermaid compressed data. In practice it's very rare and I've only seen it in one particular case - "whole huff chunks" on data that is only getting a little bit of compression, with uncompressed data that has a trinary byte structure (such as 24-bit RGB). It's also much more likely in pre-2.3.0 compatibility mode (eg. with OodleLZ_BackwardsCompatible_MajorVersion=2 or lower).
BTW it's probably a good idea in general to decode and verify the data after every compress.
I don't do it automatically in Oodle because it would add to encode time, but on second thought that might be a mistake.
Pretty much all the Oodle codecs are so asymmetric, that doing a full decode every time wouldn't add much to
the encode time. For example :
Kraken Normal level encodes at 50 MB/s
Kraken decodes at 1000 MB/s
To encode 1 MB is 0.02 s
To decode 1 MB is 0.001 s
To decode after every encode changes the encode time to 0.021 s = 47.6 MB/s
it's not a very significant penalty to encode time, and it's worth it to verify that your data definitely
decodes correctly. I think it's a good idea to go ahead and add this to your tools.
I may add a "verify" option to the Compress API in the future to automate this.
Oodle for Windows UWP comes with only the "core" library that does memory to memory compression. The Oodle Core library uses no threads, has minimal dependencies (just the CRT), no funny business, making it very portable.
For full details see the Oodle Change Log
|
|
Much of glory of this area is due to the remarkably sane preservation of its beauty by the Columbia River Gorge National Scenic Area.
Some background reading :
On Stranger we used exceptions as a last gasp measure during dev to try to keep the game running for our content creation team. It worked great and I think everyone should use a similar system in game development.
We did not ship with exceptions. They were only used during development. To be clear, what we did NOT do :
We did NOT :
Use C++ exceptions (we used SEH with __try , __throw , __except)
Try to do proper "exception-safe" C++ ala Herb Sutter
(this is a bizarre and very tricky complex way of writing C++ that requires
doing everything in a different way than the normal linear imperative code; it
uses lots of swaps and temp objects)
Return every error with exceptions ; most errors were via return value
Try to unwind exceptions cleanly/robustly
Just kill the game on exceptions
Any error that we expected to happen, or could happen in ship, such as files not found, media errors, etc. were handled with return codes. The standard way of functions returning codes and the calling code checking it and handling it somehow.
Also, errors that we could detect and just fix immediately, but not return a code, we would just fix. So, like say you tried to create an Actor and the pref file describing that actor didn't exist, we'd just print an error (and automatically email it to dev@oddworld) and just not create that Actor. Hey, shit's wrong, but you can continue.
The principle is : don't block artist A from working on their stuff just because the programmers or some other artist checked in other broken stuff. If possible, just disable the broken stuff, because artist A can probably continue.
Say the guys working on particle systems keep checking in broken stuff that would crash the game or cause lots of errors - fine. The rest of the art team can still be syncing to new builds, and they will just see an error printed about "particle system XX failed ; disabled" and then they can continue working on their other stuff.
Blocking the art/design team (potentially a lot of people) for even 5 minutes while you try to roll things back or whatever to fix it is really a HUGE HUGE disaster and should never ever happen.
Any time your artists/designers have to get up and go get coffee/snacks in the kitchen because things are broken and they can't work - you massively fucked up and you should endeavor to never do that again.
But there are inevitably problems that we didn't just detect and disable the object (like the pref not found above). Maybe you just get a crash in some code due to an array ref out of bounds, or somewhere deep in the code you detect a bad fault that you can't fix.
So, as a catch of last measure we used exceptions. The way we did it was to wrap a try/catch around each game
object creation & update, and if it caught an exception, that object was removed.
for each object O in the world list
{
__try
{
O->Update();
}
__except
{
show & email error about O throwing
remove O from world list
// don't delete the object O since it could still be pointed at by the world, or could be corrupt
}
}
Removing O prevents it from trying to Update again and thus spamming. We assume that once it throws, something is badly broken there and we'll just get rid of it.
As I said before, this is NOT trying to catch every error and handle it in a robust way. Obviously O may have been partially through with its Update and put the world in a weird state, it may not keep the game from crashing to just remove O, there are lots of possible problems and we don't try to handle them. It's "optimistic" in that sense that we sort of expect it to fail and cause problems, but if it ever does work, then awesome, great, it saved an artist from crashing. In practice it actually works fine 90% of the time.
We specifically do *not* want to be robust, because writing fully robust exception-safe code (that would have to roll back all the partial changes to the world if there was a throw somewhere through the update) is too onerous. The idea of this system is that it imposes *zero* extra work on programmers writing normal game code.
We could also manually __throw in some places where appropriate. The criterion for doing that is : not an error you should ever get in the final game, it's a spot where you can't return an error code or just show a failure measure and do some kind of default fallback. You also don't need to __throw if it's a spot where the CPU will throw an interrupt for you.
For example, places where we might manually __throw : inside a vector push_back if the malloc to extend failed. In an array out of bounds deref. In the smart-pointer deref if the pointer is null.
Places where we don't __throw : trying to normalize a zero vector or orthonormalize a degenerate frame. These are better to detect, log an error message, and just stuff in unitZ or something, because while that is broken it's a better way to break than the throw-catch mechanism which should only be used if there's no nicer way to stub-out the broken behavior.
Some (not particularly related) links :
cbloom rants 02-04-09 - Exceptions
cbloom rants 06-07-10 - Exceptions
cbloom rants 11-09-11 - Weird shite about Exceptions in Windows
Chroma downsampling (as in standard JPEG YCbCr 420) is a big ugly hammer. It just throws away a ton of bits of information. That's pretty much always a bad thing in compression.
Now, preferring to throw away chroma detail before luma detail is valid and good. So if you are not chroma subsampling, then you need a perceptually optimizing encoder that knows to give fewer bits to high frequency chroma. You have much more control doing this through bit allocation than you do by just smashing the chroma planes. (for example, on blocks where there is almost no luma signal, then you might keep more of the high frequency chroma, on blocks with luma masking you can throw away lots of the high chroma AC bits - you just have way more precise control).
The chroma subsample is just a convenient way to get decent perceptual tradeoffs in a *non* optimizing encoder.
Chroma subsample is of course an R-D choice. It throws away signal, giving a certain disortion, in exchange it saves you some rate. This particular move is a good R-D choice at some tradeoff zone. Generally at high bit rate, it's a bad move. In most encoders, it becomes a good move at some lower quality. (in JPEG the tradeoff point is usually somewhere around 85). Measuring this D in RMSE is easy, but measuring it perceptually is rather tricky (when luma masking is present it may be near zero D perceptually, but without luma masking it can be much worse).
There are other issues.
In non-subsampled world, the approximate important weights for YCbCr are something like {0.7,0.13,0.17} . If you do subsample, then the chroma weights per-pixel need to go up by 4X , in which case they become pretty close to all being the same.
Many people mistakenly say the "eye does not see blue levels well". Not true, the eye can see the overall level of blue perfectly well, just as well as red or green. (eg for images where the whole thing is a single solid color). What the eye has is very poor spatial resolution in blue.
One of the issues is that chroma subsample is a nice win for *speed*. It gives you 4X less pixels to work on in two of your planes, half as many pixels overall. This means that subsampled chroma images are almost 2X faster to decode.
I used to be anti-chroma-subsample in my early years. For example in wavelets it's much neater to keep all your color planes full res, but send your bitplanes in [YUV] order. That way when you truncate the bottom bit planes, you drop the highest frequency chroma first. But then I realized that the 2X speedup from chroma subsample was nothing to sneeze at, and in general I'm now pro-chroma-subsample.
Another reminder : if you don't chroma subsample, then you may as well do a KLT on the color planes, rather than just use YUV or whatever. (maybe even KLT per region). The advantage of using standard YUV is that the chroma are known to be perceptually okay to downsample (you can't downsample the two minor components of the KLT transformed planes because you have no guarantee that they are of a type that the eye can't perceive high frequency data).
You can obviously construct adversarial images where the detail is all in chroma (the whole image has a constant luma). In that case chroma downsampling looks really bad and is perceptually a big mistake.
Chroma-from-luma in the decoder fixes all the color fringing that people associate with JPEG, but obviously it doesn't help in the adversarial cases where there is no luma detail to boost the chroma with.
I should also note while I'm at it that there are many codecs out there that just have bugs and/or mistakes in their downsamplers and/or upsamplers that cause this operation to produce way more error than it should.
ADD : Won sent me an email with an interesting idea I'd never thought about. Rather than just jumping between not downsampling chroma and doing 2x2 downsample, you could take more progressive steps, such as going to a checkerboard of chroma (half as many pixels) or a Bayer pattern. It's probably too complex to support these well and make good encoder decisions in practice, but they're interesting in theory.
Your dictionary is like the probability interval. You start with a full interval [0,1]. You put in all
the single-character words, with P(c) for each char, so that all sums to one. You iteratively split the
largest interval, and subdivide the range.
In binary the "split" operation is :
W -> W0 , W1
P(W) = P(W)*P(0) + P(W)*P(1)
In N-ary the split is :
W -> Wa , W[b+]
P(W) = P(W)*P(a) + P(w)*Ptail(b)
W[b+] means just the word W, but in the state "b+" (aka state "1"), following sym must be >= b
(P(w) here means P_naive(w), just the char probability product)
W[b+] -> Wb , W[c+]
P(w)*Ptail(b) = P(w)*P(b) + P(w)*Ptail(c)
(recall Ptail(c) = tail cumprob, sum of P(char >= c))
(Ptail(a) = 1.0)
So we can draw a picture like an arithmetic coder does, spliting ranges and specifying cumprob intervals to choose our string :
You just keep splitting the largest interval until you have a # of intervals = to the desired number of codes. (8 here for 3-bit Tunstall).
At that point, you still have an arithmetic coder - the intervals are fractional sizes and are correctly sized to the probabilities of each code word. eg. the interval for 01 is P(0)*P(1).
In the final step, each interval is just mapped to a dictionary codeword index. This gives each codeword an equal 1/|dictionary_size| probability interval, which in general is not quite right.
This is where the coding inefficiency comes from - it's the difference between the correct interval sizes and the size that we snap them to.
(ADD : in the drawing I wrote "snap to powers of 2" ; that's not the best way of describing that; they're just snapping to the subsequent i/|dictionary_size|. In this case with dictionary_size = 8 those points are {0,1/8,1/4,3/8,1/2,..} which is why I was thinking about powers of 2 intervals.)
Plural Tunstall VTF coding in general is extremely fast to decode. It works best with 12-bit tables (must stay in L1), which means it only works well at entropy <= 4 bpb.
Marlin introduces an improved word probability model that captures the way the first letter probability is skewed by the longer-string exclusion. (this is just like the LAM exclusion in LZ)
That is, after coding with word W, subsequent chars that exist in the dictionary (Wa,Wb) cannot then be the next character to start a word, so the probably of the first char of the word being >= c is increased.
The Marlin word probability model improves compression by 1-4% over naive plural Tunstall.
The simplest implementation of the Marlin probability adjustment is like this :
P_naive(W) = Prod[chars c in W] P('c')
P_word(W) = P_scale_first_char( W[0] ) * P_naive(W) * Ptail( num_children(W) )
(where Ptail is the tail-cumulative-proability :
Ptail(x) = Sum[ c >= x ] P('c') (sum of character probabilities to end)
)
(instead of scaling by Ptail you can subtract off the child node probabilities as you make them)
on the first iteration of dictionary building, set P_scale_first_char() = 1.0
this is "naive plural Tunstall"
after building the dictionary, you now have a set of words and P(W) for each
compute :
P_state(i) = Sum[ words W with i children ] P(W)
(state i means only chars >= i can follow)
(iterating P_state -> P(W) a few times here is optional but unnecessary)
P_scale_first_char(c) = Sum[ i <= c ] P_state(i) / P_tail(i)
(P_scale_first_char = P_state_tail)
then repeat dic building one more time
(optionally repeat more but once seems fine)
And that's it!
What do these values actually look like? I thought it might be illuminating to dump them.
This is on a pretty skewed file so the effect is large, the larger the MPS probability the bigger
the effect.
R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
P of chars =
[0] 0.46841475525106835 double
[1] 0.11553621994280693 double
[2] 0.11508546535801611 double
[3] 0.059216055763873253 double
[4] 0.058911526220693004 double
[5] 0.036597584870921435 double
[6] 0.036475518749229136 double
[7] 0.018035269480036465 double
[8] 0.017757441900976400 double
[9] 0.010309501194595012 double
[10] 0.0097379520102128646 double
P_state =
[0] 0.62183816678155190 double
[1] 0.15374679894466811 double
[2] 0.032874234239563829 double
[3] 0.063794018822874776 double
[4] 0.026001940955215786 double
[5] 0.011274295764837820 double
[6] 0.028098911350290755 double
[7] 0.012986055279597277 double
[8] 0.0013397794289329405 double
.. goes to zero pretty fast ..
P_scale_first_char =
[0] 0.62183816678155202 double
[1] 0.91106139196208336 double
[2] 0.99007668169839014 double
[3] 1.2020426052137052 double
[4] 1.3096008656256881 double
[5] 1.3712643080249607 double
[6] 1.5634088663186672 double
[7] 1.6817189544649160 double
[8] 1.6963250203077103 double
[9] 1.9281295477496172 double
[10] 1.9418127462353780 double
[11] 2.0234438458996773 double
[12] 2.0540542047979415 double
[13] 2.1488636999462676 double
[14] 2.2798060244386895 double
[15] 2.2798060244386895 double
[16] 2.2798060244386895 double
[17] 2.3660205062350039 double
[18] 2.3840557757150402 double
[19] 2.3840557757150402 double
[20] 2.3840557757150402 double
[21] 2.4066061022686838 double
[22] 2.5584098628550294 double
[23] 2.5584098628550294 double
[24] 2.6690752676624321 double
.. gradually goes up to 4.14
estimate real first char probability
= P(c) * P_scale_first_char(c) =
[0] 0.29127817269875372 double
[1] 0.10526058936313110 double
[2] 0.11394343565337962 double
[3] 0.071180221940886246 double
[4] 0.077150585733949978 double
[5] 0.050184961893408854 double
[6] 0.057026149416117611 double
[7] 0.030330254533459933 double
the effect is to reduce the skewing of the probabilities in the post-word alphabet.
I think this problem space is not fully explored yet and I look forward to seeing more work in this domain in the future.
I'm not convinced that the dictionary building scheme here is optimal. Is there an optimal plural VTF dictionary?
I think maybe there's some space between something like RANS and Tunstall. Tunstall inputs blocks of N bits and doesn't carry any state between them. RANS pulls blocks of N bits and carries full state of unused value range between them (which is therefore a lot slower because of dependency chains). Maybe there's something in between?
Another way to think about this Marlin ending state issue is a bit like an arithmetic coder. When you send the index of a word that has children, you have not only sent that word, you've also sent information about the *next* word. That is, some fractional bits of information that you'd like to carry forward.
Say you have {W,Wa,Wb} in the dictionary and you send a W. You've also sent that the next word start with >= c. That's like saying your arithmetic coder cumprob is >= (Pa+Pb). You've refined the range of the next interval.
This could be done with Tunstall by doing the multi-table method (separate tables for state 0,1, and 2+), but unfortunately that doesn't fit in L1.
BTW you can think of Tunstall in an arithmetic codey way. Maybe I'll draw a picture because it's easier to show that way...
First, there's no point in making the T_ij transition matrix they talk about.
Back in "Understanding Marlin" you may recall I presented the algorithm thusly :
P_state(i) is given from a previous iteration and is constant
build dictionary using Marlin word model
we now have P(W) for all words in our dic
Use P(W) to compute new P_state(i)
optionally iterate a few times (~ 10 times) :
use that P_state to compute adjusted P(W)
use P(W) to compute new P_state
iteration dictionary building again (3-4 times)
in the paper (and code) they do it a bit differently. They compute the state transition matrix,
which is :
T_ij = Sum[ all words W that end in state S_i ] P(W|S_j)
this is the probability that if you started in state j you will wind up in state i
instead of iterating P_state -> P(W) , they iterate :
T <- T * T
and then P_state(i) = T_i0
I tested both ways and they produce the exact same result, but just doing it through the P(W) computation is far simpler and faster. The matrix multiply is O(alphabet^3) while the P way is only O(alphabet+dic_size)
Also for the record - I have yet to find a case where iterating to convergence here actually helps.
If you just make P_State from PW once and don't iterate, you get 99% of the win. eg :
laplacian distribution :
no iteration :
0.67952 : 1,000,000 -> 503,694 = 4.030 bpb = 1.985 to 1
iterate 10X :
0.67952 : 1,000,000 -> 503,688 = 4.030 bpb = 1.985 to 1
You *do* need to iterate the dictionary build. I do it 4 times. 3 times would be fine though,
heck 2 is probably fine.
4: 0.67952 : 1,000,000 -> 503,694 = 4.030 bpb = 1.985 to 1
3: 0.67952 : 1,000,000 -> 503,721 = 4.030 bpb = 1.985 to 1
2: 0.67952 : 1,000,000 -> 503,817 = 4.031 bpb = 1.985 to 1
The first iteration builds a "naive plural Tunstall" dictionary; the P_state is made from that, second
iteration does the first "Marlin" dictionary build.
In general I think they erroneously come to the conclusion that plural Tunstall dictionaries are really slow to create. They're only 1 or 2 orders of magnitude slower than building a Huffman tree, certainly not slow compared to many encoder speeds. Sure sure if you want super fast encoding you wouldn't want to do it, but otherwise it's totally possible to build the dictionaries for each use.
There's a lot of craziness in the Marlin code that makes their dic build way slower than it should be.
Some is just over-C++ madness, some is failure to factor out constant expressions.
the word is :
struct Word : public std::vector
<uint8_t> {
};
and the dictionary is :
std::vector<Word> W;
(with no reserves anywhere)
yeah that's a problem. May I suggest :
struct Word {
uint64 chars;
int len;
};
also reserve() is good and calling size() in loops is bad.
This is ouchy :
virtual double phi(const Word &) const = 0;
and this is even more ouchy :
virtual std::vector<Word> split(const Word &) const = 0;
The P(w) function that's used frequently is bad.
The key part is :
for (size_t t = 0; t<=w[0]; t++) {
double p = PcurrState[t];
p *= P[w[0]]/PnextState[t];
ret += p;
}
which you can easily factor out the common P[w[0]] from :
int c0 = w[0];
for (size_t t = 0; t<= c0; t++) {
ret += PcurrState[t]/PnextState[t];
}
ret *= P[c0]
but even more significant would be to realize that PcurrState (my P_state) and
PnextState (my P_tail) are not updated during dic building at all! They're constant
during that phase, so that whole thing can be precomputed and put in a table.
Then this is just :
int c0 = w[0];
ret = PcurrState_over_PnextState_partial_sum[c0];
ret *= P[c0]
that also gives us a chance to state clearly (again) the difference between "Marlin" and
naive plural Tunstall. It's that one table right there.
int c0 = w[0];
ret = 1.0;
ret *= P[c0]
this is naive plural Tunstall. It comes down to a modifed probability table for the first letter in
the word.
Recall that :
P_naive(W) = Prod[ chars c in W ] P(c)
simple P_word(W) = P_naive(W) * Ptail( num_children(W) )
Reading Yamamoto and Yokoo "Average-Sense Optimality and Competitive Optimality for Almost Instantaneous VF Codes".
They construct the naive plural Tunstall VF dictionary. They are also aware of the Marlin-style state transition problem. (that is, partical nodes leave you in a state where some symbols are excluded).
They address the problem by constructing multiple parse trees, one for each initial state S_i. In tree T_i you know that the first character is >= i so all words that start with lower symbols are excluded.
This should give reasonably more compression than the Marlin approach, obviously with the cost of having multiple dictionaries.
In skewed alphabet cases where the MPS is very probable, this should be significant because words that start with the MPS dominate the dictionary, but in all states S_1 and higher those words cannot be used. In fact I conjecture that even having just 2 or 3 trees should give most of the win. One tree for state S_0, one for S_1 and the last for all states >= S_2. In practice this problematic because the multiple code sets would fall out of cache and it adds a bit of decoder complexity to select the following tree.
There's also a continuity between VTF codes and blocked arithmetic coders. The Yamamoto-Yokoo scheme is like a way of carrying the residual information between blocked transmissions, similar to multi-table arithmetic coding schemes.
Phhlurge.
I just went and got the marlin code to compile in VS 2015. Bit of a nightmare. I wanted to confirm I didn't screw anything up in my implementation.
two-sided Laplacian distribution centered at 0
(this is what the Marlin code assumes)
r = 0.67952
H = 3.798339
my version of Marlin-probability plural Tunstall :
0.67952 : 1,000,000 -> 503,694 = 4.030 bpb = 1.985 to 1
Marlin reference code : 1,000,000 -> 507,412
naive plural Tunstall :
0.67952 : 1,000,000 -> 508,071 = 4.065 bpb = 1.968 to 1
I presume the reason they compress worse than my version is because they make dictionaries for a handfull of Laplacian distributions and then pick the closest one. I make a dictionary for the actual char counts in the array, so their dictionary is mis-matching the actual distribution slightly.
Marlin :
loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals : 1,111,673 -> 1,286,166 = 9.256 bpb = 0.864 to 1
decode_time2 : seconds:0.0022 ticks per: 3.467 b/kc : 288.41 MB/s : 498.66
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp: 1,572,918 -> 618,447 = 3.145 bpb = 2.543 to 1
decode_time2 : seconds:0.0012 ticks per: 1.281 b/kc : 780.92 MB/s : 1350.21
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1 : 53,161 -> 35,763 = 5.382 bpb = 1.486 to 1
decode_time2 : seconds:0.0001 ticks per: 1.988 b/kc : 503.06 MB/s : 869.78
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC : 513,216 -> 140,391 = 2.188 bpb = 3.656 to 1
decode_time2 : seconds:0.0002 ticks per: 0.800 b/kc : 1250.71 MB/s : 2162.48
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab : 190,428 -> 68,511 = 2.878 bpb = 2.780 to 1
decode_time2 : seconds:0.0001 ticks per: 1.031 b/kc : 969.81 MB/s : 1676.80
total bytes out : 1974785
naive plural Tunstall :
loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals : 1,111,673 -> 1,290,015 = 9.283 bpb = 0.862 to 1
decode_time2 : seconds:0.0022 ticks per: 3.443 b/kc : 290.45 MB/s : 502.18
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp: 1,572,918 -> 627,747 = 3.193 bpb = 2.506 to 1
decode_time2 : seconds:0.0012 ticks per: 1.284 b/kc : 779.08 MB/s : 1347.03
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1 : 53,161 -> 35,934 = 5.408 bpb = 1.479 to 1
decode_time2 : seconds:0.0001 ticks per: 1.998 b/kc : 500.61 MB/s : 865.56
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC : 513,216 -> 145,980 = 2.276 bpb = 3.516 to 1
decode_time2 : seconds:0.0002 ticks per: 0.826 b/kc : 1211.09 MB/s : 2093.97
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab : 190,428 -> 74,169 = 3.116 bpb = 2.567 to 1
decode_time2 : seconds:0.0001 ticks per: 1.103 b/kc : 906.80 MB/s : 1567.86
total bytes out : 1995503
About the files :
lzt24.literals are the literals left over after LZ-parsing (LZQ1) lzt24
like all LZ literals they are high entropy and thus do terribly in Tunstall
monarch.tga.rrz_filtered.bmp is the image residual after filtering with my DPCM
(it actually has a BMP header on it which is giving Tunstall a harder time
than if I stripped the header)
paper1 & pic are standard
tabdir.tab is a text file of a dir listing with lots of tabs in it
For speed comparison, this is the Oodle Huffman on the same files :
loading file (0/5) : lzt24.literals
ooHuffman1 : ed...........................................................
ooHuffman1 : 1,111,673 -> 1,036,540 = 7.459 bpb = 1.072 to 1
encode : 8.405 millis, 13.07 c/b, rate= 132.26 mb/s
decode : 1.721 millis, 2.68 c/b, rate= 645.81 mb/s
ooHuffman1,1036540,8405444,1721363
loading file (1/5) : monarch.tga.rrz_filtered.bmp
ooHuffman1 : ed...........................................................
ooHuffman1 : 1,572,918 -> 586,839 = 2.985 bpb = 2.680 to 1
encode : 7.570 millis, 8.32 c/b, rate= 207.80 mb/s
decode : 2.348 millis, 2.58 c/b, rate= 669.94 mb/s
ooHuffman1,586839,7569562,2347859
loading file (2/5) : paper1
ooHuffman1 : 53,161 -> 33,427 = 5.030 bpb = 1.590 to 1
encode : 0.268 millis, 8.70 c/b, rate= 198.67 mb/s
decode : 0.080 millis, 2.60 c/b, rate= 665.07 mb/s
ooHuffman1,33427,267579,79933
loading file (3/5) : PIC
ooHuffman1 : 513,216 -> 106,994 = 1.668 bpb = 4.797 to 1
encode : 2.405 millis, 8.10 c/b, rate= 213.41 mb/s
decode : 0.758 millis, 2.55 c/b, rate= 677.32 mb/s
ooHuffman1,106994,2404854,757712
loading file (4/5) : tabdir.tab
ooHuffman1 : 190,428 -> 58,307 = 2.450 bpb = 3.266 to 1
encode : 0.926 millis, 8.41 c/b, rate= 205.70 mb/s
decode : 0.279 millis, 2.54 c/b, rate= 681.45 mb/s
ooHuffman1,58307,925742,279447
Tunstall is crazy fast. And of course that's a rather basic implementation of the decoder, I'm sure it could get faster.
Is there an application for plural Tunstall? I'm not sure. I tried it back in 2015 as an idea for literals in Mermaid/Selkie and abandoned it as not very relevant there. It works on low-entropy order-0 data (like image prediction residuals).
Of course if you wanted to test it against the state of the art you should consider SIMD Ryg RANS or GPU RANS. You should consider something like TANS with multiple symbols in the output table. You should consider merged-symbol codes, perhaps using escapes, perhaps runlen transforms. See for example "crblib/huffa.c" for a survey of Huffman ideas from 1996 (pre-runtransform, blocking MPS's, order-1-huff, multisymbol output, etc.)
Geometric distribution , P(n) = r^n
I am comparing "Marlin" = plural Tunstall with P_state word probability model vs. naive plural Tunstall (P_word = P_naive). In both cases 8-byte output words, 12-bit codes.
Marlin :
filelen = 1000000
H = 7.658248
sym_count = 256
r=0.990 : 1,000,000 -> 1,231,686 = 9.853 bpb = 0.812 to 1
decode_time2 : seconds:0.0018 ticks per: 3.064 b/kc : 326.42 MB/s : 564.38
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.985 : 1,000,000 -> 1,126,068 = 9.009 bpb = 0.888 to 1
decode_time2 : seconds:0.0016 ticks per: 2.840 b/kc : 352.15 MB/s : 608.87
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.978 : 1,000,000 -> 990,336 = 7.923 bpb = 1.010 to 1
decode_time2 : seconds:0.0014 ticks per: 2.497 b/kc : 400.54 MB/s : 692.53
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.967 : 1,000,000 -> 862,968 = 6.904 bpb = 1.159 to 1
decode_time2 : seconds:0.0013 ticks per: 2.227 b/kc : 449.08 MB/s : 776.45
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.950 : 1,000,000 -> 779,445 = 6.236 bpb = 1.283 to 1
decode_time2 : seconds:0.0012 ticks per: 2.021 b/kc : 494.83 MB/s : 855.57
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.927 : 1,000,000 -> 701,049 = 5.608 bpb = 1.426 to 1
decode_time2 : seconds:0.0011 ticks per: 1.821 b/kc : 549.09 MB/s : 949.37
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.892 : 1,000,000 -> 611,238 = 4.890 bpb = 1.636 to 1
decode_time2 : seconds:0.0009 ticks per: 1.577 b/kc : 633.93 MB/s : 1096.07
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.842 : 1,000,000 -> 529,743 = 4.238 bpb = 1.888 to 1
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.773 : 1,000,000 -> 450,585 = 3.605 bpb = 2.219 to 1
decode_time2 : seconds:0.0007 ticks per: 1.237 b/kc : 808.48 MB/s : 1397.86
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.680 : 1,000,000 -> 373,197 = 2.986 bpb = 2.680 to 1
decode_time2 : seconds:0.0006 ticks per: 1.053 b/kc : 950.07 MB/s : 1642.67
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.560 : 1,000,000 -> 298,908 = 2.391 bpb = 3.346 to 1
decode_time2 : seconds:0.0005 ticks per: 0.891 b/kc : 1122.53 MB/s : 1940.85
vs.
plural Tunstall :
filelen = 1000000
H = 7.658248
sym_count = 256
r=0.99000 : 1,000,000 -> 1,239,435 = 9.915 bpb = 0.807 to 1
decode_time2 : seconds:0.0017 ticks per: 2.929 b/kc : 341.46 MB/s : 590.39
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.98504 : 1,000,000 -> 1,130,025 = 9.040 bpb = 0.885 to 1
decode_time2 : seconds:0.0016 ticks per: 2.814 b/kc : 355.36 MB/s : 614.41
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.97764 : 1,000,000 -> 990,855 = 7.927 bpb = 1.009 to 1
decode_time2 : seconds:0.0014 ticks per: 2.416 b/kc : 413.96 MB/s : 715.73
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.96665 : 1,000,000 -> 861,900 = 6.895 bpb = 1.160 to 1
decode_time2 : seconds:0.0012 ticks per: 2.096 b/kc : 477.19 MB/s : 825.07
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.95039 : 1,000,000 -> 782,118 = 6.257 bpb = 1.279 to 1
decode_time2 : seconds:0.0011 ticks per: 1.898 b/kc : 526.96 MB/s : 911.12
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.92652 : 1,000,000 -> 704,241 = 5.634 bpb = 1.420 to 1
decode_time2 : seconds:0.0010 ticks per: 1.681 b/kc : 594.73 MB/s : 1028.29
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.89183 : 1,000,000 -> 614,061 = 4.912 bpb = 1.629 to 1
decode_time2 : seconds:0.0008 ticks per: 1.457 b/kc : 686.27 MB/s : 1186.57
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.84222 : 1,000,000 -> 534,300 = 4.274 bpb = 1.872 to 1
decode_time2 : seconds:0.0007 ticks per: 1.254 b/kc : 797.33 MB/s : 1378.58
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.77292 : 1,000,000 -> 454,059 = 3.632 bpb = 2.202 to 1
decode_time2 : seconds:0.0006 ticks per: 1.078 b/kc : 928.04 MB/s : 1604.58
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.67952 : 1,000,000 -> 377,775 = 3.022 bpb = 2.647 to 1
decode_time2 : seconds:0.0005 ticks per: 0.935 b/kc : 1069.85 MB/s : 1849.77
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.56015 : 1,000,000 -> 304,887 = 2.439 bpb = 3.280 to 1
decode_time2 : seconds:0.0004 ticks per: 0.724 b/kc : 1381.21 MB/s : 2388.11
Very very small difference. eg :
plural Tunstall :
H = 3.986386
sym_count = 78
r=0.84222 : 1,000,000 -> 534,300 = 4.274 bpb = 1.872 to 1
Marlin :
H = 3.986386
sym_count = 78
r=0.842 : 1,000,000 -> 529,743 = 4.238 bpb = 1.888 to 1
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51
Yes the Marlin word probability estimator helps a little bit, but it's not massive.
I'm not surprised but a bit sad to say that once again the Marlin paper compares to ridiculous straw men and doesn't compare to the most obvious, naive, and well known (see Savari for example, or Yamamoto & Yokoo) similar alternative - just doing plural Tunstall/VTF without the Marlin word probability model.
Entropy above 4 or so is terrible for 12-bit VTF codes.
The Marlin paper uses a "percent efficiency" scale which I find rather misleading. For example, this :
H = 3.986386
sym_count = 78
r=0.842 : 1,000,000 -> 529,743 = 4.238 bpb = 1.888 to 1
is what I would consider pretty poor entropy coding. Entropy of 3.98 -> 4.23 bpb is way off.
But as a "percent efficiency" it's 94% , which is really high on their graphs.
The more standard and IMO useful way to show this is a delta of your output bits minus the entropy, eg.
excess = 4.238 - 3.986 = 0.252
half a bit per byte wasted. A true arithmetic coder has an excess around 0.001 bpb typically.
The worst you can ever do is an excess of 1.0 which occurs in any integer-bit entroy coder as the probability
of the MPS goes towards 1.0
Part of my hope / curiosity in investigating this was wondering whether the Marlin procedure would help at all with the way Tunstall VTF codes really collapse in the H > 4 range , and the answer is - no , it doesn't help with that problem at all.
Anyway, on to more results.
The classical Tunstall algorithm constructs VTF (variable to fixed) codes for binary memoryless (order-0) sources. It constructs the optimal code.
You start with dictionary = { "0","1" } , the single bit binary strings. (or dictionary = the null string if you prefer)
You then split one word W in the dictionary to make two new words "W0" and "W1" ; when you split W, it is removed since all possible following symbols now have words in the dictionary.
The algorithm is simple and iterative :
while dic size < desired
{
find best word W to split
remove W
add W0 and W1
}
each step increments dictionary size by +1
What is the best word to split?
Our goal is to maximize average code length :
A = Sum[words] P(W) * L(W)
under the split operation, what happens to A ?
W -> W0, W1
delta(A) = P(W0) * L(W0) + P(W1) * L(W1) - P(W) * L(W)
P(W0) = P(W)*P(0)
P(W1) = P(W)*P(1)
L(W0) = L(W)+1
.. simplify ..
delta(A) = P(W)
so to get the best gain of A, you just split the word with maximum probability.
Note of course this is just greedy optimization of A and that might not be the true optimum, but in fact it is and
the proof is pretty neat but I won't do it here.
You can naively build the optimal Tunstall code in NlogN time with a heap, or slightly more cleverly you can use two linear queues for left and right children and do it in O(N) time.
Easy peasy, nice and neat. But this doesn't work the same way for the large-alphabet scenario.
Now onto something that is a bit messy that I haven't figured out.
For "plural Tunstall" we aren't considering adding all children, we're only considering adding the next child.
A "split" operation is like :
start with word W with no children
W ends in state 0 (all chars >= 0 are possible)
the next child of W to consider is "W0"
(symbols sorted so most probable is first)
if we add "W0" then W goes to state 1 (only chars >= 1 possible)
W_S0 -> "W0" + W_S1
W_S1 -> "W1" + W_S2
etc.
again, we want to maximize A, the average codelen. What is delta(A) under a split operation?
delta(A) = P("W0") * L("W0") + P(W_S1) * L(W) - P(W_S0) * L(W)
delta(A) = P("W0") + (P("W0") + P(W_S1) - P(W_S0)) * L(W)
P("W0") + P(W_S1) - P(W_S0) = 0
so
delta(A) = P("W0")
it seems like in plural Tunstall you should "split" the word that has maximum P("W0") ; that is maximize the
probability of the word you *create* not the one you *remove*. This difference arises from the fact that we are only
making one child of longer length - the other "child" in the pseudo-split here is actually the same parent node again,
just with a reduced exit state.
In practice that doesn't seem to be so. I experimentally measured that choosing to split the word with maximum P(W) is better than splitting the word with maximum P(child).
I'm not sure what's going wrong with this analysis. In the Marlin code they just split the word with maximum P(W) by analogy to true Tunstall, which I'm not convinced is well justified in plural Tunstall.
While I'm bringing up mysteries, I tried optimal-parsing plural Tunstall. Obviously with "true tunstall" or any prefix-free code that's silly, the greedy parse is the only parse. But with plural Tunstall, you might have "aa" and also "aaa" in the tree. In this scenario, by analogy to LZ, the greedy parse is usually imperfect because it is sometimes better to take a shorter match now to get a longer one on the next work. So maybe some kind of lazy , or heck full optimal parse. (the trivial LZSS backward parse works well here).
Result : optimal-parsed plural Tunstall is identical to greedy. Exactly, so it must be provable. I don't see an easy way to show that the greedy parse is optimal in the plural case. Is it true for all plural dictionaries? (I doubt it) What are the conditions on the dictionary that guarantee it?
I think that this is because for any string in the dictionary, all shorter substrings of that string are in the dictionary too. This makes optimal parsing useless. But I think that property is a coincidence/bug of how Marlin and I did the dictionary construction, which brings me to :
Marlin's dictionary construction method and the one I was using before, which is slightly different, both have the property that they never remove parent nodes when they make children. I believe this is wrong but I haven't been able to make it work a different way.
The case goes like this :
you have word W in the dictionary with no children
you have following chars a,b,c,d. a and b are very probable, c and d are very rare.
P(W) initially = P_init(W)
you add child 'a' ; W -> Wa , W(b+)
P(W) -= P(Wa)
add child 'b'
W(b+) -> Wb , W(c+)
P(W) -= P(Wb)
now the word W in the dictionary has
P(W) = P(Wc) + P(Wd)
these are quite rare, so P(W) now is very small
W is no longer a desirable dictionary entry.
We got all the usefulness of W out in Wa and Wb, we don't want to keep W in the dictionary just to be able to code it with rare
following c's and d's - we'd like to now remove W.
In particular, if the current P(W) of the parent word is now lower than a child we could make somewhere else by splitting, remove W and split the other node. Or something like that - here's where I haven't quite figured out how to make this idea work in practice.
So I believe that both Marlin and my code are NOT making optimal general VTF plural dictionaries, they are making them under the (unnecessary) constraint of the shorter-substring-is-present property.
I label the symbols 'a','b','c' from most probable to least probable. I will use single quotes for symbols, and double quotes for words
in the dictionary. So :
P('a') , P('b'), etc. are given
P('a') >= P('b') >= P('c') are ordered
I will use the term "Marlin" to describe the way they estimate the probability of dictionary words. (everything else in the paper is
either obvious or well known (eg. the way the decoder works), so the innovation and interesting part is the word probability estimation,
so that is what I will call "Marlin" , the rest is just "Tunstall").
Ok. To build a Tunstall dictionary your goal is to maximize the average input length, which is :
average input length = Sum[words] { P(word) * L(word) }
since the output length is fixed, maximizing the input length maximizes compression ratio.
In the original Tunstall algorithm on binary input alphabet, this is easily optimized by splitting the most probable word, and adding its two children. This can be done in linear time using two queues for left and right (0 and 1) children.
The Marlin algorithm is all about estimating P(word).
The first naive estimate (what I did in my 12/4/2015 report) is just to multiply the character probabilities :
P(word) = Prod[c in word] P('c')
that is
P("xyz") = P('x') * P('y') * P('z')
but that's obviously not right. The reason is that the existence of words in the dictionary affects the probability of
other words.
In particular, the general trend is that the dictionary will accumulate words with the most probable characters (a,b,c) which will make the effective probability of the other letters in the remainder greater.
For example :
Start the dictionary with the 256 single-letter words
At this point the naive probabilities are exact, that is :
P("a") (the word "a") = P('a') (the letter 'a')
Now add the most probable bigram "aa" to the dictionary.
We now have a 257-word dictionary. What are the probabilities when we code from it ?
Some of the occurrances of the letter 'a' will now be coded with the word "aa"
That means P("a") in the dictionary is now LESS than P('a')
Now add the next most probable words, "ab" and "ba"
The probability of "a" goes down more, as does P("b")
Now if we consider the choice of what word to add next - is it "ac" or "bb" ?
The fact that some of the probability of those letters has been used by words in the dictionary affects
our estimate, which affects our choice of how to build the dictionary.
so that's the intuition of the problem, and the Marlin algorithm is one way to solve it.
Let's do it intuitively again in a bit more detail.
There are two issues : the way the probability of a shorter word is reduced by the presence of longer words, and the way the probability of raw characters that start words is changed by the probability of them coming after words in the dictionary.
Say you have word W in your dictionary
and also some of the most probable children.
W, Wa, Wb are in dictionary
Wc, Wd are not
We'll say word "W" has 2 children (Wa and Wb).
So word "W" will only be used from the dictionary if the children are NOT a or b
(since in that case the longer word would be used).
So if you have seen word "W" so far, to use word W, the next character must be terminal,
eg. one that doesn't correspond to another child.
So the probability of word W should be adjusted by :
P(W) *= P(c) + P(d)
Because we are dealing with sorted probability alphabets, we can describe the child set with just one integer to
indicate which ones are in the dictionary. In Marlin terminology this is c(w), and corresponds to the state Si.
If we make the tail cumulative probability sum :
Ptail(x) = Sum[ c >= x ] P('c') (sum of character probabilities to end)
Ptail(255) = P(255)
Ptail(254) = P(254) + P(255)
Ptail(0) = sum of all P('c') = 1.0
then the adjustment is :
P(W) *= P(c) + P(d)
P(W) *= Ptail('c')
P(W) *= Ptail(first char that's not a child)
P(W) *= Ptail( num_children(W) )
(I'm zero-indexing, so no +1 here as in the Marlin paper, they 1-base-index)
ADD : I realized there's a simpler way to think about this. When you add a child word, you simply remove that
probability from the parent. That is :
let P_init(W) be the initial probability of word W
when it is first added to the dictionary and has no children
track running estimate of P(W)
when you add child 'x' making word "Wx"
The child word probability is initialized from the parent's whole probability :
P_init(Wx) = P_init(W) * P('x')
And remove that from the running P(W) :
P(W) -= P_init(Wx)
That is you just make P(W) the probability of word W, excluding children that exist.
Once you add all children, P(W) will be zero and the word is useless.
Okay, so that does the first issue (probability of words reduced by the presence of longer words). Now the next issue.
Consider the same simple example case first :
W,Wa,Wb are in dictionary, Wc,Wd are not
(no longer children of W are either)
Say you reach node "Wa"
there are no children of "Wa" in the dictionary, so all following characters are equally likely
This means that starting the next word, the character probabilities are equal to their original true probabilities
But say you reach node "W" and leave via 'c' or 'd'
In that case the next character must be 'c' or 'd' , it can never be 'a' or 'b'
So the probability of the next character being 'c' goes up by the probability of using word "W" and the
probability of being a 'c' after "W" , that's :
estimate_P('c') += P("W") * P('c') / ( P('c') + P('d') )
or
estimate_P('c') += P("W") * P('c') / Ptail( num_children(W) )
now you have to do this for all paths through the dictionary. But all ways to exit with a certain child count are similar,
so you can merge those paths to reduce the work. All words with 2 children will be in the same exit probability state
('a' and 'b' can't occur but chars >= 'c' can).
This is the Marlin state S_i. S_i means that character is >= i. It happens because you left the tree with a word that had i children.
When you see character 2 that can happen from state 0 or 1 or 2 but never states >= 3.
for estimating probability of word W
W can only occur in states where the first character W[0] is possible
that is state S_i with i <= W[0]
When character W[0] does occur in state S_i , the probability of that character is effectively higher,
because we know that chars < i can't occur.
Instead of just being P(char W[0]) , it's divided by Ptail(i)
(as in estimate_P('c') += P("W") * P('c') / Ptail( num_children(W) ) above)
So :
P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )
the probability of W is the sum of the probability of states it can start from
(recall states = certain terminal character sets)
times the probability of W given that state
let
P_naive(W) = Product[ chars c in W ] P(char 'c')
be the naive word probability, then :
P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )
is what we need. This is equation (1) in the Marlin paper.
The first term increases the probability of W for higher chars, because we know the more probable
lower chars can't occur in this state (because they found longer words in the dictionary)
The last term decreases the probability of W because it will only be used when the following character doesn't cause a longer word in the dictionary to be used
Now of course there's a problem. This P(W) probability estimate for words requires the probability of starting the word
from state S_i, which we don't know. If we had the P(W) then the P of states is just :
P(S_i) = Sum[ words W that have i children ] * P(W)
so to solve this you can just iterate. Initialize the P(S_i) to some guess; the Marlin code just does :
P(state 0) = 1.0
P(all others) = 0.0
(recall state 0 is the state where all chars are possible, no exclusions, so
characters just have their original order-0 probability)
feeds that in to get P(W), feeds that to update P(S_i), and repeats to convergence.
To build the dictionary you simply find the word W with highest P(W) and split it (adding its next most probable child and increasing its child count by 1).
The Marlin code does this :
seed state probabilities
iterate
{
build dictionary greedily using fixed state probabilities
update state probabilities
}
That is, during the dictionary creation, word probabilities are estimated using state probabilities from the previous iteration. They hard-code
this to 3 iterations.
There is an alternative, which is to update the state probabilities as you go. Any time you do a greedy word split, you're changing 3 state probabilities, so that's not terrible. But changing the state probabilities means all your previous word estimate probabilities are now wrong, so you have to go back through them and recompute them. This makes it O(N^2) in the dictionary size, which is bad.
For reference, combining our above equations to make the word probability estimate :
P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )
P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )
so:
P(W) = P_naive(W) * Ptail( num_children(W) ) * Sum[ i <= W[0] ] P( state S_i ) / Ptail(i)
the second half of that can be tabulated between iterations :
P_state_tail(j) = Sum[ i <= j ] P( state S_i ) / Ptail(i)
so :
P(W) = P_naive(W) * Ptail( num_children(W) ) * P_state_tail( W[0] )
you can see the "Marlin" aspect of all this is just in using this P(W) rather than P_naive(W) . How important
is it exactly to get this P(W) right? (and is it right?) We'll find out next time...
Tunstall was originally designed with binary alphabets in mind; variable to fixed on *binary*. In that context, doing full child trees (so dictionaries are full prefix trees and encoding is unique) makes sense. As soon as you do variable-to-fixed (hence VTF) on large alphabets, "plural" trees are obviously better and have been written about much in the past. With plural trees, encoding is not unique ("a" and "ab" are both in the dictionary).
There's a big under-specified distinction between VTF dictionaries that model higher level correlation and those that don't. eg. does P("ab") = P(a)*P(b) or is there order-1 correlation?
I looked at Tunstall codes a while ago (TR "failed experiment : Tunstall Codes" 12/4/2015). I didn't make this clear but in my experiment
I was looking at a specific scenario :
symbols are assumed to have only order-0 entropy
(eg. symbol probabilities describe their full statistics)
encoder transmits symbol probabilities (or the dictionary of words)
but there are other possibilities that some of the literature addresses. There are at lot of papers on "improved Tunstall" that use
the order-N probabilities (the true N-gram counts for the words rather than multiplying the probability of each character). Whether or
not this works in practice depends on context, eg. on LZ literals the characters are non-adjacent in the source so this might not make
sense.
There's a fundamental limitation with Tunstall in practice and a very narrow window where it makes sense.
On current chips, 12-bit words is ideal (because 4096 dwords = 16k = fits in L1). 16 bit can sometimes give much better compression, but falling out of L1 is a disaster for speed.
12-bit VTF words works great if the entropy of the source is <= 5 bits or so. As it goes over 5, you have too many bigrams that don't pack well into 12, and the compression ratio starts to suffer badly (and decode speed suffers a bit).
I was investigating Tunstall in the case of normal LZ literals, where entropy is always in the 6-8 bpc range (because any more compressability has been removed by the string-match portion of the LZ). In that case Tunstall just doesn't work.
Tunstall is best when entropy <= 3 bits or so. Not only do you get compression closer to entropy, you also get more decode speed.
Now for context, that's a bit of a weird place to just do entropy coding. Normally in low-entropy scenarios, you would have some kind of coder before just tossing entropy coding at it. eg. take DCT residuals, or any image residual situation. You will have lots of 0's and 1's so it looks like a very low entropy scenario for order-0 entropy, but typically you would remove that by doing RLE or something else so that the alphabet you hand to the entropy coder is higher entropy. (eg. JPEG does RLE of 0's and EOB).
Even if you did just entropy code on a low-entropy source, you might instead use a kind of cascaded coder. Again assuming something like
prediction residuals where the 0's and 1's are very common, you might make a two-stage alphabet that's something like :
alphabet 1 : {0,1,2,3+}
alphabet 2 : {3,4,...255}
Then with alphabet 1 you could pack 4 symbols per byte and do normal Huffman. Obviously a Huffman decode is a little slower than Tunstall,
but you're getting always 4 symbols per decode so output len is not variable, and compression ratio is better.
Tunstall for LZ literals might be interesting in a very fast LZ with MML 8 or so. (eg. things like ZStd level 1, which is also where multi-symbol-output Huff works well).
Point is - the application window here is pretty narrow, and there are other techniques that also address the same problem.
it would be very easy to make huge steps that contribute substatially to this problem.
1. Devise a new set of human perceptual quality rating experiments. Get vision scientists, statisticians, survey experiments, etc. involved to make the experiment well founded. Test lots of images and gather lots of score. Make half of them public and keep half private.
2. Do a public competition Netflix-prize style for a perceptual metric that can match the human scores. Let people submit code to score images and run that on your secret private test set.
Unlike Netflix prize, require that all submissions be open-source BSD/public-domain after the competition is over, with no patent encumbrance. Also set a maximum run time & memory use on the metrics so that they are constrained to be practical. The goal is to get actually usable code out of this.
3. Make a competition to take various lossy image compressor submissions and score the results perceptually, not from algorithmic metrics but by showing them to humans.
What we need is not more random algorithms (I'm look at you, JPEG-XR, webp-lossy, Guetzli, etc.) which are not clearly better. We desperately need more data and solid ways to test "this is better or not".
There are tons of great data compression developers out there in the world (many of whom are not getting paid for their work) that could make great advances in an open competition.
Designing a perceptual test is very hard.
For example when I'm allowed to A-B images (toggle between them) to look for flaws, the flaws that I see that way are very different than I'm just given a compressed image without knowing what the original was.
The way you score things when you can compare to original (side-by-side or slow-fade toggle) is very different than if you're just given the lossy image and asked to make a no-reference score.
I suspect that the "click on worst artifact" in the Guetzli test biases the test in a certain way.
Of course you want to consider different monitors, different viewing conditions, etc.
As I wrote over in the rambles, I spent some time on perceptual metrics myself but aborted it because there's a major problem. With any perceptual test database that we currently have, you can easily beat existing metrics by over-training to the test set. Perceptual metrics all have lots of ceofficiencts for various options (and even if they don't look like numerical coefficients, there are ways to over-train with things like how you do your "pooling"). If you go and "train your metric" what you are actually doing is baking in knowledge about the particular test images, the particular test conditions. That's crap.
1. Simple simple simple. The decoder should be implementable in ~5000 lines as a single file stb.h style header. Keep it simple!
2. It should be losslessly transcodable from JPEG , ala packJPG/Lepton. That is, JPEG1 should be contained as a subset. (this just means having 8x8 DCT mode, quantization matrix). You could have other block modes in JPEG2 that simply aren't used when you transcode JPEG. You replace the entropy coded back-end with JPEG2 and should get about 20% file size reduction.
IMO this is crucial for rolling out a new format, nobody should ever be trancoding existing JPEGs and thereby introducing new error.
3. Reasonably fast to decode. Slower than JPEG1 by maybe 2X is okay, but not by 10X. eg. JPEG-ANS is okay, JPEG-Ari is probably not okay. Also think about parallelism and GPU decoding for huge images (100 MP). Keeping decoding local is important (eg. each 32x32 block or so should be independently decodable).
4. Decent quality encoding without crazy optimizing encoders. The straightforward encode without big R-D optimizing searches should still beat JPEG.
5. Support for per-block Q , so that sophisticated encoders can do bit rate allocation.
6. Support alpha, HDR. Make a clean definition of color space and gamma. But *don't* go crazy with supporting ICC profiles and lots of bit depths and so on. Needs to be the smallest set of features here. You don't want to get into the situation that's so common where the format is too complex and nobody actually supports it right in practice, so there becomes a "spec standard" and a "de-facto standard" that don't parse lots of the optional modes correctly.
7. Support larger blocks & non-square blocks; certainly 16x16 , maybe 32x32 ? Things like 16x8 , etc. This is important for increasingly large images.
Most of all keep it simple, keep it close to JPEG, because JPEG actually works and basically everything else in lossy image compression doesn't.
Anything that's not just DCT + quantize + entropy is IMO a big mistake, very suspicious and likely to be vaporware in the sense that you can make it look good on paper but it won't work well in reality.
ADD :
I have in the past posted many times about how plain old baseline JPEG + decent back-end entropy (eg. packJPG/Lepton) is surprisingly competitive with almost every modern image codec.
That's actually quite surprising.
The issue is that baseline JPEG is doing *zero* R-D optimization. Even if you use something like mozjpeg which is doing a bit of R-D optimization, it's doing it for the *wrong* rate model (assuming baseline JPEG coding, not the packjpg I then actually use).
It's well known that doing R-D optimization correctly (with the right rate model) provides absolutely enormous wins in lossy compression, so the fact that baseline JPEG + packJPG without any R-D at all can perform so well is really an indictment of everything it beats. This tells us there is a lot of room for easy improvement.
PDI_1200
blinds
my_stress_test
my_soup
porsche640
Key :
jpg_h = IJG encoder + decoder (progressive and -optimize of course)
jpg_m = mozjpeg encoder + IJG baseline decoder
guetzli = guetzi encoder + IJG baseline decoder
jpg_pack = IJG encoder + decoder + packjpg entropy coded file format
jpg_pdec = jpg_pack with my jpeg decoder
Some notes on interpretation :
What I see is MozJPEG does very well on RMSE (it does as well as jpg_pack ! this despite mozjpeg being baseline decoded while jpg_pack gets a custom entropy back end). On the two perceptual metrics, mozjpeg is roughly tied with jpeg-huff, just very slightly better in Combo.
To my eyes (looking at the images myself (take with salt)), MozJPEG is a definite win over jpeg-huff. This is why I show the three different metrics, and no one of them can be trusted on its own, you have to look at the full picture. In this case - tied on two metrics and a big win on the other is pretty strong evidence that it is in fact better.
If we ran at lower bit-rates where the Huffman starts to be a huge disadvantage, packjpg would show a bigger win over mozjpeg. But that's outside of the functional range of JPEG anyway. This test shows quality starting at 40 only.
(BTW yes I know at high bit rate my jpeg decoder (jpeg_pdec) is flawed; it's doing too much deblocking & deringing there, it needs to get weaker at high bit rate. (in the graphs you can see it actually hurts RMSE quite a lot at the highest bit rate, but helps at low bit rate).)
Guetzli is run at quality 85-90-95. Guetzli does consistently poorly, so much so that its poor showing can't be justified merely by saying it targets a different metric.
First practical stuff. If you (you being Google or any of the other big web companies serving images (eg. Instagram, Snap, Tumblr, etc.)) want to improve the quality of JPEGs for consumers, IMO the best place is in the *decoder* not the encoder. Deblocking, deringing, and chroma-from-luma are all pretty straightforward and provide huge quality wins in the low-bit-rate range where JPEG needs the most help.
Decoder-side fixes also work on the huge body of existing JPEGs in the wild, and don't tempt people to do awful things like recompress from an existing JPEG to a new lossy format.
If you want to reduce the size of transmitted JPEGs (assuming baseline decoder), mozjpeg is good.
If you can change the format (eg. if you were willing to push a new format like you did with webp-lossy, and you have control over both the servers and the client, as Google is now in a unique position to do, since they control Chrome, Android, and also many servers) - then Lepton (based on packJPG) is awesome.
The great thing about Lepton/packJPG is that you get big gains in size at the *exact same* quality. You can transcode existing JPEGs - you don't have to find the uncompressed original or transcode from existing JPEG with introduction of new loss. You get a very simple decision - smaller files, same content. It's not ambiguous or questionable or involves any user evaluation or drawbacks.
Links to Lepton :
uncmpJPG packJPG
JPEG Open Source Package packJPG
Lepton image compression saving 22% losslessly from images at 15MBs Dropbox Tech Blog
GitHub - dropboxlepton Lepton is a tool and file format for losslessly compressing JPEGs by an average of 22%.
packjpg (Matthias Stirner) · GitHub
Lepton 1.2 free download - Software reviews, downloads, news, free trials, freeware and full commercial software - Downloadc
A bit of a historical ramble about JPEG optimization.
The Guetzli paper doesn't describe the implementation in much detail, and I haven't looked in the code so I don't know exactly what's in there. The basic ideas of how they optimize JPEGs are ancient (quant matrix optimization and spatial-adaptive quantization by zeroing tails). The ideas go back to 1993 and the very introduction of the JPEG standard. Some of the classic work is Watson's DCTune and "RD-OPT" (Ratnakar and Livny), which optimize quantization matrices per-image. Trellis quantization and block truncation were also common. In fact the idea of being able to optimize the quantization matrix per image is why the matrix is transmitted in the format, rather than just sending a scalar quality (which would be smaller). The designers of the JPEG standard had all this in mind.
There are tons of papers on this stuff, it was a popular area of research for years. Data compression practitioners were excited about it and many big claims were made. But it never caught on, and we mostly stopped working on it, it was a bit of a dead end. The problem is the perceptual metric. These optimizers would go off and fiddle with the image and report they made it better by some metric, and sometimes it would look better, but sometimes it wouldn't.
For any given metric, you can write an optimizer that goes off and optimizes to that metric. And the result is .... ? better ? sometimes ? you hope ?
Back in the long ago, people ran DCTune and RD-OPT and similar coders, and some metric would say they succeeded, but you'd look at the images and not be convinced. But worse than that, perceptual metrics assume certain viewing conditions, certain brightness, certain image size, so maybe it looked good in testing, but then when people actually used it looked worse.
We (data compression researchers) wound up just throwing up our hands and punting, saying "we'll come back to this when there's a better metric". If we had a perceptual metric we could trust, then sure lots of optimization could be done in the encoder.
My "porsche640.bmp" test image. I ran guetzli at quality 85 to set a target size, because :
Guetzli should be called with quality >= 84, otherwise the
output will have noticeable artifacts. If you want to
proceed anyway, please edit the source code.
My normal imdiff run would be at a variety of qualities, but Guetzli doesn't work except in a very small range of qualities, so let's just take the end of that range to stress everything as much as possible.
Then I ran baseline JPEG (IJG cjpeg) and mozjpeg and adjusted their quality to get sizes as close as possible.
The sizes are :
guetzli 85 : 62544
jpeg 77 : 61983
jpeg 78 : 63662
mozjpeg 84 : 63046
then I ran them through my imdiff.
Imdiff reports "imdiff" as the raw diff score (the scale of this score and higher/lower better depends on
the diff type), and "fit_imdiff" is always a 0-10 quality with higher is better.
RMSE :
imDiff Type : 0 : RMSE_RGB
got option : html output to r:\id0.html
recurse on dir : porsche640.bmp_guetzli
loading : porsche640.bmp_guetzli\guetzli_000062544.bmp
compSize : 62544 , bpp : 1.629 , logbpp : 0.704
imdiff : 8.633
fit_imdiff : 5.837
loading : porsche640.bmp_guetzli\jpeg_000061983.bmp
compSize : 61983 , bpp : 1.614 , logbpp : 0.691
imdiff : 8.555
fit_imdiff : 5.850
loading : porsche640.bmp_guetzli\jpeg_000063662.bmp
compSize : 63662 , bpp : 1.658 , logbpp : 0.729
imdiff : 8.433
fit_imdiff : 5.870
loading : porsche640.bmp_guetzli\mozj_000063046.bmp
compSize : 63046 , bpp : 1.642 , logbpp : 0.715
imdiff : 7.640
fit_imdiff : 6.005
Guetzli has slightly higher RMSE (eg. worse) than baseline JPEG. Mozjpeg does much better. Perhaps this is not surprising if
Guetzli is perceptually optimized.
MS-SSIM-IW-Y :
imDiff Type : 4 : MS_SSIM_IW_Y
got option : html output to r:\id4.html
recurse on dir : porsche640.bmp_guetzli
loading : porsche640.bmp_guetzli\guetzli_000062544.bmp
compSize : 62544 , bpp : 1.629 , logbpp : 0.704
imdiff : 0.997
fit_imdiff : 6.022
loading : porsche640.bmp_guetzli\jpeg_000061983.bmp
compSize : 61983 , bpp : 1.614 , logbpp : 0.691
imdiff : 0.998
fit_imdiff : 6.237
loading : porsche640.bmp_guetzli\jpeg_000063662.bmp
compSize : 63662 , bpp : 1.658 , logbpp : 0.729
imdiff : 0.998
fit_imdiff : 6.260
loading : porsche640.bmp_guetzli\mozj_000063046.bmp
compSize : 63046 , bpp : 1.642 , logbpp : 0.715
imdiff : 0.998
fit_imdiff : 6.287
the raw score is an SSIM so it's like a cosine (1.0 is perfect); my fit to a more perceptually uniform score
uses an acos. Mozjpeg is not much better than baseline JPEG
by this metric. Guetzli does poorly.
Combo :
imDiff Type : 8 : Combo
got option : html output to r:\id8.html
recurse on dir : porsche640.bmp_guetzli
loading : porsche640.bmp_guetzli\guetzli_000062544.bmp
compSize : 62544 , bpp : 1.629 , logbpp : 0.704
imdiff : 3.013
fit_imdiff : 6.061
loading : porsche640.bmp_guetzli\jpeg_000061983.bmp
compSize : 61983 , bpp : 1.614 , logbpp : 0.691
imdiff : 2.812
fit_imdiff : 6.266
loading : porsche640.bmp_guetzli\jpeg_000063662.bmp
compSize : 63662 , bpp : 1.658 , logbpp : 0.729
imdiff : 2.788
fit_imdiff : 6.290
loading : porsche640.bmp_guetzli\mozj_000063046.bmp
compSize : 63046 , bpp : 1.642 , logbpp : 0.715
imdiff : 2.731
fit_imdiff : 6.348
By the Combo metric (my perceptual metric, a combination of MS-SSIM-IW-Y, SCIELAB delta, and PSNR-HVS-M), MozJPEG distinguishes itself from baseline JPEG. Guetzli still
does poorly.
I assume that Guetzli would win when compared under the Butteraugli metric (I hope?) but it's way behind in my imdiff metrics.
Personally eyeballing the images, I see some places where Guetzli is better than baseline JPEG, and some places where it's worse. (Guetzli is way better on the ringing of the flower over the canoe, but it adds a bunch of distortion on the car that baseline JPEG doesn't have; the worst is on the driver's side rear wheel). To my eyes MozJPEG is clearly vastly superior to either. Well done mozjpeg team!
Get the images :
porsche640_guetzli_compare.zip at tinyupload.com
Links :
[1703.04421] Guetzli Perceptually Guided JPEG Encoder
Performance Calendar » MozJPEG 3.0
GitHub - googleguetzli Perceptual JPEG encoder
Releases · googleguetzli · GitHub
GitHub - mozillamozjpeg Improved JPEG encoder.
mozjpeg codelove
binaries - mozjpeg codelove
Research Blog Announcing Guetzli A New Open Source JPEG Encoder
mozjpeg 3.1 Complied for Windows (.exe) - Thomas Coward
There is reason to be concerned about running a lot of Kraken (or Mermaid/Selkie) decodes simultaneously. On most modern systems, like the PS4, the many cores share caches, perhaps share memory busses or TLBs. That means while you have N* the compute performance, you may have cache conflicts, and you could wind up bottlenecking on some of the memory subsystem. (generally we don't run into bandwidth bottlenecks, but there are lots of other limitted resources, like queue sizes, etc.)
Anyhoo, onto the testing -
I ran N threaded decodes of the same file. The buffers are copied for each thread so they can't share any cache for input or output buffers. Wiped caches before runs. I then wait on all N decodes being done and time that.
The graphs show total time for all N decodes, and time per decode (total/N).
If you had infinite compute resources, then "total time" (orange) would be a flat line. Any number of threads would take the same total time, it would not change.
Once you hit the limits of the system, the "time per" (blue) should be constant, and total then should go up linearly. (actually not quite, because when you are off the core # modulo, the threads don't all complete at the same time so you get wasted idle time; see the jump on lappy from 4-6 cores then how flat it is from 6-8, same on PS4 from 6-8 cores then flat from 9-12). If you have the threads to spare, then you can maximize throughput by minimizing "time per".
Conclusion :
No problem with lots of simultaneous Kraken decodes. Even when heavily over-subscribed, there's no major perf inversion due to overloading cache or memory subsystems.
Kraken on PS4 has near perfect threading up to 6 threads (total time goes from 0.0099 - 0.0111) ; on lappy it's not as good but still provides benefit to the time per decode up to 4 threads (total time from 0.0060 - 0.0095).
It's a surprise to me that the PS4 scales so well despite sharing cache & memory bus for the first 4 cores. It's also a surprise that lappy scales less well, I thought it would be near perfect on the first 4 cores, but maybe that's just Windows not giving me the whole machine? That was backward from my expectation.
Charts :
Kraken on PS4 (6 cores; 4 cores per 2MB L2) :
lzt24 :
lzt99:
Almost perfect threading from 1-6 cores (total time constant) even with large binary file.
webster:
webster is a large text file that uses a lot of long distance matches (offset > 1M). Text files have very different character than binary files like the lzt's. We can see that the large hot memory region used by webster does put some stress on the shared L2, there's falloff in perf from 1-4 cores.
webster Selkie :
Selkie is much faster than Kraken (2.75X faster on webster PS4) so all else being equal it should be affected a lot more by thread contention hurting memory latency. But, Selkie has some unique cleverness that makes it immune to this drawback. Threading even on webster from 1-6 cores is near perfect.
Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) no turbo :
lzt24 :
lzt99 :
webster :
Similar to PS4, lappy has almost perfect threading on binary files from 1-4 cores. On webster there is falloff in perf due to the
Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) WITH TURBO :
lzt24 :
lzt99 :
I initially mistakenly posted lappy timings with turbo enabled. I usually turn it off for perf testing on my laptop so that timings are more reliable. I think it's interesting actually to look at how the perf falloff is different with turbo.
Without turbo, total time is constant on lzt24 and lzt99 from 1-4 cores, but with turbo it steadily falls off, as adding more cores causes the laptop to reduce its clock rate. Despite that there's still a solid gain to throughput (the blue "time per" is going down despite the clock rate also going down).
raw data : (lzt24)
lappy : no turbo : (*1000) 1, 9.1360, 9.1360 2, 9.5523, 4.7761 3, 9.7850, 3.2617 4, 10.1901, 2.5475 5, 14.6867, 2.9373 6, 16.6759, 2.7793 7, 19.1105, 2.7301 8, 20.1687, 2.5211 9, 23.6391, 2.6266 10, 25.9279, 2.5928 11, 27.7395, 2.5218 12, 27.6459, 2.3038 13, 30.7935, 2.3687 14, 31.8541, 2.2753 15, 33.7883, 2.2526 16, 34.8252, 2.1766 lappy : with turbo : 1, 0.0060, 0.0060 2, 0.0070, 0.0035 3, 0.0087, 0.0029 4, 0.0095, 0.0024 <- 4 5, 0.0133, 0.0027 6, 0.0170, 0.0028 7, 0.0175, 0.0025 8, 0.0193, 0.0024 <- 8 9, 0.0228, 0.0025 10, 0.0252, 0.0025 11, 0.0262, 0.0024 12, 0.0278, 0.0023 <- 12 13, 0.0318, 0.0024 14, 0.0310, 0.0022 15, 0.0325, 0.0022 16, 0.0346, 0.0022 <- 16 PS4 : 1, 0.0099, 0.0099 2, 0.0102, 0.0051 3, 0.0104, 0.0035 4, 0.0106, 0.0027 5, 0.0110, 0.0022 6, 0.0111, 0.0018 <- min 7, 0.0147, 0.0021 8, 0.0180, 0.0022 9, 0.0204, 0.0023 10, 0.0214, 0.0021 11, 0.0217, 0.0020 12, 0.0220, 0.0018 <- same min again 13, 0.0257, 0.0020 14, 0.0297, 0.0021 15, 0.0310, 0.0021 16, 0.0319, 0.0020 comparing just lappy turbo to no-turbo : lappy : no turbo : 1, 9.1360, 9.1360 2, 9.5523, 4.7761 3, 9.7850, 3.2617 4, 10.1901, 2.5475 lappy : with turbo : 1, 6.0, 6.0 2, 7.0, 3.5 3, 8.7, 2.9 4, 9.5, 2.4 You can see with only 1 core, turbo is 1.5X faster (9.13/6.0) than no turbo With 4 cores they are getting close to the same speed, (10.2 vs 9.5), the turbo has almost completely clocked down
The customer's actual issue was decoding into write-combined graphics memory. This is an absolute killer for decoder perf because Kraken (like any LZ decoder) needs to read back the buffers it writes.
On the PS4 I think the best way to decode to graphics memory (garlic) is to allocate the memory as writeback onion, do the decompress, then change it to wb_garlic with sceKernelBatchMap (which will cause a CPU cache flush; several of these changes could be combined together, eg. for level loading you only need to do it once at the end of all the resource decoding, don't do it per resource).
The larger the block = more compression, and can help throughput (decode speed).
Obviously larger block = longer latency (to load & decode one whole block).
(though you can get data out incrementally, you don't have to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly longer latency).
If you need fine grain paging, you have to trade off the desire to get precise control of your loading with small blocks & the benefits of larger blocks.
(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)
As a reference point, here's Kraken on Silesia with various chunk sizes :
Silesia : (Kraken Normal -z4)
16k : ooKraken : 211,938,580 ->75,624,641 = 2.855 bpb = 2.803 to 1
16k : decode : 264.190 millis, 4.24 c/b, rate= 802.22 mb/s
32k : ooKraken : 211,938,580 ->70,906,686 = 2.676 bpb = 2.989 to 1
32k : decode : 217.339 millis, 3.49 c/b, rate= 975.15 mb/s
64k : ooKraken : 211,938,580 ->67,562,203 = 2.550 bpb = 3.137 to 1
64k : decode : 195.793 millis, 3.14 c/b, rate= 1082.46 mb/s
128k : ooKraken : 211,938,580 ->65,274,250 = 2.464 bpb = 3.247 to 1
128k : decode : 183.232 millis, 2.94 c/b, rate= 1156.67 mb/s
256k : ooKraken : 211,938,580 ->63,548,390 = 2.399 bpb = 3.335 to 1
256k : decode : 182.080 millis, 2.92 c/b, rate= 1163.99 mb/s
512k : ooKraken : 211,938,580 ->61,875,640 = 2.336 bpb = 3.425 to 1
512k : decode : 182.018 millis, 2.92 c/b, rate= 1164.38 mb/s
1024k: ooKraken : 211,938,580 ->60,602,177 = 2.288 bpb = 3.497 to 1
1024k: decode : 181.486 millis, 2.91 c/b, rate= 1167.80 mb/s
files: ooKraken : 211,938,580 ->57,451,361 = 2.169 bpb = 3.689 to 1
files: decode : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s
16k : 2.80:1 , 15.7 enc MB/s , 802.2 dec MB/s
32k : 2.99:1 , 19.7 enc MB/s , 975.2 dec MB/s
64k : 3.14:1 , 22.8 enc MB/s , 1082.5 dec MB/s
128k : 3.25:1 , 24.6 enc MB/s , 1156.7 dec MB/s
256k : 3.34:1 , 25.5 enc MB/s , 1164.0 dec MB/s
512k : 3.43:1 , 25.4 enc MB/s , 1164.4 dec MB/s
1024k : 3.50:1 , 24.6 enc MB/s , 1167.8 dec MB/s
files : 3.69:1 , 18.9 enc MB/s , 1027.3 dec MB/s
(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor between files.)
You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster to decode. This is due to keeping match references in the CPU cache in the decoder.
Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for
staying in cache, but with a smaller compression penalty (than chunking).
window 128k : ooKraken : 211,938,580 ->61,939,885 = 2.338 bpb = 3.422 to 1
window 128k : decode : 181.967 millis, 2.92 c/b, rate= 1164.71 mb/s
window 256k : ooKraken : 211,938,580 ->60,688,467 = 2.291 bpb = 3.492 to 1
window 256k : decode : 182.316 millis, 2.93 c/b, rate= 1162.48 mb/s
window 512k : ooKraken : 211,938,580 ->59,658,759 = 2.252 bpb = 3.553 to 1
window 512k : decode : 184.702 millis, 2.97 c/b, rate= 1147.46 mb/s
window 1M : ooKraken : 211,938,580 ->58,878,065 = 2.222 bpb = 3.600 to 1
window 1M : decode : 184.912 millis, 2.97 c/b, rate= 1146.16 mb/s
window 2M : ooKraken : 211,938,580 ->58,396,432 = 2.204 bpb = 3.629 to 1
window 2M : decode : 182.231 millis, 2.93 c/b, rate= 1163.02 mb/s
window 4M : ooKraken : 211,938,580 ->58,018,936 = 2.190 bpb = 3.653 to 1
window 4M : decode : 182.950 millis, 2.94 c/b, rate= 1158.45 mb/s
window 8M : ooKraken : 211,938,580 ->57,657,484 = 2.176 bpb = 3.676 to 1
window 8M : decode : 189.241 millis, 3.04 c/b, rate= 1119.94 mb/s
window 16M: ooKraken : 211,938,580 ->57,525,174 = 2.171 bpb = 3.684 to 1
window 16M: decode : 202.384 millis, 3.25 c/b, rate= 1047.21 mb/s
files : ooKraken : 211,938,580 ->57,451,361 = 2.169 bpb = 3.689 to 1
files : decode : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s
window 128k: 3.42:1 , 20.1 enc MB/s , 1164.7 dec MB/s
window 256k: 3.49:1 , 20.1 enc MB/s , 1162.5 dec MB/s
window 512k: 3.55:1 , 20.1 enc MB/s , 1147.5 dec MB/s
window 1M : 3.60:1 , 20.0 enc MB/s , 1146.2 dec MB/s
window 2M : 3.63:1 , 19.7 enc MB/s , 1163.0 dec MB/s
window 4M : 3.65:1 , 19.3 enc MB/s , 1158.5 dec MB/s
window 8M : 3.68:1 , 18.9 enc MB/s , 1119.9 dec MB/s
window 16M : 3.68:1 , 18.8 enc MB/s , 1047.2 dec MB/s
files : 3.69:1 , 18.9 enc MB/s , 1027.3 dec MB/s
WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend
fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has
a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.
Comparing chunked vs windowed :
chunked :
128k : 3.25:1 , 24.6 enc MB/s , 1156.7 dec MB/s
256k : 3.34:1 , 25.5 enc MB/s , 1164.0 dec MB/s
windowed :
128k : 3.42:1 , 20.1 enc MB/s , 1164.7 dec MB/s
256k : 3.49:1 , 20.1 enc MB/s , 1162.5 dec MB/s
If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal intended operating range) here are a few tips to consider :
1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls. This avoids doing a malloc per call, which may or may not be significant overhead.
2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes. Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.
3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes that must be saved from the compressed output size before the encoder will choose a slower decode mode. eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is 256 ; I recommend trying 128 to see what the effect is.
1. A light weight high res timer or cycle counter.
ADD : okay, yeah there's cntpct_el0. Lots of weird stuff with this though that make it less than ideal. There's access bits so the OS might deny you access (why!?). It seems like on Linux/Android that access to cntpct_el0 is denied but cntvct_el0 is allowed? Getting the frequency seems to not always work; and how does the frequency of the timer relate to the frequency of the cpu? It's all a bit nastier than it should be.
2. A fast query for which core we are on, and a way to map that core id to processor information (eg. is it an A53 or A15 or whatever). This has to be fast enough to do frequently, because tasks get moved around cores so you can't store what core you think you're on.
3. The ability to lock the clock rate and stop the thermal insanity. Even if it was locked at the lowest clock rate, that would be better than nothing (though it would cause other anomalies in timing, like CPU to RAM relative speeds might not be the same as in typical use). Just anything to make measuring perf not so random.
(and while I'm at it : command line args and host filesystem mapping. (you too Durango!))
1. Use the new compressors (Kraken/Mermaid/Selkie/Hydra). Aside from being great performance, they have the most robust and well-tested fuzz safety.
2. Pass FuzzSafe_Yes to the OodleLZ_Decompress option. The KMS decoders are always internally fuzz safe. What the FuzzSafe_Yes option is add some checks to the initial header decode which will cause the decode to fail if the header byte has been tampered with to change it to a non-KMS compressor.
3. Use Oodle Core lib only, not Ext. Core lib is very light and tight. Core lib makes no threads, requires no init, and will do no allocations to decode (if you pass in the memory).
4. Disable Oodle's callbacks :
OodlePlugins_SetAllocators(NULL,NULL);
OodlePlugins_SetAssertion(NULL);
OodlePlugins_SetPrintf(NULL);
(note that disabling the allocators only works because you are doing decoding only (no encoding) and because
you do #5 below - pass in scratch memory for the decoder)
5. Because you disabled Oodle's access to an allocator, instead pass in the memory needed. You can
ask the data for the compressor if you don't always know it.
compressor = OodleLZ_GetChunkCompressor(in_comp,NULL);
decoderMemorySize = OodleLZDecoder_MemorySizeNeeded(compressor,in_raw_length);
decoderMemory = malloc( decoderMemorySize );
6. If possible keep the Decoder memory around or pull from some shared scratch space rather than allocating & freeing it every time.
7. If you're on a platform where Oodle is in a DLL or .so , sign it with a mechanism like authentisign to ensure it isn't tampered with.
8. If possible do async IO and use double-buffering to incrementally load data and decompress, so that you are maximally using the IO bus and the CPU at the same time. If loading and decoding large buffers, consider running the decoders "ThreadPhased".
9. If possible use Thread-Phased decoding on large buffers that can't be broken into multiple decodes. The most efficient use of threading always comes from running multiple unrelated encode/decodes simultaneously, rather than trying ti multi-thread a single encode/decode.
10. Use a mix of compressors to put your decode time where it helps the most. eg. use the slower compressors only when they get big size wins, and perhaps tune the compression to the game's latency needs - random access data that needs to get loaded as fast as possible, use faster decoders, background data that streams in slowly can use slower decoders. Hydra is the best way to do this, it will automatically measure decode speed and tune to maximize the space-speed tradeoff.
11. Don't load tiny buffers. Combine them into larger paging units.
12. Consider in-memory compressed data as a paging cache.
As always, feel free to contact me and ask questions.
Some little notes :
strtod in many compilers & standard libraries is broken. (see the excellent pages at exploring binary for details)
(to clarify "broken" : they all round-trip doubles correctly; if you print a double and scan it using the same compiler, you will get the same double; that's quite easy. They're "broken" in the sense of not converting a full-precision string to the closest possible double, and they're "broken" in the sense of not having a single consistent rule that tells you how any given full precision numeral string will be converted)
The default rounding rule for floats is round-nearest (banker). What that means is when a value is exactly at the midpoint of either round-up or round-down (the bits below the bottom bit are 100000..) , then round up or down to make the bottom bit of the result zero. This is sometimes called "round to even" but crucially it's round to even *mantissa* not round to even *number*.
Our goal for strtod should be to take a full numeral string and convert it to the closest possible double using banker rounding.
The marginal rounding value is like this in binary :
1|101010....101010|10000000000000000
^ ^ ^ ^- bits below mantissa are exactly at rounding threshold
| | |
| | +- bottom bit of mantissa = "banker bit"
| |
| +- 52 bits of mantissa
|
+- implicit 1 bit at top of double
in this case the "banker bit" (bottom bit of mantissa bit range) is off, the next bit is on.
If this value was exact, you should round *down*. (if the banker bit was on, you should round up). If this value is not exact by even the tiniest bit (as in, there are more significant bits below the range we have seen, eg. it was 1000....001), it changes to rounding up.
If you think of the infinity of the real numbers being divided into buckets, each bucket corresponds to one of the doubles that covers that range, this value is the boundary of a bucket edge.
In practice the way to do strtod is to work in ints, so if you have the top 64 bits of the (possibly very long) value in an int, this is :
hi = a u64 with first 64 bits
top bit of "hi" is on
(hi>>11) is the 52 bits of mantissa + implicit top bit
hi & 0x400 is the "rounding bit"
hi & 0x800 is the "banker bit"
hi & 0x3ff are the bits under the rounding bit
hi & 0xfff == 0x400 is low boundary that needs more bits
hi & 0xfff == 0xBFF is high boundary that needs more bits
At 0x400 :
"rounding bit" is on, we're right on threshold
"banker bit" is off, so we should round down if this value is exact
but any more bits beyond the first 64 will change us to rounding up
so we need more bits
- just need to see if there are any bits at all that could be on
At 0xBFF :
"banker bit" is on, so if we were at rounding threshold, we should round up
(eg. 0xC00 rounds up, and doesn't need any more bits, we know that exactly)
the bits we have are just below rounding threshold
if the remaining bits are all on (FFFF)
OR if they generate a carry into our bottom bit
then we will round up
- need enough of the remaining value to know that it can't push us up to 0xC00
The standard approach to doing a good strtod is to start by reading the first 19 digits into a U64. You just use *= 10 integer multiplies to do the base-10 to base-2 conversion, and then you have to adjust the place-value of the right hand side of those digits using a table to find a power of 10. (the place-value is obviously adjusted based on where the decimal was and the exponent provided in the input string ; so if you are given "70.23e12" you read "7023" as an int and then adjust pow10 exponent +10).
Once you have the first 19 digits in the U64, you know that the full string must correspond to an interval :
lo = current u64 from first 19 digits
hi = lo + (all remaining digits = 99999)*(place value) + (bias for placevalue mult being too low)
final value is in [lo,hi]
so the difficult cases that need a lot of digits to discriminate are the ones where the first 19 digits put you just below the
threshold of rounding, but the hi end of the interval is above it.
Older MSVC only used the first 17 digits so fails sooner. For example :
34791611969279740608512
=
111010111100000111010011011110000000011001100010011101000000000000000000000
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
this is right on a rounding threshold; because the banker bit is off it should round down
the correct closest double is
34791611969279739000000.000000
but if you bias that up beyond the digits that MSVC reads :
34791611969279740610310
=
111010111100000111010011011110000000011001100010011101000000000011100000110
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
this should round up; the closest double is
34791611969279743000000.000000
but old MSVC gets it wrong
-----------
Another example :
2022951805990391198363682
=
110101100011000000110111111101000101100100011110011001000000000000000000000000000
is the double threshold
2022951805990391198666718
=
110101100011000000110111111101000101100100011110011001000000000100100111111011110
is definitely above the threshold, should round up
but if you only read the first 19 base-10 digits :
2022951805990391198000000
=
110101100011000000110111111101000101100100011110011000111111110000010001110000000
you see a value that is below the threshold and would round down
You have to look at the interval of uncertainty -
2022951805990391198000000
2022951805990391199000000
and see that you straddle and boundary and must get more digits.
There are two things that need refinement : getting more digits & doing your place value scaling to more
bits of precision. You can interatively increase the precision of each, which refines your interval smaller
and smaller, until you either know which side of the rounding barrier to take, or you have got all the bits of
precision.
BTW this needing lots of precision case is exactly the same as the arithmetic coder "underflow" case , where
your coding interval looks like, in binary :
hi = 101101010101001 100000000000000.110101010101
lo = 101101010101001 011111111111111.01010101010
^ ^ ^- active coding zone, new symbols add at this fixed point position
| |
| +-- bad underflow zone! can't yet tell if this bit should be a 1 or 0
|
+-- hi and lo the same in top bits, these bits have been sent & renormalized out
anyway, just a funny parallel to me.
There are obviously nasty issues to be aware of with the FPU rounding more, control word, precision, or with the optimizer doing funny things to your FPU math (and of course be aware of x86 being on the FPU and x64 being in SSE). The best solution to all that is to just not use floats, use ints and construct the output floats by hand.
(there is an optimization possible if you can trust the FPU ; numbers that fit completely in an int you can just use the FPU's int-to-float to do all the work for you, if the int-to-float rounding mode matches the rounding mode you want from your strtod)
Now, you may think "hey who cares about these boundary cases - I can round-trip doubles perfectly without them". In fact, round-tripping doubles is easy, since they don't ever have any bits on below the (52+1) in the mantissa, you never get this problem of bits beyond the last one! (you only need 17 digits to round-trip doubles).
Personally I think that these kinds of routines should have well-defined behavior for all inputs. There should be a single definite right answer, and a conforming routine should get that right answer. Then nutters can go off and optimize it and fiddle around, and you can run it through a test suite and verify it *if* you have exactly defined right answers. This is particulary true for things like the CRT. (this also comes up in how the compiler converts float and double constants). The fact that so many compilers fucked this up for so long is a bit of a head scratcher (especially since good conversion routines have been around since the Fortran days). (the unwillingess of people to just use existing good code is so bizarre to me, they go off and roll their own and invariably fuck up some of the difficult corner cases).
Anyway, rolling your own is a fun project, but fortunately there's good code to do this :
"dtoa" (which contains a strtod) by David M Gay :
dtoa.c from http://www.netlib.org/fp/
(BTW lots of people seem to have different versions of dtoa with various changes/fixes; for example gcc, mozilla, chromium; I can't find a good link or home page for the best/newest version of dtoa; help?)
dtoa is excellent code in the sense of working right, using good algorithms, and being fast. It's terrible code in
the sense of being some apalling spaghetti. This is actual code from dtoa :
if (k &= 0x1f) {
k1 = 32 - k;
z = 0;
do {
*x1++ = *x << k | z;
z = *x++ >> k1;
}
while(x < xe);
if ((*x1 = z))
++n1;
}
#else
if (k &= 0xf) {
k1 = 16 - k;
z = 0;
do {
*x1++ = *x << k & 0xffff | z;
z = *x++ >> k1;
}
holy order-of-operations reliance batman! Jeebus.
To build dtoa I use :
#define IEEE_8087
#pragma warning(disable : 4244 4127 4018 4706 4701 4334 4146)
#define NO_ERRNO
Note that dtoa can optionally check the FLT_ROUNDS setting but does not round correctly
unless it is 1 (nearest). Furthermore, in MSVC the FLT_ROUNDS value in the header is broken in some versions
(always returns 1 even if fesetenv has changed it). So, yeah. Don't mess with the float rounding mode please.
dtoa is partly messy because it supports lots of wacky stuff from rare machines (FLT_RADIX != 2 for example). (though apparently a lot of the weird float format modes are broken).
An alternative I looked at is "floatscan" from MUSL. Floatscan produces correct results, but is very slow.
Here's my timing on random large (20-300 decimal digits) strings :
strtod dtoa.c
ticks = 353
strtod mine
ticks = 267
MSVC 2005 CRT atof
ticks = 1921
MUSL floatscan
ticks = 11445
My variant here ("mine") is a little bit faster than dtoa. That basically just comes from using the modern 64-bit CPU. For example I use mulq to do 64x64 to 128 multiply , whereas he uses only 32*32 multiplies and simulates large words by doing the carries manually. I use clz to left-justify binary, whereas he uses a bunch of if's. Stuff like that.
The MUSL floatscan slowness is mmrmm. It seems to be O(N) in the # of digits even when the first 19 digits can unambiguously resolve the correct answer. Basically it just goes straight into bigint (it makes an array of U32's which each hold 9 base10 digits) which is unnecessary most of the time.
Assume that we are writing a compressor with only order-0 modeling, and that we are working on a binary alphabet, so we are just modeling the count of 0's and 1's. Maybe we have some binary data that we believe only has order-0 correlation in it, or maybe this is the back-end of some other stage of a compressor.
If the data is in fact stationary (the probabilites don't change over time) and truly order-0, then the best we can do is to count the # of 0's and 1's in the whole sequence to get the best possible estimate of the true probability of 0's and 1's in the source.
The first option is a static coder : (using the nomenclature of "static huffman" vs "adaptive huffman" ; eg.
static means non-streaming, probabilities or counts transmitted at the start of the buffer)
Encoder counts n0 and n1 in the whole sequence
Encoder transmits n0 and n1 exactly
Encoder & Decoder both make
p0 = n0 / (n0+n1)
p1 = n1 / (n0+n1)
Lots of little notes here already. We didn't have to do any +1's to ensure we had non-zero probabilities
as you often see, because we have the exact count of the whole stream. eg. if n0 is 0, that's okay because
we won't have any 0's to code so it's okay that they're impossible to code.
Now, how do you do your entropy coding? You could feed p0&p1 to arithmetic coding, ANS, to an enumerative coder (since we know n0 and n1, we are just selecting one of the arrangements of those bits, of which there are (n0+n1)!/n0!n1! , and those are all equally likely, so just send an integer that selects one of those), you could group up bits and use huffman. For now we don't care how the back end works, we're just trying to model the probability to feed to the back end.
If n0 and n1 are large, they are probably specifying more precision than the coder can use, which is wasting bits. So maybe you want to send an approximation of just p0 in 14 bits or whatever your back-end can use.
If you do send n0 and n1 exactly, then obviously you don't need to send the file length (it's n0+n1), and furthermore you can gain some efficiency by decrementing n0 or n1 as you go, so that the last symbol you see is known exactly.
Okay, so moving on to adaptive estimators. Instead of transmitting p0 up front, we will start with no a-priori knowledge
of the stream (hence p0 = 50%), and as we encounter symbols, we will update p0 to make it the best estimate based on
what we've seen so far. The standard solution is :
Encoder & Decoder start with n0 and n1 = 0
Encoder & Decoder form a probability from the n0 and n1 seen so far
p0 = (n0 + B)/(n0+n1 + 2B)
symbols are coded with the current estimate of p0
after which n0 or n1 is incremented and a new p0 is formed
B is a constant bias factor
if B = 1/2 this is a KT estimator (optimal in a specific synthetic case, irrelevant)
if B = 1 this is a laplace estimator
Note that the bias B must be > 0 so that we can encode a novel symbol, eg. coding the first 0 bit when n0 = 0.
There's stuff in the literature about "optimal estimators" but it's all a bit silly, because the optimal estimator depends on the source and what the distribution of possible sources is.
That is, say you actually are getting bits from a stationary source that has a true (unknown) probability of 0 bits, T0. You could see a wide variety of sources with different values of T0, which occur with probability P(T0). After you see some bits, n0 and n1, you wish to compute a p0 which minimizes the expected codelen of the next symbol you see. To do that, you can compute the relative probability of seeing n0 and n1 events from a source of probability T0. But to form the correct final estimate you must have the information about the a-priori likelihood of each source P(T0) which in practice you never have.
So we have these estimators for stationary sources, but in the real world you almost never have a stationary source. So let's start looking at estimators we might actually want to use in the real world.
(it may actually be a pretty stationary source, but it could be stationary only under a more complex model, and any time you are not fully modeling the data, stationary sources appear to be dynamic. This is like flatlanders in 2d watching a 3d object move through their plane - it may actually be a rigid body in a higher dimension, but it looks dynamic when you have an incomplete view of it. For example data that has Order-1 correlation (probability depends on the previous symbol) will appear to have dynamic statitics under only an order-0 model (the probabilities will seem to change after each symbol is coded))
Let's start with the "static" case, transmitting p0 or n0/n1. We can improve it by just breaking the source into chunks and transmitting a model on each chunk, rather than a single count for the whole buffer. These chunks could be fixed size, but there are sometimes large wins by finding the ideal places to put the chunk boundaries. This is an unsolved problem in general, I don't know of any algorithm to do it optimally (other than brute force, which is O(N!) or something evil like that), we use hacky heuristics. Obviously chunks have a cost in that you must spend bits to indicate where the chunk boundaries are, and what the probabilities are in each chunk, so you must count the cost to send the chunk information vs. the bits saved by coding with different probabilities.
(the most extreme case is a buffer that has n0=n1, which would take n0+n1 bits to send as a single chunk, but if in fact all the 0's are at the start, and all the 1's are the end, then you can cut it into two chunks, in the first chunk p0=100% so the bits are sent in zero bits, in the second chunk p0=0% , so the total size is only the overhead of specifying the chunks and probabilities)
A slightly more sophisticated version of this scheme is to have several probability groups and to be able to switch between
them from chunk to chunk, that is :
send the # of models, M
send the models
in the binary case, p0 or n0/n1 for each model
send the # of chunks
for each chunk :
send its length
send a model selection m in [0,M)
send the data in that chunk using model m
In a binary coder this is a bit silly, but in a general alphabet coder, the model might be very large (100 bytes or so), so sending the model
selection m is much cheaper than sending the whole model. This method allows you to switch rapidly between models at a lower cost.
eg. if your data is like 000000111111111100000001111111000000 - the runs of different-character data are best coded by switching between
models. (we're still assuming we can only use order-0 coding). (this is what Brotli does)
Now moving on to adaptive estimators.
The basic approach is that instead of forming an estimate of future probabilities by counting all n0 and n1 events we have seen in the past, we will count based on what we've seen in the recent past, or weight more recent events higher than old ones.
This is rarely done in practice, but you can simply count the # of each symbol in a finite window and update it incrementally :
at position p
code bit[p]
p0 = (n0 + B)/(n0+n1 + 2B)
after coding, increment n0 or n1
if (n0+n1) = T , desired maximum total
remove the bit b[p - T]
by subtracting one from n0 or n1
this has the advantage of keeping the sum constant (once the sum reaches T), which you could use to make the sum power of 2. But it
requires you actually have the previous T bits, which you usually don't if you are using the adaptive coder as part of a larger model.
This does illustrate a problem we will face with many of these adaptive estimators. There's an initial run-up phase. They start empty with no symbols seen, then count normally up to T, at which point they reach steady state.
A common old-fashioned approach is to renormalize the total to T/2 once it reaches T. This was originally done as a way of limitting
the sum T inside the range allowed by the entropy coder (eg. it must fit in 14 bits in old arithmetic coders so that the multiplies
fit in 32 bits). It was found that applying limits like this didn't hurt compression, they in fact help in practice, because they
make the statistics more adaptive to local changes.
after coding increment n0 or n1
if (n0+n1) = T
n0 /= 2 , n1 /= 2;
This is actually the same as a piecewise-linear approximation of geometric falloff of counts. A true geometric update is like this :
once steady state is reached :
n0+n1 == T always
after coding
n0 or n1 += 1
n0+n1 == T+1 now
n0 *= T/(T+1)
n1 *= T/(T+1)
now n0+n1 == T again
this is equivalent to doing :
n0 or n1 += inc
inc *= (T+1)/T
let G = (T+1)/T is the geometric growth factor
events contribute with weights :
1,G,G^2,G^3,etc..
now, nobody does a geometric update quite like this because it requires high precision counts (though you can do piecewise linear
approximations of this and fixed point versions, which can be interesting). There is a way to do a geometric update in finite
precision that's extremely common :
p0 probability is fixed point (12-14 bits is common)
at steady state
after coding a 1 : p0 -= p0 >> updshift
after coding a 0 : p0 += (one - p0) >> updshift
this is equivalent to the "renorm every step to keep n0+n1 = T" with T = 1<
This gives an efficient way to do a very recency-biased (geometric) estimator. For most of the estimators I'm talking about, the
non-binary alphabet extension is obvious, and I'm just doing binary here for simplicitly, but in this case the non-binary alphabet
version is non trivial. Fabian works it out here :
Mixing discrete probability distributions , and
Models for adaptive arithmetic coding .
<updshift
For people familiar with filtering, it should be obvious that what we're really doing here is running filters over the previous events. The "window" estimator is a simple FIR filter with a box response. The geometric estimator is the simplest IIR filter.
In all our (adaptive) estimators, we have ensured that P0 and P1 are never zero - we need to be able to code either bit even if we've never seen one before.
To do this, we often add on a count to n0 and n1 (+B above), or ensure it's non-zero.
In the binary updshift case, the minimum of p0 is where (p0 >> updshift) is zero, that's
p0min = (1 << updshift) - 1
which in practice is actually quite a large minimum probability of the novel symbol. That turns out to be desirable in very local
fast-adaptive estimators. What you want is if the last 4 events were all 1 bits, you want the probability P1 to go very high very fast -
but you don't want to be over-confident about that local model matching future bits, so you want P0 to stay at some floor.
Essentially what we are doing here is blending in the unknown or "flat" model (50/50 probability of 0 or 1 bit) with some desired weight. So you might have a very jerky strongly adapting local model, but then you also blend in the flat model as a hedge.
The geometric update be extended to "two speed" :
track two running estimators, p0_a and p0_b
make p0 = (p0_a + p0_b)/2
use p0 for coding
after the event is see update each with different speeds :
after coding a 1 : p0_a -= p0_a >> updshift_a
after coding a 0 : p0_a += (one - p0_a) >> updshift_a
and p0_b with updshift_b
eg. you might use
updshift_a = 4 (a very fast model)
updshift_b = 8 (a slower model)
(with one = 1<<14)
Naively this looks like an interesting blend of two models. Actually since it's all just linear, it's in fact still just an IIR filter.
It's simply a slightly more general IIR filter; the previous one was a one-tap filter (previous total and new event), this one is a
two-tap filter (two previous totals and new event).
But this leads us to somethat that is interesting, which is more general blending.
You could have something like 3 models : flat (all symbols likely), a very fast estimator that strongly adapts to local statistics, and a slow estimator (perhaps n0/n1 counts for the whole file) that is more accurate if the file is in fact stationary.
Then blend the 3 models based on local performance. The blend weight for a simple log-loss system is simply the multiple of probabilities of that model on the preceding symbols.
Now, a common problem with these IIR type filters is that they assume steady state. You may recall previously we talked about
the renormalization-based adaptive coder that has two phases :
track n0,n1
ramp-up phase , while (n0+n1) < T
initialize n0=n1=0
n0 or n1 += 1
stready-state :
when n0+n1 = T , renorm total to T/2
n0 or n1 += 1
If you're doing whole-file entropy coding (eg. lots of events) then maybe the ramp-up phase is not important to you and you can just
ignore it, but if you're doing context modeling (lots of probability estimators in each node of the tree, which might not see very many
events), then the ramp-up phase is crucial and can't be ignored.
If you want something efficient (like the updshift geometric model), but that accounts for ramp-up vs steady state, the answer is table lookups. (the key difference in the ramp-up phase is that adaptation early on is much faster than once you reach steady state)
This actually goes back to the ancient days of arithmetic codec, in the work of people like Howard & Vitter, and things like the Q-coder from IBM.
The idea is that you have a compact state variable which is your table index. It starts at an index for no events (n0=0,n1=0),
and counts up through the ramp-up phase. Then once you reach steady state the index ticks up and down on a line like the p0
in updshift. Each index has a state transition for "after a 0" and "after a 1" to adapt. Something like :
ramp-up :
0: {0,0} -> 1 or 2
1: {1,0} -> 3 or 4
2: {0,1} -> 3 or 5
3: {1,1} -> 6 or 7
4: {2,0} ->
5: {0,2} ->
etc.
then say T = 16 is steady state, you have
{0,16} {1,15} {2,14} ... {16,0}
that just transitions up and down
And obviously you don't need to actually store {n0,n1}, you just store p0 in fixed point so you can do divide-free arithmetic coding.
So there's like a tree of states for the ramp-up phase, then just a line back and forth at steady state.
And those states are not actually what you want at steady state. Actually finding the ideal probabilities for steady state is complex and in the end can only be solved by iteration. I won't go into the details but just quickly touch on the issues.
You might start with a mid point at p0=0.5 , at simulated T=16 that corresponds to {8,8} , so you consider stepping to {8,9} after seeing a 1 and renormalize to T=16, that gives p0=8/17 = 0.47059 ; that corresponds to a geometric update with scaling factor G = 17/16. If you keep seeing 1's, then p0 keeps going down like that. But if you saw a 0, then p0 -> p0 + (1 - p0) * (1 - 1/G) , so 0.47059 -> 0.50173 , which is not back to where you were.
This should be intuitive because with geometric recency, if you see a 1 bit then a 0 bit, the 0 you just saw counts a bit more than the 1 before, so you don't get back to the midpoint. With geometry recency the p0 estimated for seeing bits 01 is not the same as after seeing 10 - the order matters. This is also good intuition why simple counting estimators like KT are not very useful in data compression - the probability of 0 after seeing "11110000" is most likely the not the same after seeing "00001111" . Now you might argue that we're asking our order-0 estimator to do non-order-0 things, we aren't giving it a memoryless bit, we should have used some higher order statistics or a transform or something first, but in practice that's not helpful.
The short answer is you just need lots of states on the steady-state line, and you have to numerically optimize what the probability in each state is by simulating what the desired probability is when you arive there in various ways and averaging them; a kind of k-means quantization type of thing.
Another issue is how you do the state transition graph on the steady-state line. When you are out at the ends, say very low p0 so a 1 bit is highly predicted - if you see another 1 bit, then p0 does not change very much, but if you see a 0 bit (unexpected), then p0 should change a lot. This is actually information theory in a microcosm - when you see the expected events, they don't jar your model very much, because they are what you expected, they contain little new information, when you see events that had very low probability, that's huge information and jars p0 a lot.
(there's some ancient code for a coder like this and a table in crblib ; "rungae.c" / "ladder.c")
You could store the # of steps to take up or down after seeing a 0 or 1 bit. One of them could be implicit. For example when you see a more probable symbol, always take 1 step, when you see a less probable symbol, take many steps (maybe 3). Another clever way to do it is used in the Q-coder (and QM and MQ). They have a steady state line of states, but only change state when the arithmetic coder outputs a bit. This means you have to see 1/log2(P) events before you change states, which is exactly what you want - when P is very high, log2(P) is tiny and you won't step until you see several. This method cannot be used in modern arithmetic coders that output byte by byte, it requires bitwise renormalization. It's neat because it lets you use a very tiny table (53 states) and you can put the density where you need it (mostly around p0=0.5) but still have states way out at the extreme probabilities to code them efficiently.
The next step in the evolution is secondary statistics.
If you have this {n0,n1} state transition table in the last section, that's a state index. The straightforward way to do it is that each state has a p0 precomputed that corresponds to n0,n1 and you use that for coding.
With secondary statistics, instead of using the p0 that you *expected* to observe for given past counts, you use the p0 that you
actually *observed* in that same state in the past.
Say you're in a given state S after seeing bits 0100
(n0 =3, n1 =1 , but order matters too)
You could compute the p0 that should be seen after that sequence with some standard estimator
(geometric or KT or whatever)
Or, screw them. Instead use S as a lookup to a secondary model.
SecondaryStatistics[S]
contains the n0 and n1 actually coded from the state S
previous times that you were in state S
This was the SEE idea from PPMZ (then Shkarin's PPMD (different from Teahan's PPMD) and Mahoney's PAQ
(sometimes called APM there)).
In the real world there are weird nonlinearities in the actual
probabilities of states that can't be expressed well with simple estimators. Furthermore, those change
from file to file, so you can't just tabulate them, you need to observe them.
A common hacky thing to do is to use a different estimator if n0=0 or n1=0 ; eg. if one of the possible symbols has never been seen at all, special case it and don't use something like a standard KT estimator that gives it a bias to non-zero probability. This is done because in practice it's been observed that deterministic contexts have very different statistics. Really this is just a special case version of something more general like secondary statistics.
The other big step you could take is mixing. But that's rather going beyond simple order-0 estimators so I think it's time to stop.
Hydra is a meta-compressor which selects Kraken, Mermaid, or Selkie per block. It uses the speed fit model of each compressor to do a lagrangian space-speed optimization decision about which compressor is maximizing the desired lagrange cost (size + lambda*time).
It turns out to be quite interesting.
(this is of course in addition to each of those compressors internally making space-speed decisions; each of them can enable or disable internal processing modes using the same lagrange optimization model. (eg. they can turn on and off entropy coding for various streams). And there are additional per-block implicit decisions such as choosing uncompressed blocks and huff-only blocks.)
Hydra is a single entry point to all the Oodle compressors. You simply choose how much you care about size vs. decode speed, that corresponds to a certain lagrange lambda. In Oodle this is called "spaceSpeedTradeoffBytes". It's the # of bytes that compression must save in order to take up N cycles more of decode time. You then no longer think about "do I want Kraken or Mermaid" , Oodle makes the right decision for you that optimizes the goal.
Hydra can interpolate the performance of Kraken & Mermaid to create a meta-compressor that targets the points in between. That in itself is a somewhat surprising result. Say Kraken is at 1000 mb/s , Mermaid is at 2000 mb/s decode speed, but you really want a compressor that's around 1500 mb/s with compression between the two. We don't know of a Pareto-optimal compressor that is between Kraken and Mermaid, so you're sunk, right? No, you can use Hydra.
I should note that Hydra is very much about *whole corpus* performance. That is, if your target is 1500 mb/s, you may not hit that on any one file - that file could go either all-Kraken or all-Mermaid. The target is hit overall. This is intentional and good, but if for whatever reason you are trying to hit a specific speed for an individual file then Hydra is not the way to do that.
It leads to an idea that I've tried to advocate for before : corpus lagrange optimization for bit rate allocation. If you are dealing with a limited resource that you want to allocate well, such as disk size or download size or time to load - you want to allocate that resource to the data that can make the best use of it. eg. spend your decode time where it makes the biggest size difference. (I encourage this for lossy bit rate allocation as well). So with Hydra some files decode slower and some decode faster, but when they are slower it's because the time was worth it.
And now some reports. We're going to look at 3 copora. On Silesia and gametestset, Hydra interpolates as expected. But then on PD3D, something magic happens ...
(Oodle 2.4.2 , level 7, Core i7-3770 x64)
Silesia :
|
|
|
total : Kraken : 4.106 to 1 : 994.036 MB/s total : Mermaid : 3.581 to 1 : 1995.919 MB/s total : Hydra200 : 4.096 to 1 : 1007.692 MB/s total : Hydra288 : 4.040 to 1 : 1082.211 MB/s total : Hydra416 : 3.827 to 1 : 1474.452 MB/s total : Hydra601 : 3.685 to 1 : 1780.476 MB/s total : Hydra866 : 3.631 to 1 : 1906.823 MB/s total : Hydra1250 : 3.572 to 1 : 2002.683 MB/s
gametestset :
|
|
|
total : Kraken : 2.593 to 1 : 1309.865 MB/s total : Mermaid : 2.347 to 1 : 2459.442 MB/s total : Hydra200 : 2.593 to 1 : 1338.429 MB/s total : Hydra288 : 2.581 to 1 : 1397.465 MB/s total : Hydra416 : 2.542 to 1 : 1581.959 MB/s total : Hydra601 : 2.484 to 1 : 1836.988 MB/s total : Hydra866 : 2.431 to 1 : 2078.516 MB/s total : Hydra1250 : 2.366 to 1 : 2376.828 MB/s
PD3D :
|
|
|
total : Kraken : 3.678 to 1 : 1054.380 MB/s total : Mermaid : 3.403 to 1 : 1814.660 MB/s total : Hydra200 : 3.755 to 1 : 1218.745 MB/s total : Hydra288 : 3.738 to 1 : 1249.838 MB/s total : Hydra416 : 3.649 to 1 : 1381.570 MB/s total : Hydra601 : 3.574 to 1 : 1518.151 MB/s total : Hydra866 : 3.487 to 1 : 1666.634 MB/s total : Hydra1250 : 3.279 to 1 : 1965.039 MB/s
Whoah! Magic!
On PD3D, Hydra finds big free wins - it not only compresses more than Kraken, it decodes significantly faster, repeating the
above to point it out :
total : Kraken : 3.678 to 1 : 1054.380 MB/s
total : Hydra288 : 3.738 to 1 : 1249.838 MB/s
Kraken compression ratio is in between here, around 1300 MB/s
total : Hydra416 : 3.649 to 1 : 1381.570 MB/s
You can see it visually in the loglog plot; if you draw a line between Kraken & Mermaid, the Hydra data points are above that
line (more compression) and to the right (faster).
What's happening is that once in a while there's a block where Mermaid gets the same or more compression than Kraken. While that's rare, when it does happen you just get a big free win from switching to Mermaid on that block. More often, Mermaid only gets a little bit less compression than Kraken but a lot less decode time, so switching is advantageous in the space-speed lagrange cost.
Crucial to Hydra is having a decoder speed fit for every compressor that can simulate decoding a block and count cycles needed to decode on an imaginary machine. You need a model because you don't want to actually measure the time by running the decoder on the current machine - it would take lots of runs to get reliable timing, and it would mean that you are optimizing for the exact machine that you are encoding on. I currently use a single virtual machine that is a blend of various real platforms; in the future I might expose the ability to use virtual machines that simulate specific target machines (because Hydra might make decisions differently if it knows it is targeting PC-x64 vs. Jaguar-x64 vs. Aarch64-on-A57 , etc.).
Hydra is exciting to me as a general framework for the future of Oodle. It provides a way to add in new compression modes and be sure that they are never worse. That is, you always can start with Kraken per block, and then new modes could be picked block by block only when they are known to beat Kraken (in a space-speed sense). It lets you mix in compressors that you specifically don't expect to be good in general on all data, but that might be amazing once in a while on certain data.
(Hydra requires compressors that carry no state across blocks, so you can't naively mix in something like PPM or CM/PAQ. To optimize a switching choice with compressors that carry state requires a trellis-quantization like lattice dynamic programming optimization and is rather more complex to do quickly)
The video of the conversation is on Youtube
(also see the preceding chat with Fabian & Jeff at Handmade Con on Youtube )
The way I think about compression is the way I approach all technical problems. I come from a math/science background, and I like to understand the theory behind things. So in the background I'm always thinking about basic coding theory, entropy, probabilities, conditional probability, etc.
When I start on a new problem, or am exploring a new data compression space, I usually start by writing a compressor in the most obvious way, not worrying about efficiency, using standard model-coder patterns.
I then start looking at what is going on in the data. One way to do that is by making charts and gathering statistics. Another way is simply to try adding or removing things from the model and checking your compression.
Usually I will keep adding things to the model, trying to find things that are helpful correlations. Does the previous byte help? Does the position help? Does the distance to the last byte with the same value help? Keep tossing them in.
Then I will start to try to make things more efficient. I start cutting things out of the model that only help a little bit. Try to bake it down to the simplest thing that captures most of the structure of the data. eg. if I code the value as log2 + remainder bits does that hurt?
Once I have an idea of which factors make a big difference and which don't, I then throw all that work away and start over. Writing a final efficient implementation may actually be totally different, but starting from a simple model-coder allowed a clean way to get an understanding of the problem space.
Principle : Algorithm options, question everything
This is a principle I follow in data compression, but also in all algorithm implementation. There are lots of places in the algorithm where you have choices. Many of the things are arbitrary or not well justified. Any time I'm reading a paper on data compression and they say "we did X this way" I think "why?" and "is there a better way?" or "what about other options?".
When you increment the count of your statistics after seeing a symbol, why do +1? Why not +2? Why not + (previous_count*0.10)+1 ? Why not a file-specific count?
Unless something is rigorously set in stone that it mathematically must be a certain way, then it's something I will explore. As I go through understanding an algorithm, I keep popping these things on my stack, todo : try other options. Obviously some intuition is required here to guess what is fruitful to test and what is not.
Data compression models are designed to learn the structure of the data. However, the choice of data compression model *type* is also a modeling step. That is, you must know something about the data to choose the model that fits the data. For example, PPM can only learn a byte-oriented finite-context model of the data. Aside from just trying to make the PPM as good as possible, it will only capture that one type of model of the data and could totally miss other things.
The implicit model of LZ77 is that it predicts strings are more likely to occur again, proportionally to how often they have occurred in the past. The exact model in LZ77 is a little subtle (see Langdon), but the variant LZSA is very easy to show an exact model correspondence.
Most compressors are very generic and find their compression from simple assumptions about what data is likely (data that's likely has some bytes that occur more often than others, bytes are likely to repeat in their neighborhood, etc.). If you know something specific about your data you can of course make a better model and get better compression than a generic solution.
One thing I think about a lot is *what* do I want an encoder to optimize. Not how (which is also a big problem), but what is the goal of the optimization. Many data compression encoders are optimizers, or searches in a big space. They are trying to optimize some metric. It's easy to say that you want to minimize error in a lossy codec, or that you want to minimize size in a lossless codec, but that's often wrong. In a lossy codec, what is the right way to quantify distortion? The answer depends partly on how the content will be used. How do approximations of distortion affect the search and result? In lossless codecs, you might want to consider speed or memory use or other factors in the optimization.
Principle : decoder first design.
For performance, we design codecs decoder first. It starts by thinking about how you want to arrange the execution flow for the decoder, and then how can the encoder get data into the form that the decoder wants.
Principle : try the exact solution first.
I almost never just start hacking away with heuristics and approximations. I like to know what is the exact answer to the problem first. If it's something that is solveable in reasonable time, I'll write an exact solver before I go messing around with approximations.
For example in lossy coding, there is usually one place in the codec where you are intentionally introducing loss in a controlled way (usually a quantization step, possibly in DCT domain). But there are also other places where you may be unintentionally introducing loss, perhaps in your colorspace transform, or your downsample-upsample pass, or by approximating the DCT for speed, or in scaling to stay in small integers, possibly in other places you aren't aware of. I like to do a first implementation where those losses are as low as possible, do everything right with zero cost.
This gives a reference to know how good your quality should be, and also how fast the exact answer is. Then when you start making approximations and using heuristics, you can compare and say - exactly how much is this costing me (in quality) and how much speed am I gaining? Is it worth it?
Quick performance test vs. the software zlib (1.2.8) provided in the Nintendo SDK :
ADD : Update with new numbers from Oodle 2.6.0 pre-release (11-20-2017) :
file : compressor : ratio : decode speed
lzt99 : nn_deflate : 1.883 to 1 : 74.750 MB/s
lzt99 : Kraken -z8 : 2.615 to 1 : 275.75 mb/s (threadphased 470.13 mb/s)
lzt99 : Kraken -z6 : 2.527 to 1 : 289.06 mb/s
lzt99 : Hydra 300 z6: 2.571 to 1 : 335.68 mb/s
lzt99 : Hydra 800 z6: 2.441 to 1 : 458.66 mb/s
lzt99 : Mermaid -z6 : 2.363 to 1 : 556.85 mb/s
lzt99 : Selkie -z6 : 1.939 to 1 : 988.04 mb/s
Kraken (z6) is 3.86X faster to decode than zlib, with way more compression (35% more).
Selkie gets a little more compression than zlib and is 13.35 X faster to decode.
All tests single threaded, 64-bit. (except "threadphased" which uses 2 threads to decode)
I've included Hydra at a space-speed tradeoff value between Kraken & Mermaid (sstb=300). It's a bit subtle, perhaps you can see it best in the loglog chart (below), but Hydra here is not just interpolating between Kraken & Mermaid performance, it's actually beating both of them in a Pareto frontier sense.
OLD :
This post was originally done with a pre-release version of Oodle 2.4.2 when we had just gotten Oodle running on the NX. There was still a lot of work to be done to get it running really properly.
|
|
|
lzt99 : nn_deflate : 1.883 to 1 : 74.750 MB/s lzt99 : LZNA : 2.723 to 1 : 24.886 MB/s lzt99 : Kraken : 2.549 to 1 : 238.881 MB/s lzt99 : Hydra 300 : 2.519 to 1 : 274.433 MB/s lzt99 : Mermaid : 2.393 to 1 : 328.930 MB/s lzt99 : Selkie : 1.992 to 1 : 660.859 MB/s
The global feedback problem is the problem that local decisions in the coder affect future coding, in a way that is too complex to track. Modern coders often have choices of how to code the current event, and evaluate those coding choices by computing a cost for them. What we do is to just pretend that that decision is the last, that we don't care about how it affects the future.
All lossy codecs have this problem, because in lossy codecs you always have coding choices (you can introduce more or less loss; even in codecs that don't seem to have multiple ways to code a block (eg. perhaps a wavelet codec, not MPEG style) you always actually do have choices, you can choose to degrade the current pixel, eg. perhaps zero some small high-frequency wavelet). Lossless codecs that are over-complete (eg. provide multiple ways to send the same data) (such as LZ) also have this problem. Lossless codecs that are strongly constrained (only one way to send a given coding event) do not have this problem.
In some cases we do deal with the global feedback on some limited tractable portion of the coding choice. For example in LZ we do "optimal parsing" ; this accounts only for the very limited feedback issue that the location of the termination of the current codeword affects future coding costs, but in LZ optimal parsing does not account for the way current choices affect the future offset cache, or the future entropy coder state. In lossy DCT codecs we might do "trellis quantization" which is a simple model of a quantizer where each coefficient coding only depends on the one previous (in its simplest form it only depends on whether the previous was zero or not, so only two states). The large benefits of optimal parsing & trellis quantization are demonstrations of how significant this global optimization problem is.
Put another way, the issue is that the current decision is part of the history for future decisions, and many modern codecs have very long histories that are used as part of the coding of current decisions. While it is possible to efficiently evaluate the current decision based on a fixed history, it's not possible to consider how the current decision affects the future by giving them different histories.
Obviously this affect is stronger at the begininng of each coding block, and less important near the end. The very last decision can be made purely locally.
Techniques :
1. Guiding to expectation.
The general technique here is to bias local decisions towards what we expect to be more useful as history in the future. This is generally done using a-priori knowledge of how we expect good coded streams to behave.
This can be done by biasing code costs to favor choices that match the guide. Another way is to seed or limit searches to only explore the area around the neighborhood of the expected good choice.
Sometimes the guidance strength is adaptive; it may start higher at the beginning of coding and decrease (even to zero) once the model is sufficiently strong to act as enough guidance on its own.
Quite often codecs do guidance without being explicit about it (they might not even be aware of what they are doing). For example most LZ codecs have lots of simple heuristics like - if a match is found of length >= 4, then never choose a literal, or if two matches are found with the same offset, always choose the lower offset. These heuristics might be intended as simplifications or for speed, but in fact they strongly guide the coding towards a certain a-priori expectation. (another common example of unintentional guidance is local movec search)
Another common form of guidance is "fudge factors". Many who have experimented in data compression have seen that if they get the cost estimate for local decisions slightly wrong, it makes the result better. The reason is that the wrong cost acts as a fudge factor to bias the local decision to one that is worse if that decision is the last, but that provides a better global solution. For example in LZ it's common to fudge the costs so that rep-matches appear cheaper than they really are (normal matches & literals appear more expensive) or that longer matches appear cheaper. Note that this fudging is also faking a lagranging space-speed optimization cost.
2. Iteration and model pre-conditioning.
One general approach to this problem is iteration. Run the coding once, then save the model (history) that was observed in the first coding, and use it as history in the next coding pass.
The idea here is that once the model is well established, it provides guidance towards coding that works well within that model. The problem is that at the beginning of coding, the model is empty, so all choices seem to cost the same, and there's no particular reason for coding to proceed in any particular direction. By using the model from the end of coding at the beginning of the next iteration, it makes the choices in the beginning direct towards ones that we used last time at the end, so we expect we will want to use those choices again at the end, and by making them at the beginning on the next pass we decrease their cost at the end.
A related variant of this is to use pre-conditioned models, based on expectation or measurement on a corpus. This builds up an expected or average model, which we use to seed the coder, and as long as the data to code is structural similar to the expectation, the pre-conditioned model will provide direction towards a good global solution. Of course pre-conditioning can also reduce the cost of early coding events (relative to starting with a blank model), but here we are not so much concerned as the net savings in coding cost that come from preconditioning as we are concerned with the fact that pre-conditioning makes some decisions relatively cheaper than others and thus acts as guidance.
Examples :
1. LZ
In modern LZ there are many ways to code every stream. At each coding event you might have options for literal, rep-match, or normal-match. There are also parse choices which can make any given point in the stream either be a participating coding event or not. We won't discuss parse here.
Local decisions act as history for future decisions through the rep match cache and through the entropy coder.
We have a-priori ideas about what well-behaved LZ streams look like. rep-matches are the cheapest way to code things, long normal matches are next best, least desirable is short normal matches and literals. We expect low offsets to be more frequent - this has an effect not only on the entropy coder feedback but also on the rep match cache (you want to bias towards offsets that will be useful again in the future; simply prefering lower offsets accomplishes much here).
2. Motion Vectors @@
3. Adaptive VQ Codebooks @@ guiding (fudging) iteration
.. bleh got bored of this post .. guiding (fudging) iteration
The slides and talk are available :
It's a very thorough talk for people in the game industry who want some background on data compression. Thanks to Dietmar for including Oodle!
PNG's are internally compressed with Zlib. When you run another compressor (such as Oodle) on an already-compressed file like PNG, it won't be able to do much with it. It might get a few bytes out of the headers, but typically the space-speed tradeoff decision in Oodle will not think that gain is worth bothering with, so the PNG will just be sent uncompressed.
There are a few reasons why you might want to use an Oodle compressor rather than the Zlib inside PNG. One is to reduce size; some of the Oodle compressors can make the files smaller than Zlib can. Another is for speed, if you use Kraken or Mermaid the decoder is much faster than the Zlib decompression in PNG.
Now obviously if you want the smallest possible lossless image, you should use an image-specific codec like webp-ll , but we will assume here that that isn't an option.
You could of course just decode the PNG to BMP or TGA or some kind of simple sample format, but that is not desirable. For one thing it changes the format, and your end usage loader might be expecting PNG. Your PNG's might be using PNG-specific features like borders or transparency or whatever that is hard to translate to other formats.
But another is that we want the PNG to keep doing its filtering. Filtered image samples from PNG will usually be more compressible by the back-end compressor than the raw samples in a BMP.
The easy way to do this all is just to take an existing PNG and set its ZLib compression level to 0 (just store). You keep all the PNG headers, and you still get the pixel filtering. But the samples are now uncompressed, so the back-end compressor (Oodle or whatever) gets to work on them instead of passing through already-ZLibbed data.
pngcp
pngcp is a utility from the official libpng distribution. It reads & writes a png and can change some options.
Usage for what we want is :
pngcp --level=0 --text-level=0 from.png to.png
I have made a Win32 build with static libs of pngcp for your convenience :
I also added a --help option ; run "pngcp --help". The official pngcp seems to have no help or readme at all that explains usage.
I *think* that pngcp preserves headers & options & pixel formats BUT I'M NOT SURE, it's not my code, YMMV, don't go fuck up your pngs without testing it. If it doesn't work - hey you can get pngcp from the official distro and fix it.
I used libpng 1624. The vc7.1 project in libpng worked fine for me. pngcp needed a little bit of de-unixification to build in VC but it was straightforward. You need zlib ; I used 1.2.8 and it worked fine; you need to make a dir named "zlib" at the same level as libpng. I did "mklink /j zlib zlib-1.2.8".
* CAVEAT : this isn't really the way I'd like to do this. pngcp loads the PNG and then saves it out again, which introduces the possibility of losing metadata that was stuffed in the file or just screwing it up somehow. I'd much rather do this conversion without ever actually loading it as an image. That is, take the PNG file as just a binary blob, find the zlib streams and unpack them, store them with a level 0 header, and pass through the PNG headers totally untouched. That would be a much more robust way to ensure you don't lose anything.
cbpngz0
cbpngz0 usage :
cbpngz0 from to
cbpngz0 uses the cblib loaders, so it can load bmp,tga,png,jpeg and so on. It writes a PNG at zlib level 0.
Unlike pngcp, cbpngz0 does NOT support lots of weird formats; it only writes 8-bit gray, 24-bit RGB, and 32-bit
RGBA. This is not a general purpose PNG zlib level changer!! Nevertheless I find it useful because of the
wider range of formats it can load.
cbpngz0 is an x64 exe and uses the DLLs included.
Some sample results.
I take an original PNG, then try compressing it with Oodle two ways. First, convert it to a BMP and compress the BMP. Second, convert to a Zlib level 0 PNG (the "_z0.png") and then compress with Oodle. The differene between the two is that the _z0.png gets the PNG filters, and of course stays a PNG if that's what your loader expects. If you give the original PNG to Oodle, it passes it through uncompressed.
porsche640.png 529,821
porsche640.bmp 921,654
porsche640.bmp.ooz 711,273
porsche640_z0.png.ooz 508,091
-------------
blinds.png 328,754
blinds.bmp 1,028,826
blinds.bmp.ooz 193,130
blinds_z0.png.ooz 195,558
-------------
xxx.png 420,149
xxx.bmp 915,054
xxx.bmp.ooz 521,861
xxx_z0.png.ooz 409,311
The ooz files are made with Oodle LZNA -z6 (level Optimal2).
You can see there are some big gains possible with replacing Zlib (on "blinds"). On normal photographic continuous tone images Zlib does okay so the gains are small. On those images, compressing the BMP without filters is very bad.
Another small note : if your end usage PNG loader supports the optional MNG format LOCO color transform, that usually helps compression.
ADD : Chris Maiwald points out that he gets better PNG filter choice by using "Z_FIXED" (which is the zlib option for fixed huffman tables instead of per-file huffman). A bit weird, but perhaps it biases the filter choice to be more consistent?
I wonder if choosing a single PNG filter for the whole image would be better than letting PNG do its per-row thing? (to try to make the post-filter residuals more consistent for the back end modeling stage). For max compression you would use something like a png optimizer that tried various filter strategies, but instead of rating them using zlib, rate with the back-end of your choice.
OodleLZ_CompressionLevel_Optimal2 - (level 6) this is my default, goto high compress setting.
Most users of Kraken want max compression ratio and max decode speed. This should be your first choice for that. Encode speed and memory usage at encode time can be quite high.
Oodle's Optimal2 is pretty comparable to lzma -mx9 or ZStd max compression or any other optimal parse LZ with a strong string matcher.
For making your distributions I recommend Optimal2.
If Optimal2 is basically what you want, you might also try Optimal1 and Optimal3 which offer certain advantages :
OodleLZ_CompressionLevel_Optimal3 - (level 7) favors small size a bit more. Optimal3 can make measurably smaller files than Optimal2, so if the main thing you care about is size, not encode or decode time, then go with Optimal3.
Optimal3 not only works a little harder at encode time (thus takes a bit more time), it also enables some modes in the decoder which cause decodes to run a little slower. (maybe 10% slower to decode, eg. 950 mb/s instead of 1050 mb/s - still way way faster than anything else).
Because these modes slow down decode they aren't in Optimal2. The goal of Optimal 1 and 2 is to decode just as fast as the lower compress modes and maximize ratio under that constraint of preserving decode speed.
OodleLZ_CompressionLevel_Optimal1 - (level 5) faster to encode high compress mode.
Optimal1 is comparable to lzma's -mx5 mode, in both cases it's the fastest level that does an optimal parse.
Kraken's Optimal1 is a bit of a funny compromise. I don't generally recommend it. But if Optimal2 is too slow for you, then maybe Optimal1 does the trick. One thing you might like about Optimal1 is that it does very few memory allocations and is decent about limiting memory use, whereas Optimal2 is quite heavy on the memory subsystem.
(Optimal2 and higher can of course get very very slow if they run out of memory and start going to swap; if you have less than 8 GB of RAM, stick to Optimal1 and lower on very large files (over 100MB))
OodleLZ_CompressionLevel_Normal - (level 4) the default non-optimal parse mode.
Normal is memory limited and decently fast. Its encode speed is similar to ZLib.
Basically if you tried the Optimal modes and want something faster, your next step is to try Normal.
OodleLZ_CompressionLevel_Fast & OodleLZ_CompressionLevel_VeryFast - (level 3 and 2) ; these are for when you tried _Normal and want something a little faster.
We don't really have super-fast low-compression modes in Kraken yet (nothing comparable to ZStd's super fast low-compress modes, for example). All the Kraken levels, even VeryFast, are pretty high compression.
These modes can be useful for faster turnaround of daily iterative work, like when artists make new content and want to preview it in game or whatever.
The other thing you can play with in Kraken is the "spaceSpeedTradeoffBytes" option in the CompressOptions.
The CompressionLevel is generally a tradeoff of encoder time vs. compressed size - it tries to maintain decode speed (except for the aforementioned exception at Optimal3).
If you are willing to give up some decode speed to get smaller sizes, you do that with spaceSpeedTradeoffBytes.
The default is 256. To make smaller compressed files, make this number lower. Try 200 or 128. I don't think there's much reason to go below 128, you start giving up a lot of time for not much size gain.
I don't recommend making spaceSpeedTradeoffBytes higher than 300 or so with Kraken. The reason is that if you want more speed, you should be at some point switching to Mermaid. In Oodle 2.4 you can do that automatically by using "Hydra" (the many-headed beast) which will automatically select Kraken/Mermaid/Selkie based on your space-speed tradeoff. When you use Hydra, if you set spaceSpeedTradeoffBytes to 512, you might get a little of Kraken and a little of Mermaid).
We try to be very generous with our knowledge here. We write lots of articles about our technology discoveries. We don't patent anything because we believe patents are generally bullshit and other researchers should be able to make the same discovery.
We haven't written much about Kraken because it's mainly "engineering" not "science". It's very careful, sometimes very clever, very hard-fought engineering, but it's mostly implementation details. So when someone can just disassemble and steal those engineering details, it's incredibly disheartening.
The idea of not patenting our inventions is that independent researchers should be able to come up with the same inventions on their own, and we shouldn't have ownership of that idea. But that assumes a kind of honor code that those people are doing their own research, not just taking apart your creation to see how it works.
In the end of course Kraken will be taken apart and the ideas will get out, we can't stop that. There are lots of obvious clues to what it's doing that don't require dissasembly. We always knew that the ideas would get out - just the existance of Kraken and the possibility is in itself a valuable clue for researchers that there's something worth looking at in that space.
I think that we've done some pretty interesting research over the years, and we need to sell software to be able to live and support that research. It's an amazing environment at RAD that has supported us in doing full-time development of various ideas that don't always immediately pay off. It seems there are people who are opposed to the fact that we're trying to sell software, and I think that's fucking ridiculous.
ADD :
Disassembling the Oodle lib is obviously illegal (*). This is not a legal reverse engineering (which you certainly could do, such as by passing in known buffers and looking at the output bytes from the compressor, or something like that). Without patents, our mechanism to enforce this is the evaluation agreement (which someone clearly violated) and copyright.
The disassembly is obvious enough that we could probably win against that specific copy of the code. But that's not really the problem.
We also don't really care about anyone figuring out the Kraken bitstream and having their own decoders. That was always bound to happen (as modders and hackers will want to get access to Kraken-encoded game data files).
What is disturbing is that there are now sure to be a raft of 2nd-generation codecs based on the stolen Kraken disassembly. They will come from people who have looked at the disassembly and seen the ideas, and then write their "own" version of those ideas from scratch. So in the future we might get a bunch of "new" "original" LZ codecs coming out with Kraken-like performance that are just based on what they stole from Kraken. That sucks.
(* = there appear to be many parts of the world where disassembly for "research purposes" is allowed. In those parts of the world, the law that allows disassembly for research trums the T&C or EULA that forbids disassembly. IANAL but I suspect that even in those places, wholesale disassembly and publishing of that code is still illegal, otherwise we would see it done much more often. In any case, trying to protect an implementation with just copyright is very difficult, which is why most people use patents.)
ADD : this has in fact now happened. (codecs that are entirely stolen Oodle code being passed off as original work)
I find many things about this experience to be incredibly painful.
I think RAD in general has tried about as hard as possible to be one of the "good guys" that gives back to the community. We don't patent anything, we share our ideas, we've written extensively and put lots of great code in the public domain (such as on my blog & Fabian's).
Most of the money that's made off of compression is done in the most disgusting of ways, by getting patents on trivial obvious shit, then getting those patents to apply to an open standard that gets into DVDs or web browsers or whatever, so you get rich with no fucking real contribution to the world. In contrast, we actually invent major algorithms, we don't patent them, we try to make money by charging a one time fee to our clients with no strings attached. It's just about the hardest possible way to make money on compression.
Not only has RAD done about as much as possible to give back to the community, I've personally given huge huge amounts of my life to compression. Starting 20 years ago when I made LZP and PPMZ and order-1-huffman and secondary statistics and etc. etc. and gave it all away. I wrote about it, gave out source code, taught lots of people, answered questions in emails. I've basically never made a penny on any of that. More recently we've written about most of the things we've discovered in modern LZ, like optimal parse strategies, literals-after-match and LZ-sub, rep match cache strategies, ANS coders, etc. etc.
And the thanks we get is that when we invent something pretty special that we want to keep secret for a couple of years so we can sell it, it just gets stolen.
One of the extra disgusting things about the thieves is that they are so self-righteous, so victim-blaming, that somehow some aspect of our behavior makes us deserve it, or that stealing someone's livelihood, stealing someone's hard work and their fucking sweat and years of effort is somehow a benevolent act.
1. You make one consistent hash value. You don't get to have a hash function for 32-bit and one for 64-bit and make different values. Similarly you don't get to make different values in your SIMD hash.
It is very useful to have different code for 32-bit/64-bit/SIMD so that you can run well on different machines. But you have to always make the same value.
At the moment, hashes that rely on 64-bit maths are marginal IMO. There are still quite a bit of 32-bit devices out there (older mobile phones for example) and the hit of running something like 64-bit multiplies on those is too great.
And, crucially, the idea of speeding up the hash by using 64-bit scalar code is just wrong. Every mainstream processor that has 64-bit scalar also has SIMD, and running 32-bit hashes in SIMD is better.
2. I don't understand people who use SIMD to try to speed up linear code, or try to take linear code and just enable the advanced instructions in the compiler and think something magic will happen.
You don't just take some hash code that linearly scans a buffer and is full of dependencies and enable arch:avx2 in the compiler and expect anything magic. That's not how SIMD works. You need independent data flows and execution sequences that are fully parallel.
That is, you do 4 independent 32-bit hashes at once, you don't do one hash using SIMD.
3. Why are you trying to use the same code on tiny buffers and huge buffers? They're totally different
problems. How about
if ( len < 1024 )
// use short hash
else
// use long hash
??
It's totally fine for your high-throughput long hash to have some spin-up/down time. That is not a liability on short strings - you don't use it on short strings!
In the case of hashing it's just so easy and obvious how to do it properly and I don't understand why nobody else seems to get it. (I'm talking about the long length case; for the short len case you just use FNV or something super simple with minimal rollup) (and in short len, small details can dominate, like whether you can inline the call and so on)
You hash with 32-bit registers. You do 4 (or 8) or whatever at a time independently. You always do 8 streams even in scalar mode, so that the value always comes out the same.
Then you have a final combine to mix the independent stream hashes.
There are two ways to do the multiple streams.
A. Interleaved. Each 32-bit value in order goes to a separate stream, like :
000011112222...777700001111....
For 8-way. In this case the maximum SIMD width must be decided in advance so that the value can be
consistent. (I would probably go with 8-wide for the future even though it would usually run 4-wide).
B. Chunked. Chunks of 1k or 4k or whatever bytes each get an independent hash. Like :
0000..[4k]..1111....2222....
With the chunked method you can hash every chunk independently, so the SIMD width can be decided
at runtime and still make the same hash value.
That's not to say that the recent hash work isn't awesome, and there are some crazy-fast good (non-cryptographic but highly likely to detect corruption) hashes out there.
total : Kraken : 2.914 to 1 : 1053.961 MB/s total : lzma : 3.186 to 1 : 52.660 MB/s (Win64 Core i7-3770 3.4 GHz)Kraken is around 20X faster than lzma, but lzma compresses better (about 9%). That's already a tip that something is horribly wrong; you have a 2000% speed difference and a 9% size difference.
If we look at the total time to load compressed from disk + decompress, we can make these speedup factor curves :
At very low disk speeds, the higher compression of lzma provides a speedup over Kraken. But how slow does the disk have to be? You can see the intersection of the curves is between 0 and 1 on the log scale, that's 1-2 MB/s !!
For any disk faster than 2 MB/s , load+decomp is *way* faster with Kraken. At a disk speed of 16 MB/s or so (log scale 4) the full load+decomp for Kraken is around 2X faster than with lzma. And that's still a very slow disk (around optical speed).
Now, this is a speedup factor for load *then* decomp. If you are fully overlapping overlapping IO with decompression, then some of the decode time is hidden.
*But* that also assumes that you have a whole core to give to decompression. And it assumes you have no other CPU to work to do after loading.
The idea that you can hide decompressor time in IO time only works if you have enough independent loads so that there's lots to overlap (because if you don't, then the first IO and last decompress will never overlap anything), and it assumes you have no other CPU work to do.
In theory I absolutely love the idea that you just load pre-baked data which is all ready to go, and you just point at it, so there's no CPU work in loading other than decompression, but in practice that is almost never the case. eg. for loading compressed web pages, there's tons of CPU work that needs to be ton to parse the HTML or JS or whatever, so the idea that you can hide the decompressor time in the network latency is a lie - the decompressor time adds on to the later processing time and adds directly onto total load latency.
The other factor that people often ignore is the fact that loading these days is heterogeneous.
What you actually encounter is something like this :
Download from internet ~ 1 MB/s
Load from optimal disc ~ 20 MB/s
Load from slow HDD ~ 80 MB/s
Fast SSD ~ 500 MB/s
NVMe drive on PCIe ~ 1-2 GB/s
Load from cache in RAM ~ 8 GB/s
We have very heterogeneous loading - even for a single file loaded by the same application.
The first time you load it, maybe you download from the internet, and in that case a slow decompressor like lzma might be best. But the next time you load it's from the cache in RAM. And the time after that it's from HDD. In those cases, using lzma is a disaster (in the sense that the loading is now nearly instant, but you spend seconds decoding; or in the sense that just loading uncompressed would have been way faster).
One issue that I think is not considered is that making the right choice in the slow-disk zone is not that big of a deal. On a 1 MB/s disk, the difference in "speedup" between lzma and Kraken is maybe 2% in favor of lzma. But on a 100 MB/s it's something like 400% in favor of Kraken.
Now in theory maybe it would be nice to have different compressors for download & disk storage; like you use something like lzma for downloadable, and then decode and re-encode to ZStd for HDD loading. In practice nobody does that and the advantage over just using ZStd all the time is very marginal.
Also in theory it would be nice if the OS cache would cache the decompressed data rather than caching the compressed data.
TODO : time lzma on PS4. Usually PS4 is 2-4X slower than my PC, so that puts lzma somewhere in the 10-25 mb/s
range, which is very borderline for keeping up with the optical disc.
DVD 16x is ~ 20 MB/s (max)
PS4 Blu-Ray is 6x ~ 27 MB/s (max)
PS4 transparently caches Blu-Ray to HDD
Of course because of the transparent caching to HDD, if you actually keep files in lzma on the disc, and they
are cached to HDD, loading them from HDD is a huge mismatch and makes lzma the bottleneck.
That is, in practice on the PS4 when you load files from disc, they are sometimes actually coming from the HDD transparent cache, so you sometimes get 20 MB/s speeds, and sometimes 100 MB/s.
Now of course we'd love to have a higher-ratio compressor than Kraken which isn't so slow. Right now, we just don't have it. We have Kraken at 1000 MB/s , LZNA at 120 MB/s , lzma at 50 MB/s - it's too big of a step down in speed, even for LZNA.
In order for the size gain of lzma/LZNA to be worth it, it needs to run a *lot* faster, more like 400 mb/s. There needs to be a new algorithmic step in the high compress domain to get there.
At the moment the only reason to use the slower decoders than Kraken is if you simply must get smaller files and damn the speed; like if you have a downloadable app size hard limit and just have to fit in 100 MB, or if you are running out of room on an optical disc or whatever.
Silesia :
Game Test Set :
Seven, total :
Seven, all files :
If you're working from existing BCn data (not RGB originals), there's not a lot you can do. One small thing you can do is to de-interleave the end points and indexes.
In BC1/DXT1 for example each block is 4 bytes of end points then 4 bytes of index data. You can take each alternating 4 bytes out and instead put all the end points together, then all the index data together. Sometimes this improves compression, sometimes it doesn't, it depends on the compressor and the data. When it does help it's in the 5-10% range.
If you're using mips, then you can convert the BCn endpoint colors into deltas from the parent mip. (that will only be a good idea if your BCn encoder is one that is *not* aggressive about finding endpoints outside of the original image colors)
If you have original RGB data, you can make new BCn data that's designed to be more compressible. This opens up a whole new world of powerful possibilities : R/D decisions in the BCn encoding.
There are some obvious basic ideas, like - instead of aggressive end point searches, only choose end points that occur or are near the colors in the block; try to reuse previous index dwords rather than make new ones; try to use completely flat color blocks with the whole index dword = 0, etc.
See for example Jon Olick's articles on doing this .
But unless you really need to do it manually for some reason you should probably just use Rich Geldreich's cool tool crunch .
Crunch can make its own "crn" compressed texture format, which you would need to load with the crn-lib. crn-lib would decode the crn file back to BC1 at load time. That may be an awesome thing to do, I really can't comment because I haven't looked into it in detail.
Let's assume for the moment that you don't want to use the "crn" custom compressed format. You just want DDS or raw BCn data that you will run through your normal compression pipeline (Oodle or whatever). Crunch can also make BCn data that's more compressible by reducing its complexity, choosing encodings that are lower quality but less complex.
You tell it to make BCn and output DDS and you can specify a "quality". Then when you run it through your back-end compressor you
get smaller files :
lena 512x512 RGB (absolutely terrible as a representative game texture)
DXT1 DDS quality 255
rmse : 7.6352 psnr : 30.5086
Oodle LZNA : 131,200 -> 102,888 = 6.274 bpb = 1.275 to 1
Oodle Kraken : 131,200 -> 107,960 = 6.583 bpb = 1.215 to 1
DXT1 DDS quality 200
rmse : 8.2322 psnr : 29.8547
Oodle LZNA : 131,200 -> 80,264 = 4.894 bpb = 1.635 to 1
Oodle Kraken : 131,200 -> 85,268 = 5.199 bpb = 1.539 to 1
CRN quality 255
rmse : 8.2699 psnr : 29.8150
crunched.crn 74,294
DXT1 DDS quality 128
rmse : 9.0698 psnr : 29.0131
Oodle LZNA : 131,200 -> 62,960 = 3.839 bpb = 2.084 to 1
Oodle Kraken : 131,200 -> 66,628 = 4.063 bpb = 1.969 to 1
CRN quality 160
rmse : 9.0277 psnr : 29.0535
crunched.crn 53,574
DXT1 DDS quality 80
rmse : 10.2521 psnr : 27.9488
Oodle LZNA : 131,200 -> 50,983 = 3.109 bpb = 2.573 to 1
Oodle Kraken : 131,200 -> 54,096 = 3.299 bpb = 2.425 to 1
CRN quality 100
rmse : 10.1770 psnr : 28.0126
crunched.crn 41,533
So going from rmse 7.64 to 10.26 we reduced the Oodle-compressed DDS files to about half their size! Pretty cool.
The CRN format files are even smaller at equal error. (unfortunately the -quality setting is not a direct comparison, you have to hunt around to find qualities that give equal rmse to do a comparison).
For my reference :
@echo test.bat [file] [quality]
@echo quality in 1-255 optional 128 default
set file=%1
if "%file%"=="" end.bat
set q=%2
if "%q%"=="" set q=128
crunch_x64.exe -DXT1 -fileformat dds -file %file% -maxmips 1 -quality %q% -out crunched.dds
@REM -forceprimaryencoding ??
@REM verified same :
@REM radbitmaptest64 copy crunched.dds crunched_dds_un.bmp
crunch_x64.exe -R8G8B8 -file crunched.dds -out crunched_dds_un.bmp -fileformat bmp
crunch_x64.exe -DXT1 -file %file% -maxmips 1 -quality %q% -out crunched.crn
crunch_x64.exe -R8G8B8 -file crunched.crn -out crunched_crn_un.bmp -fileformat bmp
call bmp mse %file% crunched_dds_un.bmp
@echo --------------------------
call ooz crunched.dds -zc7 -zl8
call ooz crunched.dds -zc8 -zl8
call bmp mse %file% crunched_crn_un.bmp
@echo --------------------------
call d crunched.crn
After some more months with this I have a better view of it.
ZStd (and to some extent LZ4 as well) are way faster with modern GCC -O3 and vectorizer (that they are with MSVC or older/disabled GCC ; and they benefit a lot more from those things than my compressors, or most code in general). They are written in a simple clean way that really works well with a compiler that turns that code into good implementations for the machine.
Most of the timings I post are of ZStd/LZ4 in non-ideal compilations which is a bit unfair to them.
ZStd in the optimal compression levels has MML 3 and is slower to decode (than ZStd at low compression level).
IMO the sweet spot for ZStd is in the faster compression levels (-2 to -9) particularly if you care about round-trip time. It's quite excellent as a mix of encode & decode speed and is able to get that with surprisingly simple code.
ADD : in 2018 I try to post all my timings of LZ4 and ZStd using the standard lzbench GCC compile.
Original :
Yann wrote to ask about a possible anomaly in my measurement of ZStd's performance, so I did a little digging.
I do have a bit of a funny mix of compilers in my Windows tests, which is maybe slightly unfair, though my testing indicates it's not a big factor.
The PS4 report is always one of the best to look at for absolute numbers; everything is built with the same compiler (clang 3.6.1) with the same options, run on standard hardware.
On Windows, I run Oodle from the shipping DLL which is built with MSVC 2012. The non-Oodle codecs that I test are mostly built with MSVC 2005 , which is just because my personal dev machine that I build that test on uses MSVC 2005, while the build machine that makes Oodle is on 2012. In my tests this difference is less than 5%.
For example when I post results like :
silesia : Kraken : 4.082 to 1 : 1004.014 MB/s
silesia : Mermaid : 3.571 to 1 : 2002.079 MB/s
silesia : Selkie : 3.053 to 1 : 2889.536 MB/s
silesia : lz4hc : 2.723 to 1 : 2269.788 MB/s
silesia : zlib9 : 3.128 to 1 : 358.593 MB/s
silesia : lzma : 4.369 to 1 : 78.655 MB/s
K,M,S are from the Oodle DLL with MSVC 2012
lz4 and lzma and built from source with MSVC 2005
zlib on Windows I run from zlib1x64.dll that somebody else built long ago
when I post results on non-Windows platforms, everything is built from source with the same compiler as Oodle.
First let's look at the effect of the MSVC version on Windows :
MSVC 2005 :
(actually zlib1x64.dll , not my build)
zlib9 : 24,700,820 ->13,115,250 = 4.248 bpb = 1.883 to 1
decode : 79.907 millis, 11.01 c/b, rate= 309.12 mb/s
lz4hc : 24,700,820 ->14,801,510 = 4.794 bpb = 1.669 to 1
decode : 9.499 millis, 1.31 c/b, rate= 2600.32 mb/s
zstdmax : 24,700,820 ->10,401,235 = 3.369 bpb = 2.375 to 1
decode : 58.505 millis, 8.06 c/b, rate= 422.20 mb/s
Win64 MSVC 2012 :
miniz : 24,700,820 ->13,120,668 = 4.249 bpb = 1.883 to 1
miniz_decompress_time : 77.631 millis, 10.70 c/b, rate= 318.18 mb/s
(miniz != zlib but the times are quite comparable anyway)
zstd : 24,700,820 ->10,403,228 = 3.369 bpb = 2.374 to 1
zstd_decompress_time : 57.932 millis, 7.98 c/b, rate= 426.37 mb/s
Zstd : 422.20 -> 426.37 mb/s
lz4hc : 24,700,820 ->14,801,510 = 4.794 bpb = 1.669 to 1
LZ4_decompress_safe_time : 9.214 millis, 1.27 c/b, rate= 2680.92 mb/s
LZ4 : 2600.32 -> 2680.92 mb/s
Oodle MSVC 2012 :
ooKraken : 2.48:1 , 2.4 enc MB/s , 1178.4 dec MB/s
ooMermaid : 2.31:1 , 2.1 enc MB/s , 2166.4 dec MB/s
ooSelkie : 1.94:1 , 2.6 enc MB/s , 3838.2 dec MB/s
so VC 2012 (vs 2005) is a small speed gain for the competition but it's not a huge factor.
Let's try another platform entirely, with a standardized compiler :
Mac x64 :
Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
Apple LLVM version 7.3.0 (clang-703.0.29)
Kraken : 24,700,820 -> 9,970,882 = 3.229 bpb = 2.477 to 1
decode only : 21.248 millis, 2.23 c/b, rate= 1162.51 mb/s
Mermaid : 24,700,820 ->10,838,455 = 3.510 bpb = 2.279 to 1
decode only : 10.957 millis, 1.15 c/b, rate= 2254.29 mb/s
Selkie : 24,700,820 ->12,752,506 = 4.130 bpb = 1.937 to 1
decode only : 6.517 millis, 0.68 c/b, rate= 3790.49 mb/s
miniz : 24,700,820 ->13,120,668 = 4.249 bpb = 1.883 to 1
miniz_decompress_time : 90.072 millis, 9.46 c/b, rate= 274.23 mb/s
zstd : 24,700,820 ->10,403,228 = 3.369 bpb = 2.374 to 1
zstd_decompress_time : 54.176 millis, 5.69 c/b, rate= 455.94 mb/s
brotli-9 : 24,700,820 ->10,473,560 = 3.392 bpb = 2.358 to 1
brotli_decompress_time : 99.937 millis, 10.50 c/b, rate= 247.16 mb/s
brotli-11 : 24,700,820 -> 9,848,721 = 3.190 bpb = 2.508 to 1
brotli_decompress_time : 133.675 millis, 14.04 c/b, rate= 184.78 mb/s
lz4hc : 24,700,820 ->14,801,510 = 4.794 bpb = 1.669 to 1
LZ4_decompress_safe_time : 8.662 millis, 0.91 c/b, rate= 2851.75 mb/s
And another compiler, everyone built the same way, but on the same hardware as my Windows tests :
Linux x64 :
Core i7-3770 3.4 GHz
gcc-4.7.2 -O2 (*)
Kraken : 24,700,820 -> 9,970,882 = 3.229 bpb = 2.477 to 1
decode only : 23.059 millis, 3.64 c/b, rate= 1071.20 mb/s
Mermaid : 24,700,820 ->10,838,455 = 3.510 bpb = 2.279 to 1
decode only : 11.565 millis, 1.83 c/b, rate= 2135.83 mb/s
Selkie : 24,700,820 ->12,752,506 = 4.130 bpb = 1.937 to 1
decode only : 7.178 millis, 1.13 c/b, rate= 3441.18 mb/s
miniz : 24,700,820 ->13,120,668 = 4.249 bpb = 1.883 to 1
miniz_decompress_time : 85.327 millis, 13.47 c/b, rate= 289.48 mb/s
zstd : 24,700,820 ->10,403,228 = 3.369 bpb = 2.374 to 1
zstd_decompress_time : 66.987 millis, 10.58 c/b, rate= 368.74 mb/s
brotli-9 : 24,700,820 ->10,473,560 = 3.392 bpb = 2.358 to 1
brotli_decompress_time : 93.457 millis, 14.76 c/b, rate= 264.30 mb/s
brotli-11 : 24,700,820 -> 9,828,093 = 3.183 bpb = 2.513 to 1
brotli_decompress_time : 134.560 millis, 21.25 c/b, rate= 183.57 mb/s
lz4hc : 24,700,820 ->14,801,510 = 4.794 bpb = 1.669 to 1
LZ4_decompress_safe_time : 14.070 millis, 2.22 c/b, rate= 1755.57 mb/s
(* = I have to use a bad old GCC on Linux because of the nightmare of Linux binary lib compatibility; I use the oldest GCC possible to have maximum compatibility. This GCC also has a code gen bug in -O3 that creates crashing code due to the vectorizer F'ing up, so I have to use -O2. These factors seem to hurt ZStd and LZ4 a *lot*.)
Win Linux Mac
ooKraken : 1178.4 1071.20 1162.51
ooMermaid : 2166.4 2135.83 2254.29
ooSelkie : 3838.2 3441.18 3790.49
miniz : 318.18 289.48 274.23
zstdmax : 426.37 368.74 455.94
lz4hc : 2680.9 1755.57 2851.75
Win = MSVC 2012 Core i7-3770 3.4 GHz
Linux = gcc-4.7.2 -O2 Core i7-3770 3.4 GHz
Mac = Apple LLVM version 7.3.0 , Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
The Oodle timings are pretty consistent (Selkie takes a hit on the bad old Linux GCC) even though the Mac is a different CPU.
ZStd and LZ4 vary a lot!
There's another reference point I have for Windows performance. I can run the tasty "lzbench" which has lots of compressors.
lzbench runs are not directly comparable to mine. The biggest differences are that lzbench sets the process to "realtime" priority, and it doesn't invalidate the cache between runs. These two factors should mean that lzbench measures speeds slightly faster than I do in general. But that shouldn't be a huge factor.
Win64 :
Core i7-3770 3.4 GHz
lzbench (realtime prio, no cache invalidate)
GCC 5.3.0
lzbench 1.2 (64-bit Windows) Assembled by P.Skibinski
r:\z>lzbench.exe -ebrotli,9 -j40 r:\testsets\big\lzt99
r:\z>lzbench.exe -ezstd,99 -j40 r:\testsets\big\lzt99
r:\z>lzbench.exe -ezlib,9 -j40 r:\testsets\big\lzt99
r:\z>lzbench.exe -elz4hc,99 -j40 r:\testsets\big\lzt99
Compressor name Compress. Decompress. Compr. size Ratio Filename
brotli 0.4.0 -9 4.25 MB/s 262 MB/s 10481262 42.43 lzt99
zstd 0.7.1 -99 3.62 MB/s 560 MB/s 10401235 42.11 lzt99
zlib 1.2.8 -9 5.89 MB/s 271 MB/s 13063244 52.89 lzt99
lz4hc r131 -99 10 MB/s 2801 MB/s 14801510 59.92 lzt99
(note that this zlib is != my zlib dll, and != miniz , but is nonetheless similar speed)
What you should notice is that the zlib and brotli & lz4 times are about the same in lzbench, but ZStd got way faster.
My conclusion is that ZStd speed is very strongly affected by compiler & options. On Windows, with GCC -O3 it looks like ZStd can be around 25% faster than I've been reporting.
On the same hardware : (Core i7-3770 3.4 GHz)
zlib dll : 309.12 mb/s
zstdmax MSVC 2005: 422.20 mb/s
lz4hc MSVC 2005 : 2600.32 mb/s
miniz MSVC 2012 : 318.18 mb/s
zstd MSVC 2012 : 426.37 mb/s
lz4hc MSVC 2012 : 2680.92 mb/s
miniz gcc-4.7.2 -O2 : 289.48 mb/s
zstd gcc-4.7.2 -O2 : 368.74 mb/s
brotli9 gcc-4.7.2 -O2 : 264.30 mb/s
lz4hc gcc-4.7.2 -O2 : 1755.57 mb/s
brotli 0.4.0 GCC 5.3.0 : 262 MB/s
zstd 0.7.1 -99 GCC 5.3.0 : 560 MB/s
zlib 1.2.8 -9 GCC 5.3.0 : 271 MB/s
lz4hc r131 -99 GCC 5.3.0 : 2801 MB/s
Another note of fairness to ZStd :
I've run all the compressors at max / optimal encode setting, just to remove that variable so I don't have to track another axis of encoder and try to make it as apples-to-apples as possible. These tests have all been looking at decode speed and max compression ratio.
But ZStd gets quite a lot faster to decode at lower encode settings. The Oodle compressors are much more constant in decode speed (we work hard to make our optimal parsers not produce files that are slower to decode; if you want slower decodes you have the "spaceSpeedTradeoff" setting; in fact this has been a concerted focus in my last few revs is to try to separate that axis of variation, so that increasing the encoder Level in Oodle should just improve compression ratio without hurting decode speed).
For example at level 9 ZStd gets a lot faster :
lzt99
lzbench :
zstd 0.7.1 -9 34 MB/s 737 MB/s 11344386 45.93 lzt99
MSVC 2005 :
zstd9 : 24,700,820 ->11,344,386 = 3.674 bpb = 2.177 to 1
decode : 43.955 millis, 6.06 c/b, rate= 561.95 mb/s
If you just look at that 737 MB/s , it's getting closer to Kraken.
Though of course at level 9, ZStd's compression ratio is way down, so it's
not really comparable any more. In fact it's compressing worse than Mermaid
at that point.
ooMermaid : 2.31:1 , 2.1 enc MB/s , 2166.4 dec MB/s
So anyway.
ADD : of course this is why we (RAD) ship compiled libs, and it's the kind of thing we spend a lot of time on. Like, oh crap I updated my compiler and all of a sudden my speed is down by 20%. So go look at the disasm and see WTF the compiler is doing differently and try to poke it to make it generate the right code again, etc.
It's not as big a benefit on Mermaid as it is for Kraken. (Kraken gets around 1.5X, Mermaid much less).
There is a little neat thing about thread-phased Mermaid though, and that's the ability to run Mermaid+ at almost the exact same speed as Mermaid.
Mermaid+ is a hybrid option that's hidden in Oodle 2.3.0 and will be exposed in the next release. It gets
compression between Mermaid & Kraken, with a small speed hit vs Mermaid.
Seven :
ooSelkie : 2.19:1 , 3.0 enc MB/s , 3668.0 dec MB/s
ootp2Mermaid: 2.46:1 , 2.3 enc MB/s , 3046.1 dec MB/s
lz4hc : 2.00:1 , 12.8 enc MB/s , 2532.4 dec MB/s
ooMermaid : 2.46:1 , 2.3 enc MB/s , 2364.4 dec MB/s
ootp2Kraken : 2.91:1 , 2.6 enc MB/s , 1660.9 dec MB/s
ooKraken : 2.91:1 , 2.6 enc MB/s , 1049.6 dec MB/s
zlib9 : 2.33:1 , 7.9 enc MB/s , 315.1 dec MB/s
ootp2Merm+ : 2.64:1 , 2.3 enc MB/s , 3042.4 dec MB/s
ooMermaid+ : 2.64:1 , 2.3 enc MB/s , 2044.5 dec MB/s
Silesia :
ooSelkie : 3.05:1 , 1.3 enc MB/s , 2878.4 dec MB/s
ootp2Mermaid: 3.57:1 , 1.1 enc MB/s , 2600.8 dec MB/s
lz4hc : 2.72:1 , 13.6 enc MB/s , 2273.5 dec MB/s
ooMermaid : 3.57:1 , 1.1 enc MB/s , 1994.2 dec MB/s
ootp2Kraken : 4.08:1 , 1.2 enc MB/s , 1434.4 dec MB/s
ooKraken : 4.08:1 , 1.2 enc MB/s , 1000.5 dec MB/s
zlib9 : 3.13:1 , 8.3 enc MB/s , 358.4 dec MB/s
ootp2Merm+ : 3.58:1 , 1.1 enc MB/s , 2583.9 dec MB/s
ooMermaid+ : 3.58:1 , 1.1 enc MB/s , 1986.8 dec MB/s
Game Test Set :
ooSelkie : 2.03:1 , 3.3 enc MB/s , 4548.0 dec MB/s
lz4hc : 1.78:1 , 14.0 enc MB/s , 3171.1 dec MB/s
ootp2Mermaid: 2.28:1 , 2.7 enc MB/s , 3099.4 dec MB/s
ooMermaid : 2.28:1 , 2.7 enc MB/s , 2622.8 dec MB/s
ootp2Kraken : 2.57:1 , 2.9 enc MB/s , 1812.7 dec MB/s
ooKraken : 2.57:1 , 2.9 enc MB/s , 1335.9 dec MB/s
zlib9 : 1.99:1 , 8.3 enc MB/s , 337.2 dec MB/s
ootp2Merm+ : 2.35:1 , 2.7 enc MB/s , 3034.9 dec MB/s
ooMermaid+ : 2.35:1 , 2.7 enc MB/s , 2409.0 dec MB/s
Pulling out just the relevant numbers on Seven, you can see Mermaid+ is between Mermaid and Kraken,
but thread-phased it runs at full Mermaid speed :
Seven :
ooMermaid : 2.46:1 , 2.3 enc MB/s , 2364.4 dec MB/s
ooMermaid+ : 2.64:1 , 2.3 enc MB/s , 2044.5 dec MB/s
ooKraken : 2.91:1 , 2.6 enc MB/s , 1049.6 dec MB/s
ootp2Mermaid: 2.46:1 , 2.3 enc MB/s , 3046.1 dec MB/s
ootp2Merm+ : 2.64:1 , 2.3 enc MB/s , 3042.4 dec MB/s
ootp2Kraken : 2.91:1 , 2.6 enc MB/s , 1660.9 dec MB/s
Showing speed & ratio here, higher is better.
As usual the total on a test set is total size of all individually compressed files, and total time.
I think the scatter plot most clearly shows the way Kraken, Mermaid & Selkie are just on a whole new Pareto Frontier than the older compressors. You can connect the dots of K-M-S performance for each test set and they form a very consistent space-speed tradeoff curve that's way above the previous best.
The raw numbers :
gametestset : Kraken : 2.566 to 1 : 1363.283 MB/s gametestset : Mermaid : 2.284 to 1 : 2711.458 MB/s gametestset : Selkie : 2.030 to 1 : 4870.413 MB/s gametestset : lz4hc : 1.776 to 1 : 3223.279 MB/s gametestset : zlib9 : 1.992 to 1 : 338.063 MB/s gametestset : lzma : 2.756 to 1 : 43.782 MB/s pd3d : Kraken : 3.647 to 1 : 1072.833 MB/s pd3d : Mermaid : 2.875 to 1 : 2299.860 MB/s pd3d : Selkie : 2.379 to 1 : 3784.850 MB/s pd3d : lz4hc : 2.238 to 1 : 2370.193 MB/s pd3d : zlib9 : 2.886 to 1 : 382.226 MB/s pd3d : lzma : 4.044 to 1 : 63.878 MB/s seven : Kraken : 2.914 to 1 : 1053.961 MB/s seven : Mermaid : 2.462 to 1 : 2374.796 MB/s seven : Selkie : 2.194 to 1 : 3717.074 MB/s seven : lz4hc : 2.000 to 1 : 2522.824 MB/s seven : zlib9 : 2.329 to 1 : 315.344 MB/s seven : lzma : 3.186 to 1 : 52.660 MB/s silesia : Kraken : 4.082 to 1 : 1004.014 MB/s silesia : Mermaid : 3.571 to 1 : 2002.079 MB/s silesia : Selkie : 3.053 to 1 : 2889.536 MB/s silesia : lz4hc : 2.723 to 1 : 2269.788 MB/s silesia : zlib9 : 3.128 to 1 : 358.593 MB/s silesia : lzma : 4.369 to 1 : 78.655 MB/s
|
See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie . For more about Oodle visit RAD Game Tools |
The full report is here :
oodle_arm_report on cbloom.com
It's a thorough test on many devices and several corpora. See the full details there.
Cliff notes is : Oodle's great on ARM.
For example on the iPadAir2 64-bit , on Silesia :
|
|
|
We found that the iOS devices are generally very good and easy to program for. They're more like desktop Intel chips; they don't have any terrible performance cliffs. The Android ARM devices we tested on were rather more difficult. For one thing they have horrible thermal saturation problems that makes testing on them very difficult. They also have some odd performance quirks.
I'm sure we could get a lot more speed on ARM, but it's rather nasty to optimize for. For one thing the thermal problems mean that iterating and getting good numbers is a huge pain. It's hard to tell if a change helped or not. For another, there's a wide variety of devices and it's hard to tell which to optimize for, and they have different performance shortfalls. So there's definitely a lot left on the table here.
Mermaid & Selkie are quite special on ARM. Many of these devices have small caches (as small as 512k L2) and very slow main memory (slow wrst latency; they often have huge bandwidth, but latency is what I need). Mermaid & Selkie are able to use unbounded windows for LZ without suffering a huge speed hit, due to the unique way they are structured. Kraken doesn't have the same magic trick so it benefits from a limited window, as demonstrated in the report.
|
See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie . For more about Oodle visit RAD Game Tools |
Selkie is all about decode speed, it aims to be the fastest mainstream decompressor in the world, and still gets more compression than anything in the high-speed domain.
Selkie does not currently have a super fast encoder. It's got good optimal parse encoders that produce carefully tuned encoded file which offer excellent space-speed tradeoff.
The closest compressors to Selkie are the fast byte-wise small-window coders like LZ4 and LZSSE (and Oodle's LZB16). These are all just obsolete now (in terms of ratio vs decode speed), Selkie gets a lot more compression (sometimes close to Zlib compression levels!) and is also much faster.
Selkie will not compress tiny buffers, or files that only compress a little bit. For example if you give Selkie something like an mp3, it might be able to compress it to 95% of its original size, saving a few bytes. Selkie will refuse to do that and just give you the original uncompressed file. If you wanted that compression, that means you wanted to save only a few bytes at a large time cost, which means you don't actually want a fast compressor like Selkie. You in fact wanted a compressor that was more willing to trade time for bytes, such as Mermaid or Kraken. Selkie will not abide logical inconsistency.
Selkie generally beats LZ4 compression even on small files (under 64k) but really gets ahead on files larger than 64k where the unbounded match distances can find big wins.
As usual, I'm not picking on LZ4 here because it's bad; I'm comparing to it because it's the best of the rest, and it's widely known. Both decompressors are run fuzz-safe.
Tests on Win64 (Core i7-3770 3.4 GHz) :
(for reference, this machine runs memcpy at roughly 8 GB/s)
(total of time & size on each test set)
gametestset : ooSelkie : 143,579,361 ->70,716,380 = 3.940 bpb = 2.030 to 1
gametestset : decode : 29.239 millis, 0.69 c/b, rate= 4910.61 mb/s
gametestset : lz4hc : 143,579,361 ->80,835,018 = 4.504 bpb = 1.776 to 1
gametestset : decode : 44.495 millis, 1.05 c/b, rate= 3226.89 mb/s
pd3d : ooSelkie : 31,941,800 ->13,428,298 = 3.363 bpb = 2.379 to 1
pd3d : decode : 8.381 millis, 0.89 c/b, rate= 3811.29 mb/s
pd3d : lz4hc : 31,941,800 ->14,273,195 = 3.575 bpb = 2.238 to 1
pd3d : decode : 13.479 millis, 1.44 c/b, rate= 2369.67 mb/s
seven : ooSelkie : 80,000,000 ->36,460,084 = 3.646 bpb = 2.194 to 1
seven : decode : 21.458 millis, 0.91 c/b, rate= 3728.26 mb/s
seven : lz4hc : 80,000,000 ->39,990,656 = 3.999 bpb = 2.000 to 1
seven : decode : 31.730 millis, 1.35 c/b, rate= 2521.30 mb/s
silesia : ooSelkie : 211,938,580 ->69,430,966 = 2.621 bpb = 3.053 to 1
silesia : decode : 72.340 millis, 1.16 c/b, rate= 2929.77 mb/s
silesia : lz4hc : 211,938,580 ->77,841,566 = 2.938 bpb = 2.723 to 1
silesia : decode : 93.488 millis, 1.50 c/b, rate= 2267.02 mb/s
The edge that Selkie has over LZ4 is even greater on more difficult platforms like the PS4.
To get a better idea of the magic of Selkie it's useful to look at the other Oodle compressors that are similar to Selkie.
LZB16 is Oodle's LZ4 variant; it gets slightly more compression and slightly more decode speed, but they're roughly equal. It's included here for comparison to LZBLW.
Oodle's LZBLW is perhaps the most similar compressor to Selkie. It's like LZB16 (LZ4) but adds large-window matches. That ability to do long-distance matches hurts speed a tiny bit (2873 mb/s -> 2596 mb/s), but helps compression a lot.
Oodle's LZNIB is nibble-wise, with unbounded offsets and a rep match. It gets good compression, generally better than Zlib, with speed much higher than any LZ-Huff. LZNIB is in a pretty unique space speed tradeoff zone without much competition outside of Oodle.
lz4hc : 24,700,820 ->14,801,510 = 4.794 bpb = 1.669 to 1
decode : 9.481 millis, 1.31 c/b, rate= 2605.37 mb/s
ooLZB16 : 24,700,820 ->14,754,643 = 4.779 bpb = 1.674 to 1
decode : 8.597 millis, 1.18 c/b, rate= 2873.17 mb/s
ooLZNIB : 24,700,820 ->12,014,368 = 3.891 bpb = 2.056 to 1
decode : 17.420 millis, 2.40 c/b, rate= 1417.93 mb/s
ooLZBLW : 24,700,820 ->13,349,800 = 4.324 bpb = 1.850 to 1
decode : 9.512 millis, 1.31 c/b, rate= 2596.80 mb/s
ooSelkie : 24,700,820 ->12,752,506 = 4.130 bpb = 1.937 to 1
decode : 6.410 millis, 0.88 c/b, rate= 3853.57 mb/s
LZNIB and LZBLW were both pretty cool before Selkie, but now they're just obsolete.
LZBLW gets a nice compression gain over LZB16, but Selkie gets even more, and is way faster!
LZNIB beats Selkie compression, but is way slower, around 3X slower, in fact it's slower than Mermaid (2283.28 mb/s and compresses to 10,838,455 = 3.510 bpb = 2.279 to 1).
You can see from the curves that Selkie just completely covers the curves of LZB16,LZBLW, and LZ4. When a curve is completely covered like that, it means that it was beaten for both space and speed, so there is no domain where that compressor is ever better. LZNIB just peeks out of the Selkie curve because it gets higher compression (albeit at lower speed), so there is a domain where it is the better choice - but in that domain Mermaid just completely dominates LZNIB, so it too is obsolete.
|
See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie . For more about Oodle visit RAD Game Tools |
There's really nothing even close. It's way beyond what was previously thought possible.
Mermaid supports unbounded distance match references. This is part of how it gets such high compression. It does so in a new way which reduces the speed penalty normally incurred by large-window LZ's.
Mermaid almost always compresses better than ZLib. The only exception is on small files, less than 32k or so. The whole Oceanic Bestiary family is best suited to files over 64k. They work fine on smaller files, but they lose their huge advantage. It's always best to combine small files into larger units for compression, particularly so with these compressors.
There's not really any single compressor to compare Mermaid to. What we can do is compare vs. Zlib's compression ratio and LZ4's speed. A kind of mythological hybrid like a Chimera, the head of a Zlib and the legs of an LZ4.
Tests on Win64 (Core i7-3770 3.4 GHz) :
Silesia :
|
|
|
On Silesia, Mermaid is just slightly slower than LZ4 but compresses much more than Zlib !!
PD3D :
|
|
|
On PD3D, Mermaid gets almost exactly the compression level of ZLib but the decode speed of LZ4. Magic! It turns out you *can* have your cake and eat it too.
Game Test Set :
|
|
|
lzt99 :
|
|
|
Mermaid really compresses well on lzt99 ; not only does it kill Zlib, it gets close to high compression LZ-Huffs like RAR. (RAR gets 10826108 , Mermaid 10838455 bytes).
Seven :
|
|
|
Because of the space-speed optimizing nature of Mermaid, it will make decisions to be slower than LZ4 when it can find big compression gains. For example if you look at the individual files of the "Seven" test set below - Mermaid is typically right around the same speed as LZ4 or even faster (baby7,dds7,exe7,game7,wad7 - all same speed or faster than LZ4). On a few files it decides to take an encoding slower to decode than LZ4 - model7,enwik7, and records7. The biggest differences are enwik7 and records7, but if you look at the compression ratios - those are all the files where it found huge size differences over LZ4. It has an internal exchange rate for time vs. bytes that it must meet in order to take that encoding, trying to optimize for its space-speed target usage.
Seven files :
Silesia : Mermaid : 3.571 to 1 : 2022.038 MB/s
Silesia : lz4hc : 2.723 to 1 : 2267.021 MB/s
Silesia : zlib9 : 3.128 to 1 : 358.681 MB/s
GameTestSet : Mermaid : 2.284 to 1 : 2718.095 MB/s
GameTestSet : lz4hc : 1.776 to 1 : 3226.887 MB/s
GameTestSet : zlib9 : 1.992 to 1 : 337.986 MB/s
lzt99 : Mermaid : 2.279 to 1 : 2283.278 MB/s
lzt99 : lz4hc : 1.669 to 1 : 2605.366 MB/s
lzt99 : zlib9 : 1.883 to 1 : 309.304 MB/s
PD3D : Mermaid : 2.875 to 1 : 2308.830 MB/s
PD3D : lz4hc : 2.238 to 1 : 2369.666 MB/s
PD3D : zlib9 : 2.886 to 1 : 382.349 MB/s
Seven : Mermaid : 2.462 to 1 : 2374.212 MB/s
Seven : lz4hc : 2.000 to 1 : 2521.296 MB/s
Seven : zlib9 : 2.329 to 1 : 315.370 MB/s
|
See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie . For more about Oodle visit RAD Game Tools |
Everything is slow on the PS4 in absolute terms (it's a slow chip and difficult to optimize for). The Oodle compressors do very well, even better in relative terms on PS4 than on typical PC's.
Kraken is usually around ~2X faster than ZStd on PC's, but is 3X faster on PS4. Mermaid is usually just slightly slower than LZ4 on PC's, but is solidly faster than LZ4 on PS4.
lzt99 : Kraken : 2.477 to 1 : 390.582 MB/s Mermaid : 2.279 to 1 : 749.896 MB/s Selkie : 1.937 to 1 : 1159.064 MB/s zstd : 2.374 to 1 : 133.498 MB/s miniz : 1.883 to 1 : 85.654 MB/s lz4hc-safe : 1.669 to 1 : 673.616 MB/s LZSSE8 : 1.626 to 1 : 767.106 MB/s
Mermaid is faster than LZ4 on PS4 !! Wow! And the compression level is in a totally different domain than other super-fast decompressors like LZ4 or LZSSE.
lzt99 is a good case for Selkie & Mermaid. Selkie beats zlib compression ratio while being 75% faster than LZ4.
All compressors here are fuzz-safe, and run in safe mode if they have optional safe/unsafe modes.
Charts : (showing time and size - lower is better!)
lzt99 :
|
|
|
the raw data :
PS4 : Oodle 230 : (-z6)
inName : lzt:/lzt99
reference :
miniz : 24,700,820 ->13,120,668 = 4.249 bpb = 1.883 to 1
miniz_decompress_time : 288.379 millis, 18.61 c/b, rate= 85.65 mb/s
zstd : 24,700,820 ->10,403,228 = 3.369 bpb = 2.374 to 1
zstd_decompress_time : 185.028 millis, 11.94 c/b, rate= 133.50 mb/s
lz4hc : 24,700,820 ->14,801,510 = 4.794 bpb = 1.669 to 1
LZ4_decompress_safe_time : 36.669 millis, 2.37 c/b, rate= 673.62 mb/s
LZSSE8 : 24,700,820 ->15,190,395 = 4.920 bpb = 1.626 to 1
decode_time : 32.200 millis, 2.08 c/b, rate= 767.11 mb/s
Oodle :
Kraken : 24,700,820 -> 9,970,882 = 3.229 bpb = 2.477 to 1
decode : 63.241 millis, 4.08 c/b, rate= 390.58 mb/s
Mermaid : 24,700,820 ->10,838,455 = 3.510 bpb = 2.279 to 1
decode : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s
Selkie : 24,700,820 ->12,752,506 = 4.130 bpb = 1.937 to 1
decode : 21.311 millis, 1.38 c/b, rate= 1159.06 mb/s
BTW for reference, the previous best compressor in Mermaid's domain was LZNIB.
Before these new compressors, LZNIB was quite unique in that it got good decode speeds,
much faster than the LZ-Huffs of the time (eg. 3X faster than ZStd) but with compression
usually better than ZLib. Well, LZNIB is still quite good compared to other competition,
but it's just clobbered by the new Oceanic Bestiary compressors. The new compressor
in this domain is Mermaid and it creams LZNIB for both size and speed :
LZNIB -z6 : 24,700,820 ->12,015,591 = 3.892 bpb = 2.056 to 1
decode : 58.710 millis, 3.79 c/b, rate= 420.73 mb/s
Mermaid : 24,700,820 ->10,838,455 = 3.510 bpb = 2.279 to 1
decode : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s
|
See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie . For more about Oodle visit RAD Game Tools |
Mermaid and Selkie are the super-fast-to-decode distant relatives of Kraken. They use some of the same ideas and technology as Kraken, but are independent compressors targetted at even higher speed and lower compression. Mermaid & Selkie make huge strides in what's possible in compression in the high-speed domain, the same way that Kraken did in the high-compression domain.
Mermaid is about twice as fast as Kraken, but with compression around Zlib levels.
Selkie is one of the fastest decompressors in the world, and also gets much more compression than other very-high-speed compressors.
( Oodle is my data compression library that we sell at RAD Game Tools , read more about it there )
Kraken, Mermaid, and Selkie all use an architecture that makes space-speed decisions in the encoder to give the best tradeoff of compressed size vs decoding speed. The three compressors have different performance targets and make decisions suited for each one's usage domain (Kraken favors more compression and will give up some speed, Selkie strongly favors speed, Mermaid is in between).
For detailed information about the new Mermaid and Selkie I've written a series of posts :
cbloom rants Introducing Oodle Mermaid and Selkie
cbloom rants Oodle 2.3.0 All Test Sets
cbloom rants Oodle 2.3.0 ARM Report
cbloom rants Oodle Mermaid and Selkie on PS4
cbloom rants Oodle Mermaid
cbloom rants Oodle Selkie
RAD Game Tools - Oodle Network and Data Compression
Here are some representative numbers on the Silesia test set : (sum of time and size on individual files)
Oodle 2.3.0 Silesia -z6
Kraken : 4.082 to 1 : 999.389 MB/s
Mermaid : 3.571 to 1 : 2022.038 MB/s
Selkie : 3.053 to 1 : 2929.770 MB/s
zstdmax : 4.013 to 1 : 468.497 MB/s
zlib9 : 3.128 to 1 : 358.681 MB/s
lz4hc : 2.723 to 1 : 2267.021 MB/s
on Win64 (Core i7-3770 3.4 GHz)
On Silesia, Mermaid is 5.65X faster to decode than zlib, and gets 14% more compression.
Selkie is 1.3X faster to decode than LZ4 and gets 12% more compression.
Charts on Silesia total : (charts show time and size - lower is better!)
|
|
|
And the speedup chart on Silesia, which demonstrates the space-speed efficiency of a compressor in different usage domains.
Kraken was a huge step in the Pareto frontier that pushed the achievable speedup factor way up beyond what other compressers were doing. There's a pre-Kraken curve where we thought the best possible tradeoff existed, that most other compressors in the world roughly lie on (or under). Kraken set a new frontier way up on its own with nobody to join it; Mermaid & Selkie are the partners on that new curve that have their peaks at higher speeds than Kraken.
You can also see this big jump of the new family very easily in scatter plots, which we'll see in later posts .
There were two major factors in the gains. One was just some more time optimizing some inner loops (including some new super-tight pathways from Fabian).
The other was more rigorous analysis of the space-speed tradeoff decisions inside Kraken. One of the fundamental things that makes Kraken work is the fact that it consider space-speed when making its internal decisions, but before 230 those decisions were made in a rather ad-hoc way. Making those decisions better means that even with the same decoder, the new encoder is able to create files that are the same size but decode faster.
The tradeoff point (technically, the lagrange lambda, or the exchange rate from time to bytes) that's used by Oodle to make space-speed decisions is exposed to the client in the OodleLZ_CompressOptions so you can adjust it to bias for compression or decode speed. Each compressor sets what I believe to be a reasonable default for its usage domain, so adjustments to this value should typically be small (you can't massively change behavior with it; Kraken won't start arithmetic coding things if you set the tradeoff really small, for example, there's a small window where the compressor works well and you can just bias sightly within that window).
Some dry numbers for reference :
On PS4 :
Oodle 230 Kraken -zl4 : 24,700,820 ->10,377,556 = 3.361 bpb = 2.380 to 1
decode only : 65.547 millis, 4.23 c/b, rate= 376.84 mb/s
Oodle 230 Kraken -zl6 : 24,700,820 -> 9,970,882 = 3.229 bpb = 2.477 to 1
decode : 63.453 millis, 4.09 c/b, rate= 389.28 mb/s
Oodle 230 Kraken -zl7 : 24,700,820 -> 9,734,771 = 3.153 bpb = 2.537 to 1
decode : 67.915 millis, 4.38 c/b, rate= 363.70 mb/s
Oodle 220 Kraken -zl4 : 24,700,820 ->10,326,584 = 3.345 bpb = 2.392 to 1
decode only : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s
Oodle 220 Kraken -zl6 : 24,700,820 ->10,011,486 = 3.242 bpb = 2.467 to 1
decode : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s
Oodle 220 Kraken -zl7 : 24,700,820 -> 9,773,112 = 3.165 bpb = 2.527 to 1
decode : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s
On Win64 (Core i7-3770 3.4 GHz) :
Oodle 2.3.0 :
Silesia Kraken -z6
total : 211,938,580 ->51,918,269 = 1.960 bpb = 4.082 to 1
decode : 210.685 millis, 3.38 c/b, rate= 1005.95 mb/s
Weissman 1-256 : [8.575]
mozilla : 51,220,480 ->14,410,181 = 2.251 bpb = 3.554 to 1
decode only : 51.280 millis, 3.41 c/b, rate= 998.83 mb/s
lzt99 : 24,700,820 -> 9,970,882 = 3.229 bpb = 2.477 to 1
decode only : 20.943 millis, 2.89 c/b, rate= 1179.44 mb/s
win81 : 104,857,600 ->38,222,311 = 2.916 bpb = 2.743 to 1
decode only : 108.344 millis, 3.52 c/b, rate= 967.82 mb/s
Oodle 2.2.0 :
Silesia Kraken -z6
total : 211,938,580 ->51,857,427 = 1.957 bpb = 4.087 to 1
decode : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s
Weissman 1-256 : [8.431]
"silesia_mozilla"
Kraken 230 : 3.55:1 , 998.8 dec mb/s
Kraken 220 : 3.60:1 , 896.5 dec mb/s
Kraken 215 : 3.51:1 , 928.0 dec mb/s
"lzt99"
Kraken 230 : 2.48:1 , 998.8 dec mb/s
Kraken 220 : 2.53:1 , 912.0 dec mb/s
Kraken 215 : 2.46:1 , 957.1 dec mb/s
"win81"
Kraken 230 : 2.74:1 , 967.8 dec mb/s
Kraken 220 : 2.77:1 , 818.0 dec mb/s
Kraken 215 : 2.70:1 , 877.0 dec mb/s
NOTE : Oodle 2.3.0 Kraken data cannot be read by Oodle 2.2.0 or earlier. Oodle 230 can load all old Oodle data (new versions of Oodle can always load all data created by older versions). If you need to make data that be loaded with an older version using Oodle 230, then you can set the minimum decoder version to something lower (by default it's the current version). Contact Oodle support for details.
Some of the biggest gains were found on ARM, which I'll post about more in the future.
A two-step parse is an enhancement to a forward-arrivals parse.
(background : forward-arrivals parse stores the minimum cost from head at each position, along with information on the path taken to get there. At each pos P, it takes the best incoming arrival and considers all ways to go further into the parse (literal/match/rep/etc.). At each destination point it stores arrival_cost[P] + step cost. In simple cases (no carried state, no entropy coding, like LZSS) the forward-arrivals parse is a perfect parse just like the backward dynamic-programming parse. In modern LZ with carried state such as a rep set or markov state, the forward parse is preferable.)
A two-step parse extends the standard forward-arrivals parse by being able to store an arrival from a single coding step, or from two coding steps. The standard usage (as in LZMA/7zip) is to be able to store a two-step arrival from the sequence { normal match, some literals, rep match }. This multi-step arrival is stored with the cost of the whole sequence at the end point of the sequence.
If you stored *all* arrivals (not just the cheapest), you would not need two-step parse. You could just store the first step, and then when your parse origin point advanced to the end of the first step, it would find the second step and be able to choose it as an option.
But obviously you don't store all arrivals at each position, since the number would massively explode, even with reduction by symmetries. (see, eg. previous articles on A* parse)
The problem arises when you have a cheap multi-step sequence, but the first step is expensive. Then the first step might be replaced (or never filled in the first place) and the parse will not be able to find the second step cheap option.
Let's consider a concrete example for clarity.
Parser is at pos P consider all ways to continue
At pos P there's a length 4 normal match available at offset O
It stores an arrival at [P+4] that's rather expensive
(because it has to send offset O).
At pos P+1 the parser finds a length 3 rep match
The exit from (P+1) length 3 also lands at [P+4]
This is a cheaper way to arrive at [P+4] , so the previous arrival from P via O is replaced
When the parser reaches P+4 it sees the incoming arrival as
begin a rep match match from P+1
But we missed something !
At pos P+5 (one step after the arrival) there are 2 bytes that match at offset O
if we had chosen the normal match to arrive at P+4 , we could now code a rep match
but we lost it, so we don't see the rep as an option.
Two-step to the rescue!
Back at pos P , we consider the one-step arrival :
{match len 4, offset O} to arrive at P+4
We also look after the end of that for cheap rep matches and find one.
So we store a two-step arrival :
{match len4, offset O, 1 literal, rep len 2} to arrive at P+7
Now at pos P+1 the arrival at P+4 is stomped
but the arrival at P+7 remains! So we are able to find that in the future.
The options look like :
P P+4
V V
1. MMMMLRR
2. LRRRLLL
Option 2 is cheaper at P+4
but Option 1 is cheaper at P+7
This is the primary application of two-step parse.
It's a (very limited) way of finding non-local minima in the parse search space.
The other option is "multi-parse" that stores multiple arrivals at each position (something like 4 is typical). Multi-parse and two-step provide diminishing returns when used together, so they usually aren't. Two-step is generally a faster way and provides more win per CPU time, multi-parse is able to find longer-range non-local-minimum moves and so provides more compression.
All good modern LZ's need some kind of non-local-minimum parse, because to get into a good state for the future (typically by getting the right offset into the rep offset cache) you may need to make a more expensive initial step.
which ignoring constants is just W = r/logT
That's just wrong. You don't take a logarithm of something with units. But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a logarithm of time should be involved. Just not like that.
I present a formula which I call the correct Weissman Score :
W = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) ) -
comp_ratio * log10( 1 + speed/(disk_speed_hi *comp_ratio) )
or
W = comp_ratio * log10( ( comp_ratio + speed/disk_speed_lo ) / ( comp_ratio + speed/disk_speed_hi ) )
You can have a Weissman score for encode speed or decode speed. It's a measure of space-speed tradeoff
success.
I suggest the range should be 1-256. disk_speed_lo = 1 MB/s (to evaluate performance on very slow channels, favoring small files), disk_speed_hi = 256 MB/s (to evalue performance on very fast disks, favoring speed). And because 1 and 256 are amongst programmers' favorite numbers.
You could also just let the hi range go to infinity. Then you don't need a hi disk speed parameter and you get :
Weissman-infinity = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )
with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like.
While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses
the score more on the area that most people care about.
I came up with this formula because I started thinking about summarizing a score from the Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds; like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them? speedup is a good way to measure a compressor even if you don't actually care about speed. Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to get the area under the curve? Note that we're integrating in log-scale of disk speed.
Turns out you can just do that integral :
speedup = (time to load uncompressed) / (time to load compressed + decompress)
speedup = (raw_size/disk_speed) / (comp_size/disk_speed + raw_size/ decompress_speed)
speedup = (1/disk_speed) / (1/(disk_speed*compression_ratio) + 1 / decompress_speed)
speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
speedup = compression_ratio / (1 + exp( log_disk_speed ) * compression_ratio/decompress_speed)
speedup = compression_ratio * 1 / (1 + exp( log_disk_speed + log(compression_ratio/decompress_speed)))
speedup is a sigmoid :
y = 1 / (1 + e^-x )
Integral{y} = ln( 1 + e^x )
x = - ( log_disk_speed + log(compression_ratio/decompress_speed) )
so substitute some variables and you get the above formula for the Weissman score.
In the final formula, I changed from natural log to log-base-10, which is just a constant scaling factor.
The Weissman (decode Core i7-3770 3.4 GHz; 1-256 range) scores on Silesia are :
lz4hc : 6.243931
zstdmax : 7.520236
lzham : 6.924379
lzma : 5.460073
zlib9 : 5.198510
Kraken : 8.431461
Weissman-infinity scores are :
lz4hc : 7.983104
zstdmax : 8.168544
lzham : 7.277707
lzma : 5.589155
zlib9 : 5.630476
Kraken : 9.551152
Goal : beat 10.0 !
ADD : this post was a not-sure-if-joking. But I actually think it's useful. I find it useful anyway.
When you're trying to tweak out some space-speed tradeoff decisions, you get different sizes and speeds, and it can be hard to tell if that tradeoff was good. You can do things like plot all your options on a space-speed graph and try to guess the pareto frontier and take those. But when iterating an optimization of a parameter you want just a simple score.
This corrected Weissman score is a nice way to do that. You have to choose what domain you're optimizing for, size-dominant slower compressors should use Weissman 1-256 , for balance of space and super speed use Weissman 1-inf (or 40-800), for the fast domain (LZ4-ish) use a range like 100-inf. Then you can just iterate to maximize that number!
For whatever space-speed tradeoff domain you're interested in, there exists a Weissman score range (lo-hi disk speed paramaters) such that maximizing the Weissman score in that range gives you the best space-speed tradeoff in the domain you wanted. The trick is choosing what that lo-hi range is (it doesn't necessarily directly correspond to actual disk or channel speeds; there are other factors to consider like latency, storage usage, etc. that might cause you to bias the lo-hi away from the actual channel speeds in some way; for example high speed decoders should always set the upper speed to infinity, which corresponds to the use case that the compressed data might be already resident in RAM so it has zero time to load).
Advantages of the PS4 : consistent well-defined hardware for reproducible testing. Slow platform that game developers care about being fast on. Builds with clang which is the target of choice for some compression libraries that don't build so easily on MSVC.
After the initial version of this post, I went and fuzz-safed LZB16, so they're all directly comparable.
To compare apples, look at LZ4_decompress_safe. I also include LZ4_decompress_fast for reference.
LZ4_decompress_safe : fuzz safe (*)
LZ4_decompress_fast : NOT fuzz safe
LZSSE8_Decode : fuzz safe
Oodle LZB16 : fuzz safe
LZSSE and Oodle both use multiple copies of the core loop to minimize the pain of fuzz safing. LZ4's code is much
simpler, it doesn't do a ton of macro or .inl nastiness. (* = this is the only one that I would actually trust to put
in a satellite or a bank or something critical, it's just so simple, it's way easier to be sure that it's correct)
Conclusion :
Comparing the two Safe open-source options, LZ4_safe vs. LZSSE8 : LZSSE8 is pretty consistently faster than LZ4_Safe on PS4 (though the difference is small). PS4 is a better platform for LZSSE than x64 (PS4 is actually a pretty bad platform for LZ4; there're a variety of issues; see GDC "Taming the Jaguar" slides ; but particular issues for LZ4 are the front-end bottleneck and cache latency). When I tested on x64 before, it was much more mixed, sometimes LZ4 was faster.
I was surprised to find that Oodle LZB16 is quite a lot faster than LZ4 on PS4. (for example, that's not the case on Windows/x64, it's much closer there). I've never run third party codecs on PS4 before. I suppose this reflects the per-platform tweaking that we spend so much time on, and I'm sure LZ4 would catch up with some time spent fiddling with the PS4 codegen.
The compression ratios are mostly close enough to not care, though LZSSE8 does a bit worse on some of the DXTC/BCn files (lightmap.bc3 and d.dds).
On some files I include Oodle LZBLW numbers (LZB-bytewise-large-window). Sometimes Oodle LZBLW is a pretty big free compression win at about the same speed. Sometimes it gets worse ratio, sometimes much worse speed. If I was a client using this, I might try LZBLW and drop down to LZB16 any time it's not a good tradeoff.
Full data :
REMINDER : LZ4_decompress_fast is not directly comparable to the others, it's not fuzz safe, the others are!
PS4 clang-3.6.1
------------------
lzt99 :
lz4hc : 24,700,820 ->14,801,510 = 4.794 bpb = 1.669 to 1
LZ4_decompress_safe_time : 0.035 seconds, 440.51 b/kc, rate= 702.09 mb/s
LZ4_decompress_fast_time : 0.032 seconds, 483.55 b/kc, rate= 770.67 mb/s
LZSSE8 : 24,700,820 ->15,190,395 = 4.920 bpb = 1.626 to 1
LZSSE8_Decode_Time : 0.033 seconds, 467.32 b/kc, rate= 744.81 mb/s
Oodle LZB16 : lzt99 : 24,700,820 ->14,754,643 = 4.779 bpb = 1.674 to 1
decode : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s
Oodle LZBLW : lzt99 : 24,700,820 ->13,349,800 = 4.324 bpb = 1.850 to 1
decode : 0.033 seconds, 470.39 b/kc, rate= 749.74 mb/s
------------------
texture.bc1
lz4hc : 2,188,524 -> 2,068,268 = 7.560 bpb = 1.058 to 1
LZ4_decompress_safe_time : 0.004 seconds, 322.97 b/kc, rate= 514.95 mb/s
LZ4_decompress_fast_time : 0.004 seconds, 353.08 b/kc, rate= 562.89 mb/s
LZSSE8 : 2,188,524 -> 2,111,182 = 7.717 bpb = 1.037 to 1
LZSSE8_Decode_Time : 0.004 seconds, 360.21 b/kc, rate= 574.42 mb/s
Oodle LZB16 : texture.bc1 : 2,188,524 -> 2,068,823 = 7.562 bpb = 1.058 to 1
decode : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s
------------------
lightmap.bc3
lz4hc : 4,194,332 -> 632,974 = 1.207 bpb = 6.626 to 1
LZ4_decompress_safe_time : 0.005 seconds, 521.54 b/kc, rate= 831.38 mb/s
LZ4_decompress_fast_time : 0.005 seconds, 564.63 b/kc, rate= 900.46 mb/s
LZSSE8 encode : 4,194,332 -> 684,062 = 1.305 bpb = 6.132 to 1
LZSSE8_Decode_Time : 0.005 seconds, 551.85 b/kc, rate= 879.87 mb/s
Oodle LZB16 : lightmap.bc3 : 4,194,332 -> 630,794 = 1.203 bpb = 6.649 to 1
decode : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s
------------------
silesia_mozilla
lz4hc : 51,220,480 ->22,062,995 = 3.446 bpb = 2.322 to 1
LZ4_decompress_safe_time : 0.083 seconds, 385.47 b/kc, rate= 614.35 mb/s
LZ4_decompress_fast_time : 0.075 seconds, 427.14 b/kc, rate= 680.75 mb/s
LZSSE8 : 51,220,480 ->22,148,366 = 3.459 bpb = 2.313 to 1
LZSSE8_Decode_Time : 0.070 seconds, 461.53 b/kc, rate= 735.59 mb/s
Oodle LZB16 : silesia_mozilla : 51,220,480 ->22,022,002 = 3.440 bpb = 2.326 to 1
decode : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s
Oodle LZBLW : silesia_mozilla : 51,220,480 ->20,881,772 = 3.261 bpb = 2.453 to 1
decode : 0.112 seconds, 285.68 b/kc, rate= 455.30 mb/s
------------------
breton.dds
lz4hc : 589,952 -> 116,447 = 1.579 bpb = 5.066 to 1
LZ4_decompress_safe_time : 0.001 seconds, 568.65 b/kc, rate= 906.22 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 624.81 b/kc, rate= 996.54 mb/s
LZSSE8 encode : 589,952 -> 119,659 = 1.623 bpb = 4.930 to 1
LZSSE8_Decode_Time : 0.001 seconds, 604.14 b/kc, rate= 962.40 mb/s
Oodle LZB16 : breton.dds : 589,952 -> 113,578 = 1.540 bpb = 5.194 to 1
decode : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s
Oodle LZBLW : breton.dds : 589,952 -> 132,934 = 1.803 bpb = 4.438 to 1
decode : 0.001 seconds, 396.04 b/kc, rate= 630.96 mb/s
------------------
d.dds
lz4hc encode : 1,048,704 -> 656,706 = 5.010 bpb = 1.597 to 1
LZ4_decompress_safe_time : 0.001 seconds, 554.69 b/kc, rate= 884.24 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 587.20 b/kc, rate= 936.34 mb/s
LZSSE8 encode : 1,048,704 -> 695,583 = 5.306 bpb = 1.508 to 1
LZSSE8_Decode_Time : 0.001 seconds, 551.13 b/kc, rate= 879.05 mb/s
Oodle LZB16 : d.dds : d.dds : 1,048,704 -> 654,014 = 4.989 bpb = 1.603 to 1
decode : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s
------------------
all_dds
lz4hc : 79,993,099 ->47,848,680 = 4.785 bpb = 1.672 to 1
LZ4_decompress_safe_time : 0.158 seconds, 316.67 b/kc, rate= 504.70 mb/s
LZ4_decompress_fast_time : 0.143 seconds, 350.66 b/kc, rate= 558.87 mb/s
LZSSE8 : 79,993,099 ->47,807,041 = 4.781 bpb = 1.673 to 1
LZSSE8_Decode_Time : 0.140 seconds, 358.61 b/kc, rate= 571.54 mb/s
Oodle LZB16 : all_dds : 79,993,099 ->47,683,003 = 4.769 bpb = 1.678 to 1
decode : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s
----------
baby_robot_shell.gr2
lz4hc : 58,788,904 ->32,998,567 = 4.490 bpb = 1.782 to 1
LZ4_decompress_safe_time : 0.090 seconds, 412.04 b/kc, rate= 656.71 mb/s
LZ4_decompress_fast_time : 0.080 seconds, 460.55 b/kc, rate= 734.01 mb/s
LZSSE8 : 58,788,904 ->33,201,406 = 4.518 bpb = 1.771 to 1
LZSSE8_Decode_Time : 0.076 seconds, 485.14 b/kc, rate= 773.20 mb/s
Oodle LZB16 : baby_robot_shell.gr2 : 58,788,904 ->32,862,033 = 4.472 bpb = 1.789 to 1
decode : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s
Oodle LZBLW : baby_robot_shell.gr2 : 58,788,904 ->30,207,635 = 4.111 bpb = 1.946 to 1
decode : 0.090 seconds, 409.88 b/kc, rate= 653.26 mb/s
After posting the original version with non-fuzz-safe LZB16, I decided to just go and do the fuzz-safing for LZB16.
LZB16, PS4 clang-3.6.1
post-fuzz-safing :
lzt99 : 24,700,820 ->14,754,643 = 4.779 bpb = 1.674 to 1
decode : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s
texture.bc1 : 2,188,524 -> 2,068,823 = 7.562 bpb = 1.058 to 1
decode : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s
lightmap.bc3 : 4,194,332 -> 630,794 = 1.203 bpb = 6.649 to 1
decode : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s
silesia_mozilla : 51,220,480 ->22,022,002 = 3.440 bpb = 2.326 to 1
decode : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s
breton.dds : 589,952 -> 113,578 = 1.540 bpb = 5.194 to 1
decode : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s
d.dds : 1,048,704 -> 654,014 = 4.989 bpb = 1.603 to 1
decode : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s
all_dds : 79,993,099 ->47,683,003 = 4.769 bpb = 1.678 to 1
decode : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s
baby_robot_shell.gr2 : 58,788,904 ->32,862,033 = 4.472 bpb = 1.789 to 1
decode : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s
pre-fuzz-safing reference :
lzt99 912.92 mb/s
texture.bc1 598.61 mb/s
lightmap.bc3 876.19 mb/s
silezia_mozilla 794.72 mb/s
breton.dds 1078.52 mb/s
d.dds 888.73 mb/s
all_dds 701.45 mb/s
baby_robot_shell.gr2 877.81 mb/s
Mild speed penalty on most files.
Everything run at max compression options, level 99, max dict size. All libs are the latest on github, downloaded today. Zlib-NG has the arch/x86 stuff enabled. PS4 is AMD Jaguar , x64.
I'm going to omit encode speeds on the per-file results for simplicity, these are
pretty representative :
aow3_skin_giants.clb :
zlib-ng encode : 2.699 seconds, 1.65 b/kc, rate= 2.63 mb/s
miniz encode : 2.950 seconds, 1.51 b/kc, rate= 2.41 mb/s
zstd encode : 5.464 seconds, 0.82 b/kc, rate= 1.30 mb/s
brotli-9 encode : 23.110 seconds, 0.19 b/kc, rate= 307.44 kb/s
brotli-10 encode : 68.072 seconds, 0.07 b/kc, rate= 104.38 kb/s
brotli-11 encode : 79.844 seconds, 0.06 b/kc, rate= 88.99 kb/s
Results :
PS4 clang-3.5.0
-------------
lzt99 :
MiniZ : 24,700,820 ->13,120,668 = 4.249 bpb = 1.883 to 1
miniz_decompress_time : 0.292 seconds, 53.15 b/kc, rate= 84.71 mb/s
zlib-ng : 24,700,820 ->13,158,385 = 4.262 bpb = 1.877 to 1
miniz_decompress_time : 0.226 seconds, 68.58 b/kc, rate= 109.30 mb/s
ZStd : 24,700,820 ->10,403,228 = 3.369 bpb = 2.374 to 1
zstd_decompress_time : 0.184 seconds, 84.12 b/kc, rate= 134.07 mb/s
Brotli-9 : 24,700,820 ->10,473,560 = 3.392 bpb = 2.358 to 1
brotli_decompress_time : 0.259 seconds, 59.83 b/kc, rate= 95.36 mb/s
Brotli-10 : 24,700,820 -> 9,949,740 = 3.222 bpb = 2.483 to 1
brotli_decompress_time : 0.319 seconds, 48.54 b/kc, rate= 77.36 mb/s
Brotli-11 : 24,700,820 -> 9,833,023 = 3.185 bpb = 2.512 to 1
brotli_decompress_time : 0.317 seconds, 48.84 b/kc, rate= 77.84 mb/s
Oodle Kraken -zl4 : 24,700,820 ->10,326,584 = 3.345 bpb = 2.392 to 1
encode only : 4.139 seconds, 3.74 b/kc, rate= 5.97 mb/s
decode only : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s
Oodle Kraken -zl6 : 24,700,820 ->10,011,486 = 3.242 bpb = 2.467 to 1
decode : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s
Oodle Kraken -zl7 : 24,700,820 -> 9,773,112 = 3.165 bpb = 2.527 to 1
decode : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s
Oodle LZNA : lzt99 : 24,700,820 -> 9,068,880 = 2.937 bpb = 2.724 to 1
decode : 0.643 seconds, 24.12 b/kc, rate= 38.44 mb/s
-------------
normals.bc1 :
miniz : 524,316 -> 291,697 = 4.451 bpb = 1.797 to 1
miniz_decompress_time : 0.008 seconds, 39.86 b/kc, rate= 63.53 mb/s
zlib-ng : 524,316 -> 292,541 = 4.464 bpb = 1.792 to 1
zlib_ng_decompress_time : 0.007 seconds, 47.32 b/kc, rate= 75.41 mb/s
zstd : 524,316 -> 273,642 = 4.175 bpb = 1.916 to 1
zstd_decompress_time : 0.007 seconds, 49.64 b/kc, rate= 79.13 mb/s
brotli-9 : 524,316 -> 289,778 = 4.421 bpb = 1.809 to 1
brotli_decompress_time : 0.010 seconds, 31.70 b/kc, rate= 50.52 mb/s
brotli-10 : 524,316 -> 259,772 = 3.964 bpb = 2.018 to 1
brotli_decompress_time : 0.011 seconds, 28.65 b/kc, rate= 45.66 mb/s
brotli-11 : 524,316 -> 253,625 = 3.870 bpb = 2.067 to 1
brotli_decompress_time : 0.011 seconds, 29.74 b/kc, rate= 47.41 mb/s
Oodle Kraken -zl6 : 524,316 -> 247,217 = 3.772 bpb = 2.121 to 1
decode : 0.002 seconds, 135.52 b/kc, rate= 215.95 mb/s
Oodle Kraken -zl7 : 524,316 -> 238,844 = 3.644 bpb = 2.195 to 1
decode : 0.003 seconds, 123.96 b/kc, rate= 197.56 mb/s
Oodle BitKnit : 524,316 -> 225,884 = 3.447 bpb = 2.321 to 1
decode only : 0.010 seconds, 31.67 b/kc, rate= 50.47 mb/s
-------------
lightmap.bc3 :
miniz : 4,194,332 -> 590,448 = 1.126 bpb = 7.104 to 1
miniz_decompress_time : 0.025 seconds, 105.14 b/kc, rate= 167.57 mb/s
zlib-ng : 4,194,332 -> 584,107 = 1.114 bpb = 7.181 to 1
zlib_ng_decompress_time : 0.019 seconds, 137.77 b/kc, rate= 219.56 mb/s
zstd : 4,194,332 -> 417,672 = 0.797 bpb = 10.042 to 1
zstd_decompress_time : 0.014 seconds, 182.53 b/kc, rate= 290.91 mb/s
brotli-9 : 4,194,332 -> 499,120 = 0.952 bpb = 8.403 to 1
brotli_decompress_time : 0.022 seconds, 118.64 b/kc, rate= 189.09 mb/s
brotli-10 : 4,194,332 -> 409,907 = 0.782 bpb = 10.232 to 1
brotli_decompress_time : 0.021 seconds, 125.20 b/kc, rate= 199.54 mb/s
brotli-11 : 4,194,332 -> 391,576 = 0.747 bpb = 10.711 to 1
brotli_decompress_time : 0.021 seconds, 127.12 b/kc, rate= 202.61 mb/s
Oodle Kraken -zl6 : 4,194,332 -> 428,737 = 0.818 bpb = 9.783 to 1
decode : 0.009 seconds, 308.45 b/kc, rate= 491.60 mb/s
Oodle BitKnit : 4,194,332 -> 416,208 = 0.794 bpb = 10.077 to 1
decode only : 0.021 seconds, 122.59 b/kc, rate= 195.39 mb/s
Oodle LZNA : 4,194,332 -> 356,313 = 0.680 bpb = 11.771 to 1
decode : 0.033 seconds, 79.51 b/kc, rate= 126.71 mb/s
----------------
aow3_skin_giants.clb
Miniz : 7,105,158 -> 3,231,469 = 3.638 bpb = 2.199 to 1
miniz_decompress_time : 0.070 seconds, 63.80 b/kc, rate= 101.69 mb/s
zlib-ng : 7,105,158 -> 3,220,291 = 3.626 bpb = 2.206 to 1
zlib_ng_decompress_time : 0.056 seconds, 80.14 b/kc, rate= 127.71 mb/s
Zstd : 7,105,158 -> 2,700,034 = 3.040 bpb = 2.632 to 1
zstd_decompress_time : 0.050 seconds, 88.69 b/kc, rate= 141.35 mb/s
brotli-9 : 7,105,158 -> 2,671,237 = 3.008 bpb = 2.660 to 1
brotli_decompress_time : 0.080 seconds, 55.84 b/kc, rate= 89.00 mb/s
brotli-10 : 7,105,158 -> 2,518,315 = 2.835 bpb = 2.821 to 1
brotli_decompress_time : 0.098 seconds, 45.54 b/kc, rate= 72.58 mb/s
brotli-11 : 7,105,158 -> 2,482,511 = 2.795 bpb = 2.862 to 1
brotli_decompress_time : 0.097 seconds, 45.84 b/kc, rate= 73.05 mb/s
Oodle Kraken -zl6 : aow3_skin_giants.clb : 7,105,158 -> 2,638,490 = 2.971 bpb = 2.693 to 1
decode : 0.023 seconds, 195.25 b/kc, rate= 311.19 mb/s
Oodle BitKnit : 7,105,158 -> 2,623,466 = 2.954 bpb = 2.708 to 1
decode only : 0.095 seconds, 47.11 b/kc, rate= 75.08 mb/s
Oodle LZNA : aow3_skin_giants.clb : 7,105,158 -> 2,394,871 = 2.696 bpb = 2.967 to 1
decode : 0.170 seconds, 26.26 b/kc, rate= 41.85 mb/s
--------------------
silesia_mozilla
MiniZ : 51,220,480 ->19,141,389 = 2.990 bpb = 2.676 to 1
miniz_decompress_time : 0.571 seconds, 56.24 b/kc, rate= 89.63 mb/s
zlib-ng : 51,220,480 ->19,242,520 = 3.005 bpb = 2.662 to 1
zlib_ng_decompress_time : 0.457 seconds, 70.31 b/kc, rate= 112.05 mb/s
zstd : malloc failed
brotli-9 : 51,220,480 ->15,829,463 = 2.472 bpb = 3.236 to 1
brotli_decompress_time : 0.516 seconds, 62.27 b/kc, rate= 99.24 mb/s
brotli-10 : 51,220,480 ->14,434,253 = 2.254 bpb = 3.549 to 1
brotli_decompress_time : 0.618 seconds, 52.00 b/kc, rate= 82.88 mb/s
brotli-11 : 51,220,480 ->14,225,511 = 2.222 bpb = 3.601 to 1
brotli_decompress_time : 0.610 seconds, 52.72 b/kc, rate= 84.02 mb/s
Oodle Kraken -zl6 : 51,220,480 ->14,330,298 = 2.238 bpb = 3.574 to 1
decode : 0.200 seconds, 160.51 b/kc, rate= 255.82 mb/s
Oodle Kraken -zl7 : 51,220,480 ->14,222,802 = 2.221 bpb = 3.601 to 1
decode : 0.201 seconds, 160.04 b/kc, rate= 255.07 mb/s
Oodle LZNA : silesia_mozilla : 51,220,480 ->13,294,622 = 2.076 bpb = 3.853 to 1
decode : 1.022 seconds, 31.44 b/kc, rate= 50.11 mb/s
I tossed in tests of BitKnit & LZNA in some cases after I realized that the Brotli decode speeds are
more comparable to BitKnit than Kraken, and even LZNA isn't that far off (usually less than a factor of 2).
eg. you could do half your files in LZNA and half in Kraken and that would be about the same total time
as doing them all in Brotli.
Here are charts of the above data :
(silesia_mozilla omitted due to lack of zstd results)
(I'm trying an experiment and showing inverted scales, which are more proportional to what you care about. I'm showing seconds per gigabyte, and percent out of output size, which are proportional to *time* not speed, and *size* not ratio. So, lower is better.)
log-log speed & ratio :
|
|
|
Time and size are just way better scales. Looking at "speed" and "ratio" can be very misleading, because big differences in speed at the high end (eg. 2000 mb/s vs 2200 mb/s) don't translate into a very big time difference, and *time* is what you care about. On the other hand, small differences in speed at the low end *are* important - (eg. 30 mb/s vs 40 mb/s) because those mean a big difference in time.
I've been doing mostly "speed" and "ratio" because it reads better to the novice (higher is better! I want the one with the biggest bar!), but it's so misleading that I think going to time & size is worth it.
There are some cases where it's not the answer. Those are easier to enumerate, so I'll do that, mainly with the Oodle compressors you might want instead.
*. If you want the smallest possible files and don't care much about decode speed or decode-time CPU usage, then you might prefer Oodle LZNA.
*. If you want maximum decode speed and don't care much about compression ratio, you might prefer Oodle's LZB16, LZBLW, or LZNIB. (or maybe you don't want compression at all)
*. If you want to compress tiny files independently, Kraken is probably not the best choice. Typically for loading or unpacking, tiny files should be stuck together in a larger loading unit, so that you read and decompress several of them per unit. But if for some reason you need independent tiny files, Kraken won't do great under 16k or so. The best alternatives in Oodle that work well on tiny files are LZNA (high compression) and LZNIB (high speed).
One common case that's equivalent to "tiny files" is communication buffers like network packets; Oodle has special LZ modes for this problem and the best compressors for it are LZNA and LZNIB. Of course you could also use the specialized Oodle Network for packets; contact oodle support for more information on special uses like this.
Similarly, Kraken doesn't do "streaming" or incremental encoding or decoding. It needs large chunks or whole buffers. 90% of the time when people are using streaming, they just shouldn't. For the example if you're trying to persist a save game, and you are streaming out bytes from your objects, you shouldn't be passing them one by one to the encoding layer, you should just be doing *ptr++ to fill a buffer, and then encode that buffer all at once when you're done. Most of the time when you think "I need streaming", you don't. The exception is when you actually need to flush out small independent atoms, like the network packet case above, and then Kraken is not suitable.
*. If you're extremely memory-use constrained, you might not want Kraken. Kraken needs around 256k of memory in addition to the output buffer. This is the largest decoder memory requirement in Oodle, so if for some reason that's too much, there are much smaller overhead decoders available, such as LZNA (around 12k) and LZNIB (zero, or around 1k of stack usage).
*. If you have a lot of specialized data types, and care mainly about compression ratio, you probably want something data-type specific. This might be just some kind of preprocessor, or a whole specialized compressor, but with specialized data (lots of text, or DNA, or whatever) you can always do better with a customized compressor than a generic one.
And some other junk.
From the unpack code I can't see anything about how filters are chosen, obviously. (which is the interesting part) RAR filters have a start & length, so they can apply to fine-grained portions of the file. There can be N filters per file, and they can overlap, so there could be multiple filters on any given byte. They're applied in a defined order. There's a standard E8E9/BCJ filter. These are the others : unpack50.cpp RAR FILTER_DELTA is just byte delta it can have N channels delta is at channel stride (from byte -N) and it de-interleaves the channels so eg. if you used it on RGB it would be channels=3 and it would produce output like RRRRGGGGBBB The older versions of RAR (30 ?) had much more complex filters : rarvm.cpp could send arbitrary filters in theory using the VM script I have no idea if this was actually done in normal use The VM filters also have special hard-coded modes in C : VMSF_DELTA same as v50 DELTA VMSF_RGB special image filter transmits the Width of scan lines hard-coded to 24-bit RGB (odd because it does for i to Channels, but Channels is just a const int = 3) (they never heard of 8-bit or 32-bit image data?) uses a Paeth predictor to make a residual (N,W,NW depending on which is closest to grad = N+W-NW) VMSF_AUDIO special audio/WAV filter sends # of channels does de-interleaving crazy complicated adaptive weight linear predictor does just delta from neighbor but biases that delta by three different slopes adjusts the weight of the slope by which was the best predictor over recent data (or something like that)
The important one is just DELTA which is very simple.
The trick bit is not the filter, but finding ranges to apply it on (without just brute-force trying lots of options and seeing which produces the best result - which RAR obviously doesn't do because they sometimes get it so wrong, as demonstrated in the earlier post).
Oodle Kraken can decode its normal compressed data on multiple threads.
This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.
The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.
The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.
For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread :
(with some other compressors for reference)
win81 :
Kraken 2-thread : 104,857,600 ->37,898,868 = 2.891 bpb = 2.767 to 1
decode : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s
Kraken : 104,857,600 ->37,898,868 = 2.891 bpb = 2.767 to 1
decode : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s
zstdmax : 104,857,600 ->39,768,086 = 3.034 bpb = 2.637 to 1
decode : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s
lzham : 104,857,600 ->37,856,839 = 2.888 bpb = 2.770 to 1
decode : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s
lzma : 104,857,600 ->35,556,039 = 2.713 bpb = 2.949 to 1
decode : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s
Charts on a few files :
|
|
|
Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.
Performance on the Silesia set for reference :
Silesia total :
Oodle Kraken -z6 : 211,938,580 ->51,857,427 = 1.957 bpb = 4.087 to 1
single threaded decode : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s
two threaded decode : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s
Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed
buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput*
reduction. For example, if you were really decoding the whole Silesia set, you might just run the
decompression of each file on its own thread. That is a good thing to do, and it would give you a near
2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X
but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode
reduces the latency of each independent decode, and of course it can also be used with chunking or
multiple-file decoding to get further speedups.
I think we'll continue to find improvements in the optimal parsers over the coming months (optimal parsing is hard!) which should lead to some more tiny gains in the compression ratio in the slow encoder modes.
Silesia , sum of all files
uncompressed : 211,938,580
Kraken 2.1.5 -z6 : 52,366,897
Kraken 2.2.0 -z6 : 51,857,427
Kraken 2.2.0 -z7 : 51,625,488
Oodle Kraken 2.1.5 topped out at -z6 (Optimal2). There's a new -z7 (Optimal3) mode which gets a bit
more compression at the cost of a bit of speed,
which is why it's on a separate option instead of just part of -z6.
Results on some individual files (Kraken 220 is -z7) :
-------------------------------------------------------
"silesia_mozilla"
by ratio:
lzma : 3.88:1 , 2.0 enc mb/s , 63.7 dec mb/s
Kraken 220 : 3.60:1 , 1.1 enc mb/s , 896.5 dec mb/s
lzham : 3.56:1 , 1.5 enc mb/s , 186.4 dec mb/s
Kraken 215 : 3.51:1 , 1.2 enc mb/s , 928.0 dec mb/s
zstdmax : 3.24:1 , 2.8 enc mb/s , 401.0 dec mb/s
zlib9 : 2.51:1 , 12.4 enc mb/s , 291.5 dec mb/s
lz4hc : 2.32:1 , 36.4 enc mb/s , 2351.6 dec mb/s
-------------------------------------------------------
"lzt99"
by ratio:
lzma : 2.65:1 , 3.1 enc mb/s , 42.3 dec mb/s
Kraken 220 : 2.53:1 , 2.0 enc mb/s , 912.0 dec mb/s
Kraken 215 : 2.46:1 , 2.3 enc mb/s , 957.1 dec mb/s
lzham : 2.44:1 , 1.9 enc mb/s , 166.0 dec mb/s
zstdmax : 2.27:1 , 3.8 enc mb/s , 482.3 dec mb/s
zlib9 : 1.77:1 , 13.3 enc mb/s , 286.2 dec mb/s
lz4hc : 1.67:1 , 30.3 enc mb/s , 2737.4 dec mb/s
-------------------------------------------------------
"all_dds"
by ratio:
lzma : 2.37:1 , 2.1 enc mb/s , 40.8 dec mb/s
Kraken 220 : 2.23:1 , 1.0 enc mb/s , 650.6 dec mb/s
Kraken 215 : 2.18:1 , 1.0 enc mb/s , 684.6 dec mb/s
lzham : 2.17:1 , 1.3 enc mb/s , 127.7 dec mb/s
zstdmax : 2.02:1 , 3.3 enc mb/s , 289.4 dec mb/s
zlib9 : 1.83:1 , 13.3 enc mb/s , 242.9 dec mb/s
lz4hc : 1.67:1 , 20.4 enc mb/s , 2226.9 dec mb/s
-------------------------------------------------------
"baby_robot_shell.gr2"
by ratio:
lzma : 4.35:1 , 3.1 enc mb/s , 59.3 dec mb/s
Kraken 220 : 3.82:1 , 1.4 enc mb/s , 837.2 dec mb/s
Kraken 215 : 3.77:1 , 1.5 enc mb/s , 878.3 dec mb/s
lzham : 3.77:1 , 1.6 enc mb/s , 162.5 dec mb/s
zstdmax : 2.77:1 , 5.7 enc mb/s , 405.7 dec mb/s
zlib9 : 2.19:1 , 13.9 enc mb/s , 332.9 dec mb/s
lz4hc : 1.78:1 , 40.1 enc mb/s , 2364.4 dec mb
-------------------------------------------------------
"win81"
by ratio:
lzma : 2.95:1 , 2.5 enc mb/s , 51.9 dec mb/s
lzham : 2.77:1 , 1.6 enc mb/s , 177.6 dec mb/s
Kraken 220 : 2.77:1 , 1.0 enc mb/s , 818.0 dec mb/s
Kraken 215 : 2.70:1 , 1.0 enc mb/s , 877.0 dec mb/s
zstdmax : 2.64:1 , 3.5 enc mb/s , 417.8 dec mb/s
zlib9 : 2.07:1 , 16.8 enc mb/s , 269.6 dec mb/s
lz4hc : 1.91:1 , 28.8 enc mb/s , 2297.6 dec mb/s
-------------------------------------------------------
"enwik7"
by ratio:
lzma : 3.64:1 , 1.8 enc mb/s , 79.5 dec mb/s
lzham : 3.60:1 , 1.4 enc mb/s , 196.5 dec mb/s
zstdmax : 3.56:1 , 2.2 enc mb/s , 394.6 dec mb/s
Kraken 220 : 3.51:1 , 1.4 enc mb/s , 702.8 dec mb/s
Kraken 215 : 3.49:1 , 1.5 enc mb/s , 789.7 dec mb/s
zlib9 : 2.38:1 , 22.2 enc mb/s , 234.3 dec mb/s
lz4hc : 2.35:1 , 27.5 enc mb/s , 2059.6 dec mb/s
-------------------------------------------------------
You can see that encode & decode speed is slightly worse at level -z7 , and compression ratio si
improved. (most of the other compression levels have roughly the same decode speed; -z7 enables
some special options that can hurt decode speed a bit). Of course even at -z7 Kraken is way faster
than anything else comparable!
1. Time only the compressor.
Place your time measurements only around the compressor. Not IO, not your parsing, not mallocs, just the compress or decompress calls. I understand that in the end what you care about is total time to load, but there can be a lot of issues there that need fixing, and they can cloud the comparison of just the compression part. eg. if your parsing is really slow, that will dominate the CPU time and hide the differences between the compressors.
A common problem is that your app-loading takes a large amount of CPU independent of decompression. In that case, you care about how much CPU the decompressor uses, regardless of total load latency, because it runs into the CPU usage of your post-decompress loading code. Another problem is that accurately timing the disk load time is very difficult; it strongly depends on the exact hardware, disk cache usage, file layout and packaging for seek times; etc. It's usually better to simulate disk load times rather than measure it, because good quality measurements require a wide variety of systems to get a spectrum of results. (it's a bit like doing a medical trial on one person otherwise)
There are lots of reasons why you shouldn't just put your timing around your IO to get a "total load time". IO speeds differ massively these days, from network loads at less than 1 MB/s to persistent flash that's getting ever closer to RAM speed (1 GB/s !). You would need to time across a massive range of devices. Even if you fix an average HD speed, are you timing first load (not in cache) or second load (in cache, disk speed = RAM speed). You might decide that LZMA/7zip is appropriate for network loads, but then it's massively inappropriate on a fast SSD and totally catastrophic when the files are in disk cache. Is your IO async and overlapping with CPU work, or are you needlessly stalling threads on IO? Is your data loaded with just a big binary splat and point at it, or are you crazily parsing byte by byte? etc. too many variables for this to be considered reasonably.
2. Time what you actually care about.
If you care about decode time, time the decompression. If you care about encode time, time compression. If you care about round-trip time, add the two times. Compressors are not just "fast" or "slow" at both ends, you can't time encoding and decide that it's a fast or slow compressor if what you care about is decoding.
3. Choose the right options.
Most compressors have the ability to target slightly different use cases. The most common option is the ability to trade off encode time vs. compression ratio. So, if what you care about is smallest size, then run the compressor at its highest encode effort level. It can be tricky to get the options right in most compression libraries; we are woefully non-standardized and not well documented. Aside from the simple "level" parameter, there may be other options that are relevant to your goals, perhaps trading off decompressor memory usage, or decompression speed. With Oodle the best option is always to email us and ask what options will best suit your goals.
4. Run apples-to-apples (threads-to-threads) comparisons.
It can be tricky to compare compressors fairly. As much as possible they should be run in the same way, and they should be run in the way that you will actually use them in your final application. Don't profile them with threads if you will not use them threaded in your shipping application.
Threads are a common problem. Compressors should either be tested all threaded (if you will use threads in your final application), or all non-threaded. Unfortunately the defaults are not the same. "lzma" (7z) and LZHAM create threads by default. You have to change their options to tell them to *not* create threads. The normal Oodle_Compress calls will not use threads by default, you have to specifically call one of the _Async threaded routines. (my personal preference is to benchmark everything without threads to compare single-threaded performance, and you can always add threads for production use)
5. Take the MIN of N run times.
To get reliable timing, you need to run the loop many times, and take the MIN of all times. The min will give you the time it takes when the OS isn't interrupting you with task switches, the CPU isn't clocking-down for speedstep, etc. I usually do 30 *per core* but you can probably get a way with a bit less.
6. Wipe the cache.
Assuming you are now doing N loops, you need to invalidate the cache between iterations. If you don't, you will be running the compressor in a "hot cache" scenario, with some buffers already in cache.
7. Don't pack a bunch of files together in a tar if that's not how you load.
It may seem like a good way to test to grab your bunch of test files and pack them together in a tar (or zip -0 or similar package) and run the compression tests on that tar. That's a fine option if that's really how you load data in your final application - as one big contiguous chunk that must be loaded in one big blob. But most people don't. You need to test the compressors in the same way they will be used in the final application. If you load whole file at a time, test the compressors on whole file units. Many people do loading on some kind of paging unit, like perhaps 1 MB chunks. If you do that, then test the compressor on the same thing.
8. Choose your test set.
If you could test on the entire set of buffers that your final application will load, that would be an accurate test. (though actually, even that is a bit subtle, since some buffers are more latency sensitive than others, so for example you might care more about the first few things you load to get into a running application as quickly as possible). That's probably not practical, so you want to choose a set that is representative of what you will actually load. Don't exclude things like already-compressed files (JPEGs and so on) *if* you will be running them through the compressor. (though consider *not* running them your compressed-file loading path, in which case you should exclude them from testing). It's pretty hard to get an accurate representative sample, so it's generally best to just get a variety of files and look at individual per-file results.
9. Look at the spectrum of results, not the sum.
After you run on your test set, don't just add up the compressed sizes and times to make a "total" result. Sums can be misleading. One issue is there are some large incompressible files, they can hide the differences on the more compressible files. But a bigger and more subtle trap is the way that sums weight the combination of results. A sum is a weighting by the size of each file in the test set. That's fine if your test set is all of your data, or is a perfectly proportionally representative sampling of all of your data (a subset which acts like the whole). But most likely it's not. It's best to keep the results per file separate and just have a look at individual cases to see what's going on, how the results differ, and try not to simplify to just looking at the sum.
10. If you do sum, sum *time* not speed, sum *size* not ratio.
Speed (like mb/s) and ratio (raw size/comp size) are inverted measures and shouldn't be summed. What you actually care about is total compressed size, and total time to decode. So if you run over a set of files, don't look at "average speed" or "average ratio" , because those are inverted meaures that will oddly weight the accumulation. Instead accumulate total time to decode, total raw size, and total compressed size, and then if you like you can make "overall speed" and "overall ratio" from those total.
11. Try not to malloc in the timing loop.
Your malloc might be fast, it might be slow, it's best to not have that as a variable in the timing. In general try to allocate the memory for the compressor or decompressor outside of the timing loop. (In Oodle this is done by passing in your own pointer for the "decoderMemory" argument of OodleLZ_Decompress). That would be an unfair test if you didn't also do that in the final application - so do it in the final application too! (similarly, make sure there's no logging inside the timing loop).
12. Consider excluding almost-incompressible files.
This is something you should consider for final shipping application, and if you do it in your shipping application, then you should do it for the benchmark too. The most common case is already-compressed files like JPEG images and MP3 audio. These files can usually be compressed slightly, maybe saving 1% of their size, but the time to decode them is not worth it overall - you can get more total size savings by running a more powerful compressor on other files. So it's most efficient to just send them uncompressed.
13. Tiny files should either be excluded or packed together.
There's almost never a use case where you really want to compress tiny files (< 16k bytes or so) as independent units. There's too much per-unit overhead in the compressor, and more importantly there's too much per-unit overhead in IO - you don't want to eat a disk seek to just to get one tiny file. So in a real application tiny files should always be grouped into paging units that are 256k or more, a size where loading them won't just be a total waste of disk seek time. So, when benchmarking compressors you also shouldn't run them on tiny independent files, because you will never do that in a shipping application (I hope). And of course don't just do this for the benchmark, do it in the final app too.
To my knowledge I was the first person to write about it (in "New Techniques in Context Modeling and Arithmetic Encoding" (PDF) ) but it's one of those simple ideas that probably a lot of people had and didn't write about (like Deferred Summation). It's also one of those ideas that keeps being forgotten and rediscovered over the years.
(I don't know much about the details of how Brotli does this, it may differ. I'll be talking about how I did it).
(also by "Huffman" I pretty much always mean "static Huffman" where you measure the histogram of a block and transmit the code lengths, not "adaptive Huffman" (modifying codelens per symbol (bleck)) or "deferred summation Huffman" (codelens computed from histogram of previous data with no explicit codelen transmission))
Let's start with just the case of order-1 8 bit literals. So you're coding a current 8-bit symbol with an 8-bit previous
symbol as context. You can do this naively by just have 256 arrays, one for each 8-bit context. The decoder looks like this :
256 times :
read codelens from file
build huffman decode table
per symbol :
o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , huff_table[o1] );
and on a very large file (*) that might be okay.
(* actually it's only okay on a very large file with completely stable statistics, which never happens in practice. In the real world "very large files" don't usually exist; instead they act like a sequence of small/medium files tacked together. That is, you want a decoder that works well on small files, and you want to be able to reset it periodically (re-transmit huffman codelens in this case) so that it can adapt to local statistics).
On small files, it's disastrous. You're sending 256 sets of codelens, which can be a lot of wasted data. Worst of all it's a huge decode time overhead to parse out the codelens and build the decode tables if you're only going to get a few symbols in that context.
So you want to reduce the count of huffman tables. A rough guideline is to make the number of tables proportional to the number of bytes. Maybe 1 table per 1024 bytes is tolerable to you, it depends.
(another goal for reduction might be to get all the huff tables to fit in L2 cache)
So we want to merge the Huffman tables. You want to find pairs of contexts that have the most similar statistics and merge those.
If you don't mind the poor encoder-time speed, a good solution is a best-first merge :
for each pair {i,j} (i
if the cost was just entropy (H) instead of Huffman_Cost , then a merge_cost would always be strictly >= 0 (separate statistics
are always cheaper than combining). But since the Huffman codelen transmission is not free, the first merges will actually reduce
encoded size. So you should always do merges that are free or beneficial, even if huffman table count is low enough.
<j)
merge_cost(i,j) = Huffman_Cost( symbols_i + symbols_j ) - Huffman_Cost( symbols_i ) - Huffman_Cost( symbols_j )
Huffman_Cost( symbols ) = bits to send codelens + bits to encode symbols using those codelens
while # of contexts > target , and/or merge cost < target
pop lowest merge_cost
merge context j onto i
delete all merge costs involving j
recompute all merge costs involving i
So contexts with similar statistics will get merged together, since coding them with a combined set of codelens either doesn't hurt or hurts only a little (or helps, with the cost of codelen transmission counted). In this way contexts where it wasn't really helping to differentiate them will get reduced.
Once this is done, the decoder becomes :
get n = number of huffman tables
n times :
read codelens from file
build huffman decode table
256 times :
read tableindex from file
merged_huff_table_ptr[i] = huff_table[ tableindex ]
per symbol :
o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , merged_huff_table_ptr[o1] );
So merged_huff_table_ptr[] is still a [256] array, but it points at only [n] unique Huffman tables.
That's order-1 Huffman!
In the modern world, we know that o1 = the previous literal is not usually the best use of an 8-bit context. You might do something like top 3 bits of ptr[-1], top 2 bits of [ptr-2], 2 bits of position, to make a 7-bit context.
One of the cool things order-1 Huffman can do is to adaptively figure out the context for you.
For example with LZMA you have the option of the # of literal context bits (lc) and literal pos bits (lp). You want them to be as low as possible for better statistics, and there's no good way to choose them per file. (usually lc=2 or lp=2 , usually just one or the other, not both)
With order-1 Huffman, you just make a context with 3 bits of lc and 3 bits of lp, so you have a [64] 6-bit context. Then you let the merger throw away states that don't help. For example if it's a file where pos-bits are irrelevant (like text), they will just get merged out, all the lc contexts that have different lp values will merge together.
Signed int takes the previous two bytes and forms a 6-bit context from them thusly :
Context = (Lut2[b2]<<3) | Lut2[b1];
Lut2 :=
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7
So, it's roughly categorizing the two values into ranges, which means it can act as a kind of linear predictor (if that fits the data),
eg. two preceding values in group 4 = my value probably is too, or if b2 is a "3" and b1 is a "4" then I'm likely a "5". Or not,
if the data isn't linear like that. Or maybe there's only correlation to b1 and b2 gets ignored, which the order-1-huff can also model.
One thing I like is holding out values 00 and FF and special cases that get a unique bucket. This lets you detect the special cases of last two bytes = 0000,FFFF,FF00,00FF , which can be pretty important on binary.
I think that for the type of data we get in games that often has floats, it might be worth it to single out 7F and 80 as well, something like :
0,1111....
1..
22...
2........2,3
4,5555...
5...
66.....
6........6,7
but who knows, would have to test to see.
It's a shame it never got a good mainstream implementation. It could/should have been the LZ we all used for the past 10 years.
One of the little mistakes in LZX was the 21 bit offset limit. This must have seemed enormous back on the Amiga in 1995, but very quickly became a major handicap against LZs with unlimited windows.
LZX with unlimited window (eg. on files less than 2 MB) is competitive with any modern LZ, especially on binary structured data where it really shines. In hindsight, LZX is the clear ancestor to LZMA and it was way ahead of its time. We're only clearly beating it in the past year or two (!!).
2. RAR. The primary LZ in RAR is a pretty straightforward LZ-Huff (I believe). It's fine, it's nothing bad or special.
What makes RAR special is the filters. It still has the best filters of any compressor I know.
RAR+filters often *beats* LZMA and other very slow high ratio compressors.
The special thing about the RAR filters is that they aren't like most of the "precomp" solutions that just try to recognize WAV headers and things like that - RAR may do some of that (I have no idea) - but it also definitely finds filters that work on headerless data. Like, you can take a BMP or WAV and strip off the header and RAR will still figure out that there's data to filter in there; it must have some analysis heuristics, and they're better than anything else I've seen.
As an example of when RAR filters do magic, here's a 24-bit RGB BMP with the first 100k stripped, so it's
headerless and not easily recognized by file-type-detection filters :
PDI_1200_bmp_no_header.zl8.LZNA,1241369
PDI_1200_bmp_no_header.nz,1264747
PDI_1200_bmp_no_header.BitKnit,1268120 // <- wow BitKnit !
PDI_1200_bmp_no_header.LZNA,1306670
PDI_1200_bmp_no_header.rar,1312621 // <- RAR filters!
PDI_1200_bmp_no_header.7z,1377603
PDI_1200_bmp_no_header.brotli10,1425996
PDI_1200_bmp_no_header_lp2.7z,1469079
PDI_1200_bmp_no_header.Kraken,1506915
PDI_1200_bmp_no_header.lzx21,1580593
PDI_1200_bmp_no_header.zstd060,1619920
PDI_1200_bmp_no_header.mc-.rar,1631419 // <- RAR unfiltered
PDI_1200_bmp_no_header.brotli9,1635105
PDI_1200_bmp_no_header.z9.zip,1708589
PDI_1200_bmp_no_header.lz4xc4,1835955
PDI_1200_bmp_no_header.raw,2500000
That said, it does make mistakes.
Sometimes filters can make things way worse if they make a wrong decision. They don't have a "filters must help" safety check.
This is easy to prevent, you just run also with no filter and make sure it helped, but they seem to not do that
(presumably to save the encode time) and the results can be disastrous :
lightmap.bc3.LZNA,361185
lightmap.bc3.7z,373909
lightmap.bc3_lp2.7z,387590
lightmap.bc3.brotli10,391602
lightmap.bc3.BitKnit,416208
lightmap.bc3.zstd060,417956
lightmap.bc3.Kraken,431476
lightmap.bc3.lzx21,441669
lightmap.bc3.mc-.rar,457893 // <- RAR with disabled filters
lightmap.bc3.brotli9,498802
lightmap.bc3.z9.zip,583178
lightmap.bc3.rar,778363 // <- RAR with filters huge fuckup !!
lightmap.bc3.raw,4194332
RAR filters fucking up on DXTC (BCn) is pretty consistent :
c.dds.nz,371610
c.dds.7z,371749
c.dds.zl8.LZNA,371783
c.dds.LZNA,373245
c.dds_lp2.7z,375384
c.dds.lzx21,395866
c.dds.brotli10,399674
c.dds.BitKnit,400528
c.dds.Kraken,405563
c.dds.mc-.rar,408363 // <- unfiltered is okay
c.dds.zstd060,411515
c.dds.brotli9,426948
c.dds.z9.zip,430952
c.dds.rar,438070 // <- oops!
c.dds.raw,524416
Sometimes it does magic :
horse.vipm_lp2.7z,925996
horse.vipm.LZNA,942950
horse.vipm.7z,945707
horse.vipm.brotli10,955363 // <- brotli10 big step
horse.vipm.rar,971716 // <- RAR with filters does magic
horse.vipm.BitKnit,1017740
horse.vipm.lzx21,1029541
horse.vipm.mc-.rar,1066205 // <- RAR with disabled filters
horse.vipm.zstd060,1100219
horse.vipm.Kraken,1106081
horse.vipm.brotli9,1108858
horse.vipm.z9.zip,1155056
horse.vipm.raw,1573070
Here's an XRGB dds where the RAR filters do magic :
d.dds.zl8.LZNA,352232
d.dds.nz,356649
d.dds.BitKnit,360220 // (at zl6 BitKnit beats LZNA ! crushes 7z! wow)
d.dds.LZNA,381250
d.dds.rar,382282 // <- RAR filter crushes 7z
d.dds_lp2.7z,427395
d.dds.7z,452898
d.dds.brotli10,471413
d.dds.Kraken,480257
d.dds.lzx21,520632
d.dds.mc-.rar,534913 // <- RAR unfiltered is poor
d.dds.brotli9,542792
d.dds.zstd060,545583
d.dds.z9.zip,560708
d.dds.raw,1048704
happy.zl8.LZNA,949709
happy.LZNA,955700
happy_lp2.7z,974550
happy.BitKnit,979832
happy.7z,1004359
happy.cOO.nz,1015048
happy.co.nz,1028196
happy.Kraken,1109748
happy.brotli10,1135252
happy.lzx21,1168220
happy.mc-.rar,1177426 // <- RAR unfiltered is okay
happy.zstd060,1199064
happy.brotli9,1219174
happy.rar,1354649 // <- RAR filters fucks up
happy.z9.zip,1658789
happy.lz4xc4,2211700
happy.raw,4155083
Not about RAR, but for historical comparison,
lzt24 is another mesh (the "struct72" file here )
lzt24.zl8.LZNA,1164216
lzt24.LZNA,1177160
lzt24.nz,1206662
lzt24_lp2.7z,1221821
lzt24.BitKnit,1224524
lzt24.7z,1262013
lzt24.Kraken,1307691
lzt24.brotli10,1323486
lzt24.brotli9,1359566
lzt24.lzx21,1475776
lzt24.zstd060,1498401
lzt24.mc-.rar,1612286
lzt24.rar,1612286
lzt24.z9.zip,2290695
lzt24.raw,3471552
Found another weird one where RAR filters do magic; lzt25 is super-structured 13-byte structs :
lzt25.rar,40024 // <- WOW RAR filters!
lzt25.nz,45397
lzt25.7z,51942
lzt25_lp2.7z,52579
lzt25.LZNA,58903
lzt25.zl8.LZNA,61582 // <- zl8 LZNA worse than zl6 - weird file
lzt25.lzx21,63198
lzt25.zstd060,64550 // <- ZStd does surprisingly well here, I thought you needed more reps on this file
lzt25.brotli9,67856
lzt25.Kraken,67986
lzt25.brotli10,68472 // <- brotli10 worse than brotli9 !
lzt25.BitKnit,92940 // <- BitKnit oddly struggling
lzt25.mc-.rar,106423 // <- unfiltered RAR is the worst of the LZ's
lzt25.z9.zip,209811
lzt25.lz4xc4,324125
lzt25.raw,1029744
A lot of interesting things to pick out in those reports. (just saying, I'm not gonna address them all)
One just general thing is that the performance of these LZ's is in no way consistent. You can't just say that "X LZ is 5% better than Y", there's no really consistent pattern, they have wildly variable relative performance.
There's a family of sort of normal LZ's - LZX, Brotli9, ZStd, & unfiltered RAR. Then there's the family of the high-compress LZ's, LZNA, 7z, nz. Those are pretty consistently together, and form two end-points.
But then there are the floaters. BitKnit, Kraken, Filtered RAR, and Brotli10 can jump around between the "normal LZ" and "high-compress LZ" region. BitKnit and Brotli10 are the most variable - they both can jump right up to the high-compress LZ's like 7z, but on other files they drop right down into the pack of normal LZ's (LZX, etc.).
I have a guess about what's happening with Brotli. I haven't looked at the code at all, but my guess is that between level 9 and 10 the order-1 context optimization is turned on. In particular, there's this "signed int" context mode which I believe is what does the magic for brotli on things like horse.vipm (for example it has contexts for the case of last two bytes = 0x0000 , or last two bytes = 0xFFFF , which are pretty common on horse). My guess is that this mode is just not even tried at all at level 9, and at level 10 it turns on the code to pick the best context mode, and finds the signed int mode which is great on these files. Not sure.
So how does it do?
To visualize the size-decodespeed Pareto frontier, I like to use an imaginary loading problem. You want to load compressed data and then decompress it. One option is the null compressor (or "memcpy") that just loads uncompressed data and then memcpys it. Or you can use compressors that give smaller file sizes, thus quicker loads, but take more time to decompress. By dialing the virtual disk speed, you can see which compressor is best at different tradeoffs of space vs. speed.
Of course you may not actually be just trying to optimize load time, but you can still use this imaginary loading problem as a way to study the space-speed tradeoff. If you care more about size, you look at lower imaginary disk speeds, if you core more about CPU use, you look at higher imaginary disk speeds.
I like to show "speedup" which is the factor of increase in speed using a certain compressor to do load+decomp vs. the baseline (memcpy). So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff. (By "show" I mean "is linearly proportional to, so you can actually look at the ratio of Y-intercepts in the Pareto curve and it tells you compression ratio on the left and decompression speed on the right).
03-02-15 - Oodle LZ Pareto Frontier and 05-13-15 - Skewed Pareto Chart
The chart showing "millis" shows time, so obviously lower is better. I show a few ways to combine load & decomp. Sum is the time to load then decomp sequentially. Max is the larger of the two, which is the time if they were perfectly overlapped (not usually possible). Personally I think "l2 sum" is most useful. This is the sqrt of sum of squares, which is a kind of sum that biases towards the larger; it's kind of like a mix of "sum" and "max", it means you want the sum to be small, but you also want them to similar times.
Kraken tops the Pareto curve for decodespeed vs size for a huge range of virtual disk speed; from around 2 mb/s up to around 300 mb/s.
Of course the Pareto curve doesn't tell the whole picture. For one thing I don't have encode speed in the equation here at all (and sometimes you need to look at the memory use tradeoff too, so it's really a size-decodespeed-encodespeed-memoryuse surface). For another you sometimes have strict requirements, like I must hit a certain file size, and then you just pick the fastest decoder that meets that requirement. One thing the curves do clearly tell you is when a curve just completely covers another (that is, is greater in Y for all values of X), then that compressor is just never needed (in space & decodespeed).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(BTW for anyone trying to compare new charts to old ones on my blog - "Game Test Set" is not a static set. It changes all the time as I get more customer data and try to make it match the most representative data I have for games.)
Just for comparison to earlier posts on my Blog about the other Oodle compressors, here's a run of lzt99 with the full Oodle compressor set.
I've included ZStd for reference here just to show that Kraken jumping way up is not because the previous Oodle compressors were bad - far from it, they were already Pareto optimal. (I like to compare against ZStd not because it's bad, but because it's excellent; it's the only other compressor in the world I know of that's even close to Oodle in terms of good compression ratio and high speed (ZStd also has nice super-fast encode speeds, and it's targetted slightly differently, it's better on text and Oodle is better on binary, it's not a direct apples comparison)).
You can see that LZHLW is just completely covered by Kraken, and so is now deprecated in Oodle. Even LZNIB and BitKnit are barely peeking out of the curve, so the range where they are the right answer is greatly reduced to more specialized needs. (for example BitKnit still romps on strongly structured data, and is useful if you just need a size smaller than Kraken)
First some general notes about Oodle before the big dump of numbers. (skip to the bottom for charts)
Oodle is not intended to solve every problem in data compression. Oodle Data Compression is mainly designed for compress-once load-many usage patterns, where the compressed size and decode speed is important, but encode speed is not as important. Things like distribution, packaging, when you bake content into a compressed archive and then serve it to many people. I do consider it a requirement to keep the encoders faster than 1 MB/s on my ancient laptop, because less than that is just too slow even for content baking.
Like most data compressors, Oodle has various compression level options that trade off encoder speed for compressed size. So there are faster and slower encoders; some of the fast modes are good tradeoffs for space-speed. Kraken turns out to be pretty good at getting high compression even in the fast modes. I'll do a post that goes into this in the future.
Oodle is mainly intended for packing binary data. Because Oodle is built for games, we focus on the types of data that games ship, such as textures, models, animations, levels, and other binary structured data that often contains various types of packed numeric data. Oodle is good on text but is not really built for text. In general if you are focusing on a specific type of data (eg. text/html/xml) you will get the best results with domain-specific preprocessors, such as xwrt or wbpe. Oodle+wbpe is an excellent text compressor.
Oodle Kraken is intended for high compression use cases, where you care about size. There's a whole family of compressors now that live in the "just below lzma (7zip)" compression ratio domain, such LZHAM, Brotli, Oodle BitKnit, and ZStd. In general the goal of these is to get as close to lzma compression ratio as possible while providing much better speed. This is where Kraken achieves far more than anyone has before, far more than we thought possible.
Today I will be focusing on high compression modes, looking at decode speed vs. compression ratio. All compressors will generally be run in their max compression mode, without going into the "ridiculously slow" below 1 MB/s range. (for Oodle this means running -zl6 Optimal2 but not -zl7 or higher).
Okay, on to lots of numbers.
Rather than post an average on a test set, which can be misleading due to the selection of the test set
or the way the results are averaged (total compressed size or average ratio?), I'll post results on a
selection of individual files :
-------------------------------------------------------
"silesia_mozilla"
by ratio:
lzma : 3.88:1 , 2.0 enc mb/s , 63.7 dec mb/s
lzham : 3.56:1 , 1.5 enc mb/s , 186.4 dec mb/s
Oodle Kraken: 3.51:1 , 1.2 enc mb/s , 928.0 dec mb/s
zstdmax : 3.24:1 , 2.8 enc mb/s , 401.0 dec mb/s
zlib9 : 2.51:1 , 12.4 enc mb/s , 291.5 dec mb/s
lz4hc : 2.32:1 , 36.4 enc mb/s , 2351.6 dec mb/s
by encode speed:
lz4hc : 2.32:1 , 36.4 enc mb/s , 2351.6 dec mb/s
zlib9 : 2.51:1 , 12.4 enc mb/s , 291.5 dec mb/s
zstdmax : 3.24:1 , 2.8 enc mb/s , 401.0 dec mb/s
lzma : 3.88:1 , 2.0 enc mb/s , 63.7 dec mb/s
lzham : 3.56:1 , 1.5 enc mb/s , 186.4 dec mb/s
Oodle Kraken: 3.51:1 , 1.2 enc mb/s , 928.0 dec mb/s
by decode speed:
lz4hc : 2.32:1 , 36.4 enc mb/s , 2351.6 dec mb/s
Oodle Kraken: 3.51:1 , 1.2 enc mb/s , 928.0 dec mb/s
zstdmax : 3.24:1 , 2.8 enc mb/s , 401.0 dec mb/s
zlib9 : 2.51:1 , 12.4 enc mb/s , 291.5 dec mb/s
lzham : 3.56:1 , 1.5 enc mb/s , 186.4 dec mb/s
lzma : 3.88:1 , 2.0 enc mb/s , 63.7 dec mb/s
-------------------------------------------------------
"lzt99"
by ratio:
lzma : 2.65:1 , 3.1 enc mb/s , 42.3 dec mb/s
Oodle Kraken: 2.46:1 , 2.3 enc mb/s , 957.1 dec mb/s
lzham : 2.44:1 , 1.9 enc mb/s , 166.0 dec mb/s
zstdmax : 2.27:1 , 3.8 enc mb/s , 482.3 dec mb/s
zlib9 : 1.77:1 , 13.3 enc mb/s , 286.2 dec mb/s
lz4hc : 1.67:1 , 30.3 enc mb/s , 2737.4 dec mb/s
by encode speed:
lz4hc : 1.67:1 , 30.3 enc mb/s , 2737.4 dec mb/s
zlib9 : 1.77:1 , 13.3 enc mb/s , 286.2 dec mb/s
zstdmax : 2.27:1 , 3.8 enc mb/s , 482.3 dec mb/s
lzma : 2.65:1 , 3.1 enc mb/s , 42.3 dec mb/s
Oodle Kraken: 2.46:1 , 2.3 enc mb/s , 957.1 dec mb/s
lzham : 2.44:1 , 1.9 enc mb/s , 166.0 dec mb/s
by decode speed:
lz4hc : 1.67:1 , 30.3 enc mb/s , 2737.4 dec mb/s
Oodle Kraken: 2.46:1 , 2.3 enc mb/s , 957.1 dec mb/s
zstdmax : 2.27:1 , 3.8 enc mb/s , 482.3 dec mb/s
zlib9 : 1.77:1 , 13.3 enc mb/s , 286.2 dec mb/s
lzham : 2.44:1 , 1.9 enc mb/s , 166.0 dec mb/s
lzma : 2.65:1 , 3.1 enc mb/s , 42.3 dec mb/s
-------------------------------------------------------
"all_dds"
by ratio:
lzma : 2.37:1 , 2.1 enc mb/s , 40.8 dec mb/s
Oodle Kraken: 2.18:1 , 1.0 enc mb/s , 684.6 dec mb/s
lzham : 2.17:1 , 1.3 enc mb/s , 127.7 dec mb/s
zstdmax : 2.02:1 , 3.3 enc mb/s , 289.4 dec mb/s
zlib9 : 1.83:1 , 13.3 enc mb/s , 242.9 dec mb/s
lz4hc : 1.67:1 , 20.4 enc mb/s , 2226.9 dec mb/s
by encode speed:
lz4hc : 1.67:1 , 20.4 enc mb/s , 2226.9 dec mb/s
zlib9 : 1.83:1 , 13.3 enc mb/s , 242.9 dec mb/s
zstdmax : 2.02:1 , 3.3 enc mb/s , 289.4 dec mb/s
lzma : 2.37:1 , 2.1 enc mb/s , 40.8 dec mb/s
lzham : 2.17:1 , 1.3 enc mb/s , 127.7 dec mb/s
Oodle Kraken: 2.18:1 , 1.0 enc mb/s , 684.6 dec mb/s
by decode speed:
lz4hc : 1.67:1 , 20.4 enc mb/s , 2226.9 dec mb/s
Oodle Kraken: 2.18:1 , 1.0 enc mb/s , 684.6 dec mb/s
zstdmax : 2.02:1 , 3.3 enc mb/s , 289.4 dec mb/s
zlib9 : 1.83:1 , 13.3 enc mb/s , 242.9 dec mb/s
lzham : 2.17:1 , 1.3 enc mb/s , 127.7 dec mb/s
lzma : 2.37:1 , 2.1 enc mb/s , 40.8 dec mb/s
-------------------------------------------------------
"baby_robot_shell.gr2"
by ratio:
lzma : 4.35:1 , 3.1 enc mb/s , 59.3 dec mb/s
Oodle Kraken: 3.77:1 , 1.5 enc mb/s , 878.3 dec mb/s
lzham : 3.77:1 , 1.6 enc mb/s , 162.5 dec mb/s
zstdmax : 2.77:1 , 5.7 enc mb/s , 405.7 dec mb/s
zlib9 : 2.19:1 , 13.9 enc mb/s , 332.9 dec mb/s
lz4hc : 1.78:1 , 40.1 enc mb/s , 2364.4 dec mb/s
by encode speed:
lz4hc : 1.78:1 , 40.1 enc mb/s , 2364.4 dec mb/s
zlib9 : 2.19:1 , 13.9 enc mb/s , 332.9 dec mb/s
zstdmax : 2.77:1 , 5.7 enc mb/s , 405.7 dec mb/s
lzma : 4.35:1 , 3.1 enc mb/s , 59.3 dec mb/s
lzham : 3.77:1 , 1.6 enc mb/s , 162.5 dec mb/s
Oodle Kraken: 3.77:1 , 1.5 enc mb/s , 878.3 dec mb/s
by decode speed:
lz4hc : 1.78:1 , 40.1 enc mb/s , 2364.4 dec mb/s
Oodle Kraken: 3.77:1 , 1.5 enc mb/s , 878.3 dec mb/s
zstdmax : 2.77:1 , 5.7 enc mb/s , 405.7 dec mb/s
zlib9 : 2.19:1 , 13.9 enc mb/s , 332.9 dec mb/s
lzham : 3.77:1 , 1.6 enc mb/s , 162.5 dec mb/s
lzma : 4.35:1 , 3.1 enc mb/s , 59.3 dec mb/s
-------------------------------------------------------
"win81"
by ratio:
lzma : 2.95:1 , 2.5 enc mb/s , 51.9 dec mb/s
lzham : 2.77:1 , 1.6 enc mb/s , 177.6 dec mb/s
Oodle Kraken: 2.70:1 , 1.0 enc mb/s , 877.0 dec mb/s
zstdmax : 2.64:1 , 3.5 enc mb/s , 417.8 dec mb/s
zlib9 : 2.07:1 , 16.8 enc mb/s , 269.6 dec mb/s
lz4hc : 1.91:1 , 28.8 enc mb/s , 2297.6 dec mb/s
by encode speed:
lz4hc : 1.91:1 , 28.8 enc mb/s , 2297.6 dec mb/s
zlib9 : 2.07:1 , 16.8 enc mb/s , 269.6 dec mb/s
zstdmax : 2.64:1 , 3.5 enc mb/s , 417.8 dec mb/s
lzma : 2.95:1 , 2.5 enc mb/s , 51.9 dec mb/s
lzham : 2.77:1 , 1.6 enc mb/s , 177.6 dec mb/s
Oodle Kraken: 2.70:1 , 1.0 enc mb/s , 877.0 dec mb/s
by decode speed:
lz4hc : 1.91:1 , 28.8 enc mb/s , 2297.6 dec mb/s
Oodle Kraken: 2.70:1 , 1.0 enc mb/s , 877.0 dec mb/s
zstdmax : 2.64:1 , 3.5 enc mb/s , 417.8 dec mb/s
zlib9 : 2.07:1 , 16.8 enc mb/s , 269.6 dec mb/s
lzham : 2.77:1 , 1.6 enc mb/s , 177.6 dec mb/s
lzma : 2.95:1 , 2.5 enc mb/s , 51.9 dec mb/s
-------------------------------------------------------
"enwik7"
by ratio:
lzma : 3.64:1 , 1.8 enc mb/s , 79.5 dec mb/s
lzham : 3.60:1 , 1.4 enc mb/s , 196.5 dec mb/s
zstdmax : 3.56:1 , 2.2 enc mb/s , 394.6 dec mb/s
Oodle Kraken: 3.49:1 , 1.5 enc mb/s , 789.7 dec mb/s
zlib9 : 2.38:1 , 22.2 enc mb/s , 234.3 dec mb/s
lz4hc : 2.35:1 , 27.5 enc mb/s , 2059.6 dec mb/s
by encode speed:
lz4hc : 2.35:1 , 27.5 enc mb/s , 2059.6 dec mb/s
zlib9 : 2.38:1 , 22.2 enc mb/s , 234.3 dec mb/s
zstdmax : 3.56:1 , 2.2 enc mb/s , 394.6 dec mb/s
lzma : 3.64:1 , 1.8 enc mb/s , 79.5 dec mb/s
Oodle Kraken: 3.49:1 , 1.5 enc mb/s , 789.7 dec mb/s
lzham : 3.60:1 , 1.4 enc mb/s , 196.5 dec mb/s
by decode speed:
lz4hc : 2.35:1 , 27.5 enc mb/s , 2059.6 dec mb/s
Oodle Kraken: 3.49:1 , 1.5 enc mb/s , 789.7 dec mb/s
zstdmax : 3.56:1 , 2.2 enc mb/s , 394.6 dec mb/s
zlib9 : 2.38:1 , 22.2 enc mb/s , 234.3 dec mb/s
lzham : 3.60:1 , 1.4 enc mb/s , 196.5 dec mb/s
lzma : 3.64:1 , 1.8 enc mb/s , 79.5 dec mb/s
-------------------------------------------------------
In chart form :
(lz4 decode speed is off the top of the chart)
I'm not including the other Oodle compressors here just to keep things as simple as possible. If you do want more compression than Kraken, and care about decode speed, then Oodle LZNA or BitKnit are much faster to decode than lzma (7zip) at comparable or better compression ratios.
Visit radgametools.com to learn more about Kraken and Oodle.
I'll be doing some more expert-oriented techy followup posts here at rants.html
Finnish was by some guys that I assume were from Finland. If anybody knows the correct attribution please let me know.
I was thinking about it the other day because we talked about the old segment register trick that we used to do, and I always thought this was such a neat little bit of code. It also uses the byte-regs as part of word-reg tricks.
Finnish :
; es = CharTable
; bx = hash index
; dl = control bits
; ds[si] = input
; ds[di] = output
; ax/al/ah = input char
; bp = control ptr
ProcessByte macro SourceReg,BitVal
cmp SourceReg, es:[bx]
je ProcessByte_done
or dl, BitVal
mov es:[bx], SourceReg
mov ds:[di], SourceReg
inc di
ProcessByte_done: mov bh, bl
mov bl, SourceReg
endm
ProcessBlockLoop:
mov bp, di ; ControlPtr = CompPtr++;
inc di
xor dl, dl ; Control = 0;
lodsw ; AX = ds[si] , si += 2
ProcessByte al, 80h
ProcessByte ah, 40h
lodsw
ProcessByte al, 20h
ProcessByte ah, 10h
lodsw
ProcessByte al, 08h
ProcessByte ah, 04h
lodsw
ProcessByte al, 02h
ProcessByte ah, 01h
mov ds:[bp], dl ; *ControlPtr = Control
I had :
if ( bits >= huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
Consume( table[peek].codeLen );
return table[peek].symbol;
}
it should have been :
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
Consume( table[peek].codeLen );
return table[peek].symbol;
}
it's corrected now.
In my convention, branchCodeLeftAligned is the left-aligned bitbuff value that means you must go to a higher codelen.
I thought for clarity I'd go ahead and post the example I did with him :
You have this alphabet :
symbol_id, codeLen, code:
0 ; 2 ; 00
1 ; 3 ; 010
2 ; 3 ; 011
3 ; 3 ; 100
4 ; 4 ; 1010
5 ; 4 ; 1011
6 ; 4 ; 1100
7 ; 4 ; 1101
8 ; 4 ; 1110
9 ; 5 ; 11110
10; 5 ; 11111
baseCode[n] = first code of len n - # of codes of lower len
baseCode,
[2] 0
[3] 1 = 010 - 1
[4] 6 = 1010 - 4
[5] 21 = 11110 - 9
huff_branchCodeLeftAligned
[2] 0x4000000000000000 010000...
[3] 0xa000000000000000 101000...
[4] 0xf000000000000000 111100...
[5] 0xffffffffffffffff 111111...
My decode loop is :
for(int codeLen=1;;codeLen++) // actually unrolled, not a loop
{
if ( bitbuff < huff_branchCodeLeftAligned[codeLen] ) return symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];
}
or
int codeLen = minCodeLen;
while ( bitbuff >= huff_branchCodeLeftAligned[codeLen] ) codeLen++;
sym = symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];
so if bitbuff is
11010000...
codeLen starts at 2
we check
if ( 11010000.. < 0x4000... ) - false
if ( 11010000.. < 0xa000... ) - false
if ( 11010000.. < 0xf000... ) - true
return ( 1101 - baseCode[4] ); = 13 - 6 = 7
And a full table-accelerated decoder for this code might be :
// 3-bit acceleration table :
#define TABLE_N_BITS 3
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
Consume( table[peek].codeLen );
return table[peek].symbol;
}
if ( bitbuff < huff_branchCodeLeftAligned[4] ) return symbolUnsort[ getbits(4) - baseCode[4] ];
// 5 is max codelen
// this compare is not always true (because of the bitbuff=~0 problem), but we take it anyway
//if ( bitbuff < huff_branchCodeLeftAligned[5] )
return symbolUnsort[ getbits(5) - baseCode[5] ];
And there you go. MSB-first Huffman that supports long code lengths that exceed the acceleration table size.
(In particular I was curious if the slow encode speeds for level 10 & 11 were inherent. My conclusion is basically they are. Brotli gets its high compression from the crazy flexibility of Huffman context assignment and table switching, which is an inherently slow thing to encode. It's a very asymmetric format.)
Shout out for order-1 Huffman with merge tables. I'm not sure I've seen that used in production since my paper - "New Techniques in Context Modeling and Arithmetic Encoding" (PDF) (twenty years ago, ZOMG).
Highlighting the things I think are significant or interesting :
I was remembering how modern LZ's like LZMA (BitKnit, etc.) that (can) do pos&3 for literals might like bitmaps in XRGB rather than 24-bit RGB.
In XRGB, each color channel gets its own entropy coding. Also offset bottom bits works if the offsets are whole pixel steps (the off&3 will be zero). In 24-bit RGB that stuff is all mod-3 which we don't do.
(in general LZMA-class compressors fall apart a bit if the structure is not the typical 4/8/pow2)
In compressors it's generally terrible to stick extra bytes in and give the compressor more work to do. In this case we're injecting a 0 in every 4th byte, and the compressor has to figure out those are all redundant just to get back to its original size.
Anyway, this is an old idea, but I don't think I ever actually tried it. So :
PDI_1200.bmp
LZNA :
24-bit RGB : LZNA : 2,760,054 -> 1,376,781
32-bit XRGB: LZNA : 3,676,818 -> 1,311,502
24-bit RGB with DPCM filter : LZNA : 2,760,054 -> 1,022,066
32-bit XRGB with DPCM filter : LZNA : 3,676,818 -> 1,015,379 (MML8 : 1,012,988)
webpll : 961,356
paq8o8 : 1,096,342
moses.bmp
24-bit RGB : LZNA : 6,580,854 -> 3,274,757
32-bit XRGB: LZNA : 8,769,618 -> 3,022,320
24-bit RGB with DPCM filter : LZNA : 6,580,854 -> 2,433,246
32-bit XRGB with DPCM filter : LZNA : 8,769,618 -> 2,372,921
webpll : 2,204,444
gralic111d : 1,822,108
other compressors :
32-bit XRGB with DPCM filter : LZA : 8,769,618 -> 2,365,661 (MML8 : 2,354,434)
24-bit RGB no filter : BitKnit : 6,580,854 -> 3,462,455
32-bit XRGB no filter : BitKnit : 8,769,618 -> 3,070,141
32-bit XRGB with DPCM filter : BitKnit : 8,769,618 -> 2,601,463
32-bit XRGB: LZNA : 8,769,618 -> 3,022,320
32-bit XRGB: LZA : 8,769,618 -> 3,009,417
24-bit RGB: LZMA : 6,580,854 -> 3,488,546 (LZMA lc=0,lp=2,pb=2)
32-bit XRGB: LZMA : 8,769,618 -> 3,141,455 (LZMA lc=0,lp=2,pb=2)
repro:
bmp copy moses.bmp moses.tga 32
V:\devel\projects\oodle\radbitmap\radbitmaptest
radbitmaptest64 rrz -z0 r:\moses.tga moses.tga.rrz -f8 -l1
Key observations :
1. On "moses" unfiltered : padding to XRGB does help a solid amount (3,274,757 to 3,022,320 for LZNA) , despite the source being 4/3 bigger. I think that proves the concept. (BitKnit & LZMA even bigger difference)
2. On filtered data, padding to XRGB still helps, but much (much) less. Presumably this is because post-filter data is just a bunch of low values, so the 24-bit RGB data is not so multiple-of-three structured (it's a lot of 0's, +1's, and -1's, less coherent, less difference between the color channels, etc.)
3. On un-filtered data, "sub" literals might be helping BitKnit (it beats LZMA on 32-bit unfiltered, and hangs with LZNA). On filtered data, the sub-literals don't help (might even hurt) and BK falls behind. We like the way sub literals sometimes act as an automatic structure stride and delta filter, but they can't compete with a real image-specific DPCM.
Now, XRGB padding is an ugly way to do this. You'd much rather stick with 24-bit RGB and have an LZ that works inherently on 3-byte items.
The first step is :
LZ that works on "items"
(eg. item = a pixel)
LZ matches (offsets and lens) are in whole items
(the more analogous to bottom-bits style would be to allow whole-items and "remainders";
that's /item and %item, and let the entropy coder handle it if remainder==0 always;
but probably best to just force remainders=0)
When you don't match (literal item)
each byte in the item gets it own entropy stats
(eg. color channels of pixels)
which maybe is useful on things other than just images.
The other step is something like :
Offset is an x,y delta instead of linear
(this replaces offset bottom bits)
could be generically useful in any kind of row/column structured data
Filtering for values with x-y neighbors
(do you do the LZ on un-filtered data, and only filter the literals?)
(or do you filter everything and do the LZ on filter residuals?)
and a lot of this is just webp-ll
records7
granny7
game7
exe7
enwik7
dds7
audio7
Note on the test :
This is running the non-Oodle compressors via my build of their lib (*). Brotli not included because it's too hard to build in MSVC (before 2010). "oohc" here is "Optimal2" level (originally posted with Optimal1 level, changed to Optimal2 for consistency with previous post).
The sorting of the labels on the right is by compressed size.
Report on total of all files :
-------------------------------------------------------
by ratio:
oohcLZNA : 2.37:1 , 2.9 enc mb/s , 125.5 dec mb/s
lzma : 2.35:1 , 2.7 enc mb/s , 37.3 dec mb/s
oohcBitKnit : 2.27:1 , 4.9 enc mb/s , 258.0 dec mb/s
lzham : 2.23:1 , 1.9 enc mb/s , 156.0 dec mb/s
oohcLZHLW : 2.16:1 , 3.4 enc mb/s , 431.9 dec mb/s
zstdmax : 1.99:1 , 4.6 enc mb/s , 457.5 dec mb/s
oohcLZNIB : 1.84:1 , 7.2 enc mb/s , 1271.4 dec mb/s
by encode speed:
oohcLZNIB : 1.84:1 , 7.2 enc mb/s , 1271.4 dec mb/s
oohcBitKnit : 2.27:1 , 4.9 enc mb/s , 258.0 dec mb/s
zstdmax : 1.99:1 , 4.6 enc mb/s , 457.5 dec mb/s
oohcLZHLW : 2.16:1 , 3.4 enc mb/s , 431.9 dec mb/s
oohcLZNA : 2.37:1 , 2.9 enc mb/s , 125.5 dec mb/s
lzma : 2.35:1 , 2.7 enc mb/s , 37.3 dec mb/s
lzham : 2.23:1 , 1.9 enc mb/s , 156.0 dec mb/s
by decode speed:
oohcLZNIB : 1.84:1 , 7.2 enc mb/s , 1271.4 dec mb/s
zstdmax : 1.99:1 , 4.6 enc mb/s , 457.5 dec mb/s
oohcLZHLW : 2.16:1 , 3.4 enc mb/s , 431.9 dec mb/s
oohcBitKnit : 2.27:1 , 4.9 enc mb/s , 258.0 dec mb/s
lzham : 2.23:1 , 1.9 enc mb/s , 156.0 dec mb/s
oohcLZNA : 2.37:1 , 2.9 enc mb/s , 125.5 dec mb/s
lzma : 2.35:1 , 2.7 enc mb/s , 37.3 dec mb/s
-------------------------------------------------------
How to for my reference :
type test_slowies_seven.bat
@REM test each one individially :
spawnm -n external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven\* -cr:\seven_csvs\@f.csv
@REM test as a set :
external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven
dele r:\compressorspeeds.*
@REM testproj compressorspeedchart
spawnm c:\src\testproj\x64\debug\TestProj.exe r:\seven_csvs\*.csv
ed r:\compressorspeeds.*
(* = I use code or libs to test speeds, never exes; I always measure speed memory->memory, single threaded, with cold caches)
The goal here is not to show the total or who does best overall (that relies on how you weight each type of file and whether you think this selection is representative of the occurance ratios in your data), rather to show how each compressor does on different types of data, to highlight their different strengths.
Showing compression factor (eg. N:1 , higher is better) :
run details :
ZStd is 0.5.1 at level 21 (optimal)
LZMA is 7z -mx9 -m0=lzma:d24
Brotli is bro.exe by Sportman --quality 9 --window 24 (*)
Oodle is v2.13 at -z6 (Optimal2)
All competitors run via their provided exe
Some takeaways :
Binary structured data is really where the other compressors leave a lot of room to beat them. ("granny" and "records"). The difference in sizes on all the other files is pretty meh.
BitKnit does its special thang on granny - close to LZNA but 2X faster to decode (and ~ 6X faster than LZMA). Really super space-speed. BitKnit drops down to more like LZHLW levels on the non-record files (LZNA/LZMA has a small edge on them).
I was really surprised by ZStd vs Brotli. I actually went back and double checked by CSV to make sure I hadn't switched the columns by accident. In particular - Brotli does poorly on enwik7 (huh!?) but it does pretty well on "granny", and surprisingly ZStd does quite poorly on "granny" & "records". Not what I expected at all. Brotli is surprising poor on text/web and surprisingly good on binary record data.
LZHLW is still an excellent choice after all these years.
(* = Brotli quality 10 takes an order of magnitude longer than any of the others. I got fed up with waiting for it. Oodle also has "super" modes at -z8 that aren't used here. (**))
(for concreteness : Brotli 11 does pretty well on granny7 ; (6.148:1 vs 4.634:1 at q9) but it runs at 68 kb/s (!!) (and still not LZMA-level compression))
(** = I used to show results in benchmarks that required really slow encoders (for example the old LZNIB optimal "super parse" was hella slow); that can result in very small sizes and great decode speed, but it's a form of cheating. Encoders slower than 1 mb/s just won't be used, they're too slow, so it's reporting a result that real users won't actually see, and that's BS. I'm trying to be more legit about this now for my own stuff. Slow encoders are still interesting for research purposes because they show what should be possible, so you can try to get that result back in a faster way. (this in fact happened with LZNIB and is a Big Deal))
2. The best case for bit input is when the length that you consume is not very variable. eg. in the Huffman case, 1-12 bits, has a reasonably low limit. The worst case is when it has a high max and is quite random. Then you can't avoid refill checks, and they're quite unpredictable (if you do the branchy case)
3. If your refills have a large maximum, but the average is low, branchy can be faster than branchless. Because the maximum is high (eg. maybe a max of 32 bits consumed), you can only do one decode op before checking refill. Branchless will then always refill. Branchy can skip the refill if the average is low - particularly if it's predictably low.
4. If using branchy refills, try to make it predictable. An interesting idea is to use multiple bit buffers so that each
consumption spot gets its own buffer, and then can create a pattern. A very specific case is consuming a fixed number of bits.
something like :
bitbuffer
if ( random )
{
consume 4 bits from bitbuffer
if bitbuffer out -> refill
}
else
{
consume 6 bits from bitbuffer
if bitbuffer out -> refill
}
these branches (for bitbuffer refill) will be very random because of the two different sites that consume different amounts. However, this :
bitbuffer1, bitbuffer2
if ( random )
{
consume 4 bits from bitbuffer1
if bitbuffer1 out -> refill
}
else
{
consume 6 bits from bitbuffer2
if bitbuffer2 out -> refill
}
these branches for refill are now perfectly predictable in a pattern (they are taken every Nth time exactly).
5. Bit buffer work is slow, but it's "mathy". On modern processors that are typically math-starved, it can be cheap *if* you have enough ILP to fully use all the execution units. The problem is a single bit buffer on its own is super serial work, so you need multiple bit buffers running simultaneously, or enough other work.
For example, it can actually be *faster* than byte-aligned input (using something like "EncodeMod") if the byte-input does a branch, and that branch is unpredictable (in the bad 25-75% randomly taken range).
1. SIMD processing of control words.
All LZ-Bytewises do a little bit of shifts and masks to pull out fields and flags from the control word. Stuff like lrl = (control>>4) and numbytesout = lrl+ml;
This work is pretty trivial, and it's fast already in scalar. But if you can do it N at a time, why not.
A particular advantage here is that SSE instruction sets are somewhat better at branchless code than scalar, it's a bit easier to make masks from conditions and such-like, so that can be a win. Also helps if you're front-end-bound, since decoding one instruction to do an 8-wide shift is less work than 8 instructions. (it's almost impossible for a data compressor to be back-end bound on simple integer math ops, there are just so many execution units; that's rare, it's much possible to hit instruction decode limits)
2. Using SSE in scalar code to splat out match or LRL.
LZSSE parses the control words SIMD (wide) but the actual literal or match copy is scalar, in the sense that only one is done at a time. It still uses SSE to fetch those bytes, but in a scalar way. Most LZ's can do this (many may do it already without being aware of it; eg. if you use memcpy(,16) you might be doing an SSE splat).
3. Limitted LRL and ML in control word with no excess. Outer looping on control words only, no looping on LRL/ML.
To output long LRL's, you have to output a series of control words, each with short LRL. To output long ML's, you have to output a series of control words.
This I think is the biggest difference in LZSSE vs. something like LZ4. You can make an LZ4 variant that works like this, and in fact it's an interesting thing to do, and is sometimes fast. In an LZ4 that does strictly alternating LRL-ML, to do this you need to be able to send ML==0 so that long literal runs can be continued as a sequence of control words.
Traditional LZ4 decoder :
{
lrl = control>>4;
ml = (control&0xF)+4;
off = get 2 bytes; comp += 2;
// get excess if flagged with 0xF in control :
if ( lrl == 0xF ) lrl += *comp++; // and maybe more
if ( ml == 19 ) ml += *comp++; // and maybe more
copy(out,comp,lrl); // <- may loop on lrl
out += lrl; comp += lrl;
copy(out,out-off,ml); // <- may loop on ml
out += ml;
}
non-looping LZ4 decoder : (LZSSE style)
{
lrl = control>>4;
ml = control&0xF; // <- no +4 , 0 possible
off = get 2 bytes; comp += 2; // <- * see below
// no excess
copy(out,comp,16); // <- unconditional 16 byte copy, no loop
out += lrl; comp += lrl;
copy(out,out-off,16); // <- unconditional 16 byte copy, no loop
out += ml;
}
(* = the big complication in LZSSE comes from trying to avoid sending the offset again when you're continuing a match;
something like if previous control word ml == 0xF that means a continuation so don't get offset)
(ignoring the issue of overlapping matches for now)
This non-looping decoder is much less branchy, no branches for excess lens, no branches for looping copies. It's much faster than LZ4 *if* the data doesn't have long LRL's or ML's in it.
4. Flagged match/LRL instead of strictly alternating LRL-ML. This is probably a win on data with lots of short matches, where matches often follow matches with no LRL in between, like text.
If you have to branch for that flag, it's a pretty huge speed hit (see, eg. LZNIB). So it's only viable in a fast LZ-Bytewise if you can do it branchless like LZSSE.
(LZSSE Latest commit c22a696 ; fetched 03/06/2016 ; test machine Core i7-3770 3.4 GHz ; built MSVC 2012 x64 ; LZSSE2 and 8 optimal parse level 16)
Basically LZSSE is in fact great on text, faster than LZ4 and much better compression.
On binary, LZSSE2 is quite bad, but LZSSE8 is roughly on par with LZ4. It looks like LZ4 is maybe slightly better on binary than LZSSE8, but it's close.
In general, LZ4 is does well on files that tend to have long LRL's and long ML's. Files with lots of short (or zero) LRL's and short ML's are bad for LZ4 (eg. text) and not bad for LZSSE.
(LZB16 is Oodle's LZ4 variant; 64k window like LZSSE; LZNIB and LZBLW have large windows)
Some results :
enwik8 LZSSE2 : 100,000,000 ->38,068,528 : 2866.17 mb/s
enwik8 LZSSE8 : 100,000,000 ->38,721,328 : 2906.29 mb/s
enwik8 LZB16 : 100,000,000 ->43,054,201 : 2115.25 mb/s
(LZSSE kills on text)
lzt99 LZSSE2 : 24,700,820 ->15,793,708 : 1751.36 mb/s
lzt99 LZSSE8 : 24,700,820 ->15,190,395 : 2971.34 mb/s
lzt99 LZB16 : 24,700,820 ->14,754,643 : 3104.96 mb/s
(LZSSE2 really slows down on heterogenous binary file lzt99)
(LZSSE8 does okay, but slightly worse than LZ4/LZB16 in size & speed)
mozilla LZSSE2: 51,220,480 ->22,474,508 : 2424.21 mb/s
mozilla LZSSE8: 51,220,480 ->22,148,366 : 3008.33 mb/s
mozilla LZB16 : 51,220,480 ->22,337,815 : 2433.78 mb/s
(all about the same size on silesia mozilla)
(LZSSE8 definitely fastest)
lzt24 LZB16 : 3,471,552 -> 2,379,133 : 4435.98 mb/s
lzt24 LZSSE8 : 3,471,552 -> 2,444,527 : 4006.24 mb/s
lzt24 LZSSE2 : 3,471,552 -> 2,742,546 : 1605.62 mb/s
lzt24 LZNIB : 3,471,552 -> 1,673,034 : 1540.25 mb/s
(lzt24 (a granny file) really terrible for LZSSE2; it's as slow as LZNIB)
(LZSSE8 fixes it though, almost catches LZB16, but not quite)
------------------
Some more binary files. LZSSE2 is not good on any of these, so omitted.
win81 LZB16 : 104,857,600 ->54,459,677 : 2463.37 mb/s
win81 LZSSE8 : 104,857,600 ->54,911,633 : 3182.21 mb/s
all_dds LZB16 : 79,993,099 ->47,683,003 : 2577.24 mb/s
all_dds LZSSE8: 79,993,099 ->47,807,041 : 2607.63 mb/s
AOW3_Skin_Giants.clb
LZB16 : 7,105,158 -> 3,498,306 : 3350.06 mb/s
LZSSE8 : 7,105,158 -> 3,612,433 : 3548.39 mb/s
baby_robot_shell.gr2
LZB16 : 58,788,904 ->32,862,033 : 2968.36 mb/s
LZSSE8 : 58,788,904 ->33,201,406 : 2642.94 mb/s
LZSSE8 vs LZB16 is pretty close.
LZSSE8 is maybe more consistently fast; its decode speed has less variation than LZ4. Slowest LZSSE8 was all_dds at 2607 mb/s ; LZ4 went down to 2115 mb/s on enwik8. Even excluding text, it was down to 2433 mb/s on mozilla. LZB16/LZ4 had a slightly higher max speed (on lzt24).
Conclusion :
On binary-like data, LZ4 and LZSSE8 are pretty close. On text-like data, LZSSE8 is definitely better. So for general data, it looks like LZSSE8 is a definite win.
Some good stuff.
Basically this is a nibble control word LZ (like LZNIB). The nibble has a threshold value T, < T is an LRL (literal run len), >= T is a match length. LZSSET are various threshold variants. As Conor noted, ideally T would be variable, optimized per file (or even better - per quantum) to adapt to different data better.
LZSSE has a 64k window (like LZ4/LZB16) but unlike them supports MML (minimum match length) of 3. MML 3 typically helps compression a little, but in scalar decoders it really hurts speed.
I think the main interesting idea (other than implementation details) is that by limitting the LRL and ML, with no excess/overflow support (ML overflow is handled with continue-match nibbles), it means that you can do a non-looping output of 8/16 bytes. You get long matches or LRL's by reading more control nibbles.
That is, a normal LZ actually has a nested looping structure :
loop on controls from packed stream
{
control specifies lrl/ml
loop on lrl/ml
{
output bytes
}
}
LZSSE only has *one* outer loop on controls.
There are some implementation problems at the moment. The LZSSE2 optimal parse encoder is just broken. It's unusably slow and must have some bad N^2 degeneracy. This can be fixed, it's not a problem with the format.
Another problem is that LZSSE2 expands incompressible data too much. Real world data (particularly in games) often has incompressible data mixed with compressible. The ideal fix would be to have the generalized LZSSET and choose T per quantum. A simpler fix would be to do something like cut files into 16k or 64k quanta, and to select the best of LZSSE2/4/8 per-quantum and also support uncompressed quanta to prevent expansion.
I will take this moment to complain that the test sets everyone is using are really shit. Not Conors fault, but enwiks and Silesia are grossly not at all representative of data that we see in the real world. Silesia is mostly text and weird highly-compressible data; the file I like best in there for my own testing is "mozilla" (though BTW mozilla also contains a bunch of internal zlib streams; it benefits enormously from precomp). We need a better test corpus!!!
string_match_stress_tests.7z (60,832 bytes)
Consists of :
paper1_twice
stress_all_as
stress_many_matches
stress_search_limit
stress_sliding_follow
stress_suffix_forward
An optimal parse matcher (matching at every position in each file against all previous bytes within that file) should get these average
match lengths :
(min match length of 4, and no matches searched for in the last 8 bytes of each file)
paper1_twice : 13294.229727
stress_all_as : 21119.499148
stress_many_matches : 32.757760
stress_search_limit : 823.341331
stress_sliding_follow : 199.576550
stress_suffix_forward : 5199.164464
total ml : 2896554306
total bytes : 483870
Previous post on the same test set : 09-27-11 - String Match Stress Test
And these were used in the String Match Test post series , though there I used "twobooks" instead of "paper1_twice".
These stress tests are designed to make imperfect string matchers show their flaws. Correct implementations of Suffix Array or Suffix Tree searchers should find this total match length without ever going into bad N^2 slowdowns (their speed should be roughly constant). Other matchers like hash-link, LzFind (hash-btree) and MMC will either find lower total match length (due to an "amortize" step limit) or will fall into bad N^2 (or worse!) slowdowns.
1. Oodle Network speed is very cache sensitive.
Oodle Network uses a shared trained model. This is typically 4 - 8 MB. As it compresses or decompresses, it needs to access random bits of that memory.
If you compress/decompress a packet when that model is cold (not in cache), every access will be a cache miss and performance can be quite poor.
In synthetic test, coding packets over and over, the model is as hot as possible (in caches). So performance can seem better in synthetic test loops than in the real world.
In real use, it's best to batch up all encoding/decoding operations as much as possible.
Rather than do :
decode one packet
apply packet to world
do some other stuff
decode one packet
apply packet to world
do some other stuff
...
try to group all the Oodle Network encoding & decoding together :
gather up all my packets to send
receive all packets from network stack
encode all my outbound packets
decode all my inbound packets
now act on inbound packets
this puts all the usage of the shared model together as close as possible to try to maximize the
amount that the model is found in cache.
2. Oodle Network should not be used on already compressed data. Oodle Network should not be used on large packets.
Most games send pre-compressed data of various forms. Some send media files such as JPEGs that are already compressed. Some send big blobs that have been packed with zlib. Some send audio data that's already been compressed.
This data should be excluded from the Oodle Network path and send without going through the compressor. It won't get any compression on them and will just take CPU time. (you could send them as a packet with complen == rawlen, which is a flag for "raw data" in Oodle Network).
More importantly, these packets should NOT be included in the training set for building the model. They are essentially random bytes and will just crud up the model. It's a bit like if you're trying to memorize the digits of Pi and someone keeps yelling random numbers in your ear. (Well, actually it's not like that at all, but those kind of totally bullshit analogies seem very popular, so there you are.)
On large packets that are not precompressed, Oodle Network will work, but it's just not the best choice. It's almost always better to use an Oodle LZ data compressor (BitKnit, LZNIB, whatever, depending on your space-speed tradeoff desired).
The vast majority of games have a kind of bipolar packet distribution :
A. normal frame update packets < 1024 bytes
B. occasional very large packets > 4096 bytes
it will work better to only use Oodle Network on the type A packets (smaller, standard updates) and to
use Oodle LZ on the type B packets (rarer, large data transfers).
For example some games send the entire state of the level in the first few packets, and then afterward send only deltas from that state. In that style, the initial big level dump should be sent through Oodle LZ, and then only the smaller deltas go through Oodle Network.
Not only will Oodle LZ do better on the big packets, but by excluding them from the training set for Oodle Network, the smaller packets will be compressed better because the data will all have similar structure.
Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\testsets\big\lzt99
got arg : num_repeats=5
lz test loading: r:\testsets\big\lzt99
uncompressed size : 24700820
---------------------------------------------------------------
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
---------------------------------------------------------------
| VeryFast | Fast | Normal | Optimal1 |
LZB16 |1.51:517:2988|1.57:236:2971|1.62:109:2964|1.65: 37:3003|
LZBLW |1.64:249:2732|1.74: 80:2682|1.77: 24:2679|1.85:1.6:2708|
LZNIB |1.80:264:1627|1.92: 70:1557|1.94: 23:1504|2.04: 12:1401|
LZHLW |2.16: 67: 424|2.30: 20: 447|2.33:7.2: 445|2.35:5.4: 445|
BitKnit|2.43: 28: 243|2.47: 20: 245|2.50: 13: 249|2.54:6.4: 249|
LZNA |2.36: 24: 115|2.54: 18: 119|2.58: 13: 120|2.69:4.9: 120|
---------------------------------------------------------------
compression ratio:
| VeryFast | Fast | Normal | Optimal1 |
LZB16 | 1.510 | 1.569 | 1.615 | 1.654 |
LZBLW | 1.636 | 1.739 | 1.775 | 1.850 |
LZNIB | 1.802 | 1.921 | 1.941 | 2.044 |
LZHLW | 2.161 | 2.299 | 2.330 | 2.355 |
BitKnit| 2.431 | 2.471 | 2.499 | 2.536 |
LZNA | 2.363 | 2.542 | 2.584 | 2.686 |
---------------------------------------------------------------
encode speed (mb/s):
| VeryFast | Fast | Normal | Optimal1 |
LZB16 | 517.317 | 236.094 | 108.555 | 36.578 |
LZBLW | 248.537 | 80.299 | 23.663 | 1.610 |
LZNIB | 263.950 | 69.930 | 22.617 | 11.735 |
LZHLW | 67.154 | 20.019 | 7.200 | 5.425 |
BitKnit| 28.203 | 20.223 | 12.672 | 6.371 |
LZNA | 24.192 | 18.423 | 12.883 | 4.907 |
---------------------------------------------------------------
decode speed (mb/s):
| VeryFast | Fast | Normal | Optimal1 |
LZB16 | 2988.429 | 2971.339 | 2963.616 | 3003.187 |
LZBLW | 2731.951 | 2681.796 | 2678.558 | 2707.534 |
LZNIB | 1626.806 | 1557.309 | 1504.097 | 1400.654 |
LZHLW | 423.936 | 446.990 | 444.832 | 445.040 |
BitKnit| 242.916 | 245.409 | 248.812 | 248.972 |
LZNA | 114.791 | 119.369 | 119.994 | 120.362 |
---------------------------------------------------------------
Another test :
Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\game_testset_m0.7z
got arg : num_repeats=5
lz test loading: r:\game_testset_m0.7z
uncompressed size : 79290970
---------------------------------------------------------------
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
---------------------------------------------------------------
| VeryFast | Fast | Normal | Optimal1 | Optimal2 |
LZB16 |1.4:1039:4304|1.41:438:4176|1.42:184:4202|1.44: 52:4293|1.44:4.5:4407|
LZBLW |1.51:380:3855|1.55:124:3778|1.56: 26:3774|1.62:1.0:3862|1.62:1.0:3862|
LZNIB |1.56:346:2406|1.59: 84:2398|1.62: 24:2054|1.67: 15:2048|1.67: 10:2053|
LZHLW |1.67: 85: 647|1.74: 25: 679|1.75:6.5: 635|1.77:3.3: 613|1.79:1.5: 618|
BitKnit|1.83: 24: 395|1.90: 18: 409|1.90: 12: 408|1.91:7.1: 402|1.91:6.5: 401|
LZNA |1.78: 22: 171|1.84: 18: 178|1.88: 12: 185|1.93:5.6: 167|1.93:1.5: 167|
---------------------------------------------------------------
compression ratio:
| VeryFast | Fast | Normal | Optimal1 | Optimal2 |
LZB16 | 1.390 | 1.408 | 1.424 | 1.436 | 1.442 |
LZBLW | 1.509 | 1.548 | 1.558 | 1.615 | 1.615 |
LZNIB | 1.557 | 1.593 | 1.622 | 1.669 | 1.668 |
LZHLW | 1.669 | 1.745 | 1.754 | 1.767 | 1.790 |
BitKnit| 1.825 | 1.897 | 1.905 | 1.913 | 1.915 |
LZNA | 1.781 | 1.838 | 1.878 | 1.927 | 1.932 |
---------------------------------------------------------------
encode speed (mb/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal2 |
LZB16 | 1038.910 | 437.928 | 184.457 | 52.008 | 4.465 |
LZBLW | 380.030 | 123.621 | 26.028 | 0.973 | 0.973 |
LZNIB | 345.905 | 83.577 | 24.299 | 14.544 | 10.444 |
LZHLW | 84.519 | 25.218 | 6.542 | 3.256 | 1.547 |
BitKnit| 24.116 | 17.944 | 12.476 | 7.052 | 6.464 |
LZNA | 21.859 | 18.034 | 11.767 | 5.602 | 1.465 |
---------------------------------------------------------------
decode speed (mb/s):
| VeryFast | Fast | Normal | Optimal1 | Optimal2 |
LZB16 | 4304.144 | 4175.854 | 4202.491 | 4292.925 | 4406.853 |
LZBLW | 3855.255 | 3777.826 | 3774.093 | 3861.922 | 3861.582 |
LZNIB | 2406.379 | 2397.753 | 2054.429 | 2048.329 | 2053.340 |
LZHLW | 646.796 | 679.173 | 635.035 | 613.051 | 617.994 |
BitKnit| 394.599 | 408.539 | 408.044 | 402.239 | 401.352 |
LZNA | 171.111 | 177.565 | 184.677 | 167.439 | 166.904 |
---------------------------------------------------------------
vs LZMA :
ratio: 1.901
enc : 2.70 mb/s
dec : 30.27 mb/s
On this file, BitKnit is 13X faster to decode than LZMA, and gets more compression.
(or at "Normal" level, the ratio is similar and BitKnit is 4.6X faster to encode).
Usage points for me :
0. My god, adaptive coding is sweet. I've been doing static Huffman and TANS for a while, so I sort of got used to them, and I forgot how nice it is to not have to deal with that shit. (optimal parsing with static entropy coders has the horrible feedback loop and iterations required, you have to find optimal chunking/transmission points, you have to tweak out your codelen/probability transmission to send them compactly and quickly, blah blah). In comparison, adaptive coding is just so simple. It's literally 10X fewer lines of codes.
1. For binary coding, RANS is no win over arithmetic. (Jarek calls "binary ANS" "ABS" but I see no need for another acronym; let's just say "binary ANS"). (and given the option, you'd rather have the FIFO arithmetic coding)
2. For multi-symbol, power-of-2 cumulative probability sum, adaptive RANS is really good.
3. If your model is really complex, like an N-ary Fenwick tree or anything crazy, or if you have to do binary search in cumprobs to do decoding, the difference between RANS and arithmetic can be hidden.
To make adaptive RANS really shine, you need a model specialized for power-of-2 totals, such as the classic "deferred summation" or the new nibble model (cumprob blending), or other. It's only for models that are quite fast that the speed difference between RANS and arithmetic becomes dramatic.
What I'm trying to get at is if you just take an existing compressor based on arithmetic coding, which probably uses binary arithmetic coding, or a rather complex N-ary coder, and just replace the arithmetic part with ANS - you might not see much benefit at all. There are three big stalls - divides, cache misses, branches - and they can hide each other, so if you just eliminate one of the three stalls, it doesn't help much.
ADD : 4. A lot of the recent work (TANS, Yann's Huff work, some of the RANS encoders, etc.) that are very fast also use rather a lot of memory. They make use of tables to speed up coding. That's fine for order-0 coding, or if you have very few contexts (perhaps 2 pos bits), but for order-1 coding or larger contexts, it becomes a problem. Of course your memory use becomes high, but the bigger problem is that your tables no longer fit in cache. Fast table-based coding doesn't make any sense if accessing the tables is a cache miss.
Fabian's BitKnit is coming to Oodle. BitKnit is a pretty unique LZ; it makes clever use of the properties of RANS to hit a space-speed tradeoff point that nothing else does. It gets close to LZMA compression levels (sometimes more, sometimes less) while being more like zlib speed.
LZNA and LZNIB are also much improved. The bit streams are the same, but we found some little tweaks in the encoders & decoders that make significant difference. (5-10%, but that's a lot in compression, and they were already world-beating, so the margin is just bigger now). The biggest improvement came from some subtle issues in the parsers.
As usual, I'm trying to be as fair as possible to the competition. Everything is run single threaded. LZMA and LZHAM are run at max compression with context bits at their best setting. Compressors like zlib that are just not even worth considering are not included, I've tried to include the strongest competition that I know of now. This is my test of "slowies" , that is, all compressors set at high (not max) compression levels. ("oohc" is Oodle Optimal1 , my compression actually goes up quite a bit at higher levels, but I consider anything below 2 mb/s to encode to be just too slow to even consider).
|
|
|
|
|
The raw data : ("game test set")
by ratio:
oohcLZNA : 2.88:1 , 5.3 enc mb/s , 135.0 dec mb/s
lzma : 2.82:1 , 2.9 enc mb/s , 43.0 dec mb/s
oohcBitKnit : 2.76:1 , 6.4 enc mb/s , 273.3 dec mb/s
lzham : 2.59:1 , 1.8 enc mb/s , 162.9 dec mb/s
oohcLZHLW : 2.38:1 , 4.2 enc mb/s , 456.3 dec mb/s
zstdhc9 : 2.11:1 , 29.5 enc mb/s , 558.0 dec mb/s
oohcLZNIB : 2.04:1 , 11.5 enc mb/s , 1316.4 dec mb/s
by encode speed:
zstdhc9 : 2.11:1 , 29.5 enc mb/s , 558.0 dec mb/s
oohcLZNIB : 2.04:1 , 11.5 enc mb/s , 1316.4 dec mb/s
oohcBitKnit : 2.76:1 , 6.4 enc mb/s , 273.3 dec mb/s
oohcLZNA : 2.88:1 , 5.3 enc mb/s , 135.0 dec mb/s
oohcLZHLW : 2.38:1 , 4.2 enc mb/s , 456.3 dec mb/s
lzma : 2.82:1 , 2.9 enc mb/s , 43.0 dec mb/s
lzham : 2.59:1 , 1.8 enc mb/s , 162.9 dec mb/s
by decode speed:
oohcLZNIB : 2.04:1 , 11.5 enc mb/s , 1316.4 dec mb/s
zstdhc9 : 2.11:1 , 29.5 enc mb/s , 558.0 dec mb/s
oohcLZHLW : 2.38:1 , 4.2 enc mb/s , 456.3 dec mb/s
oohcBitKnit : 2.76:1 , 6.4 enc mb/s , 273.3 dec mb/s
lzham : 2.59:1 , 1.8 enc mb/s , 162.9 dec mb/s
oohcLZNA : 2.88:1 , 5.3 enc mb/s , 135.0 dec mb/s
lzma : 2.82:1 , 2.9 enc mb/s , 43.0 dec mb/s
-----------------------------------------------------------------
Log opened : Fri Dec 18 17:56:44 2015
total : oohcLZNIB : 167,495,105 ->81,928,287 = 3.913 bpb = 2.044 to 1
total : encode : 14.521 seconds, 3.39 b/kc, rate= 11.53 M/s
total : decode : 0.127 seconds, 386.85 b/kc, rate= 1316.44 M/s
total : encode+decode : 14.648 seconds, 3.36 b/kc, rate= 11.43 M/s
total : oohcLZHLW : 167,495,105 ->70,449,624 = 3.365 bpb = 2.378 to 1
total : encode : 40.294 seconds, 1.22 b/kc, rate= 4.16 M/s
total : decode : 0.367 seconds, 134.10 b/kc, rate= 456.33 M/s
total : encode+decode : 40.661 seconds, 1.21 b/kc, rate= 4.12 M/s
total : oohcLZNA : 167,495,105 ->58,242,995 = 2.782 bpb = 2.876 to 1
total : encode : 31.867 seconds, 1.54 b/kc, rate= 5.26 M/s
total : decode : 1.240 seconds, 39.68 b/kc, rate= 135.04 M/s
total : encode+decode : 33.107 seconds, 1.49 b/kc, rate= 5.06 M/s
total : oohcBitKnit : 167,495,105 ->60,763,350 = 2.902 bpb = 2.757 to 1
total : encode : 26.102 seconds, 1.89 b/kc, rate= 6.42 M/s
total : decode : 0.613 seconds, 80.33 b/kc, rate= 273.35 M/s
total : encode+decode : 26.714 seconds, 1.84 b/kc, rate= 6.27 M/s
total : zstdhc9 : 167,495,105 ->79,540,333 = 3.799 bpb = 2.106 to 1
total : encode : 5.671 seconds, 8.68 b/kc, rate= 29.53 M/s
total : decode : 0.300 seconds, 163.98 b/kc, rate= 558.04 M/s
total : encode+decode : 5.971 seconds, 8.24 b/kc, rate= 28.05 M/s
total : lzham : 167,495,105 ->64,682,721 = 3.089 bpb = 2.589 to 1
total : encode : 93.182 seconds, 0.53 b/kc, rate= 1.80 M/s
total : decode : 1.028 seconds, 47.86 b/kc, rate= 162.86 M/s
total : encode+decode : 94.211 seconds, 0.52 b/kc, rate= 1.78 M/s
total : lzma : 167,495,105 ->59,300,023 = 2.832 bpb = 2.825 to 1
total : encode : 57.712 seconds, 0.85 b/kc, rate= 2.90 M/s
total : decode : 3.898 seconds, 12.63 b/kc, rate= 42.97 M/s
total : encode+decode : 61.610 seconds, 0.80 b/kc, rate= 2.72 M/s
-------------------------------------------------------
EncodeMod is just the idea that you send each token (byte, word, nibble, whatever) with two ranges; in one range the values are terminal (no more tokens), while in the other range it means "this is part of the value" but more tokens follow. You can then optimize the division point for a wide range of applications.
In my original pseudo-code I was writing the ranges with the "more tokens" follow at the bottom, and
terminal values at the top. That is :
Specifically for the case of byte tokens and pow2 mod
mod = 1<
Fabian spotted that the code is slightly simpler if you switch the ranges. Use the low range
[0,upper) for terminal values and [upper,256) for non-terminal values. The ranges are the same, so you
get the same encoded lengths.
<bits
in each token we send "bits" of values that don't currently fit
upper = 256 - mod
"upper" is the number of terminal values we can send in the current token
I was writing
[0,mod) = bits of value + more tokens follow
[mod,256) = terminal value
(BTW it also occurred to me when learning about ANS that EncodeMod is reminiscent of simple ANS. You're trying to send a bit - "do more bytes follow". You're putting that bit in a token, and you have some extra information you can send with that bit - so just put some of your value in there. The number of slots for bit=0 and 1 should correspond to the probability of each event.)
The switched encodemod is :
U8 *encmod(U8 *to, int val, int bits)
{
const int upper = 256 - (1<
The simplification of the encoder here :
<bits); // binary, this is 1110000 or similar (8-bits ones, bits zeros)
while (val >= upper)
{
*to++ = (U8) (upper | val);
val = (val - upper) >> bits;
}
*to++ = (U8) val;
return to;
}
const U8 *decmod(int *outval, const U8 *from, int bits)
{
const int upper = 256 - (1<<bits);
int shift = 0;
int val = 0;
for (;;)
{
int byte = *from++;
val += byte << shift;
if (byte < upper)
break;
shift += bits;
}
*outval = val;
return from;
}
*to++ = (U8) (upper | val);
val = (val - upper) >> bits;
written in long-hand is :
low = val & ((1<
Basically by using "upper" like this, the mask of low bits and add of upper is done in one op.
<bits)-1);
*to++ = upper + low; // (same as upper | low, same as upper | val)
val -= upper;
val >>= bits;
or
val -= upper;
low = val & ((1<<bits)-1);
*to++ = upper + low; // (same as upper | low, same as upper | val)
val >>= bits;
and the val -= upper can be done early or late because val >= upper it doesn't touch "low"
Background : 64-bit mode. 12-bit lookahead table, and 12-bit codelen limit, so there's no out-of-table case to handle.
Here's conditional bit buffer refill, 32-bits refilled at a time, aligned refill.
Always >= 32 bits in buffer so you can do two decode ops per refill :
loop
{
uint64 peek; int cl,sym;
peek = decode_bits >> (64 - CODELEN_LIMIT);
cl = codelens[peek];
sym = symbols[peek];
decode_bits <<= cl; thirtytwo_minus_decode_bitcount += cl;
*decodeptr++ = (uint8)sym;
peek = decode_bits >> (64 - CODELEN_LIMIT);
cl = codelens[peek];
sym = symbols[peek];
decode_bits <<= cl; thirtytwo_minus_decode_bitcount += cl;
*decodeptr++ = (uint8)sym;
if ( thirtytwo_minus_decode_bitcount > 0 )
{
uint64 next = _byteswap_ulong(*decode_in++);
decode_bits |= next << thirtytwo_minus_decode_bitcount;
thirtytwo_minus_decode_bitcount -= 32;
}
}
325 mb/s.
(note that removing the bswap to have a little-endian u32 stream does almost nothing for performance, less than 1 mb/s)
The next option is : branchless refill, unaligned 64-bit refill. You always have >= 56 bits in buffer, now you can do 4 decode ops per
refill :
loop
{
// refill :
uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
bits |= next >> bitcount;
int bytes_consumed = (64 - bitcount)>>3;
decode_in += bytes_consumed;
bitcount += bytes_consumed<<3;
uint64 peek; int cl; int sym;
#define DECONE() \
peek = bits >> (64 - CODELEN_LIMIT); \
cl = codelens[peek]; sym = symbols[peek]; \
bits <<= cl; bitcount -= cl; \
*decodeptr++ = (uint8) sym;
DECONE();
DECONE();
DECONE();
DECONE();
#undef DECONE
}
373 mb/s
These so far have both been "traditional Huffman" decoders. That is, they use the next 12 bits from the bit buffer to look up the Huffman decode table, and they stream bits into that bit buffer.
There's another option, which is "ANS style" decoding. To do "ANS style" you keep the 12-bit "peek" as a separate variable, and you stream bits from the bit buffer into the peek variable. Then you don't need to do any masking or shifting to extract the peek.
The naive "ANS style" decode looks like this :
loop
{
// refill bits :
uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
bits |= next >> bitcount;
int bytes_consumed = (64 - bitcount)>>3;
decode_in += bytes_consumed;
bitcount += bytes_consumed<<3;
int cl; int sym;
#define DECONE() \
cl = codelens[state]; sym = symbols[state]; \
state = ((state << cl) | (bits >> (64 - cl))) & ((1 << CODELEN_LIMIT)-1); \
bits <<= cl; bitcount -= cl; \
*decodeptr++ = (uint8) sym;
DECONE();
DECONE();
DECONE();
DECONE();
#undef DECONE
}
332 mb/s
But we can use an analogy to the "next_state" of ANS. In ANS, the next_state is a complex thing with
certain rules (as we covered in the past). With Huffman it's just this bit of math :
next_state[state] = (state << cl) & ((1 << CODELEN_LIMIT)-1);
So we can build that table, and use a "fully ANS" decoder :
loop
{
// refill bits :
uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
bits |= next >> bitcount;
int bytes_consumed = (64 - bitcount)>>3;
decode_in += bytes_consumed;
bitcount += bytes_consumed<<3;
int cl; int sym;
#define DECONE() \
cl = codelens[state]; sym = symbols[state]; \
state = next_state_table[state] | (bits >> (64 - cl)); \
bits <<= cl; bitcount -= cl; \
*decodeptr++ = (uint8) sym;
DECONE();
DECONE();
DECONE();
DECONE();
#undef DECONE
}
415 mb/s
Fastest! It seems the fastest Huffman decoder is a TANS decoder. (*1)
(*1 = well, on this machine anyway; these are all so close that architecture and exact usage matters massively; in particular we're relying heavily on fast unaligned reads, and doing four unrolled decodes in a row isn't always useful)
Note that this is a complete TANS decoder save one small detail - in TANS the "codelen" (previously called
"numbits" in my TANS code) can be 0. The part where you do :
(bits >> (64 - cl))
can't be used if cl can be 0. In TANS you either have to check for zero, or you have to use the method of
((bits >> 1) >> (63 - cl))
which makes TANS a tiny bit slower - 370 mb/s for TANS on the same file on my machine.
(all times reported are non-interleaved, and without table build time; Huffman is definitely faster to build tables, and faster to decode packed/transmitted codelens as well)
NOTE : earlier version of this post had a mistake in bitcount update and worse timings.
Some tiny caveats :
1. The TANS way means you can't (easily) mix different peek amounts. Say you're doing an LZ, you might want an 11-bit peek for literals, but for the 4 bottom bits you only need an 8-bit peek. The TANS state has the # of bits to peek baked in, so you can't just use that. With the normal bit-buffer style Huffman decoders you can peek any # of bits you want. (though you could just do the multi-state interleave thing here, keeping with the TANS style).
2. Doing Huffman decodes without a strict codelen limit the TANS way is much uglier. With the bits-at-top bitbuffer method there are nice ways to do that.
3. Getting raw bits the TANS way is a bit uglier. Say you want to grab 16 raw bits; you could get 12 from the "state" and then 4 more from the bit buffer. Or just get 16 directly from the bit buffer which means they need to be sent after the next 12 bits of Huffman in a weird TANS interleave style. This is solvable but ugly.
4. For the rare special case of an 8 or 16-bit peek-ahead, you can do even faster than the TANS style by using a normal bit buffer with the next bits at bottom. (either little endian or big-endian but rotated around). This lets you grab the peek just by using "al" on x86.
X. People will just copy-paste your example code.
This is obvious but is something to keep in mind. Example code should never be sketches. It should be production ready. People will not read the comments. I had lots of spots in example code where I would write comments like "this is just a sketch and not ready for production; production code needs to check error returns and handle failures and be endian-independent" etc.. and of course people just copy-pasted it and didn't change it. That's not their fault, that's my fault. Example code is one of the main ways people get into your library.
X. People will not read the docs.
Docs are almost useless. Nobody reads them. They'll read a one page quick start, and then they want to just start digging in writing code. Keep the intros very minimal and very focused on getting things working.
Also be aware that if you feel you need to write a lot of docs about something, that's a sign that maybe things are too complicated.
X. Peripheral helper features should be cut.
Cut cut cut. People don't need them. I don't care how nice they are, how proud of them you are. Pare down mercilessly. More features just confuse and crud things up. This is like what a good writer should do. Figure out what your one core function really is and cut down to that.
If you feel that you really need to include your cute helpers, put them off on the side, or put them in example code. Or even just keep them in your pocket at home so that when someone asks about "how I do this" you can email them out that code.
But really just cut them. Being broad is not good. You want to be very narrow. Solve one clearly defined problem and solve it well. Nobody wants a kitchen sink library.
X. Simplicity is better.
Make everything as simple as possible. Fewer arguments on your functions. Remove extra functions. Cut everywhere. If you sacrifice a tiny bit of possible efficiency, or lose some rare functionality, that's fine. Cut cut cut.
For example, to plug in an allocator for Oodle used to require 7 function pointers : { Malloc, Free, MallocAligned, FreeSized, MallocPage, FreePage, PageSize }. (FreeSized for efficiency, and the Page stuff because async IO needs page alignment). It's now down just 2 : { MallocAligned, Free }. Yes it's a tiny bit slower but who cares. (and the runtime can work without any provided allocators)
X. Micro-efficiency is not important.
Yes, being fast and lean is good, but not when it makes things too complex or difficult to use. There's a danger of a kind of mental-masturbation that us RAD-type guys can get caught in. Yes, your big stream processing stuff needs to be competitive (eg. Oodle's LZ decompress, or Bink's frame decode time). But making your Init() call take 100 clocks instead of 10,000 clocks is irrelevant to everyone but you. And if it requires funny crap from the user, then it's actually making things worse, not better. Having things just work reliably and safely and easily is more important than micro-efficiency.
For example, one mistake I made in Oodle is that the compressed streams are headerless; they don't contain the compressed or decompressed size. The reason I did that is because often the game already has that information from its own headers, so if I store it again it's redundant and costs a few bytes. But that was foolish - to save a few bytes of compressed size I sacrifice error checking, robustness, and convenience for people who don't want to write their own header. It's micro-efficiency that costs too much.
Another one I realized is a mistake : to do actual async writes on Windows, you need to call SetFileValidData on the newly enlarged file region. That requires admin privileges. It's too much trouble, and nobody really cares. It's no worth the mess. So in Oodle2 I just don't do that, and writes are no longer async. (everyone else who thinks they're doing async writes isn't actually, and nobody else actually checks on their threading the way I do, so it just makes me more like everyone else).
X. It should just work.
Fragile is bad. Any API's that have to go in some complicated sequence, do this, then this, then this. That's bad. (eg. JPEGlib and PNGlib). Things should just work as simply as possible without requirements. Operations should be single function calls when possible. Like if you take pointers in and out, don't require them to be aligned in a certain way or padded or allocated with your own allocators. Make it work with any buffer the user provides. If you have options, make things work reasonably with just default options so the user can ignore all the option setup if they want. Don't require Inits before your operations.
In Oodle2 , you just call Decompress(pointer,size,pointer) and it should Just Work. Things like error handling and allocators now just fall back to reasonable light weight defaults if you don't set up anything explicitly.
X. Special case stuff should be external (and callbacks are bad).
Anything that's unique to a few users, or that people will want to be different should be out of the library. Make it possible to do that stuff through client-side code. As much as possible, avoid callbacks to make this work, try to do it through imperative sequential code.
eg. if they want to do some incremental post-processing of data in place, it should be possible via : { decode a bit, process some, decode a bit , process some } on the client side. Don't do it with a callback that does decode_it_all( process_per_bit_callback ).
Don't crud up the library feature set trying to please everyone. Some of these things can go in example code, or in your "back pocket code" that you send out as needed.
X. You are writing the library for evaluators and new users.
When you're designing the library, the main person to think about is evaluators and new users. Things need to be easy and clear and just work for them.
People who actually license or become long-term users are not a problem. I don't mean this in a cruel way, we don't devalue them and just care about sales. What I mean is, once you have a relationship with them as a client, then you can talk to them, help them figure out how to use things, show them solutions. You can send them sample code or even modify the library for them.
But evaluators won't talk to you. If things don't just work for them, they will be frustrated. If things are not performant or have problems, they will think the library sucks. So the library needs to work well for them with no help from you. And they often won't read the docs or even use your examples. So it needs to go well if they just start blindly calling your APIs.
(this is a general principle for all software; also all GUI design, and hell just design in general. Interfaces should be designed for the novice to get into it easy, not for the expert to be efficient once they master it. People can learn to use almost any interface well (*) once they are used to it, so you don't have to worry about them.)
(* = as long as it's low latency, stateless, race free, reliable, predictable, which nobody in the fucking world seems to understand any more. A certain sequence of physical actions that you develop muscle memory for should always produce the same result, regardless of timing, without looking at the device or screen to make sure it's keeping up. Everyone who fails this (eg. everyone) should be fucking fired and then shot. But this is a bit off topic.)
X. Make the default log & check errors. But make the default reasonably fast.
This is sort of related to the evaluator issue. The defaults of the library need to be targetted at evaluators and new users. Advanced users can change the defaults if they want; eg. to ship they will turn off logging & error checking. But that should not be how you ship, or evaluators will trigger lots of errors and get failures with no messages. So you need to do some amount of error checking & logging so that evaluators can figure things out. *But* they will also measure performance without changing the settings, so your default settings must also be fast.
X. Make easy stuff easy. It's okay if complicated stuff is hard.
Kind of self explanatory. The API should be designed so that very simple uses require tiny bits of code. It's okay if something complicated and rare is a pain in the ass, you don't need to design for that; just make it possible somehow, and if you have to help out the rare person who wants to do a weird thing, that's fine. Specifically, don't try to make very flexible general APIs that can do everything & the kitchen sink. It's okay to have a super simple API that covers 99% of users, and then a more complex path for the rare cases.
This is an internal email I sent on 05-13-2015 :
Cliff notes : there's a good reason why OS'es use thread pools and fibers to solve this problem.
I understand progress has to happen and so on, and older APIs have to get retired sometimes. Well, no not really; that's not actually what happens in the modern world. I'm sick of the slap-dash upgrading and deprecating that has nothing to do with necessity and is just random chaos. You all can fuck around with it if you want. Not me. I love algorithms. I love programming when the problems are inherent, mathematical problems. Not problems like this fucking API doesn't do what it says it does, or this shit does different things in different versions so I have to detect that and hack it and, oh crap my platform sdk updated and nothing works any more and fuck me.
I think this blog (on Blogger) is probably dead. I can't be bothered to fix my poster (working in C# is a nightmare). Goodbye.
You can read the raw text blog at http://www.cbloom.com/rants.html
also Hello!
For a little while I've been writing a new blog.
It was inspired by el trastero | de Iñigo Quilez which is a fabulous blog. el trastero has some little technical thoughts some times, but also personal stuff, and lots of humanity. I love it. It's how my blog started, and somewhere along the way I lost the point. So I started writing a new blog, inspired by my old blog.
It's called "rambles", and it's here : http://www.cbloom.com/rambles.html
It's personal and inappropriate and you probably shouldn't read it.
Goodbye and hello.
The encode speeds on lzt99 :
single-threaded :
==============
LZNA :
-z5 (Optimal1) :
24,700,820 -> 9,207,584 = 2.982 bpb = 2.683 to 1
encode : 10.809 seconds, 1.32 b/kc, rate= 2.29 mb/s
decode : 0.318 seconds, 44.87 b/kc, rate= 77.58 mb/s
-z6 (Optimal2) :
24,700,820 -> 9,154,343 = 2.965 bpb = 2.698 to 1
encode : 14.727 seconds, 0.97 b/kc, rate= 1.68 mb/s
decode : 0.313 seconds, 45.68 b/kc, rate= 78.99 mb/s
-z7 (Optimal3) :
24,700,820 -> 9,069,473 = 2.937 bpb = 2.724 to 1
encode : 20.473 seconds, 0.70 b/kc, rate= 1.21 mb/s
decode : 0.317 seconds, 45.06 b/kc, rate= 77.92 mb/s
=========
LZMA :
lzmahigh : 24,700,820 -> 9,329,982 = 3.022 bpb = 2.647 to 1
encode : 11.373 seconds, 1.26 b/kc, rate= 2.17 M/s
decode : 0.767 seconds, 18.62 b/kc, rate= 32.19 M/s
=========
LZHAM BETTER :
lzham : 24,700,820 ->10,140,761 = 3.284 bpb = 2.436 to 1
encode : 16.732 seconds, 0.85 b/kc, rate= 1.48 M/s
decode : 0.242 seconds, 59.09 b/kc, rate= 102.17 M/s
LZHAM UBER :
lzham : 24,700,820 ->10,097,341 = 3.270 bpb = 2.446 to 1
encode : 18.877 seconds, 0.76 b/kc, rate= 1.31 M/s
decode : 0.239 seconds, 59.73 b/kc, rate= 103.27 M/s
LZHAM UBER + EXTREME :
lzham : 24,700,820 -> 9,938,002 = 3.219 bpb = 2.485 to 1
encode : 185.204 seconds, 0.08 b/kc, rate= 133.37 k/s
decode : 0.245 seconds, 58.28 b/kc, rate= 100.77 M/s
===============
LZNA -z5 threaded :
24,700,820 -> 9,211,090 = 2.983 bpb = 2.682 to 1
encode only : 8.523 seconds, 1.68 b/kc, rate= 2.90 mb/s
decode only : 0.325 seconds, 43.96 b/kc, rate= 76.01 mb/s
LZMA threaded :
lzmahigh : 24,700,820 -> 9,329,925 = 3.022 bpb = 2.647 to 1
encode : 7.991 seconds, 1.79 b/kc, rate= 3.09 M/s
decode : 0.775 seconds, 18.42 b/kc, rate= 31.85 M/s
LZHAM BETTER threaded :
lzham : 24,700,820 ->10,198,307 = 3.303 bpb = 2.422 to 1
encode : 7.678 seconds, 1.86 b/kc, rate= 3.22 M/s
decode : 0.242 seconds, 58.96 b/kc, rate= 101.94 M/s
I incorrectly said in the original version of the LZNA post (now corrected) that "LZHAM UBER is too slow". It's actually the "EXTREME" option that's too slow.
Also, as I noted last time, LZHAM is the best threaded of the three, so even though BETTER is slower than LZNA -z5 or LZMA in single-threaded encode speed, it's faster threaded. (Oodle's encoder threading is very simplistic (chunking) and really needs a larger file to get full parallelism; it doesn't use all cores here; LZHAM is much more micro-threaded so can get good parallelism even on small files).
Nothing happens. Wait.
Click "close tab" again on the same tab.
Two tabs close.
ARGGGGG MOTHER FUCKER WTF WTF.
On Android it's even worse. Maybe the worst sin of Android UI design (hard to choose) is that the back/home/panes bar at the bottom is not always there. Some apps make it roll out of the way and put other buttons there. So depending on timing and input races you can be just trying to get back to the home screen and instead hit some other shite.
Fucking basic GUI design principles :
User should get *immediate* acknolwedgement of their action. I mean, for fucks sake you should be able to actually just DO it immediately, computers are fast. But if you can't you still need to acknowledge it.
Buttons should not move! They should not roll in and out. The more important the button, the more it should just stay in the same place all the time.
A sequence of input actions should always lead to the same outcome. There should be no "input races" that make outcome dependent on processing time.
A power user should be able to use the application without looking at it. You should be able to develop muscle memory of sequences that do what you want, and shouldn't have to be tracking "is the app responding" to each of the actions in the sequence.
As much as possible buttons should be *stateless*. The same click gives the same action all the time. Modal UI is a necessary evil.
I have a kitchen timer that has these buttons :
Stop/Start : toggles running or not
Reset : resets timer to 0.
Reset only does anything if the timer is not running. So. How do you make the timer be in the "running from zero" state?
You have to look at it. There is no reliable key sequence to make it be running. If it's already running, you have to hit stop, then reset, then start.
The problem is that the action of the buttons is stateful. Dumb. A better design is something like :
Stop/Start : toggle running
Stop&Reset : stop and reset timer to 0
So you can press "Stop&Reset" + "Stop/Start" to always get into the "running from zero" state without looking at it. There are other options. The point is that people just don't understand fucking usability and good UI design.
A tool should become an extension of yourself, that you can use without thinking about it. That you can use without checking in on it. Oops, I hit a bunch of nails while my hammer was in un-nail mode and now my house is falling down.
Of course this is even more crucial in cars. There needs to be a law that every function in a car can be used by a blind person. There should never be a button/shuttle/touch-screen that you have to look at to use. It should all be possible to do without looking, by sense of location and touch. This is just good UI design in general, but even more important when safety is involved. (and the answer is not a million fucking buttons like an airplane cockpit, it's less fucking unnecessary functions)
Today I woke up and tried to start working, and my VC plugin NiftyPerforce is crashing. It's worked for years and I haven't touched anything and all of a sudden it's crashing. So.. try to debug it.. OMG it's fucking Managed C# I forgot about this nonsense. Hmm, some exception. Turn on exception breaks. OMG fucking C# throws exceptions all the damn time that are benign and you have to ignore. OMG the problem is down in some app pref XML persistence thing, why the fuck is this crashing now and how the fuck do I figure out what's going on. ARG ARG ARG.
Give me a fucking "main()". I want only my code. Nothing happens before main starts. No threads, no fucking binding. No automatically downloading packages!? WTF are you doing? Arg.
I wish I'd written my own text/code editor. It sucks having my primary daily interface with the machine be someone else's code that sucks.
The Anti-Patent Patent Pool is an independent patent licensing organization. (Hence APPP)
One option would be to just allow anyone to use those patents free of charge.
A more aggressive option would be a viral licensing model. (like the GPL, which has completely failed, so hey, maybe not). The idea of the viral licensing model is like this :
Anyone who owns no patents may use any patent in the APPP for free (if you currently own patents, you may donate them to the APPP).
If you wish to own patents, then you must pay a fee to license from the APPP. That fee is used to fund the APPP's activities, the most expensive being legal defense of its own patents, and legal attacks on other patents that it deems to be illegal or too broad.
(* = we'd have to be aggressive about going after companies that make a subsidiary to use APPP patents while still owning patents in the parent corporation)
The tipping point for the APPP would be to get a few patents that are important enough that major players need to either join the APPP (donate all their patents) or pay a large license.
The APPP provides a way for people who want their work to be free to ensure that it is free. In the current system this is hard to do without owning a patent, and owning a patent and enforcing it is hard to do without money.
The APPP pro-actively watches all patent submissions and objects to ones that cover prior art, are obvious and trivial, or excessively broad. It greatly reduces the issuance of junk patents, and fights ones that are mistakenly issued. (the APPP maintains a public list of patents that it believes to be junk, which it will help you fight if you choose to use the covered algorithms). (Obviously some of these activities have to be phased in over time as the APPP gets more money).
The APPP provides a way for small companies and individuals that cannot afford the lawyers to defend their work to be protected. When some evil behemoth tries to stop you from using algorithms that you believe you have a legal right to, rather than fight it yourself, you simply donate your work to the APPP and they fight for you.
Anyone who simply wants to ensure that they can use their own inventions could use the APPP.
Once the APPP has enough money, we would employ a staff of patent writers. They would take idea donations from the groundswell of developers, open-source coders, hobbyists. Describe your idea, the patent writer would make it all formal and go through the whole process. This would let us tap into where the ideas are really happening, all the millions of coders that don't have the time or money to pursue getting patents on their own.
In the current system, if you just want to keep your idea free, you have to constantly keep an eye on all patent submissions to make sure noone is slipping in and patenting it. It's ridiculous. Really the only safe thing to do is to go ahead and patent it yourself and then donate it to the APPP. (the problem is if you let them get the patent, even if it's bogus it may be expensive to fight, and what's worse is it creates a situation where your idea has a nasty asterisk on it - oh, there's this patent that covers this idea, but we believe that patent to be invalid so we claim this idea is still public domain. That's a nasty situation that will scare off lots of users.)
Some previous posts :
cbloom rants 02-10-09 - How to fight patents
cbloom rants 12-07-10 - Patents
cbloom rants 04-27-11 - Things we need
cbloom rants 05-19-11 - Nathan Myhrvold
Some notes :
1. I am not interested in debating whether patents are good or not. I am interested in providing a mechanism for those of us who hate patents to pursue our software and algorithm development in a reasonable way.
2. If you are thinking about the patent or not argument, I encourage you to think not of some ideal theoretical argument, but rather the realities of the situation. I see this on both sides of the fence; those who are pro-patent because it "protects inventors" but choose to ignore the reality of the ridiculous patent system, and those on the anti-patent side who believe patents are evil and they won't touch them, even though that may be the best way to keep free ideas free.
3. I believe part of the problem with the anti-patent movement is that we are all too fixated on details of our idealism. Everybody has slightly different ideas of how it should be, so the movement fractures and can't agree on a unified thrust. We need to compromise. We need to coordinate. We need to just settle on something that is a reasonable solution; perhaps not the ideal that you would want, but some change is better than no change. (of course the other part of the problem is we are mostly selfish and lazy)
4. Basically I think that something like the "defensive patent license" is a good idea as a way to make sure your own inventions stay free. It's the safest way (as opposed to not patenting), and in the long run it's the least work and maintenance. Instead of constantly fighting and keeping aware of attempts to patent your idea, you just patent it yourself, do the work up front and then know it's safe long term. But it doesn't go far enough. Once you have that patent you can use it as a wedge to open up more ideas that should be free. That patent is leverage, against all the other evil. That's where the APPP comes in. Just making your one idea free is not enough, because on the other side there is massive machinery that's constantly trying to patent every trivial idea they can think of.
5. What we need is for the APPP to get enough money so that it can be stuffing a deluge of trivial patents down the patent office's throat, to head off all the crap coming from "Intellectual Ventures" and its many brothers. We need to be getting at least as many patents as them and making them all free under the APPP.
Some links :
en.swpat.org - The Software Patents Wiki
Patent Absurdity — How software patents broke the system
Home defensivepatentlicense
FOSS Patents U.S. patent reform movement lacks strategic leadership, fails to leverage the Internet
PUBPAT Home
delta_literal = get_sub_literal();
if ( delta_literal != 0 )
{
*ptr++ = delta_literal + ptr[-lastOffset];
}
else // delta_literal == 0
{
if ( ! get_offset_flag() )
{
*ptr++ = ptr[-lastOffset];
}
else if ( get_lastoffset_flag() )
{
int lo_index = get_lo_index();
lastOffset = last_offsets[lo_index];
// do MTF or whatever using lo_index
*ptr++ = ptr[-lastOffset];
// extra 0 delta literal implied :
*ptr++ = ptr[-lastOffset];
}
else
{
lastOffset = get_offset();
// put offset in last_offsets set
*ptr++ = ptr[-lastOffset];
*ptr++ = ptr[-lastOffset];
// some automatic zero deltas follow for larger offsets
if ( lastOffset > 128 )
{
*ptr++ = ptr[-lastOffset];
if ( lastOffset > 16384 )
{
*ptr++ = ptr[-lastOffset];
}
}
}
// each single zero is followed by a zero runlen
// (this is just a speed optimization)
int zrl = get_zero_runlen();
while(zrl--)
*ptr++ = ptr[-lastOffset];
}
This is basically LZMA. (sub literals instead of bitwise-LAM, but structurally the same) (also I've reversed the implied structure here; zero delta -> offset flag here, whereas in normal LZ you do offset flag -> zero delta)
This is what a modern LZ is. You're sending deltas from the prediction. The prediction is the source of the match. In the "match" range, the delta is zero.
The thing about modern LZ's (LZMA, etc.) is that the literals-after-match (LAMs) are very important too. These are the deltas after the zero run range. You can't really think of the match as just applying to the zero-run range. It applies until you send the next offset.
You can also of course do a simpler & more general variant :
Generalized-LZ-Sub decoder :
if ( get_offset_flag() )
{
// also lastoffset LRU and so on not shown here
lastOffset = get_offset();
}
delta_literal = get_sub_literal();
*ptr++ = delta_literal + ptr[-lastOffset];
Generalized-LZ-Sub just sends deltas from prediction. Matches are a bunch of zeros. I've removed the
acceleration of sending zero's as a runlen for simplicity, but you could still do that.
The main difference is that you can send offsets anywhere, not just at certain spots where there are a bunch of zero deltas generated (aka "min match lengths").
This could be useful. For example when coding images/video/sound , there is often not an exact match that gives you a bunch of exact zero deltas, but there might be a very good match that gives you a bunch of small deltas. It would be worth sending that offset to get the small deltas, but normal LZ can't do it.
Generalized-LZ-Sub could also give you literal-before-match. That is, instead of sending the offset at the run of zero deltas, you could send it slightly *before* that, where the deltas are not zero but are small.
(when compressing text, "sub" should be replaced with some kind of smart lexicographical distance; for each character precompute a list of its most likely substitution character in order of probability.)
LZ is a bit like a BWT, but instead of the contexts being inferred by the prefix sort, you transmit them explicitly by sending offsets to prior strings. Weird.
Today they sent me back an email saying :
"I need your email address so I can send you the documents you need to sign"
Umm... you are not inspiring great confidence in your abilities.
Also, pursuant to my last post about spam - pretty much all my correspondence with lawyers over the past few months, Google decides to put in the spam folder. I keep thinking "WTF why didn't this lawyer get back to me - oh crap, go check the spam". Now, I'm totally down with the comic social commentary that Google is making ("ha ha, all email from lawyers is spam, amirite? lol"). But WTF your algorithms are insanely broken. I mean, fucking seriously you suck so bad.
By these tards :
Someone in the UK go over and punch them in the balls.
For those not aware of the background, ANS is probably the biggest invention in data compression in the last 20 years. Its inventor (Jarek Duda) has explicitly tried to publish it openly and make it patent-free, because he's awesome.
In the next 10 years I'm sure we will get patents for "using ANS with string-matching data compression", "using ANS with block mocomp data compression", "using ANS as a replacement for Huffman coding", "deferred summation with ANS", etc. etc. Lots of brilliant inventions like that. Really stimulating for innovation.
(as has happened over and over in data compression, and software in general in the past; hey let's take two obvious previously existing things; LZ string matching + Huffman = patent. LZ + hash table = patent. JPEG + arithmetic = patent. Mocomp + Huffman = patent. etc. etc.)
(often glossed over in the famous Stac-Microsoft suit story is the question of WHAT THE FUCK the LZS patent was supposed to be for? What was the invention there exactly? Doing LZ with a certain fixed bit encoding? Umm, yeah, like everyone does?)
Our patent system is working great. It obviously protects and motivates the real inventors, and doesn't just act as a way for the richest companies to lock in semi-monopolies of technologies they didn't even invent. Nope.
Recently at RAD we've made a few innovations related to ANS that are mostly in the vein of small improvements or clever usages, things that I wouldn't even imagine to patent, but of course that's wrong.
I've also noticed in general a lot of these vaporware companies in the UK. We saw one at RAD a few years ago that claimed to use "multi-dimensional curve interpolation for data compression" or some crackpot nonsense. There was another one that used alternate numeral systems (not ANS, but p-adic or some such) for god knows what. A few years ago there were lots of fractal-image-compression and other fractal-nonsense startups that did ... nothing. (this was before the VC "pivot" ; hey we have a bunch of fractal image patents, let's make a text messaging app)
They generally get some PhD's from Cambridge or whatever to be founders. They bring a bunch of "industry luminaries" on the board. They patent a bunch of nonsense. And then ...
... profit? There's a step missing where they actually ever make anything that works. But I guess sometimes they get bought for their vapor, or they manage to get a bullshit patent that's overly-general on something they didn't actually invent, and then they're golden.
I wonder if these places are getting college-backed "incubation" incentives? Pretty fucking gross up and down and all around. Everyone involved is scum.
(In general, universities getting patents and incubating startups is fucking disgusting. You take public funding and student's tuition, and you use that to lock up ideas for private profit. Fucking rotten, you scum.)
On a more practical note, if anyone knows the process for objecting to a patent in the UK, chime in.
Also, shame on us all for not doing more to fight the system. All our work should be going in the Anti-Patent Patent Pool.
Under the current first-to-file systems, apparently we are supposed to sit around all day reading every patent that's been filed to see if it covers something that we have already invented or is "well known" / public domain / prior art.
It's really a system that's designed around patents. It assumes that all inventions are patented. It doesn't really work well with a prior invention that's just not patented.
Which makes something like the APPP even more important. We need a way to patent all the free ideas just as a way to keep them legally free and not have to worry about all the fuckers who will rush in and try to patent our inventions as soon as we stop looking.
Not in a nefarious way, like haha we're going to send your good mails to "spam" and let the crap through! Take that!
But actually in a sort of more deeply evil way. A capitalist way. They specifically *want* to allow through mass-mailings from corporations that are they do not consider spam.
In my opinion, those are all spam. There is not a single corporate mass-mailing that I ever intentionally subscribed to.
Basically there's a very very easy spam filtering problem :
Easy 1. Reject all mass-mailings. Reject all mailings about sales, products, offers. Reject all mailings about porn or penises or nigerian princes.
Easy 2. Allow through all mail that's hand-written by a human to me. Particularly to one that I have written to in the past.
That would be fine with me. That would get 99.99% of it right for me.
They don't want to solve that problem. Instead they try to solve the much-harder problem of allowing through viagra offers that are for some reason not spam. For the email user who *wants* to get mass-mail offers of 50% off your next order.
I just don't understand how "yeah, let's go out to dinner" from my friend, who is responding with quote to a fucking mail that I sent goes in the in the Spam box, but "get direct email mass-marketing secrets to double your business!" goes in my inbox. How can it be so bad, I just really don't understand it. Fucking the most basic keyword include/exclude type of filter could do better.
I should have just written my own, because it's the kind of problem that you want to be constantly tweaking on. Every time a mail is misclassified, I want to run it through my system and see why that happened and then try to fix it.
It would be SOOO fucking easy for them. Being in a position as a central mail processor, they can tell which mails are unique and which are mass-sent, and just FUCKING BLOCK ALL THE MASS-SENT MAIL. God dammit. You are fucking me up and I know you're doing it intentionally. I hate you.
I mean, fuck. It's ridiculous.
They are responding to a mail I sent. The mail I sent is fucking quoted right there. I sent the fucking mail from gmail so you can confirm it's for real. I sent to their address with gmail. AND YOU PUT THEIR REPLY IN SPAM. WTF WTF WTF
But this is not spam :
Report: creative teamwork is easier with cloud-based apps
Businesses Increase Screening of Facebook, Twitter Before Hiring
Trying to solve the Prospecting Paradox?
I'd like to add you to my professional network on LinkedIn
Maybe I'm being a bit overly simplistic and harsh. Maybe there are mass-mailings that look spammish, but you actually want to get? Like, your credit card bill is due?
I'm not sure. I'm not sure that I ever need to get any of that. I don't need those "shipping confirmation" emails from Amazon. If they just all got filed to the "mass mail" folder, I could go look for them when I need them.
I want to make my own private internet. And then not allow anyone else to use it because you'd all just fuck it up.
The consumer side decs the semaphore, and waits on the count being positive.
The producer side incs the semaphore, and can wait on the count being a certain negative value (some number of waiting consumers).
Monitored semaphore solves a specific common problem :
In a worker thread system, you may need to wait on all work being done. This is hard to do in a race-free way using normal primitives. Typical ad-hoc solutions may miss work that is pushed during the wait-for-all-done phase. This is hard to enforce, ugly, and makes bugs. (it's particularly bad when work items may spawn new work items).
I've heard of many ad-hoc hacky ways of dealing with this. There's no need to muck around with that, because there's a simple and efficient way to just get it right.
The monitored semaphore also provides a race-free way to snapshot the state of the work system - how many work items are available, how many workers are sleeping. This allows you to wait on the joint condition - all workers are sleeping AND there is no work available. Any check of those two using separate primitives is likely a race.
The implementation is similar to the fastsemaphore I posted before.
"fastsemaphore" wraps some kind of underlying semaphore which actually provides the OS waits. The underlying semaphore is only used when the count goes negative. When count is positive, pops are done with simple atomic ops to avoid OS calls. eg. we only do an OS call when there's a possibility it will put our thread to sleep or wake a thread.
"fastsemaphore_monitored" uses the same kind atomic variable wrapping an underlying semaphore, but adds an eventcount for the waiter side to be triggered when enough workers are waiting. (see who ordered event count? )
Usage is like this :
To push a work item :
push item on your queue (MPMC FIFO or whatever)
fastsemaphore_monitored.post();
To pop a work item :
fastsemaphore_monitored.wait();
pop item from queue
To flush all work :
fastsemaphore_monitored.wait_for_waiters(num_worker_threads);
NOTE : in my implementation, post & wait can be called from any thread, but wait_for_waiters must be
called from only one thread. This assumes you either have a "main thread" that does that wait, or
that you wrap that call with a mutex.
template
<typename t_base_sem>
class fastsemaphore_monitored
{
atomic<S32> m_state;
eventcount m_waiters_ec;
t_base_sem m_sem;
enum { FSM_COUNT_SHIFT = 8 };
enum { FSM_COUNT_MASK = 0xFFFFFF00UL };
enum { FSM_COUNT_MAX = ((U32)FSM_COUNT_MASK>>FSM_COUNT_SHIFT) };
enum { FSM_WAIT_FOR_SHIFT = 0 };
enum { FSM_WAIT_FOR_MASK = 0xFF };
enum { FSM_WAIT_FOR_MAX = (FSM_WAIT_FOR_MASK>>FSM_WAIT_FOR_SHIFT) };
public:
fastsemaphore_monitored(S32 count = 0)
: m_state(count<<FSM_COUNT_SHIFT)
{
RL_ASSERT(count >= 0);
}
~fastsemaphore_monitored()
{
}
public:
inline S32 state_fetch_add_count(S32 inc)
{
S32 prev = m_state($).fetch_add(inc<<FSM_COUNT_SHIFT,mo_acq_rel);
S32 count = ( prev >> FSM_COUNT_SHIFT );
RR_ASSERT( count < 0 || ( (U32)count < (FSM_COUNT_MAX-2) ) );
return count;
}
// warning : wait_for_waiters can only be called from one thread!
void wait_for_waiters(S32 wait_for_count)
{
RL_ASSERT( wait_for_count > 0 && wait_for_count < FSM_WAIT_FOR_MAX );
S32 state = m_state($).load(mo_acquire);
for(;;)
{
S32 cur_count = state >> FSM_COUNT_SHIFT;
if ( (-cur_count) == wait_for_count )
break; // got it
S32 new_state = (cur_count<<FSM_COUNT_SHIFT) | (wait_for_count << FSM_WAIT_FOR_SHIFT);
S32 ec = m_waiters_ec.prepare_wait();
// double check and signal what we're waiting for :
if ( ! m_state.compare_exchange_strong(state,new_state,mo_acq_rel) )
continue; // retry ; state was reloaded
m_waiters_ec.wait(ec);
state = m_state($).load(mo_acquire);
}
// now turn off the mask :
for(;;)
{
S32 new_state = state & FSM_COUNT_MASK;
if ( state == new_state ) return;
if ( m_state.compare_exchange_strong(state,new_state,mo_acq_rel) )
return;
// retry ; state was reloaded
}
}
void post()
{
if ( state_fetch_add_count(1) < 0 )
{
m_sem.post();
}
}
void wait_no_spin()
{
S32 prev_state = m_state($).fetch_add((-1)<<FSM_COUNT_SHIFT,mo_acq_rel);
S32 prev_count = prev_state>>FSM_COUNT_SHIFT;
if ( prev_count <= 0 )
{
S32 waiters = (-prev_count) + 1;
RR_ASSERT( waiters >= 1 );
S32 wait_for = prev_state & FSM_WAIT_FOR_MASK;
if ( waiters == wait_for )
{
RR_ASSERT( wait_for >= 1 );
m_waiters_ec.notify_all();
}
m_sem.wait();
}
}
void post(S32 n)
{
RR_ASSERT( n > 0 );
for(S32 i=0;i<n;i++)
post();
}
bool try_wait()
{
// see if we can dec count before preparing the wait
S32 state = m_state($).load(mo_acquire);
for(;;)
{
if ( state < (1<<FSM_COUNT_SHIFT) ) return false;
// dec count and leave the rest the same :
//S32 new_state = ((c-1)<<FSM_COUNT_SHIFT) | (state & FSM_WAIT_FOR_MASK);
S32 new_state = state - (1<<FSM_COUNT_SHIFT);
RR_ASSERT( (new_state>>FSM_COUNT_SHIFT) >= 0 );
if ( m_state($).compare_exchange_strong(state,new_state,mo_acq_rel) )
return true;
// state was reloaded
// loop
// backoff here optional
}
}
S32 try_wait_all()
{
// see if we can dec count before preparing the wait
S32 state = m_state($).load(mo_acquire);
for(;;)
{
S32 count = state >> FSM_COUNT_SHIFT;
if ( count <= 0 ) return 0;
// swap count to zero and leave the rest the same :
S32 new_state = state & FSM_WAIT_FOR_MASK;
if ( m_state($).compare_exchange_strong(state,new_state,mo_acq_rel) )
return count;
// state was reloaded
// loop
// backoff here optional
}
}
void wait()
{
int spin_count = rrGetSpinCount();
while(spin_count--)
{
if ( try_wait() )
return;
}
wait_no_spin();
}
};
LAMs are weird.
LAM0 , the first literal after a match, has the strong exclusion property (assuming maximum match lengths). LAM0 is strictly != lolit. (lolit = literal at last offset).
LAM1, the next literal after end of match, has the exact opposite - VERY strong prediction of LAM1 == lolit. This prediction continues but weakens as you go to LAM2, LAM3, etc.
In Oodle LZNA (and in many other coders), I send a flag for (LAM == lolit) as a separate event. That means in the actual literal coding path you still have LAM1 != lolit. (the LAM == lolit flag should be context-coded using the distance from the end of the match).
In all cases, even though you know LAM != lolit, lolit is still a very strong predictor for LAM. Most likely LAM is *similar* to lolit.
LAM is both an exclude AND a predictor!
What similar means depends on the file type. In text it means something like vowels stay vowels, punctuation stays punctuation. lolit -> LAM is sort of like substituting one character change. In binary, it often means that they are numerically close. This means that the delta |LAM - lolit| is never zero, but is often small.
One of the interesting things about the delta is that it gives you a data-adaptive stride for a delta filter.
On some files, you can get huge compression wins by running the right delta filter. But the ideal delta distance is data-dependent (*). The sort of magic thing that works out is that the LZ match offsets will naturally pick up the structure & word sizes. In a file of 32-byte structs made of DWORDs, you'll get offsets of 4,8,12,32,etc. So you then take that offset and forming the LAM sub is just a way of doing a delta with that deduced stride. On DWORD or F32 data, you tend to get a lot of offset=4, so LAM tends to just be doing delta from the previous word (note of course this bytewise delta, not a proper dword delta).
(* = this is a huge thing that someone needs to work on; automatic detection of delta filters for arbitrary data; deltas could be byte,word,dword, other, from immediate neighbors or from struct/row strides, etc. In a compression world where we are fighting over 1% gains, this can be a 10-20% jump.)
Experimentally we have observed that LAMs are very rapidly changing. They benefit greatly from very quickly adapting models. They like geometric adaptation rates (more recent events are much more important). They cannot be modeled with large contexts (without very sophisticated handling of sparsity and fast adaptation), they need small contexts to get lots of events and statistical density. They seem to benefit greatly from modeling in groups (eg. bitwise or nibblewise or other), so that events on one symbol also affect other probabilities for faster group learning. Many of these observations are similar for post-BWT data. LAM sub literals does seem to behave like post-BWT data to some extent, and similar principles of modeling apply.
So, for example, just coding an 8-bit symbol using the 8-bit lolit as context is a no-go. In theory this would give you full modeling of the effects of lolit on the current symbol. In practice it dilutes your statistics way too much. (in theory you could do some kind of one-count boosts other counts thing (or a secondary coding table ala PPMZ SEE), but in practice that's a mess). Also as noted previously, if you have the full 8-bit context, then whether you code symbol raw or xor or sub is irrelevant, but if you do not have the full context then it does change things.
Related posts :
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 03-10-13 - Two LZ Notes
cbloom rants 06-12-14 - Some LZMA Notes
cbloom rants 06-16-14 - Rep0 Exclusion in LZMA-like coders
cbloom rants 03-15-15 - LZ Literal Correlation Images
The obvious fix is just to magnify the right side. This is a linear scaling of the data; *1 on the far left, *10 on the far right :
The far-left is still proportional to the compression ratio, the far right is proportional to the decompression speed. The compressor lines are still speedups vs. memcpy, but the memcpy baseline is now sloped.
I'm not really sure how I feel about the warped chart vs unwarped.
The Pareto curves are in fact sigmoids (tanh's).
speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
(here they're warped sigmoids because of the magnification; the ones
back here in the LZNA post are true sigmoids)
I believe (but have not proven) that a principle of the Pareto Frontier is that the maximum of all compressors should also be a sigmoid.
max_speedup(disk_speed) = MAX{c}( speedup[compressor c](disk_speed) );
One of the nice things about these charts is it makes it easy to see where some compressors are not as good as possible. If we fit a sigmoid over the top of all the curves :
We can easily see that LZHLW and LZNIB are not touching the curve. They're not as good as they should be in space/speed. Even thought nothing beats them at the moment (that I know of), they are algorithmically short of what's possible.
There are two things that constrain compressors from being better in a space/speed way. There's 1. what is our current best known algorithm. And then there's 2. what is possible given knowledge of all possible algorithms. #2 is the absolute limit and eventually it runs into a thermodynamic limit. In a certain amount of cpu time (cpu bit flips, which increase entropy), how much entropy can you take out of a a given data stream. You can't beat that limit no matter how good your algorithm is. So our goal in compression is always to just find improvements in the algorithms to edge closer to that eventual limit.
Anyway. I think I know how to fix them, and hopefully they'll be up at the gray line soon.
ANS (TANS or RANS) in the straightforward implementation writes a large minimum number of bytes.
To be concrete I'll consider a particular extremely bad case : 64-bit RANS with 32-bit renormalization.
The standard coder is :
initialize encoder (at end of stream) :
x = 1<<31
renormalize so x stays in the range x >= (1<<31) and x < (1<<63)
flush encoder (at the beginning of the stream) :
output all 8 bytes of x
decoder initializes by reading 8 bytes of x
decoder renormalizes via :
if ( x < (1<<31) )
{
x <<= 32; x |= get32(ptr); ptr += 4;
}
decoder terminates and can assert that x == 1<<31
this coder outputs a minimum of 8 bytes, which means it wastes up to 7 bytes on low-entropy data
(assuming 1 byte minimum output and that the 1 byte required to byte-align output is not "waste").
In contrast, it's well known how to do minimal flush of arithmetic coders. When the arithmetic coder reaches the end, it has a "low" and "range" specifying an interval. "low" might be 64-bits, but you don't need to output them all, you only need to output enough such that the decoder will get something in the correct interval between "low" and "low+range".
Historically people often did arithmetic coder minimum flush assuming that the decoder would read zero-valued bytes after EOF. I no longer do that. I prefer to do a minimum flush such that decoder will get something in the correct interval no matter what byte follows EOF. This allows the decoder to just read past the end of your buffer with no extra work. (the arithmetic coder reads some # of bytes past EOF because it reads enough to fill "low" with bits, even though the top bits are all that are needed at the end of the stream).
The arithmetic coder minimum flush outputs a number of bytes proportional to log2(1/range) , which is the number of bits of information that are currently held pending in the arithmetic coder state, which is good. The excess is at most 1 byte.
So, to make ANS as clean as arithmetic coding we need a minimal flush. There are two sources of the waste in the normal ANS procedure outlined above.
One is the initial value of x (at the end of the stream). By setting x to (1<<31) , the low end of the renormalization interval, we have essentually filled it with bits it has to flush. (the pending bits in x is log2(x)). But those bits don't contain anything useful (except a value we can check at the end of decoding). One way to remove that waste is to stuff some other value in the initial state which contains bits you care about. Any value you initialize x with, you get back at the end of decoding, so then those bits aren't "wasted". But this can be annoying to find something useful to put in there, since you don't get that value out until the end of decoding.
The other source of waste is the final flush of x (at the beginning of the stream). This one is obvious - the # of pending bits stored in x at any time is log2(x). Clearly we should be flushing the final value of x in a # of bits proportional to log2(x).
So to do ANS minimal flush, here's one way :
initialize encoder (at end of stream) :
x = 0
renormalize so x stays in the range x < (1<<63)
flush encoder (at the beginning of the stream) :
output # of bytes with bits set in x, and those bytes
decoder initializes by reading variable # of bytes of x
decoder renormalizes via :
if ( x < (1<<31) )
{
if ( ptr < ptrend )
{
x <<= 32; x |= get32(ptr); ptr += 4;
}
}
decoder terminates and can assert that x == 0
This ANS variant will output only 1 byte on very-low-entropy data.
There are now two phases of the coder. In the beginning of encoding (at the ending of the stream), x is allowed to be way below the renormalization range. During this phase, encoding just puts information into x, and the value of x grows. (note that x can actually stay 0 and never hold any bits if your consists of entirely the bottom symbol in RANS). Once x grows up into the renormalization interval, you enter the next phase where bits of x are pushed to the output to keep x in the renormalization interval. Decoding, in the first phase you read bytes from the stread to fill x with bits and keep it in the renormalization interval. Once the decoder read pointer hits the end, you switch to the second phase, and now x is allowed to shrink below the renormalization minimum and you can continue to decode the remaining information held in it.
This appears to add an extra branch to the decoder renormalization, but that can be removed by duplicating your decoder into "not near the end" and "near the end" variants.
The #sigbit output of x at the head is just the right thing and should always be done in all variants of ANS.
The checking ptr vs. ptrend and starting x = 0 is the variant that I call "minimal ANS".
Unfortunately "minimal ANS" doesn't play well with the ILP multi-state interleaved ANS. To do interleaved ANS like this you would need an EOF marker for each state. That's possible in theory (and could be done compactly in theory) but is a pain in the butt in practice.
LZNA is a high compression LZ (usually a bit more than 7z/LZMA) with better decode speed. Around 2.5X faster to decode than LZMA.
Anyone who needs LZMA-level compression and higher decode speeds should consider LZNA. Currently LZNA requires SSE2 to be fast, so it only runs full speed on modern platforms with x86 chips.
LZNA gets its speed from two primary changes. 1. It uses RANS instead of arithmetic coding. 2. It uses nibble-wise coding instead of bit-wise coding, so it can do 4x fewer coding operations in some cases. The magic sauce that makes these possible is Ryg's realization about mixing cumulative probability distributions . That lets you do the bitwise-style shift update of probabilities (keeping a power of two total), but on larger alphabets.
LZNA usually beats LZMA compression on binary, slightly worse on text. LZNA is closer to LZHAM decompress speeds.
Some results :
lzt99
LZNA -z6 : 24,700,820 -> 9,154,248 = 2.965 bpb = 2.698 to 1
decode only : 0.327 seconds, 43.75 b/kc, rate= 75.65 mb/s
LZMA : 24,700,820 -> 9,329,925 = 3.021 bpb = 2.647 to 1
decode : 0.838 seconds, 58.67 clocks, rate= 29.47 M/s
LZHAM : 24,700,820 ->10,140,761 = 3.284 bpb = 2.435 to 1
decode : 0.264 seconds, 18.44 clocks, rate= 93.74 M/s
(note on settings : LZHAM is run at BETTER because UBER is too slow. LZHAM BETTER is comparable to
Oodle's -z6 ; UBER is similar to my -z7. (not quite right; see later post "LZNA encode speed addendum").
LZMA is run at the best compression setting I can find; -m9
and lc=0,lp=2,pb=2 for binary data; with LZHAM I don't see a way to set the context bits. This is the new LZHAM 1.0,
slightly different than my previous tests of LZHAM. All 64-bit, big dictionaries.).
baby_robot_shell
LZNA -z6 : 58,788,904 ->12,933,907 = 1.760 bpb = 4.545 to 1
decode only : 0.677 seconds, 50.22 b/kc, rate= 86.84 mb/s
LZMA : 58,788,904 ->13,525,659 = 1.840 bpb = 4.346 to 1
decode : 1.384 seconds, 40.70 clocks, rate= 42.49 M/s
LZHAM : 58,788,904 ->15,594,877 = 2.122 bpb = 3.769 to 1
decode : 0.582 seconds, 17.12 clocks, rate= 100.97 M/s
I'm not showing encode speeds because they're all running different amounts of threading. It would be complicated to show fairly. LZHAM is the most aggressively threaded, and also the slowest without threading.
My "game testset" total sizes, from most compression to least :
Oodle LZNA -z8 : 57,176,229
Oodle LZNA -z5 : 58,318,469
LZMA -mx9 d26:lc0:lp2:pb3 : 58,884,562
LZMA -mx9 : 59,987,629
LZHAM -mx9 : 62,621,098
Oodle LZHLW -z6 : 68,199,739
zip -9 : 88,436,013
raw : 167,495,105
Here's the new Pareto chart for Oodle. See previous post on these charts
This is load+decomp speedup relative to memcpy : (lzt99)
The left-side Y-intercept is the compression ratio. The right-side Y-intercept is the decompression speed. In between you can see the zones where each compressor is the best tradeoff.
With LZMA and LZHAM : (changed colors)
lzt99 is bad for LZHAM, perhaps because it's heterogeneous and LZHAM assumes pretty stable data. (LZHAM usually beats LZHLW for compression ratio). Here's a different example :
load+decomp speedup relative to memcpy : (baby_robot_shell)
Bernie Sanders is fucking amazing. If you haven't had the great pleasure of hearing him talk at length, go do it now. He is the best politician since I don't even know fucking who (*). I have never in my lifetime heard a single politician that actually speaks honestly and intelligently about the issues. Not talking points. Not just a bunch of bullshit promises. Not just verbal gymnastics to avoid the point. Actually directly talks about the issue in a realistic and pragmatic way.
(* = I have seen video of things like the Nixon-Kennedy debates in which politicians actually talk about issues and try to pin each other down on points of policy, rather than scoring "gotchas" and "applause points". I understand that back in the olden days, pandering to cheap emotional vagaries would get you pilloried in the press as "evasive" or "not serious". I've never seen it in my life.)
Even if you're conservative and don't agree with his views, it should be a fucking breath of fresh air for anyone with any fucking scrap of humanity and decency and intelligence to see a politician that refuses to play the games, refuses to pander to the shitty mass-applause points and the special-interest hate groups and most of all corporate money.
Even though Bernie won't win, I want every single debate to be just Bernie. I don't want to hear a single fucking vapid scripted garbage word from any of the other shit-heads. Just let Bernie talk the whole time.
I disagree with Bernie on some points. He tends to be rather more traditional liberal pro-union pro-manufacturing than I necessarily think is wise. But it's easy to get distracted by the disagreement and miss the point - here is a politician that's actually talking about issues and facts. Even if he gets them wrong sometimes at least he's trying to address real solutions. (this is one of the tricks of most other politicians - don't ever actually propose anything concrete, because then it can be attacked, instead just talk about "hope" or "liberty" or say "America fuck yeah!")
It's classic Bernie that he chose to run as a democrat so that he wouldn't be a spoiler ala Nader. It's just totally realistic, pragmatic, to the point. I fucking love Bernie Sanders.
Watch some Brunch with Bernie
Get corporate money out of politics! Elections should be publicly funded only! Stop lobbyists from writing laws! Free trade agreements should not supercede national laws! Corporations are not human beings! Government exists for the service of its citizens! Stop privatizing profit while leaving the government on the hook for risks and long term consequences! And so on.
Granted, the first one is actually labelled "water resistant" or "water repellant" or some such nonsense which actually means "fucking useless". But the other two are actually described as "waterproof". And they just aren't.
They seem waterproof at first. Water beads up and runs off and nothing goes through. But over time in a rain they start to get saturated, and eventually they soak through and then they just wick water straight through.
The problem is they're all some fancy waterproof/breathable technical fabric.
IT DOESN'T FUCKING WORK!
Ooo we have this new fancy fabric. NO! No you don't. You have bullshit that doesn't fucking work.
Job #1 : Actually be waterproof.
But it's lighter!, you say. Nope! Zip it! But it's breathable. Zip! Shush. But it's recycled, and rip-stop. Zip. Nope. Is it waterproof? Is it actually fucking waterproof? Like if I stand out in a rain. No, it isn't. Throw it out. It doesn't work. You're fired. Back to the drawing board.
If you want to get fancy, you could use your breathable/not-actually-waterproof material in areas that don't get very wet, such as the inside of the upper arm and the sides of the torso.
At the very least the tops of the shoulders and the upper chest need to be just plastic. Just fucking plastic like a slicker from the 50's. (it could be a separate overhanging shelf of plastic, like a duster type of thing)
Any time I'm browsing REI or whatever these days and see a jacket labelled waterproof, I think "like hell it is".
I'm at the point I always hit where I lose steam. I can play some basic stuff, but not anything too difficult. The problem is I have trouble finding fun songs to play that aren't too hard, or finding songbooks or teach-yourself books that are both fun and not too hard.
What I really want, and what I think is the right way to teach guitar to a dabbler like me is :
A book of songs, in progression of difficulty
The songs need to be modern (post-60's), fun, familiar
(classic rock is pretty safe)
The songs need to be the *actual* songs. Not simplified versions. Not transposed versions.
Not just the chords when the real song is much more complex.
When I play it, it needs to sound like the actual recording.
No funny tunings. I can't be bothered with that.
and so far as I know nothing like that exists.
I've got a bunch of "easy rock guitar songbooks" and they all fucking suck.
There are lots of good tabs on the internet, and I've found some good stuff to learn that way, but fuck that. The last thing I want to be doing in my relaxing time is browsing the internet trying to decide which of the 400 fucking versions of the "Heartbreaker" tab is the right one I should try to learn.
(in the past I taught myself some classical guitar, and in contrast there are lots of great classical, and even finger-picking folk guitar song books and instructional progressions that give you nice songs to learn that are actually fun to play and sound like something)
I've tried taking lessons a few times in the past and the teachers always sucked. Maybe they were good in terms of getting you to be a better player, but they were awful at making it fun.
I had a teacher who wanted me to sit with a metronome and just pick the same note over and over to the metronome to work on my meter. WTF. You're fired. All teachers want you to play scales. Nope. And then for the "fun" part they want to teach some basic rock rhythm, A-A-E-E,A-A-E-E. Nope, I'm bored. You're fired. Then I get to learn a song and it's like some basic blues I've never heard of or some fucking John Denver or something. (*)
They seem to completely fail to understand that it has to keep the student interested. That's part of your job being a teacher. I'm not trying to become a professional musician. I don't have some great passionate motivation that's going to keep me going through the boring shit and drudgery of your lessons. You have to make it fun all the time.
(* = there is some kind of weird thing where people who play music generally have horrible taste in music. It's like they pay too much attention to either the notes/chord/key or to the technical playing, neither of which actually matter much. It's the feeling, man.)
It was interesting for me to see this in ceramics. I was lucky to find a really great teacher (Bill Wilcox) who understood that it had to be fun, and that this was a bit of a lark for most of us, and some were more serious than others. Maybe he wasn't the most efficient teacher in terms of conveying maximum learning in a set time period - but he kept you coming back. We occasionally had different guest teachers, and they were way more regimented and methodical and wanted you to do drills (pull a cylinder 20 times and check the walls for evenness), and you could see half the class thinking "fuck this"
I suppose this is true of all learning for some kids. Some kids are inherently motivated, I'm going to learn because I'm supposed to, or to get into a good college, or "for my future", or to be smarter than everyone else, or whatever. But other kids see a cosine and think "wtf is that for, who cares".
I've always thought the right way to teach anything is with a goal in mind. Not just "hey learn this because you're supposed to". But "we want to build a catapult and fire it and hit a target. Okay, we're going to need to learn about angles and triangles and such...". The best/easiest/deepest learning is what you learn because you need to learn it to accomplish something that you want to do.
What I need in guitar is a series of mini goals & accomplishments. Hey here's this new song, and it's a little bit too hard for me, but it's a fucking cool song so I actually want to play it. So I practice for a while and get better, and then I can play it, yay! Then I move on to the next one. Just like good game design. And WTF it just doesn't seem to exist.
Hello, yes blah blah some stuff. I need to know these points : 1. What about A? 2. There is also b? 3. and finally C?and I usually get a response like :
Yep, great!Umm. WTF. You are fucking fired from your job, from life, from the planet, go away.
Yep to what? There were THREE questions in there. And none of them was really a yes/no question anyway. WTF.
So I'll try to be polite and send back something like -
Thanks for the response; yes, to what exactly? Did you mean yes to A?
Also I still need to know about B & C.
and then I'll get a response like :
Ok, on B we do this and that.
Umm. Okay, that's better. We got one answer, but THERE WERE THREE FUCKING QUESTIONS.
I fucking numbered them so you could count them. That means I need three answers.
Sometimes I'll get a response like :
Ramble ramble, some unrelated stuff, sort of answer maybe A and C but not exactly, some
other rambling.
Okay. Thanks for writing a lot of words, but I HAD SPECIFIC FUCKING QUESTIONS.
This is basic fucking professionalism.
Jesus christ.
Everybody knows when you move to Seattle and get SAD you have to take vitamin D. So all these years I've been taking a few Vit D pills every day in the winter.
Recently my hair has been falling out, which has never happened to me before. I was pretty sure it was just stress, but I thought hey WTF may as well get a blood test and see if anything is wrong. So I get a blood test. Everything is normal, except -
My vit D levels were way way below normal. There's a normal range (3-7) and I was like a 1.
I was like, WTF? I take 1-2 vit D pills every day.
Turns out those pills are 1000 IU each, so I was taking 1-2000 IU. I thought that was a hell of a lot (it's ten million percent of the US RDA). Nope, not a lot. My doc said I could take 8000-10,000 IU to get my levels back up to normal, then maintenance was more like 5000 IU.
So, I started taking big doses, and BOOM instant happiness. More energy, less depression.
I still fucking hate the gray and the wet. (I recently got some foot fungus from hiking on a rainy day. I hate the fucking wet. I'd like to live on Arrakis and never see a single drop of rain again in my life.) But hey with proper vit D dosing I don't want to kill myself every day. Yay.
Pound that D, yo!
This is the core of Chameleon's encoder :
cur = *fm32++; h = CHAMELEON_HASH(cur); flags <<= 1;
if ( c->hash[h] == cur ) { flags ++; *to16++ = (uint16) h; }
else { c->hash[h] = cur; *((uint32 *)to16) = cur; to16 += 2; }
This is the decoder :
if ( (int16)flags < 0 ) { cur = c->hash[ *fm16++ ]; }
else { cur = *((const uint32 *)fm16); fm16 += 2; c->hash[ CHAMELEON_HASH(cur) ] = cur; }
flags <<= 1; *to32++ = cur;
I thought it deserved a super-simple STB-style header-only dashfuly-described implementation :
My Chameleon.h is not portable or safe or any of that jizzle. Maybe it will be someday. (Update : now builds on GCC & clang. Tested on PS4. Still not Endian-invariant.)
// Usage :
#define CHAMELEON_IMPL
#include "Chameleon.h"
Chameleon c;
Chameleon_Reset(&c);
size_t comp_buf_size = CHAMELEON_MAXIMUM_OUTPUT_SIZE(in_size);
void * comp_buf = malloc(comp_buf_size);
size_t comp_len = Chameleon_Encode(&c, comp_buf, in_buf, in_size );
Chameleon_Reset(&c);
Chameleon_Decode(&c, out_buf, in_size, comp_buf );
int cmp = memcmp(in_buf,out_buf,in_size);
assert( comp == 0 );
ADD : Chameleon2 SIMD prototype now posted : (NOTE : this is not good, do not use)
Chameleon2.h - experimental SIMD wide Chameleon
both Chameleons in a zip
The SIMD encoder is not fast. Even on SSE4 it only barely beats scalar Chameleon. So this is a dead end. Maybe some day when we get fast hardware scatter/gather it will be good (*).
(* = though use of hardware scatter here is always going to be treacherous, because hashes may be repeated, and the order in which collisions resolve must be consistent)
Density contains 3 algorithms, from super fast to slower : Chameleon, Cheetah, Lion.
They all attain speed primarily by working on U32 quanta of input, rather than bytes. They're sort of LZPish type things that work on U32's, which is a reasonable way to get speed in this modern world. (Cheetah and Lion are really similar to the old LZP1/LZP2 with bit flags for different predictors, or to some of the LZRW's that output forward hashes; the main difference is working on U32 quanta and no match lengths)
The compression ratio is very poor. The highest compression option (Lion) is around LZ4-fast territory, not as good as LZ4-hc. But, are they Pareto? Is it a good space-speed tradeoff?
Well, I can't build Density (I use MSVC) so I can't test their implementation for space-speed.
Compressed sizes :
lzt99 :
uncompressed 24,700,820
density :
c0 Chameleon 19,530,262
c1 Cheetah 17,482,048
c2 Lion 16,627,513
lz4 -1 16,193,125
lz4 -9 14,825,016
Oodle -1 (LZB) 16,944,829
Oodle -2 (LZB) 16,409,913
Oodle LZNIB 12,375,347
(lz4 -9 is not competitive for encode time, it's just to show the level of compression you could get at very fast decode speeds if you don't care about encode time ; LZNIB is an even more extreme case of the same thing - slow to encode, but decode time comparable to Chameleon).
To check speed I did my own implementation of Chameleon (which I believe to be faster than Density's, so it's a fair test). See the next post to get my implementation.
The results are :
comp_len = 19492042
Chameleon_Encode_Time : seconds:0.0274 ticks per: 1.919 mb/s : 901.12
Chameleon_Decode_Time : seconds:0.0293 ticks per: 2.050 mb/s : 843.31
round trip time = 0.05670
I get a somewhat smaller file size than Density's version for unknown reason.
Let's compare to Oodle's LZB (an LZ4ish) :
Oodle -1 :
24,700,820 ->16,944,829 = 5.488 bpb = 1.458 to 1
encode : 0.061 seconds, 232.40 b/kc, rate= 401.85 mb/s
decode : 0.013 seconds, 1071.15 b/kc, rate= 1852.17 mb/s
round trip time = 0.074
Oodle -2 :
24,700,820 ->16,409,913 = 5.315 bpb = 1.505 to 1
encode : 0.070 seconds, 203.89 b/kc, rate= 352.55 mb/s
decode : 0.014 seconds, 1008.76 b/kc, rate= 1744.34 mb/s
round trip time = 0.084
lzt99 is a collection of typical game data files.
We can test on enwik8 (text/html) too :
Chameleon :
enwik8 :
Chameleon_Encode_Time : seconds:0.1077 ticks per: 1.862 mb/s : 928.36
Chameleon_Decode_Time : seconds:0.0676 ticks per: 1.169 mb/s : 1479.08
comp_len = 61524068
Oodle -1 :
enwik8 :
100,000,000 ->57,267,299 = 4.581 bpb = 1.746 to 1
encode : 0.481 seconds, 120.17 b/kc, rate= 207.79 mb/s
decode : 0.083 seconds, 697.58 b/kc, rate= 1206.19 mb/s
here Chameleon is much more compelling. It's competitive for size & decode speed, not just encode speed.
Commentary :
Any time you're storing files on disk, this is not the right algorithm. You want something more asymmetric (slow compress, fast decompress).
I'm not sure if Cheetah and Lion are Pareto for round trip time. I'd have to test speed on a wider set of sample data.
When do you actually want a compressor that's this fast and gets so little compression? I'm not sure.
I'm showing literal correlation by making an image of the histogram.
That is, given an 8-bit predictor, you tally of each event :
int histo[256][256]
histo[predicted][value] ++
then I scale the histo so the max is at 255 and make it into an image.
Most of the images that I show are in log scale, otherwise all the detail is too dark, dominated by a few peaks. I also sometimes remove the predicted=value line, so that the off axis detail is more visible.
Let's stop a moment and look t what we can see in these images.
This is a literal histo of "lzt99" , using predicted = lolit (last offset literal; the rep0len1 literal). This is in log scale, with the diagonal removed :
In my images y = prediction and x = current value. x=0, y=0 is in the upper left instead of the lower left where it should be.
The order-0 probability is the vertical line sum for each x. So any vertical lines indicate just strong order-0 correlations.
Most files are a mix of different probability sources, which makes these images look a sum of different contibuting factors.
The most obvious factor here is the diagonal line at x=y. That's just a strong value=predicted generator.
The red blob is a cluster of events around x and y = 0. This indicates a probability event that's related to |x+y| being small. That is, the sum, or length, or something tends to be small.
The green shows a square of probabilities. A square indicates that for a certain range of y's, all x's are equally likely. In this case the range is 48-58. So if y is in 48-58, then any x in 48-58 is equally likely.
There are similar weaker squarish patterns all along the diagonal. Surprisingly these are *not* actually at the binary 8/16 points you might expect. They're actually in steps of 6 & 10.
The blue blobs are at x/y = 64/192. There's a funny very specific strong asymmetric pattern in these. When y = 191 , it predicts x=63,62,61,60 - but NOT 64,65,66. Then at y=192, predict x=64,65,66, but not 63.
In addition to the blue blobs, there are weak dots at all the 32 multiples. This indicates that when y= any multiple of 32, there's a generating event for x = any multiple of 32. (Note that in log scale, these dots look more important than they really are.). There are also some weak order-0 generators at x=32 and so on.
There's some just general light gray background - that's just uncompressible random data (as seen by this model anyway).
Here's a bunch of images : (click for hi res)
| raw | raw | raw | sub | sub | sub | xor | xor | xor | |
| log | logND | linND | log | logND | linND | log | logND | linND | |
| Fez LO |
|
|
|
|
|
|
|
|
|
| Fez O1 |
|
|
|
|
|
|
|
|
|
| lzt24 LO |
|
|
|
|
|
|
|
|
|
| lzt24 O1 |
|
|
|
|
|
|
|
|
|
| lzt99 LO |
|
|
|
|
|
|
|
|
|
| lzt99 O1 |
|
|
|
|
|
|
|
|
|
| enwik7 LO |
|
|
|
|
|
|
|
|
|
| enwik7 O1 |
|
|
|
|
|
|
|
|
|
details :
LO means y axis (predictor) is last-offset-literal , in an LZ match parse. Only the literals coded by the LZ are shown.
O1 means y axis is order1 (previous byte). I didn't generate the O1 from the LZ match parse, so it's showing *all* bytes in the file, not just the literals from the LZ parse.
"log" is just log-scale of the histo. An octave (halving of probability) is 16 pixel levels.
"logND" is log without the x=y diagonal. An octave is 32 pixel levels.
"linND" is linear, without the x=y diagonal.
"raw" means the x axis is just the value. "xor" means the x axis is value^predicted. "sub" means the x axis is (value-predicted+127).
Note that raw/xor/sub are just permutations of the values along a horizontal axis, they don't change the values.
<HR>
Discussion :
The goal of a de-correlating transform is to create vertical lines. Vertical lines are order-0 probability peaks and can be coded without using the predictor as context at all.
If you use an order-0 coder, then any detail which is not in a vertical line is an opportunity for compression that you are passing up.
"Fez" is obvious pure delta data. "sub" is almost a perfect model for it.
"lzt24" has two (three?) primary probability sources. One is almost pure "sub" x is near y data.
The other sources, however, do not do very well under sub. They are pure order-0 peaks at x=64 and 192 (vertical lines in the "raw" image), and also those strange blobs of correlation at (x/y = 64 and 192). The problem is "sub" turns those vertical lines into diagonal lines, effectively smearing them all over the probability spectrum.
A compact but full model for the lzt24 literals would be like this :
<FONT COLOR=GREEN><PRE>
is y (predictor) near 64 or 192 ?
if so -> strongly predict x = 64 or 192
else -> predict x = y or x = 64 or 192 (weaker)
lzt99, being more heterogenous, has various sources.
"xor" takes squares to squares. This works pretty well on text.
In general, the LO correlation is easier to model than O1.
The lzt99 O1 histo in particular has lots of funny stuff. There are bunch of non-diagonal lines, indicating things like x=y/4 patterns, which is odd.
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 11-02-11 - StringMatchTest Release
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
From fast to slow :
All my fast matchers now use "cache tables". In fact I now use cache tables all the way up to my "Normal" level (default level; something like zip -7).
With cache tables you have a few primary parameters :
hash bits
hash ways
2nd hash
very fastest :
0 :
hash ways =1
2nd hash = off
1 :
hash ways = 2
2nd hash = off
2 :
hash ways = 2
2nd hash = on
...
hash ways = 16
2nd hash = on
The good thing about cache tables is the cpu cache coherency. You look up by the hash, and then all
your possible matches are right there in one cache line. (there's an option of whether you store the
first U32 of each match right there in the cache table to avoid a pointer chase to check the beginning
of the match).
Cache tables are superb space-speed tradeoff up until ways hits around 16, and then they start to lose to hash-link.
Hash-link :
Hash link is still good, but can be annoying to make fast (*) and does have bad degenerate case behavior (when you have a bad hash collision, the links on that chain get overloaded with crap).
(* = you have to do dynamic amortization and shite like that which is not a big deal, but ugly; this is to handle the incompressible-but-lots-of-hash-collisions case, and to handle the super-compressible-lots-of- redundant-matches case).
The good thing about hash-link is that you are strictly walking matches in increasing offset order.
This means you only need to consider longer lengths, which helps break the O(N^2) problem in practice
(though not in theory). It also gives you a very easy way to use a heuristic to decide if a match is
better or not. You're always doing a simple compare :
previous best match vs.
new match with
higher offset
longer length
which is a lot simpler than something like the cache table case where you see your matches in random order.
Being rather redundant : the nice thing about hash-link is that any time you find a match length, you know absolutely that you have the lowest offset occurance of that match length.
I'm not so high on Suffix Tries any more.
*if* your LZ just needs the longest length at each position, they're superb. If you actually need the best match at every position (traditional optimal parse), they're superb. eg. if you were doing LZSS with large fixed-size offsets, you just want to find the longest match all the time - boom Suffix Trie is the answer. They have no bad degenerate case, that's great.
But in practice on a modern LZ they have problems.
The main problem is that a Suffix Trie is actually quite bad at finding the lowest offset occurance of a short match. And that's a very important thing to be good at for LZ. The problem is that a proper ST with follows is doing its updates way out deep in the leaves of the tree, but the short matches are up at the root, and they are pointing at the *first* occurance of that substring. If you update all parents to the most recent pointer, you lose your O(N) completely.
(I wrote about this problem before : cbloom rants 08-22-13 - Sketch of Suffix Trie for Last Occurance )
You can do something ugly like use a suffix trie to find long matches and a hash->link with a low walk limit to find the most recent occurance of short matches. But bleh.
And my negativity about Suffix Tries also comes from another point :
Match finding is not that important. Well, there are a lot of caveats on that. On structured data (not text), with a pos-state-lastoffset coder like LZMA - match finding is not that important. Or rather, parsing is more important. Or rather, parsing is a better space-speed tradeoff.
It's way way way better to run an optimal parse with a crap match finder (even cache table with low ways) than to run a heuristic parse with great match finder. The parse is just way more important, and per CPU cost gives you way more win.
And there's another issue :
With a forward optimal parse, you can actually avoid finding matches at every position.
There are a variety of ways to get to skip ahead in a forward optimal parse :
Any time you find a very long match -
just take it and skip ahead
(eg. fast bytes in LZMA)
this can reduce the N^2 penalty of a bad match finder
When you are not finding any matches -
start taking multi-literal steps
using something like (failedMatches>>6) heuristic
When you find a long enough rep match -
just take it
and this "long enough" can be much less than "fast bytes"
eg. fb=128 for skipping normal matches
but you can just take a rep match at length >= 8
which occurs much more often
the net result is lots of opportunity for more of a "greedy" type of match finding in your optimal
parser, where you don't need every match.
This means that good greedy-parse match finders like hash-link and Yann's MMC ( my MMC note ) become interesting again (even for optimal parsing).
I'm showing the "total time to load" (time to load off disk at a simulated disk speed + time to decompress). You always want lower total time to load - smaller files make the simulated load time less, faster decompression make the decompress time less.
total_time_to_load = compressed_size / disk_speed + decompress_time
It looks neatest in the form of "speedup". "speedup" is the ratio of the effective speed
vs. the speed of the disk :
effective_speed = raw_size / total_time_to_load
speedup = effective_speed / disk_speed
By varying disk speed you can see the
tradeoff of compression ratio vs. cpu usage that makes different compressors better in
different application domains.
If we write out what speedup is :
speedup = raw_size / (compressed_size + decompress_time * disk_speed)
speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
speedup ~= harmonic_mean( compression_ratio , decompress_speed / disk_speed )
we can see that it's a "harmonic lerp" between compression ratio on one end and
decompress speed on the other end, with the simulated disk speed as lerp factor.
These charts show "speedup" vs. log of disk_speed :
(the log is log2, so 13 is a disk speed of 8192 mb/s).
On the left side, the curves go flat. At the far left (x -> -infinity, disk speed -> 0) the height of each curve is proportional to the compression ratio. So just looking at how they stack up on the far left tells you the compression ratio performance of each compressor. As you go right more, decompression speed becomes more and more important and compressed size less so.
With ham-fisted-shading of the regions where each compressor is best :
The thing I thought was interesting is that there's a very obvious Pareto frontier. If I draw a tangent across the best compressors :
Note that at the high end (right), the tangent goes from LZB to "memcpy" - not to "raw". (raw is just the time to load the raw file from disk, and we really have to compare to memcpy because all the compressors fill a buffer that's different from the IO buffer). (actually the gray line I drew on the right is not great, it should be going tangent to memcpy; it should be shaped just like each of the compressors' curves, flat on the left (compressed size dominant) and flat on the right (compress time dominant))
You can see there are gaps where these compressors don't make a complete Pareto set; the biggest gap is between LZA and LZH which is something I will address soon. (something like LZHAM goes there) And you can predict what the performance of the missing compressor should be.
It's also a neat way to test that all the compressors are good tradeoffs. If the gray line didn't make a nice smooth curve, it would mean that some compressor was not doing a good job of hitting the maximum speed for its compression level. (of course I could still have a systematic inefficiency; like quite possibly they're all 10% worse than they should be)
ADDENDUM :
If instead of doing speedup vs. loading raw you do speedup vs. loading raw + memcpy, you get this :
The nice thing is the right-hand asymptotes are now constants, instead of decaying like 1/disk_speed.
So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff.
When I did the Oodle LZH I made a mistake. I used a zip-style combined codeword. Values 0-255 are a literal, and 256+ contain the log2ish of length and offset. The advantage of this is that you only have one Huff table and just do one decode, then if it's a match you also fetch some raw bits. It also models length-offset correlation by putting them in the codeword together. (this scheme is missing a lot of things that you would want in a more modern high compression LZ, like pos-state patterns and so on).
Then I added "rep matches" and just stuck them in the combined codeword as special offset values.
So the codeword was :
{
256 : literal
4*L : 4 rep matches * L length slots (L=8 or whatever = 32 codes)
O*L : O offset slots * L length slots (O=14 and L = 6 = 84 codes or whatevs)
= 256+32+84 = 372
}
The problem is that rep-match-0 can never occur after a match. (assuming you write matches of maximum length). Rep-match-0 is quite important, on binary/structured files it can have very high probability. By using a single codeword which contains rep-match-0 for all coding events, you are incorrectly mixing the statistics of the after match state (where rep-match-0 has zero probability) and after literal state (where rep-match-0 has high probability).
A quick look at the strategies for fixing this :
1. Just use separate statistics. Keep the same combined codeword structure, but have two entropy tables, one for after-match and one for after-literal. This would also let you code the literal-after-match as an xor literal with separate statistics for that.
Whether you do xor-lit or not, there will be a lot of shared probability information between the two entropy tables, so if you do static Huffman or ANS probability transmission, you may need to use the cross-two-tables-similary in that transmission.
In a static Huffman or ANS entropy scheme if rep-match-0 never occurs in the after-match code table, it will be given a codelen of 0 (or impossible) and won't take any code space at all. (I guess it does take a little code space unless you also explicitly special case the knowledge that it must have codelen 0 in your codelen transmitter)
This is the simplest version of the more general case :
2. Context-code the rep-match event using match history. As noted just using "after match" or "after literal" as the context is the simplest version of this, but more detailed history will also affect the rep match event. This is the natural way to fix it in an adaptive arithmetic/ANS coder which uses context coding anyway. eg. this is what LZMA does.
Here we aren't forbidding rep-match after match, we're just using the fact that it never occur to make its probability go to 0 (adaptively) and thus it winds up taking nearly zero code space. In LZMA you actually can have a rep match after match because the matches have a max length of 273, so longer matches will be written as rep matches. Ryg pointed out that after a match that's been limitted by max-length, LZMA should really consider the context for the rep-match coding to be like after-literal , not after-match.
In Oodle's LZA I write infinite match lengths, so this is simplified. I also preload the probability of rep-match in the after-match contexts to be near zero. (I actually can't preload exactly zero because I do sometimes write a rep-match after match due to annoying end-of-buffer and circular-window edge cases). Preconditioning the probability saves the cost of learning that it's near zero, which saves 10-100 bytes.
3. Use different code words.
Rather than relying on statistics, you can explicitly use different code words for the after-match and after-literal case. For example in something like an LZHuf as described above, just use a codeword in the after-match case that omits the rep0 codes, and thus has a smaller alphabet.
This is most clear in something like LZNib. LZNib has three events :
LRL
Match
Rep Match
So naively it looks like you need to write a trinary decision at each coding (L,M,R). But in fact only two of them are
ever possible :
After L - M or R
cannot write another L because we would've just made LRL longer
After M or R - L or M
cannot write an R because we wouldn't just made match longer
So LZNib writes the binary choices (M/R) after L and (L/M) after M or R. Because they're always binary choices, this
allows LZNib to use the simple single-divider method of
encoding values in bytes .
4. Use a combined code word that includes the conditioning state.
Instead of context modeling, you can always take previous events that you need context from and make a combined codeword. (eg. if you do Huffman on the 16-bit bigram literals, you get order-1 context coding on half the literals).
So we can make a combined codeword like :
{
(LRL 0,1,2,3+) * (rep matches , normal matches) * (lengths slots)
= 4 * (4 + 14) * 8 = 576
}
Which is a pretty big alphabet, but also combined length slots so you get lrl-offset-length correlation modeling as well.
In a combined codeword like this you are always writing a match, and any literals that precede it are written with an LRL (may be 0). The forbidden codes are LRL=0 and match =rep0 , so you can either just let those get zero probabilities, or explicitly remove them from the codeword to reduce the alphabet. (there are also other forbidden codes in normal LZ parses, such as low-length high-offset codes, so you would similarly remove or adjust those)
A more minimal codeword is just
{
(LRL 0,1+) * (rep-match-0, any other match)
= 2 * 2 = 4
}
which is enough to get the rep-match-0 can't occur after LRL 0 modeling. Or you can do anything between those two extremes to choose an alphabet size.
04/2012 to 02/2015
06/2011 to 04/2012
01/2011 to 06/2011
10/2010 to 01/2011
01/2010 to 10/2010
01/2009 to 12/2009
10/2008 to 01/2009
08/2008 to 10/2008
03/2008 to 08/2008
11/2007 to 03/2008
07/2006 to 11/2007
12/2005 to 07/2006
06/2005 to 12/2005
01/1999 to 06/2005