This is the core of Chameleon's encoder :

cur = *fm32++; h = CHAMELEON_HASH(cur); flags <<= 1; if ( c->hash[h] == cur ) { flags ++; *to16++ = (uint16) h; } else { c->hash[h] = cur; *((uint32 *)to16) = cur; to16 += 2; }

This is the decoder :

if ( (int16)flags < 0 ) { cur = c->hash[ *fm16++ ]; } else { cur = *((const uint32 *)fm16); fm16 += 2; c->hash[ CHAMELEON_HASH(cur) ] = cur; } flags <<= 1; *to32++ = cur;

I thought it deserved a super-simple STB-style header-only dashfuly-described implementation :

My Chameleon.h is not portable or safe or any of that jizzle. Maybe it will be someday. (Update : now builds on GCC & clang. Tested on PS4. Still not Endian-invariant.)

// Usage : #define CHAMELEON_IMPL #include "Chameleon.h" Chameleon c; Chameleon_Reset(&c); size_t comp_buf_size = CHAMELEON_MAXIMUM_OUTPUT_SIZE(in_size); void * comp_buf = malloc(comp_buf_size); size_t comp_len = Chameleon_Encode(&c, comp_buf, in_buf, in_size ); Chameleon_Reset(&c); Chameleon_Decode(&c, out_buf, in_size, comp_buf ); int cmp = memcmp(in_buf,out_buf,in_size); assert( comp == 0 );

ADD : Chameleon2 SIMD prototype now posted : (NOTE : this is not good, do not use)

Chameleon2.h - experimental SIMD wide Chameleon

both Chameleons in a zip

The SIMD encoder is not fast. Even on SSE4 it only barely beats scalar Chameleon. So this is a dead end. Maybe some day when we get fast hardware scatter/gather it will be good (*).

(* = though use of hardware scatter here is always going to be treacherous, because hashes may be repeated, and the order in which collisions resolve must be consistent)

Density contains 3 algorithms, from super fast to slower : Chameleon, Cheetah, Lion.

They all attain speed primarily by working on U32 quanta of input, rather than bytes. They're sort of LZPish type things that work on U32's, which is a reasonable way to get speed in this modern world. (Cheetah and Lion are really similar to the old LZP1/LZP2 with bit flags for different predictors, or to some of the LZRW's that output forward hashes; the main difference is working on U32 quanta and no match lengths)

The compression ratio is very poor. The highest compression option (Lion) is around LZ4-fast territory, not as good as LZ4-hc. But, are they Pareto? Is it a good space-speed tradeoff?

Well, I can't build Density (I use MSVC) so I can't test their implementation for space-speed.

Compressed sizes :

lzt99 : uncompressed 24,700,820 density : c0 Chameleon 19,530,262 c1 Cheetah 17,482,048 c2 Lion 16,627,513 lz4 -1 16,193,125 lz4 -9 14,825,016 Oodle -1 (LZB) 16,944,829 Oodle -2 (LZB) 16,409,913 Oodle LZNIB 12,375,347

(lz4 -9 is not competitive for encode time, it's just to show the level of compression you could get at very fast decode speeds if you don't care about encode time ; LZNIB is an even more extreme case of the same thing - slow to encode, but decode time comparable to Chameleon).

To check speed I did my own implementation of Chameleon (which I believe to be faster than Density's, so it's a fair test). See the next post to get my implementation.

The results are :

comp_len = 19492042 Chameleon_Encode_Time : seconds:0.0274 ticks per: 1.919 mbps : 901.12 Chameleon_Decode_Time : seconds:0.0293 ticks per: 2.050 mbps : 843.31 round trip time = 0.05670I get a somewhat smaller file size than Density's version for unknown reason.

Let's compare to Oodle's LZB (an LZ4ish) :

Oodle -1 : 24,700,820 ->16,944,829 = 5.488 bpb = 1.458 to 1 encode : 0.061 seconds, 232.40 b/kc, rate= 401.85 mb/s decode : 0.013 seconds, 1071.15 b/kc, rate= 1852.17 mb/s round trip time = 0.074 Oodle -2 : 24,700,820 ->16,409,913 = 5.315 bpb = 1.505 to 1 encode : 0.070 seconds, 203.89 b/kc, rate= 352.55 mb/s decode : 0.014 seconds, 1008.76 b/kc, rate= 1744.34 mb/s round trip time = 0.084

lzt99 is a collection of typical game data files.

We can test on enwik8 (text/html) too :

Chameleon : enwik8 : Chameleon_Encode_Time : seconds:0.1077 ticks per: 1.862 mbps : 928.36 Chameleon_Decode_Time : seconds:0.0676 ticks per: 1.169 mbps : 1479.08 comp_len = 61524068 Oodle -1 : enwik8 : 100,000,000 ->57,267,299 = 4.581 bpb = 1.746 to 1 encode : 0.481 seconds, 120.17 b/kc, rate= 207.79 mb/s decode : 0.083 seconds, 697.58 b/kc, rate= 1206.19 mb/shere Chameleon is much more compelling. It's competitive for size & decode speed, not just encode speed.

Commentary :

Any time you're storing files on disk, this is not the right algorithm. You want something more asymmetric (slow compress, fast decompress).

I'm not sure if Cheetah and Lion are Pareto for round trip time. I'd have to test speed on a wider set of sample data.

When do you actually want a compressor that's this fast and gets so little compression? I'm not sure.

I'm showing literal correlation by making an image of the histogram. That is, given an 8-bit predictor, you tally of each event :

int histo[256][256] histo[predicted][value] ++then I scale the histo so the max is at 255 and make it into an image.

Most of the images that I show are in log scale, otherwise all the detail is too dark, dominated by a few peaks. I also sometimes remove the predicted=value line, so that the off axis detail is more visible.

Let's stop a moment and look t what we can see in these images.

This is a literal histo of "lzt99" , using predicted = lolit (last offset literal; the rep0len1 literal). This is in log scale, with the diagonal removed :

In my images y = prediction and x = current value. x=0, y=0 is in the upper left instead of the lower left where it should be because fucking bitmaps are retarded (everyone is fired, left handed coordinate systems my ass).

The order-0 probability is the vertical line sum for each x. So any vertical lines indicate just strong order-0 correlations.

Most files are a mix of different probability sources, which makes these images look a sum of different contibuting factors.

The most obvious factor here is the diagonal line at x=y. That's just a strong value=predicted generator.

The red blob is a cluster of events around x and y = 0. This indicates a probability event that's related to |x+y| being small. That is, the sum, or length, or something tends to be small.

The green shows a square of probabilities. A square indicates that for a certain range of y's, all x's are equally likely. In this case the range is 48-58. So if y is in 48-58, then any x in 48-58 is equally likely.

There are similar weaker squarish patterns all along the diagonal. Surprisingly these are *not* actually at the binary 8/16 points you might expect. They're actually in steps of 6 & 10.

The blue blobs are at x/y = 64/192. There's a funny very specific strong asymmetric pattern in these. When y = 191 , it predicts x=63,62,61,60 - but NOT 64,65,66. Then at y=192, predict x=64,65,66, but not 63.

In addition to the blue blobs, there are weak dots at all the 32 multiples. This indicates that when y= any multiple of 32, there's a generating event for x = any multiple of 32. (Note that in log scale, these dots look more important than they really are.). There are also some weak order-0 generators at x=32 and so on.

There's some just general light gray background - that's just uncompressible random data (as seen by this model anyway).

Here's a bunch of images : (click for hi res)

raw | raw | raw | sub | sub | sub | xor | xor | xor | |

log | logND | linND | log | logND | linND | log | logND | linND | |

Fez LO | |||||||||

Fez O1 | |||||||||

lzt24 LO | |||||||||

lzt24 O1 | |||||||||

lzt99 LO | |||||||||

lzt99 O1 | |||||||||

enwik7 LO | |||||||||

enwik7 O1 |

details :

LO means y axis (predictor) is last-offset-literal , in an LZ match parse. Only the literals coded by the LZ are shown.

O1 means y axis is order1 (previous byte). I didn't generate the O1 from the LZ match parse, so it's showing *all* bytes in the file, not just the literals from the LZ parse.

"log" is just log-scale of the histo. An octave (halving of probability) is 16 pixel levels.

"logND" is log without the x=y diagonal. An octave is 32 pixel levels.

"linND" is linear, without the x=y diagonal.

"raw" means the x axis is just the value. "xor" means the x axis is value^predicted. "sub" means the x axis is (value-predicted+127).

Note that raw/xor/sub are just permutations of the values along a horizontal axis, they don't change the values.

Discussion :

The goal of a de-correlating transform is to create vertical lines. Vertical lines are order-0 probability peaks and can be coded without using the predictor as context at all.

If you use an order-0 coder, then any detail which is not in a vertical line is an opportunity for compression that you are passing up.

"Fez" is obvious pure delta data. "sub" is almost a perfect model for it.

"lzt24" has two (three?) primary probability sources. One is almost pure "sub" x is near y data.

The other sources, however, do not do very well under sub. They are pure order-0 peaks at x=64 and 192 (vertical lines in the "raw" image), and also those strange blobs of correlation at (x/y = 64 and 192). The problem is "sub" turns those vertical lines into diagonal lines, effectively smearing them all over the probability spectrum.

A compact but full model for the lzt24 literals would be like this :

is y (predictor) near 64 or 192 ? if so -> strongly predict x = 64 or 192 else -> predict x = y or x = 64 or 192 (weaker)

lzt99, being more heterogenous, has various sources.

"xor" takes squares to squares. This works pretty well on text.

In general, the LO correlation is easier to model than O1.

The lzt99 O1 histo in particular has lots of funny stuff. There are bunch of non-diagonal lines, indicating things like x=y/4 patterns, which is odd.

cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion

cbloom rants 11-02-11 - StringMatchTest Release

cbloom rants 09-24-12 - LZ String Matcher Decision Tree

From fast to slow :

All my fast matchers now use "cache tables". In fact I now use cache tables all the way up to my "Normal" level (default level; something like zip -7).

With cache tables you have a few primary parameters :

hash bits hash ways 2nd hash very fastest : 0 : hash ways =1 2nd hash = off 1 : hash ways = 2 2nd hash = off 2 : hash ways = 2 2nd hash = on ... hash ways = 16 2nd hash = onThe good thing about cache tables is the cpu cache coherency. You look up by the hash, and then all your possible matches are right there in one cache line. (there's an option of whether you store the first U32 of each match right there in the cache table to avoid a pointer chase to check the beginning of the match).

Cache tables are superb space-speed tradeoff up until ways hits around 16, and then they start to lose to hash-link.

Hash-link :

Hash link is still good, but can be annoying to make fast (*) and does have bad degenerate case behavior (when you have a bad hash collision, the links on that chain get overloaded with crap).

(* = you have to do dynamic amortization and shite like that which is not a big deal, but ugly; this is to handle the incompressible-but-lots-of-hash-collisions case, and to handle the super-compressible-lots-of- redundant-matches case).

The good thing about hash-link is that you are strictly walking matches in increasing offset order. This means you only need to consider longer lengths, which helps break the O(N^2) problem in practice (though not in theory). It also gives you a very easy way to use a heuristic to decide if a match is better or not. You're always doing a simple compare :

previous best match vs. new match with higher offset longer lengthwhich is a lot simpler than something like the cache table case where you see your matches in random order.

Being rather redundant : the nice thing about hash-link is that any time you find a match length, you know absolutely that you have the lowest offset occurance of that match length.

I'm not so high on Suffix Tries any more.

*if* your LZ just needs the longest length at each position, they're superb. If you actually need the best match at every position (traditional optimal parse), they're superb. eg. if you were doing LZSS with large fixed-size offsets, you just want to find the longest match all the time - boom Suffix Trie is the answer. They have no bad degenerate case, that's great.

But in practice on a modern LZ they have problems.

The main problem is that a Suffix Trie is actually quite bad at finding the lowest offset occurance of a short match. And that's a very important thing to be good at for LZ. The problem is that a proper ST with follows is doing its updates way out deep in the leaves of the tree, but the short matches are up at the root, and they are pointing at the *first* occurance of that substring. If you update all parents to the most recent pointer, you lose your O(N) completely.

(I wrote about this problem before : cbloom rants 08-22-13 - Sketch of Suffix Trie for Last Occurance )

You can do something ugly like use a suffix trie to find long matches and a hash->link with a low walk limit to find the most recent occurance of short matches. But bleh.

And my negativity about Suffix Tries also comes from another point :

Match finding is not that important. Well, there are a lot of caveats on that. On structured data (not text), with a pos-state-lastoffset coder like LZMA - match finding is not that important. Or rather, parsing is more important. Or rather, parsing is a better space-speed tradeoff.

It's way way way better to run an optimal parse with a crap match finder (even cache table with low ways) than to run a heuristic parse with great match finder. The parse is just way more important, and per CPU cost gives you way more win.

And there's another issue :

With a forward optimal parse, you can actually avoid finding matches at every position.

There are a variety of ways to get to skip ahead in a forward optimal parse :

Any time you find a very long match - just take it and skip ahead (eg. fast bytes in LZMA) this can reduce the N^2 penalty of a bad match finder When you are not finding any matches - start taking multi-literal steps using something like (failedMatches>>6) heuristic When you find a long enough rep match - just take it and this "long enough" can be much less than "fast bytes" eg. fb=128 for skipping normal matches but you can just take a rep match at length >= 8 which occurs much more oftenthe net result is lots of opportunity for more of a "greedy" type of match finding in your optimal parser, where you don't need every match.

This means that things like hash-link and Yann's MMC ( my MMC note ) become interesting again (even for optimal parsing).

I'm showing the "total time to load" (time to load off disk at a simulated disk speed + time to decompress). You always want lower total time to load - smaller files make the simulated load time less, faster decompression make the decompress time less.

total_time_to_load = compressed_size / disk_speed + decompress_time

It looks neatest in the form of "speedup". "speedup" is the ratio of the effective speed vs. the speed of the disk :

effective_speed = raw_size / total_time_to_load speedup = effective_speed / disk_speedBy varying disk speed you can see the tradeoff of compression ratio vs. cpu usage that makes different compressors better in different application domains.

If we write out what speedup is :

speedup = raw_size / (compressed_size + decompress_time * disk_speed) speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed) speedup ~= harmonic_mean( compression_ratio , decompress_speed / disk_speed )we can see that it's a "harmonic lerp" between compression ratio on one end and decompress speed on the other end, with the simulated disk speed as lerp factor.

These charts show "speedup" vs. log of disk_speed :

(the log is log2, so 13 is a disk speed of 8192 mb/s).

On the left side, the curves go flat. At the far left (x -> -infinity, disk speed -> 0) the height of each curve is proportional to the compression ratio. So just looking at how they stack up on the far left tells you the compression ratio performance of each compressor. As you go right more, decompression speed becomes more and more important and compressed size less so.

With ham-fisted-shading of the regions where each compressor is best :

The thing I thought was interesting is that there's a very obvious Pareto frontier. If I draw a tangent across the best compressors :

Note that at the high end (right), the tangent goes from LZB to "memcpy" - not to "raw". (raw is just the time to load the raw file from disk, and we really have to compare to memcpy because all the compressors fill a buffer that's different from the IO buffer). (actually the gray line I drew on the right is not great, it should be going tangent to memcpy; it should be shaped just like each of the compressors' curves, flat on the left (compressed size dominant) and flat on the right (compress time dominant))

You can see there are gaps where these compressors don't make a complete Pareto set; the biggest gap is between LZA and LZH which is something I will address soon. (something like LZHAM goes there) And you can predict what the performance of the missing compressor should be.

It's also a neat way to test that all the compressors are good tradeoffs. If the gray line didn't make a nice smooth curve, it would mean that some compressor was not doing a good job of hitting the maximum speed for its compression level. (of course I could still have a systematic inefficiency; like quite possibly they're all 10% worse than they should be)

ADDENDUM :

If instead of doing speedup vs. loading raw you do speedup vs. loading raw + memcpy, you get this :

The nice thing is the right-hand asymptotes are now constants, instead of decaying like 1/disk_speed.

So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff.

When I did the Oodle LZH I made a mistake. I used a zip-style combined codeword. Values 0-255 are a literal, and 256+ contain the log2ish of length and offset. The advantage of this is that you only have one Huff table and just do one decode, then if it's a match you also fetch some raw bits. It also models length-offset correlation by putting them in the codeword together. (this scheme is missing a lot of things that you would want in a more modern high compression LZ, like pos-state patterns and so on).

Then I added "rep matches" and just stuck them in the combined codeword as special offset values. So the codeword was :

{ 256 : literal 4*L : 4 rep matches * L length slots (L=8 or whatever = 32 codes) O*L : O offset slots * L length slots (O=14 and L = 6 = 84 codes or whatevs) = 256+32+84 = 372 }

The problem is that rep-match-0 can never occur after a match. (assuming you write matches of maximum length). Rep-match-0 is quite important, on binary/structured files it can have very high probability. By using a single codeword which contains rep-match-0 for all coding events, you are incorrectly mixing the statistics of the after match state (where rep-match-0 has zero probability) and after literal state (where rep-match-0 has high probability).

A quick look at the strategies for fixing this :

1. Just use separate statistics. Keep the same combined codeword structure, but have two entropy tables, one for after-match and one for after-literal. This would also let you code the literal-after-match as an xor literal with separate statistics for that.

Whether you do xor-lit or not, there will be a lot of shared probability information between the two entropy tables, so if you do static Huffman or ANS probability transmission, you may need to use the cross-two-tables-similary in that transmission.

In a static Huffman or ANS entropy scheme if rep-match-0 never occurs in the after-match code table, it will be given a codelen of 0 (or impossible) and won't take any code space at all. (I guess it does take a little code space unless you also explicitly special case the knowledge that it must have codelen 0 in your codelen transmitter)

This is the simplest version of the more general case :

2. Context-code the rep-match event using match history. As noted just using "after match" or "after literal" as the context is the simplest version of this, but more detailed history will also affect the rep match event. This is the natural way to fix it in an adaptive arithmetic/ANS coder which uses context coding anyway. eg. this is what LZMA does.

Here we aren't forbidding rep-match after match, we're just using the fact that it never occur to make its probability go to 0 (adaptively) and thus it winds up taking nearly zero code space. In LZMA you actually can have a rep match after match because the matches have a max length of 273, so longer matches will be written as rep matches. Ryg pointed out that after a match that's been limitted by max-length, LZMA should really consider the context for the rep-match coding to be like after-literal , not after-match.

In Oodle's LZA I write infinite match lengths, so this is simplified. I also preload the probability of rep-match in the after-match contexts to be near zero. (I actually can't preload exactly zero because I do sometimes write a rep-match after match due to annoying end-of-buffer and circular-window edge cases). Preconditioning the probability saves the cost of learning that it's near zero, which saves 10-100 bytes.

3. Use different code words.

Rather than relying on statistics, you can explicitly use different code words for the after-match and after-literal case. For example in something like an LZHuf as described above, just use a codeword in the after-match case that omits the rep0 codes, and thus has a smaller alphabet.

This is most clear in something like LZNib. LZNib has three events :

LRL Match Rep MatchSo naively it looks like you need to write a trinary decision at each coding (L,M,R). But in fact only two of them are ever possible :

After L - M or R cannot write another L because we would've just made LRL longer After M or R - L or M cannot write an R because we wouldn't just made match longerSo LZNib writes the binary choices (M/R) after L and (L/M) after M or R. Because they're always binary choices, this allows LZNib to use the simple single-divider method of encoding values in bytes .

4. Use a combined code word that includes the conditioning state.

Instead of context modeling, you can always take previous events that you need context from and make a combined codeword. (eg. if you do Huffman on the 16-bit bigram literals, you get order-1 context coding on half the literals).

So we can make a combined codeword like :

{ (LRL 0,1,2,3+) * (rep matches , normal matches) * (lengths slots) = 4 * (4 + 14) * 8 = 576 }

Which is a pretty big alphabet, but also combined length slots so you get lrl-offset-length correlation modeling as well.

In a combined codeword like this you are always writing a match, and any literals that precede it are written with an LRL (may be 0). The forbidden codes are LRL=0 and match =rep0 , so you can either just let those get zero probabilities, or explicitly remove them from the codeword to reduce the alphabet. (there are also other forbidden codes in normal LZ parses, such as low-length high-offset codes, so you would similarly remove or adjust those)

A more minimal codeword is just

{ (LRL 0,1+) * (rep-match-0, any other match) = 2 * 2 = 4 }

which is enough to get the rep-match-0 can't occur after LRL 0 modeling. Or you can do anything between those two extremes to choose an alphabet size.

04/2012 to 02/2015

06/2011 to 04/2012

01/2011 to 06/2011

10/2010 to 01/2011

01/2010 to 10/2010

01/2009 to 12/2009

10/2008 to 01/2009

08/2008 to 10/2008

03/2008 to 08/2008

11/2007 to 03/2008

07/2006 to 11/2007

12/2005 to 07/2006

06/2005 to 12/2005

01/1999 to 06/2005