05-17-16 | The Weissman Score

Wikipedia suggests the Weissman score should be

which ignoring constants is just W = r/logT

That's just wrong. You don't take a logarithm of something with units. But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a logarithm of time should be involved. Just not like that.

I present a formula which I call the correct Weissman Score :

W = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )  -
    comp_ratio * log10( 1 + speed/(disk_speed_hi *comp_ratio) )


W = comp_ratio * log10( ( comp_ratio + speed/disk_speed_lo ) / ( comp_ratio + speed/disk_speed_hi ) )

You can have a Weissman score for encode speed or decode speed. It's a measure of space-speed tradeoff success.

I suggest the range should be 1-256. disk_speed_lo = 1 MB/s (to evaluate performance on very slow channels, favoring small files), disk_speed_hi = 256 MB/s (to evalue performance on very fast disks, favoring speed). And because 1 and 256 are amongst programmer's favorite numbers.

You could also just let the hi range go to infinity. Then you don't need a hi disk speed parameter and you get :

Weissman-infinity = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )

with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like. While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses the score more on the area that most people care about.

I came up with this formula because I started thinking about summarizing a score from the Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds; like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them? speedup is a good way to measure a compressor even if you don't actually care about speed. Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to get the area under the curve? Note that we're integrating in log-scale of disk speed.

Turns out you can just do that integral :

    speedup = (time to load uncompressed) / (time to load compressed + decompress)
    speedup = (raw_size/disk_speed) / (comp_size/disk_speed + raw_size/ decompress_speed)
    speedup = (1/disk_speed) / (1/(disk_speed*compression_ratio) + 1 / decompress_speed)
    speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
    speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
    speedup = compression_ratio / (1 + exp( log_disk_speed ) * compression_ratio/decompress_speed)
    speedup = compression_ratio * 1 / (1 + exp( log_disk_speed + log(compression_ratio/decompress_speed)))

speedup is a sigmoid :

    y = 1 / (1 + e^-x ) 
    Integral{y} = ln( 1 + e^x )

    x = - ( log_disk_speed + log(compression_ratio/decompress_speed) )

so substitute some variables and you get the above formula for the Weissman score.

I changed from natural log to log-base-10, which is just a constant scaling factor.

the Weissman (decode Core i7-3770 3.4 GHz) scores on Silesia are :

lz4hc    : 6.243931
zstdmax  : 7.520236
lzham    : 6.924379
lzma     : 5.460073
zlib9    : 5.198510
Kraken   : 8.431461

Weissman-infinity scores are :

lz4hc    : 7.983104
zstdmax  : 8.168544
lzham    : 7.277707
lzma     : 5.589155
zlib9    : 5.630476
Kraken   : 9.551152

Goal : beat 10.0 !

05-15-16 | PS4 Battle : LZ4 vs LZSSE vs Oodle

PS4 is the arena. Three compressors enter. (revised 05-16)

Advantages of the PS4 : consistent well-defined hardware for reproducible testing. Slow platform that game developers care about being fast on. Builds with clang which is the target of choice for some compression libraries that don't build so easily on MSVC.

After the initial version of this post, I went and fuzz-safed LZB16, so they're all directly comparable. To compare apples, look at LZ4_decompress_safe. I also include LZ4_decompress_fast for reference.

LZ4_decompress_safe : fuzz safe (*)
LZ4_decompress_fast : NOT fuzz safe
LZSSE8_Decode : fuzz safe
Oodle LZB16 : fuzz safe
LZSSE and Oodle both use multiple copies of the core loop to minimize the pain of fuzz safing. LZ4's code is much simpler, it doesn't do a ton of macro or .inl nastiness. (* = this is the only one that I would actually trust to put in a satellite or a bank or something critical, it's just so simple, it's way easier to be sure that it's correct)

Conclusion :

Comparing the two Safe open-source options, LZ4_safe vs. LZSSE8 : LZSSE8 is pretty consistently faster than LZ4_Safe on PS4 (though the difference is small). PS4 is a better platform for LZSSE than x64 (PS4 is actually a pretty bad platform for LZ4; there're a variety of issues; see GDC "Taming the Jaguar" slides ; but particular issues for LZ4 are the front-end bottleneck and cache latency). When I tested on x64 before, it was much more mixed, sometimes LZ4 was faster.

I was surprised to find that Oodle LZB16 is quite a lot faster than LZ4 on PS4. (for example, that's not the case on Windows/x64, it's much closer there). I've never run third party codecs on PS4 before. I suppose this reflects the per-platform tweaking that we spend so much time on, and I'm sure LZ4 would catch up with some time spent fiddling with the PS4 codegen.

The compression ratios are mostly close enough to not care, though LZSSE8 does a bit worse on some of the DXTC/BCn files (lightmap.bc3 and d.dds).

On some files I include Oodle LZBLW numbers (LZB-bytewise-large-window). Sometimes Oodle LZBLW is a pretty big free compression win at about the same speed. Sometimes it gets worse ratio, sometimes much worse speed. If I was a client using this, I might try LZBLW and drop down to LZB16 any time it's not a good tradeoff.

Full data :

REMINDER : LZ4_decompress_fast is not directly comparable to the others, it's not fuzz safe, the others are!

PS4 clang-3.6.1


lzt99 :

lz4hc : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 0.035 seconds, 440.51 b/kc, rate= 702.09 mb/s
LZ4_decompress_fast_time : 0.032 seconds, 483.55 b/kc, rate= 770.67 mb/s

LZSSE8  : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
LZSSE8_Decode_Time : 0.033 seconds, 467.32 b/kc, rate= 744.81 mb/s

Oodle LZB16 : lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s

Oodle LZBLW : lzt99 : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode           : 0.033 seconds, 470.39 b/kc, rate= 749.74 mb/s



lz4hc : 2,188,524 -> 2,068,268 =  7.560 bpb =  1.058 to 1
LZ4_decompress_safe_time : 0.004 seconds, 322.97 b/kc, rate= 514.95 mb/s
LZ4_decompress_fast_time : 0.004 seconds, 353.08 b/kc, rate= 562.89 mb/s

LZSSE8 : 2,188,524 -> 2,111,182 =  7.717 bpb =  1.037 to 1
LZSSE8_Decode_Time : 0.004 seconds, 360.21 b/kc, rate= 574.42 mb/s

Oodle LZB16 : texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s



lz4hc : 4,194,332 ->   632,974 =  1.207 bpb =  6.626 to 1
LZ4_decompress_safe_time : 0.005 seconds, 521.54 b/kc, rate= 831.38 mb/s
LZ4_decompress_fast_time : 0.005 seconds, 564.63 b/kc, rate= 900.46 mb/s

LZSSE8 encode    :  4,194,332 ->   684,062 =  1.305 bpb =  6.132 to 1
LZSSE8_Decode_Time : 0.005 seconds, 551.85 b/kc, rate= 879.87 mb/s

Oodle LZB16 : lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s



lz4hc : 51,220,480 ->22,062,995 =  3.446 bpb =  2.322 to 1
LZ4_decompress_safe_time : 0.083 seconds, 385.47 b/kc, rate= 614.35 mb/s
LZ4_decompress_fast_time : 0.075 seconds, 427.14 b/kc, rate= 680.75 mb/s

LZSSE8 : 51,220,480 ->22,148,366 =  3.459 bpb =  2.313 to 1
LZSSE8_Decode_Time : 0.070 seconds, 461.53 b/kc, rate= 735.59 mb/s

Oodle LZB16 : silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s

Oodle LZBLW : silesia_mozilla : 51,220,480 ->20,881,772 =  3.261 bpb =  2.453 to 1
decode           : 0.112 seconds, 285.68 b/kc, rate= 455.30 mb/s



lz4hc :   589,952 ->   116,447 =  1.579 bpb =  5.066 to 1
LZ4_decompress_safe_time : 0.001 seconds, 568.65 b/kc, rate= 906.22 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 624.81 b/kc, rate= 996.54 mb/s

LZSSE8 encode    :  589,952 ->   119,659 =  1.623 bpb =  4.930 to 1
LZSSE8_Decode_Time : 0.001 seconds, 604.14 b/kc, rate= 962.40 mb/s

Oodle LZB16 : breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s

Oodle LZBLW : breton.dds :    589,952 ->   132,934 =  1.803 bpb =  4.438 to 1
decode           : 0.001 seconds, 396.04 b/kc, rate= 630.96 mb/s



lz4hc encode     :  1,048,704 ->   656,706 =  5.010 bpb =  1.597 to 1
LZ4_decompress_safe_time : 0.001 seconds, 554.69 b/kc, rate= 884.24 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 587.20 b/kc, rate= 936.34 mb/s

LZSSE8 encode    :  1,048,704 ->   695,583 =  5.306 bpb =  1.508 to 1
LZSSE8_Decode_Time : 0.001 seconds, 551.13 b/kc, rate= 879.05 mb/s

Oodle LZB16 : d.dds :  d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s



lz4hc : 79,993,099 ->47,848,680 =  4.785 bpb =  1.672 to 1
LZ4_decompress_safe_time : 0.158 seconds, 316.67 b/kc, rate= 504.70 mb/s
LZ4_decompress_fast_time : 0.143 seconds, 350.66 b/kc, rate= 558.87 mb/s

LZSSE8 : 79,993,099 ->47,807,041 =  4.781 bpb =  1.673 to 1
LZSSE8_Decode_Time : 0.140 seconds, 358.61 b/kc, rate= 571.54 mb/s

Oodle LZB16 : all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s



lz4hc : 58,788,904 ->32,998,567 =  4.490 bpb =  1.782 to 1
LZ4_decompress_safe_time : 0.090 seconds, 412.04 b/kc, rate= 656.71 mb/s
LZ4_decompress_fast_time : 0.080 seconds, 460.55 b/kc, rate= 734.01 mb/s

LZSSE8 : 58,788,904 ->33,201,406 =  4.518 bpb =  1.771 to 1
LZSSE8_Decode_Time : 0.076 seconds, 485.14 b/kc, rate= 773.20 mb/s

Oodle LZB16 : baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

Oodle LZBLW : baby_robot_shell.gr2 : 58,788,904 ->30,207,635 =  4.111 bpb =  1.946 to 1
decode           : 0.090 seconds, 409.88 b/kc, rate= 653.26 mb/s

After posting the original version with non-fuzz-safe LZB16, I decided to just go and do the fuzz-safing for LZB16.

LZB16, PS4 clang-3.6.1

post-fuzz-safing :

    lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
    decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s
    texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
    decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s
    lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
    decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s
    silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
    decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s
    breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
    decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s
    d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
    decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s
    all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
    decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s
    baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
    decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

pre-fuzz-safing reference :

    lzt99 912.92 mb/s
    texture.bc1 598.61 mb/s
    lightmap.bc3 876.19 mb/s
    silezia_mozilla 794.72 mb/s
    breton.dds 1078.52 mb/s
    d.dds 888.73 mb/s
    all_dds 701.45 mb/s
    baby_robot_shell.gr2 877.81 mb/s

Mild speed penalty on most files.

05-15-16 | PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle

What a sucky way I've spent my work day today. Running other people's code. Yeurg. I have a headache and tunnel vision.

Everything run at max compression options, level 99, max dict size. All libs are the latest on github, downloaded today. Zlib-NG has the arch/x86 stuff enabled. PS4 is AMD Jaguar , x64.

I'm going to omit encode speeds on the per-file results for simplicity, these are pretty representative :

aow3_skin_giants.clb :
zlib-ng encode   : 2.699 seconds, 1.65 b/kc, rate= 2.63 mb/s
miniz encode     : 2.950 seconds, 1.51 b/kc, rate= 2.41 mb/s
zstd encode      : 5.464 seconds, 0.82 b/kc, rate= 1.30 mb/s
brotli-9  encode    : 23.110 seconds, 0.19 b/kc, rate= 307.44 kb/s
brotli-10 encode    : 68.072 seconds, 0.07 b/kc, rate= 104.38 kb/s
brotli-11 encode    : 79.844 seconds, 0.06 b/kc, rate= 88.99 kb/s

Results :

PS4 clang-3.5.0


lzt99 :

MiniZ : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 0.292 seconds, 53.15 b/kc, rate= 84.71 mb/s

zlib-ng : 24,700,820 ->13,158,385 =  4.262 bpb =  1.877 to 1
miniz_decompress_time : 0.226 seconds, 68.58 b/kc, rate= 109.30 mb/s

ZStd : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 0.184 seconds, 84.12 b/kc, rate= 134.07 mb/s

Brotli-9 : 24,700,820 ->10,473,560 =  3.392 bpb =  2.358 to 1
brotli_decompress_time : 0.259 seconds, 59.83 b/kc, rate= 95.36 mb/s

Brotli-10 : 24,700,820 -> 9,949,740 =  3.222 bpb =  2.483 to 1
brotli_decompress_time : 0.319 seconds, 48.54 b/kc, rate= 77.36 mb/s

Brotli-11 : 24,700,820 -> 9,833,023 =  3.185 bpb =  2.512 to 1
brotli_decompress_time : 0.317 seconds, 48.84 b/kc, rate= 77.84 mb/s

Oodle Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
encode only      : 4.139 seconds, 3.74 b/kc, rate= 5.97 mb/s
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

Oodle LZNA : lzt99 : 24,700,820 -> 9,068,880 =  2.937 bpb =  2.724 to 1
decode           : 0.643 seconds, 24.12 b/kc, rate= 38.44 mb/s


normals.bc1 :

miniz :   524,316 ->   291,697 =  4.451 bpb =  1.797 to 1
miniz_decompress_time : 0.008 seconds, 39.86 b/kc, rate= 63.53 mb/s

zlib-ng :   524,316 ->   292,541 =  4.464 bpb =  1.792 to 1
zlib_ng_decompress_time : 0.007 seconds, 47.32 b/kc, rate= 75.41 mb/s

zstd :   524,316 ->   273,642 =  4.175 bpb =  1.916 to 1
zstd_decompress_time : 0.007 seconds, 49.64 b/kc, rate= 79.13 mb/s

brotli-9 :   524,316 ->   289,778 =  4.421 bpb =  1.809 to 1
brotli_decompress_time : 0.010 seconds, 31.70 b/kc, rate= 50.52 mb/s

brotli-10 :   524,316 ->   259,772 =  3.964 bpb =  2.018 to 1
brotli_decompress_time : 0.011 seconds, 28.65 b/kc, rate= 45.66 mb/s

brotli-11 :   524,316 ->   253,625 =  3.870 bpb =  2.067 to 1
brotli_decompress_time : 0.011 seconds, 29.74 b/kc, rate= 47.41 mb/s

Oodle Kraken -zl6 :    524,316 ->   247,217 =  3.772 bpb =  2.121 to 1
decode           : 0.002 seconds, 135.52 b/kc, rate= 215.95 mb/s

Oodle Kraken -zl7 :    524,316 ->   238,844 =  3.644 bpb =  2.195 to 1
decode           : 0.003 seconds, 123.96 b/kc, rate= 197.56 mb/s

Oodle BitKnit :    524,316 ->   225,884 =  3.447 bpb =  2.321 to 1
decode only      : 0.010 seconds, 31.67 b/kc, rate= 50.47 mb/s


lightmap.bc3 :

miniz :  4,194,332 ->   590,448 =  1.126 bpb =  7.104 to 1 
miniz_decompress_time : 0.025 seconds, 105.14 b/kc, rate= 167.57 mb/s

zlib-ng : 4,194,332 ->   584,107 =  1.114 bpb =  7.181 to 1
zlib_ng_decompress_time : 0.019 seconds, 137.77 b/kc, rate= 219.56 mb/s

zstd :  4,194,332 ->   417,672 =  0.797 bpb = 10.042 to 1 
zstd_decompress_time : 0.014 seconds, 182.53 b/kc, rate= 290.91 mb/s

brotli-9 : 4,194,332 ->   499,120 =  0.952 bpb =  8.403 to 1 
brotli_decompress_time : 0.022 seconds, 118.64 b/kc, rate= 189.09 mb/s

brotli-10 : 4,194,332 ->   409,907 =  0.782 bpb = 10.232 to 1 
brotli_decompress_time : 0.021 seconds, 125.20 b/kc, rate= 199.54 mb/s

brotli-11 : 4,194,332 ->   391,576 =  0.747 bpb = 10.711 to 1 
brotli_decompress_time : 0.021 seconds, 127.12 b/kc, rate= 202.61 mb/s

Oodle Kraken -zl6 :   4,194,332 ->   428,737 =  0.818 bpb =  9.783 to 1 
decode           : 0.009 seconds, 308.45 b/kc, rate= 491.60 mb/s

Oodle BitKnit :   4,194,332 ->   416,208 =  0.794 bpb = 10.077 to 1
decode only      : 0.021 seconds, 122.59 b/kc, rate= 195.39 mb/s

Oodle LZNA :  4,194,332 ->   356,313 =  0.680 bpb = 11.771 to 1 
decode           : 0.033 seconds, 79.51 b/kc, rate= 126.71 mb/s



Miniz : 7,105,158 -> 3,231,469 =  3.638 bpb =  2.199 to 1
miniz_decompress_time : 0.070 seconds, 63.80 b/kc, rate= 101.69 mb/s

zlib-ng : 7,105,158 -> 3,220,291 =  3.626 bpb =  2.206 to 1
zlib_ng_decompress_time : 0.056 seconds, 80.14 b/kc, rate= 127.71 mb/s

Zstd : 7,105,158 -> 2,700,034 =  3.040 bpb =  2.632 to 1
zstd_decompress_time : 0.050 seconds, 88.69 b/kc, rate= 141.35 mb/s

brotli-9 :  7,105,158 -> 2,671,237 =  3.008 bpb =  2.660 to 1
brotli_decompress_time : 0.080 seconds, 55.84 b/kc, rate= 89.00 mb/s

brotli-10 : 7,105,158 -> 2,518,315 =  2.835 bpb =  2.821 to 1
brotli_decompress_time : 0.098 seconds, 45.54 b/kc, rate= 72.58 mb/s

brotli-11 : 7,105,158 -> 2,482,511 =  2.795 bpb =  2.862 to 1
brotli_decompress_time : 0.097 seconds, 45.84 b/kc, rate= 73.05 mb/s

Oodle Kraken -zl6 : aow3_skin_giants.clb :  7,105,158 -> 2,638,490 =  2.971 bpb =  2.693 to 1
decode           : 0.023 seconds, 195.25 b/kc, rate= 311.19 mb/s

Oodle BitKnit : 7,105,158 -> 2,623,466 =  2.954 bpb =  2.708 to 1
decode only      : 0.095 seconds, 47.11 b/kc, rate= 75.08 mb/s

Oodle LZNA : aow3_skin_giants.clb :  7,105,158 -> 2,394,871 =  2.696 bpb =  2.967 to 1
decode           : 0.170 seconds, 26.26 b/kc, rate= 41.85 mb/s



MiniZ : 51,220,480 ->19,141,389 =  2.990 bpb =  2.676 to 1
miniz_decompress_time : 0.571 seconds, 56.24 b/kc, rate= 89.63 mb/s

zlib-ng : 51,220,480 ->19,242,520 =  3.005 bpb =  2.662 to 1
zlib_ng_decompress_time : 0.457 seconds, 70.31 b/kc, rate= 112.05 mb/s

zstd : malloc failed

brotli-9 : 51,220,480 ->15,829,463 =  2.472 bpb =  3.236 to 1
brotli_decompress_time : 0.516 seconds, 62.27 b/kc, rate= 99.24 mb/s

brotli-10 : 51,220,480 ->14,434,253 =  2.254 bpb =  3.549 to 1
brotli_decompress_time : 0.618 seconds, 52.00 b/kc, rate= 82.88 mb/s

brotli-11 : 51,220,480 ->14,225,511 =  2.222 bpb =  3.601 to 1
brotli_decompress_time : 0.610 seconds, 52.72 b/kc, rate= 84.02 mb/s

Oodle Kraken -zl6 : 51,220,480 ->14,330,298 =  2.238 bpb =  3.574 to 1
decode           : 0.200 seconds, 160.51 b/kc, rate= 255.82 mb/s

Oodle Kraken -zl7 : 51,220,480 ->14,222,802 =  2.221 bpb =  3.601 to 1
decode           : 0.201 seconds, 160.04 b/kc, rate= 255.07 mb/s

Oodle LZNA : silesia_mozilla : 51,220,480 ->13,294,622 =  2.076 bpb =  3.853 to 1
decode           : 1.022 seconds, 31.44 b/kc, rate= 50.11 mb/s

I tossed in tests of BitKnit & LZNA in some cases after I realized that the Brotli decode speeds are more comparable to BitKnit than Kraken, and even LZNA isn't that far off (usually less than a factor of 2). eg. you could do half your files in LZNA and half in Kraken and that would be about the same total time as doing them all in Brotli.

Here are charts of the above data :

(silesia_mozilla omitted due to lack of zstd results)

(I'm trying an experiment and showing inverted scales, which are more proportional to what you care about. I'm showing seconds per gigabyte, and percent out of output size, which are proportional to *time* not speed, and *size* not ratio. So, lower is better.)

log-log speed & ratio :

Time and size are just way better scales. Looking at "speed" and "ratio" can be very misleading, because big differences in speed at the high end (eg. 2000 mb/s vs 2200 mb/s) don't translate into a very big time difference, and *time* is what you care about. On the other hand, small differences in speed at the low end *are* important - (eg. 30 mb/s vs 40 mb/s) because those mean a big difference in time.

I've been doing mostly "speed" and "ratio" because it reads better to the novice (higher is better! I want the one with the biggest bar!), but it's so misleading that I think going to time & size is worth it.

05-13-16 | When you don't want Oodle Kraken

For the vast majority of data compression problems, the answer is Oodle Kraken. What should I use to pack my distribution? -> Kraken. What should I use for streaming/paging -> Kraken? What should I use for text? for binary? -> Kraken, Kraken, Kraken.

There are some cases where it's not the answer. Those are easier to enumerate, so I'll do that, mainly with the Oodle compressors you might want instead.

*. If you want the smallest possible files and don't care much about decode speed or decode-time CPU usage, then you might prefer Oodle LZNA.

*. If you want maximum decode speed and don't care much about compression ratio, you might prefer Oodle's LZB16, LZBLW, or LZNIB.

*. If you want to compress tiny files independently, Kraken is probably not the best choice. Typically for loading or unpacking, tiny files should be stuck together in a larger loading unit, so that you read and decompress several of them per unit. But if for some reason you need independent tiny files, Kraken won't do great under 16k or so. The best alternatives in Oodle that work well on tiny files are LZNA (high compression) and LZNIB (high speed).

(one common case that's equivalent to "tiny files" is communication buffers like network packets; Oodle has special LZ modes for this problem and the best compressors for it are LZNA and LZNIB. Of course you could also use the specialized Oodle Network for packets; contact oodle support for more information on special uses like this).

*. If you're extremely memory-use constrained, you might not want Kraken. Kraken needs around 256k of memory in addition to the output buffer. This is the largest decoder memory requirement in Oodle, so if for some reason that's too much, there are much smaller overhead decoders available, such as LZNA (around 12k) and LZNIB (zero, or around 1k of stack usage).

*. If you have a lot of specialized data types, and care mainly about compression ratio, you probably want something data-type specific. This might be just some kind of preprocessor, or a whole specialized compressor, but with specialized data (lots of text, or DNA, or whatever) you can always do better with a customized compressor than a generic one.

And some other junk.

05-11-16 | RAR Filters

I had a poke around the RAR unpack code to have a look at filters. This is what I found :

From the unpack code I can't see anything about how filters are chosen, obviously.
(which is the interesting part)

RAR filters have a start & length, so they can apply to fine-grained portions of the file.

There can be N filters per file, and they can overlap, so there could be multiple filters
on any given byte.  They're applied in a defined order.

There's a standard E8E9/BCJ filter.

These are the others :



is just byte delta
it can have N channels
delta is at channel stride (from byte -N)
and it de-interleaves the channels

so eg. if you used it on RGB
it would be channels=3
and it would produce output like RRRRGGGGBBB

The older versions of RAR (30 ?) had much more complex filters :


could send arbitrary filters in theory using the VM script
I have no idea if this was actually done in normal use

The VM filters also have special hard-coded modes in C :


same as v50 DELTA


special image filter
transmits the Width of scan lines
hard-coded to 24-bit RGB
  (odd because it does for i to Channels, but Channels is just a const int = 3)
  (they never heard of 8-bit or 32-bit image data?)
uses a Paeth predictor to make a residual
  (N,W,NW depending on which is closest to grad = N+W-NW)


special audio/WAV filter
sends # of channels
does de-interleaving
crazy complicated adaptive weight linear predictor
does just delta from neighbor
but biases that delta by three different slopes
adjusts the weight of the slope by which was the best predictor over recent data
(or something like that)

The important one is just DELTA which is very simple.

The trick bit is not the filter, but finding ranges to apply it on (without just brute-force trying lots of options and seeing which produces the best result - which RAR obviously doesn't do because they sometimes get it so wrong, as demonstrated in the earlier post).

05-11-16 | Oodle Kraken Thread-Phased Decoding

Oodle Kraken is already by far the fastest-to-decode high-compression compressor in the world (that's a mouthful!). But it's got a magic trick that makes it even faster :

Oodle Kraken can decode its normal compressed data on multiple threads.

This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.

The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.

The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.

For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread : (with some other compressors for reference)

win81 :

Kraken 2-thread  : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s

Kraken           : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s

zstdmax          : 104,857,600 ->39,768,086 =  3.034 bpb =  2.637 to 1 
decode           : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s

lzham            : 104,857,600 ->37,856,839 =  2.888 bpb =  2.770 to 1 
decode           : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s

lzma             : 104,857,600 ->35,556,039 =  2.713 bpb =  2.949 to 1 
decode           : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s

Charts on a few files :

Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.

Performance on the Silesia set for reference :

Silesia total :

Oodle Kraken -z6 : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1

single threaded decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s

two threaded decode      : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s

Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput* reduction. For example, if you were really decoding the whole Silesia set, you might just run the decompression of each file on its own thread. That is a good thing to do, and it would give you a near 2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode reduces the latency of each independent decode, and of course it can also be used with chunking or multiple-file decoding to get further speedups.

05-10-16 | Oodle 2.2.0 Kraken Optimal Parse improvements

Oodle 2.2.0 is about to ship, with some improvements to the Kraken optimal parse compression ratios. Compressed size is improved by around 1%. Speed is approximately the same at -z6 (previous max level for Kraken) but there's a new -z7 mode that's slightly slower and even higher compression.

I think we'll continue to find improvements in the optimal parsers over the coming months (optimal parsing is hard!) which should lead to some more tiny gains in the compression ratio in the slow encoder modes.

Silesia , sum of all files

uncompressed : 211,938,580

Kraken 2.1.5 -z6 : 52,366,897
Kraken 2.2.0 -z6 : 51,857,427
Kraken 2.2.0 -z7 : 51,625,488

Oodle Kraken 2.1.5 topped out at -z6 (Optimal2). There's a new -z7 (Optimal3) mode which gets a bit more compression at the cost of a bit of speed, which is why it's on a separate option instead of just part of -z6.

Results on some individual files (Kraken 220 is -z7) :


by ratio:
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
Kraken 220  :  3.60:1 ,    1.1 enc mb/s ,  896.5 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Kraken 215  :  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s


by ratio:
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Kraken 220  :  2.53:1 ,    2.0 enc mb/s ,  912.0 dec mb/s
Kraken 215  :  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s


by ratio:
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
Kraken 220  :  2.23:1 ,    1.0 enc mb/s ,  650.6 dec mb/s
Kraken 215  :  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s


by ratio:
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
Kraken 220  :  3.82:1 ,    1.4 enc mb/s ,  837.2 dec mb/s
Kraken 215  :  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb


by ratio:
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Kraken 220  :  2.77:1 ,    1.0 enc mb/s ,  818.0 dec mb/s
Kraken 215  :  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s


by ratio:
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
Kraken 220  :  3.51:1 ,    1.4 enc mb/s ,  702.8 dec mb/s
Kraken 215  :  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s

You can see that encode & decode speed is slightly worse at level -z7 , and compression ratio si improved. (most of the other compression levels have roughly the same decode speed; -z7 enables some special options that can hurt decode speed a bit). Of course even at -z7 Kraken is way faster than anything else comparable!

05-10-16 | Tips for benchmarking a compressor

You're about to evaluate Oodle (thanks for having a look!) or some other compressor. Before you start, consider these tips :

1. Time only the compressor.

Place your time measurements only around the compressor. Not IO, not your parsing, not mallocs, just the compress or decompress calls. I understand that in the end what you care about is total time to load, but there can be a lot of issues there that need fixing, and they can cloud the comparison of just the compression part. eg. if your parsing is really slow, that will dominate the CPU time and hide the differences between the compressors.

A common problem is that your app-loading takes a large amount of CPU independent of decompression. In that case, you care about how much CPU the decompressor uses, regardless of total load latency, because it runs into the CPU usage of your post-decompress loading code. Another problem is that accurately timing the disk load time is very difficult; it strongly depends on the exact hardware, disk cache usage, file layout and packaging for seek times; etc. It's usually better to simulate disk load times rather than measure it, because good quality measurements require a wide variety of systems to get a spectrum of results. (it's a bit like doing a medical trial on one person otherwise)

2. Time what you actually care about.

If you care about decode time, time the decompression. If you care about encode time, time compression. If you care about round-trip time, add the two times. Compressors are not just "fast" or "slow" at both ends, you can't time encoding and decide that it's a fast or slow compressor if what you care about is decoding.

3. Choose the right options.

Most compressors have the ability to target slightly different use cases. The most common option is the ability to trade off encode time vs. compression ratio. So, if what you care about is smallest size, then run the compressor at its highest encode effort level. It can be tricky to get the options right in most compression libraries; we are woefully non-standardized and not well documented. Aside from the simple "level" parameter, there may be other options that are relevant to your goals, perhaps trading off decompressor memory usage, or decompression speed. With Oodle the best option is always to email us and ask what options will best suit your goals.

4. Run apples-to-apples (threads-to-threads) comparisons.

It can be tricky to compare compressors fairly. As much as possible they should be run in the same way, and they should be run in the way that you will actually use them in your final application. Don't profile them with threads if you will not use them threaded in your shipping application.

Threads are a common problem. Compressors should either be tested all threaded (if you will use threads in your final application), or all non-threaded. Unfortunately the defaults are not the same. "lzma" (7z) and LZHAM create threads by default. You have to change their options to tell them to *not* create threads. The normal Oodle_Compress calls will not use threads by default, you have to specifically call one of the _Async threaded routines. (my personal preference is to benchmark everything without threads to compare single-threaded performance, and you can always add threads for production use)

5. Take the MIN of N run times.

To get reliable timing, you need to run the loop many times, and take the MIN of all times. The min will give you the time it takes when the OS isn't interrupting you with task switches, the CPU isn't clocking-down for speedstep, etc. I usually do 30 *per core* but you can probably get a way with a bit less.

6. Wipe the cache.

Assuming you are now doing N loops, you need to invalidate the cache between iterations. If you don't, you will be running the compressor in a "hot cache" scenario, with some buffers already in cache.

7. Don't pack a bunch of files together in a tar if that's not how you load.

It may seem like a good way to test to grab your bunch of test files and pack them together in a tar (or zip -0 or similar package) and run the compression tests on that tar. That's a fine option if that's really how you load data in your final application - as one big contiguous chunk that must be loaded in one big blob. But most people don't. You need to test the compressors in the same way they will be used in the final application. If you load whole file at a time, test the compressors on whole file units. Many people do loading on some kind of paging unit, like perhaps 1 MB chunks. If you do that, then test the compressor on the same thing.

8. Choose your test set.

If you could test on the entire set of buffers that your final application will load, that would be an accurate test. (though actually, even that is a bit subtle, since some buffers are more latency sensitive than others, so for example you might care more about the first few things you load to get into a running application as quickly as possible). That's probably not practical, so you want to choose a set that is representative of what you will actually load. Don't exclude things like already-compressed files (JPEGs and so on) *if* you will be running them through the compressor. (though consider *not* running them your compressed-file loading path, in which case you should exclude them from testing). It's pretty hard to get an accurate representative sample, so it's generally best to just get a variety of files and look at individual per-file results.

9. Look at the spectrum of results, not the sum.

After you run on your test set, don't just add up the compressed sizes and times to make a "total" result. Sums can be misleading. One issue is there are some large incompressible files, they can hide the differences on the more compressible files. But a bigger and more subtle trap is the way that sums weight the combination of results. A sum is a weighting by the size of each file in the test set. That's fine if your test set is all of your data, or is a perfectly proportionally representative sampling of all of your data (a subset which acts like the whole). But most likely it's not. It's best to keep the results per file separate and just have a look at individual cases to see what's going on, how the results differ, and try not to simplify to just looking at the sum.

10. If you do sum, sum *time* not speed, sum *size* not ratio.

Speed (like mb/s) and ratio (raw size/comp size) are inverted measures and shouldn't be summed. What you actually care about is total compressed size, and total time to decode. So if you run over a set of files, don't look at "average speed" or "average ratio" , because those are inverted meaures that will oddly weight the accumulation. Instead accumulate total time to decode, total raw size, and total compressed size, and then if you like you can make "overall speed" and "overall ratio" from those total.

11. Try not to malloc in the timing loop.

Your malloc might be fast, it might be slow, it's best to not have that as a variable in the timing. In general try to allocate the memory for the compressor or decompressor outside of the timing loop. (In Oodle this is done by passing in your own pointer for the "decoderMemory" argument of OodleLZ_Decompress). That would be an unfair test if you didn't also do that in the final application - so do it in the final application too! (similarly, make sure there's no logging inside the timing loop).

12. Consider excluding almost-incompressible files.

This is something you should consider for final shipping application, and if you do it in your shipping application, then you should do it for the benchmark too. The most common case is already-compressed files like JPEG images and MP3 audio. These files can usually be compressed slightly, maybe saving 1% of their size, but the time to decode them is not worth it overall - you can get more total size savings by running a more powerful compressor on other files. So it's most efficient to just send them uncompressed.

13. Tiny files should either be excluded or packed together.

There's almost never a use case where you really want to compress tiny files (< 16k bytes or so) as independent units. There's too much per-unit overhead in the compressor, and more importantly there's too much per-unit overhead in IO - you don't want to eat a disk seek to just to get one tiny file. So in a real application tiny files should always be grouped into paging units that are 256k or more, a size where loading them won't just be a total waste of disk seek time. So, when benchmarking compressors you also shouldn't run them on tiny independent files, because you will never do that in a shipping application (I hope). And of course don't just do this for the benchmark, do it in the final app too.

05-08-16 | Order-1 Huffman

This is a simple idea that's rarely written down, so I thought I'd do a quick summary.

To my knowledge I was the first person to write about it (in "New Techniques in Context Modeling and Arithmetic Encoding" (PDF) ) but it's one of those simple ideas that probably a lot of people had and didn't write about (like Deferred Summation). It's also one of those ideas that keeps being forgotten and rediscovered over the years.

(I don't know much about the details of how Brotli does this, it may differ. I'll be talking about how I did it).

(also by "Huffman" I pretty much always mean "static Huffman" where you measure the histogram of a block and transmit the code lengths, not "adaptive Huffman" (modifying codelens per symbol (bleck)) or "deferred summation Huffman" (codelens computed from histogram of previous data with no explicit codelen transmission))

Let's start with just the case of order-1 8 bit literals. So you're coding a current 8-bit symbol with an 8-bit previous symbol as context. You can do this naively by just have 256 arrays, one for each 8-bit context. The decoder looks like this :

256 times :
read codelens from file
build huffman decode table

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , huff_table[o1] );

and on a very large file (*) that might be okay.

(* actually it's only okay on a very large file with completely stable statistics, which never happens in practice. In the real world "very large files" don't usually exist; instead they act like a sequence of small/medium files tacked together. That is, you want a decoder that works well on small files, and you want to be able to reset it periodically (re-transmit huffman codelens in this case) so that it can adapt to local statistics).

On small files, it's disastrous. You're sending 256 sets of codelens, which can be a lot of wasted data. Worst of all it's a huge decode time overhead to parse out the codelens and build the decode tables if you're only going to get a few symbols in that context.

So you want to reduce the count of huffman tables. A rough guideline is to make the number of tables proportional to the number of bytes. Maybe 1 table per 1024 bytes is tolerable to you, it depends.

(another goal for reduction might be to get all the huff tables to fit in L2 cache)

So we want to merge the Huffman tables. You want to find pairs of contexts that have the most similar statistics and merge those. If you don't mind the poor encoder-time speed, a good solution is a best-first merge :

for each pair {i,j} (i<j)
merge_cost(i,j) = Huffman_Cost( symbols_i + symbols_j ) - Huffman_Cost( symbols_i ) - Huffman_Cost( symbols_j )

Huffman_Cost( symbols ) = bits to send codelens + bits to encode symbols using those codelens

while # of contexts > target , and/or merge cost < target
pop lowest merge_cost
merge context j onto i
delete all merge costs involving j
recompute all merge costs involving i

if the cost was just entropy (H) instead of Huffman_Cost , then a merge_cost would always be strictly >= 0 (separate statistics are always cheaper than combining). But since the Huffman codelen transmission is not free, the first merges will actually reduce encoded size. So you should always do merges that are free or beneficial, even if huffman table count is low enough.

So contexts with similar statistics will get merged together, since coding them with a combined set of codelens either doesn't hurt or hurts only a little (or helps, with the cost of codelen transmission counted). In this way contexts where it wasn't really helping to differentiate them will get reduced.

Once this is done, the decoder becomes :

get n = number of huffman tables

n times :
read codelens from file
build huffman decode table

256 times :
read tableindex from file
merged_huff_table_ptr[i] = huff_table[ tableindex ]

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , merged_huff_table_ptr[o1] );

So merged_huff_table_ptr[] is still a [256] array, but it points at only [n] unique Huffman tables.

That's order-1 Huffman!

In the modern world, we know that o1 = the previous literal is not usually the best use of an 8-bit context. You might do something like top 3 bits of ptr[-1], top 2 bits of [ptr-2], 2 bits of position, to make a 7-bit context.

One of the cool things order-1 Huffman can do is to adaptively figure out the context for you.

For example with LZMA you have the option of the # of literal context bits (lc) and literal pos bits (lp). You want them to be as low as possible for better statistics, and there's no good way to choose them per file. (usually lc=2 or lp=2 , usually just one or the other, not both)

With order-1 Huffman, you just make a context with 3 bits of lc and 3 bits of lp, so you have a [64] 6-bit context. Then you let the merger throw away states that don't help. For example if it's a file where pos-bits are irrelevant (like text), they will just get merged out, all the lc contexts that have different lp values will merge together.

05-03-16 | Brotli signed int mode

Brotli signed int context mode. Looks like a good idea. My guess is this is what's helping Brotli10 on the binary files I wrote about in the previous post (horse.vipm and so on)

Signed int takes the previous two bytes and forms a 6-bit context from them thusly :

  Context = (Lut2[b2]<<3) | Lut2[b1];

      Lut2 :=
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

So, it's roughly categorizing the two values into ranges, which means it can act as a kind of linear predictor (if that fits the data), eg. two preceding values in group 4 = my value probably is too, or if b2 is a "3" and b1 is a "4" then I'm likely a "5". Or not, if the data isn't linear like that. Or maybe there's only correlation to b1 and b2 gets ignored, which the order-1-huff can also model.

One thing I like is holding out values 00 and FF and special cases that get a unique bucket. This lets you detect the special cases of last two bytes = 0000,FFFF,FF00,00FF , which can be pretty important on binary.

I think that for the type of data we get in games that often has floats, it might be worth it to single out 7F and 80 as well, something like :

but who knows, would have to test to see.

05-03-16 | Underappreciated Compressors

1. LZX. LZX is crazy good. It was such a huge step at the time, and nobody really recognized it. (I sure didn't) (I guess someone at MS did because they bought it).

It's a shame it never got a good mainstream implementation. It could/should have been the LZ we all used for the past 10 years.

One of the little mistakes in LZX was the 21 bit offset limit. This must have seemed enormous back on the Amiga in 1995, but very quickly became a major handicap against LZs with unlimited windows.

LZX with unlimited window (eg. on files less than 2 MB) is competitive with any modern LZ, especially on binary structured data where it really shines. In hindsight, LZX is the clear ancestor to LZMA and it was way ahead of its time. We're only clearly beating it in the past year or two (!!).

2. RAR. The primary LZ in RAR is a pretty straightforward LZ-Huff (I believe). It's fine, it's nothing bad or special.

What makes RAR special is the filters. It still has the best filters of any compressor I know.

RAR+filters often *beats* LZMA and other very slow high ratio compressors.

The special thing about the RAR filters is that they aren't like most of the "precomp" solutions that just try to recognize WAV headers and things like that - RAR may do some of that (I have no idea) - but it also definitely finds filters that work on headerless data. Like, you can take a BMP or WAV and strip off the header and RAR will still figure out that there's data to filter in there; it must have some analysis heuristics, and they're better than anything else I've seen.

As an example of when RAR filters do magic, here's a 24-bit RGB BMP with the first 100k stripped, so it's headerless and not easily recognized by file-type-detection filters :

PDI_1200_bmp_no_header.BitKnit,1268120  // <- wow BitKnit !
PDI_1200_bmp_no_header.rar,1312621  // <- RAR filters!
PDI_1200_bmp_no_header.mc-.rar,1631419 // <- RAR unfiltered

That said, it does make mistakes. Sometimes filters can make things way worse if they make a wrong decision. They don't have a "filters must help" safety check. This is easy to prevent, you just run also with no filter and make sure it helped, but they seem to not do that (presumably to save the encode time) and the results can be disastrous :

lightmap.bc3.mc-.rar,457893  // <- RAR with disabled filters
lightmap.bc3.rar,778363  // <- RAR with filters huge fuckup !!
RAR filters fucking up on DXTC (BCn) is pretty consistent :
c.dds.mc-.rar,408363 // <- unfiltered is okay
c.dds.rar,438070     // <- oops!
Sometimes it does magic :
horse.vipm.brotli10,955363 // <- brotli10 big step
horse.vipm.rar,971716 // <- RAR with filters does magic
horse.vipm.mc-.rar,1066205 // <- RAR with disabled filters
Here's an XRGB dds where the RAR filters do magic :
d.dds.BitKnit,360220  // (at zl6 BitKnit beats LZNA ! crushes 7z! wow)
d.dds.rar,382282  // <- RAR filter crushes 7z
d.dds.mc-.rar,534913 // <- RAR unfiltered is poor
happy.mc-.rar,1177426 // <- RAR unfiltered is okay
happy.rar,1354649 // <- RAR filters fucks up
Not about RAR, but for historical comparison, lzt24 is another mesh (the "struct72" file here )
Found another weird one where RAR filters do magic; lzt25 is super-structured 13-byte structs :
lzt25.rar,40024  // <- WOW RAR filters!
lzt25.zl8.LZNA,61582  // <- zl8 LZNA worse than zl6 - weird file
lzt25.zstd060,64550  // <- ZStd does surprisingly well here, I thought you needed more reps on this file
lzt25.brotli10,68472 // <- brotli10 worse than brotli9 !
lzt25.BitKnit,92940  // <- BitKnit oddly struggling
lzt25.mc-.rar,106423 // <- unfiltered RAR is the worst of the LZ's

A lot of interesting things to pick out in those reports. (just saying, I'm not gonna address them all)

One just general thing is that the performance of these LZ's is in no way consistent. You can't just say that "X LZ is 5% better than Y", there's no really consistent pattern, they have wildly variable relative performance.

There's a family of sort of normal LZ's - LZX, Brotli9, ZStd, & unfiltered RAR. Then there's the family of the high-compress LZ's, LZNA, 7z, nz. Those are pretty consistently together, and form two end-points.

But then there are the floaters. BitKnit, Kraken, Filtered RAR, and Brotli10 can jump around between the "normal LZ" and "high-compress LZ" region. BitKnit and Brotli10 are the most variable - they both can jump right up to the high-compress LZ's like 7z, but on other files they drop right down into the pack of normal LZ's (LZX, etc.).

I have a guess about what's happening with Brotli. I haven't looked at the code at all, but my guess is that between level 9 and 10 the order-1 context optimization is turned on. In particular, there's this "signed int" context mode which I believe is what does the magic for brotli on things like horse.vipm (for example it has contexts for the case of last two bytes = 0x0000 , or last two bytes = 0xFFFF , which are pretty common on horse). My guess is that this mode is just not even tried at all at level 9, and at level 10 it turns on the code to pick the best context mode, and finds the signed int mode which is great on these files. Not sure.

04-30-16 | Oodle Kraken Pareto Frontier

The whole idea of Kraken is to excel in the size-decodespeed tradeoff. From the beginning of its design, every decision was considered in terms of the size vs decodespeed.

So how does it do?

To visualize the size-decodespeed Pareto frontier, I like to use an imaginary loading problem. You want to load compressed data and then decompress it. One option is the null compressor (or "memcpy") that just loads uncompressed data and then memcpys it. Or you can use compressors that give smaller file sizes, thus quicker loads, but take more time to decompress. By dialing the virtual disk speed, you can see which compressor is best at different tradeoffs of space vs. speed.

Of course you may not actually be just trying to optimize load time, but you can still use this imaginary loading problem as a way to study the space-speed tradeoff. If you care more about size, you look at lower imaginary disk speeds, if you core more about CPU use, you look at higher imaginary disk speeds.

I like to show "speedup" which is the factor of increase in speed using a certain compressor to do load+decomp vs. the baseline (memcpy). So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff. (By "show" I mean "is linearly proportional to, so you can actually look at the ratio of Y-intercepts in the Pareto curve and it tells you compression ratio on the left and decompression speed on the right).

03-02-15 - Oodle LZ Pareto Frontier and 05-13-15 - Skewed Pareto Chart

The chart showing "millis" shows time, so obviously lower is better. I show a few ways to combine load & decomp. Sum is the time to load then decomp sequentially. Max is the larger of the two, which is the time if they were perfectly overlapped (not usually possible). Personally I think "l2 sum" is most useful. This is the sqrt of sum of squares, which is a kind of sum that biases towards the larger; it's kind of like a mix of "sum" and "max", it means you want the sum to be small, but you also want them to similar times.

Kraken tops the Pareto curve for decodespeed vs size for a huge range of virtual disk speed; from around 2 mb/s up to around 300 mb/s.

Of course the Pareto curve doesn't tell the whole picture. For one thing I don't have encode speed in the equation here at all (and sometimes you need to look at the memory use tradeoff too, so it's really a size-decodespeed-encodespeed-memoryuse surface). For another you sometimes have strict requirements, like I must hit a certain file size, and then you just pick the fastest decoder that meets that requirement. One thing the curves do clearly tell you is when a curve just completely covers another (that is, is greater in Y for all values of X), then that compressor is just never needed (in space & decodespeed).

Silesia :

Game Test Set :

(BTW for anyone trying to compare new charts to old ones on my blog - "Game Test Set" is not a static set. It changes all the time as I get more customer data and try to make it match the most representative data I have for games.)

Just for comparison to earlier posts on my Blog about the other Oodle compressors, here's a run of lzt99 with the full Oodle compressor set.

I've included ZStd for reference here just to show that Kraken jumping way up is not because the previous Oodle compressors were bad - far from it, they were already Pareto optimal. (I like to compare against ZStd not because it's bad, but because it's excellent; it's the only other compressor in the world I know of that's even close to Oodle in terms of good compression ratio and high speed (ZStd also has nice super-fast encode speeds, and it's targetted slightly differently, it's better on text and Oodle is better on binary, it's not a direct apples comparison)).

You can see that LZHLW is just completely covered by Kraken, and so is now deprecated in Oodle. Even LZNIB and BitKnit are barely peeking out of the curve, so the range where they are the right answer is greatly reduced to more specialized needs. (for example BitKnit still romps on strongly structured data, and is useful if you just need a size smaller than Kraken)

04-28-16 | Performance of Oodle Kraken

A continuing series reporting on the performance of recently released Oodle Kraken

First some general notes about Oodle before the big dump of numbers. (skip to the bottom for charts)

Oodle is not intended to solve every problem in data compression. Oodle Data Compression is mainly designed for compress-once load-many usage patterns, where the compressed size and decode speed is important, but encode speed is not as important. Things like distribution, packaging, when you bake content into a compressed archive and then serve it to many people. I do consider it a requirement to keep the encoders faster than 1 MB/s on my ancient laptop, because less than that is just too slow even for content baking.

Like most data compressors, Oodle has various compression level options that trade off encoder speed for compressed size. So there are faster and slower encoders; some of the fast modes are good tradeoffs for space-speed. Kraken turns out to be pretty good at getting high compression even in the fast modes. I'll do a post that goes into this in the future.

Oodle is mainly intended for packing binary data. Because Oodle is built for games, we focus on the types of data that games ship, such as textures, models, animations, levels, and other binary structured data that often contains various types of packed numeric data. Oodle is good on text but is not really built for text. In general if you are focusing on a specific type of data (eg. text/html/xml) you will get the best results with domain-specific preprocessors, such as xwrt or wbpe. Oodle+wbpe is an excellent text compressor.

Oodle Kraken is intended for high compression use cases, where you care about size. There's a whole family of compressors now that live in the "just below lzma (7zip)" compression ratio domain, such LZHAM, Brotli, Oodle BitKnit, and ZStd. In general the goal of these is to get as close to lzma compression ratio as possible while providing much better speed. This is where Kraken achieves far more than anyone has before, far more than we thought possible.

Today I will be focusing on high compression modes, looking at decode speed vs. compression ratio. All compressors will generally be run in their max compression mode, without going into the "ridiculously slow" below 1 MB/s range. (for Oodle this means running -zl6 Optimal2 but not -zl7 or higher).

Okay, on to lots of numbers.

Rather than post an average on a test set, which can be misleading due to the selection of the test set or the way the results are averaged (total compressed size or average ratio?), I'll post results on a selection of individual files :


by ratio:
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Oodle Kraken:  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s

by encode speed:
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Oodle Kraken:  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s

by decode speed:
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s
Oodle Kraken:  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s

by ratio:
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Oodle Kraken:  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s

by encode speed:
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Oodle Kraken:  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s

by decode speed:
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s
Oodle Kraken:  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s

by ratio:
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
Oodle Kraken:  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s

by encode speed:
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
Oodle Kraken:  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s

by decode speed:
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s
Oodle Kraken:  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s

by ratio:
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
Oodle Kraken:  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb/s

by encode speed:
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
Oodle Kraken:  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s

by decode speed:
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb/s
Oodle Kraken:  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s

by ratio:
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Oodle Kraken:  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s

by encode speed:
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Oodle Kraken:  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s

by decode speed:
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s
Oodle Kraken:  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s

by ratio:
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
Oodle Kraken:  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s

by encode speed:
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
Oodle Kraken:  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s

by decode speed:
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s
Oodle Kraken:  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s

In chart form :

(lz4 decode speed is off the top of the chart)

I'm not including the other Oodle compressors here just to keep things as simple as possible. If you do want more compression than Kraken, and care about decode speed, then Oodle LZNA or BitKnit are much faster to decode than lzma (7zip) at comparable or better compression ratios.

Visit radgametools.com to learn more about Kraken and Oodle.

04-27-16 | Introducing Kraken

Kraken is out. Release notice on my blogger blog :

Release the Kraken!

I'll be doing some more expert-oriented techy followup posts here at rants.html

04-25-16 | Data Compression History : Finnish

Finnish was perhaps the fastest (non-nop) compressor in the world around 1995 (? not sure on the exact year. Definitely before P-Pro and CMOV and branch penalties and such; this is a pre-Pentium-era optimized compressor; it definitely existed before LZP1 (1996) since it was one of the things I benchmarked against). (heck it might be a 286-era compressor, seeing as it's all 16-bit!)

Finnish was by some guys that I assume were from Finland. If anybody knows the correct attribution please let me know.

I was thinking about it the other day because we talked about the old segment register trick that we used to do, and I always thought this was such a neat little bit of code. It also uses the byte-regs as part of word-reg tricks.

Finnish :

; es = CharTable
; bx = hash index
; dl = control bits
; ds[si] = input
; ds[di] = output
; ax/al/ah = input char
; bp = control ptr

ProcessByte     macro SourceReg,BitVal
                        cmp     SourceReg, es:[bx]
                        je      ProcessByte_done
                        or      dl, BitVal
                        mov     es:[bx], SourceReg
                        mov     ds:[di], SourceReg
                        inc     di
ProcessByte_done:       mov     bh, bl
                        mov     bl, SourceReg

                        mov bp, di           ; ControlPtr = CompPtr++;
                        inc di
                        xor dl, dl           ; Control = 0;

                        lodsw                ; AX = ds[si] , si += 2
                        ProcessByte al, 80h
                        ProcessByte ah, 40h
                        ProcessByte al, 20h
                        ProcessByte ah, 10h
                        ProcessByte al, 08h
                        ProcessByte ah, 04h
                        ProcessByte al, 02h
                        ProcessByte ah, 01h

                        mov     ds:[bp], dl  ; *ControlPtr = Control

04-25-16 | Huffman Correction

A reader pointed out an error in my blog post - 08-12-10 - The Lost Huffman Paper

I had :

if ( bits >= huff_branchCodeLeftAligned[TABLE_N_BITS] )
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;

it should have been :

if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;

it's corrected now.

In my convention, branchCodeLeftAligned is the left-aligned bitbuff value that means you must go to a higher codelen.

I thought for clarity I'd go ahead and post the example I did with him :

You have this alphabet :

symbol_id, codeLen, code:
0 ; 2 ; 00
1 ; 3 ; 010
2 ; 3 ; 011
3 ; 3 ; 100
4 ; 4 ; 1010
5 ; 4 ; 1011
6 ; 4 ; 1100
7 ; 4 ; 1101
8 ; 4 ; 1110
9 ; 5 ; 11110
10; 5 ; 11111

baseCode[n] = first code of len n - # of codes of lower len

    [2]     0
    [3]     1       = 010 - 1
    [4]     6       = 1010 - 4
    [5]     21      = 11110 - 9

    [2]   0x4000000000000000      010000...
    [3]   0xa000000000000000      101000...
    [4]   0xf000000000000000      111100...
    [5]   0xffffffffffffffff      111111...

My decode loop is :

for(int codeLen=1;;codeLen++) // actually unrolled, not a loop

if ( bitbuff < huff_branchCodeLeftAligned[codeLen] ) return symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];



int codeLen = minCodeLen;
while ( bitbuff >= huff_branchCodeLeftAligned[codeLen] ) codeLen++;
sym = symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];

so if bitbuff is


codeLen starts at 2
we check

if ( 11010000.. < 0x4000... ) - false
if ( 11010000.. < 0xa000... ) - false
if ( 11010000.. < 0xf000... ) - true
  return ( 1101 - baseCode[4] ); = 13 - 6 = 7

And a full table-accelerated decoder for this code might be :

// 3-bit acceleration table :
#define TABLE_N_BITS 3
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;

if ( bitbuff < huff_branchCodeLeftAligned[4] ) return symbolUnsort[ getbits(4) - baseCode[4] ];

// 5 is max codelen
// this compare is not always true (because of the bitbuff=~0 problem), but we take it anyway
//if ( bitbuff < huff_branchCodeLeftAligned[5] )

return symbolUnsort[ getbits(5) - baseCode[5] ];

And there you go. MSB-first Huffman that supports long code lengths that exceed the acceleration table size.

04-22-16 | Understanding Brotli

I got curious just now and went and had a read of the Brotli spec to see what's going on in there. My cliff notes on how Brotli works.

(In particular I was curious if the slow encode speeds for level 10 & 11 were inherent. My conclusion is basically they are. Brotli gets its high compression from the crazy flexibility of Huffman context assignment and table switching, which is an inherently slow thing to encode. It's a very asymmetric format.)

Shout out for order-1 Huffman with merge tables. I'm not sure I've seen that used in production since my paper - "New Techniques in Context Modeling and Arithmetic Encoding" (PDF) (twenty years ago, ZOMG).

Highlighting the things I think are significant or interesting :

Brotli is LZ+Huff Mostly it's LZX (reps, bottom bits, huff resets) plus order-1-huff plus some details The main stream is {packet (LRL+ML) , literals , offset} Brotli has 16 rep codes : 0: last distance 1: second-to-last distance 2: third-to-last distance 3: fourth-to-last distance 4: last distance - 1 5: last distance + 1 6: last distance - 2 7: last distance + 2 8: last distance - 3 9: last distance + 3 10: second-to-last distance - 1 11: second-to-last distance + 1 12: second-to-last distance - 2 13: second-to-last distance + 2 14: second-to-last distance - 3 15: second-to-last distance + 3 (? I'm curious how much this helps; how would Brotli do with just the normal 4 reps?) Brotli does "bottom bits" for offsets, and sends the # of bottom bits in a header (NPOSTFIX) so it's variable per file. Brotli uses an {LRL,ML} , but instead of linear up to excess flag (like LZ4) it uses a Zlib style log2ish + extrabits encoding. This packet has a rather large alphabet (704 codes) Brotli has very flexible capability for Huffman resets. It can send Huff resets anywhere, for each Huffman category. It can also switch *back* which is kind of interesting. At any byte, for each type of entropy coding, it can send a Huffman code index ("block type") and the number of bytes to code in that type. So it could do for literals Huff type i ; use Huff table 0 for the next 200 bytes then use Huff table 1 for 100 bytes, then back to table 0 for 100 bytes etc. Packets are sent order-0 Literals and offsets are sent with order-1 Huffman The order-1 Huffman can have separate tables per context, or they can be shared. So the encoder combines context buckets with similar statistics. This is sent as a "context map". Crazily, there is a separate "context map" per block type! So "block type" is actually selecting a possibly different context map assignment of the prefix code tables. The context for offsets is a 2-bit context from the match length of the current match. {ML=2,3,4,5+} (models the ML-offset correlation) Literals are context-coded using the previous 2 literals. It's not really order-2 in that there's no escape down. It just takes the 2 literals and forms a single context value, which is always 6 bits. There are four methods, called context modes, to compute the Context ID: * LSB6, where the Context ID is the value of six least-significant bits of p1, * MSB6, where the Context ID is the value of six most-significant bits of p1, * UTF8, where the Context ID is a complex function of p1, p2, optimized for text compression, and * Signed, where Context ID is a complex function of p1, p2, optimized for compressing sequences of signed integers. (note that order1 huff bin merging can effectively reduce the bit count; eg. MSB6 can become MSB2 if the lower bits are not very helpful, by sending them all as the same huffman code index)

03-14-16 | XRGB Bitmap Test

This is obvious and I think it's been done before, but hey.

I was remembering how modern LZ's like LZMA (BitKnit, etc.) that (can) do pos&3 for literals might like bitmaps in XRGB rather than 24-bit RGB.

In XRGB, each color channel gets its own entropy coding. Also offset bottom bits works if the offsets are whole pixel steps (the off&3 will be zero). In 24-bit RGB that stuff is all mod-3 which we don't do.

(in general LZMA-class compressors fall apart a bit if the structure is not the typical 4/8/pow2)

In compressors it's generally terrible to stick extra bytes in and give the compressor more work to do. In this case we're injecting a 0 in every 4th byte, and the compressor has to figure out those are all redundant just to get back to its original size.

Anyway, this is an old idea, but I don't think I ever actually tried it. So :



24-bit RGB : LZNA : 2,760,054 -> 1,376,781
32-bit XRGB: LZNA : 3,676,818 -> 1,311,502

24-bit  RGB with DPCM filter : LZNA : 2,760,054 -> 1,022,066
32-bit XRGB with DPCM filter : LZNA : 3,676,818 -> 1,015,379  (MML8 : 1,012,988)

webpll : 961,356
paq8o8 : 1,096,342


24-bit RGB : LZNA : 6,580,854 -> 3,274,757
32-bit XRGB: LZNA : 8,769,618 -> 3,022,320

24-bit  RGB with DPCM filter : LZNA : 6,580,854 -> 2,433,246
32-bit XRGB with DPCM filter : LZNA : 8,769,618 -> 2,372,921

webpll : 2,204,444
gralic111d : 1,822,108

other compressors :

32-bit XRGB with DPCM filter : LZA  : 8,769,618 -> 2,365,661 (MML8 : 2,354,434)

24-bit  RGB no filter : BitKnit : 6,580,854 -> 3,462,455
32-bit XRGB no filter : BitKnit : 8,769,618 -> 3,070,141
32-bit XRGB with DPCM filter : BitKnit : 8,769,618 -> 2,601,463

32-bit XRGB: LZNA : 8,769,618 -> 3,022,320
32-bit XRGB: LZA  : 8,769,618 -> 3,009,417

24-bit  RGB: LZMA : 6,580,854 -> 3,488,546 (LZMA lc=0,lp=2,pb=2)
32-bit XRGB: LZMA : 8,769,618 -> 3,141,455 (LZMA lc=0,lp=2,pb=2)


bmp copy moses.bmp moses.tga 32
radbitmaptest64 rrz -z0 r:\moses.tga moses.tga.rrz -f8 -l1

Key observations :

1. On "moses" unfiltered : padding to XRGB does help a solid amount (3,274,757 to 3,022,320 for LZNA) , despite the source being 4/3 bigger. I think that proves the concept. (BitKnit & LZMA even bigger difference)

2. On filtered data, padding to XRGB still helps, but much (much) less. Presumably this is because post-filter data is just a bunch of low values, so the 24-bit RGB data is not so multiple-of-three structured (it's a lot of 0's, +1's, and -1's, less coherent, less difference between the color channels, etc.)

3. On un-filtered data, "sub" literals might be helping BitKnit (it beats LZMA on 32-bit unfiltered, and hangs with LZNA). On filtered data, the sub-literals don't help (might even hurt) and BK falls behind. We like the way sub literals sometimes act as an automatic structure stride and delta filter, but they can't compete with a real image-specific DPCM.

Now, XRGB padding is an ugly way to do this. You'd much rather stick with 24-bit RGB and have an LZ that works inherently on 3-byte items.

The first step is :

LZ that works on "items"

(eg. item = a pixel)

LZ matches (offsets and lens) are in whole items

(the more analogous to bottom-bits style would be to allow whole-items and "remainders";
that's /item and %item, and let the entropy coder handle it if remainder==0 always;
but probably best to just force remainders=0)

When you don't match (literal item)
each byte in the item gets it own entropy stats
(eg. color channels of pixels)

which maybe is useful on things other than just images.

The other step is something like :

Offset is an x,y delta instead of linear
(this replaces offset bottom bits)

could be generically useful in any kind of row/column structured data

Filtering for values with x-y neighbors

(do you do the LZ on un-filtered data, and only filter the literals?)
(or do you filter everything and do the LZ on filter residuals?)

and a lot of this is just webp-ll

03-11-16 | Seven Test Space-Speeds

Showing decompress time space-speed tradeoff on the different files of "seven test" :








Note on the test :

This is running the non-Oodle compressors via my build of their lib (*). Brotli not included because it's too hard to build in MSVC (before 2010). "oohc" here is "Optimal2" level (originally posted with Optimal1 level, changed to Optimal2 for consistency with previous post).

The sorting of the labels on the right is by compressed size.

Report on total of all files :

by ratio:
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s

by encode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s

by decode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s

How to for my reference :

type test_slowies_seven.bat
@REM test each one individially :
spawnm -n external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven\* -cr:\seven_csvs\@f.csv
@REM test as a set :
external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven

dele r:\compressorspeeds.*
@REM testproj compressorspeedchart
spawnm c:\src\testproj\x64\debug\TestProj.exe r:\seven_csvs\*.csv
ed r:\compressorspeeds.*

(* = I use code or libs to test speeds, never exes; I always measure speed memory->memory, single threaded, with cold caches)

03-11-16 | Seven Test

I made a new test set called "sevens", taking the lead from enwik7, the size of each file is 10 MB (10^7).

The goal here is not to show the total or who does best overall (that relies on how you weight each type of file and whether you think this selection is representative of the occurance ratios in your data), rather to show how each compressor does on different types of data, to highlight their different strengths.

Showing compression factor (eg. N:1 , higher is better) :

run details :

ZStd is 0.5.1 at level 21 (optimal)
LZMA is 7z -mx9 -m0=lzma:d24
Brotli is bro.exe by Sportman --quality 9 --window 24 (*)
Oodle is v2.13 at -z6 (Optimal2)

All competitors run via their provided exe

Some takeaways :

Binary structured data is really where the other compressors leave a lot of room to beat them. ("granny" and "records"). The difference in sizes on all the other files is pretty meh.

BitKnit does its special thang on granny - close to LZNA but 2X faster to decode (and ~ 6X faster than LZMA). Really super space-speed. BitKnit drops down to more like LZHLW levels on the non-record files (LZNA/LZMA has a small edge on them).

I was really surprised by ZStd vs Brotli. I actually went back and double checked by CSV to make sure I hadn't switched the columns by accident. In particular - Brotli does poorly on enwik7 (huh!?) but it does pretty well on "granny", and surprisingly ZStd does quite poorly on "granny" & "records". Not what I expected at all. Brotli is surprising poor on text/web and surprisingly good on binary record data.

LZHLW is still an excellent choice after all these years.

(* = Brotli quality 10 takes an order of magnitude longer than any of the others. I got fed up with waiting for it. Oodle also has "super" modes at -z8 that aren't used here. (**))

(for concreteness : Brotli 11 does pretty well on granny7 ; (6.148:1 vs 4.634:1 at q9) but it runs at 68 kb/s (!!) (and still not LZMA-level compression))

(** = I used to show results in benchmarks that required really slow encoders (for example the old LZNIB optimal "super parse" was hella slow); that can result in very small sizes and great decode speed, but it's a form of cheating. Encoders slower than 1 mb/s just won't be used, they're too slow, so it's reporting a result that real users won't actually see, and that's BS. I'm trying to be more legit about this now for my own stuff. Slow encoders are still interesting for research purposes because they show what should be possible, so you can try to get that result back in a faster way. (this in fact happened with LZNIB and is a Big Deal))

02-29-16 | Bit Input Notes

1. The big win of U64 branchless bit input is having >= 56 bits (or 57) after refill. The basic refill operation itself is not faster than branchy 32-at-a-time refills, but that only has >= 32 (or 33) bits after refill. The advantage comes if you can unconditionally consume bits knowing that count. eg. if you have a 12-bit limitted Huffman, you can consume 4 symbols without needing to refill.

2. The best case for bit input is when the length that you consume is not very variable. eg. in the Huffman case, 1-12 bits, has a reasonably low limit. The worst case is when it has a high max and is quite random. Then you can't avoid refill checks, and they're quite unpredictable (if you do the branchy case)

3. If your refills have a large maximum, but the average is low, branchy can be faster than branchless. Because the maximum is high (eg. maybe a max of 32 bits consumed), you can only do one decode op before checking refill. Branchless will then always refill. Branchy can skip the refill if the average is low - particularly if it's predictably low.

4. If using branchy refills, try to make it predictable. An interesting idea is to use multiple bit buffers so that each consumption spot gets its own buffer, and then can create a pattern. A very specific case is consuming a fixed number of bits. something like :


if ( random )
  consume 4 bits from bitbuffer
  if bitbuffer out -> refill
  consume 6 bits from bitbuffer
  if bitbuffer out -> refill

these branches (for bitbuffer refill) will be very random because of the two different sites that consume different amounts. However, this :

bitbuffer1, bitbuffer2

if ( random )
  consume 4 bits from bitbuffer1
  if bitbuffer1 out -> refill
  consume 6 bits from bitbuffer2
  if bitbuffer2 out -> refill

these branches for refill are now perfectly predictable in a pattern (they are taken every Nth time exactly).

5. Bit buffer work is slow, but it's "mathy". On modern processors that are typically math-starved, it can be cheap *if* you have enough ILP to fully use all the execution units. The problem is a single bit buffer on its own is super serial work, so you need multiple bit buffers running simultaneously, or enough other work.

For example, it can actually be *faster* than byte-aligned input (using something like "EncodeMod") if the byte-input does a branch, and that branch is unpredictable (in the bad 25-75% randomly taken range).

02-29-16 | LZSSE Notes

There are a few things that I think are interesting in LZSSE. And really very little of it is about the SIMD-ness.

1. SIMD processing of control words.

All LZ-Bytewises do a little bit of shifts and masks to pull out fields and flags from the control word. Stuff like lrl = (control>>4) and numbytesout = lrl+ml;

This work is pretty trivial, and it's fast already in scalar. But if you can do it N at a time, why not.

A particular advantage here is that SSE instruction sets are somewhat better at branchless code than scalar, it's a bit easier to make masks from conditions and such-like, so that can be a win. Also helps if you're front-end-bound, since decoding one instruction to do an 8-wide shift is less work than 8 instructions. (it's almost impossible for a data compressor to be back-end bound on simple integer math ops, there are just so many execution units; that's rare, it's much possible to hit instruction decode limits)

2. Using SSE in scalar code to splat out match or LRL.

LZSSE parses the control words SIMD (wide) but the actual literal or match copy is scalar, in the sense that only one is done at a time. It still uses SSE to fetch those bytes, but in a scalar way. Most LZ's can do this (many may do it already without being aware of it; eg. if you use memcpy(,16) you might be doing an SSE splat).

3. Limitted LRL and ML in control word with no excess. Outer looping on control words only, no looping on LRL/ML.

To output long LRL's, you have to output a series of control words, each with short LRL. To output long ML's, you have to output a series of control words.

This I think is the biggest difference in LZSSE vs. something like LZ4. You can make an LZ4 variant that works like this, and in fact it's an interesting thing to do, and is sometimes fast. In an LZ4 that does strictly alternating LRL-ML, to do this you need to be able to send ML==0 so that long literal runs can be continued as a sequence of control words.

Traditional LZ4 decoder :

lrl = control>>4;
ml = (control&0xF)+4;
off = get 2 bytes;  comp += 2;

// get excess if flagged with 0xF in control :
if ( lrl == 0xF ) lrl += *comp++; // and maybe more
if ( ml == 19 ) ml += *comp++; // and maybe more

copy(out,comp,lrl); // <- may loop on lrl
out += lrl; comp += lrl;

copy(out,out-off,ml); // <- may loop on ml
out += ml;

non-looping LZ4 decoder : (LZSSE style)

lrl = control>>4;
ml = control&0xF; // <- no +4 , 0 possible
off = get 2 bytes;  comp += 2;  // <- * see below

// no excess

copy(out,comp,16); // <- unconditional 16 byte copy, no loop
out += lrl; comp += lrl;

copy(out,out-off,16); // <- unconditional 16 byte copy, no loop
out += ml;

(* = the big complication in LZSSE comes from trying to avoid sending the offset again when you're continuing a match; something like if previous control word ml == 0xF that means a continuation so don't get offset)

(ignoring the issue of overlapping matches for now)

This non-looping decoder is much less branchy, no branches for excess lens, no branches for looping copies. It's much faster than LZ4 *if* the data doesn't have long LRL's or ML's in it.

4. Flagged match/LRL instead of strictly alternating LRL-ML. This is probably a win on data with lots of short matches, where matches often follow matches with no LRL in between, like text.

If you have to branch for that flag, it's a pretty huge speed hit (see, eg. LZNIB). So it's only viable in a fast LZ-Bytewise if you can do it branchless like LZSSE.

02-29-16 | LZSSE Results

Quick report of my results on LZSSE. (updated 03/06/2016)

(LZSSE Latest commit c22a696 ; fetched 03/06/2016 ; test machine Core i7-3770 3.4 GHz ; built MSVC 2012 x64 ; LZSSE2 and 8 optimal parse level 16)

Basically LZSSE is in fact great on text, faster than LZ4 and much better compression.

On binary, LZSSE2 is quite bad, but LZSSE8 is roughly on par with LZ4. It looks like LZ4 is maybe slightly better on binary than LZSSE8, but it's close.

In general, LZ4 is does well on files that tend to have long LRL's and long ML's. Files with lots of short (or zero) LRL's and short ML's are bad for LZ4 (eg. text) and not bad for LZSSE.

(LZB16 is Oodle's LZ4 variant; 64k window like LZSSE; LZNIB and LZBLW have large windows)

Some results :

enwik8 LZSSE2 : 100,000,000 ->38,068,528 : 2866.17 mb/s
enwik8 LZSSE8 : 100,000,000 ->38,721,328 : 2906.29 mb/s
enwik8 LZB16  : 100,000,000 ->43,054,201 : 2115.25 mb/s

(LZSSE kills on text)

lzt99  LZSSE2 : 24,700,820 ->15,793,708  : 1751.36 mb/s
lzt99  LZSSE8 : 24,700,820 ->15,190,395  : 2971.34 mb/s
lzt99  LZB16  : 24,700,820 ->14,754,643  : 3104.96 mb/s

(LZSSE2 really slows down on heterogenous binary file lzt99)
(LZSSE8 does okay, but slightly worse than LZ4/LZB16 in size & speed)

mozilla LZSSE2: 51,220,480 ->22,474,508 : 2424.21 mb/s
mozilla LZSSE8: 51,220,480 ->22,148,366 : 3008.33 mb/s
mozilla LZB16 : 51,220,480 ->22,337,815 : 2433.78 mb/s

(all about the same size on silesia mozilla)
(LZSSE8 definitely fastest)

lzt24  LZB16  : 3,471,552 -> 2,379,133 : 4435.98 mb/s
lzt24  LZSSE8 : 3,471,552 -> 2,444,527 : 4006.24 mb/s
lzt24  LZSSE2 : 3,471,552 -> 2,742,546 : 1605.62 mb/s
lzt24  LZNIB  : 3,471,552 -> 1,673,034 : 1540.25 mb/s

(lzt24 (a granny file) really terrible for LZSSE2; it's as slow as LZNIB)
(LZSSE8 fixes it though, almost catches LZB16, but not quite)


Some more binary files.  LZSSE2 is not good on any of these, so omitted.

win81  LZB16  : 104,857,600 ->54,459,677 : 2463.37 mb/s
win81  LZSSE8 : 104,857,600 ->54,911,633 : 3182.21 mb/s

all_dds LZB16 : 79,993,099 ->47,683,003 : 2577.24 mb/s
all_dds LZSSE8: 79,993,099 ->47,807,041 : 2607.63 mb/s

LZB16  :  7,105,158 -> 3,498,306 : 3350.06 mb/s
LZSSE8 :  7,105,158 -> 3,612,433 : 3548.39 mb/s

LZB16  : 58,788,904 ->32,862,033 : 2968.36 mb/s
LZSSE8 : 58,788,904 ->33,201,406 : 2642.94 mb/s

LZSSE8 vs LZB16 is pretty close.

LZSSE8 is maybe more consistently fast; its decode speed has less variation than LZ4. Slowest LZSSE8 was all_dds at 2607 mb/s ; LZ4 went down to 2115 mb/s on enwik8. Even excluding text, it was down to 2433 mb/s on mozilla. LZB16/LZ4 had a slightly higher max speed (on lzt24).

Conclusion :

On binary-like data, LZ4 and LZSSE8 are pretty close. On text-like data, LZSSE8 is definitely better. So for general data, it looks like LZSSE8 is a definite win.

02-17-16 | LZSSE

An LZ Codec Designed for SSE Decompression

LZSSE code

Some good stuff.

Basically this is a nibble control word LZ (like LZNIB). The nibble has a threshold value T, < T is an LRL (literal run len), >= T is a match length. LZSSET are various threshold variants. As Conor noted, ideally T would be variable, optimized per file (or even better - per quantum) to adapt to different data better.

LZSSE has a 64k window (like LZ4/LZB16) but unlike them supports MML (minimum match length) of 3. MML 3 typically helps compression a little, but in scalar decoders it really hurts speed.

I think the main interesting idea (other than implementation details) is that by limitting the LRL and ML, with no excess/overflow support (ML overflow is handled with continue-match nibbles), it means that you can do a non-looping output of 8/16 bytes. You get long matches or LRL's by reading more control nibbles.

That is, a normal LZ actually has a nested looping structure :

loop on controls from packed stream
 control specifies lrl/ml

 loop on lrl/ml
   output bytes

LZSSE only has *one* outer loop on controls.

There are some implementation problems at the moment. The LZSSE2 optimal parse encoder is just broken. It's unusably slow and must have some bad N^2 degeneracy. This can be fixed, it's not a problem with the format.

Another problem is that LZSSE2 expands incompressible data too much. Real world data (particularly in games) often has incompressible data mixed with compressible. The ideal fix would be to have the generalized LZSSET and choose T per quantum. A simpler fix would be to do something like cut files into 16k or 64k quanta, and to select the best of LZSSE2/4/8 per-quantum and also support uncompressed quanta to prevent expansion.

I will take this moment to complain that the test sets everyone is using are really shit. Not Conors fault, but enwiks and Silesia are grossly not at all representative of data that we see in the real world. Silesia is mostly text and weird highly-compressible data; the file I like best in there for my own testing is "mozilla" (though BTW mozilla also contains a bunch of internal zlib streams; it benefits enormously from precomp). We need a better test corpus!!!

02-11-16 | String Match Stress Test Files

A gift. My string match stress test set :

string_match_stress_tests.7z (60,832 bytes)

Consists of :


An optimal parse matcher (matching at every position in each file against all previous bytes within that file) should get these average match lengths : (min match length of 4, and no matches searched for in the last 8 bytes of each file)

paper1_twice : 13294.229727
stress_all_as : 21119.499148
stress_many_matches : 32.757760
stress_search_limit : 823.341331
stress_sliding_follow : 199.576550
stress_suffix_forward : 5199.164464

total ml : 2896554306
total bytes : 483870

Previous post on the same test set : 09-27-11 - String Match Stress Test

And these were used in the String Match Test post series , though there I used "twobooks" instead of "paper1_twice".

These stress tests are designed to make imperfect string matchers show their flaws. Correct implementations of Suffix Array or Suffix Tree searchers should find this total match length without ever going into bad N^2 slowdowns (their speed should be roughly constant). Other matchers like hash-link, LzFind (hash-btree) and MMC will either find lower total match length (due to an "amortize" step limit) or will fall into bad N^2 (or worse!) slowdowns.

01-29-16 | Oodle Network Usage Notes

Two things I thought to write down.

1. Oodle Network speed is very cache sensitive.

Oodle Network uses a shared trained model. This is typically 4 - 8 MB. As it compresses or decompresses, it needs to access random bits of that memory.

If you compress/decompress a packet when that model is cold (not in cache), every access will be a cache miss and performance can be quite poor.

In synthetic test, coding packets over and over, the model is as hot as possible (in caches). So performance can seem better in synthetic test loops than in the real world.

In real use, it's best to batch up all encoding/decoding operations as much as possible. Rather than do :

decode one packet
apply packet to world
do some other stuff

decode one packet
apply packet to world
do some other stuff


try to group all the Oodle Network encoding & decoding together :

gather up all my packets to send

receive all packets from network stack

encode all my outbound packets
decode all my inbound packets

now act on inbound packets

this puts all the usage of the shared model together as close as possible to try to maximize the amount that the model is found in cache.

2. Oodle Network should not be used on already compressed data. Oodle Network should not be used on large packets.

Most games send pre-compressed data of various forms. Some send media files such as JPEGs that are already compressed. Some send big blobs that have been packed with zlib. Some send audio data that's already been compressed.

This data should be excluded from the Oodle Network path and send without going through the compressor. It won't get any compression on them and will just take CPU time. (you could send them as a packet with complen == rawlen, which is a flag for "raw data" in Oodle Network).

More importantly, these packets should NOT be included in the training set for building the model. They are essentially random bytes and will just crud up the model. It's a bit like if you're trying to memorize the digits of Pi and someone keeps yelling random numbers in your ear. (Well, actually it's not like that at all, but those kind of totally bullshit analogies seem very popular, so there you are.)

On large packets that are not precompressed, Oodle Network will work, but it's just not the best choice. It's almost always better to use an Oodle LZ data compressor (BitKnit, LZNIB, whatever, depending on your space-speed tradeoff desired).

The vast majority of games have a kind of bipolar packet distribution :

A. normal frame update packets < 1024 bytes

B. occasional very large packets > 4096 bytes

it will work better to only use Oodle Network on the type A packets (smaller, standard updates) and to use Oodle LZ on the type B packets (rarer, large data transfers).

For example some games send the entire state of the level in the first few packets, and then afterward send only deltas from that state. In that style, the initial big level dump should be sent through Oodle LZ, and then only the smaller deltas go through Oodle Network.

Not only will Oodle LZ do better on the big packets, but by excluding them from the training set for Oodle Network, the smaller packets will be compressed better because the data will all have similar structure.

01-16-16 | Oodle 2.1.2

Oodle 2.1.2 is out. Oodle - now with more BitKnit!

Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\testsets\big\lzt99
got arg : num_repeats=5
lz test loading: r:\testsets\big\lzt99
uncompressed size : 24700820
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |1.51:517:2988|1.57:236:2971|1.62:109:2964|1.65: 37:3003|
LZBLW  |1.64:249:2732|1.74: 80:2682|1.77: 24:2679|1.85:1.6:2708|
LZNIB  |1.80:264:1627|1.92: 70:1557|1.94: 23:1504|2.04: 12:1401|
LZHLW  |2.16: 67: 424|2.30: 20: 447|2.33:7.2: 445|2.35:5.4: 445|
BitKnit|2.43: 28: 243|2.47: 20: 245|2.50: 13: 249|2.54:6.4: 249|
LZNA   |2.36: 24: 115|2.54: 18: 119|2.58: 13: 120|2.69:4.9: 120|
compression ratio:
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |    1.510    |    1.569    |    1.615    |    1.654    |
LZBLW  |    1.636    |    1.739    |    1.775    |    1.850    |
LZNIB  |    1.802    |    1.921    |    1.941    |    2.044    |
LZHLW  |    2.161    |    2.299    |    2.330    |    2.355    |
BitKnit|    2.431    |    2.471    |    2.499    |    2.536    |
LZNA   |    2.363    |    2.542    |    2.584    |    2.686    |
encode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |    517.317  |    236.094  |    108.555  |     36.578  |
LZBLW  |    248.537  |     80.299  |     23.663  |      1.610  |
LZNIB  |    263.950  |     69.930  |     22.617  |     11.735  |
LZHLW  |     67.154  |     20.019  |      7.200  |      5.425  |
BitKnit|     28.203  |     20.223  |     12.672  |      6.371  |
LZNA   |     24.192  |     18.423  |     12.883  |      4.907  |
decode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |   2988.429  |   2971.339  |   2963.616  |   3003.187  |
LZBLW  |   2731.951  |   2681.796  |   2678.558  |   2707.534  |
LZNIB  |   1626.806  |   1557.309  |   1504.097  |   1400.654  |
LZHLW  |    423.936  |    446.990  |    444.832  |    445.040  |
BitKnit|    242.916  |    245.409  |    248.812  |    248.972  |
LZNA   |    114.791  |    119.369  |    119.994  |    120.362  |

Another test :

Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\game_testset_m0.7z
got arg : num_repeats=5
lz test loading: r:\game_testset_m0.7z
uncompressed size : 79290970
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |1.4:1039:4304|1.41:438:4176|1.42:184:4202|1.44: 52:4293|1.44:4.5:4407|
LZBLW  |1.51:380:3855|1.55:124:3778|1.56: 26:3774|1.62:1.0:3862|1.62:1.0:3862|
LZNIB  |1.56:346:2406|1.59: 84:2398|1.62: 24:2054|1.67: 15:2048|1.67: 10:2053|
LZHLW  |1.67: 85: 647|1.74: 25: 679|1.75:6.5: 635|1.77:3.3: 613|1.79:1.5: 618|
BitKnit|1.83: 24: 395|1.90: 18: 409|1.90: 12: 408|1.91:7.1: 402|1.91:6.5: 401|
LZNA   |1.78: 22: 171|1.84: 18: 178|1.88: 12: 185|1.93:5.6: 167|1.93:1.5: 167|
compression ratio:
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |    1.390    |    1.408    |    1.424    |    1.436    |    1.442    |
LZBLW  |    1.509    |    1.548    |    1.558    |    1.615    |    1.615    |
LZNIB  |    1.557    |    1.593    |    1.622    |    1.669    |    1.668    |
LZHLW  |    1.669    |    1.745    |    1.754    |    1.767    |    1.790    |
BitKnit|    1.825    |    1.897    |    1.905    |    1.913    |    1.915    |
LZNA   |    1.781    |    1.838    |    1.878    |    1.927    |    1.932    |
encode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |   1038.910  |    437.928  |    184.457  |     52.008  |      4.465  |
LZBLW  |    380.030  |    123.621  |     26.028  |      0.973  |      0.973  |
LZNIB  |    345.905  |     83.577  |     24.299  |     14.544  |     10.444  |
LZHLW  |     84.519  |     25.218  |      6.542  |      3.256  |      1.547  |
BitKnit|     24.116  |     17.944  |     12.476  |      7.052  |      6.464  |
LZNA   |     21.859  |     18.034  |     11.767  |      5.602  |      1.465  |
decode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |   4304.144  |   4175.854  |   4202.491  |   4292.925  |   4406.853  |
LZBLW  |   3855.255  |   3777.826  |   3774.093  |   3861.922  |   3861.582  |
LZNIB  |   2406.379  |   2397.753  |   2054.429  |   2048.329  |   2053.340  |
LZHLW  |    646.796  |    679.173  |    635.035  |    613.051  |    617.994  |
BitKnit|    394.599  |    408.539  |    408.044  |    402.239  |    401.352  |
LZNA   |    171.111  |    177.565  |    184.677  |    167.439  |    166.904  |

vs LZMA :
ratio: 1.901
enc  : 2.70 mb/s
dec  : 30.27 mb/s

On this file, BitKnit is 13X faster to decode than LZMA, and gets more compression. (or at "Normal" level, the ratio is similar and BitKnit is 4.6X faster to encode).

12-24-15 | RANS in practice

RANS in practice

Usage points for me :

0. My god, adaptive coding is sweet. I've been doing static Huffman and TANS for a while, so I sort of got used to them, and I forgot how nice it is to not have to deal with that shit. (optimal parsing with static entropy coders has the horrible feedback loop and iterations required, you have to find optimal chunking/transmission points, you have to tweak out your codelen/probability transmission to send them compactly and quickly, blah blah). In comparison, adaptive coding is just so simple. It's literally 10X fewer lines of codes.

1. For binary coding, RANS is no win over arithmetic. (Jarek calls "binary ANS" "ABS" but I see no need for another acronym; let's just say "binary ANS"). (and given the option, you'd rather have the FIFO arithmetic coding)

2. For multi-symbol, power-of-2 cumulative probability sum, adaptive RANS is really good.

3. If your model is really complex, like an N-ary Fenwick tree or anything crazy, or if you have to do binary search in cumprobs to do decoding, the difference between RANS and arithmetic can be hidden.

To make adaptive RANS really shine, you need a model specialized for power-of-2 totals, such as the classic "deferred summation" or the new nibble model (cumprob blending), or other. It's only for models that are quite fast that the speed difference between RANS and arithmetic becomes dramatic.

What I'm trying to get at is if you just take an existing compressor based on arithmetic coding, which probably uses binary arithmetic coding, or a rather complex N-ary coder, and just replace the arithmetic part with ANS - you might not see much benefit at all. There are three big stalls - divides, cache misses, branches - and they can hide each other, so if you just eliminate one of the three stalls, it doesn't help much.

ADD : 4. A lot of the recent work (TANS, Yann's Huff work, some of the RANS encoders, etc.) that are very fast also use rather a lot of memory. They make use of tables to speed up coding. That's fine for order-0 coding, or if you have very few contexts (perhaps 2 pos bits), but for order-1 coding or larger contexts, it becomes a problem. Of course your memory use becomes high, but the bigger problem is that your tables no longer fit in cache. Fast table-based coding doesn't make any sense if accessing the tables is a cache miss.

12-23-15 | Oodle Results Update

Major improvements coming in Oodle 2.1.2

Fabian's BitKnit is coming to Oodle. BitKnit is a pretty unique LZ; it makes clever use of the properties of RANS to hit a space-speed tradeoff point that nothing else does. It gets close to LZMA compression levels (sometimes more, sometimes less) while being more like zlib speed.

LZNA and LZNIB are also much improved. The bit streams are the same, but we found some little tweaks in the encoders & decoders that make significant difference. (5-10%, but that's a lot in compression, and they were already world-beating, so the margin is just bigger now). The biggest improvement came from some subtle issues in the parsers.

As usual, I'm trying to be as fair as possible to the competition. Everything is run single threaded. LZMA and LZHAM are run at max compression with context bits at their best setting. Compressors like zlib that are just not even worth considering are not included, I've tried to include the strongest competition that I know of now. This is my test of "slowies" , that is, all compressors set at high (not max) compression levels. ("oohc" is Oodle Optimal1 , my compression actually goes up quite a bit at higher levels, but I consider anything below 2 mb/s to encode to be just too slow to even consider).

The raw data : ("game test set")

by ratio:
oohcLZNA    :  2.88:1 ,    5.3 enc mb/s ,  135.0 dec mb/s
lzma        :  2.82:1 ,    2.9 enc mb/s ,   43.0 dec mb/s
oohcBitKnit :  2.76:1 ,    6.4 enc mb/s ,  273.3 dec mb/s
lzham       :  2.59:1 ,    1.8 enc mb/s ,  162.9 dec mb/s
oohcLZHLW   :  2.38:1 ,    4.2 enc mb/s ,  456.3 dec mb/s
zstdhc9     :  2.11:1 ,   29.5 enc mb/s ,  558.0 dec mb/s
oohcLZNIB   :  2.04:1 ,   11.5 enc mb/s , 1316.4 dec mb/s

by encode speed:
zstdhc9     :  2.11:1 ,   29.5 enc mb/s ,  558.0 dec mb/s
oohcLZNIB   :  2.04:1 ,   11.5 enc mb/s , 1316.4 dec mb/s
oohcBitKnit :  2.76:1 ,    6.4 enc mb/s ,  273.3 dec mb/s
oohcLZNA    :  2.88:1 ,    5.3 enc mb/s ,  135.0 dec mb/s
oohcLZHLW   :  2.38:1 ,    4.2 enc mb/s ,  456.3 dec mb/s
lzma        :  2.82:1 ,    2.9 enc mb/s ,   43.0 dec mb/s
lzham       :  2.59:1 ,    1.8 enc mb/s ,  162.9 dec mb/s

by decode speed:
oohcLZNIB   :  2.04:1 ,   11.5 enc mb/s , 1316.4 dec mb/s
zstdhc9     :  2.11:1 ,   29.5 enc mb/s ,  558.0 dec mb/s
oohcLZHLW   :  2.38:1 ,    4.2 enc mb/s ,  456.3 dec mb/s
oohcBitKnit :  2.76:1 ,    6.4 enc mb/s ,  273.3 dec mb/s
lzham       :  2.59:1 ,    1.8 enc mb/s ,  162.9 dec mb/s
oohcLZNA    :  2.88:1 ,    5.3 enc mb/s ,  135.0 dec mb/s
lzma        :  2.82:1 ,    2.9 enc mb/s ,   43.0 dec mb/s

Log opened : Fri Dec 18 17:56:44 2015

total : oohcLZNIB   : 167,495,105 ->81,928,287 =  3.913 bpb =  2.044 to 1 
total : encode           : 14.521 seconds, 3.39 b/kc, rate= 11.53 M/s
total : decode           : 0.127 seconds, 386.85 b/kc, rate= 1316.44 M/s
total : encode+decode    : 14.648 seconds, 3.36 b/kc, rate= 11.43 M/s
total : oohcLZHLW   : 167,495,105 ->70,449,624 =  3.365 bpb =  2.378 to 1 
total : encode           : 40.294 seconds, 1.22 b/kc, rate= 4.16 M/s
total : decode           : 0.367 seconds, 134.10 b/kc, rate= 456.33 M/s
total : encode+decode    : 40.661 seconds, 1.21 b/kc, rate= 4.12 M/s
total : oohcLZNA    : 167,495,105 ->58,242,995 =  2.782 bpb =  2.876 to 1 
total : encode           : 31.867 seconds, 1.54 b/kc, rate= 5.26 M/s
total : decode           : 1.240 seconds, 39.68 b/kc, rate= 135.04 M/s
total : encode+decode    : 33.107 seconds, 1.49 b/kc, rate= 5.06 M/s
total : oohcBitKnit : 167,495,105 ->60,763,350 =  2.902 bpb =  2.757 to 1 
total : encode           : 26.102 seconds, 1.89 b/kc, rate= 6.42 M/s
total : decode           : 0.613 seconds, 80.33 b/kc, rate= 273.35 M/s
total : encode+decode    : 26.714 seconds, 1.84 b/kc, rate= 6.27 M/s
total : zstdhc9     : 167,495,105 ->79,540,333 =  3.799 bpb =  2.106 to 1 
total : encode           : 5.671 seconds, 8.68 b/kc, rate= 29.53 M/s
total : decode           : 0.300 seconds, 163.98 b/kc, rate= 558.04 M/s
total : encode+decode    : 5.971 seconds, 8.24 b/kc, rate= 28.05 M/s
total : lzham       : 167,495,105 ->64,682,721 =  3.089 bpb =  2.589 to 1 
total : encode           : 93.182 seconds, 0.53 b/kc, rate= 1.80 M/s
total : decode           : 1.028 seconds, 47.86 b/kc, rate= 162.86 M/s
total : encode+decode    : 94.211 seconds, 0.52 b/kc, rate= 1.78 M/s
total : lzma        : 167,495,105 ->59,300,023 =  2.832 bpb =  2.825 to 1 
total : encode           : 57.712 seconds, 0.85 b/kc, rate= 2.90 M/s
total : decode           : 3.898 seconds, 12.63 b/kc, rate= 42.97 M/s
total : encode+decode    : 61.610 seconds, 0.80 b/kc, rate= 2.72 M/s

11-13-15 | Flipped encodemod

A while ago I wrote a series on Encoding Values in Bytes in which I talk about the "EncodeMod" varint encoding.

EncodeMod is just the idea that you send each token (byte, word, nibble, whatever) with two ranges; in one range the values are terminal (no more tokens), while in the other range it means "this is part of the value" but more tokens follow. You can then optimize the division point for a wide range of applications.

In my original pseudo-code I was writing the ranges with the "more tokens" follow at the bottom, and terminal values at the top. That is :

Specifically for the case of byte tokens and pow2 mod

mod = 1<<bits

in each token we send "bits" of values that don't currently fit

upper = 256 - mod

"upper" is the number of terminal values we can send in the current token

I was writing

[0,mod) = bits of value + more tokens follow
[mod,256) = terminal value

Ryg spotted that the code is slightly simpler if you switch the ranges. Use the low range [0,upper) for terminal values and [upper,256) for non-terminal values. The ranges are the same, so you get the same encoded lengths.

(BTW it also occurred to me when learning about ANS that EncodeMod is a form of ANS. You're trying to send a bit - "do more bytes follow". You're putting that bit in a token, and you have some extra information you can send with that bit - so just put some of your value in there. The number of slots for bit=0 and 1 should correspond to the probability of each event.)

The switched encodemod is :

U8 *encmod(U8 *to, int val, int bits)
    const int upper = 256 - (1<<bits); // binary, this is 1110000 or similar (8-bits ones, bits zeros)
    while (val >= upper)
        *to++ = (U8) (upper | val);
        val = (val - upper) >> bits;

    *to++ = (U8) val;
    return to;

const U8 *decmod(int *outval, const U8 *from, int bits)
    const int upper = 256 - (1<<bits);
    int shift = 0;
    int val = 0;

    for (;;)
        int byte = *from++;
        val += byte << shift;
        if (byte < upper)
        shift += bits;

    *outval = val;
    return from;

The simplification of the encoder here :

    *to++ = (U8) (upper | val);
    val = (val - upper) >> bits;

written in long-hand is :

    low = val & ((1<<bits)-1);
    *to++ = upper + low;  // (same as upper | low, same as upper | val)
    val -= upper;
    val >>= bits;


    val -= upper;
    low = val & ((1<<bits)-1);
    *to++ = upper + low;  // (same as upper | low, same as upper | val)
    val >>= bits;

and the val -= upper can be done early or late because val >= upper it doesn't touch "low"

Basically by using "upper" like this, the mask of low bits and add of upper is done in one op.

10-17-15 | Huffman Performance

I'm following Yann Collet's nice blog series on Huffman. I thought I'd have my own look.

Background : 64-bit mode. 12-bit lookahead table, and 12-bit codelen limit, so there's no out-of-table case to handle.

Here's conditional bit buffer refill, 32-bits refilled at a time, aligned refill. Always >= 32 bits in buffer so you can do two decode ops per refill :

            uint64 peek; int cl,sym;
            peek = decode_bits >> (64 - CODELEN_LIMIT);
            cl = codelens[peek];
            sym = symbols[peek];
            decode_bits <<= cl; thirtytwo_minus_decode_bitcount += cl;
            *decodeptr++ = (uint8)sym;
            peek = decode_bits >> (64 - CODELEN_LIMIT);
            cl = codelens[peek];
            sym = symbols[peek];
            decode_bits <<= cl; thirtytwo_minus_decode_bitcount += cl;
            *decodeptr++ = (uint8)sym;
            if ( thirtytwo_minus_decode_bitcount > 0 )
                uint64 next = _byteswap_ulong(*decode_in++);
                decode_bits |= next << thirtytwo_minus_decode_bitcount;
                thirtytwo_minus_decode_bitcount -= 32;

325 mb/s.

(note that removing the bswap to have a little-endian u32 stream does almost nothing for performance, less than 1 mb/s)

The next option is : branchless refill, unaligned 64-bit refill. You always have >= 56 bits in buffer, now you can do 4 decode ops per refill :

            // refill :
            uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
            bits |= next >> bitcount;
            int bytes_consumed = (64 - bitcount)>>3;
            decode_in += bytes_consumed;
            bitcount += bytes_consumed<<3;
            uint64 peek; int cl; int sym;
            #define DECONE() \
            peek = bits >> (64 - CODELEN_LIMIT); \
            cl = codelens[peek]; sym = symbols[peek]; \
            bits <<= cl; bitcount -= cl; \
            *decodeptr++ = (uint8) sym;
            #undef DECONE
373 mb/s

These so far have both been "traditional Huffman" decoders. That is, they use the next 12 bits from the bit buffer to look up the Huffman decode table, and they stream bits into that bit buffer.

There's another option, which is "ANS style" decoding. To do "ANS style" you keep the 12-bit "peek" as a separate variable, and you stream bits from the bit buffer into the peek variable. Then you don't need to do any masking or shifting to extract the peek.

The naive "ANS style" decode looks like this :

            // refill bits :
            uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
            bits |= next >> bitcount;
            int bytes_consumed = (64 - bitcount)>>3;
            decode_in += bytes_consumed;
            bitcount += bytes_consumed<<3;
            int cl; int sym;
            #define DECONE() \
            cl = codelens[state]; sym = symbols[state]; \
            state = ((state << cl) | (bits >> (64 - cl))) & ((1 << CODELEN_LIMIT)-1); \
            bits <<= cl; bitcount -= cl; \
            *decodeptr++ = (uint8) sym;
            #undef DECONE

332 mb/s

But we can use an analogy to the "next_state" of ANS. In ANS, the next_state is a complex thing with certain rules (as we covered in the past). With Huffman it's just this bit of math :

    next_state[state] = (state << cl) & ((1 << CODELEN_LIMIT)-1);

So we can build that table, and use a "fully ANS" decoder :

            // refill bits :
            uint64 next = _byteswap_uint64(*((uint64 *)decode_in));
            bits |= next >> bitcount;
            int bytes_consumed = (64 - bitcount)>>3;
            decode_in += bytes_consumed;
            bitcount += bytes_consumed<<3;
            int cl; int sym;
            #define DECONE() \
            cl = codelens[state]; sym = symbols[state]; \
            state = next_state_table[state] | (bits >> (64 - cl)); \
            bits <<= cl; bitcount -= cl; \
            *decodeptr++ = (uint8) sym;
            #undef DECONE

415 mb/s

Fastest! It seems the fastest Huffman decoder is a TANS decoder. (*1)

(*1 = well, on this machine anyway; these are all so close that architecture and exact usage matters massively; in particular we're relying heavily on fast unaligned reads, and doing four unrolled decodes in a row isn't always useful)

Note that this is a complete TANS decoder save one small detail - in TANS the "codelen" (previously called "numbits" in my TANS code) can be 0. The part where you do :

(bits >> (64 - cl))

can't be used if cl can be 0. In TANS you either have to check for zero, or you have to use the method of

((bits >> 1) >> (63 - cl))

which makes TANS a tiny bit slower - 370 mb/s for TANS on the same file on my machine.

(all times reported are non-interleaved, and without table build time; Huffman is definitely faster to build tables, and faster to decode packed/transmitted codelens as well)

NOTE : earlier version of this post had a mistake in bitcount update and worse timings.

Some tiny caveats :

1. The TANS way means you can't (easily) mix different peek amounts. Say you're doing an LZ, you might want an 11-bit peek for literals, but for the 4 bottom bits you only need an 8-bit peek. The TANS state has the # of bits to peek baked in, so you can't just use that. With the normal bit-buffer style Huffman decoders you can peek any # of bits you want. (though you could just do the multi-state interleave thing here, keeping with the TANS style).

2. Doing Huffman decodes without a strict codelen limit the TANS way is much uglier. With the bits-at-top bitbuffer method there are nice ways to do that.

3. Getting raw bits the TANS way is a bit uglier. Say you want to grab 16 raw bits; you could get 12 from the "state" and then 4 more from the bit buffer. Or just get 16 directly from the bit buffer which means they need to be sent after the next 12 bits of Huffman in a weird TANS interleave style. This is solvable but ugly.

4. For the rare special case of an 8 or 16-bit peek-ahead, you can do even faster than the TANS style by using a normal bit buffer with the next bits at bottom. (either little endian or big-endian but rotated around). This lets you grab the peek just by using "al" on x86.

09-19-15 | Library Writing Realizations

Some learnings about library writing, N years on.

X. People will just copy-paste your example code.

This is obvious but is something to keep in mind. Example code should never be sketches. It should be production ready. People will not read the comments. I had lots of spots in example code where I would write comments like "this is just a sketch and not ready for production; production code needs to check error returns and handle failures and be endian-independent" etc.. and of course people just copy-pasted it and didn't change it. That's not their fault, that's my fault. Example code is one of the main ways people get into your library.

X. People will not read the docs.

Docs are almost useless. Nobody reads them. They'll read a one page quick start, and then they want to just start digging in writing code. Keep the intros very minimal and very focused on getting things working.

Also be aware that if you feel you need to write a lot of docs about something, that's a sign that maybe things are too complicated.

X. Peripheral helper features should be cut.

Cut cut cut. People don't need them. I don't care how nice they are, how proud of them you are. Pare down mercilessly. More features just confuse and crud things up. This is like what a good writer should do. Figure out what your one core function really is and cut down to that.

If you feel that you really need to include your cute helpers, put them off on the side, or put them in example code. Or even just keep them in your pocket at home so that when someone asks about "how I do this" you can email them out that code.

But really just cut them. Being broad is not good. You want to be very narrow. Solve one clearly defined problem and solve it well. Nobody wants a kitchen sink library.

X. Simplicity is better.

Make everything as simple as possible. Fewer arguments on your functions. Remove extra functions. Cut everywhere. If you sacrifice a tiny bit of possible efficiency, or lose some rare functionality, that's fine. Cut cut cut.

For example, to plug in an allocator for Oodle used to require 7 function pointers : { Malloc, Free, MallocAligned, FreeSized, MallocPage, FreePage, PageSize }. (FreeSized for efficiency, and the Page stuff because async IO needs page alignment). It's now down just 2 : { MallocAligned, Free }. Yes it's a tiny bit slower but who cares. (and the runtime can work without any provided allocators)

X. Micro-efficiency is not important.

Yes, being fast and lean is good, but not when it makes things too complex or difficult to use. There's a danger of a kind of mental-masturbation that us RAD-type guys can get caught in. Yes, your big stream processing stuff needs to be competitive (eg. Oodle's LZ decompress, or Bink's frame decode time). But making your Init() call take 100 clocks instead of 10,000 clocks is irrelevant to everyone but you. And if it requires funny crap from the user, then it's actually making things worse, not better. Having things just work reliably and safely and easily is more important than micro-efficiency.

For example, one mistake I made in Oodle is that the compressed streams are headerless; they don't contain the compressed or decompressed size. The reason I did that is because often the game already has that information from its own headers, so if I store it again it's redundant and costs a few bytes. But that was foolish - to save a few bytes of compressed size I sacrifice error checking, robustness, and convenience for people who don't want to write their own header. It's micro-efficiency that costs too much.

Another one I realized is a mistake : to do actual async writes on Windows, you need to call SetFileValidData on the newly enlarged file region. That requires admin privileges. It's too much trouble, and nobody really cares. It's no worth the mess. So in Oodle2 I just don't do that, and writes are no longer async. (everyone else who thinks they're doing async writes isn't actually, and nobody else actually checks on their threading the way I do, so it just makes me more like everyone else).

X. It should just work.

Fragile is bad. Any API's that have to go in some complicated sequence, do this, then this, then this. That's bad. (eg. JPEGlib and PNGlib). Things should just work as simply as possible without requirements. Operations should be single function calls when possible. Like if you take pointers in and out, don't require them to be aligned in a certain way or padded or allocated with your own allocators. Make it work with any buffer the user provides. If you have options, make things work reasonably with just default options so the user can ignore all the option setup if they want. Don't require Inits before your operations.

In Oodle2 , you just call Decompress(pointer,size,pointer) and it should Just Work. Things like error handling and allocators now just fall back to reasonable light weight defaults if you don't set up anything explicitly.

X. Special case stuff should be external (and callbacks are bad).

Anything that's unique to a few users, or that people will want to be different should be out of the library. Make it possible to do that stuff through client-side code. As much as possible, avoid callbacks to make this work, try to do it through imperative sequential code.

eg. if they want to do some incremental post-processing of data in place, it should be possible via : { decode a bit, process some, decode a bit , process some } on the client side. Don't do it with a callback that does decode_it_all( process_per_bit_callback ).

Don't crud up the library feature set trying to please everyone. Some of these things can go in example code, or in your "back pocket code" that you send out as needed.

X. You are writing the library for evaluators and new users.

When you're designing the library, the main person to think about is evaluators and new users. Things need to be easy and clear and just work for them.

People who actually license or become long-term users are not a problem. I don't mean this in a cruel way, we don't devalue them and just care about sales. What I mean is, once you have a relationship with them as a client, then you can talk to them, help them figure out how to use things, show them solutions. You can send them sample code or even modify the library for them.

But evaluators won't talk to you. If things don't just work for them, they will be frustrated. If things are not performant or have problems, they will think the library sucks. So the library needs to work well for them with no help from you. And they often won't read the docs or even use your examples. So it needs to go well if they just start blindly calling your APIs.

(this is a general principle for all software; also all GUI design, and hell just design in general. Interfaces should be designed for the novice to get into it easy, not for the expert to be efficient once they master it. People can learn to use almost any interface well (*) once they are used to it, so you don't have to worry about them.)

(* = as long as it's low latency, stateless, race free, reliable, predictable, which nobody in the fucking world seems to understand any more. A certain sequence of physical actions that you develop muscle memory for should always produce the same result, regardless of timing, without looking at the device or screen to make sure it's keeping up. Everyone who fails this (eg. everyone) should be fucking fired and then shot. But this is a bit off topic.)

X. Make the default log & check errors. But make the default reasonably fast.

This is sort of related to the evaluator issue. The defaults of the library need to be targetted at evaluators and new users. Advanced users can change the defaults if they want; eg. to ship they will turn off logging & error checking. But that should not be how you ship, or evaluators will trigger lots of errors and get failures with no messages. So you need to do some amount of error checking & logging so that evaluators can figure things out. *But* they will also measure performance without changing the settings, so your default settings must also be fast.

X. Make easy stuff easy. It's okay if complicated stuff is hard.

Kind of self explanatory. The API should be designed so that very simple uses require tiny bits of code. It's okay if something complicated and rare is a pain in the ass, you don't need to design for that; just make it possible somehow, and if you have to help out the rare person who wants to do a weird thing, that's fine. Specifically, don't try to make very flexible general APIs that can do everything & the kitchen sink. It's okay to have a super simple API that covers 99% of users, and then a more complex path for the rare cases.

07-26-15 | The Wait on Workers Problem

I'd like to open source my Oodle threading stuff. There's some cool stuff. Some day. Sigh.

This is an internal email I sent on 05-13-2015 :

Cliff notes : there's a good reason why OS'es use thread pools and fibers to solve this problem.

There's this problem that I call the "wait on workers problem". You have some worker threads. Worker threads pop pending work from a queue, do it, then post a completion event. You can't ever call Wait (Wait checks a condition, and if not set, puts the thread to sleep pending that condition) on them, because it could possibly deadlock you (no progress possible) since they could all go to sleep in waits, with work still pending and noone to do it. The most obvious example is just to imagine you only have 1 worker thread. Your worker thread does something like : { stuff spawn work2 Wait(work2); more stuff } Oh crap, work2 never runs because the Wait put me to sleep and there's no worker to do it. In Oodle the solution I use is that you should never do a real Wait on a worker, instead you have to "Yield". What Yield does is change your current work item back to Pending, but with the specified handle as a condition to being run. Then it returns back to the work dispatcher loop. So the above example becomes : [worker thread dispatch loop pops Work1] Work1: { stuff spawnm work2 Yield(work2); } [Work1 is put back on the pending list, with work2 as a condition] [worker thread dispatch loop pops Work2] Work2 Work2 posts completion [worker thread dispatch loop pops Work1] { more stuff } So. The Yield solution works to an extent, but it runs into problems. 1. I only have "shallow yield" (non-stack-saving yield), so the worker must manually save its state or stack variables to be able to resume. I don't have "deep yield" that can yield from deep within a series of calls, that would save the execution location and stack. This can be a major problem in practice. It means you can only yield from the top level, you can't ever be down inside some function calls and logic and decide you need to yield. It means all your threading branching has to be very linear and mapped out at the top level of the work function. It works great for simple linear processing like do an IO then yield on it, then process the results of the IO. It doesn't work great for more complicated general parallelism. 2. Because Yield is different from Wait, you can't share code, and you can still easily accidentally break the system by calling Wait. For example if you have a function like DoStuffInParallel , if you run that on a non-worker thread, it can launch some work items then Wait on them. You can't do that from a worker. You must rewrite it for being run from a worker to launch items then return a handle to yield on them (don't yield internally). It creates an ugly and difficult heterogeneity between worker threads and non-worker threads. So, we'd like to fix this. What we'd like is essentially "deep yield" and we want it to just be like an OS Wait, so that functions can be used on worker threads or non-worker threads without changing them. So my first naive idea was : "Wait on Workers" can be solved by making Wait a dispatch. Any time you call Wait, the system checks - am I a worker thread, and if so, instead of actually going into an OS wait, it pops and runs any runnable work. After completing each work item, it rechecks the wait condition and if it's set, stops dispatching and returns to the Wait-caller. If there is no runnable work, you go into an OS wait on either the original wait condition OR runnable work available. So the original example becomes : { stuff spawn work2 Wait(work2); [Wait sees we're a worker and runs the work dispatcher] [work dispatcher pops work2] { Work2 } [work dispatcher return sees work1 now runnable and returns] more stuff } Essentially this is using the actual stack to do stack-saving. Rather than trying to save the stack and instruction pointer, you just use the fact that they are saved by a normal function call & return. This method has minor disadvantages in that it can require a very large amount of stack if you go very deep. But the real problem is it can easily deadlock. It only works for tree-structured work, and Waits that are only on work items. If you have non-tree wait cycles, or waits on non-work-items, it can deadlock. Here's one example : Work1 : { stuff1 Wait on IO stuff2 } Work2 : { stuff1 Wait on Work1 stuff2 } with current Oodle system, you can make work like this, and it will complete. (*) In any system, if Work1 and Work2 get separate threads, they will complete. But in a Dispatch-on-Wait system, if the Wait on IO in Work1 runs Work2, it will deadlock. (* = the Oodle system ensures completability by only giving you a waitable handle to a work item when that work is enqueued to run. So it's impossible to make loops. But you can make something like the above by doing h1 = Run(Work1) Work2.handle = h1; Run(Work2); *) Once you're started Work2 on your thread, you're hosed, you can't recover from that, because you already have Work1 in progress. Dispatch-on-Wait really only works for a very limited work pattern : you only Wait on work that you made yourself. None of the work you make yourself can Wait on anything but work they make themselves. Really it only allows you to run tree-structured child work, not general threading. So, one option is use Dispatch-on-Wait but with a rule that if you're on a worker you can only use it for tree-strcutured-child-work. If you need to do more general waits, you still do the coroutine Yield. Or you can try to solve the general problem. In hindsight the solution is obvious, since it's what the serious OS people do : thread pools. You want to have 4 workers running on a 4 core system. You actually have a thread pool of 32 worker threads (or whatever) and try to keep at least 4 running at all times. Any time you Wait on a worker, you first Wake a thread from the pool, then put your thread to sleep. Any time a worker completes a work item it checks how many worker threads are awake, and if it's too many it goes to sleep. This is just a way of using the thread system to do the stack-saving and instruction-pointer saving that you need for "deep yield". The Wait() is essentially doing that deep return back up to the Worker dispatch loop, but it does it by sleeping the current thread and waking another that can start from the dispatch loop. This just magically fixes all the problems. You can wait on arbitrary things, you can deep-wait anywhere, you don't get deadlocks. The only disadvantage is the overhead of the thread switch. If you really want the micro-efficiency, you could still provide a "WaitOnChildWork" that runs the work dispatch loop, which is to be used only for the tree-structured work case. This lets you avoid the thread pool work and is a reasonably common case.

06-04-15 | Goodbye and Hello

Well, fuck me. Google has broken the older GData API for posting to blogger.

I understand progress has to happen and so on, and older APIs have to get retired sometimes. Well, no not really; that's not actually what happens in the modern world. I'm sick of the slap-dash upgrading and deprecating that has nothing to do with necessity and is just random chaos. You all can fuck around with it if you want. Not me. I love algorithms. I love programming when the problems are inherent, mathematical problems. Not problems like this fucking API doesn't do what it says it does, or this shit does different things in different versions so I have to detect that and hack it and, oh crap my platform sdk updated and nothing works any more and fuck me.

I think this blog (on Blogger) is probably dead. I can't be bothered to fix my poster (working in C# is a nightmare). Goodbye.

You can read the raw text blog at http://www.cbloom.com/rants.html

also Hello!

For a little while I've been writing a new blog.

It was inspired by el trastero | de Iņigo Quilez which is a fabulous blog. el trastero has some little technical thoughts some times, but also personal stuff, and lots of humanity. I love it. It's how my blog started, and somewhere along the way I lost the point. So I started writing a new blog, inspired by my old blog.

It's called "rambles", and it's here : http://www.cbloom.com/rambles.html

It's personal and inappropriate and you probably shouldn't read it.

Goodbye and hello.

06-04-15 | LZNA encode speed addendum

Filling in a gap in the previous post : cbloom rants 05-09-15 - Oodle LZNA

The encode speeds on lzt99 :

single-threaded :



-z5 (Optimal1) :
24,700,820 -> 9,207,584 =  2.982 bpb =  2.683 to 1
encode           : 10.809 seconds, 1.32 b/kc, rate= 2.29 mb/s
decode           : 0.318 seconds, 44.87 b/kc, rate= 77.58 mb/s

-z6 (Optimal2) :
24,700,820 -> 9,154,343 =  2.965 bpb =  2.698 to 1
encode           : 14.727 seconds, 0.97 b/kc, rate= 1.68 mb/s
decode           : 0.313 seconds, 45.68 b/kc, rate= 78.99 mb/s

-z7 (Optimal3) :
24,700,820 -> 9,069,473 =  2.937 bpb =  2.724 to 1
encode           : 20.473 seconds, 0.70 b/kc, rate= 1.21 mb/s
decode           : 0.317 seconds, 45.06 b/kc, rate= 77.92 mb/s



lzmahigh : 24,700,820 -> 9,329,982 =  3.022 bpb =  2.647 to 1
encode           : 11.373 seconds, 1.26 b/kc, rate= 2.17 M/s
decode           : 0.767 seconds, 18.62 b/kc, rate= 32.19 M/s



lzham : 24,700,820 ->10,140,761 =  3.284 bpb =  2.436 to 1
encode           : 16.732 seconds, 0.85 b/kc, rate= 1.48 M/s
decode           : 0.242 seconds, 59.09 b/kc, rate= 102.17 M/s


lzham : 24,700,820 ->10,097,341 =  3.270 bpb =  2.446 to 1
encode           : 18.877 seconds, 0.76 b/kc, rate= 1.31 M/s
decode           : 0.239 seconds, 59.73 b/kc, rate= 103.27 M/s


lzham : 24,700,820 -> 9,938,002 =  3.219 bpb =  2.485 to 1
encode           : 185.204 seconds, 0.08 b/kc, rate= 133.37 k/s
decode           : 0.245 seconds, 58.28 b/kc, rate= 100.77 M/s


LZNA -z5 threaded :
24,700,820 -> 9,211,090 =  2.983 bpb =  2.682 to 1
encode only      : 8.523 seconds, 1.68 b/kc, rate= 2.90 mb/s
decode only      : 0.325 seconds, 43.96 b/kc, rate= 76.01 mb/s

LZMA threaded :

lzmahigh : 24,700,820 -> 9,329,925 =  3.022 bpb =  2.647 to 1
encode           : 7.991 seconds, 1.79 b/kc, rate= 3.09 M/s
decode           : 0.775 seconds, 18.42 b/kc, rate= 31.85 M/s

LZHAM BETTER threaded :

lzham : 24,700,820 ->10,198,307 =  3.303 bpb =  2.422 to 1
encode           : 7.678 seconds, 1.86 b/kc, rate= 3.22 M/s
decode           : 0.242 seconds, 58.96 b/kc, rate= 101.94 M/s

I incorrectly said in the original version of the LZNA post (now corrected) that "LZHAM UBER is too slow". It's actually the "EXTREME" option that's too slow.

Also, as I noted last time, LZHAM is the best threaded of the three, so even though BETTER is slower than LZNA -z5 or LZMA in single-threaded encode speed, it's faster threaded. (Oodle's encoder threading is very simplistic (chunking) and really needs a larger file to get full parallelism; it doesn't use all cores here; LZHAM is much more micro-threaded so can get good parallelism even on small files).

06-03-15 | Computers Are Evil

Click "close tab".

Nothing happens. Wait.

Click "close tab" again on the same tab.

Two tabs close.


On Android it's even worse. Maybe the worst sin of Android UI design (hard to choose) is that the back/home/panes bar at the bottom is not always there. Some apps make it roll out of the way and put other buttons there. So depending on timing and input races you can be just trying to get back to the home screen and instead hit some other shite.

Fucking basic GUI design principles :

User should get *immediate* acknolwedgement of their action. I mean, for fucks sake you should be able to actually just DO it immediately, computers are fast. But if you can't you still need to acknowledge it.

Buttons should not move! They should not roll in and out. The more important the button, the more it should just stay in the same place all the time.

A sequence of input actions should always lead to the same outcome. There should be no "input races" that make outcome dependent on processing time.

A power user should be able to use the application without looking at it. You should be able to develop muscle memory of sequences that do what you want, and shouldn't have to be tracking "is the app responding" to each of the actions in the sequence.

As much as possible buttons should be *stateless*. The same click gives the same action all the time. Modal UI is a necessary evil.

I have a kitchen timer that has these buttons :

Stop/Start : toggles running or not

Reset : resets timer to 0.

Reset only does anything if the timer is not running. So. How do you make the timer be in the "running from zero" state?

You have to look at it. There is no reliable key sequence to make it be running. If it's already running, you have to hit stop, then reset, then start.

The problem is that the action of the buttons is stateful. Dumb. A better design is something like :

Stop/Start : toggle running

Stop&Reset : stop and reset timer to 0

So you can press "Stop&Reset" + "Stop/Start" to always get into the "running from zero" state without looking at it. There are other options. The point is that people just don't understand fucking usability and good UI design.

A tool should become an extension of yourself, that you can use without thinking about it. That you can use without checking in on it. Oops, I hit a bunch of nails while my hammer was in un-nail mode and now my house is falling down.

Of course this is even more crucial in cars. There needs to be a law that every function in a car can be used by a blind person. There should never be a button/shuttle/touch-screen that you have to look at to use. It should all be possible to do without looking, by sense of location and touch. This is just good UI design in general, but even more important when safety is involved. (and the answer is not a million fucking buttons like an airplane cockpit, it's less fucking unnecessary functions)

Today I woke up and tried to start working, and my VC plugin NiftyPerforce is crashing. It's worked for years and I haven't touched anything and all of a sudden it's crashing. So.. try to debug it.. OMG it's fucking Managed C# I forgot about this nonsense. Hmm, some exception. Turn on exception breaks. OMG fucking C# throws exceptions all the damn time that are benign and you have to ignore. OMG the problem is down in some app pref XML persistence thing, why the fuck is this crashing now and how the fuck do I figure out what's going on. ARG ARG ARG.

Give me a fucking "main()". I want only my code. Nothing happens before main starts. No threads, no fucking binding. No automatically downloading packages!? WTF are you doing? Arg.

I wish I'd written my own text/code editor. It sucks having my primary daily interface with the machine be someone else's code that sucks.

05-25-15 | The Anti-Patent Patent Pool

The idea of the Anti-Patent Patent Pool is to destroy the system using the system.

The Anti-Patent Patent Pool is an independent patent licensing organization. (Hence APPP)

One option would be to just allow anyone to use those patents free of charge.

A more aggressive option would be a viral licensing model. (like the GPL, which has completely failed, so hey, maybe not). The idea of the viral licensing model is like this :

Anyone who owns no patents may use any patent in the APPP for free (if you currently own patents, you may donate them to the APPP).

If you wish to own patents, then you must pay a fee to license from the APPP. That fee is used to fund the APPP's activities, the most expensive being legal defense of its own patents, and legal attacks on other patents that it deems to be illegal or too broad.

(* = we'd have to be aggressive about going after companies that make a subsidiary to use APPP patents while still owning patents in the parent corporation)

The tipping point for the APPP would be to get a few patents that are important enough that major players need to either join the APPP (donate all their patents) or pay a large license.

The APPP provides a way for people who want their work to be free to ensure that it is free. In the current system this is hard to do without owning a patent, and owning a patent and enforcing it is hard to do without money.

The APPP pro-actively watches all patent submissions and objects to ones that cover prior art, are obvious and trivial, or excessively broad. It greatly reduces the issuance of junk patents, and fights ones that are mistakenly issued. (the APPP maintains a public list of patents that it believes to be junk, which it will help you fight if you choose to use the covered algorithms). (Obviously some of these activities have to be phased in over time as the APPP gets more money).

The APPP provides a way for small companies and individuals that cannot afford the lawyers to defend their work to be protected. When some evil behemoth tries to stop you from using algorithms that you believe you have a legal right to, rather than fight it yourself, you simply donate your work to the APPP and they fight for you.

Anyone who simply wants to ensure that they can use their own inventions could use the APPP.

Once the APPP has enough money, we would employ a staff of patent writers. They would take idea donations from the groundswell of developers, open-source coders, hobbyists. Describe your idea, the patent writer would make it all formal and go through the whole process. This would let us tap into where the ideas are really happening, all the millions of coders that don't have the time or money to pursue getting patents on their own.

In the current system, if you just want to keep your idea free, you have to constantly keep an eye on all patent submissions to make sure noone is slipping in and patenting it. It's ridiculous. Really the only safe thing to do is to go ahead and patent it yourself and then donate it to the APPP. (the problem is if you let them get the patent, even if it's bogus it may be expensive to fight, and what's worse is it creates a situation where your idea has a nasty asterisk on it - oh, there's this patent that covers this idea, but we believe that patent to be invalid so we claim this idea is still public domain. That's a nasty situation that will scare off lots of users.)

Some previous posts :

cbloom rants 02-10-09 - How to fight patents
cbloom rants 12-07-10 - Patents
cbloom rants 04-27-11 - Things we need
cbloom rants 05-19-11 - Nathan Myhrvold

Some notes :

1. I am not interested in debating whether patents are good or not. I am interested in providing a mechanism for those of us who hate patents to pursue our software and algorithm development in a reasonable way.

2. If you are thinking about the patent or not argument, I encourage you to think not of some ideal theoretical argument, but rather the realities of the situation. I see this on both sides of the fence; those who are pro-patent because it "protects inventors" but choose to ignore the reality of the ridiculous patent system, and those on the anti-patent side who believe patents are evil and they won't touch them, even though that may be the best way to keep free ideas free.

3. I believe part of the problem with the anti-patent movement is that we are all too fixated on details of our idealism. Everybody has slightly different ideas of how it should be, so the movement fractures and can't agree on a unified thrust. We need to compromise. We need to coordinate. We need to just settle on something that is a reasonable solution; perhaps not the ideal that you would want, but some change is better than no change. (of course the other part of the problem is we are mostly selfish and lazy)

4. Basically I think that something like the "defensive patent license" is a good idea as a way to make sure your own inventions stay free. It's the safest way (as opposed to not patenting), and in the long run it's the least work and maintenance. Instead of constantly fighting and keeping aware of attempts to patent your idea, you just patent it yourself, do the work up front and then know it's safe long term. But it doesn't go far enough. Once you have that patent you can use it as a wedge to open up more ideas that should be free. That patent is leverage, against all the other evil. That's where the APPP comes in. Just making your one idea free is not enough, because on the other side there is massive machinery that's constantly trying to patent every trivial idea they can think of.

5. What we need is for the APPP to get enough money so that it can be stuffing a deluge of trivial patents down the patent office's throat, to head off all the crap coming from "Intellectual Ventures" and its many brothers. We need to be getting at least as many patents as them and making them all free under the APPP.

Some links :

en.swpat.org - The Software Patents Wiki
Patent Absurdity — How software patents broke the system
Home defensivepatentlicense
FOSS Patents U.S. patent reform movement lacks strategic leadership, fails to leverage the Internet

05-21-15 | LZ-Sub

LZ-Sub decoder :

delta_literal = get_sub_literal();

if ( delta_literal != 0 )
    *ptr++ = delta_literal + ptr[-lastOffset];
else // delta_literal == 0
    if ( ! get_offset_flag() )
        *ptr++ = ptr[-lastOffset];
    else if ( get_lastoffset_flag() )
        int lo_index = get_lo_index();
        lastOffset = last_offsets[lo_index];
        // do MTF or whatever using lo_index
        *ptr++ = ptr[-lastOffset];
        // extra 0 delta literal implied :
        *ptr++ = ptr[-lastOffset];
        lastOffset = get_offset();
        // put offset in last_offsets set
        *ptr++ = ptr[-lastOffset];
        *ptr++ = ptr[-lastOffset];
        // some automatic zero deltas follow for larger offsets
        if ( lastOffset > 128 )
            *ptr++ = ptr[-lastOffset];
            if ( lastOffset > 16384 )
                *ptr++ = ptr[-lastOffset];

    // each single zero is followed by a zero runlen
    //  (this is just a speed optimization)
    int zrl = get_zero_runlen();
        *ptr++ = ptr[-lastOffset];

This is basically LZMA. (sub literals instead of bitwise-LAM, but structurally the same) (also I've reversed the implied structure here; zero delta -> offset flag here, whereas in normal LZ you do offset flag -> zero delta)

This is what a modern LZ is. You're sending deltas from the prediction. The prediction is the source of the match. In the "match" range, the delta is zero.

The thing about modern LZ's (LZMA, etc.) is that the literals-after-match (LAMs) are very important too. These are the deltas after the zero run range. You can't really think of the match as just applying to the zero-run range. It applies until you send the next offset.

You can also of course do a simpler & more general variant :

Generalized-LZ-Sub decoder :

if ( get_offset_flag() )
    // also lastoffset LRU and so on not shown here
    lastOffset = get_offset();

delta_literal = get_sub_literal();

*ptr++ = delta_literal + ptr[-lastOffset];

Generalized-LZ-Sub just sends deltas from prediction. Matches are a bunch of zeros. I've removed the acceleration of sending zero's as a runlen for simplicity, but you could still do that.

The main difference is that you can send offsets anywhere, not just at certain spots where there are a bunch of zero deltas generated (aka "min match lengths").

This could be useful. For example when coding images/video/sound , there is often not an exact match that gives you a bunch of exact zero deltas, but there might be a very good match that gives you a bunch of small deltas. It would be worth sending that offset to get the small deltas, but normal LZ can't do it.

Generalized-LZ-Sub could also give you literal-before-match. That is, instead of sending the offset at the run of zero deltas, you could send it slightly *before* that, where the deltas are not zero but are small.

(when compressing text, "sub" should be replaced with some kind of smart lexicographical distance; for each character precompute a list of its most likely substitution character in order of probability.)

LZ is a bit like a BWT, but instead of the contexts being inferred by the prefix sort, you transmit them explicitly by sending offsets to prior strings. Weird.

05-21-15 | Umm

I sent a lawyer an email yesterday.

Today they sent me back an email saying :

"I need your email address so I can send you the documents you need to sign"

Umm... you are not inspiring great confidence in your abilities.

Also, pursuant to my last post about spam - pretty much all my correspondence with lawyers over the past few months, Google decides to put in the spam folder. I keep thinking "WTF why didn't this lawyer get back to me - oh crap, go check the spam". Now, I'm totally down with the comic social commentary that Google is making ("ha ha, all email from lawyers is spam, amirite? lol"). But WTF your algorithms are insanely broken. I mean, fucking seriously you suck so bad.

05-21-15 | Software Patents are Fucking Awesome

Awesome. Spotted on encode.ru. It was inevitable I suppose :

"System and method for compressing data using asymmetric numeral systems with probability distributions"

By these tards :


Someone in the UK go over and punch them in the balls.

For those not aware of the background, ANS is probably the biggest invention in data compression in the last 20 years. Its inventor (Jarek Duda) has explicitly tried to publish it openly and make it patent-free, because he's awesome.

In the next 10 years I'm sure we will get patents for "using ANS with string-matching data compression", "using ANS with block mocomp data compression", "using ANS as a replacement for Huffman coding", "deferred summation with ANS", etc. etc. Lots of brilliant inventions like that. Really stimulating for innovation.

(as has happened over and over in data compression, and software in general in the past; hey let's take two obvious previously existing things; LZ string matching + Huffman = patent. LZ + hash table = patent. JPEG + arithmetic = patent. Mocomp + Huffman = patent. etc. etc.)

(often glossed over in the famous Stac-Microsoft suit story is the question of WHAT THE FUCK the LZS patent was supposed to be for? What was the invention there exactly? Doing LZ with a certain fixed bit encoding? Umm, yeah, like everyone does?)

Our patent system is working great. It obviously protects and motivates the real inventors, and doesn't just act as a way for the richest companies to lock in semi-monopolies of technologies they didn't even invent. Nope.

Recently at RAD we've made a few innovations related to ANS that are mostly in the vein of small improvements or clever usages, things that I wouldn't even imagine to patent, but of course that's wrong.

I've also noticed in general a lot of these vaporware companies in the UK. We saw one at RAD a few years ago that claimed to use "multi-dimensional curve interpolation for data compression" or some crackpot nonsense. There was another one that used alternate numeral systems (not ANS, but p-adic or some such) for god knows what. A few years ago there were lots of fractal-image-compression and other fractal-nonsense startups that did ... nothing. (this was before the VC "pivot" ; hey we have a bunch of fractal image patents, let's make a text messaging app)

They generally get some PhD's from Cambridge or whatever to be founders. They bring a bunch of "industry luminaries" on the board. They patent a bunch of nonsense. And then ...

... profit? There's a step missing where they actually ever make anything that works. But I guess sometimes they get bought for their vapor, or they manage to get a bullshit patent that's overly-general on something they didn't actually invent, and then they're golden.

I wonder if these places are getting college-backed "incubation" incentives? Pretty fucking gross up and down and all around. Everyone involved is scum.

(In general, universities getting patents and incubating startups is fucking disgusting. You take public funding and student's tuition, and you use that to lock up ideas for private profit. Fucking rotten, you scum.)

On a more practical note, if anyone knows the process for objecting to a patent in the UK, chime in.

Also, shame on us all for not doing more to fight the system. All our work should be going in the Anti-Patent Patent Pool.

Under the current first-to-file systems, apparently we are supposed to sit around all day reading every patent that's been filed to see if it covers something that we have already invented or is "well known" / public domain / prior art.

It's really a system that's designed around patents. It assumes that all inventions are patented. It doesn't really work well with a prior invention that's just not patented.

Which makes something like the APPP even more important. We need a way to patent all the free ideas just as a way to keep them legally free and not have to worry about all the fuckers who will rush in and try to patent our inventions as soon as we stop looking.

05-17-15 | The Google Spam Filter is Intentionally Bad

I'm convinced at this point that Google intentionally filters spam wrong.

Not in a nefarious way, like haha we're going to send your good mails to "spam" and let the crap through! Take that!

But actually in a sort of more deeply evil way. A capitalist way. They specifically *want* to allow through mass-mailings from corporations that are they do not consider spam.

In my opinion, those are all spam. There is not a single corporate mass-mailing that I ever intentionally subscribed to.

Basically there's a very very easy spam filtering problem :

Easy 1. Reject all mass-mailings. Reject all mailings about sales, products, offers. Reject all mailings about porn or penises or nigerian princes.

Easy 2. Allow through all mail that's hand-written by a human to me. Particularly to one that I have written to in the past.

That would be fine with me. That would get 99.99% of it right for me.

They don't want to solve that problem. Instead they try to solve the much-harder problem of allowing through viagra offers that are for some reason not spam. For the email user who *wants* to get mass-mail offers of 50% off your next order.

I just don't understand how "yeah, let's go out to dinner" from my friend, who is responding with quote to a fucking mail that I sent goes in the in the Spam box, but "get direct email mass-marketing secrets to double your business!" goes in my inbox. How can it be so bad, I just really don't understand it. Fucking the most basic keyword include/exclude type of filter could do better.

I should have just written my own, because it's the kind of problem that you want to be constantly tweaking on. Every time a mail is misclassified, I want to run it through my system and see why that happened and then try to fix it.

It would be SOOO fucking easy for them. Being in a position as a central mail processor, they can tell which mails are unique and which are mass-sent, and just FUCKING BLOCK ALL THE MASS-SENT MAIL. God dammit. You are fucking me up and I know you're doing it intentionally. I hate you.

I mean, fuck. It's ridiculous.

They are responding to a mail I sent. The mail I sent is fucking quoted right there. I sent the fucking mail from gmail so you can confirm it's for real. I sent to their address with gmail. AND YOU PUT THEIR REPLY IN SPAM. WTF WTF WTF

But this is not spam :

Report: creative teamwork is easier with cloud-based apps

Businesses Increase Screening of Facebook, Twitter Before Hiring

Trying to solve the Prospecting Paradox?

I'd like to add you to my professional network on LinkedIn

Maybe I'm being a bit overly simplistic and harsh. Maybe there are mass-mailings that look spammish, but you actually want to get? Like, your credit card bill is due?

I'm not sure. I'm not sure that I ever need to get any of that. I don't need those "shipping confirmation" emails from Amazon. If they just all got filed to the "mass mail" folder, I could go look for them when I need them.

I want to make my own private internet. And then not allow anyone else to use it because you'd all just fuck it up.

05-16-15 | Threading Primitive : monitored semaphore

A monitored semaphore allows two-sided waiting :

The consumer side decs the semaphore, and waits on the count being positive.

The producer side incs the semaphore, and can wait on the count being a certain negative value (some number of waiting consumers).

Monitored semaphore solves a specific common problem :

In a worker thread system, you may need to wait on all work being done. This is hard to do in a race-free way using normal primitives. Typical ad-hoc solutions may miss work that is pushed during the wait-for-all-done phase. This is hard to enforce, ugly, and makes bugs. (it's particularly bad when work items may spawn new work items).

I've heard of many ad-hoc hacky ways of dealing with this. There's no need to muck around with that, because there's a simple and efficient way to just get it right.

The monitored semaphore also provides a race-free way to snapshot the state of the work system - how many work items are available, how many workers are sleeping. This allows you to wait on the joint condition - all workers are sleeping AND there is no work available. Any check of those two using separate primitives is likely a race.

The implementation is similar to the fastsemaphore I posted before.

"fastsemaphore" wraps some kind of underlying semaphore which actually provides the OS waits. The underlying semaphore is only used when the count goes negative. When count is positive, pops are done with simple atomic ops to avoid OS calls. eg. we only do an OS call when there's a possibility it will put our thread to sleep or wake a thread.

"fastsemaphore_monitored" uses the same kind atomic variable wrapping an underlying semaphore, but adds an eventcount for the waiter side to be triggered when enough workers are waiting. (see who ordered event count? )

Usage is like this :

To push a work item :

push item on your queue (MPMC FIFO or whatever)

To pop a work item :

pop item from queue

To flush all work :


NOTE : in my implementation, post & wait can be called from any thread, but wait_for_waiters must be called from only one thread. This assumes you either have a "main thread" that does that wait, or that you wrap that call with a mutex.

template <typename t_base_sem>
class fastsemaphore_monitored
    atomic<S32> m_state;
    eventcount m_waiters_ec;
    t_base_sem m_sem;

    enum { FSM_COUNT_SHIFT = 8 };
    enum { FSM_COUNT_MASK = 0xFFFFFF00UL };
    enum { FSM_WAIT_FOR_SHIFT = 0 };
    enum { FSM_WAIT_FOR_MASK = 0xFF };

    fastsemaphore_monitored(S32 count = 0)
    :   m_state(count<<FSM_COUNT_SHIFT)
        RL_ASSERT(count >= 0);



    inline S32 state_fetch_add_count(S32 inc)
        S32 prev = m_state($).fetch_add(inc<<FSM_COUNT_SHIFT,mo_acq_rel);
        S32 count = ( prev >> FSM_COUNT_SHIFT );
        RR_ASSERT( count < 0 || ( (U32)count < (FSM_COUNT_MAX-2) ) );
        return count;

    // warning : wait_for_waiters can only be called from one thread!
    void wait_for_waiters(S32 wait_for_count)
        RL_ASSERT( wait_for_count > 0 && wait_for_count < FSM_WAIT_FOR_MAX );
        S32 state = m_state($).load(mo_acquire);
            S32 cur_count = state >> FSM_COUNT_SHIFT;

            if ( (-cur_count) == wait_for_count )
                break; // got it
            S32 new_state = (cur_count<<FSM_COUNT_SHIFT) | (wait_for_count << FSM_WAIT_FOR_SHIFT);
            S32 ec = m_waiters_ec.prepare_wait();
            // double check and signal what we're waiting for :
            if ( ! m_state.compare_exchange_strong(state,new_state,mo_acq_rel) )
                continue; // retry ; state was reloaded
            state = m_state($).load(mo_acquire);
        // now turn off the mask :
            S32 new_state = state & FSM_COUNT_MASK;
            if ( state == new_state ) return;
            if ( m_state.compare_exchange_strong(state,new_state,mo_acq_rel) )
            // retry ; state was reloaded

    void post()
        if ( state_fetch_add_count(1) < 0 )

    void wait_no_spin()
        S32 prev_state = m_state($).fetch_add((-1)<<FSM_COUNT_SHIFT,mo_acq_rel);
        S32 prev_count = prev_state>>FSM_COUNT_SHIFT;
        if ( prev_count <= 0 )
            S32 waiters = (-prev_count) + 1;
            RR_ASSERT( waiters >= 1 );
            S32 wait_for = prev_state & FSM_WAIT_FOR_MASK;
            if ( waiters == wait_for )
                RR_ASSERT( wait_for >= 1 );
    void post(S32 n)
        RR_ASSERT( n > 0 );
        for(S32 i=0;i<n;i++)
    bool try_wait()
        // see if we can dec count before preparing the wait
        S32 state = m_state($).load(mo_acquire);
            if ( state < (1<<FSM_COUNT_SHIFT) ) return false;
            // dec count and leave the rest the same :
            //S32 new_state = ((c-1)<<FSM_COUNT_SHIFT) | (state & FSM_WAIT_FOR_MASK);
            S32 new_state = state - (1<<FSM_COUNT_SHIFT);
            RR_ASSERT( (new_state>>FSM_COUNT_SHIFT) >= 0 );
            if ( m_state($).compare_exchange_strong(state,new_state,mo_acq_rel) )
                return true;
            // state was reloaded
            // loop
            // backoff here optional
    S32 try_wait_all()
        // see if we can dec count before preparing the wait
        S32 state = m_state($).load(mo_acquire);
            S32 count = state >> FSM_COUNT_SHIFT;
            if ( count <= 0 ) return 0;
            // swap count to zero and leave the rest the same :
            S32 new_state = state & FSM_WAIT_FOR_MASK;
            if ( m_state($).compare_exchange_strong(state,new_state,mo_acq_rel) )
                return count;
            // state was reloaded
            // loop
            // backoff here optional
    void wait()
        int spin_count = rrGetSpinCount();
            if ( try_wait() ) 


05-16-15 | LZ literals after match

Some vague rambling about LAMs.

LAMs are weird.

LAM0 , the first literal after a match, has the strong exclusion property (assuming maximum match lengths). LAM0 is strictly != lolit. (lolit = literal at last offset).

LAM1, the next literal after end of match, has the exact opposite - VERY strong prediction of LAM1 == lolit. This prediction continues but weakens as you go to LAM2, LAM3, etc.

In Oodle LZNA (and in many other coders), I send a flag for (LAM == lolit) as a separate event. That means in the actual literal coding path you still have LAM1 != lolit. (the LAM == lolit flag should be context-coded using the distance from the end of the match).

In all cases, even though you know LAM != lolit, lolit is still a very strong predictor for LAM. Most likely LAM is *similar* to lolit.

LAM is both an exclude AND a predictor!

What similar means depends on the file type. In text it means something like vowels stay vowels, punctuation stays punctuation. lolit -> LAM is sort of like substituting one character change. In binary, it often means that they are numerically close. This means that the delta |LAM - lolit| is never zero, but is often small.

One of the interesting things about the delta is that it gives you a data-adaptive stride for a delta filter.

On some files, you can get huge compression wins by running the right delta filter. But the ideal delta distance is data-dependent (*). The sort of magic thing that works out is that the LZ match offsets will naturally pick up the structure & word sizes. In a file of 32-byte structs made of DWORDs, you'll get offsets of 4,8,12,32,etc. So you then take that offset and forming the LAM sub is just a way of doing a delta with that deduced stride. On DWORD or F32 data, you tend to get a lot of offset=4, so LAM tends to just be doing delta from the previous word (note of course this bytewise delta, not a proper dword delta).

(* = this is a huge thing that someone needs to work on; automatic detection of delta filters for arbitrary data; deltas could be byte,word,dword, other, from immediate neighbors or from struct/row strides, etc. In a compression world where we are fighting over 1% gains, this can be a 10-20% jump.)

Experimentally we have observed that LAMs are very rapidly changing. They benefit greatly from very quickly adapting models. They like geometric adaptation rates (more recent events are much more important). They cannot be modeled with large contexts (without very sophisticated handling of sparsity and fast adaptation), they need small contexts to get lots of events and statistical density. They seem to benefit greatly from modeling in groups (eg. bitwise or nibblewise or other), so that events on one symbol also affect other probabilities for faster group learning. Many of these observations are similar for post-BWT data. LAM sub literals does seem to behave like post-BWT data to some extent, and similar principles of modeling apply.

So, for example, just coding an 8-bit symbol using the 8-bit lolit as context is a no-go. In theory this would give you full modeling of the effects of lolit on the current symbol. In practice it dilutes your statistics way too much. (in theory you could do some kind of one-count boosts other counts thing (or a secondary coding table ala PPMZ SEE), but in practice that's a mess). Also as noted previously, if you have the full 8-bit context, then whether you code symbol raw or xor or sub is irrelevant, but if you do not have the full context then it does change things.

Related posts :

cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 03-10-13 - Two LZ Notes
cbloom rants 06-12-14 - Some LZMA Notes
cbloom rants 06-16-14 - Rep0 Exclusion in LZMA-like coders
cbloom rants 03-15-15 - LZ Literal Correlation Images

05-13-15 | Skewed Pareto Chart

It's hard to see just the decomp speed in the normal Pareto Chart. It gets squished down over at the far-right Y-intercept.

The obvious fix is just to magnify the right side. This is a linear scaling of the data; *1 on the far left, *10 on the far right :

The far-left is still proportional to the compression ratio, the far right is proportional to the decompression speed. The compressor lines are still speedups vs. memcpy, but the memcpy baseline is now sloped.

I'm not really sure how I feel about the warped chart vs unwarped.

The Pareto curves are in fact sigmoids (tanh's).

speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)

speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)

(here they're warped sigmoids because of the magnification; the ones back here in the LZNA post are true sigmoids)

I believe (but have not proven) that a principle of the Pareto Frontier is that the maximum of all compressors should also be a sigmoid.

max_speedup(disk_speed) = MAX{c}( speedup[compressor c](disk_speed) );

One of the nice things about these charts is it makes it easy to see where some compressors are not as good as possible. If we fit a sigmoid over the top of all the curves :

We can easily see that LZHLW and LZNIB are not touching the curve. They're not as good as they should be in space/speed. Even thought nothing beats them at the moment (that I know of), they are algorithmically short of what's possible.

There are two things that constrain compressors from being better in a space/speed way. There's 1. what is our current best known algorithm. And then there's 2. what is possible given knowledge of all possible algorithms. #2 is the absolute limit and eventually it runs into a thermodynamic limit. In a certain amount of cpu time (cpu bit flips, which increase entropy), how much entropy can you take out of a a given data stream. You can't beat that limit no matter how good your algorithm is. So our goal in compression is always to just find improvements in the algorithms to edge closer to that eventual limit.

Anyway. I think I know how to fix them, and hopefully they'll be up at the gray line soon.

05-11-15 | ANS Minimal Flush

A detail for the record :

ANS (TANS or RANS) in the straightforward implementation writes a large minimum number of bytes.

To be concrete I'll consider a particular extremely bad case : 64-bit RANS with 32-bit renormalization.

The standard coder is :

initialize encoder (at end of stream) :

x = 1<<31

renormalize so x stays in the range x >= (1<<31) and x < (1<<63)

flush encoder (at the beginning of the stream) :

output all 8 bytes of x

decoder initializes by reading 8 bytes of x

decoder renormalizes via :

if ( x < (1<<31) )
  x <<= 32;  x |= get32(ptr); ptr += 4;

decoder terminates and can assert that x == 1<<31

this coder outputs a minimum of 8 bytes, which means it wastes up to 7 bytes on low-entropy data (assuming 1 byte minimum output and that the 1 byte required to byte-align output is not "waste").

In contrast, it's well known how to do minimal flush of arithmetic coders. When the arithmetic coder reaches the end, it has a "low" and "range" specifying an interval. "low" might be 64-bits, but you don't need to output them all, you only need to output enough such that the decoder will get something in the correct interval between "low" and "low+range".

Historically people often did arithmetic coder minimum flush assuming that the decoder would read zero-valued bytes after EOF. I no longer do that. I prefer to do a minimum flush such that decoder will get something in the correct interval no matter what byte follows EOF. This allows the decoder to just read past the end of your buffer with no extra work. (the arithmetic coder reads some # of bytes past EOF because it reads enough to fill "low" with bits, even though the top bits are all that are needed at the end of the stream).

The arithmetic coder minimum flush outputs a number of bytes proportional to log2(1/range) , which is the number of bits of information that are currently held pending in the arithmetic coder state, which is good. The excess is at most 1 byte.

So, to make ANS as clean as arithmetic coding we need a minimal flush. There are two sources of the waste in the normal ANS procedure outlined above.

One is the initial value of x (at the end of the stream). By setting x to (1<<31) , the low end of the renormalization interval, we have essentually filled it with bits it has to flush. (the pending bits in x is log2(x)). But those bits don't contain anything useful (except a value we can check at the end of decoding). One way to remove that waste is to stuff some other value in the initial state which contains bits you care about. Any value you initialize x with, you get back at the end of decoding, so then those bits aren't "wasted". But this can be annoying to find something useful to put in there, since you don't get that value out until the end of decoding.

The other source of waste is the final flush of x (at the beginning of the stream). This one is obvious - the # of pending bits stored in x at any time is log2(x). Clearly we should be flushing the final value of x in a # of bits proportional to log2(x).

So to do ANS minimal flush, here's one way :

initialize encoder (at end of stream) :

x = 0

renormalize so x stays in the range x < (1<<63)

flush encoder (at the beginning of the stream) :

output # of bytes with bits set in x, and those bytes

decoder initializes by reading variable # of bytes of x

decoder renormalizes via :

if ( x < (1<<31) )
  if ( ptr < ptrend )
    x <<= 32;  x |= get32(ptr); ptr += 4;

decoder terminates and can assert that x == 0

This ANS variant will output only 1 byte on very-low-entropy data.

There are now two phases of the coder. In the beginning of encoding (at the ending of the stream), x is allowed to be way below the renormalization range. During this phase, encoding just puts information into x, and the value of x grows. (note that x can actually stay 0 and never hold any bits if your consists of entirely the bottom symbol in RANS). Once x grows up into the renormalization interval, you enter the next phase where bits of x are pushed to the output to keep x in the renormalization interval. Decoding, in the first phase you read bytes from the stread to fill x with bits and keep it in the renormalization interval. Once the decoder read pointer hits the end, you switch to the second phase, and now x is allowed to shrink below the renormalization minimum and you can continue to decode the remaining information held in it.

This appears to add an extra branch to the decoder renormalization, but that can be removed by duplicating your decoder into "not near the end" and "near the end" variants.

The #sigbit output of x at the head is just the right thing and should always be done in all variants of ANS.

The checking ptr vs. ptrend and starting x = 0 is the variant that I call "minimal ANS".

Unfortunately "minimal ANS" doesn't play well with the ILP multi-state interleaved ANS. To do interleaved ANS like this you would need an EOF marker for each state. That's possible in theory (and could be done compactly in theory) but is a pain in the butt in practice.

05-10-15 | Did I ever mention that I fucking hate the fucking web?

(I might be a bit cranky today. Too much work and not enough sex. I should probably just go to a bar and talk about how I love Obama and taxes so I can get in a fight. Instead I'll rage about the fucking web.)

I'm trying to get together the photos of my baby to share with my mom. What a fucking nightmare. They're mostly on my phone, and auto-backed up to Google Photos. Should be easy, right?

The Google Photos web interface is fucking wrist-slashing insanity. It's SO FUCKING SLOW. It should not take so long to show me a few little thumbnails. Fucking quit all the fucking AJAX fancy bullshit whatever the fuck you're fucking doing oh my god.

It always only wants to show me "highlights". Who told you to fucking do that? I have never highlighted anything so I'm not sure how you decided what was a highlight and what wasn't. You fucking dicks.

Simple shit like making an album and trying to put the correct photos in the album just has no decent workflow. FUCK.

So I'm going to just download them and do it on my computer. Fine.

There's no download all. I'm supposed to what, click each fucking one and download? (which is a frustrating nightmare because the click is some super slow awful web popup).

Okay, I can use Google Takeouts to just get the whole thing. Hmm. Why are my photos fucking 8 GB? Oh, because it's giving me all my videos too. FUCK FUCK FUCK. I just want the photos not the videos. Nope, Takeouts gives you everything.

Okay, I'll just download the 8 GB. Oh awesome the download the failed. Oh awesome it failed again.

Okay, I'll get the download URL and give it to DownloadThemAll which is good and can do resumes and so on and the main reason I cling to Firefox.

NOPE the fucking download link is not an actual file it's some fucking redirect login bullshit that DTA can't handle. ARG ARG ARG.

And now fuck my fucking baby photos and fuck my mom (sorry mom) I'm not fucking dealing with this shit and I fucking hate the fucking web god dammit.

For some time I've been using Google Classic Maps ("https://maps.google.com/maps?output=classic&dg=opt"). And now it's been killed. Maybe I'll switch to Bing? Or fuck that. Maybe I should just buy a good set of paper maps. I'm not sure that even exists any more. Ever since the Thomas Guide switched to computer-generated maps they really suck, they're ugly and the layout is no good and hard to read.

The reason I saw on the web for killing it (Google Classic Maps) was that too many people were opting out of new maps. You killed it because people liked it. I don't know if that's true, but it is awesomely in character.

For a while I was on the Google Forums complaining about Blogger. Just about everyone who runs a blog at some point gets a troll and realizes that they need the ability to just ban an individual. Can't do it. So they go on the Google product forum and say hey can we get black listing and white listing? The Google response was "we know you want that, and fuck you".

REMINDER TO SELF : always download all the images made by Google Charts because that service will die at some point. (this would be good practice even if Google didn't randomly chop off its own limbs on a regular basis)

I don't keep any cookies or browse history. With everyone going to fucking two-phase login this is starting to get annoying. To login I now have to get a text code to my phone and enter that. It's tedious.

But the thing that really kills me is this stupid detail :

I get the numeric code sent to me. I go to Google Voice on my computer (because actually ever touching the phone is to be avoided at all cost). I double-click the number to copy it. I paste it in the two-phase entry.

It fails. Wrong code.

I try again. It fails.

The fucking double-click is selecting the space after the number, and the fucking login doesn't ignore the trailing space. It's lazy bad programmer shit like that which makes me furious.

Another one I hit often is using online payment thingies. I'll copy-paste the amount from my bill, something like "$1,234" and hit okay and I get

"invalid entry, please enter a numeric value"

IT's A FUCKING PAYMENT ENTRY BOX. You can fucking strip the leading $ and commas you piece of shit mother fucking asshole terrible programmers.

I'm trying to login to Skype on my phone.

(side note: summary of every Skype sessions I've ever had : "I can see you, can you see me? I can't hear you. Oh, you're upside down. Let me log off and disconnect. Now you're black. Try again. It's real glitchy, let's restart it. Hey, it's working! Hi! Hi! Okay, gotta go now.")

It says login with your skype account or your microsoft account.

So I tediously enter my microsoft account login which has a password like fucking @#$ASD@!#$<:22 and is fucking awful to type (and is starred out you fucking fucks the fucking fuck).

Skype says "oh, it looks like you entered a microsoft account, redirecting..."


I'm so fucking sick of loading web pages and seeing "connecting to blah.. connecting to blah.." and seeing shit popping in slowly and reflowing and the focus popping and all this fucking shit.

Hey, fucking remedial loading school. You put all the content needed for the page in one package. Send me the one package. BOOM it loads.

Incremental is bullshit.

Back in the 90's some time, I worked for Eclipse on streaming 3d for the web. One of the things I did was a progressive wavelet image compressor so we could do things like send the first 5k of each image, then the next 10k, and because of the truncation property of bitplane-coded wavelets those were good low quality versions of the image that could just be tacked together.

So we tried to test it and demo it.

Everything just instantly loaded and you couldn't see the progressive wavelet load at all.

Because if you're not a fucking moron and you package together your content and just have a single download bundle to get your content, hey the internet is actually really fucking fast (even back in the 90's !!).

To show it off I put in a bunch of fake delays on the downloader to simulate slow hosts, so that you could see the wavelets gradually getting better, and that's what we showed to VC's or whatever.

I guess I could have just taken all the files and scattered them on different hosts around the world, THE WAY FUCKING NORMAL WEB PAGES DO. It's like they have very carefully gone through this process of intentionally slowing down the internet for no reason.

Sometimes I wish that I was like an air-cooled Porsche mechanic or something very stable and non-computer related, so I could just work away in my shop and not have to ever touch this fucking demon box.

05-09-15 | Oodle LZNA

Oodle 1.45 has a new compressor called LZNA. (LZ-nibbled-ANS)

LZNA is a high compression LZ (usually a bit more than 7z/LZMA) with better decode speed. Around 2.5X faster to decode than LZMA.

Anyone who needs LZMA-level compression and higher decode speeds should consider LZNA. Currently LZNA requires SSE2 to be fast, so it only runs full speed on modern platforms with x86 chips.

LZNA gets its speed from two primary changes. 1. It uses RANS instead of arithmetic coding. 2. It uses nibble-wise coding instead of bit-wise coding, so it can do 4x fewer coding operations in some cases. The magic sauce that makes these possible is Ryg's realization about mixing cumulative probability distributions . That lets you do the bitwise-style shift update of probabilities (keeping a power of two total), but on larger alphabets.

LZNA usually beats LZMA compression on binary, slightly worse on text. LZNA is closer to LZHAM decompress speeds.

Some results :


LZNA -z6 : 24,700,820 -> 9,154,248 =  2.965 bpb =  2.698 to 1
decode only      : 0.327 seconds, 43.75 b/kc, rate= 75.65 mb/s

LZMA : 24,700,820 -> 9,329,925 =  3.021 bpb =  2.647 to 1
decode           : 0.838 seconds, 58.67 clocks, rate= 29.47 M/s

LZHAM : 24,700,820 ->10,140,761 =  3.284 bpb =  2.435 to 1
decode           : 0.264 seconds, 18.44 clocks, rate= 93.74 M/s

(note on settings : LZHAM is run at BETTER because UBER is too slow. LZHAM BETTER is comparable to Oodle's -z6 ; UBER is similar to my -z7. (not quite right; see later post "LZNA encode speed addendum"). LZMA is run at the best compression setting I can find; -m9 and lc=0,lp=2,pb=2 for binary data; with LZHAM I don't see a way to set the context bits. This is the new LZHAM 1.0, slightly different than my previous tests of LZHAM. All 64-bit, big dictionaries.).


LZNA -z6 : 58,788,904 ->12,933,907 =  1.760 bpb =  4.545 to 1
decode only      : 0.677 seconds, 50.22 b/kc, rate= 86.84 mb/s

LZMA : 58,788,904 ->13,525,659 =  1.840 bpb =  4.346 to 1
decode           : 1.384 seconds, 40.70 clocks, rate= 42.49 M/s

LZHAM : 58,788,904 ->15,594,877 =  2.122 bpb =  3.769 to 1
decode           : 0.582 seconds, 17.12 clocks, rate= 100.97 M/s

I'm not showing encode speeds because they're all running different amounts of threading. It would be complicated to show fairly. LZHAM is the most aggressively threaded, and also the slowest without threading.

My "game testset" total sizes, from most compression to least :

Oodle LZNA -z8 :            57,176,229
Oodle LZNA -z5 :            58,318,469

LZMA -mx9 d26:lc0:lp2:pb3 : 58,884,562
LZMA -mx9 :                 59,987,629

LZHAM -mx9 :                62,621,098

Oodle LZHLW -z6 :           68,199,739

zip -9 :                    88,436,013

raw :                       167,495,105

Here's the new Pareto chart for Oodle. See previous post on these charts

This is load+decomp speedup relative to memcpy : (lzt99)

The left-side Y-intercept is the compression ratio. The right-side Y-intercept is the decompression speed. In between you can see the zones where each compressor is the best tradeoff.

With LZMA and LZHAM : (changed colors)

lzt99 is bad for LZHAM, perhaps because it's heterogeneous and LZHAM assumes pretty stable data. (LZHAM usually beats LZHLW for compression ratio). Here's a different example :

load+decomp speedup relative to memcpy : (baby_robot_shell)

05-08-15 | Bernie Sanders For President

I just watched the Daily Show where Jon just makes fun of Bernie for being unknown and looking/sounding a bit like an older Larry David, without mentioning anything about the fact that Bernie is fucking awesome. Bernie's announcement of "okay let's keep this short, I have to get back to work" is fucking awesome typical Bernie and it's what our politicians should be saying. They should be doing fucking work instead of making elaborate announcement pageants. They should actually be reading and writing laws instead of letting lobbyists and aides do it all for them.

Bernie Sanders is fucking amazing. If you haven't had the great pleasure of hearing him talk at length, go do it now. He is the best politician since I don't even know fucking who (*). I have never in my lifetime heard a single politician that actually speaks honestly and intelligently about the issues. Not talking points. Not just a bunch of bullshit promises. Not just verbal gymnastics to avoid the point. Actually directly talks about the issue in a realistic and pragmatic way.

(* = I have seen video of things like the Nixon-Kennedy debates in which politicians actually talk about issues and try to pin each other down on points of policy, rather than scoring "gotchas" and "applause points". I understand that back in the olden days, pandering to cheap emotional vagaries would get you pilloried in the press as "evasive" or "not serious". I've never seen it in my life.)

Even if you're conservative and don't agree with his views, it should be a fucking breath of fresh air for anyone with any fucking scrap of humanity and decency and intelligence to see a politician that refuses to play the games, refuses to pander to the shitty mass-applause points and the special-interest hate groups and most of all corporate money.

Even though Bernie won't win, I want every single debate to be just Bernie. I don't want to hear a single fucking vapid scripted garbage word from any of the other shit-heads. Just let Bernie talk the whole time.

I disagree with Bernie on some points. He tends to be rather more traditional liberal pro-union pro-manufacturing than I necessarily think is wise. But it's easy to get distracted by the disagreement and miss the point - here is a politician that's actually talking about issues and facts. Even if he gets them wrong sometimes at least he's trying to address real solutions. (this is one of the tricks of most other politicians - don't ever actually propose anything concrete, because then it can be attacked, instead just talk about "hope" or "liberty" or say "America fuck yeah!")

It's classic Bernie that he chose to run as a democrat so that he wouldn't be a spoiler ala Nader. It's just totally realistic, pragmatic, to the point. I fucking love Bernie Sanders.

Watch some Brunch with Bernie

Support Bernie here

Get corporate money out of politics! Elections should be publicly funded only! Stop lobbyists from writing laws! Free trade agreements should not supercede national laws! Corporations are not human beings! Government exists for the service of its citizens! Stop privatizing profit while leaving the government on the hook for risks and long term consequences! And so on.

05-03-15 | Waterproof

I have 3 rain jackets. They all let rain through.

Granted, the first one is actually labelled "water resistant" or "water repellant" or some such nonsense which actually means "fucking useless". But the other two are actually described as "waterproof". And they just aren't.

They seem waterproof at first. Water beads up and runs off and nothing goes through. But over time in a rain they start to get saturated, and eventually they soak through and then they just wick water straight through.

The problem is they're all some fancy waterproof/breathable technical fabric.


Ooo we have this new fancy fabric. NO! No you don't. You have bullshit that doesn't fucking work.

Job #1 : Actually be waterproof.

But it's lighter!, you say. Nope! Zip it! But it's breathable. Zip! Shush. But it's recycled, and rip-stop. Zip. Nope. Is it waterproof? Is it actually fucking waterproof? Like if I stand out in a rain. No, it isn't. Throw it out. It doesn't work. You're fired. Back to the drawing board.

If you want to get fancy, you could use your breathable/not-actually-waterproof material in areas that don't get very wet, such as the inside of the upper arm and the sides of the torso.

At the very least the tops of the shoulders and the upper chest need to be just plastic. Just fucking plastic like a slicker from the 50's. (it could be a separate overhanging shelf of plastic, like a duster type of thing)

Any time I'm browsing REI or whatever these days and see a jacket labelled waterproof, I think "like hell it is".

04-28-15 | Guitar and Teaching

I'm sort of vaguely trying to learn guitar again as a late night alternative to TV.

I'm at the point I always hit where I lose steam. I can play some basic stuff, but not anything too difficult. The problem is I have trouble finding fun songs to play that aren't too hard, or finding songbooks or teach-yourself books that are both fun and not too hard.

What I really want, and what I think is the right way to teach guitar to a dabbler like me is :

A book songs, in progression of difficulty

The songs need to be modern (post-60's), fun, familiar
(classic rock is pretty safe)

The songs need to be the *actual* songs.  Not simplified versions.  Not transposed versions.
Not just the chords when the real song is much more complex.

When I play it, it needs to sound like the actual recording.

No funny tunings.  I can't be bothered with that.

and so far as I know nothing like that exists.

I've got a bunch of "easy rock guitar songbooks" and they all fucking suck.

There are lots of good tabs on the internet, and I've found some good stuff to learn that way, but fuck that. The last thing I want to be doing in my relaxing time is browsing the internet trying to decide which of the 400 fucking versions of the "Heartbreaker" tab is the right one I should try to learn.

(in the past I taught myself some classical guitar, and in contrast there are lots of great classical, and even finger-picking folk guitar song books and instructional progressions that give you nice songs to learn that are actually fun to play and sound like something)

I've tried taking lessons a few times in the past and the teachers always sucked. Maybe they were good in terms of getting you to be a better player, but they were awful at making it fun.

I had a teacher who wanted me to sit with a metronome and just pick the same note over and over to the metronome to work on my meter. WTF. You're fired. All teachers want you to play scales. Nope. And then for the "fun" part they want to teach some basic rock rhythm, A-A-E-E,A-A-E-E. Nope, I'm bored. You're fired. Then I get to learn a song and it's like some basic blues I've never heard of or some fucking John Denver or something. (*)

They seem to completely fail to understand that it has to keep the student interested. That's part of your job being a teacher. I'm not trying to become a professional musician. I don't have some great passionate motivation that's going to keep me going through the boring shit and drudgery of your lessons. You have to make it fun all the time.

(* = there is some kind of weird thing where people who play music generally have horrible taste in music. It's like they pay too much attention to either the notes/chord/key or to the technical playing, neither of which actually matter much. It's the feeling, man.)

It was interesting for me to see this in ceramics. I was lucky to find a really great teacher (Bill Wilcox) who understood that it had to be fun, and that this was a bit of a lark for most of us, and some were more serious than others. Maybe he wasn't the most efficient teacher in terms of conveying maximum learning in a set time period - but he kept you coming back. We occasionally had different guest teachers, and they were way more regimented and methodical and wanted you to do drills (pull a cylinder 20 times and check the walls for evenness), and you could see half the class thinking "fuck this"

I suppose this is true of all learning for some kids. Some kids are inherently motivated, I'm going to learn because I'm supposed to, or to get into a good college, or "for my future", or to be smarter than everyone else, or whatever. But other kids see a cosine and think "wtf is that for, who cares".

I've always thought the right way to teach anything is with a goal in mind. Not just "hey learn this because you're supposed to". But "we want to build a catapult and fire it and hit a target. Okay, we're going to need to learn about angles and triangles and such...". The best/easiest/deepest learning is what you learn because you need to learn it to accomplish something that you want to do.

What I need in guitar is a series of mini goals & accomplishments. Hey here's this new song, and it's a little bit too hard for me, but it's a fucking cool song so I actually want to play it. So I practice for a while and get better, and then I can play it, yay! Then I move on to the next one. Just like good game design. And WTF it just doesn't seem to exist.

04-24-15 | Typical Email Experience

I write very careful emails with clear points and specific questions, something like :

Hello, yes blah blah some stuff.  I need to know these points :

1. What about A?

2. There is also b?

3. and finally C?

and I usually get a response like :

Yep, great!

Umm. WTF. You are fucking fired from your job, from life, from the planet, go away.

Yep to what? There were THREE questions in there. And none of them was really a yes/no question anyway. WTF.

So I'll try to be polite and send back something like -

Thanks for the response; yes, to what exactly?  Did you mean yes to A?

Also I still need to know about B & C.

and then I'll get a response like :

Ok, on B we do this and that.

Umm. Okay, that's better. We got one answer, but THERE WERE THREE FUCKING QUESTIONS. I fucking numbered them so you could count them. That means I need three answers.

Sometimes I'll get a response like :

Ramble ramble, some unrelated stuff, sort of answer maybe A and C but not exactly, some
other rambling.

Okay. Thanks for writing a lot of words, but I HAD SPECIFIC FUCKING QUESTIONS.

This is basic fucking professionalism.

Jesus christ.

04-20-15 | Vitamin D

This post is a month or two too late, since we're now into the sun-times (I hope, fingers crossed), but anyway.

Everybody knows when you move to Seattle and get SAD you have to take vitamin D. So all these years I've been taking a few Vit D pills every day in the winter.

Recently my hair has been falling out, which has never happened to me before. I was pretty sure it was just stress, but I thought hey WTF may as well get a blood test and see if anything is wrong. So I get a blood test. Everything is normal, except -

My vit D levels were way way below normal. There's a normal range (3-7) and I was like a 1.

I was like, WTF? I take 1-2 vit D pills every day.

Turns out those pills are 1000 IU each, so I was taking 1-2000 IU. I thought that was a hell of a lot (it's ten million percent of the US RDA). Nope, not a lot. My doc said I could take 8000-10,000 IU to get my levels back up to normal, then maintenance was more like 5000 IU.

So, I started taking big doses, and BOOM instant happiness. More energy, less depression.

I still fucking hate the gray and the wet. (I recently got some foot fungus from hiking on a rainy day. I hate the fucking wet. I'd like to live on Arrakis and never see a single drop of rain again in my life.) But hey with proper vit D dosing I don't want to kill myself every day. Yay.

Pound that D, yo!

04-14-15 | Bend Oregon

Bend in a nutshell :

Restaurants here write "sando" on the menu when describing sandwiches.

I'm not talking about casual order-at-the-counter places with "crazy" graphics. I mean fancier places with good food. Not conversation, written word. "House-made roast beef sando with blue cheese and carmelized onions". Sando.

Bend has two primary demographics :

The rednecks, which unfortunately still infest so much of Oregon (including, surprisingly, even quite a lot of Portland, mainly on the outside, like a nasty fiery red infection around the anus of liberal Portland). The rednecks wear flannel and baseball caps, drive big trucks, they like off-roading, beer and dogs.

The snowboarders. They wear flannel, baseball caps, drive big trucks, like snowboarding, beer and dogs.

They're actually easy to tell apart. The snowboarders wear $100 designer Helly Hansen flannel, the rednecks wear cheap Walmart flannel.

No, actually you can tell them apart because the rednecks are all so damn ugly. They're constantly angry, they don't smile at you, they stomp around and have bad posture. The contrast is severe, which brings us to the next point :

Bend is the happiest place I have ever seen in my life. It's fucking ridiculous.

For Oddworlders, it's like a town full of Bonnies. For Tash, it's a town full of Bagleys.

Everyone is clear-eyed, that bright clear eye sweetness you get from lots of exercise and being outdoors, that gives you inner peace and patience and just fixes everything. Everyone is so sweet and friendly in a real way, not in that syrupy phoney Southern way.

Kids just walk down the street. There were sweet little kids playing everywhere. People ride their bikes to the corner restaurant and just leave it on the rack unlocked.

It's a fucking utopia.

It's horrible being around those positive, sweet, wholesome people who have great life priorities and friends and seem to have fun doing anything. It makes me sick. You are everything I should be in life and am not. They're the kind of people who buy their friends presents because they actually want to. The kind of people who stand up and dance in a bar when noone else is, and just don't even think about it and laugh and have a great time. I hate you so much. It must be the way ugly girls feel when they visit LA.

Everyone in Bend is a massive alcoholic. When you order your coffee in the morning, they ask if you want a craft brewed IPA with that. There's a brewery per person.

On the down side, there is no road biking around Bend. It's supposed to be this biking mecca, but I guess just for mountain bikes. I think road biking in Oregon is probably shit everywhere. There are no nice dis-used windy roads. God I miss the California biking. I can name like 20 truly epic rides on the west coast of the US and every single one of them is in CA. (roads with lots of traffic and/or no shoulder and/or bad pavement are all disqualified; morons who put things like Hwy 1 on the "good rides" list are morons).

So many fucking stinky trucks. I love to drive around and get that fresh piney mountain air, but every fucking place you go there's some damn F350 enormous monster belching soot in your face. It seems like maybe people in Oregon like diesel more than most of the US? I dunno, I've never noticed so many damn stinky trucks in my life. I guess in Texas I wouldn't even try to open my window.

Pros :



Cheap housing (relative to Seattle & Portland and of course CA)

Gorgeous (I love that dry piney stuff; better than the swampy crap west of the Cascades)

Ski all winter!

Sweet for kids (up until high school age or so, then toxic)

Great mountain biking

Cons :


No road biking

Nice people


Low population = less people to choose from

03-25-15 | My Chameleon

I did my own implementation of the Chameleon compression algorithm. (the original distribution is via the density project)

This is the core of Chameleon's encoder :

    cur = *fm32++; h = CHAMELEON_HASH(cur); flags <<= 1;
    if ( c->hash[h] == cur ) { flags ++; *to16++ = (uint16) h; }
    else { c->hash[h] = cur; *((uint32 *)to16) = cur; to16 += 2; }

This is the decoder :

    if ( (int16)flags < 0 ) { cur = c->hash[ *fm16++ ]; }
    else { cur = *((const uint32 *)fm16); fm16 += 2; c->hash[ CHAMELEON_HASH(cur) ] = cur; }
    flags <<= 1; *to32++ = cur;

I thought it deserved a super-simple STB-style header-only dashfuly-described implementation :


My Chameleon.h is not portable or safe or any of that jizzle. Maybe it will be someday. (Update : now builds on GCC & clang. Tested on PS4. Still not Endian-invariant.)

// Usage :

#include "Chameleon.h"

Chameleon c;


size_t comp_buf_size = CHAMELEON_MAXIMUM_OUTPUT_SIZE(in_size);

void * comp_buf = malloc(comp_buf_size);

size_t comp_len = Chameleon_Encode(&c, comp_buf, in_buf, in_size );


Chameleon_Decode(&c, out_buf, in_size, comp_buf );

int cmp = memcmp(in_buf,out_buf,in_size);
assert( comp == 0 );

ADD : Chameleon2 SIMD prototype now posted : (NOTE : this is not good, do not use)

Chameleon2.h - experimental SIMD wide Chameleon
both Chameleons in a zip

The SIMD encoder is not fast. Even on SSE4 it only barely beats scalar Chameleon. So this is a dead end. Maybe some day when we get fast hardware scatter/gather it will be good (*).

(* = though use of hardware scatter here is always going to be treacherous, because hashes may be repeated, and the order in which collisions resolve must be consistent)

03-25-15 | Density - Chameleon

Casey pointed me at Density .

Density contains 3 algorithms, from super fast to slower : Chameleon, Cheetah, Lion.

They all attain speed primarily by working on U32 quanta of input, rather than bytes. They're sort of LZPish type things that work on U32's, which is a reasonable way to get speed in this modern world. (Cheetah and Lion are really similar to the old LZP1/LZP2 with bit flags for different predictors, or to some of the LZRW's that output forward hashes; the main difference is working on U32 quanta and no match lengths)

The compression ratio is very poor. The highest compression option (Lion) is around LZ4-fast territory, not as good as LZ4-hc. But, are they Pareto? Is it a good space-speed tradeoff?

Well, I can't build Density (I use MSVC) so I can't test their implementation for space-speed.

Compressed sizes :

lzt99 :
uncompressed       24,700,820

density :
c0 Chameleon       19,530,262
c1 Cheetah         17,482,048
c2 Lion            16,627,513

lz4 -1             16,193,125
lz4 -9             14,825,016

Oodle -1 (LZB)     16,944,829
Oodle -2 (LZB)     16,409,913

Oodle LZNIB        12,375,347

(lz4 -9 is not competitive for encode time, it's just to show the level of compression you could get at very fast decode speeds if you don't care about encode time ; LZNIB is an even more extreme case of the same thing - slow to encode, but decode time comparable to Chameleon).

To check speed I did my own implementation of Chameleon (which I believe to be faster than Density's, so it's a fair test). See the next post to get my implementation.

The results are :

comp_len = 19492042
Chameleon_Encode_Time : seconds:0.0274 ticks per: 1.919 mb/s : 901.12
Chameleon_Decode_Time : seconds:0.0293 ticks per: 2.050 mb/s : 843.31

round trip time = 0.05670
I get a somewhat smaller file size than Density's version for unknown reason.

Let's compare to Oodle's LZB (an LZ4ish) :

Oodle -1 :

24,700,820 ->16,944,829 =  5.488 bpb =  1.458 to 1
encode           : 0.061 seconds, 232.40 b/kc, rate= 401.85 mb/s
decode           : 0.013 seconds, 1071.15 b/kc, rate= 1852.17 mb/s

round trip time = 0.074

Oodle -2 :

24,700,820 ->16,409,913 =  5.315 bpb =  1.505 to 1 
encode           : 0.070 seconds, 203.89 b/kc, rate= 352.55 mb/s
decode           : 0.014 seconds, 1008.76 b/kc, rate= 1744.34 mb/s

round trip time = 0.084

lzt99 is a collection of typical game data files.

We can test on enwik8 (text/html) too :

Chameleon :

enwik8 :
Chameleon_Encode_Time : seconds:0.1077 ticks per: 1.862 mb/s : 928.36
Chameleon_Decode_Time : seconds:0.0676 ticks per: 1.169 mb/s : 1479.08
comp_len = 61524068

Oodle -1 :

enwik8 : 
100,000,000 ->57,267,299 =  4.581 bpb =  1.746 to 1 
encode           : 0.481 seconds, 120.17 b/kc, rate= 207.79 mb/s
decode           : 0.083 seconds, 697.58 b/kc, rate= 1206.19 mb/s

here Chameleon is much more compelling. It's competitive for size & decode speed, not just encode speed.

Commentary :

Any time you're storing files on disk, this is not the right algorithm. You want something more asymmetric (slow compress, fast decompress).

I'm not sure if Cheetah and Lion are Pareto for round trip time. I'd have to test speed on a wider set of sample data.

When do you actually want a compressor that's this fast and gets so little compression? I'm not sure.

03-15-15 | LZ Literal Correlation Images

I made some pictures.

I'm showing literal correlation by making an image of the histogram. That is, given an 8-bit predictor, you tally of each event :

int histo[256][256]

histo[predicted][value] ++

then I scale the histo so the max is at 255 and make it into an image.

Most of the images that I show are in log scale, otherwise all the detail is too dark, dominated by a few peaks. I also sometimes remove the predicted=value line, so that the off axis detail is more visible.

Let's stop a moment and look t what we can see in these images.

This is a literal histo of "lzt99" , using predicted = lolit (last offset literal; the rep0len1 literal). This is in log scale, with the diagonal removed :

In my images y = prediction and x = current value. x=0, y=0 is in the upper left instead of the lower left where it should be because fucking bitmaps are retarded (everyone is fired, left handed coordinate systems my ass).

The order-0 probability is the vertical line sum for each x. So any vertical lines indicate just strong order-0 correlations.

Most files are a mix of different probability sources, which makes these images look a sum of different contibuting factors.

The most obvious factor here is the diagonal line at x=y. That's just a strong value=predicted generator.

The red blob is a cluster of events around x and y = 0. This indicates a probability event that's related to |x+y| being small. That is, the sum, or length, or something tends to be small.

The green shows a square of probabilities. A square indicates that for a certain range of y's, all x's are equally likely. In this case the range is 48-58. So if y is in 48-58, then any x in 48-58 is equally likely.

There are similar weaker squarish patterns all along the diagonal. Surprisingly these are *not* actually at the binary 8/16 points you might expect. They're actually in steps of 6 & 10.

The blue blobs are at x/y = 64/192. There's a funny very specific strong asymmetric pattern in these. When y = 191 , it predicts x=63,62,61,60 - but NOT 64,65,66. Then at y=192, predict x=64,65,66, but not 63.

In addition to the blue blobs, there are weak dots at all the 32 multiples. This indicates that when y= any multiple of 32, there's a generating event for x = any multiple of 32. (Note that in log scale, these dots look more important than they really are.). There are also some weak order-0 generators at x=32 and so on.

There's some just general light gray background - that's just uncompressible random data (as seen by this model anyway).

Here's a bunch of images : (click for hi res)

rawrawraw subsubsub xorxorxor
loglogNDlinND loglogNDlinND loglogNDlinND
Fez LO
Fez O1
lzt24 LO
lzt24 O1
lzt99 LO
lzt99 O1
enwik7 LO
enwik7 O1

details :

LO means y axis (predictor) is last-offset-literal , in an LZ match parse. Only the literals coded by the LZ are shown.

O1 means y axis is order1 (previous byte). I didn't generate the O1 from the LZ match parse, so it's showing *all* bytes in the file, not just the literals from the LZ parse.

"log" is just log-scale of the histo. An octave (halving of probability) is 16 pixel levels.

"logND" is log without the x=y diagonal. An octave is 32 pixel levels.

"linND" is linear, without the x=y diagonal.

"raw" means the x axis is just the value. "xor" means the x axis is value^predicted. "sub" means the x axis is (value-predicted+127).

Note that raw/xor/sub are just permutations of the values along a horizontal axis, they don't change the values.


Discussion :

The goal of a de-correlating transform is to create vertical lines. Vertical lines are order-0 probability peaks and can be coded without using the predictor as context at all.

If you use an order-0 coder, then any detail which is not in a vertical line is an opportunity for compression that you are passing up.

"Fez" is obvious pure delta data. "sub" is almost a perfect model for it.

"lzt24" has two (three?) primary probability sources. One is almost pure "sub" x is near y data.

The other sources, however, do not do very well under sub. They are pure order-0 peaks at x=64 and 192 (vertical lines in the "raw" image), and also those strange blobs of correlation at (x/y = 64 and 192). The problem is "sub" turns those vertical lines into diagonal lines, effectively smearing them all over the probability spectrum.

A compact but full model for the lzt24 literals would be like this : <FONT COLOR=GREEN><PRE> is y (predictor) near 64 or 192 ? if so -> strongly predict x = 64 or 192 else -> predict x = y or x = 64 or 192 (weaker)

lzt99, being more heterogenous, has various sources.

"xor" takes squares to squares. This works pretty well on text.

In general, the LO correlation is easier to model than O1.

The lzt99 O1 histo in particular has lots of funny stuff. There are bunch of non-diagonal lines, indicating things like x=y/4 patterns, which is odd.

03-04-15 | LZ Match Finding Redux

Some time ago I did a study of LZ match finders. Since then I've somewhat changed my view. This will be a bit hand wavey. Previous posts :

cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 11-02-11 - StringMatchTest Release
cbloom rants 09-24-12 - LZ String Matcher Decision Tree

From fast to slow :

All my fast matchers now use "cache tables". In fact I now use cache tables all the way up to my "Normal" level (default level; something like zip -7).

With cache tables you have a few primary parameters :

hash bits
hash ways
2nd hash

very fastest :
0 :
hash ways =1
2nd hash = off

1 :
hash ways = 2
2nd hash = off

2 :
hash ways = 2
2nd hash = on


hash ways = 16
2nd hash = on

The good thing about cache tables is the cpu cache coherency. You look up by the hash, and then all your possible matches are right there in one cache line. (there's an option of whether you store the first U32 of each match right there in the cache table to avoid a pointer chase to check the beginning of the match).

Cache tables are superb space-speed tradeoff up until ways hits around 16, and then they start to lose to hash-link.

Hash-link :

Hash link is still good, but can be annoying to make fast (*) and does have bad degenerate case behavior (when you have a bad hash collision, the links on that chain get overloaded with crap).

(* = you have to do dynamic amortization and shite like that which is not a big deal, but ugly; this is to handle the incompressible-but-lots-of-hash-collisions case, and to handle the super-compressible-lots-of- redundant-matches case).

The good thing about hash-link is that you are strictly walking matches in increasing offset order. This means you only need to consider longer lengths, which helps break the O(N^2) problem in practice (though not in theory). It also gives you a very easy way to use a heuristic to decide if a match is better or not. You're always doing a simple compare :

previous best match vs.
new match with
higher offset
longer length
which is a lot simpler than something like the cache table case where you see your matches in random order.

Being rather redundant : the nice thing about hash-link is that any time you find a match length, you know absolutely that you have the lowest offset occurance of that match length.

I'm not so high on Suffix Tries any more.

*if* your LZ just needs the longest length at each position, they're superb. If you actually need the best match at every position (traditional optimal parse), they're superb. eg. if you were doing LZSS with large fixed-size offsets, you just want to find the longest match all the time - boom Suffix Trie is the answer. They have no bad degenerate case, that's great.

But in practice on a modern LZ they have problems.

The main problem is that a Suffix Trie is actually quite bad at finding the lowest offset occurance of a short match. And that's a very important thing to be good at for LZ. The problem is that a proper ST with follows is doing its updates way out deep in the leaves of the tree, but the short matches are up at the root, and they are pointing at the *first* occurance of that substring. If you update all parents to the most recent pointer, you lose your O(N) completely.

(I wrote about this problem before : cbloom rants 08-22-13 - Sketch of Suffix Trie for Last Occurance )

You can do something ugly like use a suffix trie to find long matches and a hash->link with a low walk limit to find the most recent occurance of short matches. But bleh.

And my negativity about Suffix Tries also comes from another point :

Match finding is not that important. Well, there are a lot of caveats on that. On structured data (not text), with a pos-state-lastoffset coder like LZMA - match finding is not that important. Or rather, parsing is more important. Or rather, parsing is a better space-speed tradeoff.

It's way way way better to run an optimal parse with a crap match finder (even cache table with low ways) than to run a heuristic parse with great match finder. The parse is just way more important, and per CPU cost gives you way more win.

And there's another issue :

With a forward optimal parse, you can actually avoid finding matches at every position.

There are a variety of ways to get to skip ahead in a forward optimal parse :

Any time you find a very long match -
 just take it and skip ahead
 (eg. fast bytes in LZMA)

 this can reduce the N^2 penalty of a bad match finder

When you are not finding any matches -
 start taking multi-literal steps
 using something like (failedMatches>>6) heuristic

When you find a long enough rep match -
 just take it
 and this "long enough" can be much less than "fast bytes"
 eg. fb=128 for skipping normal matches
 but you can just take a rep match at length >= 8
 which occurs much more often

the net result is lots of opportunity for more of a "greedy" type of match finding in your optimal parser, where you don't need every match.

This means that good greedy-parse match finders like hash-link and Yann's MMC ( my MMC note ) become interesting again (even for optimal parsing).

03-02-15 | Oodle LZ Pareto Frontier

I made a chart.

I'm showing the "total time to load" (time to load off disk at a simulated disk speed + time to decompress). You always want lower total time to load - smaller files make the simulated load time less, faster decompression make the decompress time less.

total_time_to_load = compressed_size / disk_speed + decompress_time

It looks neatest in the form of "speedup". "speedup" is the ratio of the effective speed vs. the speed of the disk :

effective_speed = raw_size / total_time_to_load

speedup = effective_speed / disk_speed

By varying disk speed you can see the tradeoff of compression ratio vs. cpu usage that makes different compressors better in different application domains.

If we write out what speedup is :

speedup = raw_size / (compressed_size + decompress_time * disk_speed)

speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)

speedup ~= harmonic_mean( compression_ratio , decompress_speed / disk_speed )

we can see that it's a "harmonic lerp" between compression ratio on one end and decompress speed on the other end, with the simulated disk speed as lerp factor.

These charts show "speedup" vs. log of disk_speed :

(the log is log2, so 13 is a disk speed of 8192 mb/s).

On the left side, the curves go flat. At the far left (x -> -infinity, disk speed -> 0) the height of each curve is proportional to the compression ratio. So just looking at how they stack up on the far left tells you the compression ratio performance of each compressor. As you go right more, decompression speed becomes more and more important and compressed size less so.

With ham-fisted-shading of the regions where each compressor is best :

The thing I thought was interesting is that there's a very obvious Pareto frontier. If I draw a tangent across the best compressors :

Note that at the high end (right), the tangent goes from LZB to "memcpy" - not to "raw". (raw is just the time to load the raw file from disk, and we really have to compare to memcpy because all the compressors fill a buffer that's different from the IO buffer). (actually the gray line I drew on the right is not great, it should be going tangent to memcpy; it should be shaped just like each of the compressors' curves, flat on the left (compressed size dominant) and flat on the right (compress time dominant))

You can see there are gaps where these compressors don't make a complete Pareto set; the biggest gap is between LZA and LZH which is something I will address soon. (something like LZHAM goes there) And you can predict what the performance of the missing compressor should be.

It's also a neat way to test that all the compressors are good tradeoffs. If the gray line didn't make a nice smooth curve, it would mean that some compressor was not doing a good job of hitting the maximum speed for its compression level. (of course I could still have a systematic inefficiency; like quite possibly they're all 10% worse than they should be)


If instead of doing speedup vs. loading raw you do speedup vs. loading raw + memcpy, you get this :

The nice thing is the right-hand asymptotes are now constants, instead of decaying like 1/disk_speed.

So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff.

03-02-15 | LZ Rep-Match after Match Strategies

Tiny duh note.

When I did the Oodle LZH I made a mistake. I used a zip-style combined codeword. Values 0-255 are a literal, and 256+ contain the log2ish of length and offset. The advantage of this is that you only have one Huff table and just do one decode, then if it's a match you also fetch some raw bits. It also models length-offset correlation by putting them in the codeword together. (this scheme is missing a lot of things that you would want in a more modern high compression LZ, like pos-state patterns and so on).

Then I added "rep matches" and just stuck them in the combined codeword as special offset values. So the codeword was :

256 : literal
4*L : 4 rep matches * L length slots (L=8 or whatever = 32 codes)
O*L : O offset slots * L length slots (O=14 and L = 6 = 84 codes or whatevs)
= 256+32+84 = 372

The problem is that rep-match-0 can never occur after a match. (assuming you write matches of maximum length). Rep-match-0 is quite important, on binary/structured files it can have very high probability. By using a single codeword which contains rep-match-0 for all coding events, you are incorrectly mixing the statistics of the after match state (where rep-match-0 has zero probability) and after literal state (where rep-match-0 has high probability).

A quick look at the strategies for fixing this :

1. Just use separate statistics. Keep the same combined codeword structure, but have two entropy tables, one for after-match and one for after-literal. This would also let you code the literal-after-match as an xor literal with separate statistics for that.

Whether you do xor-lit or not, there will be a lot of shared probability information between the two entropy tables, so if you do static Huffman or ANS probability transmission, you may need to use the cross-two-tables-similary in that transmission.

In a static Huffman or ANS entropy scheme if rep-match-0 never occurs in the after-match code table, it will be given a codelen of 0 (or impossible) and won't take any code space at all. (I guess it does take a little code space unless you also explicitly special case the knowledge that it must have codelen 0 in your codelen transmitter)

This is the simplest version of the more general case :

2. Context-code the rep-match event using match history. As noted just using "after match" or "after literal" as the context is the simplest version of this, but more detailed history will also affect the rep match event. This is the natural way to fix it in an adaptive arithmetic/ANS coder which uses context coding anyway. eg. this is what LZMA does.

Here we aren't forbidding rep-match after match, we're just using the fact that it never occur to make its probability go to 0 (adaptively) and thus it winds up taking nearly zero code space. In LZMA you actually can have a rep match after match because the matches have a max length of 273, so longer matches will be written as rep matches. Ryg pointed out that after a match that's been limitted by max-length, LZMA should really consider the context for the rep-match coding to be like after-literal , not after-match.

In Oodle's LZA I write infinite match lengths, so this is simplified. I also preload the probability of rep-match in the after-match contexts to be near zero. (I actually can't preload exactly zero because I do sometimes write a rep-match after match due to annoying end-of-buffer and circular-window edge cases). Preconditioning the probability saves the cost of learning that it's near zero, which saves 10-100 bytes.

3. Use different code words.

Rather than relying on statistics, you can explicitly use different code words for the after-match and after-literal case. For example in something like an LZHuf as described above, just use a codeword in the after-match case that omits the rep0 codes, and thus has a smaller alphabet.

This is most clear in something like LZNib. LZNib has three events :

Rep Match
So naively it looks like you need to write a trinary decision at each coding (L,M,R). But in fact only two of them are ever possible :

After L - M or R
  cannot write another L because we would've just made LRL longer

After M or R - L or M
  cannot write an R because we wouldn't just made match longer

So LZNib writes the binary choices (M/R) after L and (L/M) after M or R. Because they're always binary choices, this allows LZNib to use the simple single-divider method of encoding values in bytes .

4. Use a combined code word that includes the conditioning state.

Instead of context modeling, you can always take previous events that you need context from and make a combined codeword. (eg. if you do Huffman on the 16-bit bigram literals, you get order-1 context coding on half the literals).

So we can make a combined codeword like :

(LRL 0,1,2,3+) * (rep matches , normal matches) * (lengths slots)
= 4 * (4 + 14) * 8 = 576

Which is a pretty big alphabet, but also combined length slots so you get lrl-offset-length correlation modeling as well.

In a combined codeword like this you are always writing a match, and any literals that precede it are written with an LRL (may be 0). The forbidden codes are LRL=0 and match =rep0 , so you can either just let those get zero probabilities, or explicitly remove them from the codeword to reduce the alphabet. (there are also other forbidden codes in normal LZ parses, such as low-length high-offset codes, so you would similarly remove or adjust those)

A more minimal codeword is just

(LRL 0,1+) * (rep-match-0, any other match)
= 2 * 2 = 4

which is enough to get the rep-match-0 can't occur after LRL 0 modeling. Or you can do anything between those two extremes to choose an alphabet size.

More :

04/2012 to 02/2015
06/2011 to 04/2012
01/2011 to 06/2011
10/2010 to 01/2011
01/2010 to 10/2010
01/2009 to 12/2009
10/2008 to 01/2009
08/2008 to 10/2008
03/2008 to 08/2008
11/2007 to 03/2008
07/2006 to 11/2007
12/2005 to 07/2006
06/2005 to 12/2005
01/1999 to 06/2005

back to cbloom.com