cbloom.com

Go to the new cbloom rants @ blogspot


08-27-14 | LZ Match Length Redundancy

A quick note on something that does not work.

I've written before about the redundancy in LZ77 codes. ( for example ). In particular the issue I had a look at was :

Any time you code a match, you know that it must be longer than any possible match at lower offsets.

eg. you won't sent a match of length of 3 to offset 30514 if you could have send offset 1073 instead. You always choose the lowest possible offset that gives you a given match length.

The easy way to exploit this is to send match lengths as the delta from the next longest match length at lower offset. You only need to send the excess, and you know the excess is greater than zero. So if you have an ML of 3 at offset 1073, and you find a match of length 4 at offset 30514, then you send {30514,+1}

To implement this in the encoder is straightforward. If you walk your matches in order from lowest offset to highest offset, then you know the current best match length as you go.

The same principle applies to the "last offsets" ; you don't send LO2 if you could sent LO0 at the same length, so the higher index LO matches must be of greater length. And the same thing applies to ROLZ.

I tried this in all 3 cases (normal LZ matches, LO matches, ROLZ). No win. Not even tiny, but close to zero.

Part of the problem is that match lengths are just not where the redundancy is. But I assume that part of what's happening is that match lengths have patterns that the delta-ing ruins. For example binary files will have patterns of 4 or 8 long matches, or in an LZMA-like you'll have certain patterns show up like at certain pos&3 intervals after a literal you get a 3-long match, etc.

I tried some obvious ideas like using the next-lowest-length as part of the context for coding the delta-length. In theory you could be able to recapture something like a next-lowest of 3 predicts a delta of 1 in places where an ML of 4 is likely. But I couldn't find a win there.

I believe this is a dead end. Even if you could find a small win, it's too slow in the decoder to be worth it.


07-15-14 | I'm back

Well, the blog took a break, and now it's back. I'm going to try moderated comments for a while and see how that goes.

I also renamed the VR post to break the links from reddit and twitter, but it's still there.


07-14-14 | Suffix-Trie Coded LZ

Idea : Suffix-Trie Coded LZ :

You are doing LZ77-style coding (eg. matches in the prior stream or literals), but send the matches in a different way.

You have a Suffix Trie built on the prior stream. To find the longest match for a normal LZ77 you would take the current string to code and look it up by walking it down the Trie. When you reach the point of deepest match, you see what string in the prior stream made that node in the Trie, and send the position of that string as an offset.

Essentially what the offset does is encode a position in the tree.

But there are many redundancies in the normal LZ77 scheme. For example if you only encode a match of length 3, then the offsets that point to "abcd.." and "abce.." are equivalent, and shouldn't be distinguished by the encoding. The fact that they both take up space in the numerical offset is a waste of bits. You only want to distinguish offsets that actually point at something different for the current match length.

The idea in a nutshell is that instead of sending an offset, you send the descent into the trie to find that string.

At each node, first send a single bit for does the next byte in the string match any of the children. (This is equivalent to a PPM escape). If not, then you're done matching. If you like, this is like sending the match length with unary : 1 bits as long as you're in a node that has a matching child, then a 0 bit when you run out of matches. (alternatively you could send the entire match length up front with a different scheme).

When one of the children matches, you must encode which one. This is just an encoding of the next character, selected from the previously seen characters in this context. If all offsets are equally likely (they aren't) then the correct thing is just Probability(child) = Trie_Leaf_Count(child) , because the number of leaves under a node is the number of times we've seen this substring in the past.

(More generally the probability of offsets is not uniform, so you should scale the probability of each child using some modeling of the offsets. Accumulate P(child) += P(offset) for each offset under a child. Ugly. This is unfortunately very important on binary data where the 4-8-struct offset patterns are very strong.)

Ignoring that aside - the big coding gain is that we are no longer uselessly distinguishing offsets that only differ at higher match length, AND instead of just wasting those bits, we instead use them to make those offsets code smaller.

For example : say we've matched "ab" so far. The previous stream contains "abcd","abce","abcf", and "abq". Pretend that somehow those are the only strings. Normal LZ77 needs 2 bits to select from them - but if our match len is only 3 that's a big waste. This way we would say the next char in the match can either be "c" or "q" and the probabilities are 3/4 and 1/4 respectively. So if the length-3 match is a "c" we send that selection in only log2(4/3) bits = 0.415

And the astute reader will already be thinking - this is just PPM! In fact it is exactly a kind of PPM, in which you start out at low order (min match length, typically 3 or so) and your order gets deeper as you match. When you escape you junk back to order 3 coding, and if that escapes it jumps back to order 0 (literal).

There are several major problems :

1. Decoding is slow because you have to maintain the Suffix Trie for both encode and decode. You lose the simple LZ77 decoder.

2. Modern LZ's benefit a lot from modeling the numerical value of the offset in binary files. That's ugly & hard to do in this framework. This method is a lot simpler on text-like data that doesn't have numerical offset patterns.

3. It's not Pareto. If you're doing all this work you may as well just do PPM.

In any case it's theoretically interesting as an ideal of how you would encode LZ offsets if you could.

(and yes I know there have been many similar ideas in the past; LZFG of course, and Langdon's great LZ-PPM equivalence proof)


07-03-14 | Oodle 1.41 Comparison Charts

I did some work for Oodle 1.41 on speeding up compressors. Mainly the Fast/VeryFast encoders got faster. I also took a pass at trying to make sure the various options were "Pareto", that is the best possible space/speed tradeoff. I had some options that were off the curve, like much slower than they needed to be, or just worse with no benefit, so it was just a mistake to use them (LZNib Normal was particularly bad).

Oodle 1.40 got the new LZA compressor. LZA is a very high compression arithmetic-coded LZ. The goal of LZA is as much compression as possible while retaining somewhat reasonable (or at least tolerable) decode speeds. My belief is that LZA should be used for internet distribution, but not for runtime loading.

The charts :

compression ratio : (raw/comp ratio; higher is better)
compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 2.362 2.508 2.541 2.645 2.698
LZHLW 2.161 2.299 2.33 2.352 2.432
LZH 1.901 1.979 2.039 2.121 2.134
LZNIB 1.727 1.884 1.853 2.079 2.079
LZBLW 1.636 1.761 1.833 1.873 1.873
LZB16 1.481 1.571 1.654 1.674 1.674

lzmamax  : 2.665 to 1
lzmafast : 2.314 to 1
zlib9 : 1.883 to 1 
zlib5 : 1.871 to 1
lz4hc : 1.667 to 1
lz4fast : 1.464 to 1

encode speed : (mb/s)
compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 23.05 12.7 6.27 1.54 1.07
LZHLW 59.67 19.16 7.21 4.67 1.96
LZH 76.08 17.08 11.7 0.83 0.46
LZNIB 182.14 43.87 10.76 0.51 0.51
LZBLW 246.83 49.67 1.62 1.61 1.61
LZB16 511.36 107.11 36.98 4.02 4.02

lzmamax  : 5.55
lzmafast : 11.08
zlib9 : 4.86
zlib5 : 25.23
lz4hc : 32.32
lz4fast : 606.37

decode speed : (mb/s)
compressor VeryFast Fast Normal Optimal1 Optimal2
LZA 34.93 37.15 37.76 37.48 37.81
LZHLW 363.94 385.85 384.83 391.28 388.4
LZH 357.62 392.35 397.72 387.28 383.38
LZNIB 923.66 987.11 903.21 1195.66 1194.75
LZBLW 2545.35 2495.37 2465.99 2514.48 2515.25
LZB16 2752.65 2598.69 2687.85 2768.34 2765.92

lzmamax  : 42.17
lzmafast : 40.22
zlib9 : 308.93
zlib5 : 302.53
lz4hc : 2363.75
lz4fast : 2288.58

While working on LZA I found some encoder speed wins that I ported back to LZHLW (mainly in Fast and VeryFast). A big one is to early out for last offsets; when I get a last offset match > N long, I just take it and don't even look for non-last-offset matches. This is done in the non-Optimal modes, and surprisingly hurts compression almost not all while helping speed a lot.

Four of the compressors are now in pretty good shape (LZA,LZHLW,LZNIB, and LZB16). There are a few minor issues to fix someday (someday = never unless the need arises) :

LZA decoder should be a little faster (currently lags LZMA a tiny bit). LZA Optimal1 would be better with a semi-greedy match finder like MMC (LZMA is much faster to encode than me at the same compression level; perhaps a different optimal parse scheme is needed too). LZA Optimal2 should seed with multi-parse. LZHLW Optimal could be faster. LZNIB Normal needs much better match selection heuristics, the ones I have are really just not right. LZNIB Optimal should be faster; needs a better way to do threshold-match-finding. LZB16 Optimal should be faster; needs a better 64k-sliding-window match finder.

The LZH and LZBLW compressors are a bit neglected and you can see they still have some of the anomalies in the space/speed tradeoff curve, like the Normal encode speed for LZBLW is so bad that you may as well just use Optimal. Put aside until there's a reason to fix them.


If another game developer tells me that "zlib is a great compromise and you probably can't beat it by much" I'm going to murder them. For the record :

zlib -9 :
4.86 MB/sec to encode
308.93 MB/sec to decode
1.883 to 1 compression

LZHLW Optimal1 :
4.67 MB/sec to encode
391.28 MB/sec to decode
2.352 to 1 compression
come on! The encoder is slow, the decoder is slow, and it compresses poorly.

LZMA in very high compression settings is a good tradeoff. In its low compression fast modes, it's very poor. zlib has the same flaw - they just don't have good encoders for fast compression modes.

LZ4 I have no issues with; in its designed zone it offers excellent tradeoffs.


In most cases the encoder implementations are :


VeryFast =
cache table match finder
single hash
greedy parse

Fast = 
cache table match finder
hash with ways
second hash
lazy parse
very simple heuristic decisions

Normal =
varies a lot for the different compressors
generally something like a hash-link match finder
or a cache table with more ways
more lazy eval
more careful "is match better" heuristics

Optimal =
exact match finder (SuffixTrie or similar)
cost-based match decision, not heuristic
backward exact parse of LZB16
all others have "last offset" so require an approximate forward parse

I'm mostly ripping out my Hash->Link match finders and replacing them with N-way cache tables. While the cache table is slightly worse for compression, it's a big speed win, which makes it better on the space-speed tradeoff spectrum.

I don't have a good solution for windowed optimal parse match finding (such as LZB16-Optimal). I'm currently using overlapped suffix arrays, but that's not awesome. Sliding window SuffixTrie is an engineering nightmare but would probably be good for that. MMC is a pretty good compromise in practice, though it's not exact and does have degenerate case breakdowns.


LZB16's encode speed is very sensitive to the hash table size.


-h12
24,700,820 ->16,944,823 =  5.488 bpb =  1.458 to 1
encode           : 0.045 seconds, 161.75 b/kc, rate= 550.51 mb/s
decode           : 0.009 seconds, 849.04 b/kc, rate= 2889.66 mb/s

-h13
24,700,820 ->16,682,108 =  5.403 bpb =  1.481 to 1
encode           : 0.049 seconds, 148.08 b/kc, rate= 503.97 mb/s
decode           : 0.009 seconds, 827.85 b/kc, rate= 2817.56 mb/s

-h14
24,700,820 ->16,491,675 =  5.341 bpb =  1.498 to 1
encode           : 0.055 seconds, 133.07 b/kc, rate= 452.89 mb/s
decode           : 0.009 seconds, 812.73 b/kc, rate= 2766.10 mb/s

-h15
24,700,820 ->16,409,957 =  5.315 bpb =  1.505 to 1
encode           : 0.064 seconds, 113.23 b/kc, rate= 385.37 mb/s
decode           : 0.009 seconds, 802.46 b/kc, rate= 2731.13 mb/s

If you accidentally set it too big you get a huge drop-off in speed. (The charts above show -h13 ; -h12 is more comparable to lz4fast (which was built with HASH_LOG=12)).

I stole an idea from LZ4 that helped the encoder speed a lot. (lz4fast is very good!) Instead of doing the basic loop like :


while(!eof)
{
  if ( match )
    output match
  else
    output literal
}

instead do :

while(!eof)
{
  while( ! match )
  {
    output literal
  }

  output match
}

This lets you make a tight loop just for outputing literals. It makes it clearer to you as a programmer what's happening in that loop and you can save work and simplify things. It winds up being a lot faster. (I've been doing the same thing in my decoders forever but hadn't done in the encoder).

My LZB16 is very slightly more complex to encode than LZ4, because I do some things that let me have a faster decoder. For example my normal matches are all no-overlap, and I hide the overlap matches in the excess-match-length branch.


06-26-14 | VR Impressions

NOTE : changed post title to break the link.

Yesterday I finally went to Valve and saw "The Room". This is a rather rambly post about my thoughts after experiencing it.

For those who have been under a rock (like me), Valve has got this amazing VR demo. It uses unique prototype hardware that provides very good positional head tracking and very low latency graphics. It's in a calibrated room with registration spots all over the walls. It's way way better than any other VR, it's the real thing.

There is this magic thing that happens, it does tickle your brain intuitively. Part of you thinks that you're there. I had the same experiences that I've heard other people recount - your body starts reacting; like when a sphere moves towards you, you flinch and try to dodge it without thinking.

Part of the magic is that it's good enough that you *want* to believe it. It's not actually good enough that it seems real. Even in the carefully calibrated Valve room, it's glitchy and things pop a bit, and you always know you're in a simulation. But you choose to ignore the problems. It felt like when you're watching a good movie, and if you were being rational you would say that this is all illogical and the green screening looks fucking terrible and that is physically impossible what he just did, but if it's good you just choose to ignore all that and go along for the ride. VR felt like that to me.

One of the cool things about VR is that there is an absolute sense of scale, because you are always the size of you. This gives you scale reference in a way that you never have in games. Which is also a problem. It's wonderful if you're making games where you play as a human, but you can't play as a giant (if you just scale down everything else, it feels like you're you in a world where everything else is tiny, not that you're bigger; scale is no longer relative, you are always you). You can't make the characters run at 60 mph the way we usually do in games.

As cool as it is, I don't see how you actually make games with it.

For one thing there are massive short term technical problems. The headset is heavy and uncomfortable. The lenses have to be perfectly aligned to your eyes or you get sick. The registration is very easy to mess up. I'm sure these will be resolved over time. The headset has a cable which is always in danger of tripping or strangling you, which is a major problem and technically hard to get rid of, but perhaps possible some day.

But there are more fundamental inherent problems. When I stepped off the ledge, I wanted to fall. But of course I never actually can. You make my character fall, but not my body? That's weird. Heck if my character steps up on something, I want to step up myself. You can only make games where you basically stand still. In the room with the pipes, I want to climb on the pipes. Nope, you can't - and probably never can. Why would I want to be in a virtual world if I can't do anything there? I don't know how you even walk around a world without it feeling bizarre. All the Valve demos are basically you stuck in a tiny box, which is going to get old.

How do you ever make a game where the player character is moved without their own volition? If an NPC bumps me and pushes my avatar, what happens? You can't push my real human body, so it breaks the illusion. It seems to me that as soon as your viewpoint has a physical reaction with the virtual world and isn't just a viewer with no collision detection, it just doesn't work.

There's this fundamental problem that the software cannot move the player's viewpoint. The player must always get to move their own viewpoint with their head, or the illusion is broken (or worse, you get sick). This is just such a huge problem for games, it means the player can only be a ghost, or an omniscient observer in an RTS game, or other such things. Sure you can make games where you stand over an RTS world map and poke at it. Yay, it's a board game with fancy graphics. I see how it could be great as a sculpting or design tool. I see how it would be great for The Witness and similar games.

For me personally, it's so disappointing that you can't actually physically be in these worlds. The most exciting moments for me were some of the outdoor scenes, or the pipe room, where I just felt viscerally - "I want to run around in this world". What would be amazing for me would be to go in the VR world to alien planets with crazy strange plants and geology, and be able to run around it and climb on it. And I just don't see how that ever works. You can't walk around your living room, you'll trip on things or run into the wall. You can't ever step up or down anything, you have to be on perfectly flat ground all the time. You can't ever touch anything. (It's like a strip club; hey this is so exciting! can I interact with it? No? I have to just sit here and not move or touch anything? How fucking boring and frustrating. I'm leaving.)

At the very minimum you need gloves with amazing force feedback to give you some kind of tactile experience of the VR world, but even then it's just good for VR surgery and VR board games and things where you stand still and touch things. (and we all know the real app is VR fondling).

You could definitely make a killer car racing game. Put me in a seat with force feedback, and that solves all the physical interaction problems. (or, similarly, I'm driving a mech or a space ship or whatever; basically lock the player in a seat so you don't have to address the hard problems for now).

There are also huge huge software problems. Collision detection has to be polygon-perfect ; coarse collision proxies are no longer acceptable. Physics and animation have to be way better. Texture mapping and normal mapping just don't work. Billboard cards just don't work. We basically can't have trees or smoke or anything soft or complex for a long time, it's going to be a lot of super simple rigid objects. Skinned characters and painted on clothing (and just using textures to paint on geometry), none of it works. Flat shaded simple stuff is totally fine, but all the hacks we've used for so long are out the window.

I certainly see the appeal (for a software engineer) of starting from scratch on so many issues and working on the hard problems. Fun.

Socially I find VR rather scary.

One issue is the addictive nature of living in a VR world. Yes yes people are already addicted to their phones and facebook and WoW and whatever, but this is a whole new level. Plus it's even more disengaged from reality; it's one thing for everyone in a coffee shop these days to be staring at their laptops (god I hate you) but when they're all in headsets then interaction in the real world is completely over. I have no doubt that there will be a large class of people that live in the VR world and never leave their living room; Facebook will provide a "deliver pizza" button so that you don't even have to exit the simulation. It will be bad.

Perhaps more disturbing to me is how real and scary it can be. Just having a cube move into me was a kind of real physical fright that I haven't felt in a game. I think that being in a realistic VR world with people shooting each other would be absolutely terrifying and disgusting and really would do bad things to the brains of the players.

And if we wind up with evil overlords like Facebook or Apple or whoever controlling our VR world, that is downright dystopian. We all had our chance to say "no" to the rise of closed platforms when the Apple shit started to take off, and we all fucking dropped the ball (well, you did). Hell we did the same thing with the PATRIOT act. We're all just lying down and getting raped and not doing a damn thing about it and the future of freedom is very bleak indeed. (wow that rant went off the rails)

Anyway, I look forward to trying it again and seeing what people come up with. It's been a long time since I saw anything in games that made me say "yes! I want to play that!" so in that sense VR is a huge win.


Saved comments :

Tom Forsyth said... Playing as a giant is OK - grow the player's height, but don't move their eyes further apart. So the scale is unchanged, but the eyes are higher off the ground. July 3, 2014 at 7:45 PM

brucedawson said... Isn't a giant just somebody who is way taller than everybody else? So yeah, maybe if you 'just' scale down everyone else then you'll still feel normal size. But you'll also feel like you can crush the tiny creatures like bugs! Which is really the essence of being a giant. And yes, I have done the demo. July 3, 2014 at 8:56 PM

Grzegorz Adam Hankiewicz said... I don't understand how you say a steering wheel with force feedback solves any VR problem when the main reason I know I'm driving fast is how forces are being applied to my whole body, not that I'm holding something round instead of a gamepad. You mention it being jarring not being able to climb, wouldn't it be jarring to jump on a terrain bump inside your car and not feel gravity changes? Maybe the point of VR is not replicating dull life but simulating what real life can't possibly give us ever? July 4, 2014 at 3:08 AM

cbloom said... @GAH - not a wheel with force feedback (they all suck right now), but a *seat* like the F1 simulators use. They're quite good at faking short-term forces (terrain bumps and such are easy). I certainly don't mean that that should be the goal of VR. In fact it's quite disappointing that that is the only thing we have any hope of doing a good job of in the short term. July 4, 2014 at 7:33 AM

Stu said... I think you're being a bit defeatist about it, and unimaginative about how it can be used today. Despite being around 30 years old, the tech has only just caught up to the point whereby it can begin to travel down the path towards full immersion, Matrix style brain plugs, holodeck etc. This shit's gotta start somewhere, and can still produce amazing gaming - an obvious killer gaming genre is in any vehicular activity, incl. racing, normal driving, flying, space piloting, etc. Let the other stuff slowly evolve towards your eventual goal - we're in the 'space invaders' and 'pacman' era for VR now, and it works as is for a lot of gaming. July 4, 2014 at 9:11 AM

cbloom said... I'd love to hear any ideas for how game play with physical interaction will ever work. Haven't heard any yet. Obviously the goal should be physical interaction that actually *feels* like physical interaction so that it doesn't break the illusion of being there. That's unattainable for a long time. But even more modest is just how do you do something like walking around a space that has solid objects in it, or there are NPC's walking around. How do you make that work without being super artificial and weird and breaking the magic? In the short term we're going to see games that are basically board games, god games, fucking "experiences" where flower petals fall on you and garbage like that. We're going to see games that are just traditional shitty computer games, where you slide around a fucking lozenge collision proxy using a gamepad, and the VR is just a viewer in that game. That is fucking lame. What I would really hate to see is for the current trend in games to continue into VR - just more of the same shit all the time with better graphics. If people just punt on actually solving VR interaction and just use it as a way to make amazing graphics for fucking Call of Doody or Retro Fucking Mario Indie Bullshit Clone then I will be sad. When the top game is fucking VR Candy Soul-Crush then I will be sad. What is super magical and amazing is the feeling that you actually are somewhere else, and your body gets involved in a way it never has before, you feel like you can actually move around in this VR world. And until games are actually working in that magic it's just bullshit. July 4, 2014 at 9:36 AM

cbloom said... If you like, this is an exhortation to not cop the fuck out on VR the way we have in games for the last 20 years. The hard problems we should be solving in games are AI, animation/motion, physics. But we don't. We just make the same shit and put better graphics on it. Because that sells, and it's easy. Don't do that to VR. Actually work on how people interact with the simulation, and how the simulation responds to them. July 4, 2014 at 10:03 AM

Dillon Robinson said... Son, Bloom, kiddo, you've talking out of your ass again. Think before you speak.

.. and then it really went downhill.


06-21-14 | Suffix Trie Note

A small note on Suffix Tries for LZ compression.

See previously :

Sketch of Suffix Trie for Last Occurance

So. Reminder to myself : Suffix Tries for optimal parsing is clean and awesome. But *only* for finding the length of the longest match. *not* for finding the lowest offset of that match. And *not* for finding the longest match length and the lowest offset of any other (shorter) matches.

I wrote before about the heuristic I currently use in Oodle to solve this. I find the longest match in the trie, then I walk up to parent nodes and see if they provide lower offset / short length matches, because those may be also interesting to consider in the optimal parser.

(eg. for clarity, the situation you need to consider is something like a match of length 8 at offset 482313 vs. a match of length 6 at offset 4 ; it's important to find that lower-length lower-offset match so that you can consider the cost of it, since it might be much cheaper)

Now, I tested the heuristic of just doing parent-gathers and limitted updates, and it performed well *in my LZH coder*. It does *not* necessarily perform well with other coders.

It can miss out on some very low offset short matches. You may need to supplement the Suffix Trie with an additional short range matcher, like even just a 1024 entry hash-chain matcher. Or maybe a [256*256*256] array of the last occurance location of a trigram. Even just checking at offset=1 for the RLE match is helpful. Whether or not they are important or not depends on the back end coder, so you just have to try it.

For LZA I ran into another problem :

The Suffix Trie exactly finds the length of the longest match in O(N). That's fine. The problem is when you go up to the parent nodes - the node depth is *not* the longest match length with the pointer there. It's just the *minimum* match length. The true match length might be anywhere up to *and including* the longest match length.

In LZH I was considering those matches with the node depth as the match length. And actually I re-tested it with the correct match length, and it makes very little difference.

Because LZA does LAM exclusion, it's crucial that you actually find what the longest ML is for that offset.

(note that the original LZMA exclude coder is actually just a *statistical* exclude coder; it is still capable of coding the excluded character, it just has very low probability. My modified version that only codes 7 bits instead of 8 is not capable of coding the excluded character, so you must not allow this.)

One bit of ugliness is that extending the match to find its true length is not part of the neat O(N) time query.

In any case, I think is all a bit of a dead-end for me. I'd rather move my LZA parser to be forward-only and get away from the "find a match at every position" requirement. That allows you to take big steps when you find long matches and makes the whole thing faster.


06-21-14 | The E46 M3

Fuck Yeah.

Oh my god. It's so fucking good.

When I'm working in my little garage office, I can feel her behind me, trying to seduce me. Whispering naughty thoughts to me. "Come on, let's just go for a little spin".

On the road, I love the way you can just pin the throttle on corner exit; the back end gently slides out, just a little wiggle. You actually just straighten the steering early, it's like half way through the corner you go boom throttle and straighten the lock and the car just glides out to finish the turn. Oh god it's like sex. You start the turn with your hands and then finish it with your foot, and it's amazing, it feels so right.

On the track there's a whole other feeling, once she's up to speed at the threshold of grip, on full tilt. My favorite thing so far is the chicane at T5 on the back side of pacific. She just dances through there in such a sweet way. You can just lift off the throttle to get a little engine braking and set the nose, then back on throttle to make the rear end just lightly step out and help ease you around the corner. The weight transfer and grip front to back just so nicely goes back and forth, it's fucking amazing. She feels so light on her feet, like a dancer, like a boxer, like a cat.

There are a few things I miss about the 911. The brakes certainly, the balance under braking and the control under trail-braking yes, the steering feel, oh god the steering feel was good and it fucking SUCKS in the M3, the head room in the 911 was awesome, the M3 has shit head room and it's really bad with a helmet, the visibility - all that wonderful glass and low door lines, the feeling of space in the cabin. Okay maybe more than a few things.

But oh my god the M3. I don't care that I have to sit slightly twisted (WTF); I don't care that there are various reliability problems. I don't care that it requires super expensive annual valve adjustments. I forgive it all. For that engine, so eager, so creamy, screaming all the way through the 8k rev range with not a single dip in torque, for the quick throttle response and lack of electronic fudging, for the chassis balance, for the way you can trim it with the right foot. Wow.


06-18-14 | Oodle Network Test Results

Well, it's been several months now that we've been testing out the beta of Oodle Network Compression on data from real game developers.

Most of the tests have been UDP, with a few TCP. We've done a few tests on data from the major engines (UE3/4, Source) that do delta property bit-packing. Some of the MMO type games with large packets were using zlib on packets.

This is a summary of all the major tests that I've run. This is not excluding any tests where we did badly. So far we have done very well on every single packet capture we've seen from game developers.


MMO game :
427.0 -> 182.7 average = 2.337:1 = 57.21% reduction
compare to zlib -5 : 427.0 -> 271.9 average


MMRTS game :
122.0 -> 75.6 average = 1.615:1 = 38.08% reduction


Unreal game :
220.9 -> 143.3 average = 1.542:1 = 35.15% reduction


Tiny packet game :
21.9 -> 15.6 average = 1.403:1 = 28.72% reduction


Large packet Source engine game :
798.2 -> 519.6 average = 1.536:1 = 34.90% reduction

Some of the tests surprised even me, particularly the tiny packet one. When I saw the average size was only 22 bytes I didn't think we had much room to work with, but we did!

Some notes :


06-16-14 | Rep0 Exclusion in LZMA-like coders

For reference on this topic : see the last post .

I believe there's a mistake in LZMA. I could be totally wrong about that because reading the 7z code is very hard; in any case I'm going to talk about Rep0 exclusion. I believe that LZMA does not do this the way it should, and perhaps this change should be integrated into a future version. In general I have found LZMA to be very good and most of its design decisions have been excellent. My intention is not to nitpick it, but to give back to a very nice algorithm which has been generously made public by its author, Igor Pavlov.

LZMA does "literal-after-match" exclusion. I talked a bit about this last time. Basically, after a match you know that the next literal cannot be the one that would have continued the match. If it was you would have just written a longer match. This relies on always writing the maximum length for matches.

To model "LAM" exclusion, LZMA uses a funny way of doing the binary arithmetic model for those symbols. I wrote a bit about that last time, and the LZMA way to do that is good.

LZMA uses LAM exclusion on the first literal after a match, and then does normal 8-bit literal coding if there are more literals.

That all seems fine, and I worked on the Oodle LZ-Arith variant for about month with a similar scheme, thinking it was right.

But there's a wrinkle. LZMA also does "rep0len1" coding.

For background, LZMA, like LZX before it, does "repeat match" coding. A rep match means using one of the last N offsets (usually N = 3) and you flag that and send it in very few bits. I've talked about the surprising importance of repeat matches before (also here and other places ).

LZMA, like LZX, codes rep matches with MML of 2.

But LZMA also has "rep0len1". rep0len1 codes a single symbol at the 0th repeat match offset. That's the last offset matched from. That's the same offset that provides the LAM exclusion. In fact you can state the LAM exclusion as "rep0len1 cannot occur on the symbol after a match". (and in fact LZMA gets that right and doesn't try to code the rep0len1 bit after a match).

rep0len1 is not a win on text, but it's a decent little win on binary structured data (see example at bottom of this post ). It lets you get things like the top byte of a series of floats (off=4 len1 match, then 3 literals).

The thing is, if you're doing the rep0len1 coding, then normal literals also cannot be the rep0len1 symbol. If they were, then you would just code them with the rep0len1 flag.

So *every* literal should be coded with rep0 exclusion. Not just the literal after a match. And in fact the normal 8-bit literal coding path without exclusion is never used.

To be concrete, coding a literal in LZMA should look like this :


cur_lit = buffer[ pos ];

rep0_lit = buffer[ pos - rep0_offset ];

if ( after match )
{
    // LAM exclusion means cur_lit should never be = rep0_lit
    ASSERT( cur_lit != rep0_lit );
}
else
{
    if ( cur_lit == rep0_lit )
    {
        // lit is rep0, send rep0len1 :
        ... encode rep0len1 flag ...

        // do not drop through
        return;
    }
}

// either way, we now have exclusion :
ASSERT( cur_lit != rep0_lit );

encode_with_exclude( cur_lit, rep0_lit );

and that provides a pretty solid win. Of all the things I did to try to beat LZMA, this was the only clear winner.


ADDENDUM : some notes on this.

Note that the LZMA binary-exclude coder is *not* just doing exclusion. It's also using the exclude symbol as modelling context. Pure exclusion would just take the probability of the excluded symbol and distribute it to the other symbols, in proportion to their probability.

It turns out that the rep0 literal is an excellent predictor, even beyond just exclusion.

That is, say you're doing normal 8-bit literal coding with no exclusion. You are allowed to use an 8-bit literal as context. You can either use the order-1 literal (that's buffer[pos-1]) or the rep0 literal (that's buffer[pos-rep0_offset]).

It's better to use the rep0 literal!

Of course the rep0 literal becomes a weaker predictor as you get away from the end of the match. It's very good on the literal right after a match (lam exclusion), and still very good on the next literal, and then steadily weaker.

It turns out the transition point is 4-6 literals away from the end of the match; that's the point at which the o1 symbol becomes more correlated to the current symbol than the rep0 lit.

One of the ideas that I had for Oodle LZA was to remove the rep0len1 flag completely and instead get the same effect from context modeling. You can instead take the rep0 lit and use it as an 8-bit context for literal coding, and should get the same benefit. (the coding of the match flag is implicit in the probability model).

I believe the reason I couldn't find a win there is because it turns out that LZ literal coding needs to adapt very fast. You want very few context bits, you want super fast adaptation of the top bits. Part of the issue is that you don't see LZ literals very often; there are big gaps where you had matches, so you aren't getting as much data to adapt to the file. But I can't entirely explain it.

You can intuitively understand why the rep0 literal is such a strong predictor even when it doesn't match. You've taken a string from earlier in the file, and blacked out one symbol. You're told what the symbol was before, and you're told that in the new string it is not that symbol. It's something like :


"Ther" matched before
'r' is to be substituted
"The?"
What is ? , given that it was 'r' but isn't 'r' here.

Given only the o1 symbol ('e') and the substituted symbol ('r'), you can make a very good guess of what should be there ('n' probably, maybe 'm', or 's', ...). Obviously more context would help, but with limited information, the substituted symbol (rep0 lit) sort of gives you a hint about the whole area.

An ever simpler case is given just the fact that rep0lit is upper or lower case - you're likely to substitute it with a character of the same case. Similarly if it's a vowel or consonant, you're likely to substitute with one of the same. etc. and of course I'm just using English text because it's intuitive, it works just as well on binary structured data.


There's another very small flaw in LZMA's exclude coder, which is more of a minor detail, but I'll include it here.

The LZMA binary exclude coder is equivalent to this clearer version that I posted last time :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 256 );
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
This codes up to 8 bits while the bits of "val" match the bits of "exclude" , and up to 8 bits while the bits of "val" don't match.

Now, obviously the very first bit can never be coded in the unmatched phase. So we could eliminate that from the unmatched bins. But that only saves us one slot.

(and actually I'm already wasting a slot intentionally; the "ctx" with place value bit like this is always >= 1 , so you should be indexing at "ctx-1" if you want a tight packed array. I intentionally don't do that, so that I have 256 bins instead of 255, because it makes the addressing work with "<<8" instead of "*255")

More importantly, in the "matched" phase, you don't actually need to code all 8 bits. If you code 7 bits, then you know that "val" and "exclude" match in all top 7 bits, so it must be that val == exclude^1. That is, it's just one bit flip away; the decoder will also know that so you can just not send it.

The fixed encoder is :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        ASSERT( ctx < 128 );
        m_bins[256 + ctx + (exclude_bit<<7)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( bit != exclude_bit )
            break;
        if ( ctx >= 128 )
        {
            // I've coded 7 bits
            // and they all match
            // no need to code the last one
            return;
        }
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}
Note that now ctx in the matched phase can only reach 128. That means this coder actually only needs 2*256 bins, not 3*256 bins as stated last time.

This is a little speed savings (tiny because we only get one less arith coding event on a rare path), a little compression savings (tiny because that bottom bit models very well), and a little memory use savings.


06-12-14 | Some LZMA Notes

I've been working on an LZ-Arith for Oodle, and of course the benchmark to beat is LZMA, so I've had a look at a few things.

Some previous posts related to things I'll discuss today :

cbloom rants 09-27-08 - 2 On LZ and ACB
cbloom rants 10-01-08 - 2 First Look at LZMA
cbloom rants 10-05-08 - 5 Rant on New Arithmetic Coders
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions

Some non-trivial things I have noticed :

1. The standard way of coding literals with a binary arithmetic coder has a subtle quirk to it.

LZMA uses the now standard fractional update method for binary probability modeling. That's p0 -= (p0 >> updshift) and so on. See for example : 10-05-08 - 5 : Rant on New Arithmetic Coders .

The fractional update method is an approximation of a standard {num0,num1} binary model in which you are kept right at the renormalization threshold. That is, a counting model does :

P0 = num0 / (num0+num1);
... do coding ...
if ( bit ) num1++;
else num0++;
if ( (num0+num1) > renorm_threshold )
{
  // scale down somehow; traditionally num0 >>= 1;
}
The fractional shift method is equivalent to :

num0 = P0;
num1 = (1<<frac_tot) - P0;
if ( bit ) num1++;
else num0++;

// num0+num1 is now ((1<<frac_tot)+1); rescale :
P0 = num0 * (1<<frac_tot)/((1<<frac_tot)+1);

That is, it assumes you're right at renormalization threshold and keeps you there.

The important thing about this is adaptation speed.

A traditional {num0,num1} model adapts very quickly at first. Each observed bit causes a big change to P0 because total is small. As total gets larger, it becomes more stable, it has more inertia and adapts more slowly. The renorm_threshold sets a minimum adaptation speed; that is, it prevents the model from becoming too full of old data and too slow to respond to new data.

Okay, that's all background. Now let's look at coding literals.

The standard way to code an N bit literal using a binary arithmetic coder is to code each bit one by one, either top down or bottom up, and use the previous coded bits as context, so that each subtree of the binary tree gets its own probability models. Something like :


ctx = 1;
while( ctx < 256 ) // 8 codings
{
    int bit = (val >> 7)&1; // get top bit
    val <<= 1; // slide val for next coding
    BinaryCode( bit, p0[ctx-1] );
    // put bit in ctx for next event
    ctx = (ctx<<1) | bit;
}

Okay.

Now first of all there is a common misconception that binary coding is somehow different than N-ary arithmetic coding, or that it will work better on "binary data" that is somehow organized "bitwise" vs text-like data. That is not strictly true.

If we use a pure counting model for our N-ary code and our binary code, and we have not reached the renormalization threshold, then they are in fact *identical*. Exactly identical.

For example, say we're coding two-bit literals :


The initial counts are :

0: 3
1: 1
2: 5
3: 4
total = 13

we code a 2 with probability 5/13 in log2(13/5) bits = 1.37851
and its count becomes 6

With binary modeling the counts are :

no ctx
0: 4
1: 9

ctx=0
0: 3
1: 1

ctx=1
0: 5
1: 4

to code a "2"
we first code a 1 bit with no context
with probability 9/13 in log2(13/9) bits = 0.53051
and the counts become {4,10}

then we code a 0 bit with a 1 context
with probability 5/9 in log2(9/5) bits = 0.84800
and the counts become {6,4}

And of course 1.37851 = 0.53051 + 0.84800

The coding is exactly the same. (and furthermore, binary coding top down or bottom up is also exactly the same).

However, there is a difference, and this is where the quirk of LZMA comes in. Once you start hitting the renormalization threshold, so that the adaptation speed is clamped, they do behave differently.

In a binary model, you will see many more events at the top bit. The exact number depends on how spread your statistics are. If all 256 symbols are equally likely, then the top bit is coded 128X more often than the bottom bits (and each of the next bits is coded 64X, etc.). If only one symbol actually occurs then all the bit levels will be coded the same number of times. In practice it's somewhere in between.

If you were trying to match the normal N-ary counting model, then the binary model should have much *slower* adaptation for the top bit than it does for the bottom bit. With a "fractional shift" binary arithmetic coder that would mean using a different "update shift".

But LZMA, like most code I've seen that implements this kind of binary coding of literals, does not use different adaptation rates for each bit level. Instead they just blindly use the same binary coder for each bit level.

This is wrong, but it turns out to be right. I tested a bunch of variations and found that the LZMA way is best on my test set. It seems that having much faster adaptation of the top bits is a good thing.

Note that this is a consequence of using unequal contexts for the different bit levels. The top bit has 0 bits of context, while the bottom bit has 7 bits of context, which means its statistics are diluted 128X (or less). If you do an order-1 literal coder this way, the top bit has 8 bits of context while the bottom bit gets 15 bits.

2. The LZMA literal-after-match coding is just an exclude

I wrote before (here : cbloom rants 08-20-10 - Deobfuscating LZMA ) about the "funny xor thing" in the literal-after-match coder. Turns out I was wrong, it's not really funny at all.

In LZ coding, there's a very powerful exclusion that can be applied. If you always output matches of the maximum length (more on this later), then you know that the next symbol cannot be the one that followed in the match. Eg. if you just copied a match from "what" but only took 3 symbols, then you know the next symbol cannot be "t", since you would have just done a length-4 match in that case.

This is a particularly good exclusion because the symbol that followed in the match is what you would predict to be the most probable symbol at that spot!

That is, say you need to predict the MPS (most probable symbol) at any spot in the file. Well, what you do is look at the preceding context of symbols and find the longest previous match of the context, and take the symbol that follows that context. This is "PPM*" essentially.

So when you code a literal after a match in LZ, you really want to do exclusion of the last-match predicted symbol. In a normal N-ary arithmetic coder, you would simply set the count of that symbol to 0. But it's not so simple with the binary arithmetic coder.

With a binary arithmetic coder, let's say you have the same top 7 bits as the exclude symbol. Well then, you know exactly what your bottom bit must be without doing any coding at all - it must be the bit that doesn't match the exclude symbol. At the next bit level above that, you can't strictly exclude, but you can probabilistically, exclude. That is :


Working backwards from the bottom :

At bit level 0 :

if symbol top 7 bits == exclude top 7 bits
then full exclude

that is, probability of current bit == exclude bit is zero

At bit level 1 :

if symbol top 6 bits == exclude top 6 bits
then

if symbol current bit matches exclude current bit, I will get full exclusion in the next level
so chance of that path is reduced but not zero

the other binary path is unaffected

that is, we're currently coding to decide between 4 symbols.  Something like :

0 : {A,B}
1 : {C,D}

we should have P0 = (PA+PB)/(PA+PB+PC+PD)

but we exclude one; let's say B, so instead we want to code with P0 = PA/(PA+PC+PD)

etc..

That is, the exclude is strongest at the bottom bit level, and becomes less strong as you go back up to higher bit levels, because there are more symbols on each branch than just the exclude symbol.

The LZMA implementation of this is :


  static void LitEnc_EncodeMatched(CRangeEnc *p, CLzmaProb *probs, UInt32 symbol, UInt32 matchByte)
  {
    UInt32 offs = 0x100;
    symbol |= 0x100;
    do
    {
      matchByte <<= 1;
      RangeEnc_EncodeBit(p, probs + (offs + (matchByte & offs) + (symbol >> 8)), (symbol >> 7) & 1);
      symbol <<= 1;
      offs &= ~(matchByte ^ symbol);
    }
    while (symbol < 0x10000);
  }

I rewrote it to understand it; maybe this is clearer :

void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    // same thing but maybe clearer :
    bool matched = true;        
    val |= 0x100; // place holder top bit

    for(int i=0;i<8;i++) // 8 bit literal
    {
        int exclude_bit = (exclude >> (7-i)) & 1;
        int bit = (val >> (7-i)) & 1;

        int context = val >> (8-i);
        if ( matched )
            context += exclude_bit?512:256;

        m_probs[context].encode(arith,bit);

        if ( bit != exclude_bit )
            matched = false;
    }
}

We're tracking a running flag ("matched" or "offs") which tells us if we are on the same path of the binary tree as the exclude symbol. That is, do all prior bits match. If so, that steps us into another group of contexts, and we add the current bit from the exclude symbol to our context.

Now of course "matched" always starts true, and only turns to false once, and then stays false. So we can instead implement this as two loops with a break :


void BinaryArithCodeWithExclude( ArithEncoder * arith, int val, int exclude )
{
    int ctx = 1; // place holder top bit

    // first loop in the "matched" part of the tree :
    for(;;)
    {
        int exclude_bit = (exclude >> 7) & 1; exclude <<= 1;
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[256 + ctx + (exclude_bit<<8)].encode(arith,bit);
        ctx = (ctx<<1) | bit;
        if ( ctx >= 256 )
            return;
        if ( bit != exclude_bit )
            break;
    }
    
    // then finish bits that are unmatched :
    // non-matched
    do
    {
        int bit = (val >> 7) & 1; val <<= 1;
        m_bins[ctx].encode(arith,bit);
        ctx = (ctx<<1) | bit;
    }
    while( ctx < 256 );
}

It's actually not weird at all, it's just the way to do symbol exclusion with a binary coder.

ADDENDUM : maybe I'm going to0 far saying it's not weird. It is a bit weird, sort of like point 1, it's actually not right, but in a good way.

The thing that's weird is that when coding the top bits, it's only using the bits seen so far of the exclude symbol. If you wanted to do a correct probability exclusion, you need *all* the bits of the exclude symbol, so that you can see exactly what symbol it is, how much probability it contributes to that side of the binary tree.

The LZMA way appears to work significantly better than doing the full exclude.

That is, it's discarding some bits of the exclude as context, and that seems to help due to some issue with sparsity and adaptation rates. The LZMA uses 3*256 binary probabilities, while full exclusion uses 9*256. (though in both cases, not all probs are actually used; eg. the first bit is always coded from the "matched" probs, not the "un-matched").

ADDENDUM2 : Let me say it again perhaps clearer.

The way to code a full exclude using binary modeling is :


coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use 8 bits of exclude as context
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

The LZMA way is :

coding "val" with exclusion of "exclude"

while bits of val coded so far match bits of exclude coded so far :
{
  N bits coded so far
  use N+1 bits of exclude as context   // <- only difference is here
  code current bit of val
  if current bit of val != same bit of exclude
    break;
}

while there are bits left to code in val
{
  N bits coded so far
  use N bits of val as context
  code current bit of val
}

I also tried intermediate schemes like using N+2 bits of exclude (past bits+current bit+one lower bit) which should help a little to identify the exclusion probability without diluting statistics too much - they all hurt.

3. Optimal parsing and exclusions are either/or and equal

There are two major options for coding LZ-arith :

I. Do literal-after-match exclusion and always output the longest match. Use a very simplified optimal parser that only considers literal vs match (and a few other things). Essentially just a fancier lazy parse (sometimes called a "flexible parse").

II. Do not do literal-after-match exclusion, and consider many match lengths in an optimal parser.

It turns out that these give almost identical compression.

Case II has the simpler code stream because it doesn't require the literal-after-match special coder, but it's much much slower to encode at high compression because the optimal parser has to work much harder.

I've seen this same principle many times and it always sort of intrigues me. Either you can make a code format that explicitly avoids redundancy, or you can exploit that redundancy by writing an encoder that aggressively searches the coding space.

In this case the coding of exclude-after-match is quite simple, so it's definitely preferable to do that and not have to do the expensive optimal parse.

4. LZMA is very Pareto

I can't really find any modification to it that's a clear win. Obviously you can replace the statistical coders with either something faster (ANS) or something that gives more compression (CM) and you can move the space/speed tradeoff, but not in a clearly beneficial way.

That is, on the compression_ratio / speed / memory_use three-way tradeoff, if you hold any two of those constant, there's no improvement to be had in the other.

.. except for one flaw, which we'll talk about in the next post.


03-30-14 | Decoding GIF

So I'm writing a little image viewer for myself because I got fed up with ACDSee sucking so bad. Anyway, I had almost every image format except GIF, so I've been adding that.

It's mostly straightforward except for a few odd quirks, so I'm writing them down.

Links :

spec of gif89a
A breakdown of a GIF decoder
The GIFLIB Library
Frame Delay Times for Animated GIFs by humpy77 on deviantART
gif_timing test
ImageMagick - Animation Basics -- IM v6 Examples
(Optional) Building zlib, libjpeg, libpng, libtiff and giflib — Leptonica & Visual Studio 2008
theimage.com gif Disposal Methods 2
theimage.com GIF Table of Contents

My notes :

A GIF is a "canvas" and the size of the canvas is in the file header as the "screen width/height". There are then multiple "images" in the file drawn into the canvas.

In theory multiple images could be used even for non-animated gifs. Each image can have its own palette, which lets you do true color gifs by assigning different palettes to different parts of the image. So you should not assume that a GIF decodes into an 8-bit palettized image. I have yet to see any GIF in the wild that does this. (and if you had one, most viewers would interpret it as an animated gif, since delay=0 is not respected literally)

(one hacky way around that which I have seen suggested elsewhere : a gif with multiple images but no GCE blocks should be treated as compositing to form a single still image, whereas GCE blocks even with delay of 0 must be interpreted as animation frames)

Note that animation frames which only update part of the image *is* common. Also the transparency in a GIF must be used when drawing frames onto the canvas - it does not indicate that the final pixel should be transparent. That is, an animation frame may mark some pixels as transparent, and that means don't update those pixels.

There is an (optional) global palette and an (optional) per-image palette. In the global header there is a "background color". That is an index to the global palette, if it exists. The background color will be visible in parts of the canvas where there is no image rectangle, and also where images are transparent all the way down to the canvas. However, the ImageMagick link above has this note :


        There is some thinking that rather than clearing the overlaid area to
        the transparent color, this disposal should clear it to the 'background'
        color meta-data setting stored in the GIF animation. In fact the old
        "Netscape" browser (version 2 and 3), did exactly that. But then it also
        failed to implement the 'Previous' dispose method correctly.

        On the other hand the initial canvas should also be set from the formats
        'background' color too, and that is also not done. However all modern
        web browsers clear just the area that was last overlaid to transparency,
        as such this is now accepted practice, and what IM now follows. 
        
which makes me think that many decoders (eg. web browsers) ignore background and just make those pixels transparent.

(ADD : I've seen quite a few cases where the "background" value is larger than the global palette. eg. global palette has 64 colors, and "background" is 80 or 152.)

In the optional GCE block, each image can have a transparent color set. This is a palette index which acts as a color-key transparency. Tranparent pixels should show whatever was beneath them in the canvas. That is, they do not necessarily result in transparent pixels in the output if there was a solid pixel beneath them in the canvas from a previous image.

Each image has a "delay" time and "dispose" setting in the optional GCE block (which occurs before the image data). These apply *after* that frame.

Delay is the frame time; it can vary per frame, there is no global constant fps. Delay is in centiseconds, and the support for delay < 10 is nasty. In practice you must interpret a delay of 0 or 1 centiseconds to mean "not set" rather than to mean they actually wanted a delay of 0 or 1. (people who take delay too literally are why some gifs play way too fast in some viewers).

Dispose is an action to take on the image after it has been displayed and held for delay time. Note that it applies to the *image* not the canvas (the image may be just a rectangular portion of the canvas). It essentially tells how that image's data will be committed to the canvas for future frames. Unfortunately the dispose value of 0 for "not specified" is extremely common. It appears to be correct to treat that the same as a value of 1 (leave in place).

(ADD : I've seen several cases of a dispose value of "5". Dispososal is supposed to be a 3 bit value, of which only the values 0-3 are defined (and fucking 0 means "undefined"). Values 4-7 are supposed to be reserved.)

The ImageMagick and "theimage.com" links above are very handy for testing disposal and other funny animation details.

It's a shame that zero-delay is so widely mis-used and not supported, because it is the most interesting feature in GIF for encoder flexibility.


03-25-14 | deduper

So here's my little dedupe :

dedupe.zip (x64 exe)

This is a file level deduper (not a block or sector deduper) eg. it finds entire files that are identical.

dedupe.exe does not delete dupes or do anything to them. Instead it outputs a batch file which contains commands to do something to the dupes. The intention is that you then go view/edit that batch file and decide which files you actually want to delete or link or whatever.

It runs on Oodle, using multi-threaded dir enumeration and all that. (see also tabdir )

It finds possible dedupes first by checking file sizes. For any file size where there is more than one file of that size, it then hashes and collects possible dupes that have the same hash. Those are then verified to be actual byte-by-byte dupes.

Obviously deleting a bunch of dupes is dangerous, be careful, it's not my fault, etc.

Possible todos to make it faster :

1. Use an (optional) cache of hashes and modtimes to accelerate re-runs. One option is to store the hash of each file in an NTFS extra data stream on that file (which allows the file to be moved or renamed and retain its hash); probably better just to have an explicit cache file and not muck about with rarely-used OS features.

2. If there are only 2 or 3 files of a given size, I should just jump directly to the byte-by-byte compare. Currently they are first opened and hashed, and then if the hashes match they are byte-by-byte compared, which means an extra scan of those files.


03-24-14 | GDC 2014 Aftermath

The Saturday after GDC, I took BART to Richmond to get on an Amtrak train.

(I got on the wrong train; lol; Amtrak trains have no signs or anything indicating which train they are. So I'm standing on the platform waiting, and at about the right time a train pulls up. Everyone on the platform gets on, and there's no announcement or anything so I just hop on. Nobody takes tickets at the door. Eventually a train guy walks through to check tickets and tells me I'm on the wrong train. Oops. It was my first time on Amtrak I think; it was pretty dang nice actually; if they had car-carrying trains I would use that to avoid long freeway treks).

Got on the right train and took it to Sacramento and just got to a bank before 1 PM. Bought an E46 M3. Immediately pointed it north on the 5 and drove all the way to Seattle.

(took one tiny detour to drive some mountain roads near Shasta; just had to open her up a tiny bit. Yum yum fucking yum. What a car.)

Woot. So exhausted from the combined GDC + drive, but happy to be home with the wonderful family, spring flowers blooming, and my dream car in the garage.

... and back to work. Sigh.


03-20-14 | GDC 2014 Intermezzo

The logical conclusion :

Congratulations on your new baby girl! Your delivery medical costs are free, but your baby has been implanted with advertisement-delivering contacts.

Congratulations on your new baby girl! Your delivery medical costs are free, but we will take 1% of all your child's pay forever.

Of course it is your choice. No one is forcing you. It's a free market system. You can opt out if you pay $10 million immediately.

Lots of people are talking about VR, but I have yet to hear anybody talk about what will actually sell VR, which is porn. I'm also interested in VR skype. The easiest way to do remote VR would be to have a robotic camera at the remote site that mimics your head movements. eg. I could have a robotic camera with my baby, and when I move around my room, the camera over there moves the same way, so it is as if I am there. (I miss my baby!) But there are practical problems with that (oops the robotic arm camera killed my baby again) so better solutions are needed. I'm sure you could record "VR video" in a room with all the walls covered in cameras (perhaps plenoptic cameras, or z-cameras, but mainly just lots of them).

We came up with this : "Oodle puts your data on a diet. Effective from the first byte!" (good to the last dropped packet, etc.)

Offical cbloom rants disclaimer :

Despite the constant air of incontrovertibility, some of my rants are well thought out, tested, and fully-cooked. Others are not-quite-cooked musings. Do not base a major game production pipeline on my half-baked ranting! (!! (there are not enough excalamation points in the universe) !!)

In related news, coming to GDC 2015 panel session : How cbloom ruined my engine and delayed my game. Speakers who wish to contribute should contact yo momma.

Oh, also -

Can we just fucking abolish shaking hands as a culture? Not just at GDC but in general. My main concern at GDC is the fact that I get deathly ill every year because some fucker with a cold shakes my hand instead of saying "sorry I'm sick I won't shake your hand" (you asshole, wtf are you thinking, I can see you're obviously deathly ill and you're still holding out your hand to me?).

But even aside from that, hand shaking just sucks. It far too often goes bad.

No I don't need to feel your gross pudgy round blob of a hand. No I don't want to feel your clammy sweaty palm. If you feel the need to wipe your hand on your pants before offering it, then maybe just don't offer it. No I don't want to feel your sticky food encrusted hand. No I'm not impressed by how strong you are (stop crushing my hand you fucking small-dick-having macho low-self-esteem puffed up loser), and also no I don't enjoy your limp fingers-only handshake either.

There is a happy middle ground for a correct hand shake, and then we can pat ourselves on the back and feel good that we didn't epically fail at basic motor control and hygiene. Yay! But that rare reward is in no way worth all the bad times.

Fist bump?

(and no, the fists do *not* explode when they touch!! no no no!)


03-19-14 | GDC 2014

I'm at GDC on Thursday and Friday.

I'll be giving a talk on using luxury branding techniques for upselling in microtransactions, and player retention across advertisements in free-to-play . See the schedule link for details.

Feel free to do a drive-by of the RAD booth and yell "Oodle sucks" at me.

On an actual serious note though, we're showing Oodle's new compressors for network packets . They use a trained model to compress TCP or UDP packets, and give much more compression than previous solutions.


03-15-14 | Bit IO of Elias Gamma

Making a record of something not new :

Elias Gamma codes are made by writing the position of the top bit using unary, and then the lower bits raw. eg to send 30, 30 = 11110 , the top bit is in position 4 so you send that as "00001" and then the bottom 4 bits "1110". The first few values (starting at 1) are :


1 =   1 : 1
2 =  10 : 01 + 0
3 =  11 : 01 + 1
4 = 100 : 001 + 00
5 = 101 : 001 + 01
...

The naive way to send this code is :

void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutUnary( BITOUT, topbit );

    val -= (1<<topbit); // or &= (1<<topbit)-1;

    PutBits( BITOUT, val, topbit );
} 

But it can be done more succinctly.

We should notice two things. First of all PutUnary is very simple :


void PutUnary( BITOUT, unsigned val )
{
    PutBits( BITOUT , 1, val+1 );
}

That is, it's just putting the value 1 in a variable number of bits. This gives you 'val' leading zero bits and then a 1 bit, which is the unary encoding.

The next is that the 1 from the unary is just the same as the 1 we remove from the top position of 'val'. That is, we can think of the bits thusly :


5 = 101 : 001 + 01

unary of two + remaining 01
or

5 = 101 : 00 + 101

two zeros + the value 5

The result is a much simplified elias gamma coder :

void PutEliasGamma( BITOUT, unsigned val )
{
    ASSERT( val >= 1 );

    int topbit = bsr(val);

    ASSERT( val >= (1<<topbit) && val < (2<<topbit) );

    PutBits( BITOUT, val, 2*topbit+1 );
} 

note that if your bit IO is backwards then this all works slightly differently (I'm assuming you can combine two PutBits into one with the first PutBits in the top of the second; that is
PutBits(a,na)+PutBits(b,nb) == PutBits((a<<nb)|b,na+nb)

Perhaps more importantly, we can do a similar transformation on the reading side.

The naive reader is :


int GetEliasGamma( BITIN )
{
    int bits = GetUnary( BITIN );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

(assuming your GetBits can handle getting zero bits, and returns a value >= 1). The naive unary reader is :

int GetUnary( BITIN )
{
    int ret = 0;
    while( GetOneBit( BITIN ) == 0 )
    {
        ret++;
    }
    return ret;
}

but if your bits are top-justified in your bit input word (as in ans_fast for example, or see the end of this post for a reference implementation), then you can use count_leading_zeros to read unary :

int GetUnary( BITIN )
{
    int clz = count_leading_zeros( BITIN );

    ASSERT( clz < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, clz+1 );
    ASSERT( one == 1 );

    return clz;
}

here the GetBits is just consuming the zeros and the one on bit of the unary code. Just like in the Put case, the key thing is that the trailing 1 bit of the unary is the same as the top bit value ( "(1<<bits)" ) that we added in the naive reader. That is :

int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int one = GetBits( BITIN, bits+1 );
    ASSERT( one == 1 );

    int ret = (1<<bits) + GetBits( BITIN, bits );

    return ret;
}

can be simplified to combine the GetBits :

int GetEliasGamma( BITIN )
{
    int bits = count_leading_zeros( BITIN );

    ASSERT( bits < NumBitsAvailable(BITIN) );

    int ret = GetBits( BITIN, 2*bits + 1 );

    ASSERT( ret >= (1<<bits) && ret < (2<<bits) );

    return ret;
}

again assuming that your GetBits combines like big-endian style.

You can do the same for "Exp Golomb" of course, which is just Elias Gamma + some raw bits. (Exp Golomb is the special case of Golomb codes with a power of two divisor).

Summary :
//===============================================================================

// Elias Gamma works on vals >= 1
// these assume that the value fits in your bit word
// and your bit reader is big-endian and top-justified

#define BITOUT_PUT_ELIASGAMMA(bout_bits,bout_numbits,val) do { \
    ASSERT( val >= 1 ); \
    uint32 topbit = bsr64(val); \
    BITOUT_PUT(bout_bits,bout_numbits, val, 2*topbit + 1 ); \
    } while(0)

#define BITIN_GET_ELIASGAMMA(bitin_bits,bitin_numbits,val) do { \
    uint32 nlz = clz64(bitin_bits); \
    uint32 nbits = 2*nlz+1; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    } while(0)

//===============================================================================
// MSVC implementations of bsr and clz :

static inline uint32 bsr64( uint64 val )
{
    ASSERT( val != 0 );
    unsigned long b = 0;
    _BitScanReverse64( &b, val );
    return b;
}

static inline uint32 clz64(uint64 val)
{
    return 63 - bsr64(val);
}

//===============================================================================
// and for completeness, reference bitio that works with those functions :
//  (big endian ; bit input word top-justified)

#define BITOUT_VARS(bout_bits,bout_numbits,bout_ptr) \
    uint64 bout_bits; \
    int64 bout_numbits; \
    uint8 * bout_ptr;

#define BITOUT_START(bout_bits,bout_numbits,bout_ptr,buf) do { \
        bout_bits = 0; \
        bout_numbits = 0; \
        bout_ptr = (uint8 *)buf; \
    } while(0)

#define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        ASSERT( (val) < (1ULL<<(nb)) ); \
        bout_numbits += nb; \
        bout_bits |= ((uint64)(val)) << (64 - bout_numbits); \
    } while(0)
    
#define BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr) do { \
        *((uint64 *)bout_ptr) = _byteswap_uint64( bout_bits ); \
        bout_bits <<= (bout_numbits&~7); \
        bout_ptr += (bout_numbits>>3); \
        bout_numbits &= 7; \
    } while(0)
    
#define BITOUT_END(bout_bits,bout_numbits,bout_ptr) do { \
        BITOUT_FLUSH(bout_bits,bout_numbits,bout_ptr); \
        if ( bout_numbits > 0 ) bout_ptr++; \
    } while(0)

//===============================================================

#define BITIN_VARS(bitin_bits,bitin_numbits,bitin_ptr) \
    uint64 bitin_bits; \
    int64 bitin_numbits; \
    uint8 * bitin_ptr;

#define BITIN_START(bitin_bits,bitin_numbits,bitin_ptr,begin_ptr) do { \
        bitin_ptr = (uint8 *)begin_ptr; \
        bitin_bits = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += 8; \
        bitin_numbits = 64; \
    } while(0)

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { if ( bitin_numbits <= 56 ) { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        bitin_ptr += bytesToGet; \
        bitin_bits |= next8 >> bitin_numbits; \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } } while(0)

#define BITIN_GET(bitin_bits,bitin_numbits,nb,ret) do { \
        ASSERT( nb <= bitin_numbits ); \
        ret = (bitin_bits >> 1) >> (63 - nb); \
        bitin_bits <<= nb; \
        bitin_numbits -= nb; \
    } while(0)

//=========================================================

and yeah yeah I know this bitio could be faster. It's a reference implementation that's trying to avoid obfuscations. GTFO.

Added exp-golomb. The naive put is :


PutEliasGamma( val >> r );
PutBits( val & ((1<<r)-1) , r );

but you do various reductions and get to :
//===============================================================================

// this Exp Golomb is for val >= 0
// Exp Golomb is Elias Gamma + 'r' raw bits

#define BITOUT_PUT_EXPGOLOMB(bout_bits,bout_numbits,r,val) do { \
    ASSERT( val >= 0 ); \
    uint64 up = (val) + (1ULL<<(r)); \
    uint32 topbit = bsr64(up); \
    ASSERT( topbit >= (uint32)(r) ); \
    BITOUT_PUT(bout_bits,bout_numbits, up, 2*topbit + 1 - r ); \
    } while(0)
    
#define BITIN_GET_EXPGOLOMB(bitin_bits,bitin_numbits,r,val) do { \
    uint32 nbits = 2*clz64(bitin_bits)+1+r; \
    BITIN_GET(bitin_bits,bitin_numbits,nbits,val); \
    ASSERT( val >= (1ULL<<r) ); \
    val -= (1ULL<<r); \
    } while(0)

//=========================================================


03-14-14 | Fold Up Negatives

Making a record of something not new :

Say you want to take the integers {-inf,inf} and map them to just the non-negatives {0,1,..inf}. (and/or vice-versa)

(this is common for example when you want to send a signed value using a variable length code, like unary or golomb or whatever; yes yes there are other ways, for now assume you want to do this).

We need to generate a number line with the negatives "folded up" and interleaved with the positives, like


0,-1,1,-2,2,...

The naive way is :

// fold_up makes positives even
//   and negatives odd

unsigned fold_up_negatives(int i)
{
    if ( i >= 0 )
        return i+i;
    else
        return (unsigned)(-i-i-1); 
}

int unfold_negatives(unsigned i)
{
    if ( i&1 ) 
        return - (int)((i+1)>>1);
    else
        return (i>>1);
}

Now we want to do it branchless.

Let's start with folding up. What we want to achieve mathematically is :


fold_up_i = 2*abs(i) - 1 if i is negative

To do this we will use some tricks on 2's complement integers.

The first is getting the sign. Assuming 32-bit integers for now, we can use :


minus_one_if_i_is_negative = (i >> 31);

= 0 if i >= 0
= -1 if i < 0

which works by taking the sign bit and replicating it down. (make sure to use signed right shift, and yes this is probably undefined blah blah gtfo etc).

The other trick is to use the way a negative is made in 2's complement.


(x > 0)

-x = (x^-1) + 1

or

-x = (x-1)^(-1)

and of course x^-1 is the same as (~x), that is flip all the bits. This also gives us :

x^-1 = -x -1

And it leads obviously to a branchless abs :


minus_one_if_i_is_negative = (i >> 31);
abs_of_i = (i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

since if i is negative this is

-x = (x^-1) + 1

and if i is positive it's

x = (x^0) + 0

So we can plug this in :

fold_up_i = 2*abs(i) - 1 if i is negative

fold_up_i = abs(2i) - 1 if i is negative

minus_one_if_i_is_negative = (i >> 31);
abs(2i) = (2i ^ minus_one_if_i_is_negative) - minus_one_if_i_is_negative;

fold_up_i = abs(2i) + minus_one_if_i_is_negative

fold_up_i = (2i) ^ minus_one_if_i_is_negative

or in code :

unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

For unfold we use the same tricks. I'll work it backwards from the answer for variety and brevity. The answer is :

int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}

and let's prove it's right :

if i is even

half_i = i>>1;
sign_i = 0;

return half_i ^ 0 = i/2;
// 0,2,4,... -> 0,1,2,...

if i is odd

half_i = i>>1; // i is odd, this truncates
sign_i = -1;

return half_i ^ -1 
 = -half_i -1
 = -(i>>1) -1
 = -((i+1)>>1)
// 1,3,5,... -> -1,-2,-3,...

which is what we wanted.

Small note : on x86 you might rather use cdq to get the replicated sign bit of an integer rather than >>31 ; there are probably similar instructions on other architectures. Is there a neat way to make C generate that? I dunno. Not sure it ever matters. In practice you should use an "int32" type or compiler_assert( sizeof(int) == 4 );

Summary :

unsigned fold_up_negatives(int i)
{
    unsigned two_i = ((unsigned)i) << 1;
    int sign_i = i >> 31;
    return two_i ^ sign_i;
}

int unfold_negatives(unsigned i)
{
    unsigned half_i = i >> 1;
    int sign_i = - (int)( i & 1 );
    return half_i ^ sign_i;
}


03-13-14 | Hilbert Curve Testing

So I stumbled on this blog post about counting the rationals which I found rather amusing.

The snippet that's relevant here is that if you iterate through the rationals naively by doing something like

1/1 2/1 3/1 4/1 ...
1/2 2/2 3/2 4/2 ...
1/3 2/3 ...
...
then you will never even reach 1/2 because the first line is infinitely long. But if you take a special kind of path, you can reach any particular rational in a finite number of steps. Much like the way a Hilbert curve lets you walk the 2d integer grid using only a 1d path.

It reminded me of something that I've recently changed in my testing practices.

When running code we are always stepping along a 1d path, which you can think of as discrete time. That is, you run test 1, then test 2, etc. You want to be able to walk over some large multi-dimensional space of tests, but you can only do so by stepping along a 1d path through that space.

I've had a lot of problems testing Oodle, because there are lots of options on the compressors, lots of APIs, lots of compression levels and different algorithms - it's impossible to test all combinations, and particularly impossible to test all combinations on lots of different files. So I keep getting bitten by some weird corner case.

(Total diversion - actually this is a good example of why I'm opposed to the "I tested it" style of coding in general. You know when you stumble on some code that looks like total garbage spaghetti, and you ask the author "wtf is going on here, do you even understand it?" and they "well it works, I tested it". Umm, no, wrong answer. Maybe it passed the test, but who knows how it's going to be used down the road and fail in mysterious ways? Anyway, that's an old school cbloom coding practices rant that I don't bother with anymore ... and I haven't been great about following my own advice ...)

If you try to set up a big test by just iterating over each option :

for each file
  for each compressor
    for each compression_level
      for each chunking
        for each parallel branching
          for each dictionary size
            for each sliding window size
              ...

then you'll never even get to the second file.

The better approach is to get a broad sampling of options. An approach I like is to enumerate all the tests I want to run, using a loop like the above, put them all in a list, then randomly permute the list. Because it's just a permutation, I will still only run each test once, and will cover all the cases in the enumeration, but by permuting I get a broader sampling more quickly.

(you could also add the nice technique that we worked out here long ago - generate a consistent permutation using a pseudorandom generator with known seed, and save your place in the list with a file or the registry or something. That way when you stop and start the tests, you will resume where you left off, and eventually cover more of the space (also when a test fails you will automatically repeat the same test if you restart)).

The other trick that's useful in practice is to front-load the quicker tests. You do want to have a test where you run on 8 GB files to make sure that works, but if that's your first test you have to wait forever to get a failure. This is particularly for the case that something dumb is broken, it should show up as quickly as possible so you can just cancel the test and fix it. So you want an initial broad sampling of options on very quick tests.


03-03-14 | Windows Links

I wrote a deduper in Oodle today. I was considering making the default action be to replace duplicate files with a link to the original.

I wasn't sure whether to use "hard" or "soft" links, so I did a little research.

In Windows a "hard link" means that multiple file names all point at the same file data. It's a bit of a misnomer, it's not really a "link" to the original. There is no "original" or base file - all instances of a hard link are equal peers.

A "soft link" is just an OS-level shortcut. There is an original base file, and the soft links point at it.

Both are ridiculously broken concepts and IMO should almost never be used.

With "hard links" the problem is that if you accidentally edit any of the links, you have editted the master data. If you did not intend that, you may have fucked up something severely.

Hard links are reasonable *if* the files linked are read-only (and somehow actually kept read-only, not just attrib'ed away).

The problem with "soft links" is that the links are not protected; if you rename or move or delete the original file, all the links are broken, and again you may have just severely fucked something up.

The big problem is that you get no warning in either case. Clearly what you want when you try to rename a file which has a bunch of soft links pointing at it is some kind of dialog that says "hey all these links point at this file, do you really want to rename it and break the links? or should I update the links to point at the new name?". Similarly with hard links, obviously what you want is some prompt like "hey if you modify this, so you want these hard links to see the new version or the old version?".

Now obviously you can't solve this problem in general without user prompts. But I believe that a refcounted copy-on-write link would have been a much more robust and safe solution. Open for write should have done a COW by default unless you passed a special flag to indicate you intend to edit shared data.

Even ignoring the fundamental logic of how links work, there are some very bad practical issues for links in windows.

1. Soft links show a file size of 0 in the dir enumeration file info. This breaks the assumption that most apps make that the file size they get from the dir enumeration will be the same as the file size they get if they open that file handle and ask for its size. It can also screw up enumerations that are just trying to skip zero-size files.

Hard link file sizes are out of date. If the file data is modified, only the directory entry for the one that was used to modify the data is updated. All other links still have the old file sizes, mod times, etc.

2. Hard links break the assumption that saving to a file is the same as saving to a temp and then renaming onto the file. Many apps may or may not use the "write to temp then rename" pattern; what you get is massively different results in a very unexpected way.

3. Mod times are hosed. In general attributes are hosed; neither type of link reflects the attributes of the actual file data in the link - until they are opened, then they get updated. Mod times are particularly bad because many apps use them to detect changes, and with links the file data can be changed but the mod time won't reflect it.

Dear lord. So non-robust.


02-25-14 | ANS Applications

Some rambling on where I see ANS being useful.

In brief - anywhere you used Huffman in the past, you can use ANS instead.

ANS (or ABS) are not very useful for high end compressors. They are awkward for adaptive modeling. Even if all you care about is decode speed (so you don't mind the buffering up the models to make the encode work backwards) it's just not a big win over arithmetic coding. Things like PAQ/CM , LZMA, H264, all the high compression cases that use adaptive context models, there's no real win from ANS/ABS.

Some specific cases where I see ANS being a good win :

JPEG-ANS obviously. Won't be competitive with sophisticated coders like "packjpg" but will at least fix the cliff at low bit rate caused by Huffman coding.

JPEGNEXT-ANS. I've been thinking for a long time about writing a "JPEGNEXT". Back end coefficient classification; coefficients in each group sent by value with ANS. Front end 32x32 macroblocks with various DCT sizes. Try to keep it as simple as possible but be up to modern quality levels.

LZ-ANS. An "LZX class" (which means large window, "repeat match", entropy resets) LZ with ANS back end should be solid. Won't be quite LZMA compression levels, but way way faster.

Lossless image DPCM. ANS on prediction residual values is a clear killer. Should have a few statistics groups with block classification. No need to transmit the ANS counts if you use a parametric model ala TMW etc. Should be easy, fast, and good.

blocksort-ANS. Should replace bzip. Fast to decode.

MPEG-class video coders. Fast wavelet coders. Anywhere you are using minimal context modelling (only a few contexts) and are sending values by their magnitude (not bit plane coders).

Other?


02-25-14 | WiFi

So our WiFi stopped working recently, and I discovered a few things which I will now write down.

First of all, WiFi is fucked. 2.4 GHz is way overcrowded and just keeps getting more crowded. Lots of fucking routers now are offering increased bandwidth by using multiple channels simultaneously, etc. etc. It's one big interference fest.

The first issue I found was baby monitors. Baby monitors, like many wireless devices, are also in the 2.4 GHz band and just crap all over your wifi. Turning them off helped our signal a bit, but we were still getting constant problems.

Next issue is interference from neighbors'ses wifises. This is what inSSIDer looks like at my house :

We are *way* away from any neighbors, at least 50 feet in every direction, and we still get this amount of shit from them. Each of my cock-ass-fuck neighbors seems to have four or five wifi networks. Good job guys, way to fix your signal strength issues by just piling more shit in the spectrum.

What you can't see from the static image is that lots of the fucking neighbor wifis are not locked to a specific channel, many of them are constantly jumping around trying to find a clear channel, which just makes them crud them all up.

(I'd love to get some kind of super industrial strength wifi for my house and just crank it up to infinity and put it on every channel so that nobody for a mile around gets any wifi)

I've long had our WiFi on channel 8 because it looked like the most clear spot to be. Well, it turns out that was a classic newb mistake. Apparently it's worse to be slightly offset from a busy channel than it is to be right on it. When you're offset, you get signal leakage from the other channel that just looks like noise; being on the channel you're fighting with other people, but at least you are seing their data as real data that you can ignore. Anyway, switching our network to channel 11 fixed it.

It looks like in practice channel 6 and 11 are the only usable ones in noisy environments (eg. everywhere).

The new 802.11ac on 5 GHz should be a nice clean way to go for a few years until it too gets crudded up.


02-18-14 | ans_fast implementation notes

Some notes about the ans_fast I posted earlier .

ans_fast contains a tANS (table-ANS) implementation and a rANS (range-ANS) implementation.

First, the benchmarking. You can compare to the more naive implementations I posted earlier . However, do not compare this tANS impl to Huffman or arithmetic and conclude "ANS is faster" because the tANS impl here has had rather more loving than those. Most of the tricks used on "ans_fast" can be equally used for other algorithms (though not all).

Here L=4096 to match the 12-bits used in the previous test. This is x64 on my lappy (1.7 Ghz Core i7 with turbo disabled). Compressed sizes do not include sending the counts. Time "withtable" includes all table construction but not histogramming or count normalization (that affects encoder only). ("fse" and "huf" on the previous page included table transmission and histogramming time)


book1

tANS 768771 -> 435252.75

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

rANS 768771 -> 435980 bytes (v2)

ticks to encode: 6.97 decode: 5.06
mbps encode: 248.02 decode: 341.63

withtable ticks to encode: 6.97 decode: 5.07
withtable mbps encode: 247.92 decode: 341.27

pic

tANS 513216 -> 78856.88

ticks to encode: 4.53 decode: 3.47
mbps encode: 382.02 decode: 497.75

withtable ticks to encode: 4.62 decode: 3.56
withtable mbps encode: 374.45 decode: 485.40

rANS 513216 -> 79480 bytes (v2)

ticks to encode: 5.62 decode: 3.53
mbps encode: 307.78 decode: 490.32

withtable ticks to encode: 5.63 decode: 3.54
withtable mbps encode: 307.26 decode: 488.88

First a note on file sizes : rANS file sizes are a few bytes larger than the "rans 12" posted last time. That's because that was a 32-bit impl. The rANS here is 64-bit and dual-state so I have to flush 16 bytes instead of 4. There are ways to recover some of those bytes.

The tANS file sizes here are smaller than comparable coders. The win comes from the improvements to normalizing counts and making the sort order. In fact, the +1 bias heuristic lets me beat "arith 12" and "rans 12" from the last post, which were coding nearly perfectly to the expected codelen of the normalized counts.

If you run "ans_learning" you will often see that the written bits are less than the predicted codelen :

H = 1.210176
CL = 1.238785
wrote 1.229845 bpb
this is because the +1 bias heuristic lets the codelens match the data better than the normalized counts do.

Okay, so on to the speed.

The biggest thing is that the above reported speeds are for 2x interleaved coders. That is, two independent states encoding the single buffer to a single compressed stream. I believe ryg will talk about this more soon. You can read his paper on arxiv now. Note that this is not just unrolling. Because the states are independent they allow independent execution chains to be in flight at the same time.

The speedup from interleaving is pretty huge (around 1.4X) :


book1

rANS non-interleaved (v1)

ticks to encode: 26.84 decode: 7.33
mbps encode: 64.41 decode: 235.97

withtable ticks to encode: 26.85 decode: 7.38
withtable mbps encode: 64.41 decode: 234.19

rANS 2x interleaved (v1)

ticks to encode: 17.15 decode: 5.16
mbps encode: 100.84 decode: 334.95

withtable ticks to encode: 17.15 decode: 5.22
withtable mbps encode: 100.83 decode: 331.31


tANS non-interleaved

ticks to encode: 6.43 decode: 4.68
mbps encode: 269.10 decode: 369.44

withtable ticks to encode: 6.48 decode: 4.73
withtable mbps encode: 266.86 decode: 365.39

tANS 2x interleaved

ticks to encode: 4.64 decode: 3.39
mbps encode: 372.92 decode: 509.63

withtable ticks to encode: 4.69 decode: 3.44
withtable mbps encode: 368.65 decode: 501.95

But even non-interleaved it's fast. (note that interleaved tANS is using only a single shared bit buffer). The rest of the implementation discussion will use the non-interleaved versions for simplicity.

The tANS implementation is pretty straightforward.

Decoding one symbol is :


    struct decode_entry { uint16 next_state; uint8 num_bits; uint8 sym; };

    decode_entry * detable = table - L;

    #define DECODE_ONE() do { \
        de = detable + state; \
        nb = de->num_bits; \
        state = de->next_state; \
        BITIN_OR(bitin_bits,bitin_numbits,nb,state); \
        *toptr++ = (uint8) de->sym; \
    } while(0)

where BITIN_OR reads "nb" bits and ors them onto state.

With a 64-bit bit buffer, I can ensure >= 56 bits are in the buffer. That means with L up to 14 bits, I can do four decodes before checking for more bits needed. So the primary decode loop is :


        // I know >= 56 bits are available  
        // each decode consumes <= 14 bits

        DECODE_ONE();
        DECODE_ONE();
        DECODE_ONE();
        DECODE_ONE();
            
        BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr);
        // now >= 56 bits again

The fastest way I could find to do the bit IO was "big endian style". That's the next bits at the top of the word. Bits in the word are in order of bits in the file. This lets you unconditionally grab the next 8 bytes to refill, but requires a bswap (on little endian platforms). eg :

#define BITIN_REFILL(bitin_bits,bitin_numbits,bitin_ptr) do { \
        ASSERT( bitin_numbits > 0 && bitin_numbits <= 64 ); \
        int64 bytesToGet = (64 - bitin_numbits)>>3; \
        uint64 next8 = _byteswap_uint64( *( (uint64 *)bitin_ptr ) ); \
        bitin_ptr += bytesToGet; \
        bitin_bits |= (next8 >> 1) >> (bitin_numbits-1); \
        bitin_numbits += bytesToGet<<3; \
        ASSERT( bitin_numbits >= 56 && bitin_numbits <= 64 ); \
    } while(0)

The other nice thing about the bits-at-top style is that the encoder can put bits in the word without any masking. The encoder is :

    #define ENCODE_ONE() do { \
        sym = *bufptr--; ee = eetable+sym; \
        msnb = ee->max_state_numbits; \
        msnb += ( state >= ee->max_state_thresh ); \
        BITOUT_PUT(bout_bits,bout_numbits, state,msnb); \
        state = ee->packed_table_ptr[ state>>msnb ]; \
        } while(0)

    #define BITOUT_PUT(bout_bits,bout_numbits,val,nb) do { \
        ASSERT( (bout_numbits+nb) <= 64 ); \
        bout_bits >>= nb; \
        bout_bits |= ((uint64)val) << (64 - nb); \
        bout_numbits += nb; \
    } while(0)

the key interesting part being that the encoder just does BITOUT_PUT with "state", and by shifting it up to the top of the word for the bitio, it gets automatically masked. (and rotate-right is a way to make that even faster).

Similarly to the decoder, the encoder can write 4 symbols before it has to check if the bit buffer needs any output.

The other crucial thing for fast tANS is the sort order construction. I do a real sort, using a radix sort. I do the first step of radix sorting (generating a histogram), and then I directly build the tables from that, reading out of the radix histogram. There's no need to explicitly generate the sorted symbol list as an intermediate step. I use only an 8-bit radix here (256 entries) but it's not significantly different (in speed or compression) than using a larger radix table.

The rANS implementation is pretty straightforward and I didn't spend much time on it, so it could probably be faster (particularly encoding which I didn't spend any time on (ADDENDUM : v2 rANS now sped up and encoder uses fast reciprocals)). I use a 64-bit state with 32-bit renormalization. The basic decode operation is :


        uint64 xm = x & mask;   
        const rans_decode_table::entry & e = table[xm];
            
        x = e.freq * (x >> cumprobtot_bits) + e.xm_minus_low;
    
        buffer[i] = (uint8) e.sym;
        
        if ( x < min_x )
        {
            x <<= 32;
            x |= *((uint32 *)comp_ptr);
            comp_ptr += 4;
        }

One thing I should note is that my rANS decode table is 2x bigger than the tANS decode table. I found it was fastest to use an 8-byte decode entry for rANS :

    // 8-byte decode entry
    struct entry { uint16 freq; uint16 xm_minus_low; uint8 sym; uint16 pad; };
obviously you can pack that a lot smaller (32 bits from 12+12+8) but it hurts speed.

For both tANS and rANS I make the encoder write backwards and the decoder read forwards to bias in favor of decoder speed. I make "L" a variable, not a constant, which hurts speed a little.


02-18-14 | Understanding ANS - Conclusion

I think we can finally say that we understand ANS pretty well, so this series will end. I may cover some more ANS topics but they won't be "Understanding ANS".

Here is the index of all posts on this topic :

cbloom rants 1-30-14 - Understanding ANS - 1
cbloom rants 1-31-14 - Understanding ANS - 2
cbloom rants 02-01-14 - Understanding ANS - 3
cbloom rants 02-02-14 - Understanding ANS - 4
cbloom rants 02-03-14 - Understanding ANS - 5
cbloom rants 02-04-14 - Understanding ANS - 6
cbloom rants 02-05-14 - Understanding ANS - 7
cbloom rants 02-06-14 - Understanding ANS - 8
cbloom rants 02-10-14 - Understanding ANS - 9
cbloom rants 02-11-14 - Understanding ANS - 10
cbloom rants 02-14-14 - Understanding ANS - 11
cbloom rants 02-18-14 - Understanding ANS - 12

And here is some source code for my ANS implementation : (v2 02/21/2014)

ans_learning.cpp
ans_fast.cpp
ans.zip - contains ans_fast and ans_learning
cblib.zip is required to build my code

My home code is MSVC 2005/2008. Port if you like. Email me if you need help.

NOTE : this release is not a library you just download and use. It is intended as documentation of research. If you want some ANS code you can just use off the shelf, go get FiniteStateEntropy . You may also want ryg_rans .

I think I'll do a followup post with the performance of ans_fast and some optimization notes so it doesn't crud up this index post. Please put implementation speed discussion in that followup post .


02-18-14 | Understanding ANS - 12

A little note about sorts and tables.

AAAAAAABBBBBBBBBBBBBCCCCCCCCCCCD

What's wrong with that sort?

(That's the naive rANS sort order; it's just a "cum2sym" table. It's each symbol Fs times in consecutive blocks. It has M=32 entries. M = sum{Fs} , L = coding precision)

(here I'm talking about a tANS implementation with L=M ; the larger (L/M) is, the more you preserve the information in the state x)

Think about what the state variable "x" does as you are coding. In the renormalization range it's in [32,63]. Its position in that range is a slider for the number of fraction bits it contains. At the bottom of the range, log2(x) is 5, at the top log2(x) is 6.

Any time you want to encode a "D" you must go back to a singleton precursor state, Is = [1]. That means you have to output all the bits in x, so all fractional bits are thrown away. All information about where you were in that I range is gone. Then from that singleton Is range you jump to the end of the I range.

(if Fs=2 , then you quantize the fractional bits up to 0.5 ; is Fs=3, you quantize to 1/3 of a bit, etc.)

Obviously the actual codelen for a "D" is longer than that for an "A". But so is the codelen for a "C", and the codelen for "A" is too short. Another way to think of it is that you're taking an initial state x that spans the whole interval [32,63] and thus has variable fractional bits, and you're mapping it into only a portion of the interval.

In order to preserve the fractional bit state size, you want to map from the whole interval back to the whole interval. In the most extreme case, something like :

ACABACACABACABAD

(M=16) , when you encode an A you go from [16,31] to [8,15] and then back the A's in that string. The net result is that state just lost its bottom bit. That is, x &= ~1. You still have the full range of possible fractional bits from [0,1] , you just lost the bottom bit of precision.

I was thinking about this because I was making some weird alternative tANS tables. In fact I suppose not actually ANS tables, but more general coding tables.

For background, you can make one of the heuristic tANS tables thusly :


shuffle(s) = some permutation function
shuffle is one-to-one over the range [0,L-1]
such as Yann's stepping prime-mod-L
or bit reverse

make_tans_shuffle()
{
    int next_state[256];    
    uint8 permutation[MAX_L];
    
    // make permutation :
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        
        next_state[sym] = count;
        
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;
            
            permutation[index] = (uint8)sym;
        }
    }
    ASSERT( cumulative_count == (uint32)L );

    // permutation is now our "output string"   

    for LOOP(i,L) // iterating over destination state
    {
        int sym = permutation[i];
        
        // step through states for this symbol
        int from_state = next_state[sym];
        next_state[sym] ++;
                
        int to_state = L + i;
                    
        encode_packed_table_ptr[sym][from_state] = to_state;
    }
}

which is all well and good. But I started thinking - can I eliminate the intermediate permutation[] table entirely? Well, yes. There are a few ways.

If you have a "cum2sym" table already handy, then you can just use shuffle() to look up directly into cum2sym[], and that is identical to the above. But you probably don't have cum2sym.

Well what if we just use shuffle() to make the destination state? Note that this is calling it in the opposite direction (from cum2sym index to to_state , rather than from to_state to cum2sym). If your shuffle is self-inverting like bit reversal is, then it's the same.

It gives you a very simple table construction :


make_tans_shuffle_direct()
{
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
                
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;
        }
    }
    ASSERT( cumulative_count == (uint32)L );
}

make_tans_shuffle_direct walks the Fs in a kind of cum2sym order and then scatters those symbols out to semi-random target locations using the shuffle() function.

It doesn't work. Or rather, it works, it encodes & decodes data correctly, but the total coded size is worse.

The problem is that the encode table is no longer monotonic. That is, as "from_state" increases, "to_state" does not necessarily increase. The Fs encode table entries for each symbol are not numerically in order.

In the images we've been picturing from earlier in the post we can see the problem. Some initial state x is renormalized down to the Is coding range. We then follow the state transition back to the I range - but we go somewhere random. We don't go to the same neighborhood where we started, so we randomly get more or less fractional bits.

You can fix it thusly :


make_tans_shuffle_direct_fixed()
{
    uint32 cumulative_count = 0;    
    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
                
        for LOOP(c,(int)count)
        {
            uint32 index = shuffle(cumulative_count);
            cumulative_count++;

            uint32 to_state = index + L;
            int from_state = count + c; 

            encode_packed_table_ptr[sym][from_state] = to_state;
        }

        // fix - to_states not monotonic
        // sort the destination states for this symbol :
        std::sort( encode_packed_table_ptr[sym]+count, encode_packed_table_ptr[sym]+2*count );
    }
    ASSERT( cumulative_count == (uint32)L );
}

and then it is identical to "make_tans_shuffle" (identical if shuffle is self-inverting, and if not then it's different but equal, since shuffle is really just a random number generator so running it backwards doesn't hurt compression).

For the record the compression penalty for getting the state transition order wrong is 1-2% :


CCC total bytes out :

correct sort : 1788631
shuffle fixed: 1789655
shuffle bad  : 1813450


02-14-14 | Understanding ANS - 11

I want to do some hand waving about the different ways you can conceptually look at ANS.

Perhaps the best way to understand ANS mathematically is via the analogy with arithmetic coding . While ANS is not actually building up an arithmetic coder interval for the file, each step is very much like a LIFO arithmetic coder, and the (x/P) scaling is what makes x grow the right amount for each symbol. This is the most useful way to think about rANS or uANS, I find.

But there are other ways to think about it.

One is Duda's "asymmetric numeral system", which is how he starts the explanation in the paper, and really confused me to begin with. Now that we've come at ANS from the back way we can understand what he was on about.

The fundamental op in ANS is :


integer x contains some previous value

make x' = x scaled up in some way + new value 

with a normal "symmetric numeral system" , you would just do base-b math :

x' = x * b + v

which gives you an x' where the old value (x) is distributed evenly, and the v's just cycle :

b = 3 for example

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  0  1  1  1  2  2  2
v :  0  1  2  0  1  2  0  1  2

x' is a way of packing the old value x and the new value v together. This symmetric packing corresponds to the output string "012" in the parlance of this series. The growth factor (x'/x) determines the number of bits required to send our value, and it's uniform.

But it doesn't need to be uniform.


0102 :

x':  0  1  2  3  4  5  6  7  8 ... 
x :  0  0  1  0  2  1  3  1  4 
v :  0  1  0  2  0  1  0  2  0

Intuitively, the more often a symbol occurs in the output string, the more slots there are for the previous value (x) to get placed; that is, more bits of x can be sent in lower values of x' when the symbol occurs in many slots. Hence x' grows less. If you're thinking in terms of normalized x's, growing less means you have to output fewer bits to stay in the renormalization range.

You can draw these asymmetric numeral lines in different ways, which Duda does in the paper. For example :


input x as the axis line,
output x' in the table :

"AB"
  0 1 2 3 4 5 6  x
A 0 2 4          x'
B 1 3 5

"AAB"

  0 1 2 3 4 5 6  x
A 0 1 3 4 6 7 9  x'
B 2 5 8 11

output x' as the axis line
input x in the table :

"AB"
  0 1 2 3 4 5 6  x'
A 0   1   2   3  x
B   0   1   2

"AAB"
  0 1 2 3 4 5 6  x'
A 0 1   2 3   4  x
B     0     1

output x' line implicit
show x and output symbol :

"AB"
0 0 1 1 2 2 3
A B A B A B A

"AAB"
0 1 0 2 3 1 4
A A B A A B A

That is, it's a funny way of just doing base-b math; we're shifting up the place value and adding our value in, but we're in an "asymmetric numeral system", so the base is nonuniform. I find this mental image not very useful when thinking about how the coding actually works.

There's another way to think about tANS in particular (tANS = table-based ANS), which is what Yann is talking about .

To get there mentally, we actually need to optimize our tANS code.

When I covered tANS encoding before , I described it something like this :


x is the current state
x starts in the range I = [L,2L-1]

to encode the next symbol s
we need to reach the precursor range Is = [Fs,2Fs-1]

to do that, output bits from x
b = x&1; x >>= 1;
until x is lowered to reach Is

then take the state transition C()
this takes x back into I

this should be familiar and straightforward.

To optimize, we know that x always starts in a single power of 2 interval [L,2L-1] , and it always lands in a power of 2 interval [Fs,2Fs-1]. That means the minimum number of bits we ever output is from L to 2Fs-1 , and the maximum number of bits is only 1 more than that. So the renormalization can be written as :


precompute :

max_state = 2Fs - 1;
min_num_bits = floor(log2(L/Fs));

to renormalize :

x in [L,2L-1]
output min_num_bits from x
x >>= min_num_bits

now ( x >= Fs && x < 2*max_state );

if ( x > max_state ) output 1 more bit; x>>= 1;

now x in [Fs,2Fs-1]

But you can move the check for the extra output bit earlier, before shifting x down :

precompute :

min_num_bits = log2(L) - log2ceil(Fs);  // if L is power of 2
threshold = (2*Fs)<<num_bits;

to renormalize :

x in [L,2L-1]
num_bits = min_num_bits;
if ( x >= threshold ) num_bits ++;
output num_bits from x
x >>= num_bits

x in [Fs,2Fs-1]

and then use C(x) since x is now in Is.

It's just straightforward optimization, but it actually allows us to picture the whole process in a different way. Let's write the same encoder, but just in terms of a table index :


let t = x - L
t in [0,L-1]

t is a table index.


to encode s :

num_bits = min_num_bits[s] + ( t >= threshold[s] );
bitoutput( t, num_bits );
t = encode_state_table[s][ (t+L)>>num_bits ];

That is, we're going from a table index to another table index. We're no longer thinking about going back to the [Fs,2Fs-1] precursor range at all.

Before we got our desired code len by the scaling of the intervals [L,2L)/[Fs,2Fs) , now the code len is the stored number of bits. We can see that we get fractional bits because sometimes we output one more.

Let's revisit an example that we went through previously , but with this new image.


L = 8
Fs = {3,3,2}
output string = ABCABCAB

We can see right away that our table index t is 3 bits. To encode a 'C' there will be only two slots on our numeral line that correspond to a lower digit of C, so we must output 2 bits and keep 1 bit of t. To encode an 'A' we can keep 3 values, so we can output 1 bit for t in [0,3] and 2 bits for t in [4,7] ; that will give us 2 retained values in the first region and 1 retained value in the second.

Explicitly :


t in [0,7]
I want to encode an A
so I want to reach {AxxAxxAx}

t in [0,3]
  output t&1
  index = (t+L)>>1 = 4 or 5
  take the last two A's {xxxAxxAx}
  so state -> 3 or 6

t in [4,7]
  output t&3
  index = (t+L)>>2 = 3
  take the first A {Axxxxxxx}
  state -> 0

Note that the way we're doing it, high states transition to low states, and vice versa. These comes up because of the +L sentry bit method to separate the subranges produced by the shift.

The tANS construction creates this encode table :


encode:
A : b=1+(t>=4) : {0,3,6}
B : b=1+(t>=4) : {1,4,7}
C : b=2+(t>=8) : {2,5}

It should be obvious that we can now drop all our mental ideas about "ANS" and just make these coding tables directly. All you need is an output string, and you think about doing these kinds of mapping :

t in [0,7]

I want to encode a B

[xxxxxxxx] -> [xBxxBxxB]

output bits to reduce the 3 values
and transition to one of the slots with a B

The decode table is trivial to make from the inverse :

decode:
 0: A -> 4 + getbits(2)
 1: B -> 4 + getbits(2)
 2: C -> 0 + getbits(2)
 3: A -> 0 + getbits(1)
 4: B -> 0 + getbits(1)
 5: C -> 4 + getbits(2)
 6: A -> 2 + getbits(1)
 7: B -> 2 + getbits(1)

Note that each symbol's decode covers the entire origin state range :

decode:
 0: A -> 4 + getbits(2)  from [4,7]
 3: A -> 0 + getbits(1)  from [0,1]
 6: A -> 2 + getbits(1)  from [2,3]

 1: B -> 4 + getbits(2)  from [4,7]
 4: B -> 0 + getbits(1)  from [0,1]
 7: B -> 2 + getbits(1)  from [2,3]

 2: C -> 0 + getbits(2)  from [0,3]
 5: C -> 4 + getbits(2)  from [4,7]

During decode we can think about our table index 't' as containing two pieces of information : one is the current symbol to output, but there's also some information about the range where t will be on the next step. That is, the current t contains some bits of the next t. The number of bits depends on where we are in the table. eg. in the example above; when t=4 we specify a B, but we also specify 2 bits worth of the next t.

Doing another example from that earlier post :


Fs={7,6,3}

ABCABABACBABACBA

encode:
A : b=1+(t>=12) : {0,3,5,7,10,12,15}
B : b=1+(t>=8) : {1,4,6,9,11,14}
C : b=2+(t>=8) : {2,8,13}

decode:
 0: A -> 12 + getbits(2)
 1: B -> 8 + getbits(2)
 2: C -> 8 + getbits(3)
 3: A -> 0 + getbits(1)
 4: B -> 12 + getbits(2)
 5: A -> 2 + getbits(1)
 6: B -> 0 + getbits(1)
 7: A -> 4 + getbits(1)
 8: C -> 0 + getbits(2)
 9: B -> 2 + getbits(1)
10: A -> 6 + getbits(1)
11: B -> 4 + getbits(1)
12: A -> 8 + getbits(1)
13: C -> 4 + getbits(2)
14: B -> 6 + getbits(1)
15: A -> 10 + getbits(1)

and this concludes our conception of tANS in terms of just an [0,t-1] table.

I'm gonna be super redundant and repeat myself some more. I think it's intriguing that we went through all this ANS entropy coder idea, scaling values by (x/P) and so on, and from that we constructed tANS code. But you can get to the exact same tANS directly from the idea of the output string!

Let's invent tANS our new way, starting from scratch.

I'm given normalized frequencies {Fs}. Sum{Fs} = L. I want a state machine with L entries. Take each symbol and scatter it into our output string in some way.

To encode each symbol, I need to map the state machine index t in [0,L-1] to one of its occurances in the output string.


There are Fs occurances in the output string

I need to map an [0,L-1] value to an [0,Fs-1] value
by outputting either b or b+1 bits

now clearly if (L/Fs) is a power of 2, then the log2 of that is just b and we always output that many bits. (eg L=16, Fs=4, we just output 2 bits). In general if (L/Fs) is not a power of 2, then

b = floor(log2(L/Fs))
b+1 = ceil(log2(L/Fs))

so we just need two sub-ranges of L such that the total adds up to Fs :

threshold T
values < T output b bits
values >= T output b+1 bits

total of both ranges after output should equal Fs :

(T>>b) + (L-T)>>(b+1) = Fs

(2T + L-T)>>(b+1) = Fs

L+T = Fs<<(b+1)

T = (Fs<<(b+1)) - L

and that's it! We've just made a tANS encoder without talking about anything related to the ANS ideas at all.

The funny thing to me is that we got the exact same condition before from "b-uniqueness". That is, in order to be able to encode symbol s from any initial state, we worked out that the only valid precursor range was Is = [Fs,2*Fs-1] . That leads us to the renormalization loop :


while x > (2*Fs-1)
  output a bit from x; x>>= 1;

And from that we computed a minimum number of output bits, and a threshold state for one more. That threshold we computed was

(max_state + 1)<<min_num_bits

= (2*Fs-1 + 1)<<b
= Fs<<(b+1)

which is the same.


02-11-14 | Understanding ANS - 10

Not really an ANS topic, but a piece you need for ANS so I've had a look at it.

For ANS and many other statistical coders (eg. arithmetic coding) you need to create scaled frequencies (the Fs in ANS terminology) from the true counts.

But how do you do that? I've seen many heuristics over the years that are more or less good, but I've never actually seen the right answer. How do you scale to minimize total code len? Well let's do it.

Let's state the problem :


You are given some true counts Cs

Sum{Cs} = T  the total of true counts

the true probabilities then are

Ps = Cs/T

and the ideal code lens are log2(1/Ps)

You need to create scaled frequencies Fs
such that

Sum{Fs} = M

for some given M.

and our goal is to minimize the total code len under the counts Fs.

The ideal entropy of the given counts is :

H = Sum{ Ps * log2(1/Ps) }

The code len under the counts Fs is :

L = Sum{ Ps * log2(M/Fs) }

The code len is strictly worse than the entropy

L >= H

We must also meet the constraint

if ( Cs != 0 ) then Fs > 0

That is, all symbols that exist in the set must be codeable. (note that this is not actually optimal; it's usually better to replace all rare symbols with a single escape symbol, but we won't do that here).

The naive solution is :


Fs = round( M * Ps )

if ( Cs > 0 ) Fs = MAX(Fs,1);

which is just scaling up the Ps by M. This has two problems - one is that Sum{Fs} is not actually M. The other is that just rounding the float does not actually distribute the integer counts to minimize codelen.

The usual heuristic is to do something like the above, and then apply some fix to make the sum right.

So first let's address how to fix the sum. We will always have issues with the sum being off M because of integer rounding.

What you will have is some correction :


correction = M - Sum{Fs}

that can be positive or negative. This is a count that needs to be added onto some symbols. We want to add it to the symbols that give us the most benefit to L, the total code len. Well that's simple, we just measure the affect of changing each Fs :

correction_sign = correction > 0 ? 1 : -1;

Ls_before = Ps * log2(M/Fs)
Ls_after = Ps * log2(M/(Fs + correction_sign))

Ls_delta = Ls_after - Ls_before
Ls_delta = Ps * ( log2(M/(Fs + correction_sign)) - log2(M/Fs) )
Ls_delta = Ps * log2(Fs/(Fs + correction_sign))

so we need to just find the symbol that gives us the lowest Ls_delta. This is either an improvement to total L, or the least increase in L.

We need to apply multiple corrections. We don't want a solution thats O(alphabet*correction) , since that can be 256*256 in bad cases. (correction is <= alphabet and typically in the 1-50 range for a typical 256-symbol file). The obvious solution is a heap. In pseudocode :


For all s
    push_heap( Ls_delta , s )

For correction
    s = pop_heap
    adjust Fs
    compute new Ls_delta for s
    push_heap( Ls_delta , s )

note that after we adjust the count we need to recompute Ls_delta and repush that symbol, because we might want to choose the same symbol again later.

In STL+cblib this is :

to[] = Fs
from[] = original counts

struct sort_sym
{
    int sym;
    float rank;
    sort_sym() { }
    sort_sym( int s, float r ) : sym(s) , rank(r) { }
    bool operator < (const sort_sym & rhs) const { return rank < rhs.rank; }
};

---------

    if ( correction != 0 )
    {
        //lprintfvar(correction);
        int32 correction_sign = (correction > 0) ? 1 : -1;

        vector<sort_sym> heap;
        heap.reserve(alphabet);

        for LOOP(i,alphabet)
        {
            if ( from[i] == 0 ) continue;
            ASSERT( to[i] != 0 );
            if ( to[i] > 1 || correction_sign == 1 )
            {
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
            
                heap.push_back( sort_sym(i,change) );
            }
        }
        
        std::make_heap(heap.begin(),heap.end());
        
        while( correction != 0 )
        {
            ASSERT_RELEASE( ! heap.empty() );
            std::pop_heap(heap.begin(),heap.end());
            sort_sym ss = heap.back();
            heap.pop_back();
            
            int i = ss.sym;
            ASSERT( from[i] != 0 );
            
            to[i] += correction_sign;
            correction -= correction_sign;
            ASSERT( to[i] != 0 );
        
            if ( to[i] > 1 || correction_sign == 1 )
            {
                double change = log( (double) to[i] / (to[i] + correction_sign) ) * from[i];
            
                heap.push_back( sort_sym(i,change) );
                std::push_heap(heap.begin(),heap.end());
            }               
        }
    
        ASSERT( cb::sum(to,to+alphabet) == (uint32)to_sum_desired );
    }

You may have noted that the above code is using natural log instead of log2. The difference is only a constant scaling factor, so it doesn't affect the heap order; you may use whatever log base is fastest.

Errkay. So our first attempt is to just use the naive scaling Fs = round( M * Ps ) and then fix the sum using the heap correction algorithm above.

Doing round+correct gets you 99% of the way there. I measured the difference between the total code len made that way and the optimal, and they are less than 0.001 bpb different on every file I tried. But it's still not quite right, so what is the right way?

To guide my search I had a look at the cases where round+correct was not optimal. When it's not optimal it means there is some symbol a and some symbol b such that { Fa+1 , Fb-1 } gives a better total code len than {Fa,Fb}. An example of that is :


count to inc : (1/1024) was (1866/1286152 = 0.0015)
count to dec : (380/1024) was (482110/1286152 = 0.3748)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43
to dec; cl before : 1.43 cl after : 1.43 , true cl : 1.42

The key point is on the 1 count :

count to inc : (1/1024) was (1866/1286152 = 0.0015)
to inc; cl before : 10.00 cl after : 9.00 , true cl : 9.43

1024*1866/1286152 = 1.485660
round(1.485660) = 1

so Fs = 1 , which is a codelen of 10

but Fs = 2 gives a codelen (9) closer to the true codelen (9.43)

And this provided the key observation : rather than rounding the scaled count, what we should be doing is either floor or ceil of the fraction, whichever gives a codelen closer to the true codelen.

BTW before you go off hacking a special case just for Fs==1, it also happens with higher counts :


count to inc : (2/1024) was (439/180084) scaled = 2.4963
to inc; cl before : 9.00 cl after : 8.42 , true cl : 8.68

count to inc : (4/1024) was (644/146557) scaled = 4.4997
to inc; cl before : 8.00 cl after : 7.68 , true cl : 7.83

though obviously the higher Fs, the less likely it is because the rounding gets closer to being perfect.

So it's easy enough just to solve exactly, simply pick the floor or ceil of the ratio depending on which makes the closer codelen :


Ps = Cs/T from the true counts

down = floor( M * Ps )
down = MAX( down,1)

Fs = either down or (down+1)

true_codelen = -log2( Ps )
down_codelen = -log2( down/M )
  up_codelen = -log2( (down+1)/M )

if ( |down_codelen - true_codelen| < |up_codelen - true_codelen| )
  Fs = down
else
  Fs = down+1

And since all we care about is the inequality, we can do some maths and simplify the expressions. I won't write out all the algebra to do the simplification because it's straightforward, but there are a few key steps :

| log(x) | = log( MAX(x,1/x) )

log(x) >= log(y)  is the same as x >= y

down <= M*Ps
down+1 >= M*Ps

the result of the simplification in code is :

from[] = original counts (Cs) , sum to T
to[] = normalized counts (Fs) , will sum to M

    double from_scaled = from[i] * M/T;

    uint32 down = (uint32)( from_scaled );
                
    to[i] = ( from_scaled*from_scaled <= down*(down+1) ) ? down : down+1;

Note that there's no special casing needed to ensure that (from_scaled < 1) gives you to[i] = 1 , we get that for free with this expression.

I was delighted when I got to this extremely simple final form.

And that is the conclusion. Use that to find the initial scaled counts. There will still be some correction that needs to be applied to reach the target sum exactly, so use the heap correction algorithm above.

As a final note, if we look at the final expression :


to[i] = ( from_scaled*from_scaled < down*(down+1) ) ? down : down+1;

to[i] = ( test < 0 ) ? down : down+1;

test = from_scaled*from_scaled - down*(down+1); 

from_scaled = down + frac

test = (down + frac)^2 - down*(down+1);

solve for frac where test = 0

frac = sqrt( down^2 + down ) - down

That gives you the fractional part of the scaled count where you should round up or down. It varies with floor(from_scaled). The actual values are :

1 : 0.414214
2 : 0.449490
3 : 0.464102
4 : 0.472136
5 : 0.477226
6 : 0.480741
7 : 0.483315
8 : 0.485281
9 : 0.486833
10 : 0.488088
11 : 0.489125
12 : 0.489996
13 : 0.490738
14 : 0.491377
15 : 0.491933
16 : 0.492423
17 : 0.492856
18 : 0.493242
19 : 0.493589

You can see as Fs gets larger, it goes to 0.5 , so just using rounding is close to correct. It's really in the very low values where it's quite far from 0.5 that errors are most likely to occur.


02-10-14 | Understanding ANS - 9

If you just want to understand the basics of how ANS works, you may skip this post. I'm going to explore some unsolved issues about the sort order.

Some issues about constructing the ANS sort order are still mysterious to me. I'm going to try to attack a few points.

One thing I said wrote last time needs some clarification - "Every slot has an equal probability of 1/M."

What is true is that every character of the output string is equiprobable (assuming again that the Fs are the true probabilities). That is, if you have the string S[] with L symbols, each symbol s occurs Fs times, then you can generate symbols with the correct probability by just drawing S[i] with random i.

The output string S[] also corresponds to the destination state of the encoder in the renormalization range I = [L,2L-1]. What is not true is that all states in I are equally probable.

To explore this I did 10,000 random runs of encoding 10,000 symbols each time. I used L=1024 each time, and gathered stats from all the runs.

This is the actual frequency of the state x having each value in [1024,2047] (scaled so that the average is 1000) :

The lowest most probable states (x=1024) have roughly 2X the frequency of the high least probable states (x=2047).

Note : this data was generated using Duda's "precise initialization" (my "sort by sorting" with 0.5 bias). Different table constructions will create different utilization graphs. In particular the various heuristics will have some weird bumps. And we'll see what different bias does later on.

This is the same data with 1/X through it :

This probability distribution (1/X) can be reproduced just from doing this :


            x = x*b + irandmod(b); // for any base b
            
            while( x >= 2*K ) x >>= 1;
            
            stats_count[x-K] ++;            

though I'd still love to see an analytic proof and understand that better.

So, the first thing I should correct is : final states (the x' in I) are not equally likely.

How that should be considered in sort construction, I do not know.

The other thing I've been thinking about was why did I find that the + 1.0 bias is better in practice than the + 0.5 bias that Duda suggests ("precise initialization") ?

What the +1 bias does is push low probability symbols further towards the end of the sort order. I've been contemplating why that might be good. The answer is not that the end of the sort order makes longer codelens, because that kind of issue has already been accounted for.

My suspicion was that the +1 bias was beating the +0.5 bias because of the difference between normalized counts and unnormalized original counts.

Recall that to construct the table we had to make normalized frequences Fs that sum to L. These, however, are not the true symbol frequencies (except in synthetic tests). The true symbol frequencies had to be scaled to sum to L to make the Fs.

The largest coding error from frequency scaling is on the least probable symbols. In fact the very worst case is symbols that occur only once in a very large file. eg. in a 1 MB file a symbol occurs once; its true probability is 2^-20 and it should be coded in 20 bits. But we scale the frequencies to sum to 1024 (for example), it still must get a count of 1, so it's coded in 10 bits.

What the +1 bias does is take the least probable symbols and push them to the end of the table, which maximizes the number of bits they take to code. If the {Fs} were the true frequencies, this would be bad, and the + 0.5 bias would be better. But the {Fs} are not the true frequencies.

This raises the question - could we make the sort order from the true frequencies instead of the scaled ones? Yes, but you would then have to either transmit the true frequencies to the decoder, or transmit the sort order. Either way takes many more bits than transmitting the scaled frequencies. (in fact in the real world you may wish to transmit even approximations of the scaled frequencies). You must ensure the encoder and decoder use the same frequencies so they build the same sort order.

Anyway, I tested this hypothesis by making buffers synthetically by drawing symbols from the {Fs} random distribution. I took my large testset, for each file I counted the real histogram, made the scaled frequencies {Fs}, then regenerated the buffer from the frequencies {Fs} so that the statistics match the data exactly. I then ran tANS on the synthetic buffers and on the original file data :


synthetic data :

total bytes out : 146068969.00  bias=0.5
total bytes out : 146117818.63  bias=1.0

real data :

total bytes out : 144672103.38  bias=0.5
total bytes out : 144524757.63  bias=1.0

On the synthetic data, bias=0.5 is in fact slightly better. On the real data, bias=1.0 is slightly better. This confirms that the difference between the normalized counts & unnormalized counts is in fact the origin of 1.0's win in my previous tests, but doesn't necessarily confirm my guess for why.

An idea for an alternative to the bias=1 heuristic is you could use bias=0.5 , but instead of using the Fs for the sort order, use the estimated original count before normalization. That is, for each Fs you can have a probability model of what the original count was, and select the maximum-likelihood count from that. This is exactly analoguous to restoring to expectation rather than restoring to middle in a quantizer.

Using bias=1.0 and measuring state occurance counts, we get this :

Which mostly has the same 1/x curve, but with a funny tail at the end. Note that these graphs are generated on synthetic data.

I'm now convinced that the 0.5 bias is "right". It minimizes measured output len on synthetic data where the Fs are the true frequencies. It centers each symbol's occurances in the output string. It reproduces the 1/x distribution of state frequencies. However there is still the missing piece of how to derive it from first principles.


BTW

While I was at it, I gathered the average number of bits output when coding from each state. If you're following along with Yann's blog he's been explaining FSE in terms of this. tANS outputs bits to get the state x down into the coding range Is for the next symbol. The Is are always lower than I (L), so you have to output some bits to scale down x to reach the Is. x starts in [L,2L) and we have to output bits to reach [Fs,2Fs) ; the average number of bits required is like log2(L/Fs) which is log2(1/P) which is the code length we want. Because our range is [L,2L) we know the average output bit count from each state must differ by 1 from the top of the range to the bottom. In fact it looks like this :

Another way to think about it is that at state=L , the state is empty. As state increases, it is holding some fractional bits of information in the state variable. That number of fraction bits goes from 0 at L up to 1 at 2L.


02-06-14 | Understanding ANS - 8

Time to address an issue that we've skirted for some time - how do you make the output string sort order?

Recall : The output string contains Fs occurances of each symbol. For naive rANS the output string is just in alphabetical order (eg. "AAABBBCCC"). With tANS we can use any permutation of that string.

So what permutation should we use? Well, the output string determines the C() and D() encode and decode tables. It is in fact the only degree of freedom in table construction (assuming the same constraints as last time, b=2 and L=M). So we should choose the output string to minimize code length.

The guiding principle will be (x/P). That is, we achieve minimum total length when we make each code length as close to log2(1/P) as possible. We do that by making the input state to output state ratio (x'/x) as close to (1/P) as possible.

(note for the record : if you try to really solve to minimize the error, it should not just be a distance between (x'/x) and (1/P) , it needs to be log-scaled to make it a "minimum rate" solution). (open question : is there an exact solution for table building that finds the minimum rate table that isn't NP (eg. not just trying all permutations)).

Now we know that the source state always come from the precursor ranges Is, and we know that


destination range :
I = [ M , 2*M - 1]

source range :
Is = [ Fs, 2*Fs - 1 ] for each symbol s

and Ps = Fs/M

so the ideal target for the symbols in each source range is :

target in I = (1/Ps) * (Is) = (M/Fs) * [ Fs, 2*Fs - 1 ] = 

and taking off the +M bias to make it a string index in the range [0,M-1] :

Ts = target in string = target in I - M

Ts = { 0 , M * 1/Fs , M * 2/Fs) , ... }

Essentially, we just need to take each symbol and spread its Fs occurances evenly over the output string.

Now there's a step that I don't know how to justify without waving my hands a bit. It works slightly better if we imagine that the source x was not just an integer, but rather a bucket that covers the unit range of that integer. That is, rather that starting exactly at the value "x = Fs" you start in the range [Fs,Fs+1]. So instead of just mapping up that integer by 1/P we map up the range, and we can assign a target anywhere in that range. In the paper Duda uses a bias of 0.5 for "precise initialization" , which corresponds to assuming the x's start in the middle of their integer buckets. That is :


Ts = { M * (b/Fs), M* (1+b)/Fs, M * (2+b)/Fs , ... }

with b = 0.5 for Duda's "precise initialization". Obviously b = 0.5 makes T centered on the range [0,M] , but I see no reason why that should be preferred.

Now assuming we have these known target locations, you can't just put all the symbols into the target slots that they want, because lots of symbols want the same spot.

For example :


M=8
Fs={3,3,2}

T_A = { 8 * 0.5/3 , 8 * 1.5 / 3 , 8 * 2.5 / 3 } = { 1 1/3 , 4 , 6 2/3 }
T_B = T_A
T_C = { 8 * 0.5/2 , 8 * 1.5/2 } = { 2 , 6 }

One way to solve this problem is to start assigning slots, and when you see that one is full you just look in the neighbor slot, etc. So you might do something like :

initial string is empty :

string = "        "

put A's in 1,4,6

string = " A  A A "

put B's in 1,4,6 ; oops they're taken, shift up one to find empty slots :

string = " AB ABAB"

put C's in 2,6 ; oops they're taken, hunt around to find empty slots :

string = "CABCABAB"

now obviously you could try to improve this kind of algorithm, but there's no point. It's greedy so it makes mistakes in the overall optimization problem (it's highly dependant on order). It can also be slow because it spends a lot of time hunting for empty slots; you'd have to write a fast slot allocator to avoid degenerate bad cases. There are other ways.

Another thing I should note is that when doing these target slot assignments, there's no reason to prefer the most probable symbol first, or the least probable or whatever. The reason is every symbol occurance is equally probable. That is, symbol s has frequency Fs, but there are Fs slots for symbol s, so each slot has a frequency of 1. Every slot has an equal probability of 1/M.

An alternative algorithm that I have found to work well is to sort the targets. That is :


make a sorting_array of size M

add { Ts, s } to sorting_array for each symbol  (that's Fs items added)

sort sorting_array by target location

the symbols in sorting_array are in output string order

I believe that this is identical to Duda's "precise initialization" which he describes using push/pop operations on a heap; the result is the same - assigning slots in the order of desired target location.

Using the sort like this is a little weird. We are no longer explicitly trying to put the symbols in their target slots. But the targets (Ts) span the range [0, M] and the sort is an array of size M, so they wind up distributed over that range. In practice it works well, and it's fast because sorting is fast.

A few small notes :

You want to use a "stable" sort, or bias the target by some small number based on the symbol. The reason is you will have lots of ties, and you want the ties broken consistently. eg. for "AABBCC" you want "ABCABC" or "CBACBA" but not "ABCCAB". One way to get a stable sort is to make the sorting_array work on U32's, and pack the sort rank into the top 24 bits and the symbol id into the bottom 8 bits.

The bias = 0.5 that Duda uses is not strongly justified, so I tried some other numbers. bias = 0 is much worse. It turns out that bias = 1.0 is better. I tried a bunch of values on a large test set and found that bias = 1 is consistently good.

One very simple way to get a decent sort is to bit-reverse the rANS indexes. That is, start from a rANS/alphabetical order string ("AABB..") and take the index of each element, bit-reverse that index (so 0001 -> 1000) , and put the symbol in the bit reversed slot. While this is not competitive with the proper sort, it is simple and one pass.

Another possible heuristic is to just scatter the symbols by doing steps that are prime with M. This is what Yann does in fse.c


All the files in Calgary Corpus :
(compression per file; sum of output sizes)

M = 1024

rANS/alpahabetical : 1824053.75

bit reverse : 1805230.75

greedy search for empty slots : 1801351

Yann's heuristic in fse.c : 1805503.13

sort , bias = 0.0 : 1817269.88

sort , bias = 0.5 : 1803676.38  (Duda "precise")

sort , bias = 1.0 : 1798930.75

Before anyone throws a fit - yes, I tested on my very large test set, not just calgary. The results were consistent on all the test sets I tried. I also tested with larger M (4096) and the results were again the same, though the differences are smaller the larger you make M.

For completeness, here is what the sorts actually do :


rANS/alphabetical : AAAAAAABBBBBBCCC

bit reverse :   ABABABACABACABBC

greedy search : CABABACABABACABB

greedy search, LPS first :  ABCABAACBABACBAB

Yann fse :          AAABBCAABBCAABBC

sort , bias = 0.0 : ABCABABCABABCABA

sort , bias = 0.5 : ABCABABACBABACBA

sort , bias = 1.0 : ABABCABABCABAABC

but I caution against judging sorts by whether they "look good" since that criteria does not seem to match coding performance.

Finally for clarity, here's the code for the simpler sorts :


void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );
    
    const int fse_step = (sort_count>>1) + (sort_count>>3) + 1;
    
    int fse_pos = 0;
    int s = 0;
    for LOOP(a,alphabet)
    {
        int count = normalized_counts[a];

        for LOOP(c,count)
        {
            // choose one :

            // rANS :
            sorted_syms[s] = a;

            // fse :
            sorted_syms[fse_pos] = a;
            fse_pos = (fse_pos + step) % sort_count;

            // bitreverse :
            sorted_syms[ bitreverse(s, numbits(sort_count)) ] = a;

            s++;
        }
    }
}

and the code for the actual sorting sort (recommended) :

struct sort_sym
{
    int sym;
    float rank;
    bool operator < (const sort_sym & rhs) const
    {
        return rank < rhs.rank;
    }
};

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    vector<sort_sym> sort_syms;
    sort_syms.resize(sort_count);

    int s = 0;

    for LOOP(sym,alphabet)
    {
        uint32 count = normalized_counts[sym];
        if ( count == 0 ) continue;
        
        float invp = 1.f / count;
        
        float base =  1.f * invp; // 0.5f for Duda precise

        for LOOP(c,(int)count)
        {
            sort_syms[s].sym = sym;
            sort_syms[s].rank = base + c * invp;
            s++;
        }
    }
    
    ASSERT_RELEASE( s == sort_count );
    
    std::stable_sort(sort_syms.begin(),sort_syms.end());
    
    for LOOP(s,sort_count)
    {
        sorted_syms[s] = sort_syms[s].sym;
    }
}

and for the greedy search :

void make_sort(int * sorted_syms, int sort_count, const uint32 * normalized_counts, int alphabet)
{
    ASSERT( (int) cb::sum(normalized_counts,normalized_counts+alphabet) == sort_count );

    // make all slots empty :
    for LOOP(s,sort_count)
    {
        sorted_syms[s] = -1;
    }
    
    for LOOP(a,alphabet)
    {
        uint32 count = normalized_counts[a];
        if ( count == 0 ) continue;
        
        uint32 step = (sort_count + (count/2) ) / count;
        uint32 first = step/2;
        
        for LOOP(c,(int)count)
        {
            uint32 slot = first + step * c;
            
            // find an empty slot :
            for(;;)
            {
                if ( sorted_syms[slot] == -1 )
                {
                    sorted_syms[slot] = a;
                    break;
                }
                slot = (slot + 1)%sort_count;
            }
        }
    }
}

small note : the reported results use a greedy search that searches away from slot using +1,-1,+2,-2 , instead of the simpler +1,+2 in this code snippet. This simpler version is very slightly worse.


02-05-14 | Understanding ANS - 7

And we're ready to cover table-based ANS (or "tANS") now.

I'm going to be quite concrete and consider a specific choice of implementation, rather than leaving everything variable. But extrapolation to the general solution is straightforward.

You have integer symbol frequences Fs. They sum to M. The cumulative frequencies are Bs.

I will stream the state x in bits. I will use the smallest possible renormalization range for this example , I = [ M , 2*M - 1]. You can always use any integer multiple of M that you want (k*M, any k), which will give you more coding resolution (closer to entropy). This is equivalent to scaling up all the F's by a constant factor, so it doesn't change the construction here.

Okay. We will encode/decode symbols using this procedure :


ENCODE                      DECODE

|                           ^
V                           |

stream out                  stream in

|                           ^
V                           |

C(s,x) coding function      D(x) decoding function

|                           ^
V                           |

x'                          x'

We need tables for C() and D(). The constraints are :

D(x') = { x , s }  outputs a state and a symbol

D(x) must be given for x in I = [ M , 2*M - 1 ]

D(x) in I must output each symbol s Fs times

that is, D(x in I) must be an output string made from a permutation of "AA..BB.." , each symbol Fs times

D( C( s, x ) ) = { x , s }  decode must invert coding

C(s,x) = x'  outputs the following state

C(s,x) must be given for x' in I
 that's x in Is

The precursor ranges Is = { x : C(s,x) is in I }
must exist and be of the form Is = [ k , 2k-1 ] for some k

Now, if we combine the precursor range requirement and the invertability we can see :

D(x in I) outputs each s Fs times

C(s,x) with x' in I must input each s Fs times

the size of Is must be Fs

the precursor ranges must be Is = [ Fs, 2*Fs - 1 ]

C(s,x) must given in M slots

And I believe that's it; those are the necessary and sufficient conditions to make a valid tANS system. I'll go over some more points and fill in some details.

Here's an example of the constraint for an alphabet of "ABC" and M = 8 :

Now, what do you put in the shaded area? You just fill in the output states from 8-15. The order you fill them in corresponds to the output string. In this case the output string must be some permutation of "AAABBBCC".

Here's one way : (and in true Duda style I have confusingly used different notation in the image, since I drew this a long time ago before I started this blog series. yay!)

In the image above I have also given the corresponding output string and the decode table. If you're following along in Duda's paper arxiv 1311.2540v2 this is figure 9 on page 18. What you see in figure 9 is a decode table. The "step" part of figure 9 is showing one method of making the sort string. The shaded bars on the right are showing various permutations of an output string, with a shading for each symbol.

Before I understood ANS I was trying tables like this :


M=16
Fs = {7,6,3}

 S |  0|  1|  2
---|---|---|---
  1|  2|  3|  4
  2|  5|  6| 10
  3|  7|  8| 15
  4|  9| 11| 20
  5| 12| 13| 26
  6| 14| 16| 31
  7| 17| 19|   
  8| 18| 22|   
  9| 21| 24|   
 10| 23| 27|   
 11| 25| 29|   
 12| 28|   |   
 13| 30|   |   

This table does not work. If you're in state x = 7 and you want to encode symbol 2, you need to stream out bits to get into the precursor range I2. So you stream out from x=7 and get to x=3. Now you look in the table and you are going to state 15 - that's not in the range I=[16,31]. No good!

A correct table for those frequencies is :


 S |  0|  1|  2
---|---|---|---
  3|   |   | 18
  4|   |   | 24
  5|   |   | 29
  6|   | 17|   
  7| 16| 20|   
  8| 19| 22|   
  9| 21| 25|   
 10| 23| 27|   
 11| 26| 31|   
 12| 28|   |   
 13| 30|   |   

Building the decode table from the encode table is trivial.

Note that the decode table D(x) only has to be defined for x in I - that's M entries.

C(x,s) also only has M entries. If you made it naively as a 2d array, it would be |alphabet|*M . eg. something like (256*4096) slots, but most of it would be empty. Of course you don't want to do that.

The key observation is that C(x,s) is only defined over consecutive ranges of x for each s. In fact it's defined over [Fs, 2*Fs-1]. So, we can just pack these ranges together. The starting point in the packed array is just Bs - the cumulative frequency of each symbol. That is :


PC = packed coding table
PC has M entries

C(x,s) = PC[ Bs + (x - Fs) ]


eg. for the {3,3,2} table shown in the image above :

PC = { 8,11,14, 9,12,15, 10,13 }

this allows you to store the coding table also in an array of size M.

There are a few topics on tANS left to cover but I'll leave them for the next post.


02-04-14 | Understanding ANS - 6

Okay, let's get to streaming.

For illustration let's go back to the simple example of packing arbitrary base numbers into an integer :


// encode : put val into state
void encode(int & state, int val, int mod)
{
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;
}

// decode : remove a value from state and return it
int decode(int & state, int mod )
{
    int val = state % mod;
    state /= mod;
    return val;
}

as you encode, state grows, and eventually gets too big to fit in an integer. So we need to flush out some bits (or bytes).

But we can't just stream out bits. The problem is that the decoder does a modulo to get the next value. If we stream in and out high bits, that's equivalent to doing something like +65536 on the value. When you do a mod-3 (or whatever) on that, you have changed what you decode.

If you only ever did mod-pow2's, you could stream bits out of the top at any time, because the decoding of the low bits is not affected by the high bits. This is how the Huffman special case of ANS works. With Huffman coding you can stream in and out any bits that are above the current symbol, because they don't affect the mask at the bottom.

In general we want to stream bits (base 2) or bytes (base 256). To do ANS in general we need to mod and multiply by arbitrary values that are not factors of 2 or 256.

To ensure that we get decodability, we have to stream such that the decoder sees the exact value that the encoder made. That is :


ENCODE                      DECODE

|                           ^
V                           |

stream out                  stream in

|                           ^
V                           |

C(s,x) coding function      D(x) decoding function

|                           ^
V                           |

x'                          x'

The key thing is that the value of x' that C(s,x) makes is exactly the same that goes into D(x).

This is different from Huffman, as noted above. It's also different than arithmetic coding, which can have an encoder and decoder that are out of sync. An arithmetic decoder only uses the top bits, so you can have more or less of the rest of the stream in the low bits. While the basic ANS step (x/P + C) is a kind of arithmetic coding step, the funny trick we did to take some bits of x and mod it back down to the low bits (see earlier posts) means that ANS is *not* making a continuous arithmetic stream for the whole message that you can jump into anywhere.

Now it's possible there are multiple streaming methods that work. For example with M = a power of 2 in rANS you might be able to stream high bytes. I'm not sure, and I'm not going to talk about that in general. I'm just going to talk about one method of streaming that does work, which Duda describes.

To ensure that our encode & decode streaming produce the same value of x', we need a range to keep it in. If you're streaming in base b, this range is of the form [L, b*L-1] . So, I'll use Duda's terminology and call "I" the range we want x' to be in for decoding, that is


I = [L, b*L-1]

Decoder streams into x :

x <- x*b + get_base_b();

until x is in I

but the encoder must do something a bit funny :

stream out from x

x' = C(s,x)  , coding function

x' now in I

that is, the stream out must be done before the coding function, and you must wind up in the streaming range after the coding function. x' in the range I ensures that the encoder and decoder see exactly the same value (because any more streaming ops would take it out of I).

To do this, we must know the "precursor" ranges for C(). That is :


Is = { x : C(s,x) is in I }

that is, the values of x such that after coding with x' = C(s,x), x' is in I

these precursor ranges depend on s. So the encoder streaming is :

I'm about to encode symbol s

stream out bits from x :

put_base_b( x % b )
x <- x/b

until x is in Is

so we get into the precursor range, and then after the coding step we are in I.

Now this is actually a constraint on the coding function C (because it determines what the Is are). You must be able to encode any symbol from any state. That means you must be able to reach the Is precursor range for any symbol from any x in the output range I. For that to be true, the Is must span a power of b, just like "I" does. That is,


all Is must be of the form

Is = [ K, b*K - 1 ]

for some K

eg. to be concrete, if b = 2, we're streaming out bits, then Is = { 3,4,5 } is okay, you will be able to get there from any larger x by streaming out bits, but Is = {4,5,6} is not okay.


I = [8, 15]

Is = {4,5,6}

x = 14

x is out of Is, so stream out a bit ; 14 -> 7

x is out of Is, so stream out a bit ; 7 -> 3

x is below Is!  crap!

this constraint will be our primary guide in building the table-based version of ANS.


02-03-14 | Understanding ANS - 5

First in case you aren't following them already, you should follow along with ryg and Yann as we all go through this :
RealTime Data Compression A comparison of Arithmetic Encoding with FSE
rANS notes The ryg blog

Getting back to my slow exposition track.

We talked before about how strings like "ABC" specify an ANS state machine. The string is the symbols that should be output by the decoder in each state, and there's an implicit cyclic repeat, so "ABC" means "ABCABC..". The cyclic repeat corresponds to only using some modulo of state in the decoder output.

Simple enumerations of the alphabet (like "ABC") are just flat codes. We saw before that power of two binary-distributed strings like "ABAC" are Huffman codes.

What about something like "AAB" ? State 0 and 1 both output an A. State 2 outputs a B. That means A should have twice the probability of B.

How do we encode a state like that? Putting in a B is obvious, we need to make the bottom of state be a 2 (mod 3) :


x' = x*3 + 2

but to encode an A, if we just did the naive op :

x' = x*3 + 0
or
x' = x*3 + 1

we're wasting a value. Either a 0 or 1 at the bottom would produce an A, so we have a free bit there. We need to make use of that bit or we are wasting code space. So we need to find a random bit to transmit to make use of that freedom. Fortunately, we have a value sitting around that needs to be transmitted that we can pack into that bit - x!

take a bit off x :
b = x&1
x >>= 1


x' = x*3 + b

or :

x' = (x/2)*3 + (x%2)

more generally if the output string is of length M and symbol s occurs Fs times, you would do :

x' = (x/Fs)*M + (x%Fs) + Bs

which is the formula for rANS.

Now, note that rANS always makes output strings where the symbols are not interleaved. That is, it can make "AAB" but it can't make "ABA". The states that output the same symbol are in consecutive blocks of length Fs.

This is actually not what we want, it's an approximation in rANS.

For example, consider a 3-letter alphabet and M=6. rANS corresponds to an output string of "AABBCC". We'd prefer "ABCABC". To see why, recall the arithmetic coding formula that we wanted to use :


x' = (x/P) + C

the important part being the (x/P). We want x to grow by that amount, because that's what gives us compression to the entropy. If x grows by too much or too little, we aren't coding with codelens that match the probabilities, so there will be some loss.

P = F/M , and we will assume for now that the probabilities are such that the rational expression F/M is the true probability. What we want is to do :


x' = (x*M)/F + C

to get a more accurate scaling of x. But we can't do that because in general that division will cause x' to not fall in the bucket [Cs,Cs+1) , which would make decoding impossible.

So instead, in rANS we did :


x' = floor(x/F)*M + C + (x%F)

the key part here being that we had to do floor(x/F) instead of (x/P), which means the bottom bits of x are not contributing to the 1/P scaling the way we want them to.


eg.

x = 7
F = 2
M = 6
P = 1/3

should scale like

x -> x/P = 21

instead scales like

x -> (x/F)*M + (x%F) = (7/2)*6 + (7%2) = 3*6 + 1 = 19

too low
because we lost the bottom bit of 7 when we did (7/2)

In practice this does in fact make a difference when the state value (x) is small. When x is generally large (vs. M), then (x/F) is a good approximation of the correct scaling. The closer x is to M, the worse the approximation.

In practice with rANS, you should use something like x in 32-bits and M < 16-bits, so you have decent resolution. For tANS we will be using much lower state values, so getting this right is important.

As a concrete example :


alphabet = 3
1000 random symbols coded
H = 1.585
K = 6
6 <= x < 12

output string "ABCABC"
wrote 1608 bits

output string "AABBCC"
wrote 1690 bits

And a drawing of what's happening :

I like the way Jarek Duda called rANS the "range coder" variant of ANS. While the analogy is not perfect, there is a similarity in the way it approximates the ideal scaling and gains speed.

The crucial difference between a "range coder" and prior (more accurate) arithmetic coders is that the range coder does a floor divide :


range coder :

symhigh * (range / symtot)

CACM (and such) :

(symhigh * range) / symtot

this approximation is good as long as range is large and symtot is small, just like with rANS.


02-02-14 | Understanding ANS - 4

Another detour from the slow exposition to mention something that's on my mind.

Let's talk about arithmetic coding.

Everyone is familiar with the standard simplified idea of arithmetic coding. Each symbol has a probability P(s). The sum of all preceding probability is the cumulative probability, C(s).

You start with an interval in [0,1]. Each symbol is specified by a range equal to P(s) located at C(s). You reduce your interval to the range of the current symbol, then put the next symbol within that range, etc. Like this :

As you go, you are making a large number < 1, with more and more bits of precision being added at the bottom. In the end you have to send enough bits so that the stream you wanted is specified. You get compression because more likely streams have larger intervals, and thus can be specified with fewer bits.

In the end we just made a single number that we had to send :


x = C0 + P0 * ( C1 + P1 * (C2 ...

in order to make that value in a FIFO stream, we would have to use the normal arithmetic coding style of tracking a current low and range :

currenty at [low,range]
add in Cn,Pn

low <- low + range * Cn
range <- range * Pn

and of course for the moment we're assuming we get to use infinite precision numbers.

But you can make the same final value x another way. Start at the end of the stream and work backwards, LIFO :


LIFO :

x contains all following symbols already encoded
x in [0,1]

x' <- Cn + Pn * x

there's no need to track two variables [low,range], you work from back to front and then send x to specify the whole stream. (This in fact is an ancient arithmetic coder. I think it was originally described by Elias even before the Pasco work. I mention this to emphasize that single variable LIFO coding is nothing new, though the details of ANS are in fact quite new. Like "range coding" vs prior arithmetic coders , it can be the tiny details that make all the difference.) (umm, holy crap, I just noticed the ps. at the bottom of that post ("ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand")).

You can decode an individual step thusly :


x in [0,1]
find s such that C(s) <= x < C(s+1)

x' <- (x - Cs)/Ps

Now let's start thinking about doing this in finite precision, or at least fixed point.

If we think of our original arithmetic coding image, growing "x" up from 0, instead of keeping x in [0,1] the whole time, let's keep the active interval in [0,1]. That is, as we go we rescale so that the bottom range is [0,1] :

That is, instead of keeping the decimal at the far left and making a fraction, we keep the decimal at the far right of x and we grow the number upwards. Each coding step is :


x' <- x/Ps + Cs

in the end we get a large number that we have to send using log2(x) bits. We get compression because highly probable symbols will make x grow less than improbable symbols, so more compressable streams will make smaller values of x.

(x = the value before the current step, x' = the value after the current step)

We can decode each step simply if the (x/Ps) are integers, then the Cs is the fractional part, so we just do :


f = frac(x)
find s such that C(s) <= f < C(s+1)

x' <- (x - Cs)*Ps
that is, we think of our big number x as having a decimal point with the fractional part Cs on the right, and the rest of the stream is in a big integer on the left. That is :

[ (x/Ps) ] . [ Cs ]

Of course (x/Ps) is not necessarily an integer, and we don't get to do infinite precision math, so let's fix that.

Let :


Ps = Fs / M

P is in [0,1] , a symbol probability
F is an integer frequency
M is the frequency denominator

Sum{F} = M

Cs = Bs / M

B is the cumulative frequency

now we're going to keep x an integer, and our "decimal" that separates the current symbol from the rest of the stream is a fixed point in the integer. (eg. if M = 2^12 then we have a 12 bit fixed point x).

Our coding step is now :


x' = x*M/Fs + Bs

and we can imagine this as a fixed point :

[ (x*M/Fs) ] . [ Bs ]

in particular the bottom M-ary fraction specifies the current symbol :

( x' mod M ) = Bs

the crucial thing for decoding is that the first part, the (x/P) part which is now (x*M/F) shouldn't mess up the bottom M-ary fraction.

But now that we have it written like this, it should be obvious how to do that, if we just write :


x*M/F -> floor(x/F)*M

then the (mod M) operator on that gives you 0, because it has an explicit *M

so we've made the right (x/P) scaling, and made something that doesn't mess up our bottom mod M for decodability.

But we've lost some crucial bits from x, which contains the rest of the stream. When we did floor(x/F) we threw out some bottom bits of x that we can't get rid of. So we need that (x mod F) back.

Fortunately we have the perfect place to put it. We can specify the current symbol not just with Bs, but with anything in the interval [Bs , Bs + Fs) ! So we can do :


x' = M*floor(x/Fs) + Bs + (x mod Fs)

which is :

x' = [ floor(x/Fs) ] . [ Bs + (x mod Fs) ]

with the integer part growing on the left and the base-M fractional part on the right specifying the current symbol s

and this is exactly the rANS encoding step !

As we encode x grows by (1/P) with each step. We wind up sending x with log2(x) bits, which means the code length of the stream is log2(1/P0*P1...) which is what we want.

For completeness, decoding is straightforwardly undoing the encode step :


f = M-ary fraction of x  (x mod M)
find s such that Bs <= f < Bs+1

x' = Fs * (x/M) + (x mod M) - Bs

or 

x' = Fs * intM(x) + fracM(x) - Bs

and we know

(fracM(x) - Bs) is in [0, Fs)

which is the same as the old arithmetic decode step : x' = (x - Cs)/Ps

Of course we still have to deal with the issue of keeping x in fixed width integers and streaming, which we'll come back to.


02-01-14 | Understanding ANS - 3

I'm gonna take an aside from the slow exposition and jump way ahead to some results. Skip to the bottom for summary.

There have been some unfortunate claims made about ANS being "faster than Huffman". That's simply not true. And in fact it should be obvious that it's impossible for ANS to be faster than Huffman, since ANS is a strict superset of Huffman. You can always implement your Huffman coder by putting the Huffman code tree into your ANS coder, therefore the speed of Huffman is strictly >= ANS.

In practice, the table-based ANS decoder is so extremely similar to a table-based Huffman decoder that they are nearly identical in speed, and all the variation comes from minor implementation details (such as how you do your bit IO).

The "tANS" (table ANS, aka FSE) decode is :


{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = decodeTable[state].nbBits;
  state = decodeTable[state].newState + getbits(nBits);
}

while a standard table-based Huffman decode is :

{
  int sym = decodeTable[state].symbol;
  *op++ = sym;
  int nBits = codelen[sym];
  state = ((state<<nBits)&STATE_MASK) + getbits(nBits);  
}

where for similarly I'm using a Huffman code with the first bits at the bottom. In the Huffman case, "state" is just a portion of the bit stream that you keep in a variable. In the ANS case, "state" is a position in the decoder state machine that has memory; this allows it to carry fractional bits forward.

If you so chose, you could of course put the Huffman codelen and next state into decodeTable[] just like for ANS and they would be identical.

So, let's see some concrete results comparing some decent real world implementations.

I'm going to compare four compressors :


huf = order-0 static Huffman

fse = Yann's implementation of tANS

rans = ryg's implementation of rANS

arith = arithmetic coder with static power of 2 cumulative frequency total and decode table

For fse, rans, and arith I use a 12-bit table (the default in fse.c)
huf uses a 10-bit table and does not limit code length

Runs are on x64 code, but the implementations are 32 bit. (no 64-bit math used)

Some notes on the four implementations will follow. First the raw results :


inName : book1
H = 4.527

arith 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
arith encode     : 0.006 seconds, 69.44 b/kc, rate= 120.08 mb/s
arith decode     : 0.011 seconds, 40.35 b/kc, rate= 69.77 mb/s

"rans 12:   768,771 ->   435,378 =  4.531 bpb =  1.766 to 1 
rans encode      : 0.010 seconds, 44.04 b/kc, rate= 76.15 mb/s
rans decode      : 0.006 seconds, 80.59 b/kc, rate= 139.36 mb/s

fse :   768,771 ->   435,981 =  4.537 bpb =  1.763 to 1 
fse encode       : 0.005 seconds, 94.51 b/kc, rate= 163.44 mb/s
fse decode       : 0.003 seconds, 166.95 b/kc, rate= 288.67 mb/s

huf :   768,771 ->   438,437 =  4.562 bpb =  1.753 to 1 
huf encode       : 0.003 seconds, 147.09 b/kc, rate= 254.34 mb/s
huf decode       : 0.003 seconds, 163.54 b/kc, rate= 282.82 mb/s
huf decode       : 0.003 seconds, 175.21 b/kc, rate= 302.96 mb/s (*1)


inName : pic
H = 1.210

arith 12:   513,216 ->    79,473 =  1.239 bpb =  6.458 to 1 
arith encode     : 0.003 seconds, 91.91 b/kc, rate= 158.90 mb/s
arith decode     : 0.007 seconds, 45.07 b/kc, rate= 77.93 mb/s

rans 12:   513,216 ->    79,474 =  1.239 bpb =  6.458 to 1 
rans encode      : 0.007 seconds, 45.52 b/kc, rate= 78.72 mb/s
rans decode      : 0.003 seconds, 96.50 b/kc, rate= 166.85 mb/s

fse :   513,216 ->    80,112 =  1.249 bpb =  6.406 to 1 
fse encode       : 0.003 seconds, 93.86 b/kc, rate= 162.29 mb/s
fse decode       : 0.002 seconds, 164.42 b/kc, rate= 284.33 mb/s

huf :   513,216 ->   106,691 =  1.663 bpb =  4.810 to 1 
huf encode       : 0.002 seconds, 162.57 b/kc, rate= 281.10 mb/s
huf decode       : 0.002 seconds, 189.66 b/kc, rate= 328.02 mb/s

And some conclusions :

1. "tANS" (fse) is almost the same speed to decode as huffman, but provides fractional bits. Obviously this is a huge win on skewed files like "pic". But even on more balanced distributions, it's a decent little compression win for no decode speed hit, so probably worth doing everywhere.

2. Huffman encode is still significantly faster than tANS encode.

3. "rANS" and "arith" almost have their encode and decode speeds swapped. Round trip time is nearly identical. They use identical tables for encode and decode. In fact they are deeply related, which is something we will explore more in the future.

4. "tANS" is about twice as fast as "rANS". (at this point)

And some implementation notes for the record :


"fse" and "rans" encode the array by walking backwards.  The "fse" encoder output bits forwards and
consume them backwards, while the "rans" encoder writes bits backwards and consumes them forwards.

"huf" and "fse" are transmitting their code tables.  "arith" and "rans" are not.  
They should add about 256 bytes of header to be fair.

"arith" is a standard Schindler range coder, with byte-at-a-time renormalization

"arith" and "rans" here are nearly identical, both byte-at-a-time, and use the exact same tables
for encode and decode.

All times include their table-build times, and the time to histogram and normalize counts.
If you didn't include those times, the encodes would appear faster.

"huf" here is not length-limited.  A huf decoder with a 12-bit table and 12-bit length limitted
codes (like "fse" uses) should be faster.
(*1 = I did a length-limited version with a non-overflow handling decoder)

"huf" here is was implemented with PowerPC and SPU in mind.  A more x86/x64 oriented version should be
a little faster.  ("fse" is pretty x86/x64 oriented).

and todo : compare binary rANS with a comparable binary arithmetic coder.


1-31-14 | Understanding ANS - 2

So last time I wrote about how a string of output symbols like "012" describes an ANS state machine.

That particular string has all the values occuring the same number of times in as close to the same slot as possible. So they are encoded in nearly the same code length.

But what if they weren't all the same? eg. what if the decode string was "0102" ?

Then to decode, we could take (state % 4) and look it up in that array. For two values we would output a 0.

Alternatively we could say -


if the bottom bit is 0, we output a 0

if the bottom bit is 1, we need another bit to tell if we should output a 1 or 2

So the interesting thing is now to encode a 0, we don't need to do state *= 4. Our encode can be :

void encode(int & state, int val)
{
    ASSERT( val >= 0 && val < 3 );
    if ( val == 0 )
    {
        state = state*2;
    }
    else
    {
        state = state*4 + (val-1)*2 + 1;
    }
}

When you encode a 0, the state grows less. In the end, state must be transmitted using log2(state) bits, so when state grows less you send a value in fewer bits.

Note that when you decode you are doing (state %4), but to encode you only did state *= 2. That means when you decode you will see some bits from previously encoded symbols in your state. That's okay because those different values for state all correspond to the output. This is why when a symbol occurs more often in the output descriptor string it can be sent in fewer bits.

Now, astute readers may have noticed that this is a Huffman code. In fact Huffman codes are a subset of ANS, so let's explore that subset.

Say we have some Huffman codes, specified by code[sym] and codelen[sym]. The codes are prefix codes in the normal top-bit first sense. Then we can encode them thusly :


void encode(int & state, int val)
{
    state <<= codelen[sym];
    state |= reversebits( code[sym] ,  codelen[sym] );
}

where reversebits reverses the bits so that it is a prefix code from the bottom bit. Then you can decode either by reading bits one by one to get the prefix code, or with a table lookup :

int decode(int & state)
{
    int bottom = state & ((1<<maxcodelen)-1);
    int val = decodetable[bottom];
    state >>= codelen[val];
    return val;
}

where decodetable[] is the normal huffman fast decode table lookup, but it looks up codes that have been reversed.

So, what does this decodetable[] look like? Well, consider the example we did above. That corresponds to a Huffman code like this :


normal top-bit prefix :

0: 0
1: 10
2: 11

reversed :

0:  0
1: 01
2: 11

so the maxcodelen is 2. We enumerate all the 2-bit numbers and how they decode :

00 : 0
01 : 1
10 : 0
11 : 2

decodetable[] = { 0,1,0,2 }

So decodetable[] is the output state string that we talked about before.

Huffman codes create one restricted set of ANS codes with integer bit length encodings of every symbol. But this same kind of system can be used with more general code lens, as we'll see later.


1-30-14 | Understanding ANS - 1

I'm trying to really understand Jarek Duda's ANS (Asymmetric Numeral System). I'm going to write a bit as I figure things out. I'll probably make some mistakes.

For reference, my links :

RealTime Data Compression Finite State Entropy - A new breed of entropy coder
Asymmetric Numeral System - Polar
arxiv [1311.2540] Asymmetric numeral systems entropy coding combining speed of Huffman coding with compression rate of arithmetic
encode.ru - Asymetric Numeral System
encode.ru - FSE
Large text benchmark - fpaqa ans
New entropy coding faster than Huffman, compression rate like arithmetic - Google Groups

I actually found Polar's page & code the easiest to follow, but it's also the least precise and the least optimized. Yann Collet's fse.c is very good but contains various optimizations that make it hard to jump into and understand exactly where those things came from. Yann's blog has some good exposition as well.

So let's back way up.

ANS adds a sequence of values into a single integer "state".

The most similar thing that we're surely all familiar with is the way that we pack integers together for IO or network transmission. eg. when you have a value that can be in [0,2) and one in [0,6) and one in [0,11) you have a range of 3*7*12 = 252 so you can fit those all in one byte, and you use packing like :


// encode : put val into state
void encode(int & state, int val, int mod)
{
    ASSERT( val >= 0 && val < mod );
    state = state*mod + val;
}

// decode : remove a value from state and return it
int decode(int & state, int mod )
{
    int val = state % mod;
    state /= mod;
    return val;
}

Obviously at this point we're just packing integers, there's no entropy coding, we can't do unequal probabilities. The key thing that we will keep using in ANS is in the decode - the current "state" has a whole sequence of values in it, but we can extract our current value by doing a mod at the bottom.

That is, say "mod" = 3, then this decode function can be written as a transition table :


state   next_state  val
0       0           0
1       0           1
2       0           2
3       1           0
4       1           1
5       1           2
6       2           0
...

In the terminology of ANS we can describe this as "0120120120..." or just "012" and the repeating is implied. That is, the bottom bits of "state" tell us the current symbol by looking up in that string, and then those bottom bits are removed and we can decode more symbols.

Note that encode/decode is LIFO. The integer "state" is a stack - we're pushing values into the bottom and popping them from the bottom.

This simple encode/decode is also not streaming. That is, to put an unbounded number of values into state we would need an infinite length integer. We'll get back to this some day.


12-31-13 | Statically Linked DLL

Oodle for Windows is shipped as a DLL.

I have to do this because the multiplicity of incompatible CRT libs on Windows has made shipping libs for Windows an impossible disaster.

(seriously, jesus christ, stop adding features to your products and make it so that we can share C libs. Computers are becoming an increasingly broken disaster of band-aided together non-functioning bits.)

The problem is that clients (reasonably so) hate DLLs. Because DLLs are also an annoying disaster on Windows (having to distribute multiple files, accidentally loading from an unexpected place, and what if you have multiple products that rely on different versions of the same DLL, etc.).

Anyway, it seems to me that the best solution is actually a "statically linked DLL".

The DLL is the only way on Windows to combine code packages without mashing their CRT together, and being able to have some functions publicly linked and others resolved privately. So you want that. But you don't want an extra file dangling around that causes problems, you just want it linked into your EXE.

You can build your DLL as DelayLoad, and do the LoadLibrary for it yourself, so the client still sees it like a normal import lib, but you actually get the DLL from inside the EXE. The goal is to act like a static lib, but avoid all the link conflict problems.

The most straightforward way to do it would be to link the DLL in to your EXE as bytes, and at startup write it out to a temp dir, then LoadLibrary that file.

The better way is to write your own "LoadLibraryFromMem". A quick search finds some leads on that :

Loading Win3264 DLLs manually without LoadLibrary() - CodeProject
Loading a DLL from memory » ~magogpublic
LoadLibrary replacement - Source Codes - rohitab.com - Forums

Crazy or wise?


12-12-13 | Call for Game Textures

... since I've had SOooo much luck with these calls for data in the past. Anyway, optimism...

I need game textures! They must be from a recent game (eg. modern resolutions), and I need the original RGB (not the shipped DXTC or whatever).

If you can provide some, please contact me.


11-25-13 | Oodle and the real problems in games

When I started work on Oodle, I specifically didn't want to do a compression library. For one thing, I had done a lot of compression and don't like to repeat the same work, I need new topics and things to learning. For another, I doubted that we could sell a compression library; selling compression is notoriously hard, because there's so much good free stuff out there, and even if you do something amazing it will only be 10% better than the free stuff due to the diminishing returns asymptote; customers also don't understand things like space-speed tradeoffs. But most of all I just didn't think that a compression library solved important problems in games. Any time I do work I don't like to go off into esoteric perfectionism that isn't really helping much, I like to actually attack the important problems.

That's why Oodle was originally a paging / packaging / streaming / data management product. I thought that we had some good work on that at Oddworld and it seemed natural to take those concepts and clean them up and sell that to the masses. It also attacks what I consider to be very important problems in games.

Unfortunately it became pretty clear that nobody would buy a paging product. Game companies are convinced that they "already have" that, or that they can roll it themselves easily (in fact they don't have that, and it's not easy). On the other hand we increasingly saw interest in a compression library, so that's the direction we went.

(I don't mean to imply that the clients are entirely at fault for wanting the wrong thing; it's sort of just inherently problematic to sell a paging library, because it's too fundamental to the game engine. It's something you want to write yourself and have full control over. Really the conception of Oodle was problematic from the beginning, because the ideal RAD product is a very narrow API that can be added at the last second and does something that is not too tied to the fundamental operation of the game, and also that game coders don't want to deal with writing themselves)

The two big problems that I wanted to address with Original Oodle was -

1. Ridiculous game level load times.

2. Ridiculous artist process ; long bake times ; no real previews, etc.

These are very different problems - one is the game runtime, one is in the dev build and tools, but they can actually be solved at the same time by the same system.

Oodle was designed to offer per-resource paging; async IO and loading, background decompression. Resources could be grouped into bundles; the same resource might go into several bundles to minimize loads and seeks. Resources could be stored in various levels of "optimization" and the system would try to load the most-optimized. Oodle would store hashes and mod times so that old/incompatible data wouldn't be loaded. By checking times and hashes you can do a minimal incremental rebuild of only the bundles that need to be changed.

The same paging system can be used for hot-loading, you just page out the old version and page in the new one - boom, hot loaded content. The same system can provide fast in-game previews. You just load an existing compiled/optimized level, and then page in a replacement of the individual resource that the artist is working on.

The standard way to use such a system is that you still have a nightly content build that makes the super-optimized bundles of the latest content, but then throughout the day you can make instant changes to any of the content, and the newer versions are automatically loaded instead of the nightly version. It means that you're still loading optimized bakes for 90% of the content (thus load times and such aren't badly affected) but you get the latest all day long. And if the nightly bake ever fails, you don't stop the studio, people just keep working and still see all the latest, it just isn't all fully baked.

These are important problems, and I still get passionate about them (aka enraged) when I see how awful the resource pipeline is at most game companies.

(I kept trying to add features to the paging product to make it something that people would buy; I would talk to devs and say "what does it need to do for you to license it", and everybody would say something different (and even if it had that feature they wouldn't actually license it). That was a bad road to go down; it would have led to huge feature bloat, been impossible to maintain, and made a product that wasn't as lean and focused as it should be; customers don't know what they want, don't listen to them!)

Unfortunately, while compression is very interesting theoretically, make a compressor that's 5% better than an alternative is just not that compelling in terms of the end result that it has.


11-14-13 | Oodle Packet Compression for UDP

Oodle now has compressors for UDP (unordered / stateless) packets. Some previous posts on this topic :

cbloom rants 05-20-13 - Thoughts on Data Compression for MMOs
cbloom rants 08-08-13 - Oodle Static LZP for MMO network compression
cbloom rants 08-19-13 - Sketch of multi-Huffman Encoder

What I'm doing for UDP packet is static model compression. That is, you pre-train some model based on a capture of typical network data. That model is then const and can be just written out to a file for use in your game. At runtime, you read the model from disk, then it is const and shared by all network channels. This is particularly desirable for large scale servers because there is no per-channel overhead, either in channel startup time or memory use.

(ASIDE : Note that there is an alternative for UDP, which is to build up a consistent history between the encoder and decoder by having the decoder send back "acks", and then making sure the encoder uses only ack'ed packets as history, etc. etc. An alternative is to have the encoder mark packets with a description of the history used to encode them, and then when the decoder gets them if it doesn't have the necessary history it drops the packet and requests it be resent or something. I consider these a very bad idea and Oodle won't do them; I'm only looking at UDP compression that uses no transmission history.)

Call for test data! I currently only have a large network capture from one game, which obviously skews my results. If you make a networked game and can provide real-world sample data, please contact me.

Now for a mess of numbers comparing the options.


UDP compression of packets (packet_test.bin)

order-0 static huffman :  371.1 -> 234.5 average
(model takes 4k bytes)

order-0 static multi-huffman (32 huffs) : 371.1 -> 209.1 average
(model takes 128k bytes)

order-2 static arithmetic model : 371.0 -> 171.1 average
(model takes 549444 bytes)

OodleStaticLZP for UDP : 371.0 -> 93.4 average
(model takes 13068456 bytes)

In all cases there is no per-channel memory use. OodleStaticLZP is the recommended solution.

For comparison, the TCP compressors get :


LZB16 models take : 131072 bytes per channel
LZB16 [sw16|ht14] : 371.0 -> 122.6 average

LZNib models take : 1572864 bytes per channel
LZnib [sw19|ht18] : 371.0 -> 90.8 average

LZP models take : 104584 bytes per channel, 12582944 bytes shared
LZP [8|19] : 371.0 -> 76.7 average

zlib uses around 400k per channel
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

For MMO type scenarios (large number of connections, bandwidth is important), LZP is a huge win. It gets great compression with low per-channel memory use. The other compelling use case is LZNib when you are sending large packets (so per-byte speed is important) and have few connections (so per-channel memory use is not important); the advantage of LZNib is that it's quite fast to encode (faster than zlib-3 for example) and gets pretty good compression.

To wrap up, logging the variation of compression under some options.

LZPUDP can use whatever size of static dictionary you want. More dictionary = more compression.


LZPUDP [dic mb | hashtable log2]

LZPUDP [4|18] : 595654217 -> 165589750 = 3.597:1
1605378 packets; 371.0 -> 103.1 average
LZPUDP [8|19] : 595654217 -> 154353229 = 3.859:1
1605378 packets; 371.0 -> 96.1 average
LZPUDP [16|20] : 595654217 -> 139562083 = 4.268:1
1605378 packets; 371.0 -> 86.9 average
LZPUDP [32|21] : 595654217 -> 113670899 = 5.240:1
1605378 packets; 371.0 -> 70.8 average

And MultiHuffman can of course use any number of huffmans.

MultiHuffman [number of huffs | number of random trials]

MultiHuffman [1|8] : 66187074 -> 41830922 = 1.582:1
178376 packets; 371.1 -> 234.5 average, H = 5.056
MultiHuffman [2|8] : 66187074 -> 39869575 = 1.660:1
178376 packets; 371.1 -> 223.5 average, H = 4.819
MultiHuffman [4|8] : 66187074 -> 38570016 = 1.716:1
178376 packets; 371.1 -> 216.2 average, H = 4.662
MultiHuffman [8|8] : 66187074 -> 38190760 = 1.733:1
178376 packets; 371.1 -> 214.1 average, H = 4.616
MultiHuffman [16|8] : 66187074 -> 37617159 = 1.759:1
178376 packets; 371.1 -> 210.9 average, H = 4.547
MultiHuffman [32|8] : 66187074 -> 37293713 = 1.775:1
178376 packets; 371.1 -> 209.1 average, H = 4.508

On the test data that I have, the packets are pretty homogenous, so more huffmans is not a huge win. If you had something like N very different types of packets, you would expect to see big wins as you go up to N and then pretty flat after that.


Public note to self : it would amuse me to try ACB for UDP compression. ACB with dynamic dictionaries is not Pareto because it's just too slow to update that data structure. But with a static precomputed suffix sort, and optionally dynamic per-channel coding state, it might be good. It would be slower & higher memory use than LZP, but more compression.


10-14-13 | Oodle's Fast LZ4

Oodle now has a compressor called "LZB" (LZ-Bytewise) which is basically a variant of Yann's LZ4 . A few customers were testing LZNib against LZ4 and snappy and were bothered that it was slower to decode than those (though it offers much more compression and I believe it is usually a better choice of tradeoffs). In any case, OodleLZB is now there for people who care more about speed than ratio.

Oodle LZB has some implementation tricks that could also be applied to LZ4. I thought I'd go through them as an illustration of how you make compressors fast, and to give back to the community since LZ4 is nicely open source.

OodleLZB is 10-20% faster to decode than LZ4. (1900 mb/s vs. 1700 mb/s on lzt99, and 1550 mb/s vs. 1290 mb/s on all_dds). The base LZ4 implementation is not bad, it's in fact very fast and has the obvious optimizations like 8-byte word copies and so on. I'm not gonna talk about fiddly junk like do you do ptr++ or ++ptr , though that stuff can make a difference on some platforms. I want to talk about how you structure a decode loop.

The LZ4 decoder is like this :


U8 * ip; // = compressed input
U8 * op; // = decompressed output

for(;;)
{
    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

    ip += lrl;
    op += lrl;

    if ( EOF ) break;

    // get offset
    int offset = *((U16 *)ip);
    ip+=2;

    int ml = ml_control + 4;
    if ( ml_control == 0xF ) AddExcess(ml,ip);

    // copy match :
    if ( overlap )
    {
        copy_match_overlap(op, op - offset, ml );
    }
    else
    {
        copy_match_nooverlap(op, op - offset, ml );
    }

    op += ml;

    if ( EOF ) break;
}

and AddExcess is :

#define AddExcess(val,cp)   do { int b = *cp++; val += b; if ( b != 0xFF ) break; } while(1)

So, how do we speed this up?

The main thing we want to focus on is branches. We want to reduce the number of branches on the most common paths, and we want to make maximum use of the branches that we can't avoid.

There are four branches we want to pay attention to :

1. the checks for control nibbles = 0xF
2. the check for match overlap
3. the check of LRL and match len inside the match copiers
4. the EOF checks

So last one first; the EOF check is the easiest to eliminate and also the least important. On modern chips with good branch prediction, the highly predictable branches like that don't cost much. If you know that your input stream is not corrupted (because you've already checked a CRC of some kind), then you can put the EOF check under one of the rare code cases, like perhaps LRL control = 0xF. Just make the encoder emit that rare code when it hits EOF. On Intel chips that makes almost no speed difference. (if you need to handle corrupted data without crashing, don't do that).

On to the substantial ones. Note that branch #3 is hidden inside "copy_literals" and "copy_match". copy_literals is something like :


for(int i=0;i<lrl;i+=8)
{
    *((U64 *)(dest+i)) = *((U64 *)(src+i));
}

(the exact way to do copy_literals quickly depends on your platform; in particular are offset-addresses free? and are unaligned loads fast? Depending on those, you would write the loop in different ways.)

We should notice a couple of things about this. One is that the first branch on lrl is the most important. Later branches are rare, and we're also covering a lot of bytes per branch in that case. When lrl is low, we're getting a high number of branches per byte. Another issue about that is that the probability of taking the branch is very different the first time, so you can help the branch-prediction in the chip by doing some explicit branching, like :


if ( lrl > 0 )
{
    *((U64 *)(dest)) = *((U64 *)(src));
    if ( lrl > 8 )
    {
        *((U64 *)(dest+8)) = *((U64 *)(src+8));
        if ( lrl > 16 )
        {
            // .. okay now do a loop here ...
        }
    }
}

We'll also see later that the branch on ( lrl > 0 ) is optional, and in fact it's best to not do it - just always unconditionally/speculatively copy the first 8 bytes.

The next issue we should see is that the branch for lrl > 16 there for the copier is redundant with the check for the control value (lrl == 0xF). So we should merge them :


change this :

    // get excess lrl
    if ( lrl == 0xF ) AddExcess(lrl,ip);

    // copy literals :
    copy_literals(op, ip, lrl);

to :

    // get excess lrl
    if ( lrl == 0xF )
    {
        AddExcess(lrl,ip);

        // lrl >= 15
        copy_literals_long(op, ip, lrl);
    }
    else
    {
        // lrl < 15
        copy_literals_short(op, ip, lrl);
    }

this is the principle that we can't avoid the branch on the LRL escape code, so we should make maximum use of it. That is, it caries extra information - the branch tells us something about the value of LRL, and any time we take a branch we should try to make use of all the information it gives us.

But we can do more here. If we gather some stats about LRL we see something like :

% of LRL >= 0xF : 8%
% of LRL > 8 : 15%
% of LRL <= 8 : 85%
the vast majority of LRL's are short. So we should first detect and handle that case :

    if ( lrl <= 8 )
    {
        if ( lrl > 0 )
        {
            *((U64 *)(op)) = *((U64 *)(ip));
        }
    }
    else
    {
        // lrl > 8
        if ( lrl == 0xF )
        {
            AddExcess(lrl,ip);

            // lrl >= 15
            copy_literals_long(op, ip, lrl);
        }
        else
        {
            // lrl > 8 && < 15
            *((U64 *)(op)) = *((U64 *)(ip));
            *((U64 *)(op+8)) = *((U64 *)(ip+8));
        }
    }

which we should then rearrange a bit :

    // always immediately speculatively copy 8 :
    *((U64 *)(op)) = *((U64 *)(ip));

    if_unlikely( lrl > 8 )
    {
        // lrl > 8
        // speculatively copy next 8 :
        *((U64 *)(op+8)) = *((U64 *)(ip+8));

        if_unlikely ( lrl == 0xF )
        {
            AddExcess(lrl,ip);

            // lrl >= 15
            // can copy first 16 without checking lrl
            copy_literals_long(op, ip, lrl);
        }
    }
    
    op += lrl;
    ip += lrl;

and we have a faster literal-copy path.

Astute readers may notice that we can now be stomping past the end of the buffer, because we always do the 8 byte copy regardless of LRL. There are various ways to prevent this. One is to make your EOF check test for (end - 8), and when you break out of that loop, then you have a slower/safer loop to finish up the end. (ADD : not exactly right, see notes in comments)

Obviously we should do the exact same procedure with the ml_control. Again the check for spilling the 0xF nibble tells us something about the match length, and we should use that information to our advantage. And again the short matches are by far the most common, so we should detect that case and make it as quick as possible.

The next branch we'll look at is the overlap check. Again some statistics should be our guide : less than 1% of all matches are overlap matches. Overlap matches are important (well, offset=1 overlaps are important) because they are occasionally very long, but they are rare. One option is just to forbid overlap matches entirely, and really that doesn't hurt compression much. We won't do that. Instead we want to hide the overlap case in another rare case : the excess ML 0xF nibble case.

The way to make compressors fast is to look at the code you have to write, and then change the format to make that code fast. That is, write the code the way you want it and the format follows - don't think of a format and then write code to handle it.

We want our match decoder to be like this :


if ( ml_control < 0xF )
{
    // 4-18 byte short match
    // no overlap allowed
    // ...
}
else
{
    // long match OR overlap match
    if ( offset < 8 )
    {
        ml = 4; // back up to MML
        AddExcess(ml,ip);

        // do offset < 8 possible overlap match
        // ...
    }
    else
    {
        AddExcess(ml,ip);

        // do offset >= 8 long match
        // ...
    }
}

so we change our encoder to make data like that.

Astute readers may note that overlap matches now take an extra byte to encode, which does cost a little compression ratio. If we like we can fix that by reorganizing the code stream a little (one option is to send one ml excess byte before sending offset and put a flag value in there; since this is all in the very rare pathway it can be more complex in its encoding), or we can just ignore it, it's around a 1% hit.

That's it.

One final note, if you want to skip all that, there is a short cut to get much of the win.

The simplest case is the most important - no excess lrl, no excess ml - and it occurs around 40% of all control bytes. So we can just detect it and fastpath that case :


U8 * ip; // = compressed input
U8 * op; // = decompressed output

for(;;)
{
    int control = *ip++;

    int lrl = control>>4;
    int ml_control = control&0xF;

    if ( (control & 0x88) == 0 )
    {
        // lrl < 8 and ml_control < 8

        // copy literals :
        *((U64 *)(op)) = *((U64 *)(ip));
        ip += lrl;
        op += lrl;

        // get offset
        int offset = *((U16 *)ip);
        ip+=2;
    
        int ml = ml_control + 4;    

        // copy match; 4-11 bytes :
        U8 * from = op - offset;
        *((U64 *)(op)) = *((U64 *)(from));
        *((U32 *)(op+8)) = *((U32 *)(from+8));

        op += ml;

        continue;
    }

    // ... normal decoder here ...
}

This fastpath doesn't help much with all the other improvements we've done here, but you can just drop it on the original decoder and get much of the win. Of course beware the EOF handling. Also this fastpath assumes that you have forbidden overlaps from the normal codes and are sending the escape match code (0xF) for overlap matches.

ADDENDUM : A few notes :

In case it's not clear, one of the keys to this type of fast decoder is that there's a "hot region" in front of the output pointer that contains undefined data, which we keep stomping on over and over. eg. when we do the 8-byte literal blasts regardless of lrl. This has a few consequences you must be aware of.

One is if you're trying to decode in a minimum size circular window (eg. 64k in this case). Offsets to matches that are near window size (like 65534) are actually wrapping around to be just *ahead* of the current output pointer. You cannot allow those matches in the hot region because those bytes are undefined. There are a few fixes - don't use this decoder for 64k circular windows, or don't allow your encoder to output offsets that cause that problem.

A similar problem arises with parallelizing the decode. If you're decoding chunks of the stream in parallel, you cannot allow the hot region of one thread to be stomping past the end of its chunk, which another thread might be reading from. To handle this Oodle defines "seek points" which are places that you are allowed to parallelize the decode, and the coders ensure that the hot regions do not cross seek points. That is, within a chunk up to the seek point, the decoder is allowed to do these wild blast-aheads, but as it gets close to a seek point it must break out and fall into a safer mode (this can be done with a modified end condition check).


10-09-13 | Urgh ; Threads and Memory

This is a problem I've been trying to avoid really facing, so I keep hacking around it, but it keeps coming back to bite me every few months.

Threads/Jobs and memory allocation is a nasty problem.

Say you're trying to process some 8 GB file on a 32-bit system. You'd like to fire up a bunch of threads and let them all crank on chunks of the file simultaneously. But how big of a chunk can each thread work on? And how many threads can you run?

The problem is those threads may need to do allocations to do their processing. With free-form allocations you don't necessarily know in advance how much they need to allocate (it might depend on processing options or the data they see or whatever). With a multi-process OS you also don't know in advance how much memory you have available (it may reduce while you're running). So you can't just say "I have the memory to run 4 threads at a time". You don't know. You can run out of memory, and you have to abort the whole process and try again with less threads.

In case it's not obvious, you can't just try running 4 threads, and when one of them runs out of memory you pause that thread and run others, or kill that thread, because the thread may do work and allocations incrementally, like work,alloc,work,alloc,etc. so that by the time an alloc fails, you're alread holding a bunch of other allocs and no other thread may be able to run.

To be really clear, imagine you have 2 MB free and your threads do { alloc 1 MB, work A, alloc 1 MB, work B }. You try to run 2 threads, and they both get up to work A. Now neither thread can continue because you're out of memory.

The real solution is for each Job to pre-declare its resource requirements. Like "I need 80 MB" to run. Then it becomes the responsibility of the Job Manager to do the allocation, so when the Job is started, it is handed the memory and it knows it can run; all allocations within the Job then come from the reserved pool, not from the system.

(there are of course other solutions; for example you could make all your jobs rewindable, so if one fails an allocation it is aborted (and any changes to global state undone), or similarly all your jobs could work in two stages, a "gather" stage where allocs are allowed, but no changes to the global state are allowed, and a "commit" phase where the changes are applied; the job can be aborted during "gather" but must not fail during "commit").

So the Job Manager might try to allocate memory for a job, fail, and run some other jobs that need less memory. eg. if you have jobs that take { 10, 1, 10, 1 } of memories, and you have only 12 memories free, you can't run the two 10's at the same time, but you can run the 1's while a 10 is running.

While you're at it, you may as well put some load-balancing in your Jobs as well. You could have each Job mark up to what extend it needs CPU, GPU, or IO resources (in addition to memory use). Then the Job Manager can try to run jobs that are of different types (eg. don't run two IO-heavy jobs at the same time).

If you want to go even more extreme, you could have Jobs pre-declare the shared system resources that they need locks on, and the Job Manager can try to schedule jobs to avoid lock contention. (the even super extreme version of this is to pre-declare *all* your locks and have the Job Manager take them for you, so that you are gauranteed to get them; at this point you're essentially making Jobs into snippets that you know cannot ever fail and cannot even ever *block*, that is they won't even start unless they can run straight to completion).

I haven't wanted to go down this route because it violates one of my Fundamental Theorems of Jobs, which is that job code should be the same as main-thread code, not some weird meta-language that requires lots of markup and is totally different code from what you would write in the non-threaded case.

Anyway, because I haven't properly addressed this, it means that in low-memory scenarios (eg. any 32-bit platform), the Oodle compressors (at the optimal parse level) can run out of memory if you use too many worker threads, and it's hard to really know that's going to happen in advance (since the exact memory use depends on a bunch of options and is hard to measure). Bleh.

(and obviously what I need to do for Oodle, rather than solving this problem correctly and generally, is just to special case my LZ string matchers and have them allocate their memory before starting the parallel compress, so I know how many threads I can run)


10-04-13 | The Reality of Coding

How you actually spend time.

4 hours - think about a problem, come up with a new algorithm or variation of algorithm, or read papers to find an algorithm that will solve your problem. This is SO FUCKING FUN and what keeps pulling me back in.

8 hours - do initial test implementation to prove concept. It works! Yay!

And now the fun part is over.

50 hours - do more careful implementation that handles all the annoying corner cases; integrate with the rest of your library; handle failures and so on. Provide lots of options for slightly different use cases that massively increase the complexity.

50 hours - adding features and fixing rare bugs; spread out over the next year

20 hours - have to install new SDKs to test it; inevitably they've broken a bunch of APIs and changed how you package builds so waste a bunch of time on that

10 hours - some stupid problem with Win8 loading the wrong drivers; or the linux dir my test is writing to is no longer writeble by my user account; whatever stupid computer problem; chase that around for a while

10 hours - the p4 server is down / vpn is down / MSVC has an internal compiler error / my laptop is overheating / my hard disk is full, whatever stupid problem always holds you up.

10 hours - someone checks in a breakage to the shared library; it would take a minute just to fix it, but you can't do that because it would break them so you have to do meetings to agree on how it should be fixed

10 hours - some OS API you were using doesn't actually behave the way you expected, or has a bug; some weird corner case or undocumented interaction in the OS API that you have to chase down

40 hours - writing docs and marketing materials, teaching other members of your team, internal or external evangelizing

30 hours - some customer sees a bug on some specific OS or SDK version that I no longer have installed; try to remote debug it, that doesn't work, try to find a repro, that doesn't work, give up and install their case; in the end it turns out they had bad RAM or something silly.

The reality is that as a working coder, the amount of time you actually get to spend working on the good stuff (new algorithms, hacking, clever implementations) is vanishingly small.


10-03-13 | SetLastError(0)

Public reminder to myself about something I discovered a while ago.

If you want to do IO really robustly in Windows, you can't just assume that your ReadFile / WriteFile will succeed under normal usage. There are lots of nasty cases where you need to retry (perhaps with smaller IO sizes, or just after waiting a bit).

In particular you can see these errors in normal runs :


ERROR_NOT_ENOUGH_MEMORY = too many AsyncIO 's pending

ERROR_NOT_ENOUGH_QUOTA  = single IO call too large
    not enough process space pages available
    -> SetProcessWorkingSetSize

ERROR_NO_SYSTEM_RESOURCES = 
    failure to alloc pages in the kernel address space for the IO
    try again with smaller IOs  

ERROR_IO_PENDING = 
    normal async IO return value

ERROR_HANDLE_EOF = 
    sometimes normal EOF return value

anyway, this post is not about the specifics of IO errors. (random aside : I believe that some of these annoying errors were much more common in 32-bit windows; the failure to get address space to map IO pages was a bigger problem in 32-bit (I saw it most often when running with the /3GB option which makes the kernel page space a scarce commodity), I don't think I've seen it in the field in 64-bit windows)

I discovered a while ago that ReadFile and WriteFile can fail (return false) but not set last error to anything. That is, you have something like :


SetLastError(77); // something bogus

if ( ! ReadFile(....) )
{
    // failure, get code :
    DWORD new_error = GetLastError();

    // new_error should be the error info about ReadFile failing
    // but sometimes it's still 77
    ...
}

and *sometimes* new_error is still 77 (or whatever; that is, it wasn't actually set when ReadFile failed).

I have no idea exactly what situations make the error get set or not. I have no idea how many other Win32 APIs are affected by this flaw, I only have empirical proof of ReadFile and WriteFile.

Anyhoo, the conclusion is that best practice on Win32 is to call SetLastError(0) before invoking any API where you need to know for sure that the error code you get was in fact set by that call. eg.


SetLastError(0);
if ( ! SomeWin32API(...) )
{
    DWORD hey_I_know_this_error_is_legit = GetLastError();

}

That is, Win32 APIs returning failure does *not* guarantee that they set LastError.


ADD : while I'm at it :

$err
$err,hr
in the VC watch window is pretty sweet.

GetLastError is :

*(DWORD *)($tib+0x34)

or *(DWORD *)(FS:[0x34]) on x86


09-27-13 | Playlist for Rainy Seattle

Who the fuck turned the lights out in the world? Hello? I'm still here, can you turn the lights back on please?

Playlist for the gray :


(Actually now that I think about it, "playlist for the gray" is really just the kind of music I listened to when I was young (and, apparently, depressed). It reminds me of sitting in the passenger seat of a car, silent, looking out the window, it's raining, the world passing.)


Music request :

Something I've been seeking for a while and am having trouble finding : really long repetitive tracks. Preferably analog, not techno or dance music. Like some guitar strumming and such. Not pure drone, not just like one phrase repeated over and over, but a song, a proper song, just a really long version of that song. And not some awful Phish crap either; I don't mean "jazz" or anything with long improvs that get away from the basic song structure. I don't want big Mogwai walls of sound or screeching anything either; nothing experimental; I hate songs that build to a big noise crescendo, no no, just keep the steady groove going. Just regular songs, But that go on and on.

Any pointers appreciated. Particularly playlists or "best of" lists on music sites. There must be some DJ kid doing super-long remixes of classic rock songs, right? Where is he?

Some sort of in the right vein examples :

Neu! - Hallogallo (10 min)
Traffic - Mr Fantasy Live (11 min) (maybe not a great example, too solo-y)
YLT - Night Falls on Hoboken (17 min)


09-20-13 | House Design

Blah blah blah cuz it's on my mind.

First a general rant. There's nothing worse than "designers". The self-important sophomoric egotism of them is just mind boggling. Here's this product (or in this case, house plan) that has been refined over 1000's of years by people actually using it. Oh no, I know better. I'm so fucking great that I can throw all that out and come up with something off the top of my head and it will be an improvement. I don't have to bother with prototyping or getting evaluations from the people who actual use this stuff every day because my half-baked ideas are so damn good. I don't need to bother learning about how this product or layout has been fiddled with and refined in the past because my idea *must* be brand new, no one could have possibly done the exact same thing before and proved that it was a bad idea.

And onto the random points -


X. Of course the big fallacy is that a house is going to make your life better or fix anything. One of the most hillarious variants of this is people who put in a specific room for a gym or a bar or a disco, because of course in their wonderful new house they'll be working out and having parties and lots of friends. Not sitting in front of the TV browsing donkey porn like they do in their old house.

X. Never use new technology. It won't work long term, or it will be obsolete. You don't want a house that's like a computer and needs to be replaced in a few years. Use old technology that works. That goes for both the construction itself and for any integrated gadgets. Like if you get some computer-controlled power and lighting system; okay, good luck with that, I hope it's not too hard to rip out of the walls in five years when it breaks and the replacement has gone out of production. Haven't you people ever used electronics in your life? How do you not know this?

X. Living roof? WTF are you thinking? What a nightmare of maintenance. And it's just a huge ceiling leak inevitably. Yeah, I'm sure that rubber membrane that had several leaks during construction is gonna be totally fine for the next 50 years.

X. Assume that all caulks, all rubbers, all glues, all plastics will fail at some point. Make them easy to service and don't rely on them for crucial waterproofing.

X. Waterproofing should be primarily done with the "shingle principle". That is, mechanical overlaps - not glues, caulks, gaskets, coatings. Water should have to run upwards in order to get somewhere you don't want it.

X. Lots of storage. People these days want to eliminate closets to make rooms bigger. WTF do you need those giant rooms for? Small rooms are cosy. Storage everywhere makes it easy to keep the rooms uncluttered. So much nicer to have neat clean small rooms. The storage needs to be in every single room, not centralized, so you aren't walking all over the house every time you want something.

X. Rooms are good. Small rooms. I feel like there are two extreme trends going on these days that are both way off the ideal - you have the McMansion types that are making 5000 sqft houses, and then the "minimalist" types trying to live in 500 sqft to prove some stupid point. Both are retarded. I think the ideal amount of space for two people is around 1200-1500 sqft. For a family of three it's 1500-2000 or so.

X. Doors are good. Lofts are fucking retarded. Giant open single spaces are awful. Yeah it's okay if you're single, but if there are two people in the house you might just want to do different things and not hear each other. Haven't you ever lived in a place like that? It's terrible. Doors and separated spaces are wonderful things. (I like traditional Japanese interiors with the sliding screens so that you can rearrange spaces; maybe an open living-dining room, but with a sliding door through the middle to make it into two rooms when you want that? Not sure.)

X. Shadow gaps, weird kitchen islands, architectural staircases, sunken areas, multiple levels. Bleh. You've got to think about the cleaning. These things might look okay when it's first built, but they're a nightmare for maintenace.

X. Use trim. The popular thing these days seems to be trim-less walls. (part of the sterile awful "I live in a museum" look). In classic construction trim is partly used to hide the boundary between two surfaces that might not have a perfect joint, or that are different materials and thus might move against each other over time. With fancy modern construction the idea is that you don't have a bad joint that you have to hide, so you can do away with the baseboards and crown molding for a cleaner look. Blah, no, wrong. Baseboards are not just for hiding the joint, they serve a purpose. You can clean them, you can kick them, and they protect the bottom of your walls. They also provide a visual break from the floor to the wall, though that's a matter of taste.

X. I don't see anybody do the things that are so fucking obviously necessary these days. One is that all your wiring should be accessible for redoing, because we're going to continue to get new internet and cabling needs. Really you should run PVC tubes through your walls with fish-wires in them so that you can just easily pull new wiring through your house. (or of course if you make one of those god-awful modern places you should just do exposed pipes and wires; it's one of the few advantages of modern/industrial style, wtf. don't take the style and then reject the advantage of it). You should have charging stations that are just cupboards with outlets inside the cupboard so that you can hide all your charger cords and devices. There should be tons of outlets and perhaps they should be hidden; you could make them styled architectural features in some cases, or hide them in some wood trim or something. Another option would be to have retractable power outlets that coil up inside the wall and you can pull out as needed. Another idea is your baseboards could have a hollow space behind them, so you could snap them off and run cords around the room hidden behind the baseboards.


It would have been fun to be an architect. I have a lot of ideas about design, and I appreciate being in physical spaces that do something special to your experience. I love making physical objects that you can see and touch and show other people; it's so frustrating working in code and never making anything real.

But I'm sure I would have failed. For one things being an architect requires a lot of salesmanship and bullshitting. Particularly at the top levels, it's more about being a celebrity persona than about your work (just like art, cooking, etc.). To make any money or get the good commisions as an architect you have to have a bit of renown; you need to get paid extra because you're a name that's getting magazine attention.

But it's really more about following trends than about doing what's right. I suppose that's just like cooking too. Magazines (and blogs now) have a preconceived idea of what is "current" or "cutting edge" (which is not actually cutting edge at all, because it's a widespread cliche by that time), and they look for people that fit their preconceived idea. If you're a cook that does fucking boring ass generic "farm to table" and "sustainably raised" shit, then you're current and people will feature you in the news. If you just stick to what you know is delicious and ignore the stupid fads, then you're ignored.


09-18-13 | Per-Thread Global State Overrides

I wrote about this before ( cbloom rants 11-23-12 - Global State Considered Harmful ) but I'm doing it again because I think almost nobody does it right, so I'm gonna be really pedantic.

For concreteness, let's talk about a Log system that is controlled by bit-flags. So you have a "state" variable that is an or of bit flags. The flags are things like where do you output to (LOG_TO_FILE, LOG_TO_OUTPUTDEBUGSTRING, etc.) and maybe things like subsection enablements (LOG_SYSTEM_IO, LOG_SYSTEM_RENDERER, ...) or verbosity (LOG_V0, LOG_V1, ...). Maybe some bits of the state are an indent level. etc.

So clearly you have a global state where the user/programmer have set the options they want for the log.

But you also need a TLS state. You want to be able to do things like disable the log in scopes :


..

U32 oldState = Log_SetState(0);

FunctionThatLogsTooMuch();

Log_SetState(oldState);

..

(and in practice it's nice to use a scoper-class to do that for you). If you do that on the global variable, your thread is fucking up the state of other threads, so clearly it needs to be per-thread, eg. in the TLS. (similarly, you might want to inc the indent level for a scope, or change the verbosity level, etc.).

(note of course this is the "system has a stack of states which is implemented in the program stack").

So clearly, those need to be Log_SetLocalState. Then the functions that are used to set the overall options should be something like Log_SetGlobalState.

Now some notes on how the implementation works.

The global state has to be thread safe. It should just be an atomic var :


static U32 s_log_global_state;

U32 Log_SetGlobalState( U32 state )
{
    // set the new state and return the old; this must be an exchange

    U32 ret = Atomic_Exchange(&s_log_global_state, state , mo_acq_rel);

    return ret;
}

U32 Log_GetGlobalState( )
{
    // probably could be relaxed but WTF let's just acquire

    U32 ret = Atomic_Load(&s_log_global_state, mo_acquire);

    return ret;
}

(note that I sort of implicitly assume that there's only one thread (a "main" thread) that is setting the global state; generally it's set by command line or .ini options, and maybe from user keys in a HUD; the global state is not being fiddled by lots of threads at program time, because that creates races. eg. if you wanted to do something like turn on the LOG_TO_FILE bit, it should be done with a CAS loop or an Atomic OR, not by doing a _Get and then _Set).

Now the Local functions need to set the state in the TLS and *also* which bits are set in the local state. So the actual function is like :


per_thread U32_pair tls_log_local_state;

U32_pair Log_SetLocalState( U32 state , U32 state_set_mask )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    // write TLS :

    tls_log_local_state = U32_pair( state, state_set_mask );

    return ret;
}

U32_pair Log_GetLocalState( )
{
    // read TLS :

    U32_pair ret = tls_log_local_state;

    return ret;
}

Note obviously no atomics or mutexes are need in per-thread functions.

So now we can get the effective combined state :


U32 Log_GetState( )
{
    U32_pair local = Log_GetLocalState();
    U32 global = Log_GetGlobalState();

    // take local state bits where they are set, else global state bits :

    U32 state = (local.first & local.second) | (global & (~local.second) );

    return state;
}

So internally to the log's operation you start every function with something like :

static bool NoState( U32 state )
{
    // if all outputs or all systems are turned off, no output is possible
    return ((state & LOG_TO_MASK) == 0) ||
        ((state & LOG_SYSTEM_MASK) == 0);
}

void Log_Printf( const char * fmt, ... )
{
    U32 state = Log_GetState();

    if ( NoState(state) )
        return;

    ... more here ...

}

So note that up to the "... more here ..." we have not taken any mutexes or in any way synchronized the threads against each other. So when the log is disabled we just exit there before doing anything painful.

Now the point of this post is not about a log system. It's that you have to do this any time you have global state that can be changed by code (and you want that change to only affect the current thread).

In the more general case you don't just have bit flags, you have arbitrary variables that you want to be per-thread and global. Here's a helper struct to do a global atomic with thread-overridable value :

            
struct tls_intptr_t
{
    int m_index;
    
    tls_intptr_t()
    {
        m_index = TlsAlloc();
        ASSERT( get() == 0 );
    }
    
    intptr_t get() const { return (intptr_t) TlsGetValue(m_index); }

    void set(intptr_t v) { TlsSetValue(m_index,(LPVOID)v); }
};

struct intptr_t_and_set
{
    intptr_t val;
    intptr_t set; // bool ; is "val" set
    
    intptr_t_and_set(intptr_t v,intptr_t s) : val(v), set(s) { }
};
    
struct overridable_intptr_t
{
    atomic<intptr_t>    m_global;
    tls_intptr_t    m_local;    
    tls_intptr_t    m_localset;
        
    overridable_intptr_t(intptr_t val = 0) : m_global(val)
    {
        ASSERT( m_localset.get() == 0 );
    }       
    
    //---------------------------------------------
    
    intptr_t set_global(intptr_t val)
    {
        return m_global.exchange(val,mo_acq_rel);
    }
    intptr_t get_global() const
    {
        return m_global.load(mo_acquire);
    }
    
    //---------------------------------------------
    
    intptr_t_and_set get_local() const
    {
        return intptr_t_and_set( m_local.get(), m_localset.get() );
    }
    intptr_t_and_set set_local(intptr_t val, intptr_t set = 1)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(set);
        if ( set )
            m_local.set(val);
        return old;
    }
    intptr_t_and_set set_local(intptr_t_and_set val_and_set)
    {
        intptr_t_and_set old = get_local();
        m_localset.set(val_and_set.set);
        if ( val_and_set.set )
            m_local.set(val_and_set.val);
        return old;
    }
    intptr_t_and_set clear_local()
    {
        intptr_t_and_set old = get_local();
        m_localset.set(0);
        return old;
    }
    
    //---------------------------------------------
    
    intptr_t get_combined() const
    {
        intptr_t_and_set local = get_local();
        if ( local.set )
            return local.val;
        else
            return get_global();
    }
};

//=================================================================         

// test code :  

static overridable_intptr_t s_thingy;

int main(int argc,char * argv[])
{
    argc; argv;
    
    s_thingy.set_global(1);
    
    s_thingy.set_local(2,0);
    
    ASSERT( s_thingy.get_combined() == 1 );
    
    intptr_t_and_set prev = s_thingy.set_local(3,1);
    
    ASSERT( s_thingy.get_combined() == 3 );

    s_thingy.set_global(2);
    
    ASSERT( s_thingy.get_combined() == 3 );
    
    s_thingy.set_local(prev);
    
    ASSERT( s_thingy.get_combined() == 2 );
        
    return 0;
}

Or something.

Of course this whole post is implicitly assuming that you are using the "several threads that stay alive for the length of the app" model. An alternative is to use micro-threads that you spin up and down, and rather than inheriting from a global state, you would want them to inherit from the spawning thread's current combined state.


09-18-13 | Fast TLS on Windows

For the record; don't use this blah blah unsafe unnecessary blah blah.


extern "C" DWORD __cdecl FastTlsGetValue_x86(int index)
{
  __asm
  {
    mov     eax,dword ptr fs:[00000018h]
    mov     ecx,index

    cmp     ecx,40h // 40h = 64
    jae     over64  // Jump if above or equal 

    // return Teb->TlsSlots[ dwTlsIndex ]
    // +0xe10 TlsSlots         : [64] Ptr32 Void
    mov     eax,dword ptr [eax+ecx*4+0E10h]
    jmp     done

  over64:   
    mov     eax,dword ptr [eax+0F94h]
    mov     eax,dword ptr [eax+ecx*4-100h]

  done:
  }
}

DWORD64 FastTlsGetValue_x64(int index)
{
    if ( index < 64 )
    {
        return __readgsqword( 0x1480 + index*8 );
    }
    else
    {
        DWORD64 * table = (DWORD64 *)  __readgsqword( 0x1780 );
        return table[ index - 64 ];
    }
}

the ASM one is from nynaeve originally. ( 1 2 ). I'd rather rewrite it in C using __readfsdword but haven't bothered.

Note that these may cause a bogus failure in MS App Verifier.

Also, as noted many times in the past, you should just use the compiler __declspec thread under Windows when that's possible for you. (eg. you're not in a DLL pre-Vista).


09-17-13 | Travel with Baby

We did our first flight with baby over the weekend.

It went totally fine. All the things that people worry about (getting through security, baby ears popping, whatever) are no problem. Sure she cried on the plane some, but mostly she was really good, and anybody on the plane who was bothered can go to hell.

(hey dumbass who sat next to us - when you see a couple with a baby and they say "would you like to move to another seat" you should fucking move. And if you don't move, then you need to be cool about it. Jesus christ you all suck so bad, I'm fucking holding your hand helping you be a decent person, you don't even need to take any initiative, I know you all suck too bad to speak up and take the lead at being decent, but even when I open the door for you, you still can't manage to take the easy step. Amazing.)

Despite it being totally fine, it made me feel like I don't really need to do that again.

You just wind up spending the whole trip staring at baby anyway. You wind up spending a lot of time stuck in the hotel room, because she needs to nap, or you have to go back to feed her and get more toys, or get her out of the sun, or whatever. Hotel rooms are god awful. There's this weird romanticism about hotels, but the reality is they're almost uniformly dreary, in the standard shoebox design with light at only one end. My house is so much fucking better.

Like I'm in another part of the world, but I'm still just doing "goo-goos" and shaking the rattle, why do I need to bother with the travel if this is all I'm doing? And of course it's much harder because I don't have all my handy baby accessories and her comfy bed and all that. It made me think of those cheesy photo series with the baby always in the foreground of the photo and all kinds of different world locations in the background.


09-17-13 | A Day in the Life

Wake up at 730 with the baby.

Show her some toys, shake them around, let her chew on them. Practice rolling over, she gets frustrated, help her out. She wants to walk around a bit, so pick her up and hold her while she walks. Ugh this is hard on my back. She swats at things, I move away the dangerous knives and bring close some jars for her to play with. She's getting frustrated. Show her the mirror baby, she flirts for a minute. She gets fussy; check diaper, yep it's dirty; go change it. Lay her down and make faces and do motorboats and stuff. Try to read her a book; she just wants to eat the pages and gets frustrated. Walk her around outside and show her some plants, let her grab some leaves. She wants to walk, help her walk in the grass. Ugh this is hard on my back. Getting tired now. Take her in and put her in the walker; put some toys on the walker, she swats them off, pick them up and put them back. She's getting bored of that. Show her some running water, let her suck my finger. She's getting fussy again.

God I'm tired. Check the clock.

It's 800.

Twelve more hours until she sleeps. ZOMG.


09-12-13 | Health Insurance

We've got a bunch of health insurance bills from baby and related care, and several of them have fraudulent up-coding or double-billing. But it's sort of a damned-if-you-whatever situation. Either you :

1. Fight it, spend lots of time on hold on the phone, filling out forms, talking to assholes who won't help you. Probably get nowhere and be stressed and angry the whole time.

2. Just pay it and try to "let it go". The money's not that huge and peace of mind is worth it. But in fact feel defeated and like a pussy for letting them get away with robbing you. Become less of a man day by day as you are crushed by the indefeatable awfulness of the world.

Though I suppose those are generally your only two options when fighting beaurocracy. It's just that health care is more important to our lives, our wallets, and generally the health of the entire economy as it becomes an increasing tax on all of us.

We walked through some local street fair a few weeks ago, and saw one of the doctors who's fraudulently billed us; he was being all smiley, oo look at me I'm part of the community, I'm your nice neighborhood doctor man. I wanted to just punch him in the face.

Also : Republicans are retarded pieces of shit. How could you possibly be seriously opposed to tearing apart our entire health care sector and rebuilding from scratch with cost controls and a public option? They're either morons (unaware of their evil), or assholes (aware and intentionally evil). Oh yes, it's wonderful that we have choice and competition in this robust and functioning health care economy. Oh no, wait, actually we don't have that at all. We have a corrupt government-private conspiracy, which you have intentionally created, which is screwing over everyone in America except for the heads of the health care industry (and the politicians that take their money). Sigh. Time to go back to pretending that nothing exists outside my home, because it's all just too depressing.


09-10-13 | Grand Designs

I've been watching a lot of "Grand Designs" lately; it's my new go-to soothing background zone-out TV. It's mostly inoffensive (*) and slightly interesting.

(* = the incredibly predictably saccharine sum-up by Kevin at the end is pretty offensive; all throughout he's raising very legitimate concerns, and then at the end every time he just loves it. There are a few episodes where you can see the builder/client is just crushed and miserable at the end, but they never get into anything remotely honest like that, it's all very superficial, ooh isn't excess consumerism wonderful!)

I'm not even going to talk about the house designs really. I think most of them are completely retarded and abysmal generic pseudo-modern crap. (IMO in general modernism just doesn't work for homes. Moderism is beautiful when it's pure, unforgiving, really strictly adhered to. But nobody can live in a home like that. So they water it down and add some natural materials and normal cosy furniture and storage and so on, and that completely ruins it and turns it into "condo pseudo-modernism" which is just the tackiest of all types of housing. Modernism should be reserved for places where it's realistic to keep the severe empty minimalism that makes it beautiful, like museums).

Thoughts in sections :


What makes a house.

One of the most amusing things is noticing what people actually say at the sum-up at the end. Every time Kevin sits down and talks with the couple when the house is done and asks what they really love about it, the comments are things like :

"We're just glad it's done / we're just glad to have a roof over our head".
"It's so nice to just be together as a family"
"The views are amazing."
"The location is so special."

etc. it always strikes me that none of the pros has anything to do with the expensive house they just made. They never say anything like : "the architecture is incredibly beautiful and it's a joy to just appreciate the light and the spaces". Or "we're really glad we made a ridiculous 4000 sq ft living room because we often have 100 friends over for huge orgies". Because they don't actually care about the house at all (and it sucks).

Quite often I see the little cramped bungalow that people start out with and think "that looks just charming, why are you building?". It's all cosy and full of nick-nacks. It's got colorful paint schemes and is appropriately small and cheap and homey. Then they build some awful huge cavernous cold white box.

The children in particular always suffer. Often they're in a room together before the build, and the family is building to get more space so the kids can all have their own room. But the kids in the shared room are all laughing and wrestling, piled up on each other and happy. Kids are meant to be with other kids. In fact, humans are meant to be with other humans. We spend all this money to get big empty lonely spaces, and are worse off for it. Don't listen to what people say they want, it's not good for them.

In quite a few of the episodes, the couple at the beginning is full of energy, really loving to each other, bright eyed. At the end of the episode, they look crushed, exhausted, stressed out. Their lips are thin and they're all tense and dead inside.

Even in the disasters they're still saying how wonderful it is and how they'd do it all over (because people are so awfully boringly cowardlyly dishonest and never admit regret about major life decisions until after they've unwound them (like every marriage is "wonderful" right up until the day of the divorce)), but you can see it in their eyes. Actually the final interviews of Grand Designs are an interesting study in non-verbal communication, because the shlock nonsense that they say with their mouths has absolutely zero information content (100% predictable = zero information), so you have to get all your information from what they're not saying.

It's so weird the way some people get some ridiculous idea in their head and stick to it no matter how inconvenient and costly and more and more obviously foolish it is. Like I absolutely have to build on this sloping lot that has only mud for soil, or I have to build in this old water tower that's totally impractical. They incur huge costs, for what? You could have just bought a normal practical site, and you would have been just as happy, probably much happier.

In my old age I am appreciating that all opinions are coincidences and taste is an illusion. Why in the world would you butt your head against some obviously difficult and costly and impractical problem. You didn't actually want that thing anyway. You just thought you wanted it because you are brainwashed by the media and your upbringing. Just get something else that's easier. All things are equal.


Eco Houses.

The "eco houses" are some of the most offensive episodes to me. The purported primary focus of these "eco houses" is reducing their long-term carbon footprint, with the primary focus being on insulation to reduce heating costs. That's all well and good, but it's a big lie built on fallacies.

They completely ignore the initial eco cost of the build. Often they tear down some perfectly good house to start the build, which is absurd. Then they build some giant over-sized monstrosity out of concrete. They ignore all that massive energy waste and pollution because "long term the carbon savings will make up for it". Maybe, maybe not. Correctly doing long term costs is very hard.

Obviously anyone serious about an eco house should build it as small as reasonable. Not only does a small house use less material to make and use less energy to heat and light, there's less maintenance over its whole life, you fill it with less furniture, it's smaller in the landfill when inevitably torn down, etc.

Building a ridiculously huge concrete "eco house" is just absurd on the face of it; it's so hypocritical that it kind of blows your sensor out and you can't even measure it any more. It's sort of like making a high performance electric sports car and pretending that's "eco", it's just an absurd transparently illogical combination; oh wait...

One of the big fallacies of the eco house is the "long term payoff". There might be new building technology in 5 years that makes your house totally obsolete. Over-engineering with anything technical is almost always a mistake, because the cost (and environment cost in this case) is going down so fast.

Your house might go out of style. Houses now are not permanent, venerated monuments. They are fashion accessories. You can see by the way the eco people so happily tear down the houses from the 50's and 70's as if they were garbage. In 20 years if your house isn't fashionable, someone will buy it and tear it down. You're using lots of experimental new technology which greatly reduces the chance of your house actually lasting for the long term. Things like burying a house in the ground with a neoprene waterproofing layer makes the probability of the house actually lasting for the long term very small.

Perhaps the biggest issue is that they assume that the carbon cost of energy is constant. In fact it's going down rapidly. The whole idea of the carbon savings (for things like using massive amounts of concrete) is that the alternative (a timber house with normal insulation and more energy use) is polluting a lot through its energy use. But if its heat energy comes from solar or other clean sources, then your passive house is not a win. The technology of energy production will improve massively in the next 20-50 years, so saying your house is a win over the long (50-100 years) term is insane.

As usual in these phony arguments, they use a straw man alternative. They compare building a new $500k eco house vs. just leaving the old house alone. That's absurd. Of course if you want to say that the tear-down-and-build is more "eco" you should compare your new house vs. spending $500k on the old house or other eco purposes. What if you just left the old house and spent $100k to add insulation and solar panels and better windows? Then you could spend $400k on preserving forest land or something. A fair comparison has to be along an iso-line of constant cost, and doing the best you can per dollar in each case.

I'm sure the reality in most cases is just that people *want* a new house and are rationalizing and making up excuses why it's okay to do it. I'd like it so much better if they just said "yeah, we fucking want a new house that uses tons of concrete and we don't give a shit about the eco, but we're going to make it passive so that we can feel smug and show off to our friends".


European building vs. American.

Holy crap European building quality is ridiculously good.

In one of the episodes somebody puts softwood cladding on the house and the host is like "but that will rot in 10 years!" and the builder feels all guilty about it. (it's almost vanishingly rare to have anything but softwood cladding in America (*), and yes in fact it does rot almost immediately). (* = you might get plastic or aluminum, or these days we have various fake wood cement fiber-board options, but you would never ever use hardwood, yegads).

Granted the houses on the show are on the high end; I'm sure low end European houses are shittier. Still.

Almost every house is post-and-beam, either timber or steel frames. The timer frames are fucking oak which just blew my mind the first time I saw it. Actual fucking carpenters cutting joints. And real fucking wood. We have nothing like that. "skilled tradesman" isn't even an occupation in America anymore. All we have is "day laborer who is using a nail gun for the first time ever".

An American-style asphalt shingle roof is looked down upon as ridiculously crappy. Everything is slate or tile or metal. Their rooves last longer than our entire houses.

One funny thing I noticed is that the cost of stonemasons and proper carpenters seems to be incredibly low in the UK. There are some houses with amazing hand-done stonework and carpentry, and they cost about the same as the fucking awful modern boxes that are all prefab glass and drywall. Why in the world would you get a horrible cold cheapo-condo-looking modern box when you could have something made of natural materials cut by hand? The stone work in particular shocked me how cheap it was.

Another odd one is the widespread use of "blockwork" (concrete block walls). That's something we almost never do for homes in America, and I'm not sure why not. It's very quick and cheap, and makes a very solid wall. We associate it with prisons and prison-like schools and such, but if you put plaster over it, it's a perfectly nice wall and feels just like a stone house. I guess even blockwork is expensive compared to the ridiculously cheap American stick-framing method.

Another difference that shocked me is the "fixed price contract". Apparently in the UK you can get a builder to bid a price, and then if there are any overruns *they* have to cover it. OMG how fucking awesome is that, I would totally consider building a house if you could do that.

Oh yeah, and of course the planning regulations are insane. Necessary evil I suppose. It's why Europe is beautiful (despite heavy human modification absolutely everything) and America looks like a fucking pile of vomit anywhere it's been touched by the hand of man. (though a lot of the stuff that gets allowed on the show in protected areas is pretty awful modern crap that doesn't fit in or hide well at all, so I'm not sure the planners are really doing a great job. It seems like if you spend enough time and money they will eventually let you build an eyesore).


It's interesting to watch how people (the clients) handle the building process.

A few people get completely run over by their builder or architect, pushed into building something they don't want, and forced to eat delays and overruns and shitty quality, and that's sad to see. But it's also unforgivably pathetic of them to let it happen. YOU HAVE A FUCKING CH4 CAMERA CREW! It's the easiest time ever to make your builder be honest and hard-working. Just go confront them when the cameras are there and make them explain themselves on camera. WTF how can you be such a pussy that you don't stand up for yourself even when you have this amazing backup. But no, they'll say "oh, I don't know, what can I do about it?". You can bloody well do more than you are doing.

A few people are asshole micro-managers totally hovering over the crew all the time. The crew hate them and complain about them. But they also do seem to work harder. In this sad awful life of ours, being an annoying nag really does work great, because most people just don't want to deal with it and so will do what they have to in order to not get nagged.

Building a house is one of those situations where you can really see the difference between people who just suck it up and go with the flow "oh, I guess that's just what it costs", vs. people who are always scrapping and fighting and getting more for themselves. You can see some rich old pussy fart who doesn't fight might spend $1M on a build, and some other poor immigrant guy who knows how to deal and cajole and hustle might spend $200k on the exact same build. You can be bigger than your money or your intellect if you just fight for it.

The ones that are most impressive to me are the self-builds. It just astounds me how hard they work. And how wonderful to put 2 years or so of your life into just building one thing, that afterward you can go "I made this". Amazing, I'd love to do that. It's also the only time that I really see the people enjoying the process, and being happy afterward. (I particularly like the couple in scotland that does the gut-rehab of an old stone house all by themselves with no experience).

There are a few episodes with the classic manipulative architect. The architect-client relationship is usually semi-adversarial. Architects don't just want to make you the nice house you want, that suits you and is cheap and easy to build. They want to build something that will get them featured in a magazine; they want to build something that is cutting edge, or they have some bee in their bonnet that they want to try out. They want to use expensive and experimental methods and make you take all the risk for it. In order to get you to do that, they will lie to you about how risky and expensive it really it is. I don't necessarily begrudge the architects for that, it's what they have to do in order to get something interesting built. But it's amazing how naive and trusting some of the clients are. And it's a sort of inherently shady situation. Any time one person gets the upside (eg. the architect benefits if it goes well) and someone else gets the downside (the client has to eat the cost overrun and delays and live in the shitty house if it doesn't go well), that's a big problem for morality. You're relying solely on their ethics to treat you well, and that is an iffy thing to rely on.


09-02-13 | DEET

About to go camping for a few days. Discovered that the DEET has eaten its way through the original tube it came in, through a few layers of ziplocs, and out into my tub of camping stuff where it gladly ate a hole in my thermarest. Fucking DEET !

I guess after every trip I need to take the deet out and put it in a glass jar by itself. Or in a toxic waste containment facility or some shit. It's nasty stuff. Still better than mosquitos.


08-28-13 | Crying Baby Algorithm

Diagnosis and solution of crying baby problem.

1. Check diaper. This is always the first check because it's so easy to do and is such a conclusive result if positive. Diaper is changed, proceed to next step.

2. Check for boredom. Change what you're doing, take baby outside and play a new game. If this works, hunger or fatigue are still likely, you're just temporarily averting them.

3. Check hunger. My wife disagrees about this being so early in the order of operations, but testing it early seems like a clear win to me. After diaper this is the test with the most reliable feedback - if baby eats, that's a positive result and you're done. You can do an initial test for hungry-sucking with a finger in the mouth (hungry-sucking is much harder and more aggressive than comfort-sucking); hungry-sucking does not 100% of the time indicate actual hunger, nor does lack of hungry-sucking mean no hunger, but there is strong correlation. Doing a test feed is much easier with breastfed babies than with bottles (in which case you have to warm them up and find a clean nipple and so on). If baby is extremely upset, then failure to respond to a test-feed does not mean it is not hungry, you may have to repeat with stronger measures (see below).

4. Check for obvious stomach pain. Crying from stomach pain (gas, burps, acid, whatever) can be hard to diagnose once it becomes constant or severe (we'll get to that case below). But in the early/mild stages it's pretty easy to detect by testing body positions. If baby cries like crazy on its back but calms down upright or on its stomach (in a football hold), stomach pain is likely. If baby quiets down with abdomen pressure and back-patting, stomach pain is likely.

5. Look for obvious fatigue. The signs of this are sometimes clear - droopy eyes, yawning, etc. Proceed to trying to make baby sleep.

This is the end of the easy quick checks. If none of these work you're getting into the problem zone, where there may be multiple confounding factors, or the factors may have gone on so long that baby no longer responds well to them (eg. hungry but won't eat, gassy and patting doesn't stop the crying), so you will have to do harder checks.

Before proceeding, go back to #1 and do the easy checks again. Often it's the case that baby was just hungry, but after you did #1 it pooped and so wouldn't eat. It also helps to have a different person do the re-check; baby may not have responded to the first person but will respond to the next person doing exactly the same check the second time.

This is worth repeating beacuse it's something that continues to be a stumbling block for me to this day - just becaues you checked something earlier in the crying process doesn't mean it's not the issue. You tested hunger, she wouldn't eat. That doesn't mean you never test hunger again! You have to keep working on issues and keep re-checking. You have not ruled out anything! (except dirty diaper, stick your finger in there again to make sure, nope that's not it).

If that still doesn't work, proceed to the longer checks :

6. Check for overstimulation. This is pretty straightforward but can take a while. Take baby to a dark room, pat and rock, make shushes or sing to baby, perhaps swaddle. Just general calming. Crying may continue for quite a while even if this is the right solution so it takes some patience. (and then do 1-5 again)

7. Check for fatigue but won't go to sleep. The usual technique here is a stroller walk. (and then do 1-5 again)

8. Check for hungry but won't eat. You will have to calm baby and make it feel okay in order to eat. Proceed as per #6 (overstimulation) but stick baby on a nipple, in its most secure eating position. For us it helps to do this in the bedroom where baby sleeps and eats at night, that seems to be the most comforting place. Swaddling also helps here. Baby may even turn away and reject eating several times before it works. (and then do 1-5 again)

9. Assume unsignalled stomach pain. Failing all else, assume the problem is gastrointestinal, despite the lack of a clear GI relief test result (#4). So just walk baby around and pat its back, there's nothing else to do. (and keep repeating 1-5 again)

10. If you reach this point, your baby might just be a piece of shit. Throw it out and get a better one.


08-28-13 | How to Crunch

Baby is like the worst crunch ever. Anyway it's got me thinking about things I've learned about how to cope with crunch.

1. There is no end date. Never push yourself at an unsustainable level, assuming it's going to be over soon. Oh, the milestone is in two weeks, I'll just go really hard and then recover after. No no no, the end date is always moving, there's always another crunch looming, never rely on that. The proper way to crunch is to find a way to lift your output to the maximum level that you can hold for an indeterminate amount of time. Never listen to anyone telling you "it will be over on day X, let's just go all-out for that", just smile and nod and quietly know that you will have the energy to keep going if necessary.

2. Don't stop taking care of yourself. Whatever you need to do to feel okay, you need to keep doing. Don't cut it because of crunch. It really doesn't take that much time, you do have 1-2 hours to spare. I think a lot of people impose a kind of martyrdom on themselves as part of the crunch. It's not just "let's work a lot" it's "let's feel really bad". If you need to go to the gym, have a swim, have sex, do yoga, whatever it is, keep doing it. Your producers and coworkers who are all fucking stupid assholes will give you shit about it with passive aggressive digs; "ooh I'm glad our crunch hasn't cut into your workout time, none of the rest of us are doing that". Fuck you you retarded inefficient less-productive martyr pieces of crap. Don't let them peer pressure you into being stupid.

3. Resist the peer pressure. Just decided this is worth it's own point. There's a lot of fucking retarded peer pressure in crunches. Because others are suffering, you have to also. Because others are putting in stupid long hours at very low productivity, you have to also. A classic stupid one is the next point -

4. Go home. One of the stupidest ideas that teams get in crunches is "if someone on the team is working, we should all stay for moral support". Don't be an idiot. You're going to burn out your whole team because one person was browsing the internet a month ago when they should have been working and is therefore way behind schedule? No. Everyone else GO HOME. If you aren't on the critical path, go sleep, you might be critical tomorrow. Yes the moral support is nice, and in rare cases I do advocate it (perhaps for the final push of the game if the people on the critical path are really hitting the wall), but almost never. Unfortunately as a lead you do often need to stick around if anyone on your team is there, that's the curse of the lead.

5. Sleep. As crunch goes on, lack of sleep will become a critical issue. You've got to anticipate this and start actively working on it from the beginning. That doesn't just mean making the time to lie in bed, it means preparing and thinking about how you're going to ensure you are able to get the sleep you need. Make rules for yourself and then be really diligent about it. For me a major issue is always that the stress of crunch leads to insomnia and the inability to fall asleep. For me the important rules are things like - always stop working by 10 PM in order to sleep by 12 (that means no computer at all, no emails, no nothing), no coffee after 4 PM, get some exercise in the afternoon, take a hot shower or bath at night, no watching TV in bed, etc. Really be strict about it; your sleep rules are part of your todo list, they are tasks that have to be done every day and are not something to be cut. I have occasionally fallen into the habit of using alcohol to help me fall asleep in these insomnia periods; that's a very bad idea, don't do that.

6. Be smart about what you cut out of your life. In order to make time for the work crunch you will have to sacrifice other things you do with your life. But it's easy to cut the wrong things. I already noted don't cut self care. (also don't cut showering and teeth brushing, for the love of god, you still have time for those). Do cut non-productive computer and other electronics time. Do cut anything that's similar to work but not work, anything where you are sitting inside, thinking hard, on a computer, not exercising. Do cut "todos" that are not work or urgent; stuff like house maintenace or balancing your checkbook, all that kind of stuff you just keep piling up until crunch is over. Do cut ways that you waste time that aren't really rewarding in any way (TV, shopping, whatever). Try not to cut really rewarding pleasure time, like hanging out with friends or lovers, you need to keep doing that a little bit (for me that is almost impossible in practice because I get so stressed I can't stop thinking about working for a minute, but in theory it sounds like a good idea).

7. Be healthy. A lot of people in crunch fall into the trap of loading up on sugar and caffeine, stopping exercising, generally eating badly. This might work for a few days or even weeks, but as we noted before crunch is always indeterminate, and this will fuck you badly long term. In fact crunch is the most critical time to be careful with your body. You need it to be healthy so you can push hard, in fact you should be *more* careful about healthy living than you were before crunch. It's a great time to cut out all sugary snacks, fast food, and alcohol.


08-27-13 | Email Woes

I'm having some problem with emails. Somewhere between gmail and my verio POP host, some emails are just not getting through.

If you send me an important email and I don't reply, you should assume I didn't get it. Maybe send it again with return-receipt to make sure. (if you send me a "funny" link and I don't reply, GTFO)

Fucking hell, I swear everyone is fired. How is this so hard. Sometimes I think that the project I should really be working on (something that would actually be important and make a difference in the world to a lot of people) is making a new internet from scratch. Fully encrypted and peer-to-peer so governments can never monitor it, and no corporate overlord controls the backbone, and text-only so fucking flash or whatever can never ruin it. Maybe no advertising either, like the good old days. But even if I built it, no one would come because it wouldn't have porn or lolcats or videos of nut-shots.


08-26-13 | OMG E46 M3's have the best ads

Everyone knows that E46 M3's attract an amazing demographic . I'm still occasionally keeping my eye out for them (I can't face the minivan!) , and you come across the most incredible ads for these things.

Check this one out :

2004 E46 Bmw Laguna Blue M3

OMG it starts so good and then just gets better and better. Like you can't believe that he can top his last sentence, and then he just blows it away. Amazing. (if it's fake, it's fucking brilliant)


In white for when that expires (fucking useless broken ass piece of shit internet can't even hold some text permanently, what a fakakta load of crap everything in this life is, oy) :

Hello i got my baby blue BMW m3. Finally decided to sell. You will never find a cleaner BMW i promise. Best deal for best car. -black on black -6 speed manual -66,435 miles (all freeway) (half city)(1/2country) never racetracked. -Rebuilt title( -fixed by my brother Yuri with quality shop, he did good job). -19 inch wheels perfect for race, burnout, drift. I never do.. I promise. -AC blows hard like Alaska. -333hp but i put intake and its now 360hp at least, trust me. Powerfull 3,2 engine almost as fast as Ferarri 360. You dont believe? Look it up, youtube i have video! I can show you in the test drive.. You will feel like king on the roads... Trust me. This car puts you in extasy feeling. The pistons push hard. sometimes i say, "Big mistake- big payment, sucks job". But for you this the best because i believe ur serious. I keep it very clean, oil change every 20,000 miles. Ladies like it. My wife is mad sell, because she drives to work everyday, in seattle, and likes the power. I say we buy toyota and go vacation to Hawai. CASH ONLY( no credit, debit , fake checks, paypal, ebay, trade only for mercedes AMG, i like power. No lowball, serious buyers only, im tired of low balls offers, so please serious. I have pictures here. Thank you. I love the car so i could keep it if you dont want it. And go to casino to make payment. Im good with blackjack.

You dont believe? Look it up, youtube i have video!


08-22-13 | Sketch of Suffix Trie for Last Occurance

I don't usually like to write about algorithms that I haven't actually implemented yet, but it seems in my old age that I will not actually get around to doing lots of things that I think about, so here goes one.

I use a Suffix Trie for string matching for my LZ compressors when optimal parsing.

reminder: Suffix Tries are really the super-awesome solution, but only for the case that you are optimal parsing not greedy parsing, so you are visiting every byte, and for large windows (sliding window Suffix Tries are not awesome). see : LZ String Matcher Decision Tree (w/ links to Suffix Trie posts)

Something has always bothered me about it. Almost the entire algorithm is this sweet gem of computer science perfection with no hackiness and a perfect O(N) running time. But there's one problem.

A Suffix Trie really just gives you the longest matching substring in the window. It's not really about the *location* of that substring. In particular, the standard construction using pointers to the string that was inserted will give you the *first* occurance of each substring. For LZ compression what you want is the *last* occurance of each substring.

(I'm assuming throughout that you use path compression and your nodes have pointers into the original window. This means that each step along the original window adds one node, and that node has the pointer to the insertion location.)

In order to get the right answer, whenever you do a suffix query and find the deepest node that you match, you should then visit all children and see if any of them have a more recent pointer. Say you're at depth D, all children at depth > D are also substring matches of the same first D bytes, so those pointers are equally valid string matches, and for LZ you want the latest one.

An equivalent alternative is instead of searching all children on query, you update all parents on insertion. Any time you insert a new node, go back to all your parents and change their pointers to your current pointer, because your pointer must match them all up to their depth, and it's a larger value.

Of course this ruins the speed of the suffix trie so you can't do that.

In Oodle I use limitted parent updates to address this issue. Every time I do a query/insert (they always go together in an optimal parse, and the insert is always directly under the deepest match found), I take the current pointer and update N steps up the parent links. I tested various values of N against doing full updates and found that N=32 gave me indistinguishable compression ratios and very little speed hit.

(any fixed value of N preserves the O(N) of the suffix trie, it's just a constant multiplier). (you need to walk up to parents anyway if you want to find shorter matches at lower offsets; the normal suffix lookup just gives you the single longest match).

So anyway, that heuristic seems to work okay, but it just bothers me because everything else about the Suffix Trie is so pure with no tweak constants in it, and then there's this one hack. So, can we solve this problem exactly?

I believe so, but I don't quite see the details yet. The idea goes like this :

I want to use the "push pointer up to parents method". But I don't actually want to update all parents for each insertion. The key to being fast is that many of the nodes of the suffix trie will never be touched again, so we want to kind of virtually mark those nodes as dirty, and they can update themselves if they are ever visited, but we don't do any work if they aren't visited. (BTW by "fast" here I mean the entire parse should still be O(N) or O(NlogN) but not fall to O(N^2) which is what you get if you do full updates).

In particular in the degenerate match cases, you spend all your time way out at the leaves of the suffix trie chasing the "follows" pointer, you never go back to the root, and many of the updates overwrite each other in a trivial way. That is, you might do substring "xxaaaa" at "ptr", and then "xxaaaaa" at "ptr+1" ; the update of "ptr" back up the tree will be entirely overwrittten by the update from "ptr+1" (since it also serves as an "xxaa" match and is later), so if you just delay the update it doesn't need to be done at all.

(in the end this whole problem boils down to a very simple tree question : how do you mark a walk from a leaf back to the root with some value, such that any query along that walk will get the value, but without actually doing O(depth) work if those nodes are not touched? Though it's not really that question in general, because in order to be fast you need to use the special properties of the Suffix Trie traversal.)

My idea is to use "sentries". (this is a bit like skip-lists or something). In addition to the "parent" pointer, each node has a pointer to the preceding "sentry". Sentry steps take you >= 1 step toward root, and the step distance increases. So stepping up the sentry links might take you 1,1,2,4,.. steps towards root. eg. you reach root in log(depth) steps.

When you insert a new node, instead of walking all parents and changing them to your pointer, you walk all sentries and store your pointer as a pending update.

When you query a node, you walk to all sentries and see if any of them has a lower pointer. This effectively finds if any of your children did an update that you need to know about.

The pointer that you place in the sentry is really a "pending update" marker. It means that update needs to be applied from that node up the tree to the next sentry (ADD: I think you also need to store the range that it applies to, since a large-step range can get broken down to smaller ranges by updates). You know what branch of the tree it applies to because the pointer is the string and the string tells you what branch of the tree to follow.

The tricky bit happens when you set the pointer in the sentry node, there may be another pointer there from a previous insertion that is still pending update. You need to apply the previous pending update before you store your new pointer in the pending update slot.

Say a node contains a pending update with the pointer "a", and you come in and want to mark it with "b". You need to push the "a" update into the range that it applies to, so that you can set that node to be pending with a "b".

The key to speed is that you only need to push the "a" update where it diverges from "b". For example if the substring of "a" and "b" is the same up to a deeper sentry that contains "b" then you can just throw away the "a" pending update, the "b" update completely replaces it for that range.

Saying it all again :

You have one pointer update "a" that goes down a branch of the tree. You don't want to actually touch all those nodes, so you store it as applying to the whole range. You do a later pointer update "b" that goes down a branch that partially overlaps with the "a" branch. The part that is "a" only you want to leave as a whole range marking, and you do a range-marking for "b". You have to find the intersection of the two branches, and then the area where they overlap is again range-marked with "b" because it's newer and replaces "a". The key to speed is that you're marking big ranges of nodes, not individual nodes. My proposal for marking the ranges quickly is to use power-of-2 sentries, to mark a range of length 21 you would mark spans of length 16+4+1 kind of a thing.

Maybe some drawings are clearer. Here we insert pointer "a", and then later do a query with pointer "b" that shares some prefix with "a", and then insert "b".

The "b" update to the first sentry has to push the "a" update that was there up until the substrings diverge. The update back to the root sees that "a" and "b" are the same substring for that entire span and so simply replaces the pending update of "a" with a pending update of "b".

Let's see, finishing up.

One thing that is maybe not clear is that within the larger sentry steps the smaller steps are also there. That is, if you're at a deep leaf you walk back to the root with steps that go 1,1,2,4,8,16,32. But in that last big 32 step, that doesn't mean that's one region of 32 nodes with no other sentries. Within there are still 1,2,4 type steps. If you have to disambiguate an update within that range, it doesn't mean you have to push up all 32 nodes one by one. You look and see hey I have a divergence in this 32-long gap, so can I just step up 16 with "a" and "b" being the same? etc.

I have no idea what the actual O() of this scheme is. It feels like O(NlogN) but I certainly don't claim that it is without doing the analysis.

I haven't actually implemented this so there may be some major error in it, or it might be no speed win at all vs. always doing full updates.

Maybe there's a better way to mark tree branches lazily? Some kind of hashing of the node id? Probabilistic methods?


08-19-13 | Sketch of multi-Huffman Encoder

Simple way to do small-packet network packet compression.

Train N different huffman code sets. Encoder and decoder must have a copy of the static N code sets.

For each packet, take a histogram. Use the histogram to measure the transmitted length with each code set, and choose the smallest. Transmit the selection and then the encoding under that selection.

All the effort is in the offline training. Sketch of training :

Given a large number of training packets, do this :


for - many trials - 

select 1000 or so seed packets at random
(you want a number 4X N or so)

each of the seeds is a current hypothesis of a huffman code set
each seed has a current total histogram and codelens

for each packet in the training set -
add it to one of the seeds
the one which has the most similar histogram
one way to measure that is by counting the huffman code length

after all packets have been accumulated onto all the seeds,
start merging seeds

each seed has a cost to transmit; the size of the huffman tree
plus the data in the seed, huffman coded

merge seeds with the lowest cost to merge
(it can be negative when merging makes the total cost go down)

keep merging the best pairs until you are down to N seeds

once you have the N seeds, reclassify all the packets by lowest-cost-to-code
and rebuild the histograms for the N trees using only the new classification

those are your N huffman trees

measure the score of this trial by encoding some other training data with those N trees.

It's just k-means with random seeds and bottom-up cluster merging. Very heuristic and non-optimal but provides a starting point anyway.

The compression ratio will not be great on most data. The advantage of this scheme is that there is zero memory use per channel. The huffman trees are const and shared by all channels. For N reasonable (4-16 would be my suggestion) the total shared memory use is quite small as well (less than 64k or so).

Obviously there are many possible ways to get more compresion at the cost of more complexity and more memory use. For packets that have dword-aligned data, you might do a huffman per mod-4 byte position. For text-like stationary sources you can do order-1 huffman (that is, 256 static huffman trees, select by the previous byte), but this takes rather more const shared memory. Of course you can do multi-symbol huffman, and there are lots of ways to do that. If your data tends to be runny, an RLE transform would help. etc. I don't think any of those are worth pursuing in general, if you want more compression then just use a totally different scheme.


Oh yeah this also reminds me of something -

Any static huffman encoder in the vernacular style (eg. does periodic retransmission of the table, more or less optimized in the encoder (like Zip)) can be improved by keeping the last N huffman tables. That is, rather than just throwing away the history when you send a new one, keep them. Then when you do retransmission of a new table, you can just send "select table 3" or "send new table as delta from table 5".

This lets you use locally specialized tables far more often, because the cost to send a table is drastically reduced. That is, in the standard vernacular style it costs something like 150 bytes to send the new table. That means you can only get a win from sending new tables every 4k or 16k or whatever, not too often because there's big overhead. But there might be little chunks of very different data within those ranges.

For example you might have one Huffman table that only has {00,FF,7F,80} as literals (or whatever, specific to your data). Any time you encounter a run where those are the only characters, you send a "select table X" for that range, then when that range is over you go back to using the previous table for the wider range.


08-13-13 | Dental Biofeedback

Got my teeth cleaned the other morning, and as usual the dental hygienist cut the hell out of my gums with her pointy stick.

It occured to me that the real problem is that she can't feel what I feel; when the tip of the tool starts to go into my gum, she doesn't get the pain message. It's incredibly hard to do sensitive work on other people because you don't have their self-sensory information.

But there's no reason for it to be that way with modern technology. You should be able to put a couple of electrodes on the patient to pick up the pain signal (or perhaps easier is to detect the muscles of the face clenching) hooked up to like a vibrating pad in the hygienist's chair. Then the hygienist can poke away, and get haptic feedback to guide their pointy stick.

I should make a medical device so I can get in on the gold-rush which is the health care raping of America. Yeeeaaaah!


This reminds me how incredibly shitty all the physical therapists that I saw for my shoulder were.

One of my really tough lingering problems is that after my injury, my serratus anterior basically stopped firing, which has given me a winged scapula. I went and got nerve conduction testing, as long thoracic nerve dysfunction is a common cause of this problem, but it seems the nerve is fine, it's just that my brain is no longer telling that muscle to fire. When I do various movements that should normally be recruiting the SA, instead my brain tells various other muscles to fire and I do the movement in a weird way.

Anyway I did lots of different types of PT with various types of quacks who never really listened to me properly or examined the problem properly, they just started doing their standard routine that didn't quite apply to my problem. (by far the worst was the head of Olympic PT, who started me on his pelvic floor program; umm, WTF; I guess the guy is just a molester who likes to mess around with pelvic floors, I could see half of the PT office doing pelvic floor work). When someone has this problem of firing the wrong muscles to do movements, you can't just tell them to go do 100 thera-band external rotations, because they will wind up losing focus and doing them with the wrong muscles. Just having the patient do exercises is entirely the wrong prescription.

Not one of them did the correct therapy for this problem, which is biofeedback. The patient should be connected to electrodes that detect the firing of the non-functioning muscle. They should then be told to do a series of movements that in a normal human would involve the firing of that muscle. The patient should be shown a monitor or given a buzzer that lets them know if they are firing it correctly. The direct sensory feedback is the best way to retrain the muscle control part of the brain.

A very simple but effective way to do this is for the therapist to put a finger on the muscle that should be firing. By just applying slight pressure with the finger it makes the patient aware of whether that muscle is contracting or not and can guide them to do the movement with the correct muscles. (it's a shame that our society is so against touching, because it's such an amazing aid to have your personal trainer or your yoga teacher or whatever put their hands on the part of the body they are trying to get you to concentrate on, but nobody ever does that in stupid awful don't-molest-me America).

Everyone is fired.

A related note : I've tried various posture-correction braces over the years and I believe they all suck or are even harmful. In order to actually pull your shoulders back against your body weight, you have to strap yourself into a very tight device, and none of them really work. And even if they do work, having a device support your posture is contrary to building muscle and muscle-memory to train the patient to do it themselves. I always thought that the right thing was some kind of feedback buzzer system. Give the patient some kind of compression shirt with wiring on the front and back that can detect the length of the front of the shirt and the length of the back. Have them establish a baseline of correct posture. The shirt then has a little painful electric zapper in it, so if they are curving forward, which stretches the back of the shirt and compresses the front, they get a zap. The main problem with people with posture problems is just that they forget about it and fall into bad old habits, so you need to give them this constant monitoring and immediate feedback to fix it.


08-13-13 | Subaru WRX Mini Review

We got a WRX Wagon a while ago to be our primary family car. It's basically superb, I absolutely love it.

Above all else I love that it is simple and functional. It feels like the best car that Subaru could have possibly built for the money. It has no extra frills. It's not trying to be "luxury" in any way, and I fucking love that. But where it counts, the engine, the drivetrain, the reliability, it is totally quality, superb quality, much better than the so-called German masters of engineering. And actually I like the feel of the interior plastics (for the most part); they feel functional and solid and unpretentious. Way better than the gross laminated wood in modern BMWs and the blingy rhinestone-covered interiors of modern Audis. So-called "nice" cars have all become like tacky McMansions catering to the Chelsea tastes, all faux-marble and oversized and so on. Trying to appear better than they actually are, whereas the Subaru is better quality than it looks.

It feels so crazy fast on the street. To be so sporty and also be a good family wagon is just amazing. Sometimes I wish I'd spent the extra money for the STI (and then spent some more to de-stupidify the ridiculously hard and low suspension of the STI) just for the diffs. The base WRX (non STI) has a totally non-functional viscous 4wd system that ruins the ability to throttle steer in a really predictable fun way. But realistically the number of times that I'm pushing the car that hard is vanishing and it would need lots of other mods to be a Ken Block trick machine.

Given that I basically love it and highly recommend it (*), here are a few things that I don't like about it :

(* = I believe that everyone should buy a WRX. If you want a car for X purpose, no matter what X is, the answer is WRX. The only reasons I see to buy anything but a WRX are 1. the gas mileage is not great, and 2. if you really need more space, like you if you have 8 children and 4 great danes or something ridiculous like that).

Sport Seats. God sports seats are so fucking retarded and are ruining cars. They are entirely for looks, put in by the marketting department to attract dumb teenage boys (the same dumb boys that like low profile tires and jutting hips). You can make a non-sports seat that holds you in place perfectly fine, and if you are actually doing performance driving you just need a harness (or a lap-belt lock). The specific problem in the WRX is the big solid head rests that are part of the seat - they block visibility really badly for backing up, and they also get in the way of the folding rear seats, so to fold the back seats up and down you have to go fold the front seats up first. It's lame.

Hill-start Assist. I hate hill-start assist. It would be okay if it was optional, actually I would like it if it was optional, on a button I could turn on and off so I would only get it when I actually wanted it, but no it's on all the time. I can start a car on a hill just fine by myself, thank you (where were you 15 years ago when I couldn't ?). The main result of hill-start assist is that I press the gas to get going and the car doesn't move and it feels like the brake is holding the car and I'm like "wtf is going on, the car is broken god dammit" and then it finally lets go and I realize it's the dumb HSA. Sucky. Stop trying to help me drive.

Variable Power Steering. The PS is variably boosted and *way* too variable. It's so boosted at low speed that the wheel feels all swimmy with no response. Variable PS in general hurts your ability to develop muscle-memory for turning. If you are going to do variable PS, it should be a very minor difference, not like a completely different car the way it is in the WRX; basically an ideal variable PS system should be so minor that the driver doesn't consciously know it's there at all, it should just feel right. At least it's still hydraulic, not that god-awful electric crap.

Hatch blockage. One of the big problems with the car is that the wagon back is not as useful as it should be. It's a decent amount of space, but there's a lot of weird bumps and lips that make it hard to get large things inside. The hatch opening in particular is a lot smaller that it should be, and it shouldn't be so sloped in the rear; if the roof was longer and the rear glass more vertical, you could actually get a small couch in the back (fast-backs in general are retarded; they are drawn by car designers because they *look* aerodynamic, but a hard edge is actually more aero and provides way more interior space). As is, it's unfortunately terrible for cargo carrying and just not a very good small station wagon.

Summer Tires. God damn summer tires are such a stupid scam. They're annoyingly low profile too. It just sucks that almost any car you buy these days you have to immediately buy new wheels and tires to undo the stupidity, which is again marketing department tinkering. Cars should be sold on all-season tires, and we need to stop this low-profile retardation. Rubber is amazing wonderful stuff, it's what makes wheels work well. (part of the reason cars are all sold on summer tires now is to make the magazine tests "on stock tires" look better. Those "on stock tires" tests have always been retarded, because there are a few sports cars sold on autocross tires (street legal track tires) which is totally unfair, and a few actually good-to-their-customers car companies sell their cars on all-seasons, so the magazines' claim that by testing all cars on stock tires is a fair way to compare is just bananas.)

Jack plates & hitch. The jack points on these cars absolutely suck (pinch welds with the actually reinforced part inboad of the pinch weld). What would it cost the factory to put a decent jack plate there, maybe $10? Come on. A factory hitch option would have been nice too.

Plastic bumpers. The body skirts are super low (too low, they don't clear a parking spot curb thingy) and what's worse is they're made from crunchy hard plastic that cracks or pops out of the weak plastic press-fittings rather than bends. I wish bumpers and skirts were still steel with rubber, or even soft polyurethane plastic that bends on small impacts would be fine. As usual with metric over-training, the fact that the official crash tests don't rate the damage to the car means that modern cars are very good at protecting the occupants, but self-destruct on even the slightest impact. These really aren't even "bumpers", the actual impact absorbing part is all hidden underneath; these are plastic skins over the bumpers, and they really should be designed so that you can have a 1 mph impact without cracking them or popping the fittings out.

But really the only serious complaint is the stupid shape that ruins its cargo capacity :

I just hate it when marketing people ruin the functionality of things for no good reason. First you make it as functional as possible. Then the marketing people can play with the look or whatever but *only* in ways that don't change the functionality. Almost all modern cars are really "haunchy" these days, with wheels jutting out from the body, and bodies that are cut in at the top and back to be "sleek". No it is *not* for handling or aerodynamics. It's entirely an aesthetic thing. It's making me long for something like the Honda Element that at least has not compromised function for stupid vanity.


08-13-13 | Kinesis Freestyle 2 Mini Review

Aaron got a KF2 so I tried it out. Very quick review :

(on the right side I've failed to circle the Shift and Control keys which are perhaps the worst part)

The key action is not awesome but is totally acceptable. It would not stop me from using it.

Even in brief testing I could feel that having my hands further apart was pretty cool. Getting your elbows back tucked into your ribs with forearms externally rotated helps reduce the way that normal computer use posture has the weight of the arms collapsing the chest, causing kyphosis.

The fact that it sits flat and is totally old-skool keyboard style is pretty retarded. Who is buying an ergonomic split keyboard that wants a keyboard that could have been made in the 80's ? You need to start from a shape like the MS Natural and *then* cut it in half and split it. You don't start with a keyboard that came on your Packard Bell BBS-hosting 8086 machine. In other words, each half should have a curve and should be lifted and tilted. (the thumbs should be much higher than the pinkies).

(ideally each half should sit on something like a really nice high-friction camera tripod mount, so that you can adjust the tilt to any angle you want; the perfectly-flat or perflect-vertical is no good).

The real absolute no-go killer is the fucking retarded arrow key layout. It's not a fucking laptop (and even on a laptop there's no excuse for that). What are you thinking? There is *one* way to do arrow keys and the pgup/home set, you do not change that. Also cramming the arrows in there makes the shift key too small and puts the right-control in the wrong place. Totally unnaceptable, why are they trying so hard to save on plastic? There should be a normal proper offset arrow key set over on the right side.

And get rid of all those custom-function buttons on the left side. They serve no purpose, and negatively move my left-hand-mouse out further left than it needs to be.

Why is everyone but me so retarded? Why am I not hired to design every product in the world? Clearly no one else is capable. Sheesh.


08-12-13 | Cuckoo Cache Tables - Failure Report

This is a report on a dead end, which I wish people would do more often.

Ever since I read about Cuckoo Hashing I thought, hmm even if it's not the win for hash tables, maybe it's a win for "cache tables" ?

(A cache table is like a hash table, but it never changes size, and inserts might overwrite previous entries (or not insert the new entry, though that's unusual). There may be only a single probe or multiple).

Let me introduce it as a progression :

1. Single hash cache table with no hash check :

This is the simplest. You hash a key and just look it up in a table to get the data. There is no check to ensure that you get the right data for your key - if you have collisions you may just get the wrong data back from lookup, and you will just stomp other people's data when you write.


Data table[HASH_SIZE];

lookup :

hash = hash_func(key);
Data & found = table[hash];

insert :

table[hash] = data;

This variant was used in LZP1 ; it's a good choice in very memory-limited situations where collisions are either unlikely or not that big a deal (eg. in data compression, a collision just means you code from the wrong statistics, it doesn't break your algorithm).

2. Single hash cache table with check :

We add some kind of hash-check value to our hash table to try to ensure that the entry we get actually was from our key :


Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash = hash_func(key);
check = alt_hash_func(key);
if ( table_check[hash] == check )
{
  Data & found = table[hash];
}

insert :

table_check[hash] = check;
table[hash] = data;

In practice, hash_func and alt_hash_func are usually actually the same hash function, and you just grab different bit ranges from it. eg. you might do a 64-bit hash func and grab the top and bottom 32 bits.

In data compression, the check hash value can be quite small (8 bits is common), because as noted above collisions are not catastrophic, so just reducing the probability of an undetected collision to 1/256 is good enough.

3. Double hash cache table with check :

Of course since you are now making two hashes, you could look up two spots in your table. We're basically running the primary hash and alt_hash above, but instead of unconditionally using only one of them as the lookup hash and one as the check, we can use either one.


Data table[HASH_SIZE];
int table_check[HASH_SIZE]; // obviously not actually a separate array in practice

lookup :

hash1 = hash_func(key);
hash2 = alt_hash_func(key);
if ( table_check[hash1] == hash2 )
{
  Data & found = table[hash1];
}
else if ( table_check[hash2] == hash1 )
{
  Data & found = table[hash2];
}

insert :

if ( quality(table[hash1]) <= quality(table[hash2]) )
{
    table_check[hash1] = hash2;
    table[hash1] = data;
}
else
{
    table_check[hash2] = hash1;
    table[hash2] = data;
}

Where we now need some kind of quality function to decide which of our two possible insertion locations to use. The simplest form of "quality" just checks if one of the slots is unused. More complex would be some kind of recency measure, or whatever is appropriate for your data. Without any quality rating you could still just use a random bool there or a round-robin, and you essentially have a hash with two ways, but where the ways are overlapping in a single table.

Note that here I'm showing the check as using the same number of bits as the primary hash, but it's not required for this type of usage, it could be fewer bits.

(also note that it's probably better just to use hash1 and hash1+1 as your two hash check locations, since it's so much better for speed, but we'll use hash1 and hash2 here as it leads more directly to the next -)

4. Double hash cache table with Cuckoo :

Once you get to #3 the possibility of running a Cuckoo is obvious.

That is, every entry has two possible hash table indices. You can move an entry to its alternate index and it will still be found. So when you go to insert a new entry, instead of overwriting, you can push what's already there to its alternate location. Lookup is as above, but insert is something like :


insert :

PushCuckoo(table,hash1);
table_check[hash1] = hash2;
table[hash1] = data;



PushCuckoo(table,hash1)
{
// I want to write at hash1; kick out whatever is there

if ( table[hash1] is empty ) return;

// move my entry from hash1 to hash2
hash2 = table_check[hash1];
PushCuckoo(hash2);

table[hash2] = table[hash1];
table_check[hash2] = hash1;

}

Now of course that's not quite right because this is a cache table, not a hash table. As written above you have a gauranteed infinite loop because cache tables are usually run with more unique insertions than slots, so PushCuckoo will keep trying to push things and never find an empty slot.

For cache tables you just want to do a small limited number of pushes (maybe 4?). Hopefully you find an empty slot to in that search, and if not you lose the entry that had the lowest "quality" in the sequence of steps you did. That is, remember the slot with lowest quality, and do all the cuckoo-pushes that precede that entry in the walk.

For example, if you have a sequence like :


I want to fill index A

hash2[A] = B
hash2[B] = C
hash2[C] = D
hash2[D] = E

none are empty

entry C has the lowest quality of A-E

Then push :

B -> C
A -> B
insert at A

That is,

table[C] = table[B]
hash2[C] = B
table[B] = table[A]
hash2[B] = A
table[A],hash2[A] = new entry

The idea is that if you have some very "high quality" entries in your cache table, they won't be destroyed by bad luck (some unimportant event which happens to have the same hash value and thus overwrites your high quality entry).

So, I have tried this and in my experiments it's not a win.

To test it I wrote a simple symbol-rank compressor. My SR is order-5-4-3-2-1 with only 4 symbols ranked in each context. (I chose an SR just because I've been working on SR for RAD recently; otherwise there's not much reason to pursue SR, it's generally not Pareto). Contexts are hashed and looked up in a cache table. I compressed enwik8. I tweaked the compressor just enough so that it's vaguely competitive with state of the art (for example, I use a very simple secondary statistics table for coding the SR rank), because otherwise it's not a legitimate test.

For Cuckoo Caching, the hash check value must be the same size as the hash table index, so that's what I've done for most fo the testing. (in all the other variants you are allowed to set the size of the check value freely). I also tested 8-bit check value for the single lookup case.

I'm interested in low memory use and really stressing the cache table, so most of the testing was at 18-bits of hash table index. Even at 20 bits the difference between Cuckoo and no-Cuckoo disappears.

The results :


18 bit hash :

Single hash ; no confirmation :
Compress : 100000000 -> 29409370 : 2.353

Single hash ; 8 bit confirmation :
Compress : 100000000 -> 25169283 : 2.014

Single hash ; hash_bits size confirmation :
Compress : 100000000 -> 25146207 : 2.012

Dual Hash ; hash_bits size confirmation :
Compress : 100000000 -> 24933453 : 1.995

Cuckoo : (max of 10 pushes)
Compress : 100000000 -> 24881931 : 1.991

Conclusion : Cuckoo Caching is not compelling for data compression. Having some confirmation hash is critical, but even 8 bits is plenty. Dual hashing is a good win over single hashing (and surprisingly there's very little speed penalty (with small cache table tables, anyway, where you are less likely to pay bad cache miss penalties)).

For the record :


variation of compression with hash table size :

two hashes, no cuckoo :

24 bit o5 hash : (24,22,20,16,8)
Compress : 100000000 -> 24532038 : 1.963
20 bit o5 hash : (20,19,18,16,8)
Compress : 100000000 -> 24622742 : 1.970
18 bit o5 hash : (18,17,17,16,8)
Compress : 100000000 -> 24933453 : 1.995

Also, unpublished result : noncuckoo-dual-hashing is almost as good with the 2nd hash kept within cache page range of the 1st hash. That is, the good thing to do is lookup at [hash1] and [hash1 + 1 + (hash2&0xF)] , or some other form of semi-random nearby probe (as opposed to doing [hash1] and [hash2] which can be quite far apart). Just doing [hash1] and [hash1+1] is not as good.


08-08-13 | Oodle Static LZP for MMO network compression

Followup to my post 05-20-13 - Thoughts on Data Compression for MMOs :

So I've tried a few things, and Oodle is now shipping with a static dictionary LZP compressor.

OodleStaticLZP uses a static dictionary and hash table which is const and shared by all network channels. The size is set by the user. There is an adaptive per-channel arithmetic coder so that the match length and literal statistics can adapt to the channel a bit (this was a big win vs. using any kind of static models).

What I found from working with MMO developers is that per-channel memory use is one of the most important issues. They want to run lots of connections on the same server, which means the limit for per-channel memory use is something like 512k. Even a zlib encoder at 400k is considered rather large. OodleStaticLZP has 182k of per-channel state.

On the server, a large static dictionary is no problem. They're running 16GB servers with 10,000 connections, they really don't care if the static dictionary is 64MB. However, that same static dictionary also has to be on the client, so the limit on how big a static dictionary you can use really comes from the client side. I suspect that something in the 8MB - 16MB range is reasonable. (and of course you can compress the static dictionary; it's only something like 2-4 MB that you have to distribute and load).

(BTW you don't necessarily need an adaptive compression state for every open channel. If some channels tend to go idle, you could drop their state. When the channel starts up again, grab a fresh state (and send a reset message to the client so it wipes its adaptive state). You could do something like have a few thousand compression states which you cycle in an LRU for an unbounded number of open channels. Of course the problem with that is if you actually get a higher number of simultaneous active connections you would be recycling states all the time, which is just the standard cache over-commit problem that causes nasty thrashing, so YMMV etc.)

This is all only for downstream traffic (server->client). The amount of upstream traffic is much less, and the packets are tiny, so it's not worth the memory cost of keeping any memory state per channel for the upstream traffic. For upstream traffic, I suggest using a static huffman encoder with a few different static huffman models; first send a byte selecting the huffman table (or uncompressed) and then the packet huffman coded.

I also tried a static dictionary / adaptive statistics LZA (LZA = LZ77+arith) (and a few other options, like a static O3 context coder and some static fixed-length string matchers, and static longer-word huffman coders, but all those were much worse than static LZA or LZP). The static dictionary LZA was much worse than the LZP.

I could conjecture that the LZP does better on static dictionaries than LZA because LZP works better when the dictionary mismatches the data. The reason being that LZP doesn't even try to code a match unless it finds a context, so it's not wasting code space for matches when they aren't useful. LZ77 is always trying to code matches, and will often find 3-byte matches just by coincidence, but the offsets will be large so they're barely a win vs literals.

But I don't think that's it. I believe the problem with static LZA is simply for an offset-coded LZ (as I was using), it's crucial to put the most useful data at low offset. That requires a very cleverly made static dictionary. You can't just put the most common short substrings at the end - you have to also be smart about how those substrings run together to make the concatenation of them also useful. That would be very interesting hard algorithm to work on, but without that work I find that static LZA is just not very good.

There are obvious alternatives to optimizing the LZA dictionary; for example you could take the static dictionary and build a suffix trie. Then instead of sending offsets into the window, forget about the original linear window and just send substring references in the suffix trie directly, ala the ancient Fiala & Green. This removes the ugly need to optimize the ordering of the linear window. But that's a big complex bowl of worms that I don't really want to go into.

Some results on some real packet data from a game developer :


downstream packets only
1605378 packets taking 595654217 bytes total
371.0 bytes per packet average


O0 static huff : 371.0 -> 233.5 average

zlib with Z_SYNC_FLUSH per packet (32k window)
zlib -z3 : 371.0 -> 121.8 average
zlib -z6 : 371.0 -> 111.8 average

OodleLZH has a 128k window
OodleLZH Fast :
371.0 -> 91.2 average

OodleLZNib Fast lznib_sw_bits=19 , lznib_ht_bits=19 : (= 512k window)
371.0 -> 90.6 average

OodleStaticLZP [mb of static dic|bits of hash]

LZP [ 4|18] : 371.0 -> 82.8 average
LZP [ 8|19] : 371.0 -> 77.6 average
LZP [16|20] : 371.0 -> 69.8 average
LZP [32|21] : 371.0 -> 59.6 average

Note of course that LZP would also benefit from dictionary optimization. Later occurances of a context replace earlier ones, so more useful strings should be later in the window. Also just getting the most useful data into the window will help compression. These results are without much effort to optimize the LZP dictionary. Clients can of course use domain-specific knowledge to help make a good dictionary.

TODOS : 1. optimization of LZP static dictionary selection. 2. mixed static-dynamic LZP with a small (32k?) per-channel sliding window.


08-08-13 | Selling my Porsche 911 C4S

Well, it's time to sell my car (a 2006 911 C4S). Unfortunately I missed the best selling season (early summer) because of baby's arrival, so it may take a while. It's so horrible having to work with these web sites, they're so janky and broken, jesus christ the web is such total balls, you misclick the back button and lose all your work or hit enter and some required field wasn't entered so they wipe out all your entries. (Whenever I remember to, it's best practice to always do your writing into a text editor and then copy-paste it to the web site, don't edit directly in the web site). So far I'm just getting lots of emails from spammers and dealers and other lowballers, so it's a great waste of my time. (and a lot of the lowballers (guys will offer things like $30k for the car I'm asking $48k for) are delightful human beings who insult me when I politely tell them they're out of their mind). Anyhoo...

Why am I selling it? Primarily I just don't drive it anymore. Between baby and everything else I have no time for it. Also Seattle is just a horrible place for driving. With my back problems I just really can't sit in a sports car for long periods of time (even though the 911 is by far the best of any of the cars I tried, it's super roomy inside which is really awesome), I'm gonna have to get an SUV (*shudder*) or something more upright, I no longer bend at the waist. I don't have the cash to sit on a depreciating car that I'm not driving. I also have concluded that I should not own a car that I'm worried about scratching, it means I can't really enjoy it and I'm kind of tense about it all the time. I need a piece of junk that I can crash into things and not care. (I feel the same way about nice clothes; if it's espensive enough that I am worried about spilling my food on it, then I don't want to wear it).

I'm gonna take a huge value loss on the car even if I get a good price, because it's in great shape mechanically, but has a few cosmetic flaws. Porsche buyers would much rather have something that's spotless but a disaster in the engine bay. That is, there's a lot of ownership value in the car which just evaporates for me when I put it on sale. Oh well.

Autotrader ad here with pictures.

Here's the ad text properly formatted :



07-25-13 | Baby Baby

Some rambling.

We've been going through a pretty hard time. Baby has had colic for the past few weeks (colic = mysterious crying, probably due to GI pain). It's pretty reliable in the evening, usually about 2 hours, but on really bad days it can be 5 hours. It's brutal and exhausting. The only thing that soothes the crying is constant walking around, patting, cooing, various distraction techniques. It just takes massive amounts of energy. Wifey and I both get super exhausted doing it and then we start snapping at each other.

Some of the silly books say "start to establish a going-to-bed ritual". Okay, got it. Cry inconsolably. Parents start thinking "fucking baby I'm going to kill you". After 2-3 hours of crying, go to bed. Yay, we've got a ritual! On the plus side, after the big cry session, she is exhausted and sleeps pretty solidly through the night.

But it's getting better I think. She's recently learned how to get burps out mostly on her own and without a huge fuss about it, and I think that's relieving a lot of it. We're also getting better at learning the soothing techniques. We've had a few days with hardly any of the crazy colic crying at all, and those are like "wow, this isn't that bad". (addendum : in the weeks since I wrote this and haven't posted it, it's continued to get better and she seems to be down to basically just fussy baby crying, not so much the evil colic stuff).

Just in the past few days she's started reaching for things and grasping. She still has zero coordination, so the reaching is like random flailing of tentacles, and if one of them happens to hit the target then it locks on. She does pretty amazing smiling laughing now, which is adorable (the cutests ones are when she gets so happy that she just can't handle it and explodes with this flailing of arms and has to turn her head away like a shy Japanese schoolgirl). We have nice little "talking" sessions of err-ga's and such. I rhetorically wonder if she has any concept that the sounds I make at her (or the sounds she makes back) have any meaning (or any possibility of meaning) at all, or are they just random sounds?

My mom visited and helped out for a while, which was amazing. It was so nice just to be able to eat dinner together (wifey and I), or work in the garden, or just generally have some time when Wifey and I could both be baby free. To all of you baby owners who had parents nearby to help out with yours : suck it. (or, you know, thank them).

I'm so looking forward to when my kid(s) are like 5-10 years old and can actually do stuff with me. I took baby on a walk by the lake the other day, and saw families with their little kids swimming in the lake, and damn that is the part that I want. It's what I've always wanted; I want to swim and ride bikes and play in the park. I've been doing it alone as a creepy old single man in the park ("hi, cute kid you've got there") for the past 10 years and I'm ready to do it with my own kids (not the creepy part, just the playing in the park part).

I think the 5-10 years old phase might be the most unconditionally happy time in life. I mean yeah things are better in many ways when you're college age and finally free and having sex and all that, but in those early years (assuming you have a good family) you're just totally free of worry, life is so care free, it's before the horrible teenage years where you start feeling the societal pressures and stresses of having to do well and so on. You can just live in the moment; like hey I'm at the park, I'm gonna run around the field and there are zero thoughts of what I have to accomplish, the fucking health insurance up-coding I have to fight, my excessive job todo list, etc.

Hmm, Charles's life map :


baby : who fucking cares you can't remember it
adolescent : playing, making bows and arrows, sweet parents, yay fun times!
teen : oo I'm so mopey, life sucks, listen to emo music and pout
college : I'm free! do drugs, have sex, go clubbing, party time oh yeah!
post-college : ugh wtf do I do with my life, I'm so stressed out, work sucks, new baby sucks
child is adolescent : playing, making bows and arrows, sweet parents, yay fun times!
child is teen : fucking ungrateful annoying mopey bastard kid, I'm so stressed out, work sucks
child goes away to college : we're free! do prescription drugs, drink wine all day, oh yeah!

or something.

We've now tried almost every single baby carrier that's made, and none of them are working for us. Thankfully we inherited most of them. We have a Peanut Shell, a Moby wrap, a Babyhawk Mei Tai, a traditional Mei Tai, and an Ergo. Baby pretty much hates all of them. It is possible to get her in the Moby in positions she likes, but it takes about five hours so by the time you get it all tied up she has a dirty diaper and wants to eat again so you have to take her out. We've got a few difficulties with them. One is that she hates not being able to see the world, and the vast majority of them stick her face directly in your chest which she won't tolerate. The other is that none of them work very well with small baby's feet/legs. She's too young to really wrap her legs around our torsos, but the alternative is to just fold up her legs inside and they get all crushed (I suspect all the carriers will work better once she gets to the wrap-legs-around stage). It's been a frustrating ordeal, trying over and over to get her to tolerate a carrier only to have to bail out after five minutes because she starts screaming.

(addendum : we finally caved and got a Baby Bjorn to use front-facing, even though you're "not supposed to" because it's perhaps bad for her hips (*). Yep, it works. It's the only carrier she'll tolerate for more than a few minutes when she's awake).

(* = the "not supposed to" being the standard line from the internet clique of chattering moms who spew a lot of nonsense that's not based on any facts and then repeat it over and over as if it came from some solid source. The anti-rational-thought-basis of the modern internet mob is pretty sickening to me.)

One of the things that's been really difficult for me personally is adjusting my night schedule. I used to use dinner as my transition point from hard-working-mode to relaxed-day-is-done-mode. I loved to have a very slow dinner, lying down and nibbling for a while like a Roman, having a glass of wine, taking some deep breaths. It's great for the digestion to just really relax and eat slowly. Then there would be no more work for the rest of the day. No longer can I do that; dinner is a frantic rush to cram some food in my mouth while we take turns holding baby, and the work day is not done until we get her down to sleep at night, so I can't really start relaxing until then. It all means that my "on" time is so much longer, it takes more endurance to stick it out, and then after we get her down I have stay up for a few more hours to get that relaxation to get to sleep. Oh well; life changes.

Just recently I started working near full time again, and it's such a change. For one thing it's actually easier. Work is a relief compared to baby care. (though the hardest thing of all is trying to squeeze in bits of work and still do masses of baby care and try to keep a good attitude about it). On the negative side, I could feel my closeness with baby slip away almost immediately.

In the first few weeks when I was baby-caring full time, I felt almost as close to baby as mom was. Obviously she had the breastfeeding advantage, but I could soothe baby just as well, and interpret her signals, and I didn't feel like there was any huge gap. As soon as I started working full time and I was away from baby for long stretches I started seeing a difference. Suddenly baby wanted mom more than me; mom could read signals I couldn't read; baby would sometimes get fussy and I couldn't console her, but a hand-off to mom instantly calmed her down.


07-16-13 | Vaccines

Every parent these days has to face the question of vaccines, and whether they will blindly follow the standard CDC schedule or choose any modifications.

Obviously most of the anti-vaccine movement is total nutjobs, based on no science. They're an odd mix of the crazy christians and the crazy liberal/organic types who are part of a general insane granolay anti-science movement. The scaremongering has become so widespread, that even the sane parent has to ask themselves "is there anything to this?".

Unfortunately, the pro-vaccine side is not really without faults. They also are dishonest and misrepresent the facts, and make lots of arguments that aren't helping their side.

A few links on the pro-vaccine (anti-Dr. Bob) side :

The Problem With Dr Bob's Alternative Vaccine Schedule
CDC - Vaccines - Immunization Schedules for Children in Easy-to-read Formats
Cashing In On Fear The Danger of Dr. Sears « Science-Based Medicine

Let me make a few points :

1. Some vaccines may be in the best interest of society, but not of the child. A parent making a purely selfish decision for the best interest of their child would logically choose not to get that vaccine.

The pro-vaccine people really don't want you to know this, so they lie about it or try to hide it.

Let's consider the MMR shot. The pro-vaccine people will say something like : the rate of severe complications (mainly encephalopathy and other high-fever side effects) from the MMR vaccine is something like 1-3 per million (*). However if you actually do get measles the rate of severe side effects (eg. death) is around 1-3 per thousand.

(* = there is some debate about whether the rate is actually higher than placebo, but let's ignore that question for the purpose of this point)

That's a slam dunk for vaccines, right? One per thousand vs one per million! So they claim, but no it is not. The problem is that you are getting a vaccine 100% of the time, but assuming that everyone else continues to get the MMR shot so that the diseases remain vanishingly rare (**), the chance of you getting measles is only something like 1 in a million. So in fact the chance that you will die from measles in your lifetime is only 1 in a billion, much lower than the complication rate of the vaccine. (of course the complication is worse and you have to weigh that somehow)

(** = I know perfectly well there have been recent outbreaks due to the non-vaccinating nutters, but it's still vanishingly rare at this point)

Now, obviously if you choose to fuck over society for such a marginal +EV for your child, you're an asshole. But that is the American Way. It could practically be on our national seal - "Take what you want and fuck the greater good".

I think this is the most interesting point of the whole post; theoretically let us imagine there is a vaccine that is actually on average harmful for each individual, but as a whole provides a massive benefit to society. Should you get it? Would people actually get it? I believe that the demonstrated behavior of the modern vaccine-abstainer movement is that no, people would in fact not get it. So then, should it be required by law?

BTW I suspect that there probably is not a currently existing vaccine (of a major disease; I'm not talking about chickenpox, hpv, or flu, which are in a separate category) where it is in fact +EV for the child to abstain, but I think it's close in a few cases and it's an interesting theoretical question. (the main question for it being "close" is that I suspect the supposed MMR side effects, and the settlements under the table injury law, are mostly not actually MMR side effects (*!*)).

(*!* = The primary question is about the cases of encephalopathy that have been compensated under the NCVIA based solely on them occuring shortly after the injection. In these cases, the government pays a settlement automatically without admitting fault or requiring any proof that the encephalopathy was caused by the vaccine. Some on the anti-vaccine side incorrectly interpret these settlements as courts finding that the vaccines did harm, but that is not the case. It's unclear whether the rate of encephalopathy following MMR is in fact statistically significant compared to the rate following a placebo shot; there're lots of paper on this in case you want to waste a day).

2. The pro-vaccine people claim that combo shots are perfectly safe and there's no reason to separate them. However, we know that quite a few of the combo shots that have been sold by the major pharmaceutical companies in the last 40 years have in fact been *not* safe. And during that time they were saying "oh yeah sure ProQuad (or whatever) is totally safe, trust us". So we're supposed to believe that despite the safety net failing repeatedly in the past, it's worked now and the current set of shots is fine.

My opinion on medication in general is to not trust anything that's new. If it's new, not only has it not been tested much in the field (and for many reasons you should not trust the manufacturer's own tests), but more than that there's a profit motive. The generic combinations that have been in use for a long time are no longer cash cows, so they try to bring out some new combination that puts even more together, and when there is a profit opportunity there is often pharmaceutical companies making people sick. I'd much rather get a 20 year old drug than a new one that's supposed to be better (though it's hard to find doctors that go along with this).

And the whole pro-combo-shot argument seems illogical to me on the face of it. What they generally argue is that the number of antigens in the vaccines is low, in fact very low compared to the number of antigens that babes get environmentally all the time (***). They also contend that baby's immune systems are perfectly capable of handling the load. Sure sure. But in fact vaccines do sometimes trigger high level immune responses, a very small fraction of the time. Each separate type of antigen is capable of doing that, and if you get unlucky and the body responds badly to several at the same time, that's more likely to be a higher and longer lasting fever.

(*** = and of course that's a specious argument; the daily environmental antigen exposure is obviously not the same thing as an injection of very specific antigens related to a major disease. They're not the same kind of antigens, your body doesn't have the same reaction, they aren't introduced the same way, it's just a retarded argument. The way the pro-vaccine group constantly tried to make its argument "stronger" by adding more points that aren't quite right doesn't help).

They put up this absurd straw man argument, claiming the objection to combos is that the infant immune system will be "overwhelmed", and in fact it will not be, QED combos are fine. Umm, what?

3. The "Aluminum is safe" arguments are weak. There is no good data on whether the Aluminum in the new vaccines is safe or not long term. It simply hasn't been used long enough to know if there is a low level long term toxicity.

In order to contend that it is safe, they compare it to the ingestable aluminum recommended limits and slow exposure limits and things like that which are not the same thing. Of course there's never any long term testing of any new medicine, so any new drug you take could have bad long term effects. And even for drugs that are on the market a long time, it can be very hard to attribute long term complications to them (eg. the mercury carriers that were used before aluminum probably were in fact perfectly safe, but it's hard to tell for sure).

It sort of reminds me of when dental fillings all switched from mercury amalgam over to epoxy resins. There was zero evidence that the mercury fillings were actually harmful, but because it has the scary word "mercury" in the name people thought they must be toxic. So instead we all of a sudden get some new chemical reaction happening inside our mouths, that due to being new of course had no long term health results. So essentially you're putting your whole population to a new risk for no good reason (eg. dental resins release various VOCs during initial curing, which may or may not have health consequences).

4. The CDC schedule is in fact not made with the health of your infant in mind.

This is one of the big dishonesties in the pro-vaccine camp that bothers me. Lots of the pro-vaccine people say "follow the CDC, they're the experts, the schedule is made in the best way". Well, yes, sort of, but they made the schedule with several different factors in mind. One is the best interest of your child. One is herd immunity. Another one is protecting babies from parents that are dishonest or cheaters; eg. things like the early Hep B vaccine is important even for parents who claim they are clean because many parents are either secret drug users or having secret extramarital sex. If you know you are not doing those things, it's probably fine to skip it.

Another major factor in the CDC schedule is compliance. They want you to get all the vaccines early and in few appointments, because they know that is the only chance to get most people to see their doctors. If it took lots of appointments, and continued into later childhood, there would be huge compliance failures. A large part of the CDC schedule is behavioral engineering. In fact the best schedule for your child's health is probably slower and more sparse than the CDC schedule (assuming you would actually stick to the slower schedule).

Shots like Polio are given early not because the infant needs them at that age, but just to make sure that person gets them *ever*, because the early shots are the least likely to be missed. It's probably better to get those shots later (though only microscopically since the harm of getting them early is minimal (in fact so minimal that I suspect the extra doctors visits of a spaced out schedule like Dr. Bob's are probably more harmful than just going ahead and getting all the shots early even though you don't need them early)).

Unfortunately the pro-vaccine people don't want to admit this fact and provide a well researched science based slower schedule that is designed with the best interest of the child in mind, so parents are left with things like the non-scientific ad hoc Dr. Bob schedule.

5. Lots of pro-vaccine people in the end resort to "they're the experts, they know more than you, trust them". LOL yeah right, because pharmaceutical companies have never tried to sell us poison, and our government has always given us good science-based health policies.

If you are in fact right, you should be able to argue the facts without resorting to "because we know best". Also the argument that "it's the only schedule that's been tested" is a cheap way out, since you don't allow any other schedule to be tested.


BTW in case it's not clear, my personal belief is that of course you should vaccinate (don't be ridiculous; vaccines are one of the very best things that medical science has ever done (in close competition with antibiotics and antiseptics)), and lacking any better information you should probably just use the CDC schedule. Any -EV in it for your child is very very small, and the risk of trying to make up your own schedule that would be better for your child is greater than any potential benefit.

Of course good decision making also considers the meta-decision of "should I spend my time thinking about this" and I believe in this case the answer is "no".


07-02-13 | Baby Baby Baby

Bleck it's so hard to get any work done. I've been going into the office a bit recently to try to get some concentration, but it's not helping a lot. Part of the problem is I'm not used to the office so it feels weird and uncomfortable being here. A lot of the problem is I just hate commuting so very much; by the time I get here I'm enraged and exhausted and need a lot of time to calm down.

(one thing that I've finally realized recently is that if you spend a lot of energy trying to do every little trivial thing in your life well (like driving, or loading the dishwasher, or enqueueing your laundry), it takes away energy that you could spend on something more important, like your social life or your work. I can kind of see the merit of being a total non-mindful fuckup when you're doing the trivial stuff of daily life, like the way people will walk straight into me when walking down the sidewalk, or let their leashed dog go on the opposite side of me, or leave their grocery cart right in the middle of the aisle; I always thought "jesus christ what a fucking asshole shit-for-brains", but maybe they are just saving their mental energy for more important things. More realistically I now see that the average McDonalds employee who is obviously not putting any effort into doing their job well is actually doing the right thing; why should they? of course they should just be as lazy as possible and save their energy for the fun glue-huffing party that night).

Baby has started to get a little easier. She's sleeping a bit better, and the severe colic we were occasionally getting is perhaps decreasing. I now suspect that the worst colic was due to foremilk/hindmilk imbalance, so we're trying some steps to address that and perhaps it's working. It's pretty hard and indirect to diagnose these kind of issues, so we just sort of stab in the dark and see if things get better (and of course when things do get better that is no proof that you were right (classic "entrepreneur's fallacy" (*))). We're also just learning how to deal with it better; when she gets into the once a day fussy time, we just have to walk her back and forth for hours, keep patting her back or bouncing her, show her things to keep her amused, and just wait it out until she settles down again.

(* "entrepreneur's fallacy" is my own coinage (I dunno if there's a more standard name for this). Basically it's the belief that because you were successful, everything you did was right and part of that success. It is the unfounded cause-effect association of every decision you made with the observed result of your success.) (I suppose this is just the "post hoc" fallacy, but I enjoy the pejorative implied in my nomenclature)

It's still exhausting and we're running on very little sleep. I'm a little bothered by the frequent advice we get to "ask for help". Ask who exactly? And WTF are they going to do? Are they going to sleep with baby and feed her so wifey can get a decent night's sleep for once? Are they going to clean the house? Of course not. It's like the advisers think the problem is that we're foolish martyrs who are choosing to take on more than we should. Uh no, we'd love to have help. There is no fucking help. Not for anything in this life. I've had the same advice in different arenas - at work, in social life - "make sure you ask for help if you need it". Bullshit. In my experience when you go to your boss or producer in a job and say "I really have too much work, you need to offload me or get another person on this" what you get is not help, but rather a condescending lesson on time management or priotizing, like "well let's look through your task list and see what time estimates you've got and perhaps we can reduce some of those". (but of course in the work place it is very important to ask for help even though you won't get it, and in fact very important to make sure it's in writing (really every communication with your boss/producer needs to be in writing so you have a record), so that when you start missing tasks they can't say "you should've asked for help"). I've often asked friends for help with work or life issues or whatever (partly just because I think it's nice in this world to get and give help, and I like to have that relationship with people when I can), and the majority of the time if it is at all inconvenient or just not totally trivial for them to help, or not in their own personal best interest, the result is no help (with exceptions that I am grateful for).

She's now doing some simple two syllable "uh-goos" and "err-kch". One of my favorite times with her is right after she eats (so she's in a really good mood and relaxed, not eager to leave), I'll sit with her in my lap and we'll stare into each other's eyes and talk to each other. I make sounds and she smiles at me and occasionally makes them back. She loves textiles; anything with a complex pattern she'll stare at for minutes totally enthralled.

I'm dreading the upcoming 4th of July. There are already random fireworks being set off in our neighborhood and baby hates them. If she's sleeping, they wake her right up. The night of the 4th is going to suck bad. It's been a major heat wave recently so keeping all the windows closed is not really an option.

Last weekend when we had this big hot spike we went down to the lake for some relief. We sat on a blanket with baby and it was pretty sweet. It was interesting to me to think about how it would have been different doing the same thing before baby, since hanging out by the lake in the heat was one of my favorite activities. With baby we were basically walking her back and forth the whole time to keep her content, occasionally feeding her. Without baby I would have been sunning, swimming, reading, perhaps boozing. It was okay, I didn't miss it much (part of the "everything sucks anyway so it's not too bad to lose it" principle). Single times in the sun at the lake were a joy, but also sort of unsatisfying, haunted by that feeling of "is this all there is?" or "shouldn't I be living it up more somehow?". The thing that I really missed during the hot spike was being out at night. One of my favorite things from the old days was on those heat wave nights, getting out on the bike and feeling the night air, or going to an outdoor cafe to feel the buzzing urban heat wave night energy. Actually some of my first dates with N were heat wave nights, and they were magical. Oh well, sayonara.

(oh and tangential by-the-lake rant : fucking boats and motorcycles with excessively loud motors make the lake a fucking din of growling engines all weekend. Some asshole owners do that intentionally, but the real problem is the law; we need noise limits on boat motors. WTF. They should have to be even quieter than cars, because the sound travels so clearly over the lake, and the fucking lake should be a beautiful peaceful place. If you want a fucking motorboat speedway, go to Chelan or a similar lake that's more rednecky. I'm okay with there being a handful of lakes where the drunks can run each other over, but the majority should be free of that awfulness).

We're trying to hire a "mother's helper" or housekeeper who also helps with baby a bit. They're so fucking cheap, why the hell not? If we can get some relief (and I can get some more work done) for $150/week of course it's worth it. Well, it's not so easy. So far the hiring process has been a frustrating waste of time. Kids are always complaining about how there're no jobs these days; well let me tell you why you can't get a job, it's because you're either a total irresponsible fuck-up or a spazzy freak show. You only need to have just the most minimal level of basic professionalism, like if we set an appointment time for a phone interview, fucking answer your phone when I call. If you come to our house, be on time, clean, and considerate. If you send me an email application, check your email often so that you can follow up on my reply within 24 hours. COME ON!

(in general everyone these days seems to be such a moron that it's a bit risky to let anyone in your house; they do things like flush paper towels down the toilet, put onion peels down the garbage disposal, etc. you've got to keep an eye on them constantly or they're going to do annoying or expensive damage to your house).

We should all know by now that humanity is just fucking vile and horrible and dumb and selfish and mean. But we've started taking baby on walks recently and my opinion of the human race has gone down another notch. People are fucking asshole retards behind the wheel of a car, that are generally irresponsible, inconsiderate, dicky, selfish, dangerous, and just generally stupid, but I always thought that most of those people would still be reasonable around babies. Nope. Just about every time we take her on a walk in the stroller, some asshole tries to speed by us as close as possible. There are several intersections near our house where the cars basically don't stop at the stop signs (perhaps they slow down and roll through if they're one of the more considerate ones). I assumed that hey if I'm crossing at the stop sign and I'm halfway through the intersection with a stroller, they're going to actually stop now, right? right? Nope. If I've got my car parked with the door open and I'm taking baby from stroller to car seat, people passing are going to slow down a bit or pull out a bit wider, right? Nope, full speed right past. Unbelievable, so depressing. It's so deeply sad to me every time I leave my home and see how awful the world is, it makes me never want to leave home.


06-27-13 | Some Media Reviews

Light Asylum - umm, yeah, amazing. Light Asylum is a modern band that recreates the 80's goth industrial sound. It's perfect, tinny synths, that bad operatic singing, it's exactly like what the kid who painted his finger nails black listened to. So, long story: every few months I just go see what the hip kid websites are recommending and download all their favorite stuff, then I gradually get around to listening to it. About a year or so I got Light Asylum and put it in my playlist. The first time it played I was like "WTF this is awful" and skipped it. But I was lazy and left it in my playlist. Then over the next few months, once in a while I would listen to something else (mainly "Hooray for Earth" and Grizzly Bear's Shields) and Light Asylum was after those in the PL order so it would drift into there. I would be in the other room and not skip it immediately and I started thinking "this is hillariously bad, but kind of addictive". Flash forward and now I can't get enough of it, I'm listening to it over and over. In every objective way it's just awful; the beats are repetitive, the songs are very basic and don't go anywhere, the singing is terrible, but somehow that's all just right. (I think "IP" is the best example; it's so boiled down and repetitive, and the "25 to liiiiife" is just awful, but wow it works). Amazing.

Top of the Lake - great. My first impression was "bleh it's just The Killing in New Zealand, not this shit again". But it's much better than that. It's intense, the character development is superb, it's hard to watch, you really hate some people and are afraid of what might happen, which means they're doing something right. It is also a bit disappointing in the end; there are some weird Lynchian tones hinted at early on, making me think it might drift into a semi-Twin-Peaks territory and it never develops that thread. And the last episode really sucks; all the episodes before the last are slow and develop things gradually and beautifully, and the last one just wraps everything up neatly in a rushed way. Totally worth watching. It did all feel a bit disjointed, as most modern TV shows do, like they were writing it as they went along without a great overall plan, and it had a lot more promise than it delivered, but still just way more real and powerful than almost anything else on TV.

The Fall - meh, good. Totally straightforward BBC-style crime serial. Not really anything interesting about it, hey there's some crime and some detectives, with no particular twist or local character, but it's very well made, the acting is good, it looks beautiful. Watchable.

Nobody Walks - underrated; simple little obvious movie, but nicely done; it flows well, some good little moments of human interaction. It's right in the early-Lena-Dunham wheelhouse of disfunctional upper class intellectual families.

BBC Storyville - "The Road" - really well done slice of life doc. I love this kind of thing; reminds me of "The Tower". Sad and beautiful, this world.

Endeavour - old-school BBC style detective show, in the sense that it's sort of charming but the "mysteries" are totally retarded, the local characters are shallow stereotypes, and it sets you up from the beginning to let you know what you're going to get and then gives it to you. It's like a warm bowl of milk and a cardigan, very comforting. I think it's great, carried by the delightful Shaun Evans and Roger Allam.

Out of the Wild - trash reality show survival thing, but I'm a sucker for this and found it addicting. Better than Survivorman or Man vs Wild for my taste. The group dynamic is pretty interesting and much nicer than the typical vote-them-off reality format.

Crap : Orphan Black, The White Queen, Silver Linings Playbook, Rectify, lots of other crap that doesn't bear mentioning.

Lots of good food docs on Netflix right now. "Three Stars" is the best, really superb, but I also enjoyed "Step Up to the Plate", and "A Matter of Taste: Serving Up Paul Liebrandt", both interesting.


06-19-13 | Baby Blues

I had my first real day of "baby blues" a few days ago.

Baby gets these bouts of colic that I believe are mostly from gas. When she has it, she can't stand to be horizontal, and really just wants to be held over the shoulder and patted. That's okay for a while, but sometimes it goes for an hour, which is exhausting. Most days it only happens once, but rarely it occurs over and over throughout the day.

We had a pretty bad day, and I found myself just losing my mind. You get so hungry and tired, but you can't take a break, and you start thinking "shut the fuck up! WTF do you want god dammit". In a real bad moment I started getting these weird impulses "like maybe if I throw the baby on the floor it will shut up" or "maybe shaking a baby isn't that bad". And then you just have to go whoah, keep it together, calm down.

It made me realize that if I was somewhat more irresponsible, more prone to rage, or less in control of myself, I easily could shake a baby, or one of those awful other things that people do (just lock it in a room to cry itself out, or give it booze, or whatever).

In fact it made me sort of understand those moms that kill their children, or the dads that go out for cigarettes and never come back. There's this feeling that these fucking kids are ruining your life and you can't do anything about it and you're going to be stuck with them for the next 18 years and there's this sudden feeling of helpless desperation. I can sympathize with the impulse, but of course that's where being an adult with some self control comes in.

I'm in awe of the women who had to take care of their kids all alone, with no help from their selfish misogynist husband that wouldn't touch a diaper or cook for the family, and appalled at those husbands.

Part of my thinking in having a baby was that I understood perfectly well that I would lose going out to eat, and travel, and pretty much every activity out of the house. And I'm okay with that, because those things fucking suck anyway. I don't understand what old single people do. I feel like I've done pretty much everything (*) there is to do in this life, and I don't need to keep doing the same shit over and over. I'm fine with losing all of that. But I do miss the ability to just have a quiet moment for myself, especially at the end of the day when I'm exhausted and frazzled.

(* = obviously not actually everything, nor everything that I want to do; what I have done is everything that's *easy* to do, which is all that normal people ever do. Things like going to restuarants and driving cars on race tracks and skydiving and travel and watching movies - how fucking boring and pathetic is all of that, don't you have any creativity? That's just the easy default consumerist way to waste your time - pay some money and get some superficial kicks. The actual good activities are things like : make your own internet comedy, assassinate an evil politician, find sponsors and be a motorcycle racer, go horseback camping on your own across the Russian steppe. But everyone's just too lazy and boring to ever do anything good with their life, and so am I. So just have a kid.)

One thing I've realized is that a good parent is never annoyed; a good parent never says "not now". You need to be always able to drop what you're doing, or get up off the couch and help your kid. I've always been the kind of person who has moments where I'll socialize and other times when I really want to be left alone, and if someone tries to talk to me during the left alone time I'm really pissy at that them. That's not okay for a parent, you can't be pissy at your kid because they talked to you when you're tired or hungry or trying to work.

(I suppose this is related to a realization I had some time ago, which is if you believe you are a nice person except when you are tired or hungry or cranky or busy or having a bad day - you're not a nice person. Your character is how you behave in the difficult times.)

A good parent doesn't only love their children when they're behaving well. If you only like them when they're being quiet and happy, you're an asshole. A good dad has unconditional love that is not taken away when they misbehave; if anything you need to have more kindness in your heart when they're bad, because that's when they need it from you the most. Don't be a pissy little selfish whiney baby who's like "whaah the kids are being jerks to me so I'm justified in yelling at them or just running away to my office". You're the adult, you're supposed to be the bigger person than them. (in fact a good adult treats other adults the same way)

I really miss sleeping. I'm getting way more sleep than my wife, but it's still just not enough; I suppose the draining hard work of caring for baby is part of the problem, I really need even more sleep than usual and instead I'm getting less. I can feel my brain is not working the way it used to, and that feels horrible.

Maybe most of all I miss being able to have a moment where I know I don't have to do anything. So I can finally stop working, so I can let down my guard and just relax and know I'm not going to get have to start up the diesel engine again. I may never have that again in my life, because kid issues can occur at any time; you can't knock yourself out on heroin anymore; you never get that time when you know you can just shut off your brain. Parents have to be always-on, and that's just exhausting.

I guess one of the problems for me is that I give my all to work; I don't stop working for the day until I feel completely drained, like I have nothing left, and then I just can't deal with any more tasks, I need to crash, be left alone. That's been a shitty thing for me to do my whole life; it's been shitty to my girlfriends and wife that I get home from work and just have no energy for them, but perhaps even more so it's been shitty to myself. Even morons (smarter than me) who work retail or whatever know that you shouldn't give all your energy to the stupid job; of course you should be texting your friends while you work, planning your night's activities, and when you get off work you shouldn't be drained, you should have energy to hang out, be nice to people, have a life outside of work. Anyway, when there's a whole mess of baby work to be done when you get off work; you really can't afford to work 'til you drop.

In other shitty-but-true news : if I was hiring programmers, and I had the choice between a dad and a single guy, I would not hire the dad. They would have to be massively better to compensate for the drain of children. Only young single guys will stupidly throw away their lives on work the way a manager really wants.


06-18-13 | How to Work

Reminder to myself, because I've gotten out of the habit. In the months before baby I had developed a pretty good work pattern and I want it back.

There is only full-on hard work, and away-from-computer restorative time. Nothing in between.

1. When working, disable internet. No browsing around. If you have a long test run or something, usually it's not actually blocking and you can work on something else while it goes, but if it is blocking then just walk away from the computer, get your hands off the machine, do some stretching.

2. No "easing into it". This is a self-indulgence that I can easily fall into, letting myself start slowly in the morning, and before I know it it's close to noon. When you start you just fucking start.

3. If you're tired and can't keep good posture when working, stop working. Go sleep. Work again when you're fresh.

4. Whenever you aren't working, don't do anything that's similar to work. No computer time. Fuck computers, there's nothing good to see on there anyway. Just walk away. Avoid any activity that has your hands in front of your body. Try to spend time with your arms overhead and/or your hands behind your back.

5. When you feel like you need to work but can't really focus, don't substitute shitty work like paying bills or online shopping or fiddling around cleaning up code pointlessly, or whatever that makes you feel like you're doing something productive. You're not. Either fucking get over it and work anyway, or if you really can't then walk away.

6. Be defensive of your good productive time. For me it's first thing in the morning. Lots of forces will try to take this away from you; you need to hold baby, you need to commute to go to the office. No no no, this is the good work time, go work.

7. Never ever work at a laptop. Go to your workstation. If you feel your ergonomics are bad, do not spend one second working in the bad position. Fix it and then continue.

8. Set goals for the day; like "I'm going to get X done" not just "I'm going to work on X for a while" which can easily laze into just poking at X without making much progress.

9. When you decide to stop working for the day, be *done*. No more touching the computer. Don't extend your work hours into your evening with pointless trickles of super-low-productivity work. This is easier if you don't use any portable computing device, so just step away from the computer and that's it.

10. Avoid emotional disturbances. Something like checking email in the morning should be benign, but if there's a 10% chance of it makes you pissed off, that's a big negative because it lingers as a distraction for a while. I've basically stopped reading any news, and I think it's a big life +EV and certainly productivity +EV.


06-06-13 | Baby Misc

You're old when you it takes you a while to remember how old you are.
You're older when you have to do the math from your birth year to figure it out.
You're really old when you have to do the math and get it wrong.
You're really really old when you do the math, get it wrong, and insist you're right despite everyone else in the room telling you a different number.

Pooping baby looks like an alternating sequence of Angry Andy Richter :

and O-face Gollum :

(baby made me realize that Andy Richter looks just like a giant baby)

On TV you always hear people gushing about that "great baby smell" , like mmm let me stick my nose in and smell that baby. WTF there's no great baby smell. I suppose what those people like is the smell of Johnson&Johnson shampoo and baby powder (both of which are rather out of fashion now). In our house we always try to avoid scented products (the better to smell you by), so our baby gets none of those. The real natural baby smell is mainly sour milk. Milk vomit, milk spitup, milk poos, spilled milk. Yum. It's mixed in with a faint whiff of that really nasty toe-jam funk, because babies have all these awkward fat folds and no matter how thoroughly we bathe her, we seem to miss some fold or other that accumulates crud.

One of the baby diaper-changing suggestions is "make sure to wipe front to back so that you don't spread poo towards the genitals". Umm, have you ever seen a baby diaper? It's like someone set off a poo grenade in there; there's poo everywhere; it's leaking out the top of the diaper, it's way deep inside the vagina. Oh, let me carefully wipe all that poo from front to back, ok, that'll make it all fine.

Baby is finally starting to spend some time awake that's not just eating or burping. That's kind of nice, she is starting to make some more expressive "eh"'s ; the other day she head-tracked her mom across the room when she was hungry for the first time.

Before baby was born IC made this note to me that "babies are boring", that I thought was charmingly honest. Yep, it's true, babies are boring as fuck. Sure they're cute and all, but there's just endless hours of feeding, burping, rocking; yeah, yeah, baby I've seen your cute arm waving before, just go to sleep already so I can do something else. At first I was kind of trying not to watch TV while holding baby or just put her in a mechanical rocker, I thought it was better to engage and talk to her and play with her and such, but now I say fuck it, there's just too much time to kill.

There's a sort of Stockholm Syndrome with babies. They're so hard in the first few weeks, constantly demanding attention, that it makes you grateful when they do the most basic things. Like, oh great baby overlord, thank you so much for sleeping for 3 hours in the evening, we are so grateful for your kindness. It's like the classic dick boss/dad trick of being really shitty to people so that if you are just halfway decent they love you for it.


06-04-13 | Reader Replacement

Can anyone suggest a good Reader replacement? (WTF Google, seriously).

I tried a few of the Win32 RSS Readers and absolutely hated them; they all tried to be too fancy and out-feature each other. I don't want anything that has its own built-in html renderer. I certainly don't want you to recommend related feeds to me or anything like that. I just want a list of my RSS subscriptions and show me the ones with unread updates, then I'll click it and you can open the original page in Firefox. (even Google Reader is too fancy for my taste; I don't like the in-Reader view of the feed, which often renders wrong; just open the source page in a new tab).

(actually I suppose I don't really care for RSS at all; don't send me the text of an update, all I want is a date stamp of last update so I can see when it changes).

Anyway, suggestions please.

Also if someone knows a webmail + spam filter that can integrate with a POP3 reader, I would drop gmail too, and be happy about my pointless solitary boycott of Google.


06-02-13 | Sport

I've been watching F1 for the past few years. There are some good things about; for me it satisfies a certain niche of mild entertainment that I call "drone". It's a kind of meditation, when you are too tired to work or exercise, but you don't really want to watch anything either, you just put it on in the background and it sort of lets you zone out, your mind goes into the buzz of the motors and the repetitive monotony of the cars going around over and over in the same order. I particularly like the onboards, you can get your head into imagining that it's you behind the wheel and then sort of go into a trance watching the left-right-left-right sweeping turns.

One thing that clicked for me watching F1 is just how active it is in the cockpit. When we drive on the street we're mostly just sitting there doing nothing, then there's a turn, you are active for a second or two, then it's back to doing nothing. Even with my limited experience on track, I'm so far below the capability of my car that I'm still getting breaks between the corners. A proper race driver lengthens every corner - that's what the "racing line" is all about - you use the whole track to make the corners less sharp, and you extend them so that one runs in the next; the result is that except on the rare long straight, you are actively working the car every second. Also, F1 cars are actually slipping around all the time; you don't really notice it from the outside because the drivers are so good; from the outside the cars seem to just drive around the corner, but they are actually constantly catching slides. The faster you drive a car, the less grip it has; you keep going faster and faster until the lack of grip limits you; every car driven on the absolute limit is slippy (and thus fun). I've been annoyed by how damn grippy my car is, but that's just because I'm not driving it fast enough.

F1 has been super broken for many years now. I suppose the fundamental thing that ruined it was aerodynamics. Aero is great for speed, but horrible for racing. In a very simplified summary, the primary problem is that aero makes it a huge advantage to be the front-runner, and it makes it a huge disadvantage to be behind another car, which makes it almost impossible to make "natural" passes. (more on natural passing shortly). 10 years or so ago before KERS and DRS and such, F1 was completely unwatchable; a car would qualify on pole and then would easily lead the whole race. Any time a faster car behind got up behind a car it wanted to pass, aero would make it slower and unable to make the pass. It was fucked. So they added KERS and DRS, which sort of work to allowing passing despite fundamentally broken racing, and that's how it's been the last few years, but it's weird and unnatural and not actually fun to watch a DRS pass, there's no cleverness or bravery or beauty about it. The horribly designed new tracks have not helped either (bizarre how one firm can become the leading track designer in the world and do almost all the new tracks, and yet be so totally incompetent about how to make a track that promotes natural passing; it's a great example of the way that the quality of your work is almost irrelevant to whether you'll get hired again or not).

(the thing that's saved F1 for a while is the combined brilliance of Raikkonen, Alonso, Vettel, and Hamilton. They continue to find surprising places to pass; long high speed passes in sweeping corners where passing shouldn't be possible, or diving through tiny holes. It's a shame they don't have a better series to race in, those guys are amazing. Vettel is often foolishly critized as only being able to lead, but actually some of the best races have been when he gets a penalty or a mechanical fault so that he has to start way back in the pack, he charges up more ferociously than anyone else, just man-handling his way up the order despite the RB not being a great car for racing from behind)

Anyway this year I just can't watch any more. The new tires are just so fucked, it takes a series that already felt weird and artificial and just made it even more so. The whole series is a squabble about regulations and politicial wrangling about the tires and blah blah I'm done with it.

Searching for something else, I stumbled on MotoGP. I'd seen the Mark Neale documentaries a few years ago ("Faster" etc) and thought they were pretty great, but never actually watched a season. Holy crap MotoGP has been amazing this year. There are three guys who all have legitimate shots at the title (Marquez, Pedrosa, Lorenzo). Rossi is always exciting to watch. Marquez is an incredible new star; I can't help thinking it will end badly for him, he seems too fearless, but until then he is a threat to win any race.

The best thing about MotoGP is there's no aero. No fucking stupid aero. So of course you don't need artificial crap like DRS. The result is that passing is entirely "natural", and it is a beautiful thing to watch; it's a sort of dance, it's smooth and swooping. The bikes are just motors and tires and drivers, the way racing should be. (actually without aero, it's a slight advantage to be behind because you get slipstream; giving an advantage to the follower is good, that's how you would design it as a videogame if real world physics were not a constraint; giving an advantage to the driver in front is totally retarded game design).

Natural passing is almost always done by braking late and taking an inside line to a corner. The inside line lets you reach the apex sooner, so you are ahead of the person you want to pass, but you are then badly set up for the corner exit, so that the person you passed will often have a chance to get you back on the exit; you therefore have to take a blocking line through corner exit while you are at a speed disadvantage due to your inside line. It's how racing passing should be; it's an absolutely beautiful thing to behold; it requires courage to brake late and skill to take the right line on exit and intelligence to set it up well.

Watching the guys ride around on the MotoGP bikes, I wish I could have that feeling. Puttering around on a cruiser bike (sitting upright, in traffic, ugh) never really grabbed my fancy, but to take this beast of a bike and have to grab it and manhandle it and pull it over to the side to get it to turn, it's like riding a bull, it really is like being a jockey (you're never sitting on your butt, you're posted up on your legs and balancing and adjusting your weight all the time), it's a physical athletic act, and yeah fuck yeah I'd like to do that.

I believe the correct way to fix F1 is to go back to the 70's or 80's. Ban aero. Make a standard body shell; let the teams do the engine, suspension, chassis, whatever internals, but then they have to put on a standard-shaped exterior skin (which should also be some material other than carbon so that it can take a tiny bit of contact without shattering). Design the shape of the standard skin such that behing behind another car is an advantage (due to slipstream) not a disadvantage. Then no more DRS, maybe leave KERS. Get rid of all the stupid intentionally bad tires and just let the tire maker provide the best tires they can make. Of course none of that will happen.

I've also been watching a bit of Super Rugby. You have to be selective about what teams you watch, but if you are then the matches can be superb, as good or better than internationals. There have been a couple of great experimental rule changes for Super Rugby this year and I hope they get more widely adopted.

1. Time limit on balls not in hand (mainly at the back of a ruck). The ref will call "use it" and then you have 5 seconds to get the ball in play. No more scrumhalves standing around at the ruck doing nothing with the ball at your feet.

2. Limitting scrum resets, and just generally trying to limit time spent on scrums. The refs are instructed not to endlessly reset bad scrums; either let them collapse and play on if the ball is coming out, or call a penalty on the side that's losing ground.

The result is the play is much more ball-in-hand running, which is the good stuff kids go for.

If you want to watch a game, these are the teams I recommend watching, because they are skilled and also favor an attacking ball-in-hand style : Chiefs, Waratahs, Rebels, Blues (Rene Ranger is amazing), Brumbies (too much strategic field position play, but very skilled), Cheetahs, Crusaders. The Reds and Hurricanes are good for occasional flashes of brilliance. The Bulls are an incredibly skilled forward-based team, but not huge fun to watch unless they're playing against one of the above.

The Chiefs play an incredible team attack that I've never seen the like of in international rugby. The thing about the international teams is that while they are the cream of the talent, they don't practice together very much, so they are often not very coordinated. (international matches also tend to be conservative and defensive field-position battles, yawn). The Chiefs crash into every breakdown and recycle ball so efficiently, with everybody working and knowing their part; they go through the phases really fast and are always running vertical, it's fantastic.

Quade Cooper is actually amazing on the Reds. I'd only seen him before in some Wallaby matches where he single-handedly threw away the game for his side, so I always thought of him as a huge talent that unfortunately thought his talent was even bigger than it really is. He plays too sloppy, makes too many mistakes, tries to force moves that aren't there. But on the Reds, it occasionally all works; perhaps because he has more practice time with the teammates so they know where to be when he makes a wild pass out of contact. I saw a couple of quick-hands knock passes by him that just blew me away.

I'm continually amazed by how great rugby refereeing is. It occurred to me that the fundamental difference is that rugby is played with the intention of not having any penalties. That is, both the players and the refs are generally trying to play without penalties. That is completely unlike any other sport. Basketball is perhaps the worst, in that penalties are essentially just part of the play, they are one of the moves in your arsenal and you use them quite intentionally. Football is similar in that you are semi-intentionally doing illegal stuff all the time (holding, pass interference, etc.) and the cost of the foul is often less than the cost of not doing it, like if a receiver gets away and would score, of course you just grab him and pull him down. That doesn't happen in rugby - if it did they would award a penalty try and you would get a card. If someone is committing the same foul again, particularly one that is impeding the opponent from scoring, the ref will take them aside and say "if you do that again it's a card". It's a totally different attitude to illegal play. In most sports, it's up to the player to make a decision about whether the illegal play is beneficial to them or not. I think it reflects the general American attitude to rules and capitalism - there's no "I play fair because that's right" , it's "I'll cheat if I won't get caught" or "I'll weigh the pros and cons and decide based on the impact on me".


05-30-13 | Well Crap

The predictable happened - baby threw out my back. Ever since we had it I kept thinking "oh shit, this position is really not good for me, this is not going to end well", and then yesterday blammo excruciating pain, can't stand up, etc. It's an episode like I haven't had in several years. I've been trying to forestall it, trying to keep up my stretching regimin and so on, but combined with the fatigue of not sleeping it was inevitable. Fatigue is the real enemy of my spine, because it leads to lazy positions, slouching, resting on my bones and so on.

I've been really happy in the last 6 months or so because I've finally been able to work out again after something like 5 years of not being able to properly. Ever since SF I've been doing nothing but light weight therapy work; I kept trying to slowly ramp back into real workouts and would immediately aggravate an injury, have to bail out and start over again with only light PT. I felt like I finally turned the corner, it's the first time I've been able to do basic exercises like squats and deadlifts without immediately re-injuring myself, and it's been wonderful. I still have to be super careful; I only do pulls, never pushes, I don't do any jerky violent moves, I keep the weight well below my max so that I can be really careful with form; perhaps most importantly I spend ages warming up before lifting, which is so damn boring but necessary. And now I'm fucked again.

I used to have all these ambitions (discover a scientific principle and have a theorem named after me, etc. naive child!) but now my one great dream is just to be able to do some basic human activities like lie in a bed or throw a ball without being in great pain.

Sometimes I wonder how much of my sourpuss personality is just because I'm in pain all the time. Like those kids who struggle in school and it turns out they're just slightly deaf or blind or whatever. Often things that you think are fundamental about yourself are actually just because of some surface issue that you should just fix. (and of course my physical problems are totally trivial compared to what lots of people go through)

(like for example I know that part of my problem with socializing is I have some kind of crowd deafness issue; if I'm in a conversation with more than one person, I find it hard to follow the thread, and if more than one person at a time is speaking the words kind of get all jumbled up; sometimes it's so frustrating that I just give up on trying to listen in groups and just check out and zone out. I also avoid a lot of social situations like dinners and movies because I know they'll mean I'm stuck sitting for a long time, which is inevitable severe back pain (and dinners are often intestinal pain); I think a lot of those people who just seem so happy and well-adjusted are that way not because of any mental difference but because they lucked out and don't have any major physical problems)

I think if I had a pool things would be much better. Swimming is amazing for the body. I have a dream; it's to live somewhere sunny with my own pool. I'll lie in the sunshine and swim, and lie in the sunshine and swim. I'm really looking forwarding to be an excessively tan old man shamelessly swimming (and just walking around) in a tiny speedo. Like this :

I often see those guys on the beach in Hawaii or where-ever, shamelessly strutting about with their leathery sunburns and tiny speedos under droopy bellies, and think "I hope that's me in 20 years, and I can't wait!".


05-27-13 | Marsupial Sapiens

I'm convinced that the human being is actually a marsupial that just hasn't developed a pouch yet.

The human baby is the least developed of any mammal. There are various theories why the human baby is born at such an early stage of development (all humans, like marsupials, are essentially born 3 months before they're ready); the naive guess is because later birth would not fit in the mother, but modern theories are different (metabolic or developmental).

A baby is pretty crap as a human, but it's pretty good as an infant marsupial. It wants to just lie on the mother's chest and sleep and eat. Once you think of it as a marsupial lots of other things are just obvious, like it needs low light and not very much stimulation. If you try to make it do anything other than what a marsupial wants (like sleep without skin contact) it gets upset, of course. It really struck me when watching our baby do a sort of proto-crawl (really just a random moving of the limbs) and wiggling around trying to get from the chest to the nipple; that evolved proto-crawl is useless to get along the land to the mother, the only thing it can do is move the baby around inside the marsupial sack.

The Karp method is at its core one sentence of information - "babies are soothed by recreating a womb-like environment". But it's even more accurate to say that what you want to do is create a marsupial-pouch-like environment (eg. you aren't putting the baby in total darkness and immersing it in fluid, nor are you feeding it through a tube).

As is often the case with childrearing, ancient man does a better job than modern man. I think the ideal way to deal with a newborn is just to tuck it in mama's shirt and leave it there. It sleeps on the chest, suckles whenever it wants, and bonds to mama in a calm, sheltered, warm environment. Being tool-using animals, we make our missing marsupial pouch from some cloth. The modern baby carrier things are okay (especially the ones that have the baby facing you on the chest), but they're wrong in a major aspect - they're worn outside the clothing, the baby should really be right on your skin.

It's really incredible how badly we fucked up childbirth and rearing in the western world in the last 100-200 years. Granted it was for good reason - babies and moms used to die all the time. I don't romanticize the old days of home birth with all its lack of sanitation and high mortality rates, the way some of the more foolish hippy homebirth types do. Birth was a major medical problem that needed to be fixed, it's just that we fixed it in massively wrong ways.

I'm trying as much as possible to not read any baby books or crap on the internet. I don't want to be influenced by all the trends and crap advice out there. I want to just observe my baby and do what seems obviously right to me. So for example, I was aware of "attachment parenting" or "aware parenting" or whatever the fuck the cult is called now, but I hadn't made up my mind that we would try that. But once we had the baby and I just looked at it, it was totally obvious that that's what you should do.

If you just pay attention to your baby, it's so clearly trying to tell you "I'm hungry" or "I'm sleepy" or "the light is too bright" or whatever. The only time that a healthy baby cries is when you have ignored its communication for quite a long time (5-10 minutes) and it's gotten frustrated and fed up; a baby crying is it saying "you are neglecting me, please please help me I can't help myself, you fucker, I already asked you nicely". (of course it's not the same in unhealthy babies; and we have some issues like acid reflux and/or gas that lead to some amount of crying inevitably; it's not that all crying is necessarily bad parenting, but there are enough cases of crying that are a result of not listening to the earlier gentle communication that it just seems obvious that you should take care of the baby's needs before it gets to the point of crying, if possible). You've created this helpless little creature, and it gets hungry or uncomfortable and is begging for your help, and to ignore it for so long that it has to yell at you is pretty monstrous.

It's crazy to me that people for so many years thought it was right to just let a baby cry, that it was even good for it to work its lungs or develop independence or not get spoiled. Society gets so stuck in these group-thinks; of course we're all just in a new one now, and it's impossible for me to ever know if I am actually thinking clearly based on what I see, or if I have been subconsciously brainwashed by my awareness of the societal group-think.

(Society considered harmful) (only fools think that they could ever have an independent idea, or any taste or opinion that is actually their own).


05-24-13 | Hunger Cues

I'm so exhausted that I've been eating from fatigue, which is terrible, so a note for myself. I very much believe in listening to your body, but you have to know what to listen to, and hunger cues can be confusing.

1. Tiredness is not a hunger cue. Yes, sure popping some sugar will give you a boost, but that is not the right solution. When you are tired you need sleep, not food. This is always a huge problem for me in work crunches, I start jamming candy down my throat to try to keep my energy up, and I'm tempted to do it now for baby, so hey self - no, sleepiness is not solved by eating. Go sleep.

2. Your belly is actually not a great cue. Sure extreme belly ache and rumbling means you need to eat, but a slight hollow empty feeling, which most people take as a "I must eat now" is not actually a good hunger cue. Humans are not meant to feel full in the belly all the time, but in our culture we get used to that feeling and so it feels strange when it's gone and you think something is wrong that you have to fix by putting more food in. It's really not; you need to try to detach the mental association of "belly empty" with "eat now".

3. The actual correct hunger cue is light headedness, dizzyness, or weakness. That means you really do need to eat something now, but perhaps not a lot. (getting quantities right for yourself takes some experimentation over time to figure out)

I believe that the primary goal of food consumption portioning and scheduling should be to eat as little as possible, without ever going into that red zone of critically low blood sugar. (of course I'm assuming that you want to be near your "ideal" body weight, with "ideal" being the modern standard of trim; if your ideal is to be as large as possible then you would do something different). Note that belly feelings show up nowhere in the "primary goal", you just ignore them. Perhaps even learn to enjoy that slight hollow feeling in your gut, which gives you a bit of a hungry wolf feeling, it's sort of energizing. (if I'm slightly hungry, slightly horny, and slightly angry, good god, get out of the way!)

I'm convinced that the right way to eat is something like 5 small meals throughout the day. Long ago when I was single and quite fit, I ate like that and was quite successful at meeting the "primary goal of food consumption portioning and scheduling" (minimal eating without going into the red zone). It's very hard to keep that up in a relationship, because eating a big meal is such a key part of our social conventions. When I was single I would very rarely eat a proper dinner; I would eat a sandwich at 4 PM or something so that I wouldn't really be too hungry at 8 when Fifth Meal came around, so I might just eat a salad and some canned tuna. It is possible to do in a relationship, and I'm sort of trying it now. You have to just make sure that you eat less at the standard meal times, or eat more low-cal food like cooked veg and salads, and then go ahead and eat the intermediate meals yourself. (it's particularly hard when someone else cooks and you feel compelled to eat a large amount to show that you like it; it's also hard at restaurants where the portions are always huge and you feel like you have to eat it to "get your money's worth"; eating around other men is also a problem, because of the stupid macho "let's stuff our faces" competition)


05-22-13 | Baby Work

Jesus christ it's a lot of work. I was hoping to get back to doing a little bit of RAD work by Monday (two days ago), but it's just not possible yet. I'm doing all the work for Tasha and baby and it's completely swamping me. I get little moments of free time (that I oddly use to write things like this), but never a solid enough block to actually do programming work, and you can never predict when that solid block of time will be, which is so hard for programming. A few times I tried to get going on some actual coding, and then baby wakes up and I have to stop and run around changing diapers and getting mom snacks. I give up.

(even when I do get a bit of solid time, I'm just too tired to be productive; I stare blankly at the code. Actually I've had a few funny "tired dad" moments already. I went to the grocery store and was shopping along, and all of a sudden I noticed my cart was full of stuff that wasn't mine. Oh crap I took someone else's cart at some point because I was so asleep on my feet. I remember all through my childhood my mom would make shopping mistakes that I found so infuriating (Mom, I said corn chex and you got wheat chex, omg mom how could you be so daft!) and now I finally have some sympathy; you just get so exhausted that you can't even perform the most rudimentary of tasks without spacing out and making mistakes).

If you haven't had your baby yet, get some help, don't try to do it alone. (our relief should arrive tomorrow, and it's getting easier each day anyway; the hardest days were the beginning when we were still exhausted from labor and cleaning up after the home birth, and baby hadn't figured out nursing very well yet, that was some serious crunch time). We thought it would be sweet to have some time for just the three of us, and it has been, actually it's been really nice just being alone together, but it's too much work, I don't recommend it.

(we've been incredibly fortunate so far that our baby is super easy, really well behaved, a good sleeper and nurser; we just have none of the problems that you hear about (*). Of course that's not entirely luck (though perhaps mostly luck), we're both super healthy people and we've done what we believe is right to make a happy newborn (singing to the womb, immediate mommy skin contact after birth, never separating baby from parents, cosleeping, breatfeeding, etc.). But it's been so hard even with a well-behaved baby I now have new respect for the parents that go through a baby with colic or feeding difficulties or one that doesn't sleep, you guys are heros).

(* = of course we're struggling with some acid reflux problems (what used to be called "colic" back when parents were awful and thought babies just cried because they were a nuisance, rather than because they were in serious pain that should be adressed) and forceful letdown and latching difficulties and etc etc, some other typical minor baby fussiness struggles, but that's all just normal baby stuff that I can't complain about, not a serious hardship like a baby with an actual health problem or disability)

House work is so annoying in the way that you can't just get it all done at once. Even during this time when the house work is so much harder than usual, it's still only something like 4-6 hours total of work, but it's all spread out over the day; you work for a while, then you have a 15 minute break (waiting for laundry to finish or the baby to poop again), then you do more work, then another tiny break, etc. I don't do well with work like that; I'm a work sprinter (actually I'm an everything sprinter), I want to get all my tasks on a list and then I'm just gonna strap on the gusto and knock it out as fast as possible so that I can be done and have a deep rest.

I'm sad that I'm so busy running around doing chores that I can't just lie in bed with mom and baby very much. I've always used doing work for people as a way of showing love, and it's fucking retarded. It's not what they really want, and it's not perceived as love. I'm sure that if I had someone else do all the work and I just spent more time cuddling, I would be a better dad.

At family gatherings, there are always those people who disappear from the group to help out in the kitchen; they might help cook, then help set up, help clean up; they do a lot of work and show their love for the family that way. There are other people who just hang out with the group the whole time and chat and smile and are more overtly interested in being there. Of course the hard worker is subconsciously perceived negatively, as standoffish, while the smiler is a "good family man" that everyone enjoys. In my youth I would rage about the unfairness of it all. I'm past that and can see things more clearly :

1. If you have a choice, then of course it's better to be the smiler who does no work and everyone loves. There is no reward in the real world for being a hard worker; not in social situations nor in the workplace. It's much better to be perceived as nice than to actually do deeply giving nice things.

2. Being a friendly, funny social creature is of course a type of "work" that's contributing value to the social situation. Think of them as an MC if you like; they're providing a service to the group, telling stories or laughing or whatever. That's valuable work as well.

3. The people who are really stealing from the group are the ones that don't do the work, and also don't provide laughs and good vibes. They are energy vampires and you should minimize your contact with them.

4. If you're a "smiler", the hard-working types will give you dirty looks or even drop passive aggressive shitty remarks about how "some people don't contribute" or whatever. Fuck them. They're just morons who have chosen a bad path for themselves and are trying to bring you down. Just laugh at them inside your head. Foolish hard-workers, your judgement has no sway over me, I don't need your approval. If someone else wants to do all the hard work for you, and then make themselves feel all sour about the fact that you didn't do the work, fantastic.

5. If you're a hard-worker, don't despair about the unfair world. You have found your lot in life. Maybe you are simply incapable of being a smiler. That's too bad for you, but we all have our place, and it's better to accept it and be content than to rage about what cannot be yours.

In my youth I used to struggle with trying it both ways. One of the nice things about age is you figure out your lot in life and just accept it; some years ago I concluded that I would never really contribute great social energy to groups, so I should just be a dish-doer in order to avoid being an energy vampire.

Anyway.

I'm a bit worried that I will never be able to really concentrate in my home office the way I have in the past. It's been a wonderful sanctuary of peace and alone time for me, where I can dive into my work and there's nobody making noise or peeking in at me (the way people do in offices). But now even just knowing that my baby is in the house, my mind is partly on the baby; is she okay, should I go check on her now? I'm sure that will decrease over time, but never go away. And of course once the child is running around making noise that will be a new distraction. Oh well, I guess we'll see how it goes.

Programming is such hard work to mix with anything else because you really need that solid block of uninterrupted time. You can't just pause and resume; or I can't anyway, I need to get into this sort of trance, which takes a while to acheive, and is quite draining. I feel a bit like a wizard in a fantasy novel; I can cast this amazing spell ("write code"), but to do so drains a bit of my life force, and if I do it more than once per day I bring myself close to death; if you're interrupted in coding, the spell is cancelled. I can write code without the spell, but then progress is slow and difficult, just like a normal human trying to write code, it's so frustrating for me to write code without the spell that I don't like to do it at all.

Anyway.

The actual point of this post is that I feel this need to get back to doing RAD work right away, and it's making me angry. Why do I have to feel that way? Fuck RAD work, I need to be with family. But my god damn hard-working WASPy martyr upbringing makes me feel like I can't ever take anything for myself, that I need to go and sacrifice and work for other people.

The whole time Tasha was pregnant I was crunching like crazy trying to finish Oodle 1.1 and get the real public release done, and to just get as much work done then as I could so that I would feel better about taking time off after the birth. I neglected Tasha when she needed me and she was really upset at me for it. I wanted to get ahead on my schedule, and I did, but now that I'm here my brain won't let me have that and wants me to go back to work.

One of the things I've really struggled with at RAD is the lack of structure and the self-scheduling. There's never a point where I can get ahead of an official schedule; I can't hit all my milestones for the month and then feel okay about taking it easy for a while. Any time I do take it easy because I just need a break, I feel bad about it. In general, my stupid brain makes me productive as a programmer, but also miserable as a programmer.

People who have a job where they just have a list of things to do and they can actually do them all and then go "I'm done, I'm going home" are very fortunate. In programming, the todo list is always effectively infinite (it's finite, but always longer than what you can ever accomplish). You might make a schedule and set a target set of tasks for a given month, but if you get them done sooner you don't go "great, I'm done for the month, time for a few days off", you go "oh, I went faster than expected, I guess I'll adjust the schedule and start on next month's tasks".

Even in a structured programming work environment, if you do your tasks faster than scheduled, you don't get sent home - you get given more tasks. In the traditional producer/team work system, your reward for being the fastest on the team is not more free time or even more pay, it's more work. Yay. Cynical "realist" programmers learn this at some point and many of them start to sandbag. They might finish their tasks quickly, but don't report it to production until their previously alotted time. Or they will intentionally take "slow work" breaks, like browing the web or watching videos while working.

I used to use my speed as a way to work on features I wasn't supposed to; mainly in the pre-Oddworld days, I would sprint and do my assigned tasks, and then not tell anyone I had finished much faster than scheduled, so that I could work on VIPM or secretly rewriting the DX render layer or some other task that had been decided by production was "low priority". Oo, what a rebel I was, secretly giving my employer masses of value for free, great way to use your youth cbloom.

Anyway. I guess this post is all just my way of trying to convince myself that it's okay for me to take a few more days off.


05-20-13 | Thoughts on Data Compression for MMOs

I've been thinking about this a bit and thought I'd scribble some ideas publicly. (MMO = not necessarily just MMOs but any central game server with a very large number of clients, that cares about total bandwidth).

The situation is roughly this : you have a server and many clients (let's say 10k clients per server just to be concrete). Data is mostly sent server->client , not very much is client->server. Let's say 10k bytes per second per channel from server->client, and only 1k bytes per second from client->server. So the total data rate from the server is high (100 MB/s) but the data rate on any one channel is low. The server must send packets more than once per second for latency reasons; let's say 10 times per second, so packets are only 1k on average; server sends 100k packets/second. You don't want the compressor to add any delay by buffering up packets.

I'm going to assume that you're using something like TCP so you have gauranteed packet order and no loss, so that you can use previous packets from any given channel as encoder history on that channel. If you do have an error in a connection you have to reset the encoder.

This is a rather unusual situation for data compression, and the standard LZ77 solutions don't work great. I'm going to talk only about the server->client transmission for now; you might use a completely different algorithm for the other direction. Some properties of this situation :

1. You care more about encode time than decode time. CPU time on the server is one of the primary limiting factors. The client machine has almost no compression work to do, so decode time could be quite slow.

2. Per-call encode time is very important (not just per-byte time). Packets are small and you're doing lots of them (100k packets/sec), so you can't afford long startup/shutdown times for the encoder. This is mostly just an annoyance for coding, it means you have to be really careful about your function setup code and such.

3. Memory use on the server is a bit limited. Say you allow 4 GB for encoder states; that's only 400k per client. (this is assuming that you're doing per-client encoder state, which is certainly not the only choice).

4. Cache misses are much worse than a normal compression encoder scenario. Say you have something like a 256k hash table to accelerate match finding. Normally when you're compressing you get that whole table into L2 so your hash lookups are in cache. In this scenario you're jumping from one state to another all the time, so you must assume that every memory lookup is a cache miss.

5. The standard LZ77 thing of not allowing matches at the beginning or end is rather more of a penalty. In general all those inefficiencies that you normally have on tiny buffers are more important than usual.

6. Because clients can be added at any time and connections can be reset, encoder init/reset time can't be very long. This is another reason aside from memory use that encoder state must be small.

7. The character of the data being sent doesn't really vary much from client to client. This is one way in which this scenario differs from a normal web server type of situation (in which case, different clients might be receiving vastly different types of data). The character of the data can change from packet to packet; there are sort of a few different modes of the data and the stream switches between then, but it's not like one client is usually receiving text and another one is receiving images. They're all generally receiving bit-packed 3d positions and the same type of thing.

And now some rambling about what encoder you might use that suits this scenario :

A. It's not clear that adaptive encoding is a win here. You have to do the comparison with CPU use held constant, if you just compare an encoder running adaptive vs the same encoder with a static model, that's not fair, because the static model can be so much faster you should use a more sophisticated encoder. The static model can also use vastly more memory. Maybe not a whole 4G, but a lot more than 400k.

B. LZ77 is not great here. The reason we love LZ77 is the fast, simple decoder. We don't really care about that here. An LZP or ROLZ variant would be better, that has a slightly slower and more memory-hungry decoder, but a simpler/faster encoder.

C. There are some semi-static options. Perhaps a static match dictionary with something like LZP, and then an adaptive simple context model per channel. That makes the per-channel adaptive part small in memory, but still allows some local adaptation for packets of different character. Another option would be a switching static-model scheme. Do something like train 4 different static models for different packet types, and send 2 bits to pick the model then encode the packet with that static model.

D. Static context mixing is kind of appealing. You can have static hash tables and a static mixing scheme, which eliminates a lot of the slowness of CM. Perhaps the order-0 model is adaptive per channel, and perhaps the secondary-statistics table is adaptive per channel. Hitting 100 MB/s might still be a challenge, but I think it's possible. One nice thing about CM here is that it can have the idea of packets of different character implicit in the model.

E. For static-dictionary LZ, the normal linear offset encoding doesn't really make a lot of sense. Sure, you could try to optimize a dictionary by laying out the data in it such that more common data is at lower offsets, but that seems like a nasty indirect way of getting at the solution. Off the top of my head, it seems like you could use something like an LZFG encoding. That is, make a Suffix Trie and then send match references as node or leaf indexes; leaves all have equal probability, nodes have a child count which is proportional to their probability (relative to siblings).

F. Surely the ideal solution is a blended static/dynamic coder. That is, you have some large trained static model (like a CM or PPM model, or a big static dictionary for LZ77) and then you also run a local adaptive model in a circular window for each channel. Then you compressing using a mix of the two models. There are various options on how to do this. For LZ77 you might send 0-64k offsets in the local adaptive window, and then 64k-4M offsets as indexes into the static dictionary. Or you could more explicitly code a selector bit to pick one of the two and then an offset. For CM it's most natural, you just mix the result of the static model and the local adaptive model.

G. What is not awesome is model preconditioning (and it's what most people do, because it's the only thing available with off-the-shelf compressors like zlib or whatever). Model precondition means taking an adaptive coder and initially loading its model (eg. an LZ dictionary) from some static data; then you encode packets adaptively. This might actually offer excellent compression, but it has bad channel startup time, and high memory use per channel, and it doesn't allow you to use more efficient algorithms that are possible with fully static models (such as different types of data structures that provide fast lookup but not fast update).

I believe if you're doing UDP or some other unreliable packet scheme, then static-model compression is the only way to go (rather than trying to track the different received and transmitted states to use for a dynamic model). Also if clients are very frequently joining and leaving or moving servers, then they will never build up much channel history, so static model is the way to go there as well. If streams vary vastly in size, like if they're usually less than 1k but occasionally you do large 1M+ transfers (like for content serving as opposed to updating game state) you would use a totally different scheme for the large transfers.

I'd like to do some tests. If you work on an MMO or similar game situation and can give me some real-world test data, please be in touch.


05-17-13 | Cloth Diapering

Oh yes indeed, you are in for a spate of baby-related blogging.

I'm pretty sure clother diapers are bullshit. I'm about to cancel my diaper service. In this first week I've been using a semi-alternating mix of cloth and disposable. I assumed that I would start out with disposables just for ease in the first few days and then switch to cloth because it's "better", but I don't think I will.

(I make all my decisions now based only on 1. personal observations and 2. serious scientific studies where I can read the original papers. I try to avoid and discount 3. journalism 4. hearsay 5. the internet 6. mass-market nonfiction. I think they are garbage and mental poison.)

What I'm seeing is :

Disposable diapers actually work the way they claim to. The seal around the borders is good. The entire diaper itself has a nice low profile so is not too bulky or uncomfortable. But most importantly, they actually do trap and absorb moisture. When baby has a heavy pee in a disposable diaper, the moisture stays right in one little spot and doesn't spread all over. When I remove the diaper I can feel her skin all over the nether regions is pretty dry.

Cloth diapers don't. The worst aspect is that when baby has a heavy pee, the cloth soaks it up, and because it's cloth and wicks moisture, the pee is spread all over her entire lower parts. When I get the diaper off, she's soaking wet all over. (and yes of course I'm changing her almost instantly after peeing because at this point we're watching her constantly). That alone is enough to turn me off cloth diapers, but there's lots more that sucks about them. It's really hard to get the diaper cover on such that it actually makes a water-tight seal, so leakage is much more likely (and if you do try to make it water tight, it's easy to make it too tight and cut off circulation, which I accidentally did once). The cloth diaper alone looks pretty comfortable on her, but the diaper cover is much rougher and more bulky than a dispoable; the result is that she has this huge awkward thing on.

When you add the inconvenience of cloth diapers (longer changing times, having to store poop in your house, taking the pail in and out for pickup), it just seems like a massive lose.

The only possible argument pro-cloth that makes sense to me is the reduction of the landfill load. Now, environmental arguments are always complicated; there are arguments for the other side based on the environmental cost of washing (though I think they're bogus). But even assuming that the environmental case is clear, being a hypocritical liberal I wouldn't actually inconvenience myself and discomfort my baby for the benefit of the landfill.

Eh, actually I take back that false self-accusation. That's a retarded Fox News style "gotcha" that's based on misrepresentation and not understanding. I've never advocated the standard liberal martyrdom (and if I once did, I certainly don't now). I don't believe in choosing to undermine yourself because you believe the world would be better if everyone did it. I believe in changing the laws such that they encourage you to make the choice that is better for the world. eg. people who don't drive because they believe it's evil, even if it would be much to their benefit, are just being dumb martyrs. The US government massively subsidizes driving, so if you don't take advantage of that you are essentially paying for other people to drive. I would love it if the government would subsidize *not driving* rather than the other way around, but until they do I'm driving up a storm. (tangent : the massive subsidies for Teslas is a great example of the way that Dems and Reps are in fact both really working for the same cause : creating loop holes and kick backs so that they can give money to rich people).

I'm a big tangent wanderer. My political philosophy in a nutshell :

Government's role is to create a market structure (through laws, regulation, the Fed, direct market action, etc) such that when each actor maximimizes their own personal utility, the net result is as good for the entire world (nation) as possible.

(if you're out of high school (or the 18th century) you should know that a free market does not do that on its own)

(And crucially, "good for" must be defined on something like a sum-of-logs scale, or perhaps just maximize the median, or minimize the number in poverty; if you maximize the sum (basically GDP) then giving huge profits to Larry Ellison and fucking everyone else looks like it's "good for the world")

And, uh, oh yeah, cloth diapers suck.


05-15-13 | Baby

I suppose this is the easiest way to announce to various friends and semi-friends rather than trying to mass-email. I have a new baby, yay! No pictures, you paparazzos. She's adorable and healthy. I love how simple and direct her communication is. Suckling lips = needs to nurse. Squirming = needs a diaper change. Fussing = cold or gassy. Everything else = needs to sleep. I wish all humans communicated so clearly.

I want to write about the wonderful experience of having a home birth (see *2), but don't want to intrude on Tasha's privacy. Suffice it to say it was really good, so good to be home and have everything at hand to make Tasha comfortable, and then be able to take baby in our arms and settle into bed right away. We spent the first 36 hours after birth all in bed together and I think that time was really important.

I've always wanted to have kids, but I'm (mostly) glad that I waited this long. For one thing Tasha is a wonderful mom and I'm glad I found her. But also, I realize now that I wasn't ready in my twenties. I've changed a lot in the last five years and I'm a much better person now. I've learned important lessons that are helping me a lot in this challenging time, like to do hard work correctly you have to not only complete the task but also keep a good attitude and be nice to the people around you while you do it. And that when you are tired and hungry is when you can really show your character; anyone can have a good attitude when they're fresh, but if you get nasty when the going gets tough then you are nasty. etc. standard cbloom slowly figures out things that most people learned in their teens.

Now for some old-style ranting.

1. "We had a baby". No you fucking did not. Your wife had a baby. If you were a really good husband, you held her hand and got her snacks. She squeezed a watermelon out of her vagina. You do not get to take any credit for that act, it was all her. It's a bit like Steve Jobs saying "we invented" anything; no you did not you fucking credit-stealing douchebag, your company didn't even invent it, much less you.

(tangent : I can't stand the political correctness in sport post-game interviews these days; they're all so sanitized and formulaic. They must go to interview coaching classes or something because everyone says exactly the same things. Of course it's not the athlete's fault, they would love to have emotional honest outbursts, it's the god damn stupid public who throw a coniption if anybody says anything remotely true. In particular this post reminds me of how athletes always immediately go "it wasn't just me, it was the team"; no it was not, Kobe, you just had an 80 point game, it was all fucking you, don't give me this bullshit credit to the team stuff. Be a man and say "*I* won this game".)

2. People are busy-body dicks. When we would tell acquaintances about our plans to have a home birth, a good 25% would feel like they had to tell us what a bad idea that was and nag us about the dangers of childbirth. Shut the fuck up you stupid asshole. First of all, don't you think that maybe we've researched that more than you before making our decision, so you don't know WTF you're talking about? Second of all, we're not going to change our mind because of your nagging, so all you're doing is being nasty about something you're not going to change. We didn't ask for your opinion, you can just stay the hell out of it. (The doctors that we would occasionally see for tests were often negative and naggy as well, which only made us more confident in our choice).

It's a bit like if a friend tells you they're marrying someone and you go "her?". Even if the marriage is a questionable choice, they're not going to stop it due to your misgivings, so all you're doing is adding some unpleasantness to their experience.

You always run into these idiots when you do software reviews or brainstorming sessions. You'll call a meeting to discuss revisions to the boss fight sequence, and some asshole will always chime in with "I really think the whole idea of boss fights sucks and we should start over". Umm, great, thanks, very helpful. We're not going to tear up the whole design of the game a few months from shipping, so maybe you could stick to the topic at hand and get some kind of clue about what things are reasonable to change and which need to be taken as a given and worked within as constraints.

Like when I'd ask for reviews of Oodle, a few of the respondents would give me something awesomely unhelpful like "I don't like the entire style of the API, and I'd throw it out and do a new one" , or "actually I think a paging + data compression library is a bad idea and I'd just start over on something else". Great, thanks; I might agree with you but obviously you must know that that is not going to happen and it's not what I was asking for, so if you don't want to say anything helpful then just say "no".

ADDENDUM : a few notes on home birth and midwives.

Even if you are planning to do home birth (without a doctor present), you should get an OB and do a prenatal visit with them to "establish care". That way you are officially their patient, even if you don't see them again. In the US health care system, if you do wind up having a problem, or even just a question, and you have not "established care" with a certain practice, you are just fucked. You would wind up at the ER and that's never good.

While the midwives seemed reasonably competent at popping out a healthy baby from a healthy mother with no complications, I certainly would not do it if there were any major risk factors. They also are less than thorough at the prenatal and postnatal exams, so it's probably worth seeing a regular doc for those at least once (probably only once).


05-08-13 | A Lock Free Weak Reference Table

It's very easy (almost trivial (*)) to make the table-based {index/guid} style of weak reference lock free.

(* = obviously not trivial if you're trying to minimize the memory ordering constraints, as evidenced by the revisions to this post that were required; it is trivial if you just make everything seq_cst)

Previous writings on this topic :

Smart & Weak Pointers - valuable tools for games - 03-27-04
cbloom rants 03-22-08 - 6
cbloom rants 07-05-10 - Counterpoint 2
cbloom rants 08-01-11 - A game threading model
cbloom rants 03-05-12 - Oodle Handle Table

The primary ops conceptually are :


Add object to table; gives it a WeakRef id

WeakRef -> OwningRef  (might be null)

OwningRef -> naked pointer

OwningRef construct/destruct = ref count inc/dec

The full code is in here : cbliblf.zip , but you can get a taste for how it works from the ref count maintenance code :


    // IncRef looks up the weak reference; returns null if lost
    //   (this is the only way to resolve a weak reference)
    Referable * IncRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];

        handle_type guid = handle_get_guid(h);

        // this is just an atomic inc of state
        //  but checking guid each time to check that we haven't lost this slot
        handle_type state = s->m_state.load(mo_acquire);
        for(;;)
        {
            if ( state_get_guid(state) != guid )
                return NULL;
            // assert refcount isn't hitting max
            LF_OS_ASSERT( state_get_refcount(state) < state_max_refcount );
            handle_type incstate = state+1;
            if ( s->m_state.compare_exchange_weak(state,incstate,mo_acq_rel,mo_acquire) )
            {
                // did the ref inc
                return s->m_ptr;
            }
            // state was reloaded, loop
        }
    }

    // IncRefRelaxed can be used when you know a ref is held
    //  so there's no chance of the object being gone
    void IncRefRelaxed( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        handle_type state_prev = s->m_state.fetch_add(1,mo_relaxed);
        state_prev;
        // make sure we were used correctly :
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 0 );
        LF_OS_ASSERT( state_get_refcount(state_prev) < state_max_refcount );
    }

    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state.fetch_add((handle_type)-1,mo_release);
        LF_OS_ASSERT( handle_get_guid(h) == state_get_guid(state_prev) );
        LF_OS_ASSERT( state_get_refcount(state_prev) >= 1 );
        if ( state_get_refcount(state_prev) == 1 )
        {
            // I took refcount to 0
            //  slot is not actually freed yet; someone else could IncRef right now
            //  the slot becomes inaccessible to weak refs when I inc guid :
            // try to inc guid with refcount at 0 :
            handle_type old_guid = handle_get_guid(h);
            handle_type old_state = make_state(old_guid,0); // == state_prev-1
            handle_type new_state = make_state(old_guid+1,0); // == new_state + (1<<handle_guid_shift);
  
            if ( s->m_state($).compare_exchange_strong(old_state,new_state,mo_acq_rel,mo_relaxed) )
            {
                // I released the slot
                // cmpx provides the acquire barrier for the free :
                FreeSlot(s);
                return;
            }
            // somebody else mucked with me
        }
    }

The maintenance of ref counts only requires relaxed atomic increment & release atomic decrement (except when the pointed-at object is initially made and finally destroyed, then some more work is required). Even just the relaxed atomic incs could get expensive if you did a ton of them, but my philosophy for how to use this kind of system is that you inc & dec refs as rarely as possible. The key thing is that you don't write functions that take owning refs as arguments, like :


void bad_function( OwningRefT<Thingy> sptr )
{
    more_bad_funcs(sptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    if ( sptr != NULL )
    {
        bad_function(sptr);
    }
}

hence doing lots of inc & decs on refs all over the code. Instead you write all your code with naked pointers, and only use the smart pointers where they are needed to ensure ownership for the lifetime of usage. eg. :

void good_function( Thing * ptr )
{
    more_good_funcs(ptr);
}

void Stuff::bad_caller()
{
    OwningRefT<thingy> sptr( m_weakRef );
    Thingy * ptr = sptr.GetPtr();
    if ( ptr != NULL )
    {
        good_function(ptr);
    }
}

If you like formal rules, they're something like this :


1. All stored variables are either OwningRef or WeakRef , depending on whether it's
an "I own this" or "I see this" relationship.  Never store a naked pointer.

2. All variables in function call args are naked pointers, as are variables on the
stack and temp work variables, when possible.

3. WeakRef to pointer resolution is only provided as WeakRef -> OwningRef.  Naked pointers
are only retrieved from OwningRefs.

And obviously there are lots of enchancements to the system that are possible. A major one that I recommend is to put more information in the reference table state word. If you use a 32-bit weak reference handle, and a 64-bit state word, then you have 32-bits of extra space that you can check for free with the weak reference resolution. You could put some mutex bits in there (or an rwlock) so that the state contains the lock for the object, but I'm not sure that is a big win (the only advantage of having the lock built into the state is that you could atomically get a lock and inc refcount in a single op). A better usage is to put some object information in there that can be retrieved without chasing the pointer and inc'ing the ref and so on.

For example in Oodle I store the status of the object in the state table. (Oodle status is a progression through Invalid->Pending->Done/Error). That way I can take a weak ref and query status in one atomic load. I also store some lock bits, and you aren't allowed to get back naked pointers unless you have a lock on them.

The code for the weak ref table is now in the cbliblf.zip that I made for the last post. Download : cbliblf.zip

( The old cblib has a non-LF weak reference table that's similar for comparison. It's also more developed with helpers and fancier templates and such that could be ported to this version. Download : cblib.zip )

ADDENDUM : alternative DecRef that uses CAS instead of atomic decrement. Removes the two-atomic free path. Platforms that implement atomic add as a CAS loop should probably just use this form. Platforms that have true atomic add should use the previously posted version.


    // DecRef
    void DecRef( handle_type h )
    {
        handle_type index = handle_get_index(h);
        LF_OS_ASSERT( index >= 0 && index < c_num_slots );
        Slot * s = &s_slots[index];
        
        // no need to check guid because I must own a ref
        handle_type state_prev = s->m_state($).load(mo_relaxed);
        
        handle_type old_guid  = handle_get_guid(h);

        for(;;)
        {
            // I haven't done my dec yet, guid must still match :
            LF_OS_ASSERT( state_get_guid(state_prev) == old_guid );
            // check current refcount :
            handle_type state_prev_rc = state_get_refcount(state_prev);
            LF_OS_ASSERT( state_prev_rc >= 1 );
            if ( state_prev_rc == 1 )
            {
                // I'm taking refcount to 0
                // also inc guid, which releases the slot :
                handle_type new_state = make_state(old_guid+1,0);

                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_acq_rel,mo_relaxed) )
                {
                    // I released the slot
                    // cmpx provides the acquire barrier for the free :
                    FreeSlot(s);
                    return;
                }
            }
            else
            {
                // this is just a decrement
                // but have to do it as a CAS to ensure state_prev_rc doesn't change on us
                handle_type new_state = state_prev-1;
                LF_OS_ASSERT( new_state == make_state(old_guid,  state_prev_rc-1) );
                
                if ( s->m_state($).compare_exchange_weak(state_prev,new_state,mo_release,mo_relaxed) )
                {
                    // I dec'ed a ref
                    return;
                }
            }
        }
    }


05-02-13 | Simple C++0x style LF structs and adapter for x86-Windows

I've seen a lot of lockfree libraries out there that are total crap. Really weird non-standard ways of doing things, or overly huge and complex.

I thought I'd make a super simple one in the correct modern style. Download : cbliblf.zip

(If you want a big fully functional much-more-complete library, Intel TBB is the best I've seen. The problem with TBB is that it's huge and entangled, and the license is not clearly free for all use).

There are two pieces here :

"cblibCpp0x.h" provides atomic and such in C++0x style for MSVC/Windows/x86 compilers that don't have real C++0x yet. I have made zero attempt to make this header syntatically identical to C++0x, there are various intentional and unintentional differences.

"cblibLF.h" provides some simple lockfree utilities (mostly queues) built on C++0x atomics.

"cblibCpp0x.h" is kind of by design not at all portable. "cblibLF.h" should be portable to any C++0x platform.

WARNING : this stuff is not super well tested because it's not what I use in Oodle. I've mostly copy-pasted this from my Relacy test code, so it should be pretty strong but there may have been some copy-paste errors.

ADDENDUM : In case it's not clear, you do not *want* to use "cblibCpp0x.h". You want to use real Cpp0x atomics provided by your compiler. This is a temporary band-aid so that people like me who use old compilers can get a cpp0x stand-in, so that they can do work using the modern syntax. If you're on a gcc platform that has the __atomic extensions but not C1X, use that.

You should be able to take any of the C++0x-style lockfree code I've posted over the years and use it with "cblibCpp0x.h" , perhaps with some minor syntactic fixes. eg. you could take the fastsemaphore wrapper and put the "semaphore" from "cblibCpp0x.h" in there as the base semaphore.

Here's an example of what the objects in "cblibLF.h" look like :


//=================================================================         
// spsc fifo
//  lock free for single producer, single consumer
//  requires an allocator
//  and a dummy node so the fifo is never empty
template <typename t_data>
struct lf_spsc_fifo_t
{
public:

    lf_spsc_fifo_t()
    {
        // initialize with one dummy node :
        node * dummy = new node;
        m_head = dummy;
        m_tail = dummy;
    }

    ~lf_spsc_fifo_t()
    {
        // should be one node left :
        LF_OS_ASSERT( m_head == m_tail );
        delete m_head;
    }

    void push(const t_data & data)
    {
        node * n = new node(data);
        // n->next == NULL from constructor
        m_head->next.store(n, memory_order_release); 
        m_head = n;
    }

    // returns true if a node was popped
    //  fills *pdata only if the return value is true
    bool pop(t_data * pdata)
    {
        // we're going to take the data from m_tail->next
        //  and free m_tail
        node* t = m_tail;
        node* n = t->next.load(memory_order_acquire);
        if ( n == NULL )
            return false;
        *pdata = n->data; // could be a swap
        m_tail = n;
        delete t;
        return true;
    }

private:

    struct node
    {
        atomic<node *>      next;
        nonatomic<t_data>   data;
        
        node() : next(NULL) { }
        node(const t_data & d) : next(NULL), data(d) { }
    };

    // head and tail are owned by separate threads,
    //  make sure there's no false sharing :
    nonatomic<node *>   m_head;
    char                m_pad[LF_OS_CACHE_LINE_SIZE];
    nonatomic<node *>   m_tail;
};

Download : cbliblf.zip


04-30-13 | Packing Values in Bits : Flat Codes

One of the very simplest forms of packing values in bits is simply to store a value with non-power-of-2 range and all values of equal probability.

You have a value that's in [0,N). Ideally all code lengths would be the same ( log2(N) ) which is fractional for N not a power of 2. With just bit output, we can't write fractional bits, so we will lose some efficiency. But how much exactly?

You can of course trivially write a symbol in [0,N) by using log2ceil(N) bits. That's just going up to the next integer bit count. But you're wasting values in there, so you can take each wasted value and use it to reduce the length of a code that you need. eg. for N = 5 , start with log2ceil(N) bits :

0 : 000
1 : 001
2 : 010
3 : 011
4 : 100
x : 101
x : 110
x : 111
The first five codes are used for our values, and the last three are wasted. Rearrange to interleave the wasted codewords :
0 : 000
x : 001
1 : 010
x : 011
2 : 100
x : 101
3 : 110
4 : 111
now since we have adjacent codes where one is used and one is not used, we can reduce the length of those codes and still have a prefix code. That is, if we see the two bits "00" we know that it must always be a value of 0, because "001" is wasted. So simply don't send the third bit in that case :
0 : 00
1 : 01
2 : 10
3 : 110
4 : 111

(this is a general way of constructing shorter prefix codes when you have wasted values). You can see that the number of wasted values we had at the top is the number of codes that can be shortened by one bit.

A flat code is written thusly :


void OutputFlat(int sym, int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;
    // T is the number of "wasted values"
    if ( sym < T )
    {
        // write in B-1 bits
        PutBits(sym, B-1);
    }
    else
    {
        // write in B bits
        // push value up by T
        PutBits(sym+T, B);
    }
}

int InputFlat(int sym,int N)
{
    ASSERT( N >= 2 && sym >= 0 && sym < N );

    int B = intlog2ceil(N);
    int T = (1<<B) - N;

    int sym = GetBits(B-1);
    if ( sym < T )
    {
        return sym;
    }
    else
    {
        // need one more bit :
        int ret = (sym<<1) - T + GetBits(1);        
        return ret;
    }
}

That is, we write (T) values in (B-1) bits, and (N-T) in (B) bits. The intlog2ceil can be slow, so in practice you would want to precompute that or pass it in as a parameter.

So, what is the loss vs. ideal, and where does it occur? Let's work it out :


H = log2(N)  is the ideal (fractional) entropy

N is in (2^(B-1),2^B]
so H is in (B-1,B]

The number of bits written by the flat code is :

L = ( T * (B-1) + (N-T) * B ) / N

with T = 2^B - N

Let's set

N = f * 2^B

with f in (0.5,1] our fractional position in the range.

so T = 2^B * (1 - f)

At f = 0.5 and 1.0 there's no loss, so there must be a maximum in that interval.

Doing some simplifying :

L = (T * (B-1) + (N-T) * B)/N
L = (T * B - T + N*B - T * B)/N
L = ( N*B - T)/N = B - T/N
T/N = (1-f)/f = (1/f) - 1
L = B - (1/f) + 1

The excess bits is :

E = L - H

H = log2(N) = log2( f * 2^B ) = B + log2(f)

E = (B - (1/f) + 1) - (B + log2(f))
E = 1 - (1/f) - log2(f)

so find the maximum of E by taking a derivative :

d/df(E) = 0
d/df(E) = 1/f^2 - (1/f)/ln2
1/f^2 = (1/f)/ln2
1/f = 1/ln(2)
f = ln(2)
f = 0.6931472...

and at that spot the excess is :

E = 1 - (1/ln2) - ln(ln2)/ln2
E = 0.08607133...

The worst case is 8.6% of a bit per symbol excess. The worst case appears periodically, once for each power of two.

The actual excess bits output for some low N's :

The worst case actually occurs as N->large, because at higher N you can get f closer to that worst case fraction (ln(2)). At lower N, the integer steps mean you miss the worst case and so waste less. This is perhaps a bit surprising, you might think that the worst case would be at something like N = 3.

In fact for N = 3 :


H = l2(3) = 1.584962 ...

L = average length written by OutputFlat

L = (1+2+2)/3 = 1.66666...

E = L - H = 0.08170421...

(obviously if you measure the loss as a percentage of the output length, the worst case is at N=3, and there it's 5.155% of the entropy).


04-21-13 | How to grow Linux disk under VMWare

There's a lot of these guides around the net, but I found them all a bit confusing to follow, so my experience : It seems to me it's a good idea to leave it this way - BIOS is set to boot first from CD, but the VM is set with no CD hardware enabled. This makes it easy to change the ISO and just check that box any time you want to boot from an ISO, rather than having to go into that BIOS nightmare again.


More generally, what have I learned about multi-platform development from working at RAD ?

That it's horrible, really horrible, and I pray that I never have to do it again in my life. Ugh.

Just writing cross-platform code is not the issue (though that's horrible enough, solely due to stupid non-fundamental issues like the fact that struct packing isn't standardized, adding signed ints isn't standardized, restrict/noalias isn't standardized, inline linkage varies greatly, etc. urg wtf etc etc). If you're just releasing some code on the net and offering it for many platforms (leaving it up to the downloaders to actually build it and test it), your life is easy. The horrible part is if you actually have to maintain machines and build systems for all those platforms, test them, be able to debug on them, keep all the sdk's up to date, etc. etc.

(in general coding is easy when you don't actually test your code and make sure it works well, which a surprising number of people think is "done"; hey it compiles, I'm done! umm, no...)

(I guess that's a more general life thing; I observe a lot of people who just do things and don't actually measure whether the "doing" was successful or done well, but they just move on and are generally happy. People who stress over whether what they're doing is actually a good job or not are massively less happy but also actually do good work.)

I feel like I spend 90% of my time on stupid fucking non-algorithmic issues like this Linux partition resizing shit (probably more like 20%, but that's still frustratingly high). The regression tests are failing on Linux, okay have to figure out why, oh it's because the VM disk is too small, okay how do I fix that; or the PS4 compiler has a bug I have to work around, or the system software on this system has a bug, or the Mac clang wants to spew pointless warnings about anonymous namespaces, or my tests aren't working on Xenon .. spend some time digging .. oh the console is just turned off, or the IP changed or it got reflashed and my SDK doesn't work anymore, and blah blah fucking blah. God dammit I just want to be able to write algorithms. I miss coding, I miss thinking about hard problems. Le sigh.

I've written before about how in my imagination I could hire some kid for $50k to do all this shit work for me and it would be a huge win for me overall. But I'm afraid it's not that easy in reality.

What really should exist is a "coder cloud" service. There should be a bunch of VMs of different OS'es with various compilers and SDKs installed, so I can just say "build my shit for X with Y". Of course you need to be able to run tests on that system as well, and if something goes wrong you need remote desktop for interactive debugging. It's got to have every platform, including things like game consoles where you need license agreements, which is probably a no-go in reality because corporations are jerks. There's got to be superb customer service, because if I can't rely on it for builds at every moment of every day then it's a no-go. Unfortunately, programmers are almost uniformly moronic about this kind of thing (in that they massively overestimate their own ability to manage these things quickly) so wouldn't want to pay what it costs to run that service.


04-10-13 | Waitset Resignal Notes

I've been redoing my low level waitset and want to make some notes. Some previous discussion of the same issues here :

cbloom rants 11-28-11 - Some lock-free rambling
cbloom rants 11-30-11 - Some more Waitset notes
cbloom rants 12-08-11 - Some Semaphores

In particular, two big things occurred to me :

1. I talked before about the "passing on the signal" issue. See the above posts for more in depth details, but in brief the issue is if you are trying to do NotifyOne (instead of NotifyAll), and you have a double-check waitset like this :


{
waiter = waitset.prepare_wait(condition);

if ( double check )
{
    waiter.cancel();
}
else
{
    waiter.wait();
    // possibly loop and re-check condition
}

}

then if you get a signal between prepare_wait and cancel, you didn't need that signal, so a wakeup of another thread that did need that signal can be missed.

Now, I talked about this before as an "ugly hack", but over time thinking about it, it doesn't seem so bad. In particular, if you put the resignal inside the cancel() , so that the client code looks just like the above, it doesn't need to know about the fact that the resignal mechanism is happening at all.

So, the new concept is that cancel atomically removes the waiter from the waitset and sees if it got a signal that it didn't consume. If so, it just passes on that signal. The fact that this is okay and not a hack came to me when I thought about under what conditions this actually happens. If you recall from the earlier posts, the need for resignal comes from situations like :


T0 posts sem , and signals no one
T1 posts sem , and signals T3
T2 tries to dec count and sees none, goes into wait()
T3 tries to dec count and gets one, goes into cancel(), but also got the signal - must resignal T2

the thing is this can only happen if all the threads are awake and racing against each other (it requires a very specific interleaving); that is, the T3 in particular that decs count and does the resignal had to be awake anyway (because its first check saw no count, but its double check did dec count, so it must have raced with the sem post). It's not like you wake up a thread you shouldn't have and then pass it on. The thread wakeup scheme is just changed from :

T0 sem.post --wakes--> T2 sem.wait
T1 sem.post --wakes--> T3 sem.wait

to :

T0 sem.post
T1 sem.post --wakes--> T3 sem.wait --wakes--> T2 sem.wait

that is, one of the consumer threads wakes its peer. This is a tiny performance loss, but it's a pretty rare race, so really not a bad thing.

The whole "double check" pathway in waitset only happens in a race case. It occurs when one thread sets the condition you want right at the same time that you check it, so your first check fails and after you prepare_wait, your second check passes. The resignal only occurs if you are in that race path, and also the setting thread sent you a signal between your prepare_wait and cancel, *and* there's another thread waiting on that same signal that should have gotten it. Basically this case is quite rare, we don't care too much about it being fast or elegant (as long as it's not disastrously slow), we just need behavior to be correct when it does happen - and the "pass on the signal" mechanism gives you that.

The advantage of being able to do just a NotifyOne instead of a NotifyAll is so huge that it's worth adopting this as standard practice in waitset.

2. It then occurred to me that the waitset PrepareWait and Cancel could be made lock-free pretty trivially.

Conceptually, they are made lock free by turning them into messages. "Notify" is now the receiver of messages and the scheme is now :


{
waiter w;
waitset : send message { prepare_wait , &w, condition };

if ( double check )
{
    waitset : send message { cancel , &w };
    return;
}

w.wait();
}

-------

{
waitset Notify(condition) :
first consume all messages
do prepare_wait and cancel actions
then do the normal notify
eg. see if there are any waiters that want to know about "condition"
}

The result is that the entire wait-side operation is lock free. The notify-side still uses a lock to ensure the consistency of the wait list.

This greatly reduces contention in the most common usage patterns :


Mutex :

only the mutex owner does Notify
 - so contention of the waitset lock is non-existant
many threads may try to lock a mutex
 - they do not have any waitset-lock contention

Semaphore :

the common case of one producer and many consumers (lots of threads do wait() )
 - zero contention of the waitset lock

the less common case of many producers and few consumers is slow

Another way to look at it is instead of doing little bits of waitlist maintenance in three different places (prepare_wait,notify,cancel) which each have to take a lock on the list, all the maintenance is moved to one spot.

Now there are some subtleties.

If you used a fresh "waiter" every time, things would be simple. But for efficiency you don't want to do that. In fact I use one unique waiter per thread. There's only one OS waitable handle needed per thread and you can use that to implement every threading primitive. But now you have to be able to recycle the waiter. Note that you don't have to worry about other threads using your waiter; the waiter is per-thread so you just have to worry about when you come around and use it again yourself.

If you didn't try to do the lock-free wait-side, recycling would be easy. But with the lock-free wait side there are some issues.

First is that when you do a prepare-then-cancel , your cancel might not actually be done for a long time (it was just a request). So if you come back around on the same thread and call prepare() again, prepare has to check if that earlier cancel has been processed or not. If it has not, then you just have to force the Notify-side list maintenance to be done immediately.

The second related issue is that the lock-free wait-side can give you spurious signals to your waiter. Normally prepare_wait could clear the OS handle, so that when you wait on it you know that you got the signal you wanted. But because prepare_wait is just a message and doesn't take the lock on the waitlist, you might actually still be in the waitlist from the previous time you used your waiter. Thus you can get a signal that you didn't want. There are a few solutions to this; one is to allow spurious signals (I don't love that); another is to detect that the signal is spurious and wait again (I do this). Another is to always just grab the waitlist lock (and do nothing) in either cancel or prepare_wait.


Ok, so we now have a clean waitset that can do NotifyOne and gaurantee no spurious signals. Let's use it.

You may recall we've looked at a simple waitset-based mutex before :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,1) != 0 )
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel();
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 )
    {
        waitset.NotifyAll( &thinlock );
    }
}
This mutex is non-recursive, and of course you should spin doing some TryLocks before going into the wait loop for efficiency.

This was an okay way to build a mutex on waitset when all you had was NotifyAll. It only does the notify if there are waiters, but the big problem with it is if you have multiple waiters, it wakes them all and then they all run in to try to grab the mutex, and all but one fail and go back to sleep. This is a common type of unnecessary-wakeup thread-thrashing pattern that sucks really bad.

(any time you write threading code where the wakeup means "hey wakeup and see if you can grab an atomic" (as opposed to "wakeup you got it"), you should be concerned (particularly when the wake is a broadcast))

Now that we have NotifyOne we can fix that mutex :


U32 thinlock;

Lock :
{
    // first check :
    while( Exchange(&thinlock,2) != 0 ) // (*1)
    {
        waiter w; // get from TLS
        waitset.PrepareWait( &w, &thinlock );

        // double check and put in waiter flag :
        if ( Exchange(&thinlock,2) == 0 )
        {
            // got it
            w.Cancel(waitset_resignal_no); // (*2)
            return;
        }

        w.Wait();
    }
}

Unlock :
{
    if ( Exchange(&thinlock,0) == 2 ) // (*3)
    {
        waitset.NotifyOne( &thinlock );
    }
}
We changed the NotifyAll to NotifyOne , but two funny bits are worth noting : (*1) - we must now immediately exchange in the waiter-flag here; in the NotifyAll case it worked to put a 1 in there for funny reasons (see cbloom rants 07-15-11 - Review of many Mutex implementations , where this type of mutex is discussed as "Three-state mutex using Event" ), but it doesn't work with the NotifyOne. (*2) - with a mutex you do not need to pass on the signal when you stole it and cancelled. The reason is just that there can't possibly be any more mutex for another thread to consume. A mutex is a lot like a semaphore with a maximum count of 1 (actually it's exactly like it for non-recursive mutexes); you only need to pass on the signal when it's possible that some other thread needs to know about it. (*3) - you might think the check for == 2 here is dumb because we always put in a 2, but there's code you're not seeing. TryLock should still put in a 1, so in the uncontended cases the thinlock will have a value of 1 and no Notify is done. The thinlock only goes to a 2 if there is some contention, and then the value stays at 2 until the last unlock of that contended sequence.

Okay, so that works, but it's kind of silly. With the mechanism we have now we can do a much neater mutex :


U32 thinlock; // = 0 initializes thinlock

Lock :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinlock );

    if ( Fetch_Add(&thinlock,1) == 0 )
    {
        // got the lock (no need to resignal)
        w.Cancel(waitset_resignal_no);
        return;
    }

    w.Wait();
    // woke up - I have the lock !
}

Unlock :
{
    if ( Fetch_Add(&thinlock,-1) > 1 )
    {
        // there were waiters
        waitset.NotifyOne( &thinlock );
    }
}
The mutex is just a wait-count now. (as usual you should TryLock a few times before jumping in to the PrepareWait). This mutex is more elegant; it also has a small performance advantage in that it only calls NotifyOne when it really needs to; because its gate is also a wait-count it knows if it needs to Notify or not. The previous Mutex posted will always Notify on the last unlock whether or not it needs to (eg. it will always do one Notify too many).

This last mutex is also really just a semaphore. We can see it by writing a semaphore with our waitset :


U32 thinsem; // = 0 initializes thinsem

Wait :
{
    waiter w; // get from TLS
    waitset.PrepareWait( &w, &thinsem );

    if ( Fetch_Add(&thinsem,-1) > 0 )
    {
        // got a dec on count
        w.Cancel(waitset_resignal_yes); // (*1)
        return;
    }

    w.Wait();
    // woke up - I got the sem !
}

Post :
{
    if ( Fetch_add(&thinsem,1) < 0 )
    {
        waitset.NotifyOne( &thinsem );
    }
}

which is obviously the same. The only subtle change is at (*1) - with a semaphore we must do the resignal, because there may have been several posts to the sem (contrast with mutex where there can only be one Unlock at a time; and the mutex itself serializes the unlocks).


Oh, one very subtle issue that I only discovered due to relacy :

waitset.Notify requires a #StoreLoad between the condition check and the notify call. That is, the standard pattern for any kind of "Publish" is something like :


Publish
{
    change shared variable
    if ( any waiters )
    {
        #StoreLoad
        waitset.Notify()
    }
}

Now, in most cases, such as the Sem and Mutex posted above, the Publish uses an atomic RMW op. If that is the case, then you don't need to add any more barriers - the RMW synchronizes for you. But if you do some kind of more weakly ordered primitive, then you must force a barrier there.

This is the exact same issue that I've run into before and forgot about again :

cbloom rants 07-31-11 - An example that needs seq_cst -
cbloom rants 08-09-11 - Threading Links (see threads on eventcount)
cbloom rants 06-01-12 - On C++ Atomic Fences Part 3


04-04-13 | Tabdir

I made a fresh build of "tabdir" my old ( old ) utility that does a recursive dirlisting in tabbed-text format for "tabview".

Download : tabdir 320k zip

tabdir -?
usage : tabdir [opts] [dir]
options:
 -v  : view output after writing
 -p  : show progress of dir enumeration (with interactive keys)
 -w# : set # of worker threads
 -oN : output to name N [r:\tabdir.tab]

This new tabdir is built on Oodle so it has a multi-threaded dir lister for much greater speed. (*)

Also note to self : I fixed tabview so it works as a shell file association. I hit this all the time and always forget it : if something works on the command line but not as a shell association, it's probably because the shell passes you quotes around file names, so you need a little code to strip quotes from args.

Someday I'd like to write an even faster tabdir that reads the NTFS volume directory information directly, but chances are that will never happen.

One odd thing I've spotted with this tabdir is that the Windows SxS Assembly dirs take a ton of time to enumerate on my machine. I dunno if they're compressed or WTF the deal is with them (I pushed it on the todo stack to investigate), but they're like 10X slower than any other dir. (could just be the larger number of files in there; but I mean it's slow *per file*)

I never did this before because I didn't expect multi-threaded dir enumeration to be a big win; I thought it would just cause seek thrashing, and if you're IO bound anyway then multi-threading can't help, can it? Well, it turns out the Win32 dir enum functions have quite a lot of CPU overhead, so multi-threading does in fact help a bit :

nworkers| elapsed time
1       | 12.327
2       | 10.450
3       | 9.710
4       | 9.130

(* = actually the big speed win was not multi-threading, it's that the old tabdir did something rather dumb in the file enum. It would enum all files, and then do GetInfo() on each one to get the file sizes. The new one just uses the file infos that are returned as part of the Win32 enumeration, which is massively faster).


04-04-13 | Worker Thread Forward Permit Delay-Kicks

I've had this small annoying issue for a while, and finally thought of a pretty simple solution.

You may recall, I use a worker thread system with forward "permits" (reversed dependencies) . When any handle completes it sees if that completion should trigger any followup handles, and if so those are then launched. Handles may be SPU jobs or IOs or CPU jobs or whatever. The problem I will talk about occurred when the predessor and the followup were both CPU jobs.

I'll talk about a specific case to be concrete : decoding compressed data while reading it from disk.

To decode each chunk of LZ data, a chunk-decompress job is made. That job depends on the IO(s) that read in the compressed data for that chunk. It also depends on the previous chunk if the chunk is not a seek-reset point. So in the case of a non-reset chunk, you have a dependency on an IO and a previous CPU job. Your job will be started by one or the other, whichever finishes last.

Now, when decompression was IO bound, then the IO completions were kicking off the decompress jobs, and everything was fine.

In these timelines, the second line is IO and the bottom four are workers. (click images for higher res)

LZ Read and Decompress without seek-resets, IO bound :

You can see the funny fans of lines that show the dependency on the previous decompress job and also the IO. Yellow is a thread that's sleeping.

You may notice that the worker threads are cycling around. That's not really ideal, but it's not related to the problem I'm talking about today. (that cycling is caused by the fact that the OS semaphore is FIFO. For something like worker threads, we'd actually rather have a LIFO semaphore, because it makes it more likely that you get a thread with something useful still hot in cache. Someday I'll replace my OS semaphore with my own LIFO one, but for now this is a minor performance bug). (Win32 docs say that they don't gaurantee any particular order, but in my experience threads of equal priority are always FIFO in Win32 semaphores)

Okay, now for the problem. When the IO was going fast, so we were CPU bound, it's the prior decompress job that triggers the followup work.

But something bad happened due to the forward permit system. The control flow was something like this :


On worker thread 0

wake from semaphore
do on an LZ decompress job
mark job done
completion change causes a permits check
permits check sees that there is a pending job triggered by this completion
  -> fire off that handle
   handle is pushed to worker thread system
   no worker is available to do it, so wake a new worker and give him the job
finalize (usually delete) job I just finished
look for more work to do
   there is none because it was already handed to a new worker

And it looked like this :

LZ Read and Decompress without seek-resets, CPU bound, naive permits :

You can see each subsequent decompress job is moving to another worker thread. Yuck, bad.

So the fix in Oodle is to use the "delay-kick" mechanism, which I'd already been using for coroutine refires (which had a similar problem; the problem occurred when you yielded a coroutine on something like an IO, and the IO was done almost immediately; the coroutine would get moved to another worker thread instead of just staying on the same one and continuing from the yield as if it wasn't there).

The scheme is something like this :


On each worker thread :

Try to pop a work item from the "delay kick queue"
  if there is more than one item in the DKQ,
    take one for myself and "kick" the remainder
    (kick means wake worker threads to do the jobs)

If nothing on DKQ, pop from the main queue
  if nothing on main queue, wait on work semaphore

Do your job

Set "delay kick" = true
  ("delay kick" has to be in TLS of course)
Mark job as done
Permit system checks for successor handles that can now run 
  if they exist, they are put in the DKQ instead of immediately firing
Set "delay kick" = false

Repeat

In brief : work that is made runnable by the completion of work is not fired until the worker thread that did the completion gets its own shot at grabbing that new work. If the completion made 4 jobs runnable, the worker will grab 1 for itself and kick the other 3. The kick is no longer in the completion phase, it's in the pop phase.

And the result is :

LZ Read and Decompress without seek-resets, CPU bound, delay-kick permits :

Molto superiore.

These last two timelines are on the same time scale, so you can see just from the visual that eliminating the unnecessary thread switching is about a 10% speedup.

Anyway, this particular issue may not apply to your worker thread system, or you may have other solutions. I think the main take-away is that while worker thread systems seem very simple to write at first, there's actually a huge amount of careful fiddling required to make them really run well. You have to be constantly vigilant about doing test runs and checking threadprofile views like this to ensure that what's actually happening matches what you think is happening. Err, oops, I think I just accidentally wrote an advertisement for Telemetry .


04-04-13 | Oodle Compression on BC1 and WAV

I put some stupid filters in, so here are some results for the record and my reference.

BC1 (DXT1/S3TC) DDS textures :

All compressors run in max-compress mode. Note that it's not entirely fair because Oodle has the BC1 swizzle and the others don't.

Some day I'd like to do a BC1-specific encoder. Various ideas and possibilities there. Also RD-DXTC.

I also did a WAV filter. This one is particularly ridiculous because nobody uses WAV, and if you want to compress audio you should use a domain-specific compressor, not just OodleLZ with a simple delta filter. I did it because I was annoyed that RAR beat me on WAVs (due to its having a multimedia filter), and RAR should never beat me.

WAV compression :

See also : same chart with 7z (not really fair cuz 7z doesn't have a WAV filter)

Happy to see that Oodle-filt handily beats RAR-filt. I'm using just a trivial linear gradient predictor :


out[i] = in[i] - 2*in[i-1] + in[i-2]

this could surely be better, but whatever, WAV filtering is not important.

I also did a simple BMP delta filter and EXE (BCJ/relative-call) transform. I don't really want to get into the business of offering all kinds of special case filters the way some of the more insane modern archivers do (like undoing ZLIB compression so you can recompress it, or WRT), but anyhoo there's a few.


ADDED : I will say something perhaps useful about the WAV filter.

There's a bit of a funny issue because the WAV data is 16 bit (or 24 or 32), and the back-end entropy coder in a simple LZ is 8 bit.

If you just take a 16-bit delta and put it into bytes, then most of your values will be around zero, and you'll make a stream like :

[00 00] [00 01] [FF FF] [FF F8] [00 04] ...
The bad thing you should notice here are the high bytes are switching between 00 and FF even though the values have quite a small range. (Note that the common thing of centering the values with +32768 doesn't change this at all).

You can make this much better just by doing a bias of +128. That makes it so the most important range of values (around zero (specifically [-128,127])) all have the same top byte.

I think it might be even slightly better to do a "folded" signed->unsigned map, like

{ 0,-1,1,-2,2,-3,...,32767,-32768 }
The main difference being that values like -129 and +128 get the same high byte in this mapping, rather than two different high bytes in the simple +128 bias scheme.

Of course you really want a separate 8-bit huffman for alternating pairs of bytes. One way to get that is to use a few bottom bits of position as part of the literal context. Also, the high byte should really be used as context for the low byte. But both of those are beyond the capabilities of my simple LZ-huffs so I just deinterleave the high and low bytes to two streams.


04-02-13 | The Method of Random Holdouts

Quick and dirty game-programming style hacky way to make fitting model parameters somewhat better.

You have some testset {T} of many items, and you wish to fit some heuristic model M over T which has some parameters. There may be multiple forms of the model and you aren't sure which is best, so you wish to compare models against each other.

For concreteness, you might imagine that T is a bunch of images, and you are trying to make a perceptual DXTC coder; you measure block error in the encoder as something like (SSD + a * SATD ^ b + c * SSIM_8x8 ) , and the goal is to minimize the total image error in the outer loop, measured using something complex like IW-MS-SSIM or "MyDCTDelta" or whatever. So you are trying to fit the parameters {a,b,c} to minimize an error.

For reference, the naive training method is : run the model on all data in {T}, optimize parameters to minimize error over {T}.

The method of random holdouts goes like this :


Run many trials

On each trial, take the testset T and randomly separate it into a training set and a verification set.
Typically training set is something like 75% of the data and verification is 25%.

Optimize the model parameters on the {training set} to minimize the error measure over {training set}.

Now run the optimized model on the {verification set} and measure the error there.
This is the error that will be used to rate the model.

When you make the average error, compensate for the size of the model thusly :
average_error = sum_error / ( [num in {verification set}] - [dof in model] )

Record the optimal parameters and the error for that trial

Now you have optimized parameters for each trial, and an error for each trial. You can take the average over all trials, but you can also take the sdev. The sdev tells you how well your model is really working - if it's not close to zero then you are missing something important in your model. A term with a large sdev might just be a random factor that's not useful in the model, and you should try again without it.

The method of random holdouts reduces over-training risk, because in each run you are measuring error only on data samples that were not used in training.

The method of random holdouts gives you a decent way to compare models which may have different numbers of DOF. If you just use the naive method of training, then models with more DOF will always appear better, because they are just fitting your data set.

That is, in our example say that (SSD + a * SATD ^ b) is actually the ideal model and the extra term ( + c * SSIM_8x8 ) is not useful. As long as it's not just a linear combo of other terms, then naive training will find a "c" such that that term is used to compensate for variations in your particular testset. And in fact that incorrect "c" can be quite a large value (along with a negative "a").

This kind of method can also be used for fancier stuff like building complex models from ensembles of simple models, "boosting" models, etc. But it's useful even in this case where we wind up just using a simple linear model, because you can see how it varies over the random holdouts.


03-31-13 | Market Equilibrium

I'm sure there's some standard economic theory of all this but hey let's reinvent the wheel without any background.

There's a fundamental principal of any healthy (*) market that the reward for some labor is equal across all fields - proportional only to standard factors like the risk factor, the scarcity of labor, the capital required for entry, etc. (* = more on "healthy" later). The point is that those factors have *nothing* to do with the details of the field.

The basic factor at play is that if some field changes and suddenly becomes much more profitable, then people will flood into that field, and the risk-equal-capital-return will keep going down until it becomes equal to other fields. Water flows downhill, you know.

When people like Alan Greenspan try to tell you that oh this new field is completely unlike anything we've seen in the past because of blah blah - it doesn't matter, they may have lots of great points that seem reasonable in isolation, but the equilibrium still applies. The pay of a computer programmer is set by the pay of a farmer, because if the difference were out of whack, the farmer would quit farming and start programming; the pay of programmers will go down and the wages of farmers will go up, then the price of lettuce will go up, and in the end a programmer won't be able to buy any more lettuce than anyone else in a similar job. ("similar" only in terms of risk, ease of entry, rarity of talent, etc.)

We went through a drive-through car wash yesterday and Tasha idly wondered how much the car wash operator makes from an operation like that. Well, I bet it's about the same as a quick-lube place makes, and that's about the same as a dry cleaner, and it's about the same as a pizza place (which has less capital outlay but more risk), because if one of them was much more profitable, there would be more competition until equilibrium was reached.

Specifically I've been thinking about this because of the current indie game boom on the PC, which seems to be a bit of a magic gold rush at the moment. That almost inevitably has to die out, it's just a question of when. (so hurry up and get your game out before it does!).

But of course that leads us into the issue of broken markets, since all current game sales avenues are deeply broken markets.

Equilibrium (like most naive economic theory) only applies to markets where there's fluidity, robust competition, no monopolistic control, free information, etc. And of course those don't happen in the real world.

Whenever a market is not healthy, it provides an opportunity for unbalanced reward, well out of equilibrium.

Lack of information can be particularly be a factor in small niches. There can be a company that does something random like make height-adjustable massage tables. If they're a private operation and nobody really pays attention to them, they can have super high profit levels for something that's not particularly difficult - way out of equilibrium. If other people knew how easy that business was, lots of others would enter, but due to lack of information they don't.

Patents and other such mechanisms that create legally enforced distortions of the market. Of course things like the cable and utility systems are even worse.

On a large scale, government distortion means that huge fields like health care, finance, insurance, oil, farming, etc. are all more profitable than they should be.

Perhaps the biggest issue in downloadable games is the oligopoly of App Store and Steam. This creates an unhealthy market distortion and it's hard to say exactly what the long term affect of that will be. (of course you don't see it as "unhealthy" if you are the one benefiting from the favor of the great powers; it's unhealthy in a market fluidity and fair competition sense, and may slow or prevent equilibrium)

Of course new fields are not yet in equilibrium, and one of the best ways to "get rich quick" is to chase new fields. Software has been out of equilibrium for the past 50 years, and is only recently settling down. Eventually software will be a very poorly paid field, because it requires very little capital to become a programmer, it's low risk, and there are lots of people who can do it.

Note that in *every* field the best will always rise to the top and be paid accordingly.

Games used to be a great field to work in because it was a new field. New fields are exciting, they offer great opportunities for innovation, and they attract the best people. Mature industries are well into equilibrium and the only chances for great success are through either big risk, big capital investment, or crookedness.


03-31-13 | Some GDC Observations

From my very limitted view of GDC standing at the RAD booth.

1. Programming is dead. There were basically zero programming talks at GDC this year. That's sad, but also perfectly reasonable since programming is not the problem any more (*). (* = assuming that you just want to make the same old shit with different graphics)

2. Piece of shit mobile games that people have thrown together in a month look better than AAA games 10 years ago. It's not just that GPU's are so much better, but the free engines are really amazing these days, and the content pipes are so much better, and there are so many more decent 3d artists that can just make tons of content.

3. Game developers look like human beings now. If you looked at a GDC when I first started going, we were all classic troglodyte nerds; unwashed sweatshirts and open backpacks with slide-rules falling out. We were all vampirically pale from being locked in a dark box surrounded by our giant CRTs. (more generally I'm noticing that the average fitness level (on the west coast anyway) is way up in the past 5 years or so).

4. Mobile is dead, downloadable is king. I do an unscientific random sampling every year just by asking the people who stop by the RAD booth what they're working on. For the past few years it has been mobile mobile "we're making a game for ios and android", tons of kids and startups and indies trying to get into mobile. That seems to be gone, and the new gold rush is "downloadable" (PC, XBLA, etc).

5. Games are tacky and tasteless. One of the worst things for me standing at the booth is just hearing and seeing games all day. I don't play games much, I never watch TV with commercials, and I never watch things like cable news with all the excessive HUD and overstimulation, I find all that stuff abusive of my senses. Games are stuck in this awful "bling bling whoosh blammo" flashing and fast-cuts and just really tacky aesthetic. It's just like TV ads, or a bit like standing in the slot machine section of a casino (which is surely some level of hell).

6. I saw one really amazing game at GDC that stood out from the rest. It had all the players instantly smiling and laughing. It was fun for kids and adults. It created a feeling of group affinity. Everyone around wanted to join in. It was even beneficial to the body. It was an inflatable ball. Personally I had the "holy shit what we make is total crap" (actually worse than crap, because it's actively harmful to the body and mind) epiphany some 10+ years ago, but it just struck me so hard standing there with all these shit games around and people having so much more fun in the most basic game in the non-electronic world.


03-31-13 | Index - Game Threading Architecture

Gathering the series for an index post :

cbloom rants 08-01-11 - A game threading model
cbloom rants 12-03-11 - Worker Thread system with reverse dependencies
cbloom rants 03-05-12 - Oodle Handle Table
cbloom rants 03-08-12 - Oodle Coroutines
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts
cbloom rants 12-18-12 - Async-Await ; Microsoft's Coroutines
cbloom rants 12-21-12 - Coroutine-centric Architecture
cbloom rants 12-21-12 - Coroutines From Lambdas
cbloom rants 12-06-12 - Theoretical Oodle Rewrite Continued
cbloom rants 02-23-13 - Threading - Reasoning Behind Coroutine Centric Design

I believe this is a good architecture, using the techniques that we currently have available, without doing anything that I consider bananas like writing your own programming language (*). Of course if you are platform-specific or know you can use C++11 there are small ways to make things more convenient, but the fundamental architecture would be about the same (and assuming that you will never need to port to a broken platform is a mistake I know well).

(* = a lot of people that I consider usually smart seem to think that writing a custom language is a great solution for lots of problems. Whenever we're talking about "oh reflection in C is a mess" or "dependency analysis should be automatic", they'll throw out "well if you had the time you would just write a custom language that does all this better". Would you? I certainly wouldn't. I like using tools that actually work, that new hires are familiar with, etc. etc. I don't have to list the pros of sticking with standard languages. In my experience every clever custom language for games is a huge fucking disaster and I would never advocate that as a good solution for any problem. It's not a question of limitted dev times and budgets.)


03-31-13 | Endian-Independent Structs

I dunno, maybe this is common practice, but I've never seen it before.

The easy way to load many file formats (I'll use a BMP here to be concrete) is just to point a struct at it :


struct BITMAPFILEHEADER
{
    U16 bfType; 
    U32 bfSize; 
    U16 bfReserved1; 
    U16 bfReserved2; 
    U32 bfOffBits; 
} __attribute__ ((__packed__));


BITMAPFILEHEADER * bmfh = (BITMAPFILEHEADER *)data;

if ( bmfh->bfType != 0x4D42 )
    ERROR_RETURN("not a BM",0);

etc..

but of course this doesn't work cross platform.

So people do all kinds of convoluted things (which I have usually done), like changing to a method like :


U16 bfType = Get16LE(&ptr);
U32 bfSize = Get32LE(&ptr);

or they'll do some crazy struct-parse fixup thing which I've always found to be bananas.

But there's a super trivial and convenient solution :


struct BITMAPFILEHEADER
{
    U16LE bfType; 
    U32LE bfSize; 
    U16LE bfReserved1; 
    U16LE bfReserved2; 
    U32LE bfOffBits; 
} __attribute__ ((__packed__));

where U16LE is just U16 on little-endian platforms and is a class that does bswap on itself on big-endian platforms.

Then you can still just use the old struct-pointing method and everything just works. Duh, I can't believe I didn't think of this earlier.

Similarly, here's a WAV header :


struct WAV_header_LE
{
    U32LE FOURCC_RIFF; // RIFF Header 
    U32LE riffChunkSize; // RIFF Chunk Size 
    U32LE FOURCC_WAVE; // WAVE Header 
    U32LE FOURCC_FMT; // FMT header 
    U32LE fmtChunkSize; // Size of the fmt chunk 
    U16LE audioFormat; // Audio format 1=PCM,6=mulaw,7=alaw, 257=IBM Mu-Law, 258=IBM A-Law, 259=ADPCM 
    U16LE numChan; // Number of channels 1=Mono 2=Sterio 
    U32LE samplesPerSec; // Sampling Frequency in Hz 
    U32LE bytesPerSec; // bytes per second 
    U16LE blockAlign; // normall NumChan* bytes per sample
    U16LE bitsPerSample; // Number of bits per sample 
}  __attribute__ ((__packed__));;

easy.

For file-input type structs, you just do this and there's no penalty. For structs you keep in memory you wouldn't want to eat the bswap all the time, but even in that case this provides a simple way to get the swizzle into native structs by just copying all the members over.

Of course if you have the Reflection-Visitor system that I'm fond of, that's also a good way to go. (cursed C, give me a "do this macro on all members").


03-30-13 | Error codes

Some random rambling on the topic of returning error codes.

Recently I've been fixing up a bunch of code that does things like

void  MutexLock( Mutex * m )
{
    if ( ! m ) return;
    ...

yikes. Invalid argument and you just silently do nothing. No thank you.

We should all know that silently nopping in failure cases is pretty horrible. But I'm also dealing with a lot of error code returns, and it occurs to me that returning an error code in that situation is not much better.

Personally I want unexpected or unhandleable errors to just blow up my app. In my own code I would just assert; unfortunately that's not viable in OS code or perhaps even in a library.

The classic example is malloc. I hate mallocs that return null. If I run out of memory, there's no way I'm handling it cleanly and reducing my footprint and carrying on. Just blow up my app. Personally whenever I implement an allocator if it can't get memory from the OS it just prints a message and exits (*).

(* = aside : even better is "functions that don't fail" which I might write more about later; basically the idea is the function tries to handle the failure case itself and never returns it out to the larger app. So in the case of malloc it might print a message like "tried to alloc N bytes; (a)bort/(r)etry/return (n)ull?". Another common case is when you try to open a file for write and it fails for whatever reason, it should just handle that at the low level and say "couldn't open X for write; (a)bort/(r)etry/change (n)ame?" )

I think error code returns are okay for *expected* and *recoverable* errors.

On functions that you realistically expect to always succeed and will not check error codes for, they shouldn't return error codes at all. I wrote recently about wrapping system APIs for portable code ; an example of the style of level 2 wrapping that I like is to "fix" the error returns.

(obviously this is not something the OS should do, they just have to return every error; it requires app-specific knowledge about what kind of errors your app can encounter and successfully recover from and continue, vs. ones that just mean you have a catastrophic unexpected bug)

For example, functions like lock & unlock a mutex shouldn't fail (in my code). 99% of the user code in the world that locks and unlocks mutexes doesn't check the return value, they just call lock and then proceed assuming the lock succeeded - so don't return it :


void mypthread_mutex_lock(mypthread_mutex_t *mutex)
{
    int ret = pthread_mutex_lock(mutex);
    if ( ret != 0 )
        CB_FAIL("pthread_mutex_lock",ret);
}

When you get a crazy unexpected error like that, the app should just blow up right at the call site (rather than silently failing and then blowing up somewhere weird later on because the mutex wasn't actually locked).

In other cases there are a mix of expected failures and unexpected ones, and the level-2 wrapper should differentiate between them :


bool mysem_trywait(mysem * sem)
{
    for(;;)
    {
        int res = sem_trywait( sem );
        if ( res == 0 ) return true; // got it

        int err = errno;
        if ( err == EINTR )
        {
            // UNIX is such balls
            continue;
        }
        else if ( err == EAGAIN )
        {
            // expected failure, no count in sem to dec :
            return false;
        }
        else
        {
            // crazy failure; blow up :
            CB_FAIL("sem_trywait",err);
        }
    }
}

(BTW best practice these days is always to copy "errno" out to an int, because errno may actually be #defined to a function call in the multithreaded world)

And since I just stumbled into it by accident, I may as well talk about EINTR. Now I understand that there may be legitimate reasons why you *want* an OS API that's interrupted by signals - we're going to ignore that, because that's not what the EINTR debate is about. So for purposes of discussion pretend that you never have a use case where you want EINTR and it's just a question of whether the API should put that trouble on the user or not.

I ranted about EINTR at RAD a while ago and was informed (reminded) this was an ancient argument that I was on the wrong side of.

Mmm. One thing certainly is true : if you want to write an operating system (or any piece of software) such that it is easy to port to lots of platforms and maintain for a long time, then it should be absolutely as simple as possible (meaning simple to implement, not simple in the API or simple to use), even at the cost of "rightness" and pain to the user. That I certainly agree with; UNIX has succeeded at being easy to port (and also succeeded at being a pain to the user).

But most people who argue on the pro-EINTR side of the argument are just wrong; they are confused about what the advantage of the pro-EINTR argument is (for example Jeff Atwood takes off on a general rant against complexity ; I think we all should know by now that huge complex APIs are bad; that's not interesting, and that's not what "Worse is Better" is about; or Jeff's example of INI files vs the registry - INI files are just massively better in every way, it's not related at all, there's no pro-con there).

(to be clear and simple : the pro-EINTR argument is entirely about simplicity of implementation and porting of the API; it's about requiring the minimum from the system)

The EINTR-returning API is not simpler (than one that doesn't force you to loop). Consider an API like this :


U64 system( U64 code );

doc :

if the top 32 bits of code are 77 this is a file open and the bottom 32 bits specify a device; the
return values then are 0 = call the same function again with the first 8 chars of the file name ...
if it returns 7 then you must sleep at least 1 milli and then call again with code = 44 ...
etc.. docs for 100 pages ...

what you should now realize is that *the docs are part of the API*. (that is not a "simple" API)

An API that requires you to carefully read about the weird special cases and understand what is going on inside the system is NOT a simple API. It might look simple, but it's in disguise. A simple API does what you expect it to. You should be able to just look at the function signature and guess what it does and be right 99% of the time.

Aside from the issue of simplicity, any API that requires you to write the exact same boiler-plate every time you use it is just a broken fucking API.

Also, I strongly believe that any API which returns error codes should be usable if you don't check the error code at all. Yeah yeah in real production code of course you check the error code, but for little test apps you should be able to do :


int fd = open("blah");

read(fd,buf);

close(fd);

and that should work okay in my hack test app. Nope, not in UNIX it doesn't. Thanks to its wonderful "simplicity" you have to call "read" in a loop because it might decide to return before the whole read is done.

Another example that occurs to me is the reuse of keywords and syntax in C. Things like making "static" mean something completely different depending on how you use it makes the number of special keywords smaller. But I believe it actually makes the "API" of the language much *more* complex. Instead of having intuitive and obvious separate clear keywords for each meaning that you could perhaps figure out just by looking at them, you instead have to read a bunch of docs and have very technical knowledge of the internals of what the keywords mean in each usage. (there are legitimate advantages to minimizing the number of keywords, of course, like leaving as many names available to users as possible). Knowledge required to use an API is part of the API. Simplicity is determined by the amount of knowledge required to do things correctly.


03-26-13 | Oodle 1.1 and GDC

Hey it's GDC time again, so if you're here come on by the RAD booth and say "hi" (or "fuck you", or whatever).

The Oodle web site just went live a few days ago.

Sometimes I feel embarassed (ashamed? humiliated?) that it's taken me five years to write a file IO and data compression library. Other times I think I've basically written an entire OS by myself (and all the docs, and marketing materials, and a video compressor, and aborted paging engine, and a bunch of other crap) and that doesn't sound so bad. I suppose the truth is somewhere in the middle. (perhaps with Oodle finally being officially released and selling, I might write a little post-mortem about how it's gone, try to honestly look back at it a bit. (because lord knows what I need is more introspection in my life)).

Oodle 1.1 will be out any day now. Main new features :


Lots more platforms.  Almost everything except mobile platforms now.

LZNIB!  I think LZNIB is pretty great.  8X faster to decode than ZLIB and usually
makes smaller files.

Other junk :
All the compressors can run parallel encode & decode now.
Long-range-matcher for LZ matching on huge files (still only in-memory though).
Incremental compressors for online transmission, and faster resets.

Personally I'm excited the core architecture is finally settling down, and we have a more focused direction to go forward, which is mainly the compressors. I hope to be able to work on some new compressors for 1.2 (like a very-high-compression option, which I currently don't have), and then eventually move on to some image compression stuff.


03-26-13 | Simulating Deep Yield with a Wait

I'm becoming increasingly annoyed at my lack of "deep yield" for coroutines.

Any time you are in a work item, if you decide that you can get some more parallelism by doing a branch-merge inside that item, you need deep yield.

Remember you should never ever do an OS wait on a coroutine thread (with normal threads anyway; on a WinRT threadpool thread you can). The reason is the OS wait disables that worker thread, so you have one less. In the worst case, it leads to deadlock, because all your worker threads can be asleep waiting on work items, and with no worker threads they will never get done.

Anyway, I've cooked up a temporary work-around, it looks like this :


I'm in some function and I want to branch-merge

If I'm not on on a worker thread
  -> just do a normal branch-merge, send the work off and use a Wait for completion

If I am on a worker thread :

inc target worker thread count
if # currently live worker threads is < target count
  start a new worker thread (either create or wake from pool)

now do the branch-merge and use OS Wait
dec the target worker thread count


on each worker thread, after completing a work item and before popping more work :
if target worker thread count < currently live count
  stop self (go back into a sleeping state in the pool)

this is basically using OS threads to implement stack-saving deep yield. It's not awesome, but it is okay if deep yield is rare.


03-23-13 | AI Rambles

With GDC coming up I've been thinking generally about the state of game technology in general. First a bit of a rant on that.

I am so fucking bored of graphics. Graphics are not the damn problem. I'm completely appalled by the derivative repetitive boring games you all keep making. I don't want to play "Shoot People in the Face 227" or "Space Marines 154" or "Slide Blocks to Make them Go Bling N" or "Cute Creatures Jump Around on Blocks N". Barf, boring. And making them all shiny with new graphics is just gilding the turd. Stop working on graphics.

Games have huge tech problems that nobody seems to want to work on. One that I have wanted to work on for a long time is animation. And by "animation" I don't really mean playing back clips, which fundamentally looks like garbage, but making characters move naturally, able to transition movements the way their body should, respond to surface variations and so on. Game animation just looks so awful, and it's becoming more uncanny as the graphics get better.

(in fact if we were smart we would have done it the other way around. Every cartoonist for a hundred years has known that it's actually ok for the visuals to look unrealistic if the animation and sound are really good. Human perception cares more about motion than the static appearance of things.)

Anyhoo, the other big one is AI. And by "AI" I don't mean playing scripts, or moving to designer-placed cover spots. Even some of the more sophisticated game AI systems are really just fancy whack-a-mole. You can see the AI's run to one spot, do a pre-programmed routine, run to another spot, pop out of cover so the player can shoot me, pop back in cover. Now, certainly there are merits to whack-a-mole AI. If you're making a platformer you don't want the enemy to do surprising things, you just want them to walk back and forth on a set pattern that the player can pick up easily. They're not really AI at all, they're rigid bodies with an animal painted on them.

These AI's never surprise you, they never make you laugh, they never make you want to play again because they might do something new. They feed off your energy and don't give anything back, like a bad conversation partner.

So it made me realize that game AIs are actually more interesting when the game is very simple. It might naively seem like a big complex sandbox 3d world has got a more complex AI, but really that complex world means that the AI no longer understands what it's doing. Your only hope is to give it simple rules to follow about what it can do in that world.

In contrast, AI for simple game systems (chess, checkers, backgammon, poker) can do amazing things that the human programmer never anticipated. There's a funny thing that happens with computer algorithms where a cold rational scientific brute-force search of a mathematical problem space actually leads to behavior that's more human than the shitty heuristic decision-tree type programming that's explicitly trying to simulate human behavior.

For example, when I was writing poker AI, I was really amazed at the "creative" plays that a simulation-based bot makes. (for review : a standard UAlberta-style poker bot works by building a model of the opponent based on observation of previous action; it then simulates all the possibilities for future cards and imagines what the opponent will do in each situation; it sums the EV over all paths for each of its own actions, and chooses the action that maximizes EV).

At the simplest level, it figures out things like check-raising when you tend to bet checked flops too much. But it did even weirder things. For example the bot would very quickly become hyper-aggressive against an opponent that folds even slightly too much; it adjusted faster and way more severely than any human. I would play against it sometimes with our cards face up so that I could make sure it was doing sane things, and I would see it make a huge check-raise bluff on the river with junk. My first thought is "I have a bug" and I'd go looking into the stats of the model, and found that there was no bug, it's just that the AI had learned that I thought a big river raise meant strength, so I was folding to them a lot, and therefore the simulation will jam almost every hand.

This type of poker AI is not the game theoretic equilibrium solution. It's assuming that the opponent plays by some scheme which may not be optimal, and that its own strategy is not face up. That can lead it to make mistakes. One I've long been aware of is that it doesn't hedge correctly. Normal humans hedge all the time in their poker play, perhaps too much; you will often suspect that someone is bluffing a huge percent of the time, but you aren't sure. A non-hedging AI would immediately start making very light call-downs, but a cautious human will weight in some factor for the model being wrong and play with a blended strategy that's not disastrous if the model is wrong (like only doing the light call-down in small pots, or waiting for a call-down with a hand that has some chance of being best even if the model is wrong).

Continuing the random rambling train of thought, I just realized (re-realized?) that one of the flaws with this style of poker AI is that it doesn't anticipate the reaction to its moves. Of course it does anticipate the reaction just in terms of "if I bet, what hands will he call with or raise with", but it is evaluating based on the *past* model of the opponent. After you make your bet, the opponent sees it and adjusts their view of you, so you need to be anticipating how their play style changes. For example in the case I mentioned above - when someone is playing pretty weak/tight the bot rapidly becomes hyper-aggressive, which is mostly good, but the bot never gets the idea that "hey he can see I'm raising every single street of every hand, he's going to adjust and call me down more".

Anyway, bringing it back to games, it occurred to me that it would be interesting to try some really simple 2d games, and give them a mathematical solving AI, instead of the usual heuristic crap we do. Like, let's face facts - we can't actually make games in these big free form 3d worlds, it's too complex. Our ability to do the graphics has gotten way beyond every other aspect. We need to back up and go to like Ultima-style 2d tile-based games. Now you have a space where the AI can just explore future actions, and things like advancing on the player by moving from cover to cover just pops out of the behavior automatically because it maximizes EV, not because it was explicitly coded.

(I'm not contending that this is the "right way" to make games or that it will necessarily make good games, I just thought it was interesting)


03-19-13 | Windows Sleep Variation

Hmm, I've discovered that Sleep(n) behaves very differently on my three Windows boxes.

(Also remember there are a lot of other issues with Sleep(n) ; the times are only reliable here because this is in a no-op test app)

This actually started because I was looking into Linux thread sleep timing, so I wrote a little test to just Sleep(n) a bunch of times and measure the observed duration of the sleep.

(Of course on Windows I do timeBeginPeriod(1) and bump my thread to very high priority (and timeGetDevCaps says the minp is 1)).

Anyway, what I'm seeing is this :


Win7 :
sleep(1) : average = 0.999 , sdev = 0.035 ,min = 0.175 , max = 1.568
sleep(2) : average = 2.000 , sdev = 0.041 ,min = 1.344 , max = 2.660
sleep(3) : average = 3.000 , sdev = 0.040 ,min = 2.200 , max = 3.774

Sleep(n) averages n
duration in [n-1,n+1]

WinXP :
sleep(1) : average = 1.952 , sdev = 0.001 ,min = 1.902 , max = 1.966
sleep(2) : average = 2.929 , sdev = 0.004 ,min = 2.665 , max = 2.961
sleep(3) : average = 3.905 , sdev = 0.004 ,min = 3.640 , max = 3.927

Sleep(n) averages (n+1)
duration very close to (n+1) every time (tiny sdev)

Win8 :
sleep(1) : average = 2.002 , sdev = 0.111 ,min = 1.015 , max = 2.101
sleep(2) : average = 2.703 , sdev = 0.439 ,min = 2.017 , max = 3.085
sleep(3) : average = 3.630 , sdev = 0.452 ,min = 3.003 , max = 4.130

average no good
Sleep(n) minimum very precisely n
duration in [n,n+1] (+ a little error)
rather larger sdev

it's like completely different logic on each of my 3 machines. XP is the most precise, but it's sleeping for (n+1) millis instead of (n) ! Win8 has a very precise min of n, but the average and max is quite sloppy (sdev of almost half a milli, very high variation even with nothing happening on the system). Win7 hits the average really nicely but has a large range, and is the only one that will go well below the requested duration.

As noted before, I had a look at this because I'm running Linux in a VM and seeing very poor performance from my threading code under Linux-VM. So I ran this experiment :


Sleep(1) on Linux :

native : average = 1.094 , sdev = 0.015 , min = 1.054 , max = 1.224
in VM  : average = 3.270 , sdev =14.748 , min = 1.058 , max = 656.297

(added)
in VM2 : average = 1.308 , sdev = 2.757 , min = 1.052 , max = 154.025

obviously being inside a VM on Windows is not being very kind to Linux's threading system. On the native box, Linux's sleep time is way more reliable than Windows (small min-max range) (and this is just with default priority threads and SCHED_OTHER, not even using a high priority trick like with the Windows tests above).

added "in VM2". So the VM threading seems to be much better if you let it see many fewer cores than you have. I'm running on a 4 core (8 hypercore) machine; the base "in VM" numbers are with the VM set to see 4 cores. "in VM2" is with the VM set to 2 cores. Still a really bad max in there, but much better overall.


03-16-13 | Writing Portable Code Rambles

Some thoughts after spending some time on this (still a newbie). How I would do it differently if I started from scratch.

1. Obviously you all know the best practice of using your own data types (S32 or whatever) and making macros for any kind of common operation that the standards don't handle well (like use a SEL macro instead of ?: , make a macro for ROT, etc). Never use bit-fields, make your own macros for manipulating bits within words. You also have to make your own whole macro meta-language for things not quite in the language, like data alignment, restrict/alias, etc. etc. (god damn C standard people, spend some time on the actual problems that real coders face every day. Thanks mkay). That's background and it's the way to go.

Make your own defines for SIZEOF_POINTER since stupid C doesn't give you any way to check sizeof() in a macro. You probably also want SIZEOF_REGISTER. You need your own equivalent of ptrdiff_t and intptr_t. Best practice is to use pointer-sized ints for all indexing of arrays and buffer sizes.

(one annoying complication is that there are platforms with 64 bit pointers on which 64-bit int math is very slow; for example they might not have a 64-bit multiply at all and have to emulate it. In that case you will want to use 32-bit ints for array access when possible; bleh)

Avoid using "wchar_t" because it is not always the same size. Try to explicitly use UTF16 or UTF32 in your code. You could make your own SIZEOF_WCHAR and select one or the other on the appropriate platform. (really try to avoid using wchar at all; just use U16 or U32 and do your own UTF encoding).

One thing I would add to the macro meta-language next time is to wrap every single function (and class) in my code. That is, instead of :


int myfunc( int args );

do

FUNC1 int FUNC2 myfunc(int args );

or even better :

FUNC( int , myfunc , (int args) );

this gives you lots of power to add attributes and other munging as may be needed later on some platforms. If I was doing this again I would use the last style, and I would have two of them, a FUNC_PUBLIC and FUNC_PRIVATE to control linkage. Probably should have separate wrapper macros for the proto and the body.

While you're at it you may as well have a preamble in every func too :


FUNC_PUBLIC_BODY( int , myfunc , (int args) )
{
    FUNC_PUBLIC_PRE

    ...
}

which lets you add automatic func tracing, profiling, logging, and so on.

I wish I had made several different layers of platform Id #defines. The first one you want is the lowest level, which explicitly Id's the current platform. These should be exclusive (no overlaps), something like OODLE_PLATFORM_X86X64_WIN32 or OODLE_PLATFORM_PS3_PPU.

Then I'd like another layer that's platform *groups*. For me the groups would probably be OODLE_PLATFORM_GROUP_PC , GROUP_CONSOLE, and GROUP_EMBEDDED. Those let you make gross characterizations like on "GROUP_PC" you use more memory and have more debug systems and such. With these mutually exclusive platform checks, you should never use an #else. That is, don't do :

#if OODLE_PLATFORM_X86X64_WIN32
.. some code ..
#else
.. fallback ..
#endif
it's much better to explicitly enumerate which platforms you want to go to which code block, and then have an
#else
#error new platform
#endif
at the end of every check. That way when you try building on new platforms that you haven't thought carefully about yet, you get nice compiler notification about all the places where you need to think "should it use this code path or should I write a new one". Fallbacks are evil! I hate fallbacks, give me errors.

Aside from the explicit platforms and groups I would have platform flags or caps which are non-mutually exclusive. Things like PLATFORM_FLAG_STDIN_CONSOLE.

While you want the raw platform checks, in end code I wish I had avoided using them explicitly, and instead converted them into logical queries about the platform. What I mean is, when you just have an "#if some platform" in the code, it doesn't make it clear why you care that's the platform, and it doesn't make it reusable. For example I have things like :

#if PLATFORM_X86X64
// .. do string matching by U64 and xor/cntlz
#else
// unaligned U64 read may be slow
// do string match byte by byte
#endif
what I should have done is to introduce an abstraction layer in the #if that makes it clear what I am checking for, like :

#if PLATFORM_X86X64
#define PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS 1
#elif PLATFORM_PS3
#define PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS 0
#else
#error classify me
#endif

#if PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS
// .. do string matching by U64 and xor/cntlz
#else
// unaligned U64 read may be slow
// do string match byte by byte
#endif

then it's really clear what you want to know and how to classify new platforms. It also lets you reuse that toggle in lots of places without code duping the fiddly bit, which is the platform classification.

Note that when doing this, it's best to make high level usage-specific switches. You might be tempted to try to use platform attributes there. Like instead of "PLATFORM_SWITCH_DO_STRING_MATCH_BIGWORDS" you might want to use "PLATFORM_SWITCH_UNALIGNED_READ_PENALTY" . But that's not actually what you want to know, you want to know if on my particular application (LZ string match) it's better to use big words or not, and that might not match the low level attribute of the CPU.

It's really tempting to skip all this and abuse the switches you can see (lord knows I do it); I see (and write) lots of code that does evil things like using "#ifdef _MSC_VER" to mean something totally different like "is this x86 or x64" ? Of course that screws you when you move to another x86 platform and you aren't detecting it correctly (or when you use MSVC to make PPC or ARM compiles).

Okay, that's all pretty standard, now for the new bit :

2. I would opaque out the system APIs in two levels. I haven't actually ever done this, so grains of salt, but I'm pretty convinced it's the right way to go after working with a more standard system.

(for the record : the standard way is to make a set of wrappers that tries to behave the same on all systems, eg. that tries to hide what system you are on as much as possible. Then if you need to do platform-specific stuff you would just include the platform system headers and talk to them directly. That's what I'm saying is not good.)

In the proposed alternative, the first level would just be a wrapper on the system APIs with minimal or no behavior change. That is, it's just passing them through and standardizing naming and behavior.

At this level you are doing a few things :

2.A. Hiding the system includes from the rest of your app. System includes are often in different places, and often turn on compiler flags in nasty ways. You want to remove that variation from the rest of your code so that your main codebase only sees your own wrapper header.

2.B. Standardizing naming. For example the MSVC POSIX funcs are all named wrong; at this level you can patch that all up.

2.C. Fixing things that are slightly different or don't work on various platforms where they really should be the same. For example things like pthreads are not actually all the same on all the pthreads platforms, and that can catch you out in nasty ways. (eg. things like sem_init always failing on Mac).

Note this is *not* trying to make non-POSIX platforms look like POSIX. It's not hiding the system you're on, just wrapping it in a standard way.

2.D. I would also go ahead and add my own asserts for args and returns in this layer, because I hate functions that just return error codes when there's a catastrophic failure like a null arg or an EHEAPCORRUPT or whatever.

So once you have this wrapper you no longer call any system funcs directly from your main codebase, but you still would be doing things like :


#if PLATFORM_WIN32

    HANDLE h = platform_CreateFile( ... )

#elif PLATFORM_POSIX

    int fd = platform_open( name , flags )

#else
    #error unknown platform
#endif

that is, you're not hiding what platform you're on, you're still letting the larger codebase get to the low level calls, it's just the mess of how fucked they are that's hidden a bit.

3. You then have a second level of wrapping which tries to make same-action interfaces that dont require you to know what platform you're on. Second level is written on the first level.

The second level wrappers should be as high level as necessary to opaque out the operation. For example rather than having "make temp file name" and "open file" you might have "open file with temp name", because on some platforms that can be more efficient when you know it is a high-level combined op. You don't just have "GetTime" you have "GetTimeMonotonic" , because on some platforms they have an efficient monotonic clock for you, and on other platforms/hardwares you may have to do a lot of complicated work to ensure a reliable clock (that you don't want to do in the low level timer).

When a platform can't provide a high-level function efficiently, rather than emulate it in a complex way I'd rather just not have it - not a stub that fails, but no definition at all. That way I get a compile error and in those spots I can do something different, using the level 1 APIs.

The first level wrappers are very independent of the large code base's usage, but the second level wrappers are very much specifically designed for their usage.

To be clear about the problem of making platform-hiding second layer wrappers, consider something like OpenFile(). What are the args to that? What can it do? It's hopeless to make something that works on all platforms without greatly reducing the capabilities of some platforms. And the meaning of various options (like async, temporary, buffered, etc.) all changes with platform.

If you wanted to really make a general purpose multi-platform OpenFile you would have to use some kind of "caps" query system, where you first do something like OpenFile_QueryCaps( OF_DOES_UNBUFFERED_MEAN_ALIGNMENT_IS_REQUIRED ) and it would be an ugly disaster. (and it's retarded on the face of it, because really what you're doing there is saying "is this win32" ?). The alternative to the crazy caps system is to just make the high level wrappers very limited and specific to your usage. So you could make a platform-agnostic wrapper like OpenFile_ForReadShared_StandardFlagsAndPermissions(). Then the platforms can all do slightly different things and satisfy the high level goal of the imperative in the best way for that platform.

A good second level has as few functions as possible, and they are as high level as possible. Making them very high level allows you to do different compound ops on the platform in a way that's hidden from the larger codebase.


03-10-13 | Two LZ Notes

Note 1 : on rep matches.

"Rep matches" are a little weird. They help a lot, but the reason why they help depends on the file you are compressing. (rep match = repeat match, gap match, aka "last offset")

On text files, they work as interrupted matches, or "gap matches". They let you generate something like :


stand on the floor
stand in the door

stand in the door
[stand ][i][n the ][d][oor]

[off 19, len 6][1 lit][rep len 6][1 lit][off 18, len 3]

that is, you have a long match of [stand on the ] but with a gap at the 'o'.

Now, something I observed was that more than one last offset continues to help. On text the main benefit from having two last offsets is that it lets you use a match for the gap. When the gap is not just one character but a word, you might want to use a match to put that word in, in which case the continuation after the gap is no longer the first last-offset, it's the second one. eg.


cope
how to work with animals
how to cope with animals

[how to ][cope][ with animals]
[off 25 ][off 32][off 25 (rep2)]

You could imagine alternative coding structures that don't require keeping some number of "last offsets". (oddly, the last offset maintenance can be a large part of decode time, because maintaining an MTF list is something that CPUs do incredibly poorly). For example you could code with a scheme where you just send the entire long match, and then any time you send a long match you send a flag for "are there any gaps", and if so you then code some gaps inside the match.

The funny thing is, on binary files "last offsets" do something else which can be more important. They become the most common offsets. In particular, on highly structured binary data, they will generally be some factor of the structure size. eg. on a file that has a struct size of 36, and that struct has dwords and such in it, the last offsets will generally be things like 4,8,16,36, or 72. They provide a sort of dictionary of the most common offsets so that you can code those smaller. You are still getting the gap-match effect, but the common-offset benefit is much bigger on these files.

(aside : word-replacing transform on text really helps LZ (and everything) by removing the length variance of tokens. In particular for LZ77, word length variance breaks rep matches. There are lots of common occurances of a single replaced word in a phrase, like : "I want some stuff" -> "I want the stuff". You can't get a rep match here of [ stuff] because the offset changed because the substituted word was different length. If you do WRT first, then gap matches get these.)

Note 2 : on offset structure.

I've had it in the back of my head for quite some time now to do an LZ compressor specifically designed for structured data.

One idea I had was to use "2d" match offsets. That is, send a {dx,dy} where dx is within the record and dy is different records. Like imagine the data is in a table, dy is going back rows, dx is an offset on the row. You probably want to mod dx around the row so its range is always the same, and special case dy=0 (matches within your own record).

It occurred to me that the standard way of sending LZ offsets these days actually already does this. The normal way that good LZ's send offsets these days is to break it into low and high parts :

low = offset & 7F;
high = offset >> 7;
or similar, then you send "high" using some kind of "NoSB" scheme (Number of Significant Bits is entropy coded, and the bits themselves are sent raw), and you send "low" with an order-0 entropy coder.

But this is just a 2d structured record offset for a particular power-of-2 record size. It's why when I've experimented with 2d offsets I haven't seen huge wins - because I'm already doing it.

There is some win to be had from custom 2d-offsets (vs. the standard low/high bits scheme) when the record size is not a power of two.


03-06-13 | Sympathy for the Library Writer

Over the years of being a coder who was a library-consumer and not a library-writer, I've done my share of griping about annoying API's or what I saw as pointless complication or ineffiency. Man, I've been humbled by my own experience trying to write a public library. It is *hard*.

The big problem with libraries is that you don't control how they're used. This is in contrast to game engines. Game engines are not libraries. I've worked on many game engines over the years, including ones that went out to large free user bases (Genesis 3d and Wild Tangent), and they are much much easier than libraries.

The difference is that game engines generally impose an architecture on the user. They force you to use it in a certain way. (this is of course why more advanced developers despise them so much; it sucks to have some 3rd party telling you your code architecture). It's totally acceptable if a game engine only works well when you use it in the approved way, and is really slow if you abuse it, or it could even crash if you use it oddly.

A library has to be flexible about how it's used; it can't impose a system on the user, like a certain threading model, or a certain memory management model, or even an error-handling style.

Personally when I do IO for games, I make a "tool path" that just uses stdio and is very simple and flexible, does streaming IO and text parsing and so on, but isn't shipped with the game, and I make a "game path" that only does large-block async IO that's pre-baked so you can just point at it. I find that system is powerful enough for my use, it's easy to write and use. It means that the "tool path" doesn't have to be particularly fast, and the fast game path doesn't need to support buffered character IO or anything other than big block reads.

But I can't force that model on clients, so I have to support all the permutations and I have to make them all decently fast.

A lot of times in the past I've complained about over-complicated APIs that have tons of crazy options that nobody ever needs (look at the IJG jpeg code for example). Well, now I see that often those complicated APIs were made because somebody (probably somebody important) needed those options. Of course as the library provider you can offer the complex interface and also simpler alternatives, but that has its own pitfalls of making the API bigger and more redundant (like if you offer OpenFileSimple and OpenFileComplex); in some ways it's better to only offer the complex API and make the user wrap it and reduce the parameter set to what they actually use.

There's also a sort of "liability" issue in libraries. Not exactly legal liability, but program bad behavior liability. Lots of things that would make the library easier to use and faster are naughty to do automatically. For example Oodle under Vista+ can run faster with elevated priviledge, to get access to some of the unsecure file APIs (like extending without zeroing), but it would be naughty for me to do that automatically, so instead I have to add an extra step to make the client specifically ask for that.

Optimization for me has really become a nightmare. At first I was trying to make every function fast, but it's impossible, there are just too many entry points and too many usage patterns. Now my philosophy is to make certain core functions fast, and then address problems in the bigger high level API as customers see issues. I remember as a game developer always being so pissed that all the GL drivers were specially optimized for Id. I would want to use the API in a slightly different style, and my way would be super slow, not for any good reason but just because it hadn't gotten the optimization loving of the important customer's use case.

I used to also rail about the "unnecessary" argument checking that all the 3d APIs do. It massively slows them down, and I would complain that I had ensured the arguments were good so just fucking pass them through, stop slowing me down with all your validation! But now I see that if you really do that, you will just constantly be crashing people as they pass in broken args. In fact arg validation is often the way that people figure out the API, either because they don't read the docs or because the docs are no good.

(this is not even getting into the issue of API design which is another area where I have been suitably humbled)

ADDENDUM : I guess I should mention the really obvious points that I didn't make.

1. One of the things that makes a public library so hard after release is that you can't refactor. The normal way I make APIs for myself (and for internal teams) is to sort of make an effort at a good API the first time, but it usually sucks, and you rip it out and go through big scourges of find-rep. That only works when you control all the code, the library and the consumer. It's only after several iterations that the API becomes really nice (and even then it's only nice for that specific use case, it might still suck in the wild).

2. APIs without users almost always suck. When someone goes away in a cave and works on a big new fancy library and then shows it to the world, it's probably terrible. This a problem that I think everyone at RAD faces. The code of mine that I really like is stuff that I use over and over, so that I see the flaws and when I want it to be easier to use I just go fix it.

3. There are two separate issues about what makes an API "good". One is "is it good for the user?" and one is "is it good for the library maintainer?". Often they are the same but not always.

Anyway, the main point of this post is supposed to be : the next time you complain about a bad library design, there may well be valid reasons why it is the way it is; they have to balance a lot of competing goals. And even if they got it wrong, hey it's hard.


03-05-13 | Obama

Sometimes when I see Obama making a speech (eg. recently on the sequester, and before on gun control, and health care), it strikes me that he's addressing the country and the opposing legislature as if he can convince them through logic and reasoning and good discussion of the issues.

I think maybe Obama just doesn't understand politics. Perhaps because of his youth and lack of experience in serious elected office, he seems to think he can just make a good speech to the public and the legislature will somehow see the light and bow to his finely reasoned and rationally based argument. LOL, silly Obama. The only way to actually accomplish anything progressive is through strong-arm backroom deals and dirty tricks (see eg. LBJ and FDR). You can't just beat crooks like the NRA and AMA by being *right* ; the moral high ground or rational correctness never got anyone anywhere.

Either that or he's super clever and knows that none of his stuff will ever pass, and he's just trying to make a show to look a bit progressive while intentionally only succeeding on the very pro-big-business measures.


Sometimes when I see a bit of Fox News or some Tea Party demonstration or whatever, I imagine Mr. Burns is standing just off screen whispering "and what about the taxes" or "it's big government that's doing this to you!". Can't you see that these talking points have been written by think tanks and your angry mob is just a puppet in their game?


03-05-13 | Immigration Reform

This is something that everyone with a clue has been saying forever, and I'm quite sure it's not going to happen, and it's really too late to take advantage of our huge lead, but anyway :

1. Open up immigration for anyone with a PhD , MD , etc.. Not just visas - give them citizenships. You want them to stay and make their business here.

(not really on topic, but if the AMA wasn't such a bunch of fuckers we would have a super-fast-track for MD's in other countries to get a US MD)

2. Instant citizenship for any immigrant who goes to a US PhD program and graduates.

(* obviously some difficulty here because colleges would pop up just to take money and make citizens, so there has to be some control on this)

Anybody who's gone to an American science PhD program knows just how completely insane our policy is and how much amazing human talent we are letting slip away. It's fucking retarded that so many Indian and Chinese and Russian (and others) scientists come to America, get PhDs, and leave because they can't stay. Now, as I said it's already too late, and our small-mindedness and intransigence has already fucked us, because they now have decent tech economies to go home to. If we'd done this 10 years ago we could have been the tech leader of the world for a long time.

3. No limit on H1B visas. Fast track to citizenship for H1B workers. Certainly anyone who works in software knows how stupid this is. Don't let American tech companies hire the best workers in the world, and then even when we do get to hire them, don't let them stay and become assimilated US citizens. Good system guys.


03-01-13 | Finance and Realism

interfluidity is pretty great. It's like if I wrote about economics and actually knew what I was talking about.


Now some ranting.

The financial system of the world is in a very sick state. Since the Great Recession, things have actually gotten worse. Zero reforms have been passed to prevent instability and counter-humanist actions by the big banks. What's worse is that there has been even *more* consolidation, so the reins of power are now in the hands of the very few, and government is almost helpless against them. Furthermore, Europe's troubles have provided a great opportunity for for the big banks to redesign the finances of many european countries in their favor.

There *will* be another major collapse similar to the housing crash. I have no idea how long it will take or how it will happen exactly, but with the current economic climate it's inevitable. When government wants nothing but growth and has no stomach for regulation, the inevitable result is bubble and crash.

When there are crises, there are great opportunities for change. How have they been used?


LTCM -> should have been a wake up call about the danger of huge leverage and computer trading.  Nothing done.

Dot com bubble -> chance to break investment banks from financial advisors, make televised stock pumping illegal, etc.  Nothing done.

9/11 -> used to strip Americans of civil liberties and provide justificiation for the Cheney/Rumsfeld project in Iraq.

Obama election -> great opportunity to restore some transparency, rule of law, and civil rights.  Opportunity not taken.

Great Recession -> chance to restore bank separation and generally improve regulation of the financial system.  Opportunity not taken.

Collapse of Iceland, Ireland, Italy, Spain, etc. -> Goldman is there literally writing the conditions of the bailouts.

Reduced revenue for all levels of US government -> small government schemers use it to tighten the one way ratchet.

the forces of evil are far more clever about using crises. Of course they are, they have huge advantages when it comes to acting decisively in a moment of crisis (lack of morals, political power, unified organization, etc.).

The entire productive world economy is now functioning to subsidize the financial sector (and brand owners & patent holders). We let this happen party because we are all fools, and partly because they control the government, so the system is designed to make it that way.

It's absurd to think of capitalism as any kind of fair rational playing field; if capitalism was left on its own unfettered, it would very quickly become a world oligopoly, in which a few players controlled everything. (in a game design sense, capitalism is a game which is badly afflicted by "runaways" ; once one player starts winning, they become even more powerful, and soon the other players have no hope other than a massive blunder). The only way to make capitalism work decently is with a robust regulatory structure which crafts the system so that it is okay for workers and consumers. A capitalism economy is a lot like a game system, the way it plays is a direct result of the rules. We are sculptors of our own capitalism environment; it is a human political creation.

Anyway, ranting about it is pointless of course and I gave up on all that long ago, because it won't get better. Money wins. A rational realist only has one option : if the system is going to stay this way you have to work in finance (or try to get some bullshit patent so you can sell your tech company). If you don't choose to work in finance, you are choosing to subsidize people who work in finance, which is a silly thing to do.

Long ago I decided it was dumb to be angry about the way the world works, and it was pointless to try to change it. So when you identify something, the only rational thing to do is to use it for your benefit. But I can't bring myself to actually do that for some reason.

Sometimes I wonder if the whole idea of being "moral" is a trick that was used to brainwash us into being good obedient and controllable pawns. It's largely the churches that created this idea of it being so great to live a quiet life of moral goodness, when at the same time the churches themselves were in immoral ruthless power grabs. It's like one of the monkeys in the tribe convinces everyone that you shouldn't hit other monkeys when they steal your food, and then he proceeds to steal all your food. Very clever, monkey, very clever.

(Obviously those in power use the "moral" argument in a transparent and disgusting way to supress opposition; like claiming that government workers who speak out about the evils of government are "traitors" or that questioning the merits of going to war is cowardly or unpatriotic, or that regulating the financial sector in a recession is "irresponsible". Of course the way kids are taught to "respect authority" and such is just to keep them in line. Of course the entire public school system is a way of converting individuals into mindless worker drones. But those specific things are really what I'm talking about in the above paragaph).


03-01-13 | Zopfli

zopfli seems to make small zips. There's no description of the algorithm so I can't comment on it. But hey, if you want small zips it seems to be the current leader.

(update : I've had a little look, and it seems to be pretty straightforward, it's an optimal parser + huff reset searcher. There are various other prior tools to do this (kzip,deflopt,defluff,etc). It's missing some of the things that I've written about before here, such as methods of dealing with the huff-parse feedback; the code looks pretty clean, so if you want a good zip-encoder code it looks like a good place to start.)

I've written these things before, but I will summarize here how to make small zips :

1. Use an exact (windowed) string matcher.

cbloom rants 09-24-12 - LZ String Matcher Decision Tree

2. Optimal parser. Optimal parsing zip is super easy because it has no "repeat match", so you can use plain old backwards scan. You do have the huffman code costs, so you have to consider at least one match candidate for each codeword length.

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 09-04-12 - LZ4 Optimal Parse

3. Deep huffman reset search. You can do this pretty easily by using some heuristics to set candidate points and then building a bottom-up tree. Zopfli seems to use a top-down greedy search. More huffman resets makes decode slower, so a good encoder should expose some kind space-speed tradeoff parameter (and/or a maximum number of resets).

cbloom rants 06-15-10 - Top down - Bottom up
cbloom rants 10-02-12 - Small note on Adaptive vs Static Modeling

4. Multi-parse. The optimal parser needs to be seeded in some way, with either initial code costs or some kind of heuristic parse. There may be multiple local minima, so the right way to do it is to run 4 seeds (or so) simultaneously with different strategies.

cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies

5. The only unsolved bit : huffman - parse feedback. The only solution I know to this is iteration. You should use some tricks like smoothing and better handling of the zero-frequency symbols, but it's just heuristics and iteration.


One cool thing to have would be a cheap way to compute incremental huffman cost.

That is, say you have some array of symbols. The symbols have a corresponding histogram and huffman code. The full huffman cost is :

fullcost(symbol set) = cost{ transmit code lengths } + sum[n] { codelen[n] * count[n] }
that is, the cost to send the code lengths + the cost of sending all the symbols with those code lengths.

You'd like to be able to do an incremental update of fullcost. That is, if you add one more symbol to the set, what is the delta of fullcost ?

*if* the huffman code lengths don't change, then the delta is just +codelen[symbol].

But, the addition of the symbol might change the code lengths, which causes fullcost to change in several ways.

I'm not sure if there's some clever fast way to do incremental updates; like when adding the symbol pushes you over the threshold to change the huffman tree, it often only changes some small local part of the tree, so you don't have to re-sum your whole histogram, just the changed part. Then you could slide your partition point across an array and find the optimal point quite quickly.


Now some ranting.

How sad is it that we're still using zip?

I've been thinking about writing my own super-zipper for many years, but I always stop myself because WTF is the point? I don't mean for the world, I guess I see that it is useful for some people, but it does nothing for *me*. Hey I could write some thing and probably no one would use it and I wouldn't get any reward from it and it would just be another depressing waste of some great effort like so many other things in my life.

It's weird to me that the best code in the world tends to be the type of code that's given away for free. The little nuggets of pure genius, the code that really has something special in it - that tends to be the free code. I'm thinking of compressors, hashers, data structures, the real gems. Now, I'm all for free code and sharing knowledge and so on, but it's not equal. We (the producers of those gems) are getting fucked on the deal. Apple and the financial service industry are gouging me in every possible immoral way, and I'm giving away the best work of my life for nothing. It's a sucker move, but it's too late. The only sensible play in a realpolitik sense of your own life optimization is to not work in algorithms.

Obviously anyone who claims that patents provide money to inventors is either a liar (Myhrvold etc.) or just has no familiarity with actual engineering. I often think about LZ77 as a case in point. The people who made money off LZ77 patents were PK and Stac, both of whom contributed *absolutely nothing*. Their variants were completely trivial obvious extensions of the original idea. Of course the real inventors (L&Z, and the modern variant is really due to S&S) didn't patent and got nothing. Same thing with GIF and LZW, etc. etc. perhaps v42 goes in there somewhere too; not a single one of the compression-patent money makers was an innovator. (and this is even igoring the larger anti-patent argument, which is that things like LZ77 would have been obvious to any number of researchers in the field at the time; it's almost always impossible to attribute scientific invention/discovery to an individual)


02-23-13 | Threading APIs that would be ideal

If you were writing an OS from scratch right now, what low level threading primitives should you provide? I contend they are rather different than the norm.

1. A low-level keyed event with double-checked wait.

Futex and NT's keyed event are both pretty great, but the ideal low level wait should be double-checked. I believe it should be something like :


HANDLE Waitset;

Waitset CreateWaitset();
DestroyWaitset(Waitset ws);

HANDLE wait_handle = Waitset_PrepareWait( Waitset ws , U64 key );

Waitset_CancelWait( Waitset ws , wait_handle h );
Waitset_Wait( Waitset ws , wait_handle h );

Waitset_Signal( Waitset ws, U64 key );

**Now, key of course could be a pointer, but there's no reason for it to be particularly. This is easily a superset of futex; if you want you could just have one global Waitset object, and key could be an int pointer, and you could check *ptr in between PrepareWait and Wait, that would give you futex. But you can do much more with this.

I prefer having a "waitset" object to put the waits on (like KeyedEvent), not just making it global/static (like futex). The advantage is 1. efficiency and 2. multiple meanings for a single "key". It's more efficient because you can have different waitsets for different uses, which makes each one cover fewer waits, which makes all the lookups faster. (that is, rather than 100 global waits pending, maybe you have 10 on 10 different waitsets). The other advantage is that you can reuse the same value for key without it confusing the system. You could have one Waitset where key is a pointer, and another where key is an internal handle number, etc.

2. A proper cond_var with waker-side condition checking.

First of all, a decent cond_var API combines a lot of the disjoint junk in the posix API. It should include the mutex, because that allows for vastly more efficient implementation :


    class condition_var
    {
    public:
        void lock();
        void unlock();
    
        // the below are always called with lock held :

        void unlock_wait_lock();
        
        void signal_unlock();
        void broadcast_unlock();

    private:
        ...
    };

The basic usage of this cv is like :

    cv.lock();

    while( ! condition )
    {
        cv.unlock_wait_lock();
    }

    .. do stuff with condition true ..

    cv.unlock();

A good implementation should do the compound ops (signal_unlock, etc) atomically. But I wouldn't require that because it's too hard.

But that's just background. What you really want is to put the condition check in the API. It should be :


        void wait_lock( [] { wake condition } );

The spec of the API is that "wake condition" is some code that will be run with the mutex locked, and when the function exits you will own the mutex and the condition is true. Then client usage is like :

    cv.wait_lock( condition );

    .. do stuff with condition true ..

    cv.unlock();

which allows for much more efficient implementation. The wake condition of the waiter list can be evaluated easily inside signal_unlock(), because that's always called with the mutex held.


02-23-13 | Threading Patterns - Wake Polling

Something I've written about a lot but never given a solid name to.

When a thread is waiting on some condition, your goal should be to only wake it up if that condition is actually true - that is, the thread really gets to run. In reverse order of badness :

1. Wakeup condition polling. This is the worst and is very common. You're essentially just using the thread wakeup to say "hey your condition *might* be set, check it yourself". The suspect code looks something like :


while( ! condition )
{
    Wait(event);
}

these threads can waste a ton of cycles just waking up, checking their condition, then going back to sleep.

One of the common ways to get nasty wake-polling is when you are trying to just wake one thread, but you have to do a broadcast due to the possibility of a missed wakeup (as in the naive semaphore from waitset ).

Of course any usage of cond_var is a wake-poll loop. I really don't like cond_var as an API or a programming pattern. It encourages you to write wakee-side condition checks. Whenever possible, waker-side condition checks are better. (See previous notes on cond vars such as : In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition").

(ADDENDUM : in fact I dislike cond_var so much I wrote a proposal on an alternative cond_var api ).

Now it's worth breaking this into two sub-categories :

1.A. Wake-polling when it is extremely likely that you get to run immediately.

This is super standard and is not that bad. At root, what's happening here is that under normal conditions, the wakeup means the condition is true and you get to run. The loop is only needed to catch the race where someone stole your wakeup.

For example, the way Linux implements semaphore on futex is a classic wake-poll. The core loop is :


    for(;;)
    {
        if ( try_wait() ) break;

        futex_wait( & sem->value, 0 ); // wait if sem value == 0
    }

If there's no contention, you wake from the wait and get to try_wait (dec the count) and proceed. The only time you have to loop is if someone else raced in and dec'ed the count before you. (see also in that same post a discussion of why you actually *want* that race to happen, for performance reasons).

The reason this is okay is because the futex semaphore only has to do a wake 1 when it signals. If it had to do a broadcast, this would be a bad loop. (and note that the reason it can do a broadcast is due to the special nature of the futex wait, which ensures that the single thread signal actually goes to someone who needs it!) (see : cbloom rants 08-01-11 - Double checked wait ).

1.B. Wake-polling when it is unlikely that you get to run.

This is the really bad one.

As I've noted previously ( cbloom rants 07-26-11 - Implementing Event WFMO ) this is a common way for people to implement WFMO. The crap implementation basically looks like this :


while ( any events in array[] not set )
{
    wait on an unset event in array[]
}

What this does is any time one of the events in the set triggers, it wakes up all the waiters that are waiting on it in an array, checks the array, and they go back to sleep.

Obviously this is terrible, it causes bad "thread thrashing" - tons of wakeups and immediate sleeps just to get one thread to eventually run.

2. "Direct Handoff" - minimal wakes. This is the ideal; you only wake a thread when you absolutely know it gets to run.

When only a single thread is waiting on the condition, this is pretty easy, because there's no issue of "stolen wakeups". With multiple threads waiting, this can be tricky.

The only way to really robustly ensure that you have direct handoff is by making the wakeup ensure the condition.

At the low level, you want threading primitives that don't give you unnecessary wakeups. eg. we don't like the pthreads cond_var that has you call :

    condvar.wait();
    mutex.lock();
as two separate calls, which means you can wake from the condvar and immediately fail to get the mutex and go back to sleep. Prefer a single call :
    condvar.wait_then_lock(mutex);
which only wakes you when you get a cv signal *and* can acquire the mutex.

At the high level, the main thing you should be doing is *waker* side checks.

eg. to do a good WFMO you should be checking for all-events-set on the *waker* side. To do this you must create a proxy event for the set when you enter the wait, register all the events on the proxy, and then you only signal the proxy when they are all set. When one of them is set, it does the checking. That is, the checking is moved to the signaller. The advantage is that the signalling thread is already running.


02-23-13 | Threading - Reasoning Behind Coroutine Centric Design

cbloom rants 12-21-12 - Coroutine-centric Architecture is a proposed architecture.

Why do I think it should be that way? Let's revisit some points.

1. Main thread should be a worker and all workers should be symmetric. That is, there's only one type of thread - worker threads, and all functions are work items. There are no special-purpose threads.

The purpose of this is to minimize thread switches, and to make waits be immediate runs whenever possible.

Consider the alternative. Say you have a classic "main" thread and a worker thread. Your main thread is running along and then decides it has to Wait() on a work item. It has to sleep the thread pending a notification from the worker thread. The OS has to switch to the worker, run the job, notify, then switch you back.

With fully symmetric threads, there is no actual thread wait there. If the work item is not started, or is in a yield point of a coroutine, you simply pop it and run it immediately. (of course your main() also has to be a coroutine, so that it can be yielded out at that spot to run the work item). Symmetric threads = less thread switching.

There are other advantages. One is that you're less affected by the OS starving one of your threads. When your threads are not symmetric, if one is starved (and is the bottleneck) it can ruin your throughput; one crucial job or IO can't run and then all the other threads back up. With symmetric threads, someone else grabs that job and off you go.

Symmetric threads are self-balancing. Any time you decide "we have 2 threads for graphics and 1 for compute" you are assuming you know your load exactly, and you can only be wrong. Symmetric threads maximize the utilization of the cpu. (Note that for cache coherence you might want to have a system that *prefers* to keep the same time of job on the same thread, but that's only a soft preference and it will run other jobs if there are none of the right type).

Symmetric threads scale cleanly down to 1. This is a big one that I think is important. Even just for debugging purposes, you want to be able to run your system non-threaded. With asymmetric threads you have to have a custom "non-threaded" pathway, which leads to bugs and means you aren't testing the same threaded pathway. The symmetric thread system scales down to 1 thread using the same code as always - when you wait on a job, if it hasn't been started it's just run immediately.

It's also much easier to have deadlocks in asymmetric thread systems. If an IO job waits on a graphics job, and a graphics job waits on an IO job, you're in a tricky situation; of course you shouldn't deadlock as long as there are no circular dependencies, but if one of those threads is processing in FIFO order you can get in trouble. It's just better to have a system where that issue doesn't even arise.

2. Deep yield.

Obviously if you want to write real software, you can't be returning out to the root level of the coroutine every time you want to yield.

In the full coroutine-centric architecture, all the OS waits (mutex locks, etc) should be coroutine yields. The only way to do that is if they can call yield() internally and it's a full stack-staving deep yield.

Of course you should be able to spawn more coroutines from inside your coroutine, and wait on them (with that wait being a yield). That is, aside from the outer branch-merge, each internal operation should be able to do its own branch-merge, and yield its thread to its sub-items.

3. Everything GC.

This is just the only reasonable way to write code in this system. It gives you a race-free way to ensure that object lifetimes exceed their usage.

The last post I did about the simple string crash is just so easy to do. The problem is that without GC you inevitably try to be "clever" and "efficient" (really "dumb" and "pointless") about your lifetime management. That is, you'll write things like :


void func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

Wait(h);
}

which is okay, because it waits on the async op inside the lifetime of "name". But of course a week later you change this function to :

Handle func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

return h;
}

with the wait done externally, and now it's a crash. Manual lifetime management in heavily-threaded code is just not reasonable.

The other compelling reason is that you want to be able to have "dangling" coroutines, that is you don't want to have to wait on them and clean them up on the outside, just fire them off and the clean themselves when they finish. This requires some kind of ref-counted or GC'ed ownership of all objects.

4. A thread per core.

With all your "threading" as coroutines and all your waits as "yields", you no longer need threads to share the cpu time, so you just make one thread per core and leave it there.

I wanted to note an exception to this - OS signals that cannot be converted to yields, such as IO. In this case you still need to do a true OS Wait that would block a thread. This would stop your entire worker from running, so that's not nice.

The solution is to have a separate pool of threads for running the small set of OS functions that do internal thread waits. That is, you convert :


ReadFile( args )

->

yield RunOnThreadPool( ReadFile, args );

this separate pool of threads is in addition to the one per core (or it could just be all one pool, and you make new threads as needed to ensure that #cores of them are running).


02-18-13 | Don't write spaghetti

It never works out. I usually even warn myself about it (by writing comments to myself), but it still catches me out. I also usually give myself a todo item like "hmm this smells funny revisit it", but of course the todos just pile up in a never-ending heap, and little old ones like that get burried.


void DoLZDecompress(const char *filename,...)
{
    struct CommandInfo i;
    i.data = (void *)filename;
    // warning : passing string pointer (not copying) to another thread, make sure it's const / sticks around!
    StartJob( &i );
}

Yup, that's a crash.


void OodleIOQ_SetKickImmediate( bool kick );
/* kick state is global ; hmm should really be per-thread ; makes it a race
*/

Yup, that's a problem, which leads to the later deadlock :

void Oodle_Wait( Handle h )
{
    // @@ ? can this handle depend on un-kicked items, and hence never complete ?
    //  I used to check for that with normal deps but it's hard now with the "permits"
    ...
}

Coding crime doesn't pay. Spaghetti always gets you in the end, with its buggy staining sauce.

Whenever I have one of those "hmm this smells funny, I'm worried about the robustness of this" , yep, it's a problem.

One of my mortal enemies are the "don't worry about it, it'll be fine" people. No it will fucking not be fine. You know what will happen? It'll be a nasty god damn race bug, which I will wind up fixing while the "don't worry about it, it'll be fine" guy is watching lolcatz or browsing facebook.


02-18-13 | The Myth of Future Value

I believe that most people (aka me) grossly overvalue future rewards when weighing the merits of various options.

I've been thinking about this a lot over the last fews days, and have come to it simultaneously from several different angles.

For the past month or so I've been going over my finances, reviewing my spending, because I'm not happy with the amount I'm saving, and I'm trying to figure out where the money is leaking. Obviously there are big expenses like cars and vacations, but those I've budgeted for, they're not the leak (*) (**), but there's still a large general money leakage that I want to track down. It turns out a lot of it is buying stuff for the house or for productivity, stuff that on its own I can justify, but overall adds up to a big waste.

A lot of that waste are things that I tell myself will "pay off someday". Like I need some rope for around the house; hey look it's a much better deal if I buy it in a 500 foot spool. I'll use it eventually so that's the better buy. Or, I need to set a bolt in concrete; sure a hammer drill is expensive, but I'll use it the rest of my life, so it will be a good value some day (better than renting one for this one job). etc. Lots of stuff where the idea is that in the long term it will be a good value.

Now I certainly haven't hit the "long term" yet, but I can already see the flaw in that logic. There are lots of reasons why that imagined long term value might never come. I might never wind up using the stuff. It might get damaged over time from sitting, or flood or who knows what. I also essentially pay a tax to store it, having stuff is not free. I pay a tax on it any time I move. Maybe I won't want to do DIY in the future and will just hire out the jobs and so won't use it. There are a lot of costs and uncertainty about this future value which make it much less valuable than it naively appears.

Perhaps computer stuff is an even easier example; like I sort of need a USB hub; I could live without it and just unplug stuff to make room depending on what device I want to use at the moment. You could easily convince yourself that the value is high because "even if I don't really need it now I'll use it someday". But of course there's any number of reasons why you might not use it some day.

(* = aside : expensive cars actually aren't that expensive; if you're careful about how you buy and sell, and look for cars that are on a pretty flat part of the depreciation curve, you can get a "$100k car" that actually only costs $5k a year. That's not really a big expense in the scheme of things. However that also doesn't mean it's free; the big cost is the time spent buying and selling; if you actually want it to be low cost you have to spend a lot of time on the transaction to get good value, and for people like me that's excruciatingly painful; for people who like wheeling-and-dealing, they can do pretty well, getting almost free stuff that they are just holding temporarily between sales)

(** = more aside, and actually there is a spending leak that I have that's associated to cars and vacations; I, like most, and perhaps less than average, fall prey to the sunk cost fallacy. The sunk cost fallacy is the idea that you've spent a bunch on something, you should stick with it and spend some more. Like I've spent this much to go on vacation, I shouldn't cheap out on the dining or whatever. Or I've got an expensive car, I should buy the expensive tires. But that of course is not true. Each decision should be evaluated independently for its value; the fact that you have a large sunk cost only matters in that it changes your current situation. You don't keep chasing your flush just because you're already called some big bets (though obviously your past calls do affect the pot size which affects your current decision)).

Of course home improvements are a classic of false future value. I'm not foolish enough to think I'll get any resale value benefit, but I do fall prey to thinking "I'll enjoy this for many years" when in fact I might not.

I was thinking about buying a really good mattress that's supposed to last 30 years vs one that will only last 5. In theory the long-life one is a much better deal, but there are any number of reasons why that might not be the case. It might not last like it's supposed to; you might pee and poo on it; you might want a different size mattress. By making an "investment" what you've done is commit yourself to something, you've removed flexibility, which is a cost.

Of course if you ever decide you want to travel the world and live in apartments again, all the buying of stuff is a big liability.

Getting away from just "accumulating stuff" now :

I've been thinking lately about my career arc. All through my young-adulthood I was carefully building up my value as a software development employee. I was improving my skills, improving my profile, networking, all that stuff, going up the career ladder. During that time I was not getting paid particularly well. I took jobs based on them being good opportunities for my larger career, not for their immediate financial reward.

The problem is that the big payoff never came. When Oddworld went under I was at the point where I could have moved on to CTO-level jobs at major studios, but I decided I didn't want to do that. The stress was ruining my life (and various other things that I've blogged about back then). The point is that this "future value" I had been building suddenly became zero. If you actually want to redeem that future value, you are locking yourself in to a path, which is a major cost you are paying, giving up flexibility in your life. And in careers there are so many factors outside your control; perhaps your specialty will become less prominent in a few years. Lots of people have done things like getting a JD only to find the law job market has dried up by the time they graduate.

Saving money in general is questionable now. The governments of the world have demonstrated that they don't care about the integrity of the world financial systems, so socking money away for the future has immense risk associated with it. (I don't put much credence in the complete currency collapse alarmists, but I do believe that a long period of negative inflation-adjusted returns is very likely). In the old days we glorified the good salaryman, who worked hard and saved some money, putting the joy of today aside to build a life for themselves and their family tomorrow.

Of course we can relate this all to poker, in old skool cbloom-rants style.

One of the first big realizations I had on my own as I was getting better and moving beyond book TAG play is that implied odds are massively overvalued by most players. "implied odds" is the term used for the imagined future value that you will get if you hit a big hand. Like if I call with a flush draw, it might be a bad value based on the immediate odds, but if I hit I'll make some more money, which makes it worth it the call. The problem is that there are a wide variety of reasons why you might not get paid off even if your flush comes (scare cards, or your opponent never had a strong calling hand to begin with). Or your flush might come and he might have a better flush (negative implied odds). If you realistically weight all those undesirable outcomes, the result is that the true effect of implied odds is very small. eg. on the turn you have a 16% chance to improve, you can call a bet for zero EV if the bet is about 20% of pot size. The action of implied odds is very small; you can only call a bet that's maybe 25% of pot size; really not much more. Certainly not the 30-35% that people talk themselves into believing is correct. (and of course in no limit holdem you have to adjust for position; out of position you should consider your implied odds to be zero, chasing a draw out of position is so very bad). What I'm getting at is the imagined future value of your current investment is far lower than you imagine.

(sort of not related but "implied odds" is also a good example of the "rationalization trap". Whenever a complicated logical exercise justifies behavior that your naughty irresponsible side secretly wants to do (like chasing draws), you should be very skeptical of that logic. Whenever you read that "a little red wine is healthy" you should be very skeptical. Whenever the result of a "study" is exactly what people want to hear, beware).

This isn't really related to the "future value" mistake, but I've been mulling another spending fallacy, which is the "value of an hour" fallacy. Sometimes I'll do something like buy a tool or hire a helper because it only costs $50 and saves me an hour of work; my hour is worth more than $50, so that's a good deal, right? I'm not so sure. I feel like that line of reasoning is just a way of rationalizing more spending, but I haven't quite found the flaw in it yet.


02-17-13 | Rambles on Mattresses and Retail

I'm finally getting around to trying to buy a new mattress, after the last new mattress that I bought turned out to be a dud (don't buy an S brand).

One of the better stores around here is "Bedrooms and More", which sounds just like a national chain sleaze-o-rama mattress trash peddler, but is actually not. The owner does some funny rants online and he suggests that the real shittening of the S-brands is due to private equity. Interesting idea; certainly there's no doubt that the S-brands have gone to total shit.

Of course we should be mad at the corporate overlords for sending product quality to shit, and generally using dishonest schemes to maximize short term profit. But I'm also angry at consumers for letting it happen. The only way to direct good behavior is to punish people who behave badly. And that just doesn't happen, neither in social life, nor capitalist life. People are amazingly (foolishly?) forgiving. Your only weapon as a consumer is your money.

Hanging out on Porsche forums a few years ago (zomg what was I thinking), I kept having my mind blown by how short-memoried everyone was. Even people who were pretty realistic about what fuckers Porsche had been in the past were all ready to buy the new model. (background : back in the early 90's, Porsche almost went bankrupt; they were completely restructured to focus more on marketing and profit, and less on quality. They intentionally drove the quality of their products down to the absolute minimum (actually, below minimum). This was the era of the Boxster and then the 996, and the early cars that were made were complete junk, some of the worst made cars for any money (worse than a Tata or god knows, it's hard to even think of an example of a horribly made car any more), the engine blocks were porous, the cylinders were out of round, there was cheap plastic inside the engine, and of course terrible cheap plastic everywhere in the interior, it was just a total clusterfuck). The rational consumer response should have been : whoah you guys are lying fuckers, we are never going to buy anything from you again. Instead most of the people were just incredibly forgiving and short-memoried, like yeah that was bad, but they'll fix it in the next model!

Wouldn't it be nice if products that cost more were actually better? Then you could just look at the range of products, pick your price-quality tradeoff point, and buy one. It would still be a tough decision, you'd have to weight how much you want to spend on this thing, but you would at least know that spending more got you something better. In the real world, that's not remotely the case. It's so nice when you go shopping in a video game RPG and you can just buy the more expensive sword and know it's better (and it's so fucking retarded when video games designers throw in wild-cards of expensive items that suck or really cheap items that are great, you dumb assholes, you don't get it, the game world should not make me do all the stressful shit I have to do in the real world).

I've always wanted a grocery store that actually selected its products for good cost/quality tradeof. That is, a good store should only sell things on the Pareto Frontier. Why the fuck do you have 50 different olive oils? I have no fucking clue what all these olive oils are, don't offer them to me. You (the retailer) should be an expert in this product (and also act as an agglomerator of customer feedback). There should only be like 4 olive oils to choose from, at various cost/quality tradeoffs (and also some for finishing vs. cooking oils, but let's pretend right now that there's only one axis of "quality" for olive oil), so I can just choose how much I want to spend and get the best oil at that price.

I had a funny self-realization moment at Soaring Heart when the salesman was saying how everything was made locally and they pay health care and benefits for their workers, I instinctively/subconscious thought "yeuck, that means bad value". Apparently my subconscious wants to buy products made in sweatshops. More generally I've got a major bias against ever giving money to someone who seems to be living well. When I see a realtor in a fancy car or a contractor who gets a swim and massage daily, fuck you I'm not giving you money, I want my pay to you to be barely enough to support human life, you should be in miserable subsistence conditions, not living it up! I guess I'm also biased against anything made in America; my mental image of Seattle mattress builders is not great skilled workers (like New Yankee Workshop), but something like failed philosophy PhDs who smoke weed while they work and don't know WTF they're doing (like Workaholics).


02-16-13 | The Reflection Visitor Pattern

Recent blog post by Maciej ( Practical C++ RTTI for games ) set me to thinking about the old "Reflect" visitor pattern.

"Reflect" is in my opinion clearly the best way to do member-enumeration in C++. And yet almost nobody uses it. A quick reminder : the reflection visitor pattern is that every class provides a member function named Reflect which takes a templated functor visitor and applies that visitor to all its members; something like :


class Thingy
{
type1 m_x;
type2 m_y;

template <typename functor>
void Reflect( functor visit )
{
    // (for all members)
    visit(m_x);
    visit(m_y);
}

};

with Reflect you can efficiently generate text IO, binary IO, tweak variable GUIs, etc.

(actually instead of directly calling "visit" you probably want to use a macro like #define VISIT(x) visit(x,#x))

A typical visitor is something like a "ReadFromText" functor. You specialize ReadFromText for the basic types (int, float), and for any type that doesn't have a specialization, you assume it's a class and call Reflect on it. That is, the fallback specialization for every visitor should be :


struct ReadFromText
{
    template <typename visiting>
    void operator () ( visiting & v )
    {
        v.Reflect( *this );
    }
}:

The standard alternative is to use some macros to mark up your variables and create a walkable set of extra data on the side. That is much worse in many ways, I contend. You have to maintain a whole type ID system, you have to have virtuals for each type of class IO (note that the Reflect pattern uses no virtuals). The Reflect method lets you use the compiler to create specializations, and get decent error messages when you try to use new visitors or new types that don't have the correct handlers.

Perhaps the best thing about the Reflect system is that it's code, not data. That means you can add arbitrary special case code directly where it's needed, rather than trying to make the macro-cvar system handle everything.

Of course you can go farther and auto-generate your Reflect function, but in my experience manual maintenance is really not a bad problem. See previous notes :

cbloom rants 04-11-07 - 1 - Reflection
cbloom rants 03-13-09 - Automatic Prefs
cbloom rants 05-05-09 - AutoReflect

Now, despite being pro-Reflect I thought I would look at some of the drawbacks.

1. Everything in headers. This is the standard C++ problem. If you truly want to be able to Reflect any class with any visitor, everything has to be in headers. That's annoying enough that in practice in a large code base you probably want to restrict to just a few types of visitor (perhaps just BinIO,TextIO), and provide non-template accessors for those.

This is a transformation that the compiler could do for you if C++ was actually well designed and friendly to programmers (grumble grumble). That is, we have something like

template <typename functor>
void Reflect( functor visit );
but we don't want to eat all that pain, so we tell the compiler which types can actually ever visit us :
void Reflect( TextIO & visit );
void Reflect( BinIO & visit );
and then you can put all the details in the body. Since C++ won't do it for you, you have to do this by hand, and it's annoying boiler-plate, but could be made easier with macros or autogen.

2. No virtual templates in C++. To call the derived-class implementation of Reflect you need to get down there in some ugly way. If you are specializing to just a few possible visitors (as above), then you can just make those virtual functions and it's no problem. Otherwise you need a derived-class dispatcher (see cblib and previous discussions).

3. Versioning. First of all, versioning in this system is not really any better or worse than versioning in any other system. I've always found automatic versioning systems to be more trouble than they're worth. The fundamental issue is that you want to be able to incrementally do the version transition (you should still be able to load old versions during development), so the objects need to know how to load old versions and convert them to new versions. So you wind up having to write custom code to adapt the old variables to the new, stuff like :


if ( version == 1 )
{
    // used to have member m_angle
    double m_angle;
    visitor(m_angle);
    m_angleCos = cos(m_angle);
}
else
{
    visitor(m_angleCos);
}

now, you can of course do this without explicit version numbers, which is my preference for simple changes. eg. when I have some text prefs and decide I want to remove some values and add new ones, you can just leave code in to handle both ways for a while :

{

#ifndef FINAL
if ( visitor.IsRead() )
{
    double m_angle = 0;
    visitor(m_angle);
    m_angleCos = cos(m_angle);
}
#endif

visitor(m_angleCos);

}

where I'm using the assumption that my IO visitor is a NOP on variables that aren't in the stream. (eg. when loading an old stream, m_angleCos won't be found and so the value from m_angle will be loaded, and when loading a new stream the initial filling from m_angle will be replaced by the later load from m_angleCos).

Anyway, the need for conversions like this has always put me off automatic versioning. But that also means that you can't use the auto-gen'ed reflection. I suspect that in large real-world code, you would wind up doing lots of little special case hacks which would prevent use of the simple auto-gen'ed reflection.

4. Note that macro-markup and Reflect() both could provide extra data, such as min & max value ranges, version numbers, etc. So that's not a reason to prefer one or the other.

5. Reflect() can be abused to call the visitor on values that are on the stack or otherwise not actually data members. Mostly that's a big advantage, it lets you do converions, and also serialize in a more human-friendly format (for text or tweakers) (eg. you might store a quaternion, but expose it to tweak/text prefs as euler angles) (*).

But, in theory with a macro-markup cvar method, you could use that for layout info of your objects, which would allow you to do more efficient binary IO (eg. by identifying big blocks of data that can be read in binary without any conversions).

(* = whenever you expose a converted version, you should also store the original form in binary so that write-then-read is a gauranteed nop ; this is of course true even for just floating point numbers that aren't printed to all their places, which is something I've talked about before).

I think this potential advantage of the cvar method is not a real advantage. Doing super-efficient binary IO should be as close to this :


void * data = Load( one file );
GameWorld * world = (GameWorld *) data;

as possible. That's going to require a whole different pathway for IO that's separate from the cvar/reflect pathway, so there's no need to consider that as part of the pro/con.

6. The End. I've never used the Reflect() pattern in the real world of a large production codebase, so I don't know how it would really fare. I'd like to try it.


02-05-13 | Some Media Reviews

"The Rings of Saturn" (WG Sebald) is the most incredible book I've read in a long time. It's like one big rambling stream of consciousness aside; there's no story, it's not really about anything in particular, there's no paragaphs - all things that I normally hate - and yet it's totally compelling. It's a real "page turner", an easy read, I love his writing and I just wanted to consume more and more of it. Touched me deeply, amazing book.

(followed up by reading "The Emigrants" which is good, but much more of a normal book, it's terrestrial, not so oddly magical and other-worldly as "The Rings of Saturn").

I've just discovered "The Sky at Night" (ex of Sir Patrick Moore) on BBC. What a marvelous show. I don't really even care much about astronomy, and yet here is a show with real scientists, talking to each other about things they actually understand, and talking at a very high level and not really dumbing it down much for the audience. I've never seen anything like it on television before, actual intelligent people talking to each other, it's amazing. I love Patrick's interviewing style, the way he just blurts things out; he reminds me so much of the real scientists I've known in my life who are super direct and straight to the facts without dancing around the point. It's best to watch early episodes before he got too old/ill.

They did a demo of the Higgs spontaneous symmetry breaking on The Sky at Night which is the best I've seen. Take a wine bottle (with a good hump at the bottom; those familiar will recognize that the hump is the key to symmetry breaking) and put a piece of cork or a ball or something inside. Now shake the bottle vigorously. At high energy like that (bottle shaking), the location of the cork is random, so the whole assemblage still has rotational symmetry. Now stop shaking (low energy) and the cork will settle somewhere - not in the middle of the wine bottle where the hump is, but off to one side in the trough. Symmetry broken.

"The Loneliest Planet" is a really terrible movie and I don't recommend it (jesus christ the scenes where they sit around the camp fire and say the same words over and over are excruciating torture), but it has these few scenes that are some of the most beautiful I've ever seen in a movie - the scenes with the wide static shots that the characters slowly walk across, they're staggering, breath-taking.

"Hello Lonesome" was good. The director obviously knows loneliness; it reminded me a lot of various times in my life; the weirdness of being alone for a long time, the sad joy - you do whatever you want, but it all kind of sucks. The long idle times, so much free time, rambling around your property, hitting fruit with a baseball bat (me, not the movie).

also watchable : Summer Hours (quiet, nothing ever happens, and yet very adult interactions, somehow compelling), Bernie (irresistible, charming), Bonsai (simple little movie that reminds me of life at that age; tasteful), Youth in Revolt (much better teeny rom-com than the more well known teeny rom-coms), Breaking and Entering (some great characters; in the end it's a movie about sad and lonely people)

GAYLE is crazy funny. Weird as hell, wtf is going on, and yet it's the most biting mockery of normal suburban life.

"Utopia" is pretty retarded; the plot is standard unrealistic conspiracy crap, straight out of that awful graphic-novel type of writing; there's no part of it that's clever or insightful or well thought out (so far), and the characters are pretty awful to boot. I don't really care for the torture-porn stuff either (just skip it). All that said, the look of it is just super beautiful, amazing art direction, subtle and realistic but strikingly odd; every shot has these dramatic geometric forms and colors in it. And there's an eerie stillness to it, lots of long holds. It's so good to look at, and good sounds too.

(a lot of recent British stuff is just staggeringly good looking. See also "The Tower", "Red Riding", "Wallander".)

The new season of Top Gear is finally here, and good god is it painfully bad. I guess I should face the fact that it's been awful for many years now, but I was clinging on, hoping it would perk back up (as it occasional has done, eg. the Bolivia Special). You develop this almost pavlovian response to things that have given you pleasure in the past; the sound of a beer bottle top popping off, the smell of coffee, that Top Gear opening theme song, it starts the pleasure molecules bursting in my brain, even if I don't actually want a beer, and I know that Top Gear is going to suck, there's still this vestigial fondness I have for it. The best part of the series so far has been the meta-comedy moment of James May falling asleep on the show because he too was so bored of it.

ADDENDUM : "Beasts of the Southern Wild" is amazing! joyous, sad, hard to watch, thrilling, it's a rich emotional feast, but it's also an incredible work of art. There's obvious allegory, but it's characters aren't unrealistic victims without faults. More than any thing I think it's a modern fairy tale (of the old style); not "fairy tale" like in the shit Disney sense meaning "princess, happy ending, dreams come true" but in the original sense, like Grimm's and all the old stories that were terrifying ; they were fantasies, but grounded in real world horrors, and often were obvious warning messages. Real fairy tales are magical and beautiful but also scary and sad.


01-31-13 | Ugh ugh I hate the web

So Blogger randomly changed a bunch of shit a while ago.

One of the consequences of this is that the layout of "cbloom rants" can no longer be achieved or maintained with the new blogger layout, which means I can't edit it without losing it completely. (the existing layout does seem to keep working as long as I don't touch it, because they keep the raw HTML of the layout).

Another nasty one I just discovered is that a key setting that I rely upon is no longer there. Under "Settings->Formatting" there used to be a setting for "Convert Line Breaks" which defaults to Yes and causes any LF to be turned into an HTML BR code. I set that to "No" for cbloom rants so that it doesn't crud up my html when I send it over the Blogger API. (god dammit just let me put up HTML and stop fucking with it).

The odd thing is that the "No" setting (of "Convert Line Breaks") for cbloom rants appears to have stuck even though that setting has disappeared. That's fine with me I guess, though I wouldn't be surprised if it just stops working at some point when they revise the service again. The problem is I'm trying to set up a new blog and I can't get that setting any more.

(I of course have a workaround, which is removing LF's before I upload posts. The workaround sucks a little because I like to be able to download my posts back down and have them match the way I wrote them, which of course was with LF's in it for my readability during composition. The point is not the specific issue, it's god damn it don't push updates on me ever never ever unless I ask for them.)

Software updates are incredibly harmful. The benefit from changing *anything* has to be really massive for it to be a win. I'm so sick of getting new versions of crap pushed on me. At least with non-web software you can try to hold onto old versions as long as possible so that you can keep your valuable knowledge and its connections to your automation suite.


01-28-13 | Importing Eudora MBX's to Gmail

I'd like to import all my old Eudora mail to gmail, to get it all together in one place, and for searchability.

(my current mail solution is to use Eudora POP on my local machine, but forward all my mail through gmail for spam filtering and archiving and searchability; it's working pretty well finally).

Gmail does not offer any "import from local disk" options. Sigh. There appear to be a few ways to do this :

1. Change my gmail temporarily to IMAP. Get all my Eudora MBX's into an IMAP client (something like Outlook; perhaps requiring an MBX to PST conversion step or something). Open the IMAP client and connect to gmail; drag the mail from the Eudora boxes to the gmail boxes.

Should work in theory, but a bit scary, and extremely slow (moving mail on IMAP is ungodly slow).

(Also, when I switch back to POP, is it going to redeliver me all that mail that I just uploaded? That would double-suck.)

2. Make a POP server somewhere. Convert the mbx's to mbox's to maildirs and dump them on the POP server for it to serve up. Tell gmail to grab mail from that POP server.

One issue is where I could get a POP server with a public IP and admin access. The other is that any time I try to do networking stuff it's a massive fail of mysterious problems and no error messages.

3. Get a Google Apps gmail account (different from regular gmail account for unknowable reasons). Import MBX's to Outlook. Use "Google Apps Migration for Microsoft Outlook" to import mail to Apps mail account. Use gmail fetcher to bring mail from apps-gmail to my normal gmail.

(similar alternative : get apps gmail. Convert mbx to mbox. Find a Mac. Use "Google Email Uploader for Mac" to upload the mbox. Transfer mail from apps-mail to normal mail).

(I could also use gmail API to write my own importer, but that also requires an Apps gmail, so may as well just use the existing importers in method 3)

It's all such a hassle that I'm once again tempted to just write my own damn email client. Sigh I wish I'd done that long ago, but it's always the local optimization to not do it. I'm so fucking sick of getting penis emails. Hello spam filterers, *penis* -> spam. You're welcome. And of course if I used my own email client, my private property (words) wouldn't be data-mined to serve me ads (you bastards).

(oddly gmail does remarkably well at spam detection on the cases that would be hard for me to do with simple filters; things like bank phishing mails that are designed to look exactly like legitimate mails from my bank; I don't think I could give that up, so I'd still be stuck with routing mail through gmail even if I had my own client).


01-27-13 | Kauai

We took a vacation from the Big Island vacation for a few days to go to Kauai. I'd never been to another Hawaiian island so it was interesting to compare. It also gave us a chance to stay right on the beach, which we're not doing this trip, which was nice for a little while, getting that salt air in the bedroom and sitting on the water late at night. Anyhoo, thoughts on Kauai :

1. Yes it's beautiful. It looks more like Vietnam or Thailand and their limestone karst stuff, all old and weathered and crumbly with these random protrusions and such. (it's cool how you can travel the Hawaiian islands from south to north and visually see geographic time passing at a rate of 100,000 years per island hop). It actually wasn't as lush as I expected given all the hype about how wet it was and the incredible lushness. It's no more jungley beautiful than the Hamakua Coast near Hilo is really. My favorite parts of Kauai were the northern coast, and also just south of Lihue around the Hulemalu Road area (which would be a pretty sweet bike ride; good pavement, no cars).

2. There're sweet beaches all over the island. Like you almost don't have to seek them out there's another one around every corner, and most with very few people on them. None of them looked really perfect the way Mauna Kea and Hapuna are just ridiculously perfect in every way (clear water, no rocks, bottom drops neither too fast nor too slow, no rips, etc), but they were uncrowded and more sort of charming in a rustic way and often have cool surrounding cliffs and pretty settings.

3. The traffic sucks. The island is small, which is cool for a vacation (actually I love staying on tiny islands, like Ko Hai, Caye Caulker, or the Isla Mujeres of my childhood; islands where you can walk from one side to the other in half an hour or so). However, despite the smallness it takes forever to get anywhere because it's constant gridlock. Sitting in traffic fucking blows, and this alone is almost enough to put me off Kauai.

4. The human development on Kauai is repellent. The cities are all really ugly (though that seems to be standard all over Hawaii); most of the island is strip malls and run down shopping centers and fast food and such. Then the alternative are these fancy manicured suburban/golf developments like Poipu and Princeville which are disgusting in a different way. Between the two, the human hand on Kauai has scarred it with an ugliness that is quite tragic.

5. It's extremely tourist-oriented. Every restaurant is for tourists (which means rotten food and weird phoney-nice service), the place is covered in tourist crap shops (t-shirts, mac nuts, koa, etc). It has no feeling of being it's own place independent from the tourists. It also has a big port where cruise ships drop out hordes. Part of the problem with that is that Kauai is so small it can't really handle the appearance of 5000 people in one day.

6. The Na Pali Coast trail (Kalalau) is pretty cool. We made it 6 miles in before turning around (just into Hanakoa Valley, which was the best part of the trail that we saw) (pretty impressive for a pregnant lady). It's definitely not the most beautiful hike ever (as some say); there are lots of hikes in WA that are better scenery and not so jam packed with ding-dongs. It is sweet to be able to take a dip in the rivers along the way and swim at the beach afterward. Much like the Big Island, there's too much private property and not very much development of good trails, so you see all this beautiful stuff around but you can't really get to it (unless you want to tresspass and bushwhack, which you certainly can do).

7. I think it would be a pretty great place for a surf vacation. One of the good things about it from that standpoint is there are decent beaches facing every cardinal direction, so you can pick your spot to match the swell, and because it's small it doesn't take forever to get there. I could see maybe going back to Kauai some day for an intensive "finally learn to really surf" vacation, based around Hanalei or something (and never leaving that area).

8. For anyone considering going to Kauai - don't go in winter. We got super lucky with no big storms during our short trip, but generally Kauai is pounded in winter with big waves and lots of rain. You can always go hide in the dry south, but since the north is the best part of the island it's just better to go when it's not storm season.

Overall it made me miss our Big Island home, and I'm happy to be back.

I guess I'm a little negative about Kauai because I was super tired the whole time from not sleeping well. I also realized that I kind of hate vacation these days. I like workcation where I rent a house for a while and settle in and can cook my own food and bring my bike and get to do what I like (bike, swim, work). I don't really like sight seeing, just going from place to place and going yep I saw that; it feels so pointless, and it's kind of all the same experience no matter what sight it is you're seeing. I hate hotels, the invariably awful beds and pillows, the ice makers and elevators and other guests, the nasty decor and bad air, the attendants angling for tips. I hate restaurants, I'm so sick of restaurants. I wish I could just buy some proper ingredients that are actually fresh and okay quality, and have them cooked simply at the time that I order. Instead you get frozen super-low-grade Sysco garbage that's been pre-cooked and then warmed to order and covered in some nasty "sauce", it's just revolting the filth that they pass off as food all over America. (and the fancy expensive restaurants are not much better). And you have to sit around forever while the waiter does god knows what and try to act nice and make the most of it while poisonous filth is flopped down in front of you.

I like the idea of vacations that are for a certain activity that you like. Not going to see sights or relax, but to go hiking in some place that's really great for hiking, or to go biking, or surfing, or whatever you like. I sort of did this with the CA work/bike-cation, and it was rocking good. I'd like to do it more, but it's hard to find good information. A lot of the "epic hikes" or "great bike rides" are actually total shit; the rating is done by people who don't know WTF they're talking about. (same is true of "great beaches", which are often total crap beaches except for their white sand or something stupid like that). For example, I know that Hwy 1 in CA is on many a list of epic rides, and having lived there for a long time I know that's totally retarded; not only is it not epic, it's barely even tolerable, like I would never ride it by choice (I only rode it when necessary to connect a loop between other roads), and in the same area where they recommend Hwy 1 there are probably 30 rides that are much better. So anyway, actually finding solid information on places that are good "destination biking" is very difficult.

I'm also getting more sensitive about travelling places where the tourism is sort of a form of exploitation. In Hawaii the bad vibes are mild, but they're definitely there. We stole these islands from the Hawaiians, and now they are mostly pretty poor and get to watch rich tourists come in and buy up their best land and crowd up their favorite local spots. But despite that Hawaii is immensely better than other beach destinations I've gone to. In Mexico & Central America you get to see the abject poverty of people whose lives have been destroyed by government corruption and "free trade" (which is a transparent absurdity when we own all the patents and subsidize our exports and fuel costs); most of the beach developments were the result of the government evicting the people who rightfully lived there with minimal compensation; you used to be able to get away from the Zona Hoteleria areas and find sweet little towns that were still pretty untouched, but that's increasingly hard. In Thailand you're surrounded by the sex tourists and the cheap-booze backpacking set, who generally sleaze the place up (but it's better when you get away from the tourist-heavy areas).

Anyhoo, some photos from Kauai :


(including "tree canopy" and "how to look at tree canopy")


01-17-13 | What Happened to Tech Blogs?

I feel like the internet is dying. There's less and less legitimate content, and more and more fluff and self-promotional ignorant useless crap. It's becoming harder and harder to find solid information that's written by people who actually know anything about what they're posting about.

The information on the internet is now almost entirely one of :

1. Advertising. Sometimes even subtly hidden advertising (there are now tons of "blogs" that are actually advertisements, and a lot of the posters on web forums are actually advertisers who are more or less clever about it).

2. Ignorant. Stuff like answers.yahoo and eHow and Yelp and so on are once in a while written by someone who knows their topic, but usually not. Reading these sites is often more harmful than helpful.

Oddly, the vast majority of blogs about things like cooking, cars, home improvement, or any DIY hobbies are not written by people who actually do those things and know anything about them. They're usually written by housewives or techie nerds who just want some attention or love blogging or god knows why they do it. It should sort of be harmless for ignorant people to write about their adventures building a shack for the first time (lord knows I do it), but it's actually not harmless. For one thing, they tend to become popular and so become the leading search results, ahead of much better information which is drier or not so cutesy. For another, the writers often present themselves as more well informed than they actually are, and they often misrepresent the success of their endeavour.

3. Self-vertising. Even some of the better blogs are just ways to self-promote or otherwise make money. This can be okay and there can still be good information from the self-vertisers, but they also do a lot of padding, a lot of repetition, and heavily distort the truth to make themselves seem more important. The tech self-vertisers tend to be annoyingly pedantic and act like experts when they are not. They almost never do the helpful thing and link to their (better) original sources. They often use the same style as pundits or paid "experts" in that they present their solution as The One True Way to give it extra legitimacy, when in fact the truth is more nuanced (maybe there are disadvantages that they don't talk about, or equally valid other solutions that already exist, or uncertainties in the parameters). Part of the problem with the self-vertisers is that they all mutually promote and are very active about SEO, so they become the primary visible voices. Also to pad their posting they tend to grab "facts" from other sources and repeat them, which creates a bad false sense of confidence in those nonsensities because they are being repeated all over.

Somewhat related to this are the lunatics with some kind of agenda. They aren't exactly advertising, but they are rabid about some point and so spam the web with their "facts" which are just creations designed to prove their point. It makes it almost impossible to find information about controversial topics, because these people are so active that they dominate search results.

4. Communities. I used to get some of my best information from web communities/forums. The great thing about them is that you can find these individual posters that hang out on them who are actually true experts in the field; like if you're searching for home improvement stuff you can find guys in web communities who are actual long term builders and provide solid facts; or for car info you can find people who actually build or race cars and know WTF they're talking about. However, it takes a lot of work to find those guys; they generally are not the most frequent posters, they tend to pop in and snipe some amazing wisdom once in a while and then disappear. You have to do a lot of scrounging around, and read multiple posts from each poster to try to assess the credibility of the individual user.

But I've been noticing something really nasty about web communities recently. They tend to get into this kind of rigid group-think which can lead them to constantly repeat certain "facts" despite there being no substance to them. What happens is some strong personality on the forum promotes some fact and everyone gives it a "thumbs up" , they start repeating it everytime someone asks that question, and it winds up in the FAQ. Posters on web communities are highly motivated by the approval of their peers; they act like a pack of high schoolers who are constantly looking around to make sure everyone else thinks they're cool. There's very little independent thinking and willingness to challenge the group-think. There's lots of high-fiving.

The truly wise tend to be humble and a bit soft-spoken. That's all well and good, but in the juvenile shouting match which is the modern internet, it's the people who are unashamed to loudly pontificate and bully about things they know not much of who are heard.

Try searching for something like "Calphalon" or "Big Island Waterfall" and see how many results you can find that aren't one of those 4 groups. Sure there's still signal out there but it's getting drowned in the noise.

Anyhoo. One of the symptoms of the dying internet that I've noticed is that there are basically no tech blogs for me to read any more. Maybe I'm just out of the loop? Are you all blogging on facebook now, or some other closed system that I refuse to join?

A few years ago, I felt like I was getting really superb quality tech blogs in my RSS on an almost daily basis, and now that has slowed to a trickle of maybe one a week or one a month. The vast majority of people that I liked and followed are not posting any more. What gives?

I understand that a lot of people who blog do it for a while, but lose steam and their blog goes silent. But there should be new people picking up the mantle; maybe I just haven't been active enough about figuring out who the good new bloggers are.

For reference, my tech blog subscriptions :



<opml version="1.0">
    <head>
        <title>cbloom subscriptions in Google Reader</title>
    </head>
    <body>
        <outline text=".mischief.mayhem.soap."
            title=".mischief.mayhem.soap." type="rss"
            xmlUrl="http://msinilo.pl/blog/?feed=rss2" htmlUrl="http://msinilo.pl/blog"/>
        <outline text="1024cores" title="1024cores" type="rss"
            xmlUrl="http://feeds.feedburner.com/1024cores" htmlUrl="http://blog.1024cores.net/"/>
        <outline text="A random walk through geek-space"
            title="A random walk through geek-space" type="rss"
            xmlUrl="http://api.live.net/Users(4929737823860505484)/Main?$format=rss20" htmlUrl="http://sebastiansylvan.wordpress.com"/>
         <outline text="Amit's Thoughts" title="Amit's Thoughts"
            type="rss"
            xmlUrl="http://amitp.blogspot.com/feeds/posts/default" htmlUrl="http://amitp.blogspot.com/"/>
        <outline text="Aras' website" title="Aras' website" type="rss"
            xmlUrl="http://aras-p.info/atom.xml" htmlUrl="http://aras-p.info/"/>
        <outline text="Atom" title="Atom" type="rss"
            xmlUrl="http://farrarfocus.blogspot.com/feeds/posts/default" htmlUrl="http://farrarfocus.blogspot.com/"/>
        <outline text="Attractive Chaos" title="Attractive Chaos"
            type="rss"
            xmlUrl="http://attractivechaos.wordpress.com/feed/" htmlUrl="http://attractivechaos.wordpress.com"/>
         <outline text="Beautiful Pixels" title="Beautiful Pixels"
            type="rss"
            xmlUrl="http://feeds.feedburner.com/BeautifulPixels" htmlUrl="http://beautifulpixels.blogspot.com/"/>
        <outline text="Birth of a Game" title="Birth of a Game"
            type="rss"
            xmlUrl="http://uber.typepad.com/birthofagame/atom.xml" htmlUrl="http://uber.typepad.com/birthofagame/"/>
        <outline text="bitsquid: development blog"
            title="bitsquid: development blog" type="rss"
            xmlUrl="http://bitsquid.blogspot.com/feeds/posts/default" htmlUrl="http://bitsquid.blogspot.com/"/>
        <outline text="bouliiii's blog" title="bouliiii's blog"
            type="rss"
            xmlUrl="http://bouliiii.blogspot.com/feeds/posts/default" htmlUrl="http://bouliiii.blogspot.com/"/>
        <outline text="Braid" title="Braid" type="rss"
            xmlUrl="http://braid-game.com/news/?feed=rss2" htmlUrl="http://braid-game.com/news"/>
        <outline text="Breaking Eggs And Making Omelettes"
            title="Breaking Eggs And Making Omelettes" type="rss"
            xmlUrl="http://multimedia.cx/eggs/feed/" htmlUrl="http://multimedia.cx/eggs"/>
        <outline text="C++Next" title="C++Next" type="rss"
            xmlUrl="http://cpp-next.com/feed/" htmlUrl="http://cpp-next.com"/>
        <outline text="c0de517e" title="c0de517e" type="rss"
            xmlUrl="http://c0de517e.blogspot.com/feeds/posts/default" htmlUrl="http://c0de517e.blogspot.com/"/>
        <outline text="Canned Platypus" title="Canned Platypus"
            type="rss" xmlUrl="http://pl.atyp.us/wordpress/?feed=rss2" htmlUrl="http://pl.atyp.us/wordpress"/>
        <outline text="cbloom rants" title="cbloom rants" type="rss"
            xmlUrl="http://feeds.feedburner.com/CbloomRants" htmlUrl="http://cbloomrants.blogspot.com/"/>
        <outline text="Cessu's blog" title="Cessu's blog" type="rss"
            xmlUrl="http://cessu.blogspot.com/feeds/posts/default" htmlUrl="http://cessu.blogspot.com/"/>
         <outline text="CodeItNow" title="CodeItNow" type="rss"
            xmlUrl="http://www.rorydriscoll.com/feed/" htmlUrl="http://www.rorydriscoll.com"/>
        <outline text="Coder Corner" title="Coder Corner" type="rss"
            xmlUrl="http://www.codercorner.com/blog/?feed=rss2" htmlUrl="http://www.codercorner.com/blog"/>
        <outline text="copypastepixel" title="copypastepixel" type="rss"
            xmlUrl="http://copypastepixel.blogspot.com/feeds/posts/default" htmlUrl="http://copypastepixel.blogspot.com/"/>
        <outline text="Corensic" title="Corensic" type="rss"
            xmlUrl="http://corensic.wordpress.com/feed/" htmlUrl="http://corensic.wordpress.com"/>
        <outline text="Diary of a Graphics Programmer"
            title="Diary of a Graphics Programmer" type="rss"
            xmlUrl="http://diaryofagraphicsprogrammer.blogspot.com/feeds/posts/default" htmlUrl="http://diaryofagraphicsprogrammer.blogspot.com/"/>
        <outline text="Diary Of An x264 Developer"
            title="Diary Of An x264 Developer" type="rss"
            xmlUrl="http://x264dev.multimedia.cx/?feed=atom" htmlUrl="http://x264dev.multimedia.cx/"/>
        <outline text="direct to video" title="direct to video"
            type="rss" xmlUrl="http://directtovideo.wordpress.com/feed/" htmlUrl="http://directtovideo.wordpress.com"/>
        <outline text="el trastero" title="el trastero" type="rss"
            xmlUrl="http://www.iquilezles.org/blog/?feed=rss2" htmlUrl="http://www.iquilezles.org/blog"/>
        <outline text="EntBlog" title="EntBlog" type="rss"
            xmlUrl="http://feeds2.feedburner.com/EntBlog" htmlUrl="http://entland.homelinux.com/blog"/>
        <outline text="EnterTheSingularity" title="EnterTheSingularity"
            type="rss"
            xmlUrl="http://enterthesingularity.blogspot.com/feeds/posts/default?alt=rss" htmlUrl="http://enterthesingularity.blogspot.com/"/>
         <outline text="Fast Data Compression"
            title="Fast Data Compression" type="rss"
            xmlUrl="http://fastcompression.blogspot.com/feeds/posts/default" htmlUrl="http://fastcompression.blogspot.com/"/>
        <outline text="fixored?" title="fixored?" type="rss"
            xmlUrl="http://www.sjbrown.co.uk/feed/" htmlUrl="http://www.sjbrown.co.uk"/>
        <outline text="Game Angst" title="Game Angst" type="rss"
            xmlUrl="http://gameangst.com/?feed=rss2" htmlUrl="http://gameangst.com"/>
        <outline text="Game Rendering" title="Game Rendering" type="rss"
            xmlUrl="http://www.gamerendering.com/feed/atom/" htmlUrl="http://www.gamerendering.com/"/>
        <outline text="Game Rendering" title="Game Rendering" type="rss"
            xmlUrl="http://www.gamerendering.com/feed/" htmlUrl="http://www.gamerendering.com"/>
        <outline text="GameArchitect" title="GameArchitect" type="rss"
            xmlUrl="http://gamearchitect.net/feed/" htmlUrl="http://gamearchitect.net"/>
        <outline text="Gamedev Coder Diary" title="Gamedev Coder Diary"
            type="rss" xmlUrl="http://gamedevcoder.wordpress.com/feed/" htmlUrl="http://gamedevcoder.wordpress.com"/>
        <outline text="Graphic Rants" title="Graphic Rants" type="rss"
            xmlUrl="http://graphicrants.blogspot.com/feeds/posts/default" htmlUrl="http://graphicrants.blogspot.com/"/>
        <outline text="Graphics Runner" title="Graphics Runner"
            type="rss"
            xmlUrl="http://graphicsrunner.blogspot.com/feeds/posts/default" htmlUrl="http://graphicsrunner.blogspot.com/"/>
        <outline text="Graphics Size Coding"
            title="Graphics Size Coding" type="rss"
            xmlUrl="http://sizecoding.blogspot.com/feeds/posts/default" htmlUrl="http://sizecoding.blogspot.com/"/>
        <outline text="Gustavo Duarte" title="Gustavo Duarte" type="rss"
            xmlUrl="http://feeds2.feedburner.com/GustavoDuarte" htmlUrl="http://duartes.org/gustavo/blog"/>
        <outline text="Hardwarebug" title="Hardwarebug" type="rss"
            xmlUrl="http://hardwarebug.org/feed/" htmlUrl="http://hardwarebug.org"/>
        <outline text="hbr" title="hbr" type="rss"
            xmlUrl="http://brnz.org/hbr/?feed=rss2" htmlUrl="http://brnz.org/hbr"/>
        <outline text="Humus" title="Humus" type="rss"
            xmlUrl="http://www.humus.name/rss.xml" htmlUrl="http://www.humus.name"/>
        <outline text="I am an extreme moderate"
            title="I am an extreme moderate" type="rss"
            xmlUrl="https://extrememoderate.wordpress.com/feed/" htmlUrl="https://extrememoderate.wordpress.com"/>
        <outline text="I Get Your Fail" title="I Get Your Fail"
            type="rss" xmlUrl="http://feeds.feedburner.com/IGetYourFail" htmlUrl="http://igetyourfail.blogspot.com/"/>
        <outline text="Ignacio Castaño" title="Ignacio Castaño"
            type="rss" xmlUrl="http://castano.ludicon.com/blog/feed/" htmlUrl="http://www.ludicon.com/castano/blog"/>
        <outline text="Industrial Arithmetic"
            title="Industrial Arithmetic" type="rss"
            xmlUrl="http://industrialarithmetic.blogspot.com/feeds/posts/default" htmlUrl="http://industrialarithmetic.blogspot.com/"/>
          <outline text="John Ratcliff's Code Suppository"
            title="John Ratcliff's Code Suppository" type="rss"
            xmlUrl="http://codesuppository.blogspot.com/feeds/posts/default" htmlUrl="http://codesuppository.blogspot.com/"/>
        <outline text="Just Software Solutions Blog"
            title="Just Software Solutions Blog" type="rss"
            xmlUrl="http://www.justsoftwaresolutions.co.uk/index.rss" htmlUrl="http://www.justsoftwaresolutions.co.uk/blog/"/>
        <outline text="Lair Of The Multimedia Guru"
            title="Lair Of The Multimedia Guru" type="rss"
            xmlUrl="http://guru.multimedia.cx/feed/" htmlUrl="http://guru.multimedia.cx"/>
        <outline text="Larry Osterman's WebLog"
            title="Larry Osterman's WebLog" type="rss"
            xmlUrl="http://blogs.msdn.com/larryosterman/rss.xml" htmlUrl="http://blogs.msdn.com/b/larryosterman/"/>
        <outline text="Level of Detail" title="Level of Detail"
            type="rss" xmlUrl="http://levelofdetail.wordpress.com/feed/" htmlUrl="http://levelofdetail.wordpress.com"/>
        <outline text="level of detail" title="level of detail"
            type="rss" xmlUrl="http://www.jshopf.com/blog/?feed=rss2" htmlUrl="http://jshopf.com/blog"/>
        <outline text="Light is beautiful" title="Light is beautiful"
            type="rss"
            xmlUrl="http://feeds.feedburner.com/LightIsBeautiful?format=xml" htmlUrl="http://lousodrome.net/blog/light"/>
        <outline text="Lightning Engine" title="Lightning Engine"
            type="rss"
            xmlUrl="http://feeds2.feedburner.com/LightningEngine" htmlUrl="http://blog.makingartstudios.com"/>
        <outline text="Lost in the Triangles"
            title="Lost in the Triangles" type="rss"
            xmlUrl="http://feeds.feedburner.com/LostInTheTriangles" htmlUrl="http://aras-p.info/"/>
        <outline text="Mark's Blog" title="Mark's Blog" type="rss"
            xmlUrl="http://blogs.technet.com/markrussinovich/rss.xml" htmlUrl="http://blogs.technet.com/b/markrussinovich/"/>
        <outline text="meshula.net" title="meshula.net" type="rss"
            xmlUrl="http://meshula.net/wordpress/?feed=rss2" htmlUrl="http://meshula.net/wordpress"/>
        <outline text="Miles Macklin's blog"
            title="Miles Macklin's blog" type="rss"
            xmlUrl="http://blog.mmacklin.com/feed/" htmlUrl="http://blog.mmacklin.com"/>
        <outline text="Mod Blog" title="Mod Blog" type="rss"
            xmlUrl="http://www.modularpeople.com/blog/?feed=rss2" htmlUrl="http://www.modularpeople.com/blog"/>
        <outline text="Molecular Musings" title="Molecular Musings"
            type="rss"
            xmlUrl="http://molecularmusings.wordpress.com/feed/" htmlUrl="http://molecularmusings.wordpress.com"/>
        <outline text="Monty" title="Monty" type="rss"
            xmlUrl="http://xiphmont.livejournal.com/data/rss" htmlUrl="http://xiphmont.livejournal.com/"/>
        <outline text="My Green Paste, Inc."
            title="My Green Paste, Inc." type="rss"
            xmlUrl="http://mygreenpaste.blogspot.com/feeds/posts/default" htmlUrl="http://mygreenpaste.blogspot.com/"/>
        <outline text="Nerdblog.com" title="Nerdblog.com" type="rss"
            xmlUrl="http://www.nerdblog.com/feeds/posts/default" htmlUrl="http://www.nerdblog.com/"/>
         <outline text="nothings' projects" title="nothings' projects"
            type="rss" xmlUrl="http://nothings.org/projects/?feed=rss2" htmlUrl="http://nothings.org/projects"/>
        <outline text="Nynaeve" title="Nynaeve" type="rss"
            xmlUrl="http://www.nynaeve.net/?feed=rss2" htmlUrl="http://www.nynaeve.net"/>
        <outline text="onepartcode.com" title="onepartcode.com"
            type="rss" xmlUrl="http://onepartcode.com/main/index.rss" htmlUrl="http://onepartcode.com/main"/>
        <outline text="Online Game Techniques"
            title="Online Game Techniques" type="rss"
            xmlUrl="http://onlinegametechniques.blogspot.com/feeds/posts/default" htmlUrl="http://onlinegametechniques.blogspot.com/"/>
        <outline text="Pete Shirley's Graphics Blog"
            title="Pete Shirley's Graphics Blog" type="rss"
            xmlUrl="http://psgraphics.blogspot.com/feeds/posts/default" htmlUrl="http://psgraphics.blogspot.com/"/>
        <outline text="Pixels, Too Many.." title="Pixels, Too Many.."
            type="rss" xmlUrl="http://pixelstoomany.wordpress.com/feed/" htmlUrl="http://pixelstoomany.wordpress.com"/>
        <outline text="Preshing on Programming"
            title="Preshing on Programming" type="rss"
            xmlUrl="http://preshing.com/feed" htmlUrl="http://preshing.com"/>
        <outline text="Ray Tracey's blog" title="Ray Tracey's blog"
            type="rss"
            xmlUrl="http://raytracey.blogspot.com/feeds/posts/default" htmlUrl="http://raytracey.blogspot.com/"/>
        <outline text="Real-Time Rendering" title="Real-Time Rendering"
            type="rss"
            xmlUrl="http://www.realtimerendering.com/blog/feed/" htmlUrl="http://www.realtimerendering.com/blog"/>
        <outline text="realtimecollisiondetection.net - the blog"
            title="realtimecollisiondetection.net - the blog" type="rss"
            xmlUrl="http://realtimecollisiondetection.net/blog/?feed=atom" htmlUrl="http://realtimecollisiondetection.net/blog"/>
        <outline text="Reenigne blog" title="Reenigne blog" type="rss"
            xmlUrl="http://www.reenigne.org/blog/feed/" htmlUrl="http://www.reenigne.org/blog"/>
        <outline text="RenderWonk" title="RenderWonk" type="rss"
            xmlUrl="http://renderwonk.com/blog/index.php/feed/" htmlUrl="http://renderwonk.com/blog"/>
        <outline text="ridiculous_fish" title="ridiculous_fish"
            type="rss" xmlUrl="http://ridiculousfish.com/blog/feed/" htmlUrl="http://ridiculousfish.com/blog/"/>
        <outline text="Sanders' blog" title="Sanders' blog" type="rss"
            xmlUrl="http://sandervanrossen.blogspot.com/feeds/posts/default?alt=rss" htmlUrl="http://sandervanrossen.blogspot.com/"/>
         <outline text="Self Shadow" title="Self Shadow" type="rss"
            xmlUrl="http://blog.selfshadow.com/feed/" htmlUrl="http://blog.selfshadow.com/"/>
        <outline text="Some Assembly Required"
            title="Some Assembly Required" type="rss"
            xmlUrl="http://assemblyrequired.crashworks.org/feed/" htmlUrl="http://assemblyrequired.crashworks.org"/>
        <outline text="stinkin' thinkin'" title="stinkin' thinkin'"
            type="rss"
            xmlUrl="http://stinkygoat.livejournal.com/data/rss" htmlUrl="http://stinkygoat.livejournal.com/"/>
        <outline text="Stuart Denman" title="Stuart Denman" type="rss"
            xmlUrl="http://www.stuartdenman.com/feed/" htmlUrl="http://www.stuartdenman.com"/>
        <outline text="Stumbling Toward 'Awesomeness'"
            title="Stumbling Toward 'Awesomeness'" type="rss"
            xmlUrl="http://www.chrisevans3d.com/pub_blog/?feed=atom" htmlUrl="http://www.chrisevans3d.com/pub_blog"/>
         <outline text="Syntopia" title="Syntopia" type="rss"
            xmlUrl="http://blog.hvidtfeldts.net/index.php/feed/" htmlUrl="http://blog.hvidtfeldts.net"/>
        <outline text="SĂ©bastien Lagarde" title="SĂ©bastien Lagarde"
            type="rss" xmlUrl="https://seblagarde.wordpress.com/feed/" htmlUrl="https://seblagarde.wordpress.com"/>
        <outline text="The Atom Project" title="The Atom Project"
            type="rss"
            xmlUrl="http://www.farrarfocus.com/atom/index.atom" htmlUrl="http://www.farrarfocus.com/atom/"/>
        <outline text="The Danger Zone" title="The Danger Zone"
            type="rss" xmlUrl="http://mynameismjp.wordpress.com/feed/" htmlUrl="http://mynameismjp.wordpress.com"/>
        <outline text="The Data Compression News Blog"
            title="The Data Compression News Blog" type="rss"
            xmlUrl="http://www.c10n.info/feed" htmlUrl="http://www.c10n.info"/>
        <outline text="The Fifth Column" title="The Fifth Column"
            type="rss"
            xmlUrl="http://thefifthcolumn.com/blog/?feed=rss2" htmlUrl="http://thefifthcolumn.com/blog"/>
        <outline text="The Ladybug Letter" title="The Ladybug Letter"
            type="rss" xmlUrl="http://www.ladybugletter.com/?feed=atom" htmlUrl="http://www.ladybugletter.com/"/>
        <outline text="The ryg blog" title="The ryg blog" type="rss"
            xmlUrl="http://fgiesen.wordpress.com/feed/" htmlUrl="http://fgiesen.wordpress.com"/>
        <outline text="The software rendering world"
            title="The software rendering world" type="rss"
            xmlUrl="http://winden.wordpress.com/feed/" htmlUrl="http://winden.wordpress.com"/>
        <outline text="The Witness" title="The Witness" type="rss"
            xmlUrl="http://the-witness.net/news/feed/" htmlUrl="http://the-witness.net/news"/>
        <outline text="Transcendental Technical Travails"
            title="Transcendental Technical Travails" type="rss"
            xmlUrl="http://t-t-travails.blogspot.com/feeds/posts/default" htmlUrl="http://t-t-travails.blogspot.com/"/>
        <outline text="Treatise on Graphics Programming"
            title="Treatise on Graphics Programming" type="rss"
            xmlUrl="http://www.wolfgang-engel.info/blogs/?feed=rss2" htmlUrl="http://www.wolfgang-engel.info/blogs"/>
        <outline text="UMBC Games, Animation and Interactive Media"
            title="UMBC Games, Animation and Interactive Media"
            type="rss" xmlUrl="http://gaim.umbc.edu/feed/" htmlUrl="http://gaim.umbc.edu"/>
        <outline text="View" title="View" type="rss"
            xmlUrl="http://view.eecs.berkeley.edu/blog/rss.php?ver=2" htmlUrl="http://view.eecs.berkeley.edu/blog/"/>
         <outline text="VirtualBlog" title="VirtualBlog" type="rss"
            xmlUrl="http://www.virtualdub.org/blog/rss.xml" htmlUrl="http://virtualdub.org/blog/index.php"/>
         <outline text="Voxelium" title="Voxelium" type="rss"
            xmlUrl="http://voxelium.wordpress.com/feed/" htmlUrl="http://voxelium.wordpress.com"/>
        <outline
            text="What your mother never told you about graphics development"
            title="What your mother never told you about graphics development"
            type="rss"
            xmlUrl="http://zeuxcg.blogspot.com/feeds/posts/default" htmlUrl="http://zeuxcg.blogspot.com/"/>
        <outline
            text="What your mother never told you about graphics development"
            title="What your mother never told you about graphics development"
            type="rss" xmlUrl="http://zeuxcg.org/feed/" htmlUrl="http://zeuxcg.org"/>
        <outline text="Work Without Dread" title="Work Without Dread"
            type="rss"
            xmlUrl="http://workwithoutdread.blogspot.com/feeds/posts/default" htmlUrl="http://workwithoutdread.blogspot.com/"/>
        <outline text="Zack Rusin" title="Zack Rusin" type="rss"
            xmlUrl="http://zrusin.blogspot.com/feeds/posts/default" htmlUrl="http://zrusin.blogspot.com/"/>
        <outline text="ZigguratVertigo's Hideout"
            title="ZigguratVertigo's Hideout" type="rss"
            xmlUrl="http://zigguratvertigo.com/feed/" htmlUrl="http://colinbarrebrisebois.com"/>
        <outline text="  Bartosz Milewski's Programming Cafe"
            title="  Bartosz Milewski's Programming Cafe" type="rss"
            xmlUrl="http://bartoszmilewski.wordpress.com/feed/" htmlUrl="http://bartoszmilewski.com"/>
    </body>
</opml>


01-15-13 | Kids

Some random thoughts on my impending kid-having-ness :

(note for dumb people : we're not here to talk about boring obvious shit like "kids make you sleep less" or "many parents live out their frustrated life goals through their kids". That's an obvious given as a baseline that should not need to be said; on this blog we try to talk about the things that are past the baseline, though many readers seem to not get that and want to chime in with the material that was a prerequisite for this course; get out of here and go back to reading "Excessive DOF Photos of Crappy Food" or "The New Old Coding Bore" or "Precious Twee Artisinal All-Organic Parenting" or whatever banal blog you usually read)

1. Kids automatically make you cooler. They're like a +1 modifier on anything you do. Like if you're just some single guy and you're in good shape and do triathlons or whatever, who cares, you're kind of an obsessive dweeb. But if you're a good family-man dad and you do the same, then you're amazing cool fit dad. (of course there's valid reason for this +1, because it's so much harder to do anything once you have kids, they're such a huge energy-suck)

(I've long been aware that I have some sort of bad jealousy tick where I really hate awesome dads; whenever I meet a dad who's super-fit and has great kids and also has a great job and builds robots or writes books on the side, I'm just filled with loathing; I'm not entirely sure but I assume that instant gut loathing comes from jealousy; I also think those guys are liars/phonies. Like, I think they must actually be terrible dads, it's just not possible to do all those things and spend enough time with your kids; why aren't you exhausted and frazzled? perhaps they have very self-sacrificing wives who are actually doing all the work at home, and/or they aren't actually putting in the work at their job; something is amiss, my spidey senses tingle)

2. Kids let you do things you suck at without feeling awkward. Say you suck at skiing; if you just go as a single man and take beginners lessons and have to ski the tow-rope bunny slope, you feel embarassed and most people can't get over it (of course if you do it anyway, you actually are super cool, and it's the people who look down on you that are fucking retard losers, but I digress). With kids you can go and ski the bunny slope with them and nobody looks at you funny. If you go ice skate for the first time as a single adult and are falling all over and wobbly you're a weirdo, but if you do it with your kids, you're a cool dad.

(one of the great tragedies of life is that people stop doing new things around 20 because they don't want to look like a beginner; they also lose all humility and never want to admit that they are a beginner at something. It's super dumb and I've been trying to get past it for the last 20 years or so. It's so funny seeing men at track days or at home improvement stores; they obviously don't know a thing about cars or construction (like I don't), but they can't just admit it and go "yeah I'm a newbie, can you help me?" they have to act all macho-man and pretend to be in their element like "I need a ball-peen wrench to adjust the timing on my carburetor." Um, let's back up and try again.)

3. Kids let you do things that are dweeby to do as single people, like go to the zoo or ride in a carriage. Part of the issue is that those activities are just not quite interesting enough on their own, but when you have the +1 enjoyment modifier of seeing it through your kids' eyes that pushes them over the threshold of worth doingness. I've always loved factory tours and those living-history museums where you can see how stained glass is made or whatever, also science museums (particularly interactive ones), but they just aren't quite worth doing as an adult. Kids remove the difficult embarassingness of everyone around you thinking "why are you here? it's only really old people and families, childless adults are not allowed".

4. Kids give you an excuse to be a selfish inconsiderate asshole. This is not a good thing and lots of parents over-do it. (it starts with pregnant moms who use the pregnancy as an excuse to be selfish bitches way beyond what's necessary or appropriate). Things like we can be loud at the symphony because we have kids, or we can cut in line because we're pregnant, or lets take the best seats and spread out all over, or lets take all the chairs at the hotel pool and then leave a giant mess behind us, etc. People know that kids make it much harder for others to go "hey fucker, you're out of line" and they abuse that advantage.

5. Kids let you play. I'm super excited about this. For a long time I've known that what I really need in my life is *play* , not sports, not games, but just joyous pointless movement. Adults are so fucking uptight and trying to act cool and impress each other all the time that they can never just play (actually I had a pretty sweet thing going for a while with Ryan where we could play a bit, but that was rare). Of course there's a whole industry of "ecstatic dance" and shit like that which is basically adults paying someone to let them play, which is so sad and bizarre; you have these uptight type-A business assholes who are total fuckers to everyone all day long, and then they go in a room and listen to a teacher tell them to run around in circles and stick their tongue out; super bizarre disconnect there. Anyhoo, kids let you go to the park and run around and roll in the grass and jump on things and nobody thinks you're a weirdo. (alternatively : move to San Francisco; fucking wonderful place SF, but all the gentry and computerists are ruining it)

(I guess those funny-dress-up runs are also societal concoctions to let adults play; but they ruin it by being a regimented precisely specified play; you're still just trying to fit in and do what you're supposed to. Oh crap, I wore a tutu and everyone else is wearing a cape! And it's still competitive and judgemental - ooh look, that guy is really relaxing well. Adulthood is so bizarre.)

6. Kids let you not have friends. They let you turn inward and just hang out with family. And of course you get some socializing through yours kids doing things and hanging out with other parents. You don't have to make any effort to make adult relationships work, which is a pain in the ass. Kids let you just stay home with your family without being a weird lonely hermit. Of course this is also a danger if you take it too far; you see these families that are so drawn in and almost afraid of other adults that when they're out in public they hardly even look up at the world around them.

7. Kids let you feel okay about sucking. If you're not really doing anything with your life and you're just kind of a rotten human being, but you have kids - then you can think "I devote my life to my kids, they are my pride and joy, at least I've made them, they are my life's work". They provide a +1 smugness bonus.

8. Kids give you a new thing to be ridiculously analytical and obsessive and introspective about. Most type-A nerds have kind of gotten bored of thinking about life by the time they hit 25 or so. We've already thought and over-thought everything that we do in life ("what is actually going on in the little social interaction with the grocery store checkout person? should I make minimal polite smalltalk, or should I try to say something unexpected to cheer them up? do I feel bad for this person whose life has obviously taken such a wrong turn? am I trying to make them feel like my equal and not my servant?" etc). We've made charts and graphs about how various influences in our life affect our productivity, and it's just all old hat. ("should I turn the other cheek, or should I get aggressive back at this asshole? Turning the other cheek is a local optimization of my own happiness, but that does not create a social game-theory structure which directs overall behavior in a good way. Oh wait I've had this same thought like 100 times before"). Kids give you a whole new set of things to be anal and nerdy about, read books and think about cause-effect and blah blah blah. Of course this blog post is a sign that I've already begun.


01-14-13 | Hawaii Workcation 2013

Photos from the first few days here. Tasha's bro visited so we did a bit of travelling around sight seeing. Starting with the rental house, my office, and then some excursions :



Man it feels great to be here. The house is incredible, just as we hoped, tons of windows and a big view of Mauna Kea with not a single neighbor around. I feel alive, young, virile, lithe. I love the sun and the sweat. I love the trees and the good vibes.

I packed my bike this year (mild hassle (and the damn TSA opened my box and disturbed my careful packing)), looking forward to getting some good rides.

BTW you may notice that the correct ergonomic position for a "laptop" is about three feet above the human lap.

I can't wait for Tasha to pop the kids out so we can travel with them and play on the beach and run around in the trees.


01-01-13 | Chicken Coop Learnings

Some hindsight and lessons learned after living a while with my first coop, some mistakes made and things I'll do differently the next time. In all cases I'm assuming a backyard-size flock, 10 birds or less. Obviously different considerations apply to large-scale coops. Also I'm assuming that you live somewhere relatively warm (winters above 20 degrees); in the super-cold different considerations apply.

1. Chickens don't need a big coop. They don't like to be inside, they like to be outside (as noted above, I'm assuming a decently warm climate). The coop is just for sleeping and laying. Almost all the coop designs you'll see on the internet, and all the fancy ones you can buy, are much too big. Not only is it a waste of time and materials to build a big coop, it's a huge disadvantage because it takes up more space and is more work to clean and is harder to move.

2. Don't build a coop you can walk inside. As per #1, the coop should be small, and it should be high (chickens like to be up high to sleep). All you need is a small raised box. You do not need a door for humans or a floor at human height. Do, however, put an entire wall or roof on hinges so that you can open up the whole thing and easily reach every corner.

3. Don't over-engineer. Because the coop only needs to bear chicken-weight not human-weight, there's no need to use 2x4's or half inch plywood, you can use much lighter and smaller construction materials. Again most of the internet designs and coops you can buy are just way off here, way over-engineered. (it does need to be strong enough to be wind-proof and dog-proof; dogs are by far the biggest hazard to urban chickens).

Even if you want a movable coop, you don't really need wheels if you use suitably light building materials and are moderately athletic. It's very easy to just pick up a small coop and move it around the yard as needed.

4. Paint. I painted the inside of the coop, and some sites & people consider this silly and froo-froo, but I think it was a good call and would do it again. A thick coat of high-gloss provides great water proofing and provides a smooth surface, which makes for much easier cleanup and longer life.

5. Rain/Snow. In contrast to #3, you should *not* cut corners in following good practices for weather-proofing. In particular, don't leave exposed edges of plywood or sheathing (they delaminate very easily), do use good shingle-principle for roofing (overlap and cover holes), use a proper drip-edge to prevent water wrapping around, etc.

6. Doors. I put a bunch of doors in the coop and one thing I didn't really consider was that all the poop and shavings and such will constantly be getting in the door jamb, which will prevent closure if it's a tight fit. One option is just to intentionally make a sloppy door that's a loose fit; another is to put some kind of trough near the door so that closing it pushes out the crud into the gap. Many designs, including mine, feature a door hinged at the bottom, so that when it opens it becomes a ramp. This seems clever but is not a very functional door because of the poop-in-the-hinges problem, it just becomes a static ramp. Probably the best type of door is top-hinged, with a raised bottom sill to prevent crud building up there. There's just not a lot of need for doors though; if you make the whole coop open for cleanability (such as via a hinged or removable roof), you can just use that to get the eggs as well; there's no need for the cute little nesting boxes with individual doors that people do.

7. The roost is the backbone of the coop. The chickens will spend 90% of their indoor time on the roosts, so locating the roost is the most important aspect of the design. The coop is really just the roost and the nesting boxes, the chickens want to spend their time outside in the run or free ranging, not on the floor of the coop.

8. The Poop Trough. Because of #7, I've found that almost all the chicken poop that's inside the coop is in a perfect straight line under the roost. I think you could take advantage of that and put an angled trough under the roost so that the poop was super easy to clean out. Another option would be a line of wire mesh instead of solid floor under the roost, perhaps with a removable trough under the wire mesh.

9. Rats. You have to decide from the beginning if you want to try to make a rat-proof coop. Doing so is a major undertaking and requires careful design. For example, chicken wire is not rat-proof. To make a rat-proof coop, first you need a solid stone foundation (for a small coop the easiest way to rat-proof the floor is just to cover the whole floor with pavers or bricks; for a larger coop you wouldn't want to do that, so you have to dig down at least 1 foot underground and surround the perimeter with rat-proof wire mesh or concrete blocks; rats are excellent diggers). Then the entire coop must be surrounded with hardware cloth (wire mesh) or similar. Rats are also superb climbers and jumpers, so vertical barriers will not stop them (you need a closed roof).

Some people try to rat-proof by putting wire on the floor (rather than a solid paver floor or burying a barrier around the perimeter). This is not a great idea. What will happen is the rats will still dig under the coop and create a network of tunnels under the wire floor. The chickens knock their feed all around, so lots (most) of it will fall through the wire mesh into the gap below it, and the rats will have a party living in the dirt under the wire floor. This might be okay with you (at least the rats are not actually in the chicken's space) but I think that overall wire on the floor is actually worse than nothing.

10. Feeders. Lots of people advocate these big automatic hanging feeders that you can fill with feed and it will drop down to let out more. Unless you have made a seriously rat-proof coop, these things are a terrible idea. Rats with an unlimited supply of food like that will multiply incredibly rapidly. You're going to want to visit the chickens every day anyway, so I see no advantage to these gravity feeders, just give them their ration each day so that there aren't a lot of left-overs for the vermin.

11. The Run. You have to decide up front whether you are going to free-range the chickens or not. If you are going to free-range them, then you don't need any run at all, just let them out in the yard. If not, then you need a big run. A tiny run (like under the popular commercial A-frame "chicken tractor") is pointless and cruel. If I had a decent amount of land I would build a simple run by just putting in some posts and wrapping it in chicken wire. (obviously this run is not rat proof). There's no need to cover the top of a large run (assuming as above you do not use a big feeder which would attract other birds).

12. Free ranging in your yard kind of sucks. Chickens love to dig in soft soil, so will go after your new plantings and vegetable beds and dig up your seedlings. They like to sit on railings and handles and poop. You will have poop all over everything. It's not awesome. On the other hand, it is very easy. They will eat a better diet without you having to carefully manage the supplements in their feed. They also naturally return to their coop at night so you don't really have to do any work to get them in and out, they do it themselves.

13. The poop pile. If you are going to try to reuse the poop and shavings you get when you clean out the coop as manure, you need to locate a spot for the poop to rest. You will get a *lot* of waste out of the coop, so you need a big spot, and you need at least two piles so you can cycle the new into the old (like compost; poop needs 2-3 months rest before use). The poop pile should not be near the coop (or run) and should also not be near your planting beds to avoid pest and pathogen transfer. It can be hard to find a good location for the poop pile in an urban yard, so you may want to abandon this idea and just throw out the poop. The poop pile will also attract rats and flies (but of course so will composting); it may also attract justifiably irate neighbors.


12-22-12 | Data Considered Harmful

I believe that the modern trend of doing some very superficial data analysis to prove a point, or support your argument is extremely harmful. It leads to a false impression of a scientific basis to arguments that is in fact spurious.

I've been thinking about this for a while, but this washingtonpost blog about the correlation of video games and gun violence recently popped into my blog feed, so I'll use it as an example.

The Washington Post blog leads you to believe that the data shows an unequivocal lack of correlation between videogames and gun violence. That's retarded. It only takes one glance at the chart to see that the data is completely dominated by other factors, like probably most strongly the gun ownership rate. You can't possibly try to find the effect of a minor contributing factor without normalizing for other factors, which most of these "analyses" fail to do, which makes them totally bogus. Furthermore, as usual, you would need a much larger sample size to have any confidence in the data, and you'd have to question the selection of data that was done. Also the entire thing being charted is wrong; it shouldn't be video game spending per capita, it should be video games played per capita (especially with China on there), and it shouldn't be gun-related murders, it should be all murders (because the fraction of murders that is gun related varies strongly by gun control laws, while the all murders rate varies more directly with the level of economic and social development in a country).

(Using data and charts and graphs has been a very popular way to respond to the recent shootings. Every single one that I've seen is complete nonsense. People just want to make a point that they've previously decided, so they trot out some data to "prove it" or make it "non-partisan" as if their bogus charts somehow make it "factual". It's pathetic. Here's a good example of using tons of data to show absolutely nothing . If you want to make an editorial point, just write your opinion, don't trot out bogus charts to "back it up". )

It's extremely popular these days to "prove" that some intuition is wrong by finding some data that shows a reverse correlation. (blame Freakonomics, among other things). You get lots of this in the smarmy TED talks - "you may expect that stabbing yourself in the eye with a pencil is harmful, but in fact these studies show that stabbing yourself in the eye is correlated to longer life expectancy!" (and then everyone claps). The problem with all this cute semi-intellectualism is that it's very often just wrong.

Aside from just poor data analysis, one of the major flaws with this kind of reasoning is the assumption that you are measuring all the inputs and all the outputs.

An obvious case is education, where you get all kinds of bogus studies that show such-and-such program "improves learning". Well, how did you actually measure learning? Obviously something like cutting music programs out of schools "improves learning" if you measure "learning" in a myopic way that doesn't include the benefits of music. And of course you must also ask what else was changed between the measured kids and the control (selection bias, novelty effect, etc; essentially all the studies on charter schools are total nonsense since any selection of students and new environment will produce a short term improvement).

I believe that choosing the wrong inputs and outputs is even worse than the poor data analysis, because it can be so hidden. Quite often there are some huge (bogus) logical leaps where the article will measure some narrow output and then proceed to talk about it as if it was just "better". Even when your data analysis was correct, you did not show it was better, you showed that one specific narrow output that you chose to measure improved, and you have to be very careful to not start using more general words.

(one of the great classic "wrong output" mistakes is measuring GDP to decide if a government financial policy was successful; this is one of those cases where economists have in fact done very sophisticated data analysis, but with a misleadingly narrow output)

Being repetitive : it's okay if you are actually very specific and careful not to extrapolate. eg. if you say "lowering interest rates increased GDP" and you are careful not to ever imply that "increased GDP" necessarily means "was good for the economy" (or that "was good for the economy" meant "was good for the population"); the problem is that people are sloppy, in their analysis and their implications and their reading, so it becomes "lowering interest rates improved the well-being of the population" and that becomes accepted wisdom.

Of course you can transparently see the vapidity of most of these analyses because they don't propagate error bars. If they actually took the errors of the measurement, corrected for the error of the sample size, propagated it through the correlation calculation and gave a confidence at the end, you would see things like "we measured a 5% improvement (+- 50%)" , which is no data at all.

I saw Bryan Cox on QI recently, and there was some point about the US government testing whether heavy doses of LSD helped schizophrenics or not. Everyone was aghast but Bryan popped up with "actually I support data-based medicine; if it had been shown to help then I would be for that therapy". Now obviously this was a jokey context so I'll cut Cox some slack, but it does in fact reflect a very commonly held belief these days (that we should trust the data more than our common sense that it's a terrible idea). And it's just obviously retarded on the face of it. If the study had shown it to help, then obviously something was wrong with the study. Medical studies are almost always so flawed that it's hard to believe any of them. Selection bias is huge, novelty and placebo effect are huge; but even if you really have controlled for all that, the other big failure is that they are too short term, and the "output" is much too narrow. You may have improved the thing you were measuring for, but done lots of other harm that you didn't see. Perhaps they did measure a decrease in certain schizophrenia symptoms (but psychotic lapses and suicides were way up; oops that wasn't part of the output we measured).

Exercise/dieting and child-rearing are two major topics where you are just bombarded with nonsense pseudo-science "correlations" all the time.

Of course political/economic charts are useless and misleading. A classic falsehood that gets trotted out regularly is the charts showing "the economy does better under democrats" ; for one thing the sample size is just so small that it could be totally random ; for another the economy is more effected by the previous president than the current ; and in almost every case huge external factors are massively more important (what's the Fed rate, did Al Gore recently invent the internet, are we in a war or an oil crisis, etc.). People love to show that chart but it is *pure garbage* , it contains zero information. Similarly the charts about how the economy does right after a tax raise or decrease; again there are so many confounding factors and the sample size is so tiny, but more importantly tax raises tend to happen when government receipts are low (eg. economic growth is already slowing), while tax cuts tend to happen in flush times, so saying "tax cuts lead to growth" is really saying "growth leads to growth".

What I'm trying to get at in this post is not the ridiculous lack of science in all these studies and "facts", but the way that the popular press (and the semi-intellectual world of blogs and talks and magazines) use charts and graphs to present "data" to legitimize the bogus point.

I believe that any time you see a chart or graph in the popular press you should look away.

I know they are seductive and fun, and they give you a vapid conversation piece ("did you know that christmas lights are correlated with impotence?") but they in fact poison the brain with falsehoods.

Finally, back to the issue of video games and violence. I believe it is obvious on the face of it that video games contribute to violence. Of course they do. Especially at a young age, if a kid grows up shooting virtual men in the face it has to have some effect (especially on people who are already mentally unstable). Is it a big factor? Probably not; by far the biggest factor in violence is poverty, then government instability and human rights, then the gun ownership rate, the ease of gun purchasing, etc. I suspect that the general gun glorification in America is a much bigger effect, as is growing up in a home where your parents had guns, going to the shooting range as a child, rappers glorifying violence, movies and TV. Somewhere after all that, I'm sure video games contribute. The only thing we can actually say scientifically is that the effect is very small and almost impossible to measure due to the presence of much larger and highly uncertain factors.

(of course we should also recognize that these kind of crazy school shooting events are completely different than ordinary violence, and statistically are a drop in the bucket. I suspect the rare mass-murder psycho killer things are more related to a country's mental health system than anything else. Pulling out the total murder numbers as a response to these rare psychotic events is another example of using the wrong data and then glossing over the illogical jump.)

I think in almost all cases if you don't play pretend with data and just go and sit quietly and think about the problem and tap into your own brain, you will come to better conclusions.


12-21-12 | File Handle to File Name on Windows

There are a lot of posts about this floating around, most not quite right. Trying to sort it out once and for all. Note that in all cases I want to resolve back to a "final" name (that is, remove symlinks, substs, net uses, etc.) I do not believe that the methods I present here guarantee a "canonical" name, eg. a name that's always the same if it refers to the same file - that would be a nice extra step to have.

This post will be code-heavy and the code will be ugly. This code is all sloppy about buffer sizes and string over-runs and such, so DO NOT copy-paste it into production unless you want security holes. (a particular nasty point to be wary of is that many of the APIs differ in whether they take a buffer size in bytes or chars, which with unicode is different)

We're gonna use these helpers to call into windows dlls :


template <typename t_func_type>
t_func_type GetWindowsImport( t_func_type * pFunc , const char * funcName, const char * libName , bool dothrow)
{
    if ( *pFunc == 0 )
    {
        HMODULE m = GetModuleHandle(libName);
        if ( m == 0 ) m = LoadLibrary(libName); // adds extension for you
        ASSERT_RELEASE( m != 0 );
        t_func_type f = (t_func_type) GetProcAddress( m, funcName );
        if ( f == 0 && dothrow )
        {
            throw funcName;
        }
        *pFunc = f;
    }
    return (*pFunc); 
}

// GET_IMPORT can return NULL
#define GET_IMPORT(lib,name) (GetWindowsImport(&STRING_JOIN(fp_,name),STRINGIZE(name),lib,false))

// CALL_IMPORT throws if not found
#define CALL_IMPORT(lib,name) (*GetWindowsImport(&STRING_JOIN(fp_,name),STRINGIZE(name),lib,true))
#define CALL_KERNEL32(name) CALL_IMPORT("kernel32",name)
#define CALL_NT(name) CALL_IMPORT("ntdll",name)

I also make use of the cblib strlen, strcpy, etc. on wchars. Their implementation is obvious.

Also, for reference, to open a file handle just to read its attributes (to map its name) you use :


    HANDLE f = CreateFile(from,
        FILE_READ_ATTRIBUTES |
        STANDARD_RIGHTS_READ
        ,FILE_SHARE_READ,0,OPEN_EXISTING,FILE_FLAG_BACKUP_SEMANTICS,0);

(also works on directories).

Okay now : How to get a final path name from a file handle :

1. On Vista+ , just use GetFinalPathNameByHandle.

GetFinalPathNameByHandle gives you back a "\\?\" prefixed path, or "\\?\UNC\" for network shares.

2. Pre-Vista, lots of people recommend mem-mapping the file and then using GetMappedFileName.

This is a bad suggestion. It doesn't work on directories. It requires that you actually have the file open for read, which is of course impossible in some scenarios. It's just generally a non-robust way to get a file name from a handle.

For the record, here is the code from MSDN to get a file name from handle using GetMappedFileName. Note that GetMappedFileName gives you back an NT-namespace name, and I have factored out the bit to convert that to Win32 into MapNtDriveName, which we'll come back to later.



BOOL GetFileNameFromHandleW_Map(HANDLE hFile,wchar_t * pszFilename,int pszFilenameSize)
{
    BOOL bSuccess = FALSE;
    HANDLE hFileMap;

    pszFilename[0] = 0;

    // Get the file size.
    DWORD dwFileSizeHi = 0;
    DWORD dwFileSizeLo = GetFileSize(hFile, &dwFileSizeHi); 

    if( dwFileSizeLo == 0 && dwFileSizeHi == 0 )
    {
        lprintf(("Cannot map a file with a length of zero.\n"));
        return FALSE;
    }

    // Create a file mapping object.
    hFileMap = CreateFileMapping(hFile, 
                    NULL, 
                    PAGE_READONLY,
                    0, 
                    1,
                    NULL);

    if (hFileMap) 
    {
        // Create a file mapping to get the file name.
        void* pMem = MapViewOfFile(hFileMap, FILE_MAP_READ, 0, 0, 1);

        if (pMem) 
        {
            if (GetMappedFileNameW(GetCurrentProcess(), 
                                 pMem, 
                                 pszFilename,
                                 MAX_PATH)) 
            {
                //pszFilename is an NT-space name :
                //pszFilename = "\Device\HarddiskVolume4\devel\projects\oodle\z.bat"

                wchar_t temp[2048];
                strcpy(temp,pszFilename);
                MapNtDriveName(temp,pszFilename);


            }
            bSuccess = TRUE;
            UnmapViewOfFile(pMem);
        } 

        CloseHandle(hFileMap);
    }
    else
    {
        return FALSE;
    }

    return(bSuccess);
}

3. There's a more direct way to get the name from file handle : NtQueryObject.

NtQueryObject gives you the name of any handle. If it's a file handle, you get the file name. This name is an NT namespace name, so you have to map it down of course.

The core code is :


typedef enum _OBJECT_INFORMATION_CLASS {

ObjectBasicInformation, ObjectNameInformation, ObjectTypeInformation, ObjectAllInformation, ObjectDataInformation

} OBJECT_INFORMATION_CLASS, *POBJECT_INFORMATION_CLASS;

typedef struct _UNICODE_STRING {
  USHORT Length;
  USHORT MaximumLength;
  PWSTR  Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

typedef struct _OBJECT_NAME_INFORMATION {

    UNICODE_STRING Name;
    WCHAR NameBuffer[1];

} OBJECT_NAME_INFORMATION, *POBJECT_NAME_INFORMATION;


NTSTATUS
(NTAPI *
fp_NtQueryObject)(
IN HANDLE ObjectHandle, IN OBJECT_INFORMATION_CLASS ObjectInformationClass, OUT PVOID ObjectInformation, IN ULONG Length, OUT PULONG ResultLength )
= 0;


{
    char infobuf[4096];
    ULONG ResultLength = 0;

    CALL_NT(NtQueryObject)(f,
        ObjectNameInformation,
        infobuf,
        sizeof(infobuf),
        &ResultLength);

    OBJECT_NAME_INFORMATION * pinfo = (OBJECT_NAME_INFORMATION *) infobuf;

    wchar_t * ps = pinfo->NameBuffer;
    // info->Name.Length is in BYTES , not wchars
    ps[ pinfo->Name.Length / 2 ] = 0;

    lprintf("OBJECT_NAME_INFORMATION: (%S)\n",ps);
}

which will give you a name like :

    OBJECT_NAME_INFORMATION: (\Device\HarddiskVolume1\devel\projects\oodle\examples\oodle_future.h)

and then you just have to pull off the drive part and call MapNtDriveName (mentioned previously but not yet detailed).

Note that there's another call that looks appealing :


    CALL_NT(NtQueryInformationFile)(f,
        &block,
        infobuf,
        sizeof(infobuf),
        FileNameInformation);

but NtQueryInformationFile seems to always give you just the file name without the drive. In fact it seems possible to use NtQueryInformationFile and NtQueryObject to separate the drive part and path part.

That is, you get something like :


t: is substed to c:\trans

LogDosDrives prints :

T: : \??\C:\trans

we ask about :

fmName : t:\prefs.js

we get :

NtQueryInformationFile: "\trans\prefs.js"
NtQueryObject: "\Device\HarddiskVolume4\trans\prefs.js"

If there was a way to get the drive letter, then you could just use NtQueryInformationFile , but so far as I know there is no simple way, so we have to go through all this mess.

On network shares, it's similar but a little different :


y: is net used to \\charlesbpc\C$

LogDosDrives prints :

Y: : \Device\LanmanRedirector\;Y:0000000000034569\charlesbpc\C$

we ask about :

fmName : y:\xfer\path.txt

we get :

NtQueryInformationFile: "\charlesbpc\C$\xfer\path.txt"
NtQueryObject: "\Device\Mup\charlesbpc\C$\xfer\path.txt"

so in that case you could just prepend a "\" to NtQueryInformationFile , but again I'm not sure how to know that what you got was a network share and not just a directory, so we'll go through all the mess here to figure it out.

4. MapNtDriveName is needed to map an NT-namespace drive name to a Win32/DOS-namespace name.

I've found two different ways of doing this, and they seem to produce the same results in all the tests I've run, so it's unclear if one is better than the other.

4.A. MapNtDriveName by QueryDosDevice

QueryDosDevice gives you the NT name of a dos drive. This is the opposite of what we want, so we have to reverse the mapping. The way is to use GetLogicalDriveStrings which gives you all the dos drive letters, then you can look them up to get all the NT names, and thus create the reverse mapping.

Here's LogDosDrives :


void LogDosDrives()
{
    #define BUFSIZE 2048
    // Translate path with device name to drive letters.
    wchar_t szTemp[BUFSIZE];
    szTemp[0] = '\0';

    // GetLogicalDriveStrings
    //  gives you the DOS drives on the system
    //  including substs and network drives
    if (GetLogicalDriveStringsW(BUFSIZE-1, szTemp)) 
    {
      wchar_t szName[MAX_PATH];
      wchar_t szDrive[3] = (L" :");

      wchar_t * p = szTemp;

      do 
      {
        // Copy the drive letter to the template string
        *szDrive = *p;

        // Look up each device name
        if (QueryDosDeviceW(szDrive, szName, MAX_PATH))
        {
            lprintf("%S : %S\n",szDrive,szName);
        }

        // Go to the next NULL character.
        while (*p++);
        
      } while ( *p); // double-null is end of drives list
    }

    return;
}

/**

LogDosDrives prints stuff like :

A: : \Device\Floppy0
C: : \Device\HarddiskVolume1
D: : \Device\HarddiskVolume2
E: : \Device\CdRom0
H: : \Device\CdRom1
I: : \Device\CdRom2
M: : \??\D:\misc
R: : \??\D:\ramdisk
S: : \??\D:\ramdisk
T: : \??\D:\trans
V: : \??\C:
W: : \Device\LanmanRedirector\;W:0000000000024326\radnet\raddevel
Y: : \Device\LanmanRedirector\;Y:0000000000024326\radnet\radmedia
Z: : \Device\LanmanRedirector\;Z:0000000000024326\charlesb-pc\c

**/

Recall from the last post that "\??\" is the NT-namespace way of mapping back to the win32 namespace. Those are substed drives. The "net use" drives get the "Lanman" prefix.

MapNtDriveName using QueryDosDevice is :


bool MapNtDriveName_QueryDosDevice(const wchar_t * from,wchar_t * to)
{
    #define BUFSIZE 2048
    // Translate path with device name to drive letters.
    wchar_t allDosDrives[BUFSIZE];
    allDosDrives[0] = '\0';

    // GetLogicalDriveStrings
    //  gives you the DOS drives on the system
    //  including substs and network drives
    if (GetLogicalDriveStringsW(BUFSIZE-1, allDosDrives)) 
    {
        wchar_t * pDosDrives = allDosDrives;

        do 
        {
            // Copy the drive letter to the template string
            wchar_t dosDrive[3] = (L" :");
            *dosDrive = *pDosDrives;

            // Look up each device name
            wchar_t ntDriveName[BUFSIZE];
            if ( QueryDosDeviceW(dosDrive, ntDriveName, ARRAY_SIZE(ntDriveName)) )
            {
                size_t ntDriveNameLen = strlen(ntDriveName);

                if ( strnicmp(from, ntDriveName, ntDriveNameLen) == 0
                         && ( from[ntDriveNameLen] == '\\' || from[ntDriveNameLen] == 0 ) )
                {
                    strcpy(to,dosDrive);
                    strcat(to,from+ntDriveNameLen);
                            
                    return true;
                }
            }

            // Go to the next NULL character.
            while (*pDosDrives++);

        } while ( *pDosDrives); // double-null is end of drives list
    }

    return false;
}

4.B. MapNtDriveName by IOControl :

There's a more direct way using DeviceIoControl. You just send a message to the "MountPointManager" which is the guy who controls these mappings. (this is from "Mehrdad" on Stackoverflow) :


struct MOUNTMGR_TARGET_NAME { USHORT DeviceNameLength; WCHAR DeviceName[1]; };
struct MOUNTMGR_VOLUME_PATHS { ULONG MultiSzLength; WCHAR MultiSz[1]; };

#define MOUNTMGRCONTROLTYPE ((ULONG) 'm')
#define IOCTL_MOUNTMGR_QUERY_DOS_VOLUME_PATH \
    CTL_CODE(MOUNTMGRCONTROLTYPE, 12, METHOD_BUFFERED, FILE_ANY_ACCESS)

union ANY_BUFFER {
    MOUNTMGR_TARGET_NAME TargetName;
    MOUNTMGR_VOLUME_PATHS TargetPaths;
    char Buffer[4096];
};

bool MapNtDriveName_IoControl(const wchar_t * from,wchar_t * to)
{
    ANY_BUFFER nameMnt;
    
    int fromLen = strlen(from);
    // DeviceNameLength is in *bytes*
    nameMnt.TargetName.DeviceNameLength = (USHORT) ( 2 * fromLen );
    strcpy(nameMnt.TargetName.DeviceName, from );
    
    HANDLE hMountPointMgr = CreateFile( ("\\\\.\\MountPointManager"),
        0, FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE,
        NULL, OPEN_EXISTING, 0, NULL);
        
    ASSERT_RELEASE( hMountPointMgr != 0 );
        
    DWORD bytesReturned;
    BOOL success = DeviceIoControl(hMountPointMgr,
        IOCTL_MOUNTMGR_QUERY_DOS_VOLUME_PATH, &nameMnt,
        sizeof(nameMnt), &nameMnt, sizeof(nameMnt),
        &bytesReturned, NULL);

    CloseHandle(hMountPointMgr);
    
    if ( success && nameMnt.TargetPaths.MultiSzLength > 0 )
    {    
        strcpy(to,nameMnt.TargetPaths.MultiSz);

        return true;    
    }
    else
    {    
        return false;
    }
}

5. Fix MapNtDriveName for network names.

I said that MapNtDriveName_IoControl and MapNtDriveName_QueryDosDevice produced the same results and both worked. Well, that's only true for local drives. For network drives they both fail, but in different ways. MapNtDriveName_QueryDosDevice just won't find network drives, while MapNtDriveName_IoControl will hang for a long time and eventually time out with a failure.

We can fix it easily though because the NT path for a network share contains the valid win32 path as a suffix, so all we have to do is grab that suffix.


bool MapNtDriveName(const wchar_t * from,wchar_t * to)
{
    // hard-code network drives :
    if ( strisame(from,L"\\Device\\Mup") || strisame(from,L"\\Device\\LanmanRedirector") )
    {
        strcpy(to,L"\\");
        return true;
    }

    // either one :
    //return MapNtDriveName_IOControl(from,to);
    return MapNtDriveName_QueryDosDevice(from,to);
}

This just takes the NT-namespace network paths, like :

"\Device\Mup\charlesbpc\C$\xfer\path.txt"

->

"\\charlesbpc\C$\xfer\path.txt"

And we're done.


12-21-12 | File Name Namespaces on Windows

A little bit fast and loose but trying to summarize some insanity from a practical point of view.

Windows has various "namespaces" or classes of file names :

1. DOS Names :

"c:\blah" and such.

Max path of 260 including drive and trailing null. Different cases refer to the same file, *however* different unicode encodings of the same character do *NOT* refer to the same file (eg. things like "accented e" and "e + accent previous char" are different files). See previous posts about code pages and general unicode disaster on Windows.

I'm going to ignore the 8.3 legacy junk, though it still has some funny lingering effects on even "long" DOS names. (for example, the longest path name length allowed is 244 characters, because they require room for an 8.3 name after the longest path).

2. Win32 Names :

This includes all DOS names plus all network paths like "\\server\blah".

The Win32 APIs can also take the "\\?\" names, which are sort of a way of peeking into the lower-level NT names.

Many people incorrectly think the big difference with the "\\?\" names is that the length can be much longer (32768 instead of 260), but IMO the bigger difference is that the name that follows is treated as raw characters. That is, you can have "/" or "." or ".." or whatever in the name - they do not get any processing. Very scary. I've seen lots of code that blindly assumes it can add or remove "\\?\" with impunity - that is not true!


"\\?\c:\" is a local path

"\\?\UNC\server\blah" is a network name like "\\server\blah"

Assuming you have your drives shared, you can get to yourself as "\\localhost\c$\"

I think the "\\?\" namespace is totally insane and using it is a Very Bad Idea. The vast majority of apps will do the wrong thing when given it, and many will crash.

3. NT names :

Win32 is built on "ntdll" which internally uses another style of name. They start with "\" and then refer to the drivers used to access them, like :

"\Device\HarddiskVolume1\devel\projects\oodle"

In the NT namespace network shares are named :


Pre-Vista :

\Device\LanmanRedirector\<some per-user stuff>\server\share

Vista+ : Lanman way and also :

\Device\Mup\Server\share

And the NT namespace has a symbolic link to the entire Win32 namespace under "\Global??\" , so


"\Global??\c:\whatever"

is also a valid NT name, (and "\??\" is sometimes valid as a short version of "\Global??\").

What fun.


12-21-12 | Coroutine-centric Architecture

I've been talking about this for a while but maybe haven't written it all clearly in one place. So here goes. My proposal for a coroutine-centric architecture (for games).

1. Run one thread locked to each core.

(NOTE : this is only appropriate on something like a game console where you are in control of all the threads! Do not do this on an OS like Windows where other apps may also be locking to cores, and you have the thread affinity scheduler problems, and so on).

The one-thread-per-core set of threads is your thread pool. All code runs as "tasks" (or jobs or whatever) on the thread pool.

The threads never actually do ANY OS Waits. They never switch. They're not really threads, you're not using any of the OS threading any more. (I suppose you still are using the OS to handle signals and such, and there are probably some OS threads that are running which will grab some of your time, and you want that; but you are not using the OS threading in your code).

2. All functions are coroutines. A function with no yields in it is just a very simple coroutine. There's no special syntax to be a coroutine or call a coroutine.

All functions can take futures or return futures. (a future is just a value that's not yet ready). Whether you want this to be totally implicit or not is up to your taste about how much of the operations behind the scenes are visible in the code.

For example if you have a function like :


int func(int x);

and you call it with a future<int> :

future<int> y;
func(y);

it is promoted automatically to :

future<int> func( future<int> x )
{
    yield x;
    return func( x.value );
}

When you call a function, it is not a "branch", it's just a normal function call. If that function yields, it yields the whole current coroutine. That is, it's just like threading and waits, but rather with coroutines and yields.

To branch I would use a new keyword, like "start" :


future<int> some_async_func(int x);

int current_func(int y)
{

    // execution will step directly into this function;
    // when it yields, current_func will yield

    future<int> f1 = some_async_func(y);

    // with "start" a new coroutine is made and enqueued to the thread pool
    // my coroutine immediately continues to the f1.wait
    
    future<int> f2 = start some_async_func(y);
    
    return f1.wait();
}

"start" should really be an abbreviation for a two-phase launch, which allows a lot more flexibility. That is, "start" should be a shorthand for something like :


start some_async_func(y);

is

coro * c = new coro( some_async_func(y); );
c->start();

because that allows batch-starting, and things like setting dependencies after creating the coro, which I have found to be very useful in practice. eg :

coro * c[32];

for(i in 32)
{
    c[i] = new coro( );
    if ( i > 0 )
        c[i-1]->depends( c[i] );
}

start_all( c, 32 );

Batch starting is one of those things that people often leave out. Starting tasks one by one is just like waiting for them one by one (instead of using a wait_all), it causes bad thread-thrashing (waking up and going back to sleep over and over, or switching back and forth).

3. Full stack-saving is crucial.

For this to be efficient you need a very small minimum stack size (4k is probably good) and you need stack-extension on demand.

You may have lots of pending coroutines sitting around and you don't want them gobbling all your memory with 64k stacks.

Full stack saving means you can do full variable capture for free, even in a language like C where tracking references is hard.

4. You stop using the OS mutex, semaphore, event, etc. and instead use coroutine variants.

Instead of a thread owning a lock, a coroutine owns a lock. When you block on a lock it's a yield of the coroutine instead a full OS wait.

Getting access to a mutex or semaphore is an event that can trigger coroutines being run or resumed. eg. it's a future just like the return from an async procedural call. So you can do things like :


future<int> y = some_async_func();

yield( y , my_mutex.when_lock() );

which yields your coroutine until the joint condition is met that the async func is done AND you can get the lock on "my_mutex".

Joint yields are very important because they prevent unnecessary coroutine wakeup. While coroutine thrashing is not nearly as bad as thread thrashing (and is one of the big advantages of coroutine-centric architecture (in fact perhaps the biggest)).

You must have coroutine versions of all the ops that have delays (file IO, networking, GPU, etc) so that you can yield on them instead of doing thread-waits.

5. You must have some kind of GC.

Because coroutines will constantly be capturing values, you must ensure their lifetime is >= the life of the coroutine. GC is the only reasonable way to do this.

I would also go ahead and put an RW-lock in every object as well since that will be necessary.

6. Dependencies and side effects should be expressed through args and return values.

You really need to get away from funcs like


void DoSomeStuff(void);

that have various un-knowable inputs and outputs. All inputs & outputs need to be values so that they can be used to create dependency chains.

When that's not directly possible, you must use a convention to express it. eg. for file manipulation I recommend using a string containing the file name to express the side effects that go through the file system (eg. for Rename, Delete, Copy, etc.).

7. Note that coroutines do not fundamentally alter the difficulties of threading.

You still have races, deadlocks, etc. Basic async ops are much easier to write with coroutines, but they are no panacea and do not try to be anything other than a nicer way of writing threading. (eg. they are not transactional memory or any other auto-magic).

to be continued (perhaps) ....

Add 3/15/13 : 8. No static size anything. No resources you can run out of. This is another "best practice" that goes with modern thread design that I forgot to list.

Don't use fixed-size queues for thread communication; they seem like an optimization or simplification at first, but if you can ever hit the limit (and you will) they cause big problems. Don't assume a fixed number of workers or a maximum number of async ops in flight, this can cause deadlocks and be a big problem.

The thing is that a "coroutine centric" program is no longer so much like a normal imperative C program. It's moving towards a functional program where the path of execution is all nonlinear. You're setting a big graph to evaluate, and then you just need to be able to hit "go" and wait for the graph to close. If you run into some limit at some point during the graph evaluation, it's a big mess figuring out how to deal with that.

Of course the OS can impose limits on you (eg. running out of memory) and that is a hassle you have to deal with.


12-21-12 | Coroutines From Lambdas

Being pedantic while I'm on the topic. We've covered this before.

Any language with lambdas (that can be fired when an async completes) can simulate coroutines.

Assume we have some async function call :


future<int> AsyncFunc( int x );

which send the integer off over the net (or whatever) and eventually gets a result back. Assume that future<> has a "AndThen" which schedules a function to run when it's done.

Then you can write a sequence of operations like :


future<int> MySequenceOfOps( int x1 )
{
    x1++;

    future<int> f1 = AsyncFunc(x1);

    return f1.AndThen( [](int x2){

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    return f2.AndThen( [](int x3){

    x3 --;

    return x3;

    } );
    } );

}

with a little munging we can make it look more like a standard coroutine :

#define YIELD(future,args)  return future.AndThen( [](args){

future<int> MySequenceOfOps( int x1 )
{
    x1++;

    future<int> f1 = AsyncFunc(x1);

    YIELD(f1,int x2)

    x2 *= 2;

    future<int> f2 = AsyncFunc(x2);

    YIELD(f2,int x3)

    x3 --;

    return x3;

    } );
    } );

}

the only really ugly bit is that you have to put a bunch of scope-closers at the end to match the number of yields.

This is really what any coroutine is doing under the hood. When you hit a "yield", what it does is take the remainder of the function and package that up as a functor to get called after the async op that you're yielding on is done.

Coroutines from lambdas have a few disadvantages, aside from the scope-closers annoyance. It's ugly to do anything but simple linear control flow. The above example is the very simple case of "imperative, yield, imperative, yield" , but in real code you want to have things like :


if ( bool )
{
    YIELD
}

or

while ( some condition )
{

    YIELD
}

which while probably possible with lambda-coroutines, gets ugly.

An advantage of lambda-coroutines is if you're in a language where you have lambdas with variable-capture, then you get that in your coroutines.


12-18-12 | Async/Await ; Microsoft's Coroutines

As usual I'm way behind in knowing what's going on in the world. Lo and behold, MS have done a coroutine system very similar to me, which they are now pushing as a fundamental technology of WinRT. Dear lord, help us all. (I guess this stuff has been in .NET since 2008 or so, but with WinRT it's now being pushed on C++/CX as well)

I'm just catching up on this, so I'm going to make some notes about things that took a minute to figure out. Correct me where I'm wrong.

For the most part I'll be talking in C# lingo, because this stuff comes from C# and is much more natural there. There are C++/CX versions of all this, but they're rather more ugly. Occasionally I'll dip into what it looks like in CX, which is where we start :

1. "hat" (eg. String^)

Hat is a pointer to a ref-counted object. The ^ means inc and dec ref in scope. In cbloom code String^ is StringPtr.

The main note : "hat" is a thread-safe ref count, *however* it implies no other thread safety. That is, the ref-counting and object destruction is thread safe / atomic , but derefs are not :


Thingy^ t = Get(); // thread safe ref increment here
t->var1 = t->var2; // non-thread safe var accesses!

There is no built-in mutex or anything like that for hat-objects.

2. "async" func keyword

Async is a new keyword that indicates a function might be a coroutine. It does not make the function into an asynchronous call. What it really is is a "structify" or "functor" keyword (plus a "switch") . Like a C++ lambda, the main thing the language does for you is package up all the local variables and function arguments and put them all in a struct. That is (playingly rather loosely with the translation for brevity) :


async void MyFunc( int x )
{
    string y;

    stuff();
}

[ is transformed to : ]

struct MyFunc_functor
{
    int x;
    string y;

    void Do() { stuff(); }
};

void MyFunc( int x )
{
    // allocator functor object :
    MyFunc_functor * f = new MyFunc_functor();
    // copy in args :
    f->x = x;
    // run it :
    f->Do();
}

So obviously this functor that captures the function's state is the key to making this into an async coroutine.

It is *not* stack saving. However for simple usages it is the same. Obviously crucial to this is using a language like C# which has GC so all the references can be traced, and everything is on the heap (perhaps lazily). That is, in C++ you could have pointers and references that refer to things on the stack, so just packaging up the args like this doesn't work.

Note that in the above you didn't see any task creation or asynchronous func launching, because it's not. The "async" keyword does not make a function async, all it does is "functorify" it so that it *could* become async. (this is in contrast to C++11 where "async" is an imperative to "run this asynchronously").

3. No more threads.

WinRT is pushing very hard to remove manual control of threads from the developer. Instead you have an OS thread pool that can run your tasks.

Now, I actually am a fan of this model in a limitted way. It's the model I've been advocating for games for a while. To be clear, what I think is good for games is : run 1 thread per core. All game code consists of tasks for the thread pool. There are no special purpose threads, any thread can run any type of task. All the threads are equal priority (there's only 1 per core so this is irrelevant as long as you don't add extra threads).

So, when a coroutine becomes async, it just enqueues to a thread pool.

There is this funny stuff about execution "context", because they couldn't make it actually clean (so that any task can run any thread in the pool); a "context" is a set of one or more threads with certain properties; the main one is the special UI context, which only gets one thread, which therefore can deadlock. This looks like a big mess to me, but as long as you aren't actually doing C# UI stuff you can ignore it.

See ConfigureAwait etc. There seems to be lots of control you might want that's intentionally missing. Things like how many real threads are in your thread pool; also things like "run this task on this particular thread" is forbidden (or even just "stay on the same thread"; you can only stay on the same context, which may be several threads).

4. "await" is a coroutine yield.

You can only use "await" inside an "async" func because it relies on the structification.

It's very much like the old C-coroutines using switch trick. await is given an Awaitable (an interface to an async op). At that point your struct is enqueued on the thread pool to run again when the Awaitable is ready.

"await" is a yield, so you may return to your caller immediately at the point that you await.

Note that because of this, "async/await" functions cannot have return values (* except for Task which we'll see next).

Note that "await" is the point at which an "async" function actually becomes async. That is, when you call an async function, it is *not* initially launched to the thread pool, instead it initially runs synchronously on the calling thread. (this is part of a general effort in the WinRT design to make the async functions not actually async whenever possible, minimizing thread switches and heap allocations). It only actually becomes an APC when you await something.

(aside : there is a hacky "await Task.Yield()" mechanism which kicks off your synchronous invocation of a coroutine to the thread pool without anything explicit to await)

I really don't like the name "await" because it's not a "wait" , it's a "yield". The current thread does not stop running, but the current function might be enqueued to continue later. If it is enqueued, then the current thread returns out of the function and continues in the calling context.

One major flaw I see is that you can only await one async; there's no yield_all or yield_any. Because of this you see people writing atrocious code like :

await x;
await y;
await z;
stuff(x,y,z);
Now they do provide a Task.WhenAll and Task.WhenAny , which create proxy tasks that complete when the desired condition is met, so it is possible to do it right (but much easier not to).

Of course "await" might not actually yield the coroutine; if the thing you are awaiting is already done, your coroutine may continue immediately. If you await a task that's not done (and also not already running), it might be run immediately on your thread. They intentionally don't want you to rely on any certain flow control, they leave it up to the "scheduler".

5. "Task" is a future.

The Task< > template is a future (or "promise" if you like) that provides a handle to get the result of a coroutine when it eventually completes. Because of the previously noted problem that "await" returns to the caller immediately, before your final return, you need a way to give the caller a handle to that result.

IAsyncOperation< > is the lower level C++/COM version of Task< > ; it's the same thing without the helper methods of Task.

IAsyncOperation.Status can be polled for completion. IAsyncOperation.GetResults can only be called after completed. IAsyncOperation.Completed is a callback function you can set to be run on completion. (*)

So far as I can tell there is no simple way to just Wait on an IAsyncOperation. (you can "await"). Obviously they are trying hard to prevent you from blocking threads in the pool. The method I've seen is to wrap it in a Task and then use Task.Wait()

(* = the .Completed member is a good example of a big annoyance : they play very fast-and-loose with documenting the thread safety semantics of the whole API. Now, I presume that for .Completed to make any sense it must be a thread-safe accessor, and it must be atomic with Status. Otherwise there would be a race where my completion handler would not get called. Presumably your completion handler is called once and only once. None of this is documented, and the same goes across the whole API. They just expect it all to magically work without you knowing how or why.)

(it seems that .NET used to have a Future< > as well, but that's gone since Task< > is just a future and having both is pointless (?))


So, in general if I read it as :


"async" = "coroutine"  (hacky C switch + functor encapsulation)

"await" = yield

"Task" = future

then it's pretty intuitive.


What's missing?

Well there are some places that are syntactically very ugly, but possible. (eg. working with IAsyncOperation/IAsyncInfo in general is super ugly; also the lack of simple "await x,y,z" is a mistake IMO).

There seems to be no way to easily automatically promote a synchronous function to async. That is, if you have something like :


int func1(int x) { return x+1; }

and you want to run it on a future of an int (Task< int >) , what you really want is just a simple syntax like :

future<int> x = some async func that returns an int

future<int> y = start func1( x );

which makes a coroutine that waits for its args to be ready and then runs the synchronous function. (maybe it's possible to write a helper template that does this?)

Now it's tempting to do something like :


future<int> x = some async func that returns an int

int y = func1( await x );

and you see that all the time in example code, but of course that is not the same thing at all and has many drawbacks (it waits immediately even though "y" might not be needed for a while, it doesn't allow you to create async dependency chains, it requires you are already running as a coroutine, etc).

The bigger issue is that it's not a real stackful coroutine system, which means it's not "composable", something I've written about before :
cbloom rants 06-21-12 - Two Alternative Oodles
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

Specifically, a coroutine cannot call another function that does the await. This makes sense if you think of the "await" as being the hacky C-switch-#define thing, not a real language construct. The "async" on the func is the "switch {" and the "await" is a "case ". You cannot write utility functions that are usable in coroutines and may await.

To call functions that might await, they must be run as their own separate coroutine. When they await, they block their own coroutine, not your calling function. That is :


int helper( bool b , AsyncStream s )
{
    if ( b )
    {
        return 0;
    }
    else
    {
        int x = await s.Get<int>();
        return x + 10;
    }
}

async Task<int> myfunc1()
{
    AsyncStream s = open it;
    int x = helper( true, s );
    return x;
}

The idea here is that "myfunc1" is a coroutine, it calls a function ("helper") which does a yield; that yields out of the parent coroutine (myfunc1). That does not work and is not allowed. It is what I would like to see in a good coroutine-centric language. Instead you have to do something like :

async Task<int> helper( bool b , AsyncStream s )
{
    if ( b )
    {
        return 0;
    }
    else
    {
        int x = await s.Get<int>();
        return x + 10;
    }
}

async Task<int> myfunc1()
{
    AsyncStream s = open it;
    int x = await helper( true, s );
    return x;
}

Here "helper" is its own coroutine, and we have to block on it. Now it is worth noting that because WinRT is aggressive about delaying heap-allocation of coroutines and is aggresive about running coroutines immediately, the actual flow of the two cases is not very different.

To be extra clear : lack of composability means you can't just have something like "cofread" which acts like synchronous fread , but instead of blocking the thread when it doesn't have enough data, it yields the coroutine.

You also can't write your own "cosemaphore" or "comutex" that yield instead of waiting the thread. (does WinRT provide cosemaphore and comutex? To have a fully functional coroutine-centric language you need all that kind of stuff. What does the normal C# Mutex do when used in a coroutine? Block the whole thread?)


There are a few places in the syntax that I find very dangerous due to their (false) apparent simplicity.

1. Args to coroutines are often references. When the coroutine is packaged into a struct and delayed execution, what you get is a non-thread-safe pointer to some shared object. It's incredibly easy to write code like :


async void func1( SomeStruct^ s )
{

    s->DoStuff();
    MoreStuff( s );

}

where in fact every touch of 's' is potentially a race and bug.

2. There is no syntax required to start a coroutine. This means you have no idea if functions are async or not at the call site!


void func2()
{

DeleteFile("b");
CopyFile("a","b");

}

Does this code work? No idea! They might be coroutines, in which case DeleteFile might return before it's done, and then I would be calling CopyFile before the delete. (if it is a coroutine, the fix is to call "await", assuming it returned a Task).

Obviously the problem arises from side effects. In this case the file system is the medium for communicating side effects. To use coroutine/future code cleanly, you need to try to make all functions take all their inputs as arguments, and to return all their effects are return values. Even if the return is not necessary, you must return some kind of proxy to the change as a way of expressing the dependency.

"async void" functions are probably bad practice in general; you should at least return a Task with no data (future< void >) so that the caller has something to wait on if they want to. async functions with side effects are very dangerous but also very common. The fantasy that we'll all write pure functions that only read their args (by value) and put all output in their return values is absurd.


It's pretty bold of them to make this the official way to write new code for Windows. As an experimental C# language feature, I think it's pretty decent. But good lord man. Race city, here we come. The days of software having repeatable outcomes are over!

As a software design point, the whole idea that "async improves responsiveness" is rather disturbing. We're gonna get a lot more trickle-in GUIs, which is fucking awful. Yes, async is great for making tasks that the user *expects* to be slow to run in the background. What it should not be used for is hiding the slowness of tasks that should in fact be instant. Like when you open a new window, it should immediately appear fully populated with all its buttons and graphics - if there are widgets in the window that take a long time to appear, they should be fixed or deleted, not made to appear asynchronously.

The way web pages give you an initial view and then gradually trickle in updates? That is fucking awful and should be used as a last resort. It does not belong in applications where you have control over your content. But that is exactly what is being heavily pushed by MS for all WinRT apps.

Having buttons move around after they first appeared, or having buttons appear after the window first opened - that is *terrible* software.

(Windows 8 is of course itself an example; part of their trick for speeding up startup is to put more things delayed until after startup. You now have to boot up, and then sit there and twiddle your thumbs for a few minutes while it actually finishes starting up. (there are some tricks to reduce this, such as using Task Scheduler to force things to run immediately at the user login event))


Some links :

Jerry Nixon @work Windows 8 The right way to Read & Write Files in WinRT
Task.Wait and “Inlining” - .NET Parallel Programming - Site Home - MSDN Blogs
CreateThread for Windows 8 Metro - Shawn Hargreaves Blog - Site Home - MSDN Blogs
Diving deep with WinRT and await - Windows 8 app developer blog - Site Home - MSDN Blogs
Exposing .NET tasks as WinRT asynchronous operations - Windows 8 app developer blog - Site Home - MSDN Blogs
Windows 8 File access sample in C#, VB.NET, C++, JavaScript for Visual Studio 2012
Futures and promises - Wikipedia, the free encyclopedia
Effective Go - The Go Programming Language
Deceptive simplicity of async and await
async (C# Reference)
Asynchronous Programming with Async and Await (C# and Visual Basic)
Creating Asynchronous Operations in C++ for Windows Store Apps
Asynchronous Programming - Easier Asynchronous Programming with the New Visual Studio Async CTP
Asynchronous Programming - Async Performance Understanding the Costs of Async and Await
Asynchronous Programming - Pause and Play with Await
Asynchronous programming in C++ (Windows Store apps) (Windows)
AsyncAwait Could Be Better - CodeProject
File Manipulation in Windows 8 Store Apps
SharpGIS Reading and Writing text files in Windows 8 Metro


12-15-12 | How to Lose Game Developer's Love

How to Lose Game Developer's Love ... using only Hello World.

MS has gone from by far the most beloved sweet simple console API to develop for to this :


[System::PleaseLetMyAppRun::PrettyPlease]
Universe::MilkyWay::SolarSystem::Earth::DataTypes::Void 
main( some complicated args that don't matter because they don't work ^ hat )
{

    IPrintf^ p = System::GoodLord::Deprecated::stdio::COM::AreYouKiddingMe( IPrintfToken );
    p->OnReady( [this]{ return CharStreamer( StreamBufferBuilder( StringStreamer( StringBuffer( CharConcatenator('h') +
        CharConcatenator('e') + IQuitSoftware("llo world\n") )))) } ); 

}

(this example fails because it didn't request privilege elevation with the security token to access the console)

(and then still fails because it didn't list its imports correctly in its manifest xml)


12-13-12 | Windows 8

With each version of Windows it takes progressively longer to install and set up into a cbloom-usable state. Windows 8 now takes 3-4 days to find all the crud they're trying to shove down my throat and disable it. I've gotten it mostly sorted out but there are a few little things I haven't figured out how to disable :

1. The Win-X key. Win-X is mine; I've been using it to mean "maximize window" for the last 10 years. You can't have it. I've figured out how to disable their Win-X menu, but they still seem to be eating that key press somewhere very early, before I can see it. (they also stole my Win-1 and various other keys, but those went away with the NoWinKeys registry setting; Win-X seems unaffected by that setting).

2. Win 8 seems to have even more UAC than previous verions. As usual you can kill most of it by turning UAC down to min, setting Everyone to Full Control, and Taking Ownership of c:\ recursively. But apparently in Win 8 when you turn the UAC slider down to min, it no longer actually goes to off. Before Win 8, with UAC set to min all processes were "high integrity", now processes have to request elevation from code. One annoyance I haven't figured out how to fix is that net-used and subst'ed drives are now per-user. eg. if you open in admin cmd and do a subst, the drive doesn't show up in your normal explorer (and vice-versa).

3. There seems to be no way to tweak the colors, and the default colors are really annoying. Somebody thought it was a good idea to make every color white or light gray so all the windows and frames just run together and you can't easily spot the edges. You *can* tweak individual colors if you choose a "high contrast" theme (it's pretty standard on modern Windows that you only get the options you deserve by pretending to be disabled (reasonable things like "no animations" are all hidden in "accessibility")) - however, the "high contrast" theme seems to confuse every app (devenv, firefox) such that they use white text on white backgrounds. Doh.

Once you get Win 8 set up, it's basically Win 7. I don't understand what they were thinking putting the tablet UI as the default on the desktop. Mouse/keyboard user interface is so completely different from jamming your big fat clumsy fingers into a screen that it makes no sense to try to use one on the other. You wouldn't put tiny little buttons on a tablet, so why are you putting giant ham-finger tablet buttons on my desktop? Oh well, easy to disable.

So far the only improvement I've noticed (over Win 7) is that Windows Networking seems massively improved (finally, thank god). It might actually be faster to copy files across a local network than to re-download them from the internet now.


Some general OS ranting :

An OS should be a blank piece of paper. It is a tool for *me* to create what *I* want. It should not have a visual style. It should not be "designed" any more than a good quality blank piece of paper is designed.

(that said I prefer the Win 8 look to anything MS has done since Win 2k (which was the pinnacle of Windows, good lord how sweet it would be if I could still use Win 2k); Aero was an abortion, you don't base your OS GUI design on WinAmp for fuck's sake, though at least with the Aero-OS'es you could easily get a "classic" look, which is not so easy any more)

It's almost impossible to find an OS that actually respects its users any more. I want control of *everything*. If you add some new feature, fine, let me turn it off. If you change my key mappings, fine, let me put them back the way I'm used to.

I despise multi-user OS'es. In my experience they never actually work for security, and they are a constant huge pain in the ass. If you all want to make multi-user OS'es, please just give me a way to get a no-users install with a flat file system and just one set of configs. Nobody but me will ever touch my computer, I don't need this extra layer of shit that adds friction every single day that I use a computer (is that config under "cbloom" or is it under "all users"? Fuck, why do I have to think about this, give me a god damn flat OS. Oh wait the config was under "administrators" or "local machine". ARG). I know this is not gonna happen. Urg.

While we're at it can we talk about how ridiculously broken Windows is in general now?

One of the most basic thing you might want to do with an OS is to take your apps and data and config from one machine and put it on another. LOL, good luck. I know, let's just take all the system hardware config and the user settings and the app installs and let's just shuffle them all up in a big haystack.

Any serious developer needs to be able to clone their dev software from one machine to another, or completely back up their code + dev tools (but without backing up the whole machine, and be able to restore to a different machine).

Obviously the whole DLL mess is a disaster (and now manifests, packages, SxS assemblies, .net frameworks, WTF WTF). It's particularly insane to me that MS can't even get it right with their own damn runtimes. How in hell is it that I can run an exe on Windows and get an "msvcrxx not found" error? WTF, just ship all your damn runtimes with Windows, it can't be that big. And even if you don't want to ship them all, how can you not just have a server that gives me the missing runtimes? It's so insane.

God help you if you are trying to write software that can build on various installs of windows. Oh you have Win Vista SP1 with Platform SDK N, that's a totally different header which is missing some functions and causes some other weird warning, and you need .net framework X on this machine and blah blah it's such a total mess.


12-13-12 | vcproj nightmare

Ridiculous. WTF were they thinking.

Ok, so XML suxors and all, but if you're going to use XML then *use XML*. When you rev the god damn devstudio you don't break the old file format, you just add new tags for whatever new crap you feel you need to add. You don't put the devstudio version in the header of the file, you put on the individual tags that are specific to that version.

If you need to do per-version settings files, put them in a different file than my basic list of what my source code is and how to build it. And of course don't mix up your GUI cache with my project data.

The thing that really boggles my mind is how they can make such a huge mistake, and then stick with it year after year. It's sort of understandable to make a mistake once (though I think this one was entirely avoidable), but then you go "whoah what a fuckup, let's change that". Nope.

(of course they've done the same thing with their flagship (Office). It's crazy broken that I can't at least load the text and basic formatting from any type of document into any version)


12-8-12 | Sandy Cars

I'm still keeping half an eye open for an E46 M3.

Something I've noticed is that in the last month or so a lot of cars with histories like this are popping up :


09/06/2012      70,668      Inspection Co.  New Jersey      Inspection performed
11/20/2012      74,471      Covert Ford Austin, TX          Car offered for sale

You're not fooling me, bub. I know what happened in NJ between those two dates! Beware!


12-6-12 | Theoretical Oodle Rewrite Continued

So, continuing on the theme of a very C++-ish API with "futures" , ref-counted buffers, etc. :

cbloom rants 07-19-12 - Experimental Futures in Oodle
cbloom rants 10-26-12 - Oodle Rewrite Thoughts

It occurs to me that this could massively simplify the giant API.

What you do is treat "array data" as a special type of object that can be linearly broken up. (I noted previously about having RW locks in every object and special-casing arrays by letting them be RW-locked in portions instead of always locking the whole buffer).

Then arrays could have two special ways of running async :

1. Stream. A straightforward futures sequence to do something like read-compress-write would wait for the whole file read to be done before starting the compress. What you could do instead is have the read op immediately return a "stream future" which would be able to dole out portions of the read as it completed. Any call that processes data linearly can be a streamer, so "compress" could also return a stream future, and "write" would then be able to write out compressed bits as they are made, rather than waiting on the whole op.

2. Branch-merge. This is less of an architectural thing than just a helper (you can easily write it client-side with normal futures); it takes an array and runs the future on portions of it, rather than running on the whole thing. But having this helper in from the beginning means you don't have to write lots of special case branch-merges to do things like compress large files in several pieces.

So you basically just have a bunch of simple APIs that don't look particularly Async. Read just returns a buffer (future). ReadStream returns a buffer stream future. They look like simple buffer->buffer APIs and you don't have to write special cases for all the various async chains, because it's easy for the client to chain things together as they please.

To be redundant, the win is that you can write a function like Compress() and you write it just like a synchronous buffer-to-buffer function, but it's arguments can be futures and its return value can be a future.

Compress() should actually be a stackful coroutine, so that if the input buffer is a Stream buffer, then when you try to access bytes that aren't yet available in that buffer, you Yield the coroutine (pending on the stream filling).


Functions take futures as arguments and return futures.

Every function is actually run as a stackful coroutine on the worker threadpool.

Functions just look like synchronous code, but things like file IO cause a coroutine Yield rather than a thread Wait.

All objects are ref-counted and create automatic dependency chains.

All objects have built-in RW locks, arrays have RW locks on regions.

Parallelism is achieved through generic Stream and Branch/Merge facilities.

While this all sounds very nice in theory, I'm sure in practice it wouldn't work. What I've found is that every parallel routine I write requires new hacky special-casing to make it really run at full efficiency.


12-6-12 | The Oodle Dilemma

Top two Oodle (potential) customer complaints :

1. The API is too big, it's too complicated. There are too many ways of doing basically the same thing, and too many arguments and options on the functions.

2. I really want to be able to do X but I can't do it exactly the way I want with the API, can you add another interface?

Neither one is wrong.


12-5-12 | The Fiscal Cliff

First of all, let's talk about the general structure of the laws that are causing this problem. Our federal government, like a large number of states, now has a sort of one-way ratchet towards smaller government. These laws have been slipped in by Republicans with little attention paid, but they are very powerful and will drastically change American government in the future. Basically, the Republicans have already won and we can't stop them.

The structure of all these laws is basically the same : 1. allow tax cuts to pass by simple majority. 2. make tax raises difficult to pass (many states now require 2/3 super-majority for taxes increases (and they were already politically nearly impossible)). 3. set a debt limit and force mandatory cuts to balance the budget (many states actually have a debt limit of 0, every year's budget must be balanced, which is manifestly absurd given the variance of economies and the resulting receipts and costs).

They claim that this produces "fiscal responsibility" but of course they know that's a lie; the goal is small government and that's all it produces. If you wanted actual fiscal responsibility, you wouldn't cut taxes in flush times, instead you would require that governments save during surplus times to provide a cushion for recessionary times; you would also require that any tax *cut* is matched by spending cuts in order to pass. If you were actually fiscally responsible, you would allow defecits during recessions, but require them to be matched by tax raises in boom periods.

The result of these laws is obvious : a simple Republican majority can pass tax cuts when they are in power (and the dumb voters will love it), then it's almost impossible to put the taxes back where they were, and then you inevitably run out of money and have to cut spending. Particularly if you hit a recession and have to keep the budget balanced, you will have to slash government drastically.

(whether or not government should be minimal is open for debate, but the duplicitous method of acheiving it is incontrovertibly scummy)

So first of all, let's recall the source of the fiscal cliff. It is not the growth of entitlements, which is sort of an unrelated long term issue that people love to mix in to any financial discussion. The primary causes of the short term deficit are the Bush tax cuts and the recession. (the other major factors are the war spending and TARP spending (etc)). This is not a fundamental problem of the way the US government is run, it's the combination of cutting taxes and increasing spending that happened under GWB.

The other major issue that we must keep in mind is that we are still currently deep in a recession. Tiny amounts of GDP growth may hide this, and the unemployment numbers look better, but I believe the reality is that the American economy is still deeply sick, with no real growth of industry and no prospects. Essentially we are propping it up with the free money from the Fed and the super-low taxes. Any attempt to return to a sustainable Fed interest rate and tax rate would show the economy for what it really is. Trying to tighten the belt now would certainly look bad; I don't say that it would "lead to a recession", I believe we are in a recession and are just hiding it with a candy coating.

Now, briefly about entitlements. The Republicans love to make entitlement growth seem really scary, but it's not true. Social Security can be made solvent very easily : simply make the SS tax non-regressive. The current SS tax is regressive because it's a flat percentage but has a maximum. If you simply remove the maximum, Social Security is solvent for the next 100 years (CBO numbers).

Medicare is a bigger problem, but not because of the increase of the number of elderly - rather due to the corruption of doctors and the medical establishment. With increased productivity and technology, the cost of health care should be going *down*, instead it rises at an obscene rate, because the insurance complex has cooked up a system where we have no control over the cost of our care. Unfortunately Obamacare has perhaps made this worse than ever, locking the corrupt health insurance system into law without taking any steps to limit private profits.

How do you actually fix the American economy and get some real growth that's not just an illusion propped up by free Fed money?

1. Legally require open systems. Make net neutrality law. Open up the cable-TV lines. Perhaps the best option is a national open broadband system on some new super-fast fiber (unrealistic). Make the Apple Store type of computer lockdown illegal. Openness and free competition for small business is what will really save this country.

2. Make it easier to start small businesses. Remove the favoritism for big business. Tax loopholes and breaks massively favor big business - eliminate them all. Eliminate all development and "green" subsidies, which again massively favor big business. Simplify the tax code (see below) and then perhaps even simplify it more for small businesses, like provide a super-basic flat tax option for businesses that make less than $1M a year.

3. Make it cheaper to hire Americans in America. Eliminate payroll taxes. Eliminate employer-run health care (or provide a national group option for small businesses). Increase taxes on corproate profits and aggressively go after offshoring of money.

4. Long term we're fucked no matter what. What would you say are the prospects for a country where the education system sucks (the cost of education continues to rise way faster than inflation, and most of the "educated" can't actually do anything useful), the IT infrastructure sucks, and the cost of labor is sky-high? You would say that country is doomed to poverty, and that it is.


A few proposals for real government taxing & spending reform :

O. Get corporate money out of politics or everything else is hopeless.

O. Stop the revolving door between government and private industry. eg. if you work on the Texas Railroad Commission, then you are not allowed to go work for the oil/gas industry (and vice-versa). Treasury secretaries shouldn't be allowed to rotate in and out of wall street. It's totally absurd and like corporate speech, everything else is hopeless until it's stopped.

O. Return defense spending to 1999 levels ($300 billion from the current $700+ billion). And then cut it even further. Never going to happen since defense is the biggest pork item in government (by far).

O. Stop all farm subsidies and tax breaks. They're a sick farce. The small family farmers that are trotted out for political purposes don't exist in the real world; farm subsidies go to large agribusiness and to rich people with vineyards. They're actually very bad for small farmers that are trying to legitimately compete because they massively favor big business. Not only does it make the American farm economy sick, we're destroying the entire world food economy with our export subsidies.

O. Stop all direct aid for ethanol, electric cars, etc. They're a sick distortion of the market that isn't helping anything except corrupt profit. Let the market find solutions to problems.

O. Stop sending federal money to leech states. (need to get rid of the ridiculous over-powering of small states caused by the Senate)

O. Eliminate all payroll taxes. Fund medicare, SS, unemployment, etc. from the general tax revenue. This massively simplifies the tax code, removes the regressive SS tax, and reduces the cost of employment.

O. Don't cut Medicare spending or fake out the inflation rate for COLA. Instead go after the reason why medical costs are rising out of control. Don't reimburse doctors for unnecessary procedures or scans. Don't reimburse for unnecessary MRI's. Don't allow any medical practitioner to pass on the fee in excess of the negotiated rates to the client. Require up-front pricing for all medical treatment. Force the AMA to stop its corrupt limiting of the number of doctors. etc. etc.

O. Eliminate capital gains tax. I don't mean reduce it, I mean treat all profit as profit - tax it as normal income. Eliminate the dividend loophole. Stop letting the super rich pay 10-15% tax rates.

O. Eliminate all tax deductions. Nothing is deductable (but raise the standard deduction so that a majority of Americans actually get to deduct more). Alternatively : raise the AMT and remove exceptions from the AMT.

O. Remove foreign residency as a way to avoid US taxes. If you do business in America, you pay US taxes. Same for corporate taxes. Remove non-income benefits as a way to avoid taxes; eg. company cars, apartments, dinners, etc. all count as income. Pass new laws so we can be more aggressive about going after holding companies or "consulting firms" as ways to hide personal income.

O. Make US companies pay for our foreign spending on their behalf. eg. if you're Chevron and want to run a pipeline through Afghanistan, fine go ahead, but then you pay for the Afghanistan war. etc. Almost all of our defense and foreign aid spending should be paid by the companies that do business in unstable countries.


12-1-12 | Hawk's return

The hawk returned (perhaps a different one). He missed his kill this time and I couldn't get a shot of him before he fled, but it did give me a chance to snap a photo of what the chickens do in response :

(I'm lifting the roof on their house) (there are five there, they're all sitting on top of the one broody hen that never leaves that box)

We've got hawthorn trees at the house which are unusual for Seattle (they don't belong here). In the fall/winter the leaves drop and they are covered in berries which are inedible to humans but are apparently like ambrosia to birds and squirrels. We get incursions from neighboring squirrels that the resident ones have to fend off with much shrieking, and of course lots of little birds come through in packs, which seems to be attracting the predator.

I was a bit worried that our cats would take advantage of this bountiful hunting ground (it's really perverse when people with cats set up bird feeders, and having super-delicious trees is not much removed from that), but so far that hasn't really happened.


11-29-12 | Unnatural

I hate having neighbors so much. It's just not a natural way to live, this modern human way, where we're all crammed together with people who are not our tribe.

I believe that human beings are only comfortable living with people they are intimate with. In ancient days this was your whole tribe, now it's usually just family. You essentially have no privacy from these people, and not even separate property. Trying to keep your own stuff is an exercise in frustration. You must trust these people and work together and open up to them to be happy. Certainly there is always friction in this, but it's a natural human existence, and even though it may give you lots to complain about, there will also be joy. (foolishly moving way from this way of life is the root of much unhappiness)

Everyone else is an enemy. If you aren't in my intimate tribal group, WTF are you doing near my home? This is my land where I have my family, I will fucking jab you in the eye with a pointy stick.

I'm not really comfortable with "friends". So-called "friends" are not your friends; they will make fun of you behind your back, they will let you down when you need help. You can't ever really open up and admit things to them, you can't show your weaknesses, they will mock you for it or use your weaknesses against you. It's so awful to me the way normal people talk to each other; everyone is pretending to be strong and happy all the time, nobody ever talks about anything serious, some people put on a big show to be entertaining, it's all just so tense and shallow and unpleasant. The reason is that these people are not in my tribe, hence they are my enemies, and this whole "friends" thing is a very modern societal invention that doesn't really work.

I realized a while ago that this is one of the things I hate about going into the office. The best office experiences I've had have been the ones where it was a small team, we were all young and spent lots of time at work, and we actually bonded and had fun together and were almost like a family after several years of crunching (going through tough shit together is a classic way to brainwash people into acting like a tribe); at that point, it feels comfortable to go in to work, you can rip a fart and people laugh instead of tensely pretending that nothing happened. But most of the time an office never reaches that level of intimacy, the people in the office are just acquaintances, so you're in this space where you're supposed to be relaxed, and there are people walking around all the time looking over your shoulder, but they are enemies! Of course I can't relax, they're not my tribe, why are they in my space? It's terrible.


Going away from home to work is really unnatural.

At first when people start working from home it feels weird because they're so used to leaving, but really this whole going to a factory/office thing is a super-modern invention (last few hundred years).

Of course you should stay home and work your farm. You should build your house, till your field, and be there to protect your women and children (eg. in the modern world : open jars for them). Of course you should have your children beside you so that you can talk to them and teach them your trade as you work.

Of course when you're hungry you should go in to your own kitchen and eat some braised pork shoulder that's real simple hearty food cooked by your own hands, not the poisonous filth that restaurants purvey.

You shouldn't leave your family for 8 hours every day, that's bizarre and horrible. You should see your livestock running around, be there to shoo away the neighbors cats, see the trees changing color, and put your time and your love into what is really yours.


11-28-12 | SSD Buying Guide

Have done a bunch of reading over the past 24 hours and updating my old posts on SSD's :

cbloom's definitive SSD buying guide :

recommended :

Intel's whatever (currently the 520, but actually the old X25-M is still just fine; the S3700 stuff looks promising for the future)

not recommended :

Everything else.

The whole issue of flash degradation and moving blocks and such is a total red herring. SSD's are not failing because of the theoretical lifetime of flash memory, they are failing because the non-Intel drives are just broken. It's pretty simple, don't buy them.

The other issue I really don't care about is speed. They're all super fast. If they all actually worked then maybe I would care which was fastest, but since the non-Intel ones are just broken, the question of speed is irrelevant. The hardware review sites are all pretty awful with their endless benchmarking and complete missing of the actual issues. And even my ancient X25-M is plenty fast enough.

I think it's tempting to just go for the enterprise-grade stuff (Intel 710 at the moment). Saving money on storage doesn't make any sense to me, and all the speed measurement stuff just makes me yawn. (Intel 720 looks promising for the future). It's not quite as clear cut as ECC RAM (which is obviously worth it), but I suspect that spending another few $hundred to not worry about drive failure is worth it.

Oh, also brief googling indicates various versions of Mac OS don't support various SSD's correctly. I would just avoid SSD's on Mac unless you are very confident about getting this right. (best practice is probably just avoiding Mac altogether, but YMMV and various other acronyms)


11-26-12 | Chickens and Hawks

Hawk with kill in our yard :

And in context :

The chickens were out free ranging at the time; they all ran inside the coop and climbed back into the farthest corner nesting box and sat on top each other in a writhing pile of terrified chickens.


Watching animals is pretty entertaining. I remember when I was younger, I used to think it was a pathetic waste of time. Old people would sit around and watch the cats play, or get all excited about going on safari or whatever, and I would think "ppfft, boring, whatever, I've seen it on TV, what a sad vapid way to get entertainment, you oldsters are all so brain-dead, doing nothing with your time, you could be learning quantum field theory, but you've all just given up on life and want to sit around smiling at animals". Well, that's me now.


11-26-12 | VAG sucks

(VAG = Volkswagen Auto Group)

Going through old notes I found this (originally from Road and Track) :

"For instance, just about every Audi, Porsche and Volkswagen model that I've driven in the U.S. doesn't allow throttle/brake overlap. Our long-term Nissan 370Z doesn't, either, which is a big reason why I'm not particularly keen on taking it out for a good flog; overlap its throttle and brake just a little bit and the Z cuts power for what seems an eternity (probably about two seconds)."

VAG makes fucked up cars. I certainly won't ever buy a modern one again. They have extremely intrusive computers that take the power for LOLs out of the driver's hands. (apparently the 370Z also has some stupidity as well; this shit does not belong in cars that are sold as "driver's cars").

(in case it wasn't clear from the above : you cannot left-foot-brake a modern Porsche with throttle overlap. Furthermore, you also can't trail-brake oversteer a modern Porsche because ESC is always on under braking. You have to be careful going fully off throttle and then back on due to the off-throttle-timing-advance. etc. etc. probably more stupid shit I'm not aware of. This stuff may be okay for most drivers in their comfort saloons, but is inexcusable in a sports car)

Anyway, I'm posting cuz this reminded me that I found another good little mod for the 997 :

Stupid VAG computer has clutch-release-assist. What this does is change the engine map in the first few seconds after you let the clutch out. The reason they do this is so that incompetent old fart owners don't stall the car when pulling away from a light, and also to help you not burn the clutch so much. (the change to the engine map increases the minimum throttle and also reduces the max).

If you actually want to drive your car and do hard launches and clutch-kicks and generally have fun, it sucks. (the worst part is when you do a hard launch and turn, like when you're trying to join fast traffic, and you get into a slight slide, which is fine and fun, but then in the middle of your maneuver the throttle map suddenly changes back as the clutch-assist phase ends, and the car sort of lurches and surges weirdly, it's horrible). Fortunately disabling it is very easy :

There's a sensor that detects clutch depression. It's directly above the clutch in the underside of the dash. You should be able to see the plastic piston for the sensor near the hinge of the clutch pedal. All you have to do is unplug the sensor (it's a plastic clip fitting)

With the sensor unplugged you get no more clutch-release-assist and the car feels much better. You will probably stall it a few times as you get used to the different throttle map, but once you're used to it smooth fast starts are actually easier. (oh, and pressing the clutch will no longer disable cruise control, so be aware of that). I like it.

(aside : it's a shame that all the car magazines are such total garbage. If they weren't, I would be able to find out if any modern cars are not so fucked. And you also want to know if they're easy to fix; problems that are easy to fix are not problems)

(other aside : the new 991-gen Cayman looks really sweet to me, but there are some problems. I was hoping they would use the longer wheelbase to make the cabin a bit bigger, which apparently they didn't really do. They also lowered the seat and raised the door sills which ruin one of the great advantages of the 997-gen Porsches (that they had not adopted that horrible trend of excessively high doors and poor visibility). But the really big drawback is that I'm sure it's all VAG-ed up in stupid ways that make it annoying for a driver. And of course all the standard Cayman problems remain, like the fact that they down-grade all the parts from the 911 in shitty ways (put the damn multi-link rear suspension on the Cayman you assholes, put an adjustable sway on it and adjustable front control arms))

(final aside : car user interface design is generally getting worse in the last 10-20 years. Something that user interface designers used to understand but seem to have forgotten is that the most potent man-machine bond develops when you can build muscle memory for the device, so that you can use it effectively with your twitch reflexes that don't involve rational thought. In order for that to work, the device must behave exactly the same way at all times. You can't have context-sensitive knobs. You can't have the map of the throttle or brake pedal changing based on something the car computer detected. You must have the same outcome from the same body motion every time. This must be an involiable principle of good user interfaces.)


11-24-12 | The Adapted Eye

Buying a house for a view is a big mistake. (*)

Seattle is a somewhat beautiful place (I'm not more enthusiastic because it is depressing to me how easily it could have been much better (and it continues to get worse as the modern development of Cap Hill and South Lake Union turn the city into a generic condo/mall dystopia)) but I just don't see it any more. When we got back from CA I realized that I just don't see the lake and the trees anymore, all I see is "home".

There are some aspects that still move me, like clear views of the Olympics, because they are a rare treat. But after 4 years, the beauty all around is just background.

We have pretty great views from our house, and I sort of notice them, but really the effect on happiness of the view is minimal.

(* = there are benefits to houses with a view other than the beauty of the view. Usually a good view is associated with being on a hill top, or above other people, or up high in a condo tower, and those have the advantages of being quieter, better air, more privacy, etc. Also having a view of nature is an advantage just in the fact that it is *not* a view of other people, which is generally stressful to look at because they are doing fucked up things that you can't control. I certainly appreciate the fact that our house is above everyone else; it's nice to look down on the world and be separate from it).

I was driving along Lake Wash with my brother this summer and he made some comment about how beautiful it was, and for a second there I just couldn't figure out what he was talking about. I was looking around to see if there was some new art installation, or if Mount Rainier was showing itself that day, and then I realized that he just meant the tree lined avenue on the lake and the islands and all that which I just didn't see at all any more.

Of course marrying for beauty is a similar mistake. Even ignoring the fact that beauty fades, if we imagine that it lasted forever it would still be a mistake because you would stop seeing it.

I've always thought that couples could keep the aesthetic interest in each other alive by completely changing their style every few years. Like, dress as a hipster for a while, dress as a punk rocker or a goth, dress as a preppy business person. Or get drastically different hair cuts, like for men grow out your hair like an 80's rocker, or get a big Morrisey pompadour, something different. Most people over 30 tend to settle into one boring low-maintenance style for the rest of their lives, and it becomes invisible to the adapted eyes in their lives.

I suppose there are various tricks you can use; like rather than have your favorite paintings on the wall all the time, rotate them like a museum, put some in storage for a while and hang up some others. It might even help to roll some dice to forcibly randomize your selection.

I guess the standard married custom of wearing sweats around the house and generally looking like hell is actually a smart way of providing intermittent reward. It's the standard sitcom-man refrain to complain that your wife doesn't fancy herself up any more, but that's dumb; if she did dress up every day, then that would just become the norm and you would stop seeing it. Better to set the baseline low so that you can occasionally have something exceptional.

(add : hmm the generalized point that you should save your best for just a few moments and be shitty other times is questionable. Think about behavior. Should you intentionally be kind of dicky most of the time and occasionally really nice? If you're just nice all the time, that becomes the baseline and people take it for granted. I'm not sure about that. But certainly morons do love the "dicky dad" character in TV's and movies; your typical fictional football coach is a great example; dicky dad is stern and tough, scowly and hard on you, but then takes you aside and is somewhat kind and generous, and all the morons in the audience melt and just eat that shit up.)

One of the traps of life is optimizing things. You paint your walls your favorite color for walls, you think you're making things better, but that gets you stuck in a local maximum, which you then stop seeing, and you don't feel motivated to change it because any change is "worse".

I realized the other day that quite a few ancient societies actually have pretty clever customs to provide randomized rewards. For example lots of societies have something like "numbers" , which ignoring the vig, is just a way of taking a steady small income and turning it into randomized big rewards.

Say you got a raise and make $1 more a day. At first you're happy because your life got better, but soon that happiness is gone because you just get used to the new slightly better life and don't perceive it any more. If instead of getting that $1 a day, you instead get $365 randomly on average once a year, your happiness baseline is the same, but once in a while you get a really happy day. This is probably actually better for happiness.

I think the big expensive parties that lots of ancient societies throw for special events might be a similar thing. Growing up in LA we would see our poor latino neighbors spend ridiculous amounts on a quincenera or a wedding and think how foolish it was, surely it's more rational to save that money and use it for health care or education or a nicer house. But maybe they had it right? Human happiness is highly resistant to rational optimization.


11-23-12 | Global State Considered Harmful

In code design, a frequent pattern is that of singleton state machines. eg. a module like "the log" or "memory allocation" which has various attributes you set up that affect its operation, and then subsequent calls are affected by those attributes. eg. things like :


Log_SetOutputFile( FILE * f );

then

Log_Printf( const char * fmt .... );

or :

malloc_setminimumalignment( 16 );

then

malloc( size_t size );

The goal of this kind of design is to make the common use API minimal, and have a place to store the settings (in the singleton) so they don't have to be passed in all the time. So, eg. Log_Printf() doesn't have to pass in all the options associated with logging, they are stored in global state.

I propose that global state like this is the classic mistake of improving the easy case. For small code bases with only one programmer, they are mostly okay. But in large code bases, with multi-threading, with chunks of code written independently and then combined, they are a disaster.

Let's look at the problems :

1. Multi-threading.

This is an obvious disaster and pretty much a nail in the coffin for global state. Say you have some code like :


pcb * previous_callback = malloc_setfailcallback( my_malloc_fail_callback );

void * ptr = malloc( big_size ); 

malloc_setfailcallback( previous_callback );

this is okay single threaded, but if other threads are using malloc, you just set the "failcallback" for them as well during that span. You've created a nasty race. And of course you have no idea whether the failcallback that you wanted is actually set when you call malloc because someone else might change it on another thread.

Now, an obvious solution is to make the state thread-local. That fixed the above snippet, but some times you want to change the state so that other threads are affected. So now you have to have thread-local versions and global versions of everything. This is a viable, but messy, solution. The full solution is :

There's a global version of all state variables. There are also thread-local copies of all the global state. The thread-local copies have a special value that means "inherit from global state". The initial value of all the thread-local state should be "inherit". All state-setting APIs must have a flag for whether they should set the global state or the thread-local state. Scoped thread-local state changes (such as the above example) need to restore the thread-local state to "inherit".

This can be made to work (I'm using for the Log system in Oodle at the moment) but it really is a very large conceptual burden on the client code and I don't recommend it.

There's another way that these global-state singletons are horrible for multi-threading, and that's that they create dependencies between threads that are not obvious or intentional. A little utility function that just calls some simple functions picks up these ties to shared variables and needs synchronization protection with the global state. This is related to :

2. Non-local effects.

The global state makes the functions that use it non-"pure" in a very hidden way. It means that innocuous functions can break code that's very far away from it in hidden ways.

One of the classic disasters of global state is the x87 (FPU) control word. Say you have a function like :


void func1()
{

    set x87 CW

    do a bunch of math that relies on that CW

    func2();

    do more math that relies on CW

    restore CW
}

Even without threading problems (the x87 CW is thread-local under any normal OS), this code has nasty non-local effects.

Some branch of code way out in func2() might rely on the CW being in a certain state, or it might change the CW and that breaks func1().

You don't want to be able to break code very far away from you in a hidden way, which is what all global state does. Particularly in the multi-threaded world, you want to be able to detect pure functions at a glance, or if a function is not pure, you need to be able to see what it depends on.

3. Undocumented and un-asserted requirements.

Any code base with global state is just full of bugs waiting to happen.

Any 3d graphics programmer knows about the nightmare of the GPU state machine. To actually write robust GPU code, you have to check every single render state at the start of the function to ensure that it is set up the way you expect. Good code always expresses (and checks) its requirements, and global state makes that very hard.

This is a big problem even in a single-source code base, but even worse with multiple programmers, and a total disaster when trying to copy-paste code between different products.

Even something like taking a function that's called in one spot in the code and calling it in another spot can be a hidden bug if it relied on some global state that was set up in just the right way in that original spot. That's terrible, as much as possible functions should be self-contained and work the same no matter where they are called. It's sort of like "movement of call site invariance symmetry" ; the action of a function should be determined only by its arguments (as much as possible) and any memory locations that it reads should be as clearly documented as possible.

4. Code sharing.

I believe that global state is part of what makes C code so hard to share.

If you take a code snippet that relies on some specific global state out of its content and paste it somewhere else, it no longer works. Part of the problem is that nobody documents or checks that the global state they need is set. But a bigger issue is :

If you take two chunks of code that work independently and just link them together, they might no longer work. If they share some global state, either intentionally or accidentally, and set it up differently, suddenly they are stomping on each other and breaking each other.

Obviously this occurs with anything in stdlib, or on the processor, or in the OS (for example there are lots of per-Process settings in Windows; eg. if you take some libraries that want a different time period, or process priority class, or priviledge level, etc. etc. you can break them just by putting them together).

Ideally this really should not be so. You should be able to link together separate libs and they should not break each other. Global state is very bad.


Okay, so we hate global state and want to avoid it. What can we do? I don't really have the answer to this because I've only recently come to this conclusion and don't have years of experience, which is what it takes to really make a good decision.

One option is the thread-local global state with inheritance and overrides as sketched above. There are some nice things about the thread-local-inherits-global method. One is that you do still have global state, so you can change the options somewhere and it affects all users. (eg. if you hit 'L' to toggle logging that can change the global state, and any thread or scope that hasn't explicitly sets it picks up the global option immediately).

Other solutions :

1. Pass in everything :

When it's reasonable to do so, try to pass in the options rather than setting them on a singleton. This may make the client code uglier and longer to type at first, but is better down the road.

eg. rather than


malloc_set_alignment( 16 );

malloc( size );

you would do :

malloc_aligned( size , 16 );

One change I've made to Oodle is taking state out of the async systems and putting in the args for each launch. It used to be like :

OodleWork_SetKickImmediate( OodleKickImmediate_No );
OodleWork_SetPriority( OodlePriority_High );
OodleWork_Run( job );

and now it's :

OodleWork_Run( job , OodleKickImmediate_No, OodlePriority_High );

2. An options struct rather than lots of args.

I distinguish this from #3 because it's sort of a bridge between the two. In particular I think of an "options struct" as just plain values - it doesn't have to be cleaned up, it could be const or made with an initializer list. You just use this when the number of options is too large and if you frequently set up the options once and then use it many times.

So eg. the above would be :


OodleWorkOptions wopts = { OodleKickImmediate_No, OodlePriority_High  };
OodleWork_Run( job , &wopts );

Now I should emphasize that we already have given ourselves great power and clarity. The options struct could just be global, and then you have the standard mess with that. You could have it in the TLS so you have per-thread options. And then you could locally override even the thread-local options in some scope. Subroutines should take OodleWorkOptions as a parameter so the caller can control how things inside are run, otherwise you lose the ability to affect child code which a global state system has.

Note also that options structs are dangerous for maintenance because of the C default initializer value of 0 and the fact that there's no warning for partially assigned structs. You can fix this by either making 0 mean "default" for every value, or making 0 mean "invalid" (and assert) - do not have 0 be a valid value which is anything but default. Another option is to require a magic number in the last value of the struct; unfortunately this is only caught at runtime, not compile time, which makes it ugly for a library. Because of that it may be best to only expose Set() functions for the struct and make the initializer list inaccessible.

The options struct can inherit values when its created; eg. it might fill any non-explicitly given values (eg. the 0 default) by inheriting from global options. As long as you never store options (you just make them on the stack), and each frame tick you get back to a root for all threads that has no options on the stack, then global options percolate out at least once a frame. (so for example the 'L' key to toggle logging will affect all threads on the next frame).

3. An initialized state object that you pass around.

Rather than a global singleton for things like The Log or The Allocator, this idea is to completely remove the concept that there is only one of those.

Instead, Log or Allocator is a struct that is passed in, and must be used to do those options. eg. like :


void FunctionThatMightLogOrAllocate( Log * l, Allocator * a , int x , int y )
{
    if ( x )
    {
        Log_Printf( l , "some stuff" );
    }

    if ( y )
    {
        void * p = malloc( a , 32 );

        free( a , p );
    }
}

now you can set options on your object, which may be a per-thread object or it might be global, or it might even be unique to the scope.

This is very powerful, it lets you do things like make an "arena" allocator in a scope ; the arena is allocated from the parent allocator and passed to the child functions. eg :


void MakeSuffixTrie( Allocator * a , U8 * buf, int bufSize )
{

    Allocator_Arena arena( a , bufSize * 4 );

    MakeSuffixTrie_Sub( &arena, buf, bufSize );
}

The idea is there's no global state, everything is passed down.

At first the fact that you have to pass down a state pointer to use malloc seems like an excessive pain in the ass, but it has advantages. It makes it super clear in the signature of a function which subsystems it might use. You get no more surprises because you forgot that your Mat3::Invert function logs about degeneracy.

It's unclear to me whether this would be too much of a burden in real world large code bases like games.


11-13-12 | Another Oodle Rewrite Note

Of course this is not going to happen. But in the imaginary world in which I rewrite from scratch :

I've got a million (actually several hundred) APIs that start an Async op. All of those APIs take a bunch of standard arguments that they all share, so they all look like :


OodleHandle Oodle_Read_Async(

                // function-specific args :
                OodleIOQFile file,void * memory,SINTa size,S64 position,

                // standard args on every _Async function :
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

The idea was that you pass in everything needed to start the op, and when it's returned you get a fully valid Handle which is enqueued to run.

What I should have done was make all the little _Async functions create an incomplete handle, and then have a standard function to start it. Something like :


// prep an async handle but don't start it :
OodleHandleStaging Oodle_Make_Read(
                OodleIOQFile file,void * memory,SINTa size,S64 position
                );

// standard function to run any op :
OodleHandle Oodle_Start( OodleHandleStaging handle,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

it would remove a ton of boiler-plate out of all my functions, and make it a lot easier to add more standard args, or have different ways of firing off handles. It would also allow things like creating a bunch of "Staging" handles that aren't enqueued yet, and then firing them off all at once, or even just holding them un-fired for a while, etc.

It's sort of ugly to make clients call two functions to run an async op, but you can always get client code that looks just like the old way by doing :


OodleHandle Oodle_Start( Oodle_Make_Read( OodleIOQFile file,void * memory,SINTa size,S64 position ) ,
                OodleHandleAutoDelete autoDelete,OodlePriority priority,const OodleHandle * dependencies,S32 numDependencies);

and I could easily make macros that make that look like one function call.

Having that interval of a partially-constructed op would also let me add more attributes that you could set on the Staging handle before firing it off.

(for example : when I was testing compresses on enwik, some of the tasks could allocate something like 256MB each; it occurred to me that a robust task system should understand limitting the number of tasks that run at the same time if their usage of some resource exceeds the max. eg. for memory usage, if you know you have 2 GB free, don't run more than 8 of those 256 MB tasks at once, but you could run other non-memory-hungry tasks during that time. (I guess the general way to do that would be to make task groups and assign tasks to groups and then limit the number from a certain group that can run simultaneously))


11-13-12 | Manifesto on American Gardens

The landscaping in my neighborhood is so incredibly ugly, it occurred to me the other day that it would actually look better if everybody just stopped tending their yards for 5 years and let it go back to natural; some blackberry, some sweet peas, oregon grape, and conifers. Wild, bushy, green, that stuff is beautiful. The un-tended strips along the highways or in abandonded lots look better than the damn city, which is covered with all these Home Depot plants that don't belong in the area.

American vernacular garden style is disgusting. It consists of a patch of lawn that's generally very dense and tightly mowed, then a hard edge at the border of the lawn, often even with something utterly unforgivable like a black plastic strip, then planted beds. The planted bends typically are a big expanse of mulch with scattered little dots of isolated plants. It's totally unnatural looking; it doesn't look right in its place, like it belongs there.

Just like with food, you should ask yourself, is all this stuff I'm doing actually making it better? If you go out to the woods around here and then walk around a neighborhood, how could you possibly think that the concrete pavers and decks and inappropriate warm-weather plants are an improvement?

Anyhoo, here's my manifesto about natural garden design :


1. Plants should look like themselves. Ferns look good green, not yellow or variegated. Daffodils look good yellow, not red or white. These days with advanced hybridization techniques you can get all kinds of crazy stuff, but DON'T, they are tacky as hell. They're like heavy makeup or plastic surgery, they sort of optimize a beauty goal but wind up being worse.

2. Your garden should match your area. Again with careful tending you can grow things from all over the world, but you shouldn't. The plants will look the most natural if they suit your area. Here in the PNW that means evergreens, ferns, rhododendra, etc. You can add some Japanese plants and such that have similar native climates, but things like tropical plants or hot/dry mediterranean plants do not belong here.

3. Mulch is fucking disgusting looking. One of the worst possible things you can do to your garden is to spread a huge field of mulch and then dot it with a sparse scattering of shrubs. Mulch is a necessary evil (actually some modern thought believes that mulch is overrated, but that's a digression) and should be invisible as much as possible. Mulches should, like the plants, match the area, not be some weird imported thing. So the modern cocoa and coco (coconut) mulches are both inappropriate everywhere. Here in the Northwest, pine bark is a semi-natural forest floor mulch and so looks okay in moderation.

4. Fertilizers, weed killers, grass treatments, etc. are all massive poisons, they flow into our natural water ways and fuck up the environment. You are a huge fucking selfish asshole if you use them beyond the bare minimum that is absolutely necessary. If your yard or plants require large amounts of chemicals to be happy, then change your fucking yard you asshole, plant something that works better in your environment; poisoning the damn lake is not a good tradeoff for you having a nice lawn.

5. Moss is a very natural and beautiful part of a Northwest garden and should not be removed. (BTW on a semi-related topic, the idea that moss can damage your roof is basically a myth here in the northwest (we very rarely get a major freeze, which is what makes moss harmful (because freezing makes it expand which rips up the shingles)) - and certainly pressure washing is guaranteed to do more harm than the moss ever could). Removing moss in the northwest is like polishing the patina off antique metal ware, it shows a complete lack of taste.

6. Concrete, manufactured stone, plastic, cast pavers, etc. have no place in a garden. Landscape fabric, plastic path/yard edging, etc. can be used but only if they will stay invisible, which they won't (they always work themselves out into sight), so probably just shouldn't be used.

7. A sort of general issue I've been thinking about is that there is a conflict between what looks good in a photo, or in a first impression, vs. what looks good to live with. This is true of gardens, houses, lots of things. Basically for a photo or a first impression (like a realtor visit) what looks good is simple, clean, and above all coherent; for example flowers should be all of one color or two colors. Many people now design gardens with this in mind, optimizing for a single view. However, that is quite different from what is enjoyable to live with on a regular basis. The scene that looks great in a photo will get boring if it's all you have to look at every day. To live with it's nicer to have lots of variety, unusual specimens, lots of little bits of interest you can walk around and look at. It's sort of like the overall impression vs. the density of interest; it's the "macro" vs "micro" optimization if you like. (the same is true of house decor; magazines and realtors favor a very clean, unified, almost Japanese simple interior, but living in that is quite boring; it's more interesting to live in the very cluttered house full of curios and covered with paintings and photos that give you lots of little things to look at).


The ideal Seattle house should have big Doug Fir beams, a cedar shake roof, and a big fireplace made of natural boulders. There should be a stream on the property and french drains that route groundwater to the stream.

The ideal Seattle garden should be like a woodland meadow. Obviously you don't actually want a "house in the woods" actually because it's too dark, what you want is that feeling when you're walking through the dense woods and you get to a big meadow and suddenly the sun appears and there's this lovely clearing of grasses and flowers with trees all around.

An ideal Seattle garden should always include some big evergreen trees, since they are the true masters of this landscape. A forest garden around the evergreens could include rhodendra, ferns, blueberries, etc. A good Seattle garden should always include moss and boulders; a truly lucky site would have one of our magnificent ice-age glacially deposited boulders.

(Seattle used to have lots of amazing giant boulders in the city. They were deposited by the glaciers that cut the sound, and were usually granite, giant 40 foot diameter things that just plopped there randomly. The vast majority of them have been destroyed, clearing space for houses and roads and such, and also to create smaller rocks. If you drive around Seattle you may observe all the rockeries used as retaining walls, made of large boulders (2-3 foot diameter typically); those boulders were usually made by dynamiting the original huge glacial boulders. There was one of the giant glacial boulders right on my street up until quite recently (the 1950's or something like that; the old neighbor was alive when it was still there); it's a shame that more weren't left and used as interesting city landscape features; of course it's an even bigger shame that more stands of old growth forest weren't left, we could have had stretches like the Golden Gate Park Panhandle running all over the city, and Seattle would then have been a unique and gorgeous city instead of the architectural garbage heap that it is today).


I've realized after buying this house we're in now that when shopping for a house, if you want to be a gardener, it's actually a liability to buy a house with a nice existing garden. The problem is you will want to work with what's already there, to respect those plants, and to save yourself work. But most likely the previous owner did some dumb things, planting big trees in bad places, or picking bad species or whatever. It's hard for me to just pull the trigger and rip out a garden that's already pretty nice, but if I had a crap garden I could plan it from the beginning and get more of what I want.


Typical "American Vernacular" style, taken from a real estate listing in my neighborhood :

Contrast with a Seattle park (Llandover Woods) that has very minimal sculpting, but is sticking to what this area should look like :


11-12-12 | The Destruction of American Democracy

People who actually care about democracy in America should be very concerned about the trend in the last 10 years. (granted if we look back farther to the LBJ era and before, corruption was rife in American politics, but it seems like it got better for a while, and now it's getting worse again).

1. Corporate Speech / Unlimited political spending.

Duh. Can we impeach Thomas and Scalia already? Every lawyer and judge in this country knows that they have no business being on the bench; they are literally the punch-line of law school jokes. It's a farce that we have such incompetent, corrupt, lazy, biased buffoons making some of the most important decisions in the country. (in case you aren't aware : in Citizen's United (as with countless other cases), Thomas and Scalia were known to be meeting with the supporters of the conservative side of the case).

Without campaign finance reform, there is no democracy. Both parties are just the parties of big corporations now. Both parties are the puppets of wall street, the military, the health-care complex, the cable companies, etc, and none of those interests will ever be harmed, no matter how much they fuck over the populace. Politicians are the puppets of money, and money wins elections. With the current court, the only way we'll get serious campaign finance reform would be with an ammendment, which is pretty unlikely in this day.

(in amusing absurdism, lots of states are going after political speech by labor unions, and the courts have so far been upholding it. While I basically agree that Unions should not be making political speech, or at least their members should be allowed to opt out of funding it, it's in odd opposition to allowing unlimited corporate political speech)

2. Electoral College.

I think everyone with a brain realizes now that the electoral college is a huge disaster that's ruining national politics. National elections hinge entirely on the results in a few swing states, thus national party platforms and attention are directed at the interests of those states. The majority of the country has almost no say in national elections. It's completely retarded.

The electoral college also means that the national elections are disporportionally controlled by the state governments of a few swing states, which gives those state governments massive power over the nation that we all must be concerned about.

3. Voter roll tampering.

Voter roll tampering should be really shocking to anyone of either party that respects the right to vote. At the moment it appears to be the Republicans who are mainly using this tool (certainly in the olden corrupt days, the dems were masters of it).

For a while the main tool was expunging criminals from the rolls (with collateral damage to non-criminals). The new tool of choice is "voter id" laws. Voter id sounds okay in theory, but in practice the point of the voter id laws is to remove some poor people and some old people from the rolls, because they tend to vote more Dem. I've heard some Dems say it's not a big deal, it's only 20,000 people or so that lose their right to vote, but of course that number is *massive* in the swing states.

In a wider view, the problem is that the party which gets into majority in the state government has the power to change the rules to affect future elections. That should raise some serious eyebrows, and brings us to the next point :

4. Gerrymandering.

It's completely ridiculous that the party in charge (in many states) gets to draw the voting districts.

I think a lot of people don't realize how powerful this is, or how widespread, or how ridiculous many of the districts are. (see for example : amusing maps and disturbing control ). We're being robbed right in front of our eyes and we're not doing anything about it, and they're laughing all the way to the bank; it's sickening.

If you win control of a state by even a tiny margin like 51-49 , you can rig the districts to go for your party by a huge margin, like 12-4. The way you do it is you put all the opposition support into a few districts that they will win in landslides, like 95-5, and then you spread out your support just thin enough to guarantee lots of wins, like 55-45 wins. (if you have a 51% majority of the state's population, you can split your state into just 2 districts where the opposition wins 95-5 , and 23 districts where you win 55-45, giving you a 23-2 majority from 51-49).

I've seen proposals that there should be better non-political committees to draw the voting districts (something like direct-elected long term seats), but I think they're all doomed to be corrupt eventually. I would much rather see the elimination of voting districts entirely, and instead use direct state-wide election (something like : you vote for your top 5 people and the people with the most votes get the seats). (multi-vote systems are also a big win for other reasons; they give non-mainstream candidates a better chance of winning, and allow viable 3rd parties to form)

(I understand that the idea of local districts is that you have a rep in your area to help address your local issues, but I think that's basically a myth; the only thing local representation does is encourage corruption, as the rep tries to get tax cuts and earmarks for business interests in their district)

amusing disclosure : my first ever software job was working on gerrymandering software ("redistricting software" but of course we all knew what it was for). It was a CAD package that had all the census data, and you could move the borders around and see the political balance in each district so that you could easily adjust the lines to get the voter ratios you wanted. We sold it to Cook County, which is one of the classic Dem-side gerrymanders (what they do is take little slices of Democratic Chicago and put them into the suburban districts that might otherwise go Republican).


11-09-12 | Rotten Systems

1. Excessively long and hidden contracts are the not-much-discussed open secret behind the destruction of consumer rights.

Anytime you do anything in America these days you are handed a fifty page contract (*) with all kinds of crazy clauses that nobody reads. And even if you do read it and object, what are you supposed to do? Not sign it? The competition has the exact same kind of abusive contract. As a consumer you can't choose to avoid them.

(* = if you are actually handed the real contract, that's rare and you should consider that company to be upstanding. In reality there is usually a clause somewhere that says "the full contract is available by request" and there's another 200 pages you don't know about that have even more exclusions. And even if you do request the full contract and actually get it, by the time you get it they've changed it and you no longer have the latest. And even if you like the terms you see and sign it, they'll change it the next day, and one of the clauses was "the agreement is superceded by changes to the contract" or some such. You may as well just sign a blank page that says "you may rape me however you choose" because they can always change the rules)

Tricky contracts are one of the forgotten evils behind the whole mortgage crisis. Asshole type-A republican types will say "it's your own fault if you don't read the fine print; you were given a contract and agreed to it, you have to live with it". Bullshit. We are all presented with reams of intentionally-obfuscated lawyer-speak, it is totally unreasonable to allow it.

There need to be laws about limitting the complexity of contracts. There need to be laws about limitting the amount of fine print in fee structures; pricing needs to be up front and standardized and clearly advertised.

Things like the classic "It's only $9.95 (with ten more monthly payments of 99 thousand)" should just be illegal, obviously. There's no social benefit in allowing that kind of advertising. The point of laws is to make a capitalist/social structure which is good for the people, and that kind of shit is not good; it's particularly bad for capitalists who want to play fair by selling a good product with an honest price.

The new evil in abusive contracts is of course software license agreements. These should just be completely illegal. You shouldn't be allowed to compel me to sign anything in order to use the product that I already bought. The purchase implies a standard obligation of functionality and indemnity, that should be the only agreement allowed. The standard indemnity laws should be updated a bit to reflect the modern age of software of course.

2. America desperately needs a real right-to-privacy law.

2.A. Government agencies should not be allowed to sell your personal information to corporations! (among others, the USPS does this, as do most state DMV's). This one is a super no-brainer.

2.B. No corporation should be allowed to sell your personal information without your explicit permission. But really what's really needed is :

2.C. Corporations should not be allowed to require any personal information beyond some bare minimum that can identify you. eg. they can ask for SSN and a password of your choosing, but they cannot ask for previous addresses or bank account numbers or etc.

2.D. It should be illegal to tie incentives to privacy violations. eg. club cards that give you discounts in exchange for giving up your privacy. Similarly lots of bills now give you a discount if you allow direct withdrawal from your bank account. Utilities often will allow you to not give your SSN, but only with a fee.

2.E. All privacy options must be set to max-privacy by default. Allowing increased privacy but requiring you to go through forms is no good.

2.F. You should be able to request deletion of all your records. eg. when you close a bank account, or from a doctor, or whatever, you should be able to say "please delete all your info on me" and they should be required by law to comply (if your accounts are in good standing blah blah blah).

3. I feel like the government-corporate complex is intentionally building this world structure in which you are locked into a variety of fixed fees which suck up all your income. (okay this part of the post is going off the deep end a bit)

You don't go work and get your money and then choose how to spend it. The corporate masters have auto-debit on your account and just suck it right out. I know this is retarded hyperbole, but it sort of feels like mining towns where you get paid scrip and then just have to give it right back at the company store, but in this case the company store is apple and the chinese-crap importers (target, walmart, gap, c&b, etc etc).

3.1. Health insurance is perhaps the most obvious; the cost of health care is ridiculously inflated, but that cost is hidden from you a bit (intentionally), so we're all just locked into paying out a massive amount monthly to the health care complex.

3.2. Car insurance is of course the same story. The car insurance companies very much want you to feel like "accidents are no big deal" ; hey lets all jump in our tanks with no visibility and smash into each other, no biggie, the car insurance pays for it. And in the mean time a huge government-required deduction slips out of your account every month.

3.3. Cell phones obviously; and Cable companies; these ones are semi-government-enforced monopolies, and basically not optional in modern life, you are just required by law to give them $200 every month. More and more software wants to move to subscription plans. Everyone is getting very clever about making the easy way out just being "give us lots of money every month automatically" , and you have to work harder and harder to actually be proactive about spending your money.

3.4. I actually think online purchasing is part of this and is changing the whole relationship people have with money. You very often no longer get to see the thing you are buying before you buy it. Then when it sucks, it's usually too much trouble to return it. The result is that the whole purchasing is more like "send money out into the ether and then products show up which I have no control over". It's almost like a constant tax, and then they give you some shitty products once in a while.


11-08-12 | Job System Task Types

I think I've written this before, but a bit more verbosely :

It's useful to have at least 3 classes of task in your task system :

1. "Normal" tasks that want to run on one of the worker threads. These take some amount of time, you don't want them to block other threads, they might have dependencies and create other jobs.

2. "IO" tasks. This is CPU work, but the main thing it does is wait on an IO and perhaps spawn another one. For example something like copying a file through a double-buffer is basically just wait on an IO op then do a tiny bit of math, then start another IO op. This should be run on the IO thread, because it avoids all thread switching and thread communication as it mediates the IO jobs.

(of course the same applies to other subsytems if you have them)

3. "Tiny" / "Run anywhere" tasks. These are tasks that take very little CPU time and should not be enqueued onto worker threads, because the time to wake up a thread for them dwarfs the time to just run them.

The only reason you would run these as async tasks at all is because they depend on something. Generally this is something trivial like "set a bool after this other job is done".

These tasks should be run immediately when they become ready-to-run on whatever thread made them rtr. So they might run on the IO thread, or a worker thread, or the main thread (any client thread). eg. if a tiny task is enqueued on the main thread and its dependencies are already done, it should just be run immediately.

It's possible that class #2 could be merged into class #3. That is, eliminate the IO-tasks (or GPU-tasks or whatever) and just call them all "tiny tasks". You might lose a tiny bit of efficiency from that, but the simplicity of having only two classes of tasks is probably preferable. If the IO tasks are made into just generic tiny tasks, then it's important that the IO thread be able to execute tiny tasks from the generic job system itself, otherwise it might go to sleep thinking there is no IO to be done, when a pending tiny IO task could create new IO work for it.

Okay.

Beyond that, for "normal" tasks there's the question of typical duration, which tells you whether it's worth it to fire up more threads.

eg. say you shoot 10 tasks at your thread-pool worker system. Should you wake up 1 thread and let it do all 10 ? Or wake up 10 threads and give each one a task? Or maybe wake 2?

One issue that still is bugging me is when you have a worker thread, and in doing some work it makes some more tasks ready-to-run. Should it fire up new worker threads to take those tasks, or should it just finish its task and then do them itself? You need two pieces of information : 1. are the new tasks significant enough to warrant a new thread? and 2. how close to the end of my current task am I? (eg. if I'm in the middle of some big work I might want to fire up a new thread even though the new RTR tasks are tiny).

When you have "tiny" and "normal" tasks at the same priority level, it's probably worth running all the tinies before any normals.

Good lord.


11-06-12 | Protect Ad-Hoc Multi-Threading

Code with un-protected multi-threading is bad code just waiting to be a nightmare of a bug.

"ad-hoc" multi-threading refers to sharing of data across threads without an explicit sharing mechanism (such as a queue or a mutex). There's nothing wrong per-se with ad-hoc multi-threading, but too often people use it as an excuse for "comment moderated handoff" which is no good.

The point of this post is : protect your threading! Use name-changes and protection classes to make access lifetimes very explicit and compiler (or at least assert) moderated rather than comment-moderated.

Let's look at some examples to be super clear. Ad-Hoc multi-threading is something like this :


int shared;


thread0 :
{

shared = 7; // no atomics or protection or anything

// shared is now set up

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

print shared;

}

thread1 :
{

shared ++;

}

this code works (assuming that thread creation and waiting has some kind of memory barrier in it, which it usually does), but the hand-offs and synchronization are all ad-hoc and "comment moderated". This is terrible code.

I believe that even with something like a mutex, you should make the protection compiler-enforced, not comment-enforced.

Comment-enforced mutex protection is something like :


struct MyStruct s_data;
Mutex s_data_mutex;
// lock s_data_mutex before touching s_data

That's okay, but comment-enforced code is always brittle and bug-prone. Better is something like :

struct MyStruct s_data_needs_mutex;
Mutex s_data_mutex;
#define MYSTRUCT_SCOPE(name)    MUTEX_IN_SCOPE(s_data_mutex); MyStruct & name = s_data_needs_mutex;

assuming you have some kind of mutex-scoper class and macro. This makes it impossible to accidentally touch the protected stuff outside of a lock.

Even cleaner is to make a lock-scoper class that un-hides the data for you. Something like :

//-----------------------------------

template <typename t_data> class ThinLockProtectedHolder;

template <typename t_data>
class ThinLockProtected
{
public:

    ThinLockProtected() : m_lock(0), m_data() { }
    ~ThinLockProtected() { }    

protected:

    friend class ThinLockProtectedHolder<t_data>;
    
    OodleThinLock   m_lock;
    t_data          m_data;
};

template <typename t_data>
class ThinLockProtectedHolder
{
public:

    typedef ThinLockProtected<t_data>   t_protected;
    
    ThinLockProtectedHolder(t_protected * ptr) : m_protected(ptr) { OodleThinLock_Lock(&(m_protected->m_lock)); }
    ~ThinLockProtectedHolder() { OodleThinLock_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define TLP_SCOPE(t_data,ptr,data)  ThinLockProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//--------
/*

// use like :

    ThinLockProtected<int>  tlpi;
    {
    TLP_SCOPE(int,&tlpi,shared_int);
    shared_int = 7;
    }

*/

//-----------------------------------
Errkay.

So the point of this whole post is that even when you are just doing ad-hoc thread ownership, you should still use a robustness mechanism like this. For example by direct analogy you could use something like :

//=========================================================================

template <typename t_data> class AdHocProtectedHolder;

template <typename t_data>
class AdHocProtected
{
public:

    AdHocProtected() : 
    #ifdef RR_DO_ASSERTS
        m_lock(0), 
    #endif
        m_data() { }
    ~AdHocProtected() { }

protected:

    friend class AdHocProtectedHolder<t_data>;
    
    #ifdef RR_DO_ASSERTS
    U32             m_lock;
    #endif
    t_data          m_data;
};

#ifdef RR_DO_ASSERTS
void AdHoc_Lock( U32 * pb)  { U32 old = rrAtomicAddExchange32(pb,1); RR_ASSERT( old == 0 ); }
void AdHoc_Unlock(U32 * pb) { U32 old = rrAtomicAddExchange32(pb,-1); RR_ASSERT( old == 1 ); }
#else
#define AdHoc_Lock(xx)
#define AdHoc_Unlock(xx)
#endif


template <typename t_data>
class AdHocProtectedHolder
{
public:

    typedef AdHocProtected<t_data>  t_protected;
    
    AdHocProtectedHolder(t_protected * ptr) : m_protected(ptr) { AdHoc_Lock(&(m_protected->m_lock)); }
    ~AdHocProtectedHolder() { AdHoc_Unlock(&(m_protected->m_lock)); }
    
    t_data & Data() { return m_protected->m_data; }
    
protected:

    t_protected *   m_protected;

};

#define ADHOC_SCOPE(t_data,ptr,data)    AdHocProtectedHolder<t_data> RR_STRING_JOIN(tlph,data) (ptr); t_data & data = RR_STRING_JOIN(tlph,data).Data();

//==================================================================
which provides scoped checked ownership of variable hand-offs without any explicit mutex.

We can now revisit our original example :


AdHocProtected<int> ahp_shared;


thread0 :
{

{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared = 7; // no atomics or protection or anything

// shared is now set up
}

start thread1;

// .. do other stuff ..

kill thread1;
wait thread1;

{
ADHOC_SCOPE(int,&ahp_shared,shared);
print shared;
}

}

thread1 :
{
ADHOC_SCOPE(int,&ahp_shared,shared);

shared ++;

}

And now we have code which is efficient, robust, and safe from accidents.


11-06-12 | Using your own malloc with operator new

In Oodle, I use my own allocator, and I wish to be able to still construct & destruct classes. (Oodle's public API is C only, but I use C++ internally).

The traditional way to do this is to write your own "operator new" implementation which will link in place of the library implementation. This way sucks for various reasons. The important one to me is that it changes all the news of any other statically-linked code, which is just not an okay thing for a library to do. You may want to have different mallocs for different purposes; the whole idea of a single global allocator is kind of broken in the modern world.

(the presence of global state in the C standard lib is part of what makes C code so hard to share. The entire C stdlib should be a passed-in vtable argument. Perhaps more on this in a later post.)

Anyway, what I want is a way to do a "new" without interfering with client code or other libraries. It's relatively straightforward (*), but there are a few little details that took me a while to get right, so here they are.

(* = ADDENDUM = not straightforward at all if multiple-inheritance is used and deletion can be done on arbitrary parts of the MI class)

//==================================================================

/*

subtlety : just calling placement new can be problematic; it's safer to make an explicit selection
of placement new.  This is how we call the constructor.

*/

enum EnumForPlacementNew { ePlacementNew };

// explicit access to placement new when there's ambiguity :
//  if there are *any* custom overrides to new() then placement new becomes ambiguous   
inline void* operator new   (size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
inline void  operator delete(void*,  EnumForPlacementNew, void*) { }

#ifdef __STRICT_ALIGNED
// subtlety : some stdlibs have a non-standard operator new with alignment (second arg is alignment)
//  note that the alignment is not getting passed to our malloc here, so you must ensure you are
//    getting it in some other way
inline void* operator new   (size_t , size_t, EnumForPlacementNew, void* pReturn) { return pReturn; }
#endif

// "MyNew" macro is how you do construction

/*

subtlety : trailing the arg list off the macro means we don't need to do this kind of nonsense :

    template <typename Entry,typename t_arg1,typename t_arg2,typename t_arg3,typename t_arg4,typename t_arg5,typename t_arg6,typename t_arg7,typename t_arg8,typename t_arg9>
    static inline Entry * construct(Entry * pEntry, t_arg1 arg1, t_arg2 arg2, t_arg3 arg3, t_arg4 arg4, t_arg5 arg5, t_arg6 arg6, t_arg7 arg7, t_arg8 arg8, t_arg9 arg9)

*/

//  Stuff * ptr = MyNew(Stuff) (constructor args); 
//  eg. for void args :
//  Stuff * ptr = MyNew(Stuff) ();
#define MyNew(t_type)   new (ePlacementNew, (t_type *) MyMalloc(sizeof(t_type)) ) t_type 

// call the destructor :
template <typename t_type>
static inline t_type * destruct(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    ptr->~t_type();
    return ptr;
}

// MyDelete is how you kill a class

/*

subtlety : I like to use a Free() which takes the size of the object.  This is a big optimization
for the allocator in some cases (or lets you not store the memory size in a header of the allocation).
*But* if you do this, you must ensure that you don't use sizeof() if the object is polymorphic.
Here I use MSVC's nice __has_virtual_destructor() extension to detect if a type is polymorphic.

*/

template <typename t_type>
void MyDeleteNonVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    #ifdef _MSC_VER
    RR_COMPILER_ASSERT( ! __has_virtual_destructor(t_type) );
    #endif
    destruct(ptr);
    MyFree_Sized((void *)ptr,sizeof(t_type));
}

template <typename t_type>
void MyDeleteVirtual(t_type * ptr)
{
    RR_ASSERT( ptr != NULL );
    destruct(ptr);
    // can't use size :
    MyFree_NoSize((void *)ptr);
}

#ifdef _MSC_VER

// on MSVC , MyDelete can select the right call at compile time

template <typename t_type>
void MyDelete(t_type * ptr)
{
    if ( __has_virtual_destructor(t_type) )
    {
        MyDeleteVirtual(ptr);
    }
    else
    {
        MyDeleteNonVirtual(ptr);
    }
}

#else

// must be safe and use the polymorphic call :

#define MyDelete    MyDeleteVirtual

#endif

and the end result is that you can do :

    foo * f = MyNew(foo) ();

    MyDelete(f);

and you get normal construction and destruction but with your own allocator, and without polluting (or depending on) the global linker space. Yay.


10-26-12 | Simple Stuff C Should Have

I'm not interested in pie-in-the-sky weirdo shit, only very simple stuff that I believe would make real world coding massively better.

1. Global variables should not be allowed without a prefix "global". (optionally). eg.


int x;  // compile failure!
static int y;
global int x; // ok

(this should be turned on as an option). I should be able to search for "global" and find them all (and remove them, because globals are pointless and horrible, especially in the modern age of multi-threading).

2. Name-hiding should require an explicit "hide" prefix, eg.


int x;

{
  int x;  // compile failure !
  hide int x; // ok

}

I hate name-hiding and think it never should have been in the language. It does nothing good and creates lots of bugs. Similarly :

3. Overloads and virtual overrides should have an explicit prefix.

This ensures that you are adding an overload or virtual override intentionally, not by accident just because the names happen to line up. The entire C overload/override method only works by coincidence; like it's a coincidence that these names are the same so they are an overload; there's no way to say "I intend this to be an overload, please be an error if it's not" (and the opposite; I intend this to be a unique function, please error if it is an overload).

Similarly, it catches the very common error that you wrote a virtual override and it all worked and then somebody changes the signature in the base class and suddenly you have code that still compiles but the virtuals no longer override.

4. C-style cast should be (optionally) an error.

I should be able to flip a switch and make C-style casts illegal. They are the devil, too easy to abuse, and impossible to search for. There should be a c_cast that provides the same action in a more verbose way.

5. Uninitialized variables should be an error. (optionally).


int x; // compile failure!
int x(0); // ok

Duh. Of course. Same thing with member variables in class constructors. It's ridiculous that it's so easy to use uninitialized memory.

6. Less undefined behavior.

Everything that's currently "undefined" should be a compile error by default. Then let me make it allowed by setting pragmas, like :


#require signed_shift

#require flat_memory_model

which then changes those usages from compile errors to clearly defined operations.

7. Standardize everything that's sort of outside the standard, such as common pragmas, warning disablement, etc.

The way to do this is to have a chunk of standard that's like "you don't have to implement this, but if you do the syntax must be like so" and then provide a way to check if it's implemented.

Just a standard syntax for warning disablement would be great (and of course the ability to do it in ranges or even C scopes).

Things like SIMD could also be added to the language in this way; just simple things like "if you have a simd type it is named this" would massively help with standardizing shit which is different on every platform/compiler for no good reason.

8. (optionally) disable generation of all automatic functions, eg. assignment, copy construct, default construct. The slight short-term convenience of these being auto-generated is vastly outweighed by the bugs they create when you use them accidentally on classes that should not be copied. Of course I know how to put non-copyable junk in classes but I shouldn't have to do that; that should be the default, and they should only be copyable when I explicitly say it's okay. And you shouldn't have to write that out either, there should be a "use default copy" directive that you put in the class when you know it's okay.

9. Trivial reflection. Just give me a way to say "on all member variables". There's no reason not to have this, it's so easy for the compiler to add, just give me a way to do :

template <typename t_op>
void Reflect( t_op & op )
{
    op(member1);
    op(member2);
    ...
}
In which the compiler generates the list of members for me, I don't have to manually do it.


10-26-12 | Oodle Rewrite Thoughts

I'm getting increasingly annoyed with the C-style Oodle threading code. It's just such a nightmare to manually manage things like object lifetimes in an async / multi-threaded environment.

Even something as simple as "write part of this buffer to a file" constantly causes me pain, because implied in that operation is "the buffer must not be freed until the write is done" , "the buffer should not be changed in the area being written until the write is done" , and "the file should not be closed until the write is done".

When you first start out and aren't doing a lot of complicated ops, it doesn't seem too bad, you can keep those things in your head; they become "comment-enforced" rules; that is, the code doesn't make itself correct, you have to write comments like "// write is pending, don't free buffer yet" (often you don't actually write the comments, but they're still "comment-enforced" as opposed to "code-enforced").

I think the better way is the very-C++-y Oodle futures .

Oodle futures rely on every object they take as inputs having refcounts, so there is no issue of free before exit. Some key points about the Oodle futures that I think are good :

A. Dependencies are automatic based on your arguments. You depend on anything you take as arguments. If the arguments themselves depend on async ops, then you depend on the chain of ops automatically. This is super-sweet and just removes a ton of bugs. You are then required to write code such that all your dependencies are in the form of function arguments, which at first is a pain in the ass, but actually results in much cleaner code overall because it makes the expression of dependencies really clear (as opposed to just touching some global deep inside your function, which creates a dependency in a really nasty way).

B. Futures create implicit async handles; the async handles in Oodle future are all ref-counted so they clean themselves automatically when you no longer care about them. This is way better than the manual lifetime management in Oodle right now, in which you either have to hold a bunch of handles.

C. It's an easy way to plug in the result of one async op into the input of the next one. It's like an imperative way of using code to do that graph drawing thing ; "this op has an output which goes into this input slot". Without an automated system for this, what I'm doing at the moment is writing lots of little stub functions that just wait on one op, gather up its results and starts the next op. There's no inefficiency in this, it's the same thing the future system does, but it's a pain in the ass.

If I was restarting from scratch I would go even further. Something like :

1. Every object has a refcount AND a read-write lock built into. Maybe the refcount and RW lock count go together in one U32 or U64 which is maintained by lockfree ops.

Refcounting is obvious. Lifetimes of async ops are way too complicated without it.

The RW lock in every object is something that sophomoric programmers don't see the need for. They think "hey it's a simple struct, I fill it on one thread, then pass it to another thread, and he touches it". No no no, you're a horrible programmer and I don't want to work with you. It seems simple at first, but it's just so fragile and prone to bugs any time you change anything, it's not worth it. If every object doesn't just come with an RW lock it's too easy to be lazy and skip adding one, which is very bad. If the lock is uncontended, as in the simple struct handoff case above, then it's very cheap, so just use it anyway.

2. Whenever you start an async op on an object, it takes a ref and also takes either a read lock or write lock.

3. Buffers are special in that you RW lock them in ranges. Same thing with textures and such. So you can write non-overlapping ranges simultaneously.

4. Every object has a list of the ops that are pending on that object. Any time you start a new op on an object, it is delayed until those pending ops are done. Similarly, every op has a list of objects that it takes as input, and won't run until those objects are ready.

The other big thing I would do in a rewrite from scratch is the basic architecture :

1. Write all my own threading primitives (semaphore, mutex, etc) and base them on a single waitset. (I basically have this already).

2. Write stack-ful coroutines.

3. When the low level Wait() is called on a stackful coroutine, instead yield the coroutine.

That way the coroutine code can just use Semaphore or whatever, and when it goes to wait on the semaphore, it will yield instead. It makes the coroutine code exactly the same as non-coroutine code and makes it "composable" (eg. you can call functions and they actually work), which I believe is crucial to real programming. This lets you write stackful coroutine code that does file IO or waits on async ops or whatever, and when you hit some blocking code it just automatically yields the coroutine (instead of blocking the whole worker thread).

This would mean that you could write coroutine code without any special syntax; so eg. you can call the same functions from coroutines as you do from non-coroutines and it Just Works the way you want. Hmm I think I wrote the same sentence like 3 times, but it's significant enough to bear repetition.


10-22-12 | Windows 7 Start Menu Input Race

I've been super annoyed by some inconsistent behavior in the Windows 7 start menu for a while now. Sometimes I hit "Start - programname - enter" and nothing happens. I just sort of put it down to "god damn Windows is flakey and shit" but I finally realized yesterday exactly what's happening.

It's an input race , as previously discussed here

What happens is, you hit Start, and you get your focus in the type-in-a-program edit box. That part is fine. You type in a program name. At that point it does the search in the start menu thing in the background (it doesn't stall after each key press). In many cases there will be a bit of a delay before it updates the list of matching programs found.

If you hit Enter before it finds the program and highlights it, it just closes the dialog and doesn't run anything. If you wait a beat before hitting enter, the background program-finder will highlight the thing and hitting enter will work.

Very shitty. The start menu should not have keyboard input races. In this case the solution is obvious and trivial - when you hit enter it should wait on the background search task before acting on that key (but if you hit escape it should immediately close the window and abort the task without waiting).

I've long been an advocate of video game programmers doing "flakiness" testing by playing the game at 1 fps, or capturing recordings of the game at the normal 30 fps and then watching them play back at 1 fps. When you do that you see all sorts of janky shit that should be eliminated, like single frame horrible animation pops, or in normal GUIs you'll see things like the whole thing redraw twice in a row, or single frames where GUI elements flash in for 1 frame in the wrong place, etc.

Things like input races can be very easily found if you artificially slow down the program by 100X or so, so that you can see what it's actually doing step by step.

I'm a big believer in eliminating this kind of flakiness. Almost nobody that I've ever met in development puts it as a high priority, and it does take a lot of work for apparently little reward, and if you ask consumers they will never rate it highly on their wish list. But I think it's more important than people realize; I think it creates a feeling of solidness and trust in the application. It makes you feel like the app is doing what you tell it to, and if your avatar dies in the game it's because of your own actions, not because the stupid game didn't jump even though you hit the jump button because there was one frame where it wasn't responding to input.


10-22-12 | LZ-Bytewise conclusions

Wrapping this up + index post. Previous posts in the series :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3
cbloom rants 09-04-12 - Encoding Values in Bytes Part 4
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-10-12 - LZ4 - Large Window
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-13-12 - LZNib
cbloom rants 09-14-12 - Things Most Compressors Leave On the Table
cbloom rants 09-15-12 - Some compression comparison charts
cbloom rants 09-23-12 - Patches and Deltas
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 09-28-12 - LZNib on enwik8 with Long Range Matcher
cbloom rants 09-30-12 - Long Range Matcher Notes
cbloom rants 10-02-12 - Small note on LZHAM
cbloom rants 10-04-12 - Hash-Link match finder tricks
cbloom rants 10-05-12 - OodleLZ Encoder Speed Variation with Worker Count
cbloom rants 10-07-12 - Small Notes on LZNib
cbloom rants: 10-16-12 - Two more small notes on LZNib

And some little additions :

First a correction/addendum on cbloom rants 09-04-12 - LZ4 Optimal Parse :

I wrote before that going beyond the 15 states needed to capture the LRL overflowing the control byte doesn't help much (or at all). That's true if you only go up to 20 or 30 or 200 states, but if you go all the way to 270 states, so that you capture the transition to needing another byte, there is some win to be had (LZ4P-LO-332 got lztestset to 12714031 with small optimal state set, 12492631 with large state set).

If you just do it naively, it greatly increases memory use and run time. However, I realized that there is a better way. The key is to use the fact that there are so many code-cost ties. In LZ-Bytewise with the large state set, often the coding decision in a large number of states will have the same cost, and furthermore often the end point states will all have the same cost. When this happens, you don't need to make the decision independently for each state, instead you make one decision for the entire block, and you store a decision for a range of states, instead of one for each state.

eg. to be explicit, instead of doing :


in state 20 at pos P
consider coding a literal (takes me to state 21 at pos P+1)
consider various matches (takes me to state 0 at pos P+L)
store best choice in table[P][20]

in state 21 ...

do :

in states 16-260 at pos P
consider coding a literal (takes me to states 17-261 at pos P+1 which I saw all have the same cost)
consider various matches (takes me to state 0 at pos P+L)
store in table[P] : range {16-260} makes decision X

in states 261-263 ...

so you actually can do the very large optimal parse state set with not much increase in run time or memory use.

Second : I did a more complex variant of LZ4P (large window). LZ4P-LO includes "last offset". LZ4P-LO-332 uses a 3-bit-3-bit-2-bit control word (as described previously here : cbloom rants 09-10-12 - LZ4 - Large Window ) ; the 2 bit offset reserves one value for LO and 3 values for normal offsets.

(I consider this an "LZ4" variant because (unlike LZNib) it sends LZ codes as a strictly alternating LRL-ML pairs (LRL can be zero) and the control word of LRL and ML is in one byte)

Slightly better than LZ4P-LO-332 is LZ4P-LO-695 , where the numbering has switched from bits to number of values (so 332 should be 884 for consistency). You may have noticed that 6*9*5 = 270 does not fit in a byte, but that's fixed easily by forbidding some of the possibilities. 6-9-5 = 6 values for literals, 9 for match lengths, and 5 for offsets. The 5 offsets are LO + 2 bits of normal offset. So for example one of the ways that the 270 values is reduced is because an LO match can never occur after an LRL of 0 (the previous match would have just been longer), so those combinations are removed from the control byte.

LZ4P-LO-695 is not competitive with LZNib unless you spill the excess LRL and ML (the amount that is too large to fit in the control word) to nibbles, instead of spilling to bytes as in the original LZ4 and LZ4P. Even with spilling to nibbles, it's no better than LZNib. Doing LZ4P-LO-695, I found a few bugs in LZNib, so its results also got better.

Thirdly, current numbers :

raw lz4 lz4p332 lz4plo695 lznib d8 zlib OodleLZHLW
lzt00 16914 6473 6068 6012 5749 4896 4909
lzt01 200000 198900 198880 198107 198107 198199 198271
lzt02 755121 410695 292427 265490 253935 386203 174946
lzt03 3471552 1820761 1795951 1745594 1732491 1789728 1698003
lzt04 48649 16709 15584 15230 14352 11903 10679
lzt05 927796 460889 440742 420541 413894 422484 357308
lzt06 563160 493055 419768 407437 398780 446533 347495
lzt07 500000 265688 248500 240004 237120 229426 210182
lzt08 355400 331454 322959 297694 302303 277666 232863
lzt09 786488 344792 325124 313076 298340 325921 268715
lzt10 154624 15139 13299 11774 11995 12577 10274
lzt11 58524 25832 23870 22381 22219 21637 19132
lzt12 164423 33666 30864 29023 29214 27583 24101
lzt13 1041576 1042749 1040033 1039169 1009055 969636 923798
lzt14 102400 56525 53395 51328 51522 48155 46422
lzt15 34664 14062 12723 11610 11696 11464 10349
lzt16 21504 12349 11392 10881 10889 10311 9936
lzt17 53161 23141 22028 21877 20857 18518 17931
lzt18 102400 85659 79138 74459 76335 68392 59919
lzt19 768771 363217 335912 323886 299498 312257 268329
lzt20 1179702 1045179 993442 973791 955546 952365 855231
lzt21 679936 194075 113461 107860 102857 148267 83825
lzt22 400000 361733 348347 336715 331960 309569 279646
lzt23 1048576 1040701 1035197 1008638 989387 777633 798045
lzt24 3471552 2369885 1934129 1757927 1649592 2289316 1398291
lzt25 1029744 324190 332747 269047 230931 210363 96745
lzt26 262144 246465 244990 239816 239509 222808 207600
lzt27 857241 430350 353497 315394 328666 333120 223125
lzt28 1591760 445806 388712 376137 345343 335243 259488
lzt29 3953035 2235299 1519904 1451801 1424026 1805289 1132368
lzt30 100000 100394 100393 100010 100013 100020 100001
total 24700817 14815832 13053476 12442709 12096181 13077482 10327927

And comparison charts on the aggregated single file lzt99 :

Speeds are the best of 20 trials on each core; speed is the best of either x86 or x64 (usually x64 is faster). The decode times measured are slightly lower for everybody in this post (vs the last post of this type) because of the slightly more rigorous timing runs. For reference the decode speeds I measured are (mb/s) :


LZ4 :      1715.10235
LZNib :     869.1924302
OodleLZHLW: 287.2821629
zlib :      226.9286645
LZMA :       31.41397495

Also LZNib current enwik8 size : (parallel chunking (8 MB chunks) and LRM 12/12 with bubble)

LZNib enwik8 mml3 : 30719351
LZNib enwik8 stepml : 30548818

(all other LZNib results are for mml3)


10-16-12 | Thoughts on Bit-Packing Structs Before Compression

If you're trying to transmit some data compactly, and you are *not* using any back-end compression, then it's relatively straightforward to pack the structs through ad-hoc "bit packing" - you just want to squeeze them into as few bits as possible. But if you are going to apply a standard compressor after bit packing, it's a little less clear. In particular, a lot of people make mistakes that result in larger final data than necessary.

To be clear, there are two compression steps :


{ Raw structs } --[ad hoc]--> { Bit Packed } --[compressor]--> { Transmitted Data }

What you actually want to minimize is the size of the final transmitted data, which is not necessarily achieved with the smallest bit packed data.

The ideal scenario is if you know your back-end compressor, simply try a variety of ways of packing and measure the final size. You should always start with completely un-packed data, which often is a reasonable way to go. It's also important to keep in mind the speed hit of bit packing. Compressors (in particular, decompressors) are very fast, so even though your bit-packing may just consist of some simple math, it actually can very easily be much slower than the back-end decompressor. Many people incorrectly spend CPU time doing pre-compression bit-packing, when they would be better off spending that same CPU time by just running a stronger compressor and not doing any twiddling themselves.

The goal of bit-packing should really be to put the data in a form that the compressor can model efficienctly. Almost all compressors assume an 8-bit alphabet, so you want your data to stay in 8-bit form (eg. use bit-aligned packing, don't use non-power-of-2 multiplies to tightly pack values if they will cross a byte boundary). Also almost all compressors, even the best in the world (PAQ, etc) primarily achieve compression by modeling correlation between neighboring bytes. That means if you have data that does not have the property of maximum correlation to its immediate neighbor (and steady falloff) then some swizzling may help, just rearranging bytes to put the correlated bytes near each other and the uncorrelated bytes far away.

Some issues to consider :

1. Lossy bit packing.

Any time you can throw away bits completely, you have a big opportunity that you should exploit (which no back end compressor can ever do, because it sends data exactly). The most common case of this is if you have floats in your struct. Almost always there are several bits in a float which are pure garbage, just random noise which is way below the error tolerance of your app. Those bits are impossible to compress and if you can throw them away, that's pure win. Most floats are better transmitted as something like a 16 bit fixed point, but this requires application-specific knowledge about how much precision is really needed.

Even if you decide you can't throw away those bits, something that can help is just to get them out of the main stream. Having some random bytes mixed in to an otherwise nicely compressible stream really mucks up the order-0 statistics, so just putting them on the side is a nice way to go. eg. you might take the bottom 4 or 8 bits out of each float and just pass them uncompressed.

(in practical bone-head tips, it's pretty common for un-initialized memory to be passed to compressors; eg. if your structs are padded by C so there are gaps between values, put something highly compressible in the gap, like zero or a duplicate of the neighboring byte)

2. Relationships between values.

Any time you have a struct where the values are not completely independent, you have a good opportunity for packing. Obviously there are cases where one value in a struct can be computed from another and should just not be sent.

There are more subtle cases, like if A = 1 then B has certain statistics (perhaps it's usually high), while if A = 0 then B has other statistics (perhaps it's usually low). In these cases there are a few options. One is just to rearrange the transmission order so that A and B are adjacent. Most back end compressors model correlation between values that are adjacent, so putting the most-related values in a struct next to each other will let the back end find that correlation.

There are also often complicated mathematical relationships. A common case is a normalized vector; the 3 values are constrained in a way that the compressor will never be able to figure out (proof that current compressors are still very far away from the ideal of perfect compression). When possible you want to reduce these related values to their minimal set; another common case is rotation matrices, where 9 floats (36 bytes) can be reduced to 3 fixed points (6-9 bytes).

This is really exactly the same as the kinds of variable changes that you want to do physics; when you have a lot of values in a struct that are constrained together in some way, you want to identify the true number of degrees of freedom, and try to convert your values into independent unconstrained variables.

When numerical values are correlated to their neighbors, delta transformation may help. (this particularly helps with larger-than-byte values where a compressor will have a harder time figuring it out)

3. Don't mash together statistics.

A common mistake is to get too aggressive with mashing together values into bits in a way that wrecks the back-end statistical model. Most back end compressors work best if the bytes in the file all have the same probability histogram; that is, they are drawn from the same "source". (as noted in some of the other points, if there are multiple unrelated "sources" in your one data stream, the best thing to do is to separate them from each other in the buffer)

Let me give a really concrete example of this. Say you have some data which has lots of unused space in its bytes, something like :


bytes in the original have values :

0000 + 4 bits from source "A"
0001 + 4 bits from source "B"

(when I say "from source" I mean a random value drawn under a certain probability distribution)

You might be tempted to bit-pack these to compact them before the back end compressor. You might do something like this :


Take the top 4 bits to make a flag bit
Take 8 flag bits and put them in a byte

Then take the 4 bits of either A or B and put them together in the high and low nibble of a byte

eg, in nibbles :

0A 1B 1B 0A 0A 0A 1B 0A 

--[bit packed]-->

01100010 (binary) + ABBAAABA (nibbles)

(and A and B are not the hex numbers but mean 4 bits drawn from that source)

It looks like you have done a nice job of packing, but in fact you've really wrecked the data. The sources A and B had different statistics, and in the original form the compressor would have been able to learn that, because the flag bit was right there in the byte with the payload. But by packing it up tightly what you have done is made a bunch of bytes whose probability model is a mix of {bit flags},{source A},{source B}, which is a big mess.

I guess a related point is :

4. Even straightforward bit packing doesn't work for the reasons you think it does.

Say for example you have a bunch of bytes which only take on the values 0-3 (eg. use 2 bits). You might think that it would be a big win to do your own bit packing before the compressor and cram 4 bytes together into one. Well, maybe.

The issue is that the back end compressor will be able to do that exact same thing just as well. It can see that the bytes only take values 0-3 and thus will send them as 2 bits. It doesn't really need your help to see that. (you could help it if you had say some values that you knew were in 0-3 and some other values you knew were in 0-7, you might de-interleave those values so they are separated in the file, or somehow include their identity in the value so that their statistics don't get mixed up; see #5)

However, packing the bytes down can help in some cases. One is if the values are correlated to their neighbors; by packing them you get more of them near each other, so the correlation is modeled at an effective higher order. (eg. if the back end only used order-0 literals, then by packing you get order-3 (for one of the values anyway)). If the values are not neighbor-correlated, then packing will actually hurt.

(with a Huffman back end packing can also help because it allows you to get fractional bits per original value)

Also for small window LZ, packing down effectively increases the window size. Many people see advantages to packing data down before feeding it to Zip, but largely that is just reflective of the tiny 32k window in Zip (left over from the DOS days and totally insane that we're still using it).

5. Separating values that are independent :

I guess I've covered this in other points but it's significant enough to be redundant about. If you have two different sources (A and B); and there's not much correlation between the two, eg. A's and B's are unrelated, but the A's are correlated to other A's - you should try to deinterleave them.

A common simple case is AOS vs SOA. When you have a bunch of structs, often each value in the struct is more related to the same value in its neighbor struct than to other values within its own struct (eg. struct0.x is related to struct1.x more than to struct0.y). In this case, you should transform from array-of-structs to struct-of-arrays ; that is, put all the .x's together.

For example, it's well known that DXT1 compresses better if you de-interleave the end point colors from the palette interpolation indeces. Note that AOS-SOA transformation is very slow if done naively so this has to be considered as a tradeoff in the larger picture.

More generally when given a struct you want to use app-specific knowledge to pack together values that are strongly correlated and de-interleave values that are not.


10-16-12 | Two more small notes on LZNib

Followup to Small Notes on LZNib

1. Because cost ties are common, and ties are not actually ties (due to "last offset"), just changing the order that you visit matches can change your compression. eg. if you walk matches from long to short or short to long or low offset to high offset, etc.

Another important way to break ties is for speed. Basically prefer long matches and long literal runs vs. a series of shorter ones that make the same output length. Because the code cost is integer bytes, you can do this pretty easily by just adding a small bias to the cost (one thousandth of a byte or whatever) each time you start a new match or LRL.

(more generally in an ideal world every compressor should have a lagrange parameter for space-speed tradeoff, but that's the kind of thing nobody ever gets around to)

2. Traditional LZ coders did not output matches unless they were cheaper than literals. That is, say you send a match len in 4 bits and an offset in 12 bits, so a match is 2 bytes - you would think that the minimum match length should be 3 - not 2 - because sending a 2 byte match is pointless (it's cheaper or the same cost to send those 2 bytes as literals (cheaper as literals if you are in a literal run-len already)). By using a larger MML, you can send higher match lengths in your 4 bits, so it should be a win.

This is not true if you have "last offset". With LO in your coder, it is often beneficial to send matches which are not a win (vs literals) on their own. eg. in the above example, minimum match length should be 2 in an LO coder.

This is one of those cases where text and binary data differ drastically. If you never tested on structured data you would not see this. Really the nature of LZ compression on text and binary is so different that it's worth considering two totally independent compressors (or at least some different tweaked config vals). Text match offsets fall off very steadily in a perfect curve, and "last offsets" are only used for interrupted matches, not for re-using an offset (and generally don't help that much). Binary match offsets have very sparse histograms with lots of strong peaks at the record sizes in the file, and "last offset" is used often just as a way of cheaply encoding the common record distance.

On text, it is in fact best to use an MML which makes matches strictly smaller than literals.

If I keep at this work in the future I'm sure I'll get around to doing an LZ specifically designed for structured data; it's sort of hopeless trying to find a compromise that works great on both; I see a lot more win possible.


10-15-12 | Joy

Paso Robles Joy :

The sun, the heat. The big open spaces and shade trees. It makes me want to get naked and feel the hand of the sun on my skin, run around in the field, it's the way human beings should be. The vast rolling hills just beg you to get on a horse and ride (though our shitty world of fences and private property make that only a fantasy).

The biking is just fantastic. Pretty deserted roads (though there's more traffic than I remember being here 10 years ago), decent pavement. Grape vines and oak trees all around, and lovely rolling terrain. I hate flats, and I kind of hate endless climbs; here I have the sweet mix, a hard sprint climb, and then a fun windy descent, then a bit of a gradual climb, up and down, lots of variety, never boring. Really great riding, and so many different routes with varying difficulty levels, all out in the country but close by.

The smell; maybe above all the thing that hits me any time I come back to California are the sweet smells; sage and grass up on the dry hillsides, and bay laurel down in the river hollows, the gentle breezes just rich with the wild smells.

October might be the best time here; the grapes are ripe and just about to be picked (in fact there are pickers working right now); you can smell them as you ride around, or stop and have a snack. Wine grapes are super delicious; they have much more interesting flavors than the garbage you get in grocery stores, tons of weird musky notes and caramel and just lots of complexity, not so sweet and boring.

I love that everyone drives fast in California. It just makes life so much easier for me, because I'm not constantly fighting the general flow around me. (in fact being used to Northwest driving, I'm often the slowest person here). I know that it doesn't mean that people here are actually more intelligent or better drivers, they're just following the regional habit the same way Northwest people are (the way people so uniformly just go with the habit of their area is a great demonstration of how little actual individuality anyone has; 99% of your "personality" is just where you live and the time and place you were raised), but man it is a fucking bummer driving in the Northwest with all the up-tight busy-body dumbass passive-aggressive speed-limit followers (who are actually very dangerous drivers, because they don't adapt their behavior to the situation at hand).

Seattle Joy :

This is a memory of July/August, trying to remind myself of the good things.

Being able to walk down and swim in the lake is incredibly nice. I love to just swim out a hundred feet or so and float; getting away from shore you get a view of Mt Rainier and the city skyline. Incredibly it really never gets too crowded in the lake, and even on busy boating days it clears out around twilight, which is one of the best times to be in the lake.

It's really magical when the blackberries get ripe all over the city. The sweet rich smell fills the air and you get it just everywhere as you walk around town. You can ride around Mercer Island and stop and snack as you go.

I've found some pretty decent river swims; they aren't the river swims of my dreams (too cold, and not private enough to get naked), but they are a joy on those rare hot summer days, when you get a bracing dip that shocks you and makes you feel alive.

One of the things that I totally take for granted is that we have no bugs. I completely forget that it's true until I go visit some place like PA or The South where you just can't even sit outside at all without being attacked. It absolutely sucks to live in places with bugs and it's some kind of bizarre miracle that we don't have them (it makes no sense to me that we don't, there's lots of water, and it doesn't get that cold, it should be ideal mosquito land, wtf).

Of course the high mountains are really incredible. Once again I only got backpacking twice; every year I tell myself that I need to go more next year, but it doesn't happen. One problem is that I feel like I can't take that much time off work; another problem is that just staying in the city and swimming and biking near home is so sweet in the summer that the motivation to go way out to the woods is reduced. Anyway, once again I swear I'll try to get out more often next year. It would be easier if I could work in the mountains.


Comparing the Northwest with California, I've had some revelations about what makes for really great driving/riding roads. The driving & riding around Seattle just sucks, and surprisingly in CA which is a much much more populous state, that doesn't really seem to be that much older, it's way better.

The key thing for great roads is that they are somewhat old and now disused. That is, there had to be some reason to put in good country roads long ago (mining, farming) but now there is not much reason for people to be on those roads, so they are low traffic. They have to be old enough to be made before earth-moving equipment, so they are nice and windy and narrow.

The problem with the Northwest is it's just too young. Habitation in the area is only 100 years old; there aren't farm roads from 150 years ago. The only old roads are logging roads and those are/were dirt and temporary. There's only a handful of nice windy old mountain pass roads, and they all are popular tourist attractions which makes them no good for me.

Of course one of the things that makes the Central Coast area so great is the strict development controls that keep the towns from creeping into the countryside and devouring it with endless suburbs. With no housing subdivisions on these old farm roads there's not much use for them, and that makes them heaven for a windy road lover.


Being back in SLO gives me some perspective on how badly I've lived my life. Walking around downtown Tasha asked me if I did this or that, did I go to college parties? did I surf? did I make wine? No, not really. What did I do all the time that I lived here? Pretty much just worked. What a retard.

I feel like I accomplish a lot more than the average programmer, and I like to think that it's because I'm smart and more efficient; I think I have good methodology and solve problems more directly, but maybe I don't, maybe I just work more. When I'm in the moment I can't see it, but any time I look back at my life with 10 years or more of distance I go "wtf I was just doing nothing but working the whole time".

Maybe that's just the way it is for everyone; you work and buy groceries and sleep and go through life without ever doing much.

In related news, I think going out to dinner and going to movies and such is a really horrible way to spend your time. It doesn't really impact your life, you don't remember it down the road, it's just a way of killing time, it's not much better than watching TV or drinking booze (which is the ultimate in "please just make this lifetime go away with as little involvement from me as possible").


10-15-12 | Treat People Like Children

One of the things I've realized in the last year or two is that you should treat people like children. It's what they really want, and it's also just better for yourself.

What I mean is, when you treat someone "like an adult", you let them be responsible for their own decisions, you let them suffer the ill consequences of their own mistakes, and you listen to their words as if they mean what they say literally. When you treat someone "like a child" , you clean up after them, you fix their mistakes for them, you assume that when they say something wrong they didn't mean it, etc.

I think some examples may clarify what I mean.

Say you're going hiking in the mountains with a friend. You notice that they have not brought a jacket and you know it will be cold up there. You say "hey do you want to borrow a jacket?" and they say "nah, I'll be fine". You know fully well they will not be fine. If you "treat them like an adult", you would just let them suffer the ill consequences of their bad decision, but the result will be unpleasant for you as well, they will complain, they'll be all pouty and in a bad mood, they'll want to leave quickly, it will suck. Either you can say "fuck you, I told you to bring a jacket, I want to stay, suck it up!" or you can accomodate them and leave early, and either way sucks. So the better thing is to "treat them like a child" and just say at the start "well I'll bring an extra one anyway in case you want it". (with particularly childish friends you shouldn't even say anything and just silently bring an extra one).

(The same goes with snacks and water and such obviously; basically you're better off being like a mom and carrying a pouch of supplies to keep all the "children" (ie. all humans) from getting cranky).

Say you're driving with your dad and you're lost and he doesn't want to stop for directions. If you treat him "like an adult" you would either just speak to him rationally and say "hey this is silly you need to stop and ask someone, don't be so childish" or you would just let him suffer the ill consequences of being lost. But of course neither of those would actually work (almost nobody responds well to having their bad behavior pointed out to them). What you need to do is treat him like a pouty child and fix the situation yourself; eg. say you really need to pee, can we stop for that please, and then ask for directions yourself.

A very common one is just when someone is really pouty or starts acting like a jerk to you. You could "treat them like an adult" and assume they are aware of what they are saying and actually mean to be a jerk to you. But in reality they probably don't, they are just hungry or cranky or need a poop (they are a child) and you shouldn't take it personally. If you need to interact with them, you should get them some food and water and try to fix their crankiness before proceeding.

I find in general that interactions with people work much better if I treat them like a child. (and the same goes in reverse - I get along much better with people who treat me like a child). (basically the idea of the rational self-responsible adult is an invention that does not correspond to reality)

(I guess a related thing that everyone in "communication" knows is you can't just criticize someone and expect them to rationally accept the information and decide if it is useful or not; you have to butter them up first and do it really gently and all that, just like you were trying to critique your child's drawing ("that tree is awful, trees don't look like lollipops, you moron"))

(I guess 99% of modern publicity is just treating people like children. It doesn't matter how good your product is if it has enough stars on the box; your store can sell garbage if it smells like cookies; it doesn't matter what the president actually says as long as he has good hair. I feel like in the 50's before PR was figured out, that media actually treated adults like adults a bit more, and the cleverness of the modern age is realizing that everyone is an easily manipulated pouty child (suck on your iNipple))


Related : thoughts on using money.

I have enough money now to live comfortably, much more so than when I was a child. The little differences are really what strike me. When I was a child of course you would buy store brand aluminum foil, of course you would use coupons, those dollars all mattered. Buying food at Disneyland was a huge luxury (to be avoided if possible, you can wait till we get out of the park, right?) because it was marked up so much. So the first good use of money is just hey you can buy whatever basic necessities you want and not waste your time worrying about the price.

I've tried various ways of spending money now and think that I've made some discoveries. Fancy cars and fancy houses are not good ways to spend money. They are not any better and don't improve your life. In general buying stuff/goods/toys is not helpful (except when it allows you to do an activity that you could not have otherwise done, and you actually do that activity and enjoy it; eg. buying fancy road bikes has zero value if you already had a bike that was good enough and you enjoyed riding; if it doesn't change your ability to do an activity, just making it faster or easier or whatever has zero actual value; but if you had no bike and buy one and then actually ride it, okay that's a good use of money).

Anyway, one of the best uses of money is just to fix all those little moments of crankiness. Like you're in a museum and you're kind of tired or hungry or thirsty; you start to get cranky and not enjoy it. My depression-era upbringing tells me to just gut it out; stay the hell away from the museum cafe, because it's crap food and it's way overpriced. But with money you can just buy the ten dollar tuna sandwich and it will fix your bad mood; that's a good use of money. (in my youth we would have brought homemade sandwiches).


10-14-12 | Ranting

We're staying in this cabiny rental in Paso Robles. I walk in the door and one of the first things I see is a sign in the kitchen saying "don't cut on the butcher block counter; if you spill wine please use stain remover; etc..". What the fuck is wrong with everyone !? It is mind boggling how bad you all are at life. (my ex-landlady had the same dumb rule about the butcher block). First of all, the whole point of butcher block counter is to cut on, you dumb suburban ladies see them in magazines and think it looks nice and don't actually understand them, but whatever even if you don't agree with that - don't put it in a fucking rental unit if you don't want people to cut on it, it's just so incredibly stupid, it's creating problems for yourself. Of course renters are going to use it in a reasonable way (eg. cut on it) not in your unreasonable uptight way. (it's particularly ridiculous here because it's a rustic cabiny thing with super-shitty home-improver home-depot fixtures; ooo wouldn't want to get a water mark on your plywood furniture ooo)

It sort of reminds me of all the terrible park designers who make these circuitous awkward paths such that the direct route to get through the park is straight through the greenery, and then put up signs saying "stay off the grass". If you wanted people to stay off the grass you should have put the path in the natural place to walk. Of course people are going to cut off your dumb artsy loop. It's not their fault for walking on the grass, they're not doing anything wrong, you did something wrong by building your paths dumbly.

Driving down I-5 to get here, you're on this seemingly endless straight flat stretch of highway; I spent the whole time thinking about what incredible morons everyone around me was. First of all, hardly anybody seems to use cruise control, so I'm going along at 70 and people keep passing me and then slowing down in front of me and such annoying shit. Dumbasses. Even the people who are reasonably adept at controlling a car are just such inconsiderate assholes. For example people would constantly pull out to pass me at like 71 mph when I'm going 70, which causes them to box me in on the left for like half an hour because it takes them so long to pass; inevitably some truck is in the right lane and I get trapped. It was so consistent that I started to just hold the left lane (which I hate to do and felt like an asshole) until another car proved to me that they were a decent human being (eg. would pass me quickly). I'm trying to be less curteous to strangers; my new rule is that you have to give me some kind of sign that you aren't a waste of oxygen before I get upset with myself for inconveniencing you. (it's not really working yet, I still instinctively get out of the way of assholes who are rudely barging through a crowd, etc)

Tasha and I have both gotten speeding tickets recently while passing, right in that brief moment when we sped up to get the pass over with quickly. Speeding tickets in general are obviously a farce, so this is no surprise, but it's completely absurd to suggest that you should pass without speeding. If someone is going 3 mph under the limit and you want to pass (in a 65 mph zone, and assuming you need 200 feet of clearance on each side to pass safely) it would take 1.64 miles to make the pass. Of course the safest way to make a pass is to get it over with as quickly as possible, eg. pop up to 90 mph briefly; in a reasonable world you should get a ticket for passing without speeding up enough (which I occasionally see and is very dangerous) (or of course for blocking up traffic and not pulling out).

Because I have a ticket I'm trying to be really careful and not speed at all, and it *sucks* god it sucks. It's not the speed I miss, I'm actually totally fine with just driving slowly for a while, it's the ability to get away from all the dumb fuckers out there. If you actually drive the speed limit everywhere, you are constantly surrounded by other cars and they are just constantly doing cock-ass-motherfucker things like changing lanes right into me without signalling so that I have to take evasive action, or just hanging out right in my blind spot and matching my speed, or speeding up to pull in front of me and then slowing down again, etc. etc. It's just awful driving in a pack. I actually think it was much safer when I was speeding, because I would use it to find empty spots on the freeway and just get alone. I also think it's a lot safer to always be slightly faster than the average traffic, because then it's all layed out in front of you for you to see, rather than buzzing around and coming up from behind. (obviously this is a local optimization not a global one) (I think that most drivers are not actually watching for other cars the way I do; that's why nobody but me seems to care about that fact that almost every modern car has absolutely horrid visibility. When I drive I know exactly where every car around me is, so that I can always make an evasive move without looking, because I know there's a space on my left or right.)

In other news of "my god everyone is so incredibly dumb" I had three retail experiences in a row with the exact same bizarre dumb interaction. I went to this gross mongolian wok place and asked the cashier for a "grilled pork bowl" and she looked at me like I had just said "blerkabootyppsh" , she was like "err, what is that?"; eventually I re-checked the menu and said "barbecue pork bowl" and she was like "oh, okay". Huh? What? You can't figure out that maybe that's what I meant? It's not a very hard puzzle, the only things you sell are "chicken bowl" , "pork bowl", and "beef bowl", so just because I put the wrong adjective in front shouldn't have blown your mind (and FYI it's actually grilled not barbecued); it's like there's just empty space behind those eyes.

The one that really boggles my mind is the constant level of stupidity in coffee shops; I ordered a doppio somewhere and the girl was like "err.. uhh.. do you mean a double espresso?" , uh yeah, you work in a fucking espresso shop and you've never heard of a doppio before? Of course it always kind of blows my mind the way people can do a job day after day and not be at all interested in learning about it or doing it well.


10-13-12 | Sports

Random thoughts from someone who doesn't know much about sports.

1. Of course Lance Armstrong was on drugs; if you didn't know that, you're a moron. He completely dominated a field which was full of dopers, winning in the mountains and on the flats; if he could do that naturally he would be some kind of super-human abnormality, which of course he wasn't. It doesn't diminish his amazing achievements at all. Everyone he was competing against was on drugs too, so it was a totally level playing field. Everyone in cycling has been on drugs since maybe the 30's or so when they took straight amphetamines. (so did most atheletes in those days). You do know that Eddy Merckx was on drugs, right? And pretty much every TdF winner ever. Everyone in every sport in the world has been on drugs for roughly the past century, it's bizarre to act like it's a scandal. It's sort of like a man admitting that he thinks about women other than his wife and everyone gets all upset about it (harrumph and drops their monocle); it's fucking retarded to have these societal faux-pas that publicly we decry and nobody can admit to, but anybody with a brain knowns that everybody does it.

Even if you have perfect drug testing of pro athletes, it wouldn't diminish the importance or usage of drugs in sport. eg. say you tested every single day and the tests could detect everything so no doping was possible. That would just make it even more important for the kids to dope in high school before going pro and getting into the testing regime (which is what happens in NFL football these days; to be a football player you must use steroids in high school).

(The French obsession with taking down Lance for doping is particularly ridiculous; they're upset that Lance beat all their French stars so badly, and just generally upset that French cyclists all suck so bad these days, but of course the only French cyclists who have had any success at all recently (eg. Richard Virenque) were huge huge dopers (which is inevitable when you are carrying the expectations of a nation))

2. Sebastien Loeb was probably the greatest racing driver of all time. Unfortunately for him, the WRC format has just not been very interesting during his reign, and he didn't have the fortune of a good foil - he needed a rival to seriously challenge him and make it interesting for the fans, but nobody ever could. (it also didn't help that the cars are so boring now; historic rallies are probably the best rallies to watch now; I love to watch the old Ford Escorts in the 2wd historic rallies hanging the tail around every corner).

Obviously 9 straight championships speaks of his dominance, but if you actually watched some of the races you would appreciate that his supremacy was at a level even beyond what the numbers show. You could tell that he was playing it safe most of the time, that he always had a little more speed in the bank. Of course that's smart, and part of what made Loeb the greatest of all time, that he was not only skilled but crafty and good at managing the risk and percentages. He would drive just fast enough to win and no faster; sometimes he would fall behind in an early stage of a race, and then he would push a little harder and just rip time out of the competition, showing how much speed he really had.

3. The current Spanish national soccer team is a real joy to watch; one of the best international teams I've ever seen (but I don't watch a lot of soccer). The thing that makes them so great is they play a dynamic style with lots of movement (their movement off the ball is particularly good; they run great "give and go's" in basketball lingo), and they make goals from the natural flow of play. Way too many international teams use on a very boring defensive style, where they just randomly launch the ball forward or rely on set pieces (corner kicks, penalties) for goals. It's so much nicer to see goals come out of the flow of play rather than set pieces. The German teams in particular are always very effective but just agony to watch. Even the great Brazilian teams have fun individual flair but actually play a pretty defensive configuration most of the time and rely on just a few forwards to make something happen.

Soccer, like most sports, is clearly broken. By "broken" I mean that the rules of the game do not encourage beautiful play; in fact they punish it. A well designed game system makes it so that playing smartly (eg. to maximize the chance of winning) also causes you to play in a way that is elegant and nice to watch and in the spirit of the game. I don't think that playing 8 defenders and winning on penalty kicks is in the spirit of the game and the rules should not let that be such a profitable strategy.

4. I've been enjoying watching F1 recently, mainly as a nice way to zone out (it's very boring, a bit like watching golf or something, just a nice bit of mindless background).

One of the very annoying things about it is that all the video feeds are provided by the F1 organization (not each TV channel) and they are just terrible. The director seems to have very little clue about what is interesting to watch in racing. They're constantly showing cars in 20th place (HRT or whatever) going into the pits; oo 20th place pitted, better cut away to that, fascinating. And of course they cut away from the leaders right when they are setting up for a pass.

F1, like almost all sports, also has just terrible announcers. There are lots of interesting things happening all the time that you would not have any clue about unless you really know racing, because the commentators don't tell you. For example smart drivers like Alonso are very clever about how they interact with other cars; if he's trying to make a pass, he will pester the car in front in areas of the track where he is not planning to pass; this makes the leading car use up its KERS unwisely, meanwhile Alonso is saving all his KERS for the spot where he is planning to make the pass. Sometimes a tough pass is set up with fakes (bluffs) over the course of several laps; you show that you want to pass in one part of the track, so that the leading driver starts going defensive in that spot, then you actually make the pass in a totally other section where you have previously bluffed that you don't have pace. Good commentators should be telling you about these dynamics, as well as just constantly telling stories about the drivers to give you some background on how they interact with each other. If you actually stop and think about how good commentating could be, and how shitty it actually is, the gulf is massive. We've been kind of inured to just atrocious sports commentary, so much so that it is the expected norm (it doesn't feel right to watch football without some commentator saying "they need to get more physical").

I feel like Red Bull is actually way way better than all the other teams, but it doesn't seem that way on the surface because they keep getting torpedoed by the FIA. Every time they make another advancement that lets them run away with races, the rules get changed to make their innovation illegal. Certainly the racing is more interesting if the teams are close by, but constantly changing the rules to hinder the leader is not a good way to make a sport competitive.


10-08-12 | Oh dear god, what a relief

For the past few days I've had a terrible heisen-bug. It mainly showed up as decompression failing to reproduce the original data. At first I thought it was just a regular bug, some weird case exposing a broken pathway, but I could never get it to repro, and it only seemed to happen when I ran very long tests; like if I compressed 10,000 files it would show up in the 7000th one, but then if I ran on just the file that failed it would never repro.

I do a lot of weird threading stuff so my first fear was that I had some kind of race. So I turned off all my threading, but it kept happening.

My next thought was some kind of uninitialized memory problem or out-of-bounds problem. The circumstances of failure jive with the bug only happening after I have touched a lot of memory and maybe moved into a weird part of address space, or maybe I'm writing past the end of a buffer somewhere and it doesn't show up and hurt me until much later.

So I turned on my various debug allocator features and tried a bunch of things to stress that, but still couldn't get it to fail in any kind of repeatable way.

Yesterday I saw the exact same kind of bug happen in a few of my different compressors and the lightbulb finally came on in my head : maybe I have bad RAM. Memtest86 and just a few seconds in, yep, bad RAM.

Phew. As pissed as I am to have to deal with this (getting into the RAM on my lappy is a serious pain in the ass) it's nice to not actually have a bizarro bug.

The failure rate of RAM in desktop-replacement lappies is around 100% in my experience. I've had two different desktop replacement lappies in the past 8 years and I have burned out 3 RAM chips; I've blown the OEM RAM on both of them and on this one I also toasted the replacement RAM. Presumably the problem is that it just gets too hot in there and they don't have sufficient cooling. (and yes I keep them on a screen for air flow and all that, and never actually use them on a lap or pillow or anything bad like that). (perhaps I should get one of those laptop stands that has active cooling fans).

Also, shouldn't we have better debugging features by now?

I should be able to take any range of memory, not just page boundaries, and mark it as "no access". So for example I could take compression buffers and put little no access regions at the head and tail.

For uninitialized memory you want to be able to mark every allocation as "fault if it's read before it's written". (this requires a bit per byte which is cleared on write).

You could enforce true const in C by making a true_const template that marks its memory as read-only.

I've ranted before about how thread debugging would be much better if we could mark memory as "fault unless you are thread X", eg. give exclusive access of a memory region to a thread.

I see two good solutions for this : 1. a VM that could run your exe and add these features, or 2. special memory chips and MMU's for programmers. I certainly would pay extra for RAM that had an extra 2 bits per byte with access flags. Hell with how cheap RAM is these days I would pay extra for more error-correction bits too; maybe even completely duplicate bytes. And self-healing RAM wouldn't be bad either (just mark a page as unusable if it sees failures in that page).

(for thread debugging we should also have a VM that can record exact execution traces and replay them, of course).


10-07-12 | Small Notes on LZNib

Some little thoughts.

1. It's kind of amazing to me how well LZNib does. (currently 30,986,634 on enwik8 with parallel chunked compress and LRM). I guess it's just the "asymptotic optimality" of LZ77; as the dictionary gets bigger, LZ77 approaches perfect compression (assuming the data source is static, which of course it never is, which is why LZ77 does not in fact approach the best compressor). But anyway, the point is with basic LZ the way matches are encoded becomes less and less important as the window gets bigger (and the average match length thus gets longer).

2. With byte-wise coders you have something funny in the optimal parser than you don't run into much with huffman or arithmetic coders : *ties*. That is, there are frequently many ways to code that have exactly the same code length. (in fact it's not uncommon for *all* the coding choices at a given position to produce the same total length).

You might think ties don't matter but in fact they do. One way you can break a tie is to favor speed; eg. break the tie by picking the encoding that decodes the fastest. But beyond that if your format has some feedback, the tie is important. For example in LZNib the "divider" value could be dynamic and set by feedback from the previous encoding.

In my LZNib I have "last offset" (repeat match), which is affected by ties.

3. My current decoder is around 800 mb/s on my machine. That's almost half the speed of LZ4 (around 1500 mb/s). I think there are a few things I could do to get a little more speed, but it's never going to get all the way. Presumably the main factor is the large window - LZ4 matches mostly come from L1 and if not then they are in L2. LZNib gets a lot of large offsets, thus more cache misses. It might help to do a lagrangian space-speed thing that picks smaller offsets when they don't hurt too much (certainly for breaking ties). (LZNib is also somewhat more branchy than LZ4 which is the other major source of speed loss)

4. One of the nice things about optimal parsing LZNib is that you can strictly pick the set of matches you need to consider. (and there are also enough choices for the optimal parser to make interesting decisions). Offsets can be sent in 12 bits, 20 bits, 28 bits, etc. so for each offset size you just pick the longest match in that window. (this is in contrast to any entropy-coded scheme where reducing to only a few matches is an approximation that hurts compression, or a fixed-size scheme like LZ4 that doesn't give the optimal parser any choices to make)

5. As usual I'm giving up some compression in the optimal parser by not considering all possible lengths for each match. eg. if you find a match of length 10 you should consider only using 3,4,5... ; I don't do that, I only consider lengths that result in a shorter match length code word. That is a small approximation but helps encoder speed a lot.

6. Since LZNib uses "last offset", the optimal parse is only approximate and that is an unsolved problem. Because big groups of offsets code to the same output size, the choice between those offsets should be made by how useful they are in the future as repeat matches, which is something I'm not doing yet.


10-05-12 | Picture of the Day

(Animated gifs are so annoying.)


10-05-12 | OodleLZ Encoder Speed Variation with Worker Count

Thought I would look into this. One thing I've been wondering is whether putting workers on the hyper-threads helps or not.

Measured speed on enwik8. This is the slow optimal encoder to give it something to do. enwik8 is encoded by breaking into 4 MB chunks (24 of them). Each chunk gets 4 MB of dictionary overlap precondition. Matches before the overlap are found using the LRM (Long Range Matcher). The LRM is created for the whole file and shared between all chunks.

What we see :

The speed dip from 0 to 1 workers is expected, it's the cost of firing up threads and communication and chunking and such. (0 = synchronous, just encode on the main thread).

My machine has 4 real cores and 8 hyper-cores. From 1-4 workers we see not-quite-linear speedup, but big steps. Once we get into the hyperthreads, the benefit is smaller but I'm still seeing steady speedup, which surprises me a bit, I thought it would flatten out more after 4 workers.

(the wiggle at 7 is probably just a random fluctuation in Windows (some service doing something I didn't ask it to do, you bastards); I only ran this test once so the numbers are not very solid; normally I run 40 trials or so when measuring speeds on Windows).

And here's the Oodle ThreadProfile of the encode showing what's happening all the threads :


(click to zoom)

Of course part of the reason for the not-quite-linear speedup is the gap at the end when not all the workers are busy. You can fix that by using smaller chunks, but it's really not anything to worry too much about. While it does affect the latency of this single "encode enwik8" operation, it doesn't affect throughput of the overall system under multiple workloads.


OodleLZHLW enwik8 compressed size variation with different chunkings :


28,326,489   4 MB chunks - no LRM
27,559,112   4 MB chunks with LRM
27,098,361   8 MB chunks with LRM , 4 matches
26,976,079   16 MB chunks , 4 matches
26,939,463   16 MB chunks , 8 matches
26,939,812   16 MB chunks , 8 matches, with thresholds

In each case the amount of overlap is = the chunk size (it's really overlap that affects the amount of compression). After the first one, all others are with LRM. Note that the effective local dictionary size varies as you parse through a chunk; eg. with 4 MB chunks, you start with 4 MB of overlap, so you have an effective 4 MB local window, as you parse your window effectively grows up to a max of 8 MB, so the end of each chunk is better compressed than the beginning.

My LZHLW optimal parse only considers 4 matches normally; as the overlap gets bigger, that becomes a worse compromise. Part of the problem is how those matches are chosen - I just take the 4 longest matches (and the lowest offset at each unique length). Normally this compromise is okay, you get a decent sampling of matches to choose from; on moderate file sizes the cost from going to infinite to 16 to 4 matches is not that great, but as the dictionary gets bigger, you will sometimes fill all 4 matches with high offsets (because they provide the longest match lengths) and not any low offsets to try.

At 16 MB chunks (+16 overlap = 32 MB total window) it becomes necessary to consider more matches. (in fact there's almost no benefit in going from 8 MB to 16 MB chunks without increasing the number of matches).

I tried adding "thresholds"; requiring that some of the matches found be in certain windows, but it didn't help; that merits more investigation. Intuitively it seems to me that the optimal parser wants to be able to choose between some long high-offset matches and some shorter low-offset matches, so the question is how to provide it a few good selections to consider. I think there's definitely some more win possible in my optimal parser by considering more matches, or by having a better heuristic to choose which matches to consider.


10-04-12 | Hash-Link match finder tricks

Some notes on the standard old Hash->Links match finder for LZ. (See previous posts on StringMatchTest Hash1b and Amortized Hashing or index post here )

Some additional tricks which are becoming more or less standard these days :

1. Carry-forward "follows" matches. Previously discussed, see Hash1b post. (also in the Hash1b post : checking for improvement first).

2. "Good enough length". Once you find a match of length >= GOOD_ENOUGH (256 or 1024 or so), you stop the search. This helps in super-degenerate areas; eg. you are at a big run of zeros and that has occured many times before in your file, you can get into a very bad O(N^2) thing if you aren't careful, so once you find a long match, just take it. Hurts compression very little. (note this is not just a max match length; that does hurt compression a bit more (on super-compressable files))

3. Extra steps when not finding matches. The first place I saw this was in LZ4 and Snappy, dunno where it was done first. The idea is when you fail to find a match, instead of stepping ahead by 1 you step ahead by some variable amount. As you continue to fail to find matches, that variable amount increases. Something like :


ptr += 1 + (numSearchesWithNoMatchFound>>5);

instead of just ptr++. The idea is that on incompressible files (or incompressible portions of files) you stop bothering with all the work to find matches that you won't find anyway. Once you get back to a compressible part, the step resets.

4. Variable "amortize" (truncated hash search). A variant of #3 is to use a variable limit for the amortized hash search. Instead of just stepping over literals and doing no match search at all, you could do a match search but with a very short truncated limit. Alternatively, if you are spending too much time in the match finder, you could reduce the limit (eg. in degenerate cases not helped by the "good enough len"). The amortize limit might vary between 64 and 4096.

The goal of all this is to even out the speed of the LZ encoder.

The ideal scenario for an LZ encoder (greedy parsing) is that it finds a very long match (and thus can step over many bytes without doing any lookup at all), and it finds it in a hash bucket which has very few other entries, or if there are other entries they are very easily rejected (eg. they mismatch on the first byte).

The worst scenario for an LZ encoder (without our tricks) is either : 1. there are tons of long matches, so we go and visit tons of bytes before picking one, or 2. there are no matches (or only a very short match) but we had to look at tons of pointers in our hash bucket to find it, and we will have to do hash lookups many times in the file because we are not finding long matches.


10-04-12 | Work and Life Patterns

Some random thoughts.

1. I'm pretty sure that people who have "work-life balance" are not actually working. Not by my standard of "work". I see these people sometimes who manage to exercise every day, take a nice relaxing lunch break, stop working to be sweet to their wife or play with their kids. No fucking way those guys are working, you can't put in a solid self-destructing day when you're doing that.

It seems like you should be able to stop and take 30 minutes off for stretching or whatever, but in fact you can't. For one thing, if you are really in deep "work mode", it takes at least an hour to get out of it. Then if you really were working hard, your body and mind are exhausted when you get out so you don't want to do anything. Then when you do go back to work, it takes hours to really get your mind back up to full speed.

The worst are the motivational speaker douchebags who will tell you can get more done if you only work 1 hour a day, or the dot-com-luckboxes who made millions over some trivial bullshit and now think they are business geniuses. I get more done in 5 minutes than you fuckers have done in your entire lives. I don't think you have any concept of what people do when they're actually working.

2. I've been in crazy crunch all summer long, and only in the last few weeks have kind of "hit the wall" where it's become a bit unpleasant. Not just in terms of job work, but also exercising, house work, etc. it's been a summer of work work work, take a break from one kind of work to do another kind of work.

(aside : actually taking a break from work to do other work is wonderful; I find that almost any day on which I do a variety of jobs I'm quite happy; like 6 hours of job work, then a few hours of wood working in the garage, then a few hours of gardening; that's a nice day. Any day that I spend doing all the same work all day is a sucky horrible day; obviously job work all day long is miserable, but so is home improving all day long. I've never been much for socializing or relaxing or whatever you're supposed to do when you're not working, so a lifestyle of hobbies and chores is okay with me.

Sometimes I see these old guys, generally 50-60 or so, wirey leather-hard old guys, who are just always doing something, they built their own house, they're overhauling an old engine, carrying bags of feed; you know they're really miserable inside their own brains which is why they never stop working to just sit and think or talk with the family, but they've found a way to live by just keeping themselves busy all the time. I look at those old guys and think yeah I could see myself getting through life that way.)

Anyhoo, now that fall is rolling in my body & mind want to quit. It occurred to me that this is the ancient Northern European life cycle; when spring rolls around you kick into high gear and take advantage of the long days and work your ass off for a while, then falls rolls around and you retreat into your dens. In the long ago, Northern Europeans actually almost hibernated in the winter; they would sleep for 16+ hours a day, and their heart rates and calorie consumption would drastically lower.

One of the problems with the modern world is that Northern Europeans won. With the advent of artificial light and heat, they can keep that Northern European summer work ethic going all year round. Back in the ancient days if you lived somewhere where you could work year round (eg. the tropics) then of course you took a slower pace. It's a real un-human situation we've gotten ourselves into. The Northern Europeans had to work their asses off in the summer because they didn't have much opportunity; and they had to be really careful uptight jerks, cache their food carefully and repair their shelters and such, because if they didn't they would die in the winter.

To be repetitive : in the ancient days you had the tropical peoples that lived a slower pace year round, and the northern peoples who lived very intensely, but only for the brief summer. What we've got now is basically that intense summer pace of life, but year round.

(as usual I have no idea if this is actually true; but a good parable is much better than factual accuracy).

3. Work life quality is obviously a race for the bottom. Basically capitalism is a pressure against life quality. I suppose the fundamental reason is that productivity is ever increasing (as knowledge increases, the value of each laborer goes down), and population is also increasing. But anyway, it's clear that the fundamental direction of capitalism is towards worse life quality (*). There are two factors I see that resist it : 1. unions , and 2. new fields of work. (or 3. get to the top of the hierarchy)

(* = this is clearly a half baked thought, as there are various ways in which capitalism is a pressure towards better life quality overall. I guess I'm specifically talking about the pressure of competition in a field where the number of people that want to be in it is greater than what's really needed. All fields go through a progression where at first the number of people trying to do it is very small, there are great opportunities in that phase, but at some point it becomes a well known thing to do and then the pressure is towards worse and worse life quality. I'm also normalizing out the general improvement in life quality for all, since human perception also normalizes that out and it doesn't affect our perception of our life quality)

The "race for the bottom" basically works like this : say you have some job that pays decently and gives you decent life quality; someone else sees that and says "hey I'll do the same job but for 90% of the pay" or more often "I'll take the same pay but work 120% of the hours". Because there is excess labor, the life quality for the worker goes down and down.

New areas of work, where there is a relatively small pool of competent labor, is one of the few ways to avoid this. Software has been new enough to be quite nice for some time, but for your standard low-level computer programmer is already no longer so, and it will only get worse as it becomes more mature.

The "race for the bottom" also occurs due to competition. Say you're an independent, maybe you make an archiver like WinPackStuffSmall, if your competition starts working crazy hours adding more and more features, suddenly you have to do the same to compete; then anybody else who wants to get into that business has to work even harder; over time the profit gets smaller and the work conditions get worse. This has certainly happened in games; it's almost impossible to make a competitive game with a small budget in a small amount of time without just killing the employees.

Anyway, I certainly feel it in data compression; there are so many smart guys putting in tons of work on compression for free because it's fun work, that you can't compete unless you really push hard. If you're going for the maximum compression prize and somebody else is putting in just killer work to do all the little things that get you a little more win, you can't compete unless you do it too. Being more efficient or having better ideas wins you a little bit of relaxation, but in the end some grindstone time is inevitable.

4. I really want my cabin in the woods to go off and work. It's too hard for me to try to work and live a normal life at the same time; I'd like to be able to just go out and be alone and eat gruel and code for a week straight.

For a while I was thinking about buying my own piece of land and building a little basic shack. But now that I own a house I'm not so sure. Owning property fucking sucks, it's a constant pain in the ass. (the only thing worse is renting in America, where the landlords have all the rights and are even worse pains in the ass). Sometimes I think that it would be nice to own a piece of mountain land, maybe an orchard, a beach house in the tropics, that that would be a legacy to pass on to my children, to stay in the family, but god damn it's a pain in the ass maintaining properties.

I wish I could find a rental, but I just cannot find anything decent, which is very odd, I know it must be out there. If I went out to my coding shack and I owned it, I would spend much of the time stressed out about the fixes I should be doing to it, at least with a rental I can go "yeah this place sucks but it's not my problem".

I sort of vaguely considered going backpacking-working, but I can't stand working on laptops, and carrying out the standing desk seems a bit difficult. (I said to James when we were backpacking that if I was rich it would be sweet to go backpacking and have a mule team or something carry in a nice bit of kit for you, way back into the inaccessible wilderness, so you could be out there all alone but with a nice supply of non-freeze-dried food (and a keyboard and standing desk) (like an old timey British explorer; have a coolie carry my oak desk into the woods on his back).

I do think the best way for a programmer to work (well, me anyway) is not the steady put in 8 hours every day and plod along. It's take a few weeks off and basically don't work at all, then go heavy crunch for a few months where you just dive in and every thought is about work. It's so much better if you can stay focused on the problem and not have to pop the stack and try to be relaxed and social and such. I'm not fucking relaxed, I can't chit chat with you, I have shit to get done! Unfortunately the periodic lifestyle doesn't work very well with other people in your life. (and mainstream employers expect you to do the crunch part but not the take a break part).

5. I've always thought that the ideal job would be a seasonal/periodic one. Something like being an F1 engineer, or an NFL coach. (NFL coach was my dream job in college; now I think F1 engineer looks mighty attractive; you get the fun of competition, and then you get to go back to your shed and study film and develop strategies and run computer models). There's some phase when you're "on" where you just work like crazy, and then you get a little bit of a break in the off season. (unfortunately, due to the "race to the bottom", the break in these kinds of jobs is disappearing; back in the ancient days they really were seasonal, in the off season everyone would just go relax, but then uptight assholes starting taking the jobs and working year round, and now that's more the norm).

The other awesome thing about F1 engineer or NFL coach is that you get a big feedback moment (eg. "I won" or "I lost") which is very cathartic either way and gives you nice resolution. For me the absolute worst kind of work is the never-ending maintenance; you do some work, and then you do some more; guess what, next year you do some more; there's no real end point. Working on games at least does have that end point (whew, we shipped!) but they're way too far apart to be a nice cyclical lifestyle; you want it once a year, not once every 3-4 years.

I also like the overt competition in those kind of jobs. Real intellectual competition is one of the most fun things in the world; it's what I loved about poker, about going after the most compression, the Netflix prize, etc. It's so cool to see someone else beat you, and you get motivated and try to figure out how they did it, or take the next step yourself and come back and top them. Love that. And you don't have to listen to any dumb fucker's opinion about what the best way is, you go out and prove it in the field of combat; if your ideas are right, you win.

(for quite a while I've been thinking about making my own indie game, solely for the competitive aspect of it; I want to prove I can make a game faster and better than my peers. I really have very little interest in games, for me the "game" is the programming; I want to win at programming. Good lord that is a childish bad reason to make a game. Anyway that part of me is slowly dieing as I get older so the chance of me actually making an indie game declines by the day.)

6. I can be quite happy with a simple lifestyle : work really hard, then exercise hard to release the stress and relax the body, then sleep a lot. It actually feels pretty great. The problem is it's an unstable equilibrium, like a pendulum on its tip. The slightest disturbance sends it toppling down.

Any day you don't get enough sleep, suddenly the work is agony and you don't feel like exercising, and then you carry the stress and it's a disaster. In this lifestyle I feel very productive and healthy, but I'm also very prickly; you have to be quite self-defensive to make it work, you can't let people sap your energy because you are so close to using all the energy you have. You will seem quite unreasonable to others; if someone asks you for a little favor or even just wants to socialize or something; no, sorry I can't do it; I have to work and then I have to go swim or everything is going to come crashing down.

I see a lot of the type A successful douchebag types living this lifestyle, and I've never quite put my finger on it about what makes it so douchey. I suppose part of it is jealousy; somebody who actually manages to put in a hard day of work and then exercise off the stress and have a good evening is something that I am rarely able to do, and I'm jealous of people who pull it off. But part of it is that it is a very self-centered lifestyle; you have to be very selfish to make it work.

7. I certainly am aware that I am using work to avoid life at the moment. I've got a bunch of home improving I need to do and other such shite that I really don't want to deal with, so every morning I wake up and just get straight on the computer to do RAD work so that I don't have to think about any of the other things I should be doing.

Of course that's not unusual. I have quite a few friends/acquaintances around here who very reliably use work to avoid life; they can't do this or that "because they have to work". Bullshit, of course you don't have to work right at that moment, you almost never do, you're just avoiding life. It's not really even a lie; if you think it's a lie it's just because you're listening too literally; they're really just saying "no I don't want to" or "my head is all fucked up right now and it's better if I don't spend time in the normal world".

A few months ago I had a fence put in, and on the day that the guys were doing the layout, I felt like I had to be at the office. Of course they did some kind of fucked up things because I wasn't there to supervise, and of course looking back now I can't even remember why it was I felt like I really had to go to work that day, of course I didn't.

8. The times that I really kill myself working are 1. when a team depends on me; like if I made a bug that's blocking the artists, of course I'll kill myself until it's fixed (and you're an asshole if you don't), 2. when I'm working on something that I kind of am not supposed to be; eg. back when I did VIPM at WildTangent or cube map lights at Oddworld or many things at RAD (LZNib the latest); even if it's sort of within the scope of what I should be doing, if it's not what I would have told myself to do if I was the producer, then I feel guilty about it and try to get it over with as quickly as possible and feel bad about it the whole time. 3. when I'm embarassed; that's maybe the biggest motivator. If I release a product that has bugs, that's embarassing, or if I claim something is the best and then find out it's not, I have to go and kill myself to make it right.

Right now I'm embarassed about how long Oodle has taken to get out, so I'm trying to fix that.

9. There's a kind of mania you can get into when you're working a lot where you stop seeing the forest for the trees. You can dive down a hole, and you just keep doing stuff, you're knocking items off the todo list, but you aren't seeing the big picture. It's like the more you work the more you only see the foreground, the details. You have to stop and take a break to take stock and realize you should move onto a different topic.

Sometimes when you are faced with a mountain of tasks and are kind of overwhelmed about where to start, the best thing is to just pick one and do it, then the next, and eventually you will be done. But that rarely works with code, because there are really an infinite number of tasks, doing each one creates two new ones, so "putting your head down" (as producers love to say) can be non-productive.


10-02-12 | Small note on Buffered IO Timing

On Windows, Oodle by default uses OS buffering for reads and does not use OS buffering for writes. I believe this is the right way to go 99% of the time (for games).

(see previous notes on how Windows buffering works and why this is fastest :
cbloom rants 10-06-08 - 2
cbloom rants 10-07-08 - 2
cbloom rants 10-09-08 - 2
)

Not buffering writes also has other advantages besides raw speed, such as not polluting the file cache; if you buffer writes, then first some existing cache page is evicted, then the page is zero'ed, then your bytes are copied in, and finally it goes out to disk. Particularly if you are streaming out large amounts of data, there's no need to dump out a bunch of read-cached data for your write pages (which is what Windows will do because its page allocation strategy is very greedy).

(the major exception to unbuffered writes being best is if you will read the data soon after writing; eg. if you're writing out a file so that some other component can read it in again immediately; that usage is relatively rare, but important to keep in mind)

Anyhoo, this post is a small note to remind myself of a caveat :

If you are benchmarking apps by their time to run (eg. as an exe on a command line), buffered writes can appear to be much much faster. The reason is that the writes are not actually done when the app exits. When you do a WriteFile to a buffered file, it synchronously reserves the page and zeroes it and copies your data in. But the actual writing out to disk is deferred and is done by the Windows cache maintenance thread at some later time. Your app is even allowed to exit completely with those pages unwritten, and they will trickle out to disk eventually.

For a little command line app, this is a better experience for the user - the app runs much faster as far as they are concerned. So you should probably use buffered writes in this case.

For a long-running app (more than a few seconds) that doesn't care much about the edge conditions around shutdown, you care more about speed while your app is running (and also CPU consumption) - you should probable use unbuffered writes.

(the benefit for write throughput is not the only compelling factor, unbuffered writes also consume less CPU due to avoiding a memset and memcpy).


10-02-12 | Small note on Adaptive vs Static Modeling

Even most people who work in compression don't realize this, but in fact in most cases Adaptive Models and Static Models can achieve exactly the same amount of compression.

Let me now try to make that note more precise :

With an adaptive model to really do things right you must :


1. Initialize to a smart/tuned initial condition (not just 50/50 probabilities or an empty model)

2. Train the model with carefully trained rates; perhaps faster learning at first then slowing down;
perhaps different rates in different parts of the model

3. Reset the model smartly at data-change boundaries, or perhaps have multiple learning scales

4. Be careful of making the adaptive model too big for your data; eg. don't use a huge model space
that will be overly sparse on small files, but also don't use a tiny model that can't learn about
big files

With a static model to do things right you must :

1. Transmit the model very compactly, using assumptions about what the model is like typically;
transmit model refreshes as deltas

2. Send model refreshes in the appropriate places; the encoder must optimally choose model refresh points

3. Be able to send variable amounts of model; eg. with order-1 huffman decide which contexts get their
own statistics and which go into a shared group

4. Be able to send the model with varying degrees of precision; eg. be able to approximate when that's better
for the overall size(model) + size(coded bits)

We've seen over and over in compression that these can be the same. For example with linear-prediction lossless image compression, assuming you are doing LSQR fits to make predictors, you can either use the local neighborhood and generate an LSQR in the decoder each time, or you can transmit the LSQR fits at the start of the file. It turns out that either way compression is about the same (!!* BUT only if the encoder in the latter case is quite smart about deciding how many fits to send and how precise they are and what pixels they apply to).

Same thing with coding residuals of predictions in images. You can either do an adaptive coder (which needs to be pretty sophisticated these days; it should have variable learning rates and tiers, ala the Fenwick symbol-rank work; most people do this without realizing it just by having independent statistics for the low values and the high values) or you can create static shaped laplacian models and select a model for each coefficient. It turns out they are about the same.

The trade off is that the static model way needs a very sophisticated encoder which can optimize the total size (sizeof(transmitted model) + sizeof(coded bits)) , but then it gets a simpler decoder.

(caveat : this is not applicable to compressors where the model is huge, like PPM/CM/etc.)

A lot of people incorrectly think that adaptive models offer better compression. That's not really true, but it is *much* easier to write a compressor that achieves good compression with an adaptive model. With static models, there is a huge over-complete set of ways to encode the data, and you need a very complex optimizing encoder to find the smallest rep. (see, eg. video coders).

Even something as simple as doing order-0 Huffman and choosing the optimal points to retransmit the model is a very hard unsolved problem. And that's just the very tip of the iceberg for static models; even just staying with order-0 Huffman you could do much more; eg. instead of retransmitting a whole model, send a delta instead. Instead of sending the delta to the ideal code lens, instead send a smaller delta to non-ideal codelens (that makes a smaller total len); instead of sending new code lens, select from one of your previous huffmans. Perhaps have 16 known huffmans that you can select from and not transmit anything (would help a lot for small buffers). etc. etc. It's very complex.

Another issue with static models is that you really need to boil the data down to its simplest form for static models to work well. For example with images you want to be in post-predictor space with bias adjusted and all that gubbins before using a static model; on text you want to be in post-BWT space or something like that; eg. you want to get as close to decorrelated as possible. With adaptive models it's much easier to just toss in some extra context bits and let the model do the decorrelation for you. Put another way, static models need much more human guidance in their creation and study about how to be minimal, whereas adaptive models work much better when you treat them badly.


10-02-12 | Small note on LZHAM

When I did my comparison of various compressors a little while ago, I also tested LZHAM, but I didn't include it in the charts because the numbers I was seeing from it were very strange. In particular, I saw very very slow decode speeds, which surprised me because it seems to test well in other peoples' benchmarks.

So I finally had a deeper look to sort it out. The short answer is that LZHAM has some sort of very long initialization (even for just the decoder) which makes its speed extremely poor on small buffers. I was seeing speeds like 2 MB/sec , much worse than LZMA (which generally gets 10-25 MB/sec on my machine). (this is just from calling lzham_lib_decompress_memory)

On large buffers, LZHAM is in fact pretty fast (some numbers below). The space-speed is very good (on large buffers); it gets almost LZMA compression with much faster decodes. Unfortunately the breakdown on small buffers makes it not a general solution at the moment IMO (it's also very slow on incompressible and nearly-incompressible data). I imagine it's something like the huffman table construction is very slow, which gets hidden on large files but dominates small ones.

Anyhoo, here are some numbers. Decode shows mb/s.

BTW BEWARE : don't pay too much attention to enwik8 results; compressing huge amounts of text is irrelevant to almost all users. The results on lzt99 are more reflective of typical use.

name lzt99 decode
raw 24700820 inf
lz4 14814442 1718.72
zlib 13115250 213.99
oodlelzhlw 10164511 287.54
lzham 10066153 61.24
lzma 9344463 29.77

name enwik8 dec
raw 100000000 inf
lz4 42210253 1032.34
zlib 36445770 186.96
oodlelzhlw 27729121 258.46
lzham 24769055 103.01
lzma 24772996 54.59

(lzma should beat lzham on enwik8 but I can't be bothered to fiddle with all the compress options to find the ones that make it win; this is just setting both to "uber" (and -9) parse level and setting dict size = 2^29 for both)

And some charts for lzt99. See the previous post on how to read the charts .


09-30-12 | Long Range Matcher Notes

Some notes on the LRM mentioned last time.

Any time you do a string search based on hashes you will have a degeneracy problem. We saw this with the standard "Hash1b" (Hash->links) string matcher. In short, the problem is if you have many occurances of the same hash, then exact string matching becomes very slow. The standard solution is to truncate your search at some number of maximum steps (aka "amortized hashing"), but that has potentially unbounded cost (though typically low).

We have this problem with LRM and I brushed it under the rug last time. When you are doing "seperate scan" (eg. not incrementally adding to the hash table), then there's no need to have a truncated search, instead you can just have a truncated insert. That is, if you're limitting your search to 10, then don't add 1000 of the same hash and only ever search 10, just add 10. In fact on my test files it's not terrible to limit the LRM search to just 1 (!).

But I'm not happy with that as a general solution because there is a potential for huge inefficiency. The really bad degenerate case looks something like this :


LRM hash length is 32 or whatever
Lots of strings in the file of length 32 have the same hash value
You only add 100 or so to the hash
One of the ones you didn't add would have provided a really good match

Typically, missing that match is not a disaster, because at the next byte you will roll to a new hash and look that up, and so on, so if you miss a 128k long match, you will usually find a (128k - 256) long match 256 bytes later. But it is possible to miss it for a long time if you are unlucky, and I like my inefficiency to be bounded. The more common bad case is that you get matches just a bit shorter than possible, and that happens many times, and it adds up to compression lost. eg. say hash length is 16 and there are 24 byte matches possible, but due to the reduced search you only find 16 or 18-length matches.

But most importantly, I don't like to throw away compression for no good reason, I want to know that the speed of doing it this approximate way is worth it vs. a more exact matcher.

There are a few obvious solutions with LRM :

1. Push matches backwards :

If you find a match at pos P of length L, that match might also have worked at pos (P-1) for length (L+1), but a match wasn't found there, either because of the approximate search or because hashes are only put in the dictionary every N bytes.

In practice you want to be scanning matches forward (so that you can roll the hash forward, and also so you can carry forward "last match follows" in generate cases), so to implement this you probably want to have a circular window of the next 256 positions or whatever with matches in them.

This is almost free (in terms of speed and memory use) so should really be in any LRM.

2. Multiple Hashes :

The simplest form of this is to do two hashes; like one of length 16 and one of length 64 (or whatever). The shorter hash is the main one you use to find most matches, the longer hash is there to make sure you can find the big matches.

That is, this is trying to reduce the chance that you miss out on a very long match due to truncating the search on the short hash. More generally, to really be scale-invariant, you should have a bunch of levels; length 16,64,256,1024,etc. Unfortunately implementing this the naive way (by simply having several independent hashes and tables) hurts speed by a linear factor.

3. Multiple Non-Redundant Hashes :

The previous scheme has some obvious inefficiencies; why are we doing completely independent hash lookups when in fact you can't match a 64-long hash if you don't match a 16-long hash.

So you can imagine that we would first do a 16-long hash , in a lookup where the hashes have been unique'd (each hash value only occurs once), then for each 16-long hash there is another table of the 64-long hashes that occured for that 16-long hash. So then we look up in the next. If one is found there, we look in the 256-long etc.

An alternative way to imagine this is as a sorted array. For each entry you store a hash of 16,64,256,etc. You compare first on the 16-long hash, then for entries where that is equal you compare on the 64-long hash, etc. So to lookup you first use the 16-long hash and do a sorted array lookup; then in each range of equal hashes you do another sorted array lookup on the 64-long hash, etc.

These methods are okay, but the storage requirements are too high in the naive rep. You can in fact store them compactly but it all gets a bit complicated.

4. Hash-suffix sort :

Of course it should occur to us that what we're doing in #3 is really just a form of coarse suffix sort! Why not just actually use a suffix sort?

One way is like this : for each 16-byte sequence of the file, replace it with a 4-byte U32 hash value, so the array shrinks by 4X. Now suffix-sort this array of hash values, but use a U32 alphabet instead of a U8 alphabet; that is, suffix strings only start on every 4th byte.

To lookup you can use normal sorted-array lookup strategies (binary search, interpolation search, jump-ins + binary or jump-ins + interpolation, etc). So you start with a 16-byte hash to get into the suffix sort, then if you match you use the next 16-byte hash to step further, etc.


09-28-12 | LZNib on enwik8 with Long Range Matcher

I wrote a "Long Range Matcher" (ala SREP/etc. (see previous blog post )).

I'm using a rolling hash, and a sorted-array lookup. Building the lookup is insanely fast. One problem with it in an Optimal Parsing scenario is that when you get a very long LRM match, you will see it over and over (hence N^2 kind of badness), so I use a heuristic that if I get a match over some threshold (256 or 1024) I don't look for any more in that interval.

For a rolling hash I'm currently using just multiplicative hash with modulus of 2^32 and no additive constant. I have no idea if this is good, I've not had much luck finding good reference material on rolling hashes. (and yes of course I've read the Wikipedia and such easily Googleable stuff; I've not tested buzhash yet; I don't like table lookups for hashing, I needs all my L1 for myself)

LRM builds its search structure by hashing L bytes (LRM hash length is a parameter (default is 12)) every S bytes (S step is a parameter (default is 10)). Memory use is 8 bytes per LRM entry, so a step of 8 would mean the LRM uses memory equal to the size of the file. For large files you have to increase the step. Hash length does not affect memory use.

So anyhoo, I tested on enwik8. This is a test of different dictionary overlaps and LRM settings.

Compression works like this :


I split the file into 8 M chunks. (12 of them)

Chunks are compressed independently (in parallel).

Each chunk preconditions its dictionary with "overlap" bytes preceding the chunk.

Each chunk can also use the LRM to match the entire file preceding its overlap range.

So for each chunk the view of the file is like :

[... LRM matches here ...][ overlap precondition ][ my chunk ][ not my business ]

(note : enwik8 is 100mB (millions of bytes) not 100 MB, which means that 12*8 MB chunks = 96 MB actually covers the 100 mB).

(BTW of course compression is maximized if you don't do any chunking, or set the overlap to infinity; we want chunking for parallelism, and we want overlap to be finite to limit memory use; enwik8 is actually small enough to do infinite overlap, but we want a solution that has bounded memory use for arbitrarily large files)

With no further ado, some data. Varying the amount of chunk dictionary overlap :

overlap MB no LRM LRM
0 32709771 31842119
1 32355536 31797627
2 32203046 31692184
3 32105834 31628054
4 32020438 31568893
5 31947086 31518298
6 31870320 31463842
7 31797504 31409024
8 31731210 31361250
9 31673081 31397825
10 31619846 31355133
11 31571057 31316477
12 31527702 31281434
13 31492445 31253955
14 31462962 31231454
15 31431215 31206202
16 31408009 31189477
17 31391335 31215474
18 31374592 31202448
19 31361233 31192874

0 overlap means the chunks are totally independent. My LRM has a minimum match length of 8 and also must match a hash equal to the rolling hash length. The "with LRM" in the above test used a step of 10 and hash length of 12.

LRM helps less as the overlap gets bigger, because you find the most important matches in the overlap region. Also enwik8 being text doesn't really have that huge repeated block that lots of binary data has. (on many of my binary test files, turning on LRM gives a huge jump because some chunk is completely repeated from the beginning of the file to the end). On text it's more incremental.

We can also look at how compression varies with the LRM step and hash length :

lrm_step lrm_length compressed
0 0 32020443
32 32 31846039
16 32 31801822
16 16 31669798
12 16 31629439
8 16 31566822
12 12 31599906
10 12 31568893
8 12 31529746
6 12 31478409
8 8 31511345
6 8 31457094

(this test was run with 4 MB overlap). On text you really want the shortest hash you can get. That's not true for binary though, 12 or 16 is usually best. Longer than that hurts compression a little but may help speed.

For reference, some other compressors on enwik8 (from Matt's LTCB page )


enwik8
lzham alpha 3 x64 -m4 -d29                    24,954,329
cabarc 1.00.0601  -m lzx:21                   28,465,607
bzip2 1.0.2       -9                          29,008,736
crush 0.01        cx                          32,577,338
gzip 1.3.5        -9                          36,445,248


ADDENDUM : newer LRM, with bubble-back and other improvement :


LZNib enwik8
lots of win from bubble back (LRM Scanner Windowed) :

    32/32 no bubble : 31,522,926
    32/32 wi bubble : 31,445,380
    12/12 wi bubble : 30,983,058
    10/12 no bubble : 31,268,849
    10/12 wi bubble : 30,958,529
     6/10 wi bubble : 30,886,133


09-26-12 | Optimizing Carefully and Compression Bugs

This Theora update is a good reminder to be careful when optimizing. Apparently they had a massively broken DCT which was just throwing away quality for no good reason.

This may seem really bone-headed but it's very very common. You may recall in some of my past posts I looked into some basic video stuff in great detail and found that almost every major video encoder/decoder is broken on the simple stuff :

the DCT (or whatever transform)
quantization/dequantization
color space conversion
upsample/downsample
translating float math into ints

These things are all relatively trivial, but it's just SO common to get them wrong, and you throw away a bottom bit (or half a bottom bit) when you do so. Any time you are writing a compressor, you need to write reference implementations of all these basic things that you know are right - and check them! And then a crucial thing is : keep the reference implementation around! Ideally you would be able to switch it on from the command line, or failing that with a build toggle, so at anytime you can go back and enable the slow mode and make sure everything works as expected.

(of course a frequent cause of trouble is that people go and grab an optimized integer implementation that they found somewhere, and it either is bad or they use it incorrectly (eg. maybe it assumes data that's biased in a certain way, or centered at 0, or scaled up by *8, or etc))

A lot of this basic stuff in video is very easy to do regression tests on (eg. take some random 8x8 data, dct, quantize, dequantize, idct, measure the error, it should be very low) so there's no excuse to get it wrong. But even very experienced programmers do get it wrong, because they get lazy. They might even start with a reference implementation they know is right, and then they start optimizing and translating stuff into ints or SIMD, and they don't maintain the slow path, and somewhere along the line a mistake slips in and they don't even know it.


I've been thinking about a more difficult problem, which is : how do you deal with bugs in compression algorithms?

I don't mean bugs like "it crashes" or "the decompressor doesn't reproduce the original data" - those are the easy kind of bugs and you just go fix them. I mean bugs that cause the compressor to not work the way you intended, and thus not compress as much as it should.

The very hard thing about these bugs is that you can have them and not even know it; I'm sure I have a hundred of them right now. Frequently they are tiny things like you have a less-than where you should have a less-or-equal.

To avoid them really requires a level of care that most programmers never use. You have to be super vigilant. Any time something surprises you or is a bit fishy, you can't just go "hmm that's weird, oh well, move on to the next task". You have to stop and think and look into it. You have to gather data obsessively.

Any time you implement some new idea and it doesn't give you the compression win you expected, you can't just say "oh well guess that didn't work", you have to treat it like a crash bug, and go set breakpoints and watch your variables and make sure it really is doing what you think; and if it is, then you have to gather stats about how often that code is hit and what the values are, and see where your expectations didn't match reality.


I've really been enjoying working on compression again. It's one of the most fun areas of programming that exists. What makes it great :

1. Clear objective measure of success. You can measure size and speed (or whatever other criteria) and see if you are doing well or not. (lossy compression is harder for this).

2. Decompressors are one extreme of fun "hacker" programming; they have to be very lean; great decompressors are like haikus, they're little pearls that you feel could not get any simpler without ruining them.

3. Compressors, on the other hand, can be big and slow, and you get to pull out all the big guns of algorithms for optimization and searching and so on.


09-24-12 | I prefer the old C Preprocessor

It was so much better back when CPP was just text sub. (for context, I'm working on making structs that can encapsulate arbitrary function calls).

I want to be able to use CPP to make lists of N args (N a compile-time variable) like :


(int stuff0, int stuff1, int stuff2)

In MSVC (any old compiler where CPP is just text sub) you can do this quite easily.


#define LIST1(prefix,between)   RR_STRING_JOIN(prefix,1)
#define LIST2(prefix,between)   LIST1(prefix,between) between RR_STRING_JOIN(prefix,2)
#define LIST3(prefix,between)   LIST2(prefix,between) between RR_STRING_JOIN(prefix,3)
#define LIST4(prefix,between)   LIST3(prefix,between) between RR_STRING_JOIN(prefix,4)
#define LIST5(prefix,between)   LIST4(prefix,between) between RR_STRING_JOIN(prefix,5)
#define LIST6(prefix,between)   LIST5(prefix,between) between RR_STRING_JOIN(prefix,6)
#define LIST7(prefix,between)   LIST6(prefix,between) between RR_STRING_JOIN(prefix,7)
#define LIST8(prefix,between)   LIST7(prefix,between) between RR_STRING_JOIN(prefix,8)
#define LIST9(prefix,between)   LIST8(prefix,between) between RR_STRING_JOIN(prefix,9)

#define LISTN(N,prefix,between) RR_STRING_JOIN(LIST,N)(prefix,between)

#define LISTCOMMAS(N,prefix)        LISTN(N,prefix,COMMA)

#define COMMA   ,

and then you can use it like :

#define TestFuncN(N)  void RR_STRING_JOIN(TestFunc,N) ( LISTCOMMAS(N,int arg) );

Similarly for other variants of LIST, and then you can quite neatly construct structs/templates that take N args of N types.

But this doesn't work in compilers (GCC) with the newer standard that says preprocessor tokens have to be C identifiers (or whatever pedantic thing it says). IMO it's another one of those GCC/C99 (C89?) things that breaks old code and takes power away from the programmer and has very little to no benefit. (I guess I just really don't like the strict C standard).

Urg. GCC is like the nit-picky guy on the team who wants to endlessly debate some pointless crap that doesn't actually help anyone.

Is there any way to do this type of thing correctly? I'm so sick of manually making variants for N args every time I want this, the freaking preprocessor is the perfect tool to make all the N-arg variants for me if only they would let me use it.

(see for example : cblib/autoprintf.inl or cblib/callback.h)

To be concrete, a common usage case is something like cblib/callback where you want a struct to encapsulate a member function call. You have to do something like :


    explicit CallbackM3(T_ClassPtr c, T_fun_type f, Arg1 a1 , Arg2 a2, Arg3 a3, double when) : Callback(when)
    {
        ASSERT( c != NULL && f != NULL );
        __p = c;
        _mem_fun = f;
        _arg1 = a1;
        _arg2 = a2;
        _arg3 = a3;
    }

and write variants for every number of args. With the LIST macros I could very easily make variants for N args automatically.

... later ...

Oh well, I just bit the bullet and made a bunch of these :


#if NUM >= 1
    prefix1 
#if NUM >= 2
    prefix2 
#if NUM >= 3
    prefix3 
#if NUM >= 4
    prefix4 
#if NUM >= 5
    prefix5 
#if NUM >= 6
    prefix6 
#if NUM >= 7
    prefix7 
#if NUM >= 8
    prefix8 
#if NUM >= 9
    prefix9 
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif

which is much much worse than the LIST solution but works in GCC. And of course that could be much nicer if you could put preprocessor directives inside macros, cuz then I could just have a macro that does that, instead I have to copy-paste the whole thing all over.

Bleh. How lame is it that C++98 doesn't have a way to encapsulate a function call in a struct. (yes yes I know we finally have lambdas now, well not now but maybe in 10 years or so).

Another thing that would have made this all easier would be if C has a "null" type. Then I could just make templates that always take 10 arguments, and for the fewer-argument variants I could make the later ones be null types. eg. for cases like :


template <typename t1,typename t2,typename t3,typename t4>
struct stuff
{
    t1 m1;
    t2 m2;
    t3 m3;
    t4 m4;
};

I could just have stuff to get a two-member variant. Meh, I guess there are lots of annoying omissions in C++98. Why can't I have "typeof(var)" ? Why can't I say "if ( defined(type) )" ? etc. etc. You would also need the ability to not even try to compile scopes that are compile-time unreachable (eg. stuff in if(0) ).

Hmm.. so one option for this is I could just run a different CPP before compiling; it used to be that CPP was completely independent from the compiler, and you could do whatever you want there, but I'm not sure that's the case any more. (eg. assuming I want things like debugging in the original pre-CPP code to work). Or I could just eat the pain once again and work around GCC yet again.


09-24-12 | LZ String Matcher Decision Tree

Revisiting this to clarify a bit on the question of "I want to do X , which string matcher should I use?"

Starting from the clearest cases to the least clear :

There are two types of parsing, I will call them Optimal and Greedy. What I really mean is Optimal = find matches at every position in the file, and Greedy = find matches only at some positions, and then take big steps over the match. (Lazy parses and such are in the Greedy category).

There are three types of windowing : 1. not windowed (eg. infinite window or we don't really care about windowing), 2. fixed window; eg. you send offsets in 16 bits so you have a 64k window for matches, 3. multi-window or "cascade" ; eg. you send offsets up to 128 in 1 byte , up to 16384 in 2 bytes, etc. and you want to find the best match in each window.

There are two scan cases : 1. Incremental scan; that is, we're matching against the same buffer we are parsing; matches cannot come from in front of our current parse position, 2. Separate scan - the match source buffer is independent from the current parse point, eg. this is the case for precondition dictionaries, or just part of the file that's well before the current parse point.

1. Optimal Parsing , No window, Incremental scan : Suffix Trie is the clear winner here. Suffix Trie is only a super clear winner when you are parsing and matching at the same time, since they are exactly the same work you double your time taken if they are separate. That is, you must be scanning forward, adding strings and getting matches. Suffix Trie can be extended to Cascaded Windowing in an approximate way, by walking to parents in the tree, but doing it exactly breaks the O(N) of the Suffix Trie.

2. Optimal Parsing, No window or single window, Separate Scan : Suffix Array is pretty great here. Separate scan means you can just take the whole buffer you want to match against and suffix sort it.

(BTW this is a general point that I don't think most people get - any time you are not doing incremental update, a sort is a superb search structure. For example it's very rarely best to use a hash table when you are doing separate scan, you should just have a sorted list, possibly with jump-ins)

3. Optimal Parsing, Windowed or Cascaded, Incremental or Separate Scan : there's not an awesome solution for this. One method I use is cascades of suffix arrays. I wrote in the past about how to use Suffix Array Searcher with Incremental Scan (you have to exclude positions ahead of you), and also how to extend it to Windowing. But those method get slow if the percentage of matches allowed gets low; eg. if you have a 1 GB buffer and a 64k window, you get a slowdown proportional to (1GB/64k). To address this I use chunks of suffix array; eg. for a 64k window you might cut the file into 256k chunks and sort each one, then you only have to search in a chunk that's reasonably close to the size of your window. For cascaded windows, you might need multiple levels of chunk size. This is all okay and it has good O(N) performance (eg. no degeneracy disasters), but it's rather complex and not awesome.

Another option for this case is just to use something like Hash->Links and accept its drawbacks. A more complex option is to use a hybrid; eg. for cascaded windows you might use Hash->Links for the small windows, and then Suffix Array for medium size windows, and Suffix Trie for infinite window. For very small windows (4k or less) hash->links (or even just a "cache table") is very good, so it can be a nice supplement to a matcher like suffix trie is not great at cascaded windows.

Addendum : "Suffix Array Sets" is definitely the best solution for this.

4. Greedy Parsing : Here SuffixArray and SuffixTrie both are not awesome, because they are essentially doing all the work of an optimal-style parse (eg. string matching at every position), which is a big waste of time if you only need the greedy matches.

Hash-Link is comparable to the best matcher that I know of for greedy parsing. Yann's MMC is generally a bit faster (or finds better matches at the same speed) but is basically in the same class. The pseudo-binary-tree thing used in LZMA (and I believe it's the same thing that was used in the original PkZip that was patented) is not awesome; sometimes it's slightly faster than hash-link, sometimes slightly slower. All Window relatively easily.

Hash-Link extends very naturally to cascaded windows, because you are always visiting links in order from lowest offset to highest, you can easily find exact best matches in each window of the cascade as you go.

With Greedy Parsing you don't have to worry about degeneracies quite so much, because when you find a very long match you are just going to take it and step over it. (that is, with optimal parse if you find a 32k long match, then at the next step you will find a 32k-1 match, etc. which is a bad N^2 (or N^3) thing if you aren't super careful (eg. use a SuffixTrie with correct "follows" implementation)). However, with lazy parsing you can still hit a mild form of this degeneracy, but you can avoid that pretty easily by just not doing the lazy eval if your first match length is long enough (over 1024 or whatever).

(BTW I'm pretty sure it's possible to do a Suffix Trie with lazy/incremental update for Greedy Parsing; the result should be similar to MMC but provide exact best matches without any degenerate bad cases; it's rather complex and I figure that if I want perfect matching I generally also want Optimal Parsing, so the space of perfect matching + greedy parsing is not that important)

Previous posts on string matching :

cbloom rants 06-17-10 - Suffix Array Neighboring Pair Match Lens
cbloom rants 09-23-11 - Morphing Matching Chain
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
cbloom rants 09-25-11 - More on LZ String Matching
cbloom rants 09-26-11 - Tiny Suffix Note
cbloom rants 09-27-11 - String Match Stress Test
cbloom rants 09-28-11 - Algorithm - Next Index with Lower Value
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 09-30-11 - Don't use memset to zero
cbloom rants 09-30-11 - String Match Results Part 1
cbloom rants 09-30-11 - String Match Results Part 2
cbloom rants 09-30-11 - String Match Results Part 2b
cbloom rants 09-30-11 - String Match Results Part 3
cbloom rants 09-30-11 - String Match Results Part 4
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 10-01-11 - More Reliable Timing on Windows
cbloom rants 10-01-11 - String Match Results Part 6
cbloom rants 10-02-11 - How to walk binary interval tree
cbloom rants 10-03-11 - Amortized Hashing
cbloom rants 10-18-11 - StringMatchTest - Hash 1b
cbloom rants 11-02-11 - StringMatchTest Release


09-23-12 | Patches and Deltas

A while ago Jon posted a lament about how bad Steam's patches are. Making small patches seems like something nice for Oodle to do, so I had a look into what the state of the art is for patches/deltas.

To be clear, the idea is computer 1 (receiver) has a previous version of a file (or set of files, but for now we'll just assume it's one file; if not, make a tar), computer 2 (transmitter) has a newer version and wishes to send a minimum patch, which computer 1 can apply to create the newer version.

First of all, you need to generate patches from uncompressed data (or the patch generator needs to be able to do decompression). Once the patch is generated, it should generally be compressed. If you're trying to patch a zip to a zip, there will be lots of different bits even if the contained files are the same, so decompress first before patching.

Second, there are really two classes of this problem, and they're quite different. One class is where the transmitter cannot see the old version that the receiver has; this is the case where there is no authoritative source of data; eg. in rsync. Another class is where the transmitter has all previous versions and can use them to create the diff; this is the case eg. for game developers creating patches to update installed games.

Let's look at each class.

Class 1 : Transmitter can't see previous version

This the case for rsync and Windows RDC (Remote Differential Compression).

The basic way all these methods work is by sending only hashes of chunks of data to each other, and hoping that when the hashes for chunks of the files match, then the bytes actually were the same. These methods are fallible - it is possible to get corrupt data if you have an unlucky hash match.

In more detail about how they work :

The file is divided into chunks. It's important that these chunks are chosen based on the *contents* of the file, not just every 256 bytes or whatever, some fixed size chunking. If you did fixed size chunking, then just adding 1 byte at the head of a file would make every chunk different. You want to use some kind of natural signature to choose the chunk boundaries. (this reminds me rather of SURF type stuff in image feature detection).

I've seen two approaches to finding chunking boundaries :

1. Pick a desired average chunk size of L. Start from the previous chunk end, and look ahead 2*L and compute a hash at each position. The next chunk boundary is set to the local min (or max) of the hash value in that range.

2. Pick a desired average chunk size of 2^B. Make a mask M with B random bits set. Compute a hash at each position in the file; any position with (hash & M) == (M) is a boundary; this should occur once in 2^B bytes, giving you an average chunk len of 2^B.

Both methods can fall apart in degenerate areas, so you could either enforce a maximum chunk size, or you could specifically detect degenerate areas (areas with the same hash at many positions) and handle them as a special case.

So once you have these chunk boundaries, you compute a strong hash for each chunk (MD5 or SHA or whatever; actually any many-bit hash is fine, the cryptographic hashes are widely over-used for this, they are generally slower to compute than an equally strong non-cryptographic hash). Then the transmitter and receiver send these hashes between each other; if the hashes match they assume the bytes match and don't send the bytes. If the hashes differ, they send the bytes for that chunk.

When sending the bytes for a chunk that needs updating, you can use all the chunks that were the same as context for a compressor.

If the file is large, you may wish to use multi-scale chunking. That is, first do a coarse level of chunking at large scale to find big regions that need transmision, then for each of those regions do finer scale chunking. One way to do this is to just use a constant size chunk (1024 bytes or whatever), and to apply the same algorithm to your chunk-signature set; eg. recurse (RDC does this).

Class 2 : Transmitter can see previous version

This case is simple and allows for smaller patches (as well as guaranteed, non-probablistic patches). (you probably want to do some simple hash check to ensure that the previous versions do in fact match).

The simplest way to do this is just to take an LZ77 compressor, take your previous version of the file and put it in your LZ dictionary, then compress the new version of the file. This will do byte-exact string matching and find any parts of the file that are duplicated in the previous version.

(aside : I went and looked for "preload dictionary" options in a bunch of mainstream compressors and couldn't find it in any of them. This is something that every major compressor should have, so if you are the author of FreeArc or 7zip or anything like that, go add a preload dictionary option)

(aside : you could use other compressors than LZ77 of course; for example you could use PPM (or CM) and use the previous version to precondition the model. For large preconditions, the PPM would have to be very high order, probably unbounded order. An unbounded order PPM would be just as good (actually, better) at differential compression than LZ77. The reason why we like LZ77 for this application is that the memory use is very low, and we want to use very large preconditions. In particular, the memory use (in excess of the window itself) for LZ77 compression can be very low without losing the ability to deduplicate large blocks; it's very easy to control, and when you hit memory limits you simply increase the block size that you can deduplicate; eg. up to 1 GB you can find all dupes of 64 bytes or more; from 1-2 GB you can find dupes of 128 bytes or more; etc. this kind of scaling is very hard to do with other compression algorithms)

But for large distributions, you will quickly run into the limits of how many byte-exact matches an LZ77 matcher can handle. Even a 32 MB preload is going to stress most matchers, so you need some kind of special very-large-window matcher to find the large repeated blocks.

Now at this point the approaches for very-large-window matching look an awful lot like what was done in class 1 for differential transmission, but it is really a different problem and not to be confused.

The standard approach is to pick a large minimum match length L (L = 32 or more) for the long-range matches, and to only put them in the dictionary once every N bytes (N 16 or more, scaling based on available memory and the size of the files). So basically every N bytes you compute a hash for the next L bytes and add that to the dictionary. Now when scanning over the new version to look for matches, you compute an L-byte hash at every position (this is fast if you use a rolling hash) and look that up.

One interesting variant of this is out-of-core matching; that is if the previous version is bigger than memory. What you can do is find the longest match using only the hashes, and then confirm it by pulling the previous file bytes back into memory only when you think you have the longest once. (SREP does some things like this; oddly SREP also doesn't include a "preload dictionary" option, or it could be used for making patches)

In the end you're just generating LZ matches though. Note that even though you only make dictionary entries every N bytes for L byte chunks, you can generate matches of arbitrary length by doing byte-by-byte matching off the end of the chunk, and you can even adjust to other offsets by sliding matches to their neighboring bytes. But you might not want to do that; instead for very large offets and match lengths you could just not send some bottom bits; eg. only send a max of 24 bits of offset, but you allow infinite window matches, so over 24 bits of offset you don't send some of the bottom bits.

Special Cases

So far we've only looked at pretty straightforward repeated sequence finding (deduplication). In some cases, tiny changes to original files can make lots of derived bytes differ.

A common case is executables; a tiny change to source code can cause the compiled exe to differ all over. Ideally you would back up to the source data and transmit that diff and regenerate the derived data, but that's not practical.

Some of the diff programs have special case handling for exes that backs out one of the major problems : jump address changes. Basically the problem is if something like the address of memcpy changes (or the location of a vtable, or address of some global variable, etc..), then you'll have diffs all over your file and generating a small patch will be hard.

I speculate that what these diffs do basically is first do the local-jump to absolute-jump transform, and then they create a mapping of the absolute addresses to find the same routine in the new & old files. They send the changed address, like "hey replace all occurances of address 0x104AC with 0x10FEE) so that chunks that only differ by some addresses moving can be counted as unchanged.

(bsdiff does some fancy stuff like this for executables) (ADDENDUM : not really; what bsdiff does is much more general and not as good on exes; see comments)

If you're trying to send small patches of something like lightmaps, you might have areas where you just increased the brightness of a light; that might change very pixel and create a huge diff. It might be possible to express deltas of image (and sound) data as linear transforms (add & scale). An alternative would be finding the original piece of content and just using it as a mocomp source (dictionary precondition) for an image compressor. But at some point the complexity of the diff becomes not worth it.

Links

In no particular order :

-ck hacking lrzip-0.613
ngramhashing - Rolling Hash C++ Library - Google Project Hosting
A new 900GB compression target
Patchdelta compression
Remote diff utility
SREP huge-dictionary LZ77 preprocessor
Long Range ZIP – Freecode
About Remote Differential Compression (Windows)
There is a Better Way. Instead of using fixed sized blocks, use variable sized b... Hacker News
bsdiff windows
ZIDRAV Free Communications software downloads at SourceForge.net
Binary diff (bsdiff)
Data deduplication (exdupe)
xdelta
Tridge (rsync)

BTW some of you have horrible page titles. "binary diff" is not a useful page title, nor is "data deduplication". It's like all the freaking music out there named "track1.mp3".

I have not done an extensive review of the existing solutions. I think bsdiff is very strong, but is limited to relatively small files, since it builds an entire suffix sort. I'm not sure what the best solution is for large file diffs; perhaps xdelta (?). The algorithms in rsync look good but I don't see any variant that makes "class 2" (transmitter has previous version) diffs. It seems neither lrzip nor srep have a "precondition dictionary" option (wtf?). So there you go.


09-22-12 | Oodle Beta and Roadmap

Oodle went Beta a few weeks ago (yay). If you're a game developer interested in Oodle you can mail oodle at rad.

I wanted to write about what's in Oodle new and what's coming, as a sort of roadmap for myself and others. This is not a detailed feature list; contact RAD for documents with more details about Oodle features.

Oodle Beta : (now)

What's in Oodle at the moment :

Oodle RC / 1.0 : (2012-2013)

Oodle 1.1 : (around GDC 2013)

Oodle 1.2/2.0 :


09-22-12 | Input Streams and Focus Changes

Clearly apps should have an input/render thread which takes input and immediately responds to simple actions even when the app is busy doing processing.

This thread should be able to display the current state of the "world" (whatever the app is managing) and let you do simple things like move/resize windows, scroll, etc. without blocking on complex processing.

Almost every app gets this wrong; even the ones that try (like some web browsers) just don't actually do it; eg. you should never ever get into a situation where you browse to a slow page that has some broken script or something, and that causes your other tabs to become unresponsive. (part of the problem with web browsers these days of course is that scripts are allowed to do input processing, which never should have been allowed, but anyhoo).

Anyway, that's just very basic and obvious. A slightly more advanced topic is how to respond to input when the slow processing causes a change in state which affects input processing.

That is, we see a series of input commands { A B C D ... } and we start doing them, but A is some big slow operation. As long as the commands are complelety indepenent (like "pure" functions) then we can just fire off A, then while it's still running we go ahead and execute B, C, D ...

But if A is something like "open a new window and take focus" , then it's completely ambiguous about whether we should go ahead and execute B,C,D now or not.

I can certainly make arguments for either side.

Argument for "go ahead and process B C D immediately" :

Say for example you're in a web browser and you click on a link as action A. The link is very slow to load so you decide you'll do something else and you center-click some other links on the original page to open them in new tabs. Clearly these inputs should be acted on immediately.

Argument for "delay processing B C D until A is done" :

For similarity we'll assume a web browser again. Say you are trying to log into your bank, which you have done many times. You type in your user name and hit enter. You know that this will load the next page which will put you at a password prompt, so you go ahead and start typing your password. Of course those key presses should be enqueued until the focus change is done.

A proponent of this argument could outline two clear principles :

1. User input should be race free. That is, the final result should not depend on a race between my fingers and the computer. I should get a consistent result even if the processing of commands is subject to random delays. One way to do this is :

2. For keyboard input, any keyboard command which changes key focus should cause all future keyboard input to be enqueued until that focus change is done.

This certainly bugs me on a nearly daily basis. The most common place I hit it is in MSVC because that's where I spend most of my life, and I've developed muscle-memory for common things. So I'll frequently do something like hit "ctrl-F stuff enter" , expecting to do a search for "stuff" , only to be dismayed to see that for some inscrutable reason the find dialog box took longer than usual to open, and instead of searching for "stuff" I instead typed it into my source code and am left with an empty find box.

I think in the case of pure keyboard input in a source code editor, the argument for race-freeness of user input is the right one. I should be able to develop keyboard finger instinctive actions which have consistent results.

However, the counter-example of the slow web browser means that this is not an obvious general rule for user inputs.

The thing I ask myself in these scenarios is "if there was a tiny human inside my computer that was making this decision, could they do it?". If the answer to that question is yes, then it means that there is a solution in theory, it just may not be easy to express as a computer algorithm.

I believe that in this case, 99% of the time a human would be able to tell you if the input should be enqueued or not. For example in the source code "ctrl-F stuff" case - duh, of course he wants stuff to be in the find dialog, not typed into the source code; the human computer would get that right (by saying "enqueue the input, don't process immediately"). Also in the web browser case where I click a slow link and then click other stuff on the original page - again a human would get that right by saying "don't enqueue the input, do process it immediately").

Obviously there are ambiguous cases, but this is an interesting point that I figured out while playing poker that I think most people don't get : the hard decisions don't matter !

Quickly repeating the point for the case of poker (I've written this before) : in poker you are constantly faced with decisions, some easy (in that the right answer is relatively obvious) and some very hard, where the right answer is quite unclear, maybe the right answer is not what the standard wisdom thinks it is, or maybe it requires deep thought. The thing is, the hard decisions don't matter. The reason they are hard is because the EV (expected value) of either line is very close; eg. maybe the EV of raise is 1.1 BB and the EV of call is 1.05 BB ; obviously in analysis you aren't actually figuring out the EV, but just the fact that it's not clear tells you that either line is okay.

The way that people lose value in poker is by flubbing the *easy* decisions. If you fold a full house on the river because you were afraid your opponent had quads, that is a huge error and gives up tons of value. When you fail to do something that is obviously right (like three-betting often enough from the blinds against aggressive late position openers) that is a big error. When you are faced with tricky situations that poker experts would have to debate for some time and still might not agree what the best line is - those are not important situations.

You can of course apply the same situation to politics, and here to algorithms. People love to debate the tricky situations, or to say that "that solution is not the answer because it doesn't work 100% of the time". That's stupid non-productive nit picking.

A common debate game is to make up extreme examples that prove someone's solution is not universal or not completely logical or self-consistent. That's retarded. Similarly, if you have a good solution for case A, and a good (different) solution for case B, a common debate game is to interpolate the cases and find something in the middle where it's ambiguous or neither solution works, and the sophomoric debater contends that this invalidates the solutions. Of course it doesn't, it's still a good solution for case A and case B and if those are the common cases then who cares.

What actually matters is to get the answer right *when there is obviously a right answer*.

In particular with user input response, the user expects the app to respond in the way that it obviously *should* respond when there is an obvious response. If you do something that would be very easy for the app to get right, and it gets it wrong, that is very frustrating. However if you give input that you know is ambiguous, then it's not a bad thing if the app gets it wrong.


09-15-12 | Some compression comparison charts

These charts show the time to load + decompress a compressed file using various compressors.

(the compressors are the ones I can compile into my testbed and run from code, eg. not command line apps; they are all run memory to memory; I tried to run all compressors in max-compression-max-decode-speed mode , eg. turning on heavy options on the encode side. Decode times are generated by running each decompressor 10 times locked to each core of my box (80 total runs) and taking the min time; the cache is wiped before each decode. Load times are simulated by dividing the compressed file size by the disk speed parameter. All decoders were run single threaded.)

They are sorted by fastest decompressor. (the "raw" uncompressed file takes zero time to decompress).

"sum" = the sum of decomp + load times. This is the latency if you load the entire compressed file and then decompress in series.

"max" = the max of decomp & load times. This is the latency if the decomp and load were perfectly parallelizable, and neither ever stalled on the other.

The criterion you actually want to use is something between "sum" and "max", so the idea is you look at them both and kind of interpolate in your mind. (obviously you can replace "disk" with ram or network or "channel")

Discussion :

The compressors are ordered from left to right by speed. If you look at the chart of compressed file sizes, they should be going monotonically downward from left to right. Any time it pops up as you go to the right (eg. at snappy, minilzo, miniz, zlib) that is just a bad compressor; it has no hope of being a good choice in terms of space/speed tradeoff. The only ones that are on the "Pareto frontier" are raw, LZ4, OodleLZH, and LZMA.

Basically what you should see is that on a fast disk (100 mbps (and mb = millions of bytes, not 1024*1024)), a very slow decoder like LZMA does not make a lot of sense, you spend way too much time in decomp. On very slow data channels (like perhaps over the net) it starts to make sense, but you have to get to 5 mbps or slower before it becomes a clear win. (of course there may be other reasons that you want your data very small other than minimizing time to load; eg. if you are exceeding DVD capacity).

On a fast disk, the fast decompressors like LZ4 are appealing. (though even at 100 mbps, OodleLZH has a lower "max"; LZ4 has the best "sum").

Of the fast decoders, LZ4 is just the best. (in fact LZ4-largewindow would be even stronger). Zip is pretty poor; the small window is surely hurting it, it doesn't find enough matches which not only hurts compression, it hurts decode speed. Part of the problem is neither miniz nor zlib have super great decoders with all the tricks.

It's kind of ridiculous that we don't have a single decent mainstream free compression library. Even just zip-largewindow would be at least decent. (miniz could easily be extended to large windows; that would make it a much more competitive compressor for people who don't care about zlib compatibility)

If you are fully utilizing your CPU's, you may need a low-CPU decoder even if it's not the best choice in a vacuum. In fact because of that you should avoid CPU-hungry decoders even if they are the best by some simple measure like time to load. eg. even in cases where LZMA does seem like the right choice, if it's close you should avoid it, because you could use that CPU time for something else. You could say that any compressor that can decode faster than it can load compressed data is "free"; that is, the decode time can be totally hidden by parallelizing with the IO and you can saturate the disk loading compressed data. While that is true it assumes no other CPU work is being done, so does not apply to games. (it does sort of apply to archivers or installers, assuming you aren't using all the cores).

As a rough rule of thumb, compressors that are in the "sweet spot" take time that is roughly on par with the disk time to load their compressed data. That is, maybe half the time, or double the time, but not 1/10th the time of the disk (then they are going too fast, compressing too little, leaving too much on the table), and also not 10X the time of the disk (then they are just going way too slow and you'd be better off with less compression and a faster compressor).


The other thing we can do is draw the curves and see who's on the pareto frontier.

Here I make the Y axis the "effective mbps" to load and then decompress (sequentially). Note that "raw" is an x=y line, because the effective speed equals the disk speed.

Let me emphasize that these charts should be evaluated as information that goes into a decision. You do not just go "hey my disk is 80 mbps let me see which compressor is on top at that speed" and go use that. That's very wrong.

and the log-log (log base 2) :

You can see way down there at the bottom of the log-log, where the disk speed is 1.0 mbps, LZMA finally becomes best. Also note that log2 of 10 is a gigabyte per second, almost the speed of memory.

Some intuition about log-log compressor plots :

Over on the right hand side, all the curves will flatten out and become horizontal. This is the region where the decompress time dominates and the disk speed becomes almost irrelevant (load time is very tiny compared to decompress time). You see LZMA flattens out at relatively low disk speed (at 16 mbps (log2 = 4) it's already pretty flat). The speed over at the far right approaches the speed of just the decompressor running memory to memory.

On the left all the curves become straight lines with a slope of 1 (Y = X + B). In this area their total time is dominated by their loading time, which is just a constant times the disk speed. In a log-log plot this constant multiple becomes a constant addition - the Y intercept of each curve is equal to log2(rawLen/compLen) ; eg someone with 2:1 compression will hit the Y axis at log2(2) = 1.0 . You can see them stacked up hitting the Y axis in order of who gets the most compression.

Another plot we can do is the L2 mean of load time and decode time (sqrt of squares). What the L2 mean does is penalize compressors where the load time and decode time are very different (it favors ones where they are nearly equal). That is, it sort of weights the average towards the max. I think this is actually a better way to rate a compressor for most usages, but it's a little hand-wavey so take it with some skepticism.


09-14-12 | Things Most Compressors Leave On the Table

It's very appealing to write a "pure" algorithmic compressor which just implements PPM or LZP or whatever in a very data agnostic way and make it quite minimal. But if you do that, you are generally leaving a lot on the table.

There are lots of "extra" things you can do on top of the base pure compressor. It makes it very hard to compare compressors when one of them is doing some of the extra things and another isn't.

I used to only write pure compressors and considered the extra things "cheating", but of course in practical reality they can sometimes provide very good bang-for-the-buck, so you have to do them. (and archivers these days are doing more and more of these things, so you will look bad in comparison if you don't do them).

Trying to dump out a list of things :

Parameter Optimization . Most compressors have some hard-coded parameters; some time it's an obvious one, like in LZMA you can set the # of position bits used in the context. Getting that number right for the particular file can be a big win. Other compressors have hard-coded tweaks that are not so obvious; for example almost all modern PPM or CM compressors use some kind of secondary-statistics table; the hash index made for that table is usually some heuristic, and tweaking it per file can be a big win.

Model Preconditioning . Any time your have a compressor that learns (eg. adaptive statistical coders, or the LZP cache table, or the LZ77 dictionary) - a "pure" compressor starts from an empty memory and then learns the file as it goes. But that's rarely optimal. You can usually get some win by starting from some pre-loaded state; rather than starting from empty and learning the file, you start from "default file" and learn towards the current file. (eg. every binary arithmetic probability should not start at 50% but rather at the expected final probability). And of course you can take this a step further by having a few different preconditions for different file types and selecting one.

Prefilters . BCJ (exe jump transform), deltas, image deltas, table record deltas (Bulat's DELTA), record transposes, various deswizzles, etc. etc. There are lots of prefilters possible, and they can provide very big wins for the amount of time they take. If you don't implement all the prefilters you are at a disadavantage to compressors that do. (for example, RAR has a pretty strong set of prefilters that are enabled by default, which means that RAR actually beats 7zip on lots of files, even though as a pure compressor it's much worse).

Header compression . Anything you send like buffer sizes or compressor parameters can generally be smaller by more advanced modeling. Typically this is just a few bytes total so not important, but it can become important if you transmit incremental headers, or something like static huffman codes. eg. something like Zip that can adapt by resending Huffmans, it's actually important to get that as small as possible, and it's usually something that's neglected because it's outside of the pure compressor.

Data Size Specialization . Most compressors either work well on large buffers or small buffers, not both; eg. if you do an LZSS , you might pick 3 byte offsets for large buffers, but on tiny buffers that's a huge waste; in fact you should use 1 byte offsets at first, and then switch to 2, and then 3. People rarely go to the trouble to have separately tuned algorithms for various buffer sizes.

Data specialization . Compressing text, structured records, images, etc. is actually all very different. You can get major win by special-casing for the major types of data (eg. text has weird features like the top bits tell you the type of character; word-replacing transforms are a big win, as are de-punctuators, etc. etc.).

Decompression . One of the new favorite tricks is decompressing data to compress it. If someone hands you a JPEG or a Zip or whatever and tells you to compress it as well as possible, of course the first thing you have to do is decompress it to undo the bad compressor so you can get back to the original bits.

This is almost all stuff I haven't done yet, so I have some big win in my back pocket if I ever get around to it.

In the compression community, I'm happy to see packages like FreeArc that are gathering together the prefilters so that they can be used with a variety of back-end "pure" compressors.


09-13-12 | LZNib

LZNib is the straightforward/trivial way to do an LZ77 coder using EncodeMod for variable length numbers and 4-bit nibble aligned IO. That is, literals are always 8 bit; the control word is 4 bits and signals either a literal run len or a match length, using a range division value instead of a flag bit.

eg. if the nibble is < config_divider_lrl , it's a literal run len; if nibble is >= config_divider_lrl, it's a match.

The point of LZNib is to see how much compression is possible while keeping the decode speed close to the fastest of any reasonable compressor (LZ4,snappy,etc).

Testing different values of config_divider_lrl :

name raw config_divider_lrl=4 config_divider_lrl=5 config_divider_lrl=6 config_divider_lrl=7 config_divider_lrl=8 config_divider_lrl=9 config_divider_lrl=10 config_divider_lrl=11 config_divider_lrl=12 best
lzt00 16914 5639 5638 5632 5636 5654 5671 5696 5728 5771 5632
lzt01 200000 199360 199354 199348 199345 199339 199333 199324 199319 199314 199314
lzt02 755121 244328 243844 250836 255146 255177 255257 257754 260107 260597 243844
lzt03 3471552 1746220 1744630 1743728 1743043 1742718 1742728 1743191 1744496 1746915 1742718
lzt04 48649 13932 13939 13968 14015 14120 14184 14268 14319 14507 13932
lzt05 927796 422058 421115 420746 420592 418289 418200 418639 418854 418082 418082
lzt06 563160 414925 414080 412748 412223 409673 408884 408361 408435 407393 407393
lzt07 500000 237756 237318 237004 236910 236771 236949 237381 238091 239176 236771
lzt08 355400 309397 309490 308579 307706 307263 306418 305689 305495 305793 305495
lzt09 786488 302834 303018 303773 304350 305222 307405 308888 310649 314647 302834
lzt10 154624 11799 11785 11792 11800 11821 11843 11866 11885 11923 11785
lzt11 58524 22420 22341 22249 22288 22276 22322 22370 22561 22581 22249
lzt12 164423 28901 28974 28900 28957 29053 29122 29296 29381 29545 28900
lzt13 1041576 1072275 1068614 1058545 1047273 1025641 1025616 1025520 1025404 1024891 1024891
lzt14 102400 52010 51755 51595 51462 51379 51314 51298 51302 51341 51298
lzt15 34664 11846 11795 11767 11760 11740 11740 11756 11831 11837 11740
lzt16 21504 11056 11000 10961 10934 10911 10904 10893 10883 10892 10883
lzt17 53161 20122 20119 20152 20210 20288 20424 20601 20834 21091 20119
lzt18 102400 77317 77307 77274 77045 77037 77020 77006 76964 76976 76964
lzt19 768771 306499 307120 308138 309635 311801 314857 318983 323981 329683 306499
lzt20 1179702 975546 974447 973507 972326 971521 972060 971614 971569 985009 971521
lzt21 679936 99059 99182 99385 99673 100013 100492 101018 101652 102387 99059
lzt22 400000 334796 334533 334357 334027 333860 333733 333543 333501 337864 333501
lzt23 1048576 1029556 1026539 1023978 1021833 1019900 1018124 1016552 1015139 1013815 1013815
lzt24 3471552 1711694 1710524 1708577 1706969 1696663 1695663 1694205 1692996 1688324 1688324
lzt25 1029744 224428 224423 224306 229365 229362 229368 229603 227083 227546 224306
lzt26 262144 240106 239633 239200 238864 238538 238232 237960 237738 237571 237571
lzt27 857241 323147 323098 323274 323133 322050 322068 322799 322182 322573 322050
lzt28 1591760 343555 345586 348549 350601 352455 354077 356025 360583 364438 343555
lzt29 3953035 1445657 1442589 1440996 1440794 1437132 1437593 1440565 1442614 1442914 1437132
lzt30 100000 100668 100660 100656 100655 100653 100651 100651 100643 100643 100643
total 24700817 12338906 12324450 12314520 12308570 12268320 12272252 12283315 12296219 12326039 12212820

The divider at 8 is the same as using a flag bit + 3 bit payload in the nibble.

There does seem to be some win to be had by transmitting divider in the file (but it's not huge). Adaptive divider seems like an interesting thing to explore also, but it will make the decoder slower. Obviously it would be nice to be able to find the optimal divider for each file without just running the compressor N times.


Some comparison to other compressors. I tried to run all compressors with settings for max compression and max decode speed.

I'm having a bit of trouble finding anything to compare LZNib to. Obviously the LZ4+Large Window (LZ4P) that I talked about before is a fair challenger. I thought CRUSH would be the one, but it does very poorly on the big files, indicating that it has a small window (?). If anyone knows of a strong byte-aligned LZ that has large/unbounded window, let me know so I have something to compare against.

(ADDENDUM : tor -c1 can do byte-aligned IO and large windows, but it's a really bad byte-aligned coder)

(obviously things like LZ4 and snappy are not fair challengers because their small window makes them much worse; also note that zlib of course has a small window but is the only one with entropy coding; also several of these are not truly "byte aligned"; they have byte-aligned literals but do matches and control words bit by bit, which is "cheating" in this test (if you're allowed to do variable bit IO you can do much better; hel