There is a subtle mistake in the Model + Encoder paradigm. Let me warm up to it. The modern coder is actually not a general Model + Encoder, because it is too slow to ask the model to supply probabilities for each new character to be encoded. Instead a modern coder is as follows : 1. a Model generates a "context" for the character to be encoded. All characters assigned to a given context shall be called a "context group". 2. the encoder uses a separate set of statistics for each context. In a two-pass scenario step two could be thought of as separating the characters into many separate arrays, one for each context group, and then coding each array to their order zero entropy. More realistically, step 2 is done by an adaptive arithmetic coder; this adaptive coder should not only learn the statistics of each context group, it can adapt to (temporal, if you consider a file as a time sequence) changing trends. The goal of the first step is to decorellate all the context groups, that is if G1 and G2 are different groups, the mutual information I(G1|G2) should be zero. This seems all well and good, but it is wrong. In practice, the arithmetic coder used on each context group is adapting over time, so its statistics are adapting over time. This means that the character of each context group is changing over time. Thus the correct context group to add a certain character to might change. Let us make this more precise. There are : contexts : Cn characters : c(t) already-filled context groups : Gn(t) the best-estimated statistics of a character : H(t) and the current statistics of the adaptive coder for group Gn : Hn(t) our goal in assigning a context Cn to the character c(t) is to choose the n such that | Hn(t) - H(t) | is minimized. This problem statement differs from the original one above in that I have now explicitly acknowledged that Hn changes with t : that is no static context-choosing scheme can ever be locally optimal (all known context-choosing schemes, such as DMC, PPM, CALIC, ECECOW, etc. are static; the only scheme which approaches non-static-ness is Paul Volf's Mixing Scheme or the CTW blending scheme. (my old PPMZ LOE is also somewehere in this family)). Now that we have recongized the problem, we can elaborate on it. We have implicit control over Hn(t) which is seldom used or recognized. Rather than imagining that Hn is constant and trying to group similar statistics, we can acknowledge that Hn(t) varies, and we can control this variation, (that is, we can do other things than just letting Hn(t) vary due to the adaptation of the model). A nice example of this is image coding. Consider lossless image coding (this all that matters, since lossy coding is just lossless on the post-quantized coefficients). We are looking at pixels in the nearby two-dimensional neighborhood to create the context. Once we have this context, Cn, which we have tried to design to correspond to Hn, we simply code the character with Hn(t). But Hn(t) is usually an adaptive arith-coder, which is gathering statistics in one-dimensional time : that is, in some arbitrary scan order! In fact when we made Cn from the 2d neighborhood we implicitly assumed that Hn(t) also was localized in the 2d neighborhood. This is usually not true and results in a mis-match of Cn(t) and Hn(t). When I refer to this sort of mismatch, I mean that in assymptotia there is some optimal matching : Cx and Hy are matched. There is a mismatch if at some finite time t Cx is best-fit to some Hn with n != y. That is, the relation of contexts and context-groups flip-flops at some finite time. LoPresto, Ramchandran, and Orchard (in "Image Coding based on Mixture Modeling..." the EQ paper from DCC 97) solved this problem in the most trivial way. The simply cut the (t) from Hn, and use constant probability distributions for each context : Cn(t) choose Hn. (in their case "n" corresponds to a standard-deviation of a generalized gaussian). This obviously eliminates the incongruity I mentioned above, but is also obviously not optimal, because it entirely surrends the possibility of coding with changing statistics. (using Hn(t) instead of Hn can compensate for a poor context-choosing algorithm).
Charles Bloom / cb at my domain
The free web counter says you are visitor number