(5-28-96) This section presumes a basic understand of arithmetic encoding, or that you have read the Statistical Algorithms section.
This text will describe the ideas behind and methods of implementing modern modelling schemes.
The basic goal of modelling is to provide a probability estimate to the arithmetic encoder. You will recall that if an arithmetic encoder is given the ideal range sizes for the characters, that it will encode them with maximum compression. Thus, we need a model which can provide these range sizes.
The first step is to put the data into its "natural state". This means that text, and most similar data, should be read in 8-bit bytes. Binary images should be read as bits. Full color images should be transformed into a Fourier or Wavelet basis. Sound should be decompossed into subbands. And so on - compression works best on the "natural state" of data. For this discussion, we will presume we are compression "text-like" data (i.e postscript, source code, binaries, database files, etc.) and will deal with 8-bit bytes.
(1-1-98)
Let me talk briefly about why we need modelling. Let's say we have some arithmetic encoder. There are various possible inputs (eg bytes), call them 'n' , and each occurs with a probability P(n) (in general, this is not enough to describe the "source", eg. a file, a book, you, but it'll do for now). Now, as I've shown in my Kraft Inequality article, and perhaps convinced you if you've read the 'statistical' section, the best the arithmetic coder can do for decodability is to assign a length to each entry n :
L(n) = - log[ Q(n) ]
where Sum[n] Q(n) = 1
and Q >= 0 for all n
(the first line here is just a definition of Q, the latter lines are the Kraft inequality - the decodability requirement). Hence, with a probability P(n) we write a length L(n) , so that the average output length is :
H(n) = Sum[n] P(n) L(n) =
- Sum[n] P(n) log[ Q(n) ]
Now, H is the entropy (the average output length). In compression, all we can do is to change the Q's.
The P are the actual probalities - the character of the source; for example, in text P('e') > P('x') (if you watch Wheel of Fortune you can rattle off P('rstlne') > P(else) ) We are free to choose the Q's ; a choice of Q's is called a "model". If I just chose the Q's arbitrarily, it would be a static model, designed to compress a specific data set; nowadays we can adaptively & dynamically choose the Q's to match the source as we "learn" about the source.
Let's do an example: if someone is flipping a coin, they ask you to predict if it's heads (H) or tails (T). First, you would say H&T are equally likely. They flip H. You still say they're equally likely. And so on; but if they flip H twenty times in a row, you might start to get suspicious - maybe they have a weighted coin; you would start to guess H every time. You've adaptively adjusted your model!
I've noted in my article on entropy that the expression
H(n) = - Sum[n] P(n) log[ Q(n) ]
considered as a function of Q is minimized when Q = P. (that is, the model is identical to the source!). If you knew the exact character of the source, then you could design a perfect model (for that source!). However, we don't usually know the source before hand, so we must design a scheme to adaptively choose the Q's.
Let me rephrase this another way before we jump in to the nitty gritty:
"You cannot say anything about a fully compressed string"
For example, I can say alot about 'English Text' :
'u' tends to follow 'q' (as 'qu')
vowels follow 'th' and 'ch' and 'sh'
'on' tends to follows 'ti'
etc..
in fact, the whole idea of grammar, spelling, etc. means that we can 'say' alot about english text. Similarly for an image (GIF,JPEG) file; the fact that the reader can decode it means there is a certain structure; for example a tag 'JPEG' or 'GIF87a' occurs explicitly in the file.
On the other hand, a fully compressed string has no structure; (note that zip files and the like have headers with structure; that doesn't count!). That is, any one byte looks like any other. If you hand me two bytes I can't tell you which came first, or even if they could be adjacent (unlike 'q' and 'u').
Hence, modelling can be thought of as 'removing structure', or preventing your friends from gossiping about your data files, because there's nothing to say about a compressed file!
The "Model" is defined thusly:
Thus, we can define a simple coder with this flow chart :
And the decoder is nearly identical, all we do is flip a couple of arrows:
Let me define the terms here in case there's any confusion:
The basic entity in a Finite Context Model (FCM) is the 'context'.
An "order-N context" is the set of N preceding characters (bytes,
bits, whatever) before the current one to be coded. When we say
'context' we mean any order of context, or an infinite order
context (all the preceding characters). It is key to note that
the context uses only preceding characters, hence "already
transmitted" characters (this is conceptual; they need not have
been actually transmitted) so that the encoder and decoder can
both see them to construct the same models.
BTW, FCM's are sometimes called 'Markov Models'.
The basic idea is that we use not only the "order-0" (no context) model, but we also look at the context. The context has predictive powers (recall again my favorite example: the order-1 context 'q' strongly predicts 'u' ; if you did not look at the context, then you would think 'u' was very unlikely; however, the order-4 context 'Iraq' does not predict 'u' , even though its order-1 sub-context does).
The conceptual FCM is simply a list of every single context that has ever occurred, and all the characters that have followed them. Hence, if the data seen so far is :
"abcab" $
where '$' indicates the current coding position, then all the contexts and their prediction are:
a : b b : c ab : c c : a bc : a abc : a a : b ca : b bca : b abca : b
So, now we have 'b' as our order-1 context. Looking in our table, we see than 'b' predicts 'c', so we would guess that 'c' would follow. We need some scheme to actually asign the probabilities; we might say P('c') = 0.9 , spread the remaining 0.1 between the other characters. (we must always give a finite probability to every character, regardless of how unlikely we might think it is, since if it does occur we must be able to code it with non-zero length and non-infinite length; and recall L(c) = -log[Q(c)] where Q is the probability predicted by the model.
As I stressed above, the role of the model is to mimic the true source; FCM's work so well because most sources are very nearly FCM sources.
NOT FINISHED! UNDER CONSTRUCTION!
Charles Bloom / cb at my domain Send Me Email
The free web counter says you are visitor number