There are no magic compressors that can pack any data!!! I shall now go through a semi-technical proof of the fact that NO compressor can compress all data.
First of all, we must understand that all compressors of an input file S must work statistically. Really, this isn't saying anything at all, or making any incorrect assumptions - it is just a mode of interpretation (kind of like the various interpretations of quantum mechanics - the results are always the same under the mathematic framework, you just have some freedom in choosing your philosophical viewpoint). Let me describe what I mean. Each file S, we will assume is a string of characters in an array, terminated by an EOF character of value 0x100 . Imagine we "know" the probability of each file appearing; Let us call this P(S) , then all we do to compress a file is to assign a number to each S, and we write that number in -log2(P(S)) bits. It is known that you cannot do better than this for a given P(S) (see Shannon). Really, the only this the statistical view is to consider the probabilities P(S) of each code instead of the length L(S) of each code ; by the relation L(S) = -log2(P(S)) , these views are equal by a transform (kind of like Matrix Mechanics vs. Wave Mechanics vs. q-number algebra ; I'm just full of quantum mechanical analogies today).see Shannon's derivation of L(S) = -log2(P(S))
But we can generalize this, for those of you who say "you don't have to use a probability compressor!" - let us just assign a code to each possible input S. These codes must be prefix codes in order for them to be uniquely decodable - i.e. we do not want two codes to be confused by the decoder. Let us call the length of the code for a certain input L(S) ; now all we do is write this code with this length. However, we cannot just choose our lengths arbitrarily - obviously, if S is 2 bytes long, then I could not write 1 bit for each code in S - some of the possible values of S would be unspecifiable! i.e. S code --------- 00 0 01 1 10 00 11 01 IS NOT OK - because the decoder doesn't know how long code is, so when it sees '00' it doesn't know if that means S = 00 or S = 10 ; if I want to use that code set I must also write out code lengths, which make all the codes longer. It turns out there is an easy way to specify this "decodability" requirement. It is called the Kraft Inequality. I have written about it before in :Kraft Requirement on Prefix Codes and Power of the Kraft Inequality
However, there is an equally good way of saying it, more simply: For each L(S) let me define a number P(S) which I have not yet given a name or meaning, and let me define it by the relation: L(S) = -log2( P(S) ) or P(S) = 2^( - L(S) ) Now, the decodability requirement is : Sum[ all S ] P(S) <= 1 And the average output length is minimized when this Sum is at its maximum, i.e. Sum[ all S ] P(S) = 1 Furthermore, it is a property of the definition of P(S) that 0 < P(S) < 1 ( with L(S) > 0 ) Now these properties look just like the properties of a probability, so I will call P(S) a probability, since it acts just like one, whether it really is one or not.
Thus, we come back to the assertion : however I compress the stream S, I MUST assign each S an output length L(S) ; these lengths are NOT arbitrary, but must satisfy the Kraft inequality to be decodable. This requirment is the same as saying that each L(S) corresponds to a P(S), which acts like a probability, and is constrained by normal probability rules. Thus, finding the best L(S) set is identical to finding the best P(S) set, and coding with L(S) = -log2( P(S) ). NOTE that I did not say ANYTHING about our coder being a statistical coder - all my conclusions simply came from the fact that it was a coder which wrote a length L(S) for each input S, and that its output was decodable. Thus these conclusions apply to ANY compressor - including one that works by mathematical tricks like factoring !!!!!
Now, we can say some more very nice things given this fact :
1. Any statistical coder (PPM,LZ77,Zip,HA,ACB,etc. etc.) is
the same as the simple P(S) coder I have described, except
that instead of keeping a massive offline database of every
P(S) , they use tricks to adaptively figure out P(S) based
one what they expect a "reasonable" file to look like.
In general, a coder which works character by character is
just a subset of the more general whole-file-at-once coder
2. Because we know we are constrained by the Kraft Inequality,
my proof from Power of the Kraft Inequality
Holds on our L(S) set : therefore if you compress one possible
input file by a factor F, you MUST expand another file by
a factor greater than F - this is a simple mathematical consequence
of the Kraft Inequality.
3. Let's take a look at what the "average" output length will be.
Imagine that each input S occurs with a probability Q(S) , exactly.
Note that Q(S) not necessarily = P(S) , because P(S) isn't yet
a probability, it is just defined to be P(S) = 2^( - L(S) ) , from
the output lengths of our general coder. Now, the output length
L(S) will occur with probability Q(S), so the average output length
L(avg) is:
L(avg) = Sum[ all S ] Q(S) * L(S)
but L(S) = - log2( P(S) ) so:
L(avg) = - Sum[ all S ] Q(S) * log2( P(S) )
It is shown by Shannon and
others
(and is easy to prove, but requires some math)
that L(avg) is minimized when P(S) = Q(S)
Thus let us define H = min[ L(avg) ] = - Sum[ all S ] P(S) * log2( P(S) )
Now we clearly see that in the best case, P(S) is indeed the probability
of S appearing, and that the Entropy H is equal to the minimum possible
average output length. Thus when people say that you cannot code
better than the entropy, IT IS TRUE , and it DOES apply to ANY coder!!!
4. Let me attempt a short proof of H = min[ L(avg) ] ;
I will use a two-symbol (binary) alphabet for S; S = 0 or 1 ; Any
larger alphabet of S can be trivially constructed by recursive
application of a binary coder to a binary tree.
P(0) = a
P(1) = 1-a
Q(0) = b
Q(1) = 1-b
L(avg) = - Sum[ all S ] Q(S) * log2( P(S) )
= - [ b * log2( a ) + ( 1-b ) * log2( 1-a ) ]
Now, let us find the value of a that minimizes this L(avg)
It is determined by :
d/da [ L(avg) ] = 0
L(avg) = - ( 1 / ln(2) ) * [ b * ln(a) + (1-b) * ln(1-a) ]
d/da [ L(avg) ] = - ( 1 / ln(2) ) * [ b * (1/a) + (1-b) * (-1/(1-a)) ]
= 0
b/a - (1-b)/(1-a) = 0
b/a = (1-b)/(1-a)
b/(1-b) = a/(1-a)
b = a
-> L(avg) minimized when P(0) = Q(0)
proves it for the binary case, therefore for the general case.
Charles Bloom / cb at my domain
The free web counter
says you are visitor number