Compression : Algorithms : Statistical Coders



Introduction

Statistical coders are the backbone of compression. We first require a simple model for a compressor:

( RAW ) INPUT -> COMPRESSOR -> OUTPUT ( TRANSMISSION )

( TRANSMISSION ) INPUT -> DECOMPRESSOR -> OUTPUT ( RAW )

The basic idea is : the compressor guesses what input will be, and writes fewer bits to the output if it guessed correctly. Note that the decompressor MUST be able to make the same guess as the compressor, in order to correctly decode the transmission, since the transmission is different depending on the prediction.

Currently, however, we are only concerned with the "statistical coder". This is an algorithm which encodes or decodes a character, with a number of bits proportional to the -log2(predicted probability). Note that it may round the number of bits up or encode with extra - as long as it is trying to reach this goal, it is a statistical coder.

For example, if the RAW data can be one of four characters : a,b,c, or d, and the statistical coder is told (by the rest of the compressor) to expect that a occurs 50% of the time, b occurs 25% of the time, c occurs 25% of the time, and d occurs almost never, then we would want to code as follows:

1 bit for a (-log2(.5) = 1)
2 bits for b & c (-log2(.25) = 2)

for example:

a : 0
b : 10
c : 11

note that there are many other possibilities (i.e. {1,01,00},{1,00,01},{0,11,10}) - all that matters is that the length of the codes be as good as possible

You may think of "RAW" IO as a statistical coder which predicts each symbol with equal probability. Thus, there are 256 values for a byte, and each is given an equal probability, p = 1/256, so -log2(1/256) = 8 bits.

In typical text (like this) the SPACE character and 'e' occur very frequently, so they should be written in few bits (3) while 'q' occurs often, so it should be written in more bits (12).

Another way to look at the goal of a statistical coder is this: imagine that the input has only two symbols A and B (this is perfectly general, since any byte can be decomposed into its bits). Let p(A) be the actual probability of A. Because there are only two symbols, p(A) + p(B) = 1, so p(B) = 1 - p(A). Let L(a) = # of bits used to write a, and L(b) = # of bits used to write b (note that these are NOT integers). Thus, our goal is to minimize the function

L = L(A)*p(A) + L(B)*p(B)

p(A) & p(B) are given, and L(A) & L(B) are constrained by the fact that the code must be decodable (i.e. L(A)=L(B)=0 is NOT valid, neither is L(A)=L(B)=0.5)

It can be shown that the values which minimize L are:

L(A) = -log2(p(A))
L(B) = -log2(p(B)) = -log2(1 - p(A))

L = -log2(p(A))*p(A) -log2(1 - p(A))*(1 - p(A))

Thus L depends only on p(A) (or p(B) if you prefer).

You may wonder : how can L be a non-integer? Well, one answer is : "arithmetic encoding", which we will talk about later. Another, easier answer is : "blocking".

Blocking is when you combine two adjacent symbols together in order to allows fractional code lengths. For example, let A and B be the only input symbols (binary input). If the coding set is:

AA : 0
AB : 10
BA : 110
BB : 111

Then, A is written in (about) 0.7 bits, and B is written in about 1.4 bits.

It has been shown, that if N symbols are blocked together, the coding inefficieny due to integer code lengths is bounded by :

(Extra length per symbol) <= 1/N bits

Ok. This is the end of the intro to statistical coding. Following this we go through Shannon-Fano coding, Huffman coding, and then Arithmetic coding.

ADDENDUM :
The goal of statistical coding is to code symbols with their Shannon-order0-entropy. Please see my news postings on entropy.


Coding Algorithms

Remember my example:

a : 0
b : 10
c : 11

You may ask how such a code was arrived at. It is easy to do conceptually for a small alphabet, however how can we do it for a large alphabet? Obviously, we want a general-purpose iterative or recursive algorithm.

This process is required for all integer-length prefix-codes.

A prefix code is one such that no part of one code is the prefix of another. This is required for simple decoding. For example:

a : 0
b : 01
c : 10

is NOT a prefix-free code. If the decoder saw the string "0010" it wouldn't know if this was "a,b,a" or "a,a,c" .

Another example of a NON prefix-code :

a : 0
b : 01
c : 11

the string "00111" MUST be "a,b,c" however, the decoder cannot tell right away if begins as "a,a" or "a,b" - i.e. one code cannot be decoded until after later ones are examined. This kind of decoding is to be avoided.

a : 0
b : 10
c : 11

IS a prefix-code. No string is ambiguous. "0010" means "a,a,b" and "00111" means "a,b,c" and the decoder can tell this right away.

Now we wish to create one of these prefix-codes. How do we do it? We break the process into two steps:

1. Determine the lengths of the codes. if Li is the length of the ith code Let Ei = Li - (- log2 (Pi) ) where Pi is the probability of the ith code Ei is the Extra length written for each symbol, i.e. the length which is written which is not necessary. We would like to pick the lengths so as to minimize the sum of Pi*Ei . This should make sense because Pi is the occurance rate of i, and Ei is the excess length, so the sum of Pi*Ei is the total output length which is longer than needed.

2. Create a prefix-code for those code lengths.

Step 1 is accomplished with Shannon-Fano coding, or Huffman coding. These are discussed later. As far as Step 2 is concerned, these lengths could be generated in any way.

Kraft Inequality

It turns out that a prefix-code only exists for the codes IF:

K = Sum( i = 0 to N ) of 2^(-Li) <= 1

This is the Kraft inequality. Proof of this statement is beyond the scope of this paper. Note, though, that Li = -log2(Pi) + Ei

thus 2^(-Li) = 2^(--log2(Pi) - Ei) = 2^(log2(Pi)) * 2^(-Ei) = Pi / 2^(Ei)

Thus K = Sum( i = 0 to N ) of Pi / 2^(Ei)

Image the simple case where Ei = E for all i

Thus K = Sum( i = 0 to N ) of Pi / 2^(E) = [ Sum( i = 0 to N ) of Pi ] / 2^E

but [ Sum( i = 0 to N ) of Pi ] = 1

Thus K = 1 / 2^E

So the Kraft inequality says 1/ 2^E <= 1 1 = 2^0

2^(-E) <= 2^(0)

-E <= 0

E >= 0

The result is : The Kraft inequality says that a code is decodeable IF AND ONLY IF the code lengths are greater than or equal to that determined by the entropy.

This should make good sense.

Generating a Prefix-Free Code

Two methods will be presented here. One is simple and intuitive, one is practical and cutting edge.

The first is very simple. Just create a binary tree of symbols, such that the depth in the tree is the code length.

For example :
L(a) = 1
L(b) = 2
L(c) = 2

Then we make the tree:

  
      a
    /
root      b
    \   /
      X
        \ 
          c

Now traverse the tree. Each up-branch is a one bit. Each down branch is a zero bit. The creation of this tree is a relatively simple algorithm.

The second method is fast and creates an alphabetically-ordered code. I shant discuss it here, but you may look at my Huffman Source code to check it out.


Shannon-Fano Coding

Shannon-Fano coding is a method for creating the code lengths of a prefix-code.

The basic idea between Huffman & S-F coding is :
because each BIT of the output reserves the same amount of space, each bit should have the same probability.

Let's think of a code as a binary tree as described above. The above statement is the same as saying that at each node of the tree, the above branch and the below branch should have the same probability of occuring.

For example, in a normal 8-bit byte, we have a totally symmetric tree. Some of the branches are taken often, and some are taken almost never. For example, in text, the upper branch from the root is almost never taken (i.e. ASCII characters dont go over 127).

Thus, all code-length generating algorithms are a way of spreading characters over a binary tree such that the probabilities are equal across each branch.

Imagine that we know an occurance count for each symbol : i.e. if the input string is (aabbaac) then the occurance counts are:
C(a) = 4
C(b) = 2
C(c) = 1

Then we want the tree:

  
      a
     /
    4
   /
root      b
   \     /
    3   2
     \ /
      X
       \ 
        1
         \
          c

The number on each branch indicates the sum of frequency counts in the "subtree" : i.e. the tree with that node as its root.

We can see that across each branch, the counts are NOT equal, though they ARE as close to equal as is possible. This means that an integer-length code for this string is sub-optimal. Whenever the counts are unequal, then the code is not as good as the entropy.

How do we make a tree like this? It's easy.

Take all the characters, each has a count associated with it, and put them in nodes which are linked to nothing. Call this set S0. Let C(S0) = the sum of all counts in S0.

Take the node with the highest count, and put it into the set S1. Now C(S0) is lower.

Keep doing this until C(S1) >= C(S0). Now we have split the data thusly:

	
      [S1]
     /
    C(S1)
   /
root
   \
    C(S0)
     \
      [S0]

Now, take [S1] as your starting point, and split it into [S2] and [S1]. Do the same for [S0] , making [S3] and [S0] . Keep doing this until A set has only one member, at which point it is a leaf node.

I will do an example on the set:

C(a) = 4
C(b) = 2
C(c) = 1
C(d) = 1

S0 = {a,b,c,d} , C(S0) = 8

pull out a

S1 = {a} C(S1) = 4 ; S0 = {b,c,d} C(S0) = 4

counts are equal, stop building S1

S1 has only one member, stop working on S1

S0 = {b,c,d} C(S0) = 4

pull out b to make S2

S2 = {b} C(S2) = 2 ; S0 = {c,d} C(S0) = 2

counts are equal, stop building S2

S2 has only one member, stop working on S2

S0 = {c,d} C(S0) = 2

let S2 = {c} , and S0 = {d} and we're done.

  
      a
     /
    4
   /
root      b
   \     /
    4   2
     \ /
      X       c
       \     /
        2   1
         \ /
          X
           \
            1
             \
              d
Which is the ideal code for this set.

This is called Shannon-Fano coding.


Huffman Coding

It was found that Shannon-Fano coding is slightly sub-optimal in some cases.

Thus Huffman code is the best which can be has with integer-length codes.

> OOPS!!! I had the wrong description of Huffman coding! <

Here it is, roughly : instead of a forward pass like Shannon-Fano, a backwards pass is made. In Shannon-Fano we created a tree by starting with all the characters at the root, then breaking them into two groups and recursing down. In Huffman, we start with all the characters "Off of the tree" , on the "right" hand side - i.e. past the leaves. We then grab some characters to become the right-most leaves. Then we grab some more and make the parent nodes, and find our way back to root. However, in practice it's done slightly differently :

Take a list of all your characters, put them in leaf nodes, and give each node a count equal to the count of the character in it. These nodes currently have no parents or children (they never will have children). Now forget that these are characters, and just call them nodes with counts. Sort them so that the ones with the lowest counts are first. Now, grab the two lowest- count nodes, make a new node as the parent of those two, and set the count of the new node equal to the sum of its childrens counts. (you see how we are recursing backwards in building the tree). Now, add this new node to the sorted list of nodes, in its correct sorted position. The two nodes which were made into children are no longer in the list (i.e. the list gets shorter by one node). Repeat until the list has only one member - this is the root!

Now an example, one we already did with Shannon-Fano:

[a] : 4
[b] : 2
[c] : 1
[d] : 1

I'll use [] to indicate a node, so [[x],[[y],[z]]] is a node with two children, one is a leaf [x] , and one has two children of its own, [y] and [z]. [x] : N means the count of that node is N.

Initialize the coding: turn the chars into leaf nodes & sort :

[c] : 1
[d] : 1
[b] : 2
[a] : 4

Grab two lowest : [c] & [d] , and make a new node : [[c],[d]] : 2
put it back in sorted order [b] : 2
[[c],[d]] : 2
[a] : 4

grab the two lowest : [b] & [[c],[d]] , and make a new node : [[b],[[c],[d]]] : 4
[a] : 4
[[b],[[c],[d]]] : 4

Now we're obviously done : just grab the lowest two and make a new node:

[[a],[[b],[[c],[d]]]] : 8

There's only one node left, it's the root.

Now I'll draw the tree so you can see what this notation means:

  
      a
     /
    4
   /
root      b
   \     /
    4   2
     \ /
      X       c
       \     /
        2   1
         \ /
          X
           \
            1
             \
              d
The same as Shannon-Fano coding created. (this is an easy one cuz the probabilities are inverse powers of two)


Theoretical Arithmetic Coding (Introduction)

Ok, here we go. Put your wading boots on.

Arithmetic Coding (arithcoding) is a method of writing a code in a non-integer length. This allows us to code very close to the ideal entropy.

Recall that blocking characters together can result in coding of non-integer lengths. For example :


p(A) = 2/3
p(B) = 1/3

then the code :
0 = AA
10 = AB
11 = B

Writes A & B with the ideal code lengths.

In practice, the blocking required to get optimal behavior may be arbitrarily large. Also, ideal-blocking (like above) is a difficult problem to solve in real-time. Thus we need arithcoding.

Arithcoding is based on this principle : Real Numbers are very similar to Data Files. Lets think about this and understand it. If you have a real number, specified up to N digits, then there are still an infinite number of ways to extend off of it. Similarly for a file. Also, for a real number of N digits, there are exp(N) numbers with this number of digits. Similarly for files. By analogy, we can think of a file as a real number.

For example, if we know the length of the input, L (in bits), there are 2^L possible inputs. We can think of each input file as a binary number (that's what it is). For example : the input 0101100 is equivalent to the binary number 0101100 (duh!). But, we want the encoder to be length-independent, so we want to use the extensibility property of real numbers, as mentioned above. So instead of thinking the binary number as an integer, we will think of it as the binary fraction 0.0101100 - i.e. a decimal point is added to the far left. Now we can extend the file : to 0.01011001 or 0.01011000 without impinging on other real numbers : i.e. each number must uniquely specify one file.

Unlike mathematical real numbers, we get to know how many digits of precision are in the number, i.e. 0.00 is not the same as 0.000 because the number of digits is different; they specify different files.

Time for a diagram :

-------------------------------------------------------- 8/8

                    1                               111
                    +----------------------------------- 7/8
                    0                               110
            1
            +------------------------------------------- 6/8
            0
                    1                               101
                    +----------------------------------- 5/8
                    0                               100
   1
   +---------------------------------------------------- 4/8
   0
                    1                               011
                    +----------------------------------- 3/8
                    0                               010
            1
            +------------------------------------------- 2/8
            0
                    1                               001
                    +----------------------------------- 1/8
                    0                               000

-------------------------------------------------------- 0/8

Let us look at this from left to right. We start with an interval at the far left, between the two farthest dotted lines. We proceed to the first '+' mark. At this point we input a bit from the file. This tells us to whether to go up or down. Now, we cut the interval in half, and go to one side. We keep doing this at each plus. Once have done this 3 times, go all the way to the right. We are now in an interval which corresponds to the 3 bits inputted. The fractions at the far right are equivalent to the inputted data. Each data triplet corresponds to the fraction below it.

It is easy to confirm that these fractions are the same as the binary-fraction of the inputted data:

----- 8/8
0.111
----- 7/8
0.110
----- 6/8
0.101
----- 5/8
0.100
----- 4/8
0.011
----- 3/8
0.010
----- 2/8
0.001
----- 1/8
0.000
----- 0/8

It is clear that this is equivalent to the far right of the above diagram. Again, each binary fraction corresponds to the fraction below it. We write it this way, because each binary fraction corresponds to the interval between the fractions above and below it.

For example, 010 corresponds to the interval ( 2/8 , 3/8 ).

This is convenient, because any extension of 010 will also be in the interval ( 2/8 , 3/8 ) : for example, 0100 is in the interval ( 4/16 , 5/16 ) and 0101 is in the interval ( 5/16 , 6/16 ) : both of which are in ( 2/8 , 3/8 ).

Ok, now we have considered the input as a binary fraction, let's think about how we would compress it.

It's really quite simple :

1. first, we must think of how we will write out an interval. This is done by writing out ANY value inside that interval. We arbitrarily decide that the bottom edge of the interval will be inside that interval, and the upper edge will be in the next. Thus, to indicate 0.000 we could write out 0/8 or 1/16 or .000000001 . We always choose the shortest way, and write it as a binary fraction, since computer data is in binary, not rational numbers.

2. Second, we must realize that larger intervals are written in fewer bits. For example, if A gives the interval ( 0 , 1/2 ) and B gives the interval ( 1/2 , 3/4 ) and C gives the interval ( 3/4 , 1 ) then we can specify these intervals as :

A : 0.0  - 0.1
B : 0.10 - 0.11
C : 0.11 - 1.00
and we can write them as the shortest value inside the interval :
A : 0.0  : 0
B : 0.10 : 10
C : 0.11 : 11
Don't worry that 1.00 is invalid (remember that we always use 0.X to indicate file X) because the top of the interval is unused.

In fact, you can intuitively see that if the interval is of size R, it can be specified in approximately log2((1/R)) bits. Note that this could be a non-integer, and that's fine.

Just for practice, let's look at the intervals for the code set, (A,B,C) above :

------------------------------------------------- 1
               C                               CC
               ---------------------------------- 15/16
               B                               CB
               ---------------------------------- 7/8
    
    
    C          A                               CA
    --------------------------------------------- 3/4
               C                               BC
               ---------------------------------- 11/16
               B                               BB
               ---------------------------------- 5/8
    
    
    B          A                               BA
    --------------------------------------------- 1/2
    
               C                               AC
               ---------------------------------- 1/8
    
               B                               AB
               ---------------------------------- 1/4
    
    
    
    
    A          A                               AA
------------------------------------------------- 0
The process is the same as before : start with the full
interval at left (0,1) : when you get to a branch,
read from the input, and take the branch which corresponds
to the data you read.  Do the same at the second branch.
Once you've read two, you now have the interval to
specify those two inputs.  Think of this as successively
dividing/refining the working-interval.

For longer inputs, you would keep subdividing over and over.

For example, to write BA you would have the interval ( 1/2 , 5/8 ) which can be written as binary fraction 0.100 : NOT 0.1 : 0.1 specifies (B or C) in the first choice : 0.10 specifies B in the first choice : you need to write 0.100 to specify BA. Similary 0.1010 is BB and 0.1011 is BC.


Since we know that an interval of size R is written in l = -log2(R) bits, it is simple to see that if R = p (the probability of that character) then l = -log2(p) bits, which is the ideal. Thus, all we have to do to code data is to figure out p, then we can make a process of dividing the interval which will code the data IDEALLY!

Thus, we see that the above example (with A,B,C) will provide ideal coding if and only if P(A) = 1/2, P(B) = 1/4 and P(C) = 1/4 and A,B & C have no correlation (i.e. after an A, the probabilities of the next character are the same as after a B, same as after a C; i.e. no character implies anything about another).


Practical Arithmetic Coding with Renormalization

The above discussion is all very nice : but how do you do it ? The interval-division method above will work great, except for one problem : on large data files, the interval will get VERY small. Too small to keep in a machine word.

In fact, if you don't care about speed, you can use the above method. However, most of us like speed. Thus, we need to keep the interval in an integer machine word.

This means that for a 32-bit word, you cannot specify more precicision that 2^(-32).

However, consider a flat code for byte : i.e. a byte has values 0 -> 255 and you write each with equal probability 1/256. Then after one coding event, the interval is 1/256, after another byte, the interval is 1/65536, after 3 bytes the interval is 1/2^(24) and after 4 bytes interval is 2^(-32). YIKES! In 4 bytes of input we used up our machine word! We would like to compress files longer than 4 bytes!!

Thus, we need a way to keep our interval from getting too small.

The secret is : Renormalization.

Let us think of it this way :

we start with a low bound, L, and an upper bound, H, for our range. Pictorially, L is the bottom of the graph and H is the top. Now, as we go through the coding recursion (i.e subdividing the range) L and H get closer and closer together. When we are done, we write a number between the two (say (L+H)/2) and we're done. But, if we represent L and H as a binary fraction, as above, then we can write out bits as we go : once some of the left-most bits of L and H are the same, then they can never change.

for example : if we want to code from the "model" of :

------------------------------------------------- 1
               B                               
               ---------------------------------- 1/2
               A                               
------------------------------------------------- 0

and the input is "ABAA" then the codes proceed as:

L          H          Chars seen
------------------------------------
0.0000      0.1111      none
0.0000      0.0111      A
0.0011      0.0111      AB
0.0011      0.0101      ABA
0.0011      0.0100      ABAA

A quick walk-through of this table, for extra clarity :

we start with L = 0.0000 and H = 0.1111 , this is the initialization condition corresponding to the full range. Then we see an A, and cut the range in half, keeping the bottom half. Obviously, L doesn't change. H must be /2 , which is the same as >> 1, so the new H is 0.0111 Another way to see this is: H really started at 1.0000 , with 0.0000...0001 subtract off. Thus, 1/2 = 0.1000 , but we must subtract off 0.0000....00001 , which gives 0.0111 Now we see a B, so we must halve the range upwards. H stays the same, and L is increased by half of (H-L) , which is 0.0011 ... The rest follows as this did.

now we could write out 0.0011 and be done. However, we could have output the first 0 right after the first A : once L is 0.0000 and H is 0.0111 then we can output a 0 bit and shift both up to 0.000 and 0.111 ;

We always shift a zero into L and a one into H, so L would be 0.0000 and H would be 0.1111 after these shifts.

The reason we shift a 1 into H is that H is not really 0.1111 , it is really 1.00000 , but since we can't handle this, we treat it as 0.11111111111111 .... which is equivalent to 0.9999999999 ... which we know from Math is equal to 1 (in base-ten). (this must be equivalent to One, because if it weren't equal, then there would be a different number which could fit between the two).

Thus, a simple algorithmic arith-coder in 16-bit registers would be:


void A_Simple_Arithcoder(void)
{
uword L,H;

L = 0
H = (1<<16) - 1; /* there is an implicit decimal place at the 16th bit,
                   just to the left of the machine word */

while ( MoreInput )
  {
  get input

  shrink L
  shrink H
  /* L and H are shrunk based on symbol cumulative probabilities.
       this is called the "coding step" */

  if ( L & (1<<15) == H & (1<<15) )
    {
    outputbit ( L & (1<<15) );

    /* if the top bits are the same, shift them out */

    L <<= 1;
    H <<= 1;
  
    H ++;

    /* shift a one into the least-significant place of H */
    }
  }

}

By the way, the coding step is :

let Sp = the symbol probability
let Sl = the sum of all symbol probabilities, whose index is
           lower than the current one

R = H - L
L += R * Sl
H -= R * ( 1 - (Sl + Sp) )

This should be pretty obvious. Multiplying by R just scales the probabilities into the range of L and H so that Sl = 0 -> L and Sl+Sp = L -> H

Now we are all ready to write an arithcoder except for one thing : UnderFlow. UnderFlow is the "worst case" for an arithcoder of this style (or any style). It happens about once every 1000 bytes, but we must be able to handle it, or our arithcoder will die.

Underflow is the case when L = 0.01 ... and H = 0.10 ... : i.e. they stradle the midpoint : L = 0.011111111 and H = 0.100000001 is the worst case. The problem here is that no matter how much the range is divided, the top digits of L and H will never be equal until L and H are equal !!!! ( this is not true if you have an infinitely large machine register).

Thus, we have to recognize that L and H are in the underflow case, and do something about it.

The best way to understand this is to return to our previous arithcoder and change the conceptual basis. Whenever the first digit of L and H are both 0 , this means L < 1/2 and H < 1/2 : when we scale them up, what we have really done is "shrunk our range" from (0,1) to (0,1/2) :

for example:

let Rl = the low bound of the range
let Rh = the high bound of the range
let Al = arithcoder low , absolute
let Ah = arithcoder high, absolute

Rl = 0 , Rh = 1 , Al = 0, Ah = 1

Proceed with arithcoding, shrink Al and Ah, don't worry
abour renormalization.

Our old friends L and H can be found from :

L = ( Al - Rl ) / ( Rh - Rl )
H = ( Ah - Rl ) / ( Rh - Rl )

which is just scaling of Al and Ah in the Rl-Rh frame.  i.e. Al and Ah
are taken to be in the interval (Rl,Rh) while L and H are in the interval
(0,1)

now, if L < 1/2 and H < 1/2 then let Rh = ( Rh - Rl ) / 2 + Rl

in other words, if Al and Ah are both in the lower half of
the Rl-Rh interval, then cut the Rl-Rh interval in half, choosing
the lower half

similarly, if L > 1/2 and H > 1/2 let Rl = ( Rh - Rl) / 2 + Rl

choose the upper half.

This is identical to our "simple arithmetic encoder".

This viewpoint will make it easier to understand the solution to the underflow problem (which eluded this author for quite some time) which is called the "bit-plus-follow" method.

It's really quite simple once you know how to do it. When L & H were less than 1/2 , we cut the range in half and took the bottom. When L & H were above 1/2, we cut the range in half and took the top. So, when L & H are in the middle, we just cut the range in half and take the middle! As follows:

if ( 1/4 <= L < 3/4 && 1/4 <= H < 3/4 ) { L -= 1/4; H -= 1/4; L *= 2; H *= 2; }

It should be clear that this code segment scales the middle 1/2 range up to the full range.

Now the only problem is : how to tell the decoder that we did this? Let us again change our conceptual basis.

A data file is like a series of commands. The decoder reads these commands and knows how to interpret them & do something useful with them. In arithmetic encoding, we have two commands we can send to the decoder:
0 = halve lower
1 = halve upper

Since we're in binary, these are the only commands we get. You should stop and think for a second to be sure that his viewpoint is equivalent to our Simple_ArithCoder which outputted the leftmost bit of the binary fractions of L & H whenever they became the same.

So, how do we send a "halve middle" command? Well, it would seem that we can't - and indeed, we can't!!! - but we CAN send a "halve middle" command if it is followed by one of the other commands. Let us look at some examples:

What if we do a "halve middle" followed by a "halve lower" : the result is:

1 ----------

                             3/4 --------

initial   -> halve middle ->               -> "halve lower" ->   1/2 ------
range 
                             1/4 --------                        1/4 ------

0 ----------

The net result is a range of (1/4 , 1/2) - but that's the same as doing a "half lower" followed by a "half upper"!!

Now what happens if we do a "halve middle" followed by a "halve upper" :

1 ----------

                             3/4 --------                        3/4 ------

initial   -> halve middle ->               -> "halve upper" ->   1/2 ------
range 
                             1/4 --------                      

0 ----------
The net result is a range of (1/2 , 3/4) - but that's the same as doing a "half upper" followed by a "half lower"!!

Now recall
0 = halve lower
1 = halve upper

So we can write algorithmically :

("Halve Middle") then (operation B) = (operation B) then (operation NOT B)

Now what if we have to do a "Halve Middle" multiple times before a normal operation? Well, that just requires multiple applications of the !B operation. (this is easy to confirm, just think about the ranges being halved, as above).
Thus we have this picture:

N applications of ("Halve Middle") then (operation B)
= (operation B) then N applications of (operation !B)

So, any time we want to do a "Halve Middle", we just increment a counter, which I'll call BPF, which tells us how many opposite bits to output.

Now, let's revisit our Simple_Arithcoder, and we'll add BitPlusFollow.

void A_Simple_Arithcoder_Number1 (void)
{
uword L,H;
int b,bpf;

bpf = 0;

L = 0
H = (1<<16) - 1; /* there is an implicit decimal place at the 16th bit,
                   just to the left of the machine word */

while ( MoreInput )
  {
  get input

  shrink L
  shrink H
  /* L and H are shrunk based on symbol cumulative probabilities.
       this is called the "coding step" */

  if ( L & (1<<15) == H & (1<<15) )
    {
    b = L & (1<<15);
    outputbit ( b );
    if ( bpf )
      {
      while( bpf-- )
        {
        outputbit ( !b );
        }
      }
      

    /* if the top bits are the same, shift them out */

    L <<= 1;
    H <<= 1;
  
    H ++;

    /* shift a one into the least-significant place of H */
    }
  else if ( (L>>13) == 0x1 && (H>>13) == 0x2 )
    {
    /* if the top two bits of L are 01 and the top two bits of H are 10
       then they straddle the middle and we need to do a "halve middle" */

    bpf ++;

    /* this does the -= 1/4 */
    L &= (1<<14)-1;
    H |= (1<<14);

    L <<= 1;
    H <<= 1;
    H ++;
    }
  }

}

That's one way to do it, using a very bit-oriented approach. Let's look at another way:

void A_Simple_Arithcoder_Number2 (void)
{
ulong One,Half,Quarter,ThreeQuarters;
ulong L,H;
int b,bpf;

bpf = 0;

One = (1<<16);
Half = One/2;
Quarter = Half/2;
ThreeQuarters = Half + Quarter;

L = 0
H = One - 1;

while ( MoreInput )
  {
  get input & shrink L,H

  if ( L >= Half )
    {
    b = 1
    outputbit ( b );
    if ( bpf )
      {
      while( bpf-- )
        {
        outputbit ( !b );
        }
      }

    L -= Half;
    H -= Half;
      
    L *= 2;
    H *= 2;
  
    H ++;
    }
  else if ( H < Half )
    {
    b = 0
    outputbit ( b );
    if ( bpf )
      {
      while( bpf-- )
        {
        outputbit ( !b );
        }
      }

    L *= 2;
    H *= 2;
  
    H ++;
    }
  else if ( L >= Quarter && H < ThreeQuarters )
    {
    bpf ++;

    /* this does the -= 1/4 */
    L -= Quarter;
    H -= Quarter;

    L *= 2;
    H *= 2;

    H ++;
    }
  }

}

Well, that's about it. There are a couple of things to notice :

1. note that we use <= and >= for the bottom of the range, but < and > for the top of the range. It is critical that you chose one boundary to be "in" and one to be "out" - they cannot both be "in" the range.

2. note that X += X is equivalent to X *= 2 and X <<= 1 and that the method of X += X is the fastest on most architectures.

3. Note that H is REALLY (H + 1), and also that H < One always, and that a 1 bit is shifted into H instead of a 0 bit (i.e. H++ )


Go here for some further notes on finite-integer arithcoding/decoding:

Email on integer coding


Charles Bloom / cb at my domain

Send Me Email


Back to the Index


The free web counter says you are visitor number