PPMZ v7.3 Algorithm Description
June 5 1996
by Charles Bloom
cbloom@mail.utexas.edu

---

This discussing presumes a familiarity with PPMC,PPM*,and PPMD+.

PPMD+ is described at:

        http://www.cs.waikato.ac.nz/papers/pe_for_ppm/

PPM* information is availabe at:


If you don't know PPMC, run for the hills.

---

PPMZ v7.3 source code, reports, and this text are available at:

http://wwwvms.utexas.edu/~cbloom/src/indexppm.html

I will soon (one week) be adding an extremely detailed statistics
report (about 100k long!!!) to this page.

---

PPMZ includes several coders, all of which are made of several
fundamental PPM components.  For example, the primary coder consists
of :

Order-12        PPMdet
Order-8         PPMZ
Order-5-4       PPMQ
Order-3-2-1     PPMC
Order-0-(-1)    Frequency model with full exclusion

There is certain confusion of nomenclature which I will try to alleviate
here:
        PPMQ,PPMZ, and PPMdet are all Escape Probability Estimation algorithms
        PPMZ v7.3 is the name of the executable
        in PPMZ v7 , PPMQ was merged into PPMZ as a special case,
                so PPMQ is actually implemented as a kind of PPMZ.
                The same was done to PPMC.
        PPMQ is also the name of the PPMQ-Second-Level-Statistics-Manager
                which is an algorithm abstraction
---

Now a brief (amazing) introduction to PPM statistics :
        about 50% of characters coded by PPM are coded by a
                deterministic context
        about 90% of characters coded by a deterministic context
                are successfully matched

deterministic contexts are those that predict only a single character.
It is extremely difficult to estimate the statistics of a deterministic
context 

---

PPMdet :

PPMdet is an unbounded-length context coder (like PPM*) which accepts only
deterministic contexts with a count >= 2 .  This faccilitates extremely
simple data structure maintenance :

1. every context seen is added to an LZ77-style hash-lookup table, using
   a reverse pointer
2. each context contains a Minimum Match Length, which is initially set to
   the "Order" of the PPMdet (12 in PPMZ v7.3)
3. To match a context, the current context must match to at least MinMatchLength
   The Length of mutual match is MatchLength.
4. If the new character is successfully predicted, the character count
   is updated.
5. Otherwise the old context MinMatchLength is set to MatchLength + 1
   as is the new context.  Thus, both context minmatchlengths are 
   extended so that both remain deterministic, and a longer match length
   is required.
   Perhaps adding a NumExtensions field would improve probability
   estimation.
6. Since every context is deterministic, every context has an escape count
   of 1.
7. Statistics are passed through the PPMQ-Second-Level-Statistics-Manager
        (which will be described later)
8. Matches can be very slow in PPMdet, so efficient LZ77 methods are used,
        but are still very slow, and are helped greatly by a
        preliminary pass of RunTransform
                (http://wwwvms.utexas.edu/~cbloom/src/genlib.zip)
9. If a context has not yet seen two characters, it is not coded from,
        and an escape is made without writing one

---

PPMZ :

PPMZ is identical to PPMC (with faster data structures inspired by LZ77)
except that deterministic contexts (contexts with one character in them
AFTER exclusion has been applied) and contexts with two characters in them
AFTER exclusion has been applied use the PPMQ-Second-Level-Statistics-Manager
to encode escapes.

---        

PPMQ :

PPMQ is identical to PPMZ except that only deterministic contexts
use the PPMQ-Second-Level-Statistics-Manager to encode escapes.

---        

PPMQ-Second-Level-Statistics-Manager : (SLSM)

This the core of PPMdet/Z/Q

The driving idea is that if a context has a PPMC-Escape-Count of
3 and a PPMC-Total-Count of 6, then it might not ACTUALLY have a
50% probability of escaping.  For example, deterministic contexts
almost always have a much lower chance of escaping than PPMC
would predict.  However, the difference is not a simple linear
scaling, and it also varies greatly from file to file.

LZP and ACB capture this variance in the statistics of match
lengths (trans,etc. have longer match length, paper1,etc. have
distributions which tail-off very rapidly after length=1).  SLSM
captures this information in its statistics tables.

when PPMC-Total-Count = 2 and PPMC-Escape-Count == 1

then Actual-Escape-Frequency =

        paper1 : 50% (same)
        trans  : 25% (2 times lower)

This information is tracked in the SLSM by keeping a second-level
array of statistics which translates the Stored PPMC statistics
into Actual Coding statistics.

This is done by create a hash value from the PPMC-Tot & Esc
counts.  This hash is used as an index into a table.  The table
contains the actual number of escapes and non-escapes done by
other contexts with the same hash (and order).  This actual number
done is used for coding.

All of this operationg is done through the PPMZ_Write_Escape function
call.

---

Future improvements :

PPMZ v7.3 consists primarily of a small modification of the work done by
Bill Teahan on PPMD+ : PPMZ handles deterministic contexts very well (almost
as well as possible).  It is projected that all the things Bill Teahan found
to improve PPMD+ will also improve PPMZ, so the following things will be
tried :
                                        
0. There are several constants in PPMdet.c and PPMZ.c relating to statistics
   management (i.e. DT_INC, and the DT_HASH values) which have not
   been significantly tuned.  It is estimated that severe optimal tuning
   of parameters & startup-values would result in about 0.01 bpb improvment,
   but would require too damn much work.

1. Recency scaling.

2. Switching to Random and Semi-Random models (i.e. geo, obj1)

3. LZP and ACB teach that string matching can occasionally be an improvement
   over PPM (because string matching inherently handles the deterministic
   problem).  Thus, adding a string-matching functionality to PPMdet, ala
   LZP, might improve performance on trans, pic, prog* etc. while hopefully
   not hurting too much on other files.

4. Semi-Adaptive PPMZ-Coder switching.  Version 8 of PPMZ will quickly
   detect which coder to use and then do so.