PPMZ v7.3 Algorithm Description June 5 1996 by Charles Bloom cbloom@mail.utexas.edu --- This discussing presumes a familiarity with PPMC,PPM*,and PPMD+. PPMD+ is described at: http://www.cs.waikato.ac.nz/papers/pe_for_ppm/ PPM* information is availabe at: If you don't know PPMC, run for the hills. --- PPMZ v7.3 source code, reports, and this text are available at: http://wwwvms.utexas.edu/~cbloom/src/indexppm.html I will soon (one week) be adding an extremely detailed statistics report (about 100k long!!!) to this page. --- PPMZ includes several coders, all of which are made of several fundamental PPM components. For example, the primary coder consists of : Order-12 PPMdet Order-8 PPMZ Order-5-4 PPMQ Order-3-2-1 PPMC Order-0-(-1) Frequency model with full exclusion There is certain confusion of nomenclature which I will try to alleviate here: PPMQ,PPMZ, and PPMdet are all Escape Probability Estimation algorithms PPMZ v7.3 is the name of the executable in PPMZ v7 , PPMQ was merged into PPMZ as a special case, so PPMQ is actually implemented as a kind of PPMZ. The same was done to PPMC. PPMQ is also the name of the PPMQ-Second-Level-Statistics-Manager which is an algorithm abstraction --- Now a brief (amazing) introduction to PPM statistics : about 50% of characters coded by PPM are coded by a deterministic context about 90% of characters coded by a deterministic context are successfully matched deterministic contexts are those that predict only a single character. It is extremely difficult to estimate the statistics of a deterministic context --- PPMdet : PPMdet is an unbounded-length context coder (like PPM*) which accepts only deterministic contexts with a count >= 2 . This faccilitates extremely simple data structure maintenance : 1. every context seen is added to an LZ77-style hash-lookup table, using a reverse pointer 2. each context contains a Minimum Match Length, which is initially set to the "Order" of the PPMdet (12 in PPMZ v7.3) 3. To match a context, the current context must match to at least MinMatchLength The Length of mutual match is MatchLength. 4. If the new character is successfully predicted, the character count is updated. 5. Otherwise the old context MinMatchLength is set to MatchLength + 1 as is the new context. Thus, both context minmatchlengths are extended so that both remain deterministic, and a longer match length is required. Perhaps adding a NumExtensions field would improve probability estimation. 6. Since every context is deterministic, every context has an escape count of 1. 7. Statistics are passed through the PPMQ-Second-Level-Statistics-Manager (which will be described later) 8. Matches can be very slow in PPMdet, so efficient LZ77 methods are used, but are still very slow, and are helped greatly by a preliminary pass of RunTransform (http://wwwvms.utexas.edu/~cbloom/src/genlib.zip) 9. If a context has not yet seen two characters, it is not coded from, and an escape is made without writing one --- PPMZ : PPMZ is identical to PPMC (with faster data structures inspired by LZ77) except that deterministic contexts (contexts with one character in them AFTER exclusion has been applied) and contexts with two characters in them AFTER exclusion has been applied use the PPMQ-Second-Level-Statistics-Manager to encode escapes. --- PPMQ : PPMQ is identical to PPMZ except that only deterministic contexts use the PPMQ-Second-Level-Statistics-Manager to encode escapes. --- PPMQ-Second-Level-Statistics-Manager : (SLSM) This the core of PPMdet/Z/Q The driving idea is that if a context has a PPMC-Escape-Count of 3 and a PPMC-Total-Count of 6, then it might not ACTUALLY have a 50% probability of escaping. For example, deterministic contexts almost always have a much lower chance of escaping than PPMC would predict. However, the difference is not a simple linear scaling, and it also varies greatly from file to file. LZP and ACB capture this variance in the statistics of match lengths (trans,etc. have longer match length, paper1,etc. have distributions which tail-off very rapidly after length=1). SLSM captures this information in its statistics tables. when PPMC-Total-Count = 2 and PPMC-Escape-Count == 1 then Actual-Escape-Frequency = paper1 : 50% (same) trans : 25% (2 times lower) This information is tracked in the SLSM by keeping a second-level array of statistics which translates the Stored PPMC statistics into Actual Coding statistics. This is done by create a hash value from the PPMC-Tot & Esc counts. This hash is used as an index into a table. The table contains the actual number of escapes and non-escapes done by other contexts with the same hash (and order). This actual number done is used for coding. All of this operationg is done through the PPMZ_Write_Escape function call. --- Future improvements : PPMZ v7.3 consists primarily of a small modification of the work done by Bill Teahan on PPMD+ : PPMZ handles deterministic contexts very well (almost as well as possible). It is projected that all the things Bill Teahan found to improve PPMD+ will also improve PPMZ, so the following things will be tried : 0. There are several constants in PPMdet.c and PPMZ.c relating to statistics management (i.e. DT_INC, and the DT_HASH values) which have not been significantly tuned. It is estimated that severe optimal tuning of parameters & startup-values would result in about 0.01 bpb improvment, but would require too damn much work. 1. Recency scaling. 2. Switching to Random and Semi-Random models (i.e. geo, obj1) 3. LZP and ACB teach that string matching can occasionally be an improvement over PPM (because string matching inherently handles the deterministic problem). Thus, adding a string-matching functionality to PPMdet, ala LZP, might improve performance on trans, pic, prog* etc. while hopefully not hurting too much on other files. 4. Semi-Adaptive PPMZ-Coder switching. Version 8 of PPMZ will quickly detect which coder to use and then do so.