High Compression Markov Predictive Coder
Download : PPMZ2 v0.81 Here's the source code and
WinNT executable. The code is much cleaner than the PPMZ code; you
might actually have a chance of figuring it out :^) Also, it runs in much less memory than PPMZ,
so packing 'pic' and such is actually reasonable. As usual, you'll need crblib
July 22 2009 - Ian Cadieu has made a managed wrapper for PPMZ2; it includes a test harness. See :
managed PPMZ2 start page ;
May 09 2004 - PPMZ2 v0.81 , with cross-correlation measurement. I've also rolled in some of the
improvements from Cynbe ru Taren. Cross-correlation is an excellent way of measuring how similar
two files are. Use PPMZ2 -x to build the model from one file and then use it to compress another.
If the files are totally unrelated, you'll get an entropy >= 8 bits per symbol. If the files are
closely related, you'll get an entropy similar to (or even less) than the self-entropy of that file.
For example, if you take an exerpt from a larger text and use that larger text as the condition, the
"cross-entropy" is lower than the self-entropy. For example, on some of the files of the corpus
(this is PPMZ2 -x) :
Another interesting metric is to do the self-compression with pre-condition (the -c) option. You can
then measure similarity by taking that divided by the self-entropy (then I do a -1 and x100% to make it
more clear). For example :
- paper1 self entropy : 2.210 bpc
- paper1 conditioned on paper2 : 3.247 bpc
- paper1 conditioned on bib : 4.297 bpc
- paper1 conditioned on pic : 12.263 bpc
This is nice because it completely normalizes out the self-compressability,
but it's a much more transient measure than the -x cross-correlation mode where the
model isn't updated from self at all.
- paper1 self entropy : 2.210 bpc
- paper1 self conditioned on paper2 : 1.992 bpc
- similarity = (2.210/1.992 - 1)*100% = 10.944 %
- paper1 self conditioned on pic : 2.313 bpc
- similarity = (2.210/2.313 - 1)*100% = -4.453 %
April 2002 : Hannu Peltola and Jorma Tarhio have produced a port of
PPMZ for Linux ; their
work and web page is very nice, even if you don't care for Linux. They've
also generated results on the Canterbury corpus.
May 2004 - Cynbe ru Taren is porting PPMZ2 to Linux. He's cleaning up and optimizing the code,
it looks nice. The PPMZ2 code base was a research project design to easily explore new ideas in
PPM, and I wasn't a great coder at the time, so the improvements are welcome.
PPMZ2 for Linux . Little bug fixes to PPMZ2 ; there was a bug in the "text mode" order -1 coder, and
the MTF I was doing was actually a "rotate".
June 1999 - I'm now working on PPMZ version 2. Some exciting things are in the works. As usual, you can follow the
developments on comp.compression.
Results for ppmz2 v0.7 on the Calgary Corpus :
bib : 111261 -> 23873 = 1.717 bpc
book1 : 768771 -> 210952 = 2.195 bpc
book2 : 610856 -> 140932 = 1.846 bpc
geo (de-interleaved) : 102404 -> 52446 = 4.097 bpc
news : 377109 -> 103951 = 2.205 bpc
obj1 : 21504 -> 9841 = 3.661 bpc
obj2 : 246814 -> 69137 = 2.241 bpc
paper1 : 53161 -> 14711 = 2.214 bpc
paper2 : 82199 -> 22449 = 2.185 bpc
pic (transposed) : 513216 -> 30814 = 0.480 bpc
progc : 39611 -> 11178 = 2.258 bpc
progl : 71646 -> 12938 = 1.445 bpc
progp : 49379 -> 8948 = 1.450 bpc
trans : 93695 -> 14224 = 1.214 bpc
total : 726400
average : = 2.086 bpc
The performance on medium sized "text-like" files here is by far the
best of any "pure" compressor (eg. the switching scheme of Volf
not included). Medium sized text-like files are news, bib, paper2, trans, progl,
book2, etc.. On book1 our compression is hurt by the LRU limitting the number of
contexts (by an unknown amount; it is unlikely we would beat CTW's rate of 2.158 bpc).
On many of the "medium text-like" files (eg. progl, trans, news) we even beat the
best switching schemes of Volf!! This is at first quite an astounding result, because
our computational complexity is orders of magnitude lower. However, we should perhaps
not be surprised. Volf's best switching/mixing schemes mix between CTW and something
like PPM* or LZ77. In both cases, he is simply trying to mix the good low-order
performance of CTW with a good high-order coder. We, on the other hand, have got such a
hybrid automatically by using infinite-length PPM and LOE. If you like, the PPMDet and LOE
scheme can be seen as a way of using the local character to switch algorithms (in my case,
between PPMDet and PPMZ-finite-order), instead of weighting all possible switches by
Note that on the very small files (obj1,progc) we're worse than the old ppmz, but we
don't really care about that.
Note that this means I can now send the corpus self-extracting (separate files, with a totally
generic 1->1 coder) in about 759,000 bytes. Hey Leonid - you owe me fifty bucks :^)
A run-down of things we do that make us so good :
- SEE (Secondary Escape Estimation) using 22 bits of data, and then weighting by entropy
between various orders (9 and 15 bit orders). PPMZ 2 now uses 2 bits for the number of characters
in the parent context (thanks to Malcolm Taylor) and that helps a bit. I also use the *post-exclusion*
statistics to compute the escape.
- LOE (Local Order Estimation) every time we step down a context. eg we start at infinity and
then keeping LOE'ing down to some best lower order until we hit order (-1). I use the excluded stats
and the SEE escapes to make the LOE decision, and I use the MPS probability with a penalty for non-deterministic
as the LOE decision (this is the same thing done in PPMZ and was one of the primary innovations there).
In v0.5 I do LOE betwen orders 16,12,8,and 5 through 0
- PPMdet : infinite length deterministic context modeller. This is why our performance on trans is
so much better than ever. I now have a custom SEE module coupled to ppmdet. The basic idea here is that
we track the deterministic "front" of a PPM* model. However, there are some heuristic modfications that
improve compression : I don't always take the longest context matched, but instead take ones which have high
likelyhoods of succeeding (eg. high match counts or low minimum match lengths). In v0.5 I use PPMdet with
a minimum match length of 24.
- The source code is all brand new (except for some stuff in crblib) and I'm using a more elegant
Trie structure to do my context lookups. The Trie is not a win if you're doing old-school PPMD, but if you
need to get all the contexts to do LOE anyway, you might as well walk bottom to top (instead of top to bottom).
PPMZ version 9.1 released March 1998
See release notes below for more info.
A report on the files of the Calgary Corpus
|PPMZ v9.1 results on the Calgary Corpus
|file ||raw size||compressed||bits per byte||tuned (best)|
|bib ||111261 || 24256||1.744||1.743|
|book1 ||768771 || 212733||2.213||2.210|
|book2 ||610856 || 143075||1.873||1.869|
|geo ||102404 || 51635||4.033||3.874|
|news ||377109 || 105725||2.242||2.242|
|obj1 || 21504 || 9854||3.665||3.644|
|obj2 ||246814 || 68804||2.230||2.215|
|paper1|| 53161 || 14772||2.222||2.221|
|paper2|| 82199 || 22749||2.214||2.210|
|pic ||513216 || 50685||0.790||0.787|
|progc || 39611 || 11180||2.257||2.250|
|progl || 71646 || 13185||1.472||1.467|
|progp || 49379 || 9122||1.477||1.471|
|trans || 93695 || 14508||1.238||1.231|
| || || || |
For more information...
PPM is a classic algorithm that many luminaries have done extensive work
on. My PPMZ represents only a few small steps away from their body of
work. The major contributors to PPM have been: Alistair Moffat, John
Cleary, Ian Witten, and Bill Teahan.
Of most immediate interest is Bill Teahan's
, which was the starting point for PPMZ.
The fact that something better was possible was suggested by Cleary
and Teahan's paper on
PPMdet was inspired by the PPM* coder by Cleary,Witten, and Teahan :
and the LZ77 context index structures of Peter Fenwick.
Most recently, a weighting method has been inspired by the work
of the CTW algorithm.
As always, a good summary of the original PPM algorithms can be found in:
T. Bell, J. Cleary, and I. Witten, Text Compression , Prentice Hall, 1990
Charles Bloom / cb at my domain Send Me Email
Back to the Index
The free web counter says you are visitor number
and visitor number to this section.