High Compression Markov Predictive Coder


  • Download : PPMZ2 v0.81 Here's the source code and WinNT executable. The code is much cleaner than the PPMZ code; you might actually have a chance of figuring it out :^) Also, it runs in much less memory than PPMZ, so packing 'pic' and such is actually reasonable. As usual, you'll need crblib to compile.

  • July 22 2009 - Ian Cadieu has made a managed wrapper for PPMZ2; it includes a test harness. See : managed PPMZ2 start page ; code trunk

  • May 09 2004 - PPMZ2 v0.81 , with cross-correlation measurement. I've also rolled in some of the improvements from Cynbe ru Taren. Cross-correlation is an excellent way of measuring how similar two files are. Use PPMZ2 -x to build the model from one file and then use it to compress another. If the files are totally unrelated, you'll get an entropy >= 8 bits per symbol. If the files are closely related, you'll get an entropy similar to (or even less) than the self-entropy of that file. For example, if you take an exerpt from a larger text and use that larger text as the condition, the "cross-entropy" is lower than the self-entropy. For example, on some of the files of the corpus (this is PPMZ2 -x) : Another interesting metric is to do the self-compression with pre-condition (the -c) option. You can then measure similarity by taking that divided by the self-entropy (then I do a -1 and x100% to make it more clear). For example : This is nice because it completely normalizes out the self-compressability, but it's a much more transient measure than the -x cross-correlation mode where the model isn't updated from self at all.

  • April 2002 : Hannu Peltola and Jorma Tarhio have produced a port of PPMZ for Linux ; their work and web page is very nice, even if you don't care for Linux. They've also generated results on the Canterbury corpus.

  • May 2004 - Cynbe ru Taren is porting PPMZ2 to Linux. He's cleaning up and optimizing the code, it looks nice. The PPMZ2 code base was a research project design to easily explore new ideas in PPM, and I wasn't a great coder at the time, so the improvements are welcome. PPMZ2 for Linux . Little bug fixes to PPMZ2 ; there was a bug in the "text mode" order -1 coder, and the MTF I was doing was actually a "rotate".

  • June 1999 - I'm now working on PPMZ version 2. Some exciting things are in the works. As usual, you can follow the developments on comp.compression.

    Results for ppmz2 v0.7 on the Calgary Corpus :

    bib                  :   111261 ->    23873 = 1.717 bpc
    book1                :   768771 ->   210952 = 2.195 bpc
    book2                :   610856 ->   140932 = 1.846 bpc
    geo (de-interleaved) :   102404 ->    52446 = 4.097 bpc
    news                 :   377109 ->   103951 = 2.205 bpc
    obj1                 :    21504 ->     9841 = 3.661 bpc
    obj2                 :   246814 ->    69137 = 2.241 bpc
    paper1               :    53161 ->    14711 = 2.214 bpc
    paper2               :    82199 ->    22449 = 2.185 bpc
    pic (transposed)     :   513216 ->    30814 = 0.480 bpc
    progc                :    39611 ->    11178 = 2.258 bpc
    progl                :    71646 ->    12938 = 1.445 bpc
    progp                :    49379 ->     8948 = 1.450 bpc
    trans                :    93695 ->    14224 = 1.214 bpc
    total :                              726400
    average :                                   = 2.086 bpc

    The performance on medium sized "text-like" files here is by far the best of any "pure" compressor (eg. the switching scheme of Volf not included). Medium sized text-like files are news, bib, paper2, trans, progl, book2, etc.. On book1 our compression is hurt by the LRU limitting the number of contexts (by an unknown amount; it is unlikely we would beat CTW's rate of 2.158 bpc).

    On many of the "medium text-like" files (eg. progl, trans, news) we even beat the best switching schemes of Volf!! This is at first quite an astounding result, because our computational complexity is orders of magnitude lower. However, we should perhaps not be surprised. Volf's best switching/mixing schemes mix between CTW and something like PPM* or LZ77. In both cases, he is simply trying to mix the good low-order performance of CTW with a good high-order coder. We, on the other hand, have got such a hybrid automatically by using infinite-length PPM and LOE. If you like, the PPMDet and LOE scheme can be seen as a way of using the local character to switch algorithms (in my case, between PPMDet and PPMZ-finite-order), instead of weighting all possible switches by their performance.

    Note that on the very small files (obj1,progc) we're worse than the old ppmz, but we don't really care about that.

    Note that this means I can now send the corpus self-extracting (separate files, with a totally generic 1->1 coder) in about 759,000 bytes. Hey Leonid - you owe me fifty bucks :^)

  • A run-down of things we do that make us so good :

    PPMZ version 9.1 released March 1998

    See release notes below for more info.

    A report on the files of the Calgary Corpus

    PPMZ v9.1 results on the Calgary Corpus
    file raw sizecompressedbits per bytetuned (best)
    bib 111261 242561.7441.743
    book1 768771 2127332.2132.210
    book2 610856 1430751.8731.869
    geo 102404 516354.0333.874
    news 377109 1057252.2422.242
    obj1 21504 98543.6653.644
    obj2 246814 688042.2302.215
    paper1 53161 147722.2222.221
    paper2 82199 227492.2142.210
    pic 513216 506850.7900.787
    progc 39611 111802.2572.250
    progl 71646 131851.4721.467
    progp 49379 91221.4771.471
    trans 93695 145081.2381.231

    For more information...

    PPM is a classic algorithm that many luminaries have done extensive work on. My PPMZ represents only a few small steps away from their body of work. The major contributors to PPM have been: Alistair Moffat, John Cleary, Ian Witten, and Bill Teahan.

    Of most immediate interest is Bill Teahan's PPMD+ , which was the starting point for PPMZ.

    The fact that something better was possible was suggested by Cleary and Teahan's paper on escape estimation

    PPMdet was inspired by the PPM* coder by Cleary,Witten, and Teahan : PPM* paper and the LZ77 context index structures of Peter Fenwick.

    Most recently, a weighting method has been inspired by the work of the CTW algorithm.

    As always, a good summary of the original PPM algorithms can be found in:

    T. Bell, J. Cleary, and I. Witten, Text Compression , Prentice Hall, 1990

    Charles Bloom / cb at my domain
    Send Me Email

    Back to the Index

    The free web counter says you are visitor number and visitor number to this section.