see also:

George Buynovsky's Description of ACB (the author)

Peter Fenwick's LZP = BlockSort = Shannon

Email

From: leob@best.com (Leonid A. Broukhis)

To: Mark Nelson and Charles Bloom

Mark and Charles, I went over the George's paper on weekend and here is what I made out of it (the paper only describes the basic idea without implementation details; this is my understanding of it): A. Encoding. Consider a sorted dictionary D of N binary strings with imaginary "all 0's" as 0th entry and "all 1's" as N+1th entry. The procedure of encoding a binary stream S with the dictionary is as follows: 1. Find the lexicographic location L of S to be encoded in the dictionary. L = X: { length of matching part of D[X](*) and S(*) = max } Encode and transmit L to the output stream. There may be two cases: D[L]: XX...X0ZZ.Z0V... D[Lprev]: XX...X0YY... S: XX...X0ZZ.Z1T... S: XX...X1ZZ.Z0T... D[Lnext]: XX...X1YY... D[L]: XX...X1ZZ.Z1V... 2. Now, encode and transmit the difference bit B=1 between S(+) and D[L](+); let K = the number of (~B) bits in ZZ.Z . 3. Update the dictionary if desired. 4. Make the first non-encoded bit to be the new stream head, repeat. The decoder gets L and B, transmits the common part of D[L] and D[Lnext] or D[Lprev] depending on B, transmits ~B, gets K, transmits the ZZ..ZZ part of D[L] until it is about to transmit the K+1th (~B) bit not including it, then transmits B. Pretty close to LZP+Sort so far, but what's new is B. Dictionary organization. At this point we define the way the dictionary is sorted. Each dictionary entry has two parts: content (post-history in George's terms) and context (pre-history), and it is sorted _rigth_to_left_ based on context. E.g.the dictionary built using ...D R D O B B S D R D O B B S ... : | George's terms: | Dict. |"Funnels of analogies" entry #'s +---------+----------+ | pre- | post- | | history | history | - -----------+---------+----------+ i ...SDRDOBB | SDRDOBB... ii ...BSDRDOB | BSDRDOB... iii ...OBBSDRD | OBBSDRD... iv ...RDOBBSD | RDOBBSD... v ...BBSDRDO | BBSDRDO... vi ...DOBBSDR | DOBBSDR... vii ...DRDOBBS | DRDOBBS... Here is what happens if we want to encode ...DR | ROBBSSDP... where ...DR is the context, and ROBBSSDP... is the content. Entry (vi) is the best context match, entry (iv) is the best content match, therefore L = -2, the first letter 'R' matches. What will be transmitted is (-2, 'O', 1). Now, the context is ...DRRO, and the content is BBSSDP... Best context match is (v), the same as the best content match. Difference character is 'S', and there are only two non-S characters is the string (v) before the difference, therefore we transmit (0, 'S', 2). Now the context is ....BBSS, content = DP.... Best context match is (vii), best content match is (vii) (we are free to choose between (vi) and (vii), so we choose the min distance), what's transmitted is (0, 'P', 0), as the first characters of (vi) and (vii) match, and there's nothing more to copy. It's easy to see (I'm not sure about "to prove") that for most real inputs abs(L) will be quite small. If a sequence of literals needs to be output, it is prepended by a "magic dict. reference number". The depth with which the contexts are sorted controls the speed and the compression ratio of the algorithm. +---------------+ |Footnotes: | |(*) use context| |(+) use content| +---------------+ Last note: the actual ACB algorithm uses right-to-left scanning, so the contexts are left-to-right, and the contents - right-to-left. This simplifies sorting of the contexts, and George claims that it improves compression as well (~0.01 bpb, but very consistently). Please tell me what you think, Leo

From: leob@best.com (Leonid A. Broukhis) Newsgroups: comp.compression.research Subject: Associative coding by George Buynovsky Date: 5 May 1996 07:05:29 GMT This article is based on the ideas described by George Buynovsky in his 8'94 article "Associative coding" and on the "Shannon" coder. In the setting of the Shannon's experiment, the test subject was asked to guess the next symbol in the text, then guess again if the first guess was wrong, etc. The idea is to guess phrases, not single symbols. That is, the "predictor" gives several possible multicharacter guesses, and rank and length of the longest correct one is recorded. An implementation may use the following organization of the dictionary: each entry in the dictionary consists of the two parts - the context and the content, where the context is unbounded and is directed "backwards", and the content is also unbounded and is directed "forwards". The entries in the dictionary are being alphabetically sorted by the conteXts. It is best done using the ring buffer. The example shows the dictionary built on a string "MISSISSIPPI": Context Content 1 ...MI SSISSIPPI.. 2 ...PI MISSISSIPPI 3 ...MISSI SSIPPI... 4 ...SISSI PPI... 5 ...M ISSISSIPPI. 6 ...IP PI... 7 ...PP I... 8 ...MIS SISSIPPI... 9 ...SIS SIPPI... 10 ...MISS ISSIPPI... 11 ...SISS IPPI... The contexts are shown up to the point of uniqueness, and sorted alphabetically right-to-left (towards the past). Let's see what happens when we encode a phrase "MISTER IS SIPPING ..." (spaces are ignored for clarity). For demonstration purposes assume the previous character was A. This will place the current context (...A) before the context 1 in the dictionary. The best matching conteNt (with the current MIST...) is content 2 with match length 3. The first non-matching character is E. The output of the coder will be (+2, 3, E), and the context becomes (...E). The current content (R...) doesn't match any, therefore a literal must be output. Now the current conteXt is (...R), placed between the conteXts 7 and 8, and the best conteNt match is 10 with length 7. Therefore a triplet (+3, 7, N) will be output, etc. Several improvements can be made to reduce the set of possible dictionary offsets and lengths that need to be output by restricting the set of entries the current content is matched against, e.g. imposing a conteXt match length requirement (this will eliminate random matches with weird dictionary offsets), and only counting characters comprising the _unique_ part of the matched entry (e.g. in the triplet (+3, 7, N) corresponding to the string "ISSIPPI" the first 4 characters are not unique, they also occur in entry 5, etc. Leo

Charles Bloom / cb at my domain

Send Me Email

The free web counter says you are visitor number