In the course of exploring "how many bits per bit", it has come to my attention that "how big is nothing" is an important question that I have never seen answered. First of all, nothing must be at least 1 bit. That is, we must send a flag for "nothing/something". If we did not do this, then we would have a channel which did nothing but transmit nothing (no, that's not a double negative!). (You're not allowed to choose different channels, since that would require transmission over the 'channel-choice channel'). Hence we need 1 bit for the existence of the data. We also need roughly 2*Log(N) bits to send the EOF or the file size of an N bit sequence. I get 2Log(N) because I assume the following coding scheme : prior to writing each symbol we code an adaptively adjusting escape : flag 0 for EOF, flag 1 for else, then code as normal. Hence, after n symbols the probability of EOF is 1/n . Thus the total output length added by this scheme is : L = log(N+1) + Sum[n=1 to N] log(n+1/n) L = log(N+1) + Sum[n=1 to N] log(n+1) - log(n) L = log(N+1) + log(N+1) - log(1) L = 2log(N+1) If there is a more efficient general method, let me know. (note that we cannot just write L in log(N+1) bits; if you wish to try to write L as a number at the head of the file you must use some variable-length scheme with an implicity probability distribution of N's; that is, you must write an EOF for the number itself). Aside/addendum : in fact, this can be done quite nicely. First, one way is to unary-code the length of the bits of N, which takes log(N) bits, then to code the length of N itself, which takes log(N) bits. This is then exactly the length required for my EOF code. However, you can do still better by using this trick recursively: instead of writing L = log(N) using unary, we code L the same way we did N : write the length of L, then the bits of L raw. We can write the length of L again with this scheme. Then, instead of 2log(N) we can use log(N) + log(log(N)) + log(log(log(N))) + ... bits. All of this stuff is usually hidden away in the operating-system's file headers, but if we wish to discuss fundamental issues we must account for this business. Now, there is a more subtle issue. We must also specify which coder we are using. If we used an "algorithmic" (Kolmogorov/Chaitin) compressor, this would not be an issue, but I consider this unacceptable because of the halting problem; we also wish to have a more general scheme to discuss the performance of other coders. Now, one solution is to have the convention of sending the data as a self-extracting executable, ie sending the decoder before the data. This is of course the most general method, and seems the best we can do (it is the only way to account for the sneaks who would put the whole universe in a static data file in their decoder!). Hence, we find : L_tot = 1 + 2log(N+1) + sizeof(decoder) + sizeof(data) But now we can cut out losses. Since we have generated this general purpose cannel, we can now make one-time agreements between the receiver and transmitter. Since this channel has existed since the big bang (when time began) the one-time communication cost is irrelevant compared to all the data we can send over it. So, what EXTRA receiver-transmitter schemes can operate in conjunction with this channel? 1. N has information. Since we've explicitly sent N , the receiver will find N. In addition to telling the decoder when to stop receiving, it can also contain information (as is well known; see for example my web page where I detail the operation of an 8 -> 6 bit coder which works on this principle (hiding information in the string length)). It is roughly log(N) bits (for N large, anyway; perhaps log(N)-1 more generally). 2. I've also now suggested that the coder can skew the 0/1 probabilities, and the decoder can figure out the skewing, and hence get another log(N) that way. See again my web page for an extensive review [ note added 12-30-97 : this does NOT work ] The conclusion is that our above result is modified (for large N) to : L_tot = sizeof(decoder) + sizeof(data) ! This is quite amazing to me; our implicit schemes 1 & 2 have exactly cancelled the overhead necessary for a general-purpose channel!
Charles Bloom / cb at my domain
The free web counter says you are visitor number