It seems that under careful consideration, more than 1 bit of information can be kept in 1 bit of storage. Note that this does not apply to >input< random sources, for which there are many general proofs (see my web page, news section, for example). This does apply to a special form of transmission. Let us send bits, 0 or 1, to a receiver. They are sent with a probability slightly skewed from 50% : p(0) = p = 1/2 ( 1 + e ) p(1) = q = 1/2 ( 1 - e ) where 'e' means epsilon, and is small. If we send N bits with a coding scheme carefully chosen to have this distribution (eg. with an arithmetic coder on the back-end of a scheme which has a skewed 0/1 output rate) then the bits themselves will be slightly redundant due to the skewing. The information sent is then: I = N [ -p L(p) - q L(q) ] where 'L' means 'log base 2' I/N = 1 - ( p L(1+e) + q L(1-e) ) I/N = 1 - 1/2 ( L((1+e)(1-e)) + eL((1+e)/(1-e)) ) I/N = 1 - 1/2 ( -e^2 + e( 2e) + O(e^3) ) I/N = 1 - 1/2 e^2 So the information per bit is slightly less than one bit, as we would have suspected. Now for the finesse part. The receiver not only gets the bits sent by the source, he also can estimate the skewing factor e, by counting the ones and zeros. To find the error of this guess we need not assume a gaussian source or anything of the sort, it is a simple fractional truncation problem. If n = pN is the number of 1 bits, then the measured p (p_m) is : p_m = Int[n]/N = Int[pN]/N Where the operation Int[] takes the rounded integer closest to the fraction pN. Thus the Int[pN] can different from pN by at most 1, so that abs[p_m - p] <= 1/N Thus, the receiver can resolve e to one of N values (eg. the middle of a range of N buckets dividing the range zero to one). This means that an overall extra piece of information is received, with entropy : I_extra = L(N) (well, this all is true as N -> infinity anyway; I have implicity assumed that n = the number of ones actually is equal to pN , the probabalistic number of ones. Of course, for any finite N it is possible that n differs from pN dramatically. Taking account for this with the proper random walk formulas would make the math intractible). Actually, on further consideration, this error due to fractional truncation is smaller than the error due to the random-walk deviation of n not= pN , (i.e. sqrt(N) < N ). However, both are O(N) so that you can choose whichever you prefer. Michael Schindler has made the good point in email that just because we can estimate the average error does not mean that we can confidently expect to resolve epsilon inside the "error bars". Of course he is right; the right way to do this analysis is to probabilistically count the resolving power : I_average = Sum[x] P(x) * H_x where x represents a certain resolvability (deviation of the epsilon realized from the input parameter epsilon), and P(x) is an elaborate combinatoric factor representing how often a sequence of resolvability x occurs, and H_x is like log2(1/x). However, all of this is unnecessary if we instead assume that epsilon is resolvable in the error bars, and then account for the times it falls out of that region Actually, we can fix this as well. Once the encoder has sent his N bits, he can compare the actual number of p's and q's with the amount predicted by epsilon, and hence the amount the decoder will predict. He can then encode the delta of the two. This will be small, and hence easy to encode. Thus rather than accounting for the errors in decoding, we simply need to count how many bits this correction-code requires. How big is the correction code? By simple arguments, it is negligible. Consider the worst coder we can think of : if e is out of the pre-set error bars (1/N resolution) then we simply send it manually. This happens 50% of the time, and we must send log(N) bits when this happens. Thus instead of gaining log(N) information, we gain L(N) - (1/2)L(N) = (1/2)L(N) This is simply a multiplicative correction, and is irrelevant for N reasonably large. There are more imporant corrections, as noted further below. (btw we must also encode one flag bit to indicate whether or not we send the correction code (ie which half of 50%) but this is small) Let me re-state what we are doing here. The encoder knows the actual probabilities p & q (we'll assume these to be real numbers for now; this does not affect the results). All the decoder knows is the number of 1s and 0s actually encoded : n1 and n0. Now decoder guesses values of p and q, p(g) = n0/N , q(g) = n1/N . As N->large , these get very close to the actual values of p and q. Let us imagine that both parties agreed on a precision of m out of N bits. Thus we round n0 and n1 to nearest multiple of m. Now, if Np and Np(g) differ by less than m, the encoder and decoder agree on the probability - that is, the actually realized effective probability agrees with our true probability (within the error m). In this case, we send an extra zero bit at the end of the transmission. If this is not the case, then the encoder knows that the decoder will make an error. In this case, the encoder sends a one bit to indicate that the probabilities the decoder will guess are wrong. He then transmits the true probability. We quote the random walk formula, and m = sqrt(N) , then we know the actual probabilities will be wrong 50% of the time. Thus, we gain log(sqrt(N)) information by knowing the probabilities; we lose 1 bit by sending the are-we-wrong-flag, and 50% of the time we must excplicitly transmit the probability. Hence the net gain is (1/4) log(N) - 1 The astute reader may balk that we assumed e was small, but have now counted the information from distinguishing epsilon in the entire range (0,1). However, if we did restrict the range to be small (such as (0,5e)) this would simply by a multiplicative factor in front of N, which would be an additive factor in I_extra, which becomes irrelevant as N becomes large. (similar to the one bit from the flag above) Thus, we find that the difference (from one bit) in the total information per encoded digit is : (delta_I/N) = [1/4 L(N) - 1 ]/N - e^2 or approximately: (delta_I/N) = L(N)/N - e^2 This is an intriguing result, and begs for a more careful analysis than the one conducted here. In particular, as N->infinity, the first term dies, and we are left only with the small correction for the redundancy in the skewed bits - nothing is gained. Of course, this is not surprising since measuring e is a one-time overall gain which disappears as N->large. However, when N is small (eg 2^4) then the first term will dominate (eg 4/2^4 = 1/4) and we have succeeded in sending more than one bit per bit. However, in this region our assumptions are invalid (in particular, the difference between e and 1 becomes important, also it must be than N is large enough that Ne >> 1 so that integer-quantization is not catastrophic). This feels something like non-equilibrium statistical mechanics : all our arguments are designed for N -> infinity, but now we're trying to explore interesting effects that might occur for N finite. However, it remains that e can be very small, so it seems there must be cases where the first term makes the net information gained positive. One nice case is N = 1/e , (that is, N just large enough so that integer-quantization of (Ne) doesn't destroy the effect). Then: (delta_I/N) = e L(1/e) - e^2 = L(N)/N - N^-2 Now as N -> large or e -> 0 , we find that the positive term does indeed dominate the sum, resulting in a gain of information. [note added 12-30-97: ] you CANNOT choose e to be 1/N as I have done here; e must be chosen by the coder based on the data which we desire to code in e. Hence, it must have a range of possible choices, and the true delta_I comes from an average thereof. See the debunking below. Possibilities for future work : 1. does this work imply something about qubits? (quantum bits). The qubit is an inherently probabalistic binary source. If we are sent a set of N qubits ( p|0> + q|1> ) with skewed p and q as described here, we can measure the skewing, and so extract the Log2(N) information exactly as above. Perhaps this means that qubits do carry slightly more classical information than classical bits. [despite the debunking, I believe there may be something to this; perhaps qubits carry slightly LESS information..] 2. See my note on the general-purpose channel (or, "how big is nothing").
(Bubble Bursted 12-28-97)
My "scheme" of making the decoder guess the
encoder's probability skewing does indeed
not work. (or rather, it does, but the
net space saved happens to be negative!!).
I had completed the calculations to show
the failure in the general case, which
I will present shortly, but let me first
review and give thanks.
I note that Jeff Epler has sent me a
similar experimental analysis in email,
showing that the typical gain is negative.
Andrey Savov and Michael Schindler have
also been of great help. There are
probably some more that I've forgotten.
Also, Hugo Witters has posted a brilliant
analysis in terms of signal-to-noise in
the transmission (some is reproduced below).
I will just quote some intuitive ideas
from his arguments:
If the skewing (e) is small, then the
numbers of 0s and 1s actually encoded (n0,
and n1) will be very close to eachother,
and the random-walk standard deviation
will be quite large. Hence, n0 and n1
will contain almost no information about
e, and the only information sent will be
in the flag coded at the end; hence we
will have only sent information about e by
explicitly coding it.
Also we may see that my scheme MUST be
wrong - if it were not we would have a
"perfect" (infinite) compressor, which of
course cannot exist.
In fact, my scheme works like any other
compressors, it expands some inputs, and
compresses others. In particular, my scheme
will compress strings who happen to have
bits skewed in the same way as e - and will
expand those whose skewing deviates from
the predicted.
Let us now see why I erred. We start with
the basic (approximate, but good enough
for now) expression:
(delta_I/N) = L(N)/N - e^2
Now, under (all of our) watchful eyes, it
seems that this means an "extra" information
(instead of "extra information" you could
say "compression") which can be small and positive
when 'e' is small. However, let us note that the
first term is always less than one. Now,
let us remember what 'e' is for. 'e' CANNOT
be chosen arbitrarily by the encoder. In
fact, 'e' must be chosen so as to encode some
data. Thus, for certain types of data to
encode, e will be small, and we will compress.
For other types of data, e will be large, and hence
larger than L(N)/N ! so we will expand.
So, now we can be more quantitative. We will
assume that the coding model is uniform : that is,
one "symbol" is assigned to each range of e's .
I will also take e to be in the range (0,1/4)
This may not be ideal, but it allows a quantitative
answer. Hence, the average information gained
(compression) is:
(delta_I/N) = Sum[e] { L(N)/N - e^2 }
(delta_I/N) = 4 Integral[e from 0 to 1/4] { L(N)/N - e^2 }
But this is actually not right, because the expression
at right is only accurate when e is small (and N is large).
(and overal factors have been dropped).
If you go through my original anylsis but make fewer
approximations, you will arive at the accurate formula:
(delta_I/N) = (1/4) [ L(N) - L(1-e^2) - 2]
- (N/2) [ L(1-e^2) + e L(1+e/1-e) ]
We may now integrate this over e, using some tricks
(mostly, L(x) = ln(x)/ln(2) , and then do integration
by parts on the ln's) we find:
(delta_I/N) = L(N)/4 - 0.1507 N - 0.5678
For N large this simply give -0.1507
We may find the N which minimizes the loss.
It is N = 26.34 , giving a (delta_I/N) of -0.100
Hence, the coding redundancy from the skewing is
far greater than the gain from distinguishing the
skewing. The best we can do is to fix e = 0,
in the coding step, in which case this method does
nothing but code the information in e with the
correction code.
One might think from these arguments that we
can do better with a small range for e, but then
less information is sent (fewer buckets) and the
net is again negative. (as has been shown by
Hugo Witters in the limitting case of only
two buckets).
Let's consider a simple model that we can try
to optimize. The encoder tries to send an
extra bit by skewing the probabilities by
(+/-e) (read: "plus or minus epsilon"). Here e
is an externally chosen parameter, the coder
sends a bit with the +/- choice. Now, the decoder
can decode that bit if the number of 1's (n1) is
greater than the number of 0's (n0), it guesses
a skewing of +e, or if n1 < n0 it guesses -e.
[small correction below due to Michael Schindler,
12-31-97]
Now, if the encoder sees that he used +e but got
n1 < n0, then he must send the correction. So,
we must flag whether a corrrection is needed; if the
correction is not needed we're done, if it was, the
decoder just uses the opposite of what he guesses,
so all we need to do is flag whether the guess was
right. So, in addition to the coding inefficiency, we
must send:
when right : flag we're right
when wrong : flag we're wrong
Now, if e is small (we'll confirm that it is
optimal when it is very small (N is large)) then
the information sent per bit is (1 - ee/2). Hence
the net information sent is :
I = (1 - ee/2)*N
+ 1
- ( P L(1/P) + Q L(1/Q) )
The first line is the information in the skewed
coded bits. The next line (+1) is the extra bit
encoded in the skewing. The last line is the
entropy of the optimal coding of the flag bit,
where P is the probability that n1>n0 when e is
positive, and Q = 1-P ;
Put dI = I - N , the "extra" information
(aka compression). Now, if e was zero, P and Q
would be 1/2 , so let us first consider this
limit. Since e is small, perturbation will be
done around this limit.
dI(e=0) = 1 - ( 1/2 + 1/2( 1 ) = 1 - 1 = 0
There is no gain. It is easy to see conceptually
why this has happened. Our encoding scheme fails
so often that we must send the correction, cancelling
the gain.
Now, the lowest order corrections in e are also easy
to compute. (to do this exactly, you must evaluate an
"error function" or worse, sum the random-walk
distribution over some range, both of which can only
be done computationally, not analytically). The
distribution of (n1)/N is roughly Gaussian, and the
effect of e is to shift the center of the gaussian
away from 1/2. Thus, to compute P and Q we must
compute the area under the curve from 1/2 to 1/2(1+e)
Now, since the Gaussian has standard-deviation of
sqrt(N) and is normalized, it must have a maximum
height of 1/sqrt(N pi). Hence the shift due to
moving the center makes:
P = 1/2( 1 + e/sqrt(N pi) )
Thus the entropy of the flag and correction becomes:
H = ( P L(1/P) + Q L(1/Q) )
= 1/2( 1 + e/sqrt(N pi) ) [ 1 + e/sqrt(N pi) ]
+ 1/2( 1 - e/sqrt(N pi) ) [ 1 - e/sqrt(N pi) ]
= 1/2( 1 + 2 e/sqrt(N pi) + ee/N pi )
+ 1/2( 1 - 2 e/sqrt(N pi) + ee/N pi )
= 1 + ee/Npi
Where I have used the multiplication property and taylor
expansion of the logarithm.
Hence, the information change is:
dI = - Nee/2 + 1 - H
= - Nee/2 + ee/Npi
= - ee N/2 ( 1 - 2/(NN pi) )
We see from the positivity of the last term that the
skewing has reduced the redundancy from the flag &
correction, but not nearly enough to cancel the loss
due to the coding redundancy; the gain is quenched
strongly by a factor of NN = N^2
Obviously, dI is smallest for e=0 , so our assumption
of e small was valid.
>> Tom Lane wrote: >You have forgotten about the coding overhead. How did the >receiver learn how to decode the symbols? [...] In any >case, the coding overhead will necessarily amount to more than >the alleged "extra" information. I did count the coding inefficiency introduced by having skewed p & q, and epsilon is transmitted explicity in the data. Tom's point may be interpretted in a more subtle manner : the fact that the encoder and decoder have agreed on a certain transmission scheme is an implicit overhead of information; (that is, they have agreed that the decoder will count the zeros and ones to estimate the skewing). However, I think that this agreement can be done in an initial communication of finitely many bits, that is the # of possible coders is not O(N), while the amount we gain IS O(N), so that we may neglect this communication. (note that in the various schemes which allow you to "sneak" one bit off of a 'random' input, we can interpret this as failing to count the implicit encoder/decoder agreement; however, these gain like O(1), not O(N)). We can interpret this as epsilon's ability to have nearly infinite precision, being a real number, thus being able to transmit a huge amount of information, limitted only by our ability to quantize this real number. I have shown here (to my surprise!) that the errors in estimating epsilon from the difference in actual probability and theoretical probability are no bigger than the quantization error in making (Ne) an integer. ----------------------- In article <67pr81$pk6@graves.maths.tcd.ie>, tim@maths.tcd.ie says... >But n bits can certainly hold more than n bits of information, >since they also hold the information in the number n. > >The additional information is roughly log(n), >which seemed to be the additional amount you got. Well, this is indeed correct, and well said. However, my method does not actually use the number N. It only uses the ratio n/N, the apparent probability. The only reason we end up with a log(N) additional information is that we are specifying the probability, a real number, and can only get as much precision as we have bits, hence the log(N) dependence. In particular, the standard tricks for getting an extra log(N) from the length could still be performed after having done my method. ----------------------- >> Chris Hillman wrote: >Are you saying that if you fix some K large, and take many samples of >sequences of K bits produced by your source and compute the ratio of zeros >to ones for each sequence, this ratio will be distributed like a bell >curve around a mean slightly different from 1/2? Right. If the sender has p(1) = p, p(0) = q, and the receiver finds there are actually n 1's and m 0's , n+m = N , then he can estimate p and q, pe = n/N , qe = m/N . Now, the full formula for n is the random walk formula: P(n) = (N!/(n!m!)) p^n q^m But when N and n become large we can approximate this with a Gaussian, centered around n = Np Hence the distribution of pe's is a Gaussian around p. As usual, the standard deviation for a gaussian is sqrt(Np/q) but since p/q = 1 to order epsilon^2 , we have simply a standard deviation of sqrt(N) in n, hence sqrt(N)/N in pe standard_dev( pe ) = N^(-1/2) (actually, as Andrey Savov astutely pointed out, we need not use the Gaussian approximation; these results are exactly the same for the exact random walk formulas) ----------------------------------------- Michael Schindler's nice conceptual interpretation: >> Andy McFadden said: --- > This I have trouble with. For an input of length N, you need to transmit > M bits of model, in this case it's 'e' or something derived from 'e'. As > N increases, I believe M increases as well; otherwise I can come up with > an input of size N+x for which I can change the value of a symbol and > not have enough precision in 'e' to detect the change. M is a parameter that you can choose. I'll explain the idea with other words, without mentioning log at all :) You have a total of X bits of information (visualize it as an X bit long random binary stream). I will not consider how you send along the length of this binary stream, you have to send it along with any compression method, even if you do just verbatim copy. You may assume X to be constant if not knowing the length bothers you. These number (X) of bits can be split in two parts: X=M+N some of the bits (M) are used to determine a model, and the other part (N) is transmitted using that model. M is constant, so no overhead involved in how much bits are encoded in the model choice. The idea is now to take the transmitted string (code for the N bits) and try to guess the model (which contains M bit of information). Since you have to resolve the ambiguity in the model some extra bits are needed. The question is: is the average loss caused by using an inexact model plus this extra infomation to determine the model (or e, which is the same) less than M (for some M, N) - if it is it would be great, if you can prove it isn't it is also great - Chales will be glad to be proven to post impossible ideas here. Charles idea is that we know a lot about the model (e) already just by seeing its output, so less than M bits should be needed on average. It may be required to make N (and thus X) a variable, but the value can be determined at the decoder) and fix the length of the message containing N bits of information - this would not change the fundamental question, although it makes the question more complicate. Charles did not post it in this general form, his choice of M was that e is accurate to sqrt(N), but that need not be. Just getting the sign of e right is just one bit and may be easier to guess - quite easy actually if you choose e large, but then the loss is large too. ----------------------------------------------------------------------- [ [ more good stuff from Mike Schindler : I've incorporated some of [ this into the main text [ hi! found a bug: Charles Bloomwrote in article <68bfou$rt1@gap.cco.caltech.edu>... > Now, if the encoder sees that he used +e but got > n1 < n0, then he must send the correction. So, > we must flag whether a corrrection is needed, and > then if it is, we explicitly send the extra bit. > So, in addition to the coding inefficiency, we > must send: > > when right : flag we're right > when wrong : flag we're wrong , + 1 bit just the flag will do; in a two way choice knowing that it is wrong leaves not much alternatives. This is one of the reasons why binary decisions only work better in this case. Another problem: if you base the probability of beeing wrong on the actual numbers of 0's and 1's (if you get 120 1's and 90 0's it is more probable that the result is right than if you get 101 1's and 99 0's) you can reduce the amount needed to signal wether we are right or wrong a little; however the net gain is still negative. So the question is not: if I choose the coding favouring the 0's what is the probability that there will be more 1's, but if I have a given distribution of 0's and 1's, what is the probability for each of both models (or more than two in the general case). then encode a symbol (you called it: flag we are right + correct value) that gives the correct model using the distribution of models calculated. The code for this symbol should be short on average, but total gain is still negative. Michael [ [ my reply: [ Ah, yes of course; when e = 0, the flag bit must simply send the extra bit we are trying to sneak. >Another problem: if you base the probability of beeing wrong on >the actual numbers of 0's and 1's (if you get 120 1's and 90 0's it >is more probable that the result is right than if you get >101 1's and 99 0's) Hmm. Yes, that is indeed true, but seems awfully hard to analytically model. Let me rephrase this for the reader : when the decoder gets a stream, he knows not only that n1>n0 (which encodes our bit), but he also knows the value of n1/n0 , and he can see that when n1 = n0 roughly he cannot be very confident in his guess, and so must 'expect' to read a correction flag about 50% of the time, hence 1 bit of correction. But, when n1/n0 = 2 or so, he can be very confident that his guess is right, and so 'expect' to read a correction only rarely. Note that you might try my scheme with e=0 in this case, to try to take advantage of the "extreme cases" of n1 & n0 that come out of a balanced model; hence there is 1 bit per bit, and we need a correction flag; when n1 = n0 all the 1 bit is in the flag. When n1 is far from n0 the decoder will guess that 1 is skewed more than 0, but since we skewed not at all, he will be wrong half the time, so this is no good. >if I have a given distribution of 0's and 1's, what is the probability >for each of both models (or more than two in the general case). Yes, well put; you get more information from this reverse-guessing procedure. >The code for this symbol should be short on average Well, the code for that symbol should be very close in length to the code for my simple approximation, since the extremely skewed cases are so rare. The right entropy is: H = Sum[ sets (S) of N 0's and 1's ] P(S) * [ P(+e made S) log(1/P) + P(-e made S) log(1/P) ] ----------------------------------------------------------------------- > a brilliant post by > > hugo.witters@tornado.be (Hugo Witters) Following is a more detailed example and calculation including the information send by the coder to correct the errors caused by the noise signal. The distinction between the bit (measure of information) and the binit (binary digit, which is an output binary symbol) should be noted. For clarity, I take a decoder where the "zero" binits received have -1V (Volt) level and the "one" binits have +1V level. The average received signal will then be 0V without skewing. The coder operates in two modes: +e skewing and -e skewing, respectively resulting in a +eV and a -eV level. The threshold of detection is set at 0V. Positive voltages will produce a one, negative voltages a zero output signal. Working with e = 0.01 and N = 5000, this would yield a useful skewing signal (integration of signal over N and dividing result by N) of + and -0.01V and a noise voltage superposed on the previous signal of 1/sqrt5000 = 0.014Veffective with Gaussian amplitude distribution. This means the skewing signal level lies at 0.707 of the standard deviation of the noise. The error rate is then 0.24, or 24 percent of the skewed bits are inverted. To correct this, the coder has to send information, say in an extra time slot at the end of each block, where a one would signal the skewing receiver to inverse the detected bit and a zero would leave the output signal intact. The average information carried by this extra time slot per block would be: I = -0.24 L(0.24) - 0.76 L(0.76) = 0.795 bits per block. The average infomation of one binit in this time slot would go to 1 bit for a 50 percent error rate (random signal) and to 0 bit for a 0 percent error rate, so this value could be used as the penalty in bits caused by the error signal. I don't see a more information effective way to do the correction. The information balance in this example would then give 1 bit gained by the skewing code, 0.5 bits lost by the redundancy penalty and 0.795 bits lost by the signal to noise level, giving an overall loss of 0.295 bits per block. This is not the optimum choice of N/Nmax ratio. When one varies this ratio from 0 to 1 the optimum lies at N/Nmax = 0. In this case the skewing detector is completely drowned by noise and produces a random signal and the binit in the extra slot carries 1 bit of information. This means the skewing system does not work at all and the extra bit is simply encoded in the extra slot. When N/Nmax increases the skewing system produces more and more information, and the binit in the extra slot less and less, but the penalty associated with the skewing coding also increases and produces an overall loss of information. N/Nmax Ratio Total Information Penalty in Bits per block 0 0 0.05 0.027 0.1 0.055 0.2 0.11 0.5 0.29 0.75 0.46 1 0.63 These penalties are independent of the value of e as long as e remains small.. If a large number of different e levels are used to code k extra bits in each block, the error rate comes so close to that of a random signal (S/N = sqrt(k+1)/(sqrt2*2^k) when both positive and negative e values are used with equal spacing), that almost all of the information has to come in via the correction bits in the additional time slots like in the above example when N/Nmax comes close to zero.
Charles Bloom / cb at my domain
The free web counter says you are visitor number