with regards from Kenya.
]]>It’s a little difficult to determine how many bits are being used per code word. It’s not a code like, say Huffman coding, in which a symbol has a specific integral number of bits.
In arithmetic coding, codes tend to have close to -ln(p) bits, where p is the probability of the appearance of a symbol in your model using base 2. The actual number of bits used will be as close to that as the precision your model allows.
So in order to restrict a symbol to a certain number of bits, you have to adjust your model to not generate a probability that will be less than 1/2&bits or something like that. Since you control the model this is totally under your control
- Mark
]]>Wish I knew enough about it to write an article!
- Mark
]]>Cheers,
Aaron
But after some more thinking (after writing the first post), I think I understand it. Lets say |A| = 3 and |B| = 5. Since we want to map one to another, lets make it of equal length by "multiplying" A on |B| and B on |A|:
We see, that mapping from A to B is just y=floor(|B|*x/|A|). To do the reverse, we need to take next y (hence, y + 1) and find how many |B| we can fit into it: |A|*(y + 1)/|B|. But we need to avoid case where |A|*(y + 1) % |B| == 0, because in that case we will get x + 1, not x we are looking for. For that purpose we subtract 1: ( |A|*(y + 1) - 1)/|B|.
As you can see my explanation is clumsy and not formal, so I would like to hear a better on, if somebody has it.
You've got one basic problem that is breaking you down, you aren't calculating the position of a given code in the interval properly.
In your example, the difference between upper and lower is 65280. So when you are trying to calculate the width of the range, you come up with 1771 and and 2024 as your interval.
That's correct, the width of the interval is bounded by 1771 and 2024. But it doesn't start at 0, it starts at 640. So your final interval is 1771+640, 2024+640.
The math involved in dividing up the line is pretty simple, but you need to get it right. You can see how your initial calculation is not working by simply adjusting the intervals. Imagine that low is 1,000,000 and high is 1,065,280. With your calculation, the value of the character is still at position 1771 - but that isn't even in the current range! So the method you are using is clearly broken.
Figuring out how these calculations work will be more interesting when you look at the bottom and top ranges, 0/n to x/n, and x/n to n/n. For those cases you know that the new low and high values after update have to incorporate the old low, or the old high.
- Mark
]]>In your code you use following formula to map current value to frequency interval:
((code_value + 1)*freq_range - 1)/code_range
But I can't understand why exactly how that works. Can you give any hint where I can read more about that?
]]>No, that underflow would be a problem. However, that code in the article is just trying to demonstrate a bit of the concept.
In the actual implementation, as you read further on, you will see that the algorithm has to be close attention to the number of bits used in the math.
In the final implementation, the range variable will typically be a long, and the number of bits used by high and low will be low enough that underflow and overflow cannot occur.
If you download and compile the source you will see that the model metrics header figures out reasonable values for these things.
Keep working with the source to improve your understanding - this algorithm is kind of challenging, getting the deep understanding in your brain takes some practice.
- Mark
]]>Thanks for the article, Mark. It's very helpful. I have a question, though. Code in the article says:
But in that case range will be 0 (because of the overflow). Am I missing something?
]]>