That is an interesting approach to the problem for sure. I have not considered using the hazard rate going both directions when defining my symbols. I chose to only consider defining symbols with the number of 0′s before encountering a 1 because I found that the shortest symbol only has 20 0′s instead of 22 in the MDF resulting in a smaller number of bits/symbol.

Are you just theorizing when you say “… but the number of patterns are large enough that it outweighs the smaller expansion of .322 bits when totaled and compression occurs.”?

I’m not sure that is the case and I would need to see the symbol counts for all the symbols to see if that is a true statement. My experience tells me that the symbols and resulting bit cost incurred for symbols 0-5 will outweigh the savings from all the larger symbols combined due to the entropy of MDF. Only one way to find out though. Thanks for sharing…

-Brian

]]>IF you really wanted to use arithmetic encoding, there is a way to do that with this, but I’d suggest first that you unify the data and tokenize these to where you make every occurrence of 22 fields the context weights with bitwise encoding, to where you get 0 or 1 as a single bit and go as high as 22 0′s and 22 1′s. Technically you’d be using a model that uses that design rather than ASCII symbols. When you get the frequency for those statistics, you’re able to give 1 or 0 the smallest sizes and the 22 bits the largest/heaviest if they occur less frequently. I haven’t done this version yet, only the one above which is more huffman/shannon-like, but it is doable.

Another thing is that you’re able to reduce the sizes to 22 fields from 44 if you transform the data, to where rather than having 1/0, 11/00, 111/000, 1111/0000, etc you transform it to where it’s 1, 01, 0001, 00001, etc and track whether you’re on a 1 or a 0 to make it a monolithic yield. I.e: a series like 01101110001011111 becomes 10110010011100001 which is fully reversible if you know whether you started with a 1 or a 0. That transformation gives you an unlimited number of 1′s, but no more than 21 0′s before a 1 to where your arithmetic encoder only needs to deal with 22 symbols as binary tokens that are variable size.

If any clarification is needed let me know. Good luck. :)

James

]]>010 0

011 1

100 2

101 3

001 4

110 5

0001000 60

0001001 61

0001010 62

0001011 63

0001100 64

0001101 65

0001110 66

0001111 67

1110000 68

1110001 69

1110010 70

1110011 71

1110100 72

1110101 73

1110110 74

1110111 75

0000100 76

0000101 77

0000110 78

0000111 79

1111000 80

1111001 81

1111010 82

1111011 83

0000010 84

0000011 85

1111100 86

1111101 87

0000001 88

1111110 89

00000001 90

11111110 91

000000001 92

111111110 93

0000000001 94

1111111110 95

00000000001 96

11111111110 97

000000000001 980

111111111110 981

0000000000001 982

1111111111110 983

00000000000001 984

11111111111110 985

000000000000001 986

111111111111110 987

0000000000000001 988

1111111111111110 989

00000000000000001 990

11111111111111110 991

000000000000000001 992

111111111111111110 993

0000000000000000001 994

1111111111111111110 995

00000000000000000001 996

11111111111111111110 997

000000000000000000001 998

111111111111111111110 9990

0000000000000000000001 9991

1111111111111111111110 9992

00000000000000000000001 9993

11111111111111111111110 9994

010 0

011 1

100 2

101 3

001 4

110 5

0001000 60

0001001 61

0001010 62

0001011 63

0001100 64

0001101 65

0001110 66

0001111 67

1110000 68

1110001 69

1110010 70

1110011 71

1110100 72

1110101 73

1110110 74

1110111 75

0000100 76

0000101 77

0000110 78

0000111 79

1110000 80

1110001 81

1110010 82

1110011 83

1110100 84

0000010 85

0000011 86

1111100 87

1111101 88

0000001 89

1111110 90

00000001 91

11111110 92

000000001 93

111111110 94

0000000001 95

1111111110 96

00000000001 97

11111111110 980

000000000001 981

111111111110 982

0000000000001 983

1111111111110 984

00000000000001 985

11111111111110 986

000000000000001 987

111111111111110 988

0000000000000001 989

1111111111111110 990

00000000000000001 991

11111111111111110 992

000000000000000001 993

111111111111111110 994

0000000000000000001 995

1111111111111111110 996

00000000000000000001 997

11111111111111111110 998

000000000000000000001 9990

111111111111111111110 9991

0000000000000000000001 9992

1111111111111111111110 9993

00000000000000000000001 9994

11111111111111111111110 9995

It is very simple to understand. If you write a program like I did that checks how many 0′s there are before a 1, and how many 1′s there are before a 0, you will see that there are never more than 22 0′s before a 1, and never more than 22 1′s before a 0. That is to say, an entire stream will never have more than 0000000000000000000000 before a 1 is the next bit. It will never have more than 1111111111111111111111 before a 0 is the next bit. If you use a fixed bit field that sees the pulses as fields, you’re able to compress this.

010 0

011 1

100 2

101 3

001 4

110 5

0001 6

1110 7

0000100 80

0000101 81

0000110 82

0000111 83

1110000 84

1110001 85

1110010 86

1110011 87

0000010 88

0000011 89

1111100 90

1111101 91

0000001 92

1111110 93

00000001 94

11111110 95

000000001 96

111111110 97

0000000001 980

1111111110 981

00000000001 982

11111111110 983

000000000001 984

111111111110 985

0000000000001 986

1111111111110 987

00000000000001 988

11111111111110 989

000000000000001 990

111111111111110 991

0000000000000001 992

1111111111111110 993

00000000000000001 994

11111111111111110 995

000000000000000001 996

111111111111111110 997

0000000000000000001 998

1111111111111111110 9990

00000000000000000001 9991

11111111111111111110 9992

000000000000000000001 9993

111111111111111111110 9994

0000000000000000000001 9995

1111111111111111111110 9996

00000000000000000000001 9997

11111111111111111111110 9998

As you see, there is nothing wrong with my counting. Every digit is 3.322 bits per symbol. The first few symbols have .322 bits extra, while any pattern 4 bits and larger compresses by fractional bits. There is no need to use arithmetic encoding with this because this is a static pattern that the compressor and decompressor are able to use.

It never exceeds 13.288 bits to represent up to 22 bits per numeric pair. The best you’ll get is 8.712 bits per numeric pair, but the number of patterns are large enough that it outweighs the smaller expansion of .322 bits when totaled and compression occurs. It is possible to look for other patterns and extend 9999 with more sequences to squeeze out more as well.

]]>I’ve been down that road before and it does not compute. There is simply not enough unused binary space to overcome the bit loss incurred from the arithmetic encoder. Even if we had a perfectly efficient arithmetic encoder using perfect entropy, the compressed file would only be 415238.648679351 bytes. Unfortunately, the arithmetic encoder eats up all the savings leaving no room for assembly code or any other form of language expression. Plus there is an error in your method of counting.

-Brian

]]>Another way is with fractional bits, but I’ll wait with that to see if anyone makes progress with this.

The right answers require a fresh start.

]]>