The Folly of Infinite Compression is one that crops up all too often in the world of Data Compression. The man-hours wasted on comp.compression arguing about magic compressors is now approaching that of the Manhattan Project.
In a nutshell, a magic compressor would be one that violates the Abraham Lincoln’s little-known compression invariant: “You can’t compress all of the files all of the time”. It’s trivial to prove, and I won’t do it here, that no single compressor can losslessly and reversibly compress every file.
The easiest way to foil most compressors is with a so-called “Random File”. In the world of Information Theory, Randomness is a tricky thing to define, but I like to fall back on Kolmogorov complexity, where we define the complexity or randomness of a file as the Komogorov complexity w/r/t a Turing Machine.
Several years ago I posted a challenge on comp.compression that I hoped could be used to silence any proponents of Magic Compression, and I’m reposting it here so I have a permanent location I can point to.
How does it work? I took a well-known file of random digits created by the RAND group (no doubt at taxpayer expense), and converted that file into binary format, which squeezed out all the representational fluff.
The result was AMillionRandomDigits.bin, 415,241 bytes of incompressible, dense data.
The challenge is to create a decompressor + data file that when executed together create a copy of this file.
The size of the decompressor and data file added together must be less than 415,241 bytes.
So far nobody has been able to do this. I wouldn’t have been surprised to see somebody get a byte or two, but that hasn’t happened.
The only real rule here is that we have to negotiate exactly how to measure the size of the program plus data file. I’m willing to exclude lots of stuff. For example, if you wrote the program in C++, how would we measure the size of the program?
In this case, I would measure it as length of the source file, after passing through a resaonable compactor. The run time libraries and operating system libraries could be reasonably excluded. But that’s just one example, the rest are all subject to public interpretation.
Really, the only rule we need is that the executable program can’t have free access to a copy of the million random digits file. For example, you can’t create a new programming language called JG++ that includes a copy of this file in its Run Time library. You can’t hide the digits in file names. And so on.
As long as those rules are obeyed, the challenge will never be met.
Recursive Compression
This challenge has a special place for the variant of Magic Compressors known as Recursive Compressors. Some savants will claim that they have a compressor that can compress any file by a very small amount, say 1%. The beauty of this is that of course they can repeately use the output of one cycle as the input to another, compressing the file down to any size they wish.
The obvious absurdity to this is that if we compress every file to a single bit, it’s going to be kind of hard to represent more than two files using this algorithm. So most people in this subspecialty will claim that their process peters out around some fixed lower limit, say 512 bytes.
For those people, a working program should be able to meet the challenge quite easily. If their compressed data file is a mere 512 bytes, that leaves 400K of space for a decompressor that can be called repeatedly until the output is complete.
The Payoff
The first person to achieve the challenge, while staying within the spirit of the challenge, will receive a prize of $100.
324 users commented in " The Million Random Digit Challenge Revisited "
Follow-up comment rss or Leave a Trackback*Cracks knuckles*
*fires up compiler*
*scratchs head dumbly – wonders how long this might take*
Wow – you’ll give me $100 for doing that – um ok. I’ll get back to you – later.
It’s not just the $100. You’d be famous!
Hello!
When it comes to recursive magic, I think you might need to explicitly include the requirement that the number of cycles be included along with the actual decompressor and compressed data. Otherwise, the challenge would be easy. (But you already know this, of course, but I wouldn’t be surprised if the magic-doers don’t realise this, with all that that could entail.)
The way I’d do it is to treat each file as a string of bits, add an extra one on one end, and interpret the result as being a nonzero natural number in binary (with that extra one being the most significant bit). Then, my compressor would subtract one from that number, unless that number is already one, in which case it cannot be compressed any further. As the result must be at least one, it’s going to have a most significant one in its binary representation. Chop that one off, and take the remaining string of bits as the ‘compressed’ file.
Most files don’t compress with just one compression cycle, but, if enough cycles are applied, all files (of finite length) end up compressed to zero length.
To recursively decompress such a file, you just have to apply the (obvious) decompressor as many times as the compressor was applied. But how many times is that? “Oh!” I can imagine the magic-doers saying, “You never said the cycle count counted, too! You’re cheating by changing the rules! It’s a CONSPIRACY!!!1! Yet more proof that the ‘experts’ won’t admit that they’ve been wrong all these years!…”
:-)
I think the challenge as is is safe against this. Basically, your method requires hiding information in the compression cyle, and then using it in the decompression cycle.
I’m trying to avoid most of those tricks by asking the user to create a decompressor only – a single file, that when executed, creates the random file. So even if the user counted up the cycles while compressing the data, there’s no place to store them.
So for example, if you created a shell script or batch file that executed your decompressor, it would have one line that said:
for i = 0 ; i < big_big_number ; i )
decompress
Because that number is so big, the shell script itself would have to be bigger than the million random digit file.
You could try to finesse this by hiding the number in a huge package of 0 length files, each with a name that makes up part of the number. But we’ll nip that in the bud!
Sachin Garg has a post on c10n.info that highlights a classic magic compressor here. Jeff Fries seems to be claiming spooky compression powers that would enable him to get a slam-dunk on this million digit test – however, there’s a bit of a problem in that I think he needs some hardware to make it happen.
fifty quid? hardly worth getting out of bed.
i’ll keep the file though as it could be a good test.
>fifty quid?
fifty of your quid plus worldwide fame.
of course, you can always wait for the exchange rate to improve.
Don’t tease or Peter St. George and Hamby of Zeosync will come swirling out of the sky on giant bat wings and roost everywhere.
- truth
@truth:
I try not to tease, but it is kind of fun to watch the challenge roll on for years and years with (of course) nobody even able to submit a challenge!
Hey, Mark.
I thought you would remember when the “truth” brought Zeosync down. Seems to me they started to fold about three weeks after Sam Costello’s article in PC World.
- truth
Hi everyone!
If someone manages to compress the million random digits file by a few bytes, would it have any significance? Would it be an accomplishment?
Wouldn’t it simply mean that the specific sequence of digits wasn’t completely random after all?
@Emil:
A few people did some analysis on the file and have found some minor weaknesses in it. In theory one could squeeze a few bytes out of it.
But of course, to win the challenge, you’re decompressor would have to be vanishingly small. This will be tough to do.
So… if you did create a decompressor + data file that was smaller then the challenge file, but only by a few bytes, you will be lauded for your accomplishment, but you won’t be breaking any laws.
The flaw in the idea of counting up to the number represented in the file’s bits is, as stated, flawed because inserting that number into the source code (or data file) is exactly as big as the original.
But I like the tactic. Are there ways around this limitation? Assuming infinite computer time: factor out the primes, and write each in an efficient form to the data file. The decompressor would then multiply the numbers. Obviously, that’s extremely vast division and multiplication, but the algorithm for doing so with bit streams as the operand is probably not TOO large.
Assuming the decompressor also has unlimited runtime memory, it opens a few more options. Say instead of just multiplication we also allow for certain meta-instructions. Each entry in the table is either a bitstream or a manipulation of one of the other (preceding) entries. With addition, subtraction, multiplication, maybe division, left shift, right shift, and/or/xor, you could start to get pretty expressive.
As the size of the compressor is not in question, and the efficiency on both ends is not part of the question, the success of this approach begins to depend less on the algorithm – over an arbitrarily large enough file even a very small compression ratio would overcome the size of such an algorithm. Instead, we’d have to ponder whether or not the data file, with its instructions *and the separation of the component bitstreams (factors)* would actually be any smaller than the source set, or smaller than any source set, or smaller than only very large source sets, etc.
My suspicion is that it would not. Even merely the storage of any given prime factor from the original number requires overhead. Where the source file is considered to contain the number and nothing but the number, we must be able to store multiple streams of arbitrary length in ours.
I don’t think you could demark them in a header to the data file because the bit positions are ALSO arbitrarily long. Would you store them in 4 bits? 8 bits? 32 bits? 128 bits?
Instead, each bitstream should be preceded by its length. The length itself must have a length; to store the length, you could count each 1 up to the first 0, then use that as the number of bits to read for the length, and use that as the number of bits to read for the string. If you wanted to meta it out further, after each ‘length’ string you could insert 1 for ‘the stream begins next’ or ’0′ for read a new length string with a length of the last length. :D
I did ramble a bit. It caught my eye. Any thoughts?
The idea that there are shortcuts to counting things is appealing, but ultimately fails to bear fruit. Counting any random thing will lead to some numbers that compress, but many, many more that don’t.
*shrugs* Alright, how about this idea (I’m no expert in compression or codeing, just throwing an idea)
Alright assuming these are numbers. 0-9
why not have it search for the inevitable patterns that arise. Find the most common ones (say if 12345 appears 100 times for example) of as long of string as possible, and re-assign that to a different character
so for example:
123453726482912345375482
would be
X37264829X375482
Find the most common repeating patterns, and longest running ones. Then you have, for example, 1 saved (12345) and 100 X’s.
If the system finds there’s 10 occurances of a 250 character instance, that shortens things by 2000 characters or so no? (including your new X’s and one copy of the 250 char key)
Patterns WILL occur in any random number, especially at a million characters.
0-9 only uses a few of the veriations of binary, leaving you a lot of characters to convert them into.
Now if the program could be made small enough to fit into it, and left to run for a LONG ass time to find the patterns for all lengths and combo’s… (but processing time wasn’t a factor)… that should do it?… no?
@Digital:
You’ve done a good job of describing a class of common compression algorithms, sometimes known as the Macro Replacement Model.
It works great when some patterns appear more frequently than others. But if this is not the case, if patterns appear randomly, it takes more bits to describe and replace them than it did in the first place.
Think about it: let’s say that the string 1234567890 only appears once in the million digit file. To use macro replacement, I have to insert a macro definition in the file, something like:
X=123456789
Then I replace that occurence in the file with ‘X’.
At a minimum, that takes two chacters plus change more space in the file than it was using before. Fail.
If the string occurs more than once, you win. And when long sequences start showing up with some regularity, we describe that file as not being random.
@ Mark
Figures something that simple would have been thought of already, haha
I would be interested to see how well it would work on this, though I’m sure it’s been tried. A million is a large number, large enough that odd things start happening with probability. I would think that there would have to be quite a few 3-4 character matches, and at least a few larger ones.
You’d also have to have the program try all possible combo’s, and THAT’s the processor eater.
00…01…02… ~~~ 99…000…001…002
after each one, check number of replacements vs length of string to check total saved space. That would have to go up to 500,000 characters looping to verify every combo has been included. (now that I think about it, go backwards, starting at 999999, 999998, 999997… hate to kill a larger combo by finding a smaller one… although with unlimited time, run it both ways to be sure)
It would have to keep a record of every single one of them with a ‘space saved’ tag next to them…
Now that I think about it… since time isn’t an issue here, after it goes thru all combos, run that sucker again! who knows, maybe we’ll get to compress our new ‘letters’ we’re using as place holders… as long as we keep track of what order we compressed them nothing should be lost.
Of course, before finishing, all the non-used combo’s could be deleted out, so you don’t have a dozen strings of 0 matches, or 1 matches that aren’t used to compress.
___
bah, now I’m rambling again. I’m sure all of this is already done and been rendered useless by the next generation of compression. Still interesting to re-invent it in your own head :P
@mark
I’m going to read up on “Macro Replacement Model” some, I’m sure it will be interesting to see the completed idea. ty for the info
@Digital:
>I would be interested to see how well it would work on
>this, though I’m sure it’s been tried. A million is
>a large number, large enough that odd things start
>happening with probability. I would think that there
>would have to be quite a few 3-4 character matches,
>and at least a few larger ones.
I encourage you to work through the problem, I think you’ll find it instructive. A lot of people run into the trap of coming up with ideas like this and then not following through with actual tests to see what the real costs are.
@mark
the extent of my programing knowlage is VBA unfortunatly… and though I think it would be possible in VBA it would be,… lets just say very very ugly. haha
Reading up on different compression methods, there’s a lot of interesting stuff in there. Of course now I’m trapped in the ‘wiki-hop’ of reading every artical that’s linked to it, so I’m sure I’ll be sucked in for a few days.
You really have not received a *single* challenge? That’s amazing. And pretty convincing. Is the challenge “real”? Is it still available (July, 2008)? Maybe potential challengers don’t think that you really intend to give out the money.
Of course, your prize is “only” $100. Is Mike Goldman still offering the $5,000 prize that I read about years ago?
@Ak3la:
No, after all these years no serious challenge.
And I think it unlikely that there will be. Some hard workers on comp.compression identified a tiny bit of redundancy in RAND’s million digits, but the amount is so small that I’d consider it impossible for even the most skillful coder to take advantage of.
Are those Dollars are American Dollars or Canadian?
I think Canadian is worth more these days.
I’m still trying to make it happen.
I have another idea so I’m working on it, again.
I was curious as to the status of th project so I did a search for million digits and here is this nice site.
I should know something in about a month.
Wish me luck.
Hi Mark,
I came across this challenge while casually browsing the internet. Is this challenge still on? What’s the best compression reported to date? I tried WinZip and Winrar, both gave compressed files bigger than the original file, so I realize this is not a trivial challenge.
Anash
@Anash:
I don’t think anyone has bothered to report “best compression” on this – it’s kind of an all or nothing deal. So far nobody has been able to come up with a good algorithm for compressing this random data. Nor is it likely that anyone ever will.
Hmm…
There’s no mention of isolating the decompressor’s output to a single file. Nothing says the decompressor is responsible for identifying which output is the match, only that it’s been created. If your script simply created 415,241 files with the bytes incrementing from all 0′s, this random data should be in there, yes?
On the other hand, this sort of “letter of the law” vs spirit has inspired me to look at different techniques for hiding information; thanks A Million Random Digit Challenge.
Len
@Len:
Stop a minute and think about how many files you’d have to create in order to be sure the random file was included… that would be 2^415,241, not 415,241.
That would be a lot of files.
Alright well a month may be a little bit of an underestimate.
Still trying new ideas.
Thought I had a winner the other day but I’ve learned:
Great results… Magic compression? “There is a mistake someplace.”
I hope that make ya snicker…
Anyway Happy new Year to all of ya still trying to compress million digit file.
Ernst
Hi Mark
Here is my recipe (awnfull pseudocode).
mydigest = Digest the file with md5 (as a example)
Decompresión
iterate a from 0 to max_length
{
if md5(a) == mydigest then ouput(“File found: a”) ; exit }
outpur “error!!”
As md5 can collide maybe a two digest aproach adds more against finding a collision on the way..
Well this is akind of joke but you never said the algo has to be deterministic…
Regards Angel
@Angel:
Nice idea, if only it worked!
- Mark
Anyone else working with recoding the file?
The number of collisions of a hash function on a file storing number around 10^1,000,000 is going to be around 10^1,000,000 divided by 2^(size of hash in bits). The count of the number collisions to reach the orignal data plus the size of the hash is going to be about equal to the origninal size of the data.
If you use multiple hash functions it is more difficult to determine the number of collisions. However it would take near infinite time to encrypt or decrypt by going through every possiblity and calculating the hashes.
You could hash the orignal file, drop every Nth bit or byte, and calculate a hash(es) of the dropped data. This may (or may not) reduce the number of collisions significantly, but it would still take near infinite time to reconstruct the original file.
Never give up. Never surrender.
Hey everyone.
Helping a friend move this weekend and I needed a break so I thought to post some here and there.
Happy Valentines day all you ladies.
I’ve been working on the Million Digit challenge these past few weeks. It’s funny how looking for work lets a man focus his mind.
I’ve had a few interesting results but nothing works all the way yet.
This next week I will try a new algorithm I worked out and see if it does any better.
What did they say in the original Hitchhiker guide to the galaxy? “Keep banging those rocks together.”
http://en.wikiquote.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy
Good luck to all.
Ernst
How about...
Yay! I win! Ok, I know that this is cheating really...
@mike40033:
Yeah, kind of violates this rule:
>the executable program can’t have free access to a
>copy of the million random digits file
- Mark
Hmm... that comment didn't quite look like I meant it to. I guess I didn't read the instructions for the iG:Syntax Hiliter plugin properly........
The rule about "not inventing new programming languages" should perhaps be tightened to not allow new programming languages at all, or to require that the compiler/interpreter for a non-standard language be included in the bit count.
Otherwise, for example, I could invent a programming language as follows. The programming language is called "MikesMillionMunger". As the rules stand, the compiler for MikesMillionMunger doesn't count towards the number of bits. The language doesn't contain the file in its runtime libraries.
1) Any sequence of bits is a valid program. The two instructions are '1' and '0'.
2) When the program starts, it outputs the bits corresponding to hexadecimal 9b1b.
3) The instruction '1' means 'output the bit 1'
4) The instruction '0' means 'output the bit 0'.
Now that I've defined a programming language, I need a program to run on it. The program I will use will be the last 415239 bytes of 'AMillionRandomDigits.bin'
My program is 2 bytes less than the file, size of the input is 0 bytes, and I can argue (against the spirit, but within the letter) of the contest that my implementation of my language won't count towards the bit-count.
Mike H...
@mike40033:
As I've said before, the whole point of this is not to finesse the rules, but actually achieve the goal, so I'm not going to go crazy on definitions.
Anyone who really things they have won the prize should be able to have a program that works on the real file, then works again on the same file encoded with a randomly chosen key using DES or whatever.
- Mark
But you only require a decompressor. To make it work on the encoded file needs a compressor as well.
If I understand the challenge correctly, the whole point of this is not to 'achieve the goal' but to silence people who don't understand the pigeonhole principle. I would suggest that this requires closing loopholes in the rules.
However, it is *your* challenge, so I shall respectfully defer to your interpretation.
Mike H...
Oh it's fun and the history behind it all as well.
It's a tough challenge. Fun for someone like me.
I will be very impressed if I or anyone else does well.
Best to have a bit of humor about it all.
I'll place a side bet the winning method will be clever.
Ernst
Maybe someone here can help me remember something I've been trying to find for years...
A long time ago (early 90's or so) there was actually an ad in Byte magazine for a company that claimed to have a magic compressor that could compress anything.
If I recall correctly, they got some interesting letters to the editor and then quietly went bankrupt.
It's the only time I've ever seen this in print in a reasonable magazine, and I'd love to see it again. Anyone remember this ad, or the name of the company?
David
@deg:
Search for the comp.compression FAQ, and when you find it, I think you'll find that the company was called WEB Technologies or something like that.
- Mark
Thanks Mark, that was it. Details in answer #9 of that FAQ.
Turns out that the story played out over three years --1992-95. But, in an ironic mental compression, my memory had reduced the story down to just a few consecutive issues of Byte.
David
I am assuming that someone has already thought of this, but I would be interested in learning about why this doesn't work. Could you use a similar algorithm to the Macro Replacement Model, but choose macros that fit in with Benfords law? That is, random numbers will tend to have more 1's than 2's, etc. Could you use this to your advantage when writing a compressor?
not sure if this way has been mentioned (as i didn't read the entire list of comments)
encode the data using a prime factor algorithm:
for a string of digits "102310"
you would compute 2^1 * 3^0 * 5^2 * 7^3 * 11^1 * 13^0
you could then split this number into larger factors that you would be able to store in a much smaller filesize (IE: 450^345 + 6344^231 or something (completely made up)
This would of course be an extremely inefficient way of compressing the data and would require hardware to prime-factor numbers with multiplicands of the millionth prime number (assuming a file of 1 million digits)
I just remember reading about this code (can't remember the name now) in a scifi book in middle school.
@aloishis89:
Benford's law applies to data that is in a non-linear distribution. The random million digits should be linear.
No way this is going to be any help.
@soda:
You've described a different way of encoding the data. Is there any reason to think that it results in compression? Simply finding a new way to represent a number is not enough, it must also be a more compact representation.
Mark mentioned that a few minor weaknesses have been identified in the data file, and in theory it might be compressible by a few bytes. The problem is of course that's is hard to write a decompresser that's only a a few bytes long. For fun I tried to see how small a script could be while still doing anything meaningful.
Suppose for a moment that the data file happened to contain the sequence '134113410686196639649008076861966396490080729' starting at location 531441.
By coincidence 134113410686196639649008076861966396490080729 can be constructed/compressed using the ruby expression 7**25 (which calculates 7 to the power of 25), and 531441 can be compressed using the expression 9**6.
I'm feeling lucky! Now I can write the following decompresser in Ruby:
The program reads the file 'f', which is a copy of the million digits file, but with the sequence removed. It then inserts the sequence at location 531441, which recreates the original data. The inserted sequence is 45 chars long, and my decompresser is only 28. I saved 17 bytes!
From this litle experiment, it seems that to do anything meaningful, a ruby script needs to be about 30 chars, with 10 chars for just reading the file. I guess the problem here is that any 30 char sequence that can be calculated simply would represent an island of order in the sea or randomness, and be a significant weakness, probably far beyond the actual weakness identified in the file.
In any case achieving a few bytes of compression would not prove anything. It would simply be neat. Wether it's possible depend really on the exact nature and location of the weakness in this specific file. If the weakness is only one or two bytes, I would say it's impossible.
I was wondering if there's a way to calculate the absolute value of randomness/orderliness of a specific integer?
take 0. what's the orderliness of zero? probably undefined. or low, probably 0.
1: seems no different. what the chance of accidentally hitting on the number 1? depends on the range!
2: we might still be at 0?
3: hmm maybe we're not exactly at 0 anymore.
4: high. here it seems some level of orderliness can already be identified. 2+2=4, a kind of symmetry.
5. low. it's a prime. all primes must have a low orderliness value?
6. high, maybe higher than 4? less of a coincidence than 4, since 6 is bigger than 4.
seems the bigger the number, the more order it can contain, which seems right since it becomes more unlikely to hit on the number by chance.
is there a meassure for this? is there a way to calcualte the absolute value or it?
A question about the rules. Suppose I run the a program using:
> ruby -n million.rb random.txt >> output.txt
The program million.rb is run with the file random.txt as input, and the result is saved to output.txt.
How would the size of the program be meassured? As the number of byte inside the million.rb file?
(btw, I know the data is binary not text...)
@Emil:
Like I said in the rules, we are looking for the total size of the program plus the data file. So in the case shown above, I suppose it would be the length of million.rb plus the length of random.txt.
Creating a build of ruby that stores the data in the runtime would obviously not count.
Someone suggested a while ago factorising the number and just storing the primes. I can't see that working, since storing the primes will require as much space as storing the number (it's the same amount of information, after all). However, it did give me an idea:
The file given is random in terms of sequences of digits. It may not be so random in terms of primes. If the number is not square-free (and we have no reason to expect it to be), then you can reduce the space required to store the primes by not storing the duplicates (just storing the power required).
Unfortunately, I can't test to see if this works because the largest numbers that can be reasonably factored by modern clusters are in the 100s of bits, far smaller than this number. I think the idea may have merit, though - whenever you see the word "random" your first thought should always be "what's the distribution?" It may be a uniform distribution from one point of view, but heavily skewed from another. If you provide a somewhat smaller random number (100 bits, maybe), I'll give it a go, though. (Although it may turn out to be too small to have a large enough repeated prime to absorb the overhead of the decompresser, as small as it would be.)
if it turned out that the million random digit number was simply the factor of a few primes, you could express it by the ordinal numbers of those primes, and then indeed you would have great compression.
but imagine the luck involved for that to be true!
the folks who created this file spent a lot of time making sure it was random in every way they could imagine, but they couldn't have done a factoring check too easily.
so it could happen, but the odds are incredibly high against it.
an interesting experiment would be to start cranking out a list of numbers that *can* be compressed by prime factorization, and seeing what percentage actually meet the cut.
A number theory guy might be able to help with the answer.
I forgot to come back and check for replies, sorry!
For my method to work, the number doesn't need to be the product of a small number of primes, it just needs to not be square-free (and have the repeated prime be fairly large). The vast majority of numbers are not square-free.
It's not actually necessary to factorise the whole number, just to find a (large) square factor, so perhaps this is tractable. I have exams coming up (one of which happens to be number theory!), but I'll give it a go in a few weeks.
Having actually looked it up, rather than guessing (I should know better than to guess about such things!), I need to correct myself - it's not a majority of numbers than aren't square free, it's actually a slight minority. So, the odds of this compression method working is not great... If it weren't for the limits of computing power, it would still be reasonable, but even if this number has a repeated prime factor it will probably be difficult to find... Oh well...
Hi Mark,
Is the challenge still open? I'd like to try,
So all you want is a file (exe) smaller than this one, that outputs this file. Obv, as per the spirit of the competition.
What i am asking, is , I wont give any code /source at this stage.
Thanx
Waiting for a promt reply.
JITENDER BEDWAL
New Delhi
India
@Jitender, sure, there's no requirement that you disclose any source code to win this challenge.
Hi Nelson,
Is there any limit to the acceptable time need to be taken for the task to complete ? (AmillionRandomDigits Challenge) eg, if it acceptable if the decompression need 3 weeks time. (joke).
Regards,
Ray
@Ray:
You take as long as you want, Ray.
Hi Mark,
Can u tell me, how do we know, how much a file is compressible to, Or , how exactly do we calculate the 'Entropy' of it.
I mean, that if i consider its entropy as composed of Symbols of the 256 symbol alphabet (read 8 bit byte alphabet), it comes out something, and if i consider it as the 256*256 symbol alphabet then its different.... So, how do we exactly get to know ( any STANDARD or absolute scale), that i can say, this file is compressible to 'this much'.
Thanks.
Jitender Bedwal.
@JitenderJB:
There is no such thing as an absolute measure of compressibility.
Any measure of compressibility is *always* relative to some specific model.
And for any file, there is some model that can compress that file down to 1 bit.
So when you hear people talk about the entropy of a file, you have to know how they are measuring the entropy.
Even something like Kolmogorov Complexity is not an adequate way to define an absolute measure of the complexity of a sequence - because it is going to differ depending on what sort of machine you choose to run it on. For example, a machine that has a single instruction that can generate the million-random-digit file will have super-low Kolmogorov Complexity on that machine, but may be very complex on some other machine.
- Mark
Mark,
What i am asking is, on what scale (or basis) [if any], did the RAND people, rate the 'A Million Random Digits' as random.
Jitender
@JitenderJB:
There are a few places you can read up on the history of these digits, but the most important thing is that they generated the digits using an unbiased, memoryless source. For their purposes, this qualifies as random.
Or maybe, is it exhaustive manual scrutiny, of the 'Tests of Randomness'.
Mark,
Is there any function, which does the reverse of what BigInteger did to the Random Digits file?
QUERY ABOUT SUBMISSION PROCEDURE
--------------------------------
I have been trying to compress THE file for a while no, I hav got 'a little' (but still some) success. Please elaborate the possible combination of the contents of the final file to be submitted.
So far, I hav a .CPP file, a .JAVA file, and a .BIN file, which when run Create the EXACT File u are providing. And the Total Size of the above contents is 232Bytes less than Ur File.
I hav tried to remain strictly in accordance with the spirit of the Challenge.
Thanks again..
JITENDER BEDWAL
@Jitender:
If you can mail those files, along with build procedures, and I can reproduce your results, you get $100, it's that simple.
- Mark
Thank you very much Mark.
I'll publish (or mail) them very soon. I am not actually that concerned about the patents or something. And I am not in any way afraid of posting my code (or the THEORY ). But still, Mark, please guide me, what should be a BETTER way, (like for example Publishing it in some magazine/paper/website or something so that i don't loose the credit for it, in long run) to do it. In short of course.
And btw, i hav gone through numerous threads on Comp.Compression. I know there hav been a lot of Pranks there.
So if you like, we can end this discussion for now, and resume it later (in a week or so, when i Publish the files), but still some advice would be greatly appreciated.
Thank (yet again)
Jitender
Waiting here... for another 10 minutes.
looking forward to seeing jitender's claimed solution.
i'm curious what code was used to convert the digits to binary format?
@Emil:
See this post on comp.compression for the full Java source code:
http://groups.google.com/group/comp.compression/msg/e219c4f18a9f546e
thanks for the link to the original java encoder. for fun i tried to write the shortest decoder i could come up with:
p gets.unpack('B*')[0].to_i(2)
it's ruby, takes about 40 min to complete on my mac, and outputs the one million original digits when given the binary file as input (and disabling the record separator using -0777):
the script is 30 bytes long. can anyone beat that? :-)
Is there a restriction on the OS used? I have an idea, but it will only work correctly on Linux.
@AtomicCheese:
No restrictions on the O/S, but of course, it helps if it is something I can run to verify things.
If you have an idea, and it will only work on Linux, I am pretty skeptical.
- Mark
Is this competition still open?
I would like to give a try.
But I feel prize money of $100 is too low.
Hi Mark,
A few questions about the contest:
1. When measuring the size of the compressor / decompressor, I assume we are talking about the length of the source code in bytes, not the binary. Can you confirm this is the case?
2. Also, when including the length of the source code, whether binary or source code, is the packed size (with something standard, zip, or winrar) of the source / binary used or the uncompressed size? If uncompressed that would mean we need to use one letter variables, macros to hell, etc.
3. Is this offer really still good and nobody has won it?
@Parlance:
I'll be happy to measure your submission by the size of the source, even the size of the ZIP or RAR file that holds the source. I agree that it would be annoying to have to make the source unreadable.
Yes, the offer is still good, nobody has won, and I don't think anyone ever will. If it was me, I'd be trying to get the file inserted into the Linux kernel so I could use that to snag a back door victory.
- Mark
@keshavshetty:
Contest is certainly still open.
Winning it would of course give you much fame and acclaim, way more than $100 worth. As Matt Mahoney has pointed out, if you find a general purpose way to recompress compressed data, you would actually be eligible for at least one million dollar math challenge.
Obviously the money is simply a token - nobody is going to do this for cash value.
- Mark
i just found this Improbable Research video about the million random digits , i think its extra funny cuz im trying to compress that crazyness =)
njoy
http://www.youtube.com/watch?v=0y8Wa10Lm7c
@haggo:
That video is simply awesome! Thanks for posting.
Does anyone still visit this place?
I have a few questions.
Is this challenge impossible?
Has anyone successfully completed this challenge?
What about this:
http://cs.fit.edu/~mmahoney/compression/barf.html
Hi Noodlesgc,
About barf, reality can be found at
http://nikhilsheth.blogspot.com/2007/09/barf-compression-holy-grail-or-goose.html
Meh... If you are so confident that this will never be achieved, why is the reward only a measly $100.00? Why not a million? By offering such a piss-poor reward, you've pretty much guaranteed that no one with anything valuable will ever take part in your challenge. If someone has a real working program, their eyes will be more focused on the millions they can make by marketing their little compression program. Your cowardly offer will not capture the interest of anyone who is serious.
@NotImpressed:
First, I don't have a million dollars, sorry.
Second, the people who pursue impossible compression claims are going to do it regardless of whether I offer any prize or not. This prize is simply a stake in the ground that they can choose to shoot for.
The primary reason I created the challenge was to provide a benchmark I could use on comp.compression every time someone made a fantastic claim. It has done very well in that role. When someone says they have a great new algorithm, I just point them to the challenge and tell them to let me know when they are done.
And despite what you say, it has caught the attention of many, many people who are quite serious. Of course, none of them have managed to meet the challenge. Nor will they.
Sorry I disappointed you.
It it still open? I fully do understand the chellange:
decompressor executable + compressed file MUST BE LESS THAN ORIGINAL AND MUST DECOMPRESS TO ORIGINAL WHEN RUN.
I ma not after prize even if it is 1 million, am just interested, I have an algorithm (not yet published anywhere) that can bring it down, which I use as post compression technique to reduce data to almost 1/3rd and have tried it already, but rightnow seems a little tricky with your binary file, but yes I am 95% use it is possible and I would love to proove it with an example (only if it is still open). Machines can never ever be more powerful than human brain.
regards
Khan
@Khan:
Most certainly, the challenge is still open. Please give it a shot!
- Mark
There appears to be just one way to crack this problem, which is to find something about the data that's non-random.
But if you succeed in that you still haven't solved the actual problem (although you'd win the $100), since the data file wasn't fully random.
Therefore, as long as the data is as near to fully random every way you analyze it then this problem cannot be solved.
@Robin:
First, there are many people who think that they can write compressors that work on all datasets, regardless of randomness. This challenge is their opportunity to prove it - of course no one has done so.
As for a sequence like the million random digits - regardless of how random it looks, there may be a short program that can generate the sequence - it is not possible to refute this due to the halting problem. So no matter how random it looks, there is always a chance that it could be compressed.
- Mark
Been a while.. I worked a long time and then again fell into the hole of hopelessness..
You know i should get it all going again..
This last round gave me three unique encoders so it was worth the effort.
What I love about this is when i get within striking range, and I have been there a few times, it seems I run out of storage space.
I'm always stuck with one or more bits of information I need to store that i have no place for.
Now that I think of it.. i did have a new idea on on old encoder I didn't work out yet..
But nice to say Hi! I will never quit trying I'm sure..
@Ernst:
It might be impossible to solve, but this is unprovable, so it is not necessarily a fool's errand to continue. Kolmogorov shows that we cannot prove that a given program is the shortest one to produce a given sequence, so there is always hope.
Needless to say, though, I don't have much faith that one will be found.
- Mark
Has there been any real submission for the challenge, other then scams??
@Zoran:
I received one submission, but it was just a waste of my time - I guess the guy didn't really understand the contest. He submitted a program that did some kind of compression, but it did not meet the requirements to win the prize.
I don't ever expect to see a valid submission.
- Mark
[...] (Before you read further I suggest you to read Mark Nelson “The Million Random Digit Challenge“) [...]
Loss less random data compression – Is it possible?
The article I posted in my blog - http://blog.adityon.com/2009/12/random-data-compression-is-it-possible/
Sorry I couldn't post here for 2 reasons.
1. Blog is too long
2. I want the user comments to be available at my blog.
New article added at http://blog.adityon.com/2009/12/random-data-compression-is-it-possible-part-2/
Well Guys.. It's a real mental challenge and I believe that is why I phase back into it from time to time.
I am now working on an encoder that has some promise in recoding the data. I have a ways to go since I decided to start from scratch and prove each step as I go.
Well if there is a boundary we cannot cross to information then that is the way it is. I've been within one bit of one bit a few times. I know what that boundary is like first hand. Thanks to the great spirit of this challenge. I have enjoyed the time spent.
Hey Mark I have been checking pattern with your ARI from the BWT program. Is that as good as any? Do you recommend a different one?
I figure is there is pattern that ARI program will see it.
Well to all who are still smacking the keys in 2010 on this challenge I salute you! I will be at it for a few months with this one.
I have coined the term Dark Information to reference any information that can be assumed in a system.
This was inspired by the idea of an observer in quantum system terms and one encoder with a special state that has no spare bits if I want it to "compress" I need that external reference to signal the state of the system. It is an one to one but I wanted it to be Special :)
Well this is the basic representation of million digit data. Not that this form of it is anything special; it's not. It's the kind of reality I see over and over again in all the ways I have expressed the data of the million digit file.
Not any savings to brag about so far.
1 : 829609
2 : 415167
3 : 207993
4 : 104024
5 : 52223
6 : 25717
7 : 12968
8 : 6475
9 : 3197
10 : 1627
11 : 769
12 : 420
13 : 212
14 : 92
15 : 37
16 : 29
17 : 7
18 : 5
19 : 3
20 : 1
21 : 1
Any Ideas?
I've posted a short article on Dobb's Code Talk dealing with this challenge. Some of you might find some interesting ideas there:
http://dobbscodetalk.com/index.php?option=com_myblog&show=The-Million-Random-Digit-Challenge.html&Itemid=29
Thanks Mark..
Imagination is more important than knowledge...
Albert Einstein
US (German-born) physicist (1879 - 1955)
I realise this is a dumb question, but.
Don't you need 4 bits per 0-9 digit?
1million * 4 bits = 500,000 bytes.
So how come the .bin file is 415,241 bytes?
@thicky:
There is a form of coding numbers called BCD. In BCD, you encode two decimal digits in a single byte, just like you describe.
The million digit file does not use BCD coding - it is a single long string of binary digits. If you do the math, you will see that a single decimal digit takes approximately 3.32 bits to encode. So a million decimal digits takes 3.32M bits, or 415K bits.
If this doesn't make sense, start Googling for computer numeric formats.
- Mark
@thicky: if you use Binary-coded decimals (http://en.wikipedia.org/wiki/Binary-coded_decimal), you would be right. But converting a decimal number to a binary number is a bit different. Let's convert 11 in decimal to binary: 11 = 8*1 + 4*0 + 2*1 + 1*1 = 2^3 * 1 + 2^2 *0 + 2^1 * 1 + 2^0 * 1, so 11 in decimal is 1011 in binary. In BCD it would be 00010001 More information on http://en.wikipedia.org/wiki/Binary_numeral_system
@Mark @Ben
Thanks - makes sense now I think. The file contains the binary representation of a million digit number, not the binary representation of a million digits!
Happy New Year!
Hey.. I tested the new encoder and no real surprise it turns any file into random data as far as Zip is concerned.
LOL it is a one to one encoder but still no magic compression.
That must be Codex number 25. At least this one doesn't have any unexpected states like one I have does.
There are systems out there. Keep banging those rocks together!
Mark - in response to your article at Dobb's Code Talk: The normality of pi has yet to be proven. That is, it has yet to proven (although many mathematicians think it is probably true) that pi contains every sequence (actually, normality is a little stronger than that, but I don't think the weaker version has been proven either). Your Magic Function, therefore, may not work (but it probably will).
For more information, see:
http://en.wikipedia.org/wiki/Normal_number
@Tango:
As I said, I am obviously not a mathematician! Thinking it is true and proving it is true are obviously two different things. Thanks for the pointer!
- Mark
Pi is an data set then until we know more.
That works for me.
Hey Mark are folks looking for that Holy Grail of Data Compression Numerologists then?
Oh I'm hooked on trying that is for sure. Anything that helps the imagination.. Thumbs up..
There was a story I remember about a substitute teacher who teaches one Kindergarten class one day and a High School senior class the next.
Kindergarten kids when asked to tell the teacher what a dot of chalk on the black board was. They replied with enthusiasm one answer after another. A Sun, the top of a telephone poll.. a... a... and the answers poured in.
The High school seniors when asked the same question were mostly silent with a lone condescending voice or it's a chalk dot dweeb! Coming from the back of the class.
The moral of the story is love your inner chalk dot.
Hey Did I tell you guys I can compress any binary number by one bit?
Drum roll....
@Ernst:
There are numerous ways to do as you are describing, but they all rely on hiding information. You of course can't compress any stream by one bit without actually storing that bit elsewhere.
- Mark
Well.. This is meant to be on the Data Compression humor side.
So If I blunder on the terms or other proper defines please correct me gently.
Again try to see this as Data compression humor and not a Big Brag.
I need to say that all bit strings except zero value ie 0 or 00000000
-----------------
If we take a string and I will just type and make up a data
1010101 -- We can show how value and symbol can be exchanged and encoded.
1010101 Starting with a limit of length of MSSB ( remember this is humor ) Ask yourself what must we subtract at the location of MSSB - 1 to get the MSSB to reset?
In this case only a single set bit at the MSSB - 1 location resets the MSSB.
1010101
-0100000
-------------
so that is 0110101 now.. Choose a parity in this case a simple 0 = 1 and 1 = 2 works so the first symbol is a zero.
repeat and you will end with the value of one which is Dark Information we assume when decoding the Symbol string. we start with a value of one and not a value of zero
So, in humor, I can reduce all greater than 0 binary values by one bit.
:)
it's 1010101 = 010101.. Where 0 and 1 are symbols rather than value.
Poo you say just drop the MSSB? well if I change the symbol meanings then it's 101010..
It's okay.. Just a little Data compression humor. I compressed all greater than one values by one bit..
> 0 duhh.. LOL
Wow Tough crowd..
Hey maybe you guys see a way to manage that parity flip issue?
Oh well I will log into Random Data compression and chat.
Happy new Year all..
And again:
Here's my algorithm. It relies on parallel processing. Well, parallel universes to be precise.
Theoretically, that should do it in some universe (because the data file was only read in universe0 and probably doesn't exist in the one which returns the file). Of course your interpreter or executable would have to exist outside the universes, which could make it hard to invoke!
I have an official update and I'd like to ask your position on this, Mark.
I have had the good fortune to discover a bijective encoding algorithm.
The size of the decoder is under 8000 bytes.
The down side is to recode a packet the size of million digit file, which is over 3 million bits, it takes nearly 48 hours using a single core at 100% This version wouldn't divide over multi-cores.
My question is: If I find a file "some place down or up the line" sequentially that will compress, at least that 8000 bytes, with a publicly available data compressor will that qualify as compressing the million digit file?
Thanks
Ernst
@Ernst:
I'm not sure I know what you mean. If you are suggesting that you might find a sequence in the file that compresses by 8000 bytes, then you should be able to construct a decompressor that meets the challenge.
- Mark
@danielnash:
I'll let you know when I am able to test your code. Don't hold your breath!
Hey Mark.
I am able to encode or decode the entire file recursively here. I'm confident it can "go" many iterations in both encode and decode directions but, sequentially only.
D3, D2, D1... E1, E2, E3...
There may be a diamond in the rubble but, it's worthless if you don't buy it.
Just wanted to see where you stand on this.
A little "Kobayashi Maru" Cheat on the file but it's in-line with my personal theory that there is a "quantum" of information involved and the symbolism representing it is non-permanent.
The new files are exactly the same size as the original million digit binary you provided. I am looking for a way to compress but the best hope at the moment is recoding.
I can learn to write a traditional data compressor but I will be learning from the bottom up. Why reinvent the wheel right?
So, my question, relating to recoding the file is; how would we score that. Would it be 8000 bytes + the data compressor? Maybe we might waive the size of programs all together since the codex is bijective and encodes or decodes any binary data and a publicly available data compressor is common enough? Wouldn't that be nice...
Files don't have to be encoded with "Wave", the codex code name, first to be decoded.
Maybe recoding the file is out of bounds?
You are fair so I respect your rules.
Ernst
Hey everyone.
I have an update: No luck on finding any exploit on the million digit challenge however I am pleased to announce I discovered dynamic Unary encoding. It is the basis for a bijective codex I code-named "Wave" since Wave encodes down to and back from, two bits.
I don't know if this is a self discovery or a discovery in general so I could use some feedback on the concept.
I have posted in the Beginners thread of the Usenet group alt.comp.compression.
Mark I placed this here because of two reasons. 1.) I have your challenge to thank for the focal point and 2.)alt.comp.compression is really slow these days. No one is even BS'ing. I hope to make use of common knowledge of those in the know. :)
So Questions; the concept is dynamic encoding where data is in motion rather than static and, has anyone said or read anything on this general concept of recoding bijectively?
Just what to look for will be helpful and I'll bet it's more appropriate to reply in the alt.comp.compression group to the thread Beginners Thread.
Thanks Mark.. I am trying to find a way and sharing what I find. It's a challenge and I like that.
Ernst
Hello I want to stay anonymous and I want to explain why no one who is serious will actually do this. To begin I was reading over some old AIM (Instant Messenger) logs from 1999 based on what I and this german kid used to talk about.. now I am 25 years old and still don't understand half of the things he used to write to me but it all comes down to this, at the age of 15 as he states and checked by his DOB profile, I cannot get in touch with him anymore but he lived a crazy life don't know if he is in jail now or dead. So let me explain a few things he was a social outcast he used to talk about how easy he could control girls without even talking to him just by clapping his hands and staring at them he would get them to have anal sex with him. Don't stop reading because of that part it's funny I haven't understood what anal sex even met when I was 15 haha. He was socially awkward to talk to becasue of his vocabulary I can't say I was the best influence on him either when ever he used to talk in big words I would say just thats awesome and laugh and thats how it conversations ended. He used help me in magicially turning any shareware program into a freeware and tell me to enjoy it =]. Now I figured out he was a hacker. He also used to talk about how he takes some kind of drugs which induced attention deficit disorder which was half of the key and the other half was empirically (still have problems understanding what this ment) he needed to become into a millionaire by the age of 17 he never logged in AIM after that point he did mention he was rich. I don't know what made him into a hacker but he could take any program apart and recreate the source code to it in less then a week thats what he did to turn RICH and I was a big part of him becoming rich as I told him about this cool software which can detect letters in pictures like a human brain back to text now called OCR (optical character recognition) it was pretty primitive at that time and was a shareware, he quickly made it a freeware for me and continnued to turn it back to source code and work on improving it, later patenting his improvement at a VERY young age i may say 16!. Not sure exactly what he patented it since it was already patented I believe? but he started to license it to websites which were for blind/color blind people, which he later said after I asked him that blind people were definetly not using them it was for automated mail hackers. Thats pretty much the story I have logs here which are still VERY cryptic for me to understand and I'm pretty dang mature. Most of them talk about compression and life and a bunch of other things about space and aliens and things I found useless. He states some funny things he must of been a bit immature back then like he writes that life is just a closed circuit with in a cycled loops. I think he ment environmental evolution? rofl If I could still talk to him I'd comment about that by saying yes life is a cycle :P but its no dang circuit its but its a process of SEX, baby then and age which gives wisdom lol. His compression theories talked about how in life the compression take many cycles to unpack and it may take years for it to fully unpack things that could be done in just a month using a CPU. Haha yah he was pretty nutty talked about evolution I am guessing. I won't go into much details here because I'm still trying to understand half of his theories maybe I could patent some of them and get rich myself.
But I wrote all this to tell you that hackers could turn programs into source code so don't risk it even though hackers may be crazy like this guy but they exist.
I'm learning Visual Basc 6 and I may be able to recreate what this german guy was talking about near the future at the moment I'm struggling with bitwise stuff and loops at the moment it's progress.
If lets say someone did submit a program that qualified for the challenge, what are the steps you would take to submit it to the scientific community for verification, and so on??
@joe:
Verification would be done by me, maybe with whatever help I can get from the folks at comp.compression. Verification is not a matter of science, more of forensics.
Now, if you choose to disclose the algorithm, that opens the door to more interest from other people.
- Mark
Wow we took a turn for the Surreal.
Hey Whatever on the long Story.. I can get it but I doubt anyone else here would be able to get the story..
Anyway.
Update.
I now understand how to recode all binary strings 1 to 1 under simple integer count. I'll be looking at one exploit that came to mind, next. I only hope I don't become homeless here too soon. The Job market is still oppressing economic opportunities.
Does anyone recognize this integer sequence? 13,128,23,217... ? It's my Q0-B.
Just a shout out to all you other would be Nut Cases trying this challenge.. Catch me if you can!
Ernst
Hey Mark,
where i have to go to get my 100,- dollar if i had a solution for this problem?
@lilalula:
You don't have to go anywhere, I'll get you the $100 wherever you are - at least if you take Paypal!
- Mark
Hey Mark,
OK. What is about my idea?
I think i can compress any kind of data(even the AMillionRandomDigits.bin file). Well not endless but to a smaller amount. The compression is effective only on files with a certain minimum size. The bigger the file the better is the compression rate. If you're interested let me know. I'll give you some more detailed infos on this.
@lilalula:
Hey, the rules of the contest are right out there - don't tell me how you are going to do it, just do it.
You write a program that can compress any file without resorting to various cheating tactics and you'll be famous, don't worry about it.
- Mark
Hey Mark,
How are you today? So i'm back and i found a Way to do it. I was paying arround with the "AMillionRandomDigits.bin" file and it worked fine. Also i tryed to run some more tests with my program and i found this here:
http://www.stat.fsu.edu/pub/diehard/
This is a cd with 4.8 billion random bits in sixty 10-megabyte files. You can go there and download this files to see for your self.
OK. here it is:
I can save 16 MB(16777216 Bytes) data(ANY data) in a file with a size of 11296 Bytes. My program don't compress the original data. All i store is the information in the original data. It is mutch like tranforming it to something else. It is like tranforming energy from one to another form. Like making heat out of wood or electricity out of wind.
During my work on this project i realized that this is more than 100,- Dollar worth. So you can keep your bugs. If you want proof for my program than look out for a project with name ALICE. Also i can meet you to show you how it works. I hope you understand that i can't send you this program.
cu
lilalula
PS: for more questions: you got my email.
@lilalula:
It is rather trivial to prove that you cannot save any 16 MB file in an 11K container. Sorry, the pigeonhole principle shows that you are incorrect without even breaking a sweat.
The purpose of the challenge is not so much to get $100, it is to prove to the world that you can do it. I don't think anyone is going to believe your boast until you actually prove it.
- Mark
Hey Mark,
Well basicaly you are right it is not possible to save 16 MB in an 11KB container. As i sayed this is not what i'm doing. I'm not trying to save the 16MB content somehow in a smaller container. All i save is the information from the 16MB.
OK, i will make you an "dekompressor"(i don't like this word because it's not right) and i send it to you. You can use it ONLY to restore the original file. After you have the program you send me a file. I'll send the container file back to you. you can use the programm and restore your original file. Would you be happy with this as proof?
Hey Mark,
It's me again. You know that everybody was 100% sure that the earth is the center of the universe? Until 1610 a Guy in Italy(Galileo Galilei) profed that this is not right. He almost got killed for this. This is just one example of our way to understand things. It is often just simpler to belive what we don't know.
@lilalula:
Yes, your proposed proof is fine with me.
>Until 1610 a Guy in Italy(Galileo Galilei)
>profed that this is not right
This is a nice little story, but of course it was Copernicus who was credited with advancing the notion of heliocentrism, not Galileo.
- Mark
looks like somebody did homework...
Send me an e-mail so i get your e-mail.
Found another claim for recursive compression:
Even "explains" how it works :p
http://www.recursiveware.com/
@TonioRoffo:
Awesome site, thanks for the link. Looks like the guy is having trouble getting the final implementation complete though... Think I'll contact him.
- Mark
Well, I looked at this. Curiously, as I wrote pictorial representations of the phenomena described, I came up with a pattern that fitted the concept "a million random digits" and "so-many bytes of (supposedly) -incompressible 'random' 'data'" BEFORE I saw the direction Mark's description was going. This suggests my method so far is consistent with what Mark describes.
I found a very interesting pattern. I then intuitively found a candidate method to fulfill the "The challenge is to create a decompressor + data file that when executed together create a copy of this file."
I then altered my method to allow for the requirement that the information be in the form of "bytes".
Curiously, BEFORE I READ of the notion of "negotiating how to measure the result etc.", I produced the notion "so you get an approximation of bytes (cannot measure it))".
My intuition suggests "if write the program in C++" one could measure it with (say at this stage) any other 2 programs...?
Of course, the technical details of what I found could do with sponsorship more than $100 as I could do with a freehold house!
The alleged problem of "recursive compression" is apparently "solvable" ("what problem? " one may ask......?)
by a simple but very clever technique- true there is a natural "limit" but this is very low it appears.
The general subject area of ultra "compression" is what I call "space computing", it has possible ramifications for the study of astronomy.
Dr. Wong seems clued up;
5-d floating characters
nice work I think
yes Sir.
I can go much further apparently (If higher rate of data storage minimisation of container ( ) required)
with hyperspace bypass technology
entropy "vanishes" in a very delightful 'way'
!
Is this challenge still on? I love the idea. However, $100 is a very small incentive for a standing challenge 10 years old. The Clay Math Institute (http://www.claymath.org/millennium/) offers a million dollars. How about submitting this conjecture to them? Or, maybe you would be willing to up the incentive to say $10,000?
@jjdawson7:
Yes, the challenge is still on. Many have tackled the problem, but of course, nobody has even come close to defeating it.
I would love to offer a million dollar prize, but of course, I don't have a million dollars. And the Clay Math Institute would never have a challenge like this.
Anyone who manages to beat this problem will certainly get fame and acclaim, and I will do my best to help them.
- Mark
The Clay Math Institute wouldn't have this kind of challenge, but James Randi might accept a solution to this problem as winning his "One Million Dollar Paranormal Challenge" (http://www.randi.org/site/index.php/1m-challenge.html). One would need to compress it by a significant amount, though, not just a token reduction as you require.
Quote: "...nobody has even come close to defeating it."
I doubt this is accurate; I quite likely solved your challenge, and many others (such as all seven Clay Institute Millenium problems, and "the Goldbach Conjecture"- all these connect to a revolution in science that may be as profound and far-reaching as the Copernican Revolution.)
It is time "skeptics" stopped being so quick to discount these possibilities- to realise "the stone that the builder rejected possibly has become "the corner stone", as the saying .... ............... "
The reason why the Clay Institute problems are so "hard" is perhaps the manner in which the construction of these problems interacts with how mathematicians do math.
As my living conditions are very difficult in several ways, I would prefer to find a sponsor or commercial avenue to releasing what I have (apparently) discovered.
I now have maybe over 14 different ways to look at this discovery- if it was in error it wouldn't sustain so many perspectives?
I need a salesperson to find me a deal
.
@Alan:
> I quite likely solved your challenge
You "quite likely" solved it? Alan, either you did, in which case you could provide proof, or you didn't. There is no maybe about it.
The challenge is still open for you to solve, but until you do so, perhaps you should stop posting comments and get to work.
- Mark
I worked the apparent 'solution' on a piece of paper. That is the "proof". For intellectual property reasons, and severe living discomfort reasons, I consider it unwise to reveal the contents of the piece of paper without using a non-disclosure agreement and dealing with a genuine potential sponsor or licensee.
Surely you knew this would likely happen if someone solved it? Who can afford to give away that potential new technology?
I can tell you, the piece of paper includes drawings, i.e. patterns.
It was a rather strange problem, but was it was nice to find what appears to be the resolution.
@Alan:
Sorry Alan, you are wrong. You did not solve it, you will never demonstrate it, and you are living in a world of delusion. I've seen too many like you to be fooled.
You will always have some excuse for not demonstrating your results, and I expect 20 years from now you will still be whining and begging for money like you are now.
- Mark
Mark Nelson, EMI said that "guitar bands have had their day", and did not sign the band known as "The Beatles".
You have no evidence for your claim that I did not solve (or whatever) your challenge. You are apparently not facing reality- that people do not necessarily give away this sort of knowledge.
You appear to be living a self-fulfilling prophesy of destructive skepticism....... ???
A large pool of knowledge now exists, already many pages have been written for a scientist who has signed a non-disclosure agreement.
What I know is on the market.
Got to hand it to Mark. What he lacks in humility he makes up (in spades) with smarminess.
Alan, you seem to be misunderstanding the rules of the challenge. You don't need to reveal your method. Just send Mark the compressed file and the decoder (as an exe, no need for the source code), nothing else.
I don't know much about computers, this technology is (apparently) a massive jump in concept from how computer experts are thinking (though they get little bits of the idea at times)
If a file "writes its own (perfectly tailored) code", you automatically have a fantastically efficient use-of-space. The "decoder" and the ""file"" are one- a hyperspace dynamic.
This is (apparently) very new system (?)
The file was already a code- this an object-recognising system- you don't "decode" a file like this- you "upscale it"
All the space-efficiency-storage-enhanced file needs is "to see itself "in the mirror"" (laser interferometry)- one converts the file into space-efficient storage system; in doing so you create a 'run triangle' (inverted location) the file exists now "in 2 places at once".
To re-see the file as originally depicted, requires running it past 5 (or 6) directions more-or-less "simultaneously" - the easiest way to re-calibrate the file is a dot matrix calculator gizmo (like using a cross-word puzzle to unravel the location and links between each word (you have a jumble of words and the shape of the puzzle- now can find where each word fits)).
the prize is $ $ 100 or $ 100 million U.S. dollars?
@ademsam:
$100. I don't have $100 million.
- Mark
Not worth wasting time on a prize that I can Mediger up the street.
@ademsam:
>Not worth wasting time on a prize that I can
>Mediger up the street.
Not worth wasting time on a challenge that you have no chance of winning!
- Mark
I am still working on things Mark.
I found a way to drop bits but, I will have to work the program out to see if there is actual data reduction when all things are considered. It's possible that it could come in under the wire.
This design is a much more challenging structure then anything I have attempted before.
So it's a win-win situation for me. I get to improve my program writing skills and see if this algorithm can meet your requirements.
It is an extension of that dynamic unary encoding I was chatting up earlier.
Best of luck to everyone.
I'll check back in later this year.
one million bits is tooooooooooooo big
how about a nice manageable 1k of bits ?
@Mark
Fun contest... do you think its solvable if P=NP?
@dmfdmf:
>do you think its solvable if P=NP?
Matt Mahoney shows that if you can compress random data, then you can solve NPC problems in polynomial time, which would mean that P=NP. So that's kind of the same thing...
Hi Mark,
could you give me a pointer to the paper by Matt Mahoney which shows compressing random data would solve the P=NP problem.
Thanks.
@ps:
In a post on comp.compression:
http://groups.google.com/group/comp.compression/browse_thread/thread/965bb085438e33db/2dc4b259c0c81503
"random data" : a mathematician agreed with me that "there is no such thing as "randomness"".
Though actually, considering the notion of "compressing random data", "random data" could be described as "not data at all".
"random data" is NOT "data" I submit; "random data" is "an OBJECT".
There is nothing to compress.
"random data" is not about something, it IS "something".
I have apparently solved, at least in a manner of speaking, all 7 Clay Institute Millenium problems (including "P vs NP"), and the Goldbach Conjecture.
I am fed up with being so poor when the value of these apparent discoveries is so helpful !
If you want to "compress "random" data, you will first have to expand it: this results in ____________________ (censored) which = "the Goldbach Conjecture" (or "P vs NP quantisation) To "solve "P vs NP" you need "quantum electro mechanics" ...
Hello Mark,
I have a question concerning the nature of the challenge that you have set.
As I understand it, the challenge is simply to provide a compressed file with a decompressor whose combined total size is less than the original AMillionRandomDigits.bin file, i.e. less than 415241 bytes.
I don't know, but I suspect that this challenge cannot be met.
My question is this: Why does the decompressor and compressed file combination have to be less than 415241 bytes?
Given the random nature of the AMillionRandomDigits.bin file, would it not be, at least in principle, sufficient to show that one can compress/decompress the random file to a few bytes less than 415241, regardless of the size of the decompressor?
Why does the size of the decompressor matter?
Is it simply to ensure that the missing bytes are not encoded within the decompressor?
Is it to guarantee fair play, or is there a reason, in principle, why the decompressor has to be smaller than the number of bytes saved by compressing the random bin file?
If one puts aside the idea of trying to cheat, would there be any worth in a compression/decompression algorithm that could, regardless of its size, compress this random binary file by a few bytes?
I have no idea whether the challenge could ever be met, even if the only constraint was that the random file had to be compressed/decompressed by a general purpose algorithm.
Could you please explain the reasoning behind the constraints set in the challenge?
I am sorry for asking what may be a naive question.
Thanks
ps
@ps:
If the decompressor could be any size, it would be trivial to keep a copy of the million digit file in the compressor and then to compress it down to 1 byte, or even 0.
If somebody actually was able to compress this file, and found that their code was too large to make a go of it, I'd be happy to create a file that is 10X larger and repeat the challenge.
- Mark
What I have taken from this challenge is that we can try. We can look at this in new ways and maybe innovation will happen.
This maybe a think outside the box challenge where imagination is more important than education or it is folly.
Time will tell.
Working on a one to less than one system here. meta-data may outweigh net reduction tho.. Cheers!
What about this (only theoretically, would be far too slow):
have a function checkme() which goes through 415241 bytes of data and calculates some kind of hash or whatever and then returns TRUE or FALSE.
The function checkme() must return TRUE if the data == AMillionRandomDigits.bin. For a lot of other data it will also return TRUE, but for some (hopefully as much as possible) it will return FALSE.
The compressor does a loop through all possible data of 415241 bytes length and calls checkme() on the data each time. If it returns TRUE a conter is increased. The loop is aborted/done when data == AMillionRandomDigits.bin. Now the counter is the index which represents AMillionRandomDigits.bin: of all possible data of 415241 bytes length it's the n'th for which checkme() returns TRUE.
The decompressor program stores that index alongside it and does a loop through all possible data of 415241 bytes length. It calls checkme() and if it returns TRUE it increases counter. If counter == index it stops and outputs the current data.
If decompressor prog size + index size needs more than 415241 bytes modify checkme() in such a way that the resulting index for AMillionRandomDigits.bin can itself be compressed (or represented in a short way) enough such that decompressor prog size + compressed index is
@flott:
This is pretty much a standard implementation of Magic Function Theory. Start simple and let's compress 16 bits by hashing down to eight bits. Further, let's say this is a really excellent hash function, and every eight bit hash covers exactly 256 input files.
So now, for each eight bit hash, we have 256 possible files. So to specify a given input file of 16 bits, we need an eight bit hash, plush eight bits of extra data.
Net result: no compression.
- Mark
Mark,
I'm sure this won't work, but for the life of me, I can't think why. I'm trying to understand your reply to flott, but also having trouble understanding your point.
Take the number, and hash it with a variety of hashing/checksum algorithms. For this example, let's use SHA1, MD5 and CRC32.
The "compressed file" will simply be the output of all three (or however many you choose to use) concatenated together.
The decompressor would simply (HA!) try every single input permutation (for a slight increase in expediency, you could also concatenate the length of the input number, so it doesn't have to try numbers below that length), and check if that test number hashes to the same MD5, SHA1 and CRC32 checksum as the input file has provided.
While I understand that due the pigeonhole principle, this can never work perfectly, to me it seems that there should be a way that this works (albeit very very slowly) for most numbers.
Or (as I'm starting to think more and more), as the output of this "program" can never be larger than the input, the pigeonhole principle applies, and there would always be multiple collisions in such a system.
Thoughts?
Thanks!
Another thought (unsure if it actually matters or not).
Since you're bound to find collisions in any combined set of hash functions, could you not just also include which "hit" is going to be the correct one?
In other words, when you're "compressing" the file, also attempt to decompress it. Since you have the source number on hand, you can easily determine if the answer is correct or not. Each time you find a value that collides with all of the chosen hash functions, increment an index. Once you've arrived at the "right" number, simply (again HA!) append the collision index to the end of the compressed file, and you're good to go.
I'm honestly very curious to see why I'm wrong (because I'm almost certain I am).
>Since you're bound to find collisions in
>any combined set of hash functions, could you not just
>also include which "hit" is going to be the correct one?
Yes, and I tried to get to the point of this in a previous response.
If you were able to hash a 16 bit string down to eight bits, you would have 2:1 compression on every possible string. The only problem is that your hash function would have at least 256 possible hits, only one of which is the correct one.
So, in order to carry along the information about which hit is the correct one, you need an extra eight bits.
Now you are right back where you started - using 16 bits to data to encode 16 bits.
Magic Function Theory always works out this way - you don't get any information for free.
- Mark
I see your point, but when you have hash functions which always output a fixed length message no matter what the input length, and you're using two completely separate hash functions (from separate cryptographic families), would you not be able to AND the two sets of hits together?
Then couldn't you take the collisions that occurred only in each hash function and then from the known starting length of the input, use the collision index to determine how far to search to find the "correct" collision?
I realize I've just repeated myself basically, but in my mind, each additional step (the additional hash functions, the collision index, the starting length) all serve to reduce the "wrong" outcomes.
I'm not trying to claim that this could be a magic compressor (or anything to that effect), but I can't see why it wouldn't work in some cases.
Again, it's possible that I'm just not getting the math that's proving me wrong. My apologies if I'm being dense.
Look, you asked why it wouldn't work with just one hash function and I gave you a very straightforward, simple explanation.
Your next try falls victim to exactly the same problem. Even if I'm using two hash functions, if there are multiple files that have the same intersection, you will need extra bits to enumerate them.
Do the math man, it's very simple. Instead of just tossing out these ideas, actually sit down and work them out. When you propose something like this once, it's understandable that maybe you are impatient. But don't keep hitting your head against the wall and asking why it hurts. If you really think it can work, start out by developing a test case using 16 bits, 24 bits, whatever. If you aren't willing to test to that extent you are just wasting your time and mine.
- Mark
I realized I failed to address one of your points (that of being a net zero compression due to the size of the index value).
If you take a 10000 bit input, you'd get a 128 bit MD5 hash, and a 160bit SHA1 hash. We're now at ~34 times smaller than input value. Adding on a CRC32 for speedy sanity checking, a 128 bit index, and a 32 bit length field, we're now at 480 bits.
I can't imagine that in the space (for lack of a better term) of 10000 bits, you'll find more joint collisions in MD5 and SHA1 than will fit in a 128 bit value.
Mark,
My apologies. I'm not trying to waste your time, although it seems I'm succeeding rapidly. I'm obviously missing something key to this whole concept, and I apologize.
If I could impose upon you to show me the error (which I'm sure exists) in my last post (with the 10000 bit input), I'll admit defeat, and have learned something to boot.
I think the problem is just that you need to brush up on your math. This statement gives it away:
>I can't imagine that in the space (for lack of a better term)
>of 10000 bits, you'll find more joint collisions in MD5 and SHA1 >than will fit in a 128 bit value.
Your imagination is too limited. Again, do the math on this problem, assume that the hash function generates a perfectly flat distribution, and work out how many bits you need.
- Mark
I figured that might be the case. Math has never been my strongest suit.
I assume then that concatenating two different hash functions basically just creates one larger hash function with the same number of collisions over the same space?
>I assume then that concatenating two different hash
>functions basically just creates one larger hash
>function with the same number of collisions over the same space
Yes, ideally the two functions will be distributed across the input space without much overlap. So if your hash functions are well designed, your strategy of MD5 + SHA1 is going to be more or less the same as using a good hash function with a 288 bit digest.
In general, you don't ever get any useful compression with magic functions. They don't do anything to exploit redundancy in the input data. Since we are talking about random-appearing data, we are not going to find any redundancy that can be exploited by simple coding.
- Mark
Thanks for taking the time to explain where I was going astray, I really appreciate it.
as Miss Marple could say: 'how interesting'
oh the things I could say!
very nice discussion
I have a thought can you use data file(s) like
data.001
data.002
data.003
and so on. Since it doesn't really say it has to 1 data file only??
@Joe:
You could certainly use multiple data files, but you can't hide information in the file names.
- Mark
I love this stuff.
I have a programming friend now.
I may get back to work if my spirits lift.. Kinda sux to be alone working on things but Data Compression is accolades or cool-aid.
Happy Holidays all!
I know that this cannot be solved, but I wanted to look at it to remove any naively niggling notions floating in my own head. So I did some basic analysis of the file.
I got results like: 1600 times, a byte is repeated in the next byte(so if a byte was 200, the next byte was also 200), and 12 instances of the same byte repeating 3 times.
So let's take just this example, let's say we want to gain compression by removing these repetitions.
Let's try a code replacement: Because all 256 codes are already used, you have to create a 257th code, then 1/256th of the bytes would need to be augmented by one bit. (We can be reasonably certain that each byte is nearly equally represented even without doing an analysis.) How many bytes does this cost? 415241/256 = 1622 bits = 203 Bytes.
Well, code replacement doesn't actually help in the case of just one repetition, since the next byte would need to encode the duplication, thus losing any compression right away. But what about the repetitions of 3? There are 12 of them, so after the first instance, you could put in the Duplication code. This would save the next byte (the third instance). So you would end up saving 12 bytes, but this would cost you 203 bytes, for a net loss of 191 bytes.
OK, so let's try another approach.
Let's try to really go after those juicy 1600 repetitions. Why not just remove them, then in a separate data stream, indicate which bytes in the data should be repeated? So we remove them and gain 1600 bytes; our compressed file is, for the moment, just 413641 bytes long.
If we gave the exact position of each byte in need of duplication, we would need Log2(413641) bits to do that -- that's 18.7 bits -- but each byte that we saved before was only 8 bits, so we lose 10 bits for each. Ok so that doesn't work.
But why don't we just add the number starting from zero, so if the first repetition occurs 100 bytes in, we just store 100 in binary, and then keep adding to it to get to the next repetition. In theory, each repetition occurs about every 413641/1600 bytes in the file. That's 258.5 bytes. How many bits does it take to store 258.5 bytes? Log2(258.5) = 8.014 bits. But the Byte that we saved was 8 bits, which means we lose .014 bits per coding. .014 * 1600 = 22.7 bits = 3 bytes.
Again, psychologically, this looks like we should be gaining compression, but when you do the math, you see that you're losing "just" 3 bytes. (And depending on the implementation, you'll probably lose even more.)
I've done some more interesting analyses, but they always result in bit loss.
For example, modulo 16 repetitions. If you modulo 16, there are 6 repetitions of 5; 105 repetitions of 4; 1437 repetitions of 3; and 1 repetition of 5. In other words, it's like 35 and 67 -- both modulo 16 to an answer of 3. You can reduce 4 bits per byte by diving by 4 and remembering the remainder is 3. 35 would become 2, 67 would become 4.
To Compress:
Remainder = 3
35 Div 16 = 2
67 Div 16 = 4
(This can be stored in 12 bits)
Then to decompress:
2*16+3 = 35
4*16+3 = 67
(Back to 16 bits)
When you have a run of 3, you can save 8 bits (on the face of it). When you have a run of 5, you can save 2 Bytes (on the face of it).
I leave it to the reader to see that this also would not allow any compression. Nor would any other scheme, like the oft-proposed prime factoring of the number that is the entire file.
The same problems are encountered for Modulo 8 and such.
@James V:
Some good analysis. Every time you manage to squeeze some space out of the file, the information needed to restore it is equal to or slightly greater than the savings. This means the file's randomness is holding up well.
- Mark
Heh, I did notice that I made a couple typos, but they don't effect the end results. You could include the 12 runs of triple repetitions, which would mean (8-Log2((413641-12)/1612))*1612 = 5.4 bits lost. So that's still losing a byte even before thinking about how to implement it (which will lose more than the theoretical loss).
I would guess that if there were more duplications, the file would no longer be random, and then you might possibly squeeze out a theoretical byte (before implementation). But you would not be able to then reshuffle the file and do it again.
Is there any reason why you are letting people us multiple files? If someone used like 300+ or more files would you use the byte count for total??
@Joe:
It does no harm to let people use multiple files. And yes, the byte count will be for all files summed up.
- Mark
I have a submission, if you want to take a look at it, but it's worth a shot and I tried to follow the rules as stated.
The only problem with multiple files, is that (a representation of) the file length should be included in the count. This can be represented in 1 to 3 bytes per file.
[And this assumes that no other file parameters are used as data.]
Because if the files were combined, you would need a way to discern where each separate data area ends.
@Zoran:
You can find my email address on the About page, go ahead and send me your submission along via email.
- Mark
James. I disagree. I have an apparent solution.
How exciting.. Do we have a winner Mark?
@Ernst:
No. Alan suffers from serious delusions of grandeur. While he often talks about his great ideas (solving all the Millenium Prize problems in one stroke, for example) he has never revealed even the slightest hint that he understands the material at hand.
In short, he is a crank, and is to be ignored.
- Mark
Poor fellow.
I know how that feels. To get caught up is the idea of success when it looks like it could or might *IF* it holds up once it gets to storage.
Been there done that.
I have a fish on the line here of late. Hopefully it won't be a story of another one that got away but we all know how that goes when it comes to compressing the Million Digit Challenge file.
Good luck Challenge people!
Happy Holidays and Seasons Greetings!
I am happy to see our unemployed in the USA will not go homeless for Christmas.
Ernst
What value today to perfect compressor capable of compressing rapidly and recursively, any volume of data, with any format, hypothetically speaking, the compression magic, which value enterprises Software & Telecom would pay for this algorithm / software?
@Maicon:
Of course much would depend on the computation complexity.
Obviously such a thing would be worth a lot if it worked in a reasonable manner.
But this is like asking what would be the value of the Philosopher's Stone. It's a foolish question, because what you suggest does not, and will not exist.
- Mark
You're wrong, it will be real soon. Many have promised and failed. Many theories have been tested, many say is impossible, I believe is possible. 2011 we will have big news.
@Maicon:
Well, at a minimum you're going to win $100 from me.
Be sure to let me know when you have a submission ready!
- Mark
2011? Are we going to rock the house?
Nice!
Update here: I have a possible here but I have not finished the encoder yet. Could go either way at this point; like a dozen before.
The real bitch is to keep that bit of size-savings once I get it. The "Devil is in the details."
Good luck Challenge people!
We may be able to compress AMillionRandomDigits.bin as follows:
1.) For now, ingore the last byte in the file, so that there are 415240/8 = 51905 eight byte chunks. Let each of these eight byte chunks represent an unsigned, 64 bit integer AND CRITICALLY, notice that each of these 64 bit integers are distinct.
2.) Let binomial(n, k) be the binomial coefficient, or the number of possible ways there are to choose k objects from n distinct items (see http://en.wikipedia.org/wiki/Binomial_coefficient)
3.) Notice that Ceiling(Log_2(Binomial(2^64, 51905))) = 2458197. Thus, it would take this many bits, or ceiling(2458197/8) = 307275 bytes to REPRESENT THE TOTAL NUMBER OF WAYS TO SELECT 51905 objects from a set of 2^64 objects.
4.) Thus, the idea for compression is to instead of store the 51905 unique 64 bit integers, we can find the unique integers just by storing the COMBINATION RANK, which in this case would take only 307275 bytes. As will be explained later, we also need to store information about the ordering of these integers.
As explained on http://home.hccnet.nl/david.dirkse/math/rank/ranking.html, the concept of combination rank is simple: Suppose we have two distinct numbers in the range [1, 4]. There are binomial(4, 2) = 6 ways to choose two numbers from the list of four. The possible choices are: 1-2; 1-3; 1-4; 2-3; 2-4; and 3-4. Thus instead of storing the two integers, we could just store ceiling(Log_2(6)) = 3 instead of 4 bits to represent the 2 numbers. Notice however, that we are loosing information about the order of the numbers.
5.) Since there are 51905 64-bit integers in the file, we would need to store ceil(log_2(51905)) = 16 bits for each integer to also represent its position. Thus the number of bytes needed to store information about the integers position is (16/ 8) * 51905 = 103810 bytes to store information about the order of the numbers.
6.) Thus, we can compress the file as follows:
4 byte integer representing size of binomial coefficient | 307275 bytes representing binomial coefficient | 4 bytes representing total number of 64-bit integers | 103810 bytes of location information | 1 last byte of file that we originally ignored
Thus, the total number of bytes we need to store JUST in the data file is = 4 + 307275 + 4 + 103810 + 1 = 411091. But the original file was 415241 bytes, so we have saved 4150 bytes.
7.) We are not done, because we have not included the size of the decompressor. If someone can write a program that can open the compressed file (or byte array stored in the program itself), calculate the combination from the combination rank as per http://home.hccnet.nl/david.dirkse/math/rank/ranking.html (writing this function is not hard at all; the challenge is making it small), reconstruct the file, and then write it out to a file in less than 4150 bytes, then the file will be compressed.
I DO know how to write such a decompression program (this is easy); I DO NOT know how to write it so that the compiled code is under 4150 bytes (but it may be possible).
@pete6:
Interesting - have you verified that the 64 bit integers are distinct?
Since you are using 16 bits to order 51,905 numbers, you really can skip the combination ranking step, right? With 16 bits per number you could simply assign each number to a given slot using a linear address.
Are you sure about the math in steps 2 and 3? If my reduction of step 5 is correct, this implies that you would be able to compress any million digit file in which all of the integers were unique - which is going to be a huge number of files. I don't have a handy way to check your calculation without coding.
If your math holds up, the size of the compiled code doesn't really matter. The algorithm would be able to compress and expand any file as long as we ensure that the 64-bit integers are distinct. This more or less proves that you aren't hiding information. You would win the prize for adhering to the spirit.
Why not create a demo program that operates on, say, a file with just 4 64-bit integers? Can you compress every 32 byte file using this method?
- Mark
I WILL WORK ON ACTUALY CODE A LITTLE LATER TONIGHT...
Interesting - have you verified that the 64 bit integers are distinct?
NO; WILL DO SO LATER AND LET YOU KNOW
Since you are using 16 bits to order 51,905 numbers, you really can skip the combination ranking step, right? With 16 bits per number you could simply assign each number to a given slot using a linear address.
I DO NOT THINK THIS IS POSSIBLE, BUT IF YOU CAN EXPLAIN TO ME HOW TO DO IT, I WOULD LOVE TO UNDERSTAND. SINCE WE HAVE ONLY 51,905 NUMBERS (LESS THAN 2^16), THEY CAN DEFINITELY BE MAPPED BY A ONE-TO-ONE FUNCTION (64 BIT NUMBER ==> 16 BIT NUMBER). HOWEVER, I DO NOT KNOW OF A COMPACT WAY (SUCH AS A FORMULA) TO STORE THIS MAPPING; THE ONLY WAY I HAVE THOUGHT OF TO STORE THE MAPPING IS BY USING THE BINOMIAL COEFFICIENT (THINK OF JAVA'S BigInteger AS A WAY TO DO THE ACTUAL IMPLEMENTATION)
Are you sure about the math in steps 2 and 3?
AS FAR AS I KNOW, YES. YOU CAN VERIFY FOR YOURSELF BY CUTTING AND PASTING THE FOLLOWING INTO WOLFRAMALPHA.COM (THINK WEB-VERSION OF Mathematica):
Ceil[Ceil[Log[2, Binomial[2^64, 51905]]]/8]
If my reduction of step 5 is correct, this implies that you would be able to compress any million digit file in which all of the integers were unique - which is going to be a huge number of files. I don't have a handy way to check your calculation without coding.
YES, ANY MILLION BINARY DIGIT FILE IN WHICH EACH OF THE 64-BIT CONSECUTIVE CHUNKS ARE ALL UNIQUE. NOTE THIS IS NOT MAGICAL COMPRESSION - IT WILL ONLY WORK WITH SOME FILES.
If your math holds up, the size of the compiled code doesn't really matter. The algorithm would be able to compress and expand any file as long as we ensure that the 64-bit integers are distinct. This more or less proves that you aren't hiding information. You would win the prize for adhering to the spirit.
I WILL TRY TO WORK ON CODING SOMETHING UP TONIGHT.
Why not create a demo program that operates on, say, a file with just 4 64-bit integers? Can you compress every 32 byte file using this method?
NO. MY METHOD WILL NOT WORK WITH A FILE OF JUST 4 UNIQUE 64 BIT INTEGERS (BUT LATER, I MAY BE ABLE TO COME UP WITH A SMALLER EXAMPLE TO SHOW YOU). THE SIZE OF THE FILE YOU SUGGEST IS
4 * 64 BITS = 32 BYTES
BUT
Ceil[Ceil[Log[2, Binomial[2^64, 4]]]/8] = 32, WHICH MEANS JUST STORING THE COMBINATION RANK NUMBER IS JUST AS LARGE AS THE FILE ITSELF. BUT REMEMBER, THE COMBINATION RANK DOES NOT INCLUDE INFORMATION ABOUT ORDER BECAUSE IT IS NOT A PERMUTATION RANK. THUS, IN THIS CASE, THE COMPRESSED FILE WOULD ACTUALLY EXPAND.
YES, IT SHOULD BE POSSIBLE TO COME UP WITH A FORMULA TO SHOWING UNDER WHAT CONDITIONS MY COMPRESSION SCHEME WILL WORK.
I am fairly certain that you have a serious error in your math. Wolfram Alpha is giving you bad results.
- Mark
Thanks for sharing pete6
There are many tricks to compressing this. The several I have worked out will reduce any given string by one bit but all have some dependancy to be valid encoding and that is the reality of existence.
Save a bit here spend it there..
Still.. This is oddly an interesting activity. pete6, I hope you find it interesting.
Who knows if there is a system out there for binary digits there seems to have been one for mater and energy.
Keep on keeping on pete6.
Ernst
Yes, you were right that my calculations were wrong, but I think I may have another idea that may work.
I finally came up with an idea that may work, but before wasting time on implementing it, I want to know if I can still win the prize if I do the following:
Write an open-source java program that you and the entire world can look at that transforms the data. Use paq8l to compress the transformed data.
Can I win the $100 even if I use paq8l and even if size of (paq8l.exe + my compiled java program + compressed source) > original source. In other words, would I still win the prize if part of it uses another compressor AND I do not consider the size of the compressor. Of course, (compressed source) < (original source) and the compressor I will write is entirely generic, so it will expand some files and compress others?
See http://cs.fit.edu/~mmahoney/compression/ for PAQ8L.
@pete6:
How about if you get it working first? I spent a lot of time working with you on your last idea, I think the burden is now on you to show that you can actually do what you say.
- Mark
Code is available at
http://userpages.umbc.edu/~pete6/compr_golden/pete6/FunnyZip.java.
it is a single file
note that to compile you must type javac pete6/FunnyZip.java since it
is in the pete6 package
Executable jar is available at
http://userpages.umbc.edu/~pete6/compr_golden/FunnyZip.jar
java -jar FunnyZipper.jar
You were right. No, the data is not compressible, but I have sucessfully written a WORTHLESS, but GENERIC compressor/ decompressor written in JAVA that will either compress or expand AND sucessfully deflate each file by a single byte. No, it is NOT A BARF CLONE OR A SPOOF and it IS NOT A MAGIC COMPRESSOR (IT EXPANDS SOME FILES).
HOWEVER, IT DOES INDEED COMPRESS AMillionRandomDigits.bin by exaclty ONE BYTE AND DECOMPRESS IT BACK TO THE ORIGINAL. FILE NAME NOT IMPORTANT.
it will give a printout of how to use if
It actually does something similar to the TRANSFORMATION I explained above and can compress SOME 9 byte strings by 8 byte strings. Look at the two functions untransformChunk() and transformChunk(). I can explain further if needed
Would you please be able to look at the code and see if it falls within the
ecs021pc26[674]% javac pete6/FunnyZip.java
ecs021pc26[675]% jar cmf mainClass FunnyZip.jar pete6
ecs021pc26[676]% md5sum AMillionRandomDigits.bin
5fa22a14e52b58d27ebbc1e074b7949d AMillionRandomDigits.bin
ecs021pc26[677]% java -jar FunnyZip.jar -c AMillionRandomDigits.bin c
ecs021pc26[678]% java -jar FunnyZip.jar -d c d
ecs021pc26[679]% diff d AMillionRandomDigits.bin
ecs021pc26[680]% ls -l
total 1231
-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:00 AMillionRandomDigits.bin
-rw-r--r-- 1 pete6 rpc 415240 Jan 7 22:27 c
-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:27 d
-rw-r--r-- 1 pete6 rpc 9825 Jan 7 22:26 FunnyZip.jar
-rw-r--r-- 1 pete6 rpc 29 Jan 7 21:57 mainClass
drwxr-xr-x 2 pete6 rpc 2048 Jan 7 22:26 pete6
ecs021pc26[681]% java -jar FunnyZip.jar -c FunnyZip.jar xyzzy
ecs021pc26[682]% java -jar FunnyZip.jar -d xyzzy sss
ecs021pc26[683]% diff FunnyZip.jar sss
ecs021pc26[684]% ls -l
total 1251
-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:00 AMillionRandomDigits.bin
-rw-r--r-- 1 pete6 rpc 415240 Jan 7 22:27 c
-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:27 d
-rw-r--r-- 1 pete6 rpc 9825 Jan 7 22:26 FunnyZip.jar
-rw-r--r-- 1 pete6 rpc 29 Jan 7 21:57 mainClass
drwxr-xr-x 2 pete6 rpc 2048 Jan 7 22:26 pete6
-rw-r--r-- 1 pete6 rpc 9825 Jan 7 22:29 sss
-rw-r--r-- 1 pete6 rpc 9826 Jan 7 22:28 xyzzy
ecs021pc26[685]% java -jar FunnyZip.jar -c mainClass xyzzy
ERROR: the output file 'xyzzy' already exists
ecs021pc26[686]% java -jar FunnyZip.jar -c mainClass mainCLass.c
ecs021pc26[687]% java -jar FunnyZip.jar -d mainCLass.c mainclass.d
ecs021pc26[688]% diff mainclass.d mainClass
...
ecs021pc26[721]% java -jar FunnyZip.jar
USAGE: FunnyZip
-d to decompress; -c to compress
ecs021pc26[722]%
Here is the followup: I wrote a completely WORTHLESS, but GENERIC compressor plus decompressor that will either compress or expand every file by exactly one byte. It is not a MAGIC COMPRESSOR since is expands some (most) files. It is NOT a BARF CLONE or A SPOOF, but does COMPRESS A MILLIONRANDOMDIGITS.bin by one byte (and successfully compress it back to orinal).
source code at http://userpages.umbc.edu/~pete6/compr_golden/pete6/FunnyZip.java
a singe file
executable jar at
http://userpages.umbc.edu/~pete6/compr_golden/FunnyZip.jar
java -jar FunnyZip.jar
USAGE: FunnyZip
-d to decompress; -c to compress
Doesn't matter anymore, but: When I did my analysis, I found that the longest repetition was 3 bytes. All 4 byte numbers (in the million random digits file) are unique.
http://www.riceresources.com/documents/RDET_Anlys_78.pdf
@pete6:
If you read my post today, I think you will understand why this document is garbage:
http://marknelson.us/2011/01/09/combinatorial-data-compression/
- Mark
[...] I tried to compress the famous million random digit file from the Mark Nelson blog [...]
Keep on keeping on...
There has to be a way.
Hey, Happy new year.
About a year ago I discovered an encoding method I named dynamic unary encoding.
Dynamic unary encoding (DUE) is a way to change one file into all other files of that same size bijectively. So, if we start with this million digit file we would transform that file into the next file in either direction. Then on to the next and so on until it cycles right back to million digit file.
The catch is that it takes something like a day and a half to transform that file of 3321928 bits just once using a single 2 ghz core.
I believe a system using DUE can be used to "compress" million digit file and am willing to work with others to achieve success.
The idea is this: Cooperation; I believe a series of DUE transforms can compress this data.
I am open to a group design of the software. Perhaps others will see implementation aspects I have not.
Is there interest in forming such a group?
Ernst
Ernst,
If it takes a day to transform a file this small once, you're doing something seriously wrong.
What do you mean by "next" file? Logically, the "next" file would be accomplished by adding 1 to the current file, and the "previous" file would be accomplished by subtracting 1. Neither of which should take more than a fraction of a second.
To change the original file into a compressible file, would be the same thing as adding (or subtracting) a file of the same size as the one you are trying to encode.
Or are you doing something else?
Yeah doing something else..
Same sort of thing as a numerical increment except the strings are in a different order.
Just thought to offer..
Mark,
Will you wave size of zlib or Gzip?
I may have found an exploit. Part of the data from an encoder crunched 10k tonight.
I'll have to rely on off the shelf compressors at zero cost to meet your challenge if this pans out.
Ernst
@Ernst:
I would be willing to consider zlib to be part of the compiler library - after all, it is included in quite a few language definitions.
- Mark
Thanks Mark,
I'll be working this version of codec and learning how to use Zlib.
TTYL
Ernst
413555 Jan 17 18:24 Mdigit.dat.gz
Source file was 453570 Jan 17 18:24 Mdigit.dat
Gzip keeps the original time and date.
There is hope, Mark.
Now to see if I am correct.
Fingers crossed and NO Claims being made here!
I'll work and report if or if not but this is something like encoder 100 so I am truly hopeful with these results.
Decoder work ahead and some file format thinking to do today.
Again friends.. No claims being made just blogging in the Challenge blog.
Good luck Challenge people!
Ernst
Oh Well..
As usual, forget that version of things.. Once all is recorded I lose the savings. Ain't that the way it goes?
Now to search for another idea.
Good luck challenge people!
Ernst
I just found this by Mark:
"No. Alan suffers from serious delusions of grandeur. While he often talks about his great ideas (solving all the Millenium Prize problems in one stroke, for example) he has never revealed even the slightest hint that he understands the material at hand.
In short, he is a crank, and is to be ignored."
WRONG
I did not say that I had solved all the Millenium Prize problems in one stroke.
It took me about three hours, using a combination of intuition and a VERY fast problem/phrase analysis method to get an APPARENT insight on how to solve all 7 Clay Institute Millenium Problems.
Mark I have had a GUTSFULL of your negative, stereotyping of anyone who dares to not restrict themselves to OLD WAYS OF THINKING as promoted (perhaps) by people with power and priveledge (and possibly high status) in academic fields (of endeavour) ........
I subsequently analysed various of the 7 Millenium problems more formally and gained apparent additional insight in other ways (I was on a commercial airliner looking out the window at a "patchwork quilt" of stratoculumus clouds when I realised that "turbuluence" was "the packaging of space" (and recently made the (stunning) discovery that (apparently) "teleportation" is the opposite of "turbuluence" (which is curious as when you are in an aircraft that meets turbulence, you do seem to experience your body being moved rather abruptly!)
However I also recently made AWESOME apparent theoretical breakthroughs in theorising about teleportation, finding a way of explaining it that appears to satisfy conventional scientific concerns re: chemistry and physics. I do not want to say too much here as I am VERY POOR by first-world standards for my age group, and am looking to license many possible new technologies.
IF YOU LOOK at the Milllenium problems you will find that "turbulence" is mentioned...
Regarding an earlier apparent discovery of a way to utilise SPACE much more eficiently in the quest to store computer data, the APPARENT breakthrough I made I now have probably over 25 different perspectives on it. Of interest is that the ability to find INHERENT COHERENCY IN A SEEMINGLY RANDOM FIELD OF DATA (i.e. naturally occuring data interference fringes i.e. how the data tends to jerk itself about (like a airliner passenger who encounters- you guessed : turbulence!)) has potential application in finding how seismometer data shows the patterns that lead to (or lead from) earthquakes.
Regarding so-called "cranks": a photo of a (apparent) series of craters on the lunar surface was claimed by a "crank magazine" to be a spaceship (as it can be looked at as if the shading is on a hill-like surface not a valley surface). I realised that it was (most likely) a series of craters overlapping forming a valley, that due to the lighting, could be perceived as if it were a long space-ship-like hill.
However, metaphorically, a series of overlapping craters does in fact match the notion of "shipping space"....
Million digit encoded into 349314 Trytes..
The Number looks good but in Binary space it's larger than Million Digit.
LOL It's the thought that counts..
Hey Challenge people!
I'm not sure if there are a few other people or if this challenge is considered "Twit of the Year" Monty Python stuff.
Anyway I have an update.
The three symbol transform of million digit file
0: 699198
1: 697479
2: 699202
2095880 trits.
I'll be exploring actual data compression with this Ternary translation of Million Digit File.
To get any compression I have to achieve 6 symbols per byte so 12 bits down to 8.
If anyone wants to work with me on it I'll consider it.
2095880 trits of information to work with.
That's about it I think for the winter season. I will learn more about probability and combinatorics.
Perhaps dense data is dense in all forms. I'd say that was true from the simple testing I have done so far. However, maybe I can represent 6 trits with 8 bits if I study this ternary encoding.
I see something about ternary Huffman codes so I'll go read that.
Ernst
p.s. are there others working on this?
Ernst,
Your suspicion is correct: dense data remains dense in all forms - whether its in other numerical bases, sets of arbitrary bit strings, or even arranged in different shapes (such as 2d grids, 3d grids, 4d grids, an Ulam Spiral, zigzags, etc).
More incredibly, all the derivative data remains dense. For example, if you took every 8th bit of information from file, you'd find that the resulting bitstring was just as dense and incompressible. If you took a random sample from it, you'd find that sampling just as dense.
Even abstract methods fail to reduce the density. For instance, you could take the file and break it into 32 bit chunks representing unsigned integers. You could then say that on the total number line, 0 to 2^32-1, each of the results was "1" and the other numbers were "0". Your numbers would represent something like 103,810 out of 4,294,967,296 - 0.0024%.
Yet, if you try to randomly sample that resulting line and remove half of the numbers without hitting a "1" (despite the fact that the there are so incredibly few actual "1"s in the line) you'll find that nothing helps - the arrangement of "1"'s is so randomly dispersed that almost exactly half of them will be picked up.
If this were not the case, you could compress the million digit file by 1 bit per 32 bits by devising a scheme that allowed you to ignore half of the 32 bit numbers, thus storing your offsets in 31 bits. Instead, you'll find that no function or algorithm can remove half of those numbers without hitting the distributed "1"s unless the function uses more information about the pattern than you'd actually save in compression.
No matter how you examine, slice, or splice the data it will remain dense =)
That fits with what I have experienced.
Would the term Quanta apply to the underlying information?
I read that information has been called the fifth element or perhaps called quintessence.
So could the unit of information that is the million digit file be called a quanta and be independent of form?
Sorry if i sound deranged. I didn't expect your reply.
Tim
There is a way to cycle through all patterns for a fixed length of binary string and back to the starting string. It is bijective.
I've avoided that project since I'm looking at a possible a year of processor time to run such a thing.
The information needed to be saved is the offset in one direction or the other and the pattern-information such as all zeros or all ones.
Will that offset be smaller space than the million digit? I don't know. Maybe I'd be lucky and it is just a distance away a few thousand bits can represent or maybe not so lucky.
But, I am sure that density transfers into new encodings. Not that an encoding won't compress some but that I have not seen an obvious smaller compressed encoding for my efforts.
I appreciate the feedback.
Ernst,
I applaud your line of thinking. Yes, it should be fully possible to cycle through all the possible patterns, but let me explain why its a bad idea.
First, I'm going to set aside performance and time considerations (though they defeat the idea alone, as may become apparent momentarily).
Let's say you wanted to cycle from the number 0 to the number that the file represents. Obviously, in this case, the number would be (in binary) 3,321,928 digits long and would require 2^3,321,928 cycles to reach the goal.
That number, btw, comes out to about ~9.36x10^999999 according to Wolfram Alpha. For comparison, there are only about 10^80 particles in the entire universe.
So, no matter what, if you start from 0 and try to cycle, you would require the same number of bits as the original file to express your cycling offset.
Moving on, let's say you tried to roll a random number (since hey, you could build a binary string of the same length as your target data with a single integer). Let's say that, lucky for you, this new binary string represented a number that was exactly halfway to your goal! Awesome! Now you only need to express 2^3,321,927 cycles to get there... oh.
At this point, you might get clever and say "Well, what if I kept rolling random numbers, and stored only the ones that led me closer to the goal through some deterministic method." The first number will get you halfway there, the next perhaps 3/4, the next perhaps 7/8, and so on... and yet, it will never quite reach the full and correct number until you expend the same amount of data as was originally required to express the original number (or maybe 1 bit less! You might get lucky enough to get past that one bit!)
In essence, you'll never spend less than one bit less than the original sequence if you're trying to just cycle, or combine other numbers to express it.
I'm not sure what direction to take.
There are mechanical blind-sort possibilities coupled with transform.
There is stepping through sequence using zlib to test each instance.
I agree that it is possible to end up with one bit less for my efforts. I want you to know I understand that because I have seen that in several systems I have tried.
Well, I can't cry because I have no choices or no ideas.. Also I can't discount looking under every proverbial rock since I have been surprised when following through on what looked to be insignificant but turned out to be important to have explored.
I don't know what direction to follow next.
I'll be thinking about it.
Thanks Tim.
Ernst,
First, don't get disheartened - at least, not for the wrong reasons. Sure, it is provably impossible to compress all possible data, and sure, it's most likely to be able to compress a given "random" source of data when its encountered as it has already reached a high state of information entropy.
You never know though, this particular file might have an undiscovered weakness. As long as you approach it from the perspective of not trying to build a universal random data compression/decompression then you remain sane AND educated about this entire challenge.
For my part I've gone through much of what you have, and I can't tell you how much I learned and how valuable it's been to me.
So don't despair, just understand that you've managed to unleash a great deal of your focus and attention on realizing your ideas and trying to accomplish something. Now imagine if you focused that same zeal on mastering the current tech behind compression theory and algorithms - you may yet provide the community valuable and insightful contributions outside of this one particular challenge!
That's kind of you to say.
Yeah, having spent a few years learning to encode and decode. Learning to spot similarities in different systems because of the common source file ( Million digit ) is of value to me.
My first encoder was based on the dynamic system Lothar Collatz made famous [3x+1,x/2]. In the beginning (2002) I thought I would just use determination to get things done but after I had done my best to make a dent in the million digit file size I had to step back and say: Okay I see that I am not the master so I must be the student.
Once I changed my point of view each experiment became a lesson. That's it really.. Keep on keeping on.. Pay attention. Follow up and most important revisit and review the concepts already experienced from time to time because as the understanding evolves so does the ability to apply old algorithms in new ways.
I'm starting the project to step through the sequence of files the bijective transform generates. I'll be testing the compressed size of each instance. Perhaps a compressible file is not so too far away.
I'll be using a faster method than the one that takes a day and a half @ 2ghz to render a single instance .
Here is a thought.
Since we can generate all files of specified length there is data to consider. Maybe one of those files is a short story no one ever wrote or a picture no one ever took.
Building on that concept then all the information that can be, already exists. That means "The Truth is out there." http://www.youtube.com/watch?v=JDZBgHBHQT8
I can tie up the desktop now that I have a laptop and I might as well.
I don't want to quit before I tried bijective transform brute force style.
Lots of ideas..
You are welcome to chat more Tim.
The one thing that is lacking in my hermit lifestyle is healthy interaction with those who share my interests.
Number crunching Hermits are odd folk..
----
Since I have gone on for years about the [3x+1,x/2] I can share the series that I believe the Collatz fits into.
Trying to extend the series past 0/0 is where it gets trippy in my opinion.
And the point isn't the numbers it's the pattens.
With the Collatz, and I believe for all [3x+/-y,x/2] the 3x+/-y part shares only one element with x/2 ; The even right after a 3x+/-y operation. Other then that it's a struggle for control between the two systems and x/2 always wins.
The structure of all pattern for [A(x)+or-y,x/2] is Fibonacci and base 3 in nature from what I can tell.
I'd love to work on a base 3 computer.
I asked about geometric relationships of that series and one kind professor took a look and said that there was a relationship in a couple of the A/B but the whole thing didn't fit.
Anyway interested folk can look at those for some fun.
I had a web page for a few years with that stuff on it but Geocities has closed down so I can be reached via Email for questions and help.
That is the stuff of a type of encoders if anyone wants to play with it. For all those Collatz type systems they share a common parity language structure.
I have examples of attractor finder code so that people can locate the attractors for any system.
Anyway I entertain myself with such musings. So far it keeps me off the streets :)
How strange.
The series is missing from the post.
Just above the line "Trying to extend the series past 0/0 is where it gets trippy in my opinion."
Should be a series so let me try it with spaces like A / B
Sorry about the confusion but I did write a good looking text.
All these can fit in a dynamic system [A(x)+/-y,x/z]
Again the point, in my opinion, is the patterns not the values.
Mark your editor is stripping text out.
To Share the series here I have to use a different character than the divide by to represent the system
In the denominator position is the multiplier and in the case of the Collatz example it's three so I will use the ":" for the back slash I like to use that makes it look like a fraction. How funky.
Try and see a fraction looking thing in your mind when you see A:B.
One more try to post.
It just won't let me post numbers not even A : B,
Ahhh.. Code escape..
let's try that!
[c]
[\c]
The Collatz is [3x+1,x/2] so it exists in the 3/2 for example
Fingers crossed.
Wow Mark.. If I could edit I would
One more try.. My bad on the wrong slash
Again the point, in my opinion, is the patterns not the values.
Let me try spelling it out.
Lets see Imagine a series of fractions.
It extends to infinity on both ends.
from left to right
Negative 4 over negative three.
Negative 3 over positive two.
Negative 2 over negative one.
Negative 1 over positive one.
Zero over zero.
One over One.
Two over One
Three over Two.
Four over Three.
And so on in both directions to infinity.
So you reader have to write it down to see the relationship.
I'm trying to share but this is a cluster f*ck of an editor.
Test -4/-3 -3/2 -2/-1
Okay then it's the commas that trip it up.
Mark please delete the comments effected. Pardon the snafu.
-------------------------------------------------------------------
You are welcome to chat more Tim.
The one thing that is lacking in my hermit lifestyle is healthy interaction with those who share my interests.
Number crunching Hermits are odd folk..
----
Since I have gone on for years about the [3x+1,x/2] I can share the series that I believe the Collatz fits into.
Trying to extend the series past 0/0 is where it gets trippy in my opinion.
And the point isn't the numbers it's the pattens.
With the Collatz, and I believe for all [3x+/-y,x/2] the 3x+/-y part shares only one element with x/2 ; The even right after a 3x+/-y operation. Other then that it's a struggle for control between the two systems and x/2 always wins.
The structure of all pattern for [A(x)+or-y,x/2] is Fibonacci and base 3 in nature from what I can tell.
I'd love to work on a base 3 computer.
I asked about geometric relationships of that series and one kind professor took a look and said that there was a relationship in a couple of the A/B but the whole thing didn't fit.
Anyway interested folk can look at those for some fun.
I had a web page for a few years with that stuff on it but Geocities has closed down so I can be reached via Email for questions and help.
That is the stuff of a type of encoders if anyone wants to play with it. For all those Collatz type systems they share a common parity language structure.
I have examples of attractor finder code so that people can locate the attractors for any system.
Anyway I entertain myself with such musings. So far it keeps me off the streets :)
Frack! It did it again.
I am trying to post a series of dynamic systems.
-4 / -3 -3 / 2 -2 / -1 -1 / 1 0 / 0 1 / 1 2 / 1 3 / 2 4 / 3
Matt,
I am running tests on a rewritten encoder.
I'll be working on the software for a while but I spotted an interesting file.
T: 1193 Ones 1660964 Zeros 1660964 Diff: 0 High: 6484
This is the 1193rd file generated in one direction that has equal set and reset bits.
The "High" of 6484 is the largest difference between set and reset bits up to the point of 1193 files.
This algorithm is a lot faster than the original algorithm.
Mark feel free to delete the above posts and clean your blog up.
Pardon the jumble.
Approximately 600,000 files per 2ghz processor per day can be generated. A test run of 100,000 files took 4 hours.
Nothing too exciting to report in the first 100,000 files generated from this simple test run.
I have no Idea if a coherent pattern was generated since I was using a ratio of set bits and reset bits as my guide to get the general idea of the informational landscape.
The largest difference between set and reset bits in the first 100,000 files generated in the forward direction was 7572 bits difference.
I'll check back Matt.. Did you understand the series of systems I posted? I may have to have Mark post an email since the editor is horrid. Why would an editor selectively strip data from a post?
Who-thunk that up?
On to the reverse function.
I have isolated that first 0 difference between set and reset bits file in the forward direction of this transform.
There are a lot more zero difference files.
The million digit file has a difference of Ones 1660251 Zeros 1661677 Diff: 1426
If anyone wants a copy just email me.
I'm open to reading about what you determine about the file.
My guess is it is more statistically flat than the million digit file.
Gzip doesn't compress it.
Well, Tim..
I think I was wrong about this version of DUE.. I think it splits the patterns into two sets rather than one. I'm rechecking my original work but so far I think it's a two set system not a one set system as I thought.
So whatever set the Million digit file is in after 850,000 files the high difference between set and reset is 9100 bits.
I'll have to implement the zlib to be sure I am not passing up files that can compress.
I think you were wrong on this being a bad idea. How can I learn thing If I don't explore them.
More work to do..
This looks to be two (rings?) So I need to audit things here.
Anyway.. Mark.. Whatever man.. I feel like I wrote graffiti but I was hoping to chat and share. As you see fit on this segment of posts.
Back to the solace of "why did it do that."
Later Tim.
So, Hey.. Update.
I have a question. I also posted on Data compression newsgroup.
Does anyone understand what Cycles are? I believe I may be working with Cycles as defined by Wikipedia.
http://en.wikipedia.org/wiki/Cycle_%28mathematics%29 but I am not sure yet.
The above wikipedia define seems to fit best. It seems to be what "Tiger" I have by the proverbial tail.
Any data-compression use cycles? If so which ones. Link or search term welcomed.
I'd appreciate feedback.
This is a reply to a post I had a chance to reply to before.
However, I did not know enough to counter. I do now.
I counter because I know better now.
Humbly, Ernst
Quote followes.
----------------------------------
Tim said,
in March 10th, 2011 at 4:15 pm
Ernst,
I applaud your line of thinking. Yes, it should be fully possible to cycle through all the possible patterns, but let me explain why its a bad idea.
First, I'm going to set aside performance and time considerations (though they defeat the idea alone, as may become apparent momentarily).
Let's say you wanted to cycle from the number 0 to the number that the file represents. Obviously, in this case, the number would be (in binary) 3,321,928 digits long and would require 2^3,321,928 cycles to reach the goal.
That number, btw, comes out to about ~9.36x10^999999 according to Wolfram Alpha. For comparison, there are only about 10^80 particles in the entire universe.
So, no matter what, if you start from 0 and try to cycle, you would require the same number of bits as the original file to express your cycling offset.
Moving on, let's say you tried to roll a random number (since hey, you could build a binary string of the same length as your target data with a single integer). Let's say that, lucky for you, this new binary string represented a number that was exactly halfway to your goal! Awesome! Now you only need to express 2^3,321,927 cycles to get there... oh.
At this point, you might get clever and say "Well, what if I kept rolling random numbers, and stored only the ones that led me closer to the goal through some deterministic method." The first number will get you halfway there, the next perhaps 3/4, the next perhaps 7/8, and so on... and yet, it will never quite reach the full and correct number until you expend the same amount of data as was originally required to express the original number (or maybe 1 bit less! You might get lucky enough to get past that one bit!)
In essence, you'll never spend less than one bit less than the original sequence if you're trying to just cycle, or combine other numbers to express it.
----------------------------------
Cycles:
There are finite sets of finite sets. That is as simple as I can write it.
So the orbit is a finite set. There is a finite set of cycles.
Thanks for the conversation. Pardon any stress in receiving a delayed second reply.
I am sure the complete geometry of existence will surprise us all.
Update:
Million digit file is an element in a k-cycle of 4194304 elements for this configuration.
I don't know yet if a compressible file is in that cycle but that is fewer files to check than I assumed.
Good luck Challenge people!
Hey!
I've worked out the bugs in the "proof" program. Meaning I have verified that the k-size of the cycle which the Million Digit file is an element of is 4149404.
What is next is to understand how to progress from one cycle to another and back. There has to be a way to "jump orbits."
I have to say this, I didn't know much about this kind of math and I am still learning.
There are other cycles for other configurations of this encoder and that too is an area of interest for me but first I should see about making use of this configuration.
It isn't important now that this challenge was intended as a Gag. It has been a catalyst for discovery. I don't see what I do anywhere on the Internet so just maybe I have discovery and then that alone validates this whole challenge.
I have much to learn and much to do. I'll blog back in while.
Good Luck Challenge people.
It is amazing how people are still hopeful to compress the incompressible!
Mohamed Al-Dabbagh said,
in July 2nd, 2011 at 1:52 am
It is amazing how people are still hopeful to compress the incompressible!
--------------------------
Ah, folly of the incompressible, and yet the effort opens doors of discovery.
Hey I watched a documentary of interest to me and maybe you and you reader. http://video.google.com/videoplay?docid=-5122859998068380459
It is possible our Data "tool set" may not be complete.
Perhaps there is more discovery ahead for all.
Good luck Challenge people.
well I apparently have what appears to be a answer.
Also can see what "hierarchies of infinity" actually are (differing points of view of external things);
and what the on-again off -again problem Cantor had may actually be (why he thought he solved, then hadn't solved, "continuum hypothesis" (sensitive ideas).
@Alan:
So which is it? Is the Continuum Hypothesis provably true or false?
- Mark
What is "proof"?
Proof = "(a) "continuum --hypothesis - -"
It is a ___________________
I am not going to fill in this because it is worth a lot of income and I am tired of being as poor and hungry as I am.
You want to "prove" if "proof" is "true" or "false"...?
If you want to make a ridiculous amount of money then please get over your ?????prejudices!
What is "Ultimate logic"? (see "New Scientist" recent issue re: Cantor etc.). "Logic" = "path"; "ultimate path" = a path with discrete boundaries (a VIRTUAL 'set')
inherent coherence: a PERSPECTIVE
a way things "hang together"
If you make a "data hologram" of e..g. seismometer data, then you can "rotate the data" (= "continuum hypothesis!!!!!!" )
and in some views you will see how "a quake to come" (or a quake that has been") is leaving a trace in the data (the data has a kind of "Objectivity" - it "holds together"- it has "continuuance" !!!!!
Needless to say, if you can find the inherent ways that data "sticks"- forms its own sort of jigsaw- you may find you can fit a lot of data in a very small room (by finding the hyper-space tangential (the "L-function" I could describe it) of the "data" (the data almost "dissapears"- becomes nearly 'invisible'- it becomes like a material object: software becomes more like "hardware"!!!!!!!)
Quote: "The Folly of Infinite Compression is one that crops up all too often in the world of Data Compression. The man-hours wasted on comp.compression arguing about magic compressors is now approaching that of the Manhattan Project.
In a nutshell, a magic compressor would be one that violates the Abraham Lincoln’s little-known compression invariant: “You can’t compress all of the files all of the time”. "
comment: "all of the files IS "all of the time" (all of the
______________________)
Of course you can not, unless you don't even try to or need to (via using a Space computer (which creates a (constant...) "time differential" (imaginary time in 3d: hyperspace known (internal limits on external edge) (i.e. a (coherent type) fixed structure (like a regular ____________________)
Quote: "It’s trivial to prove, and I won’t do it here, that no single compressor can losslessly and reversibly compress every file."
Comment: The answer IS "no single "compressor" (i.e. an algorithm that maps space: so every file has its own individually designed compressor via the _____________________ quantisation diffraction _________ 5-6-7 (...8) dimension ____________ 'function' )
Quote: "The easiest way to foil most compressors is with a so-called “Random File”.
Comment: This technique turns any (?) (except one) (which is !!!! pi !!!!! I think) file into a so-called "random file" (it becomes _________ _________(these blanks are in part a: a virtual _________))
Quote:
"In the world of Information Theory, Randomness is a tricky thing to define, but I like to fall back on Kolmogorov complexity, where we define the complexity or randomness of a file as the Komogorov complexity w/r/t a Turing Machine."
Comment: = '4 bit ____________ or a virtual __________________'
Add this to a ______________ file and you get hyperspace virtual real quantisation (or establishing local LIMITS on hyperspace with respect to the supposed "file" being tailor -
s______ )
Quote:
"Several years ago I posted a challenge on comp.compression that I hoped could be used to silence any proponents of Magic Compression, and I’m reposting it here so I have a permanent location I can point to.
How does it work? I took a well-known file of random digits created by the RAND group (no doubt at taxpayer expense), and converted that file into binary format, which squeezed out all the representational fluff."
Comment: a RAND file is ALREADY in so-called binary format if you look at it a certain way
When put in so-called binary format it becomes a space-computed file
Quote:
"The result was AMillionRandomDigits.bin, 415,241 bytes of incompressible, dense data.
correct (in one sense) (well done, but.....
turn it into a hyper-space jigsaw and something amazing begins to happen (it starts to unravel and line up as a matrix of variables with specifc right-angle arrangement (Like a ____________________s game/ board))
Quote:
"The challenge is to create a decompressor + data file that when executed together create a copy of this file."
comment: Hyper-space _______________ in 5 d should do it
Quote: "The size of the decompressor and data file added together must be less than 415,241 bytes."
Comment: It may be less than or no more than 1 byte (it reconfigures the inside of your computer (using a hyperspace binary (super) 'group'
Quote: "So far nobody has been able to do this. I wouldn’t have been surprised to see somebody get a byte or two, but that hasn’t happened."
Comment: "A byte or two" is a very interesting idea
Quote: "The only real rule here is that we have to negotiate exactly how to measure the size of the program plus data file."
Hmmmmmmm
Quote: "I’m willing to exclude lots of stuff. For example, if you wrote the program in C++, how would we measure the size of the program?
In this case, I would measure it as length of the source file, after passing through a resaonble compactor. The run time libraries and operating system libraries could be reasonably excluded. But that’s just one example, the rest are all subject to public interpretation.
Really, the only rule we need is that the executable program can’t have free access to a copy of the million random digits file. For example, you can’t create a new programming language called JG++ that includes a copy of this file in its Run Time library. You can’t hide the digits in file names. And so on.
As long as those rules are obeyed, the challenge will never be met."
Comment: The file HAS ITS OWN RUN TIME LIBRARY.
The key is knowing how to find this internal hyper-space quantisation ((diffraction)) '...limit ' and de/ not d "quantise it (in hyper'space')(turn it into a ________(______ "library" (or difraction limit on ... hyp_________)
Quote: "Recursive Compression This challenge has a special place for the variant of Magic Compressors known as Recursive Compressors. Some savants will claim that they have a compressor that can compress any file by a very small amount, say 1%. The beauty of this is that of course they can repeately use the output of one cycle as the input to another, compressing the file down to any size they wish."
Comment: If you do that it will push out in other ways that place limits on it (you end out with a maximum of a ________________(a PERSPECTIVE or point-of-view on the so-called file)(__________________________________________________________)
Quote:
"The obvious absurdity to this is that if we compress every file to a single bit, it’s going to be kind of hard to represent more than two files using this algorithm."
Comment: What you need is "hyper _________" ____________________________________________________________________________________________________________________
Quote:
"So most people in this subspecialty will claim that their process peters out around some fixed lower limit, say 512 bytes.
For those people, a working program should be able to meet the challenge quite easily. If their compressed data file is a mere 512 bytes, that leaves 400K of space for a decompressor that can be called repeatedly until the output is complete."
Comment: The computer may interfere with such a program?
The PayoffThe first person to achieve the challenge, while staying within the spirit of the challenge, will receive a prize of $100.
Comment: Unfortunately the possibility answer is worth more in the market?
Alan,
I've reviewed each of your comments and found no mathematical coherency to the ideas you've put forth.
Are you OK?
Tim:
I have intentionally left blanks throughout my comments, so as not to reveal too much.
You are only supposed to get the notion that I MAY know an answer. I am very poor for a non-third-world country person, I cannot afford to give away what appears to be an answer.
Mathematics is "the science of pattern"; "physics" I could call "the pattern of science".
If I were to describe "mathematical coherency", would you expect to find mathematical coherency in my description? Or would you expect something wider than "mathematical coherency' to be part of how one gets a handle on "mathematical coherency" (or " the cohering of categories" or universal set quantisation )(flippability )?
There is a statement on this website that says:
"Arithmetic coding + statistical modeling = data compression".
According to an analysis I did of this claim, I found this claim to be true.
My analysis then might suggest to professionals in the computer industry that the un-revealed apparent breakthroughs I have made with regard to the subject of what is known by the phrase "data compression" may also turn out to be successful.
I just stumbled upon this funny challenge. It is absolutely amazing how many replies it has caused.
I want to add that of course compressing a random file can be possible. The probability of creating a random file consisting purely of 0's is not zero, for example. That file would be trivially compressible.
I think it might very well be that your specific file can be compressed by a few bytes, just by coincidence. That would be a meaningless result however.
What if all files are 'trivially compressible' when you know how to do this?
Apparently the way to do this is via a --------------------------------------------
(censored by me as potential commercial use) !
Mark,
Although I think you may have stopped bothering to read the comments after a few weeks of these rather frustrating posts, I hope you will answer my question.
Is a compressor still "magic" if, instead of compressing all files, the files it can't compress are just very usefully small? With the reading I've done so far I'm inclined to think not; it fits the proof against infinite compression.
Any reply would be appreciated! Thanks.
@Komon:
Let's just say for the sake of argument that your definition of "very usefully small" means a file that is under 1K in size.
A compressor that can compress all files that are greater than 1K, but cannot compress some of the files under 1K is still magic (and provably impossible.)
- Mark
Can you provide the alleged proof of "impossibility",Mark?
Here is something which is not specifically my commercially sensitive idea, that occurred to me that I could say here, to provide food for thought:
A way in principle to store an unknown string of zeroes and ones that is one million characters in length:
The simplest way to store something is via a reference, e.g. "Tolstoy's novel "War and Peace" is readily "decompressed" by looking up this reference at a library.
However, if you entered this title in a brand new unused computer, you would not expect to find that novel in all its detail.
But if you entered the following instructions to the brand new unused computer, and if it had a particular type of logic program installed, you would in principle recreate the novel in all its detail:
The instructions are:
The object to be generated by the computer has n characters (or in the case of "the million digit challenge" it has a million characters)("n" refers to whatever the actual number of characters including letters, spaces, and punctuation is).
In a list of all possible combinations and frequencies of these characters, the object to be found is number x on that list (e.g., in "the million digit challenge", it could be number 5 billion and 27 on the list of all possible options for arranging zeroes and ones in a string of one million characters).
The computer needs to have a logic program installed that goes through the list of all possible arrangements in a methodical way (e.g. the list of all possible arrangements of 4 characters is: one: 0000; two: 0001: three: 0010: four: 0100: five: 1000: six: 0011; seven: 0110: eight: 1100; nine: 1001: ten:: 0111; eleven: 1110 : twelve: 1011; thirteen: 1101; fourteen: 1111 )
The computer only need to know that the item to be stored has so-many characters, and is item number whatever on a methodically generated list of all possible ways to arrange those characters. It can then recreate the item by running a logic-scan of the list (the way in which the list is generated can be a methodical process, with numerous shortcuts to improve efficiency - also a small amount od additional information would help (e.g. how many zeroes and how many ones in the million characters)).
The secret more advanced technology could be like a Sudoku puzzle, with floating information that self-configures to produce one solution.
@Alan:
Math: learn it.
http://marknelson.us/2011/01/09/combinatorial-data-compression/
Interesting article...
First: to correct an error in my post: there are sixteen ways of presenting zeroes and ones in 4 characters, the two I omitted are:
fifteen: 1010; sixteen: 0101
Regarding the article on combinations and permutations:
What I presented here, which may be related to other things but is not specifically the technically advanced apparent discovery that I have not revealed, was a "throw-away line' that suggests (though I am aware there may be problems) an intriguing possibility in principle, that concerns zeroes and ones, not bits and bytes.
in your article, you introduce the idea of a "subset". the notion I presented does not have the "problems" that are shown in your article, because there is (initially at least) no subset situation, just ALL unique ways of listing zeroes and ones in a fixed number of characters.
Each way is identified by its number in the total (e.g. in all ways of listing zeroes and ones in 4 characters, option five can be 1000. To "store" this option, one only needs to have a program that runs through all options in a methodical way, the number of characters (4), and the location on the list of all options as run according to the automatic listing of options (whatever it is designed to be re: how such a list is listed re: which option when) .
The immediate issues are: can a listing procedure be designed? I think so.
Does it take more zeroes and ones to describe the (1) number of characters and (2) the number on the list of all possible ways of writing zeroes and ones in that number of characters? In the case of just 4 characters, the method appears to fail as it would take more zeroes and ones to describe the option this way than to just list the actual file e.g. file 1000.
But what if it were a million characters? Wouldn't the numbers to describe how many characters (1,000,000) and which option on the list of options (could be over many billion) take up more zeroes and ones than the actual file? Maybe, depending on efficient mathematical shorthand techniques (like exponents etc.).
Then there is the question of listing more information about the file to reduce the size of the list of all possible options. Now you do have a subset and it is here that saying any more might open up too much!
1 1000 00110001
2 1001 00110010
3 1010 00110011
4 1011 00110100
5 1100 00110101
6 1101 00110110
7 1110 00110111
8 1111 00111000
9 0111 00111001
0 0110 00110000
Make a translator to create a file that switches all the numbers from 8 digit binary. The file is only numbers, you only need 4 digits of binary to represent them all losslessly. Should cut the file size by 50% leaving plenty of room for the decompressor.
Cheers.
@Brian:
No, not even close.
The file is only number? Yes, the ASCII version of the million random digit file contains only the digits 0-9 plus record delimiters. But that's not the file that you are being asked to compress. The file that you are being asked to compress contains the binary version of the million digit number.
- Mark
ah, bummer.
...And if someone could complete that challenge (and more generaly solve that problem of data hypercompression). What could be the consequences on the computer sciences and industrial/commercial computing?
@Monad:
Matt Mahoney has pointed out that may fantastic compression claims would require that P=NP. The web has some good speculation on what the consequences of that would be.
No worries on my part - it's been a long time since anyone has even thrown a new idea at this challenge.
- Mark
P=NP does not mean impossible, that does mean very expensive in time ressources, thing that could be solved with a qbit technology (quantum theory). In plus, it can not be exclude that a new paradigm approach gives a polynomial time solution or more realistic: an almost polymonial time solution.
(sorry for my english, that is not my native tongue)
Ah, it's that time of year again. Time to spend more time inside where it's warm than outside where it is cold. It's my favorite time of the year since I like to work on this challenge.
As to MillionDigit Challange.. I still Dig it. I too am moving on into Cantor and Set Theory.
I have more questions than experience and I have not run out of possibilities.
Allen was nice to read because of the display of creativity.
My favourite saying about these things now is "the Universe will not let us cheat but I keep trying."
That can apply to MillionDigit Challenge and to death itself.
As Prof. Einstein wrote "Imagination is more important than education."
I don't have a candidate for a compressor going into this winter cycle Mark. I do need to learn more about Set Theory at this time.
I do believe there is a function that can satisfy this challenge. Something that will connect the set of bits of MillionDigit to a smaller file and that file and the program size will be smaller than the MillionDigit file.
I think we can all agree that the path to find that function isn't obvious or easy.
And now a quote from Hitch-hikers guide to the Galaxy "Keep banging those rocks together!"
Good Luck Challenge people and thanks for keeping the site going Mark. I still like it and the challenge.
Hey Friends.
If there are any in this line of work :)
I am enjoying some tunes and again spending an afternoon day-dreaming on the topic of this file.
I am about to display the binary and look for some structure.
I know this whole challenge is a stfu effort but hey that is a narrow minded attitude.
Dare to dream people!
Okay.. Since I am in the mood to work on my programming projects I have had to decide on a direction for the coming year.
It's to be to write a proper paper on an encoding method or two I already understand.
Any advice on paper writing is appreciated.
I don't see myself authoring more compression program efforts at this time. Not that I don't find myself working out an idea every now and again.
I'll stop back in, in a while. Be Well..
Ernst
I'm going to be working soon on some arithmetic coding updates. I'd love to see good work on Q-coder, range coding, and other arithmetic coding techniques that go beyond the basic Cleary-Whitten model.
- Mark
Here's my solution, wherein a) the compressor will shorten any source greater than 32 bytes in length by 32 bytes, and b) the decompressor is only 24 bytes long and can be run from a shell prompt.
Total net savings: 8 bytes, every file, every time, guaranteed!*
For brevity, I've written a compressor specifically for AMillionRandomDigits.bin; a general-purpose compressor is left as an exercise for the reader.
$ head -c415209 AMillionRandomDigits.bin > x
And, the decompressor (only 25 bytes; shell prompt prepended for clarity):
$ head -c32 /dev/random>>x
Success!
* Note: decompressor may not decompress the file correctly on the first iteration; if so, re-compress and try again. It'll get it eventually.
Errata: the decompressor is in fact 24 bytes (the description above erroneously stated it is 25 bytes in one place; this is because the beta version of the decompressor used /dev/urandom instead of /dev/random).
@Andrew:
This humorous approach could almost work - you could try hiding data in the form of an MD5 signature in your decompressor, then keep trying until you get a match.
Of course, the extra cost of the signature would always be greater than the bits it allowed you to remove, but that's where the fun comes in - trying to get that past the judges.
- Mark
Suppose the compressor and decompressor exceed million digit file but still compress the file and other random file to minimal size let’s say ((zv[x] + ((x) mod 23) ) mod 256) );
Calculate the frequencies of both the original and the transformed file
Original Frequency
[ [ 47, 44, 43, 42, 42, 40, 40, 40, 40, 39, 39, 39, 39, 39, 39, 38, 37, 37,
37, 37, 37, 37, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 35, 35, 34, 34,
34, 34, 34, 34, 34, 34, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,
32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31,
31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,
30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 29, 29, 29, 29, 29, 29,
29, 29, 29, 29, 29, 29, 29, 29, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 26, 26, 26,
26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 25, 25, 25, 25, 25, 25, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 23, 23, 23, 23, 23,
23, 23, 23, 23, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 21, 21,
21, 21, 21, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 19, 19, 19, 18, 18,
16, 16, 15, 14 ],
[ 125, 12, 44, 82, 43, 130, 223, 61, 23, 228, 69, 0, 62, 87, 187, 227, 220,
232, 165, 84, 108, 106, 161, 246, 233, 41, 197, 239, 36, 226, 214, 250,
26, 40, 55, 251, 147, 124, 168, 94, 157, 243, 153, 128, 42, 151, 58,
67, 18, 117, 178, 89, 200, 142, 22, 99, 70, 9, 215, 35, 144, 206, 3,
149, 120, 16, 118, 231, 7, 32, 81, 90, 93, 219, 73, 60, 98, 8, 131,
132, 34, 115, 159, 24, 135, 17, 249, 49, 91, 134, 245, 241, 247, 202,
95, 10, 158, 222, 20, 182, 216, 109, 74, 177, 31, 195, 196, 163, 25,
15, 48, 170, 143, 244, 204, 63, 65, 139, 185, 13, 242, 205, 186, 51,
66, 11, 52, 191, 6, 4, 59, 39, 47, 208, 136, 138, 193, 238, 92, 234,
21, 64, 254, 2, 85, 119, 156, 37, 175, 210, 188, 56, 181, 179, 27, 111,
71, 80, 107, 180, 173, 1, 167, 190, 104, 218, 209, 102, 199, 110, 225,
162, 148, 140, 78, 101, 192, 248, 123, 183, 169, 198, 145, 137, 14,
121, 203, 174, 240, 160, 235, 76, 86, 126, 77, 211, 28, 114, 113, 207,
38, 171, 166, 176, 88, 79, 103, 221, 172, 229, 252, 19, 133, 217, 33,
201, 213, 50, 212, 152, 230, 116, 154, 72, 30, 96, 155, 129, 146, 53,
45, 83, 105, 112, 184, 46, 141, 75, 54, 237, 224, 100, 97, 236, 29,
255, 122, 253, 57, 164, 189, 5, 150, 127, 194, 68 ] ]
Transformed frequency
[ [ 46, 45, 44, 44, 41, 41, 40, 40, 40, 39, 39, 39, 38, 38, 38, 38, 37, 37,
37, 37, 36, 36, 36, 36, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 35, 34,
34, 34, 34, 34, 34, 34, 34, 34, 34, 33, 33, 33, 33, 33, 33, 33, 33, 33,
33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,
32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
31, 30, 30, 30, 30, 30, 30, 30, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
29, 29, 29, 29, 29, 29, 29, 29, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,
28, 28, 28, 28, 28, 28, 28, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 26, 26, 26, 26, 26, 26, 26, 26,
26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 25, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 24, 24, 24, 24, 24, 24, 24,
23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 22, 22, 22, 22, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 21, 21, 21, 21, 21, 21,
21, 20, 20, 20, 20, 20, 20, 20, 20, 19, 19, 19, 19, 19, 19, 19, 19, 19,
18, 18, 17, 15 ],
[ 218, 95, 32, 177, 134, 245, 89, 19, 156, 24, 49, 22, 233, 242, 38, 39, 6,
146, 92, 125, 217, 179, 182, 31, 74, 235, 231, 246, 13, 107, 168, 114,
214, 142, 251, 15, 238, 171, 169, 199, 86, 103, 127, 126, 5, 250, 120,
25, 83, 90, 196, 63, 77, 232, 18, 10, 147, 255, 42, 65, 239, 67, 101,
128, 221, 167, 75, 116, 9, 208, 72, 99, 183, 143, 197, 69, 121, 59, 87,
93, 44, 111, 139, 64, 51, 52, 223, 150, 164, 50, 84, 224, 202, 55, 190,
181, 187, 158, 28, 226, 47, 3, 195, 117, 201, 80, 138, 8, 110, 248, 37,
36, 56, 252, 104, 163, 14, 141, 27, 154, 155, 188, 153, 45, 100, 136,
4, 192, 249, 76, 157, 144, 166, 53, 228, 254, 203, 43, 29, 132, 211,
68, 26, 109, 58, 118, 61, 54, 130, 237, 12, 48, 222, 210, 70, 41, 1,
94, 73, 162, 213, 244, 60, 243, 198, 17, 151, 30, 33, 212, 21, 35, 129,
170, 180, 152, 88, 122, 105, 97, 189, 140, 66, 115, 172, 57, 91, 113,
247, 204, 102, 16, 40, 71, 191, 11, 207, 135, 227, 205, 159, 215, 240,
108, 230, 193, 241, 96, 176, 124, 194, 186, 145, 79, 112, 161, 178, 20,
234, 2, 78, 236, 209, 220, 23, 149, 225, 0, 165, 7, 123, 200, 175, 216,
62, 119, 219, 133, 82, 98, 184, 206, 185, 106, 173, 174, 160, 34, 131,
81, 137, 229, 46, 253, 148, 85 ] ]
Then calculate the LogInt of Frequency Binomial
The final Result is
Non Compressed size is 7245 X 8 gives 57960
Original Compressed 56811 Bits
Transformed Compressed 56804 Bits
Some Header information is missing. The missing part is here
However you can find other resources here
http://www.facebook.com/pages/South-Side-Technologies/129860307070256
http://www.facebook.com/?sk=lf#!/pages/South-Side-Technologies/129860307070256
However I have watched this challenge with keen interest for sometime, though my interest is not Data Compression per se is on Universal Intelligence. With proper understanding of intelligence then recursive compression may not be all that trivial.
Apparently through UI I have several solution\approaches that compresses any random file which include the million digit file among other random file I have generated.
One approach I want to describe here that result in saving of 7bit is a basic version of several approaches I already have. This approach basically transform random file to one with a lot of redundancy.
For example
The first 7245 bytes of million digit file if you transform
By the following equation
zv:= First 7245;zx:=Size(zv);
zh:=List([1..Length(zv)], x -> ((zv[x] + ((x) mod 23) ) mod 256) );
Calculate the frequencies of both the original and the transformed file
Original Frequency
I'll take a rain-check on those topics Mark. Not that I already understand them enough, just not enough time for more right now.
I have some work ahead of me but I gave the MillionDigit Challenge some thought today and could work on Move To Front again.
That MtF is interesting in that it's cyclic.
I dig that.
So, another wild goose to chase here then before I get completely absorbed in what I must do.
Good Luck Challenge people!
Hello fellow dreamers,
I have accepted this challenge 3 months ago and I’d like to thank you mark, for this challenge is very entertaining and fun!
I totally understand the limitations and how this challenge cannot be won but somehow for an illogical reason I keep coming back and giving it another go (third attempt yet). Probably Ernst has to do something with this, for he is a beacon of hope in thei dark uncharted territories. haha
Now I have a question I’d like to ask.
I have successfully completed an algorithm that will compress the random digits file. The only setback is the speed of this algorithm; it’s going to take around 1-2 months to compress the entire random digits file (boring!!!). (There is plenty of room for optimizations but it will also take time)
My question:
If I extracted a small chunk of the random digits file, say 1k data and compressed it using my algorithm, then added it into a 7zip archive, and then if I compared it to the original 1k digits which are also added into a 7zip archive, I get a significant size reduction on the compressed file (almost 50%). Now Can I generalize this size reduction on the rest of the random digits file? or entropy’s magic only works with the whole file?!
Note: the compressed file is 100% decompressable into the random digits.
From skimming through the above posts, I’ve found the answer to my question in tim's reply to ernst
“dense data remains dense in all forms - whether its in other numerical bases, sets of arbitrary bit strings, or even arranged in different shapes (such as 2d grids, 3d grids, 4d grids, an Ulam Spiral, zigzags, etc).
More incredibly, all the derivative data remains dense. For example, if you took every 8th bit of information from file, you'd find that the resulting bitstring was just as dense and incompressible. If you took a random sample from it, you'd find that sampling just as dense.”.
If this is true then I’ve achieved partial success, ill just have to wait 1-2 months to see the result of the entire million random digits file compressed. I am confident it will achieve 30-50% compression. Expect a submission soon mark :)
Uhn, for all numbers listed, the search of a transform takes more time! Well, if the parameters in the transform have to be attached in the header, there won't be too much compression!
Well there Kiwi-Coder I agree it is fun.
It challenges me to think and imagine.
Back to my previous post about Move to Front.
I worked with variable length codes in matching tokens.
There isn't a single code that allows for a reduction of output bit count for all input tokens so I focused on assigning single bit output for the most frequent.
I worked the Milliondigit file and in most all cases it expanded.
However with two different encoding systems being applied I saw file size reductions of up to 50% but that is after the milliondigit file expanded due to re-encoding.
I never saw a size smaller that the original 415241 bytes.
So there is an area of potential in what I just did. The draw back is that recoding expands this file. The possible reward is that once several re-coding efforts are made there might be a compressible data created. I have not examined the outputs all that closely so there might be a gem in the rubble for some sequence of processing.
Anyway. I just finished that hybrid move to front effort. I have two encoding schemes for it and if there is interest I can share the ideas for good or ill.
Other News:
I have recently found a paper by a respected man that discovered half of an encoding system I discovered for myself in Jan. 2010.
The paper has an associated proof by two other fellows.
I wrote that man and hope he will write back.
Having part of the truth in print already will make my job easier in explaining the encoding universe that it actually is.
So I can't say spending time on Million Digit is a waste of time.
I just hope I am treated fairly as I learn the "paper writing ropes."
So, Heh... I'm able to publish. Who would have thunk it 8 years ago. yeah I have been hacking-about for 8 years on this stuff.
Oh and Kiwi on dense data. I have recoded this Million digit file hundreds of times. With dozens of encoding systems I made myself.
There are files that result that do compress some but never smaller than the source of 415241
I believe the way to compress this is to either find a system that generates this pattern the file is or to transform the data into something that can be manipulated into a smaller file.
Obviously all the standard encoding and compressing techniques fail on the million digit file and files like it.
I do believe that information is separate from form. This would be as we expect a black hole to behave. Such that energy continues into the gravity well but the information is smeared on the surface.
I also believe we don't know everything.
Okay. Wish me luck on the paper and I'm open to sharing my MTF efforts.
Update:
I spoke with that accomplished fellow who has retired and I feel great. He and I share a Union of cyclic arrays and we have different systems that generate that Union. It's nice to rub elbows (in a way) with the accomplished. He sure is a great guy!
I'll be working on recoding the Million Digit file for the winter. It's dark and cold out so inside is a good thing.
I think I will see what transformations provide.
I worked 96 bytes, by hand, and managed to align a row of 6 where the same bit was reset in all bytes.
Another alignment left only 14 set bits out of 48 where the original was about 50/50 or about 24/24.
I was thinking that I might try a few combinations and see if some quality can be exposed such as biasing the file towards reset bits or something.
There is overhead to the effort but maybe I'll get lucky and it will compress enough.
If nothing else I get the experience of writing the software.
Well friends, I am far from out of ideas so don't think that creativity and imagination are false tools. They are your friends.
Good Luck Challenge people.
Ernst
Update:
I decided last night to work with recursive transforming (reversible naturally)of the million digit file.
This means that I am creating new files of same size and eventually it will cycle back to million digit file. Think of it as cyclic code.
I am looking for qualities which make size reduction simple.
I could add Zlib and test each file but I am green on using zlib so I'll have to research.
I will look things over here and once I am satisfied I will be running the quad core with 4 instances of variations of Million digit.
The only way to speed up this search process is to have the computer cycle in both directions and meet itself in the middle some place but I need to learn about parallel processing.
Well if any of you are interested in a different file than million digit I have billions here. Let me know. I'll get some out there.
I was thinking about filling up a 3 TB drive just for fun.
Well lets see how long it takes to find a diamond among the coal.
This is the only way I see compressing million digits working.. Using a surrogate file derived from Million Digit file.
I see I am over 15,000 files at this point..
Time to examine a few.
I am saving files mod 1024
Just think that for some cycle there could be an mp3 that no one ever recorded. Perhaps a picture or a book never written..
Cool stuff..
Update:
Heading on to 4 million files generated as of this writing..
There are billions of files that will satisfy so it's just a matter of processor time.
This is a crude effort at this point so I can improve on this effort.
I'll post more when I have something solid.
Good luck Challenge people!
Ernst
What you are describing here is normally referred as Computational Thaumaturgy. Or "Real Sorcery or Computational Sorcery or In African word is called kamutii "
"Just think that for some cycle there could be an mp3 that no one ever recorded. Perhaps a picture or a book never written..
Cool stuff.."
But it is good you may continue!!!
@mpith
I had never read that word before.
That is a nice thing to say mpith.
Merry Christmas!
Soon it will be New Years Day again!
Happy New Year to you all.
My Update is honest and here it is..
Since I am actually working this challenge I trust I am seen in that light.
I didn't see an easy advantage in cycling the file but since it seemed relative at each instance I will consider that there is structure.
If that is true then as long as I do and undo the same maths some interesting possibilities exist.
However, time is always the limiting factor in life and I will wait on that for a while. Push it on the Stack.
Currently I am writing my first hash function.
I am accepting advice and links for study.
This function is in the coding stage so testing is a ways off and that will happen in real-time on the Million Digit file. The Hash may fail at some point but hopefully this design is flexible enough to offer alternate codes to get around collisions for any one instance. Collisions, I understand, are the problem with hash codes.
Now if this hash function can hold up, at least for the Million Digit file, compression is possible.
I can only guess, for this Hash function design, on the actual range for a given input range at this point but I hope for 23 bits I'll have eighteen million codes of utility or so.
Obviously, I will be learning as I go.
So Update: I have a Candidate for 2012 now and will be working in earnest.
No promises or claims of achievement here since this is a learning experience at every turn and in that it is reward but, this is the first time I have sizeable compression and am trying to keep it.
This is a different point of view for me.
If nothing else my code writing has taken a turn for the serious and efficient . Reminds me of school work and making the "grade."
Good Luck Challenge people!
Ernst
My bad.. I don't know why I typed 18 million there.. I wasn't thinking :)
LOL 23 bits at most is 16,777,215 :) I have 18 on the brain.
Again Happy New Year!
Very interesting thread, and great challenge. I've been puzzling on the problem of compressing 'random' data for quite a while in the past, and I may have a good approach for this.
Does the $100 challenge still stand? I'd like to try and see how well my method works for your Million Random Digits file, and hopefully submit my solution soon!
It most certainly still stands! Good luck.
-Mark
Once one understands the limits of math and why the limits occur then the "problem" is easily solvable.
Think of it in these terms, man can not walk through fire due to the heat and your skin. Understanding the issue helps address the need for a fire suit.
Same concept applies here, clearly using an environment that is reliant on math will not allow one to achieve the desired results.
Another approach is required and I have found it, however it is CPU intensive and there is no real world application that can make use of it aside from encryption of small data.
@ Timo Makinen hey Timo! Sounds really interesting.
I think we all agree that our current tools are what are really being tested with this challenge.
@ all: Any suggestions on Hash functions to read up on?
Now is the time. I am about to do the code with the design I have been working on but, external examples are welcome.
I am interested is systems that offer alternate codes for individual elements.
I expect to have multiple options and hopefully no more need than 1 out of 4 but I am designing for 1 out of 16.
It is easier to write one function once than to redesign it several times.
Other than that .. Week Three for me on this one..
I'm writing a full program this time win or lose on this approach. That means all the mundane details have to be ironed out.
I usually prefer quick-hack style proofs of ideas but this one is a win or lose kind of effort.
I must have faith at this point. I can believe later.
Anyway.. I thought to re-post that I am in for 2012's Million Digit Challenge action.
If I do manage a bit of magic.. Mark I take it that the 100 dollar bill will be new and unfolded? It has to look good in a picture frame :) lol
Good luck Challenge people!
I am now good on the Hash functioning. I get it.
Thanks for the read. I must get to work.
Hey Guys,
I am working on a Codec and part of it is a hash code. I am learning about these but I think I am doing it right. Please comment.
I have the need to identify a class of string. Doesn't have to be an exact choice so a general code say 0 to 1023 (10-bits) seems okay.
Now where I am with this is to be happy with a distribution ( on the Million digit file ) of reasonably equal use of all 1024 values.
The spread ranges from a low of 29 instances of one 10-bit value to a high 80 instances of a 10-bit value and it looks fairly distributed.
What I am asking is how are these things gauged?
I have not taken Statistics at the J.C. yet.
Having nearly equal use of all codes is the goal yes?
And the input file is Million Digits file.
You answer need not be specific or need to directly relate to my work.
I assume a good hash code is uniformly distributed for a set of uniformly different strings.
Am I stating that correct?
>I assume a good hash code is uniformly distributed for a set of uniformly different strings.
I'm no hashing expert, but a good hash function generally should be generating something that looks like a uniform distribution for random input. Ideally, for any input.
But every hash function will have some weakness in which a certain series of inputs will cluster in a non-uniform manner. It's just that in a good hash function, those series of inputs are not ones that you would expect to encounter in typical data.
So maybe your statement is not mathematically concised, but I think it is pretty close to getting to the core idea of what a good hash function does.
- Mark
Cool; Thanks Mark.
I was a bit surprised with this one after a few designs didn't look good to me.
I'll have a better idea on the true utility once I try to decode I am sure.
Good Luck Challenge people.
Update:
Okay.. I am a bit surprised just how hard this is to do. :)
I have a Codec and I calculate, to decode, that it must choose one out of over 2 million potential strings.
Wow, that is a needle in a proverbial haystack. You see what I am up against.
I have choices of which data to sample and how I sample so when I have a clear idea of how this may fail it should help me refine this approach.
The miracle side of this decode effort is that it might work.
I'm now working on the decode aspect. Hash codes are interesting and my hope is that this will decode.
Okay then, I'll be a couple weeks or so I am sure. I assume there are people reading your site and this challenge blog Mark so this is an offering of content as well as a space I blog about my experiences with your challenge.
What good is our Challenge Blog without a candidate once in a while?
I'm over-due presenting at least a hope of winning so here it is.
Good Luck Challenge people!
Hey,
I am a fan of integer sequences. I have an account on a database because I love it.
So; Here, is an interesting sequence.. I'm looking at the "bigger world" through the eyes of the above mentioned hash.
Three dots means sequence before assumed. Three dots after means sequence after assumed. You know, I cannot, post the whole thing so enjoy a snippit.
... 720,8,0,1,2,3,13,727,1021,1022,1023,0,19,733,21 ...
Interesting.
Hey Guys,
I have a question. Has Million Digit been factored?
If so is the data available?
Update:
I'm sidetracked again.
Three days a go I discovered Recursive Modulus and I have been exploring Recursive Modulus to find factors of numbers.
I started three copies of a simple program I wrote to find factors. Each one is running the RSA 2048 data.
I'm guessing I could be waiting years or minutes more for a factor to show up.
Ain't it something? I get so many projects stacked up I feel like I need a staff.
Well, Sleepy time now that that program is running. There is a thread in the Comp.compression group where I explain RM a bit.
It looks like the world has one more way to find factors for numbers.
I am interested in knowing if this Recursive Modulus was already known or if I discovered it,
Either way the RSA 2048 race is on. Lets hope I get lucky.
Okay.. I need to calm down and get back to my data compression effort once again.
I was reworking the Codec when I found the RM so I guess that answers that question I had.
Still it will be really hard to get 1 selection out of 2million+ possible strings.
I was worried when I was looking at the modulus stuff and I found a partial answer to my worry but, I think it could go either way.
Now that is okay as long as I can make it work for the Million Digit file. That is a possibility.
But, there is no telling for a while more.
I hope I have time for all this stuff.
If no one beats me to my own prize since I have been giving it away, I will factor this file soon.
I hope to create a codec of factors and powers. I hope that will satisfy this challenge.
God Speed my friends...
Ernst
I'm asking again has anyone factored Million Digit file.
I have a factor that is 175 pages of 10 point font in open office.
If no one has claimed that factoring I now do.
The factor ends in "36579116120702356361444362876721627845529455567385497"
So I have three factors so far at last check.
Alright then . Finally some headway on Million Digit file.
How exciting.
Well I found a bug in my program on factoring. I am exploring factoring since I have never messed with it to any significant degree before.
I wasn't checking for an end condition of the N being 1 after a factor is removed. It's reasonable to find this errors after checking it out a while.
So i fixed that. Also I was printing the current N before the factor was removed and so seeing a large value for N and no stop for N=1 I didn't see that there are only three factors.
Now I am using GMP's function to check if this nearlly a million digit number is a composite probably prime or prime as thos are the three results that function will return.
My question is, having never used this function before let alone on nearly a million digit number, how long does it take for a million digit number check?
Dang.. feels like it's spinning wheels but I don't know.
Anyone?
I am also running a search using recursive modulus. I started the RM program I have just a minute ago.
Who will exit first?
This factoring is interesting and I am happy I figured out what the recursive modulus was doing and I have code to work with if need be.
I'm still hoping to find reference to this method so if any one has seen recursive modulus method before let me know. I'm sure it will make for interesting reading.
That's enough for me on Recursive Modulus. It works the way it works. Does what it does and has interesting mechanics.
I have it in my tool box. It is another aspect to consider.
I hope others will share their discoveries like I did. Makes for an interesting day.
Now back to work on what I was doing when I saw RM.
Good Luck Challenge people.
Commentary on the Million Digit challenge and the stigma it has accrued in Comp.Compression
When I learned of the Collatz Conjecture I was in class at the junior college taking introduction to programming using the newest programming language C in 1991 and Pascal had just been replaced by C for the class.
I fell deep in to the magic of all things that cycle like the famous 3x+1,x/2 Collatz Conjecture.
As time went on I discovered many things through my private efforts like there are infinite systems for 3x+y where Y is odd.
I also saw the series of systems where collatz fits in I can understand all the systems and I know, without proof, that there are some strange dynamics in play in the true nature of whatever this reality is just by extending that series to the left and realizing that it can go negative. Negative Zero? How strange indeed. It could be true for some point of reality but I simply keep the idea open in my mind. I don't know exactly and it's okay for me to keep it open.
I consider myself a successful man when it comes to my programming interests. I have more ideas than I have days I am sure. I am a friend and a fan of the sciences and I have a wide range of interests. Naturally, as a hobby programmer, I have not specialized in any one discipline but I do really good with the data encoding these days.
Moving forward to 2012.
My interests have found a focus with the problem to solve of the Million Digit Challenge. I have been working on and around the file as a data set for many years.
As a data set I understand it is trustworthy. As a trustworthy thing I measure other things against it and it is a good measure to work with.
So why is there such bias and outright aggression towards the people working the Challenge?
Here is what I understand about the state of the art of data compression.
Data compression as defined by Shannon is statistical in nature and by his own words there is a fork in the early road and Shannon lead us in one direction over the other and that modern data compression is currently travelling on that road.
Along the way many proofs and truths have been confirmed that relate and define many boundaries. That is all good. A wall is a wall after all.
Yet all of these things fail to compress all files with the tools that have been developed and most of the advancements in the tools we use come in small increments and many in specialized efforts.
Indeed there are many good craftsmen out there who have tools and skills to keep those advancements going and that is good. More power to ya friends!
For myself I am not in the business. I am simply an interested man who finds this Challenge a great tool and focus for my interests.
The data set is trustworthy and the results of experiments are valid using it by the very statements of the data compression professionals of all times.
The bottom line is I really don't need this file to compress because in parallel I do my own thing in private. I have more than one iron in the fire and it is my private business what discoveries I may be hiding from the rest of you.
So, why is it such a moral data compression sin to work this Challenge and why do smart people in comp.compression feel the need interfere and down right discriminate?
Rest assured if I was not generating successes from the path I am following I would not be right here right now. I would be doing a different thing on a different path somewhere else but I am having successes. Private success that keep me moving in the direction of finding that information I seek. Me Personally seek! The answers I want. I am not some data-compression-Jesus to hang on the Data Compression cross.
My latest experience, being associated with the Million Digit Challenge, was to report the effect of Recursive Modulus as a factoring primitive. I assumed it was old news I had discovered yet no one seemed to know about it.
So, it turns out that I had seen factors in the stream while designing a hash code and it turns out that it can be considered a primitive recursive function of some class. It is possible I am the first to report it but I am fine with knowing who first wrote it down.
So there is a private success anyone can see just as I assure you I have more private successes that are worth reporting.
Yet, what does it hurt to look for new ways and new understandings? Is it because someone may have disability? Like Newton or Cantor?
Conclusion :
There is nothing wrong with innovation and for every success there are untold numbers of failures in human history.
When it comes to data compression our understanding is incomplete since our understanding of what reality is, is incomplete.
The moral of the story is every one dies and no one knows everything but it is often the fringe people that advance innovation like Newton or Cantor and many others.
So let people dream you data compression Thought Nazis'
Let those on the fringe take a shot at fringe things and stop the authoritarian tirade that wrestled control of comp.compression forum from the same over the years.
If we don't know everything then we can use all the help we can get even from fringe projects like compressing random data.
If someone wants to invest part of their life trying and you simply are not a mind reader don't stand there telling them they are idiots and then play them on every idea they have like some molester.
I am now taking the name calling and harassment personal after 10 years of efforts on understanding the nature of information come June 2012 and focusing on the million digit data set as a development tool since around 2003-2004 I believe.
So learn to be a friend comp.compression because you have the opposite part down pat.
However, if someone can prove to me that they will never die and that they know everything I will apologize sincerely.
Other than that expect me to get as ugly as possible to protest the labels and comp.compression keep on closing threads to cover up the mess that you comp.compression make by being a biggot.
In short, let the fringe people give things a try and stop kicking them in the teeth for sharing news and interacting for friendship especially the need for social friendship part.
It is simply amazing that such a noble challenge is tuned into a psychotic-hustle in that forum. What the hell are they afraid of?
Thank you Mark Nelson for a forum and a focus.
Ernst
Thanks everyone.
I am currently "down for repairs" with my desktop and the new machine is awaiting parts but once I have access to the hard drives again that large factor I can share. I take it a simple export in GMP is what is required. I'll have to ask Mark if that is how he generated million digit. I think that is the case. On that maybe I should give a copy to Mark as well.
If anyone is interested is seeing if we are trying to compress a prime number I have no problem sharing some technology and the number I believe is a factor.
The old desktop needs a older quad core and I have to find someone who has the older chip but when all is done here that quad core is available to run factoring efforts.
I am open to working with others in determining is we are trying to compress a prime number or not. Maybe we can conjecture no prime number is compressible?
If nothing else is used I know recursive modulus will do the search for factors but I am open to suggestions.
What do people think of GMP's prime number test?
With recursive modulus as I employed it, it will search from the number minus one to one so it will cover the ground albeit mechanically.
With recursive modulus we can divide the search into sections and share the work such that each of us can get a block and process so we all don't overlap.
Other than that I will be slowing down on the number of hours I have for computer with the coming of spring. I am still hopeful I will find a full time job with benefits and it looks surprisingly hopeful with the resent up-tick of hiring across the Nation and especially locally.
Alright then, I hope to check out that factor candidate. At this point I conjecture there only three factors but that could change if this nearly a million digit number is composite.
And again if there is interest in working the number with me just let me know. It would be fun to work with others for a change.
Ernst
As of late many people have had a instance or perception about Ernst the Internet~Usenet guy. I cannot worry how any one person feels and I will practice a punch back for a jab at me from here on out; however, I mostlt wish for peaceful interactions.
Update.
As it often is for me I go sideways as I explore things.
Recursive modulus may well be a primitive recursive function of some class. I see conflicting data classifications but then again perhaps there is a bigger picture. My best guess so far is primitive recursive function.
I am open to other suggestions.
As a result of blurting out recursive modulus rather than quietly absorbing a new thing as I usually do, I have a speed-race to re-evaluate other efforts.. You see each piece of the puzzle understood helps paint the picture of what is really true; for me.
Anyway.. I must divert a second time. The effort I have posted as being my effort is being pushed on the stack until I have an answer about the utility of Pi.
Don't expect me to debate the philosophy of transcendentals' I simply am interested in the factual patterns of Pi and other Transcendentals'
So friends, I am delayed again but for good reason. No sense in designing a codec that takes a long time to decode if there is a faster way. That is my concession here. I simply have to look some more before I proceed.
Ernst
Update guys.
I have to review a few things before I will know what is the forward direction here.
I have a couple of parallel threads to process so to speak.
So I will be working but not so active here or comp.compression.
I'm studying.
Well, Spring is happening here.
I've settled on pattern matching using the tools I have on hand at this time for my focus at this time.
I faced several important programming topics lately and any of those directions is worthy of effort but I have to pick a focus I feel confident will be a reasonable project for me.
It's rather simple to generate a large data set from small sized code.
For example I have a 10kbyte program here I am working on that generates a 2 million bit dataset rather fast.
The real challenge then is writing the pattern matching code that will expose the results of matching to the dataset.
I expect to have to spend some bits and to save some bits. If this follows other experiments I have done, that can be considered related, then it is an open question as to if the net result will be smaller than million digit file by one bit and the decoder size.
Then again one thing builds on others and this may be another tool in the tool-set after I am done.
At this time, it looks like all I need is about 15kbytes of reduction of million digit file at this time. It is my guess as to the actual size but from my current point of view that would be 3kbytes more than I assume the final size a decoder will be at this time just to be safe.
Well, pass or fail I am still interested in this aspect of data compression and I am still working the challenge.
Pattern matching looks to be an interesting exploration to code for.
Any one else? Drop a line here and say hello.
Good Luck Friends!
Heh!
Speaking of RSA style numbers.. Million-digit's byte count is the product of relatively sized primes like RSA would be.
617 and 673 is 415241
It has other interesting properties but since we are all interested in prime pairs such as;
Hint: Well try recursive modulus.
Hell, I am just puttering around. The Pattern Matching is less than rewarding for me today.. Bummer..
Anyway.. there is some trivia..
Ernst, here are some useful comments for your work on challenge
>> 617 and 673 is 415241
617 and 673 is 1290, not 415241
>> ... but since we are all interested in prime pairs ...
log(a * b) = log(a) + log(b)
Let million-digit-file = x * y, then
log(x) + log(y) = log(x * y) = log(mdf) =~ 3320000
>> Hint: Well try recursive modulus.
My hint would be: Well try math!
I'm curious why multiple files are allowed, I think that's a loophole.
From what I read, storing data in the filenames is not allowed. I read before that someone "beat" a $5000 compression challenge in this way before, so I can see what's trying to be avoided. (notice "beat" in quotes)
The only loophole I see is this: would enumerating files be considered storing data in them? I mean something like: 001.file, 002.file, 003.file, 004.file etc.
Because just the sheer presence of these files provides information that otherwise does not exist. For example if a certain bit pattern that repeats in the original data is found, then you could take that bit pattern out. Normally, you'd have to store at which locations you took the pattern out of, so it can be re-inserted, and this would quickly add up to more info than you saved.
But, if instead you split the file each time you encountered this bit pattern and stored these pieces of the original file in enumerated sequential files, you wouldn't have to store the information of where to re-insert the bit pattern. The concatination would simply be 001.file + bit pattern + 002.file + bit pattern + ... etc.
The question is only to find such a bit pattern. This isn't difficult:
The file is 415,241 bytes. A single byte can have 256 separate states (meaning there are 256 unique bytes)
By the pigeonhole principle then, just under 415,000 of those bytes have to be a repeat of a previous byte that's already been seen.
In one extreme, each one of the 256 bytes would be repeated about 1,600 times, and in the other extreme, 1 of these bytes would be repeated the whole 415,241 times.
Since if only one byte repeated, this data wouldn't be very random, the case here is probably closer to the first one, where the most any single byte is repeated is ~1,600 times.
So, we find this most-often repeating byte, and split the data each time we encounter it, storing each new chunk in an enumerated file.
The "decompressor" should simply store the value of this byte, and then concatenate these files together, each time inserting this byte between two files.
So, basically, just by having enumerated files, we should save about 1.6kb (in the worst case scenario). The source code for the concatination program could potentially be written in under 1.6kb - it is rather simple.
This process can also be repeated with nibbles (4bits) and 2bit patterns that repeat - there should be even more of those repeating.
But, even if this works, this is NOT quite in the spirit of the challenge, and isn't compressing the data, so much as just hiding the information elsewhere.
To be clear, I am indeed a programmer by profession, but I have NOT programmed this. I found this article today at work and the idea occurred to me when I read some of the comments later.
I am NOT CLAIMING to be able to compress this (or any other) random data - only that there's a potential loophole in the rules (if Mark allows enumerated files). I am also well aware of the quite solid proof of the counting argument of why no method can compress all files.
If I do have a free time, I'll try to write a program to run this sort of data-hiding scheme, to see if it would really end up being smaller.
@Milcho:
One of the reasons that I have the rather vague comment at the end is because I don't want to get in a never-ending contest with people who find clever ways to hide their extra data. I can't anticipate every possible way to do this in advance, and I don't really have any interest in doing so.
For example, let's say I limit things to one file. Well, someone will come along and create a directory structure with 400,000 entries, and just one file.
Then I say no, it all has to go in a single directory. Then they create a file name with 400,000 characters in the name.
Really, who cares? This is not interesting and I don't want to deal with it.
And so far, this has worked okay, I haven't really had anyone suggest that they have met the challenge while staying within the rules.
- Mark
@Mark
It makes sense, the rules of the game are to actually compress the data, not shuffle it around to unseen places. It took me all of one hour to write the code of what I posted above, and that's just the simplest way to use the filesystem to hide data.
It's amazing how recent some of these claims for incredible compression are, and that there's still people who buy into it. Oh well, they may be wasting their time, but it's at least slightly amusing to read about it.
Hey Milcho!
I see you found the challenge interesting!
cool!
But hiding data is not in the spirit of the challenge for sure!
Give it some time and maybe you will find a unique solution!
Update :
I'm working on bit-pattern matching using a constructed data-set. Also I am learning about OpenMP and parallel processing along with this effort.
So win or lose there are benefits for trying.
Ernst
on April 11th, 2012 at 8:39 am, Tobbi said:
Ernst, here are some useful comments for your work on challenge
>> 617 and 673 is 415241
617 and 673 is 1290, not 415241
>> ... but since we are all interested in prime pairs ...
log(a * b) = log(a) + log(b)
Let million-digit-file = x * y, then
log(x) + log(y) = log(x * y) = log(mdf) =~ 3320000
>> Hint: Well try recursive modulus.
My hint would be: Well try math!
Pardon Tobbi those are the factors of 415241 as in 617 times 673 = 415241.
I can understand your confusion and I promise to not try and trick you again.
Be Well!
The prize amount ecouurages to try this. In a way, challange creator also doubtful about the impossibility of this work. Hence he seems has not taken much risk ;)
Leave A Reply
You can insert source code in your comment without fear of too much mangling by tagging it to use the iG:Syntax Hiliter plugin. A typical usage of this plugin will look like this:[c]
Note that tags are enclosed in square brackets, not angle brackets. Tags currently supported by this plugin are: as (ActionScript), asp, c, cpp, csharp, css, delphi, html, java, js, mysql, perl, python, ruby, smarty, sql, vb, vbnet, xml, code (Generic). If you post your comment and you aren't happy with the way it looks, I will do everything I can to edit it to your satisfaction.int main()
{
printf( "Hello, world!\n" );
return 1;
}
[/c]