Note:As of 9 October 2012 the compression challenge has been updated. See the new post for details. The comment stream on this post is closed due to excessive BMI.

The Folly of Infinite Compression is one that crops up all too often in the world of Data Compression. The man-hours wasted on comp.compression arguing about magic compressors is now approaching that of the Manhattan Project.

In a nutshell, a magic compressor would be one that violates the Abraham Lincoln’s little-known compression invariant: “You can’t compress all of the files all of the time”. It’s trivial to prove, and I won’t do it here, that no single compressor can losslessly and reversibly compress every file.

The easiest way to foil most compressors is with a so-called “Random File”. In the world of Information Theory, Randomness is a tricky thing to define, but I like to fall back on Kolmogorov complexity, where we define the complexity or randomness of a file as the Komogorov complexity w/r/t a Turing Machine.

Several years ago I posted a challenge on comp.compression that I hoped could be used to silence any proponents of Magic Compression, and I’m reposting it here so I have a permanent location I can point to.

How does it work? I took a well-known file of random digits created by the RAND group (no doubt at taxpayer expense), and converted that file into binary format, which squeezed out all the representational fluff.

The result was AMillionRandomDigits.bin, 415,241 bytes of incompressible, dense data.

The challenge is to create a decompressor + data file that when executed together create a copy of this file.

The size of the decompressor and data file added together must be less than 415,241 bytes.

So far nobody has been able to do this. I wouldn’t have been surprised to see somebody get a byte or two, but that hasn’t happened.

The only real rule here is that we have to negotiate exactly how to measure the size of the program plus data file. I’m willing to exclude lots of stuff. For example, if you wrote the program in C++, how would we measure the size of the program?

In this case, I would measure it as length of the source file, after passing through a resaonable compactor. The run time libraries and operating system libraries could be reasonably excluded. But that’s just one example, the rest are all subject to public interpretation.

Really, the only rule we need is that the executable program can’t have free access to a copy of the million random digits file. For example, you can’t create a new programming language called JG++ that includes a copy of this file in its Run Time library. You can’t hide the digits in file names. And so on.

As long as those rules are obeyed, the challenge will never be met.

#### Recursive Compression

This challenge has a special place for the variant of Magic Compressors known as Recursive Compressors. Some savants will claim that they have a compressor that can compress any file by a very small amount, say 1%. The beauty of this is that of course they can repeately use the output of one cycle as the input to another, compressing the file down to any size they wish.

The obvious absurdity to this is that if we compress every file to a single bit, it’s going to be kind of hard to represent more than two files using this algorithm. So most people in this subspecialty will claim that their process peters out around some fixed lower limit, say 512 bytes.

For those people, a working program should be able to meet the challenge quite easily. If their compressed data file is a mere 512 bytes, that leaves 400K of space for a decompressor that can be called repeatedly until the output is complete.

#### The Payoff

The first person to achieve the challenge, while staying within the spirit of the challenge, will receive a prize of $100.

## 552 users commented in " The Million Random Digit Challenge Revisited "

Follow-up comment rss or Leave a Trackback*Cracks knuckles*

*fires up compiler*

*scratchs head dumbly – wonders how long this might take*

Wow – you’ll give me $100 for doing that – um ok. I’ll get back to you – later.

It’s not just the $100. You’d be famous!

Hello!

When it comes to recursive magic, I think you might need to explicitly include the requirement that the number of cycles be included along with the actual decompressor and compressed data. Otherwise, the challenge would be easy. (But you already know this, of course, but I wouldn’t be surprised if the magic-doers don’t realise this, with all that that could entail.)

The way I’d do it is to treat each file as a string of bits, add an extra one on one end, and interpret the result as being a nonzero natural number in binary (with that extra one being the most significant bit). Then, my compressor would subtract one from that number, unless that number is already one, in which case it cannot be compressed any further. As the result must be at least one, it’s going to have a most significant one in its binary representation. Chop that one off, and take the remaining string of bits as the ‘compressed’ file.

Most files don’t compress with just one compression cycle, but, if enough cycles are applied, all files (of finite length) end up compressed to zero length.

To recursively decompress such a file, you just have to apply the (obvious) decompressor as many times as the compressor was applied. But how many times is that? “Oh!” I can imagine the magic-doers saying, “You never said the cycle count counted, too! You’re cheating by changing the rules! It’s a CONSPIRACY!!!1! Yet more proof that the ‘experts’ won’t admit that they’ve been wrong all these years!…”

:-)

I think the challenge as is is safe against this. Basically, your method requires hiding information in the compression cyle, and then using it in the decompression cycle.

I’m trying to avoid most of those tricks by asking the user to create a decompressor only – a single file, that when executed, creates the random file. So even if the user counted up the cycles while compressing the data, there’s no place to store them.

So for example, if you created a shell script or batch file that executed your decompressor, it would have one line that said:

for i = 0 ; i < big_big_number ; i )

decompress

Because that number is so big, the shell script itself would have to be bigger than the million random digit file.

You could try to finesse this by hiding the number in a huge package of 0 length files, each with a name that makes up part of the number. But we’ll nip that in the bud!

Sachin Garg has a post on c10n.info that highlights a classic magic compressor here. Jeff Fries seems to be claiming spooky compression powers that would enable him to get a slam-dunk on this million digit test – however, there’s a bit of a problem in that I think he needs some hardware to make it happen.

fifty quid? hardly worth getting out of bed.

i’ll keep the file though as it could be a good test.

>fifty quid?

fifty of your quid plus worldwide fame.

of course, you can always wait for the exchange rate to improve.

Don’t tease or Peter St. George and Hamby of Zeosync will come swirling out of the sky on giant bat wings and roost everywhere.

- truth

@truth:

I try not to tease, but it is kind of fun to watch the challenge roll on for years and years with (of course) nobody even able to submit a challenge!

Hey, Mark.

I thought you would remember when the “truth” brought Zeosync down. Seems to me they started to fold about three weeks after Sam Costello’s article in PC World.

- truth

Hi everyone!

If someone manages to compress the million random digits file by a few bytes, would it have any significance? Would it be an accomplishment?

Wouldn’t it simply mean that the specific sequence of digits wasn’t completely random after all?

@Emil:

A few people did some analysis on the file and have found some minor weaknesses in it. In theory one could squeeze a few bytes out of it.

But of course, to win the challenge, you’re decompressor would have to be vanishingly small. This will be tough to do.

So… if you did create a decompressor + data file that was smaller then the challenge file, but only by a few bytes, you will be lauded for your accomplishment, but you won’t be breaking any laws.

The flaw in the idea of counting up to the number represented in the file’s bits is, as stated, flawed because inserting that number into the source code (or data file) is exactly as big as the original.

But I like the tactic. Are there ways around this limitation? Assuming infinite computer time: factor out the primes, and write each in an efficient form to the data file. The decompressor would then multiply the numbers. Obviously, that’s extremely vast division and multiplication, but the algorithm for doing so with bit streams as the operand is probably not TOO large.

Assuming the decompressor also has unlimited runtime memory, it opens a few more options. Say instead of just multiplication we also allow for certain meta-instructions. Each entry in the table is either a bitstream or a manipulation of one of the other (preceding) entries. With addition, subtraction, multiplication, maybe division, left shift, right shift, and/or/xor, you could start to get pretty expressive.

As the size of the compressor is not in question, and the efficiency on both ends is not part of the question, the success of this approach begins to depend less on the algorithm – over an arbitrarily large enough file even a very small compression ratio would overcome the size of such an algorithm. Instead, we’d have to ponder whether or not the data file, with its instructions *and the separation of the component bitstreams (factors)* would actually be any smaller than the source set, or smaller than any source set, or smaller than only very large source sets, etc.

My suspicion is that it would not. Even merely the storage of any given prime factor from the original number requires overhead. Where the source file is considered to contain the number and nothing but the number, we must be able to store multiple streams of arbitrary length in ours.

I don’t think you could demark them in a header to the data file because the bit positions are ALSO arbitrarily long. Would you store them in 4 bits? 8 bits? 32 bits? 128 bits?

Instead, each bitstream should be preceded by its length. The length itself must have a length; to store the length, you could count each 1 up to the first 0, then use that as the number of bits to read for the length, and use that as the number of bits to read for the string. If you wanted to meta it out further, after each ‘length’ string you could insert 1 for ‘the stream begins next’ or ’0′ for read a new length string with a length of the last length. :D

I did ramble a bit. It caught my eye. Any thoughts?

The idea that there are shortcuts to counting things is appealing, but ultimately fails to bear fruit. Counting any random thing will lead to some numbers that compress, but many, many more that don’t.

*shrugs* Alright, how about this idea (I’m no expert in compression or codeing, just throwing an idea)

Alright assuming these are numbers. 0-9

why not have it search for the inevitable patterns that arise. Find the most common ones (say if 12345 appears 100 times for example) of as long of string as possible, and re-assign that to a different character

so for example:

123453726482912345375482

would be

X37264829X375482

Find the most common repeating patterns, and longest running ones. Then you have, for example, 1 saved (12345) and 100 X’s.

If the system finds there’s 10 occurances of a 250 character instance, that shortens things by 2000 characters or so no? (including your new X’s and one copy of the 250 char key)

Patterns WILL occur in any random number, especially at a million characters.

0-9 only uses a few of the veriations of binary, leaving you a lot of characters to convert them into.

Now if the program could be made small enough to fit into it, and left to run for a LONG ass time to find the patterns for all lengths and combo’s… (but processing time wasn’t a factor)… that should do it?… no?

@Digital:

You’ve done a good job of describing a class of common compression algorithms, sometimes known as the Macro Replacement Model.

It works great when some patterns appear more frequently than others. But if this is not the case, if patterns appear randomly, it takes more bits to describe and replace them than it did in the first place.

Think about it: let’s say that the string 1234567890 only appears once in the million digit file. To use macro replacement, I have to insert a macro definition in the file, something like:

X=123456789

Then I replace that occurence in the file with ‘X’.

At a minimum, that takes two chacters plus change more space in the file than it was using before. Fail.

If the string occurs more than once, you win. And when long sequences start showing up with some regularity, we describe that file as not being random.

@ Mark

Figures something that simple would have been thought of already, haha

I would be interested to see how well it would work on this, though I’m sure it’s been tried. A million is a large number, large enough that odd things start happening with probability. I would think that there would have to be quite a few 3-4 character matches, and at least a few larger ones.

You’d also have to have the program try all possible combo’s, and THAT’s the processor eater.

00…01…02… ~~~ 99…000…001…002

after each one, check number of replacements vs length of string to check total saved space. That would have to go up to 500,000 characters looping to verify every combo has been included. (now that I think about it, go backwards, starting at 999999, 999998, 999997… hate to kill a larger combo by finding a smaller one… although with unlimited time, run it both ways to be sure)

It would have to keep a record of every single one of them with a ‘space saved’ tag next to them…

Now that I think about it… since time isn’t an issue here, after it goes thru all combos, run that sucker again! who knows, maybe we’ll get to compress our new ‘letters’ we’re using as place holders… as long as we keep track of what order we compressed them nothing should be lost.

Of course, before finishing, all the non-used combo’s could be deleted out, so you don’t have a dozen strings of 0 matches, or 1 matches that aren’t used to compress.

___

bah, now I’m rambling again. I’m sure all of this is already done and been rendered useless by the next generation of compression. Still interesting to re-invent it in your own head :P

@mark

I’m going to read up on “Macro Replacement Model” some, I’m sure it will be interesting to see the completed idea. ty for the info

@Digital:

>I would be interested to see how well it would work on

>this, though I’m sure it’s been tried. A million is

>a large number, large enough that odd things start

>happening with probability. I would think that there

>would have to be quite a few 3-4 character matches,

>and at least a few larger ones.

I encourage you to work through the problem, I think you’ll find it instructive. A lot of people run into the trap of coming up with ideas like this and then not following through with actual tests to see what the real costs are.

@mark

the extent of my programing knowlage is VBA unfortunatly… and though I think it would be possible in VBA it would be,… lets just say very very ugly. haha

Reading up on different compression methods, there’s a lot of interesting stuff in there. Of course now I’m trapped in the ‘wiki-hop’ of reading every artical that’s linked to it, so I’m sure I’ll be sucked in for a few days.

You really have not received a *single* challenge? That’s amazing. And pretty convincing. Is the challenge “real”? Is it still available (July, 2008)? Maybe potential challengers don’t think that you really intend to give out the money.

Of course, your prize is “only” $100. Is Mike Goldman still offering the $5,000 prize that I read about years ago?

@Ak3la:

No, after all these years no serious challenge.

And I think it unlikely that there will be. Some hard workers on comp.compression identified a tiny bit of redundancy in RAND’s million digits, but the amount is so small that I’d consider it impossible for even the most skillful coder to take advantage of.

Are those Dollars are American Dollars or Canadian?

I think Canadian is worth more these days.

I’m still trying to make it happen.

I have another idea so I’m working on it, again.

I was curious as to the status of th project so I did a search for million digits and here is this nice site.

I should know something in about a month.

Wish me luck.

Hi Mark,

I came across this challenge while casually browsing the internet. Is this challenge still on? What’s the best compression reported to date? I tried WinZip and Winrar, both gave compressed files bigger than the original file, so I realize this is not a trivial challenge.

Anash

@Anash:

I don’t think anyone has bothered to report “best compression” on this – it’s kind of an all or nothing deal. So far nobody has been able to come up with a good algorithm for compressing this random data. Nor is it likely that anyone ever will.

Hmm…

There’s no mention of isolating the decompressor’s output to a single file. Nothing says the decompressor is responsible for identifying which output is the match, only that it’s been created. If your script simply created 415,241 files with the bytes incrementing from all 0′s, this random data should be in there, yes?

On the other hand, this sort of “letter of the law” vs spirit has inspired me to look at different techniques for hiding information; thanks A Million Random Digit Challenge.

Len

@Len:

Stop a minute and think about how many files you’d have to create in order to be sure the random file was included… that would be 2^415,241, not 415,241.

That would be a lot of files.

Alright well a month may be a little bit of an underestimate.

Still trying new ideas.

Thought I had a winner the other day but I’ve learned:

Great results… Magic compression? “There is a mistake someplace.”

I hope that make ya snicker…

Anyway Happy new Year to all of ya still trying to compress million digit file.

Ernst

Hi Mark

Here is my recipe (awnfull pseudocode).

mydigest = Digest the file with md5 (as a example)

Decompresión

iterate a from 0 to max_length

{

if md5(a) == mydigest then ouput(“File found: a”) ; exit }

outpur “error!!”

As md5 can collide maybe a two digest aproach adds more against finding a collision on the way..

Well this is akind of joke but you never said the algo has to be deterministic…

Regards Angel

@Angel:

Nice idea, if only it worked!

- Mark

Anyone else working with recoding the file?

The number of collisions of a hash function on a file storing number around 10^1,000,000 is going to be around 10^1,000,000 divided by 2^(size of hash in bits). The count of the number collisions to reach the orignal data plus the size of the hash is going to be about equal to the origninal size of the data.

If you use multiple hash functions it is more difficult to determine the number of collisions. However it would take near infinite time to encrypt or decrypt by going through every possiblity and calculating the hashes.

You could hash the orignal file, drop every Nth bit or byte, and calculate a hash(es) of the dropped data. This may (or may not) reduce the number of collisions significantly, but it would still take near infinite time to reconstruct the original file.

Never give up. Never surrender.

Hey everyone.

Helping a friend move this weekend and I needed a break so I thought to post some here and there.

Happy Valentines day all you ladies.

I’ve been working on the Million Digit challenge these past few weeks. It’s funny how looking for work lets a man focus his mind.

I’ve had a few interesting results but nothing works all the way yet.

This next week I will try a new algorithm I worked out and see if it does any better.

What did they say in the original Hitchhiker guide to the galaxy? “Keep banging those rocks together.”

http://en.wikiquote.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy

Good luck to all.

Ernst

How about...

Yay! I win! Ok, I know that this is cheating really...

@mike40033:

Yeah, kind of violates this rule:

>the executable program can’t have free access to a

>copy of the million random digits file

- Mark

Hmm... that comment didn't quite look like I meant it to. I guess I didn't read the instructions for the iG:Syntax Hiliter plugin properly........

The rule about "not inventing new programming languages" should perhaps be tightened to not allow new programming languages at all, or to require that the compiler/interpreter for a non-standard language be included in the bit count.

Otherwise, for example, I could invent a programming language as follows. The programming language is called "MikesMillionMunger". As the rules stand, the compiler for MikesMillionMunger doesn't count towards the number of bits. The language doesn't contain the file in its runtime libraries.

1) Any sequence of bits is a valid program. The two instructions are '1' and '0'.

2) When the program starts, it outputs the bits corresponding to hexadecimal 9b1b.

3) The instruction '1' means 'output the bit 1'

4) The instruction '0' means 'output the bit 0'.

Now that I've defined a programming language, I need a program to run on it. The program I will use will be the last 415239 bytes of 'AMillionRandomDigits.bin'

My program is 2 bytes less than the file, size of the input is 0 bytes, and I can argue (against the spirit, but within the letter) of the contest that my implementation of my language won't count towards the bit-count.

Mike H...

@mike40033:

As I've said before, the whole point of this is not to finesse the rules, but actually achieve the goal, so I'm not going to go crazy on definitions.

Anyone who really things they have won the prize should be able to have a program that works on the real file, then works again on the same file encoded with a randomly chosen key using DES or whatever.

- Mark

But you only require a decompressor. To make it work on the encoded file needs a compressor as well.

If I understand the challenge correctly, the whole point of this is not to 'achieve the goal' but to silence people who don't understand the pigeonhole principle. I would suggest that this requires closing loopholes in the rules.

However, it is *your* challenge, so I shall respectfully defer to your interpretation.

Mike H...

Oh it's fun and the history behind it all as well.

It's a tough challenge. Fun for someone like me.

I will be very impressed if I or anyone else does well.

Best to have a bit of humor about it all.

I'll place a side bet the winning method will be clever.

Ernst

Maybe someone here can help me remember something I've been trying to find for years...

A long time ago (early 90's or so) there was actually an ad in Byte magazine for a company that claimed to have a magic compressor that could compress anything.

If I recall correctly, they got some interesting letters to the editor and then quietly went bankrupt.

It's the only time I've ever seen this in print in a reasonable magazine, and I'd love to see it again. Anyone remember this ad, or the name of the company?

David

@deg:

Search for the comp.compression FAQ, and when you find it, I think you'll find that the company was called WEB Technologies or something like that.

- Mark

Thanks Mark, that was it. Details in answer #9 of that FAQ.

Turns out that the story played out over three years --1992-95. But, in an ironic mental compression, my memory had reduced the story down to just a few consecutive issues of Byte.

David

I am assuming that someone has already thought of this, but I would be interested in learning about why this doesn't work. Could you use a similar algorithm to the Macro Replacement Model, but choose macros that fit in with Benfords law? That is, random numbers will tend to have more 1's than 2's, etc. Could you use this to your advantage when writing a compressor?

not sure if this way has been mentioned (as i didn't read the entire list of comments)

encode the data using a prime factor algorithm:

for a string of digits "102310"

you would compute 2^1 * 3^0 * 5^2 * 7^3 * 11^1 * 13^0

you could then split this number into larger factors that you would be able to store in a much smaller filesize (IE: 450^345 + 6344^231 or something (completely made up)

This would of course be an extremely inefficient way of compressing the data and would require hardware to prime-factor numbers with multiplicands of the millionth prime number (assuming a file of 1 million digits)

I just remember reading about this code (can't remember the name now) in a scifi book in middle school.

@aloishis89:

Benford's law applies to data that is in a non-linear distribution. The random million digits should be linear.

No way this is going to be any help.

@soda:

You've described a different way of encoding the data. Is there any reason to think that it results in compression? Simply finding a new way to represent a number is not enough, it must also be a more compact representation.

Mark mentioned that a few minor weaknesses have been identified in the data file, and in theory it might be compressible by a few bytes. The problem is of course that's is hard to write a decompresser that's only a a few bytes long. For fun I tried to see how small a script could be while still doing anything meaningful.

Suppose for a moment that the data file happened to contain the sequence '134113410686196639649008076861966396490080729' starting at location 531441.

By coincidence 134113410686196639649008076861966396490080729 can be constructed/compressed using the ruby expression 7**25 (which calculates 7 to the power of 25), and 531441 can be compressed using the expression 9**6.

I'm feeling lucky! Now I can write the following decompresser in Ruby:

The program reads the file 'f', which is a copy of the million digits file, but with the sequence removed. It then inserts the sequence at location 531441, which recreates the original data. The inserted sequence is 45 chars long, and my decompresser is only 28. I saved 17 bytes!

From this litle experiment, it seems that to do anything meaningful, a ruby script needs to be about 30 chars, with 10 chars for just reading the file. I guess the problem here is that any 30 char sequence that can be calculated simply would represent an island of order in the sea or randomness, and be a significant weakness, probably far beyond the actual weakness identified in the file.

In any case achieving a few bytes of compression would not prove anything. It would simply be neat. Wether it's possible depend really on the exact nature and location of the weakness in this specific file. If the weakness is only one or two bytes, I would say it's impossible.

I was wondering if there's a way to calculate the absolute value of randomness/orderliness of a specific integer?

take 0. what's the orderliness of zero? probably undefined. or low, probably 0.

1: seems no different. what the chance of accidentally hitting on the number 1? depends on the range!

2: we might still be at 0?

3: hmm maybe we're not exactly at 0 anymore.

4: high. here it seems some level of orderliness can already be identified. 2+2=4, a kind of symmetry.

5. low. it's a prime. all primes must have a low orderliness value?

6. high, maybe higher than 4? less of a coincidence than 4, since 6 is bigger than 4.

seems the bigger the number, the more order it can contain, which seems right since it becomes more unlikely to hit on the number by chance.

is there a meassure for this? is there a way to calcualte the absolute value or it?

A question about the rules. Suppose I run the a program using:

> ruby -n million.rb random.txt >> output.txt

The program million.rb is run with the file random.txt as input, and the result is saved to output.txt.

How would the size of the program be meassured? As the number of byte inside the million.rb file?

(btw, I know the data is binary not text...)

@Emil:

Like I said in the rules, we are looking for the total size of the program plus the data file. So in the case shown above, I suppose it would be the length of million.rb plus the length of random.txt.

Creating a build of ruby that stores the data in the runtime would obviously not count.

Someone suggested a while ago factorising the number and just storing the primes. I can't see that working, since storing the primes will require as much space as storing the number (it's the same amount of information, after all). However, it did give me an idea:

The file given is random in terms of sequences of digits. It may not be so random in terms of primes. If the number is not square-free (and we have no reason to expect it to be), then you can reduce the space required to store the primes by not storing the duplicates (just storing the power required).

Unfortunately, I can't test to see if this works because the largest numbers that can be reasonably factored by modern clusters are in the 100s of bits, far smaller than this number. I think the idea may have merit, though - whenever you see the word "random" your first thought should always be "what's the distribution?" It may be a uniform distribution from one point of view, but heavily skewed from another. If you provide a somewhat smaller random number (100 bits, maybe), I'll give it a go, though. (Although it may turn out to be too small to have a large enough repeated prime to absorb the overhead of the decompresser, as small as it would be.)

if it turned out that the million random digit number was simply the factor of a few primes, you could express it by the ordinal numbers of those primes, and then indeed you would have great compression.

but imagine the luck involved for that to be true!

the folks who created this file spent a lot of time making sure it was random in every way they could imagine, but they couldn't have done a factoring check too easily.

so it could happen, but the odds are incredibly high against it.

an interesting experiment would be to start cranking out a list of numbers that *can* be compressed by prime factorization, and seeing what percentage actually meet the cut.

A number theory guy might be able to help with the answer.

I forgot to come back and check for replies, sorry!

For my method to work, the number doesn't need to be the product of a small number of primes, it just needs to not be square-free (and have the repeated prime be fairly large). The vast majority of numbers are not square-free.

It's not actually necessary to factorise the whole number, just to find a (large) square factor, so perhaps this is tractable. I have exams coming up (one of which happens to be number theory!), but I'll give it a go in a few weeks.

Having actually looked it up, rather than guessing (I should know better than to guess about such things!), I need to correct myself - it's not a majority of numbers than aren't square free, it's actually a slight minority. So, the odds of this compression method working is not great... If it weren't for the limits of computing power, it would still be reasonable, but even if this number has a repeated prime factor it will probably be difficult to find... Oh well...

Hi Mark,

Is the challenge still open? I'd like to try,

So all you want is a file (exe) smaller than this one, that outputs this file. Obv, as per the spirit of the competition.

What i am asking, is , I wont give any code /source at this stage.

Thanx

Waiting for a promt reply.

JITENDER BEDWAL

New Delhi

India

@Jitender, sure, there's no requirement that you disclose any source code to win this challenge.

Hi Nelson,

Is there any limit to the acceptable time need to be taken for the task to complete ? (AmillionRandomDigits Challenge) eg, if it acceptable if the decompression need 3 weeks time. (joke).

Regards,

Ray

@Ray:

You take as long as you want, Ray.

Hi Mark,

Can u tell me, how do we know, how much a file is compressible to, Or , how exactly do we calculate the 'Entropy' of it.

I mean, that if i consider its entropy as composed of Symbols of the 256 symbol alphabet (read 8 bit byte alphabet), it comes out something, and if i consider it as the 256*256 symbol alphabet then its different.... So, how do we exactly get to know ( any STANDARD or absolute scale), that i can say, this file is compressible to 'this much'.

Thanks.

Jitender Bedwal.

@JitenderJB:

There is no such thing as an absolute measure of compressibility.

Any measure of compressibility is *always* relative to some specific model.

And for any file, there is some model that can compress that file down to 1 bit.

So when you hear people talk about the entropy of a file, you have to know how they are measuring the entropy.

Even something like Kolmogorov Complexity is not an adequate way to define an absolute measure of the complexity of a sequence - because it is going to differ depending on what sort of machine you choose to run it on. For example, a machine that has a single instruction that can generate the million-random-digit file will have super-low Kolmogorov Complexity on that machine, but may be very complex on some other machine.

- Mark

Mark,

What i am asking is, on what scale (or basis) [if any], did the RAND people, rate the 'A Million Random Digits' as random.

Jitender

@JitenderJB:

There are a few places you can read up on the history of these digits, but the most important thing is that they generated the digits using an unbiased, memoryless source. For their purposes, this qualifies as random.

Or maybe, is it exhaustive manual scrutiny, of the 'Tests of Randomness'.

Mark,

Is there any function, which does the reverse of what BigInteger did to the Random Digits file?

QUERY ABOUT SUBMISSION PROCEDURE

--------------------------------

I have been trying to compress THE file for a while no, I hav got 'a little' (but still some) success. Please elaborate the possible combination of the contents of the final file to be submitted.

So far, I hav a .CPP file, a .JAVA file, and a .BIN file, which when run Create the EXACT File u are providing. And the Total Size of the above contents is 232Bytes less than Ur File.

I hav tried to remain strictly in accordance with the spirit of the Challenge.

Thanks again..

JITENDER BEDWAL

@Jitender:

If you can mail those files, along with build procedures, and I can reproduce your results, you get $100, it's that simple.

- Mark

Thank you very much Mark.

I'll publish (or mail) them very soon. I am not actually that concerned about the patents or something. And I am not in any way afraid of posting my code (or the THEORY ). But still, Mark, please guide me, what should be a BETTER way, (like for example Publishing it in some magazine/paper/website or something so that i don't loose the credit for it, in long run) to do it. In short of course.

And btw, i hav gone through numerous threads on Comp.Compression. I know there hav been a lot of Pranks there.

So if you like, we can end this discussion for now, and resume it later (in a week or so, when i Publish the files), but still some advice would be greatly appreciated.

Thank (yet again)

Jitender

Waiting here... for another 10 minutes.

looking forward to seeing jitender's claimed solution.

i'm curious what code was used to convert the digits to binary format?

@Emil:

See this post on comp.compression for the full Java source code:

http://groups.google.com/group/comp.compression/msg/e219c4f18a9f546e

thanks for the link to the original java encoder. for fun i tried to write the shortest decoder i could come up with:

p gets.unpack('B*')[0].to_i(2)

it's ruby, takes about 40 min to complete on my mac, and outputs the one million original digits when given the binary file as input (and disabling the record separator using -0777):

the script is 30 bytes long. can anyone beat that? :-)

Is there a restriction on the OS used? I have an idea, but it will only work correctly on Linux.

@AtomicCheese:

No restrictions on the O/S, but of course, it helps if it is something I can run to verify things.

If you have an idea, and it will only work on Linux, I am pretty skeptical.

- Mark

Is this competition still open?

I would like to give a try.

But I feel prize money of $100 is too low.

Hi Mark,

A few questions about the contest:

1. When measuring the size of the compressor / decompressor, I assume we are talking about the length of the source code in bytes, not the binary. Can you confirm this is the case?

2. Also, when including the length of the source code, whether binary or source code, is the packed size (with something standard, zip, or winrar) of the source / binary used or the uncompressed size? If uncompressed that would mean we need to use one letter variables, macros to hell, etc.

3. Is this offer really still good and nobody has won it?

@Parlance:

I'll be happy to measure your submission by the size of the source, even the size of the ZIP or RAR file that holds the source. I agree that it would be annoying to have to make the source unreadable.

Yes, the offer is still good, nobody has won, and I don't think anyone ever will. If it was me, I'd be trying to get the file inserted into the Linux kernel so I could use that to snag a back door victory.

- Mark

@keshavshetty:

Contest is certainly still open.

Winning it would of course give you much fame and acclaim, way more than $100 worth. As Matt Mahoney has pointed out, if you find a general purpose way to recompress compressed data, you would actually be eligible for at least one million dollar math challenge.

Obviously the money is simply a token - nobody is going to do this for cash value.

- Mark

i just found this Improbable Research video about the million random digits , i think its extra funny cuz im trying to compress that crazyness =)

njoy

http://www.youtube.com/watch?v=0y8Wa10Lm7c

@haggo:

That video is simply awesome! Thanks for posting.

Does anyone still visit this place?

I have a few questions.

Is this challenge impossible?

Has anyone successfully completed this challenge?

What about this:

http://cs.fit.edu/~mmahoney/compression/barf.html

Hi Noodlesgc,

About barf, reality can be found at

http://nikhilsheth.blogspot.com/2007/09/barf-compression-holy-grail-or-goose.html

Meh... If you are so confident that this will never be achieved, why is the reward only a measly $100.00? Why not a million? By offering such a piss-poor reward, you've pretty much guaranteed that no one with anything valuable will ever take part in your challenge. If someone has a real working program, their eyes will be more focused on the millions they can make by marketing their little compression program. Your cowardly offer will not capture the interest of anyone who is serious.

@NotImpressed:

First, I don't have a million dollars, sorry.

Second, the people who pursue impossible compression claims are going to do it regardless of whether I offer any prize or not. This prize is simply a stake in the ground that they can choose to shoot for.

The primary reason I created the challenge was to provide a benchmark I could use on comp.compression every time someone made a fantastic claim. It has done very well in that role. When someone says they have a great new algorithm, I just point them to the challenge and tell them to let me know when they are done.

And despite what you say, it has caught the attention of many, many people who are quite serious. Of course, none of them have managed to meet the challenge. Nor will they.

Sorry I disappointed you.

It it still open? I fully do understand the chellange:

decompressor executable + compressed file MUST BE LESS THAN ORIGINAL AND MUST DECOMPRESS TO ORIGINAL WHEN RUN.

I ma not after prize even if it is 1 million, am just interested, I have an algorithm (not yet published anywhere) that can bring it down, which I use as post compression technique to reduce data to almost 1/3rd and have tried it already, but rightnow seems a little tricky with your binary file, but yes I am 95% use it is possible and I would love to proove it with an example (only if it is still open). Machines can never ever be more powerful than human brain.

regards

Khan

@Khan:

Most certainly, the challenge is still open. Please give it a shot!

- Mark

There appears to be just one way to crack this problem, which is to find something about the data that's non-random.

But if you succeed in that you still haven't solved the actual problem (although you'd win the $100), since the data file wasn't fully random.

Therefore, as long as the data is as near to fully random every way you analyze it then this problem cannot be solved.

@Robin:

First, there are many people who think that they can write compressors that work on all datasets, regardless of randomness. This challenge is their opportunity to prove it - of course no one has done so.

As for a sequence like the million random digits - regardless of how random it looks, there may be a short program that can generate the sequence - it is not possible to refute this due to the halting problem. So no matter how random it looks, there is always a chance that it could be compressed.

- Mark

Been a while.. I worked a long time and then again fell into the hole of hopelessness..

You know i should get it all going again..

This last round gave me three unique encoders so it was worth the effort.

What I love about this is when i get within striking range, and I have been there a few times, it seems I run out of storage space.

I'm always stuck with one or more bits of information I need to store that i have no place for.

Now that I think of it.. i did have a new idea on on old encoder I didn't work out yet..

But nice to say Hi! I will never quit trying I'm sure..

@Ernst:

It might be impossible to solve, but this is unprovable, so it is not necessarily a fool's errand to continue. Kolmogorov shows that we cannot prove that a given program is the shortest one to produce a given sequence, so there is always hope.

Needless to say, though, I don't have much faith that one will be found.

- Mark

Has there been any real submission for the challenge, other then scams??

@Zoran:

I received one submission, but it was just a waste of my time - I guess the guy didn't really understand the contest. He submitted a program that did some kind of compression, but it did not meet the requirements to win the prize.

I don't ever expect to see a valid submission.

- Mark

[...] (Before you read further I suggest you to read Mark Nelson “The Million Random Digit Challenge“) [...]

Loss less random data compression – Is it possible?

The article I posted in my blog - http://blog.adityon.com/2009/12/random-data-compression-is-it-possible/

Sorry I couldn't post here for 2 reasons.

1. Blog is too long

2. I want the user comments to be available at my blog.

New article added at http://blog.adityon.com/2009/12/random-data-compression-is-it-possible-part-2/

Well Guys.. It's a real mental challenge and I believe that is why I phase back into it from time to time.

I am now working on an encoder that has some promise in recoding the data. I have a ways to go since I decided to start from scratch and prove each step as I go.

Well if there is a boundary we cannot cross to information then that is the way it is. I've been within one bit of one bit a few times. I know what that boundary is like first hand. Thanks to the great spirit of this challenge. I have enjoyed the time spent.

Hey Mark I have been checking pattern with your ARI from the BWT program. Is that as good as any? Do you recommend a different one?

I figure is there is pattern that ARI program will see it.

Well to all who are still smacking the keys in 2010 on this challenge I salute you! I will be at it for a few months with this one.

I have coined the term Dark Information to reference any information that can be assumed in a system.

This was inspired by the idea of an observer in quantum system terms and one encoder with a special state that has no spare bits if I want it to "compress" I need that external reference to signal the state of the system. It is an one to one but I wanted it to be Special :)

Well this is the basic representation of million digit data. Not that this form of it is anything special; it's not. It's the kind of reality I see over and over again in all the ways I have expressed the data of the million digit file.

Not any savings to brag about so far.

1 : 829609

2 : 415167

3 : 207993

4 : 104024

5 : 52223

6 : 25717

7 : 12968

8 : 6475

9 : 3197

10 : 1627

11 : 769

12 : 420

13 : 212

14 : 92

15 : 37

16 : 29

17 : 7

18 : 5

19 : 3

20 : 1

21 : 1

Any Ideas?

I've posted a short article on Dobb's Code Talk dealing with this challenge. Some of you might find some interesting ideas there:

http://dobbscodetalk.com/index.php?option=com_myblog&show=The-Million-Random-Digit-Challenge.html&Itemid=29

Thanks Mark..

Imagination is more important than knowledge...

Albert Einstein

US (German-born) physicist (1879 - 1955)

I realise this is a dumb question, but.

Don't you need 4 bits per 0-9 digit?

1million * 4 bits = 500,000 bytes.

So how come the .bin file is 415,241 bytes?

@thicky:

There is a form of coding numbers called BCD. In BCD, you encode two decimal digits in a single byte, just like you describe.

The million digit file does not use BCD coding - it is a single long string of binary digits. If you do the math, you will see that a single decimal digit takes approximately 3.32 bits to encode. So a million decimal digits takes 3.32M bits, or 415K bits.

If this doesn't make sense, start Googling for computer numeric formats.

- Mark

@thicky: if you use Binary-coded decimals (http://en.wikipedia.org/wiki/Binary-coded_decimal), you would be right. But converting a decimal number to a binary number is a bit different. Let's convert 11 in decimal to binary: 11 = 8*1 + 4*0 + 2*1 + 1*1 = 2^3 * 1 + 2^2 *0 + 2^1 * 1 + 2^0 * 1, so 11 in decimal is 1011 in binary. In BCD it would be 00010001 More information on http://en.wikipedia.org/wiki/Binary_numeral_system

@Mark @Ben

Thanks - makes sense now I think. The file contains the binary representation of a million digit number, not the binary representation of a million digits!

Happy New Year!

Hey.. I tested the new encoder and no real surprise it turns any file into random data as far as Zip is concerned.

LOL it is a one to one encoder but still no magic compression.

That must be Codex number 25. At least this one doesn't have any unexpected states like one I have does.

There are systems out there. Keep banging those rocks together!

Mark - in response to your article at Dobb's Code Talk: The normality of pi has yet to be proven. That is, it has yet to proven (although many mathematicians think it is probably true) that pi contains every sequence (actually, normality is a little stronger than that, but I don't think the weaker version has been proven either). Your Magic Function, therefore, may not work (but it probably will).

For more information, see:

http://en.wikipedia.org/wiki/Normal_number

@Tango:

As I said, I am obviously not a mathematician! Thinking it is true and proving it is true are obviously two different things. Thanks for the pointer!

- Mark

Pi is an data set then until we know more.

That works for me.

Hey Mark are folks looking for that Holy Grail of Data Compression Numerologists then?

Oh I'm hooked on trying that is for sure. Anything that helps the imagination.. Thumbs up..

There was a story I remember about a substitute teacher who teaches one Kindergarten class one day and a High School senior class the next.

Kindergarten kids when asked to tell the teacher what a dot of chalk on the black board was. They replied with enthusiasm one answer after another. A Sun, the top of a telephone poll.. a... a... and the answers poured in.

The High school seniors when asked the same question were mostly silent with a lone condescending voice or it's a chalk dot dweeb! Coming from the back of the class.

The moral of the story is love your inner chalk dot.

Hey Did I tell you guys I can compress any binary number by one bit?

Drum roll....

@Ernst:

There are numerous ways to do as you are describing, but they all rely on hiding information. You of course can't compress any stream by one bit without actually storing that bit elsewhere.

- Mark

Well.. This is meant to be on the Data Compression humor side.

So If I blunder on the terms or other proper defines please correct me gently.

Again try to see this as Data compression humor and not a Big Brag.

I need to say that all bit strings except zero value ie 0 or 00000000

-----------------

If we take a string and I will just type and make up a data

1010101 -- We can show how value and symbol can be exchanged and encoded.

1010101 Starting with a limit of length of MSSB ( remember this is humor ) Ask yourself what must we subtract at the location of MSSB - 1 to get the MSSB to reset?

In this case only a single set bit at the MSSB - 1 location resets the MSSB.

1010101

-0100000

-------------

so that is 0110101 now.. Choose a parity in this case a simple 0 = 1 and 1 = 2 works so the first symbol is a zero.

repeat and you will end with the value of one which is Dark Information we assume when decoding the Symbol string. we start with a value of one and not a value of zero

So, in humor, I can reduce all greater than 0 binary values by one bit.

:)

it's 1010101 = 010101.. Where 0 and 1 are symbols rather than value.

Poo you say just drop the MSSB? well if I change the symbol meanings then it's 101010..

It's okay.. Just a little Data compression humor. I compressed all greater than one values by one bit..

> 0 duhh.. LOL

Wow Tough crowd..

Hey maybe you guys see a way to manage that parity flip issue?

Oh well I will log into Random Data compression and chat.

Happy new Year all..

And again:

Here's my algorithm. It relies on parallel processing. Well, parallel universes to be precise.

Theoretically, that should do it in some universe (because the data file was only read in universe0 and probably doesn't exist in the one which returns the file). Of course your interpreter or executable would have to exist outside the universes, which could make it hard to invoke!

I have an official update and I'd like to ask your position on this, Mark.

I have had the good fortune to discover a bijective encoding algorithm.

The size of the decoder is under 8000 bytes.

The down side is to recode a packet the size of million digit file, which is over 3 million bits, it takes nearly 48 hours using a single core at 100% This version wouldn't divide over multi-cores.

My question is: If I find a file "some place down or up the line" sequentially that will compress, at least that 8000 bytes, with a publicly available data compressor will that qualify as compressing the million digit file?

Thanks

Ernst

@Ernst:

I'm not sure I know what you mean. If you are suggesting that you might find a sequence in the file that compresses by 8000 bytes, then you should be able to construct a decompressor that meets the challenge.

- Mark

@danielnash:

I'll let you know when I am able to test your code. Don't hold your breath!

Hey Mark.

I am able to encode or decode the entire file recursively here. I'm confident it can "go" many iterations in both encode and decode directions but, sequentially only.

D3, D2, D1... E1, E2, E3...

There may be a diamond in the rubble but, it's worthless if you don't buy it.

Just wanted to see where you stand on this.

A little "Kobayashi Maru" Cheat on the file but it's in-line with my personal theory that there is a "quantum" of information involved and the symbolism representing it is non-permanent.

The new files are exactly the same size as the original million digit binary you provided. I am looking for a way to compress but the best hope at the moment is recoding.

I can learn to write a traditional data compressor but I will be learning from the bottom up. Why reinvent the wheel right?

So, my question, relating to recoding the file is; how would we score that. Would it be 8000 bytes + the data compressor? Maybe we might waive the size of programs all together since the codex is bijective and encodes or decodes any binary data and a publicly available data compressor is common enough? Wouldn't that be nice...

Files don't have to be encoded with "Wave", the codex code name, first to be decoded.

Maybe recoding the file is out of bounds?

You are fair so I respect your rules.

Ernst

Hey everyone.

I have an update: No luck on finding any exploit on the million digit challenge however I am pleased to announce I discovered dynamic Unary encoding. It is the basis for a bijective codex I code-named "Wave" since Wave encodes down to and back from, two bits.

I don't know if this is a self discovery or a discovery in general so I could use some feedback on the concept.

I have posted in the Beginners thread of the Usenet group alt.comp.compression.

Mark I placed this here because of two reasons. 1.) I have your challenge to thank for the focal point and 2.)alt.comp.compression is really slow these days. No one is even BS'ing. I hope to make use of common knowledge of those in the know. :)

So Questions; the concept is dynamic encoding where data is in motion rather than static and, has anyone said or read anything on this general concept of recoding bijectively?

Just what to look for will be helpful and I'll bet it's more appropriate to reply in the alt.comp.compression group to the thread Beginners Thread.

Thanks Mark.. I am trying to find a way and sharing what I find. It's a challenge and I like that.

Ernst

Hello I want to stay anonymous and I want to explain why no one who is serious will actually do this. To begin I was reading over some old AIM (Instant Messenger) logs from 1999 based on what I and this german kid used to talk about.. now I am 25 years old and still don't understand half of the things he used to write to me but it all comes down to this, at the age of 15 as he states and checked by his DOB profile, I cannot get in touch with him anymore but he lived a crazy life don't know if he is in jail now or dead. So let me explain a few things he was a social outcast he used to talk about how easy he could control girls without even talking to him just by clapping his hands and staring at them he would get them to have anal sex with him. Don't stop reading because of that part it's funny I haven't understood what anal sex even met when I was 15 haha. He was socially awkward to talk to becasue of his vocabulary I can't say I was the best influence on him either when ever he used to talk in big words I would say just thats awesome and laugh and thats how it conversations ended. He used help me in magicially turning any shareware program into a freeware and tell me to enjoy it =]. Now I figured out he was a hacker. He also used to talk about how he takes some kind of drugs which induced attention deficit disorder which was half of the key and the other half was empirically (still have problems understanding what this ment) he needed to become into a millionaire by the age of 17 he never logged in AIM after that point he did mention he was rich. I don't know what made him into a hacker but he could take any program apart and recreate the source code to it in less then a week thats what he did to turn RICH and I was a big part of him becoming rich as I told him about this cool software which can detect letters in pictures like a human brain back to text now called OCR (optical character recognition) it was pretty primitive at that time and was a shareware, he quickly made it a freeware for me and continnued to turn it back to source code and work on improving it, later patenting his improvement at a VERY young age i may say 16!. Not sure exactly what he patented it since it was already patented I believe? but he started to license it to websites which were for blind/color blind people, which he later said after I asked him that blind people were definetly not using them it was for automated mail hackers. Thats pretty much the story I have logs here which are still VERY cryptic for me to understand and I'm pretty dang mature. Most of them talk about compression and life and a bunch of other things about space and aliens and things I found useless. He states some funny things he must of been a bit immature back then like he writes that life is just a closed circuit with in a cycled loops. I think he ment environmental evolution? rofl If I could still talk to him I'd comment about that by saying yes life is a cycle :P but its no dang circuit its but its a process of SEX, baby then and age which gives wisdom lol. His compression theories talked about how in life the compression take many cycles to unpack and it may take years for it to fully unpack things that could be done in just a month using a CPU. Haha yah he was pretty nutty talked about evolution I am guessing. I won't go into much details here because I'm still trying to understand half of his theories maybe I could patent some of them and get rich myself.

But I wrote all this to tell you that hackers could turn programs into source code so don't risk it even though hackers may be crazy like this guy but they exist.

I'm learning Visual Basc 6 and I may be able to recreate what this german guy was talking about near the future at the moment I'm struggling with bitwise stuff and loops at the moment it's progress.

If lets say someone did submit a program that qualified for the challenge, what are the steps you would take to submit it to the scientific community for verification, and so on??

@joe:

Verification would be done by me, maybe with whatever help I can get from the folks at comp.compression. Verification is not a matter of science, more of forensics.

Now, if you choose to disclose the algorithm, that opens the door to more interest from other people.

- Mark

Wow we took a turn for the Surreal.

Hey Whatever on the long Story.. I can get it but I doubt anyone else here would be able to get the story..

Anyway.

Update.

I now understand how to recode all binary strings 1 to 1 under simple integer count. I'll be looking at one exploit that came to mind, next. I only hope I don't become homeless here too soon. The Job market is still oppressing economic opportunities.

Does anyone recognize this integer sequence? 13,128,23,217... ? It's my Q0-B.

Just a shout out to all you other would be Nut Cases trying this challenge.. Catch me if you can!

Ernst

Hey Mark,

where i have to go to get my 100,- dollar if i had a solution for this problem?

@lilalula:

You don't have to go anywhere, I'll get you the $100 wherever you are - at least if you take Paypal!

- Mark

Hey Mark,

OK. What is about my idea?

I think i can compress any kind of data(even the AMillionRandomDigits.bin file). Well not endless but to a smaller amount. The compression is effective only on files with a certain minimum size. The bigger the file the better is the compression rate. If you're interested let me know. I'll give you some more detailed infos on this.

@lilalula:

Hey, the rules of the contest are right out there - don't tell me how you are going to do it, just do it.

You write a program that can compress any file without resorting to various cheating tactics and you'll be famous, don't worry about it.

- Mark

Hey Mark,

How are you today? So i'm back and i found a Way to do it. I was paying arround with the "AMillionRandomDigits.bin" file and it worked fine. Also i tryed to run some more tests with my program and i found this here:

http://www.stat.fsu.edu/pub/diehard/

This is a cd with 4.8 billion random bits in sixty 10-megabyte files. You can go there and download this files to see for your self.

OK. here it is:

I can save 16 MB(16777216 Bytes) data(ANY data) in a file with a size of 11296 Bytes. My program don't compress the original data. All i store is the information in the original data. It is mutch like tranforming it to something else. It is like tranforming energy from one to another form. Like making heat out of wood or electricity out of wind.

During my work on this project i realized that this is more than 100,- Dollar worth. So you can keep your bugs. If you want proof for my program than look out for a project with name ALICE. Also i can meet you to show you how it works. I hope you understand that i can't send you this program.

cu

lilalula

PS: for more questions: you got my email.

@lilalula:

It is rather trivial to prove that you cannot save any 16 MB file in an 11K container. Sorry, the pigeonhole principle shows that you are incorrect without even breaking a sweat.

The purpose of the challenge is not so much to get $100, it is to prove to the world that you can do it. I don't think anyone is going to believe your boast until you actually prove it.

- Mark

Hey Mark,

Well basicaly you are right it is not possible to save 16 MB in an 11KB container. As i sayed this is not what i'm doing. I'm not trying to save the 16MB content somehow in a smaller container. All i save is the information from the 16MB.

OK, i will make you an "dekompressor"(i don't like this word because it's not right) and i send it to you. You can use it ONLY to restore the original file. After you have the program you send me a file. I'll send the container file back to you. you can use the programm and restore your original file. Would you be happy with this as proof?

Hey Mark,

It's me again. You know that everybody was 100% sure that the earth is the center of the universe? Until 1610 a Guy in Italy(Galileo Galilei) profed that this is not right. He almost got killed for this. This is just one example of our way to understand things. It is often just simpler to belive what we don't know.

@lilalula:

Yes, your proposed proof is fine with me.

>Until 1610 a Guy in Italy(Galileo Galilei)

>profed that this is not right

This is a nice little story, but of course it was Copernicus who was credited with advancing the notion of heliocentrism, not Galileo.

- Mark

looks like somebody did homework...

Send me an e-mail so i get your e-mail.

Found another claim for recursive compression:

Even "explains" how it works :p

http://www.recursiveware.com/

@TonioRoffo:

Awesome site, thanks for the link. Looks like the guy is having trouble getting the final implementation complete though... Think I'll contact him.

- Mark

Well, I looked at this. Curiously, as I wrote pictorial representations of the phenomena described, I came up with a pattern that fitted the concept "a million random digits" and "so-many bytes of (supposedly) -incompressible 'random' 'data'" BEFORE I saw the direction Mark's description was going. This suggests my method so far is consistent with what Mark describes.

I found a very interesting pattern. I then intuitively found a candidate method to fulfill the "The challenge is to create a decompressor + data file that when executed together create a copy of this file."

I then altered my method to allow for the requirement that the information be in the form of "bytes".

Curiously, BEFORE I READ of the notion of "negotiating how to measure the result etc.", I produced the notion "so you get an approximation of bytes (cannot measure it))".

My intuition suggests "if write the program in C++" one could measure it with (say at this stage) any other 2 programs...?

Of course, the technical details of what I found could do with sponsorship more than $100 as I could do with a freehold house!

The alleged problem of "recursive compression" is apparently "solvable" ("what problem? " one may ask......?)

by a simple but very clever technique- true there is a natural "limit" but this is very low it appears.

The general subject area of ultra "compression" is what I call "space computing", it has possible ramifications for the study of astronomy.

Dr. Wong seems clued up;

5-d floating characters

nice work I think

yes Sir.

I can go much further apparently (If higher rate of data storage minimisation of container ( ) required)

with hyperspace bypass technology

entropy "vanishes" in a very delightful 'way'

!

Is this challenge still on? I love the idea. However, $100 is a very small incentive for a standing challenge 10 years old. The Clay Math Institute (http://www.claymath.org/millennium/) offers a million dollars. How about submitting this conjecture to them? Or, maybe you would be willing to up the incentive to say $10,000?

@jjdawson7:

Yes, the challenge is still on. Many have tackled the problem, but of course, nobody has even come close to defeating it.

I would love to offer a million dollar prize, but of course, I don't have a million dollars. And the Clay Math Institute would never have a challenge like this.

Anyone who manages to beat this problem will certainly get fame and acclaim, and I will do my best to help them.

- Mark

The Clay Math Institute wouldn't have this kind of challenge, but James Randi might accept a solution to this problem as winning his "One Million Dollar Paranormal Challenge" (http://www.randi.org/site/index.php/1m-challenge.html). One would need to compress it by a significant amount, though, not just a token reduction as you require.

Quote: "...nobody has even come close to defeating it."

I doubt this is accurate; I quite likely solved your challenge, and many others (such as all seven Clay Institute Millenium problems, and "the Goldbach Conjecture"- all these connect to a revolution in science that may be as profound and far-reaching as the Copernican Revolution.)

It is time "skeptics" stopped being so quick to discount these possibilities- to realise "the stone that the builder rejected possibly has become "the corner stone", as the saying .... ............... "

The reason why the Clay Institute problems are so "hard" is perhaps the manner in which the construction of these problems interacts with how mathematicians do math.

As my living conditions are very difficult in several ways, I would prefer to find a sponsor or commercial avenue to releasing what I have (apparently) discovered.

I now have maybe over 14 different ways to look at this discovery- if it was in error it wouldn't sustain so many perspectives?

I need a salesperson to find me a deal

.

@Alan:

> I quite likely solved your challenge

You "quite likely" solved it? Alan, either you did, in which case you could provide proof, or you didn't. There is no maybe about it.

The challenge is still open for you to solve, but until you do so, perhaps you should stop posting comments and get to work.

- Mark

I worked the apparent 'solution' on a piece of paper. That is the "proof". For intellectual property reasons, and severe living discomfort reasons, I consider it unwise to reveal the contents of the piece of paper without using a non-disclosure agreement and dealing with a genuine potential sponsor or licensee.

Surely you knew this would likely happen if someone solved it? Who can afford to give away that potential new technology?

I can tell you, the piece of paper includes drawings, i.e. patterns.

It was a rather strange problem, but was it was nice to find what appears to be the resolution.

@Alan:

Sorry Alan, you are wrong. You did not solve it, you will never demonstrate it, and you are living in a world of delusion. I've seen too many like you to be fooled.

You will always have some excuse for not demonstrating your results, and I expect 20 years from now you will still be whining and begging for money like you are now.

- Mark

Mark Nelson, EMI said that "guitar bands have had their day", and did not sign the band known as "The Beatles".

You have no evidence for your claim that I did not solve (or whatever) your challenge. You are apparently not facing reality- that people do not necessarily give away this sort of knowledge.

You appear to be living a self-fulfilling prophesy of destructive skepticism....... ???

A large pool of knowledge now exists, already many pages have been written for a scientist who has signed a non-disclosure agreement.

What I know is on the market.

Got to hand it to Mark. What he lacks in humility he makes up (in spades) with smarminess.

Alan, you seem to be misunderstanding the rules of the challenge. You don't need to reveal your method. Just send Mark the compressed file and the decoder (as an exe, no need for the source code), nothing else.

I don't know much about computers, this technology is (apparently) a massive jump in concept from how computer experts are thinking (though they get little bits of the idea at times)

If a file "writes its own (perfectly tailored) code", you automatically have a fantastically efficient use-of-space. The "decoder" and the ""file"" are one- a hyperspace dynamic.

This is (apparently) very new system (?)

The file was already a code- this an object-recognising system- you don't "decode" a file like this- you "upscale it"

All the space-efficiency-storage-enhanced file needs is "to see itself "in the mirror"" (laser interferometry)- one converts the file into space-efficient storage system; in doing so you create a 'run triangle' (inverted location) the file exists now "in 2 places at once".

To re-see the file as originally depicted, requires running it past 5 (or 6) directions more-or-less "simultaneously" - the easiest way to re-calibrate the file is a dot matrix calculator gizmo (like using a cross-word puzzle to unravel the location and links between each word (you have a jumble of words and the shape of the puzzle- now can find where each word fits)).

the prize is $ $ 100 or $ 100 million U.S. dollars?

@ademsam:

$100. I don't have $100 million.

- Mark

Not worth wasting time on a prize that I can Mediger up the street.

@ademsam:

>Not worth wasting time on a prize that I can

>Mediger up the street.

Not worth wasting time on a challenge that you have no chance of winning!

- Mark

I am still working on things Mark.

I found a way to drop bits but, I will have to work the program out to see if there is actual data reduction when all things are considered. It's possible that it could come in under the wire.

This design is a much more challenging structure then anything I have attempted before.

So it's a win-win situation for me. I get to improve my program writing skills and see if this algorithm can meet your requirements.

It is an extension of that dynamic unary encoding I was chatting up earlier.

Best of luck to everyone.

I'll check back in later this year.

one million bits is tooooooooooooo big

how about a nice manageable 1k of bits ?

@Mark

Fun contest... do you think its solvable if P=NP?

@dmfdmf:

>do you think its solvable if P=NP?

Matt Mahoney shows that if you can compress random data, then you can solve NPC problems in polynomial time, which would mean that P=NP. So that's kind of the same thing...

Hi Mark,

could you give me a pointer to the paper by Matt Mahoney which shows compressing random data would solve the P=NP problem.

Thanks.

@ps:

In a post on comp.compression:

http://groups.google.com/group/comp.compression/browse_thread/thread/965bb085438e33db/2dc4b259c0c81503

"random data" : a mathematician agreed with me that "there is no such thing as "randomness"".

Though actually, considering the notion of "compressing random data", "random data" could be described as "not data at all".

"random data" is NOT "data" I submit; "random data" is "an OBJECT".

There is nothing to compress.

"random data" is not about something, it IS "something".

I have apparently solved, at least in a manner of speaking, all 7 Clay Institute Millenium problems (including "P vs NP"), and the Goldbach Conjecture.

I am fed up with being so poor when the value of these apparent discoveries is so helpful !

If you want to "compress "random" data, you will first have to expand it: this results in ____________________ (censored) which = "the Goldbach Conjecture" (or "P vs NP quantisation) To "solve "P vs NP" you need "quantum electro mechanics" ...

Hello Mark,

I have a question concerning the nature of the challenge that you have set.

As I understand it, the challenge is simply to provide a compressed file with a decompressor whose combined total size is less than the original AMillionRandomDigits.bin file, i.e. less than 415241 bytes.

I don't know, but I suspect that this challenge cannot be met.

My question is this: Why does the decompressor and compressed file combination have to be less than 415241 bytes?

Given the random nature of the AMillionRandomDigits.bin file, would it not be, at least in principle, sufficient to show that one can compress/decompress the random file to a few bytes less than 415241, regardless of the size of the decompressor?

Why does the size of the decompressor matter?

Is it simply to ensure that the missing bytes are not encoded within the decompressor?

Is it to guarantee fair play, or is there a reason, in principle, why the decompressor has to be smaller than the number of bytes saved by compressing the random bin file?

If one puts aside the idea of trying to cheat, would there be any worth in a compression/decompression algorithm that could, regardless of its size, compress this random binary file by a few bytes?

I have no idea whether the challenge could ever be met, even if the only constraint was that the random file had to be compressed/decompressed by a general purpose algorithm.

Could you please explain the reasoning behind the constraints set in the challenge?

I am sorry for asking what may be a naive question.

Thanks

ps

@ps:

If the decompressor could be any size, it would be trivial to keep a copy of the million digit file in the compressor and then to compress it down to 1 byte, or even 0.

If somebody actually was able to compress this file, and found that their code was too large to make a go of it, I'd be happy to create a file that is 10X larger and repeat the challenge.

- Mark

What I have taken from this challenge is that we can try. We can look at this in new ways and maybe innovation will happen.

This maybe a think outside the box challenge where imagination is more important than education or it is folly.

Time will tell.

Working on a one to less than one system here. meta-data may outweigh net reduction tho.. Cheers!

What about this (only theoretically, would be far too slow):

have a function checkme() which goes through 415241 bytes of data and calculates some kind of hash or whatever and then returns TRUE or FALSE.

The function checkme() must return TRUE if the data == AMillionRandomDigits.bin. For a lot of other data it will also return TRUE, but for some (hopefully as much as possible) it will return FALSE.

The compressor does a loop through all possible data of 415241 bytes length and calls checkme() on the data each time. If it returns TRUE a conter is increased. The loop is aborted/done when data == AMillionRandomDigits.bin. Now the counter is the index which represents AMillionRandomDigits.bin: of all possible data of 415241 bytes length it's the n'th for which checkme() returns TRUE.

The decompressor program stores that index alongside it and does a loop through all possible data of 415241 bytes length. It calls checkme() and if it returns TRUE it increases counter. If counter == index it stops and outputs the current data.

If decompressor prog size + index size needs more than 415241 bytes modify checkme() in such a way that the resulting index for AMillionRandomDigits.bin can itself be compressed (or represented in a short way) enough such that decompressor prog size + compressed index is

@flott:

This is pretty much a standard implementation of Magic Function Theory. Start simple and let's compress 16 bits by hashing down to eight bits. Further, let's say this is a really excellent hash function, and every eight bit hash covers exactly 256 input files.

So now, for each eight bit hash, we have 256 possible files. So to specify a given input file of 16 bits, we need an eight bit hash, plush eight bits of extra data.

Net result: no compression.

- Mark

Mark,

I'm sure this won't work, but for the life of me, I can't think why. I'm trying to understand your reply to flott, but also having trouble understanding your point.

Take the number, and hash it with a variety of hashing/checksum algorithms. For this example, let's use SHA1, MD5 and CRC32.

The "compressed file" will simply be the output of all three (or however many you choose to use) concatenated together.

The decompressor would simply (HA!) try every single input permutation (for a slight increase in expediency, you could also concatenate the length of the input number, so it doesn't have to try numbers below that length), and check if that test number hashes to the same MD5, SHA1 and CRC32 checksum as the input file has provided.

While I understand that due the pigeonhole principle, this can never work perfectly, to me it seems that there should be a way that this works (albeit very very slowly) for most numbers.

Or (as I'm starting to think more and more), as the output of this "program" can never be larger than the input, the pigeonhole principle applies, and there would always be multiple collisions in such a system.

Thoughts?

Thanks!

Another thought (unsure if it actually matters or not).

Since you're bound to find collisions in any combined set of hash functions, could you not just also include which "hit" is going to be the correct one?

In other words, when you're "compressing" the file, also attempt to decompress it. Since you have the source number on hand, you can easily determine if the answer is correct or not. Each time you find a value that collides with all of the chosen hash functions, increment an index. Once you've arrived at the "right" number, simply (again HA!) append the collision index to the end of the compressed file, and you're good to go.

I'm honestly very curious to see why I'm wrong (because I'm almost certain I am).

>Since you're bound to find collisions in

>any combined set of hash functions, could you not just

>also include which "hit" is going to be the correct one?

Yes, and I tried to get to the point of this in a previous response.

If you were able to hash a 16 bit string down to eight bits, you would have 2:1 compression on every possible string. The only problem is that your hash function would have at least 256 possible hits, only one of which is the correct one.

So, in order to carry along the information about which hit is the correct one, you need an extra eight bits.

Now you are right back where you started - using 16 bits to data to encode 16 bits.

Magic Function Theory always works out this way - you don't get any information for free.

- Mark

I see your point, but when you have hash functions which always output a fixed length message no matter what the input length, and you're using two completely separate hash functions (from separate cryptographic families), would you not be able to AND the two sets of hits together?

Then couldn't you take the collisions that occurred only in each hash function and then from the known starting length of the input, use the collision index to determine how far to search to find the "correct" collision?

I realize I've just repeated myself basically, but in my mind, each additional step (the additional hash functions, the collision index, the starting length) all serve to reduce the "wrong" outcomes.

I'm not trying to claim that this could be a magic compressor (or anything to that effect), but I can't see why it wouldn't work in some cases.

Again, it's possible that I'm just not getting the math that's proving me wrong. My apologies if I'm being dense.

Look, you asked why it wouldn't work with just one hash function and I gave you a very straightforward, simple explanation.

Your next try falls victim to exactly the same problem. Even if I'm using two hash functions, if there are multiple files that have the same intersection, you will need extra bits to enumerate them.

Do the math man, it's very simple. Instead of just tossing out these ideas, actually sit down and work them out. When you propose something like this once, it's understandable that maybe you are impatient. But don't keep hitting your head against the wall and asking why it hurts. If you really think it can work, start out by developing a test case using 16 bits, 24 bits, whatever. If you aren't willing to test to that extent you are just wasting your time and mine.

- Mark

I realized I failed to address one of your points (that of being a net zero compression due to the size of the index value).

If you take a 10000 bit input, you'd get a 128 bit MD5 hash, and a 160bit SHA1 hash. We're now at ~34 times smaller than input value. Adding on a CRC32 for speedy sanity checking, a 128 bit index, and a 32 bit length field, we're now at 480 bits.

I can't imagine that in the space (for lack of a better term) of 10000 bits, you'll find more joint collisions in MD5 and SHA1 than will fit in a 128 bit value.

Mark,

My apologies. I'm not trying to waste your time, although it seems I'm succeeding rapidly. I'm obviously missing something key to this whole concept, and I apologize.

If I could impose upon you to show me the error (which I'm sure exists) in my last post (with the 10000 bit input), I'll admit defeat, and have learned something to boot.

I think the problem is just that you need to brush up on your math. This statement gives it away:

>I can't imagine that in the space (for lack of a better term)

>of 10000 bits, you'll find more joint collisions in MD5 and SHA1 >than will fit in a 128 bit value.

Your imagination is too limited. Again, do the math on this problem, assume that the hash function generates a perfectly flat distribution, and work out how many bits you need.

- Mark

I figured that might be the case. Math has never been my strongest suit.

I assume then that concatenating two different hash functions basically just creates one larger hash function with the same number of collisions over the same space?

>I assume then that concatenating two different hash

>functions basically just creates one larger hash

>function with the same number of collisions over the same space

Yes, ideally the two functions will be distributed across the input space without much overlap. So if your hash functions are well designed, your strategy of MD5 + SHA1 is going to be more or less the same as using a good hash function with a 288 bit digest.

In general, you don't ever get any useful compression with magic functions. They don't do anything to exploit redundancy in the input data. Since we are talking about random-appearing data, we are not going to find any redundancy that can be exploited by simple coding.

- Mark

Thanks for taking the time to explain where I was going astray, I really appreciate it.

as Miss Marple could say: 'how interesting'

oh the things I could say!

very nice discussion

I have a thought can you use data file(s) like

data.001

data.002

data.003

and so on. Since it doesn't really say it has to 1 data file only??

@Joe:

You could certainly use multiple data files, but you can't hide information in the file names.

- Mark

I love this stuff.

I have a programming friend now.

I may get back to work if my spirits lift.. Kinda sux to be alone working on things but Data Compression is accolades or cool-aid.

Happy Holidays all!

I know that this cannot be solved, but I wanted to look at it to remove any naively niggling notions floating in my own head. So I did some basic analysis of the file.

I got results like: 1600 times, a byte is repeated in the next byte(so if a byte was 200, the next byte was also 200), and 12 instances of the same byte repeating 3 times.

So let's take just this example, let's say we want to gain compression by removing these repetitions.

Let's try a code replacement: Because all 256 codes are already used, you have to create a 257th code, then 1/256th of the bytes would need to be augmented by one bit. (We can be reasonably certain that each byte is nearly equally represented even without doing an analysis.) How many bytes does this cost? 415241/256 = 1622 bits = 203 Bytes.

Well, code replacement doesn't actually help in the case of just one repetition, since the next byte would need to encode the duplication, thus losing any compression right away. But what about the repetitions of 3? There are 12 of them, so after the first instance, you could put in the Duplication code. This would save the next byte (the third instance). So you would end up saving 12 bytes, but this would cost you 203 bytes, for a net loss of 191 bytes.

OK, so let's try another approach.

Let's try to really go after those juicy 1600 repetitions. Why not just remove them, then in a separate data stream, indicate which bytes in the data should be repeated? So we remove them and gain 1600 bytes; our compressed file is, for the moment, just 413641 bytes long.

If we gave the exact position of each byte in need of duplication, we would need Log2(413641) bits to do that -- that's 18.7 bits -- but each byte that we saved before was only 8 bits, so we lose 10 bits for each. Ok so that doesn't work.

But why don't we just add the number starting from zero, so if the first repetition occurs 100 bytes in, we just store 100 in binary, and then keep adding to it to get to the next repetition. In theory, each repetition occurs about every 413641/1600 bytes in the file. That's 258.5 bytes. How many bits does it take to store 258.5 bytes? Log2(258.5) = 8.014 bits. But the Byte that we saved was 8 bits, which means we lose .014 bits per coding. .014 * 1600 = 22.7 bits = 3 bytes.

Again, psychologically, this looks like we should be gaining compression, but when you do the math, you see that you're losing "just" 3 bytes. (And depending on the implementation, you'll probably lose even more.)

I've done some more interesting analyses, but they always result in bit loss.

For example, modulo 16 repetitions. If you modulo 16, there are 6 repetitions of 5; 105 repetitions of 4; 1437 repetitions of 3; and 1 repetition of 5. In other words, it's like 35 and 67 -- both modulo 16 to an answer of 3. You can reduce 4 bits per byte by diving by 4 and remembering the remainder is 3. 35 would become 2, 67 would become 4.

To Compress:

Remainder = 3

35 Div 16 = 2

67 Div 16 = 4

(This can be stored in 12 bits)

Then to decompress:

2*16+3 = 35

4*16+3 = 67

(Back to 16 bits)

When you have a run of 3, you can save 8 bits (on the face of it). When you have a run of 5, you can save 2 Bytes (on the face of it).

I leave it to the reader to see that this also would not allow any compression. Nor would any other scheme, like the oft-proposed prime factoring of the number that is the entire file.

The same problems are encountered for Modulo 8 and such.

@James V:

Some good analysis. Every time you manage to squeeze some space out of the file, the information needed to restore it is equal to or slightly greater than the savings. This means the file's randomness is holding up well.

- Mark

Heh, I did notice that I made a couple typos, but they don't effect the end results. You could include the 12 runs of triple repetitions, which would mean (8-Log2((413641-12)/1612))*1612 = 5.4 bits lost. So that's still losing a byte even before thinking about how to implement it (which will lose more than the theoretical loss).

I would guess that if there were more duplications, the file would no longer be random, and then you might possibly squeeze out a theoretical byte (before implementation). But you would not be able to then reshuffle the file and do it again.

Is there any reason why you are letting people us multiple files? If someone used like 300+ or more files would you use the byte count for total??

@Joe:

It does no harm to let people use multiple files. And yes, the byte count will be for all files summed up.

- Mark

I have a submission, if you want to take a look at it, but it's worth a shot and I tried to follow the rules as stated.

The only problem with multiple files, is that (a representation of) the file length should be included in the count. This can be represented in 1 to 3 bytes per file.

[And this assumes that no other file parameters are used as data.]

Because if the files were combined, you would need a way to discern where each separate data area ends.

@Zoran:

You can find my email address on the About page, go ahead and send me your submission along via email.

- Mark

James. I disagree. I have an apparent solution.

How exciting.. Do we have a winner Mark?

@Ernst:

No. Alan suffers from serious delusions of grandeur. While he often talks about his great ideas (solving all the Millenium Prize problems in one stroke, for example) he has never revealed even the slightest hint that he understands the material at hand.

In short, he is a crank, and is to be ignored.

- Mark

Poor fellow.

I know how that feels. To get caught up is the idea of success when it looks like it could or might *IF* it holds up once it gets to storage.

Been there done that.

I have a fish on the line here of late. Hopefully it won't be a story of another one that got away but we all know how that goes when it comes to compressing the Million Digit Challenge file.

Good luck Challenge people!

Happy Holidays and Seasons Greetings!

I am happy to see our unemployed in the USA will not go homeless for Christmas.

Ernst

What value today to perfect compressor capable of compressing rapidly and recursively, any volume of data, with any format, hypothetically speaking, the compression magic, which value enterprises Software & Telecom would pay for this algorithm / software?

@Maicon:

Of course much would depend on the computation complexity.

Obviously such a thing would be worth a lot if it worked in a reasonable manner.

But this is like asking what would be the value of the Philosopher's Stone. It's a foolish question, because what you suggest does not, and will not exist.

- Mark

You're wrong, it will be real soon. Many have promised and failed. Many theories have been tested, many say is impossible, I believe is possible. 2011 we will have big news.

@Maicon:

Well, at a minimum you're going to win $100 from me.

Be sure to let me know when you have a submission ready!

- Mark

2011? Are we going to rock the house?

Nice!

Update here: I have a possible here but I have not finished the encoder yet. Could go either way at this point; like a dozen before.

The real bitch is to keep that bit of size-savings once I get it. The "Devil is in the details."

Good luck Challenge people!

We may be able to compress AMillionRandomDigits.bin as follows:

1.) For now, ingore the last byte in the file, so that there are 415240/8 = 51905 eight byte chunks. Let each of these eight byte chunks represent an unsigned, 64 bit integer AND CRITICALLY, notice that each of these 64 bit integers are distinct.

2.) Let binomial(n, k) be the binomial coefficient, or the number of possible ways there are to choose k objects from n distinct items (see http://en.wikipedia.org/wiki/Binomial_coefficient)

3.) Notice that Ceiling(Log_2(Binomial(2^64, 51905))) = 2458197. Thus, it would take this many bits, or ceiling(2458197/8) = 307275 bytes to REPRESENT THE TOTAL NUMBER OF WAYS TO SELECT 51905 objects from a set of 2^64 objects.

4.) Thus, the idea for compression is to instead of store the 51905 unique 64 bit integers, we can find the unique integers just by storing the COMBINATION RANK, which in this case would take only 307275 bytes. As will be explained later, we also need to store information about the ordering of these integers.

As explained on http://home.hccnet.nl/david.dirkse/math/rank/ranking.html, the concept of combination rank is simple: Suppose we have two distinct numbers in the range [1, 4]. There are binomial(4, 2) = 6 ways to choose two numbers from the list of four. The possible choices are: 1-2; 1-3; 1-4; 2-3; 2-4; and 3-4. Thus instead of storing the two integers, we could just store ceiling(Log_2(6)) = 3 instead of 4 bits to represent the 2 numbers. Notice however, that we are loosing information about the order of the numbers.

5.) Since there are 51905 64-bit integers in the file, we would need to store ceil(log_2(51905)) = 16 bits for each integer to also represent its position. Thus the number of bytes needed to store information about the integers position is (16/ 8) * 51905 = 103810 bytes to store information about the order of the numbers.

6.) Thus, we can compress the file as follows:

4 byte integer representing size of binomial coefficient | 307275 bytes representing binomial coefficient | 4 bytes representing total number of 64-bit integers | 103810 bytes of location information | 1 last byte of file that we originally ignored

Thus, the total number of bytes we need to store JUST in the data file is = 4 + 307275 + 4 + 103810 + 1 = 411091. But the original file was 415241 bytes, so we have saved 4150 bytes.

7.) We are not done, because we have not included the size of the decompressor. If someone can write a program that can open the compressed file (or byte array stored in the program itself), calculate the combination from the combination rank as per http://home.hccnet.nl/david.dirkse/math/rank/ranking.html (writing this function is not hard at all; the challenge is making it small), reconstruct the file, and then write it out to a file in less than 4150 bytes, then the file will be compressed.

I DO know how to write such a decompression program (this is easy); I DO NOT know how to write it so that the compiled code is under 4150 bytes (but it may be possible).

@pete6:

Interesting - have you verified that the 64 bit integers are distinct?

Since you are using 16 bits to order 51,905 numbers, you really can skip the combination ranking step, right? With 16 bits per number you could simply assign each number to a given slot using a linear address.

Are you sure about the math in steps 2 and 3? If my reduction of step 5 is correct, this implies that you would be able to compress any million digit file in which all of the integers were unique - which is going to be a huge number of files. I don't have a handy way to check your calculation without coding.

If your math holds up, the size of the compiled code doesn't really matter. The algorithm would be able to compress and expand any file as long as we ensure that the 64-bit integers are distinct. This more or less proves that you aren't hiding information. You would win the prize for adhering to the spirit.

Why not create a demo program that operates on, say, a file with just 4 64-bit integers? Can you compress every 32 byte file using this method?

- Mark

I WILL WORK ON ACTUALY CODE A LITTLE LATER TONIGHT...

Interesting - have you verified that the 64 bit integers are distinct?

NO; WILL DO SO LATER AND LET YOU KNOW

Since you are using 16 bits to order 51,905 numbers, you really can skip the combination ranking step, right? With 16 bits per number you could simply assign each number to a given slot using a linear address.

I DO NOT THINK THIS IS POSSIBLE, BUT IF YOU CAN EXPLAIN TO ME HOW TO DO IT, I WOULD LOVE TO UNDERSTAND. SINCE WE HAVE ONLY 51,905 NUMBERS (LESS THAN 2^16), THEY CAN DEFINITELY BE MAPPED BY A ONE-TO-ONE FUNCTION (64 BIT NUMBER ==> 16 BIT NUMBER). HOWEVER, I DO NOT KNOW OF A COMPACT WAY (SUCH AS A FORMULA) TO STORE THIS MAPPING; THE ONLY WAY I HAVE THOUGHT OF TO STORE THE MAPPING IS BY USING THE BINOMIAL COEFFICIENT (THINK OF JAVA'S BigInteger AS A WAY TO DO THE ACTUAL IMPLEMENTATION)

Are you sure about the math in steps 2 and 3?

AS FAR AS I KNOW, YES. YOU CAN VERIFY FOR YOURSELF BY CUTTING AND PASTING THE FOLLOWING INTO WOLFRAMALPHA.COM (THINK WEB-VERSION OF Mathematica):

Ceil[Ceil[Log[2, Binomial[2^64, 51905]]]/8]

If my reduction of step 5 is correct, this implies that you would be able to compress any million digit file in which all of the integers were unique - which is going to be a huge number of files. I don't have a handy way to check your calculation without coding.

YES, ANY MILLION BINARY DIGIT FILE IN WHICH EACH OF THE 64-BIT CONSECUTIVE CHUNKS ARE ALL UNIQUE. NOTE THIS IS NOT MAGICAL COMPRESSION - IT WILL ONLY WORK WITH SOME FILES.

If your math holds up, the size of the compiled code doesn't really matter. The algorithm would be able to compress and expand any file as long as we ensure that the 64-bit integers are distinct. This more or less proves that you aren't hiding information. You would win the prize for adhering to the spirit.

I WILL TRY TO WORK ON CODING SOMETHING UP TONIGHT.

Why not create a demo program that operates on, say, a file with just 4 64-bit integers? Can you compress every 32 byte file using this method?

NO. MY METHOD WILL NOT WORK WITH A FILE OF JUST 4 UNIQUE 64 BIT INTEGERS (BUT LATER, I MAY BE ABLE TO COME UP WITH A SMALLER EXAMPLE TO SHOW YOU). THE SIZE OF THE FILE YOU SUGGEST IS

4 * 64 BITS = 32 BYTES

BUT

Ceil[Ceil[Log[2, Binomial[2^64, 4]]]/8] = 32, WHICH MEANS JUST STORING THE COMBINATION RANK NUMBER IS JUST AS LARGE AS THE FILE ITSELF. BUT REMEMBER, THE COMBINATION RANK DOES NOT INCLUDE INFORMATION ABOUT ORDER BECAUSE IT IS NOT A PERMUTATION RANK. THUS, IN THIS CASE, THE COMPRESSED FILE WOULD ACTUALLY EXPAND.

YES, IT SHOULD BE POSSIBLE TO COME UP WITH A FORMULA TO SHOWING UNDER WHAT CONDITIONS MY COMPRESSION SCHEME WILL WORK.

I am fairly certain that you have a serious error in your math. Wolfram Alpha is giving you bad results.

- Mark

Thanks for sharing pete6

There are many tricks to compressing this. The several I have worked out will reduce any given string by one bit but all have some dependancy to be valid encoding and that is the reality of existence.

Save a bit here spend it there..

Still.. This is oddly an interesting activity. pete6, I hope you find it interesting.

Who knows if there is a system out there for binary digits there seems to have been one for mater and energy.

Keep on keeping on pete6.

Ernst

Yes, you were right that my calculations were wrong, but I think I may have another idea that may work.

I finally came up with an idea that may work, but before wasting time on implementing it, I want to know if I can still win the prize if I do the following:

Write an open-source java program that you and the entire world can look at that transforms the data. Use paq8l to compress the transformed data.

Can I win the $100 even if I use paq8l and even if size of (paq8l.exe + my compiled java program + compressed source) > original source. In other words, would I still win the prize if part of it uses another compressor AND I do not consider the size of the compressor. Of course, (compressed source) < (original source) and the compressor I will write is entirely generic, so it will expand some files and compress others?

See http://cs.fit.edu/~mmahoney/compression/ for PAQ8L.

@pete6:

How about if you get it working first? I spent a lot of time working with you on your last idea, I think the burden is now on you to show that you can actually do what you say.

- Mark

Code is available at

http://userpages.umbc.edu/~pete6/compr_golden/pete6/FunnyZip.java.

it is a single file

note that to compile you must type javac pete6/FunnyZip.java since it

is in the pete6 package

Executable jar is available at

http://userpages.umbc.edu/~pete6/compr_golden/FunnyZip.jar

java -jar FunnyZipper.jar

You were right. No, the data is not compressible, but I have sucessfully written a WORTHLESS, but GENERIC compressor/ decompressor written in JAVA that will either compress or expand AND sucessfully deflate each file by a single byte. No, it is NOT A BARF CLONE OR A SPOOF and it IS NOT A MAGIC COMPRESSOR (IT EXPANDS SOME FILES).

HOWEVER, IT DOES INDEED COMPRESS AMillionRandomDigits.bin by exaclty ONE BYTE AND DECOMPRESS IT BACK TO THE ORIGINAL. FILE NAME NOT IMPORTANT.

it will give a printout of how to use if

It actually does something similar to the TRANSFORMATION I explained above and can compress SOME 9 byte strings by 8 byte strings. Look at the two functions untransformChunk() and transformChunk(). I can explain further if needed

Would you please be able to look at the code and see if it falls within the

ecs021pc26[674]% javac pete6/FunnyZip.java

ecs021pc26[675]% jar cmf mainClass FunnyZip.jar pete6

ecs021pc26[676]% md5sum AMillionRandomDigits.bin

5fa22a14e52b58d27ebbc1e074b7949d AMillionRandomDigits.bin

ecs021pc26[677]% java -jar FunnyZip.jar -c AMillionRandomDigits.bin c

ecs021pc26[678]% java -jar FunnyZip.jar -d c d

ecs021pc26[679]% diff d AMillionRandomDigits.bin

ecs021pc26[680]% ls -l

total 1231

-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:00 AMillionRandomDigits.bin

-rw-r--r-- 1 pete6 rpc 415240 Jan 7 22:27 c

-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:27 d

-rw-r--r-- 1 pete6 rpc 9825 Jan 7 22:26 FunnyZip.jar

-rw-r--r-- 1 pete6 rpc 29 Jan 7 21:57 mainClass

drwxr-xr-x 2 pete6 rpc 2048 Jan 7 22:26 pete6

ecs021pc26[681]% java -jar FunnyZip.jar -c FunnyZip.jar xyzzy

ecs021pc26[682]% java -jar FunnyZip.jar -d xyzzy sss

ecs021pc26[683]% diff FunnyZip.jar sss

ecs021pc26[684]% ls -l

total 1251

-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:00 AMillionRandomDigits.bin

-rw-r--r-- 1 pete6 rpc 415240 Jan 7 22:27 c

-rw-r--r-- 1 pete6 rpc 415241 Jan 7 22:27 d

-rw-r--r-- 1 pete6 rpc 9825 Jan 7 22:26 FunnyZip.jar

-rw-r--r-- 1 pete6 rpc 29 Jan 7 21:57 mainClass

drwxr-xr-x 2 pete6 rpc 2048 Jan 7 22:26 pete6

-rw-r--r-- 1 pete6 rpc 9825 Jan 7 22:29 sss

-rw-r--r-- 1 pete6 rpc 9826 Jan 7 22:28 xyzzy

ecs021pc26[685]% java -jar FunnyZip.jar -c mainClass xyzzy

ERROR: the output file 'xyzzy' already exists

ecs021pc26[686]% java -jar FunnyZip.jar -c mainClass mainCLass.c

ecs021pc26[687]% java -jar FunnyZip.jar -d mainCLass.c mainclass.d

ecs021pc26[688]% diff mainclass.d mainClass

...

ecs021pc26[721]% java -jar FunnyZip.jar

USAGE: FunnyZip

-d to decompress; -c to compress

ecs021pc26[722]%

Here is the followup: I wrote a completely WORTHLESS, but GENERIC compressor plus decompressor that will either compress or expand every file by exactly one byte. It is not a MAGIC COMPRESSOR since is expands some (most) files. It is NOT a BARF CLONE or A SPOOF, but does COMPRESS A MILLIONRANDOMDIGITS.bin by one byte (and successfully compress it back to orinal).

source code at http://userpages.umbc.edu/~pete6/compr_golden/pete6/FunnyZip.java

a singe file

executable jar at

http://userpages.umbc.edu/~pete6/compr_golden/FunnyZip.jar

java -jar FunnyZip.jar

USAGE: FunnyZip

-d to decompress; -c to compress

Doesn't matter anymore, but: When I did my analysis, I found that the longest repetition was 3 bytes. All 4 byte numbers (in the million random digits file) are unique.

http://www.riceresources.com/documents/RDET_Anlys_78.pdf

@pete6:

If you read my post today, I think you will understand why this document is garbage:

http://marknelson.us/2011/01/09/combinatorial-data-compression/

- Mark

[...] I tried to compress the famous million random digit file from the Mark Nelson blog [...]

Keep on keeping on...

There has to be a way.

Hey, Happy new year.

About a year ago I discovered an encoding method I named dynamic unary encoding.

Dynamic unary encoding (DUE) is a way to change one file into all other files of that same size bijectively. So, if we start with this million digit file we would transform that file into the next file in either direction. Then on to the next and so on until it cycles right back to million digit file.

The catch is that it takes something like a day and a half to transform that file of 3321928 bits just once using a single 2 ghz core.

I believe a system using DUE can be used to "compress" million digit file and am willing to work with others to achieve success.

The idea is this: Cooperation; I believe a series of DUE transforms can compress this data.

I am open to a group design of the software. Perhaps others will see implementation aspects I have not.

Is there interest in forming such a group?

Ernst

Ernst,

If it takes a day to transform a file this small once, you're doing something seriously wrong.

What do you mean by "next" file? Logically, the "next" file would be accomplished by adding 1 to the current file, and the "previous" file would be accomplished by subtracting 1. Neither of which should take more than a fraction of a second.

To change the original file into a compressible file, would be the same thing as adding (or subtracting) a file of the same size as the one you are trying to encode.

Or are you doing something else?

Yeah doing something else..

Same sort of thing as a numerical increment except the strings are in a different order.

Just thought to offer..

Mark,

Will you wave size of zlib or Gzip?

I may have found an exploit. Part of the data from an encoder crunched 10k tonight.

I'll have to rely on off the shelf compressors at zero cost to meet your challenge if this pans out.

Ernst

@Ernst:

I would be willing to consider zlib to be part of the compiler library - after all, it is included in quite a few language definitions.

- Mark

Thanks Mark,

I'll be working this version of codec and learning how to use Zlib.

TTYL

Ernst

413555 Jan 17 18:24 Mdigit.dat.gz

Source file was 453570 Jan 17 18:24 Mdigit.dat

Gzip keeps the original time and date.

There is hope, Mark.

Now to see if I am correct.

Fingers crossed and NO Claims being made here!

I'll work and report if or if not but this is something like encoder 100 so I am truly hopeful with these results.

Decoder work ahead and some file format thinking to do today.

Again friends.. No claims being made just blogging in the Challenge blog.

Good luck Challenge people!

Ernst

Oh Well..

As usual, forget that version of things.. Once all is recorded I lose the savings. Ain't that the way it goes?

Now to search for another idea.

Good luck challenge people!

Ernst

I just found this by Mark:

"No. Alan suffers from serious delusions of grandeur. While he often talks about his great ideas (solving all the Millenium Prize problems in one stroke, for example) he has never revealed even the slightest hint that he understands the material at hand.

In short, he is a crank, and is to be ignored."

WRONG

I did not say that I had solved all the Millenium Prize problems in one stroke.

It took me about three hours, using a combination of intuition and a VERY fast problem/phrase analysis method to get an APPARENT insight on how to solve all 7 Clay Institute Millenium Problems.

Mark I have had a GUTSFULL of your negative, stereotyping of anyone who dares to not restrict themselves to OLD WAYS OF THINKING as promoted (perhaps) by people with power and priveledge (and possibly high status) in academic fields (of endeavour) ........

I subsequently analysed various of the 7 Millenium problems more formally and gained apparent additional insight in other ways (I was on a commercial airliner looking out the window at a "patchwork quilt" of stratoculumus clouds when I realised that "turbuluence" was "the packaging of space" (and recently made the (stunning) discovery that (apparently) "teleportation" is the opposite of "turbuluence" (which is curious as when you are in an aircraft that meets turbulence, you do seem to experience your body being moved rather abruptly!)

However I also recently made AWESOME apparent theoretical breakthroughs in theorising about teleportation, finding a way of explaining it that appears to satisfy conventional scientific concerns re: chemistry and physics. I do not want to say too much here as I am VERY POOR by first-world standards for my age group, and am looking to license many possible new technologies.

IF YOU LOOK at the Milllenium problems you will find that "turbulence" is mentioned...

Regarding an earlier apparent discovery of a way to utilise SPACE much more eficiently in the quest to store computer data, the APPARENT breakthrough I made I now have probably over 25 different perspectives on it. Of interest is that the ability to find INHERENT COHERENCY IN A SEEMINGLY RANDOM FIELD OF DATA (i.e. naturally occuring data interference fringes i.e. how the data tends to jerk itself about (like a airliner passenger who encounters- you guessed : turbulence!)) has potential application in finding how seismometer data shows the patterns that lead to (or lead from) earthquakes.

Regarding so-called "cranks": a photo of a (apparent) series of craters on the lunar surface was claimed by a "crank magazine" to be a spaceship (as it can be looked at as if the shading is on a hill-like surface not a valley surface). I realised that it was (most likely) a series of craters overlapping forming a valley, that due to the lighting, could be perceived as if it were a long space-ship-like hill.

However, metaphorically, a series of overlapping craters does in fact match the notion of "shipping space"....

Million digit encoded into 349314 Trytes..

The Number looks good but in Binary space it's larger than Million Digit.

LOL It's the thought that counts..

Hey Challenge people!

I'm not sure if there are a few other people or if this challenge is considered "Twit of the Year" Monty Python stuff.

Anyway I have an update.

The three symbol transform of million digit file

0: 699198

1: 697479

2: 699202

2095880 trits.

I'll be exploring actual data compression with this Ternary translation of Million Digit File.

To get any compression I have to achieve 6 symbols per byte so 12 bits down to 8.

If anyone wants to work with me on it I'll consider it.

2095880 trits of information to work with.

That's about it I think for the winter season. I will learn more about probability and combinatorics.

Perhaps dense data is dense in all forms. I'd say that was true from the simple testing I have done so far. However, maybe I can represent 6 trits with 8 bits if I study this ternary encoding.

I see something about ternary Huffman codes so I'll go read that.

Ernst

p.s. are there others working on this?

Ernst,

Your suspicion is correct: dense data remains dense in all forms - whether its in other numerical bases, sets of arbitrary bit strings, or even arranged in different shapes (such as 2d grids, 3d grids, 4d grids, an Ulam Spiral, zigzags, etc).

More incredibly, all the derivative data remains dense. For example, if you took every 8th bit of information from file, you'd find that the resulting bitstring was just as dense and incompressible. If you took a random sample from it, you'd find that sampling just as dense.

Even abstract methods fail to reduce the density. For instance, you could take the file and break it into 32 bit chunks representing unsigned integers. You could then say that on the total number line, 0 to 2^32-1, each of the results was "1" and the other numbers were "0". Your numbers would represent something like 103,810 out of 4,294,967,296 - 0.0024%.

Yet, if you try to randomly sample that resulting line and remove half of the numbers without hitting a "1" (despite the fact that the there are so incredibly few actual "1"s in the line) you'll find that nothing helps - the arrangement of "1"'s is so randomly dispersed that almost exactly half of them will be picked up.

If this were not the case, you could compress the million digit file by 1 bit per 32 bits by devising a scheme that allowed you to ignore half of the 32 bit numbers, thus storing your offsets in 31 bits. Instead, you'll find that no function or algorithm can remove half of those numbers without hitting the distributed "1"s unless the function uses more information about the pattern than you'd actually save in compression.

No matter how you examine, slice, or splice the data it will remain dense =)

That fits with what I have experienced.

Would the term Quanta apply to the underlying information?

I read that information has been called the fifth element or perhaps called quintessence.

So could the unit of information that is the million digit file be called a quanta and be independent of form?

Sorry if i sound deranged. I didn't expect your reply.

Tim

There is a way to cycle through all patterns for a fixed length of binary string and back to the starting string. It is bijective.

I've avoided that project since I'm looking at a possible a year of processor time to run such a thing.

The information needed to be saved is the offset in one direction or the other and the pattern-information such as all zeros or all ones.

Will that offset be smaller space than the million digit? I don't know. Maybe I'd be lucky and it is just a distance away a few thousand bits can represent or maybe not so lucky.

But, I am sure that density transfers into new encodings. Not that an encoding won't compress some but that I have not seen an obvious smaller compressed encoding for my efforts.

I appreciate the feedback.

Ernst,

I applaud your line of thinking. Yes, it should be fully possible to cycle through all the possible patterns, but let me explain why its a bad idea.

First, I'm going to set aside performance and time considerations (though they defeat the idea alone, as may become apparent momentarily).

Let's say you wanted to cycle from the number 0 to the number that the file represents. Obviously, in this case, the number would be (in binary) 3,321,928 digits long and would require 2^3,321,928 cycles to reach the goal.

That number, btw, comes out to about ~9.36x10^999999 according to Wolfram Alpha. For comparison, there are only about 10^80 particles in the entire universe.

So, no matter what, if you start from 0 and try to cycle, you would require the same number of bits as the original file to express your cycling offset.

Moving on, let's say you tried to roll a random number (since hey, you could build a binary string of the same length as your target data with a single integer). Let's say that, lucky for you, this new binary string represented a number that was exactly halfway to your goal! Awesome! Now you only need to express 2^3,321,927 cycles to get there... oh.

At this point, you might get clever and say "Well, what if I kept rolling random numbers, and stored only the ones that led me closer to the goal through some deterministic method." The first number will get you halfway there, the next perhaps 3/4, the next perhaps 7/8, and so on... and yet, it will never quite reach the full and correct number until you expend the same amount of data as was originally required to express the original number (or maybe 1 bit less! You might get lucky enough to get past that one bit!)

In essence, you'll never spend less than one bit less than the original sequence if you're trying to just cycle, or combine other numbers to express it.

I'm not sure what direction to take.

There are mechanical blind-sort possibilities coupled with transform.

There is stepping through sequence using zlib to test each instance.

I agree that it is possible to end up with one bit less for my efforts. I want you to know I understand that because I have seen that in several systems I have tried.

Well, I can't cry because I have no choices or no ideas.. Also I can't discount looking under every proverbial rock since I have been surprised when following through on what looked to be insignificant but turned out to be important to have explored.

I don't know what direction to follow next.

I'll be thinking about it.

Thanks Tim.

Ernst,

First, don't get disheartened - at least, not for the wrong reasons. Sure, it is provably impossible to compress all possible data, and sure, it's most likely to be able to compress a given "random" source of data when its encountered as it has already reached a high state of information entropy.

You never know though, this particular file might have an undiscovered weakness. As long as you approach it from the perspective of not trying to build a universal random data compression/decompression then you remain sane AND educated about this entire challenge.

For my part I've gone through much of what you have, and I can't tell you how much I learned and how valuable it's been to me.

So don't despair, just understand that you've managed to unleash a great deal of your focus and attention on realizing your ideas and trying to accomplish something. Now imagine if you focused that same zeal on mastering the current tech behind compression theory and algorithms - you may yet provide the community valuable and insightful contributions outside of this one particular challenge!

That's kind of you to say.

Yeah, having spent a few years learning to encode and decode. Learning to spot similarities in different systems because of the common source file ( Million digit ) is of value to me.

My first encoder was based on the dynamic system Lothar Collatz made famous [3x+1,x/2]. In the beginning (2002) I thought I would just use determination to get things done but after I had done my best to make a dent in the million digit file size I had to step back and say: Okay I see that I am not the master so I must be the student.

Once I changed my point of view each experiment became a lesson. That's it really.. Keep on keeping on.. Pay attention. Follow up and most important revisit and review the concepts already experienced from time to time because as the understanding evolves so does the ability to apply old algorithms in new ways.

I'm starting the project to step through the sequence of files the bijective transform generates. I'll be testing the compressed size of each instance. Perhaps a compressible file is not so too far away.

I'll be using a faster method than the one that takes a day and a half @ 2ghz to render a single instance .

Here is a thought.

Since we can generate all files of specified length there is data to consider. Maybe one of those files is a short story no one ever wrote or a picture no one ever took.

Building on that concept then all the information that can be, already exists. That means "The Truth is out there." http://www.youtube.com/watch?v=JDZBgHBHQT8

I can tie up the desktop now that I have a laptop and I might as well.

I don't want to quit before I tried bijective transform brute force style.

Lots of ideas..

You are welcome to chat more Tim.

The one thing that is lacking in my hermit lifestyle is healthy interaction with those who share my interests.

Number crunching Hermits are odd folk..

----

Since I have gone on for years about the [3x+1,x/2] I can share the series that I believe the Collatz fits into.

Trying to extend the series past 0/0 is where it gets trippy in my opinion.

And the point isn't the numbers it's the pattens.

With the Collatz, and I believe for all [3x+/-y,x/2] the 3x+/-y part shares only one element with x/2 ; The even right after a 3x+/-y operation. Other then that it's a struggle for control between the two systems and x/2 always wins.

The structure of all pattern for [A(x)+or-y,x/2] is Fibonacci and base 3 in nature from what I can tell.

I'd love to work on a base 3 computer.

I asked about geometric relationships of that series and one kind professor took a look and said that there was a relationship in a couple of the A/B but the whole thing didn't fit.

Anyway interested folk can look at those for some fun.

I had a web page for a few years with that stuff on it but Geocities has closed down so I can be reached via Email for questions and help.

That is the stuff of a type of encoders if anyone wants to play with it. For all those Collatz type systems they share a common parity language structure.

I have examples of attractor finder code so that people can locate the attractors for any system.

Anyway I entertain myself with such musings. So far it keeps me off the streets :)

How strange.

The series is missing from the post.

Just above the line "Trying to extend the series past 0/0 is where it gets trippy in my opinion."

Should be a series so let me try it with spaces like A / B

Sorry about the confusion but I did write a good looking text.

All these can fit in a dynamic system [A(x)+/-y,x/z]

Again the point, in my opinion, is the patterns not the values.

Mark your editor is stripping text out.

To Share the series here I have to use a different character than the divide by to represent the system

In the denominator position is the multiplier and in the case of the Collatz example it's three so I will use the ":" for the back slash I like to use that makes it look like a fraction. How funky.

Try and see a fraction looking thing in your mind when you see A:B.

One more try to post.

It just won't let me post numbers not even A : B,

Ahhh.. Code escape..

let's try that!

[c]

[\c]

The Collatz is [3x+1,x/2] so it exists in the 3/2 for example

Fingers crossed.

Wow Mark.. If I could edit I would

One more try.. My bad on the wrong slash

Again the point, in my opinion, is the patterns not the values.

Let me try spelling it out.

Lets see Imagine a series of fractions.

It extends to infinity on both ends.

from left to right

Negative 4 over negative three.

Negative 3 over positive two.

Negative 2 over negative one.

Negative 1 over positive one.

Zero over zero.

One over One.

Two over One

Three over Two.

Four over Three.

And so on in both directions to infinity.

So you reader have to write it down to see the relationship.

I'm trying to share but this is a cluster f*ck of an editor.

Test -4/-3 -3/2 -2/-1

Okay then it's the commas that trip it up.

Mark please delete the comments effected. Pardon the snafu.

-------------------------------------------------------------------

You are welcome to chat more Tim.

The one thing that is lacking in my hermit lifestyle is healthy interaction with those who share my interests.

Number crunching Hermits are odd folk..

----

Since I have gone on for years about the [3x+1,x/2] I can share the series that I believe the Collatz fits into.

Trying to extend the series past 0/0 is where it gets trippy in my opinion.

And the point isn't the numbers it's the pattens.

With the Collatz, and I believe for all [3x+/-y,x/2] the 3x+/-y part shares only one element with x/2 ; The even right after a 3x+/-y operation. Other then that it's a struggle for control between the two systems and x/2 always wins.

The structure of all pattern for [A(x)+or-y,x/2] is Fibonacci and base 3 in nature from what I can tell.

I'd love to work on a base 3 computer.

I asked about geometric relationships of that series and one kind professor took a look and said that there was a relationship in a couple of the A/B but the whole thing didn't fit.

Anyway interested folk can look at those for some fun.

I had a web page for a few years with that stuff on it but Geocities has closed down so I can be reached via Email for questions and help.

That is the stuff of a type of encoders if anyone wants to play with it. For all those Collatz type systems they share a common parity language structure.

I have examples of attractor finder code so that people can locate the attractors for any system.

Anyway I entertain myself with such musings. So far it keeps me off the streets :)

Frack! It did it again.

I am trying to post a series of dynamic systems.

-4 / -3 -3 / 2 -2 / -1 -1 / 1 0 / 0 1 / 1 2 / 1 3 / 2 4 / 3

Matt,

I am running tests on a rewritten encoder.

I'll be working on the software for a while but I spotted an interesting file.

T: 1193 Ones 1660964 Zeros 1660964 Diff: 0 High: 6484

This is the 1193rd file generated in one direction that has equal set and reset bits.

The "High" of 6484 is the largest difference between set and reset bits up to the point of 1193 files.

This algorithm is a lot faster than the original algorithm.

Mark feel free to delete the above posts and clean your blog up.

Pardon the jumble.

Approximately 600,000 files per 2ghz processor per day can be generated. A test run of 100,000 files took 4 hours.

Nothing too exciting to report in the first 100,000 files generated from this simple test run.

I have no Idea if a coherent pattern was generated since I was using a ratio of set bits and reset bits as my guide to get the general idea of the informational landscape.

The largest difference between set and reset bits in the first 100,000 files generated in the forward direction was 7572 bits difference.

I'll check back Matt.. Did you understand the series of systems I posted? I may have to have Mark post an email since the editor is horrid. Why would an editor selectively strip data from a post?

Who-thunk that up?

On to the reverse function.

I have isolated that first 0 difference between set and reset bits file in the forward direction of this transform.

There are a lot more zero difference files.

The million digit file has a difference of Ones 1660251 Zeros 1661677 Diff: 1426

If anyone wants a copy just email me.

I'm open to reading about what you determine about the file.

My guess is it is more statistically flat than the million digit file.

Gzip doesn't compress it.

Well, Tim..

I think I was wrong about this version of DUE.. I think it splits the patterns into two sets rather than one. I'm rechecking my original work but so far I think it's a two set system not a one set system as I thought.

So whatever set the Million digit file is in after 850,000 files the high difference between set and reset is 9100 bits.

I'll have to implement the zlib to be sure I am not passing up files that can compress.

I think you were wrong on this being a bad idea. How can I learn thing If I don't explore them.

More work to do..

This looks to be two (rings?) So I need to audit things here.

Anyway.. Mark.. Whatever man.. I feel like I wrote graffiti but I was hoping to chat and share. As you see fit on this segment of posts.

Back to the solace of "why did it do that."

Later Tim.

So, Hey.. Update.

I have a question. I also posted on Data compression newsgroup.

Does anyone understand what Cycles are? I believe I may be working with Cycles as defined by Wikipedia.

http://en.wikipedia.org/wiki/Cycle_%28mathematics%29 but I am not sure yet.

The above wikipedia define seems to fit best. It seems to be what "Tiger" I have by the proverbial tail.

Any data-compression use cycles? If so which ones. Link or search term welcomed.

I'd appreciate feedback.

This is a reply to a post I had a chance to reply to before.

However, I did not know enough to counter. I do now.

I counter because I know better now.

Humbly, Ernst

Quote followes.

----------------------------------

Tim said,

in March 10th, 2011 at 4:15 pm

Ernst,

I applaud your line of thinking. Yes, it should be fully possible to cycle through all the possible patterns, but let me explain why its a bad idea.

First, I'm going to set aside performance and time considerations (though they defeat the idea alone, as may become apparent momentarily).

Let's say you wanted to cycle from the number 0 to the number that the file represents. Obviously, in this case, the number would be (in binary) 3,321,928 digits long and would require 2^3,321,928 cycles to reach the goal.

That number, btw, comes out to about ~9.36x10^999999 according to Wolfram Alpha. For comparison, there are only about 10^80 particles in the entire universe.

So, no matter what, if you start from 0 and try to cycle, you would require the same number of bits as the original file to express your cycling offset.

Moving on, let's say you tried to roll a random number (since hey, you could build a binary string of the same length as your target data with a single integer). Let's say that, lucky for you, this new binary string represented a number that was exactly halfway to your goal! Awesome! Now you only need to express 2^3,321,927 cycles to get there... oh.

At this point, you might get clever and say "Well, what if I kept rolling random numbers, and stored only the ones that led me closer to the goal through some deterministic method." The first number will get you halfway there, the next perhaps 3/4, the next perhaps 7/8, and so on... and yet, it will never quite reach the full and correct number until you expend the same amount of data as was originally required to express the original number (or maybe 1 bit less! You might get lucky enough to get past that one bit!)

In essence, you'll never spend less than one bit less than the original sequence if you're trying to just cycle, or combine other numbers to express it.

----------------------------------

Cycles:

There are finite sets of finite sets. That is as simple as I can write it.

So the orbit is a finite set. There is a finite set of cycles.

Thanks for the conversation. Pardon any stress in receiving a delayed second reply.

I am sure the complete geometry of existence will surprise us all.

Update:

Million digit file is an element in a k-cycle of 4194304 elements for this configuration.

I don't know yet if a compressible file is in that cycle but that is fewer files to check than I assumed.

Good luck Challenge people!

Hey!

I've worked out the bugs in the "proof" program. Meaning I have verified that the k-size of the cycle which the Million Digit file is an element of is 4149404.

What is next is to understand how to progress from one cycle to another and back. There has to be a way to "jump orbits."

I have to say this, I didn't know much about this kind of math and I am still learning.

There are other cycles for other configurations of this encoder and that too is an area of interest for me but first I should see about making use of this configuration.

It isn't important now that this challenge was intended as a Gag. It has been a catalyst for discovery. I don't see what I do anywhere on the Internet so just maybe I have discovery and then that alone validates this whole challenge.

I have much to learn and much to do. I'll blog back in while.

Good Luck Challenge people.

It is amazing how people are still hopeful to compress the incompressible!

Mohamed Al-Dabbagh said,

in July 2nd, 2011 at 1:52 am

It is amazing how people are still hopeful to compress the incompressible!

--------------------------

Ah, folly of the incompressible, and yet the effort opens doors of discovery.

Hey I watched a documentary of interest to me and maybe you and you reader. http://video.google.com/videoplay?docid=-5122859998068380459

It is possible our Data "tool set" may not be complete.

Perhaps there is more discovery ahead for all.

Good luck Challenge people.

well I apparently have what appears to be a answer.

Also can see what "hierarchies of infinity" actually are (differing points of view of external things);

and what the on-again off -again problem Cantor had may actually be (why he thought he solved, then hadn't solved, "continuum hypothesis" (sensitive ideas).

@Alan:

So which is it? Is the Continuum Hypothesis provably true or false?

- Mark

What is "proof"?

Proof = "(a) "continuum --hypothesis - -"

It is a ___________________

I am not going to fill in this because it is worth a lot of income and I am tired of being as poor and hungry as I am.

You want to "prove" if "proof" is "true" or "false"...?

If you want to make a ridiculous amount of money then please get over your ?????prejudices!

What is "Ultimate logic"? (see "New Scientist" recent issue re: Cantor etc.). "Logic" = "path"; "ultimate path" = a path with discrete boundaries (a VIRTUAL 'set')

inherent coherence: a PERSPECTIVE

a way things "hang together"

If you make a "data hologram" of e..g. seismometer data, then you can "rotate the data" (= "continuum hypothesis!!!!!!" )

and in some views you will see how "a quake to come" (or a quake that has been") is leaving a trace in the data (the data has a kind of "Objectivity" - it "holds together"- it has "continuuance" !!!!!

Needless to say, if you can find the inherent ways that data "sticks"- forms its own sort of jigsaw- you may find you can fit a lot of data in a very small room (by finding the hyper-space tangential (the "L-function" I could describe it) of the "data" (the data almost "dissapears"- becomes nearly 'invisible'- it becomes like a material object: software becomes more like "hardware"!!!!!!!)

Quote: "The Folly of Infinite Compression is one that crops up all too often in the world of Data Compression. The man-hours wasted on comp.compression arguing about magic compressors is now approaching that of the Manhattan Project.

In a nutshell, a magic compressor would be one that violates the Abraham Lincoln’s little-known compression invariant: “You can’t compress all of the files all of the time”. "

comment: "all of the files IS "all of the time" (all of the

______________________)

Of course you can not, unless you don't even try to or need to (via using a Space computer (which creates a (constant...) "time differential" (imaginary time in 3d: hyperspace known (internal limits on external edge) (i.e. a (coherent type) fixed structure (like a regular ____________________)

Quote: "It’s trivial to prove, and I won’t do it here, that no single compressor can losslessly and reversibly compress every file."

Comment: The answer IS "no single "compressor" (i.e. an algorithm that maps space: so every file has its own individually designed compressor via the _____________________ quantisation diffraction _________ 5-6-7 (...8) dimension ____________ 'function' )

Quote: "The easiest way to foil most compressors is with a so-called “Random File”.

Comment: This technique turns any (?) (except one) (which is !!!! pi !!!!! I think) file into a so-called "random file" (it becomes _________ _________(these blanks are in part a: a virtual _________))

Quote:

"In the world of Information Theory, Randomness is a tricky thing to define, but I like to fall back on Kolmogorov complexity, where we define the complexity or randomness of a file as the Komogorov complexity w/r/t a Turing Machine."

Comment: = '4 bit ____________ or a virtual __________________'

Add this to a ______________ file and you get hyperspace virtual real quantisation (or establishing local LIMITS on hyperspace with respect to the supposed "file" being tailor -

s______ )

Quote:

"Several years ago I posted a challenge on comp.compression that I hoped could be used to silence any proponents of Magic Compression, and I’m reposting it here so I have a permanent location I can point to.

How does it work? I took a well-known file of random digits created by the RAND group (no doubt at taxpayer expense), and converted that file into binary format, which squeezed out all the representational fluff."

Comment: a RAND file is ALREADY in so-called binary format if you look at it a certain way

When put in so-called binary format it becomes a space-computed file

Quote:

"The result was AMillionRandomDigits.bin, 415,241 bytes of incompressible, dense data.

correct (in one sense) (well done, but.....

turn it into a hyper-space jigsaw and something amazing begins to happen (it starts to unravel and line up as a matrix of variables with specifc right-angle arrangement (Like a ____________________s game/ board))

Quote:

"The challenge is to create a decompressor + data file that when executed together create a copy of this file."

comment: Hyper-space _______________ in 5 d should do it

Quote: "The size of the decompressor and data file added together must be less than 415,241 bytes."

Comment: It may be less than or no more than 1 byte (it reconfigures the inside of your computer (using a hyperspace binary (super) 'group'

Quote: "So far nobody has been able to do this. I wouldn’t have been surprised to see somebody get a byte or two, but that hasn’t happened."

Comment: "A byte or two" is a very interesting idea

Quote: "The only real rule here is that we have to negotiate exactly how to measure the size of the program plus data file."

Hmmmmmmm

Quote: "I’m willing to exclude lots of stuff. For example, if you wrote the program in C++, how would we measure the size of the program?

In this case, I would measure it as length of the source file, after passing through a resaonble compactor. The run time libraries and operating system libraries could be reasonably excluded. But that’s just one example, the rest are all subject to public interpretation.

Really, the only rule we need is that the executable program can’t have free access to a copy of the million random digits file. For example, you can’t create a new programming language called JG++ that includes a copy of this file in its Run Time library. You can’t hide the digits in file names. And so on.

As long as those rules are obeyed, the challenge will never be met."

Comment: The file HAS ITS OWN RUN TIME LIBRARY.

The key is knowing how to find this internal hyper-space quantisation ((diffraction)) '...limit ' and de/ not d "quantise it (in hyper'space')(turn it into a ________(______ "library" (or difraction limit on ... hyp_________)

Quote: "Recursive Compression This challenge has a special place for the variant of Magic Compressors known as Recursive Compressors. Some savants will claim that they have a compressor that can compress any file by a very small amount, say 1%. The beauty of this is that of course they can repeately use the output of one cycle as the input to another, compressing the file down to any size they wish."

Comment: If you do that it will push out in other ways that place limits on it (you end out with a maximum of a ________________(a PERSPECTIVE or point-of-view on the so-called file)(__________________________________________________________)

Quote:

"The obvious absurdity to this is that if we compress every file to a single bit, it’s going to be kind of hard to represent more than two files using this algorithm."

Comment: What you need is "hyper _________" ____________________________________________________________________________________________________________________

Quote:

"So most people in this subspecialty will claim that their process peters out around some fixed lower limit, say 512 bytes.

For those people, a working program should be able to meet the challenge quite easily. If their compressed data file is a mere 512 bytes, that leaves 400K of space for a decompressor that can be called repeatedly until the output is complete."

Comment: The computer may interfere with such a program?

The PayoffThe first person to achieve the challenge, while staying within the spirit of the challenge, will receive a prize of $100.

Comment: Unfortunately the possibility answer is worth more in the market?

Alan,

I've reviewed each of your comments and found no mathematical coherency to the ideas you've put forth.

Are you OK?

Tim:

I have intentionally left blanks throughout my comments, so as not to reveal too much.

You are only supposed to get the notion that I MAY know an answer. I am very poor for a non-third-world country person, I cannot afford to give away what appears to be an answer.

Mathematics is "the science of pattern"; "physics" I could call "the pattern of science".

If I were to describe "mathematical coherency", would you expect to find mathematical coherency in my description? Or would you expect something wider than "mathematical coherency' to be part of how one gets a handle on "mathematical coherency" (or " the cohering of categories" or universal set quantisation )(flippability )?

There is a statement on this website that says:

"Arithmetic coding + statistical modeling = data compression".

According to an analysis I did of this claim, I found this claim to be true.

My analysis then might suggest to professionals in the computer industry that the un-revealed apparent breakthroughs I have made with regard to the subject of what is known by the phrase "data compression" may also turn out to be successful.

I just stumbled upon this funny challenge. It is absolutely amazing how many replies it has caused.

I want to add that of course compressing a random file can be possible. The probability of creating a random file consisting purely of 0's is not zero, for example. That file would be trivially compressible.

I think it might very well be that your specific file can be compressed by a few bytes, just by coincidence. That would be a meaningless result however.

What if all files are 'trivially compressible' when you know how to do this?

Apparently the way to do this is via a --------------------------------------------

(censored by me as potential commercial use) !

Mark,

Although I think you may have stopped bothering to read the comments after a few weeks of these rather frustrating posts, I hope you will answer my question.

Is a compressor still "magic" if, instead of compressing all files, the files it can't compress are just very usefully small? With the reading I've done so far I'm inclined to think not; it fits the proof against infinite compression.

Any reply would be appreciated! Thanks.

@Komon:

Let's just say for the sake of argument that your definition of "very usefully small" means a file that is under 1K in size.

A compressor that can compress all files that are greater than 1K, but cannot compress some of the files under 1K is still magic (and provably impossible.)

- Mark

Can you provide the alleged proof of "impossibility",Mark?

Here is something which is not specifically my commercially sensitive idea, that occurred to me that I could say here, to provide food for thought:

A way in principle to store an unknown string of zeroes and ones that is one million characters in length:

The simplest way to store something is via a reference, e.g. "Tolstoy's novel "War and Peace" is readily "decompressed" by looking up this reference at a library.

However, if you entered this title in a brand new unused computer, you would not expect to find that novel in all its detail.

But if you entered the following instructions to the brand new unused computer, and if it had a particular type of logic program installed, you would in principle recreate the novel in all its detail:

The instructions are:

The object to be generated by the computer has n characters (or in the case of "the million digit challenge" it has a million characters)("n" refers to whatever the actual number of characters including letters, spaces, and punctuation is).

In a list of all possible combinations and frequencies of these characters, the object to be found is number x on that list (e.g., in "the million digit challenge", it could be number 5 billion and 27 on the list of all possible options for arranging zeroes and ones in a string of one million characters).

The computer needs to have a logic program installed that goes through the list of all possible arrangements in a methodical way (e.g. the list of all possible arrangements of 4 characters is: one: 0000; two: 0001: three: 0010: four: 0100: five: 1000: six: 0011; seven: 0110: eight: 1100; nine: 1001: ten:: 0111; eleven: 1110 : twelve: 1011; thirteen: 1101; fourteen: 1111 )

The computer only need to know that the item to be stored has so-many characters, and is item number whatever on a methodically generated list of all possible ways to arrange those characters. It can then recreate the item by running a logic-scan of the list (the way in which the list is generated can be a methodical process, with numerous shortcuts to improve efficiency - also a small amount od additional information would help (e.g. how many zeroes and how many ones in the million characters)).

The secret more advanced technology could be like a Sudoku puzzle, with floating information that self-configures to produce one solution.

@Alan:

Math: learn it.

http://marknelson.us/2011/01/09/combinatorial-data-compression/

Interesting article...

First: to correct an error in my post: there are sixteen ways of presenting zeroes and ones in 4 characters, the two I omitted are:

fifteen: 1010; sixteen: 0101

Regarding the article on combinations and permutations:

What I presented here, which may be related to other things but is not specifically the technically advanced apparent discovery that I have not revealed, was a "throw-away line' that suggests (though I am aware there may be problems) an intriguing possibility in principle, that concerns zeroes and ones, not bits and bytes.

in your article, you introduce the idea of a "subset". the notion I presented does not have the "problems" that are shown in your article, because there is (initially at least) no subset situation, just ALL unique ways of listing zeroes and ones in a fixed number of characters.

Each way is identified by its number in the total (e.g. in all ways of listing zeroes and ones in 4 characters, option five can be 1000. To "store" this option, one only needs to have a program that runs through all options in a methodical way, the number of characters (4), and the location on the list of all options as run according to the automatic listing of options (whatever it is designed to be re: how such a list is listed re: which option when) .

The immediate issues are: can a listing procedure be designed? I think so.

Does it take more zeroes and ones to describe the (1) number of characters and (2) the number on the list of all possible ways of writing zeroes and ones in that number of characters? In the case of just 4 characters, the method appears to fail as it would take more zeroes and ones to describe the option this way than to just list the actual file e.g. file 1000.

But what if it were a million characters? Wouldn't the numbers to describe how many characters (1,000,000) and which option on the list of options (could be over many billion) take up more zeroes and ones than the actual file? Maybe, depending on efficient mathematical shorthand techniques (like exponents etc.).

Then there is the question of listing more information about the file to reduce the size of the list of all possible options. Now you do have a subset and it is here that saying any more might open up too much!

1 1000 00110001

2 1001 00110010

3 1010 00110011

4 1011 00110100

5 1100 00110101

6 1101 00110110

7 1110 00110111

8 1111 00111000

9 0111 00111001

0 0110 00110000

Make a translator to create a file that switches all the numbers from 8 digit binary. The file is only numbers, you only need 4 digits of binary to represent them all losslessly. Should cut the file size by 50% leaving plenty of room for the decompressor.

Cheers.

@Brian:

No, not even close.

The file is only number? Yes, the ASCII version of the million random digit file contains only the digits 0-9 plus record delimiters. But that's not the file that you are being asked to compress. The file that you are being asked to compress contains the binary version of the million digit number.

- Mark

ah, bummer.

...And if someone could complete that challenge (and more generaly solve that problem of data hypercompression). What could be the consequences on the computer sciences and industrial/commercial computing?

@Monad:

Matt Mahoney has pointed out that may fantastic compression claims would require that P=NP. The web has some good speculation on what the consequences of that would be.

No worries on my part - it's been a long time since anyone has even thrown a new idea at this challenge.

- Mark

P=NP does not mean impossible, that does mean very expensive in time ressources, thing that could be solved with a qbit technology (quantum theory). In plus, it can not be exclude that a new paradigm approach gives a polynomial time solution or more realistic: an almost polymonial time solution.

(sorry for my english, that is not my native tongue)

Ah, it's that time of year again. Time to spend more time inside where it's warm than outside where it is cold. It's my favorite time of the year since I like to work on this challenge.

As to MillionDigit Challange.. I still Dig it. I too am moving on into Cantor and Set Theory.

I have more questions than experience and I have not run out of possibilities.

Allen was nice to read because of the display of creativity.

My favourite saying about these things now is "the Universe will not let us cheat but I keep trying."

That can apply to MillionDigit Challenge and to death itself.

As Prof. Einstein wrote "Imagination is more important than education."

I don't have a candidate for a compressor going into this winter cycle Mark. I do need to learn more about Set Theory at this time.

I do believe there is a function that can satisfy this challenge. Something that will connect the set of bits of MillionDigit to a smaller file and that file and the program size will be smaller than the MillionDigit file.

I think we can all agree that the path to find that function isn't obvious or easy.

And now a quote from Hitch-hikers guide to the Galaxy "Keep banging those rocks together!"

Good Luck Challenge people and thanks for keeping the site going Mark. I still like it and the challenge.

Hey Friends.

If there are any in this line of work :)

I am enjoying some tunes and again spending an afternoon day-dreaming on the topic of this file.

I am about to display the binary and look for some structure.

I know this whole challenge is a stfu effort but hey that is a narrow minded attitude.

Dare to dream people!

Okay.. Since I am in the mood to work on my programming projects I have had to decide on a direction for the coming year.

It's to be to write a proper paper on an encoding method or two I already understand.

Any advice on paper writing is appreciated.

I don't see myself authoring more compression program efforts at this time. Not that I don't find myself working out an idea every now and again.

I'll stop back in, in a while. Be Well..

Ernst

I'm going to be working soon on some arithmetic coding updates. I'd love to see good work on Q-coder, range coding, and other arithmetic coding techniques that go beyond the basic Cleary-Whitten model.

- Mark

Here's my solution, wherein a) the compressor will shorten any source greater than 32 bytes in length by 32 bytes, and b) the decompressor is only 24 bytes long and can be run from a shell prompt.

Total net savings: 8 bytes, every file, every time, guaranteed!*

For brevity, I've written a compressor specifically for AMillionRandomDigits.bin; a general-purpose compressor is left as an exercise for the reader.

$ head -c415209 AMillionRandomDigits.bin > x

And, the decompressor (only 25 bytes; shell prompt prepended for clarity):

$ head -c32 /dev/random>>x

Success!

* Note: decompressor may not decompress the file correctly on the first iteration; if so, re-compress and try again. It'll get it eventually.

Errata: the decompressor is in fact 24 bytes (the description above erroneously stated it is 25 bytes in one place; this is because the beta version of the decompressor used /dev/urandom instead of /dev/random).

@Andrew:

This humorous approach could almost work - you could try hiding data in the form of an MD5 signature in your decompressor, then keep trying until you get a match.

Of course, the extra cost of the signature would always be greater than the bits it allowed you to remove, but that's where the fun comes in - trying to get that past the judges.

- Mark

Suppose the compressor and decompressor exceed million digit file but still compress the file and other random file to minimal size let’s say ((zv[x] + ((x) mod 23) ) mod 256) );

Calculate the frequencies of both the original and the transformed file

Original Frequency

[ [ 47, 44, 43, 42, 42, 40, 40, 40, 40, 39, 39, 39, 39, 39, 39, 38, 37, 37,

37, 37, 37, 37, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 35, 35, 34, 34,

34, 34, 34, 34, 34, 34, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 32, 32,

32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 31, 31, 31, 31, 31, 31,

31, 31, 31, 31, 31, 31, 31, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,

30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 29, 29, 29, 29, 29, 29,

29, 29, 29, 29, 29, 29, 29, 29, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,

28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 27, 27, 27, 27, 27, 27, 27, 27,

27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 26, 26, 26,

26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 25, 25, 25, 25, 25, 25, 25, 25,

25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 24, 24, 24, 24, 24, 24, 24, 24,

24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 23, 23, 23, 23, 23,

23, 23, 23, 23, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 21, 21,

21, 21, 21, 21, 21, 20, 20, 20, 20, 20, 19, 19, 19, 19, 19, 19, 18, 18,

16, 16, 15, 14 ],

[ 125, 12, 44, 82, 43, 130, 223, 61, 23, 228, 69, 0, 62, 87, 187, 227, 220,

232, 165, 84, 108, 106, 161, 246, 233, 41, 197, 239, 36, 226, 214, 250,

26, 40, 55, 251, 147, 124, 168, 94, 157, 243, 153, 128, 42, 151, 58,

67, 18, 117, 178, 89, 200, 142, 22, 99, 70, 9, 215, 35, 144, 206, 3,

149, 120, 16, 118, 231, 7, 32, 81, 90, 93, 219, 73, 60, 98, 8, 131,

132, 34, 115, 159, 24, 135, 17, 249, 49, 91, 134, 245, 241, 247, 202,

95, 10, 158, 222, 20, 182, 216, 109, 74, 177, 31, 195, 196, 163, 25,

15, 48, 170, 143, 244, 204, 63, 65, 139, 185, 13, 242, 205, 186, 51,

66, 11, 52, 191, 6, 4, 59, 39, 47, 208, 136, 138, 193, 238, 92, 234,

21, 64, 254, 2, 85, 119, 156, 37, 175, 210, 188, 56, 181, 179, 27, 111,

71, 80, 107, 180, 173, 1, 167, 190, 104, 218, 209, 102, 199, 110, 225,

162, 148, 140, 78, 101, 192, 248, 123, 183, 169, 198, 145, 137, 14,

121, 203, 174, 240, 160, 235, 76, 86, 126, 77, 211, 28, 114, 113, 207,

38, 171, 166, 176, 88, 79, 103, 221, 172, 229, 252, 19, 133, 217, 33,

201, 213, 50, 212, 152, 230, 116, 154, 72, 30, 96, 155, 129, 146, 53,

45, 83, 105, 112, 184, 46, 141, 75, 54, 237, 224, 100, 97, 236, 29,

255, 122, 253, 57, 164, 189, 5, 150, 127, 194, 68 ] ]

Transformed frequency

[ [ 46, 45, 44, 44, 41, 41, 40, 40, 40, 39, 39, 39, 38, 38, 38, 38, 37, 37,

37, 37, 36, 36, 36, 36, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 35, 34,

34, 34, 34, 34, 34, 34, 34, 34, 34, 33, 33, 33, 33, 33, 33, 33, 33, 33,

33, 33, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

32, 32, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,

31, 30, 30, 30, 30, 30, 30, 30, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,

29, 29, 29, 29, 29, 29, 29, 29, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,

28, 28, 28, 28, 28, 28, 28, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,

27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 26, 26, 26, 26, 26, 26, 26, 26,

26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 25, 25, 25,

25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 24, 24, 24, 24, 24, 24, 24,

23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 22, 22, 22, 22, 22,

22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 21, 21, 21, 21, 21, 21, 21,

21, 20, 20, 20, 20, 20, 20, 20, 20, 19, 19, 19, 19, 19, 19, 19, 19, 19,

18, 18, 17, 15 ],

[ 218, 95, 32, 177, 134, 245, 89, 19, 156, 24, 49, 22, 233, 242, 38, 39, 6,

146, 92, 125, 217, 179, 182, 31, 74, 235, 231, 246, 13, 107, 168, 114,

214, 142, 251, 15, 238, 171, 169, 199, 86, 103, 127, 126, 5, 250, 120,

25, 83, 90, 196, 63, 77, 232, 18, 10, 147, 255, 42, 65, 239, 67, 101,

128, 221, 167, 75, 116, 9, 208, 72, 99, 183, 143, 197, 69, 121, 59, 87,

93, 44, 111, 139, 64, 51, 52, 223, 150, 164, 50, 84, 224, 202, 55, 190,

181, 187, 158, 28, 226, 47, 3, 195, 117, 201, 80, 138, 8, 110, 248, 37,

36, 56, 252, 104, 163, 14, 141, 27, 154, 155, 188, 153, 45, 100, 136,

4, 192, 249, 76, 157, 144, 166, 53, 228, 254, 203, 43, 29, 132, 211,

68, 26, 109, 58, 118, 61, 54, 130, 237, 12, 48, 222, 210, 70, 41, 1,

94, 73, 162, 213, 244, 60, 243, 198, 17, 151, 30, 33, 212, 21, 35, 129,

170, 180, 152, 88, 122, 105, 97, 189, 140, 66, 115, 172, 57, 91, 113,

247, 204, 102, 16, 40, 71, 191, 11, 207, 135, 227, 205, 159, 215, 240,

108, 230, 193, 241, 96, 176, 124, 194, 186, 145, 79, 112, 161, 178, 20,

234, 2, 78, 236, 209, 220, 23, 149, 225, 0, 165, 7, 123, 200, 175, 216,

62, 119, 219, 133, 82, 98, 184, 206, 185, 106, 173, 174, 160, 34, 131,

81, 137, 229, 46, 253, 148, 85 ] ]

Then calculate the LogInt of Frequency Binomial

The final Result is

Non Compressed size is 7245 X 8 gives 57960

Original Compressed 56811 Bits

Transformed Compressed 56804 Bits

Some Header information is missing. The missing part is here

However you can find other resources here

http://www.facebook.com/pages/South-Side-Technologies/129860307070256

http://www.facebook.com/?sk=lf#!/pages/South-Side-Technologies/129860307070256

However I have watched this challenge with keen interest for sometime, though my interest is not Data Compression per se is on Universal Intelligence. With proper understanding of intelligence then recursive compression may not be all that trivial.

Apparently through UI I have several solution\approaches that compresses any random file which include the million digit file among other random file I have generated.

One approach I want to describe here that result in saving of 7bit is a basic version of several approaches I already have. This approach basically transform random file to one with a lot of redundancy.

For example

The first 7245 bytes of million digit file if you transform

By the following equation

zv:= First 7245;zx:=Size(zv);

zh:=List([1..Length(zv)], x -> ((zv[x] + ((x) mod 23) ) mod 256) );

Calculate the frequencies of both the original and the transformed file

Original Frequency

I'll take a rain-check on those topics Mark. Not that I already understand them enough, just not enough time for more right now.

I have some work ahead of me but I gave the MillionDigit Challenge some thought today and could work on Move To Front again.

That MtF is interesting in that it's cyclic.

I dig that.

So, another wild goose to chase here then before I get completely absorbed in what I must do.

Good Luck Challenge people!

Hello fellow dreamers,

I have accepted this challenge 3 months ago and I’d like to thank you mark, for this challenge is very entertaining and fun!

I totally understand the limitations and how this challenge cannot be won but somehow for an illogical reason I keep coming back and giving it another go (third attempt yet). Probably Ernst has to do something with this, for he is a beacon of hope in thei dark uncharted territories. haha

Now I have a question I’d like to ask.

I have successfully completed an algorithm that will compress the random digits file. The only setback is the speed of this algorithm; it’s going to take around 1-2 months to compress the entire random digits file (boring!!!). (There is plenty of room for optimizations but it will also take time)

My question:

If I extracted a small chunk of the random digits file, say 1k data and compressed it using my algorithm, then added it into a 7zip archive, and then if I compared it to the original 1k digits which are also added into a 7zip archive, I get a significant size reduction on the compressed file (almost 50%). Now Can I generalize this size reduction on the rest of the random digits file? or entropy’s magic only works with the whole file?!

Note: the compressed file is 100% decompressable into the random digits.

From skimming through the above posts, I’ve found the answer to my question in tim's reply to ernst

“dense data remains dense in all forms - whether its in other numerical bases, sets of arbitrary bit strings, or even arranged in different shapes (such as 2d grids, 3d grids, 4d grids, an Ulam Spiral, zigzags, etc).

More incredibly, all the derivative data remains dense. For example, if you took every 8th bit of information from file, you'd find that the resulting bitstring was just as dense and incompressible. If you took a random sample from it, you'd find that sampling just as dense.”.

If this is true then I’ve achieved partial success, ill just have to wait 1-2 months to see the result of the entire million random digits file compressed. I am confident it will achieve 30-50% compression. Expect a submission soon mark :)

Uhn, for all numbers listed, the search of a transform takes more time! Well, if the parameters in the transform have to be attached in the header, there won't be too much compression!

Well there Kiwi-Coder I agree it is fun.

It challenges me to think and imagine.

Back to my previous post about Move to Front.

I worked with variable length codes in matching tokens.

There isn't a single code that allows for a reduction of output bit count for all input tokens so I focused on assigning single bit output for the most frequent.

I worked the Milliondigit file and in most all cases it expanded.

However with two different encoding systems being applied I saw file size reductions of up to 50% but that is after the milliondigit file expanded due to re-encoding.

I never saw a size smaller that the original 415241 bytes.

So there is an area of potential in what I just did. The draw back is that recoding expands this file. The possible reward is that once several re-coding efforts are made there might be a compressible data created. I have not examined the outputs all that closely so there might be a gem in the rubble for some sequence of processing.

Anyway. I just finished that hybrid move to front effort. I have two encoding schemes for it and if there is interest I can share the ideas for good or ill.

Other News:

I have recently found a paper by a respected man that discovered half of an encoding system I discovered for myself in Jan. 2010.

The paper has an associated proof by two other fellows.

I wrote that man and hope he will write back.

Having part of the truth in print already will make my job easier in explaining the encoding universe that it actually is.

So I can't say spending time on Million Digit is a waste of time.

I just hope I am treated fairly as I learn the "paper writing ropes."

So, Heh... I'm able to publish. Who would have thunk it 8 years ago. yeah I have been hacking-about for 8 years on this stuff.

Oh and Kiwi on dense data. I have recoded this Million digit file hundreds of times. With dozens of encoding systems I made myself.

There are files that result that do compress some but never smaller than the source of 415241

I believe the way to compress this is to either find a system that generates this pattern the file is or to transform the data into something that can be manipulated into a smaller file.

Obviously all the standard encoding and compressing techniques fail on the million digit file and files like it.

I do believe that information is separate from form. This would be as we expect a black hole to behave. Such that energy continues into the gravity well but the information is smeared on the surface.

I also believe we don't know everything.

Okay. Wish me luck on the paper and I'm open to sharing my MTF efforts.

Update:

I spoke with that accomplished fellow who has retired and I feel great. He and I share a Union of cyclic arrays and we have different systems that generate that Union. It's nice to rub elbows (in a way) with the accomplished. He sure is a great guy!

I'll be working on recoding the Million Digit file for the winter. It's dark and cold out so inside is a good thing.

I think I will see what transformations provide.

I worked 96 bytes, by hand, and managed to align a row of 6 where the same bit was reset in all bytes.

Another alignment left only 14 set bits out of 48 where the original was about 50/50 or about 24/24.

I was thinking that I might try a few combinations and see if some quality can be exposed such as biasing the file towards reset bits or something.

There is overhead to the effort but maybe I'll get lucky and it will compress enough.

If nothing else I get the experience of writing the software.

Well friends, I am far from out of ideas so don't think that creativity and imagination are false tools. They are your friends.

Good Luck Challenge people.

Ernst

Update:

I decided last night to work with recursive transforming (reversible naturally)of the million digit file.

This means that I am creating new files of same size and eventually it will cycle back to million digit file. Think of it as cyclic code.

I am looking for qualities which make size reduction simple.

I could add Zlib and test each file but I am green on using zlib so I'll have to research.

I will look things over here and once I am satisfied I will be running the quad core with 4 instances of variations of Million digit.

The only way to speed up this search process is to have the computer cycle in both directions and meet itself in the middle some place but I need to learn about parallel processing.

Well if any of you are interested in a different file than million digit I have billions here. Let me know. I'll get some out there.

I was thinking about filling up a 3 TB drive just for fun.

Well lets see how long it takes to find a diamond among the coal.

This is the only way I see compressing million digits working.. Using a surrogate file derived from Million Digit file.

I see I am over 15,000 files at this point..

Time to examine a few.

I am saving files mod 1024

Just think that for some cycle there could be an mp3 that no one ever recorded. Perhaps a picture or a book never written..

Cool stuff..

Update:

Heading on to 4 million files generated as of this writing..

There are billions of files that will satisfy so it's just a matter of processor time.

This is a crude effort at this point so I can improve on this effort.

I'll post more when I have something solid.

Good luck Challenge people!

Ernst

What you are describing here is normally referred as Computational Thaumaturgy. Or "Real Sorcery or Computational Sorcery or In African word is called

kamutii""Just think that for some cycle there could be an mp3 that no one ever recorded. Perhaps a picture or a book never written..

Cool stuff.."

But it is good you may continue!!!

@mpith

I had never read that word before.

That is a nice thing to say mpith.

Merry Christmas!

Soon it will be New Years Day again!

Happy New Year to you all.

My Update is honest and here it is..

Since I am actually working this challenge I trust I am seen in that light.

I didn't see an easy advantage in cycling the file but since it seemed relative at each instance I will consider that there is structure.

If that is true then as long as I do and undo the same maths some interesting possibilities exist.

However, time is always the limiting factor in life and I will wait on that for a while. Push it on the Stack.

Currently I am writing my first hash function.

I am accepting advice and links for study.

This function is in the coding stage so testing is a ways off and that will happen in real-time on the Million Digit file. The Hash may fail at some point but hopefully this design is flexible enough to offer alternate codes to get around collisions for any one instance. Collisions, I understand, are the problem with hash codes.

Now if this hash function can hold up, at least for the Million Digit file, compression is possible.

I can only guess, for this Hash function design, on the actual range for a given input range at this point but I hope for 23 bits I'll have eighteen million codes of utility or so.

Obviously, I will be learning as I go.

So Update: I have a Candidate for 2012 now and will be working in earnest.

No promises or claims of achievement here since this is a learning experience at every turn and in that it is reward but, this is the first time I have sizeable compression and am trying to keep it.

This is a different point of view for me.

If nothing else my code writing has taken a turn for the serious and efficient . Reminds me of school work and making the "grade."

Good Luck Challenge people!

Ernst

My bad.. I don't know why I typed 18 million there.. I wasn't thinking :)

LOL 23 bits at most is 16,777,215 :) I have 18 on the brain.

Again Happy New Year!

Very interesting thread, and great challenge. I've been puzzling on the problem of compressing 'random' data for quite a while in the past, and I may have a good approach for this.

Does the $100 challenge still stand? I'd like to try and see how well my method works for your Million Random Digits file, and hopefully submit my solution soon!

It most certainly still stands! Good luck.

-Mark

Once one understands the limits of math and why the limits occur then the "problem" is easily solvable.

Think of it in these terms, man can not walk through fire due to the heat and your skin. Understanding the issue helps address the need for a fire suit.

Same concept applies here, clearly using an environment that is reliant on math will not allow one to achieve the desired results.

Another approach is required and I have found it, however it is CPU intensive and there is no real world application that can make use of it aside from encryption of small data.

@ Timo Makinen hey Timo! Sounds really interesting.

I think we all agree that our current tools are what are really being tested with this challenge.

@ all: Any suggestions on Hash functions to read up on?

Now is the time. I am about to do the code with the design I have been working on but, external examples are welcome.

I am interested is systems that offer alternate codes for individual elements.

I expect to have multiple options and hopefully no more need than 1 out of 4 but I am designing for 1 out of 16.

It is easier to write one function once than to redesign it several times.

Other than that .. Week Three for me on this one..

I'm writing a full program this time win or lose on this approach. That means all the mundane details have to be ironed out.

I usually prefer quick-hack style proofs of ideas but this one is a win or lose kind of effort.

I must have faith at this point. I can believe later.

Anyway.. I thought to re-post that I am in for 2012's Million Digit Challenge action.

If I do manage a bit of magic.. Mark I take it that the 100 dollar bill will be new and unfolded? It has to look good in a picture frame :) lol

Good luck Challenge people!

I am now good on the Hash functioning. I get it.

Thanks for the read. I must get to work.

Hey Guys,

I am working on a Codec and part of it is a hash code. I am learning about these but I think I am doing it right. Please comment.

I have the need to identify a class of string. Doesn't have to be an exact choice so a general code say 0 to 1023 (10-bits) seems okay.

Now where I am with this is to be happy with a distribution ( on the Million digit file ) of reasonably equal use of all 1024 values.

The spread ranges from a low of 29 instances of one 10-bit value to a high 80 instances of a 10-bit value and it looks fairly distributed.

What I am asking is how are these things gauged?

I have not taken Statistics at the J.C. yet.

Having nearly equal use of all codes is the goal yes?

And the input file is Million Digits file.

You answer need not be specific or need to directly relate to my work.

I assume a good hash code is uniformly distributed for a set of uniformly different strings.

Am I stating that correct?

>I assume a good hash code is uniformly distributed for a set of uniformly different strings.

I'm no hashing expert, but a good hash function generally should be generating something that looks like a uniform distribution for random input. Ideally, for any input.

But every hash function will have some weakness in which a certain series of inputs will cluster in a non-uniform manner. It's just that in a good hash function, those series of inputs are not ones that you would expect to encounter in typical data.

So maybe your statement is not mathematically concised, but I think it is pretty close to getting to the core idea of what a good hash function does.

- Mark

Cool; Thanks Mark.

I was a bit surprised with this one after a few designs didn't look good to me.

I'll have a better idea on the true utility once I try to decode I am sure.

Good Luck Challenge people.

Update:

Okay.. I am a bit surprised just how hard this is to do. :)

I have a Codec and I calculate, to decode, that it must choose one out of over 2 million potential strings.

Wow, that is a needle in a proverbial haystack. You see what I am up against.

I have choices of which data to sample and how I sample so when I have a clear idea of how this may fail it should help me refine this approach.

The miracle side of this decode effort is that it might work.

I'm now working on the decode aspect. Hash codes are interesting and my hope is that this will decode.

Okay then, I'll be a couple weeks or so I am sure. I assume there are people reading your site and this challenge blog Mark so this is an offering of content as well as a space I blog about my experiences with your challenge.

What good is our Challenge Blog without a candidate once in a while?

I'm over-due presenting at least a hope of winning so here it is.

Good Luck Challenge people!

Hey,

I am a fan of integer sequences. I have an account on a database because I love it.

So; Here, is an interesting sequence.. I'm looking at the "bigger world" through the eyes of the above mentioned hash.

Three dots means sequence before assumed. Three dots after means sequence after assumed. You know, I cannot, post the whole thing so enjoy a snippit.

... 720,8,0,1,2,3,13,727,1021,1022,1023,0,19,733,21 ...

Interesting.

Hey Guys,

I have a question. Has Million Digit been factored?

If so is the data available?

Update:

I'm sidetracked again.

Three days a go I discovered Recursive Modulus and I have been exploring Recursive Modulus to find factors of numbers.

I started three copies of a simple program I wrote to find factors. Each one is running the RSA 2048 data.

I'm guessing I could be waiting years or minutes more for a factor to show up.

Ain't it something? I get so many projects stacked up I feel like I need a staff.

Well, Sleepy time now that that program is running. There is a thread in the Comp.compression group where I explain RM a bit.

It looks like the world has one more way to find factors for numbers.

I am interested in knowing if this Recursive Modulus was already known or if I discovered it,

Either way the RSA 2048 race is on. Lets hope I get lucky.

Okay.. I need to calm down and get back to my data compression effort once again.

I was reworking the Codec when I found the RM so I guess that answers that question I had.

Still it will be really hard to get 1 selection out of 2million+ possible strings.

I was worried when I was looking at the modulus stuff and I found a partial answer to my worry but, I think it could go either way.

Now that is okay as long as I can make it work for the Million Digit file. That is a possibility.

But, there is no telling for a while more.

I hope I have time for all this stuff.

If no one beats me to my own prize since I have been giving it away, I will factor this file soon.

I hope to create a codec of factors and powers. I hope that will satisfy this challenge.

God Speed my friends...

Ernst

I'm asking again has anyone factored Million Digit file.

I have a factor that is 175 pages of 10 point font in open office.

If no one has claimed that factoring I now do.

The factor ends in "36579116120702356361444362876721627845529455567385497"

So I have three factors so far at last check.

Alright then . Finally some headway on Million Digit file.

How exciting.

Well I found a bug in my program on factoring. I am exploring factoring since I have never messed with it to any significant degree before.

I wasn't checking for an end condition of the N being 1 after a factor is removed. It's reasonable to find this errors after checking it out a while.

So i fixed that. Also I was printing the current N before the factor was removed and so seeing a large value for N and no stop for N=1 I didn't see that there are only three factors.

Now I am using GMP's function to check if this nearlly a million digit number is a composite probably prime or prime as thos are the three results that function will return.

My question is, having never used this function before let alone on nearly a million digit number, how long does it take for a million digit number check?

Dang.. feels like it's spinning wheels but I don't know.

Anyone?

I am also running a search using recursive modulus. I started the RM program I have just a minute ago.

Who will exit first?

This factoring is interesting and I am happy I figured out what the recursive modulus was doing and I have code to work with if need be.

I'm still hoping to find reference to this method so if any one has seen recursive modulus method before let me know. I'm sure it will make for interesting reading.

That's enough for me on Recursive Modulus. It works the way it works. Does what it does and has interesting mechanics.

I have it in my tool box. It is another aspect to consider.

I hope others will share their discoveries like I did. Makes for an interesting day.

Now back to work on what I was doing when I saw RM.

Good Luck Challenge people.

Commentary on the Million Digit challenge and the stigma it has accrued in Comp.Compression

When I learned of the Collatz Conjecture I was in class at the junior college taking introduction to programming using the newest programming language C in 1991 and Pascal had just been replaced by C for the class.

I fell deep in to the magic of all things that cycle like the famous 3x+1,x/2 Collatz Conjecture.

As time went on I discovered many things through my private efforts like there are infinite systems for 3x+y where Y is odd.

I also saw the series of systems where collatz fits in I can understand all the systems and I know, without proof, that there are some strange dynamics in play in the true nature of whatever this reality is just by extending that series to the left and realizing that it can go negative. Negative Zero? How strange indeed. It could be true for some point of reality but I simply keep the idea open in my mind. I don't know exactly and it's okay for me to keep it open.

I consider myself a successful man when it comes to my programming interests. I have more ideas than I have days I am sure. I am a friend and a fan of the sciences and I have a wide range of interests. Naturally, as a hobby programmer, I have not specialized in any one discipline but I do really good with the data encoding these days.

Moving forward to 2012.

My interests have found a focus with the problem to solve of the Million Digit Challenge. I have been working on and around the file as a data set for many years.

As a data set I understand it is trustworthy. As a trustworthy thing I measure other things against it and it is a good measure to work with.

So why is there such bias and outright aggression towards the people working the Challenge?

Here is what I understand about the state of the art of data compression.

Data compression as defined by Shannon is statistical in nature and by his own words there is a fork in the early road and Shannon lead us in one direction over the other and that modern data compression is currently travelling on that road.

Along the way many proofs and truths have been confirmed that relate and define many boundaries. That is all good. A wall is a wall after all.

Yet all of these things fail to compress all files with the tools that have been developed and most of the advancements in the tools we use come in small increments and many in specialized efforts.

Indeed there are many good craftsmen out there who have tools and skills to keep those advancements going and that is good. More power to ya friends!

For myself I am not in the business. I am simply an interested man who finds this Challenge a great tool and focus for my interests.

The data set is trustworthy and the results of experiments are valid using it by the very statements of the data compression professionals of all times.

The bottom line is I really don't need this file to compress because in parallel I do my own thing in private. I have more than one iron in the fire and it is my private business what discoveries I may be hiding from the rest of you.

So, why is it such a moral data compression sin to work this Challenge and why do smart people in comp.compression feel the need interfere and down right discriminate?

Rest assured if I was not generating successes from the path I am following I would not be right here right now. I would be doing a different thing on a different path somewhere else but I am having successes. Private success that keep me moving in the direction of finding that information I seek. Me Personally seek! The answers I want. I am not some data-compression-Jesus to hang on the Data Compression cross.

My latest experience, being associated with the Million Digit Challenge, was to report the effect of Recursive Modulus as a factoring primitive. I assumed it was old news I had discovered yet no one seemed to know about it.

So, it turns out that I had seen factors in the stream while designing a hash code and it turns out that it can be considered a primitive recursive function of some class. It is possible I am the first to report it but I am fine with knowing who first wrote it down.

So there is a private success anyone can see just as I assure you I have more private successes that are worth reporting.

Yet, what does it hurt to look for new ways and new understandings? Is it because someone may have disability? Like Newton or Cantor?

Conclusion :

There is nothing wrong with innovation and for every success there are untold numbers of failures in human history.

When it comes to data compression our understanding is incomplete since our understanding of what reality is, is incomplete.

The moral of the story is every one dies and no one knows everything but it is often the fringe people that advance innovation like Newton or Cantor and many others.

So let people dream you data compression Thought Nazis'

Let those on the fringe take a shot at fringe things and stop the authoritarian tirade that wrestled control of comp.compression forum from the same over the years.

If we don't know everything then we can use all the help we can get even from fringe projects like compressing random data.

If someone wants to invest part of their life trying and you simply are not a mind reader don't stand there telling them they are idiots and then play them on every idea they have like some molester.

I am now taking the name calling and harassment personal after 10 years of efforts on understanding the nature of information come June 2012 and focusing on the million digit data set as a development tool since around 2003-2004 I believe.

So learn to be a friend comp.compression because you have the opposite part down pat.

However, if someone can prove to me that they will never die and that they know everything I will apologize sincerely.

Other than that expect me to get as ugly as possible to protest the labels and comp.compression keep on closing threads to cover up the mess that you comp.compression make by being a biggot.

In short, let the fringe people give things a try and stop kicking them in the teeth for sharing news and interacting for friendship especially the need for social friendship part.

It is simply amazing that such a noble challenge is tuned into a psychotic-hustle in that forum. What the hell are they afraid of?

Thank you Mark Nelson for a forum and a focus.

Ernst

Thanks everyone.

I am currently "down for repairs" with my desktop and the new machine is awaiting parts but once I have access to the hard drives again that large factor I can share. I take it a simple export in GMP is what is required. I'll have to ask Mark if that is how he generated million digit. I think that is the case. On that maybe I should give a copy to Mark as well.

If anyone is interested is seeing if we are trying to compress a prime number I have no problem sharing some technology and the number I believe is a factor.

The old desktop needs a older quad core and I have to find someone who has the older chip but when all is done here that quad core is available to run factoring efforts.

I am open to working with others in determining is we are trying to compress a prime number or not. Maybe we can conjecture no prime number is compressible?

If nothing else is used I know recursive modulus will do the search for factors but I am open to suggestions.

What do people think of GMP's prime number test?

With recursive modulus as I employed it, it will search from the number minus one to one so it will cover the ground albeit mechanically.

With recursive modulus we can divide the search into sections and share the work such that each of us can get a block and process so we all don't overlap.

Other than that I will be slowing down on the number of hours I have for computer with the coming of spring. I am still hopeful I will find a full time job with benefits and it looks surprisingly hopeful with the resent up-tick of hiring across the Nation and especially locally.

Alright then, I hope to check out that factor candidate. At this point I conjecture there only three factors but that could change if this nearly a million digit number is composite.

And again if there is interest in working the number with me just let me know. It would be fun to work with others for a change.

Ernst

As of late many people have had a instance or perception about Ernst the Internet~Usenet guy. I cannot worry how any one person feels and I will practice a punch back for a jab at me from here on out; however, I mostlt wish for peaceful interactions.

Update.

As it often is for me I go sideways as I explore things.

Recursive modulus may well be a primitive recursive function of some class. I see conflicting data classifications but then again perhaps there is a bigger picture. My best guess so far is primitive recursive function.

I am open to other suggestions.

As a result of blurting out recursive modulus rather than quietly absorbing a new thing as I usually do, I have a speed-race to re-evaluate other efforts.. You see each piece of the puzzle understood helps paint the picture of what is really true; for me.

Anyway.. I must divert a second time. The effort I have posted as being my effort is being pushed on the stack until I have an answer about the utility of Pi.

Don't expect me to debate the philosophy of transcendentals' I simply am interested in the factual patterns of Pi and other Transcendentals'

So friends, I am delayed again but for good reason. No sense in designing a codec that takes a long time to decode if there is a faster way. That is my concession here. I simply have to look some more before I proceed.

Ernst

Update guys.

I have to review a few things before I will know what is the forward direction here.

I have a couple of parallel threads to process so to speak.

So I will be working but not so active here or comp.compression.

I'm studying.

Well, Spring is happening here.

I've settled on pattern matching using the tools I have on hand at this time for my focus at this time.

I faced several important programming topics lately and any of those directions is worthy of effort but I have to pick a focus I feel confident will be a reasonable project for me.

It's rather simple to generate a large data set from small sized code.

For example I have a 10kbyte program here I am working on that generates a 2 million bit dataset rather fast.

The real challenge then is writing the pattern matching code that will expose the results of matching to the dataset.

I expect to have to spend some bits and to save some bits. If this follows other experiments I have done, that can be considered related, then it is an open question as to if the net result will be smaller than million digit file by one bit and the decoder size.

Then again one thing builds on others and this may be another tool in the tool-set after I am done.

At this time, it looks like all I need is about 15kbytes of reduction of million digit file at this time. It is my guess as to the actual size but from my current point of view that would be 3kbytes more than I assume the final size a decoder will be at this time just to be safe.

Well, pass or fail I am still interested in this aspect of data compression and I am still working the challenge.

Pattern matching looks to be an interesting exploration to code for.

Any one else? Drop a line here and say hello.

Good Luck Friends!

Heh!

Speaking of RSA style numbers.. Million-digit's byte count is the product of relatively sized primes like RSA would be.

617 and 673 is 415241

It has other interesting properties but since we are all interested in prime pairs such as;

Hint: Well try recursive modulus.

Hell, I am just puttering around. The Pattern Matching is less than rewarding for me today.. Bummer..

Anyway.. there is some trivia..

Ernst, here are some useful comments for your work on challenge

>> 617 and 673 is 415241

617 and 673 is 1290, not 415241

>> ... but since we are all interested in prime pairs ...

log(a * b) = log(a) + log(b)

Let million-digit-file = x * y, then

log(x) + log(y) = log(x * y) = log(mdf) =~ 3320000

>> Hint: Well try recursive modulus.

My hint would be: Well try math!

I'm curious why multiple files are allowed, I think that's a loophole.

From what I read, storing data in the filenames is not allowed. I read before that someone "beat" a $5000 compression challenge in this way before, so I can see what's trying to be avoided. (notice "beat" in quotes)

The only loophole I see is this: would enumerating files be considered storing data in them? I mean something like: 001.file, 002.file, 003.file, 004.file etc.

Because just the sheer presence of these files provides information that otherwise does not exist. For example if a certain bit pattern that repeats in the original data is found, then you could take that bit pattern out. Normally, you'd have to store at which locations you took the pattern out of, so it can be re-inserted, and this would quickly add up to more info than you saved.

But, if instead you split the file each time you encountered this bit pattern and stored these pieces of the original file in enumerated sequential files, you wouldn't have to store the information of where to re-insert the bit pattern. The concatination would simply be 001.file + bit pattern + 002.file + bit pattern + ... etc.

The question is only to find such a bit pattern. This isn't difficult:

The file is 415,241 bytes. A single byte can have 256 separate states (meaning there are 256 unique bytes)

By the pigeonhole principle then, just under 415,000 of those bytes have to be a repeat of a previous byte that's already been seen.

In one extreme, each one of the 256 bytes would be repeated about 1,600 times, and in the other extreme, 1 of these bytes would be repeated the whole 415,241 times.

Since if only one byte repeated, this data wouldn't be very random, the case here is probably closer to the first one, where the most any single byte is repeated is ~1,600 times.

So, we find this most-often repeating byte, and split the data each time we encounter it, storing each new chunk in an enumerated file.

The "decompressor" should simply store the value of this byte, and then concatenate these files together, each time inserting this byte between two files.

So, basically, just by having enumerated files, we should save about 1.6kb (in the worst case scenario). The source code for the concatination program could potentially be written in under 1.6kb - it is rather simple.

This process can also be repeated with nibbles (4bits) and 2bit patterns that repeat - there should be even more of those repeating.

But, even if this works, this is NOT quite in the spirit of the challenge, and isn't compressing the data, so much as just hiding the information elsewhere.

To be clear, I am indeed a programmer by profession, but I have NOT programmed this. I found this article today at work and the idea occurred to me when I read some of the comments later.

I am NOT CLAIMING to be able to compress this (or any other) random data - only that there's a potential loophole in the rules (if Mark allows enumerated files). I am also well aware of the quite solid proof of the counting argument of why no method can compress all files.

If I do have a free time, I'll try to write a program to run this sort of data-hiding scheme, to see if it would really end up being smaller.

@Milcho:

One of the reasons that I have the rather vague comment at the end is because I don't want to get in a never-ending contest with people who find clever ways to hide their extra data. I can't anticipate every possible way to do this in advance, and I don't really have any interest in doing so.

For example, let's say I limit things to one file. Well, someone will come along and create a directory structure with 400,000 entries, and just one file.

Then I say no, it all has to go in a single directory. Then they create a file name with 400,000 characters in the name.

Really, who cares? This is not interesting and I don't want to deal with it.

And so far, this has worked okay, I haven't really had anyone suggest that they have met the challenge while staying within the rules.

- Mark

@Mark

It makes sense, the rules of the game are to actually compress the data, not shuffle it around to unseen places. It took me all of one hour to write the code of what I posted above, and that's just the simplest way to use the filesystem to hide data.

It's amazing how recent some of these claims for incredible compression are, and that there's still people who buy into it. Oh well, they may be wasting their time, but it's at least slightly amusing to read about it.

Hey Milcho!

I see you found the challenge interesting!

cool!

But hiding data is not in the spirit of the challenge for sure!

Give it some time and maybe you will find a unique solution!

Update :

I'm working on bit-pattern matching using a constructed data-set. Also I am learning about OpenMP and parallel processing along with this effort.

So win or lose there are benefits for trying.

Ernst

on April 11th, 2012 at 8:39 am, Tobbi said:

Ernst, here are some useful comments for your work on challenge

>> 617 and 673 is 415241

617 and 673 is 1290, not 415241

>> ... but since we are all interested in prime pairs ...

log(a * b) = log(a) + log(b)

Let million-digit-file = x * y, then

log(x) + log(y) = log(x * y) = log(mdf) =~ 3320000

>> Hint: Well try recursive modulus.

My hint would be: Well try math!

Pardon Tobbi those are the factors of 415241 as in 617 times 673 = 415241.

I can understand your confusion and I promise to not try and trick you again.

Be Well!

The prize amount ecouurages to try this. In a way, challange creator also doubtful about the impossibility of this work. Hence he seems has not taken much risk ;)

Hey Ashish,

As far as I know it started as a way to silence those claiming fantastic algorithms that do the "Impossible."

In it's long life it has been a point of humor for some. A source of ridicule for others.

I may be the only one still working on it. I seem to be the only one updating the blog somewhat regularly.

Speaking of that.

UPDATE:

I have been successful in generating a 16-gigabyte dataset from a 10kbyte program.

I studied the dataset for a few weeks and have decided on the initial dimensions and layout of this Matrix.

I am currently indexing the Million Digit file. That will take some time the way I am doing it so I have time to ready other aspects of the encoder.

In a few posts back I wrote about learning and designing hash functions. I also wrote about being sidetracked with new maths for me to check out.

Well it has come full circle and that effort with the hash functions is again the focus. Good thing I did all that work already. This hashing effort will be more fun than work I believe.

If I can hash some of the indexing then compression is possible.

Wishful thinking suggests storing a byte in 6.5 bits.

What is really true is until I have the indexed data I cannot test any hashing.

It won't be long now but still it will take a while.

I thought to stop in and add to the blog.

Good luck Challenge people!

hey guys,

it is done :) tested and it works great, unfortunately commercialization is the only practical option here. I’m am not posting this to gloat rather to tell you guys that the answer is out there staring you in the face, you just need to apply more imagination than math and think out of the box.

Thanks and Good Luck to all

I have a very good system for providing an initial test of your claims. It will not compromise your work in any way, and will give some indication that you are doing what you claim.

Oddly, every time I offer this to someone who claims to have solved the problem, they decline!

- Mark

@marknelson

How to sell my working program ? and how much price it is worth ?

And also please tell me more about your "very good system for test".

And how trustable are you ?

Good day

thanks.

@Pedros:

Your program may or may not have some value.

To start testing your program, we need the following:

1) A Linux system built around a standard distribution, say Ubuntu or CentOS.

2) A program that, when run on that system, will create the Million Random Digit file. The program plus any required data files should be smaller than the million random digit file.

When you have those two conditions in place, we can talk.

- Mark

Well, I am working on a "Look-Up Matrix" here.

So far Million Digit is showing me it's still a pain in the ass. I am serious.. A new dimension of "random" is being displayed here as the program runs. Go figure.. Is this file from Space Aliens?

Without qualifying this statement "I am being very clever in my encoding method." Still it defies exploitation.

I am trying to find some sort of indexing, re-ordering of matrix or any other notation and still it defies "compression."

So if anyone has a solution we all will be very impressed.

As for money? Hell.. simply keep the chain of custody of the program limited to yourself and then make the best deal you can.

If I do get a solution I don't intend to simply email it off to anyone that wants a copy. I too would like to see some capital come my way. Yet it's a game of exploitation so at some point we all sell out.

First show the application, second join with some capital group and be ready to handle all the crap like buy-outs, bankruptcies and other lessor legal language that has inspired good folks to kill themselves.

I feel that whatever the solution is it will require dedicated hardware.

Well, an exhaustive search is happening here and I am simply dumbfounded to find an example and definition of "random" so far within this effort. Perhaps "Random data" is far more structured than we know but not the way we hope in-order to win that $100.

Live and Learn.

Hey,

I apparently (in logic terms), in principle, may have actually found a solution to this "challenge"....

The (if it is valid)(I think it will prove correct!) (theoretical) SOLUTION is (at least seems to be possibly describable as being) : COOL AS

but... TOO valuable to say here (I guess)

but a clue:

it also happens to deliver the dream of the world's theoretical physicists and others:

a "theory of EVERY THING".

I have a BEAUTIFUL math-like version of this

it's STUNNING

(Just my opinion)

OH yes, so-called "random data" is VERY "structured" (apparently)

Guess what? The "Higgs boson" IS "structure", "quantum mechanics" is .......................

Who wants to form a intellectual property licensing company....????????

@ Alan

Nice to read.. Do you really think anyone can chip off a piece of space with a collider? I'm not holding my breath.

Put an application together and claim the prize!

My Update is : Still running an exhaustive search here on a constructed dataset. It has 6 days left at the current rate.

Other then that I am working on ideas for indexing and hashing. No breakthrough here yet, but, it's yet another empirical experience.

Good Luck Challenge people!

Ernst,

I too am finding this data compression stuff fascinating and quite the challenge and think I'll work on it for a while. At the moment, I'm working on numerical distribution algorithms... Basically, without deliminators and based on the distribution of each of the digits, we can drop the original file down to just 2,598,851 bits or 324,857 bytes. That is as compact as the file could possibly get - as a single string of bits, but you can't unpack it.

So, what I am working on is taking chunks and writing the numerical distribution of each of the digits within each chunk. This adds about 20,000 bytes of information, but if it works the way it is supposed to, we should be able to unpack it.

Hmmm.... calculation mistake. It's either 2,598,730 bits or 2,598,731 bits. The original data file from the rand website shows a deviation in the numerical totals of the digits compared to the page: http://www.rand.org/pubs/monograph_reports/MR1418/index2.html

The file shows 99803 instances of 0 and 100640 instances of 2, where the site reports 99802 instances of 0 and 100641 instances of 2. Regardless of this deviation, my original calculation was wrong and the minimum the file can be as a single string of bits is: 324842 bytes.

Ernst, what can I say?

I am looking for someone to find me a buyer i.e. a licensee to use (or a sponsor to publish) what I have apparently found (I could possibly pay them a huge sales commission if it all worked out).

I guess I could form a company and sell shares (it is very important that investors know what risk the investment has- these apparent discoveries have not been verified by industry people yet). A computer engineer and computer programming expert would likely need to be hired to translate the apparent theoretical finding into a practical product.

The potential commercial value is such that it is not practical for me to claim this prize (or even to chase the Clay Institute Millenium prize for the "P versus NP" 'problem' (I need income now and I'm not sure mathematicians will be (at least at first) too keen on my technique ) )

I would like to find a way to share these apparent discoveries while earning something

"As Matt Mahoney has pointed out, if you find a general purpose way to recompress compressed data, you would actually be eligible for at least one million dollar math challenge."

Oh? Curious, how so?

I believe Matt has shown that so called infinite compression would require that P=NP.

- Mark

What if infinite regressive compression was a proven and simple "one way function"?

For example, say we begin with the digits 0123456789 and simply write them as a single string of bits: 01101110010111011110001001

Then we take that string of bits and re-write them as a string of bytes: 147,93,226,1

Then take these new digits and write them as a string of bits and so on:

147932261 = 110011110011110101101

110011110011110101101 = 207,61,13

2076113 = 1000111011011101

1000111011011101 = 142,220

142220 = 11001010100

11001010100 = 202,4

2024 = 10010100

10010100 = 148

...

The limit here is two bits, 10, and it would be absolutely impossible to get my data back. But, the function is pure in that it is simply a regressive conversion routine.

However, a sub-routine can be written that stores not but the numerical distribution of each of the digits and it becomes possible, but extremely unlikely to get the data back correctly. For example, if at some point we know that the string of bits: 10010100 contains two 2's a 0 and a 4 then we have two solutions that we can derive from this string:

100 10 10 0 = 4220

10 0 10 100 = 2024

If this string occurs within the data several times, then I have to chose which one each time and I'm up a creek. When you get it down to the recursion limit of 1 or 2 bits from the original string of data, there's nothing you can do because there's no longer enough data to do any kind of reversal with. At this point, you'd just be re-creating the data with some random technique.

Nice reading!

Thanks you guys!

I sure hope both Alan and krishields are able! That would be really cool.

@ Alan Wow[1]! You may do very well to demonstrate an application even for a lowly $100. I mean even if your program size is way over the encoded + program == 415241

@ krishields Wow[2] I will look up what you wrote to expand my understandings. It sounds great and honestly I can't remember reading about that before. Thanks!

---

Update :

There is hope here if I can find an indexing that uniquely identifies an element using 14 bits.

Since I work mostly empirically it is a "cat and mouse" effort at the moment but I have a result of a set of 3 selections this morning. I need a set of one. Close but no $100 yet.

I did find a bug(s) in a linux lib and/or gcc. I am having to write code carefully or my program hits that bug. Sort of annoying.

Good Luck Challenge people. Again "noice" to come and read anything. Misspelling intended.

Well, good news..

I need to see if this method holds for the whole matrix but I just encoded 51904 bits with 50282. 3.125 %

I'm not saying I have succeeded in compressing anything here just updating the update.

Keep on Keeping on.. I'll post again succeed or fail.

Good Luck Challenge people!

Nope:

My latest empirical theory did not hold.

This was an interesting jaunt into sub-sets and relational indexing but the closer I get to defining an exact reference the more questions I have.

Ultimately, any indirect referencing depends on a final specific index.

The solution eludes me today.

Bah!

Mark,

I have apparently "solved" P vs NP

In fact, all 7 "Clay Institute Millenium problems", and the Goldbach Conjecture.

However, in the last month or so I have found new, sensational math-related ideas about all this, and found apparent "solution" to "theory of everything" re: physics.

I just quickly looked at "the Grand Design" by Stephen Hawking, and noted his recent inclinations to a Escher-like "universe" in a New scientist (I think, current) magazine...

My 'hyper-space jigsaw' has more than a passing resemblance to his Escher-like tesselations..... ????

I may have "solved" his idea of a supposedly self-"generating" universe- i.e. why it looks like that...

The / a maybe very spectacular part of this is the recent possibly sort-of "accurate" math-linked

ways I have more recently discovered of describing "the/ a theory of every thing' (i.e. "d'oh" or "objectivity") (!!!!!!!)

Ernst, thanks, though I'm not a computer programmer (wrote some simple programs in Fortran and Basic years ago). I worked out logical theories (including one applying to zeroes and ones). If what I found works, I need to find a buyer. I can offer few clues, but could say try looking at this "holistically", like how Sudoku operates- the "million digits" will be "rubbish" until the program (it it was a program- more like a _____________) has "run" (it doesn't necessarily "run" in a conventional way (it may apear to 'run' but what is potentially happening is something quite unusual)(more like a precisely timed space time continuum collission interference scenario) (like turning the whole problem into a uniquely designed 'Rubik's cube" (hence the link with Stephen Hawking's 'Grand Design')(The links with "levitation" and "teleportation" ("space of packaging") (opposite of "turbulence" ("packaging of space")) are not surprising given the notion of "objectivity").

It's like (maybe) several people all doing bits of a jigsaw simultaneously- it all looks like chaos to start with- then the answer (the "magically" "compressed" (not actually compressed (no need to) rather "hyper-space _____________") "million digits" "fall out" (Like a million ________going _______________)

A curious thing:

if my theory is correct:

the "million digits" when "data ________", becomes like a "baby universe", it forms a single coherent "object" (or one may say a "blob" but more likely something more detailed-looking....)

It is comprised of many individual "data objects" that all "mesh" together

P = NP because "time" "stops" (has a precise start and end)- and space is de-quantised most space is now time, time has "decomposed" into what looks like a chunk of "matter" "occupying" 'space' (a hyper-space 'continuum'....)

something like that

...........

@ Alan

I wish you luck...

I too like thinking on the fringe. It's an interesting way to approach things.

I came up with a possible arithmetic that I coined Imaginary Space.

In my imagination I considered it "entangled." In fact the two spinning directions exist for the two objects I imagined a connection to with the hypothetical arithmetic.

I have no proof of functioning to go by but I can say it behaves that way if I want. A system can be constructed. What use that system would be I don't know. That too is domain for the imagination.

And Imagination is a true worth to man!

If you have crunched things to a blob then you got me. I have never seen anything blobish. Existence is a constant from what I can tell.

If you have found some magic hats off to you because I have been within one bit of saving one bit many many times in the 10 years I have been at this; my hobby-interest and focal for my programming bent.

With a matrix, like I have constructed, I can index it in many different ways. All of them valid and all of them 1 to 1. There hasn't been any easy solution that I can see but this effort has led me to database relative methods. Surprised to see I do the same as them I was after reading a couple PDFs

Attempting to find some hash where I use fewer bits than the source also places me at a disadvantage. Albeit I can demonstrate my best method so far, a method of selecting 2 to 6 elements from 2^32 elements. Alas there is no more bits to define the exact index once I do find that set of solutions.

So , if you do have solution please step forward and show it. It will help me and others sharpen our focus.

This is truly a hard problem to solve or nigh impossible yet it is also like feeling along a smooth wall in the pitch black for a way. The way may never be found yet the wall goes on forever (path) and we must always remember that.

Ah ha!

I believe I just found an indexing that works..

I'll now focus on this one. This is about my 20th attempt today.

I'll be off-line awhile. Wish me luck.

Good Luck Challenge People

I thought to post here rather than the comp.compression

I have a question related to distribution and frequency.

Here is a frequency of a 6 bit values.

1 > 39

2 > 1394

3 > 6228

4 > 39638

5 > 120402

6 > 333743

7 > 697016

8 > 1292337

9 > 2074589

10 > 3032803

11 > 4036856

12 > 4985843

13 > 5745053

14 > 6212867

15 > 6344082

16 > 6131073

17 > 5633410

18 > 4927016

19 > 4115707

20 > 3289675

21 > 2518785

22 > 1854111

23 > 1315138

24 > 894743

25 > 589508

26 > 374279

27 > 230967

28 > 137342

29 > 79264

30 > 44347

31 > 24215

32 > 12946

33 > 6641

34 > 3494

35 > 1754

36 > 787

37 > 385

38 > 184

39 > 95

40 > 38

41 > 20

42 > 4

43 > 1

In my mind that has Bell shaped curve written all over it.

Since I do so little of the statistical side of data compression I need to ask a question.

Do you think I can represent the 6 bit value that is the index to the left of the bell shaped curve better than the total of instances times the 6 bits?

I am getting really tired but I thought to ask for a general opinion before I head off in that direction.

I'm looking hard for a way to come in under 1 to 1 and it's not easy.

My bad; posting I would be busy for a while and now posting again..

Thanks..

Oh Yeah.. That is a hard one..

I understand that no-one has a clue..

It's what I do mostly.

Sad too is the whimpered lack of participation in Comp.Compression newsgroup after someone stood up to bullying.

They can dish it out but cannot not take it.. LOL. Pussies.

Al-right I will check back and see if there are any bright suggestions..

I have a hopeful effort going. I have cycled back to the effort before Recursive Modulus. Nice that I did so much work on it from the start!

I soon will be working 7-days a week so I may drop off for three months but know that my loyalty is still aimed at solving this problem.

Wish me luck.. ~ Good Luck Challenge people.

I dare not reveal valuable secrets.

I could give a vague clue:

"The birthday paradox" (hyperspace discontinuity paradox)

And what I call "the hyperspace continuity paradox"

You don't need to index a million digits

they can index themselves

Why?

Because there are only so many digits

@ Alan

Sounds cool. I don't remember ever reading about "Birthday Paradox." That is really cool!

From my experiences Million Digit file is full of surprises. It is "unique" in so many ways.. Why I and others figured that it would have a lot of factors recently and when I checked again million digit file defies us. Oh that huge factor could be composite but it resisted every attempt I made to find a factor. Could be the product of two primes like RSA maybe.

Update.. Merging two branches of development today here. It will be nice to have a unified foundation at this point in development.

I still do not know if it will work. So much coding and designing has had to be done before I can get an answer on it. I'm hoping I will have an answer by December.

So Alan man.. You go dood! I thought to tell you that Google has contract with the Government to archive the Internet in real time. They might be the best choice since it's less risk from what I understand and they have huge amounts of cash. I mean they aren't going to go bankrupt or drop the need for technology any time soon.

Feel free to chat in any way you feel comfortable with. I enjoy even the most random chats as a break from the isolation and long hours at the keyboard.

Woah,

I can stand some advice on RNGs' Random number generators.

I finally have need and understand why good random numbers are so useful.

My first question is with RNGs' I take it that a sequence can be reconstructed when needed.

Is that true?

Any suggestions on algorithmic RNGs'?

as an aside... I thought I had found a way to uniquely identify an element and to my shock hundreds of matches occurred.. LOL :) Obviously there is a maths at work :)

Anyway I now am looking at the RNGs' I appreciate any info.

Well, I have looked into hashing as it has been defined in articles available on the internet. I'm sure I need something clever since representing larger data with a smaller code means codes have to be relative to some context.

This means that pigeon holes have to be multidimensional.

I do have the ability and tools to construct such types of functions.

Are there examples of contextual codes? Am I describing that correctly?

Ah! A new Video standard is a fine example to follow. Still I welcome other suggestions to read.

Thanks..

I see that a new "ah"-Front is "ah"-foot

Cool.. Any-1 got bitches about GCC and it's temperamental-compilation.

I feel like I have to massage the code to get a desired compile.

Well; if this is cutting-edge then perhaps "they" did not know?

Cool enough..

i HAVE A QUESTION.

Sure all the haters can continue to be dicks.. That's natural behaviour but if one has a system of pattern that 9 bits describes a system and then there is a complement of 11 bits of data is that 2^9 ^ 2^11 ?

I ask sincerely. Kick me in the ass if it makes you feel like a PHD graduate but hey..

Thanks..

@ Mark Nelson

Hey wtf with the Moose?

Wanna come clean? What has a moose that only moved when you took a photo has to do with the challenge???

That is a bigger mystery!

That those who have become so big that if they sand still no one notices?

Then along comes a photographic moment and it is important to "prove" that they are doing something ("Work") ?

Are you the Moose Mark?

Eh, forget it.. I was trying to be funny and it fails.. Oh well..

@Ernst:

Sometimes a moose is just a moose.

That's a picture I took in Alaska, and since I took the photo, I don't have to pay anyone royalties!

- Mark

Nice.. The Echo was so lonely...

I'd like to go to Alaska.. Got a Relative there.

It is rather slow in c.c and here...

So What sound does a Moose make?

@Ernst:

All the moose I've ever seen in Alaska were silent. People who live there, even in Anchorage, a fairly big city, are very accustomed to seeing moose wander through their backyards, up and down the street, etc. If it is a year with heavy snow, the moose can't get around in their normal habitat and so there tend to be more of them in the city.

It might seem pretty cool having moose walking around in your neighborhood, but if they get upset they can kill you. Bull moose will get territorial and charge you just for being around. Of course, they are not so smart, they are known to also charge trash cans, cars, trees, etc. But one good kick can kill a dog or put a man in the hospital.

And of course, their main job is eating trees, so if you are a backyard gardener, its not happy time to see a moose in your yard.

I was reading a book by Chuck Klosterman this weekend and I learned a new word, apophenia:

http://en.wikipedia.org/wiki/Apophenia

Very apropos to people who read this page on a regular basis.

Personally, I think apophenia is as far from mental illness as can be, I think it is in our nature.

Have you ever been in fairly quiet surroundings, maybe with just nature noise, wind, a stream, or whatever, and you think you can barely hear a whispered word in the white noise?

- Mark

My respect that people read this regular!

Still if we wish to understand the dynamics of data we cannot depend on statistical and recorded data; don't you think?

Leaves room for all folk and all walks of life I would assume..

Still all the data in the world doesn't replace one Moose vocalization in the context of the unknowing experiencing the "anew" ahem!

This is a nice link I assume.. It is data after all :)

http://www.youtube.com/watch?v=ejlCv9pAiHY

Again.. I attempt humour..

Also I have an idea about encoding that might work or lead to a better concept!

Enjoying the theme of data compression and Moose.. Thanks!

Well; An Update!

All Pariah state aside!

I am still in the game! I think I see a way to encode 8-bits to 7-bits.

I am going to work on that.

I just finished a 32-bit data set (16-GB) and was rewarded with the results that lent towards the simpler.

I'll endeavour along those lines...

Good Luck Challenge people!

God Bless the Moose!

Any guess as to the probability that this is compressible?

For 1 bit of compression, at least half of the possible output sequences by pigeonhole cannot be generated. And since the input executes on an architecture, a huge number of input combinations don't meaningfully execute, don't generate 415k of output, or don't terminate.

@ Mikemon

My approach is one of inventiveness. I (rightly) assumed all the tools the accomplished and educated followers of Data Compression cling to were useless. That would include all methods of statistical dependencies.

So, I've always been trying to construct something new.

Who can deny if some method could generate the binary number that million-digit is, that (it) would satisfy the challenge even if it did nothing to utilize known compression methods.

Encoding has been my main direction and true it often relates to statistical pigeon hole issues but, I have to see that and try to design around that.

The truth is that the solution is not anything we already know how to do. The Solution is something we all can respect.

Good luck! I find it a focus for my programming interests so it's fine with me to work on it win or lose on "compressing."

I forget that some (okay many) do not share the experiences of my adolescence.

In that light may I share the origin of the Moose humour.

http://www.smouse.force9.co.uk/monty.htm

Enjoy!

Watch the film too.

Ernst wrote:

"My approach is one of inventiveness."

yes

"I (rightly) assumed all the tools the accomplished and educated followers of Data Compression cling to were useless."

Or could be (I don't know much about them)

"That would include all methods of statistical dependencies."

Very nice

"So, I've always been trying to construct something new."

yes

"Who can deny if some method could generate the binary number that million-digit is, that (it) would satisfy the challenge even if it did nothing to utilize known compression methods."

And who wants a million US dollars for finding a licensee, who pays 3x (or possibly more)that commission to the licensor, after all taxes and costs, in a deal acceptable to the licensor (no deal = no fee) for the commercial (and ethical. e.g. not too monopolistic) use of the unproven, theoretical idea of how to do this ?

"Encoding has been my main direction and true it often relates to statistical pigeon hole issues but, I have to see that and try to design around that."

hyperspace barions could be interpreted as "codes" when seen from some "dimensions"....?

"The truth is that the solution is not anything we already know how to do. The Solution is something we all can respect."

The possible, theoetical, unproven, "solution" I have found is beautiful to see!

And: some people know more than they realise

I have nothing against. It is good to exercise one's mind on problems be they solvable or not. It is like valuable education problems which largely repeat well known lessons in order to educate. A calculation of probability of compressability is just out of curiosity since it doesn't change whether this challenge is solvable or not.

It's so amusing to watch these guys claim they've solved this problem, and greedily state they want a ton of money, without showing any proof.

Not only that but certain individuals claim they've done a lot more of these impossible tasks, continuously spewing buzz words without knowing what they mean.

I honestly can't tell if some of these people are incredibly self-delusional, or are stupid enough to believe they can scam someone to buy their invention without showing concrete proof first.

@everyone..

This is a beat your head against the wall kind of problem.

The first thing anyone who wants to solve this problem has to do is assume it can be solved. Yet, how can anyone get paid?

They have to have a demonstration of their work. Then it would need to be reviewed by independent folks I suppose. Then merged into some product and perhaps then they can be paid.

Just watched a video "Compressed Sensing Meets Information Theory." There are a lot of concepts being used there. Bottom line is that new forms of indexing may exist where traditional indexing has failed. (pigeon hole fail)

Those folks are in the business and they believe that alternative systems to our standard efforts are possible.

So it's not so far fetched to believe it can be done nor is it far fetched to experiment.

I enjoyed the "mentions" in the video such as Bernoulli's principle and more.

So, there are great thinkers trying new methods.

Around Here: Gone back to previous system for a bit. Looking at representing 40 bits with 34. It works but there is a lot of searching involved so it isn't fast. Yet I now have the memory to attempt such things where I didn't for a number of years.

@Mikemon.. Yeah, after I replied I realized that I may have seemed critical. Sorry..

As for people claiming things? Well, I just let it slide on by. I mean if someone has a fish on the line they will eat and if they have an old shoe then they can pretend. Doesn't effect my fishing too much.

My Bad .. Bernoulli's Theorem not Bernoulli's principle. Although I did ponder the principle with regards to data compression too.

Since there a3 system that generate a lot of data such as Recursive Modulus, perhaps it can be considered a "flow" in some way.

Hey Zen, I can assume you are not referring to me?

I claim to have found what appears to me (an amateur scientist/ inventor) a possible way of solving this (so-called) puzzle.

I also claim to have found an apparent "solution" to the physicists' quest for a "theory of everything", and apparent "solutions" to each of the 7 Clay Institute Millenium problems in mathematics, and to the Goldbach conjecture.

It is possible that I am mistaken- that I may have overlooked something. But I have found probably over 40 ways of describing the same basic, and dare I say, fantastic, idea.

Some of these ways are probably going to greatly excite mathematicians, even though initially I thought they might not be too keen on my method.

So I have tons of what one may call "potential proof"- as in things for people to analyse and test if it works. But as there is potential commercial value, as would any inventor wishing to preserve patentability etc., I am only in a position to provide the potential prrof to someone who has signed a non-disclosure agreement and who can demonstrate that they are a real, reliable, potential start-up company founder/ licensee/ investor etc. I have no interest in scamming anyone.

The usual process is, that after an invention is evaluated under confidentiality by a potential licensee/ investor, i.e. after they have seen the potential proof, they decide if it looks o.k. and if it is worth investing in.

Regarding the phrase "hyperspace barion"- I was going on instinct (I seem to recall a barion is either a boson or a fermion in physics)- I used this phrase to describe a particular type of pattern of information. I could tell you exactly what this refers to- I thought about what if Ernst asks me what it is? I then figured out a wonderfully clear explanation - but it is too obvious to say without using a non-disclosure agreement.

I am not greedy- I have very little income and difficult living cicumstances. As a free-lance inventor I have a responsibility to myself to try to earn something.

Well, these things tend to stir the pot Zen. It leads to folks thinking that poking fun at those not claiming anything but still working on these problems is a right.

For example, there is nothing "kook-ish" about Recursive Modulus. It's simple and works since I assume it is a primitive recursive function of some class; but since no one had seen it or knew of any prior art, that I could tell. Suggesting I may have "discovered" it drew the ire of the very people who should have had the sense to not bully.

So making claims without proof can hurt the rest of us who work on these things without academic standing.

I see that some of the people who fled after the fight have returned and I also see the sport of Kook-bashing is not far from their minds. Bullying is defence where in a group as long as the "group" is focusing on making another the icon of ridicule the "group" finds unity and each is spared from being the focus of the group.

I have experienced this demented group mores related to the Million Digit file for over 10 years. I did stand up and go to blows earlier this year and can do so again but it's better that we all move on.

In the USA we fall short in the measure of our people and science. I think while we have good educational systems we are needing to evolve socially. For example the Japanese focus on the Group "cohesion" and "working as a team over the individual achievements. That has Japan much higher on the list of the best than the USA by far.

So It's not good to make claims alone. If one has achieved things and many things at that it should be clear that proving only one of them to be true will help us all as a group.

When it comes to Million Digits after a decade of experimenting and examining the results I know a few things. One is enough to write a paper on but success in encoding the file smaller is still not within my reach.

So either it is truly impossible or the way has not been discovered yet.

Make no mistake friends there are systems to study yet for us all but until there is a way I and others have to walk in between success and failure and never join either side. So please don't stir the pot and create more hostility. Our task is hard enough already.

Thanks

Man o Man.. Update..

I am finding those 40 bits strings but it is going to take a long long time to find them all.

I guess the best thing to do now that I know it works is to employ at least 6 of these 8 processors full time and let it run a few months.

So It looks like I can find the 83049-ish 40 bit strings that make up the Million-Digit file and come in under the Size limitation for the challenge.

Today I'll rework the code here and set it up to run a few months.

It may take until December to find all those strings in the system I am exploring but it looks promising.

I currently have 54 of the 40 bit strings found and if it was only these then they would cost me 29 bits each so at this point I have saved 594 bits already.

I don't expect that to hold up however it is possible at least 4 bits per 40 can be saved, worst case, and if that holds true then the file size will be 373720.5 bytes give or take. That would leave 41520.5 bytes for the decoder and the decoder should cost the minimum file size or about 12k

The good news is that decoding will not take as long as encoding so it won't be like the search for patterns I have going on now. The decode time frame will be reasonable.

This is the only thing I can come up with to try , at this time, that promises to represent more bits with less. I have worked and reworked as many encoding ideas as I can think of lately and usually the 1 to 1 "ness" is the defining rule so it takes 32-bits to represent all 32 bits values no matter what the pattern the system generates.

So, I am not out of the Game. Albeit the pace will be slothy for some time to come.. It's time to work this 8-core. After all it's why I bought it.

Anyone else got something going?

I'm interested in reading the news..

Ernst

Update Again..

Gone through a couple-three software revisions and I think this is the last round of revisions.

I'll be running longer than a couple months. In-fact maybe all the way into Spring from what I can tell.

This is a brute-force search so I'll have to consider other methods of searching this system to improve on this. I'll have the time to think it over.

The way I see this is time is going to pass anyway and Spring will come so, if I have a solution that will make 2013 a great year. The Wait will be worth it if I do find all the integers that make up the Million Digit file. The view from my chair is that it is possible.

So, we have a possible solution. Let's see what the Spring of 2013 brings.

Ernst

some news!

Ernst referred to "recursive modulus". I once knew what "modulus" was- anyway after sleeping on it I realised "recursive modulus" is about something that recurs, and modulates ("D'oh" (!))"

Actually "d'oh" is related to the expression "stating the obvious". The (or "a") "theory of everything" is, curiously "stating the obvious" i.e. "stating the objects" (a quantum field formula)(at least two (or could be more (in the casde of "40 bit strings"- way more?)) ways of looking at "objects").

I thought about my theory re: "the million digit challenge"

and figured out how the concept "recursive modulus" may be linked. Very intelligent to be thinking about "recursive modulus", though it may be much more radical (a very extraordinary type of recursive modulus).

I checked a maths book on what "Modulus" refers to, and it fitted what I had been thinking. (It says "modulus" is "the shortest distance from a point to a line (on a graph); and "absolute value" (so non-negative); and "the size of a number". )

Given the relevance of finding 'data objects' (inherent ways that data "sticks' together (like lollipops one could say)), "recursive modulus is obviously potentialy relevant....

The question one can ask about "the million digit challenge" is "can one find a "jigsaw" version of this file?" If one can, then it is possible that the "pieces" can be rotated around (each other) and moved about in such a way to "quantise hyperspace in ...... .........." , resulting in a (potentially?) catastrophic reduction in required (storage) space....

Here's something:

What relates "P vs NP", "the million digit challenge", and "the theory of everything" ?

"P vs NP" is apparently about: if there is a question that requires a searching process, but where the answer, when known, makes it obvious how to get there (in hindsight), is there some way to see this "in advance"? Is there some way of "STATING" the obvious?

Well, I found that "the (or "a") "theory of everything" is "stating the obvious". If you "state the obvious" ("state the objects") in the "milion digit file", you get a quantisation version of hyperspace (You can see it even though you don't "know" it....)

Some curious theories:

The phenomenon apparently in Loch Ness , the "flying saucers" and "foo fighters" of the 1940s and 1950s, the "Roswell" incident in New Mexico, "freak waves", "dark matter", and "dark energy":

According to my analysis, "dark matter" = "approximate space", "dark energy" = "approximate time".

It was found by scientists, that there were in the open ocean, waves from time to time that would "appear out of nowhere" and be much higher than the surrounding ocean. After possible explanations for these had been considered (such as long-runs of wind, colliding ocean currents, or storms, there still remained (photographed from satellites) apparently unexplained occurences of these "freak waves".

A possible explanation was offered via Shrodinger's equation in physics.

I think that at Loch Ness, wing blowing down/ up the length of the long narrow loch may set up a "standing wave" vibration in the air/ water that causes cross-wave interference between the hillsides bordering the Loch.

This may cause a partial vacuum to occur that draws water up in the center of the Loch.

Possibly a person entering into the water (the water may recede from the shore) would generate a quantum interference structure or effect- the water in the center or wherever may appear to simulate the shape of a person (creating a localised small water tower?).

Flying saucers: these may be mechanical vibration standing waves in the atmosphere, generated when a "Meta static" region of atmosphere (a micro-fine long-lasting vibrational condition in the air), reacts to a moving object (such as an aeroplane) entering into that region, causing an approximate "image" to form (or group of images) of the aeroplane at a distance (the 9 "gleaming discs" seen by Mr. Arnold in 1947 near Mt. Rainier?)

A closer version could be "just a light" (ionisation of the atmosphere) that appears to "read the mind" of the pilot (Called a "foo fighter").

A metal container containing Rhesus monkeys (used to test high altitude space suits etc. when sent up under helium baloons) may have come too near a mechanical standing wave in the air. A sudden 'discharge' of energy to the container could cause it to crash, but the metal to become "twitchy" (like other-world like) and the features of the monkeys to stretch (like they had been in a centrifuge)- hence looking like "aliens".

@ Alan

Alan that post is truly strange. But, I am not a hater so work on!

I ran into another hater on the IRC last night. He went ape saying there isn't any way to encode things with fewer bits and would not think for himself. Naturally he starts in ridiculing and asks for proof so I talk about recursive modulus and how it can generate a lot of bits for a small input as an example of what I am currently using to encode 40 bits to about 36 bits. That guy simply has no other way of thinking besides hurting others. He couldn't do the maths I suppose and opted to hurt others instead. So watch out for someone who goes by the handle "Hmmmm" He's an arse.

Oh and btw Google now links primitive recursive functions when doing a Google search for recursive modulus. That is good.

Well Alan.. Just because I don't understand what you wrote very well, I'll assume you are trying and leave it at that.

So guys I update again. I did find an error in how I was parsing the input around 6pm last night so I reworked the code and restarted at about 6PM last night and this morning I have over 60 40-bit strings ( I shall simply refer to them as encoded ) encoded.

This is a slow process and like I said a brute force approach.

It is the best I have come up with as of late.

Right now if I had to only encode these 60 I would need only 33 bits so as of this restart this is 420 bits saved so far.

The Maximum codeword size is expected to be 36-bits as I wrote before.

It's happening. It is just going to take a long time to encode the whole file.

I'm wondering if there is a faster way so I'll be thinking about that.

So, Alan.. Friend hang in there! Keep on keeping on and believe in yourself because if you don't no one will.

What's happening!

Still a slow process going here but it is a going concern.

Just passed 300 reduction based on codeword size 34 I know that's really small and it's so slow but it is happening.

Anyone else? I'd love to hear about it.

As slow as this is someone could beat me still.

I thought to mention decoding has been proven as in program written to do the job. Not that I'm not going to write a better one.

Also this isn't a magic function. There are limitations to it's encoding capacity. For example if the source file to encode was a file of 2^40 40-bit strings and each represented a unique value then the encoding would be 1 to 1. 40-bits for 40-bits.

Hopefully with such a small subset of values such as Million Digit file there will plenty of reduction through "encoding." It's a guess but I'm hopeful that 36 bits will be enough or possibly 35 bits. I hope for 32 but that is very much wishful thinking.

Well, okay.... I'm excited and chatty about a chance at providing a solution after a lot of man-hours at the keyboard.

Alan? Zen? Mikemon? Mark? How is it going?

@Mark

I remember it being said that compressing the C program that would "uncompress" was legal under the challenge.

I was calculating how many matches and 4 bit reductions I needed to reach the break even point. It's a lot sooner with a compressed decoder than an uncompressed executable.

That is the size of the compression program such as BZip2 will not count in the measurement because it is much like a library which is also not counted if it is common to all.

Do you still support the compressing of executable with out the size of that compression program counting in on the total size?

Hello Ernst-

I am not a computer programmer (wrote some simple programmes many years ago in Fortran and Basic)- I worked out theory's abut how to apparently solve the challenge with logic.

I do not fully understand what you are doing- if you could explain it then i could figure out how potentially to accelerate it or whatever, though I would have to limit what I can say if it is commercially sensitive (though you could look at options re; start-up company or something)

Eh, it's a little of the old mixed with some newer works. Naturally why I post is simply to interact with those who share my interests and you have to admit Million Digit challenge isn't a hobby of the masses!

Perhaps this will mature into some utility and perhaps have commercial application but it's premature to consider such things.

I have some concern here to improve the searching method. If my current results hold then for this particular search design will be about 20,000 short for the 80,000-ish. Not that it fails to encode but the search method may fall short of finding every string by 20k-ish. I must study.

So please all of you, I'm just doing the hobby as I have been for the past decade and I have never stated I had any results which encoded smaller before so I don't post to be a problem.

As stands I have a potential set of 1200 40-bits strings encoded to 36-bits. Now I assume since I am running two versions, to speak simply, there is some overlap so, 1200 isn't an exact count of unique source elements but in principle it is 1200 40-bit strings 900 bytes reduced.

This is an avenue to pursue; hardly, a cause to plan for commerce. I'll be looking at different approaches to the searching for solutions.

To be fair here to everyone, if I am called upon to prove the decoding of individual elements I can provide that demonstration so this is not without a decoder; however, having 1200 40-bit strings encoded to 36-bits out of 80,000+ hardly qualifies as compressing the Million Digit File and the time frame for this primary exploration is a serious detriment.

"I have Miles to go" http://www.poemhunter.com/poem/stopping-by-woods-on-a-snowy-evening-2/

I've been attracted to challenge from the first moment I read about it. My Million Digit File has a data on it of Monday Aug 5, 2002 at about 7:30 PM almost 10 years to the day soon.

Other than that I may have a full time job! Yeah cool! I'm one of the long-term-unemployed and it's getting old!

So be a friend..

I see what you are doing... Lot's of brain work, but very well could be worth it in the end. AND it gives me an idea!

Good Luck ( both with this and the job ) Ernst!

If anyone is at, or goes to, a location that is extremely quiet, and listens, what do they hear?

In my opinion, they "hear" an almost "soundless sound", the "singing of silence". If you have ever used a Sunpak flashgun from many years ago, as it warms up, the sound increases in pitch till the pitch is so high it is almost 'invisible'. That is fairly like what I am talking about.

I discovered what this "soundless sound" may actually be. It appears to be a sort of superposition, or wave interwonven higher dimension form, of all "sense data" you have experienced in your life (all sound data, all image data, and all other sense data). The 'amount' is phenomenal- yet it is ALL there (Total recall appears not only possible, but normal if a human was in complete harmony...)

There may also be another, lower pitch, "bell-like" sound that you hear. This appears to be likely data from before you were born.

A curious thing is: there appears to be a phenomenon of Nature, and of what one may call "Theory of everything", that I call "a hyperspace bypass", and I call many other names all too commercially sensitive to say now, but which could be called "the SOUND of SILENCE".

It may be even that the more data you have to store, the easier it is to store it

I think I may have a way to 'collapse' as-it-were (Or expand in another 'dimension') the 40 bit string, so that you can add more 40 (eg) bit strings to it, and each time, the added string results in BOTH strings taking up LESS space than the initial 40 bit string did.

The more you add, the less each takes up compared to the first one. By the time you reach ONE MILLION digits, you are using a tiny amount of space for each string when you average it out!

(Just a theory)

Alan

(Note: "both strings taking up less space each, i.e. the total of the two strings taking up less than 2 times the first string would)

@Alan.. Reading your input I was inspired to read up on the "Holographic Universe."

There is a theory that our 3D experience is a projection of a 2D "surface" of information.

Now it seems that Hawkings and more see, as matter falls into a blackhole that the "information" about said matter is "smeared" over the 2D surface of the blackhole and that the matter travels into the deep.

So such thoughts are not too strange. Perhaps there is some division to an element of information and "said structure" that makes up an object. Perhaps if nothing else with "Holographic Universe" we at least realize our ability to perceive is limited to our human structure.

The one thing I will defend as true is that we cannot get rid of information and then magically get it back.

Everything needs a reference to exist. Loose the reference and loose the information and in data encoding/compression if we cannot decode our algorithm is worthless.

I know this from experience. I have been within one bit of saving one bit for years.

I do have a couple dozen Codecs for my efforts but none are magic functions where I can toss information and magically get it back.

I named one such encoding for Alan Turing so I have the "Turing Engine." here> Named so because it has a mechanical "event."

@ krishields

Good for you on the idea! That aspect of the Comp.Compression always inspired me and we have lost that. The inspiring each other.

I have a Seasonal job that starts next week but I just got called for a full time fabrication/welding job which is what I used to do before the great recession.

I've started a secondary program effort here with a goal of Faster, Better and hopefully search-complete.

Here is today's motivational song for us all. http://www.youtube.com/watch?v=5-8dhZRfiTw

As I have said before we must first believe it is possible before we can succeed.

So if I am not so chatty after this weekend good luck Challenge people. Putting food in the Cat Bowl comes first around here. They have me trained.

@krishields

Working on it.. Have a new draft that should run much much faster.

So yeah I get this done and start it and I am busy with work but the CPU don't care.

I will see how it goes. I learned things from the first design. I am now looking at the proverbial depth where the first was proverbial width.

I have no idea how long the depth version will take. I may have some idea of how long in a few days. But, it can run a few months.

Once I have this program proven I'll start multiple copies and hope for positive results.

I suppose this could employ several thousand processors if not a million but again I don't know that answer at this time.

I do wonder if there is a Maths to consider.

Again I say honestly the decoder is functional and the encoder searches for the right combinations of "things." Eh, some mystery is okay.

Feels good to do this and have a job too. I never thought in the USA that a man who wants to work and has skills couldn't do so for years. Still I understand our Corporations are sitting on record amounts of cash. Seems the theory of American Capitalism has broken starting with too big to fail.

So how did that idea go?

Update:

@ everyone

Well this did not't take long to come to an understanding.

I see that given an infinity of systems that no one system adds up to significantly more than any other.

All systems scan a division of the file or 59 sets of 2346 24-bit strings. I then scan for matches of these 2346 24-bit strings. The results are between 500 and 600 consistently with the total for all sets consistently 34,000-ish

I divided the file into divisions and reduced my sample size to 24-bits. This is a lot faster than matching 40 bits.

So, I hoped to see results where for some system a specific subset of the file would match perfectly and I could logically define the decoding for that subset.

It looks to be a consistent approximate 25% total matching for any system and any subset of the file for that system.

Now I will conjecture that given the nature of matching 25% of the whole Million digit file with any given system that there are 4 unique systems that match 100% of this file in such a way that logical division of elements occur. Otherwise for any 4 given systems that match 100% of million digits file file-order will be chaotic and require overhead that exceeds bit-savings.

So I will conjecture that there are 4 systems that exist that provide a subset on some modulus of the bit count and as such, for example, file-order can be assumed and a string of , for example 24-bits, can be represented with 22-bits.

The probability of finding such a set of 4 systems.. Well I imagine it will take more time than I have left in life.

So a possible solution but perhaps not practical.

I'll let it run a while longer just to be sure but the results are uniform and I would expect that to continue.

my bad.. The results are between 500 and 700 consistently for any of the 59 sets of 2346 24-bit strings.

Wow, some of these posts really are full of crazy.

Are you claiming you can fit any arbitrary 2x40-bit string in less than 80 bits?

Basic pigeonhole principle, dudes. Say I bring along a hard drive containing every single possible variant of an 80bit file. Are you claiming you can produce a set of 79bit (or less) files for each of those? How many different files can you have for an 80bit file? 2^80. How many different files can you have for a 79bit file? 2^79. OBVIOUSLY you cannot cover the collection of every 80bit file in less than 80 bits.

I am going to try to do an update of the challenge with some new conditions that make this a little easier to manage.

Stay tuned!

- Mark

Update.

I let these run over night and the results are the same.

For any system, and there are an infinity of them, it appears that a single system matches approximately 25% of the Million Digit file and those 25% can be encoded to 22-bits.

However, with 75% not matched and the likelihood that there isn't a coherent indexing that will maintain file-order this was an exciting exercise that leaves me with more questions than solutions to compressing Million Digit file.

I do believe that this encoding scheme functions since I can generate the infinities of solutions with the decoder for a given length of string such as 22 bits generating 24 bits or as previously explored 36 bits generating 40-bit strings.

It was exciting to chase this one down nonetheless.

@Mark

Changing the game now? Is that fair?

Awaiting your new rules...

>Changing the game now? Is that fair?

I think you will find that I am sticking very closely to the spirit of the competition!

- Mark

@Mark cool!

@wtf,

I'm not so sure that we cannot generate all 80-bit files with 79 bits.

Sure thing about the Pigeon Hole Principle but what about some hypothetical function that generates more bits than is input and has some context to draw upon such as file-position for example.

It may be possible but not in a strict pigeon-hole-principle sense.

I do agree 100% that we cannot throw information away and get it back without a context.

Then again nothing exists without context. Every observable or provable thing does so in context to other things.

As for what the 80-bit comment truly means? I dono.. May be a stream of conciousness kind of thing or mockery on some level.

@ernst you must have lost touch with reality. let's take it down a notch to more understandable numbers. you're proposing that you can represent any 4bit string with 3bits. There are 16 possible 4-bit strings, and 8 possible 3-bit strings.

Out of only 8 different encoded inputs, how can you produce the 16 different decoded outputs? Does any one specific encoded 3bit input represent 2 specific 4-bits outputs, and in thta case, does the decoder then read your mind to figure out which one you were looking for?

79 bits means there are 2^79 = 604462909807314587353088 possible encoded files.

80 bits means there are 2^80 = 1208925819614629174706176 possible decoded files.

How could

604462909807314587353088 inputs

correctly map to the

1208925819614629174706176 outputs?

You must be trolling.

@wtf

No, only specific ones. The file contains 415,241 bytes. That means, there are 83,048 5 byte combinations in the file plus 1 byte left over. If we consider there are 256^5 = 1,099,511,627,776 possible 5 byte combinations, we immediately see that 83,048 is only a small portion of them.

This means that, while we cannot represent ALL of the possible values from 80 bits with only 79 bits, we only are required to represent a small fraction of those possible values.

@Ernst

I think I may have found a possible solution. The key, I think, is to not have to store any encoding overhead. Rather, generate the encoding on the fly - hide the encoding in the algorithm. Meaning, it is easily possible through redundancy to generate very large numbers with very small numbers. For example, I can represent no less than 1,000,000 unique 12 digit sequences each with only 6 digits. Or, to go to the extreme, represent a handful of bazillion digit numbers with only 2 digits, or a single infinite number of digits with only 1 digit.... So, what are the odds that of those 1,000,000 possible 12 digit numbers I generated, I will not find at least a handful of sequences in the file? Not very good. And if we change the makeup of the file by representing 12 digits with 6 digits ( plus a flag ), will it then compress with an ordinary compressor? Possibly.

Regarding:

"How could

604462909807314587353088 inputs

correctly map to the

1208925819614629174706176 outputs?"

EASY? Maybe....?

There is (apparently) a very cool trick to this (but I dare not say (unless you are CEO of Samsung or IBM and have signed an nda...?))

Ever wondered how the English language was created....?

Ernst has said some very cool things but I dare not say much..

Except I could say: your 25% (of the million digit file) is dark matter-energy

*shakes head*.

You guys really need to read http://mattmahoney.net/dc/dce.html#Section_11 and show us some working code instead of just posting nonsense here.

@Mark Nelson

I am also awaiting the new rules! Maybe this will render the challenge more solvable!

A bit about my progress:

I first noticed this challenge about one and a half months ago. I started tinkering with the file, but I couldn't solve the challenge so far. I found a partial algorithm which could decrease the entropy of the file, but as of now, it still doesn't do any compression.

Regards,

Vacek Nules

a computer programmer

@wtf

You took the words right out of my mouth. I am highly bothered by charlatans making unfounded, boastful, pseudo-scientific claims about their possible solutions to the problem, instead of showing some actually impressive, convincing examples of their work (if there are any real work involved.) A layman may be impressed by their hyperbolical ramblings, but us, computer programmers can see what's behind the surface -- nothing. (A big emptiness to be precise.) I would rather read interesting comments by real participants than plough through pointless "dreamworlds" of these quacks.

@wtf

What, you can't figure out how to strictly generate the 80 bit sequence: 125412236523145214256325 from the 20 bit sequence: 123456?

It's not THAT hard is it?

@ Alan..

Um, I thought about sharing my current direction and I figure it's safe.

Have you read about the Collatz Conjecture? http://en.wikipedia.org/wiki/Collatz_conjecture

Well there are an infinity of systems because instead of 3x+1 it can be 3x+odd_number.

This is my own notation so it's not a generally accepted nomenclature A(X)+or-Y for odd and X/2 for even Where for the Collatz form A=3 and Y = 1.

Y can be any odd value and for Y equalling powers of three there is but one attractor for that system and for others not powers of three there are many attractors for those systems.

Some non-powers of three systems have as many as 40 or 50 attractors that I have seen.

To generate 40-bits from say 15 bits simply we can sample 40-bits by the "Parity Language" of if odd Parity out is 1 and if even parity out is 0 thus constructing larger words from smaller words.

That forms words of length.

I just ran all 2^32 bit values and generated millions of 40-bit numbers.

The area I am studying is how to find the ones I need. So far I see some structure which tells me there is a maths behind it but with brute force searching it takes a very long time for 40-bit generations and matching to million digit file.

Naturally I conjecture that given infinite time all 40-bit values of million digit file will be seen.

I also employ some other "tricks" to manufacture more 40-bit words but I'll keep that closeted.

I hope this clears the air.

I will be working full time for a few months so I just started searching for 40-bit segments again.

A lot of times when trying to interact I don't like to publish my current projects just so I will have a friend but this is okay this time.

So far I see about 5 matches after restarting the 40-bit versions this Sunday morning so that is 5 more possible reductions from 40 bits down to 36 bits.

I'm out of time for designing anything new so I'll employ this 8-core for a few months while I am gone and see what happens.

Any Questions?

This is one thing I know rather well.

@krishields

It seems you and I are fringe thinkers. Employing unusual methods and concepts to a difficult problem.

Long ago I decided I wasn't going to follow along with the regular data compression methods and went my own ways. Not that I have not learned some about classical data compression. I have but I have not gone deeply into that realm yet.

I thought about on the fly too and settled on a middle way with the [A(x)+or-y, x/2] <-- my own notation for a binary cyclic function so I'm not trying to usurp any proper notations.

So with that, in goes some number smaller than the desired string length and out comes a string of desired size that is part of the million digit file. So there in is a form of data compression.

For me there is the overhead of which system is used so I am currently trying to work in a finite amount of systems so I can use input values right on up to 2^36: And like you, I am hoping to snag all the 40-bit strings that is million digit file as, because, we all know the Pigeon's Hole (tm)

I just ran a 24-bit experiment and found structure that finds all 24-bit values with 22-bits but that overhead of which system is used for which element makes the resulting code larger than 24-bits. It was useful to try because I needed an answer about heading on up the infinity of systems. As I wrote before each system generates about 25% of the 24-bit values of million digit file. So, in theory, there exists 4 systems that contain all 100% of the Million digit file encoded to 22-bits. But, having simply 4 systems that equal 100% of the file isn't enough because there is file-order to consider. So, naturally, I conjecture exists 4 systems ( 4 sets ) that contain 100% of file and are ordered in such a way that file-order is logical. To find them.. Well that is the stuff of super computers for sure. So it's back to finite sets rather than infinite sets.

But this is about "compressing" one file as you state and I also agree we are after a subset of 2^n strings not the whole set.

I don't see any other way. Clearly we have to have all 2^n to cover all 2^n and that is that. I know, you know.. most everyone knows that.

Well I hope this points out that unusual methods have to be explored and considered if we are to solve compressing Million Digit File.

Ernst

@Vacek Nules

I can tutor you on the [A(x)+or-Y,X/2] if you need help understanding.

It's the least I can do to make up for my crimes and claims; however, I am equally sick of reading nasty and hateful things from people who have not used their imaginations and come up with even one clever idea and decide they have a right to chastise me due to that lack of creativity on their part.

Not that I am suggesting you are such a dork.. No that would he hateful to say you are a kook or dork because you cannot think for yourself.. That's just ignorant.

So, any questions on how it is possible to generate n-bit length strings from (n-?)-bits strings using the [A(x)+or-y,x/2]

The basic code is rather simple however, I don't think I'll be sharing full sources of my latest works yet, I can aid you in crafting your own.

@Ernst

I guess so... it seems. I don't see why it is so hard to understand that we don't have to represent ALL of the possible values from a given number of bits, only a small subset of them because we are dealing with a single file, and so hard to understand that a very large data set can be generated by a very small data set.

Since both wtf and Vacek Nules don't seem to understand how to generate the 80 bit sequence 125412236523145214256325 from the 20 bit sequence 123456 I'll just have to lay it out for them...

All we have to do is stack 123456 in sets of three then apply a clockwise rotation read and then a counterclockwise rotation read to the set:

123 -> CW = 125 412 236 523

456 -> CCW = 145 214 256 325

-----------------------------

125412236523145214256325

An 80 bit sequence strictly generated from a 20 bit sequence. It's really that simple.

Then just flag the thing as a match and be on your merry way saving 59 bits in the process.

I mean, this is just one trick you can use amongst a vast quantity of tricks to chop huge bit gains. You can set up any number of rules to generate large sets of data from smaller sets of data. Bottom line is, not being able to find a sequence in any given file that cannot be generated using a much smaller sequence is nil.

The question remains, however, whether or not we can find enough matches to make enough of a difference in the total output size to cover the cost of the code... Which means, a lot of hard work to find all the various methods necessary to do it as a single method cannot cover all the values, multiple methods are necessary.

"Bottom line is, not being able to find a sequence in any given file that cannot be generated using a much smaller sequence is nil."

I assume you meant "...the probability of not being able to find..."

In which case I must kindly ask all of you pseudo-science kooks to please re-read http://mattmahoney.net/dc/dce.html#Section_11

If you still think that you can find a shorter sequence, i.e. less than 80 bits, for any given 80 bit sequence, then please read section 1.1 once more.

Then, if you STILL think that you can find a shorter sequence, i.e. less than 80 bits, for any given 80 bit sequence, then please read section 1.1 once more.

Finally, if you STILL insist that any given 80 bit sequence can be represented with less than 80 bits, please post working code or shut up.

Your just not understanding the points were making...

NOT "any given sequence", but SPECIFIC ones absolutely. I thought I clearly demonstrated that. Again, we're not looking for ALL the possible sequences - only SPECIFIC ones which CAN encode to less than 80 bits with a strict algorithm. There are 1,000,000 80 bit sequences that CAN encode to 20 bits using the method posted above. MOST of which are not useful, but are 80 bit sequences nonetheless.

If you still insist that AN 80 bit sequence cannot be represented with a 20 bit sequence - as clearly demonstrated above, then I feel sorry for you. Are you really that DENSE that you do not understand what we are saying? That we're not looking to encode ALL the possible values - only a PORTION of them? The PORTION that is most relevant to the Million Random Digit file itself...?

Well, it was you who said, and I quote, "find a sequence in any given file".

Which was also the point of the original argument, in which you entered in the middle, and which the other guys are still arguing for. Well, at least that one guy who keeps rambling about "dark matter-energy".

I believe that if the challenge file is truly 100% random, it represents one of the many files which are completely unpredictable and therefore uncompressible. I think that, to have a chance at shaving off bytes, you will need to find a pattern, any pattern, in the file that is just not there - there simply is no "slack" in the data, you need all the bits to represent the file.

Yes, I said that... but there is a distinct difference between the statements "[a sequence] in any given file" and "[any given sequence]".

I don't know about that dark energy guy, but Ernst is working on something very similar to what I demonstrated only using a strict mathematical rule instead of rules which simply re-use parts of a sequence to generate larger sequences.

It is unavoidable that there exists patterns within any set of random data and in that sense, there is no truly random set. For example, it is impossible to not re-use the pattern n2 = n1+1 somewhere within the Million Digits file. Or many other rules. This means, somewhere, we have to re-use a pattern we have already used elsewhere in the file.

The tricky here is to figure out WHERE these patterns are and how to mark them without costing more than the data itself.

Well, I'm not a professionally trained mathematician so I too will make statements that are not exact.

@krishields I know something you would like... I should find out how to write a proper paper so I can claim discovery perhaps.

@everyone

So I'm working on a solution.. krishields is working on one..

Anyone else?

I may not solve this but I find it fun to try.

@Ernst

Certainly write that paper!

I have actually successfully used the above method of sequence generation some time ago to win money in the IA state lottery... a game called pick 3 :)

But I found that the patterns drift and flip, so it's difficult to maintain solid monetary gains. We will see what happens when we employ the method to compressing the file...

@krishields

It's on the list. I was working with it about a year and I didn't see how it could work for compression. Cryptography? Yes absolutely but nothing I have tried seems to work for compression so far. It's one of those "you can't throw information away and magically get it back" kinds of things. I tried to find a cheat and it won't let me.

Heh on the Lotto.. My first application of the Collatz function was a Lotto number program. Without knowing much about randomness and still in my first computer programming class I managed a function that used one's Social Security number as the seed along with the date to generate a set of 6 lotto numbers. That program was awful in how it was written. Had biorhythm functions and other things I copied out of books and pieced together. Naturally all without the experience of a full command of the C language. It was horrid with BASIC style constructions instead of tight loops-in-loops.

I understand the attraction you have for your pick-3.

I need to look over your post but I thought I might share a thing I did a few months back and didn't put much time into yet it might be an interesting maths to examine.

I'll take some time to see if your method might be similar and if it is then I'd like to share that "thing" I briefly explored with you for you to consider. If you like.

That like many other interests of the moment are one of hundreds of short branches off of other works in the past decade.

It's a back and forth thing and may be something like your pattern matching or offer some additional mechanics to work with.

By the way what name would you assign your method? "Rocker" came to mind from what little I took in.

I didn't read too closely as I was in and out doing the mundane get ready for the work week stuff.

Update: 74 more 40-bit to 36-bit encodings added to the list.

I figure it will come up short on the 80,000-ish total yet whatever the result it offers me an additional result to consider as I ponder the nature of this functioning.

And again I'm enjoying "compressing" million digit file even if it cannot be all I need to win the challenge. There is a satisfying feeling with every line of report announcing another 4-bits reduction.

LOL!

If I may make some comments:

re: the online book: looks interesting.

A quick glance and I see a potential problem: making rules right at the beginning that potentially overly reduce the options a would-be data 'storer' may take...

Why 'probability'/ There are other paths....

re: "universal compressor":

you do not have to compress "every string". You only have a task to "compress" one unique file.

It is already known that 'compression' works better if it is 'tailored' to a particular file. So a "universal compressor" need only let the file itself tailor its own 'compressor'.....

Professor Stephen Toulmin wrote on the philosophy of science. In a book, he writes that "probability" is a myth. Probability statements are only statements about how willing someone is to bank on something.

It occurs to me that if you say "the probability of a cube; with the numbers 1 to 6 , one number per face, on each of its six faces, is to tumble about with no bias to it finishing tumbling on any particular face, of it landing on any one face, is 1 in 6",

you have surely just counted to two.

You have "imagined time".

If you stop "imagining time", and deal with "real time", then you can start "imagining space".........

"Working code" is the answer- but in a different way than you might suppose.....!

"Shannon's limit" is apparently associated with "probability"- but why be restricted by "proabability"....

Krishield's wrote: "The tricky here is to figure out WHERE these patterns are and how to mark them without costing more than the data itself"

True

The answer is to let the data mark its OWN patterns....

You can do that with what I call "a hyper space bypass" (apparently)

This technique it appears may demolish the so-called "counting argument"...

why?

because when you finish counting "when you reach the one millionth digit" that is, you have created a space vector (a fixed direction IN SPACE) - i.e. a "universal 'counting' (or uncounting) "argument" .........

a HYPER SPACE BY PASS (!)

cool !

One could say:

"dark matter" = "approximate space"

"dark energy" = "approximate time"

Higgs boson = space vector

Higgs fermion = time vector

proton = space in time

neutron = time in space

electron = (aproximate) space/ time

quark = conservation (of measurement) (' mass ')

antiquark (Higgs interferon) (theoretical - not yet 'discovered' (!!!)) involves meaasurement of conservation

(spin)

"lotto machine" = like a Higgs boson

"ice cream machine (snow-freeze machine)" = like a Higgs fermion

million digit challenge = like a lotto machine going backwards (how do these 'random' numbers none-the-less seemingly 'self-sort' themselves into .................. with structure constantly ..................... ?

(curious- the drifts are flips and the flips are drifts)

.....................

Hey, I'm cool with the discussion. Thanks..

I'm not sure how the high level symbolism works but I have used such types of reasoning to devine a direction of study.

Working as in job here and redesigning the search program tonight.

I've added an expansion to the pattern matching essentially creating 128 patterns for every one generated.

I'm pausing the 40 bit works to experiment with a 64 bit pattern-match system.

I figure I'll be busy the next few months so let the computer work!

Searching 2^61 64-bit patterns to match 51906 64-bit values that is million digit file.

We all know that to be 100% certain to find them all we need 2^64 patterns so this is a hopeful search of a rather large subset.

I liked the 40-bit version of the new search because it is much tighter.

Okie-Dokie I have little to share so I'll add when I have something worthy.

It's good to be earning an income again.

Well, I have been reading that we are still not absolutely sure we have a Higgs Field what with all the activity in that range.

Still adding mass to information is what to data-representation?

Just asking in a friendly way..

-------------

How goes it?

I'm matching 50% of the million digit file with one system now. That is up from the 25% I wrote about before.

hopeful I will work out a way to match 100%. It looks feasible.

So how goes things?

Working 7days a week now but still managing to scribble in the notebook at lunch.

Good Luck Challenge people!

Interesting... very interesting... The file is definitely constructable. Now it is only a matter of time before one of us finishes our project and successfully cracks this SOB of a challenge.

It's really quite curious. The construction works much better on a binary output of the file. I'd have this sucker cracked right now, this instant if I could find a better method to sort the insertion points of the data...

But, yea, whats more curious about it is that when I run the file and set it to look for a clockwise rotation, I find a certain amount of matches in the binary code - 100k or some such. But then, when I run the same algorithm on the remainder after extraction, I find zero... But if instead I look for a counterclockwise rotation, I find a bunch again. Then I have to run it clockwise again to find any matches.

In other words, there is something fishy going on here... An alternating clockwise/counterclockwise rotation finds results on the remainder.

I'm really close to having this SOB cracked. As it seems you are as well Ernst. Both based on similar principles so there is something special about our thinking here...

It is amazing to see many are getting close to the solution... In fact there are some bizarre solutions to compress compressed files and random files.. but the bottom line is that I am paranoid in releasing any of those solutions or infact a compressed executable for the fear of the solution being reverse engineered. Added to these the amount of non implementable patents released by USPTO makes me feel more sick that innovation has lost its place to come up with solutions that are unthinkable in mankind....

I know this post is going to draw lot of flak about me... but I felt that I should atleast let everyone know that there are solutions available for recursive compression...

-R-

@ Rick.

Speaking for myself I am closer but not close. That there is some actual reduction happening is exciting however, I too have issues to overcome.

Around here I am again up against the Pigeons Hole. With the distribution of the "values" of any word size it's really really difficult to find one system of encoding that will provide a uniform reduction and without a uniform reduction what I am doing will require overhead and thus fail to compress.

I'm now thinking about how to get around the Pigeon Hole issue. You see I can get in the 100% match range with no bit(s) of compression now and one bit of compression is the 50% range.

I have some thinking to do.

@krishields

Yes indeed! This file is a slick pig. Once you get hold of it it wiggles free in another way. Cases in point Factoring the file and encoding efforts.

However, we must believe this is solvable so having hopeful banter is a goodly thing.

I too see encoding spaced out and I usually see results that are not optimal. Without explaining I want to get a result of 6 matches and it's more common for 5 and 4 than 6,3,2 or 1. Then it almost adds up to a total I want. Cat and Mouse, cat and mouse indeed!

Ernst: It is good to hear that you are close... but the bottom line is trying to beat the pigeon hole principle using permutations and combinations wont just help... My own theory is that the pigeon hole principle exists only for distribution of values, but if you take a different approach altogether... with math numbers, the solution exists.. and the solution works for any type of files be it image, audio, compressed, random, etc etc.. so for me the bottom line is the bizarre patents issued by USPTO.... So imagine, if you did manage to find the solution, what next? Would you be brave enough to let the world know the solution? Now a days getting a patent on compression is just as tough as fighting against the pigeon hole principle....

-R-

@Rick.

Not Close but Closer actually. I thought we were working this weekend but the grapes are not ready so weekend off.

Rick, I have written programs and looked at the results. From that I have come to understand some things about these systems experimented with.

And, uh, every once in a while I cycle back over what I have done before especially when I feel I have an impasse in the current projects so, this Collatz stuff isn't all that new except I am using much newer algorithms with it.

If I get a solution with that dynamic equation I'd figure it would spawn a new effort to prove how that dynamic system works.. It is unproven still. I assure you it would not be a comprehensive compression system. It might lead to one if it is a success but hey, one step at a time.

STILL! Has anyone else even reported partial success? I don't remember anyone even saying they could reduce any part of that file.

Again I say I can demonstrate decoding what I have so far any time.

For me Pigeon Hole Principle shows up as the range of input needed to produce all output. Seeing that Million Digit file is flat statistically and in other ways the need to select unique output for the near unique input often means that the bit length of input equals that of the desired output. So, no mapping of n to n-1 is possible and I agree with that science.

Where I am poking around is to find some "cheat" where I can pick values out of different sets and hopefully pick inputs that are all n-1 size.

This could easy be another tail-chase exercise but, that is okay for me since this is not my career and failing means nothing to my well being. Now not trying or giving up?? That could be a bad influence.

I just happen to be one in a million who like trying.

@krishields

So are you trying to get double duty out of one string then?

Did you try inverting the logic such as X and NOT(X) as well?

Xor and XNor also offer some interesting and reversible forms.

I have found that X and NOT(X) tend to double matches in the streams I work with.

Still we have to represent with bits so everything has a limitation in that way.

Oh, pardon the excessive posts..

@krishields

I do have that technology I wrote about. If you hit an impasse perhaps we can merge two technologies.

I saw some ability to calculate arithmetic results using a back and forth kind of functioning but placed that on the stack of I'll look into that later; meaning when I am frustrated with what I am doing.

As you pointed out I do work with strict mathematical's and have done so for a long time.

Anyway, if you find you wish to take a break and branch out I will show you mine and you can decide if you want to show me yours..

LOL that sounds sick :) Oh well..

Recursive Compression is impossible, I agree. However, it's simple to transform the data into a format that CAN be compressed.

I've explained 2 "quick & dirty" methods elsewhere on the internet, the 3rd method i'll be working on myself when i've time & inclination.

Don't forget that, while I am not a computer programmer as such, I apparently solved logically how this challenge might be solved. Re: "Higgs field"- from my perspective a group of protons fulfills this. The "knottiness" or "chunkiness" of the million digits is your Higgs field. The "Higgs Boson" is a --------------- that allows you to 'rotate' these 'chunks' around each other in different ways and at varying ---------.

I just worked out yet another way of describing the apparent discovery (how to navigate your way around obstacles) that also appears to allow a solution to the million digit challenge from am logical perspective. What is needed is a possibility of commercialising this. Those who are using mathematics- I suggest operating at higher dimensions...

@Alan

Perhaps however, a computer program is what is required to actually win my friend.

The history of this challenge if full of ideas and no proof of functioning. That is why I have never made any claims of compression before I shared the partial reductions of the above for-mentioned "encodings" since one should have proof to back up any claims to be in good standing with the Million Digit Community.

I am not doubting the power of mathematical logic at all. Still, the spirit of this challenge is to produce one or two files that generate the Million Digit File and add up to less bits than the Million Digit File.

I assume one bit less is all it takes providing that it can be proven any "unused" bits have no effect being either set or reset.

On the topic of creative thinking. I find nothing wrong with symbolic reasoning. I have had my fair share of "Stream of Conscienceless" sessions and with great enjoyment too. So I don't hate creative thinking. Yet, for me, I had to balance out the flood of creative reasoning with a productive output. I chose keeping logs.

I type down or write in notebooks whatever I happen to be thinking along the lines of this Challenge.

Naturally I have hundreds if not over a thousand such records.

So, when I read "Higgs" or "Boson" I harken back to earlier times when vague concepts were a major benefit to the creative effort.

So, write it down exactly.

@EwenG

Cool!

Whoever finds the way is welcome for sure!

@Alan

You inspired me to watch http://www.youtube.com/watch?v=rGf8fEXbF94&feature=related

Interesting.. Do you know I have a maths that have spins?

^_^ Indeed!

So any news Challenge people?

I had some unusual traffic reported here so I am a bit suspicious.

My Update: I'm compiling a large file that hopefully will contain all elements of the Million Digit File.

I'm not sure what will be done after that however some traditional data-compression algorithms may provide utility.

I'm not sure.. As I see these results I see a lack of brevity in any codex I've constructed. However, I am rather clever so if there is a way I will try and find it.

I'm missing the OT-pay but loving the feeling of a day off and a job to go back to.

Okay, I've had a good day coding. Good Luck Challenge People!

@Rick

I know this post is going to draw lot of flak about me... but I felt that I should atleast let everyone know that there are solutions available for recursive compression...

The idea of "Recursive Compression" in traditional defines is considered impossible however, I will look at re-encoding if I can get a result here smaller than million digit file.

What I am doing isn't traditional "compression" so "Recursive Compression" as seen from classical data compression will still be impossible.

I am getting reductions here so the next question to answer here is can I represent all patterns of length n with fewer bits than n.

Once I see that all patterns of Million Digit File are represented and that the total information is less than source I will then attempt to reduce the new dataset.

So from one point of view "recursive compression" is possible however, it is the class of encoding that is changing not the mathematical laws.

So @Everyone from my point of view there is a possibility.

@Ernst

Busy traveling so I haven't had too much time to work on it. But am looking to go forward here in the next few days or so...

I'll get back to you on those questions then.

@krishields

I have a dedication to follow through here so the time frame is wide.

In all the days since my download of Million Digit it has been the domain of the willing.

I am bit encouraged with the past few posts as to a community here.

Still, I have miles to go before I rest.

Update:

Ah, the 7 day a week work has started. I will be rather tired for the most part.

I have switched my computer work to the early AM where the best of me for the day occurs. The Job can have the body but this work needs the mind.

I am compiling a large dataset to work with. I expect it to run for a couple three months. Perhaps it will be overkill but hey, I have no real time to discern such for a few weeks.

So more towards Winter I expect to sort through to find the values that generate the whole Million Digit File.

The only potential detractor to satisfying the challenge is that I wouldn't find the data to encapsulate the elements in a finite number of discrete sets. Hence the large dataset.

In the mean time I will need to design search functions which are challenging to me from a design standpoint.

So with my great respect to you all I have little to add to the blogging done so far.

It is true reductions are happening here. It is true Encodings function and it is true I have not told you everything.

@Mark I do not think I need any reductions or redactions to the rules thus far. Perhaps Compressing the C program will be useful since I have seen a 15k program compressed to less than 4K. That is a major benefit for us all.

Okay, I will be tired for a some weeks to come but in today's economy any work is good work. What has happened to America.

Good Luck Challenge people.. It is time we get this done.

re: "q-bits": from the video Ernst referred to it looks like they are just using spin states of atomic particles for zeroes and ones. However I thought "quantum computing" was also about using so-called quantum mechanical aspects of matter.

"Computing" ia about "adding" one could say; "quantum" one could say is about "quantity" or "things that meet" i.e. that are in a group, so about "meeting"...

"quantum computing" then would be "adding up meeting" e.g. one apple plus one apple meet one orange plus one orange, so a bounding of higher dimension space ("fruit" is the higher dimension space, partially distinguished by the level of detail in having quantity (or quantum) of apples / and or oranges.

I think of "q-bit" as "quantum bit", of information. Aeroplnaes can fly "in formation", a snowman or a house in the process of being constructed is "in formation", the presence of ________________ seems to be occurring in this pattern.

Withot giving all the details, "information" appears to be "quantum bit", so to define "quantum bit of information" would require a code (or "algorithm"); a precise or clear way of doing things.

For what it is worth, my apparent technique for storing the million digits in minimal space involves what I could call "r-bits" (or reletavon bits)(these are related to Relativity).

They are far more condensed than quantum bits (a "reletavon" is one could possibly say "a space triangle" or a piece of higher dimension 'space'.

In this regard consider the activity of "speed cubing" (Rubik's cube solving at high speed). Using basic Rubik's cube solving technques, it may take many moves to solve a cube. "Speed cubers" have numerous algorithms to create shortcuts that allow them to do multiple orthodox moves via algorithm-based moves. A speed cuber showed me a solving of one problem via two algorithms, but said there would be a way to do this in one algorithm.

Apparently the world record holder for speed cubing has solved a Rubik's cube in just 32 moves. I heard that scientists calculated that in theory, any Rubik's cube state can be solved in as little as 20 moves. I also heard that the record breaker has now solved a Rubik's cube in just 20 moves (even though there are 43 quantillion combinations I hear).

To solve the million digit challenge then- find the analogous "surface' to a Rubik's cube, then find the short-cut that allows the numerous combinations to be related to the "Rubik's cube" in minimal moves.... is a thought

Yeah, I like the thought-flow.

Interesting thing too.

Add to that http://www.youtube.com/watch?v=2DIl3Hfh9tY which Leonard Susskind has an actual measurement for the "quantum or such" bit and has theory for information density.

It is not "pie in the sky" to think everything we are or that is, is information and what? Energy, matter ??

So perhaps we are a projection. That our 3D perspective is only a projection of a 2D information..

Very very stimulating concept in my opinion.

Oh Update.. Tired like a worn out hound dog here. Hard work in the heat of the summer sun.

I'm going to write a much more efficient search function Sunday, I had a great blog-lunch and realized how to maximize the search mechanics. I understand we have no Double-Time work Sunday so it is a down-day.

It is pleasing to utilize all N bits and have the flexibly to have a dimensioning facility.

I am now wishing for Twin 16-Core AMD CPU's with a full complement of Memory, hardware RAID, Twin maximum-Density Blu-Ray burners and a lot of 4TB high-speed Hard-Drives. Heh, any rich folk out there willing to grant me my wish?? Never hurts to ask.

Well, I wondered what the final search construct would look like. I cannot imagine a more efficient algorithm that the one I am ready to construct and test.

Well thanks for the Blog space here. I'll keep up the efforts until I get too tired from 7-day a week work.. It won't last but while it does it takes a lot out of me.

Holy Cow!

This file is now weird in a new way. Let me describe my testing of an evolving search algorithm design.

I am constructing 32 versions of a source data taken from the Million Digit file and am comparing that to 32 versions of a match value generated by stable algorithm.

Both lists contain unique, to themselves respectively, values.

So with two lists of 32 I make 32*32 comparisons or 1024 comparisons.

It seems logical that there would be matches of the source list to the match list where some pattern of list indexes occurs.

It is not what is happening.

I have spent the last couple hours proving the functioning of this new search function and it looks right. The Lists are being generated correctly and the comparisons are being done correctly.

It would seem logical that the indexes for both lists would vary between 0 to 31.

Now here is the odd thing so far. Only the original source from the Million Digit file ever matches any of the match list.

That means the index for the source list is always 0 and yes I have proven that it really is a true result.

This is really strange!

The Match indexes change as expected but never matching any other version of the source data? That is extremely odd!

It also is a failure for the hoped advantage of this search design and not because lack of engineering and testing.

Some sort of wall is being hit and I am baffled.

One of the goals for this version is to prove all values can be found so there is still some time to go to know that but so far the results are rather unexpected.

Another strange reality for the Million Digit File and a even bigger strange reality for the nature of information and Number Theory.

Well, I have modified my first construct and now it does use the full lists now. The Coding was correct but the result was unexpected. I adopted a slightly different way of generating the second list.

How odd that previous result was. Okay so now I will fine tune the program to find all values which looks very likely now even-though it currently isn't finding them within the bounds needed. Still it is finding them sequentially without fail.

This evolution is closer. I have decisions to make to be in bounds but will allow this version to prove it found all values before I make choices.

Back to work.. One day off was good but it's always the second day off is even better.. Not Happening here. I expect to work straight through to October now. Good enough to pay the bills.

How is everyone else doing?

Good Luck Challenge People!

@ Alan

What do you think of http://www.youtube.com/watch?v=2DIl3Hfh9tY ?

Hi,

I watched the video (usually am limited internet time)

Very curious...

re: some recent comments you made- I think you may have found something but I cannot say the name of it here...........

Re: Video-

I think when Alice enters the black hole, it rotates around her so she is not injured

I think "black hole" matches what is called "number"

I think "Hawking radiation" matchers what is called "factorisation"

I think 'entropy" is a change between 2 states of higher dimension rotations (e.g. to define two higher dimensional rotations of a Rubik's cube would take two speed-cubing algorithms, therefore a projected 'superposition' of the 'jam' of variables in lower dimensional rotational 'data', causing that data to break up into a limit on space (or a time integral of 'time' (the "information is "hidden" because it is revealed (in complex form) (as an inverse object)(a 'subject' (?))(The meaning or _____________of the 'information' is the _______________ of the meaning.)) )

(The "bit" the physicist "left out" is the most important - what does the sentence mean (not just what shape are the letters.....( or what do you know or can you say re: the atoms and particles etc.))

I (apparently), may, know how to see the "information" that the physicist says is beyond reach (it is on the surface of _________________________ and the _____________ of _________________ (!))

a new? old? "law" ?? of "Physics" ;

information can be detected / seen on higher dimension surfaces (anything that _________________________________________________________________________________________________________________________)

( ? )

I agree it is very interesting.

That we ourselves could be projections and our Universe then is a sort of Hologram.

Here is an interesting idea. Perhaps then starting at the big bang our Universe might be expanding and evolving like a Data Compressor's Decompression function. That the evolving forms are reconstructions of the data and the data is the Universe.

I hope that makes sense. In essence we exist because this is the current state of the decompression program in a sense.

---------------

My Update:

I am working on an Alpha version of a compressor using these latest works.

I have set the first goal to the level of minimum compression in hopes of succeeding at reducing all instances of the data. Being able to process all input is the goal.

I am attempting to use multiple subsets to get around the Pigeon Hole Principle

It is a hopeful effort. Still, there are always Walls and Lessons to experience so I'll see. There is an element of chance involved since I am using a collection of subsets.

As I have written this approach does "Compress" any individual input yet, getting this to work with all files, as a data compressor, uniformly is the unknown realm I face.

Still, I'm not out of the Game! Far From It!

@ Alan

Heh, If the Rubic's Cube's smaller cubes could individually rotate in place as well as get reordered in the traditional Rubic's cube way..

I think that would be a way. Now what algorithm would do that and would the instructions plus data be smaller than the source file being compressed?

I think BWT is along these lines.

I know whatever we do (algebraically) and undo works. It's just that I have not seen any process that can throw away information and get it back without some reference which usually means a bit out and a bit needed to signal to put a bit back.

I think I've invented a new way of representing 64 bits in 58 bits (with a few bits of overhead in some cases). The algorithm is reminiscent of the classic newspaper puzzle game "Japanese Crossword", with the decompressor being essentially a puzzle solver, which is given some clues when the solution is ambiguous (hence the overhead). The compressor checks if a bit block is ambiguous, and if so, it encodes a clue for a partial solution. This algorithm produces ~310000 bits of saving without clue bits, so this leaves ~38800 bytes for the decompressor and the overhead.

Ernst and Alan, what do you think???

And a bit of light-hearted approach to the problem... :)

The 64-bit "data block" is essentially a 4-D object, whose 3-D projection (ie. what we can see) make up the 58-bit "codeword". The decompression "program" (more like a _________) "sees" the "projection" and tries to "guess" the "fourth dimension". The clue bits control the _________ of the decompression _________ and represent the missing information in a _________ format. The decompressor reads the "rules" (ie. clue bits) and starts to work on the "puzzle", abiding by the "rules". (Think of the decompressor as a scientist who encodes the "higher state of information" in a way that the ordinary "observer" can "read" it, and trains "him" on how to "read" the "code"!)

Compressor = "scientist"

Decompressor = "observer"

64-bit block = "4-D space"

58-bit block = "3-D projection"

Clue bits = "lessons for the observer"

Compressing = "projecting the higher space"

Decompressing = "guessing the missing dimension"

Encoding the clue bits = "training the observer"

PS. Alan, if you recognize yourself, don't be offended! :)

Ah HA! http://www.proprofs.com/games/crossword/japanese-crossword-6/

I have to look at that!

Just got in from work. 102 today heading for 107 Friday and I am out in the sun all day.

I read through and yes we have to have some information we can retrieve so if your system has the rules to provide that last bit then it sounds great!

I have been within one bit of compressing everything by one bit so many times and in so many ways.

Let me share the algorithm for compressing all binary values by one bit. This is called Base-1 encoding and I do believe I was the first to post it in Comp.Compression many moons ago. I thought of it at lunch at my last full time job in I think 2005 or 2006. I have it written down in a log book if anyone needs an exact date.

Now this requires that we allow for our human perspective in that when we write a number we usually do not pad to the left with zeros to obtain word boundaries so if allowed to do the same for Binary numbers I can show that there is a Codex to compress every binary value by one bit.

Given an arbitrary value of 23. I just looked at the minutes of your last post.

23 is reported to be "10111" in binary. Now we do work in bytes so in byte terms it's "00010111" but in human terms it is 10111 as in 23 not 0023.

Now with 23 10111 in human terms the algorithm is as follows.

Look at the most significant set bit. In this case it is 2^4 or 16 now look right one power of two down and ask what must I subtract here to reset the set bit one magnitude larger.

Well with this example a reset bit is at 2^3 or 8 so subtracting a single 2^3 value will cause a borrow from the 2^4th position so our first element in our codec must represent the action of subtracting a single value.

We can choose what symbols we like to represent however there are only two possible actions and those are subtract a single value as in the case of 2^3 being a reset bit or 2 values subtracted in the case of 2^3 being a set bit as in if the value was 31.

So for simplicity I will choose 1 and 0 as symbols. Now I do mean symbols and not values here so I can swap 0 for 1 or pick the letter A and B as the symbols it matters not however we do work with binary so we do better to pick a parity scheme and call it good.

Again there is a difference between {1,0} as symbols and {1,0} as values

So 0 will represent subtract 1 and 1 will represent subtract 2.

Repeat this action

10111

0

-------

1111

1

-------

111

1

-------

11

1

-------

1 < now we stop. We have 1 as the last value and when we "decode" we can assume the sate is a set bit the same as we assume for the most part that we start with zero or a reset bit.

So the Code now and I stress code not value is "0111"

We can also write it it reverse parity as "1000" and again that is code not value.

The final 1 is the assumable state for the decoder to start with hence Base-1 I have called it the many years.

Perhaps that may inspire you as well?

--

I will look at your Japanese Puzzle Game soon.. Time to cool off and watch TV for now.. Good thinking!

Sorry the alignment or the subtract bits was modified in posting.

Want to edit that Mark?

@Ernst

What you've found is not the Japanese Crossword I mentioned... The "Japanese Crossword" I thought about is also called "Paint by Numbers" or "Griddlers" and it's a visual game. Perhaps this starts a brainstorming session for you! :) Using puzzle games as compression methods may seem stupid... but I think puzzles are a good metaphor for (de)compression, because the point of puzzles is to retrieve information not included in the puzzle. Think of Sudoku... the missing numbers can be fully guessed using the rules of Sudoku. It may be just a coincidence, but is interesting to note that my puzzle metaphor also uses a Japanese invention.

PS. I have read back, and found a post from Alan, dated Sept. 2011, which mentions Sudoku... perhaps even he has good ideas sometimes! :)))

Yeah, I'd believe it is a learning opportunity for me for sure.

I'll have to take your word as gold and that's fine with me.

I'm neck deep in my own pool at the moment but I will not forget to read more about the puzzles. Thank You. Hey feel free to comment even though I will not follow readily.

I don't think using the puzzle as a method is stupid at all. I fear I am not adept to engage on a meaningful level if you are neck deep in that pool, to use a phrase.

Learn I will. Read I must.

Learning Japanese didn't sound bad either but hey.. Little time much to do.

------------------------------------

My Update after a day of 105F all day is good.

I am now ready to prove the flow of the compressor/encoder and even-though it is not flowing correctly yet the first result found the first dataset within three tries.

That bodes well.

My faith is in multiple sub-sets rather than the full range of all possible values for a given bit length. Storing that in fewer bits than the N because of the cardinality of the subsets is how it will compress if it finds all the values from the Million Digit file.

Even if there is a failure overall, and it only takes one, there are many other variable combinations to try. I am using a best guess approach based on observation and experience but I have never been in this end of the pool exactly so this could be a learning stage yet.

I feel strongly that the mechanics and maths are solid still I tend to pour over the code again and again testing all aspects for proper functioning.

If all works out then Saturday or Sunday may be the day I launch the first effort to actually compress the Million Digit file as a whole while expecting and hoping it will work in earnest.

Again failure is not the final result but it would start me hunting for a variable set that works for all instances which I admit is more enthusiasm and inspiration than solid maths.

Overall I am excited to have a proper candidate. Now on to the last steps of proving the program flow tomorrow morning.

Hey Guys!

I got this new version of the search and code going but the results so far suggest I am better off with the previous 40-bit method.

I'll let this run a few days and see.

Tomorrow I will look into "Japanese Crossword" and Sudoku

Can't hurt.

@ Vacek Nules

They told us that they will try their hardest to not have us work Sundays so I get tomorrow off. Cool.

I'll look into your suggestion once I am rested. Feel free to share any links that may be helpful if you have time.

Ah, just in after 107F all day in the sun. I see the assumption I made for the new design doesn't offer the production I had hoped. Still these things have to be determined and it is so this evening.

So it's back towards a center effort. The Middle Way as it were.

The new design offers some improvements for the previous effort. They will help however, It is back to a long wait. A wait with proven results.

Ernst- another idea is that the belief that the universe is expanding may be an illusion caused by "compressing" (or over flatten ing ) astronomical (telescope obtained) data....

Regarding:

"Heh, If the Rubic's Cube's smaller cubes could individually rotate in place as well as get reordered in the traditional Rubic's cube way.."

= hyperspace continuum..... (?)

"I think that would be a way. Now what algorithm would do that and would the instructions plus data be smaller than the source file being compressed?"

would be a hyperspace continuum algorithm

Vacek Nules- I didn't make a lot of sense in the brief time I had to read your post...

Due to possibility of commercial usefulness of ideas I have to refrain from saying all the intricate detail....

General: here's a curious thing:

I have, apparently, made some major breakthroughs on blackholes, and a related teleportation theory.

Without giving full detail, I can say some things. Re; teleportation- it appears that if you wanted to "jump" from the Earth to the moon, this would involve a gravitational interference 'pattern' re: the outer planets.

I have figured out that the 'information pattern' that is called "black hole" could be seen as about the notion "conserved space".

Examples of "black holes" could be:

there is a toy which consists of a square grid containing say 5 x 5 squares that can slide about past each other vertically and/or horizontally, with letters or numbers on them, minus one empty square.

Any superposition of two or more configurations of this grid, would leave the "hole" "invisible" (or seeming to be temporarily occupied by a square)- so would be a "black hole".

Black holes are all about "the structure" - the relativity between potentially different configurations (of the toy grid array here). Higher dimension _________________ (not sure i should say what this is!)

The (higher dimension) 'space' (it is very time-like interestingly...)("time" as "anywhere within limits" eg, anyehere between 12 o'clock and 12 o'clock on a clock face, or anywhere between the limits on a pendulum swing)

"__________" (maybe I shiould say that...??) on every possible _________________of ________________

so is a "holographic ________"

It is time-like.

Gyro scope ic (!!!!!)

the big picture implicit -

a limit on locating ____________ the distinguished content/contents ___________________________________________

of a (real time) object (or (a) space ________________

The relationship between the _____________________________________of the toy movable squares sitting in a square grid.

If __________________(many things here but not said here now....) ____________

soliographic

as if __________________(the tele____________!)

blackholes may be 'artifacts' of multiple telescope 'integrating'

(a) natural SPACE telescope(s) !

("SPACE" telescope as a telescope that "sees space" (that knows how to differentiate between objects in real-time i.e. that knows ALL the data "simultaneously" (a telescope that operates within the idea(s) of higher dimension Einstein _____________)

Like a ___________

it ___________

things the more dispersed (spread out) they _____________hence they__________________________________________(like galaxies (or sometimes supernova ?)

Integron: ____________________________________

Differention: ____________________________________

Gyroscope: "still" when it spins'

Compass: 'spins' when it (is)(relatively) still

gyroscope: particle-like, locks on to space

____________________________________________________________________________________________________________________________

compass: locks on to "___________?

Any two gyroscopes create a REGION IN SPACE (or "real time")

Any two compasses create a REGION IN '_________' (or "real space" )

compass plus electro magnetic: points in time _

_____________________

gyroscope plus gravity: points in s_____

like a ________________

Then I have drawings of really cool things (involving gyroscopes and compasses) and descriptions...

_____________

(quantum)l___________ (?)

not quite sure where something is __________

quantum ____________

(lots of details not posted.....)

_____________

space gravity

(!) sounds like fun (?????????????????????)

so can teleport (outside e.g. planets in (e.g.) solar system time- (code) electro-magnetic hover-board (stiff not really stiff just conservation matter (in space across time))

Teleport Earth to moon:

very slight change in Earth moon positions and outer planets - equal only to 1 (teleport) 'mass'

easy

So it (may) SEEM that all your atoms and molecules 'gyrate" when you attempt to teleport (as if you were "burning") but due to gravitational feedforward feedback vis-a-vis the relationship between your location vis-a-vis a large gravitational mass (like your place on the earth and your teleport-arrival location eg on the earth or the moon say) and other large (though maybe distant) other masses (other planets not the Sun I think the Sun may be too hot as a suitable gravity teleport "beacon" ( ? ))(or may be a still second reference- more likely - I guess )

your atoms and molecules do not "twist" around each other, instead the large gravitationl masses absorb and re-emit the "strange assemblage/ re-'assemblage' " of your body as just another day at the \office' of their relative movements (around the Sun) called a geomagnetic orbital equation (a phrase that comes to mind and that I seem to recall reading was used by Mark McCutcheon)

fun as

@ Alan

I don't understand. Sorry.

I do like that read of it in a creative thinking context. I'll accept that you are making progress and keeping it mum. I have done the same many a moon in here and comp.compression

I have to agree that "Spooky Action at a Distance" was considered nonsense by more than one learned fellow at a not too distant point in the past.

My Sunday update.

Back-tracked and added the best of the newest evolution to the middle evolution of the search/encoder/compressor program and started multiple copies running. I set the search range back to the low end so it may be awhile before I see a report however, I must be thorough.

I have look for a faster method of searching and will continue to think about it.

So, unless I can come up with something better this will take as long as it takes.

Again I have nothing but great things to say about the AMB Bulldozer cpu http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29

It sure can take on a lot and still have more to give! Thank You AMD!

So, starting the search over. The Clock is reset and running. Let's see what gets reported!

Darn Typos.. Sorry AMD...

The multiple copies are reporting matches and so far 136 bits reduced/compressed. All but one of the programs is reporting matches so far. I also have the previous data so I already have possibly a Kilobyte of reduction available.

This is the encoding that actually reduces/compresses random data or any binary actually.

In theory it should recursively encode so a sort of infinite compression is possible because this doesn't depend on statistical methods.

I could utilize even a 64-core CPU here. If anyone is so kind!

@ Vacek Nules

I am now browsing the "paint by numbers" information. I didn't come up with any C source code so I'm keen to glean some code about it.

Got the chores done so I have the afternoon off..

Well, Vacek and Allan I have a functioning system here. I have explored as many aspects of these equations as is reasonable with my skill set and have decided on a configuration.

It is slow and may take a long time but it's happening. Time for you guys to focus on your efforts.

Good Luck. I will be happy if any of us reach the finish line!

Ernst

@Ernst...

I have devised two new algorithms for encoding the file... both of them utilize the idea of encoding only absolutely necessary information, not fluff. For example, imagine a certain, attractive young woman. (If you have trouble imagining, use Google to find a picture of one.) "She" looks as you can see "her", but "she" is, in reality, just a quantity of incomprehensible data of a higher level. You can differentiate her from all the people by her looks. But what if you know *two* similar girls? (If so, you're lucky.) Then you must encode more information to avoid confusing them! For example, their names. But if they have the same names (that would be paranormal), you have to find some subtle difference between them. Got it? Now imagine 64-bit codewords instead of girls, and repeat the process. The variant part between my algorithms is the method of capturing the "essence" of data - a puzzle, a checksum, a combination enumerator, etc... By the way, last night I had a nightmare of someone (maybe you?) winning the challenge and claiming the prize of -how strange!- $111! (Here I realized I was probably dreaming.) I hope that the winner will be me... but I am not convinced about the usefulness of my ideas.

PS. If the post sounds Alan-ish, don't worry, I'm not losing my marbles... :)

Me thinks maybe I just nailed this one... :) Expect a submission in the next few days.

@krishields

Could you tell us more about your algorithm? I've shared my ideas already, so this is your turn! I will, if needed, divulge more about my progress... I wish you luck... after all, I'm just a greenhorn in these subjects, so I hail to you, Master! :)

I will not have time to work on this until Saturday (or if I'm lucky, Friday), so I'm going on a hiatus... and if it happens, I will step aside, and watch the ending sequence, performed by others...

Until next time!

Thanks Ernst for inspiring yet another way i can describe an apparently awesome discovery (i must be nearing 50 ways to describe it!)

Hey Everyone,

Good to read.

I'm the Turtle in this race then. Having changed the search algorithm yet again I now see reasonable use for the bits spent for the methods.

Still I may come up against the Pigeon Hole Principle in another way as the number of matches is under the number needed to be under "the wire" so to speak.

I do have 4 bits of compression I can work with but I don't have a plan yet. It may come to me or I may face a shortfall.

The break point is finding half of all the values and spending one of the 4 bits reduced to mark compresses and not compressed.

That suggests that with compressing the decoder that the resulting encoded file and the compressor would come in under 415241 bytes.

This is all I have. Yes individual elements are being encoded 4 bits smaller and yes they are decodable but finding enough of them is the challenge.

On one hand it's nice to have a method and on the other it's not nice to fear failure of another kind.

As for the other? Bring it on! Step up and claim your prize!

De-compressor. not Compressor.. sorry.

Ah, I gave it some thought today guys. I have been inching towards a ,minimum compression achievement ruling out things as I go.

The multiple systems running here were meant to be a collective repository. I think I had better use a competitive model and find a system that does contain all the values in the Million Digit file.

So, I'm going to rewrite again and change over to a competitive strategy.

This will offer the maximum possible if compression is to be for this file and that is 2^N-1 code bits for the sample word size of 2^N.

Again this will depend on what sub set of values are generated by what configuration.

Simply put I am getting too few results to achieve the goal of even 50% encoding of the Million Digit File using this collective effort strategy but, I am getting 40 to 36 bit encodings.

Back to the Vim Editor!

Ah, a Day off..

I thought to ask if anyone else is working with the Collatz types of dynamic equations now that I posted about it..

It would be nice to get a "Hell Yeah" on the compression of 40 bits to 36 bits if you are following along.

I wrote @ Alan that telling on myself was safe since I know how hard it is even if people know the basic idea for encoding.

So I'm interested in general chatter on the effort.. I'm getting good utilization but I can stand contrasting points of view and can provide some in return..

Well I have chores and I am needing to write a function today so I will get to them both. I'll check back later.

I give up... ROAR!!! The algorithm took much longer than expected, and it didn't find nearly enough... well I give up for the time being. I'll probably come back to this problem in a month or so.

Sorry to read krishields but you know that "Failures are only early attempts at success."

Been really tired here. The "Drag effect" is starting in but I hope to have the new version cobbled together this next Sunday.

The current ones have found over 1600 matches for 4*1600 bits reduced. For 800 Bytes "provable" reduction but not enough to win.

krishields,

Keep on keeping on.. It was when i thought I had exhausted everything that I made my greatest discovery.

@Ernst,

I've decided to go after the closely related field of encryption. So, maybe I will find something in there that will help me... if you haven't already accomplished the task! ;P

I've got some interesting ideas I want to flesh out. So, I didn't really give up... just moving on for time being.

Good luck!

I understand.

I do have a method here but Pigeon Hole is showing up so I am going for one bit reduction per 40 bits and using a competitive relationship for running multiple instances of the encoder.

It is possible I can hit on one system that generates all the Million Digit file and it is possible it could take a long long time to find one that does.

Well, I am writing a function, now that I managed a full night's sleep, that is required for the new version. I will try and sleep 8 hours tonight as well. It's the only way I can work in the morning on Code.

As for going sideways on projects? Yep, I do it, you are going to it, we all do it!

Sometimes it is necessary to follow a thread and I understand that!

I'd have to say that I spent most of the decade I have been working this challenge exploring the nature of data, encoding and realizing what was being said publicly existed in my results and moving on from there.

When I committed to solving this challenge in February of 2002 I knew next to nothing and now I feel much more enlightened.

It takes time.. What my be the biggest asset is that one may not know better and may see things out of the box others avoided.

Okay Man, I look forward to reading from you again.

God Speed on your efforts!

Bummer.. Had a GFX card failure due to poor air-flow. Couldn't see anything still the AMD 8-core wad churning away. I know this by two things. The sound of the cooling fans and the dates on the files were later than when I discovered a "frozen" display.

I have since used the can of "duster" to blow the dust from the fans and the Temps have returned to normal.

Rebooted.

There is little reason to restart this last effort. I would have liked an up to date sum up to the moment I manually killed them but it's moot.

The last effort would have run out of bounds before I had anywhere near enough matches.

It is interesting that there are ways to encode few have written about yet they hold promise. True so, true that it is outside the scope of traditional data compression ; still, Think outside the box + out side that box; if you can...

Okay! I have some work to do on the future effort and I shall shut down for the time being and rest my poor computer awhile.

Sunday is scheduled off so it's a date!

Keep on banging those rocks together a wise man once said!

Maybe it was a movie!

So on to Sunday?! So be it..

@krishields

I'd say I haven't accomplished the task. Closer is the term best describing where this is at.

It's Sunday and the water is getting hot for a Cup of instant-Joe.

What this next evolution will aim at is finding the set of 40-bit elements that is million digit file in a single instance of this encoder/system.

Effectively this will generate 549,755,813,888 (2^39) 40 bit numbers and I need to find all 83,049 40-bit values in one system.

It seems to me to be possible and I am making every effort to have a quality program.

Since the variables can vary and the resulting 2^39 element sets will be different there is a possibility that the search for an exact match of million digit may take much too long.

All I can say this morning is I have the tools to craft such a program and the experience to work with these algorithms.

One more function to write and I can work the main code.

There is still a "luck" factor involved and that makes it exciting.

First Coffee!

Good Luck Challenge People!

Based on my current work with encryption, it seems, there might actually be a sneaky way of going about it... sort of sneaky anyways. I would say, while the algorithm I am thinking of would be "tuned" specifically for maximum efficiency to this particular file, it has a high probability of working well on a great majority of files also, so it could stand as a supplement to a standard compressor, or a generalized stand-alone compressor in itself. That is, my overhead would be much, much bigger than the original file to the tune of 16 MB, but, would it count against me for this challenge if this kind of overhead also works with a great majority of files?

Can you generate your 16 MB data on the fly? That wouldn't count if your program generates it and it's size (your program) satisfies the challenge.

It's possible, I trust, since I am able to generate a 16GB dataset with a 10k Elf program.

Too bad there wasn't an obvious way to index that to generate the MDF.

This is pretty funny. I was momentarily convinced I had a sneaky trick to solve this. Now I am convinced that I can't and leaning toward it being completely insoluble!

My plan was naive, I have at my disposal a function that can generate an endless series of pseudo-random numbers. All I had to do is find a small random seed that matches a large segment of this random data. Trading runtime for filesize in a dramatic fashion!

Seed size and combinatorics won out though. There is some subtle rational relationship between seed length, number of seeds per bit width, and length of sub-problems. As I increase my window size say from 3byte to 4byte my problem domain goes from 4.2e9 to 1.4e12, and thus using integer seeds the ratio of seed values to subproblems plumits causing a need for a bigger seed!

For instance with only a 4byte window size if I have 3bytes of seed that's (2^(3*8)/4.2e9)=0.00399458 or about 0.4% coverage of all combinations! If I increase to 4byte seed the ratio becomes 1.02261126 or 102.2% coverage, but now storing my seed takes as much room as the value I'm generating!

Now even with those small numbers you may have noticed the 0.4% of 1e6 is no small number. Unfortunately that is the best case if every seed generated a value in the data set. My best run was a little odd though:

Which leaves me 191 bytes even after storing all the seed values! Looks great right? Unfortunately I don't think I can cram the re-constitution program into such a small space, and the real kicker is this: I forgot about the window indices! For each seed I also have to store a window number between 0 and 138414... 18bits a piece minimum.

I figured I could trade size for runtime. Just search for one seed that created the entire data set, thus only having to save a single number and a trivial program, but even a set of 255 samples from 255 values with repeats results in the ridicules 4.7e613 possible combinations. At a million samples runtime really would be longer than the lifespan of the universe even with a good algorithm!

That was `fun`, but I'm done. Very nice object lesson though!

Thank You Kitsu!

It is nice to read your post!

Perhaps inspiration will visit again!

Update!

I have, not by my choice, two unpaid days off for the holiday so i will endeavour to code some.

I hope to see an increase in the percentage of matches. The best I can hope is that all values will be found before it goes out of range however, getting a result that matches a higher percentage will be satisfactory.

I expect that the "right" values for this function is what would match all the MDF. I can only guess at the values used. Oh, I have a general idea of what range I wish to use yet, I'm hopeful more than confident.

It should be a fun couple of days.

The details are : Going for one bit reduction on each 40-bit value so maximum reduction of file size then is 10,381 bytes. That might work if I strip the decoder down to the bare minimum and compress the elf.

Still there are a lot of sets to search and I will know more about how the mechanics of the new design function soon.

That's all I have.. I am not out of the race; that's for sure!

Happy labor Day!

Hey Guys!

Nice to sleep in. I was surprised to sleep an extra 5 hours! Tired I was.

I'm running the new prototype now. I may need to test for "collisions."

When working with so many numbers it's hard to know if duplicates are occurring.

For now I will simply test for output.

How goes it?

Well, This new work offers more match values but it is slower.

A trade off it would seem.

Running 16 instances and I must be patient. Run time is not the same as it was and boundaries will take longer to reach.

The good thing is that that the number of comparisons is the same.

It's the old deep verses wide argument. Half as deep of before but much wider.

The Question is, is there a way to find more matches or is there a limit no matter how it's parsed.

Well, I will check back. I know I have time-off but many may be on real holiday.

@Kitsu

I am keeping your observations in mind as I work this new design.

I am providing more choices yet I see no matches so far. It is slower that is true. Still I wonder about randomness and wonder about if it isn't so random after all.

I'm limited as to expressing true terms because I am an amateur mathematician. Perhaps even considered a tinkerer but I despise the hateful labels "kook", "crank" and so on.

Hate is not limited to the lesser educated.

Conformity is a powerful organizer of groups and the easiest way to maintain control is to define some class to despise. Kook and Crank have been the default defines for the Compression Community for a long time.

So, try and try again! Imagination is the only limitation throughout the ages! It certainly has not failed to be true in our "Modern Times."

As I mentioned I appreciate your post. I'm sure others do as well!

Good Luck Challenge Person!

I have a mystery this morning.

I am following the very same methods as before in constructing values yet I had no matches at all over-night for these greatly expanded match sets.

This seems to point to a failure yet how can this fail when it's cousins work well? Very interesting.

I have already gone over the basics and things look right.

Is this version generating everything but any of the values in MDF?

Is this version very sensitive to the function's variable set?

Coffee first!

I'm not getting random enough generations.

Interesting.

Anyone else hanging out today?

I don't understand what domain this is in.

I'm dead in the water until I do.

Very Very interesting to say the least!

Update:

I see a valid system this morning. I examined the generation of match values last-night and made changes to a much more simple system and I didn't see any major repeats as I had before.

So, more to do but the main goal of increasing the match choices per is accomplished.

Now to add a bit more functioning and then restart.

Have a good week everyone!

Good news! I see an increase in the percentage of matches.

The instances are much closer together so this may not over run the boundaries.

I mentioned before I have not opted for 4 bits reduction this time just one bit per 40-bit word. That is I hope to represent all 40-bit words with 39 bits for each.

So far, since 5 am this morning, I have 78 matches. 78 bits isn't much I know but it actually is compression of million digit file and is worthy of further efforts.

My main goal is validated; the goal of increasing percentage of matches to a variable bounds.

I'll observe a couple of weeks and decide what to do next.

While the true run time of this version will not be aeons it could be a year or so. I'll have to decide if I wish to commit to this version but, hey! It's the best so far as for being tight on the bounds.

Don't forget everyone, that (apparently) solving this "problem' also appears related to (apparently) solving "P vs NP"

Kitsu, the phrase I could say is ___________________________ (need a deal to say it!)

I don't think so Alan. I'd be surprised if this solves P vs NP. This is a "different animal."

Update:

By the 24 hour results here the time frame for a full run is 18 months.

There is an option to use a combined effort of all 16 instances running that would run 12 months.

Also my simple reckoning suggests this version will not run out of bounds.

The fail here would be that not all 83,049 40-bit words will be found after generating 2^39 (549,755,813,888) 40-bit words.

This would happen if I have failed to write a competent number generator.

I'm going to keep on running it and look at the rate of progress each day.

Again I ask, if anyone has a 64-core system I can use that would help a lot.

Never hurts to ask.

(Ernst: the way that you are approaching this problem, it LOOKS "a different animal" to "P vs NP" ? (is my intuitive response))

P vs NP is about "is there a way to "see" what in hindsight is obvious".

I earlier I think mentioned that the apparent answer to this million digit challenge is associated with what I could call "(The) theory of every thing" , or "stating the obvious", or (quantum) objectivity. If you can break the so-called "random" data into a set of self-sticking objects (objects that naturally fit together (in a lump of, well, clay, you might say)(this may have interest to economics as a "lump" (of "clay") is like the opposite of a anti-lump (a depression)- it seems likely that "depressions" in economics may tend to occur when there is an extra "lumpiness" beyond which the trade/financial management patterns can handle ( too much wealth "in the wrong hands" ? Not enough real economy (people being directly compensated for the fruits of their activity?)) then the data "collapses" (called a "wave function" !) into a minimum amount (of space), it would seem.

Well, first off, no one can claim success before they can show success to everyone in rearguards to MDF-Challange.

I just took a count of all 16 instances running here and the sum is less than twice of day one. The projection for finding all 40-bit values is 1.6 years ( as of this morning ).

Now many will say "wow, that is too long" but I point out that is only a portion of the amount of time I have been at it and even a smaller portion of time this Challenge has stood.

All I have come up with in 10 years is being used in this effort. If this doesn't fly sure I can try other variants but I know of nothing else that offers even a chance of encoding this "Random Data" file smaller.

@Alan

Honestly, I have to read up on the P vs NP to know what we are actually talking about. Let me Google http://en.wikipedia.org/wiki/P_versus_NP_problem

So, I didn't have the concept correct in my mind after all!

Well Alan if this takes 1.6 years to run on an 8 core AMD Bulldozer CPU and it is actually 16 programs running with different variables then I don't think it counts as P vs NP.

I'm open to read more about it but my understanding of P vs NP is limited.

Again, this effort could fail to provide all values in two ways.

The first is all values will not be found before the point each of the 16 instances must stand alone due to coding restrictions or at the end of all instances have been generated no one system has all the values for MDF.

As of this morning, and this is really early to make a judgement, the early success of all 16 providing all values will fail. The projection of if any one will generate all values still has a green light.

I stress, this is very very early and since I am getting results in ranges that never generated results before I have hope this version will hit "sections" where it will generate many matches.

I have seen the outputs surge for on and lag for others only to be rather equal as time went on. I see the same this morning with one surging ahead and another catching up and yet another not generating anything since 2pm yesterday.

Still. this is worth exploring and are we not ready for some good news in this forum?

So Alan if I read the Wikipedia page correctly "The P versus NP problem is a major unsolved problem in computer science. Informally, it asks whether every problem whose solution can be quickly verified by a computer can also be quickly solved by a computer."

Quickly is not what this is so I assume this wouldn't qualify.

I was just re-reading some of the posts over the years.. My how we have grown.

I can safely say I knew next to nothing when i decided to try this challenge and all I know now I have learned along the way. I see my posts reflect that.

I can only wonder what I will know tomorrow and beyond!

When I started I didn't know how to set a bit. I didn't know about GMP nor could I have used it on my then Computer a Commodore Amiga 2000 so I wrote my own big number addition and subtraction functions.

Since then I have written several hundred programs which have explored ideas I have had and also served as a teacher when things failed or simply exposed results to me.

I have a feeling I will be doing something related to data encoding until I die.

Alan,

I know you are reluctant to give out any of your results, so nobody really can comment on your incredible proofs.

But I'm curious as to whether you have anything to say about Gödel's incompleteness theorems, or perhaps Turing's proof showing the halting problem is undecidable?

It might sound like I'm trolling, but really, I'm just curious as to how far you are willing to stake your claims. There is of course no accepted proof on P vs. NP, so your assertions in those areas are open to debate, but Gödel and Turing are regarded as the solid foundations of modern computing and logic - if you were to refute those it would obviously have enormous consequences.

- Mark

Just on the safe side I wrote you Mark to explain myself and my claims.

Update here..

Much the same today. I am recording the rate of matchings and it's steady.

The system that was generating the most yesterday is still in the lead so those variables seem to be better. What is Best? That is a guess.

I have a design in mind for a better number generator. There are some repeats of sequence in the current generator and that is due to the nature of the maths so I am working on overcoming any duplicate generations. If the new effort is better I may start the run over.

For now any data I see will help me decide what could improve the chances of encoding the whole MDF.

The amount compressed so far is 0.463582%

It's still going!

Hi Mark, some quick ideas / research, I found:

what I do could be described as "higher dimensional mathematics" (or curiously, 'physics'!) (not "conventonal maths in higher dimensions").

It appears that: "Godel's Incompleteness Theorem" occurs when mathematics meets what I could call "negative mathematics".

You end out with a stop (or "halting" ( ! ))'

The Turing "Halting problem" appears to possibly be when mathematics cannot "complete" itself, it tends to split!

So these two problems appear to be closely related the "Incompleteness" occurs when usual mathematics "halts" (when it encounters higher dimensional mathematics to such an extent that it causes it to reach a "halting"), and the "Halting" ocurs when regular mathematics splits and cannot complete itself.

In more detail:

(referring to Wikipedia for descriptions)

In have discovered (apparently) what I could call "hyperspace _______________" (not written here!)

I could call these "true natural numbers".

What I am about to say may shock you (It is not for me to deny an apparent discovery just because it mimics something in a supposedly fictional story). It appears that there may be a solution to "

Godel's Incompleteness Theorem".

The strange thing is that the phrase from "The Hitchiker's Guide To The Galaxy" may apply here! That is "The answer is 42, but what is the question?"

It appears that, if you have AT LEAST 42 axioms, you then could have 43 (or 44, or who knows ? That is, "the answer is '42", but what is the question?" The "proof" (for each axiom) involves a exchange between "natural numbers" and "true natural numbers". (The exact reason for "42" began with a thought of it would be "20 axioms" then I realised I had to have 40, then 2 more or not to more (the whole system becomes like a _____________________________ so you end out with mathematics in a higher dimension so-to-speak))(This is (apparently, I possibly could suppose) "Higgs field quantisation" (a "Higgs Field" I found to fit the concept of "spatial span" (or spatial extent)' ("Higgs Boson" as "structure" ("quantum mechanics" as "analysis of structure")).

Key phrase (which applies to my apparent solution to the million digit challenge) The ________________________________ (too sensitive to say here). (It is to do with what happens when one encounters an object)(An object occupies a region in (of) space). (Related subjects are "the supposed rotation mass problem re: large distant distended objects (galaxies), and maybe the "Sloane digital Sky survey" claim of a galaxy spatial distribution structure called (the) "Sloane wall" (possibly, I guess)))(differentiation of natural space telescopes)(Negative primary telescope 'mirror')(when a mirror becomes an (accidental sort of I guess...) interferometer...( ? ))

Looking at Wikipedia, re: "Turing machine Halting problem":

I cannot say too much here (Samsung may be interested..!)) , I can say:

Something to do with "fractals" re: the so-called Turing machine (5 Turing machines if you want to find an answer, I could say) Re: (one) Turing machine: the data can be viewed as a ________________________ in (maybe in extreme) higher dimension space.

Inverse-computing.... (anti-computing) .....

forms a "black hole" i.e. "some thing" .................

it IS 'the "unlimited memory"....

that is: _________________________________

(to do with fractals)

If you "alter the second symbol" on the "tape" on the Turing machine, the machine is defined as different (from the data on the "tape")(Note the question can be asked "WHAT symbols elsewhere on the tape?)(!!!!!!!!!))

(The approach of the text would give you a limit on math....)

(so the answer is you don't have to indentify individual "symbols" on the tape)(at this point in the activity)

________________________________________________

Well it seems like that is "solvable" also

it doesn't "halt" neither does it "not halt", it distills hyperspace in a string of "infinite" "dimensions"

As many clues as I dare give!

Muchos Gracias

The irony of these apparent, possible results is:

So mathematics need not be incomplete as Godel said- buy you may have to have at least 42 axioms, and it is up to you to complete how many more than 42 axioms.

The "Turing Halting problem" also involves this strange 42 as I found 5 plus 37 with an uncertainty in the 7 (and/or in the 5, but not both (that may seem wrong but there is a way of describing this I reckon)) as computability itself reaches a question (so computers start to resemble "data" (they continuously 'break' into "bits" (!)) If you have 5 Turing machines "in harmony" that would seem to ensure continuous separation of data from machine (it could run "forever" one might suppose so-to-say but would at some stage be repeating things it had done previously though not necessarily in the same sequence or whatever).

-----------

the "birthday paradox" is core to these considerations

just discovered how all this links to old-fashioned math

654 @ 8:48 AM.. Slept in on a Sunday

0.787487 % Total count between all 16. No cross-check for duplicates but that's okay for a general report.

Getting closer to 1.0%

What to do today. I have an idea for a better number generator so I can work on that.

Good Luck Challenge People!

Ah, Sitting down with a hot cup of Coffee.

I thought to chat about Infinite Compression.

It may be possible that some form of infinite compression can be done with the encoding I designed.

Theory : Given the encoding I am working with and assuming there exists a system which will generate all values for a subset of the word size.

Meaning if we have a smaller number of symbols to encode than the complete word size such as 2^40 word size set there will be a system that generates all the values in that set and which encodes then to a smaller word set.

Given then that if that is true then any "file" that is in that define would be encodable to a smaller file.

Then extending that to the same assumption that all values can be found in some system's output then "recursive Compression" or "Recursive Encoding" is possible.

So perhaps I am working on the first Infinite Compression Algorithm in history that has a chance of succeeding?

Naturally there is much more work to do but this is about the possibility.

Comments ? Insights?

Will anyone entertain the possibility or are you all simply daft and exploitive?

It's a valid question.

Ernst said:

"So perhaps I am working on the first Infinite Compression Algorithm in history that has a chance of succeeding?"

No you're not. Infinite compression is impossible.

It's not just a matter of not having read and understood even a tiny bit of the ton of serious work that's gone into this subject. It's more importantly a total failure to logically think about compression.

I know you're going to call me 'daft' or 'hater' or something like that, but here's a fact:

I say you can't make an infinite compression algorithm. The one, and only way to show me that I'm wrong is to make such an algorithm. But here's the thing, I wager all my money that you can't do it. Because infinite compression is mathematically impossible.

Cool a reply!'

Really you will give me money? How much money are we talking about?

To be fair here there I am finding encoding 40-bit values with about 30 bits right now. The method is based on addition, multiplication, division and subtraction.

So, I will argue that Recursive Compression is Possible since once we are working with numbers it's a whole new game.

So, how much money are we talking about?

Also I feel bad about "picking a fight"

with "daft" and such now. My Bad.

So, what is your define of recursive compression Zen? I believe I understand the concept but I don't know how you define it. Helps to be on the same page.

Also thanks for the conversation. Again I feel bad for the Daft stuff.

I am needing a new motorized bicycle since California has banned the importing of two cycle engines and I need to build a new 4-Stroke bike.

Would run around $1200 to $1500 and I am really poor.

I would consider seeing if I can recursively compress a small file say around 4000 bits. I say small because the catch with this encoding is time. I expect to know if my selection of variables for the 16 different instances of the encoder running right now on the AMD Bulldozer 8-core will take about 18 months to run far enough to see.

A file that is small such as a 4000 or so bit file should be easier to do.

For a new 4-Stroke motorized bicycle I would do my best to recursively encode a small file and prove or disprove this algorithm can "recursively compress or encode" depending on which is a correct term for what I am doing..

I have a Tri-Core Laptop I can run it on.

So, this is exciting! I hope you are phat with cash friend! I'm a poor man and money is very tempting. I might even suspend encoding MDF and use 10 cores just so I can increase the number of systems I will search.

Again small file equals less time. What the data is , is irrelevant.

Thanks Zen!

Ernst, you seem very excited by the idea of money.

In order for anyone (including me) to give you money, you would have to test your method under fair for both parties conditions.

Which is exactly what this challenge has been all about. The approximately 405kb file provided by Mark in this challenge is exactly that - a fair test of any revolutionary compression method. See that Mark even has a special mention for Recursive Compression in his challenge.

If you are so certain your method works, complete Mark's challenge. I'm not that rich, and frankly you sound better off than I, but I will meet Mark's $100 prize.

Not satisfied? Still think your work is worth more money? Download Mark's file, and do your own private testing, with compression/decompression. You don't need to show me, or anyone else here your program, your method, or even your results.

Simply complete the challenge in private, and contact any major IT company of your choice. Tell them you have a new compression method that is able to complete the Million Digit Challenge, and ask to demonstrate and sell it to them, working out privacy details.

You don't need to take money from us - the poor slobs who're trying to tell you why you're wrong. If you're convinced you have a revolutionary compression method, compressing and decompressing the MRD will be enough to get the attention of at least some major tech companies.

Okay Zen, Let me explain what is real here.

I have worked over 10 years on the Million Digit Challenge.

In those 10 years I have taught myself and come to understand by observation what works and what doesn't.

As of late, I have cycled back to my most earliest encoder effort and have incorporated my most current encoding discovery.

I am currently running 16 instances of a program that is looking for smaller integers that generate larger integers through this latest software.

Now in order to do a recursive compression I will need to gather up all of the found integers and make a file from them. They are 40-bit values from the Million Digit File. This means as well, that, I would suspend searching for more integers for the million digit file for that effort.

Now Zen, it looks like you are a jerk now with the offer of "all your money" and when you realized you might have to pay up you are now wiggling out of your own offer.

So, I conclude when you thought you could proverbially "Kick my Ass" for free, draw attention to yourself and possibly gain admiration from idiots like yourself who offer then run away when they might lose.

As of this morning my 16 instances report a "found count total" of 1.058411% of the Million Digit File. Duplicate finds not accounted for at this time.

@Everyone; I could pause searching and see if I can recursively encode what I have encoded so far. Perhaps I can wait until I have 1000 unique MDF 40-bit values encoded that can be made into a MDF subset file for the purpose of recursive compression.

Again I remind that I am working with numbers so traditional data compression defines may not always apply.

Also, again, I restate that demonstrating decoding is an open invitation here. I cannot afford to travel far but I will show up and "prove" my claims any-time.

Besides I'd like to meet Compression friends; even you Zen except the dinner check is on you and I like Steak at Center Street Grill so bring your money.

The absolute truth on this compression effort here is that it is unclear if I will find all the 40-bit values that make-up the Million Digit File with the variables I picked even with 16 different instances running 16 different configurations.

It will also be about 18 months more before I can decide if the goal is possible for the variables I have picked.

This is not unusual to get closer one effort at a time. Also I figure it is welcome news that Compressing the Million Digit File is not impossible.

We all have to admit that it's a big deal to all us Comp.Compression folks. It's a part of our culture.

Yes 18 months is a long time but it is a functioning data-encoder where "Random" or "Dense" data doesn't matter. Besides once I have more time I may be able to advance the search function to be better and faster.

I welcome others to report they are compressing the "MDF", Especially you Zen.

Time to get ready for work. I was happy to learn we have 4 weeks more work this year than was expected. I have my eye on a new hard-drive now!

Good Luck Challenge People

Ernst, I'll be a jerk when you prove your method is real.

Until then, everything I've said is correct.

And please stop proclaiming yourself as 'poor' when you can afford a 8 core AMD PC. $100 is a significant amount to some people, though obviously not to you, and most of us don't have huge amounts of money saved up.

My wager was for you achieving infinite compression, which is what you claimed to be working on. Frankly, it's as safe bet as betting on a perpetual motion machine.

Again, you want to prove any sort of attempt at infinite compression, do so on Mark's data.

> Again I remind that I am working with numbers so traditional data compression defines may not always apply.

I hate double posting, but this struck me as funny because it's specifically addressed in the comp.compression FAQ on compressing random data, link: http://www.faqs.org/faqs/compression-faq/part1/section-8.html

The counting argument is described there, and I quote:

"Note that no assumption is made about the compression algorithm. The proof applies to *any* algorithm, including those using an external dictionary, or repeated application of another algorithm, or combination of different algorithms, or representation of the data as formulas, etc... All schemes are subject to the counting argument."

Notice the "...or representation of the data as formulas..." ?

Yesterday I realized Zen had a good point.

If an IT ( Wink Wink Google ) was interested in me I should make it reasonable to contact me.

So here is my Email Ernst_Berg@Sbcglobal.net Please use the Title "Att: Data Compression"

Perhaps there is no direction but up from here on out. In the past three years I have earned less than $12,000 after the factory where I worked many years closed down.

It's a nice idea to say the least. Thanks Zen.

@Zen

I understand the counting argument.

To overcome the counting argument symbols would need to map to more than one element in general. Mapping to more than one element would require some sort of contextual reference.

To be fair here I am looking for one system or maybe a set of systems that have all 83,049 40-bit values of MDF in it (them).

I haven't found all the values yet so there is a chance of failure to find all of them still however, I am running 16 different instances which are generating different symbols for the same file so yes, mapping is flexible, contextual and this run is no where near out of bounds yet.

Perhaps the net result will be short of the total file elements but it will NOT have failed to compress the ones it found.

Don't ask for a proof. That is something no one has or can do at least not yet.

I don't know of anyone that is expert with the Collatz Conjecture or it would have been proven already. I can only rely on my experiences of working with it starting in 1992; 20 years worth. How it works or what limitations it has is a fair topic of conversation.

What can be said to be true and basic is that compression of the whole MDF is possible and so is recursive compression given the context of individual elements and perhaps extending to a whole set such as the encoded MDF.

What the data means, means nothing here. That it is a number generated by a file compressor program or a number encoded by this encoding system doesn't matter. All that matters is that for a given system some input-value must generate it.

I trust you all realize that in over 10 years I have never claimed any successes in compressing the MDF. I did not take sharing the news lightly. This means I must prove what I am saying and so I restate I will any time and the invitation is open. Just let me know. I would enjoy meeting Compression People.

The next thing for me here is to sort through and see how many unique values have been found. And, Keep On Keeping On.

Zen, why not see for yourself. Download software or write your own. Perhaps you will have a better design than I. I can live with that.

Respectfully, Ernst_Berg@Sbcglobal.net

@Ernst

I really like your idea of using a number generator to generate the values of MDF... I've set up a system based on a re-constructable non-periodic non-predictable random number generator, and it happily munches through the file, reporting a match every 8-9 seconds (average). It's different from your idea in that it doesn't *compress* the file (as such), but rather *transform* the data to a same-sized but denser, (hopefully) compressible form!

It now says it will run for 44 days, a long time, but better than your 18 months! ;) Sometimes it produces 2 or even 3 matches in a quick succession, and sometimes it finds doesn't find a match in 15 seconds...

The program is set up as to produce a beep when a match is found, so I can detect matches even if I'm surfing on the web... it's rather annoying, but if it produces meaningful data in an approximately 4-hour time-frame, I will remove the beep routine and set up a separate workstation in the basement... I have several old computers there, why not give one of them a second chance?

Vacek Nules

Nice to read you find my sharing useful and also you shared your own cool!

I figured it was safe to share what I have and thought others might see more than I . Naturally these things are not the whole effort but the basic direction.

@anyone if you are working with the dynamic equation(s) such as the Collatz feel free to email me if you like. I'd be happy to chat.

@Vacek Nules Yeah that 18 months and possibly not finding all values is a serious concern here. I can't do much in the programming department right now because I am dead tired everyday. Crush season is hard work daily and after a month people start dragging. Zombie mode is upon me and writing code is a no go mentally. The Season will end in 4 weeks or less. Perhaps I can come up with a better generator. I assume some duplication is happening but in studying the MDF I have seen that it is "random" in ways I can't really explain except to say it always manages to have unique elements any way I work it.

I expect to see that with even this effort and I feel I have except this approach is finding far more matches for a given range. That doesn't mean it can't be improved.

Well without an advanced understanding of how these dynamic equations relate to counting numbers. It would be nice to be able to skip the brute force approach.

On transforming the file. I have had some luck. What allays seems to transfer with it is the structure. I didn't find a reversible way to map to same values. Does your "generator" map to same value? I mean cause repeats of value? That would be cool.

Well speaking of 18 months I see a total, duplicates not accounted for, count of 1.322111%.

It's a long time with an uncertain outcome but it's working and that is still amazing me.

Worst case is I will need to run a second set of 16 after this in hopes of finding the rest of the values. Making that a smaller file of what is missing would speed up things.

Good Luck on your transform. A cool basement sounds great for the computer! I have had to keep the cooler running here.

Oh and when I can I will be creating a subset of 1000 "found and unique" values in order to attempt "Recursive Compression." Hopefully there will be a demonstration and I'll be at the ready.

With this machine running 16 instances I failed to notice that some of the words I typed didn't show in the post.

"I expect to see that with even this effort and I feel I have except this approach is finding far more matches for a given range. That doesn't mean it can't be improved."

That makes no sense..

I was intending to convey "It's random structure" always comes through and even doing this "matching" that structure determines the matching rate.

It should be possible to match more than one element in a given cycle but I am yet to see that in any one instance. That points back to MDF and it's structure.

Remember it could be 4 times a really big prime number. We could be essentially trying to compress a large prime number.

@Ernst

I went down to the basement and brought one of the ancient PC's upstairs, along with a bag of computer junk... it seems I am the assembly point of all the family's IT junk. :) I'm now wielding my trusty screwdriver to build a dedicated machine for the task! BTW, I found interesting things down there... a vacuum cleaner I didn't know being there, a fried printer/copier, an icebox, boxful of schoolbooks etc. I can only hope that this computer will be a plus for me... I must pray for no blackouts to occur! :)

Hi Mark,

worked out a beautiful apparent somewhat rough proof of the "solution" to "Godel's Incompleteness Theorem" (and apparently found a "Turing machine" along the way!

42 axioms does it (apparently).

Also worked out the Turing Halting problem apparent solution again

It could be stated, that a "hyperspace bypass" (or "theory of everything"), solves the (so-called) "counting argument" (as it could be said that it is a "no counting 'argument'').

Vacek Nules A UPS is the only way to go!!!

Saved me here several times already.

@Alan Did you really write a proof? wow, I have some maths to publish and don't even know where to start to write a proof. Good Job!

I'm setting off to try my hand at this challenge. I admit that my approach is not so exotic as some of those described above.

I think I may have a good shot at a solution, though I hesitate to describe it until I can squeeze some reasonable space out of (or into) the 415,241 room.

But this is where it gets dicey. I've never much worried about resulting program size, and through the years, it's always been something outside my control anyway, but it would seem that a 200k+ executable is not uncommon, but squeezing 200k out of that file is pretty unlikely.

I just worry that after solving a compression/decompression scheme for this file, I'll be stuck with a new, and only tangentially related problem, how to find a compiler that will fit my algorithm into the available space.

I expect my decompressor source code could be reasonably small, but I would rather not release that, since it would give away the entire technique, and I hope to hang onto that for proprietary purposes.

Is it possible that compressing the file is trivial, but fitting the decompressor into 40k is the real challenge?

Oh well, I guess I'll burn that bridge when I get to it.

Welcome CWCunningham

Fresh eyes Fresh Ideas!

My experience is that how ever the data is parsed, transformed or segmented that uniqueness of MDF elements is always a factor.

I agree with you that having a total combined size less than the file is the "last twist on the lime slice."

This Cocktail's history is one of an objective and that of, as is described in the introduction by Mark, having a purpose of silencing the banter of more than a decade ago in Comp.Compression.

Living in that shadow I believe it can be done. I would suggest that it's a harsh qualification but in a broader sense it aims to limit hidden data that is basically cheating.

I agree that with a Linux ELF program that , that program can be compressed from say 10k to 3-ish K so that is allowed.

All in all, if there is a real solution which can be defended as "not a hoax or cheat" I would stand on the side of relaxing the total size requirement. Seriously if a new technology can be demonstrated why be so rigid about a $100 dollar prize?

I've found this "Challenge" and the MDF data itself to be a focus for my own interests. I am not a true Data-Compression person. I have been interested in things like the Collatz Conjecture and things that cycle "dynamic equations" and such.

This Challenge seemed interesting in that I could "play with real information" whatever that was I thought. I could do what others above my station were doing and maybe I could win. Nice idea.

I could win indeed.

Again welcome and good luck! There is still time to try!

I have no intentions of cheating. My ideas may be new, and that might be their only slim chance of success. My guess is that there are a lot of smarter people than I, who have pondered this challenge.

I hesitate to describe my approach. I'm currently working on a commercial application that required some particular technology. In developing that technology, I began to see that it could be an aid to compression. So I come at this as a complete newb to the art of compression.

The good news is that if my approach is viable, I should be able to discover that without too much trouble. My problem is that I have been using my own homegrown libraries for years in the development of software, and so compression (if I can achieve it) will be done using generic tools that alone comprise 40k. So my worst case performance would have to achieve 10% size savings in order to qualify as a "winner". But what do I do if I can only shave 20k ... or 10k? I end up with a whole different compression problem that is not compression per se, but instead, busy work to replace a set of tools with another identical set of tools to accomplish a task that I've already accomplished.

Is there a scoreboard anywhere that shows the best efforts to date?

Or is it the case that no-one has ever achieved any compression at all?

on February 16th, 2009 at 8:25 am, Mark Nelson said:

@mike40033:

As I've said before, the whole point of this is not to finesse the rules, but actually achieve the goal, so I'm not going to go crazy on definitions.

Anyone who really things they have won the prize should be able to have a program that works on the real file, then works again on the same file encoded with a randomly chosen key using DES or whatever.

- Mark

=================

Ouch, this sounds like changing the rules (or perhaps I misunderstand)

My plan is to use a compressor that analyzes *THIS* million random digit file and constructs a plan to squeeze "X" data out. This 'plan' would inform the technique to decompress the result (and define the maximum size of the decompressor).

I believe that (if I can achieve the necessary compression), I could do it again for an encrypted version , but the whole process would have to be redone including the creation of a decompressor designed explicitly for *THAT* compressed file.

My point being that the compression techniques would be carefully chosen to match the input, and so the decompression technique will not exist until the compression technique is selected. If the input is changed, the compression technique has to be changed (more than likely) and a new decompressor built.

If that is breaking the rules, let me know now.

on September 16th, 2012 at 11:25 am, CWCunningham said:

Is there a scoreboard anywhere that shows the best efforts to date?

Or is it the case that no-one has ever achieved any compression at all?

-------------

I believe "do or die" is the law of the land.

Partials or claims carry little weight. I expect that one must present a complete solution before any acknowledgement.

If you wish to believe my posts then some success is occurring here.

I've adapted a well studied dynamic equation called the Collatz Conjecture (for one name) to generate numbers that are matched to the MDF. This is allowing "compression" since it takes fewer bits to generate the sample word from the MDF than the sample word size.

The potential failure is that I will not "find" all the values I need within the bounds defined for this version.

I am working 7 days a week now for a couple three weeks more and then I will sit down and extract the first 1000 unique matches ( with 16 instances running different configurations some duplicate matches happen ) and take that set , make a file and attempt to "re-encode" the first output.

I don't see why it wouldn't fly since this approach doesn't depend of frequencies but rather values.

A couple of other members have stated they have solutions in the works but little else.

Well, coffee in hand I see there are a total 1535 matches reported this morning. Things are running consistently with over 100 matches added everyday but this pace is slower than first calculated.

I am now questioning the time-frame and totals possible.

Perhaps the number generator needs to be better. I know some duplicates are being generated but that is the nature of the maths.

Well, this is the best I have come up with so it can only improve.

I have seen MDF's structure and I know just how hard it is to deal with it's structure so matching it consistently is a miracle in itself.

As I underst(an|oo)d it, success is defined as:

1) Deliver a compressed file and additional tools sufficient to decompress that file into the MRD file.

2) The combined size of the compressed file and tools is not allowed to exceed the size of the MRD file.

3) Cheating is not allowed.

But what that comment seems to suggest is that there is an additional requirement to do the same thing with some arbitrary additional file using the same tool for decompression. I just wonder if that breaks the spirit of the challenge.

No, there is no additional requirement. If you do 1, 2, and 3 you have passed.

I need to get my explanation out. The comment about being able to compress a second file would only apply to someone who creates a general purpose compressor that works on random data. The original requirements would work for someone who creates a tuned program that outputs the million digit file.

- Mark

Mark,

(what is written here involves apparent discoveries by an amateur, not verified by industry people yet)

thank you for drawing my attention to Godel's so-called Incompleteness Theorem (higher dimension 'quantisation' in 'mass' ) and the Turing Halting problem (mass quantisation in higher dimension space).

Now that i have apparently defeated both Godel and Turing, perhaps you would care to please follow through on your claim that there would be "enormous consequences": by finding e a commercial deal for the applications re; apparently solving the dream of theoretical physicists- the "theory of every thing", - obvious applications in data storage, telecommunications, data pattern finding….

No deal no fee and no expenses, but a deal that I agree with could be worth a good commission (no guarantees)…

Thanks Earnts- i have "proofs' galore

and it is SO COOL even the pros will go bananas

! (?)

cheers

Alan

Sorry, correct spelling is "Ernst" (by the way: the "Higgs Boson" (according to my understanding), IS 'structure'; the "Higgs field" is "spatial extent" and a group of protons is itself a "Higgs field"…………...

Alan, it's been over 2 years since you've first posted here making these extraordinary claims, and given no proof whatsoever.

You've had 2 years to contact some scientific journals to try and publish your results, or to even try to contact *any* company to show them what you've done. And yet you haven't.

If you've done even ONE of the things you claim you've done (and I highly doubt you're even close), go send out a few emails to ask to demonstrate your findings, under whatever terms you want.

But I'm sure you won't. You're delusional. You've discovered nothing, and just spit out what you think are scientific sounding terms in a sad attempt for attention. Mark called it out right 2 years ago - 20 years from now you'll still just be posting around here and whining that you need money, and never actually doing anything about it. Because at the end, you've done nothing worth any money.

@Zen

Given the assumption that "re-compression"is a reality how many "levels" of re-compression would satisfy as proof?

In about three or four weeks this seasonal job of mine will be over. I assume I will have time to code again so I will gather up the first 1000 unique matches and craft a subset of MDF.

I will share that and the positions where these values were taken from out of the Million Digit File (MDF).

I will then attempt to "compress" that output.

So, how many "levels" will qualify and satisfy as recursive compression? I assume three. Is three enough for you? Is Three enough for us all?

Once that is complete I have a California State University near by and perhaps I can demonstrate to faculty there.

You will then be welcome to reward me just as my failure to achieve would demean me. I am confident I will not fail. Let's see how long it takes to process just 1000 40-bit values encoded to 35 bits, down to less than 35 and then again less then that.

Update: Good Morning (as I type ) coffee in hand. Tuesday, September 18 2012 1645 @ 04:28 AM 1.980758%

The pace is consistent but not the pace I hoped for. The pace is pointing to 24 months time to elapse for a complete run.

So, the answer then is to spend some time ,soon, improving the program. I have an idea that could eliminate some duplicate efforts.

Even if I do discontinue this run for a different construct and adopt a newer design these results will still be valid and any "recursive compression" will still decode.

@Alan:

This presumes that you have shown the halting problem to be decidable - in other words, for a given program, you can determine that it will run to completion or not.

This is of course in contradiction to accepted proof by Turing, so you will have a very tough time with community of scholars who care about these things.

But on to monetization. If I had solved the halting problem, I believe my first task would be to collect a million bucks from the Clay Mathematics Institute by resolving the Riemann Hypothesis.

To do so, I would write a program that iterates through all the possible values of the Riemann Zeta function, looking for values that do not have a real part of 1/2. If my program found such a value, it would stop, and by stopping, it would have disproved the Reimann Hypothesis.

Currently, we have no way of knowing if such a program would halt - doing so requires some sort of proof, there is no general solution to tell if a program will halt. If you have such a method, you can then assert whether the program will halt or not, and resolve the question.

I've given you all this information, but please keep in mind that I don't believe you have really solved this problem. I am not acting as troll, trying to draw you into some pointless argument. You make a lot of assertions, but you never provide any backup, so I won't get any pleasure out of arguing with you about it.

The truth, of course, is that I am not actually qualified anyway to make a judgement about something as exotic as a proof of the decidability of the halting problem - I would be way too easy to fool one way or another. I have no particular mathematical credentials.

I wish I did, I have a proof bubbling around in my head showing that real numbers are uncountable, and I think it is simpler than Cantor's diagonal argument!

- Mark

Excuse me Zen.....

who are you kidding?

You know as well as I do, that "experts" do not react nicely to amateurs. It can be very difficult to break through the extremely dense shield of negativity. What I have apparently discovered has potential commercial value, so I do not publish it. But I do pursue commercial deals. However, being poor takes up a lot of time.

A scientist has signed a non-disclosure document (in fact, a number of people have signed non-disclosure documents with regards to various theoretical inventions that I have). But as I said, poverty takes up a lot of time. Eventually I hope to have a document ready for the scientist to analyse.

But why wait? Some of the main discoveries can be explained very quickly, to the right people- i.e. serious, genuine, investors/start-up founders whatever.

Why would i publish in a scientific journal? And lose patentability? And pay massive page fees that i cannot afford? And since when do scientific journals embrace revolution? Who are you kidding?

Re: contacting companies- I have contacted companies, but as you surely know they tend to assume that I am rich, university-connected or from another company. They tend to not be geared towards amateurs. If contacting Samsung, IBM or whover is so easy- why don't you contact them- if this leads to a successful deal you could get a nice commission (no guarantees)?

Hi Mark,

I referred to the "Turing Halting problem" as "mass quantisation in higher dimension space". It is a long time since I analysed the Riemann hypothesis- i do not have my apparent solution/ insight here.

After looking at Wikipedia I'm somewhat guessing its triangulating space-time in 5 dimensions (also could call this "the prime number hypothesis" or the "real-time" "distribution" (or " UN-distribution ") of /"Prime" numbers (i.e. they are ONLY "prime" if you want them to be) (they are Zeta functions in themselves)).

Ironically- (life meets art !) "Optional primes" could be caled TRANSFORMERS- pivots on a so-called number line that "flips" the (so-called) number line into a number of higher dimensional regions.

A computer program "stops" when it starts- when it reaches maximum "saturation" of data analytic capability- when ____________________________(censored)

A complex number is a 2-d number- it could be argued that the "real part" IS a Zeta function (or mass factorisation: a non-trivial zero because the "zero" involves "mass" (a region of overlapping)). Looking for (so-called)non-trivial zeroes" in a Riemann Zeta function could be seen as "looking for __________________________________________________ (censored)"

You end out with "______________________" (censored)

(Hilbert's second problem QUANTISED (I'm somewhat guessing/ instinct)(a fixed direction IN SPACE identified.....!!!!!)

An IMAGINARY computer program!!!!!!!!!

If your computer program stopped as you say, you would have both proved and disproved the Riemann hypothesis (because WHEN your computer program "stopped", it would "start"- I guess you would have "broken" (or ?) your computer program into two (or more) "chunks" - it would look more like ?data

I interpret Hilbert's second problem as 'How do you define a space vector" (or "how do you define a (perfectly) integrated TIME) or how do you define a _________________________________________

WHICH

I

HAVE APPARENTLY SOLVED

("Theory of everything" apparently can solve this, or that is any viable method to navigate completely about a group of fixed or semi-fixed objects in a region of "space" )

(Otherwise known one may say as a "Hyperspace bypass")

the group of objects could be seen as a optional prime number (a space-computed 'field' ...........)

@Alan

When you use terms like "Could" I suspect you are not so confident about "proof."

I for one really hope you are sincere about your proof(s).

I've come to the conclusion that it's not very productive to reduce the file down to less than 3 bits.

At 2 bits, it's impossible to tell what it might decompress as. In one run, it appeared to decompress into a popular Commodore 64 game, though I couldn't find enough single sided 5-1/4" disks to actually test it.

On the few occasions that I was able to get it down to 0 bits, my decompressor complained bitterly and refused to do anything useful ... though these were some of the fastest runs of that program.

So my advice to anyone attempting this, once you get down to 3 bits, call it a day.

Seriously though, I'm only joking. Have a nice weekend.

@ CW

Everything requires a reference. To drop down to one bit ( and yes I know a way ) it still requires an external reference so there isn't a way to do it without storing information externally.

The nifty thing is that transformations are fluid!

Have a nice weekend!

Cunningham, I'm sure all the compression experts would agree that reducing a file down to 3 bits is trivial, really, and the only challenge is, as you said, uncompressing successfully from 2 bits.

On a side note, if you happen to get any old DOS/Atari 5200 games in your decompression experiments, I'd be willing to assist you in testing them.

I'm also convinced that you're posting from the future, as its only Thursday here... but Time Travel is obviously trivial once you get Recursive Compression figured out.

@Ernst,

I love that Idea!!!

If and when I have this puzzle solved, I'm going to submit a 1 bit file that represents the compressed MRD file, and an additional file of a few hundred K that's just 'reference' (so the bit knows where to go)

@Zen,

Yes, I am from the future, but my time machine has broken down, which is a shame. I came from the future to show that I've solved the MRD file, but I inadvertently left the thumb drive in my other hand.

Well, there isn't a free lunch with reducing down to say 1 bit. You can't magically throw away bits and magically get them back. I wrote you need an external reference to know what your bits are but mechanics exists to reduce.

As Zen said people can reduce but "decompressing" is where it gets back to reality.

See the "Base1 encoding I posted about. If we "flip parity" or change what the 1 and 0 mean value wise then a "reduction" is recursive but that requires knowing what the set and reset mean at any point.

I did show how I can "compress" every binary value by one bit.

Update: I suspect I have two more weeks of work; possibly three. Then I will code again.

Today's stats are Friday, September 21 2012

1979 @ 04:17 AM 2.382931%

So consistently over 100 found daily.

Zen are you compressing "Random Data?"

Ah Sunday off!

I admit I slept in to 10AM and that is a long time when I get up at 3AM every day.

I am feeling the effects of long days and hard work.

The Grapevine said I have 1 maybe two weeks of work left but I am hopeful my dedication and loyalty will give me another weeks pay.

I am sitting here sipping a mix of instant coffee and a flavored coffee beverage that is sort of like a StarBucks cold coffee drink in flavor. I added fresh cream to get it there.

So, my Challenge friends are away and I have an updated completion time of 1.6 years (1.619441). One thing that seems real is the generation of new matches is consistent with some days better than others but none failing to "find" over a hundred a day.

I've been giving a new design some thought. There are many ways to write a number generator so the logical advance is one that generates more matches per day than this. More than likely the new design will not be the same output as this effort. That means starting over on the generations.

Other than that I believe the next logical effort is a database. Since I run different instances each instance is it's own set of results and so knowing which contributes which element from the Million Digit File (MDF)is necessary. Also knowing when I have 1000 unigue elements is necessary too.

On the 1000; I mentioned I will attempt to encode the output of this effort. Most will recogonize that attempt as recursive compression. I ask for nothing less as well.

The idea here is that since I am generating values that match up with file-values it is simply an effort to find all the values and not an effort to find qualities in the strings that can be exploited by classical compression techniques.

Once I have all 1000 found then it will be timne to do it again on that output.

Three Levels is what I suggest as proof but I am open to comments on that.

Well the coffee-drink here is having an effect so I have chores todayI can now get to.

I know that telling people what I am doing may be viewed as unwise because a good scientest toils in silence to claim his prize but I still have technology in reserve that will do fine to offer a patent situation should some IT type wish to offer me a path. I sure would like to end my working-days with a house of my own paid for and a little dignity in the bank to lean on.

So it feels good to be actually working on an infinite data compression scheme albiet a slow approach to infinite data compression.

It can only get faster from here.

Have a great Sunday friends.

Hi Ernst-

just briefly:

when I said "could" - if you think of a Sudoku puzzle as like a prime number in that it has only one solution, it "could be seen as a "optional" prime number" if there was sufficient space in it (e.g. if the rules allowed any one digit to be a repeat of another digit in that group of 9) that allowed for more than one solution. A configuration might be so complicated that it was or was not prime number-like depending on where the space was (hence "P" equalling "NP": a "simple listing procedure" matching a Sudoku single solution with the difference being where the space is (knowing where space is allows "P" to 'swap' so-to-speak (or rotate like two sides of a single phenomenon) with NP).

(NP comes from "listing in space", "P" comes from "listing in time"; BOTH come from: "listing" in "space-time" (or a "hyperspace bypass")

re: my "proofs": they might not satisfy a professional mathematician necessarily, they tend to be logical explanations.

artificial randomness

Okay Alan,

I'm sleepy late this morning.. Turned off the clock instead of hitting Snooze. That the kittens expecting food saved me from being late.

Nothing much new here Wednesday, September 26 2012 2552 @ 04:29 3.072885%

I still don't know if Friday is the last day or if I have 2 weeks more work. It will more than likely be obvious one day when I come in and get my last check and the day off paid.

I hope I have 1000 values matched already. A database will be needed to read in all the data then select the unique matches. That will be better than the total I am now posting which can contain duplicate matches. Mind you the "codes" that map to an element are all different. So that this does map to more than one and therefore the counting argument has a counter argument.

I was thinking we could think of it as the "string" or "element" maps to more than one codeword or more that a single value "codeword" maps to more than one element. Naturally knowing what system the codeword belongs to is needed. Still this is a counter to the counting argument.

What the goal here is , is to find matches for 83,049 40-bit numbers so that is all that is needed. I'm not expecting to prove that this set of 16 systems will generate all 2^40 symbols for all 2^40 values possible.

I have a theory to explain how all 2^n symbols do exist using these dynamic equations but that is a theory thus far. So I believe all data exists however I am looking for a subset.

Ah Coffee water is hot!

Good Luck Alan and Challenge People

I'm going to give up on this challenge for now.

I'm confident that I can achieve some decent compression, I won't explain how, since I don't want to share my technique. The problem I have is that when I try to build minimalist tools to achieve decompression, each addition of a transform or a decoder really has a disproportionate expansion of my executable size making this more an exercise in programming artistry than algorithmic artistry.

I suspect that my choice of tools (Embarcadero's C++ Builder) may be the problem, but this is all a tangent from my critical path, which has nothing to do with compression.

If I were truly confident that there were riches at the end of this rainbow, I'd switch critical paths (yes Virginia, it is all about income), but for now, I must get back to work.

Best of luck to the travelers, may your algorithms be short, sweet and shiny with the bright gleam of genius!

Cunningham, it's not the size of the compiled executable that needs to be measured. As stated in the challenge:

"For example, if you wrote the program in C++, how would we measure the size of the program?

In this case, I would measure it as length of the source file, after passing through a resaonable compactor."

Just measure your source code's size, which is usually far smaller than the compiled program's size - for example, my 458 byte source compiled into a 9028 byte executable on Fedora.

Since you don't think there's riches beyond the $100, and since you only need to send the de-compressor, not the compressor, you should have no problem sending Mark your decompressor source.

Well, I for one like having Challenge friends around.

In 10 years I've explored a lot of ideas so I know what it's like to run into limits. That's a part of trying.

Can't say keep on keeping on but here I can't imagine ever stopping 100%

Look at what is happening here; the computer even with 8-cores must work a long time to find matches.

Perhaps there is a better faster way but the maths are not clear to me.

My point is one thing leads to another and another so you will most likely get back to things CW.

Update here is I may have two more weeks of work so that is good money wise. I'll be coding again after the work is over and I am a bit rested.

3.35585% of 83,049 matches reported found. It's unknown how many are different codes for the same elements. This offers the ability to map to more than one element from a code or to more than one code from an element within the scope of context.

The first 1000 may be here already it's hard to tell right now but it should be exciting to reduce the data three times. Something to work towards.

@Zen Is it really true the "Source" file size is what is measured? Heh.. I've been compressing the ELF to see where I am at size wise.

What is true?

The only problem with the 'source size' option is that the techniques I am using are proprietary to an entirely different product. Even releasing an executable would comprise a potential security breach (not national security mind you). I would trust Mark if he told me that he would not reverse engineer my code, or scan it more than necessary to satisfy himself that it does not contain hidden bits of the MRD file. In fact, if he gave me a signed statement that he would not share my technique, I'd be glad to share with him the concept, and he could his own version which does not disclose my techniques.

On the other hand, it's also quite plausible that I'm deluding myself into thinking that I can solve a puzzle that has been examined by smarter men than I, and I have to give some weight to Mark's assertion that it can't be done. He knows a lot more about compression than I ever will, and my techniques may be clever for their intended purpose, but I wouldn't be surprised to find that I'm treading familiar ground to people with Mark's experience.

I'm not saying that I won't twiddle with it on occasion, but I have more important things at the moment, and maybe when the need for secrecy has fallen away at some future time, I can stop being so guarded.

Just as a side note, my compiler has a zlib library, and when I linked that in with just calls to compress and uncompress a buffer, my executable shot from 12k to 427k. It probably has special technology that searches a hard drive for a copy of the MRD file, and if it finds it, it self expands to make sure that Mark's lunch money remains safe ;-).

Well CW, you know best. Just wanted to be supportive. Having friends when this activity has been classified as "nutty" is not so easy.

So with the lowly $100 and the stigma Mark seemed to be satisfying an agenda yet, we should have known that there would be people who would find the niche a fit.

So what you have may well be protected and so be it but if you can "do it" then "Do It."

While I agree that the goal of decoder + compressed MDF is doable it is a ruthless reality.

Who knows how many things I passed on because it would be larger.

Oh man.. Time to do the count. Looks about the same 100+ every day.

@ CWCunningham

I was thinking today at Work, yes half day Sunday, that while Mark has good solid reasons to state compression of the file under these conditions is impossible he cannot dispute the possibility of encoding smaller data considered "dense or random."

So perhaps the whole file may require a lot of resources to compress it's now demonstrable that at least portions of the file have fallen to smaller encoding and decoding back to original size.

So don't figure because some guys say it's so it's God's Law and you must die for irrational number belief.

Follow you mind and see what comes of your ideas. With any luck the "White Rabbit" you follow will lead you to other "Blue Pills."

As an aside;

No Idea is a good Idea until it's your idea.. I understand the philosophy.

So are you and Zen really real?

Hello Everyone!

I finally had the time to work on the MDF. My number generator proved to be ineffective in the test run (it tended to make large indices), so it was shut down and the 'workhorse' hauled back to the basement, without even starting the main run. :( But I have a clever (but maybe a bit slow) en-/de-coding algorithm for my Japanese Crossword method. (Yes, it needs the same steps for encoding and decoding!) If it works, the Workhorse will get a second (third) chance! :) This algorithm will make an 51-bit 'puzzle' of a 64-bit data block, but it needs more 'clue bits' to disambiguate the puzzle. The only big q-mark is the quantity of clue bits needed. (12 bits of free space accounts for 4096 solutions) If this method achieves the task, the remaining step will be the decompressor, and if I succeed, I'll be $100 richer...

Good luck everyone, and have a nice day! :)

@ Vacek Nules

Strike while the iron is hot and keep logs on your efforts!

Document all code well and keep a log (blog) of what the code does and what your thoughts and goals were.

The reason for the suggestions is I have gone back to works 7 or 8 years old and reading the notes has been the best help when the code isn't so clear.

---------

I find your "Puzzle" approach very interesting! I won't be disappointed if you are able to do this as you describe. Just don't you lose will if you need to think about the mechanics every now and then.

I have found the limits of many aspects of manipulating information and many times sat here as asked myself "now what do I do?"

-----------------

I find "hitting" the limits is like being in pitch black dark, following a smooth wall with my hands looking for a "way" (door) and the Eastern concept of Way.

In science failure is as useful as success so when I hit a limit I understand I know at least the effort is consistently resulting in truth.

Where I get "Success" I find I ask myself "what did I do wrong." In a way Successful results in more doubt than failures for me.

-------------------------------

Well this is a project for the Fringe Folks and I understand discussing or telling stories is considered bad form but with an open invite to see the decoding I will take a moment with coffee in hand to post today's report of the encoding smaller the Million Digit File.

Tuesday, October 02 2012 3248 @ 04:37 AM 3.910944%

I am still working a Seasonal Job so I am rather tired and will wait to Code the database program. I'll be looking for the first 1000 "compressed" 40-bit values which is the word size I chose to work with. There are 83,049 40-bit words in the million digit file.

Not accounting for codes generated for the same Word by 16 separate programs the total count is 3248 as of 4:37 AM today

Once the database program is running I'll switch to a count of matches to the 83,049 and not simply the sum total of all matches which are different codes for the same Word in some instances.

Have a Good One and pick the Red Pill! (Matrix the Movie Reference)

@ Vacek Nules

So you feel you can index it?

That is truly a good idea.

I was thinking about your posts today.

Here is a curious thing:

the form that the million digits take in my apparent solution has the following very interesting quality:

look at it from a wave perspective, and it looks like (one or more) particles

look at it from a particle perspective, and it looks like (one or more) waves

@ Alan

I don't understand..

It is true however that I can provide one of over 5 million versions of the file that require only 23 bits of overhead for me to convert back to the MDF.

I wonder if any of those files have qualities we can exploit?

So, what is MDF as a "wave" what is "wave" in this instance? I'm cool with liberal use of terms. I just want to understand the concepts.

(briefly) a wave is a coherent group that builds to form a single entity; a particle is a piece of something bigger-

If the million digits are stored in a hyperspace continuum (a higher dimension (e.g. at least two spaces is a hyperspace) continuum (so a method exists it would seem of connecting data-related higher dimension spaces i.e. the data forms 'blocks' that affect the boundary of neighbouring data pieces) (so the higher-dimension space is via a co-operation amongst the data like a Sudoku puzzle), then to see this "wave" you would have to break off a piece of data (a 'particle'), as the whole thing is a "wave", but to see a "particle" (a little piece of data) you would have to have a wave view (interact the data with something say.......)

Thanks for the explanation.

I remembered yesterday that I had used "wave" and "particle" in describing a finite set of data transformations. That when it "cycles" through all it's states it's a "wave" and when any one state is evaluated it's a particle. It was simply creative thinking there.

Well, it's day by day at work now. I do go in today so this makes for a full check but I cannot say which day in the near future is my last working day for the season.

California Wine is having a great harvest! A lot of wine.. Drink California Wines!

Hey you know this encoder finds over 100 matches every day but I'm wondering if I can improve that. I'm happy to see a consistent result over all these days it's been running yet, if it is consistent then there has to be a maths behind this I would assume. I wonder if I can get to 500 matches a day by redesigning?

What do you think Alan.. Once I am off work should I focus on "re-encoding/re-compressing" the "compressed/encoded" results found so far and then do it again for three levels of "compression/encoding" and then redesign for more matches?

I guess the least of results is that I would "follow that wall in the dark" and see if there is a "way" to find more matches per day consistently.

The wisdom of demonstrating dense data "compression" and hopefully "recursive compression" seems valid. I have a University in my area and they have a computer science department. I'd hope I could get a couple of them to spend the time on me.

HOLY SHIT!

If any one of you people have anything to back up your claims so far, you need to publish a peer-reviewed paper or call the CTO of any major multinational immediately.

Even just github'ing your software will insure you of immortal credit and fame far greater than Turing and Ada combined, guaranteeing you an R&D position paying fortunes every month just for playing around.

I am saying this as a very drunk senior technology scientist in a un-nameable multinational. Sorry I cannot be more specific.

Post within a few weeks or your chances will forfeit.