Ten years ago I issued a simple challenge to the compression community: reduce the size of roughly half a megabyte of *random data* – by as little as one byte – and be the first to actually have a legitimate claim to this accomplishment.

Ten years later, my challenge is still unmet. After making a small cake and blowing out the candles, I thought this would be a good time to revisit this most venerable quest for coders who think outside the box.

#### Some History

In George Dyson’s great book on the early history of electronic computers, Turing’s Cathedral, he describes how much of the impetus for computation in the 40′s and 50′s was from the US military’s urge to design better fission and fusion bombs. A powerful technique used in this design work was the Monte Carlo method, which relied on streams of random numbers to drive simulations.

The problem then, as now, is that coming up with random numbers is not always an easy task. John von Neumann was intimately involved in all this, and is famously quoted as having said:

any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.

The book describes how the RAND corporation, flush with bomb money, took on the task of using physical processes to generate a handy bucket of random numbers, eventually published as A Million Random Digits with 100,000 Normal Deviates.

Since my tax dollars paid for those numbers, I thought it only fair that I make it the basis of my challenge. I took the decimal digits, converted them to a base two number, stored it in AMillionRandomDigits.bin, and challenged the world to compress it. (The original USENET post was followed with a more findable blog posting that contains a bit more detail.)

Ten years later, there have been no serious entrants, although there are a few dedicated souls who are continuing to attack the problem. Unfortunately for all who are working on it, it seems that those RAND scientists back in the 50′s did a really, really good job of scrubbing those numbers.

#### The 2012 Edition

For a few different reasons, it is time to close down the older versions of the contest and issue an updated 2012 challenge. Nothing much has changed, but over the years I’ve bumped into a few points of confusion that I can clear up, and in addition, I would like to slightly widen the contest’s scope. And most importantly, the comments section of the previous contest is way out of hand, and issuing an update lets me close that stream down and open a new one.

For the 2012 edition, I am actually giving entrants two possible ways to win the prize. The first challenge is essentially a reprise of the original, with minor tweaks and updates. The second poses a more difficult problem that is a superset of the first.

Meeting either challenge brings you worldwide fame, a cash prize of $100, and the knowledge that you have defeated a problem that many said was untouchable.

Likewise, both problems are governed by one meta-rule which seeks to implement an overarching principle: the point of this is to win algorithmically, not to game the contest. I will disqualify any entry that wins through means such as hiding data in filenames, environment variables, kernel buffers, or whatever. Someone can always find a way to win with monkey business like this, but that is beside the point of the contest. No hiding data.

#### Challenge Version 1 – A Kolmogorov Compressor

The original version of the challenge is basically unchanged. Your goal is to find the shortest program possible that will produce the million random digit file. In other words, demonstrate that its Kolmogorov complexity is less than its size. So the heart of Challenge 1 is the question of whether the file is compressible à la Kolmogorov and standard, general purpose computing machines.

The interesting part about this challenge is that it is only *very likely* impossible. Turing, and Gödel before him, made sure that we can’t state with any certainty that there is no program of size less than 415,241 bytes that will produce the file. All it takes is a lucky strike. Maybe the digits are a prime? Maybe they just happen to be nested in the expansion of some transcendental number? Or better yet, maybe the RANDians overlooked some redundancy, hidden somewhere in a fifth order curve, just waiting to be fit. There are no telling how many different ways you could hit the jackpot.

However, the dismal logic of The Counting Argument tells us that there are always going to be some files of size 415,241 bytes that are not compressible by the rules of Challenge 1. And of course, it’s actually much worse than that – when you cast a critical eye on the task, it turns out that nearly all files are incompressible. But for a given file, we don’t really have any way of proving incompressibility.

**Rulings**

I want everyone to have the best chance possible to win this challenge. The basic rule is that your program file, possibly combined with a data file, must be less than 415,241 bytes. Clarifications on questions that have come up in the past included:

- Programs written in C or some other compiled language can be measured by the length of their source, not their compiled product.
- Using external programs to strip comments, rename variables, etc. is all fine. It would be nice to have the unmangled source available as your input, with the mangling step part of the submission process.
- Programs that link to standard libraries included with the language don’t have to include the length of those libraries against their total. Hiding data in standard libraries is of course not allowed. (And don’t even think of hiding it in the kernel!)
- Source code can be submitted in a compressed container of your choice, and I will only count the bytes used in the container against you.
- Likewise, any data files can be submitted in a compressed container, and I will only count the bytes used in the container against you.
- You own the code, and just because you win, I don’t have the right to publish it. If you insist on an NDA I may be willing to comply.
- In general, you need to submit source code that I can build and execute in a relatively standard VM. If you are paranoid and insist on binaries only, we might be able to come to terms, but no guarantees.
- The nature of this contest is such that gaming the rules is pointless. You aren’t entering in a quest to beat the rules, you are entering in a quest to beat the data.
- Your program might take a long, long time to run, but we will have to draw the line somewhere.

If there is anyone who deserves to beat this file, it is Ernst Berg, who has been relentlessly attacking it from various directions for years now. Ernst doesn’t seem to be doing this to feed his ego – he’ll share his results with all comers, and is always willing to listen to someone’s new approach. I consider him to be the unofficial Sergeant-at-Arms of this enterprise.

But Ernst will also be the first to tell you what a harsh mistress the file can be – always taking, but never giving.

#### Challenge Version 2 – A General Purpose Random Compressor

Challenge 1 is interesting because it is nearly, but not assuredly impossible. Challenge 2 is more along the lines of troll bait, because it is patently impossible: create a system to compress and then decompress *any* file of size 415,241 bytes. In other words, create a compressed file that is smaller than the input, then use only that compressed file to restore the original data.

Unlike Challenge 1, there are no size limitations on Challenge 2. Your compressor and decompressor can be as large as you like. Because the programs have to be able to handle *any* input data, size is of no particular advantage – there is no data to hide.

This challenge is for the contestant who is sure that he or she has figured out a way to compress the million digit file, but finds that their program takes 100 MB of space. Okay, that’s fine, we shall first see if it can compress the file. It then must be able to correctly compress and decompress a file of the same size. Let’s say, on 1,000 different files.

To keep it simple, the files will be simple permutations of the million digit file – scrambled with an encryption system, or perhaps a random number generator, or whatever. The point is, they should have all the same numeric characteristics as the original file, just organized slightly differently.

Again, I will emphasize that Challenge 2 is at its heart provably impossible to beat. No program can compress all files of a given size, and the chances of any program being able to compress 1,000 different files of this length is so vanishingly small that it can safely be ruled out, even if every computer on earth was working on it from now until we are engulfed in flames during Sol’s red giant phase.

#### Conclusion

It seems unlikely that there will be any developments in lossless compression that change the terms of this contest. No doubt I’ll reissue the challenge in another ten years or so, but if it is beatable, the tools are already at hand. Good luck to those of you who are tackling the challenge. Beating this will not get you the fame associated with something like the Clay Millenium Prizes, but in the small world of data compression, you will sit alone on a throne of your own making, and deservedly so.

## 262 users commented in " The Random Compression Challenge Turns Ten "

Follow-up comment rss or Leave a Trackback> However, the dismal logic of The Counting Argument tells us that there are always going to be some files of size 415,241 bytes that are not compressible by the rules of Challenge 1

From my understanding the counting argument shows it’s clear that for a given compressor, there MUST be some files which are incompressible. Indeed, most, meaning definitely over 50% would be incompressible, since any compression must reduce by at least 1 bit, which cuts the number of possible representations in half.

But I think the counting argument doesn’t say that there’s a set of files which cannot be compressed by any given compressor.

More precisely I don’t think that the statement: “There exists a file F, such that for all possible functions C, size_of(C(F)) >= size_of(F)” is provable only with the counting argument.

Actually I guess the above statement, if true, would basically say that there exists a piece of information for which it’s Kolmogorov complexity is irreducible. I’m not entirely familiar with all the theorems in this field, so this could have already been disproven, or maybe proven, though it doesn’t sound like it could be proven at least.

Anyway, I’m not sure if that’s what you were saying in your post, but that line sort of sounded like that.

Regardless, great post, I especially like the v2 of the challenge, since I’m sure I’ll see many who’ll claim to have solved it :)

@Zen:

>But I think the counting argument doesn’t say that

>there’s a set of files which cannot be compressed

>by any given compressor.

No, you are right about that. I think that it would be fairly easy to disprove that idea.

Kolmogorov doesn’t care what your Turing Machine looks like, so in principle, I could have a machine that has a one byte instruction that then punches out the MRDF to the tape, showing that I have compressed that file down to 1 byte.

It’s sometimes frustrating for people to think that there are no absolute measures of information, but it ends up all being relative to the model.

I think you might be able to get into the area of absolute incompressibility if you start talking about something like Chaitin’s number omega, but it’s a little over my head I think.

- Mark

Just a reminder:

As far as I can currently ascertain, with limited time at hand, I may have solved, comprehensively, this challenge, in theory (I am not a computer programmer, but have worked out an apparent logical answer, that is sufficently highly defined that I have very precise descriptions of it, though it is an extraordinary thing, being very “holistic” in nature.)

Due to the possible commercial value of the large group of inter-related discoveries of which apparently solving this challenge is one, I am limited in what I can say. I can say that I have also apparently solved, or found insights into, each Clay Institute Millenium problem, and the Goldbach Conjecture, the Godel Incompleteness Theorem, and the Turing Halting problem.

What can I say, then? The million digits, however “random” it is claimed to be, has an Archilles heel- it has a limit on its size, and it has a very precise order. There are only so many ways the million digits can be arranged, and only so many higher-level structures it can contain, and for every such higher-level structure, “limits” occur on whatever other such structures can exist.

This is not a problem to necessarily be ‘solved’ by being restricted to “Shannon limit” or “probability” or the ‘counting argument”: the million digits is ALREADY a “counting” (a roll call of the digits you could say..) “argument” (alternating bias)- a varying density of zeros and ones- you don’t solve that with ‘pigeon holes’ – you solve this with un-pigeon holes (multiple possibilities juxtaposed in varying higher dimension ‘spaces’ – like a Sudoku puzzle on ice cold ginger beer- string theory going backwards and forwards (in time)) – you end out with a “solution” that only becomes “real” when all the digits are accounted for.

I found a new potential proof re: defeating Godel’s Incompleteness Theorem: if you consider the 6 quarks as associated with the phenomenon of conservation of measurement, then 6 x 6 gives quarks blur, but 6 x 7 gives quarks that are there at some scale but not at every scale.

That means “units” are defined (so mathematics- which depends on the counting of units) purely by means of a structural system that is fully integrated, and differentiated, with of its own content- it is a universal higher dimension space- one could say everything inside is potentially or partially “at right angles” to everything (else)- except its wholesomeness.

6 x 7 = 42, if you have 42 axioms, then each axiom partially may deconstruct the imaginary units, but in doing so, it must affect other such units in such away that the idea of “axiom” and the notion of “unit” become complementary over the whole thing (which forms a haphazard looking “square” floating as it were.

cheers

Alan

for what it is worth:

looking at Wikipedia, list of maths problems:

the 4th, to construct all metrics where lines are geodesics, appears solved by the apparent system to minimise space to store the million digits.

the 9th, to find most general law of reciprocity theorem in an algebraic number field, appears to be soved by Einstein Relativity

the 1st, to find a set strictly between integers and real numbers, ‘Continuum hypothesis’, appears to be such that it inviolves a floating set, or an actual ‘axiom of choice’ in that the rule is you get to make a choice (I.e. optional elements in sets)

the 6th, axiomise all of physics, is curious, in that if physics is really “imaginary mathematics”, what is there to axiomise?

Re; 8th, Riemann, and twin prime conjecture, e.g. the twin primes 41 and 43; well given the role of 42 (as neither prime nor not-prime (In higher dimensional space or math)), 43 minus 1 is like 42 plus 3 (Or 4 or 5 or?)- ‘priming’ and ‘counting’ become ‘interchangeable’ -so if you can count, you can still find primes, that ‘(as if) ‘count reverse’ ( 2 back as 1 forwards)?

chance that the 10th- re: Diophantine equation- that there is ‘one’ such algorithm, but applied to each scenario (e.g. the million digit challenge) gives a unique ‘algorithm’ for each scenario

@Alan:

Disproving the Continuum Hypothesis merely requires that you create a set with greater cardinality than aleph-0 and less than that of the reals. It’s always nice when all you need is a counter-example.

Proving it to be true is beyond the scope of my understanding.

Be careful though, thinking about transfinite sets may have contributed to Cantor’s mental illness.

- Mark

Cool Update Mark!

Wow what a nice comment about me. It is true that I started knowing nothing and have worked it like a “new boy should.” http://www.youtube.com/watch?v=k4VFFBCa5Aw : Heh :) I spend enough time with it.

My official status is still as it was; the file has been divided into 83,049 40-bit words and is currently being compressed when a match is found in a number generator I wrote. The number of bits to generate the 40-bit word from the Million Digit File by the number generator are fewer than 40-Bits. The fewer is variable but no less than 1 bit.

The Pace is slow with over 100 matches a day and a total number of candidates being Friday, October 12 2012 4385 @ 04:44 AM 5.280015% of 83,049

This number generator will be reviewed and I may wish to improve the number of matches by redesigning but the current reality is that while I have only a possible 5%, not counting duplicate codes for some same elements, I have a possibility to “re-compress” the “compressed” matches so far.

This possible re-compress could demonstrate what we have long thought impossible.

So, not knowing what maths are actually in play here, I want to see if I can do this “recursive-compression” and back it up with recursive-decompression.”

If so then it is time to “show and tell.” I trust after all these years you guys know I am serious.

So Props on the new forum and it’s nice to still be in the running. As always Good Luck Challenge People!

This Challenge is possible to win!

Oh my,

Challenge 2 is not about a Challenge it’s a Safety isn’t it Mark.

I understand the mechanics now. I have intimate insight.

What Challenge 2 is isn’t a new version it is Challenge 1 to a Power-Of.

Generating a set of files based on MDF is easy. I can show 5 million+ files for any file you like of that exact length.

If I am to write a general purpose encoder using what I am exploring now as to satisfy Challenge 2 I would need a team and fantastic resources.

I would believe something can be crafted and it may not be anything I can envision at this point. I have a glimpse of possible maths but not a command of such to reach that goal.

Hell You and all your friends cant even do MDF and you offer Challenge 2.

So I am still a Challenge 1 guy. I simply don’t see Challenge 2 as anything but a Safety for you. That is sad.

Sorry Mark, Joe Biden has influenced me. Here is a Video enjoy!

http://www.youtube.com/watch?feature=endscreen&NR=1&v=w-NshzYK9y0

Mark:

a jigsaw puzzle would appear to be a set whose content satisfies the requirements of the “Continuum hypothesis”….

I don’t know about “transfinite sets”- sounds like “across finity” – also a jigsaw it appears.

re: “Cantor’s “mental illness”"- It is worth noting that this is a metaphorical use of the concept “illness”- conflict, disagreement, dissaproved conduct, and conduct that annoys people, or conduct that “gets under someone’s skin, so-to-speak”, is not illness. Illness is either physical, or metaphorical. It was once claimed that slaves who tried to run away, were somehow not responsible for their personal desires, but that (acording to those who wished to think that they could somehow control them) they were allegedly suffering from an alleged ‘mental illness” called “drapetomania” (that supposedly made them run away!)

Of course, it is a myth, there are no such “illnesses” (except in the minds of people who try to coerce others and try to hide behind a rhetoric of “health”.)

(There is, however, such a thing as disagreement, and annoyance, and puzzlement by one person re: the actions of another).

Curiously, real (physical) illness appears to be all about minding (or person-environment communication), and so-called “mental illness” is all about people NOT minding (people trying to “bash” others up, usually out of fear of (re) discovering how much freedom there really is (both for them, and for others) in reality (which tends to jar with their own (often parent/society “programmed” (or influenced) subservience to so-called authority…?)

(For more information on fake medicine (i.e. what is known as “psychiatry” see http://www.szasz.com)

I must apologize;

I became frustrated whit what I see as the need for fantastic resources, ( computer power, storage and talented people assisting me ) over the Newer Second Challenge.

I was angry because I can see a possibility that a general compressor can be crafted; at-least in theory from my point of view. I do not see how I could accomplish such even with my nice 8-Core AMD and 32 GB of ram.

————–

So good luck anyone who attempts the second challenge. I want to but I don’t see how I can with my resources. Maybe that will change.

As to it being possible? We still don’t know what Dark-Matter and Dark-Energy are so can there be things we don’t know? Absolutely! Infinite compression could be one and the maths for it too.

I think there may be an answer to challenge 2, it is an integrated _________________ (censored).

Note: the data is ALREADY “compressed and decompressed” (I.e there will presumably be “wide bits” (lots of spread re: the zeroes and (the) ones, and “narrow bits” (bunching of zeroes (e.g. 0000 and ones (eg 1111)). (So to both compress and decompress it you need something called a “virtual computer”.

Re: “dark matter” and “dark energy”: I have a theory re: what these are- “dark matter” is “conserved disturbance” and “dark energy” is “conserved alternatives” (both relevant to the Million Digit Challenge.

These consepts are possibly from “continually projecting space”; if you look at a flat Atlas page, at the projection of the spherical Earth, Antarctica is typically too large, and India is too small. That is what “dark matter” and “dark energy” may be about. Astronomers are failing to accurately truly differentiate and integrate space (due to the real-time changes in the location of their instruments (how do you know where the planet Earth is, except by looking at the stars (and planets), but how do you know where these reference objects are, except by “triangulating” them from two locations on the Earth- but what about possible BOTH shifting..?

Re: Ernst’s recursive compression: if you partially differentiate hyperspace (two or more connected spaces), you will get the impression that “recursive compression” is possible, but I think you will hit a problem if you try to apply it to ALL the data. There is a way around this……

Mark:

re; “Continuum hypothesis”: I did more research on this.

First, I do not think there is such a thing as “the square root of 2″- think about it!: two quantities that supposedly match, and when juxtaposed alternately (i.e. multiplied, so creating “2 views”) create “2″? It would be less thahn 2, yet more than 2 (a hyperspace ‘object’ (or just plain “hyperspace”).

The so-called “real numbers”: claimed to be on a continuous line, which is divided. My impression is that such a line would tend to form a curve/ or circle. There is no ‘Cardinality” of the “real numbers” (number of elements in the so-called “set”), other than hyperspace quantisation (the error margin (depending on how many divisions you place on the line and where).

The cardinality of the “natural numbers” (counting numbers) is claimed to be Aleph zero (or I could call: a higher dimensional space times three): the non-(“error”) margin: the magnitude (or real-’time’ “size”) of the “set” (the number of elements in the “set” (of natural numbers) is (apparently) the number of “sets” in the (any) “element” (!!!!!)

So : the “set” of (locally) “infinitely” divisable numbers (i.e. cardinality of set of natural numbers = (local) real numbers. (How the natural numbers ‘mesh’ together (like in a jigsaw puzzle))

(The relatedness between the so-called numbers (their role in describing SPACE (or “real-time “objectivity”) (partial differential of time, full differentiating of space (or space within limits in some versions, I could say/ guess))).

The “cardinality of the Natural numbers ” is: the real “numbers” (the number of elements in the set of counting is (apparently) the number of ‘countings’ in the element (ultimate subdivisability) of ‘sets’ (assuming “real numbers” overlap (which they apparently do! )

If error margins and anti’error margins’ converge (on a magical “counting argument” “number”) you have “a transfinite set” (a set of “reals” that is (perfectly) “natural”). A limit on space (and a de-limit on time), or vice versa.

an OBJECTIVE definition of SPACE, as it were….????

Well, This is the last week of work. I cannot say if the last day is tomorrow, Thursday or Friday but it’s one of them.

I look forward to resting and coding.

The first thing is to craft the decoder and database programs. Don’t worry decode works fine I just need to update to fit this particular encoding configuration.

The Goal now is to gather up 1000 unique 40-bit matches and make the new file. 40,000 bits from the MDF.

Then once that is done I will share this file and the positional information of where these 40-bits are taken from MDF with Mark if he wants to bother with sharing it with everyone. Then I’ll attempt to reduce the first reduction again. Once that is successful then another level of compression will be added.

The decoding has to be universal so the same program that decompresses the first must also process the second and third.

This will be interesting to attempt. True I am blindly faithful to a conjecture that it will work with such a small set of data fast.

So the next couple of months should be interesting. The only drawback is I might land another job and then the time I have could be limited.

For those not familiar this “encoding” is not the traditional data compression. This is a number generator that looks for matches in the Million Digit File and can generate the target data with a code that uses fewer bits than the target data word size.

The hope is that all values can be found within the boundaries of those codes. The Counting argument gets a little bent with this approach since multiple systems are used concurrently with each having to obey it’s own counting argument reality. Together they can use the same “codes” and generate different results.

I trust I have shared enough to put many at ease as to the “claims” being made.

So I raise my coffee cup to the last week! Cheers!

I find the challenge interesting and I assume you still remember the $5000 Goldman Challenge. I still believe Patrick won.

Which brings me to this challenge using your data. I think you have to restrict the challenge to files in bytes only since to be fair using byte files you can not make the file 1 bit smaller. It has to be at least a integer multiple of 8.

Since I could map you file bijectively to a computer system where your byte file is actually smaller datawise in number of bits from scratch. Since on a bit file system you have not only the 256 files that the byte file system has for a file which is 1 byte long the bit files also has 128 file that are exactly 7 bits long. These 128 file would have to be longer than 1 byte on the normal file system. And so on for files 1 to 6 bits long.

I don’t believe and one will find a solution using bytes on a normal PC.

However I think there are several general bijective fair transforms that when applied in series to your data will result in a smaller data file. But these are all very long programs and the compression would only be a few bytes so there length alone would make them not in the running for a win. Yet if one can apply a simple series of general bijective transforms not really tuned to the data and still save a few bytes. It should raise some sort of flag than maybe the data file used was not the best since I don’t think for a truly random file (what ever that is) should be able to be compressed more than a few bytes using normal general bijective means. Anyway thats my thoughts on the subject and I belive your money safe.

It would indeed be interesting if you found a transform that saved a few bytes. It could be a start at Challenge 2, but of course it would be hard to

Back in the day some folks on comp.compression did find something like a couple of bytes worth of bias, alhtoughI don’t recall the details.

- Mark

The question is asked: “how many elements in the so-called “set of real numbers”?

What is a “real number”? I am inclined to think a “real number” is just that, a REAL number: i.e. e.g., the number of eggs in a carton (e.g. 12). A number that describes something real.

What distinguishes a real number also then is that it is “a number divisable another way” (e.g. by breaking eggs).

Hence it includes fractions.(?) Its so-called “continuity” (mathematics appears to define “real numbers”

as numbers on a claimed “continuous” line) suggests that such numbers are ‘real space numbers’- i.e. they always involve “some space” e.g. where are the limits in defining/ dividing an egg?

“Natural numbers” are apparently just that- “natural” as in numbers you use to count (with).

What then would a “real-natural” number be (a “hyper-space number”)? When you count, then it refers to something with “span”. There are 27 ways of arranging 3 beans in 3 boxes; the “27″ is I could say, a “transfinite set”, a real-natural number (you get 27 (“counting”), ONLY when you get “an argument” (a two-way bias) (Beans/boxes).

The “Continuum hypothesis” appears to be solved by such a number.

How many elements are there in the set “27 ways of fitting 3 beans in 3 boxes”?

Is the number of elements in this set, this (“trans finite set” ?), strictly between that of the integers and that of the real numbers?

The 27 ways all involve juggling beans with regard to boxes- pluses, and minuses (relative to the boxes).

But they are bigger than that (because each way creates interference in potential other ways) and smaller than that (Because each other way takes plus and minus possibilities from a particular way?). Overall, they are bigger because of “the continuum” (Beans plus boxes)?

So the set is “bigger than the (Localised) integers”?

So “the set of “trans-finite sets” is, by definition (by quantising higher dimension space), bigger than the set of integers? It counts more than just pluses and minuses, when it counts pluses, and minuses ?

The ‘real-ness” of the 27 ways comes from it describing something real, beans, and “boxes”. But in the 27, the beans and boxes finally merge? So become “less than real”- but only when fractionalised. A local limit is formed on how fractionalised the 27 can be. The set is like a dotted line? So lacks the continuity of a number line? So the set of transfinite sets always has less members than the set of real numbers?

But more than the set of integers?

Because the 27 ways of arranging 3 beans in 3 boxes is a real integer? A REAL plus and minus. A fitting together of objects in (a) space.

To solve the claimed million digit challenge, you need to find the hyperspace continuity of the (million) digits. How bunches of digits can form mutually rotating groups in hyperspace. I have apparently found how….. ??????????

Yes, there are bijective transforms. I work with such.

Nothing I know of “compresses” all files.

To encode smaller takes an external reference and that has been bit for bit in the “all files case.”

I cannot dismiss the possibility that there are some set of transforms that will do for Challenge 1.

Great thinking!

Looks like Friday for me…

Most of an email that I tried to send with Samsung’s website but it didn’t send:

Hello,

I am wondering if it is possible for an inventor to interest Samsung with apparent, theoretical, developments in information technology?

I have worked out, to at least a rough extent, many potential (un-tested, unproven)discoveries in information technology, including:

(1) possible breakthroughs in 3-D television (to be more natural)

(2) possible breakthroughs in forming holograms- using natural light

(3) possible breakthroughs in storing computer data in less space

(4) possible breakthroughs in finding hidden patterns in data (how to form “data holograms”, how to ‘rotate’ data and see it from different perspectives, potentially highlighting previously hidden patterns (such as patterns that may be hidden in the time. place. and activity data of seismometers, to potentially produce a “differential strain map” to possibly detect trends associated with eartrhquakes))

(5) possibility of breakthroughs in search engines

(6) possibility of breakthroughs in super-conductivity

(7) apparent insights or breakthroughs in each “Clay Institute Millenium problem in mathematics” (“P vs NP”, “Riemann hypothesis”, “Hodge conjecture”, Poincare Conjecture”, “Yang Mills existence and mass gap”,”Navier-Stokes existence and smoothness”,”Birch and Swinnerton-Dyer conjecture”) , and in: Turing Halting Problem, Goldbach Conjecture, Godel’s Incompleteness Theorem, Continuum hypothesis, and more, also with regards to Physics “Theory Of Everything”.

(8) possible breakthroughs in telecommunications including in: capability of copper, capability of fibre optic (including “wave division multiplexing in passive optical systems”), and many other apparent, un-tested, theoretical new technology ideas

License to use one or more of what I have discovered is something I wish to consider selling to technology companies, e.g. Samsung.

Usually a company signs a Non-Disclosure agreement re: the theoretical potential technology development they may be interested in.

Thank you,

Alan ——- (amateur theoretical physicist/ free-lance inventor)

I had a strange email this morning that Recursive compression has been done. I do see a few articles dated April 27 2012 but no major news such as Dr. Dobbs or other. Oh well.. It’s this if anyone wishes to look at the claim. http://www.prweb.com/releases/prweb2012/4/prweb9445241.htm

———————————–

My Update: A not so bold claim of random data compression but a modest claim of partial compression of the Million Digit File.

I am now off work as of this morning. I am having coffee and will get to coding tonight.

The goal is to extract at least 1000 unique 40-bit words from the MDF that have codes that compress the 40-bits to something like 36 bits I think. So If that 36 bits is the right length that means 4 bits reduction each word. 4000 bits for the subset of 1000 40-bit words. 500 bytes off of 5000 bytes.

So, it’s back On!!

I read the site on Recursive compression and the quote below is on it: from Q10 of the FAQ

START QUOTE

This will never happen, because after a file has gone through a large number of passes, it will first get into a phase of diminishing returns, and eventually reaches the optimum, when the effects of further passes would become negligible. Furthermore, even in the event two files become exactly the same at their optimum by sheer coincidence, the decompressed files will surely be different, because the two files would have gone through different numbers of passes to reach the same optimum.

END QUOTE

This is BS since if two files reach the same so called optimum where is the information stored that tells it how many times to decompress since it appears the number of times to do decompression passes differs for the two original different files that compress to same file.

Hell I can compress any file recursively to ZERO bits. Since it will decompress using my method to every finite file. The Trick is to know how many times to decompress since doing it a different number of times goes to a different file.

I will tell you for free how to do this first 256 decompressions go to every possible one byte file. the next 256*25 decompressions cover all 2 byte files. Then the next 256*256*256 decompression cover all files of 3 bytes and etc.

Recursive compression discoveries are BS. There’s a number of sources that describe the reasons why lossless recursive compression is impossible, as well as why making a general algorithm to compress *ANY* given ‘random’ data is impossible.

The arguments behind this are rock solid, in fact most of them can be covered by the counting argument. (which for some reason people often misunderstand, or just ignore)

That’s why anyone who comes on here saying they’ve solved the challenge by making some general compressor, or claiming to be able to re-compress their output is inevitably going to fail.

The only reason why trying to solve this challenge isn’t actually a pointless effort is because there’s a tiny, tiny chance that someone might find and exploit some incredibly well hidden pattern in the million digits, entirely specific to this file only, by which they can save some space. That’s the only approach that can win this challenge.

Well given something external to the source such as a number generator, there is a different realm of probability in my opinion.

I have a subset of the MDF and I will be looking for encoding for those integers that already encode the 40-bit word to say 36 bits.

So recursive compression may be possible at least in a limited sense.

So do we know everything ? Hell no..

Let us realize that we are part of “creation” and as such we observe not define.

So, that we dream of other worlds of reality my not be so “Crank” it may well be part of the multi-verse in principle.

Still if you claim it defend it.

Quoting Zen:

“Recursive compression discoveries are BS. There’s a number of sources that describe the reasons why lossless recursive compression is impossible, as well as why making a general algorithm to compress *ANY* given ‘random’ data is impossible.”

reply: the current secret to (apparently) solving the million digit challenge: via this method the _______________________________________________________________________________________________________

_____________________________________________________________________________________________________________________________________________________________________________________

The “compression” strangely enough, may actually be recursive IN HINDSIGHT i.e. AFTER it has been done like a constantly changing jigsaw that only fits at the very last (Or near last) piece

Quoting Zen:

“The arguments behind this are rock solid, in fact most of them can be covered by the counting argument. (which for some reason people often misunderstand, or just ignore)”

counting arguments can be ____________________________________ ____________________________________________________________________________________________

Quoting Zen:

“That’s why anyone who comes on here saying they’ve solved the challenge by making some general compressor, or claiming to be able to re-compress their output is inevitably going to fail.”

reply:

Any “recompression of so-called “output”" is hidden by the apparent solving technique, with all data and its arrangement non-set until the system has completed its activities (Like a Sudoku puzzle- it may be)

What is “output” and what is not “output” may be “undefined” exactly till the end of the higher-domensional processing technique…

Quoting Zen:

“The only reason why trying to solve this challenge isn’t actually a pointless effort is because there’s a tiny, tiny chance that someone might find and exploit some incredibly well hidden pattern in the million digits, entirely specific to this file only, by which they can save some space. That’s the only approach that can win this challenge

”

It can be said that may be all data contains patterns.

(Due to commercial sensitivity sections of this reply have been left blank)

A hypothetical but provable solution exists. Hypothetical only in that it cannot be done in a reasonable amount of time. It is based on a *working* prototype of a streaming OTP data encryption solution I developed.

Without giving too many details, the encryption program begins with an initial key k and plain text p. Encrypting p produces a cipher text c which is precisely 6 times the size of p, which results in key k1. Given the initial key k, one can stream decrypt (“compress”) c down to exactly 1/6th the size.

The statistical analysis ran by the diehard program of the encryption output shows the encryption method produces a cipher text c which is PRECISELY identical to any random data.

Meaning, given this encryption program, for every file of size n > 36 bytes, regardless of contents, there exists an initial key k of some size < 10,242 bytes which can stream decrypt ("compress") that file to ~ n/6 bytes if we view that file as a cipher text c.

IOW, given the random digits file of size 415,241 bytes, we can compress it, given an initial key k down to just 69,207 bytes. Resulting in key k1. Not by any means regressive compression, but a HELLOFALOT of compression nonetheless. The problem is finding the initial key k which is a very specific key that belongs to that file and that file only – OTP. That simply cannot be done in a reasonable amount of time (otherwise my encryption would be broken!) – BUT nonetheless mathematics has "an infinite amount of time" to work with… So one could argue that it is possible to compress any random data, however it is extremely unlikely that you would ever be able to find an initial key k which belongs to that file, to the tune of a keyspace of 256^10242! But there exists only ONE k and that's all that matters…

Oh, I forgot to mention, given an initial key k, the resulting cipher text c – even given the same exact input repeatedly, will result in a unique output every time. Meaning, the initial key k belongs to many different files of many different sizes. Pigeon hole, Key k combined with c, must produce k1.

@krishields:

There’s a problem with your method, in that it’s not only going to be very slow, but is most likely not going to work at all. Let me explain:

What you’re saying basically, is that you have a function E(encryption is just a type of function) that works like this (using your terminology where p = original data, k = key, and c = encrypted data)

E(p, k) = c

Now, c is 6 times the size of p, and from what you’ve said, k won’t exceed 10,242 bytes.

It’s safe to assume then, say for files p where the size of p is over 10,242 bytes, that

size_of(p) + size_of(k) < size_of(c) – where size_of(X) is the number of bits in X.

In fact, if you are going to claim to compress, this MUST be true.

Note that the MRD file is well over 10,242 bytes.

Your function then, has 2^(size_of(p) + size_of(k)) number of unique inputs. (remember, size_of is just the number of bits)

Given this, in the best case scenario, your function can then produce 2^(size_of(p) + size_of(k)) unique outputs.

However, the total number of possible bit-strings with the same number of bits as c is actually 2^(size_of(c))

Since we established that size_of(p) + size_of(k) < size_of(c), that means that 2^(size_of(p) + size_of(k)) < 2^(size_of(c))

What that means is that your encryption function will only be able to produce a tiny fraction of the actual possible strings of length size_of(c). In fact, each bit saved, halves the number of possible outputs – so if size_of(p) + size_of(k) is just 1 byte (8 bits) smaller than size_of(c), you'll only be able to produce 1/256th of all the possible strings with length size_of(c)

Why is this important? Because if you want to start with some random string c, and try to find a matching p and k so that E(p,k) = c, then you are very likely to find that no possible combination of p and k actually produces the string c, since all possible combinations of p and k only produce a fraction of the possible strings c.

This whole problem that I described is just a result of the counting argument, and in fact, your approach is commonly referred to as the "magic function theory", covered here:

http://www.faqs.org/faqs/compression-faq/part1/

in [9] Compression of random data

and more specifically here:

http://www.dogma.net/markn/FAQ.html#Q19

To everyone posting here, the first link to the compression FAQ should be something you've read and fully understood.

Zen:

apparently:

the so-called “counting argument” produces “area”, also “synthetic electro-magnetism”;

it results in “invisibility” (the million digits are not seen)

the secret “no-counting argument”, produces “volume”, “synthetic gravity” ;

it results in “visibility”, (the million digits are seen)

I have other names for “electro-magnetism” and “gravity”…

Your aproach appears to be based in (and limited via) statistics…

My approach involves something different

@Zen

I understand what you are saying.

BUT the common mistake is thinking that k must be static. The counting argument does not apply here. It isn’t, at all. It is modified every step of the way – for every 6 bytes, a new k is produced using information it gathers from those 6 bytes. Meaning, the program will use 69,207 very unique k. This means, if we stored the key at every interval in one big file, we have created a library that is 708,818,094 bytes long and stored it in a file that is just 10,242 bytes long and then combined it with a file that is 69,207 bytes long to produce a file that is 415,242 bytes long.

Put it this way, the encryption actually RELY’S on the injection of random data in c to be fully successful – and actually requires the use of that random data to decrypt the file. Think about that one for a moment… How is this possible?

I know it sounds like a load of crap, but it does work that way. I mean, how in hell can a guy look at random data in a file that was not produced from the RNG on his machine and actually use that to decrypt that file? Sounds Bizarre, but that’s the point. The problem IS bizarre, and requires a bizarre solution.

What this is saying is, a plain text such as this message, must also be the cipher text to some other plain text 1/6 the size of it. Weird, but true.

Again, the idea to inverse of my encryption to use it as a compressor is completely infeasible hence the strength of the encryption given a keyspace of 256^10242 but it is a working solution given an infinite amount of time to work on the problem…

@krishields

>The counting argument does not apply here.

I can’t tell you how many times I’ve seen people say this, as though they’ve somehow worked around the counting argument. Except that no one can work around it, because it’s just a very basic mathematical property of any sort of data.

It doesn’t matter if you use multiple keys. The counting argument applies to any type of data manipulation, because it only concerns the size of the data, not what method you use.

The number of possible files of a certain length is only dependant on the length of the data. Because of this, you simply cannot have enough combinations of smaller size data to represent ALL possible larger-size data.

It doesn’t matter what algorithm you use or what sort of trick you employ – if you take write any program that tries to take some file and outputs a smaller file, you will find that you can’t output a unique smaller file for every larger file, simply because there’s far less unique smaller files than there are larger files!

Please read the counting argument again! And again! It’s universal, it’s basic math, and you can’t bypass it!

@Zen

But you can because you are not concerned with producing the entire scope of the file, “in one interval”. You are only concerned about producing a tiny portion of it at any given interval….

While I cannot fit 15 pigeons into 16 holes all “at once”, I can stream them so that at some interval within the operation of the program all pigeons have gotten a chance to sit in one of them “for a moment”.

So I make for myself, another box to contain the pigeons that are waiting to sit in one of the holes.

Given a block size of 6 bytes, I am only concerned with being able to produce 256^6 possible combinations during that interval, and 256^6 possible combinations during the next and so on until we have enough intervals to cover the length of the file. Our box which contains the pigeons that are waiting to be sat is the most important aspect – we can re-arrange that box however we like until we find the pigeon that we want and then record its location.

If the algorithm is based in causation, the output given identical p and identical initial k are in fact, going to be unique and resulting k1 will also be unique, generally speaking anyways – notwithstanding the statistical “chance” of reproducing the same end result, even given different input.

Kris

Wait, that should read, “16 pigeons in 15 holes”.

@Kris:

>But you can …

Nope.

I was going to try to explain this again, but if you haven’t understood what the issue is, another explanation from me isn’t going to change that. I also think this is not the right place to have this discussion.

The problem is that to show the counter-argument to what Zen is saying is difficult to do while maintaining commercial secrecy.

There exists what I could call “the no-counting argument”.

It is too sensitive to say here.

Some comments though:

Zen claims that “The number of possible files of a certain length is only dependant on the length of the data. Because of this, you simply cannot have enough combinations of smaller size data to represent ALL possible larger-size data.”

The problems here: “possible” files, the word “possible” implies a statistical perspective. What about the number of “impossible” files? You are lucky to get this big a clue.

And re: “length” of the data, well, who is worried about

“length”?

“Cannot have enough combinations”… who needs “combinations”? Who said anything about representing “ALL”, “possible”, “larger-size” data?

These are statistics-based perspectives…

With hyperspace triangulation, you do not need to worry about “statistics” (and associated (relative) limitations)

You do not need to represent ALL possible combinations of “larger size” data with “smaller size” data.

You just need to find how, and in what way, the data is “like a blob of something sticking together like clay”.

It’s “large size” IS its “small size”, because it forms a blob- it has volume. It is no longer strung out.

@Zen

On the contrary, I do understand what the problem is.

Give me 8 bytes. If I analyze the file one byte at a time, I give you 256^8 possible combinations or files. I have to be able to come up with that.

Say I set my block size to 2 bytes. This gives me 4 runs each looking at 2 bytes at a time. Each 2 bytes gives me 256^2 possible combinations or files that I can produce.

(256^2)*(256^2)*(256^2)*(256^2) = (256^8)

I’ve successfully blocked out all possible combinations or files.

Now I have to represent all those in a smaller file, somehow.

But how am going to do that. I am going to represent them in two files. One that remains static, and one that is dynamic. But for this initial explanation, I will generate a static k so you can see the whole dynamic in one frame.

What I will do is set each two byte character in a key file, k. So I have a k that is 65536 bytes long.

We simply string search for our combination in k and assign the appropriate corresponding character, depending on our position in k to it, 0-255. However, this means that we are going to have at most, 4 duplicate characters in p. Thus conforming quite nicely to the counting argument and making it impossible to decompress. How do you account for that?

Simple! Make k *smaller* which brings out the requirement that it be *dynamic* rather than static. Instead of looking at 65536 bytes “at once”, k should only be 512 bytes for each of the 4 blocks so that you only have a single instance of a particular two byte sequence in k per block. That is, for this instance, ‘b’ = ‘ik’. Then shuffle k for the next block and so on so that you can have 4 ‘b’ in the target file, but each time ‘b’ might mean something different, ‘b”b”b”b’ = ‘ik”jh”uy”tr’.

So yes, you CAN get around the counting argument by breaking it into smaller pieces and streaming modifying k. How to stream modify k, determining what block size, k size and initial arrangements of k etc. are necessary to ‘make’ the file is the challenge… Can it be done, given an infinite amount of time, YES. Is it feasible, HELL NO, not in the slightest.

@Alan

Absolutely. We are not interested in producing ALL possible combinations at once. Only the ones which are necessary.

There is some mathematical rule that says, no file of any length can contain within it, all possible combinations for that length of file. So that is a moot point, really. We are only concerned with what is in that file. It matters not what the potential a particular sequence has to be in that file, only WHAT IS. That is, the program should not be responsible for producing WHAT COULD BE in the file, only WHAT IS in the file – for that particular file only.

Computation is bound by the same rules reality are.

Consider, every byte sequence could be a potential “program” in itself. If I had a program that sequentially generated every file of length n, then I have for myself, every program of length n! What those programs could do… who knows.

Hi. Interesting discussion here. Been dabbling with compression myself as a hobby for a couple decades and nice to come across people looking at compressing data from a different point of view, ie., not looking for redundancy or other regular methods. Looks like some of you have been at this challenge for a very long time!

I do have a comment about the post above about the Recursive site claiming the new lossless recursive technology. Looking over his method, the idea of splitting couplets into control/subordinate groups is interesting and certainly will provide some compression with certain data. The main flaw I see right away is with his examples. He didn’t take into account what happens when you have a couplet like 10. 00 01 and 11 will all work as he describes, 00 being the only one that will save 1 bit. Maybe I didn’t think it through enough to see how that would be split so it could be reconstructed on decompress but it doesn’t appear to have a solution the way I see it.

Anyway, just wanted to make a short comment and we’ll read through all these posts and maybe join in on some of these discussions when time permits.

Take care and good luck to you long timers.

Hey,

Been doing a hell of a lot of projects now that I have time.

It is not yet time for me to head off into the warmth of the cave and start working the data and writing code it’s been in the mid to upper 70′s for a while now and this weather is like spring here in Central California.. Obviously climate change but it’s Hawaiian like in my opinion of Hawaiian winters.

Just dropping in to shout out I have not forgot or slipped away.

@ the challenge. I will try to be clear.

This is not a race that we are going to sprint to the finish-line with the gold cup in hand but, I do think we are finding chinks in random-data’s armour.

I agree with the counting argument but what if we only encode 2^(n-1) words ( or less ) of n bit words with a number generator that generates all the words in “a file” with a n minus one bit length codeword?

The file will not contain all 2^n values so are we limited by the counting argument?

I would think that if a codeword can be input and a larger word retrieved then that is (de)compression.

I have over 1000 40 to 36-ish bit examples to work with later on..

Perhaps not the gold cup but it qualifies. Anyone else got anything?

Okay then the faster I get these chores done the sooner I can get to the fun of working on this challenge.

Thank you Kris, brilliant thinking as I was rethinking about what the counting argument may be and it does seem difficult from some perspectives i.e. how could 10 digits stand for 100- wouldn’t several identical-looking ten digits have to stand for different hundreds the counting argument might say?

But one could have non-data in different places so not all 0s and 1s would be data, that would introduce differences as could breaks in how the data is grouped.

But your point is superb- it doesn’t matter if theoretically a stored “compressed” version is one of a relatively limited range of optons to store data compared with all the possible larger versions of the data that could exist- so long as large data plus process becomes less-space required data and that it works.

What is stored is not really data but could be an entanglement between the original data and the process itself and the apparent stored version. (I call my system a dynamic data storage system)

Because the process and the original data are outside the set of possible combinations of the supposed compressed version- the secret is that THE PROCESS ITSELF IS PART OF IT- the process itself “Becomes entangled” with both sets- the original data and the what looks like “end-data”.

One could ask- how does the process “know” that the apparent “stored” “data” is different from what may appear to be similar-looking data? Simple: the “Process” IS NOT ABSOLUTELY FIXED: the process changes- the original data affects the process- there is an interference pattern between the original data and the process- you end out with a “compressed data” that could be seemingly identical to many other “compressed datas”, but the process has altered its patterns- the data is stored in a original-data altered process- hence the data IS ACTUALLY STORED IN HYPER SPACE (It is stored as a large group of sympathetic interference patterns like a hologram (hence: “data holograms”!)

What handles this data is probably not an algorithm in a conventional sense.

It is stored as a “wave” and as a “particle”, it is actually stored “like light” one may even say, it is up and down and sideways (like a wave), and part of something larger (it collides with a detector as “energy packets” (as “bunches of alternative possibiities”)).

One could suppose there is no absolutely fixed algorithm- the orginal data helps “create” the process that “stores” it- so suddenly you have vast numbers of combinations available because a process is time-spread, and time allows a lot to happen.

So tough as it was/ is, the “counting argument” is SHATTERED it appears.

As the data is partly stored “in time” not just “in space”.

All you need to do to defeat the counting argument is store a run time, then identical-looking zeroes and ones could stand for even identical-looking instructions and meta-data, but both would be entirely different!!!!!!!!!!!!!!!!!!!!!!

this of course only (apparently) works because of natural interference in all data (caused, naturally, by what is known in mathematics books as “the birthday paradox”)

You don’t have to worry about storing all combinations of a million digits, just a particular combination. If it works, it works. If you tried to store all combinations, you can store that very easily: “all combinations of a million zeroes and ones” is a sentence that “stores” it, you could say.

The sentence would be longer if you reduced the number of combinations to store- but as the sentence got longer, the number of particular storage systems needed to uniquely recreate each one decreases. At some point an efficient system is reached…

I have, apparently the answer

“All combinations of zeroes and ones” is “predictable”, you could say.

One combination of zeroes and ones is easily stored one may say by birthday-paradox-linked internal interference patterns.

At maximum improbability, it appears any combination can be stored via a particular apparent way.

Possibly, apparently:

all combinations are always stored (In the hyperspace continuum).

My method simply extracts a particular million digits from there

Wow, your post has attracted the crazies with crazy ideas. But in fact challenge 2 is solvable as posed. The following program can compress any file that consists of a bytewise permutation of AMillionRandomDigits.bin by exactly one byte and decompress it to its original contents. It requires AMillionRandomDigits.bin as a reference file (consider it to be part of the program).

In fact it would be possible to achieve a better compression factor because of the fact that the number of permutations of AMillionRandomDigits.bin is considerably smaller than the number of files of length 415241.

Please contact me privately to arrange payment of the $100.

I should have mentioned that the program is written for Python v2 and is run like

./random-compressor [--compress | --decompress] INPUT OUTPUT

where INPUT and OUTPUT are filenames.

I estimated the total possible compression based on lg(number of files of length 415241 / number of permutations of AMillionRandomDigits.bin) and using Stirling's approximation. If I haven't made a mistake, an optimal compressor could compress any file containing a permutation of AMillionRandomDigits.bin by 235 bytes.

@mhagger:

You would certainly win if the only input files were permutations of the million digit file. But of course the wording of the challenge is:

>Create a system to compress and then decompress any file of size 415,241 bytes.

Of course using the word "permutation" to describe sample files I might create is probably a bad idea, because it implies a formal reordering of the bytes, and you jumped on that. Because the problem definition said "any file", you can assume that when I use the verb "permute", I am using the English meaning from dictionary.com:

1. to alter; change.

Not the mathematical definition. But I should change to a more generic term like "scramble". And please, before you decide to engage in a protracted argument about the rules, remember the key point I tried to make:

>The nature of this contest is such that gaming the rules is pointless. You aren’t entering in a quest to beat the rules, you are entering in a quest to beat the data.

However, you certainly win the unstated contest, which is to find a way to win the contest if this were that kind of contest.

- Mark

;-)

I was just kidding about taking your $100.

I was more interested in pointing out how a tiny extra crumb of information can be used to effect a tiny bit of compression. I was initially surprised how little compression is made possible by knowledge of character frequency counts, though in retrospect it is pretty obvious:

N = 415241

Each byte value appears approximately N/256 = 1622 times with a variation of about ±sqrt(1622) = ±40

That fact that byte b appears n_b times therefore gives you (very roughly) lg(80) or 6.3 bits of information. Since we have independent counts for 255 bytes, that gives about 6.3 bits * 255 ≈ 200 bytes of information, which is not far off of the value of 235 that I computed.

The compressed size, 99.94% of the original file size, is nevertheless not very impressive.

@mhagger:

I've seen a lot of people get sucked into traps over the years when they think about compressing based on combinations and permutations. One guy actually had a formula for recoding the million digit file in big blocks, and he could show that he was going to save space. It took both of us a few days to figure out that his problem was evaluating a huge factorial using the online version of Mathematica. It turned out that when it gets to some really huge numbers, it switches from an exact evaluation to some sort of approximation, and even though it was flagging this, neither of us noticed it!

http://marknelson.us/2011/01/09/combinatorial-data-compression/

It is interesting to try to find those tiny crumbs. Every once in a while they will pay off.

- Mark

Suppose the million digits were all zeroes. They could be. Difficult to "compress"?

Suppose the million digits were all ones. They could be. Difficult to compress?

What if the million digits were 500,000 zeroes followed by 500,000 ones? They could be. Difficult to compress?

At what point do they become difficult, one may ask...?

To defeat the "Counting argument", one could say that there must be a way of increasing information content of say 100 digits, so it easily matches that of say 1,000 digits, and there must be information missing in a conventional 1,000 digits.

How can you represent any unique combination of 1,000 digits with say, 100 digits? What is missing in the 1,000 digits that could be included in the 100 digits?

One thing, you could run a list of 100 digits in many different orders, with the list itself ordered in a rotating pattern so that it easily covers a simple ordering of the combinations of 1,000 digits- that would be one way (so you go through a series of differing list orders re: the 100 digits that creates a much longer meta-list to correspond to just a simple listing of combinations of 1,000 digits.)

The question is whether Mark Nelson can be persuaded, without revealing (to him) actually how it is apparently done, that it is logically possible to counter the (so-called) "counting argument" and potentially, theoreticaly, if one knew how (which apparently I do....) solve the million digit challenge.

Regarding the claim that there are not enough combinations available to have a unique 'code' for any pattern of the million digits, well, something the milion digits do not have in their initial state, that a solution can have, is a break into two types of information: so-called 'decoder" and so-called "data". Hence you could have numerous semingly identical lines of zeroes and ones for different versions of the million digits, that are different vbecause that which is "so-called 'decoder"" and that which is "so-called "data"' can vary.

Another part of the lines of zeroes and ones can designate 'run time', and who knows which part of the lines of zeroes and ones refers to that? The computer can store the three roles differently ("Run time". "so-called 'de-coder"", and "so-called "data""), plus the "solution" files can be different lengths. With all these options, is there not enough to cover every combination of the million digits, with less zeroes and ones (If you knew how)(which, apparently, I do...............)

Mark Nelson's challenge: to refute this argument....?

Good work anyway mhagger!

I enjoyed reading the conversation.

Hey Alan.. It makes no sense to me.

I'm having to job hunt and get welfare since my UI benefits are not valid until January. LOL one of life's little twists but I will rejoin you and the challenge soon.

"Life is what happens to you while you are busy making other plans." John Lennon

I do have that potential recursive encoding so that is worth the effort and I will try to improve the number generator and maybe increase the number of matches.

Small Steps.

The problem with the contest is that the random file is to short. Also I feel there is a weakness in your rules you allow 2 files.

One the program file and the other a data file. You sum the program file and the data file to see if they are shorter than the random file.

THAT rule is a mistake. Since and two files will combine to an average length of which is greater than the sum of bytes in the two files see DSC at

http://bijective.dogma.net

and see pairing at

http://mathworld.wolfram.com/PairingFunction.html

To the current problem define a program N+1 bytes long where the first byte is either 0x00 or 0x80 this byte will be XORED with the first byte in the uncompressed file. The rest of data added to this one byte file making it thousands of bytes long is the data of so called random file skipping the first two bytes in the random file but only a few thousand bytes. The other the data file is made up of the rest of the original data file following where the previous stopped off.

The sum of the program plus the data file is exactly one byte less than the so called random file.

To get the first two bytes out of the random file when decompressing with as a two byte number the length of the first fragment as a two byte number then xor the first byte with the first byte of the program. Your done!

So this is shorter than the original file byte by one byte. Well suppose you could code such a program use real code it might be X bytes long but its finite. If you random file is long enough you get the first several byte for free that is the number of bytes that give the length of first fragment. When the random file is long enough you eventually get enough free space to write the program to do the combining.

Of course one should allow 2 files. But instead of adding the lengths of two files to N bytes and say get that should be less than the single file of N bytes. You should add to sum of the 2 files log(N)/(8*log(2)) bytes rounded up to next integer this would make the problem fair.

Or better yet is think to the Randomfile as a single number write how bytes it needs. In this case it the number of bits in the file itself so pretty easy.

Compare that with the number of bits in the program file say its X then look at number of bits in the data file.

Combine those 2 numbers using the pairing function to bet Z at wolfram and get the number of bits used. Require that number to be less than the number of bits in your random file. Which is why in the real game you only look at length of simple program that is run to get file instead of program file plus a data file.

@david:

First, there is certainly no requirement that a program in contest 1 include a data file - that is strictly optional. But I don't see how it causes any problems.

I'm not sure I follow your algorithm though.

- Mark

Gee Mark I was hoping you clear up what I wrote. I proof read it several times but many typos especially the last few lines

but here is an example of what I meant

take any file 2^16 + 2 bytes long assume its random suppose the first two are bytes 0x00 and 0x00 create a file that is

2^15 + 1 bytes long. the first byte is 0x80 and the next bytes are the next 2^16 bytes of file you compressing this is the program file its 2^16 + 1 bytes long the last file is rest of original file which is 2^16 bytes long you have two file that when run create the one file that is 2^16 + 2 bytes long so you save one byte.

Next same file as above but random file has first to byte as 0x80 and 0x00 the answer is the same except know the control byte is 0x00

You look at fist two bytes of the long random file read them as a number if the number is less then 2^15 then you use 0x80 and first word of program if its 2^15 or larger you use 0x00 as first byte of program. Then you tack on to the contorl byte the number of bytes that after the first two bytes. Then the rest of the file is just the bytes left

which is the lat file. When you combine you save one byte.

Extreme example I have a finite file such that the length N is so large log(N)/log(2) is several billion. I can cut the file into 3 pieces. The first short peace is is roughly a billion bytes long that file will be thrown away since it is can be recovered and is less than half the values of log(N)/log(2) so its still a billion bytes long. The next peace is much longer its length when write in bytes is actually the same as what the first fragment is. The last peace is the rest of the file so I have one short piece and two long piece. The there file when you glue then together you get the original file. And the sum of there there lengths is the length of original file. Well I can though the first piece away and since it at least a billion byte replce it by a simple program that is combined with the byte control work and second fragment call that the program its quite large. The data file used with the program is the third fragment. The two pieces the program and the last fragment should easily save you millions of bytes.

That is way you can't have the total of two files compared to a file by its self. You need a handicap based on the log of the length of the two files.

I hope this clearer. Feel free to change any typos you want I am sure there are many.

Okay friends I'm back at it.

It's rough to walk away from a "hot Project" and coming back cold.

The first order of business is to extract all the data and build a codec.

How are you doing Alan, mhagger, Mark and David?

I need to have a file that reports on where the data was taken from the Million Digit File and I need the subset of that data as it's own file.

I hope I warm back up soon.

here are some questions to ponder:

there are many arrangements of zeroes and ones that can be massively 'compressed', even a million digits, e.g. a million zeroes, a million ones, 500,000 of each, etc.

Agree?

Therefore, in storing all possible combinations of zeroes and ones in a million digits, it takes less space than just storing the fully written out combinations.

Agree?

If you allow combinations to interfere, you can transfer savings on 'simple' patterns of zeroes and ones (easily 'compressed') to more complicated patterns?

Hey all.

I have been unable to work on writing code here in a productive way.

I just started to work on the programs I need to extract the data and make a file of a subset (40,000+ bits) of the MDF.

On that issue : I hope my personal life settles down to a focus so I can get into writing code again.

I have a request. I am leaping into attempting to understand DE BRUIJN GRAPHS. I'm clueless about graph theory.

I was looking for URL's or a tutorial along the lines of Graph Theory for Dummies : DE BRUIJN GRAPHS explained or something reasonable along those lines.

Looks very interesting. I would like to understand.

I welcome (simple) tutelage on the topic of "DE BRUIJN GRAPHS and Data Structures explained for dummies" in general.

It looks like a way to index permutations and more.

Thanks.

@ Alan..

How is it going?

The way your post reads I believe segments of some file can be encoded smaller. Knowing what is what is an expense.

I'm taking the lazy approach. I am finding matches in MDF using a number generator that issues codewords of fewer bits than source length.

That's my subset of the MDF I claim to have compressed. I am not encoding where in the file those elements exist.

It would be a challenge to dynamically select what number generator goes best with what subset of the stream.

Like wise enumerating "compressed" segments along with uncompressed segments has an expense too.

I cannot suggest anything at this time.

there is a 'trick' to bypass the expense

if one looks at numbers at a ultra basic level the whole file becomes its own 'number generator', it 'generates the exact million digit file because it operates at the 'hyperspace boundary' where even what numbers are involves a 'limit' on space . (cross-referenced against its own 'uncertainty' 'parameters')

Good thinking re: "dynamically select what number generator........"

Well, this is my 1st post since the new rules... I haven't stepped aside, but I had (and still have) much to do, and didn't have the time to compress.

My number generator finally seemed to be periodic (probably it has a large cycle - several millions - but it is not enough to generate enough different numbers for the file), and that's not good. I think I'll resume to my classical "Puzzle" approach. Or I'll invent some new way to compress. Or if all else fail, I'll try to make a new number generator (I'll read a bit about randomness in general...) At least, it's finally winter (with temperatures under 23ºF), so my computer doesn't need to be stored away in the basement (but it doesn't hurt anyway)...

Someone making progress? Ernst? Anyone?

Same thing here.. Puttering around on the data extraction off and on.

Still aiming at encoding smaller the subset of MDF already compressed.

Recursive compression.

Also I have written to one of the men who wrote a maths proof related to a discovery I made in 2010. I need to get help in understanding that proof in order to see if I have something to add.

I believe at this point I have the Universe of that maths and I assume they have just a piece of it. But, I don't know exactly yet.

I hope he replies.

Wouldn't it be cool if all this bit-crunching adds to our understandings?

@ Alan

I use the same function but use different values for the "X and Y" for 16 instances of the program running at the same time.

The number of variable sets that can be used is infinite.

Any "word" from MDF can have more than one code in the same instance of the program or other instances of this program.

Some numbers show up more often than others.

I was seeing about 100 new numbers found with 16 programs running per day. I can only assume the number of repeat matches was very high. I believe the number of numbers generated not found in MDF was even higher.

The truth is that those 37 bits actually generates 20,480 bits but only the specific instance of one of the 512 40 bit values will match the source value.

There are more possible designs. I know I don't have command of the science.

Vacek Nules asks if anyone is making progress-

I apparently figured out a way to resolve this challenge ages ago. As I am fincially poor I cannot afford to give it away, and can only give very few clues.

Publishing a possible solution to Godel's Incompleteness theorem, and apparent answers re: the Turing Halting problem, the seven Clay Institute Millenium problems, and the Goldbach conjecture, and the physics "theory of everything", would not guarantee being able to eat well and buy a house (as far as I know).

THe Clay Institute require solutions are published in mathematical journals, and last two years of challenges, and generally be accepted by mathematicians, before they even consider their option of paying a prize.

If my apparent knowledge was added to the computer programming knowledge of someone here like Ernst, then the result could be worth hundreds of billions?

But a new company needs to be created????

>Publishing a possible solution to Godel’s Incompleteness theorem, and apparent answers re: the Turing Halting problem,

>the seven Clay Institute Millenium problems, and the Goldbach conjecture, and the physics “theory of everything”,

>would not guarantee being able to eat well and buy a house (as far as I know).

While it is true that the Clay Institute moves slowly, publishing a reasoned rebuttal of Godel or Turing's Halting problem would pretty much guarantee you a good job somewhere in academia right away, no matter how lowly your credentials. It would simply be too remarkable a feat for the world to overlook it. If it was done by someone outside mainstream academia, it would make you all the more interesting.

Your wording needs correction though. You describe a "possible solution to Godel's Incompleteness Theorem." This theorem is a well-buttressed logical statement with a proof that has not been seriously challenged for 80 years. There is no "solution" to it as it is not a problem, puzzle, or challenge. Really, if you are talking about doing something to Godel's work, it would be in the form of either showing that his proof is incorrect, or in showing some extrapolated framework in which it does not hold. Either would be remarkable. Neither would be a solution.

The same thing holds true for the halting problem. To make a rigorous proof that we cannot decide which programs halt takes a bit of work, but at the heart the logic behind it is quite simple. And again, it is proven. For you to do anything, you would have to find a flaw in the proof, which is not "a solution".

- Mark

Thank you for this, Mark.

I am not looking for a job in academia, I have other things to deal with that puts me in a situation where I need to be independent. The default situation is I in due course hope to get a document sufficiently ready to send to a scientist who has signed a non disclosure agreement, and hopefully negotiate a license for a specific application (involving "holographic data analysis and hidden pattern discovery re: seismometer raw data) of a collection of apparent discoveries all related (so includes data storage ideas).

If desperate I could contact investors who lost money on non-delivering data-compression technology investments, to offer a second chance.

Re: 'solution"- I was thinking about how Godel's Incompleteness theorem, and the Turing Halting problems, are it seems depicted as problems limiting mathematics or technology.

My Internet service has been on the Fritz friends. I am able to use between 2 and 7 minutes of Internet time before the line goes off again but I wanted to Wish everyone Happy Holidays as soon as I could. A bit kate but it's the thought that counts.

To all Million Digit folks the best of luck in the coming year.

For Christmas fun here I went to inventing. I invented a transform that can work with BWT and MTF. It maps the values of the set of object words (like 8 bit words) in a dataset to new values in a different way than MTF. There is no physical movement of elements like MTF involved just recoding in place. There are many possible applications. This was an exciting avenue to explore for the holiday.

Still.. I owe a report on compressing MDF. I am still needing to prove my long rant on compressing subsets of Million Digit File. I can cover my ass on that. I'm lazy but I'll get busy this weekend.

Hey @everyone. I am very willing to take a full time job at college so I could go back to school. I would do much better in a work/study mode in my old age than a waiting to die, slowly broadcasting updates mode.

I'm cool with waxing the hallways and cleaning the bathrooms...

Must help with Cat Friendly housing. Toga Parties optional but fun.

Again The Best of Luck to any and all Million Digit Challenge Folks!

Happy New Year.

Challenge #1 is interesting. Even if remains unsolved (most likely so), tackling it could advance theories on how we might be able to measure the absolute degree of order in any data set.

And challenge #2 is bogus and impossible, of course. But I'm sure there will always be people around to confuse the apparently impossible with the mathematically impossible :).

I have good news to start the MDF Challenge New year.

A subset of the Million Digit File of 5860 40 bit words 234,400 bits has been compressed 7.5% and decompressed.

The "compressed" data has been decoded (decompressed). That means a partial compression of Million Digit File has been achieved.

I plan on offering the subset and the compressed file as soon as possible to any who care to examine the data.

Publishing the code for the number generator is on hold pending information on the maths discovery I made in 2010. I have wrote someone who can answer my questions and have not received a reply yet.

Again I announce a subset has been compressed.

7.5% compression achieved.

Good Luck Challenge People.. This Can be done!

@ Alexis Naveros

You Wrote :

Challenge #1 is interesting. Even if remains unsolved (most likely so), tackling it could advance theories on how we might be able to measure the absolute degree of order in any data set.

---------------------------

Tell me about it! 10 years I have been tinkering around with this challenge. As Mark wrote I have come at this from many directions in that time.

I am hopeful I have something to offer on advancing theories as you wrote. If not it will advance my understanding.

I would say that no one is going to get far on compressing million digit file if they don't have tools to manipulate "ordering" and reassign elements into different sets.

I'm way over my education level here but I see some fantastic possibilities and I see mathematical objects to work with. I am needing to find out what exactly I'm romping around in.

The next phase here is a new number generator to see what those maths things have to offer.. It's exciting.

So yeah my friend; it's all as you say. I have to agree.

Happy new year!

Hi,

without directly going to my apparent technique re: the million digit challenge", I can say the "counting argument" can often seem very tough.

However, I just worked out an apparent simple answer, that also (apparently) "knocks out" (from very brief investigation) what Wikipedia calls "incomputability of Kolmogorov complexity" ("Kolmogorov complexity" is said to be the least number of characters required to describe a piece of data in some universal description language), and also (apparently) dissolves the "Counting Argument" (which is I gather saying that if you have "n" items and "m" pigeonholes, and if "n" is greater than "m", at last one pigeonhole must contain more than one item). (Hint: don't exactly count pigonholes...........)

I thought about saying both these results, as it doesn't on the face of it appear to involve giving detail of actual method, but will say some other things just now-

I assume that if one million digits, using these: 0,1,2,3,4,5,6,7,8,9, are written in binary, there are a fixed number of possibilities.

I assume that people recognise that many of these possibilities can easily be stored in very little space: e.g. one million "1"s, one million '2"s, one million of a huge range of patterns (one could say the anti-Kolmogorov complexity is list all easily reduced versions of the million digits)- I wonder just how many non-easily reduced versions there are?

It seems to be assumed that the remaining versions are going to require more characters, even no compression.

Yet overall it takes less space to write all possibilities than just writing them....

so there must be some inherent anti-Kolmogorov complexity....

The pigeonholes are not isolated, but are at least partly linked along with whatever is stored in them- i.e. they form (with the "data that could be in them") a higher-dimensional system-

It comes down to what goes where when-

a universal space-time difference

Another thing is: one approach of people is to look for any possible "weakness" in the million digit file, that may allow easier storing. The technical apparent discovery I found (apparently) allows one to find how to re-arrange the data so that it naturally 'fits togther' like a ball of clay, and to find "clumps" (or "objects" in the data).

Ernst writes: ".... tools to manipulate "ordering" and reassign elements into different sets"

true

Alexis Naveros wrote:

"Challenge #1 is interesting. Even if remains unsolved (most likely so), tackling it could advance theories on how we might be able to measure the absolute degree of order in any data set"

A way of apparently solving (i.e. of counter-arguing to) the claims of Godel's Incompleteness theorem and (If it is a claim) the Turing Halting problem, seems to adress this

(both answers involve a system with at least 42 axioms, allowing all data or whatever to be rotated in higher dimension space that it itself "projects" so to speak, which brings into the scenario the question of turbulence, its opposite 'teleportation', the Poincare conjecture, probably also Riemann hypothesis, Hodge conjecture, and other Clay Insitiute Millenium claimed problems in mathematics.

Its "halting" IS at the moment of its "incompleteness: a perfect ....... (a example of a theory of every thing)(a sort of mini uni verse subdivided into discrete bounded objects)(that (curiously!) "exchange" quanta of "energy")

I recently thought of about 25 ways (from maybe 50 or so have found) to describe, name or relate to the technique I found, but only a few have thought it o.k. to say (most too obvious re: the apparent discovery)

the ones I have said include: "theory of everything", "hyperspace bypass", "stating the obvious".

You Wrote.. without directly going to my apparent technique re: the million digit challenge", I can say the "counting argument" can often seem very tough.

--------------------

It's something to count on..

We have to enumerate codes to tell the difference between them. It doesn't mean that there can't be two or more different codes that map to the same uncompressed source.

Obviously if the weight of trying to encode all N elements in one go is lifted there are systems that can represent a subset of MDF.

Remember who is to say Mark's dataset is the best way to approach a solution to the underlying problem?

All we all can agree on is it's a tough climb. It's a Challenge.

The fun is in trying to climb it.

I'm holding out that you have clever ideas that will bare fruit.

I can honestly say I can see where I was in the yesterdays but I have to take the steps to be some place tomorrow.

How many characters does it take for me to describe all the possible combinations of a million digits?

This many: "all the possible combinations of a million digits". By my count that's 49 characters. See, that's what Kolmogorov complexity is.

Just because the complexity of some string is really low, doesn't mean that the complexity of all of its individual sub-strings is going to be lower.

Ernst, you still seem to be somewhat realistic, though some of your posts hinting at recursive compression are bordering the absurd. But take my advice - pay no attention to Alan's worthless rumblings - there is no mathematical basis behind what he says, he's simply making up things that aren't real and delusional in believing himself a genius.

Why not add a leader board to see how close people are getting to crossing the +% compression gain threshold? That would encourage more people to participate... and in the process something innovative may come out!

It's okay Zen.. Alan doesn't bother me. If he has issues with reality well haven't we all?

Alan you do seem strange.

1 Matches Found. FPos-Element 4563

The first match found from the "compressed" binary. if the binary file checks out then that is "compressing" the compressed.

Thanks for having a little faith in my sanity Zen. I know nothing about yours.

Hey all.

The most difficult thing to sync up has been me. I'm working on the stand alone decoder. Part of that is to rethink the whole flow of data. Rechecking what has already been rechecked. It never hurts to review.

It looks like the compressed binary is true. Having just parsed the first record I will assume that the following records are going to be at sequentially predictable intervals.

Now on to integrating the number generator and generating the first de-compression / decode for the stand alone version.

On "re-compressing:"

Here is a list of the first few matches found then; since the binary appears to be true.

1 Matches Found. FPos-Element 4563

2 Matches Found. FPos-Element 1901

3 Matches Found. FPos-Element 1649

4 Matches Found. FPos-Element 1900

5 Matches Found. FPos-Element 5251

6 Matches Found. FPos-Element 2635

7 Matches Found. FPos-Element 5273

8 Matches Found. FPos-Element 3934

9 Matches Found. FPos-Element 1595

10 Matches Found. FPos-Element 2625

11 Matches Found. FPos-Element 3060

12 Matches Found. FPos-Element 5011

13 Matches Found. FPos-Element 150

14 Matches Found. FPos-Element 4027

15 Matches Found. FPos-Element 3699

16 Matches Found. FPos-Element 2936

17 Matches Found. FPos-Element 4774

18 Matches Found. FPos-Element 4017

19 Matches Found. FPos-Element 1987

20 Matches Found. FPos-Element 1518

The above reports the compressing of the "compressed" subset of MDF.

The same ratio of compression 7.5% is still valid encoding smaller a second level.

There are many questions I have. Like will all 5421 40 bit words of the this binary be found under a "full load run?"

That answer and more has to wait until I load up and run a full load.

Could go either way. I am exploring after all.

It's exciting guys.

Good Luck Challenge people.

Ernst

Update:

Current Chill out music: http://www.youtube.com/watch?v=7apji-hg5j4

--------------

I Love that kind of chill and also pan flute but I digress.

I just extracted the first MDF subset source[0] 40-bit word from the binary with the stand alone version.

Looks good once again. The re-compression then has a green light.

Okay now to finish it and move on. I'll set up for a full load run on level 1.

Let me call the level we stop at Level[0] or some variant like Source[0]. It's time to introduce such concepts.

I am sure it's possible to generate a level[-1] dataset.

Haven't been around in awhile ...

Ernst, are you on the edge of greatness (winning the challenge).

If so, good on you!!!

I'm positive it can be done ... well, let's just say, "incredulously confident".

Thanks CWCunningham

Winning by encoding all that data, that is MDF, smaller is still a mountain I am yet to climb from my point of view.

It is still a really big challenge for me even though I've found codes for some of it.

I'm no where near claiming a win on Challenge 1 or Challenge 2 nor am I thinking I'll have a solution to either challenge any time soon.

However, it would seem to be true that a subset of the file is vulnerable.

At this point it will help us all to be clear on the progress I claim to have made.

The truth is that, it is a simple construct that demonstrates a codec that has an associated data set.

It's a dictionary method.

I can match any "word" in a file that is in that dataset and decode is simply a look-up table function.

The data is generated by a number generator, of my own design, and is not based on statistical aspects of the Million Digit File directly.

It is the year to share some good news of partial success' so good news Challenge friends. A codec of data-compression value has been found albeit a simple one. If I may say there are more wonders ahead. It's beautiful.

Please don't let this news hamper your efforts or darken your spirit. Mine is not the only path. Of that I have faith.

Update:

Synced up the stand alone decoder output binary file with the subset binary file. The "CMP" utility reported no differences. Good enough for me. How about you?

If there is more interest comment I guess.

I just started a "full load" run on the compressed "level1" binary.

That file did change. It had a header added to it.

Full load means all 16 instances searching the Level1 file for words that match the data generated. I have an 8-Core AMD Bulldozer CPU now running on all 8 100%

I will be moving on to the next design from here.

If anyone has questions feel free to ask. I'm rather relaxed at this point so it's all good.

I could write a manual decoder version. Allow anyone to type in 37 bits and generate a 40-bit output.

I can demonstrate generating the Level[0] file from the Level[1] file.

I don't expect to find all of the 40-bit words of the Level[1] file with this new search. I expect to see a subset again but, I could get lucky.

Zen: my "worthless rumblings" are like this because I cannot afford to tell people the actual apparent answer- it has potential commercial value. Hence i have to talk in vague terms.

As for "delusions" and claims of genius- I made no such claims.

Any tricky puzzle may seem impossible till you know how- e.g. the "string and ring" puzzle looks impossible.

The 7 Clay Institute Millenium problems in mathematics can potentially all have the same weakness- if you find that then they could solve.

Have you heard of Edward de Bono and "Lateral thinking"? Do you think his book "Six thinking hats" is strange?

Is being successful "strange"?

I have something- whether it is in error or not and whether it works or not re: the million digit challenge is not yet considered by commerce- if you think I do not have something then you are mistaken- the only question is how good is it?

there is no mathematical basis behind what he says, he's simply making up things that aren't real and delusional in believing himself a genius.

on January 8th, 2013 at 3:02 pm, Ernst said:It's okay Zen.. Alan doesn't bother me. If he has issues with reality well haven't we all?

Alan you do seem strange.

1 Matches Found. FPos-Element 4563

The first match found from the "compressed" binary. if the binary file checks out then that is "compressing" the compressed.

Thanks for having a little faith in my sanity Zen. I know nothing about yours.

on January 9th, 2013 at 9:02 pm, Ernst said:Hey all.

The most difficult thing to sync up has been me. I'm working on the stand alone decoder. Part of that is to rethink the whole flow of data. Rechecking what has already been rechecked. It never hurts to review.

It looks like the compressed binary is true. Having just parsed the first record I will assume that the following records are going to be at sequentially predictable intervals.

Now on to integrating the number generator and generating the first de-compression / decode for the stand alone version.

On "re-compressing:"

Here is a list of the first few matches found then; since the binary appears to be true.

1 Matches Found. FPos-Element 4563

2 Matches Found. FPos-Element 1901

3 Matches Found. FPos-Element 1649

4 Matches Found. FPos-Element 1900

5 Matches Found. FPos-Element 5251

6 Matches Found. FPos-Element 2635

7 Matches Found. FPos-Element 5273

8 Matches Found. FPos-Element 3934

9 Matches Found. FPos-Element 1595

10 Matches Found. FPos-Element 2625

11 Matches Found. FPos-Element 3060

12 Matches Found. FPos-Element 5011

13 Matches Found. FPos-Element 150

14 Matches Found. FPos-Element 4027

15 Matches Found. FPos-Element 3699

16 Matches Found. FPos-Element 2936

17 Matches Found. FPos-Element 4774

18 Matches Found. FPos-Element 4017

19 Matches Found. FPos-Element 1987

20 Matches Found. FPos-Element 1518

The above reports the compressing of the "compressed" subset of MDF.

The same ratio of compression 7.5% is still valid encoding smaller a second level.

There are many questions I have. Like will all 5421 40 bit words of the this binary be found under a "full load run?"

That answer and more has to wait until I load up and run a full load.

Could go either way. I am exploring after all.

It's exciting guys.

Good Luck Challenge people.

Ernst

on January 9th, 2013 at 11:19 pm, Ernst said:Update:

Current Chill out music: http://www.youtube.com/watch?v=7apji-hg5j4

--------------

I Love that kind of chill and also pan flute but I digress.

I just extracted the first MDF subset source[0] 40-bit word from the binary with the stand alone version.

Looks good once again. The re-compression then has a green light.

Okay now to finish it and move on. I'll set up for a full load run on level 1.

Let me call the level we stop at Level[0] or some variant like Source[0]. It's time to introduce such concepts.

I am sure it's possible to generate a level[-1] dataset.

on January 10th, 2013 at 4:04 am, CWCunningham said:Haven't been around in awhile ...

Ernst, are you on the edge of greatness (winning the challenge).

If so, good on you!!!

I'm positive it can be done ... well, let's just say, "incredulously confident".

on January 10th, 2013 at 5:54 pm, Ernst said:Thanks CWCunningham

Winning by encoding all that data, that is MDF, smaller is still a mountain I am yet to climb from my point of view.

It is still a really big challenge for me even though I've found codes for some of it.

I'm no where near claiming a win on Challenge 1 or Challenge 2 nor am I thinking I'll have a solution to either challenge any time soon.

However, it would seem to be true that a subset of the file is vulnerable.

At this point it will help us all to be clear on the progress I claim to have made.

The truth is that, it is a simple construct that demonstrates a codec that has an associated data set.

It's a dictionary method.

I can match any "word" in a file that is in that dataset and decode is simply a look-up table function.

The data is generated by a number generator, of my own design, and is not based on statistical aspects of the Million Digit File directly.

It is the year to share some good news of partial success' so good news Challenge friends. A codec of data-compression value has been found albeit a simple one. If I may say there are more wonders ahead. It's beautiful.

Please don't let this news hamper your efforts or darken your spirit. Mine is not the only path. Of that I have faith.

on January 11th, 2013 at 5:34 pm, Ernst said:Update:

Synced up the stand alone decoder output binary file with the subset binary file. The "CMP" utility reported no differences. Good enough for me. How about you?

If there is more interest comment I guess.

on January 11th, 2013 at 6:28 pm, Ernst said:I just started a "full load" run on the compressed "level1" binary.

That file did change. It had a header added to it.

Full load means all 16 instances searching the Level1 file for words that match the data generated. I have an 8-Core AMD Bulldozer CPU now running on all 8 100%

I will be moving on to the next design from here.

Zen wrote: "

If anyone has questions feel free to ask. I'm rather relaxed at this point so it's all good.

I could write a manual decoder version. Allow anyone to type in 37 bits and generate a 40-bit output.

I can demonstrate generating the Level[0] file from the Level[1] file.

I don't expect to find all of the 40-bit words of the Level[1] file with this new search. I expect to see a subset again but, I could get lucky.

Leave A Reply

Zen: my "worthless rumblings" are like this because I cannot afford to tell people the actual apparent answer- it has potential commercial value. Hence i have to talk in vague terms.

As for "delusions" and claims of genius- I made no such claims.

Any tricky puzzle may seem impossible till you know how- e.g. the "string and ring" puzzle looks impossible.

The 7 Clay Institute Millenium problems in mathematics can potentially all have the same weakness- if you find that then they could solve.

Have you heard of Edward de Bono and "Lateral thinking"? Do you think his book "Six thinking hats" is strange?

Is being successful "strange"?

I have something- whether it is in error or not and whether it works or not re: the million digit challenge is not yet considered by commerce- if you think I do not have something then you are mistaken- the only question is how good is it?

(re-posted as previous post somehow posted other peole's comments)

@ Alan

See I believe in you man.. I'm sure your ideas will bear fruit.

Hey, I have a plan on the next evolution of the number generator Alan.

So with clear direction I'm starting today.

Good luck Alan. I hope to taste the ambrosia from the fruits of your labor soon!

@ CWCunningham and all

191 matches found so far against the already compressed subset from the Million Digit File using the same software that compressed 5860 40 bit words or 29,300 bytes from the Million Digit File.

By using a simple estimate of output count so far it's reasonable to hope that all the 5422 new 40 bit words will be matched. It remains to be seen if that will hold true.

I think I am on to something here. It will require some work over a few months to get an encoder designed.

So CWCunningham and all I'm going to press on with an evolution of this maths I use now and see what that encoder offers.

The encoder software I envision will encode the whole million digit file that is a give you can trust but will it encode the whole file smaller? That is unknown even in theory at this time.

@Alan.. Sorry if you have been made uncomfortable. It does help to have something working to show like I do. I feel much better today than just a year ago. I can hold my head up and say to people who know of my dedication to this challenge and tell then I succeeded in encoding like I said I would last year.

Like I say it goes a long way to be able to say that and have the proof to show if any one cares which as Mark points out few do unless they get the source code to go with it.

I tell you guys I was wide eyed back ten years ago when Comp.Compression was wild with dreamers.

I remember being awestruck reading some of the ideas people had. I enjoyed all the banter about "Random Data compression." It, after all, fit into my programming interests then and still does now.

Some how over time the fires of creativity and BS have been tamped down to smouldering coals. Still remnants remain of those early days and the general rebuff of dense data claims in all of us.

I say to you fellow MDF people I have compressed. I am recompressing. These are truths.

Well,on the home front, my finances are recovering and I have to shut down my Internet for a couple of months to catch the bills up. So, I don't know what day is the cut off or when I will post again after that. I thought to check in, in case it is of concern.

On the recompression, I suspect 50 days of run time to find all 5422 words if they are all in this dataset. I have nothing to suggest they all are. I assume they are not. Still I have enough already to prove re-compression is a reality.

That's it really. It's been a long strange trip so far with miles to go.

I am not out of the running on MDF challenge at all. In fact I have to rule out the possibility of compressing MDF before I will be sure with this next evolution.

So get busy guys. You don't want me to win do you?

Ernst

I'll leave the winning to those who are actually working on the problem.

I had been working on an approach to find the weakest point to attack, when it occurred to me that I needed to get back to what I was *supposed* to be doing.

One day, I decided to look back at that code, and I saw that I could finish it up in just a few hours, which I did.

What I discovered was that those boys at Rand did an excellent job of producing a homogeneously random bunch of bits. This doesn't stop my approach to compression at all, but it does show that a successful approach will not be trivial, IE, there are no (obvious) shortcuts.

I can't think of any reason why you shouldn't win, Ernst. In fact, I'm pulling for you. If my approach has any validity, it will still be valid, whether it is the first to succeed, or the last. But I won't be snapping at your heels in the foreseeable future. So hit a home run soon, Alan has the secret to the universe and it's only a matter of time before Bill Gates decides to buy it for a few hundred million (and what a bargain).

On a related note, my proprietary idea which would be the key to my solution, may prove to be less 'security sensitive' than I thought. That will free me up to release the source, if, and when, the time comes. For now, I'm still preoccupied with other things.

I am currently at a low point in the economic year. In fact my Internet is whatever I can access wherver it's available for a few months.

Anyway, I Update nonetheless.

@CWCunningham

I understand. Thank You for your kind words.

The point of this (MDF) dataset is that it has none of the relative qualities we associate with classic data compression. That in itself is it's strength.

The, on my end, current focus is a number generation function generating a dataset which I can use to search files to be compressed for matches to that generated dataset encoding those data with fewer bits.

This new effort is read to run once the previous is done.

The status of the "re-compression" of the compressed subset of MDF is going well. I estamate 86% of the 5422 40-bit words will be found.

After that writing a program to allow manual input of 37 bit values and exporting the expanded 40-bit words is a good idea. I can polish up the compression and re-compression demonstration.

Since I am at the Job Center looking for work it's reasonable to point out that I am looking for oworka nd why not a job in R&D ??

I am compressing dense data so I have skills.

So I am open to making a quantum leap from labor class to the sciences. I have cats so that is my only requirement if I must relocate. I must earn enough to live on with cats.

About this next evolution of number generator.

The new effort will map any file to a 320 tebibyte virtual datset.

I will examine the mapping for exploits and hopefully be able to compress the whole MDF using that mapping data.

This is a long term project and I can use secure access to 1000+ CPUs to shorten the time it will take to generate a map.

So, I need a job. I have technology and I'm tossing that out.

Until next time.

Good Luck Challenge people!

PS my email is blocked right now so give me some time to get that bill paid.

Drop a note here if you have a job offer. I am very flexable on wages and hours.

Ernst I'm not sure I should say what you have possibly done (commercial interest potential relevance?...)

(I could say that you appear to have possibly made a "parallel universe" to the millon digit file and this is a hyperspace _________)

!

@Ernst

you stat

"I have good news to start the MDF Challenge New year.

A subset of the Million Digit File of 5860 40 bit words 234,400 bits has been compressed 7.5% and decompressed."

My question is this is your subset of the file a continuous subset of the file? If it is good work! If it is not then I can break the file into 2 subsets using one bit words. The one containing zeroes and the other one containing ones these two subsets can be compressed quite small.

Another point is this say using a large word size say 40 bits its a fact that not all the patterns will exist in the file that mark gave. If you make a universal bijective arithmetic coder with fixed weights for those patterns that exist and another for those that don't exist in this file

you have in worst case for a one million bit file 25000 40 bit words you could use:

-log((1)/25001)/log(2)= 14.609698181... bits each symbol this would easily make the file less than 1/2 the size of original file.

for other files. each symbol not used in the million bit file would expand so that each 40 bit symbol would be represented by 54.609.. bits so most other files might expand a little bit.

Does this win the first challenge. No I don't think so since somehow you have to represent the table that was used for the bijective arithmetic coder that caused this compression.

Mark

How does the rule "then decompress any file of size 415,241 bytes." will be verified, you aren't going to test the compressor/decompressor with every possible combination of files of this size.

Can you please clarify how are you planning to test it exactly, ie - how many files of size 415,241 actually are going to be used for testing ?

>Can you please clarify how are you planning to test it exactly, ie –

>how many files of size 415,241 actually are going to be used for testing ?

I could try to be funny and say I am going to test all of them :-)

But I don't really have a specific number in mind. I will say that if someone submits a file that compresses one file, I will surely test it against at least 10 more.

If my 10 other input files are simply versions of the original file scrambled with a reasonably strong crypto algorithm, I feel pretty good about them being a good test.

So let's say that the submission passes the original file, then ten randomly chosen others.

I would probably have to declare that person a winner at that point. It's very possible that the world could go on to problem that there were files that couldn't be compressed, but it is not for me to do the exhaustive analysis. I would be satisfied that the submission met the spirit of the contest and was at least a pretty good result.

So if you want a precise answer, how about if we say ten? I reserve the right to change my mind, but I won't be doing it just to hassle someone, there would need to be a compelling reason.

- Mark

Mark,

Thanks, that's the answer I was looking for, as it adds some room for making it actually work.

Even if somebody can come up with random data compression algorithm it sure can't work well with all the data every single time.

unless it Is "all the data" every time ??????

Hey everyone!

I am still very poor but in America we do have free WiFi spots so I am now sitting at one.

About the 7.5% compression. It is a subset of the million digit file and the "recursive" compression is a subset of that subset.

I have not compressed the whole(compressed)subset of the Million Digit file just part of that subset. I have compressed the subset of the Million Digit File 7.5%.

What I can show with that is the potential encoding of the whole file.

No claim of complete success compressing the Million Digit file is being made.

Let me see, long time no update.

I stopped running the "Number generator version 1" which gave the 7.5% compression of a subset.

I then spent time reworking the "Turing Engine" concept which doesn't compress all input but does allow for a stable system of modifying the stream. Could be useful in encryption. Possible data compression use too.

I then went on to rework the number generator effort and apply OpenMP coding to take advantage of all 8 cores of the AMD cpu I own. There are some interesting aspects to multi-processors and threads.

Version 2 of the number generator is now running and it is closely related in design to Version one. It is closely related because I had the right idea at the start. I did try different coding in the development of version 2 but found the version one concept the nicest.

So far I am matching to 8192 sets generated albeit slowly.

One cycle of all eight threads takes 21 hours and it is matching around 160 new 40-bit words of the Million Digit file a day. That is up from 100 a day with version one.

This effort aims at generating a data set to use in potentially cobbling together data to satisfy the challenge number 1.

What I am now working on is publishing a discovery I made in 2010. There is already a published work on this but that work lacks that actual transform believe it or not. I have the transform so it's a matter of learning to write a proper peer reviewed paper that will add to the previous work by an honoured and distinguished cryptographer.

I have little experience or training in writing a proper paper so I'm thinking I could use a buddy who has experience in doing this. I'm considering the local college or signing up to take the basic two algebra classes at the local J.C. to gain access to the maths faculty.

So that is what I plan.

Over all and again the time frame for the version 2 number generator run is 18 months. The actual generation can go to infinity but the data set I assume I'll have after 18 months will be enough to start with.

I simply did not find an exploit in any of the encodings (codecs) I have written over the years.

I now assume that simple matching of one value in file to a value in generator is the way to go for future efforts.

Perhaps in the future generating such a dataset as a 5 tebibyte set will only take a few seconds on the computers of the future but for now we have to brave the slough rather than take the free-way.

If you are interested in coaching me through a paper I will consider that. After all you are my data-compression friends. After all I am the common man attempting the (assumed) impossible so why not be humble and ask for guidence.

I do doubt that I will compress random data in real time but I know I will always try to find some exploit because it suits my nature.

What I can find solice in is that I have something to contribute because of these efforts having been made.

So for 2013-2014 I am wanting to publish a discovery that I know the Crypto-community will enjoy a lot. I assume that will aid the Data-compression community too.

I can see BWT being modified as well as other applications. It might be an advantage for some corporation if they take me on board. I still need a good job even if it's part time so I can go to school.

Never hurts to network.

Good Luck Challenge people!

My guess is that with your computer skills and bravery, if you were aware of what i have found, it may be commercialisable. You could form a company to do that?

There are only 27 ways of placing 27 beans in 27 boxes.

There are only so-many ways of fitting the "beans" of people's birthdays into the "boxes" of the days of the year.

It is said that if you go in to a room that has 23 people in it, you probably will find two of them share a birthday.

What if you went into a room and it had 364 people in it, and you never found a shared birthday even by the time you spoke to the 365th person?

(called "a large eddy" (or "electro ... magnet")

What if you went in to a room with 365 people in it, and you only came across a shared birthday every 52 people (what you have done, inadvertently ...?)

What if you went into a room with 10 people in it, and they all shared one birthday? (called the opposite of electro magnet: gravitational ("collapse") ?

(and what if you.... )

I have said quite a bit.....

error: was to write: 27 ways of placing 3 beans in 3 boxes

Gentlemen: I commend Mark for providing a venue for the discussion and allowing it to live in a very 'noisy' channel. I am sorry to observe that a couple of you have missed an important point: Challenges like this one stand up over time because people misunderstand the problem. You must provide an implementation that compresses an arbitrary file. Solutions like the 'mad professor' approach of hiding the data in hyperspace are nonsense because the potential value of accessing hyperspace is far in excess of any value of winning the challenge. It will be trivial for Mark to supply a file you can't compress.

With that said, I have to travel back to 'the other side' for a few millenia. Watch out for bald men in suits wearing similar hats. (I'm going to miss Fringe even though the stories were thinning out).

@Alan All of the solutions you describe about going in to a room with slightly more than 360 people in it and finding 'n' people with the same birthday are possible and have nothing to do with 'spooky action at a distance'. This is simply a demonstration of the mathematics of probability.

Each case you describe has no magical short-cut causal connection to cataclysmic events in the cosmos and your perception of it is influenced more by unusual brain chemistry than anything else.

I wish I could cure the side-effects but I can't and I wish there were a gentle way to persuade you that there are benefits to the medication but I can't. I hope your situation improves as you age as it has for my younger brother whose future was cut down by a tragic neuro-chemical development. None of that prevented him from saving the life of our youngest brother with his gift of bone marrow stem cells. There is a purpose to life and we all urge you to accept any help that is offered.

wow, John , you sure are trying to cause some type of annoying?

who is the "mad" professor? People on the street used to think that man would never walk on the moon. People used to think it was witchcraft to sail beyond the horizon on a ship?

People who had an agenda against those who were slaves tried to claim that slaves who run away were suffering friom a "disease" called "drapetomania".

Colonel Sanders received one thousand "no"s before someone said "yes" to his receipe for Kentucky fried chicken. EMI said "Guitar bands have had their day" and wouldn't sign up a new band called "The Beatles".

Are you afraid of losng your priviledged position in the world? Or are you just being a classic TROLL?

I have what I always said I have: an apparent, theoretical, potential answer to the so-called "million digit challenge" and a lot of other things. I object to any call to psychiatric fascism- conflict and disagreement are just that, conflict and disagreement. Those who have priveledge might not like a more free world- trying to medicalise and verbally abuse those who dare to challenge academic orthodoxy is a type of fallacy in argument (read: "The Uses Of Argument" by Stephen Toulmin, also anything by Thomas S. Szasz).

you wrote an extraordinary thing- apparently you claimed that any solutions involving hyperspace werre nonsense because the value of accessing hyperspace was far greater than the prize Mark was offering- um- how can something be "nonsense" because it is more valuable than a prize being offered?

That looks like what is known as a "Non sequitur".

Your conclusion is not warranted by your premise(s).

The claim is made that it wil be trivial for Mark to provide a file that I cannot compress. That is your opinion.

You refer to "spooky action at a distance". I did not use this phrase. I can say other things.

You refer to "mathematics of probabilty". Yeah- what about higher-dimensional mathematics, improbability, and so on?

You talk of "magical short cuts to cataclismic events in the cosmos"- I did not use this phrase; you may be refering to my referring to "electro-magnetism" and "gravity".

There appears to be a misunderstanding and/or lack of comprehension by you of what I am doing. Your lack of understanding and/or comprehension of what I am doing should not lead to verbal / moral attack.

I am looking for licensees for commercial use of apparent theoretical breakthroughs- so I have to say less than would reveal too much. If you want to know the missing bits- please assist me to find a licensee, so eventually the detail can be published with patent protection if need be.

accusations of "unusual brain chemistry" is just an abuse- fist, it matters not whether I or anyone else has "unusual brain chemistry"- it has been said that Nazis understood that being a Jew meant you could be discriminated against, similarly in western societies people who are accused of being different are often abused- in such societies it is tanatmount to calling for abuse of a person to claim that they are different. You then went on to abuse me by talking of "medication"- that has no more vailidity than calls for "curing" slaves that dared to run away.

Your post could be "trolling" - all that talk about stem cells- I'd say you are stirring.

Regrading apparent lack of understanding;

I often look at "minimum defining characteristics" of concepts. It so happens that I found that what is called "electro-magnetism" appears to match (as a pattern of information) another pattern. Similar with what is called "gravity".

It is sad if you are too scared to challenge those who would claim to have "the universe"- the physics and mathematics hierarchy. The Universe belongs to everyone.

@Alan on February 20th, 2013 at 3:55 am, :

You made an error again: placing 3 beans in 3 boxes

could be done in 6 ways, because this is a permutation problem

and 3! = 6.

Written out:

Box 1 Box 2 Box 3

1 2 3

1 3 2

2 3 1

2 1 3

3 2 1

3 1 2

--> 6 possibilities

http://en.wikipedia.org/wiki/Permutation

Not an error:

beans: A, B, C

each box can contain 3 beans, there are three boxes:

(I could call this a differentiated limit in 5 dimensions. I I got this beans/boxes notion from a Time-life book about Mathematics)

It only seems to adress the question of how many of each bean is in each box, with room in each box for three beans.

you could possibly call it "quantum entanglement"

there is likely an error here as I listed 28, must be a repeat entry somewhere:

(1) box 1: empty. box 2: empty. box 3: ABC

(2) box 1: empty. box 2: A . box 3: BC

(3) box 1: C . box 2: A . box 3: B

(4) box 1: A . box 2: B . box 3: C

(5) box 1: C . box 2: B . box 3: A

(6) box 1: empty. box 2: BC . box 3: A

(7) box 1: empty. box 2: AC . box 3: B

(8) box 1: empty. box 2: B . box 3: AC

(9) box 1: empty. box 2: ABC . box 3: empty

(10) box 1: ABC . box 2: empty. box 3: empty

(12) box 1: A . box 2: empty. box 3: BC

(13) box 1: A . box 2: BC . box 3: empty

(14) box 1: BC . box 2: A . box 3: empty

(15) box 1: B . box 2: AC . box 3: empty

(16) box 1: AC . box 2: B . box 3: empty

(17) box 1: C . box 2: AB . box 3: empty

(18) box 1: AB . box 2: C . box 3: empty

(19) box 1: empty. box 2: AB . box 3: C

(20) box 1: empty. box 2: C . box 3: AB

(21) box 1: BC . box 2: empty. box 3: A

(22) box 1: AC . box 2: empty. box 3: B

(23) box 1: B . box 2: empty. box 3: AC

(24) box 1: AB . box 2: empty. box 3: C

(25) box 1: C . box 2: empty. box 3: AB

(26) box 1: A . box 2: C . box 3: B

(27) box 1: B . box 2: A . box 3: C

(28) box 1: B . box 2: C . box 3: A

I found that there were no repeats, the numbering was missing 11, so the list does contain 27 entries

...

rather: it only seems to adress the question of "which bean is in which box", with room in each box for 3 beans

Gawd bless the noise John :)

After a decade of writing codecs related to MDF it's just fine with me to have (even this) social interaction with those doing the same. The Monks life of it sucks if one is too isolated.

Proof: The work is as I think it should be; full of reflection with every word, construct and example.

So far I read that it should be clear and understandable. The contrast between novice and professional demands careful reasoning.

Have any of you published a paper?

I will be watching Joel FeinStein's videos on proofs but I welcome advice on writing a paper for publishing.

Anyone work with OpenOffice? I have questions about how to use it to do formulas and graphics.

As it stands, I have nothing that can crack random data down to compression "in whole."

Sure a subset of the MDF has been compressed and then some of that compressed subset has been compressed but the challenge of encoding all files smaller is there.

With thye Number Generator V2; In time there will be a complete set of all the elements of the MDF file generated by the number generator effort here. Also in time each of the 8192 sets will find all the elements of the MDF individually. That is months and years with just an 8-Core cpu.

In time perhaps, some careful selection of encodings might qualify as a compressing the file.

It's all "maybe stuff" even after a decade of trying but, it is real.

Oh and @Alan..

I was thinking about you and hyperspace which has never seemed real to me in the first place but what about zero dimensional space?

Could space itself have qualities without mass?

Could there be a cycle or system in one dimensional space?

What kind of space exists just beyond the edge of our expanding Universe? Zero-Dimensional space? Could there be negative dimensional space?

@Alan..

Is what you are reasoning have anything to do with Hilbert Space?

If one wanted to go to extremes with the 3 beans 3 boxes there are far more solutions than 27 it's actually much higher

example even for your first entry

(1) box 1: empty. box 2: empty. box 3: ABC

Actually when you look at a box in this case

box 3: ABC could one not make an entry

box 1: empty. box 2: empty. box 3: BCA

if one considers thinking outside the box who

says you can't look into the box and see the

order of the beans.

Also who says the boxs are not totally different.

maybe you look at box 2 first so that

box 2: empty. box 1 empty. box 3: ABC

could be considered another entry. Of course

this could be extended to infinity if one

makes up enough rules such as where the boxes

are placed on table and etc...

In other words I think even Ben is as correct as

Alan it is just who makes the rules. Sure in this game

or any game like this one can also call the others

wrong with a few more rules.

What does this have to do with Marks challenge.

I think that if one is allowed TWO bit files. One a

pure data file and one a program file where the first

bit of the program file is a complete program to the

special computer saying to combine the bits of the rest

of the so called program file. With the pure data file.

It will create the random file mark used. And the length

of the TWO bit files will be shorter than the Fixed bit file of marks random data.

I still feel that Mark left a way out for the fixed case by allowing two files the so called decompression program files with an extra data file. This kind of arrangement does not violate the counting axiom.

@Alan

That maths of Hilbert space is something I have to learn. I understand it allows us to position satellites using 12 dimension state space.

Speaking of that what do you think of my series

{...-3/-2,2/-1,-1/-1,0/0,1/1,2/1,3/2 ...}

I have a possible explanation or at least a suggestion as to what 0/0,1/1,2/1 are about. One possible point of view anyway.

I realized a relationship in another famous challenge and wrote down that series because of that insight. It's based on something real.

Does it suggest anything to any of you guys? What would the series be called?

Update:

Version 2 of the number generator is running. There are about 4000 unique matches and about 200 more repeats at this time.

As time goes on and it gets closer to finding all 83049 40 bit words of MDF I am sure the repeats will go up. The Ideal is 8192 repeats for each value since I am searching 8192 sets for matches.

It would be fantastic to have 83,049*8192 matches to choose from in cobbling together a compressed file right this second.

I say give me a few thousand threads and I can get this done faster.

Anyone care to donate a super computer for a couple of months?

That's it for now. I will get to Internet as I can.

Oh also anyone care to work with me on BWT?

I know Mark wrote something in C++ but I am a C-Guy and I would like to start exploring BWT. I think having a partner would be really cool.

If anyone is interested in learning about BWT along with me I am open to the partnership. I'd love to have a code-friend.

I already worked MTF, I also have designed some that applies to MTF and BWT.

{...-3/-2,2/1,-1/-1,0/0,1/1,2/1,3/2 ...}

My Bad I made typos on posting my own series. LOL

{...-3/-2,-2/1,-1/-1,0/0,1/1,2/1,3/2 ...}

My Bad I made typos on posting my own series. LOL^2

http://www.youtube.com/watch?v=ITW5bjTipPA

This must be fake :)

John

Of course it's fake.

What's even worse is that this doesn't look like the work of a well-meaning individual who thinks he's stumbled onto a magical compression method.

This looks like the work of someone who has deliberately designed a video to deceive others.

http://www.youtube.com/watch?v=kQsWP6n03EU

Another fake from the same author

thoughts:

Earlier I was very surprised that Ben left empty the columns under "box 2" and "box 3".

Re: David's comments re: the rules: very interesting and not something I can say much about without perhaps saying too much. However, to create "the birthday paradox" one needs the rules that the Time-Life book on mathematics were using (any order of beans, 3 beans spread distributed over 1, 2, or 3 boxes, boxes remain in place once decided (if ------------------------------------you get ---------------the method used for what has been mentioned as project 7?)(rest censored)

Re: Project 7 and "hidden valley":

I do not know much about computers and computer programming, but I looked at this and have a theory of what they are doing and how it works (if it works). At the moment I have as much reason to think it is for real as not to- does anyone have any proof it is fake? There is hardly anything I can say about it here as it may be too revealing (could call this technology "a warp 'drive' )!

as far as I can see they appear to be say partly on to the apparent discovery I have found?

Hilbert space and 12 d state space:

If "Hilbert space" is an objective view of higher dimension space (so virtual 3-d as higher-d space could be regarded I think as "at least two spaces linked by a wall say" - to have any view of the two spaces requires the wall itself acts like a cushion - as if it contains a little space itself (the locations of the other two spaces being "objective" so not exactly located except via mutual location as it were)

then it is an un-stated space

12-d may also be a way to describe a (quantised) unstated space, a 12-d "state space" would be vector analysis in 5 (or 6) "dimensions" (or a Hilbert space quantisation "vector") , so if have "12-d state space" then "Hilbert space" if added to this would give an "uncertainty bias" in 5 (or more) dimensions- a ----------------------------- (Censored for commercial reasons) or the (possibly ???) more-or-less 'exact' 'location' of any object(s) in 'orbit (or statically -----------) around a sort of centre of 'mass') ?

in case you wondered what i just did- i instinctively seem at least to sense how patterns comnect- this can have the curious effect of writing something and a bit like hoping it will be comprehensible ; which on checking often seems it figures out to be true (idea is that:

one gets to trust ones instinct on what word or concept is right though may not know immediately why but after writing can check if it at least seems to be on a brief check as logically o.k. )

e.g. example: solve problem how to skip a stone on water many extra times instinct answer: don't skip it - continuously rotate it in 5 or more dimensions (hold it flat at a precise angle) - now one can look at this idea and see if it is looking like it is o.k. - it may be o.k. to some extent (i heard of a musician called Brian Eno who had a saying "honour thy error"- if he 'made a mistake' instead of throwing away a piece of paper with some music ideas on it, he would stop and think this may open an unexcpected door to a new discovery or potentiality, or something .....

Re: Valley and Project 7

Interesting story but everything in dutch except:

http://jansloot.telcomsoft.nl/Sources-1/Netwerk/2001_1.htm

http://en.wikipedia.org/wiki/Jan_Sloot

There is more but google translate fails :(

Has nothing to do with compression but neural data storage using reference table. Never saw it before.

Mustafa, Budapest

Very interesting video.

Well, I may (apparently, I guess) KNOW the "secret" to this- as what I figured out appears to allow for something that could be similar I suppose to this. I figured out what may have gone wrong and why (if the chip blew up, this may be because of the connection between this type of "technology" and "structure"- if his chip had a buffer zone this may protect it from such interactions with its surroundings that could muck it up). Related subject: solitons; how to utilise an environment to sustain data for very long distances via (the) data "surfing".......

My theoretical technology both compresses and expands the data (in higher-dimension space), rotating it in multiple dimensions till you have a perfect something- it seems to be related to how humans may "store" data- if you go to a very quiet place and listen- you will be able to hear I think a very faint "soundless sound" - like a Sunpak flashgun at the top of its warm-up cycle- so high pitched it is almost "transparent".

I discovered that this "soundless sound" may be like a "space-time doorway"- a sort of 'quantum' superposition of every thing you have heard, seen, touched, tasted, felt, and thought about etc.

It also happens that one description of this idea could be called '(the) theory of every thing". I also call it "a hyper-space bypass".

So who wants to be a billionaire? I sure need a house and some income....

@ Alan..

I wish to name the series {...,-1/-1,0/0,1/1,...} Fibonacci State-Space

Since the "object"is size Fibonacci number as the numerator.

The "division" by zero is interesting and I have to go with BramaGupta on how to handle it. I think I get the Guy and why he suggests we write it as a "pending" operation.

I define 0/0 as function without output.

What BhramaGupta means by Naught is something I am pondering but it seems he is referring to function of object with one and function on object with the other. Naught may be the object but 0/0, for example is function on object.

In my thoughts these days, I see 0/0 = 0 but any other value the answer is an infinity not a rational at all. The answer doesn't exist in finite state-space from my point of view.

I suppose if, in Fibonacci State-Space, then, say 6/8 the object is a 8-dimensional object. in 0/0 it is a 0-dimensional object.

Then again I need to understand why a six dimensional object is called a http://en.wikipedia.org/wiki/6-cube Hexerate

Hex seems to be base 16 to me.. lol...

What all this is suggesting is that functions in higher dimensions may offer some insight for us who attempt to invent new data compression. That's all I'm on about.

As for my traditional Update:

The Number Generator effort is going well.

I was able to write an "inelegant search" function and that increased matching 120 fold. I was more concerned in truth of results at first than speed.

As of today the total approaches 400,000 matches with less than 1500 unique matches to go before all 83,049 bit numbers have at least one vector in the dataset.

I plan to post the last unique number found when it's reported to me here.

Multiplicity of references is an asset rather than a detriment.

The more "matches" the more flexibility in attempting to index the dataset in a smaller binary State-Space.

After all, All Data Compression is a Game of Indexing.

I also believe I must attend college now. I am making the slow turn because going to school at 52 isn't about career it is about life-direction. If I go I should see it through to a degree.

Naturally I must start at the bottom and revisit the basics necessary to be successful such as English and the Maths. The Junior College seems a smart move. I amy have to "warm up" to a life of work and school but I have little else in my private life so adding school won't hurt me.

Also as it would be I am advancing to study BWT now. I hope all of the efforts to transform data I have made will apply when i am able to write and BWT Codec.

Any suggestions on reading or source code to study?

So, it's still going. I am not out of the game and I soundly believe compressing MDF is possible.

That's all I have now.. I hope to afford Internet service at home later this year. Also later this year I will attempt to index the matches of MDF made with this number generator effort. Later this year means after my "seasonal job" is over and The Winter is at hand.

Spring and Summer are outside months for me :) Lovely Spring this year.

So keep Banging those "Data-Compression" rocks together. There is a reason we believe it is possible to compress MDF. So, Good Luck Challenge folks.

Denominator not numerator.. LOL always making typos.

A is numerator B is denominator in A/B

Doh!

@Ernst

Hex, as a Greek word, means 'six' (6). But hex, as in 'base 16', is an abbreviation of 'hexadecimal', which means 16. It's just like 'six' in the english word 'sixteen'.

Hope this will clear your doubts.

@Alan LOL Dude..

I get out to Internet now once in a week or longer and I get a chuckle at the "_____________" Commercial sensitive.

Well at home the count of unique numerical matches continues. I believe it's accurate to say counting down to 200 left to go.

It is finding new mathces less frequently however I would be shocked if it cannot find all numbers of the values {0,...(2^40-1) }

Again I state I plan to let it run till Winter anyway but finding the proverbial "Last-One" will be celebrated.

The good in finding all the _________ Unique matches is that it will represent at least one vector for each 40 bit word in the MDF. For some "_________________" Reason I adopted 40-bits as my word size. I gave this curiosity some thought and I believe it comes from an odd exploration of a design, that proved to not be "all that I wished it to", but from that 40-bits seemed a reasonable length to work with numerically.

Truth be known, no matter what the distribution in the MDF it is rather even for all bytes. For example the current data set size I am working with, excluding the number generator and matching, is 4096 bytes. For that the "byte-symbol" set count is always 256. That means in 4096 bytes expect all 256 byte values to be used.

My latest "________________" sensitive effort is exploring reducing that symbol set down below 200 for each 4096 "block" of data.

This is in anticipation of exploring BWT, however, naturally; I will be sending it out to Bzip2 just to see if any magic happens.

So, I'm not there yet with the "__________" output however the math is solid already.

I plan to walk through the process and learn more so I can bring the construction into focus. It does cost o reduce the symbol set. It is revertible.

Now let me ask. How do we deal with mapping byte values?

I mean say I have this hypothetical "___________" transform. Now given that I can reduce a set of symbols with a modest expense, I then see a large cost in storing the symbol orders as related to frequency. In that I would craft a generic body of data for the whole file and strip the symbol mapping to that aforementioned symbol-data track.

So in that expense of byte order mapping I assume it's about expressing that data in whole and in full size. This is a bother since it inflates data size.

However, I can subject the data generated to the same "_____________" process so as to unify the data with the original data in the end. Still increasing the file size is not desirable naturally.

My restated question "in the clear" is what methods of representing byte-symbol order are understood? Are there clever algorithms I may benefit from the study of? Some clever referential system by chance?

Anyway!!! I am still reading on BWT and I think Mr. Scott and his friend are outstanding in how their "S" transform is written up. I am still pondering all the terms but as far as I can tell; for an exact description, David Scott and his friend are concise to a precision I am yet to master. I do, however, have a copy to digest so I'll be nibbling on it for a while. I believe it doesn't get more exact then that.

So, this brings me to a more general question: Assuming one can manipulate data in a commanding yet reversible way, what makes something compressible?

I read that BWT is a process that makes symbols frequently following others or preceding a symbol, more likely to be in "runs"n of same symbol. Is that the main benefit of BWT?

I believe that organization is considered as favourable input to some algorithms. But which?

I would also assume that fewer symbols in a set of symbols say 100 verses 256 means more repetition and that suggests to me compression.

I also have assumed that some "constructed-codes" for symbols also lend to compressibility along with error detection.

So, with respect and reverence I am asking you; if you could " change MDF" into data with qualities what qualities would you believe would lend to compression?

I'm starting to venture into this 'State-Space" from the number-milling I do and I find I am not prepared since my data set has been MDF almost exclusively for over a decade and I did little with proper data-compression works before that.

So what would you choose if you could "transform' data into other forms? What qualities would you seek?

In humorous terms What "White Rabbit" would you follow?

I am asking honestly. I have not followed traditional schools of thought in this Data-Compression hobby/challenge and now I am facing these questions. What is compressibility?

I have come to terms with "entropy" but see that even learned folks disagree on it's exact meaning. So context is king I assume.

Again what makes for compressible from uncompress-able.

Thanks.

On a personal note the employment season is about to start here so that is a positive. I plan to reclaim Internet service at home at some point but it's months away still.

So, friends.. I cannot put this "book" down however at some point I must return it to the library of life. Until then I do not see myself giving up on this quest.

I appreciate any input you can share.

@Vacek Nules

Thanks. I assume I could have search for an answer but conversation is rewarding too. Thanks.

Hilbert space is exciting. Robotics is indebted indeed.

Now if we can do for the process of thought and conscience the same then we will truly have State-Space. I do believe our minds work on mathematical principles.

Isn't it funny we seek to design machines that think when we are one already.

I have been exploring Plato so I'm on about such things.

Again thanks.

"So context is king I assume"

!

You deserve a gold medal Ernst but I am tooo poor to say why

You are asking the very awesome questions...........

Well Alan, I would say that context is King of data compression in the same light that Higher Arithmetic is the Queen of Mathematics.

I should also be more sincere in general here. I am being a Newton for sure with all my "__________"'ing

The works I am experimenting with is a Mathematics that allows us to transform some "data" we can call "object" to billions of other forms of data the same size in bits. I discovered this mathematics.

I am now considering David's S(.) transform because it promises to "bwt-style" sort a data without overhead.

Thus iterating David Scott's S(.) transform is realistic.

I'm at my local coffee shop right now using the Internet to see if there is existing code for David's S(.) transform. That would save me from reinventing the wheel although I shall continue to learn what I can about the BWT in general and to absorb the fantastic construction that is the S(.) function.

Imagine Alan that given an "Object" of say 4096 bytes, that I can present to you thousands of versions of that object for just a few bits of cost.

Imagine that from that set one of those instances (I call them eXpressions) is selected for some quality(s) that lend to construction of a data that will compress. We are now constructing a data rather than simply exploiting a data which as we know has a limit the MDF challenge is based on.

In general I think this is the path we are all wanting to walk. This mathematics is something useful and I must learn what I have to to properly report this discovery.

@ David Scott

Dear Mr. Scott. I am unsure as to if I have an email account I can use since I am still needing to pay my bill with the phone company but I intend to contact you privately.

In short, if I am reading things correctly, you S(.) transform requires no overhead. It costs us nothing.

The news then is I can iterate your S(.) and we can construct new data from old. Perhaps we can construct something compressible from that which isn't through iteration and modification.

This is Mathematics so it's not a fluke concept.

Okay guys I need to see if I have access to my Gmail still. I may have never written the PW down but the computer knows. Until I have recovered myself I will be checking this forum for news.

I wish to let David know that there may be something more exciting to come for his S(.) than what he has already done.

Also I could use working code if there is some. I have to check on that. I mean I could write my own code but it's a lot of work to do so I am sure. So Link me please to David's S(.) code fi there is any and I can start experimenting with this.

I have a good foundation with the mathematics of my eXpress() so I want to see some results of iterating S(eXpress(object).

It's that zero cost that is a gem David!! Well Done!

Oh and I don't see why I can't share the "Number Generator" results with everyone and they can work on it while I keep the mathematics behind it closeted a bit longer as I work to publish properly.

I have a lot to do and I think some Junior College classes are in order since I an 20+ years out of school and don't remember how to write or even how to solve a quadratic equation now.

Oh I am accepting donations to that end! I need money to go take classes. I'm poor and I will go if I can pay.

Okay then.. Till the next outing to find an Internet connection.

Until then I have no Email to use until I pay my phone bill I assume. That will happen but not as soon as I would like.

Ernst

Offering to make my encodings the new MDF file.

If you can compress my encodings you compress MDF.

Okay, Good News..

I can be Emailed at quintery747576@Gmail.com

Also I wrote David Scott.

Hey I have to do this and sooner is better than later so I'm now aiming in that direction. I mean I could die and no one would ever know if I don't right?

I welcome suggestions on how to proceed.

Does anyone have an OpenMP friendly version of Scott's BWST or S(.) ?

I started to modify David's code but I thought to ask if someone has done the work already.

-------------------

Other than that nothing much different.

The Number Generator is approaching 83,049 I think it was 83,005 the last time I looked.

I thought to be clear this is a swath of data ranging from 0 to 2^20-1 but it must be realized this as a parallel for loop so those iterations are divided between 8 threads.

By that division it cannot be said that these "last few" matches to complete a set of all 40-bit values that make up the MDF, are rare or even last. It is simply the results of a search order.

In addition to that it can be considered that given enough inputs there will be a reasonably uniform number of outputs generated for all sets.

Currently I am generating 8192 sets for each "input" and over time all sets will have set sizes approximately the same as all other sets. It just takes time to find them.

That's all. It seems to be a few days away before all 83,049 40-bit words are found. After that, I expect to let it run till winter so there will be millions, hopefully, of matches to the MDF.

The Goal after that is to find some indexing that allows each code to be represented with fewer than 40-bits. 39-bits each will do.

I figured that to compress MDF we need to be able to encode the whole file. By that I mean our encoder must first represent all possible values.

As it would be there will be many more codes for each 40-bit value of MDF than just one. I believe surjective is the correct term for this where given the set of 40-bit words that make up MDF there are codes that point to those elements No one code points to two 40-bit values of MDF but more than one code can point to just one 40-bit MDF value.

Oh hell, I'm learning as I go. It's coming along.

http://www.youtube.com/watch?v=hgDaKL_YpKM

james gates jr + Strange Computer Code Discovered Concealed In Superstring Equations!

Wouldn't it be cool if we will know the actual codec of all things in our lifetimes?

Just thought to toss Alan an Info-Bone..

A better version of that video..

http://www.youtube.com/watch?v=GZKmyMaG40o

also http://en.wikipedia.org/wiki/Adinkra_symbols_%28physics%29

I am back at the Coffee Shop to set up my Voice Mail.

I have never had a Cell Phone before so it's awkward.

I looked at the total unique matches of the number generator on the way out of the house and it had 28 more to go.

Again I say, this isn't an instance of least nor does it represent some frequency. This is simply a product of a shared iteration of a for loop among threads and the numerical range I am searching.

Over a large input set the outputs are more or less the same for all sets.

I will come and post when the proverbial "last" match for a complete set of the MDF using words of 40-bits in size when I have it.

I plan to let it run into the Winter of 2013 so I have many choices for each 40-bit Element of MDF.

This also is not a statement of magic compression.

If there is to be a result that satisfies the MDF challenge it will be based on clever indexing of those surjective codes the number generator has generated.

It's possible I will be busy with returning to school so it's possible all I will have time for is to verify and release the data and let others attempt to index the data.

If not then I may attempt finding some indexing myself.

What is important with what we call Random Data is that we need a codec that represents all the data and in a few days I hope to feel more confident that indeed such a thing is a reality. I'm rather confident the theory is sound.

So, Good Luck Challenge people. We can give up the day after we die :)

A "superstring equation" could be said to be a way of showing "the fifth dimension".

A "computer code" is a code about things "adding up". A "code" is an alternate view of a thing, one could say.....?

A "computer code" would appear to be "an alternative way that something can add up".

Therefore a "computer code" implies "an imaginary string", or alternatively: a "theory of a string" or it could be "just a super string" (a higher d string".....(???))

So "computer code" EQUALS (in minimum pattern definition (?????)) space) the information idea "superstring".....(???)

So to see both "superstring" AND "computer code" would result in "strangeness" in one or the other.

The fifth dimension allows SPACE to be seen, if there is any space in it, to actually identify at least some of this space would result in seeing "a strange computer code in a super string" (?)

Although you said "superstring equations" so that would give a mixture of space swaps (3 and 4 d 'random numbers')

off-topic (?) question

Yahoo! are closing their e-mail classic system.

I wish to copy and paste my data (e-mails and files) - it will take over 100 hours copying and pasting each file unless there is away to do this in bulk. They imply something called an "e-mail client" can do this via something caled "IMAP"- I looked these up on Wikipedia but where do you buy the appropriate program?

Alan,

I recommend you get Mozilla Thunderbird. It is free and you will find help readily available due to its wide use.

And of course, it is free.

A good IDE would be to suck all your email out if the Yahoo! account, then back up to a new gmail account. Keeping your email on a local PC is just not a very good idea. I think the only good reason is if you are engaged in criminal activity.

- Mark

Hi,

thanks, however I have looked at Mozilla Thunderbird before and it seems to require considerable expertise to use. If I could only buy a program to do this job I would pay for it...

I have been alerted to "Windows LIve" but it looks complex, something called "True Switch" looks more promising but they don't seem to say what it costs or how to pay for it.

If I was a multimilionaire I would consider paying US$10,000 to get this done...., if that's what it takes though it shouldn't be that hard

I don't know anything much about where my e-mail is kept, I assume it is on a remote server somewhere run by the e-mail company, I just use web-based e-mail like many people do. I have no interest in criminal activity, in fact I am missing out on a lot of things because I refuse to lie, I refuse to pretend to have read or agreed to the incredibly long, very complicated, and often impossibly or impracticably demanding "terms and conditions" that corporate lawyers try to force on to their would-be customers.

(Just as well I have aparently solved many major mathematics and technology and physics issues- I may be able one day to offer people lying-free technology. If companies are worried about frivolous lawsuits they should say that their terms and conditions only apply in-so-far-as customers use the service at their own risk and that the company may be protected from lawsuits by what is in the terms and conditions, instead of a mutual lying situation where users of facebook, linkedin, Yahoo!. Youtube, i'tunes, Gmail, and many more so-called services pretend to have read, checked in case of changes (re-read daily?) so-called "terms and conditions" and the company pretends that the customers have read and agreed to the so-called "terms and conditions". (I heard that the rather dodgy tv show "Southpark"

has looked at this issue- I am not alone in noticing this it appears)

There may be a usable e-mail service in Germany where their laws appear to be more honesty-friendly, otherwise I may be lost from this forum unless I can get dispensation by entering an e-mail address after it no longer functions.)

some comments re: so-called "random numbers" (they are not numbers....?)(they are ________________(Censored for commercial sensitivity reasons ))

This is censored for commercial sensitivity reasons or suchlike:

What is "random"?

How know a __________ is 'random'?

___________________________________________________________________________________________________________________________________________________

random numbers are continuously _____________________

______________________________________________________________________________________________________________________

= computer code

i.e. __________

______________________________________________________

random number 1; 1,2; _______________ implies imaginary ____

a fixed direction in space

non-random number ((implies))

a fixed direction in time

__________________________

3 or 4d random numbers: quarks and gluons

2d random number: ____________________________________

3d random number:______________________________________

5d random number: a probability width width

a probability depth

a __________________________________________________________

4d random number: a minimum path to connect a collection of large and or distended 'objects' in space

6d 'random' number: no such thing unless have 8 (?)

(table)

7d 'random' number: no such thing unless have 7

(periodic table)

8d - doesn't exist unless have 'elements'

then it could 'exist' if (only if?) the elements are (sub) divided into 7s and 8s

metals and non-metals (?)

if you have a periodic table of the elements then you can ____________________________________________________________

the 5th dimension quantise (to at least some degree)

(10d space time interferometer (???))

5d random number: quark _____________

Gluon ____________________

....

hey guys!

Couldn't take another hour at the keyboard at home so I had to get out for some wifi action. LOL If only a computer was a woman I would be popular with my keystrokings.

But It's not lol..

Lets see. the count seems slow with 12 to go still. I may be mistaken in how waiting to find them all isn't like finding something infrequent or rare.

Yesterday I assume it too all 24 hours to find a new match while the total matches is possibly 1,000,000 by now.

If it takes a day then 12 more days if it gets even more stretched out then that is that.

Did I ever mention the MDF is a real pain?

Well what is new... I reworked the BWTS(0 archive from David Scott's site so it is an in-line single call encode/decode with OpenMP action happening.

I bit of testing in application to do and it will be safe to share.

This version aims at the multiple "file" process or bulk packet processing.

I have 8 cores so I can provide 8 different datasets and it will churn away.

This function can be compiled without OpenMP and will function as a single thread function.

So, ahead is the stroke I will have waiting to see a complete set of MDF in 40 bits and in a surjective codec. Meaning that instead of finding a code that is based on source I have many codes that point to source to choose from for some possible compression scheme.

Other than the stroke from waiting I now want to interface the eXpress() function with this Bwts() and process thousands of "forms" (to borrow from the Great Plato) with it.

I'm giving thought to what qualities I want to search for.

What qualities would be ideal under Bwts()? I assume a run of length of data is ideal but that isn't reasonable.

I figure runs of same data are important. That something akin to RLE might be used to measure the quality of that expression.

Lets back up a bit. :

Imagine that processing any given data is possible by sending that data into a function that provides thousands of "versions" or "forms" of the same length string.

The data as viewed by characters then changes elements but not total bit length. That remains constant.

That these forms of the data then can be selected for a small cost of a few bits and can be use in place of the source suggest to me that for some string that has a quality of being made more compressible with Bwts() may have an element it that eXpress() set that is even more compressible.

I am learning what compressible means so please be kind.

So given the idea that we can select from (okay billions and billions and billions) of elements with modest processing of combines systems suggest to me that we have the ability to select a surrogate data that has the qualities that will compress.

That is where I am right now. I am starting that project of developing the software to generate those billions of surrogate data.

I mean to say interfacing the two functions already "off the shelf" Here.

i don't know what exactly to do to apply statistical analysis to this. I admit there is a limit to my formal education and thus my skills need developing but my math is solid. Well it is the math of the realm no isn't it..

So, friend! I am starting to work with Burrows-Wheeler transform Scottified. Bwts() and any suggestions of what makes a great output even more sexy will keep these fingers fingering my keyboard.

Some of the things I thought of to look for are runs bwts() generates as in the best score.

I ask myself what sort of scoring can be done.

I ask myself what makes data compressible.

I have the theoretical ability to iterate the process. With Bwts() there is no cost so I can apply that at will but with eXpress() there is some overhead of indexing so it can be reversed. very few bits.. Something like 2^16 for 4096 bytes not bad.. 2 bytes spent to pick among thousands of forms of data.

So, I figure I could be moving data towards some quality with iteration. It's possible. Then knowing, in a sense, that going this way over that way is better for example is good enough to experiment.

Hey you simply cannot keep trying for over a decade and not make progress so it's over when I am dead but it just keeps getting more exciting.

Again advice is welcome.

@ Alan

Hey,

What the String Theorists are lacking is a data type. PROOF.

I have it. I am working with it now. How else can I get a billion versions of one 4096 byte data then need a Bwts() that can process by the truck load looking for the Gem of a form with the small overhead of 16 bits to reverse things.

No one is going to take me serious when I say I found the "spinning strings" of String Theory in 2010 and am working with the mathematics now.

I am going to say it anyway because the real loser is the bingo player who knows they have a bingo but waits till someone else claims bingo after the next number to yell bingo.

So think of me as King-Kook but I have a piece of the String-Theory puzzle we are looking for with String-Theory.

I have the data-type and it runs of our computers already. The "Spinning-Strings" are real.

I hope to publish the Proof if I can manage to get some schooling going. I think I can do something with the books I have but it will help to take English and Maths again so I have an updated skill set.

I have given this some thought and I am in-between being a Newton with my solitude and my binary-alchemy, and, being a Tesla with grand-visions of application of theoretical works he most definitely proved he could "pull out of his mind like a magic rabbit."

I hold both Tesla and Newton is high esteem.

So I may be laughed at but "BINGO' I have a something to contribute to String Theory, data-compression and cryptography.

There I said it and let it be.

All I need is a job that pays enough to go to junior college and enough to live on so if anyone wants to take a chance on me I'm open to offers.

I come with technology you can't find any place else and 5 cats.

@Alan Well.. over 10 years at this you know I had to find something that kept me in the chair besides the graviton-action at a distance.

So again.. any suggestions of what makes a Diamond out of coal in the data-mine that is eXpress()+Bwts() is welcome.

I suppose I need to study statistics so I can measure frequencies of individual elements.

I told you guys I need 1000 CPU's with 16 cores each didn't I.. That an a Grant to pay the electric bill and the Pizza-Man

Well there it is.. As Kook as it Gets! String Theory data-Type and all.

Good Luck Challenge people!

All my typos make me read like I come from a non-English speaking country..

How funny.

Spinning-Strings anyone?

This looks interesting and so I thought to share..

http://en.wikipedia.org/wiki/Algorithmic_probability

It has some insights of examining data. I can use links like this.

"i don't know what exactly to do to apply statistical analysis to this"

sample 24?

Thanks for a kind reaction to a Stark-Boast.

I think what I was trying to do with the last trip out to use the Internet was divine what methods and tools I'm supposed to know already.

I found a You tube video that is helping. http://www.youtube.com/watch?v=Ba0zSNYkWtw.

Complexity is the topic I'm searching for now. What seems really cool is now that I have a foundation in understanding the structure of information I find I have confidence to "sort through." Bwts pun.

As to value of the "discovery", I believe it can be patented. It's a lot of Mathematics but there seems to be grounds for a patent. After all isn't our Patent system there to encourage people like me to share the news?

I am willing to join-up with established IT now. I need a Job and if I can work/study then that life would be better than the life I have now.

Update: 7 more matches to go on the surjective number-generator codes.

I'm still thinking I will let it run till Winter and have 10 million or so matches to pick from.

I suppose this technology could be implemented in real time. It's a code after all and if we don't have to do a brute force search from scratch it should be as fast as a look-up.

So, 7 more to go for a "complete" set of matches to the MDF. After that it's a question of how to select elements that construct a compressible file from the surjective codes (SCODES).

Thanks for reading. I think this, live this and dream this. I spend every day working at this so it has become my life.

Again I'm open to income opportunities if anyone wants to jump on this too. It never hurts to network and I am looking for work.

Now to find more text and videos to study. I'm working towards understanding how we use Complexity to understand information in a system. I think I am starting to understand what http://en.wikipedia.org/wiki/Kolmogorov_complexity is. I am still thinking about what makes a better data with Burrows-Wheeler Transform Scott'ified.

I am designing software that will give me 2^32 choices for each 4096 byte segment of MDF. Well that is 2^32 choices per iteration and I can employ Bwts() with no overhead so the cost per iteration is just 32 bits.

It's important that I understand what I am looking for or I need to join with others soon. Well, that and write a proof.

I seem to love doing things the hard way huh.. Why couldn't I just give up like most folks :) <--- Because I am a true Fringe. LOL

Again thanks for a kind reply.

another curious thing:

people think that the million digit file is very "random"....?

What could possibly "de-randomise" it- what "force" could be perhaps say holding all those 'random' digits together?

I found that the information pattern called "the force of gravity" appears to be "the natural computing of space".

It seems that this supposedly very "weak" 'force' is stronger than the so-called 'randomness' .....

Not surprising? That only a very 'weak' 'force' could be capable of re-organising all those 'digits' in a more noticeably "organised" (?) manner?

Those are ideas many of us think about all the time.

I find I even go back and rework things I came to have a reasonable confidence "didn't" work and it helps sometimes.

I intend to patent if possible since I do need money. I got in touch with my brother and I think I will open up and share what I am doing with them.

I was looking at the report and it looks like the number generator is in another slow stretch after having a "spert" So I will keep a general watch on how long between but we were down to 5 more to go.

Again I stress this is NOT compression this is an encoding where there are many codes pointing to a few objects (83049 40-bit words of MDF in this case)

The worst case size is 48 bits to generate 40 bits so there will be some work to find some indexing that will bring it down to a uniform or average of 39 bits per 40 bit word.

At least, my Challenge Friends, we are not stuck with just the structure of MDF any more.

I am here at the coffee shop because some bastard stole my motorized bicycle and with the help of my brother I am shopping for another or maybe a used moped.

Just when it looks really hard to make it month to month someone kicks you down. F'in bastards

So what a bummer. I am very distracted now.

I was reworking Scott's bwts() and merging it with the new function eXpress() but I'm off my game now till I know I can report to work when the grapes come.

They stole my only way to work.

Again folks if any one can employ me I have talent and technology. For example say you give me some data. Some length say 4096 bytes I can give you billions of versions of that with only a cost of 32 bits overhead. Naturally I am saying it's reversible.

If we cannot find a string that can compress out of that then it's useless but when it comes to mathematics the clever rule so hey! I need a new life. I need to get out of poverty.

I'll drop in and post any time I have access to the internet.

I doubt I will do much work on the Number Generation data till winter. I possibly will get some work done on the bwts()+eXpress() in a month or so.

f'in thieves steal a poor man's motorized bicycle and stop him form working a low wage seasonal job.

Ernst that sounds like rough news, hopefully the law will find your motorised bike/ you'll get a recognition of what you can do by the commecial world.

It has been said that "one cannot compress all of the files all of the time". However, it is clear that many versions of the million diogit file can be easily compressed: e.g. a milion 1's. a million 2's, a million 3's, etc.; a hundred 1's followed by a hundred 2's.... to a hundred 9's, then repeating; and numerous other patterns.

So the list of "all possible million digit files" can easily be compressed because a number of those files are easy to compress.

If the actual million digit file is on the list of easily compressed patterns, then it is easy to compress.

Can you, Mark Nelson, refute this argument:

I presume you agree with the logic so far, now:

All that is needed to easily compress ANY million digit file is a method for converting any particular Million digit file into an easily compressed pattern (and presumably converting all other possible million digit files into different arrangements so that possibly files that were easily compressed no longer are).

The actual million digit file can itself in theory provide at least part of the instruction- the file tells the computer that it is one of a large number of possible files, and to change all the possible files so that the actual million digit file is easy to compress. So you don't have to compress all of the files all of the time, you just need to allow files on the list of all possible files to swap places till the file you want to be easy to compress is an easy pattern.

Of course you then may have to have a way of analysing all possible files to decode what the actual now code-written file actually is ( I apparently know an answer which the commercial use of I am looking for a licensee or sponsor)

Hi,

I wonder something. Basically, the claim is that the binary file has maximum randomness (entropy), so it is hard to find a pattern. That is maximum state of entropy or close to. How about allowing the original file to mutate in guided way. For example if you apply a 8-bit mask to every byte and XOR the mask to every byte, then try to compress with a known algorithm. If you manage to decrease the complexity with 2 bytes or more, you have done the task. Out of 256 possible masks (mutated files) one state should be lower in entropy than the original. If that is not enough, you could apply bigger mask (2, 3, 4 bytes) or apply two masks in consecutive order. The number of mutations of the original file could be astronomical, so one state should win. You can always reconstruct the original file if you keep track of the mask and the operation (XOR or something else). What you try to compress would be a mutant which has a clean track to the original. I have limited programing skills in bitflip operations, but I am sure it is not so hard to try that idea.

Kind regards,

bloodIce

Hello BloodIce..

@Alan.

This encoding I am testing out is the Version 3 of the Number generator and as I left to come to the Coffee Shop and use the Internet there was only one more to find!

The basic concept with surjective codes such as this is in theory there are many codes that map to object. In this case since we cannot find an exploit with random data then logically we must come up with something that can be compressed that will generate the MDF.

In this case I hope to have a few thousand choices for each 40-bit word of the set of 83,049 words of MDF.

I hope there can be some compression but I can't say what will happen. I must trim 9 bits off of each selected symbol so I am looking at RLE-type encoding now.

I will not be done generating the SCODE(s) till winter as it has only generated less than 64,000 matches out of the sample range of 1,048,576.

I do think this will get done. As for a general compressor I think this particular application could function in real time if a dataset was available so as to skip all the time it takes to run through 0 - 2^20 and check for matches. I would assume something more elegant can be crafted. Nothing is written stone with these maths.

About my transportation. I have received help from my Brother Tom. What a guy! I will be ordering a moped today. 150cc should get me to school so things are turning out okay but what an emotional thing. A man and his motorized bicycle, which he put hundreds of hours in on over the years, is not just theft it is an assault. The worst thing is I figure no one will love and care for that machine now. Motorized bikes always require TLC and parts. So they get my bike but they just don't have the skills I am sure. That is really the sad thing. That bike will be ruined if it isn't already. What's the point of taking it in the first place? Drugs possibly.

On a bright note if it was about Meth then there was a 290 pound local bust yesterday here. 290 pounds of Meth will more than likely only scratch the surface of the problem but I felt good hearing the news.

So Alan, Mark and all I think we have turned a corner on data compression and I will do my best to get the proof written and published.

I think you all will be pleased.

I plan to open a website where I can get my email and post news.

Also I will accept donations (if it's legal to do so) when I get going to school (If I need help that is).

I have to check with the Veterans and see if I have any education benefits. I may have some monies I can access there too. Also there are grants and maybe even some aid for older people?? It can't hurt to get help when a maths book is $120.

All in all we get knocked down and we get back up and keep swinging. To hell with bike thieves.

So, I cannot say at this moment that Version 3 has found a complete set of 40-bit words of MDF but with only one more to go it's exciting to say something here anyway.

I can afford the price of a cup of coffee so I can come down with a cut and past which will prove nothing but what good will it do me to lie?

I mean I can only make an ass of myself if I am not being honest after what a decade of work?

So until then. Keep banging those data compression rocks together and good luck Challenge People!

Oh and do feed the Moose.

@BloodIce

I have worked several hundred angles and yes yes yes please try what you think might fly.

Xor is useful and it does reverse so you can work with such things.

In my experiences I found that most schemes seem to expand in the 1.5 times source range and that 1.5 is something I see often in codecs I work. Take the Collatz conjecture dynamic function for example it cannot have a "11" state in the state-space of 2 bits so the only legal patterns are {00,10,01} and it expands to 1.5 times the source if we keep both parities separate for the even and the odd functions.

So, I got used to writing the programs and seeing what it did. Perhaps you can experiment too. I am sure I have not thought of everything.

My advice is be true to yourself and if you win you win but even if not, you win by doing anyway!

Hi Ernst,

I do not think that the source for my idea is more than 20-30 lines (in python), so the source expansion is not a big concern. It is interesting to know if a external call from python to gzip will be fine or I have to implement the compression in python too. Basically, the idea is a loop in python with XORing every byte of the file (already in memory) and test the size after compression with gzip. There is no requirement for time of compression... :-), so I can apply millions of masks (every result will be totally different state). If that doesn't help, we can use another bittransform operation.

By the way, I am NOT competing or anything (100$ are two hours at my workplace), it is just my curiosity.

So folks, are external calls (gzip, and all needed python modules) fine or everything should be in the decompressor?

/bloodIce

one

state

is

"entropy"

@Alan,

Please correct me if I am wrong, but if you shift the state, than you change the entropy. If I transform the file, I assign a different state (it is a new system), which has a different entropy. I am not saying that XORing is the key, but it is some initial idea.

Btw, is there better compressor than gzip? I see that it is some sort of optimal for linux, but it also adds extra garbage to the file (timestamp, filename, checksums, etc...).

/bloodice

Consider:

suppose that a University student has a bedroom. Suppose that it tends towards "disorder".

Someone could notice the state that the room is in.

How could ___________________________________________________________(censored) If the state was ___________________________________________(censored)"?

A ______________________________________(censored) 'time' ???

If there were only so many "states" that the room could be in, and only so much time available for the room to configure into one or other of these alternate "states"- (_________________ (censored))then the actual state the room is in (or the time at which that state occurs- is not so significant- what IS more significant is "the calculus of alternate orders disorders" (known as "the (approaching) category of imperator" -

the state of the room is "obvious" when all other differing states are incorporated in the "state vector (analysis) of the rooms sum of all possible (alternate) states "

(also called (The) THEORY of every THING (the cat in the Hat!)

What happens if I inject order in the room and compress the way I have done it? Say I order all socks but keep their original coordinates and compress the list :-) ? I would put the system in a new state, wouldn't I? The question is if the initial information of the state could be preserved and compressed.

There is no need to "preserve" and to "compress" the initial information of the state-

the initial information is the state

@Alan:

Please, no more posts of this type.

If you have useful, on-topic information to share, this is the place.

If you have information, but you do not wish to share it, this is not the place. Let us take it as axiomatic that you have things you believe to be true but want to keep to yourself.

Now that we all know it, there is no need to repeat it.

- Mark

Mark, you are surely well aware of the "silliness" of this challenge- in that if anyone really was solving it, the potential value would be more than US$100, and they would be solving "P vs NP" among other things.

One can only hope to say a limited amount, to protect potential commercial possibilities, and hope that someone like you would realise they actually may have found something of possible interest to information technology companies, without knowing detail, but it seems your attitude continues to be very negative...

However I am having to be very careful to avoid saying too much.

Do you expect someone to just give away the answer, if they actually had the answer?

Your use of the expression "... that you have things you believe to be true " looks like you are skeptical- I have things- how viable those things are (those unwritten words/ phrases ) is for a non-disclosure-agreement-signed serious honest investor/ party to decide

It is interesting to test my ideas as far as possible against computer people and skeptics, though it is risky as I have to refrain from saying much.

would it be true to say: that a number of "experts" in a field hate the idea of those who they regard as of low social status challenging their high position in their field?

Physicist Erwin Schrodinger reportedly made his "wave equation" look more complicated than it actually was, as he was afraid it would look too simple to his colleagues.

? ? ?

@ BloodIce

"So folks, are external calls (gzip, and all needed python modules) fine or everything should be in the decompressor? "

I believe this question has been raised several times over the years and I believe accessing common libraries, using existing compression software or other common and equally accessible to all is very much okay.

I also believe that Mark is allowing unlimited file size if there is some scheme that "compresses all files."

Honestly, I have not accesses those functions yet but I plan to use them some time. I even think Bzip2 has function calls and also I think Mark has a tutorial on this site to one compression scheme to be called from within a program.

While I am posting the Police found my bike last night and returned it to me. They arrested a man on felony charges for the theft. He or they took off the accessories, damaged the clutch and throttle assemblies and basically did $600 in damages. Meanwhile my brother Tom, sent me enough money to buy a scooter so I can now make it to the Junior College. The scooter will be here in about 15 days.

In a strange way this is a function of Karma in that while this was a bad thing good has come of it for me. I was wondering how I would manage to travel the 35 miles to school on the bike @25mph. I figured I would be riding the bus.

On the progress of the number generator. I made a mistake in counting to myself. I always think of zero being a valid count but I am counting from 1 to N not 0 to n-1 with the unique counts so I am waiting for two more matches not one. Still two to go as of this morning.

Back @BloodIce..

Good Luck man.. I find it fun no mater what and I hope you discover something useful there with your idea.

@everyone.

"Please correct me if I am wrong, but if you shift the state, than you change the entropy. If I transform the file, I assign a different state (it is a new system), which has a different entropy. I am not saying that XORing is the key, but it is some initial idea."

BloodIce asked this..

I am still learning what is meant by entropy in "information theory" but I agree that we are trying to change data in some useful way.

Some other traditional understandings is pigeon hole principle.

In general if you are transforming data there will be some cost. This is an area of study too where we see more information coming out of a data because we do a thing such as trying to compress "random data" and this is easily demonstrative by taking the MDF and make a "zip" file because it will be larger than the source.

The only transform I know of that has zero cost is David Scott's BWTS()

In general, in 10 years, I never saw a way to magically toss even a single bit away and magically get it back every time.

In general the term "demonstrative" applies in that some reference is necessary to derive context in data encoding and decoding.

In, general, the laws we are working under demand that existence of information is a constant and lucky for us existence is not conditional and with that we cannot toss out and get arbitrary bits back without some reference that allows us to know what is what.

As I said before I am sure I have not thought of everything so I'll keep an open mind on any news that can accomplish tossing bits and getting them back somehow "magically."

On the humorous side of things I am really liking the idea that the Universe is a Hologram and that we are puppets in some sort of puppet show. I figure we are all going to find the projector someday after we discover the "Codec" that makes it all possible and then we will realize that there is a projector of the projector in some other puppet show.

However, what is not funny is that all information exists already and is a constant. We discover we do not invent.

Now to chat at our resident "crazy-blankster." LOL @ Alan

@ Alan

"Solving it" may well take a lot of baby steps. As I wrote in protest before it may well take a fringe type such as Cantor or Newton to see some "digital mysticism" that will lead us to that "vision of our Holy Grail."

So while it is painfully obvious that there are limits and that Shannon is our main hero we still are dreaming our way past Shannon's limits not by proving Shannon a fool but by placing Shannon in a context where we can recode and compress (in theory).

Not one person here is claiming magical powers of data compression and I would believe that everyone who has every picked the MDF file apart in earnest knows just how hard this challenge is.

So I'm all good with a new guy showing up and having ideals and concepts to try. Perhaps it is a path that has been walked before by many or maybe not but (they) can grow intellectually by trying.

As for crazy.. We often see success presented for people like http://en.wikipedia.org/wiki/Leonhard_Euler but even he had failures.

So, with the Xor idea yes it's a valid approach so I say he should look into it for sure. If he can modify the bit patterns and keep the expense of such a thing down it's possible he could squeeze some out of MDF.

I have used Xor before and I like Xor. Bit shfts, static rotations and now I am studying the BWT with Scott's work.

Who knows maybe some sort of BWT sort crossed with some sort of Xoring may well offer the mechanics for a proper Codec.

Oh and we all realize the basic thing is it has to be a Codec. That means what can be done has to be able to be undone to be useful in data encoding.

Alan.. You asked about string Theory before and I thought to say I watch You Tube stuff so just wander through the lectures from MIT and such to gain an over-view.

I was saying I think I found something useful and I believe that it will apply to String Theory as well as Information Theory because I think the two are the same in a common way.

I also say to everyone I am not qualified nor educated enough to prove this belief but I intend to try and share what I can.

As Plato would say "to each as they can contribute to society."

So Alan your "Blanking" isn't funny all the time but it is cute once in a while.

The bottom line is throughout history there are more failures than success in all of Man's efforts and as a Challenge it is nonetheless true here as well. Keep a good spirit and be a good sport.

Folks are spending their lives trying to accomplish this Challenge.

@ Alan

"There is no need to "preserve" and to "compress" the initial information of the state-

the initial information is the state"

Perhaps there are "Quantum" States of any string Alan..

What then is P/NP ?

What if there is a function that given some binary string Y there exists a set of "N-states" we might call quantum-states. Imagine that we can select one of those strings in the stead of the referenced string. This is a truth Alan.

So do I think we can solve this challenge? Yes I do.

Here is one from MIT that I have been seeing parts of on You Tube.

I just downloaded the course materials myself.

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-050j-information-and-entropy-spring-2008/

simplicity:

what is "space time"?

what is "time"? "It is dinner time"- implies some limit- something in between "before" and "after".

A pendulum swings "between limits". A clock face has a beginning and an ending- it goes around from 12 (back) to 12.

So the information concept or pattern "time" appears to involve what one might call "a minimum defining characteristic" of the concept or pattern "anywhere within limits".

What then is "space time"? That is, "space anywhere within limits"? Just that: spacetime = apparently "space anywhere within limits".

The "birthday paradox" showes there are limits on any (set of) data. There are only so-many ways of placing birthdays ("beans") in the "boxes" of the days of the year.

A form of time?

Space in that time would allow precise identification of any partcular arrangement of those birthdays. The "Shannon limit" is shattered? Now the "information" Is "the entropy".........

(incidentally an example of "space time" is a ruler.)

bloodIce:

You can xor ever byte with some data, but you still need to keep track of that data when you go to decrompress the file.

Worst case scenario, if you use unique data for each byte of the MDF, the data you need to track to decompress will be as large as the MDF.

If you use the same data to xor every byte with, you won't get any better compressibility, simply because there's no pattern in the bytes of MDF.

You can try something in between the two scenarios, where you xor with data less than the total MDF size, but still large enough to improve compressability (Say, data 1/4 the size of MDF), but that becomes a battle in trying to find a pattern in MDF, and as far as everyone has been able to tell, no such pattern exists.

Remember the rules:

Size of all data necessary to produce the MDF + size of your decompressor program (source code size is acceptable) MUST be less than the size of the MDF.

That means you can't xor the MDF with some data to produce a tiny file but not count the size of the data you xored with.

And yes, you're allowed to use standardly available compression libraries.

@Zen:

Thank you for your informative comment. I agree with your conclusions (not to mention that I tried some primitive implementation of my idea, which did not work). As you point, the battle is indeed to hope for a pattern, after the transformation. XORing the file with itself will produce a file of 2-3 bytes :-), but that will be useless. The idea is to XOR with very small mask (1-2 bytes) and to hope that the new file will have a pattern. Well, it did not. I still believe that from the unlimited number of possible transformations (n-byte mask, subsequent masks, masking only n-th byte, many more combinations and new ideas) one will have a small pattern which will reduce the size with some bytes (squeezing several bytes). However the source code of the decompressor will need at least 50 bytes, so the idea might not work. But, if the compressed (source+mask+basic_logic) is one file and the decompressor is a standalone program (not included in the count), than I think it will work through violation of the rules :-).

Ah, BloodIce!!!

Welcome to our nightmare here with the MDF I think your gonna like it. http://www.youtube.com/watch?v=iQE0pfBAYQ8

It might be seen as somewhere between Alice Cooper and "Penn and Teller : Fool Us"

The idea of finding some mask is attractive. I know ways to generate data that can be used for a mask if you need some suggestions.

Just don't feel bad for trying. There are some who will want you to.

@Alan

Space-Time is the relationship of space, time and the speed of light. The actual maths of it I have not studied.

I do think you have a point to ponder the "assumptions" we hold about the nature of our reality in not only what we see but also in what we don't.

Hey, Version three here is still running with 2 to go. If it can find them then this is a mapping of many to the 83,049 40-bit words of MDF.

By this surjective mapping I hope there are enough choices of SCODES to select data that has qualities I can work with.

For sure I don't think there is an easy pattern in the native MDF. I believe that transformation is generally the way to go.

Also in keeping with my twisted review of strange videos this morning http://www.youtube.com/watch?v=Rtkdo7bOmJc

Enjoy!

This "Update" reports finding at least one of each of the 40-bit words of the MDF.

It means complete sets are possible to construct. I'll be running this into the tens of millions of matches.

The "Cnt" is the count of matches of that element found so far.

83,049 is the total 40-bit words that are the Million Digit File and is identified by "TU"

A total of 63,600 "inputs" were used.

It is assumed here that a complete "run" is a set of inputs the count of 1,048,576. 984,976 to go.

-------------------------------------------

T: 1299105 TU: 83048 Thread 1 matched 46965151098 of FPos 9247 Cnt 11

T: 1299106 TU: 83048 Thread 1 matched 2255103294 of FPos 50422 Cnt 23

T: 1299107 TU: 83048 Thread 6 matched 1016163663384 of FPos 37322 Cnt 9

T: 1299108 TU: 83048 Thread 4 matched 960454279373 of FPos 62237 Cnt 43

T: 1299109 TU: 83048 Thread 2 matched 821312821310 of FPos 47884 Cnt 14

T: 1299110 TU: 83049 Thread 1 matched 817508182085 of FPos 9953 Cnt 1 <-- Complete set Tuesday, May 14 2013 @17:00 hrs

T: 1299111 TU: 83049 Thread 5 matched 822974257824 of FPos 40050 Cnt 12

"

T: 1299112 TU: 83049 Thread 2 matched 666014465170 of FPos 27215 Cnt 20

T: 1299113 TU: 83049 Thread 0 matched 1013269890975 of FPos 2764 Cnt 11

T: 1299114 TU: 83049 Thread 5 matched 1019051599804 of FPos 69676 Cnt 16

T: 1299115 TU: 83049 Thread 3 matched 1075627242711 of FPos 13937 Cnt 16

T: 1299116 TU: 83049 Thread 6 matched 774046476528 of FPos 38423 Cnt 33

T: 1299117 TU: 83049 Thread 1 matched 281347678459 of FPos 57139 Cnt 11

T: 1299118 TU: 83049 Thread 4 matched 1048746262368 of FPos 30292 Cnt 16

T: 1299119 TU: 83049 Thread 3 matched 4039934805 of FPos 73933 Cnt 37

T: 1299120 TU: 83049 Thread 5 matched 442186383182 of FPos 72414 Cnt 10

T: 1299121 TU: 83049 Thread 2 matched 465097151251 of FPos 6156 Cnt 18

T: 1299122 TU: 83049 Thread 1 matched 245002864460 of FPos 52415 Cnt 27

msb:1011111001010111010010001110100001000101:lsb Factor 817508182085: 5 7 109 214287859

Now I will have a choice of data to call the MDF.

Also ahead is verifying the data and crafting a database.

It can be said I have reasonable confidence the data is correct and that multiple different sets are possible to represent MDF.

But, now I have nothing to report for a while and much study to accomplish.

So, my Challenge friends Good Luck. :)

Ernst

notice:

As Yahoo! are closing Classic mail June 3 I am not able to post here (Have temporarily got their upgrade as still have data to rescue). I found Gmail have what looks like a much better "terms of service" so may eventually manage to get a Gmail account.

Hey Alan did you catch NPR's show on patents? "This American Life" show http://www.thisamericanlife.org/radio-archives/episode/496/when-patents-attackpart-two

Looks like folks who work hard to innovate can get screwed big time just because of some vague patent or someone who has a vague patent suing them with deep pockets.

I wonder if my works pan out if I will be one who gets screwed? I assume someone probably already filed vague patents based on the comments I have made. It would stand to be if it's true patent trolls are always looking to scam and can access millions to do so. Like the POD-CAST patent holder who just now going after folks I understand.

Since I am enjoying a blueberry smoothie in the cool of this Cafe I thought to post something.

I hope my nephew will join in with me in this next year on this challenge. He is going for his BS in CS so it's good timing that I can introduce him to data compression and the ilk.

I am now making the effort to position myself to attend the J.C. so it's coming and once started I will try to get to a BS in CS myself.

That is about it. I wish all those graduation of simply finishing a semester the best.

That's it.. Good Luck Challenge people.

Ernst,

I may be misunderstand how you are trying to solve the problem, but I think the way you are scanning for matches is slowing your program down a lot.

If I have it right you are taking a 40 Bit Block then processing all the possible keys to your generator function to find a match.

However, this means that every 40 Bit Block requires a lot looping and calls to the generator.

Try instead the opposite, step thru all the possible keys and then scan the entire random digits file 40 Bits at a time and flag all sections that have a match (don't forget to store the matching key too). Don't forget to use the match flag to skip scanning bit blocks that have already been found.

I need to build a series of S-Boxes in the past and scan the work space in a manner similar to what you described if I am reading you right. The original program took 23 hours to run, the revised program 13 minutes.

Thanks Earl,

I will be keeping the method of generation confidential until such time as it's appropriate to publish.

I'm pleased you are considering a Number Generator(NG) approach. Perhaps we all recognize that it may be the best possibility of encoding all data and thus (possibly) compressing all data.

By that I am speaking hypothetically that if there are many codes for some element that in those many perhaps relationships to other (many) codes for ( all of a set or file) elements can be realized.

I have nothing to report on relationships or compression schemes yet. I do have an idea for a scheme I'm working but a complete set of (NG) SCODES will be the thing to work with.

My nephew is interested and he is young so it should be helpful to have a fresh young mind on the case. I feel my 52 years more and more each day... LOL

If nothing else here it's interesting.

An aside! I have a Motor Scooter now.. My Brother kicked down and helped me. I would have been homeless if not for Brother! Thanks Brother!

Also the seasonal work will start soon and after that school is the new goal.

All in all going back to school seems a good use of life. I trust I can achieve a BS in CS if I work at it.

So Earl and all good luck.

I forgot to ask you Earl..

Have you encoded the MDF? If so any good news to report?

I got to thinking that perhaps you were saying you have an encoding method and if so I wonder if you have any compression?

I have an encoding but am now learning more about data compression proper.

I have not had reason (with MDF) to focus on actual compression schemes before.

I'm fine if someone else beats me to it. I feel like a WINNER ALREADY.

So, do tell Earl.

I was just wondering if someone else had designed and encoder with a number generator is all.

I was looking at my "SCODES"and there is an exploit.

I still am waiting until I have a complete run which I expect a total of 21 million SCODES to be generated that will offer (hopefully) over 200 choices for each 40 bit word element of MDF.

I also wrote that I am searching 8192 sets for those SCODES and if, when the SCODE run is complete, I can select a reduced set say 64 sets not 8192 then I believe I can drop 10 to 12 bits from the 48 and @30 bits I would win.

I admit the time this is going to take will see me into May 2014.

As for Earl? Well if he is working off the videos linked then Earl it's a Jump to the Left..

So Challenge friends you still have time to come in first.

Good Luck.

Pardon the typos

I meant to type 38 bits vrs 48 will win.

If I can get 36 down to bits per SCODE then that will be even better.

I expect the "decoder" to be no more than a 20,000 byte ELF.

I measure by the ELF but I assume the C code would be far less than 20k compressed.

I also assume there is still a prize available? With this 10 year version no mention of cash has been made.

Well, We all have waited a long time and there is time to go unless someone can get me access to Blue Gene for a few days.

I'm willing if they are.

:)

Again sorry for the typos.

Oh and a video link.

http://www.youtube.com/watch?v=V1bFr2SWP1I

Why not.. Perhaps now I can work on that iterative BWTS

I don't think you get it.

I don't believe your idea stands a chance to work. The counting argument always win in the end. But since you are one of the few compression dreamers to have guts to write code and test your ideas, I want to help you.

1) Run your tests faster so you can waste less time finding out your idea will not work for general use.

2) Learn ways to improve your general coding skills, that is what I hope will happen in the long. I want you to improve your ability to program.

I want you to learn.

Why thank you Earl.

I indeed want to learn too. I am preparing to return to College.

I do try and improve every day. Some times it's a struggle to get motivated to continue writing code but if one wants to write then they must write every day so I keep on coding.

I keep the projects I am working going and some times invent new things. I just had a new idea today while I was finishing up a hybrid effort in fact, so I may want to write that one out and see what it does.

Also I have had a simple understanding of complexity and I found this video and I like it a lot. I thought to share. http://www.youtube.com/watch?v=8bLXHvH9s1A

So thanks for your concern. I appreciate that you took the time to chat.

The more I learn the more fascinating it becomes. The whole N vs NP issue and the other classifications of complexity are interesting so readers have a view of the video link.

Isn't it all about the struggle? Each of us working to add to the greater understanding. Possibly to leave something behind for everyone when we die.

What else is there?

The struggle is important. One should always have goals that the process of trying to reach them teaches more than just lazying around that teaches so little to most.

But it is important that you don't struggle to reach impossible goals ***ONLY***.

Trying for an impossible goal part-time is a good thing usually, you learn as you work to that goal, and rarely - very rarely you get to prove the impossible becomes possible.

But usually you just waste a lot of effort.

That is why you need to set some other goals, hard but reachable ones. Life is too short to waste on just one thing.

I just finished a 21 day vacation - 7 days Alaska, 5 days Long Beach and the Pacific Rim Park, the rest of the time spent in Victoria and Sidney. And no Internet or computer programming the whole time.

Get out there and enjoy life!

Well, your advice of taking a break actually hits home.

After working that "Hybrid" effort I did work that 'Idea" and I'm happy to have the ability to work out and encoder and decoder in 48 hours, so as to prove or disprove an idea, but it made me mentally tired.

I did not see a serious reduction in complexity however there was an obvious structure so that leaves open more thought.

If you like, since you seem to possibly lack compassion in a small degree by your tone, you can consider paying my way on the next trip :)

We all cannot jump a boat and party!

I am also accepting donations towards attending College. I plan just a couple classes to acclimate at first. Proofs and English I think.

LOL.

Sure man it's important to have a balance I am willing you see.

I find a long ride on the scooter to be a joy with 80 mpg.

I trust your trip was wonderful.

Well, since I don't see an offer to aid me in obtaining my goals ( goals defensibly sensible ) I will assume your charity is equal to the value of zero.

Since I would frame you in a fair and balanced perspective as a non-contestant in the MDF challenge I will assign the value of zero to your criticisms such that you are simply a man who has an opinion that doesn't necessarily represent the current state of my goals or efforts.

As for an Update :

I thought to provide an update but I do wish not to get so ahead of reality that I am projecting my hopes as accomplishments so I will comment on the factual state of the encoding software I am calling Version-3.

Version-3 is a number generator which allows me to encode all files. The process is surjective in that for some domain of codes they point to the million digit file ( as the example ) as divided up into 40 bit words.

So many codes point to a finite set of 40 bit words of the MDF.

Domain -> Co-domain I understand is the relationship called surjective.

I just processed a second collection of codes I call SCODES into my database and that gave me 1% of the total I estamate I can generate for this run.

So, with 1% in I see some structure I can exploit.

While the size of an SCODE is 48 bits I am hoping I can write some Run Length encoding function to take advantage of that. If I am to project I would hope I can reduce each element by 5 bits and with some luck it will be 6 bits. I am hoping then I have only 4 bits per SCODE element to go. I would win if I can get it down to 38 bits per element.

I'm happy that the data is numerous. That I can choose an SCODE to represent some 40-bit element based on the qualities of that code.

All in all I am now learning how to apply compression algorithms since I finally have something related to the MDF that I can compress.

With that said, on a home computer, I expect to have a full run as I designed Version 3 in May 2014 or there abouts.

That's it really. I have much to learn.

I am excited that I still have 99% of the data to go and with the results so far I can assume I will have a data set I can learn with as well as compress. On to the next level that is; actually writing compression algorithms.

So Good Luck Challenge people. I care not if I am first as much as that I finish this race. What a long strange trip it's been.

http://www.youtube.com/watch?v=pafY6sZt0FE

I almost forgot to share.

Here is a great video and love those Planck bits. I am smitten with the Universe as information.

Enjoy.

http://www.youtube.com/watch?v=XDAJinQL2c0

Update:

I am exploring run length encoding and there was some success.

I was able to reduce one selected group of SCODES by 6 bits for each element of 48.

That's right I'm now at 42 bits and falling.. ( HGTTG joke )

This "selection" of codes is not the final selection with just 1% of the data in but it's exciting to see data compression working.

I'm very happy with RLE as my target data was 7-bits of the SCODE and at a distance of 83,049 elements I got 6 bits per element compression from RLE.

I think that is a success for RLE. It sure inspires me to study more. I welcome suggestions on other algorithms.

I have worked with move to front so I have the experience on hand.

Again I think it will be easier when I have all the 21 million SCODES on hand but I have to wait for it. Until then more baby steps.

So now to design a compressor that gets those 6 bits and 4 or 5 more.

Talk about a lot of work for $100.00. Mark it had better be a really nice $100.00 bill. Not folded or anything. Maybe put it in a frame.

Heh.. I have not compressed the MDF yet. I'm working on it.

Well, it's to be 108 today so I had better get my chores done. Later then and good luck challenge people!

Hey all!

I am working again thankfully! It's wonderful to be back in the land of personal income.

Things will be much the same for a long time to come. The only news is the count is approaching 2.8 million surjective codes.

I did harvest a data set of 2.6 million to work with in the mean time and that should be enough to continue studying the data.

I am open to suggestions on how to analyse bit patterns. Perhaps there is software or algorithms already to suggest?

I find that Run Length Encoding is useful.

I must be patient as you all can see.

So, is anyone still working on this challenge besides me at this time? Anyone attempting a number generator approach?

Good luck Challenge people.

Hi Ernst,

Hi Ernst,

Hope all is well.

As for suggestions on how to analyze bit patterns, that really depends on how your particular genius envisions what a needle might look like in a haystack.

As for me, my technique was to examine bits in tiny patterns, 8 bits and less, across the entire file. My thinking was that if I could find a set of patterns that was more repetitious than others, this opens the door to common compression techniques. For each set of bits (IE 4 bits would be one set, 5 bits another), I would permute the pattern across a byte such that all (set) bit patterns within a byte were examined. What I found was that regardless of the number of bits I chose, or which bits within the byte I chose to comprise my sets, that the data in the file was alarmingly random. The number of unique values that could be represented by (set) bits appeared nearly uniformly. (IE, to examine every 3 bit combination within a byte and then examine each byte of the MRD through that lens, the number of 0, 1, 2, 3, 4, 5, 6 and 7 values were nearly uniform. This pattern held no matter which 3 bits I selected (because I did all of them).

Of course, that is just one way to look at bits, and it suffers from the fact that all my examinations were against patterns comprised of n bits out of 8, and no examinations occurred across byte boundaries.

The excellent distribution of ones and zeroes within the bytes of MRD suggest to me that if I had extrapolated outwards moving across byte boundaries and including larger sets of bits, I would likely discover the same uniform distributions of potential values. I jump to this conclusion based on the fact that I could find no bit within all bytes that was not quite literally a coin flip from a very statistically reliable coin, meaning that I could have examined larger and larger sets of numbers from larger and larger windows of bits, but each bit position would be filled with a uniformly random value.

If your question was more based on which tools to throw at such an endeavor. Bit wise operators are your friend, they execute faster than most instructions and you can have bit patterns of interest predefined in data types that cover your desired number of bits. Remember to use unsigned integer types so you don't get erroneous values from bit shifts. I assume that if you want to examine bits across the natural boundaries of bytes, you'll need to do some bit rotations and such meaning lot's of shifting.

You could probably make a nice little C++ data type to represent a fixed number of bits with all the rotation code stuck inside so you could overload the shift operators to catch the bits falling over the edge, and stuff them back in at the other end. A data type like that would probably get a lot of re-use in the endeavor to compress the MRD.

Hi Mark,

Thanks for the very interesting website and challenge. It is great that people are dedicating their time to this, and regardless of how futile people think it is, it beats doing Sudoku puzzles.

I've just had a look at your bin file and the original random numbers from RAND and would like to know how the numbers were converted (encoded) to binary and how to decode i.e. how can we be sure that the bin file is actually a representation of the original million digits?

Apologies if I've missed something really obvious or if you've already provided these details.

Best regards,

John

@John:

I think this code in comp.compression from a few years back gives the details:

https://groups.google.com/forum/#!topic/comp.compression/fLKEN065nrU

- Mark

Many thanks Mark. I'll check it out.

Hello John G.

Hey Mark and also "on August 27th, 2013 at 2:58 pm, CWCunningham said:"

I have been working 7 days a week for the past 84 days but that is over now. The Grape harvest is done and my seasonal job is over for the year.

I had some financial problems and the electric was off for 2.5 months so I have now restarted the number generator and am focusing on a range of data that has the best chance of being coded smaller.

With the "Version 3" I was staying true to a range of numbers so I could compare the "before and after" so to speak. I had wanted a data set of many millions to examine. I am going to let that slide for now and go smaller.

What I found with the first 3 million results was that I could see an exploit in the data where run length encoding worked to reduce the "codeword" by 6 bits.

By working in that larger range I am mentioning, one of the elements is reduced from a width of 20 bits to 14 bits.

With the reduced set it will go from 16 to 10 hopefully. That may put me in a potential striking range to 38 bit codeword for the 40 bit source.

I am unsure if I will see the same exploit in a reduced range set but I am going to find out.

I will be generating 1.1 million scodes for this reduced sample range and again the number of 40 bit words in the MDF is about 83,000 so again there will be several choices for what codeword represents any one 40 bit word segment.

So far it is doing very well. More than half of the whole file has been encoded in just under 72 hours so near 50,000 unique 40 bit words have codes and then another 35,000 codes have been found that offer choices as to what code to use for those found already. Surgjective Codes.

I hope this set will be done in just over 30 days from today. I will then see if that RLE exploit is there. I have my fingers crossed here.

Other than that I had an idea for another program and am currently working on that. Success with that will depend on the data and a couple of functions I wrote. I cannot say if I can get the ideal result but it's worth the effort to write and observe the program. More experiment and observation.

That is what is happening here.

@ John Yes the terms kook and crank have been thrown at folks working on this problem. I have gotten rather upset with folks about that. It is true that we "currently" must say that this challenge is proven to be impossible.

So, yes we are fringe working on this and much of what we do is to retrace efforts already made only to find those truths already espoused such as the Pigeon Hole and so on but is that all there is?

I'm still going strong on software development so perhaps there is more.

@CWCunningham

Sorry for the long delay. I lost electric, worked 7 days a week for 90 days and my laptop got fried.

I have been given an old laptop so I am now using the Internet again although I miss my 3-core Laptop bad.

I was asking in general about bits and patterns because I am trying to form an idea of what would make for useful patterns as if I could choose and construct the patterns myself.

I am still working on the thoughts of it and how it might work if one could change a file and change it back easy enough.

I am still tinking along the lines of "since it's random we have to translate it to something useful and back again."

I have followed the data encoding path these 11 years and had skipped the proper data compression stuff but am now wanting to merge the two.

It is so much easier if the file has simple runs of same byte or repeating patterns from the start but it wouldn't be the MDF challenge if we were working with that stuff.

I would have to say that I am sold on a number generation approach. I also say that everything I do is experiment and observe so it might work or it may not. I have to see.

Hopefully the same exploit ( the easy stuff to compress ) will be seen in a reduced focus and I can get this accomplished.

-----

So friends, it was a rough 3 months and I am so very tired. I manhandled 50 lbs sacks everyday 7 days a week and lost about 30 pounds. No joke.

I can say I expect to make it to next spring just fine because of that job. That is a good thing.

So, a couple days of rest and I should be back to coding full speed. I am back generating the data to see if the "Version Three approach" is a valid effort. I hope so. It's time for a winner.

@Mark

This next project I just started might be a general compressor. At least the effort is to head off in that direction.

It will be dependant on the data so it may qualify as the general compressor you want to see happen as part 2 of the new challenge if it happens to work. I can't say at this point. By that I mean the program will work but will the data have an exploit in it.. I don't know.

I most likely will know something solid by Christmas on that question.

Again Good Luck Challenge people.

> It is true that we "currently" must say

>that this challenge is proven to be impossible.

I disagree.

It is very likely that this challenge cannot be met. The actual probability of success may be infinitesimally small.

That said, anyone who says it is impossible to defeat the million digit challenge does not understand some very fundamental principles of computer science, computational logic, and information theory.

In fact, posting a challenge requiring a proof that the challenge cannot be met would be significantly harder than beeting the challenge. In the worst case, it turns into a busy beaver problem.

- Mark

Ah, I see your point.

Got one shot at reducing by about 20,000 bytes if I can get the same RLE reduction as before. If not I'll have to look closer.

Another 4 weeks of data generation and I should know then.

So, I am still in.

Hello everyone.

The 16-bit seed run of the "Version3" number generator is complete.

I have about 1.2 million codes that point to the 40 bit words of the million digit file.

Sadly, the exploit I was able to see in the (incomplete) 20-bit seed data doesn't appear in the 16-bit data.

I have been feeling down about what to do next yet there is hope.

I've explored isolating sections of bits without seeing any major advantage so far. Currently I am exploring selecting codewords of the same general magnitude so that a common subtraction can be performed on the codewords in a RLE type of encoding. This idea may have some merit since I can have runs of reduced values which are 8 to 10 bits less that the source codeword.

Please NOTE: It's possible I can share this dataset with you all as well. Perhaps someone has more experience and can spot a way to make this data into a "compressed" file?

What makes sense now is for me to begin to write some things down. I'm not sure where or how to start a paper. I thought I might go over to the local College and see if I can find a mentor.

The truth is that I focused on translating the MDF into other forms back when I started this challenge ten years ago. I worked and learned how and now have command of maths transforms that are possibly more applicable to data encryption then data compression.

That's about it. Looks like making an effort to publish a contribution to our science is the logical next step especially if I get nowhere with this number generator data. I'm sure what I can contribute will thrill the crypto folk.

If anyone has suggestions for writing things down and how to do it correctly I'm willing to listen to advice. I have never published a paper before.

Mark, if you like I can provide a text file of the codewords after a final check.

Each 40 bit word in the MDF has from seven to several hundred integers to choose from. Some are even less than 40-bits in length when the most significant reset bits are dismissed but the word size for these codewords is 44 bits.

I would think since I can "encode" any file we can, by the rules of this challenge, dismiss the filesize of any decoder for this "transform."

Mark do you agree? That if some program transforms data equally that we can dismiss the fileseize of a decoder for that transform?

That might make the difference if I complete the similar magnitude with common offset. I can get an overall reduction of file size but I fear not enough to account for the transform and the program to reconstruct the original codeword values.

Well, let me know. I will continue to work on this but am more inclined to work on publishing what I have so far from here on out.

Ernst

Well, I have an update. Not all went well.

Well Challenge friends, I generated the 16-bit seed data and recorded all their SCODES matching the 40 bit integers of the Million Digit File using the Version3 Number Generator I wrote.

I did not see an easy or obvious way to reduce the overall size of the new data below the 415241 bytes that is the size of the Million Digit File.

I spent about 3 days after that last effort which reduced an encoded and larger file size by 20kbytes feeling down and I had no inspiration of what to try next, but then, an idea came to mind.

I realized that I could group the SCODES into sets of 8 and I could take each 44 bit SCODE and place those bits in 44 bytes at a specific bit positions. Thus those 8 SCODES form bytes that become strings of 44 bytes.

A nested for loop then picks the SCODES then generates billions of strings for each group of 8.

Those strings are matched against a common string that is generated from the very first group of SCODES. There are 10,380 sets to search through and each set offers billions of strings.

So far ( for the very first match string ) it seems that a match of 50% occurs in each searched set. Also several choices for that 50% matching occur. So for even one search string and one group of 8 the number of choices for a 50% match are many. Thus generating files to compress can be based on further intelligent choices.

With so many billions of billions this makes for a huge data set and a huge undertaking.

I open this phase up to all. I welcome the help of others. There is nothing too secret about this phase of the work.

If anyone wants to do some of these searches or has access to a super computer ( hundreds of threads ) this work will go faster.

Also I may not have the right plan here. I have spent over a decade learning how to transform data and little of that time with compression algorithms.

Below are examples of search strings used to match to the other generated strings.

{ 250 177 204 59 240 90 165 88 208 153 199 74 1 14 183 97 151 192 234 254 242 97

1 184 217 15 175 192 50 76 129 215 73 49 47 232 147 195 173 198 85 117 39 135 }

{ 250 176 205 59 240 91 165 88 208 152 198 74 0 14 183 96 151 193 234 255 243 96

1 185 216 14 174 193 50 76 128 215 73 49 46 233 147 195 172 199 84 117 39 134 }

{ 251 176 204 58 241 91 165 88 209 152 198 74 1 15 183 97 150 193 235 254 242 97

1 184 217 15 174 192 50 77 128 215 73 49 46 232 147 195 173 199 84 116 39 134 }

{ 250 177 205 58 241 91 165 89 208 152 199 75 0 14 183 96 150 193 234 255 242 96

0 185 216 15 175 192 51 76 128 215 73 48 47 233 147 195 173 199 85 117 39 134 }

{ 251 177 205 59 241 91 165 89 208 153 198 74 0 15 183 97 151 192 235 254 243 96

1 185 217 15 174 193 51 76 129 215 73 48 47 233 146 195 173 198 85 116 38 135 }

{ 250 177 205 59 240 90 165 88 209 152 198 74 1 15 182 96 150 192 235 254 242 96

0 184 217 15 175 192 51 76 129 215 73 49 47 233 147 194 173 198 84 117 38 135 }

{ 250 176 205 58 241 91 164 88 208 153 199 74 1 14 183 97 150 192 234 254 243 96

0 185 217 15 175 193 51 77 128 214 73 48 47 232 147 194 172 199 85 116 39 135 }

{ 251 177 205 58 240 91 164 88 208 152 199 74 1 15 182 97 150 193 235 255 242 96

1 185 216 14 175 193 51 77 128 214 73 49 47 233 147 194 172 198 84 116 39 134 }

{ 251 176 204 59 240 91 164 89 208 152 199 74 0 14 183 97 150 192 235 254 243 97

1 185 217 15 174 192 51 77 129 214 72 48 46 232 147 194 173 198 84 116 38 134 }

All of these represent the same thing, the first group of 8 SCODES for the Million Digit File.

The point of this is that a mind blowing massive data set is here to select from. A huge Universal set to explore.

.

Below is an example of matches as logged today. I just started running 8 searches:

Checking string 1

----------------------------------------------------------------

214 170 253 14 203 84 181 68 236 166 206 36 3 57 191 48 243 206 138 145 186 103 73 130 222 45 203 210 0 0 237 145 126 124 125 139 203 164 195 217 26 25 112 140 The count of unique byte values are 39

Processing set 9082

================

Number of Strings for set 9082 is 44748262944

* Highest match 21 : 200 141 138 152 66 125 170 124 154 177 248 0 253 73 174 149 91 24 186 149 26 230 172 24 36 176 68 139 3 45 203 52 195 25 50 246 163 214 114 220 112 57 164 41

+ Equal match 21 : 200 141 170 152 66 125 138 124 154 145 216 32 253 73 142 149 91 24 186 149 26 230 140 24 36 144 68 139 3 45 203 52 227 57 50 214 131 214 82 220 112 25 164 9

+ Equal match 21 : 136 229 143 176 103 48 138 68 139 140 253 17 217 76 131 181 54 104 186 237 54 210 164 84 125 249 57 243 75 4 151 25 130 41 94 170 238 234 71 140 0 13 185 41

+ Equal match 21 : 222 166 181 26 112 210 16 222 3 46 206 131 125 126 147 37 96 138 139 10 147 111 10 14 136 128 254 48 57 57 215 183 75 186 176 73 130 204 78 103 124 45 132 61

+ Equal match 21 : 216 166 221 48 53 176 25 164 26 125 204 160 43 95 209 103 38 170 201 14 214 65 5 70 237 139 186 82 91 87 181 184 0 138 189 57 253 138 84 110 112 63 217 43

Any of the above strings decode to provide 8 SCODES that decode to 8 40 bit integers of the million digit file.

checking string one means the very first string to match to and it is printed just below that statement.

"Number of Strings for set 9082 is 44748262944" means that there are 44,748,262,944 strings to check for matches in that set of 8.

I am seeing match counts of 20,21 and 22 bytes which is about half of the 44 bytes. In this case there are 39 unique bytes to search for.

I have divided the search into 8 threads. Here is a highest count report from a different search group.

* Highest match 22 : 84 140 243 160 53 139 22 45 202 87 220 237 112 187 116 126 240 65 172 149 159 44 54 0 67 31 49 203 236 130 132 125 222 14 241 124 160 35 166 195 253 57 3 68

+ Equal match 22 : 84 140 243 160 53 139 22 45 202 87 220 237 112 187 116 126 240 65 172 149 159 44 54 0 67 31 49 203 236 130 132 125 222 14 241 124 160 35 166 195 253 57 3 68

Only two matches for the highest count of 22 in the group 7787.

I welcome input on this approach.

I welcome other approaches.

I welcome a group effort on every aspect.

More than likely when it's time to generate a complete file from all the groups of 8 there will be many choices to craft a file from.

While it's true that the overall file size will be 456770 bytes the reality is we can select what bytes are in that file.

Do you believe that it will be possible to compress such a file?

I will hope that since the encoding process applies to all data equally that all overhead of that coding can be excluded from the final count. I also hope that with such a hand made file that a off the shelf data compressor will do the work of compression. So if the Codec overhead is forgiven and the compressor overhead is forgiven then any reduction below 415241 might qualify to satisfy the challenge. I would hope for a large reduction but even 1000 bytes below 415241 would make me happy.

In my mind this doable. Is this wrong?

In closing, I face a long run of processor hours to complete even one search. More than likely there is some optimal search string in the billions of search strings and I probably will never know which one it is without help.

At this time I have a theory that more than likely the 50% matching is about standard for all search strings.

What I can say with confidence is that I now face Data Compression questions I have not considered before such as what is compressible? How do I select for effect? What does a data compressor look for?

I am open to suggestions on how best to mine this data for a compressible file.

I am open to other methods as well.

I welcome others to join in.

Your Challenge friend,

Ernst

I was just wondering what the rules are regarding temporary files and/or memory usage by the program. I'm trying to come up with new idea's but some have "horrible" senario's like a 32gb scratch file.

:) Wish I knew of this challenge 10 years ago. Always had the same facination for perpetual compression as some people have with perpetual energy.

You know somewhere that it's probally impossible but you can't help enjoying youself for hours trying to do things that in the end just don't seem to work. :) And still smile... and try something else.

@Ernst: I never put a hand on this random data set but on the ones I used I tried lots things.

The way I'm currently looking on compressing anything is the question if binary data can be plotted in a 2d/3d canvas in a way that recreating that image/object and getting the plot points takes less space then noting down the x,y(,z) points.

Just a crazy old nerd idea tho... probally won't work, like all the others I ever had. :)

I am new to the whole "compress this one dataset thing". Never looked at the problem from that perspective before.

Temporary files and RAM are not factored in to the rules, so you are free to do whatever you need to do.

The only thing that might cause trouble is if your requirements get crazy big, there may be trouble reproducing work. So for example, if you work for Google and you do this as a 20% project, and it requires a 20TB distributed database and map/reduce on 15,000 computers, well, you might not ever get certified.

- Mark

@Gert

Welcome to the Challenge. Been over a decade for me at this. So far I am a wonderful Cryptographer.

@Mark, I would like to have the data in file format so searching would be a simple look up but I may never have that kind of file storage.

What seems true no mater what is I can get the representation down to the same size the sample data is in most cases but never below.

I may have to give up and just publish what I found so far and see what comes after that.

If I had all the permutations of these surjective codes accessible in files perhaps some selection could come in under the goal.

What I have settled for is to take the first instance of 20 bytes matching a control string. Thus each group of 44 bytes represents 8 surjective codes which generate 8 40-bit integers of the MDF.

It still is going to take the rest of this month to extract just one set

Well that was odd. The post "posted" before I wanted it to.

So to continue. So far I estimate that the data will not come in under 415241 bytes. I will follow through.

So this becomes 12 years this next February that I have worked on recoding and the nature of Information. Not all of these efforts are lost so perhaps I was always meant to advance cryptography. After all my goal has been to master the art to transformations. I'd say I have done well.

Perhaps I will go over to Stanislaus State University and find a mentor to advise me on how to proceed from here..

Too bad it would be such a task to go in reverse with surjective codes. Since I could take an arbitrary number of bits such as 44 or 48 and derive the codomain integer of 40 bits such that any 44 or 48 bit string would relate to a valid integer but to get back to that 44 or 48 bit surjective code would be a challenge itself. Yes it's one way from what I can tell. Express all the seeds for a number range and see what matches to what.

I don't know if I could design a surjective function that could support indexing of the codes generated but even then it would require a set of generated data for a fixed witdh seed value so there would be overhead there in the form of a data file and the time required to generate it.

It's possible that I could expand the encoding to include more "buffered data in the number generator algorithm" and in that absorbing more source bits. Perhaps enough or perhaps not.

What I am saying is this has grown past my home computer. I may simply need to acknowledge I no longer have machine enough to continue experimenting and observing.

So; is there a representation of MDF that will compress and be un-compressible? Absolutely there is. Do I think I can find it? Not currently.

I say there is because I understand the mechanics of information now.

Okay well, I will continue extracting the first match of 20 or best count (19) bytes matching the first source string so that I may examine that further.. Who knows maybe some slice and remix of that data will provide a compressible file.

Maybe I will be surprised and it will compress 60kbytes..

I think I have gone as far as I can before I share what I know thus far.

Ernst

Hi Mark and all.

If (IF!), by using a practical method, I were able to meet the challenge of compressing and losslessly restoring random data, what might the reactions be from the mathematics and scientific communities? Would the programming community consider it the "Second Coming"?

Just curious.

@Mike:

Total, unrelenting disbelief.

- Mark

@Mark: Thx, no I don't work for google. Used to design anti shoplifting systems if your interested.

My hobbysystem is a humble i3, 8gb mem and seperate 60gb (ssd) scratchpad. My time limit has always been 12 hours. Whatever it does, on how ever much, it gives up or succeeds in 12 hours or less. Seems reasonable and flexible enough to try anything.

Unfortually had some other things on my plate lately. This whole converting data into an x,y,z context and try (de)compressing that been really tingling. Hope to have some more real free time soon.

@ Gert

I was watching a video about the Universe as a hologram and they said that there is a relationship (with hologram) of random looking 2D data which is actually a 3D hologram.

Is this the approach you are looking into?

Nifty idea to say the least.

---------------------

Not much is new here. I am wanting to head in the direction of the J.C. and get help in writing some of the work I have done here down. I'll see about that soon.

I am currently puttering around with reduced symbols set transform of MDF. I expect it to expand the byte size of the file while reducing the number of total symbols by 50%.

I have found that not only does the number of symbols create difficulty in attempting compression of MDF but also the orders of those symbols too have complexity to deal with.

No telling if any of that will be a solution.

Well, it is a seriously challenging challenge friends!

@Ernst

[quote] Below are examples of search strings used to match to the other generated strings.

{ 250 177 204 59 240 90 165 88 208 153 199 74 1 14 183 97 151 192 234 254 242 97

1 184 217 15 175 192 50 76 129 215 73 49 47 232 147 195 173 198 85 117 39 135 }

{ 250 176 205 59 240 91 165 88 208 152 198 74 0 14 183 96 151 193 234 255 243 96

1 185 216 14 174 193 50 76 128 215 73 49 46 233 147 195 172 199 84 117 39 134 }

{ 251 176 204 58 241 91 165 88 209 152 198 74 1 15 183 97 150 193 235 254 242 97

1 184 217 15 174 192 50 77 128 215 73 49 46 232 147 195 173 199 84 116 39 134 }

{ 250 177 205 58 241 91 165 89 208 152 199 75 0 14 183 96 150 193 234 255 242 96

0 185 216 15 175 192 51 76 128 215 73 48 47 233 147 195 173 199 85 117 39 134 }

{ 251 177 205 59 241 91 165 89 208 153 198 74 0 15 183 97 151 192 235 254 243 96

1 185 217 15 174 193 51 76 129 215 73 48 47 233 146 195 173 198 85 116 38 135 }

{ 250 177 205 59 240 90 165 88 209 152 198 74 1 15 182 96 150 192 235 254 242 96

0 184 217 15 175 192 51 76 129 215 73 49 47 233 147 194 173 198 84 117 38 135 }

{ 250 176 205 58 241 91 164 88 208 153 199 74 1 14 183 97 150 192 234 254 243 96

0 185 217 15 175 193 51 77 128 214 73 48 47 232 147 194 172 199 85 116 39 135 }

{ 251 177 205 58 240 91 164 88 208 152 199 74 1 15 182 97 150 193 235 255 242 96

1 185 216 14 175 193 51 77 128 214 73 49 47 233 147 194 172 198 84 116 39 134 }

{ 251 176 204 59 240 91 164 89 208 152 199 74 0 14 183 97 150 192 235 254 243 97

1 185 217 15 174 192 51 77 129 214 72 48 46 232 147 194 173 198 84 116 38 134 }[/quote]

I've never really understood your approach to this challenge. I figured that was fair, since I've never explained the approach I would take either. So unfortunately, I can't offer much insight into how you might be able to refine your approach, but in a recent post you said that you've concentrated entirely on transformations and haven't really looked at the compression end of things. Well I'm no expert on either subject, and besides, you've got Mark for that, but in another message, you suggested that you would like to be able to have all of your "strings" available for searching (if I understood correctly).

One thing that would help in that regard would be compression, and glancing at your example strings that I've copied above, the amount that this representative data could be compressed is amazing, and it presents an opportunity for you to look at some of Mark's compression techniques, but in this case, I'd feel good to roll my own. Here's what I see in that data (and if it is truly typical of the strings you'd like to store, it appears that you could store many of those 44 byte strings into 44 bit strings. You'll need an additional 44 bit string as overhead, and only 1 44 byte string.

Here's how it works. I notice that the all of the values in any given byte position are either identical to the value in the same byte location in each subsequent string, or they differ by just one. This means that you could write out the first string, and then write the byts of each subsequent string as just one bit with 0 meaning that it is identical to the first string, and one to signify that is is a "1 off". You would need the 44 overhead bits to signify for each byte position whether you need to add 1 to the one offs, or subtract 1 (Assuming that the original byte values might be either the high value for that column, or the low value).

I hope that makes sense. Alternatively (and likely the saner approach) would be to use LZW compression, I suspect it would really shine given that data ... but Mark would know much better than I.

@ CWCunningham

Thank you for your reply.

Indeed I have focused on transformations and succeeded.

I am over my head here trying to analyze so much data. I am in need of training and help.

My efforts have lead to discovery that needs to be published sooner rather than later at my age.

I will attempt to return to school and hopefully accomplish writing down what I discovered along the way here. I promise that it is all good news!

I feel it is wise to double back on this decade long effort and share what I did accomplish. Then see what might be possible for the future.

To reply about that data; it's the output of a number generator program where because of 16 bit seed the codeword is 44 bits long and those codewords are layered by bit position in the data you see.

Thus one codeword is across 44 bytes in bit position b0 and the next at b1 and so on.. Each 40 bit integer of the MDF then has a representation of a 44 bit codeword .

The relationship of one 44-byte grouping to the next then is an area of study that can be pursued with emphasis on restructuring the data like I started to explore. I didn't see anything obvious and became frustrated. I "went cold" on that and started to poke around with other things but realized I better put pen to paper at this point and then invest further in MDF challenge refreshed.

So yes I have generated tebibytes of transforms all MDF but I am not so skilled in analyzing the data at this time.

Over the years I have attempted humor in the form of video links and what not. I feel somber sharing this video link but it seems appropriate. http://www.youtube.com/watch?v=tKeSRsy2l9E

I am now at the point of what I need.

I am not giving up, just being smart about not losing what has been gained.

I know so much more today about the nature and structure of information then I did ten years ago but it's been forged in isolation. Time for a change.

Ernst

Just having a cup of coffee and using the Internet at my local Coffee Shop and thought to be more forthcoming with what those "data-strings" are.

As I wrote they are codewords which are decodable. They represent 40 bit integers taken from the MDF in sequential order.

The codewords are then grouped using nested for loops ( C language ) where by the nature of nested for loops the last element changes more rapidly than the others.

Given I have placed the codeword across a group of 44-bytes the change from one to the next is slight because of the nested for loop. I thought I would have greater control in selections this way. True but 44 billion to pick just one set of 8 from? Big job! 44 billion isn't even the largest set count. It was 750 billion or so I believe for the largest

In a previous post, this is the truth about how many strings can be generated. Simply put, this is the total number of 44-byte constructs possible for one set of 8. So in truth I am processing ONE group of 8 codewords and have a choice of 44,748,262,944 constructs for just this set of 8 codewords.

I looked at this and realized I need a better skill set if I am to tackle sifting through trying to find data that would result in a compressible file.

Thus "Over-my-head" is a correct phrase.

So, there really isn't an extreme compression because I have to choose one of those constructs since they are all the same thing; a grouping of 8 codewords representing 8 sequential 40 bit integers of the MDF.

The rest of the codewords too have a similar fate. It was my hope that I would see some construct that would be obvious. I didn't see any obvious constructs.

Do I think there is some transform that will satisfy this challenge still? Yes I do. I am unsure how to go about that at this time. The maths are solid but I need more training to take advantage.

Speaking of that, I am working the plan to attend the local J.C. The first step is to get a job I can work around classes with and then get to the classes. I will be back on the road this weekend. I mentioned someone knifed my rear tire on my motorcycle yes? Those replacement tires are on the way to me now. I have opposed the Meth dealing and all the traffic in my neighborhood. I think a couple people had to do a few months in jail after they got busted and within one week of one guy getting out I had my tire knifed. I did mention I live in the rough part of town didn't I?

This going back to school is a life-choice at my age. I assume once I start it will become my main social activity.

LOL; I believe I can benefit from some of the very basic classes in English and Maths to start with. I assume I will follow a Computer Science major.

This isn't a bad thing. Getting old and continuing to learn is a good way to go.

So again, sorry to deceive or confuse with that data but the relationship between groupings of codewords from a software design point of view is a much bigger project than I am ready for at this time.

Again, transforming the data is easy and natural. Picking the elements that make a compressible file , not so much.

As a programmer, I try to view your approach in terms of how it could be used to achieve the goal of data compression, since that is the whole point, correct?

Since I can't draw flow charts in this box, I'll lay it out in c

Given the notion that "tryHardToCompress(theMDF, compressedMDF)" tries all known compression algorithms to the current state of theMDF and returns the size of whichever algorithm produced the best result. (my pseudo-code doesn't show it, but you need to make note of the decompression scheme to reverse it later)

If attempts at compression didn't produce sufficient compression, you either give up, or apply whatever clever reversible transform(s) you can envision to the current state of theMDF, and you try again.

Once you have a framework like that, carved into code, you can add new compression algorithms as you encounter / create them, and you can add new reversible transformations as you dream them up.

The output of the program should be a compressed file 1 decompression scheme, and the series of reversed transformations that will result in the recovered MDF.

Then you can create a decompressor that contains only those algorithms necessary to decompress that one file ... do it in < 500 bytes, and you have earned $100

This frees you up to concentrate on reversible transforms, and when the mood or inspiration strikes, you can tinker with tweaking compression schemes.

Ah, That's funny!

This file can be expressed in many ways.

For example the parity language structure of http://en.wikipedia.org/wiki/Collatz_conjecture

Is one I worked for many years. I used that maths in the Version 3 Number Generator.

The structure of that transform has a rule. No two set bits will be together in stream.

So }{0,1}, {00,01,10}, {000,001,010,100,101},....} are the legal codes for the number of iterations we see in the sets.

The reality is that it is Fibonacci sequence in structure.

Thus in reality the MDF then becomes a 622861.5 byte file coded with [[X+n],[x/2]] cycle that is the Collatz. One can use Plus n or minus n and n can be any odd value. It's very easy to pick some integer for n and encode a file with it. A sort of encryption. Could be effective.

What does seem to remain is the structure of the information that is MDF. That seems to keep transforms from compressing below the original size.

What I was attempting with that interlaced data construct and the NG was to construct the data such that I am the one defining file structure. It's a project I don't have a final result of since the billions of strings take a long time to run through. Think 8 nested for-loops crafting each set of 44-bytes and there are 10381 sets to process where the largest number of strings fr just on set of 8 was 777 billion.

So if I had all of the strings in a dataset I would then look through them looking for data that would compress.

This has a cost of 41,524 bytes.

I am open to suggestion on how to generate all that data, search and select.

You see I find myself asking "what is compressible."

So, I feel like that is a job I cannot accomplish without more skills. Maybe a class in statistics?

By the way. If we could generate some file that xors MDF to a compressible form that would work.

That is an interesting idea. I may give that some thought.

So, I'll ask here.. Any one experienced is writing up maths and going through the process of getting published?

@ CWCunningham

As you see I just shared an encoding method and shared some hard earned insight on a class of dynamic equations ( iterative function ).

I'm thinking it will be a treat for us all if you share a little about your efforts on this the MDF challenge.

Ernst

Hey all!

I am now registered for Junior College and am in the process of finding work I can do near the school so I can manage both.

After a number of years without a regular job due to plant closure and the down-turned economy it looks like there is hope in accomplishing my goals.

I realize this is off topic somewhat but after a dozen years of working this challenge I feel as if this is part of my life and in turn my life is a part of this challenge.

Also I am giving some thought to starting an automated search for a compressible form of MDF through some encoding using perhaps Zlib. I am not sure what form of encoding I would try. Naturally this would be a blind-effort where random chance would play a part.

Still why let that workstation sit idle?

I welcome anyone to share some news here. I know I will be involved with this area of study the rest of my life. How about you?

Ernst

on February 12th, 2014 at 4:20 pm, Ernst said:

on February 12th, 2014 at 4:20 pm, Ernst said:

@ CWCunningham

As you see I just shared an encoding method and shared some hard earned insight on a class of dynamic equations ( iterative function ).

I'm thinking it will be a treat for us all if you share a little about your efforts on this the MDF challenge.

Ernst

===================

My efforts aren't going to make much of a treat.

I looked at the MDF through a different lens than you. I was looking for exploitable patterns in sub byte chunks rather than multi-byte chunks. What I found was that the MDF is quite uniformly random. As if someone took equal amounts of ones and zeros, and then stirred them all up until the ones and zeros were fully dispersed. This is not compressible using the techniques I was hoping to use.

The way I figure it, theoretically, you could apply any (or all) reversible transforms on the data, as many times as you like, and see if the result is more compressible. The data will have to be transformed, if it's going to be compressed. And the diabolical part of this puzzle is that if you xor non-random data with data that is uniformly random, the result will be uniformly random.

The trick would seem to be to discover any reversible transformation that makes the uniformly random data into something you can compress using existing technologies. Then your decompressor is just a system call to gzip, or other, and then just reverse your transformations.

That's as much time as I've spent with it.

MDF stands up as "random" in every encoding I have managed.

I will try and get a paper out as soon as possible on "the transforms" but I assure you that even with these "transforms" MDF is in a cycle of files that do not compress.

Some ideas on generating numbers of the same size to use in logical operations such as xor are Pi, e, recursive modulus with primes and so forth.

If you know of Pi code that emits a stream of binary and can be trusted I would like to get my hands on that. I am unsure how to construct such a program.

Same for e I think.

Recursive modulus with primes can be implemented in GMP rather easy but could eat up time to generate up to some referenced prime.

We agree on transform.

Hey all!

I am almost done with a primary draft of the paper I wish to publish.

It is much easier to design software then to write a paper I am finding.

I sure can use some advice.

I suspect there is a peer review process before I submit it. I read about preprint servers.

Are there folks in our compression realm who are qualified?

I was thinking to walk this over to the local University and see if I can get someone to read. Is that a good thing to do?

Any advice? Any of us qualified?

The classification is data encoding naturally.

Thanks.. It's exciting.

Ernst

The paper advances.

I assume for someone writing their first paper arXiv.org (preprint serve) is a valid choice..

I've been learning to use TeX. Nice but overwhelming at first. TeX can do so much.

I have come to understand that a scientific paper is dividable into six parts. I believe I have a good outline on all six. The maths are good to go. I got all that done. It's the 'story" and the format of presenting the information that is my concern at this point.

I'm trying to have a paper that has solid facts in math-speak and also have a narrative for the ordinary person. I think I can pull it off. I don't want to turn people off either way.

So, guys any news on compressing MDF? I will be "out" of the game for a while. Trust me writing a paper for the first time being unsure how to do any of it has taken up many hours. The "versions evolve" then they get trimmed and I keep on trying to condense things down so I communicate much more signal than noise. I feel the only thing that looks bloated right now is the two pages of data. I feel the data is important because I make claims and anyone wishing to check those claims will have the data handy. I also have learned to work with libreoffice draw and use that to do flowcharts. Not a bad tool at all. It is supported on RHEL6

I still can use advice. I am thinking I should walk the paper, when done, over to the University and find a couple professors to read it before I post it on the Internet. Is that what people do?

I see nothing much in the papers posted to arXiv but I do see it in papers published in proper publications.

Well I hope to share with you all some of the "discoveries" I made along the 12 year road of this challenge. Lets hope sooner rather than later.

Ernst

Yeah arXiv.org is a fine choice for making it available to people. That won't make anyone read it though :-)

>Is that what people do?

The truth is, people who aren't in academia don't publish papers. Unless you have a very good relationship with someone at a university, it is unlikely they will take you seriously. This is unfortunate but true.

- Mark

Thank Mark..

I think I have to have someone who is authorized on arXiv to get me authorized.

The paper is done. Well the first official draft. Not that I haven't spent the last few weeks draft after draft. It has been an experience to get to this point. I mean the amount of ideas that end up condensed and the wording thought out and re-thought out.

The cool thing is that the hard work was writing things down and not the mathematics.

As for serious? What's not to be serious about? I have the honour of introducing something new.

You guys have read me talking about surjective this and number generator that so here is the mathematics behind the evil plan!

I got to the point where I was going really deep into the data encoding and I am not really ready to continue on in the dark. This way I hope to interact with folks more and we will have a common reference if I tell folks what I am doing.

So, I think I am supposed to pass the paper around to my colleagues before attempting to publish but on this I have none except this forum.

I may try to walk it over to the University and see who I can meet there. Any advice on this "peer review " phase is welcome.

I am sure that someone with no experience just learning to use TeX and never having written a scientific paper before should have something to fix about his paper. Wording, phrasing or something.

The paper is 11 pages with seven as text two as flowcharts and two as data.

Okay I will see what I can come up with. I'm sure I want to get some feedback before submitting. I have a hunch about a guy in Oklahoma U. I will follow that lead today.

I am open to suggestions. Perhaps there are honest folk among us who are qualified to render an opinion and critique a paper.

I will need an endorsement for arXiv so if you dear reader has status as an authorised author and wants to help a new guy publish his first well here I am.

Ernst

Hello, long time from my last post. Sorry but it has taken a year to settle my mom's estate and now I have time to get back working on compression code.

I bought an I7 with 32GB of memory so I have lack of power to test complex code. Even then I have spent over a week running one program to generate a special SBOX[] array only to rewrite the code and have it run in just under 4 hours!

I hope to do a lot of coding this year.

PS. Ernst, where do you live? Maybe I will stop by for a visit and take you on a trip sometime!

PSS. I plan a number of different trips this year, so don't worry if you are a little far.

I will be walking the paper to my college Monday. I will see if I can interest a teacher or two to look at the paper.. If they can spot problems that will help. Also my nephew has been reviewing conference papers this last semester at Oregon U and I am hoping he will take time out of his summer and away from his fox of a GF to review this paper.. I may lose there.

If nothing else I should submit this week. I don't wish to have several versions listed but that is a possibility if I need to advance my form.

As for continuing this challenge. I think I can relax and work on some projects. I might even be open to corporative efforts with others.

Who knows what we can do once I am not worried that my "Secret" will be discovered. LOL

I'll post a link when I have one.

Ernst

Alright! I am happy that the LaTeX I wrote did not cause problems on arXiv The paper looks like it should.

I must wait until after 4pm today to see if it was accepted or delayed or rejected. I would assume delayed as I am a new author. I would assume I need an endorsement.

The paper is Introduction to Dynamic Unary Encoding and yes I discovered it as a direct result of attempting to compress the million digit file.

While progress on that will continue the obvious benefit of attempting this challenge is the data set itself.

By changing the schemes I was able to learn about aspects of the nature of information. I agree I have more to learn. Let's hope I can finance Junior College soon.

Hey everyone!

I introduce Dynamic Unary Encoding. Once considered an oxymoron it truly is not.

If there are any taoists reading I especially want to hear from you.

http://arxiv.org/abs/1405.2846

Comments? Questions?

I did try to have someone proof read but asking people who don't know me and that I don't know Didn't work out.

Turns out there were typos and the maths looked unclear so After a few days I was able to see I had typos and that their mathJax had changed maths symbols on me. I also added more text to acknowledgements to include Mark, all here and this challenge.

I hope I have it corrected but I know me.. I may need to change up once more..

An update fixing typos and adding clarifications will be available 05/17/2014

It is possible to look so hard one cannot see.

I am back to "Thinking about what to do."

Now that everyone can see the "Ace" I was holding it should be clear that data is fluid. That "Spinning Strings" are indeed true and so on and so on.. I was living with the truth but having to allow people to see me as a crank. I admit I needed social interaction and that for the most part data compression people are strange on that level.

With the Surjective encoder the problem was the amount of data was mind blowing. I mean if I had a DNA storage device or DNA ram well it would be no problem but all I can fit in 32GB is all the cycles for 32 bit integers. I did create that data set and studied that noting that each 32 bit element in MDF was in it's own orbit. LOL Mark you don't know just how random MDF really is.

Since the number of cycles is 2^n / #elements in cycle there is a one to one relationship of bit lengths in the integer and accessing the cycle and it's index.

I started to study the bijective form of BWT. Since there is no cost to either or DUE it's an interesting direction to go.

So hey.. Is there anybody out there? Just nod if you can hear me.. LOL (Pink Floyd)

Ernst

I'm out here, but that doesn't make me bright enough to make sense of your paper.

I hope you won't take my analysis as a put down of your approach because any honest critique would require that I understand what you're doing and could make constructive input as to it's strengths and weaknesses, and I can't do that. I can, however, critique my misunderstandings of your approach, and though my critiques may be useless as far as helping you in a concrete manner, maybe my ideas could spark some idea in your mind that will allow you to help yourself, since you know what's going on, and I don't. So, please forgive me if my analysis is naive. If it gives you some ideas ... great, but if it's useless, at least it didn't cost much.

When I think of the ultimate goal being compression, and the challenge being to compress data that has never been compressed before in any useful manner, then the ultimate reversible transform is one that produces a smaller data set than what you start with.

Since no known compression scheme has been able to compress the MDF, then the solution path (to my limited imagination) is to either produce a revolutionary compression scheme that has never been seen before (something I doubt I could do)-OR- transform the data in a reversible manner into something that can be compressed using existing methods (something I think is theoretically possible).

For instance, encryption is a reversible transform, and in many encryption/decryption schemes, the size of the output is equal to the size of the input. The biggest problem I see in using encryption as a pre-compression transform is that encryption is intended to produce a data set that is indistinguishable from random data, meaning that it's unlikely that an encryption of the MDF is going to be any more compressible than the original.

The Burroughs Wheeler Transform, on the other hand, is a much more useful transform since (A) it does not grow the data, and (B)the whole point of the transform is to produce data that is more compressible. If I were taking a serious run at this challenge, the BWT is definitely something I'd have in my toolkit.

When I look at your approach, what I assume you're trying to do is to create a reversible transform that produces something compressible, which makes sense to me. But when I see you talking about needing vast quantities of storage, it's obvious that you are growing the data set astronomically. I can picture a larger data set that is more compressible, but it would seem like you're heading in the wrong direction.

Unless all the data you're generating that you need so much space for, is hundreds of thousands unique transforms, any and all of which could be used to reproduce the MDF exactly, and each of these unique transforms is smaller than, or at worst, not that much larger than the original, I just can't see how that many gigs of data would be useful to compress a relatively small file.

Hey thanks for the conversation.

As to paper, Introduction to Dynamic Unary Encoding I made an effort to write it so it can be read by a wide audience. While there are aspects that are better expressed with Higher Mathematics constructs they are in context so there is a potential to comprehend in general for the novice just learning about data encoding. If you find it problematic I welcome a description of the difficulty and once I have that I may have reason to revise again.

The Final Draft is on arXiv.org as of yesterday. http://arxiv.org/abs/1405.2846

My approach has been to find an encoding that is compressible. To that end experimentation and observation are the tools to use in general. The cycle of length MDF with a PRef of b0 is 2^22 files 415241 bytes long.

If we count all of the bytes of every element in that cycle I calculate 18,518,862,987,264 bytes total. I believe that is 17247 gigabytes of data. Given that 2^22 * 415241 bytes it is only one cycle the number fo cycles is then 2^3321928 / 2^22 cycles.

That is what I meant by storage issues. How do I examine those data sets in whole?

Please see http://www.extremetech.com/extreme/134672-harvard-cracks-dna-storage-crams-700-terabytes-of-data-into-a-single-gram

Still a technology of the future but, we now know we can store that much data and more (in theory).

@ CWCunningham

As my Data Compression Friend you know the author.

As to BWTS Yes it is an interesting possibility. The truth is the number of possibilities of solving this challenge are now infinite. DU is the tool of possibilities.

Also Noted: BWTS would require a count of the number of times it was applied. If we know the count we can forgo recording the information.

Truthfully the BWTS project here is awaiting me. The BWTS is converted to OpenMP and is multi-thread ready.

So Seriously, I could be employed with a staff to get all the things done I want to see accomplished. Currently I am researching the Tao in an attempt to find a connection. There is a possibility DU and Tao are related. To that I am looking into developing an interactive Oracle. So far it's just me but I have left it an open invitation to those I have been in contact with.1

My Next attempt on the MDF, which I am also working on, is going to be a long one. I will construct files and test them for compressibility. Millions of Millions of files I am sure. I am here today to reply and to get a copy of how to use the Zlib. Mark have an article.

Not to hijack but if anyone is interested in such things I have a home on the internet. Just click the link that is my name here.

[quote=Ernst]My Next attempt on the MDF, which I am also working on, is going to be a long one. I will construct files and test them for compressibility. Millions of Millions of files I am sure. [/quote]

Flaunting my ignorance, I don't see the need for millions and millions of files, here's what I mean ...

Generate one file.

Throw the kitchen sink at it and if it won't compress, delete it.

If after millions of files, you're not getting anywhere, redesign your kitchen sink.

Why save files that won't compress?

Given your earlier work where you were working with 5 byte numbers, why not start with the first 5 bytes out of the MDF, if you can find a way to reversibly shave one bit off of it, you're on to something ... try the next 5 bytes.

If you could shave one bit off of 5 bytes for every other five byte chunk of the MDF, you could write a program to reverse that 'shave' in the space you have created.

Hey C.W. No skin off my nose. Got an eight core workstation. I need to think up stuff to tax it to the max.

I think after the "big think" last night that I will run two different designs. I love it when an idea just pops into my head.

Man you really have to digest my paper. Millions and millions is just code for trillions of trillions.

As we know C.W. if we had a file say file X and we xor it to MDF such that X xor MDF = C and C is very compressible we win.

So it's a matter of generating X.

I doubt so so so seriously that we can get any place with MDF until we break it's structural back. Having recoded the damn thing so many times and I never see anything budge much.

Seriously C.W. I am not even worried about being first any more. No Biggie. Look at DUE. Millions of millions of millions of millions of millions and so on.

What is that Song ?

http://www.eyeneer.com/video/rock/the-who/i-can-see-for-milesmy-generation

Oh Yeah! :)

"People Try to put us down.." My Generation.

I'm in a really good mood this morning. Have a great one C.W.

One more Tune..

https://www.youtube.com/watch?feature=player_detailpage&v=HKhI09XO5R0

Well... update..

This next project is a long one. The state of things is that I have the foundation code written. The next phase is all the record-keeping code with some of the framework already defined. Basically when we are doing a really long term run we have to expect that at some point the computer will get rebooted.

This effort may never finish with 2,758,801,409,296 total possible files but sometimes advancement in research is not a straight line. You might understand what I am saying by the phrase "We often learn more from failure than success."

So like Galaxies look at that number.. Wow! Do you think one of them is going to compress? I am leaning towards no myself.

So hey Challenge friends! What are you up to?

Hey hey hey.. It will soon be a busy work season for me so I am enjoying a bit of frequent Internet browsing and this is my favourite data compression challenge website.

I have finished testing of that program generating all those many files. I am now of the mind that it will not produce a compressible file. It is a tough love decision but initial testing suggests no. So Mark add that to MDF.. two trillion files just like it.

So this gets pushed onto the stack. I will give it more thought but a third actor is needed in the algorithm and I am not sure what to try next.

So down but nowhere near out of the game still..

Ernst

I have figured out how to do this.. I can't reveal what I know but what I can say is I don't know the full puzzle I know a very big important piece of it. But I lack the ability to figure out the smaller piece.

Obviously since the data is random it will have to be treated with always different parameters or else it's pointless, now I'm trying to limit the parameters down to just the few.

I been trying this for years too and I had many failures and some very good discoveries which all served a purpose of better understanding the flaws.

I tried very abstract ideas like putting the random data into a rubix cube kind of game then I simplified it into a Sudoku game with missing numbers being what is compressed.

And now I simplified it even deeper then that which I am keeping a secret for now.

All I know it's always possible to shrink data over and over to a certain limit where it will start only growing..

Of course there is the law that always gave me dead ends when I was just starting up with this field which goes like this the number of possibilities in a larger value cannot be stored in a smaller value no matter what you do it cannot express more values without having ambiguity occurrences.

Also like to state that no matter how difficult you may think your compression is it's only difficult because you haven't fully understood it yet.

I have a method that works.. but I lack the ability to code the brain which does this efficiently without brute force.. when I try to do this with pen and paper I rely on logic which is not understood by computers yet and can't put this in words.

Yes I am accessing pre-stored data in other to decompress other data.

That's the thing that's hard to do.

If you want to know my idea I might as well just say it.. I built a simple assembly langauge with only a few commands mostly math and other commands cause mutation (self-modifying code) (which I keep secret for now) like I say its logic you can reach it in your mind but you can't take the information out of it.. so you can't understand how it works.

The problem is the instruction set of commands will have to re-ordered with new command opcodes for every block of data I try to shrink.

The decompressing is pretty straight forward you let the code run normally at the beginning which has no modifications in it then later down the line it will start modifying itself at run-time everytime a modification happens it can end up anywhere in the file (like torrents) and starts executing a completely new set of opcodes which are created at the runtime from what seems like a random area of data. Every set of opcodes is not stored anywhere in the file it's a discovery process where it figures out which opcodes do what command. This is where you have to start doing back-stepping after prediction is pretty accurate enough (not sure how to check for this yet.), But don't forget the opcodes are not constants they will change over time and by this I mean their meanings get swapped around (although the old decompressed random data will serve no purpose at this point, it can be discarded), this process doesn't require too much of memory it runs somewhat in real-time.

But until the prediction is pretty accurate it will keep accumulating the results

The compressing part is hard to do with computers my mind blanks out here trying to even attempt to put this into code.

That's because the actual random data is the opcodes+payload but since it's random data you have to find which random data is dominant and other factors like with random data is like very nicely joined and could serve a purpose of mutation (there is times where data is discarded completely being a special opcode runs here but still gets put as being decomprsesed, also times where the data is overwritten to get more data and this step requires usually human assistance of manual work since you will have problems understanding what data to put as decompressed and what data just served it's purpose of expansion opcodes afaik it's like a feeling at this point not really something that has a specific size to output as decompressed, so ya this step is some what luck needs more work maybe rules like splitting it into a different container which run's together with the main container but gets filtered accordingly.

I try to keep the opcode in 4 bits which only contains 16 different possible commands the payload could be any size.

It also requires a 1 container I call the sandbox where the data mutates (with possible running cycles until it hits a certain opcode counter which is always some set attribute limit to avoid corruption of stream).

The scanning process to shrink the opcode+payload can easily be done on pen/paper with your own mind.. but other then brute force on computers I see no other way. It's like making a OCR software for computers you need alot of learning for computer to understand, so this requires tons of manual labor as a person to teach it.

I have an interesting encoding that compressed. By that I mean a lot. I will be checking this out over the next few days and adding the decompress side of Zlib to the program. I don't see any obvious flaw in the algorithm and I am a seasoned programmer when it comes to writing codecs for the MDF Challenge. Still Magic Compression has been seen before here and it always turns out to be a defect in the code. That is what I call a result that appears to have compressed MDF but is actually flawed. Magic Compression.

I will keep you all posted if I have a solution or not. I sure want one after all this time.. Then I can do other things with my life.

Good Luck Challenge friends!

@ pkedpker

Do not worry that you know not. Simply set your direction and work consistently towards your goal.

I have a great respect for those with creative minds that don't quite fit in.

I suggest that you document each effort. Get in the habit of Blogging to yourself so that you can go back and see what you said.

Often I find that simply sitting down and typing ideas for an hour or two leads to a good idea. It is my way of working things out. It is funny that some of those files are truly bizarre. But hey in the creative process all is fair. Don't be afraid to dance around the room if it helps generate that great idea.

I have been known to get up while listening to music and dance around. But hey.. I do have some great music I listen to.

Good luck my Challenge Friend!

I found your website through links on Wikipedia. You must be pretty big time.

I'm not an expert on these things, but if a number is truly random, isn't Challenge version 1 impossible? If you can make a smaller program, doesn't that mean that number wasn't really mathematically random? Is the challenge to find a mistake in their series of random numbers?

On an unrelated note, I read your article about combining statistical and dictionary compression methods. You talk about the different probability of finding a u after q, for example.

But I'm wondering are there variations on Huffman and Arithmetic coding that take into account whole words or strings of bytes? For example, if you were compressing text it might make sense to have a code that simply means "the".

>if a number is truly random, isn’t Challenge version 1 impossible?

If numbers are random, doesn't that mean they will sometimes fall into patterns that are easy to represent?

If you generate an infinite stream of random numbers, some subset of those numbers will contain the first 100,000,000 digits of pi, very easy to compress.

- Mark

Update..

Had Magic Compression again.. Bummer.. Was not setting a bit when I should have been.

Comment: So far I have encoded the MDF in many ways and always the structure of the data is preserved thus no compressing below the original size.

I assume it will require that the file be mapped to something compressible.

@ Joe

Yes sure we can have codes representing "The" actually many compression programs parse data and assign codes for the most frequent occurrences of patterns. If the pattern "The" hapens a lot it will probably be coded with shorter codes.

With Random data the frequency of patterns is about equal to all other patterns so to code them no one pattern is more than the other so the only code that can be used is one of the same size as pattern or the file expands in size.

Does that makes sense?

The issue with random data is how do we represent that data structure such that we can map some patterns to the same codes yest know the meaning of those codes in context.

So I could say that the words "The" and "They" are code zero but in decode how do i know what is meant? Sure I can get three letters right but what about Y?

I have explored transforming the file into other files in hope of finding a file that can convert back yet is compressible.

I just presented an encoding called Dynamic Unary that adds to our tools of manipulating data. I design encoders with Dynamic Unary as a component. That makes changing data as easy as any mathematics is such as A xor B and so on..

The Million Digit Challenge file is highly structured. Think of it as a super dense amount of information already condensed. It is far easier to make much more data out of it than to condense it further.

Still if we could generate some file the same length that Xor's to it and the result is compressible we win.

I like the Challenge myself. It's not the only thing I do naturally and what I do with it helps me in other projects. So even If I die trying here it's already paid off for me.

If you want advice I would be willing to chat. Just click my link.

Update:

Well it's getting closer to my three months of hell so I am enjoying these wonderful June days. Man what nice weather here in California.

I had an interesting encoding result so I am now exploring that. I am optimistic once again.

I have seen my share of Magic Compression. I think I have experienced the full range of highs and lows regarding this Challenge.

So if you don't see me here much it will be because I am really busy with my seasonal job and have no time to get online.

Good Luck Challenge Friends.

Ernst

I do understand the pigeonhole principle, but that won't turn me down immediately in accepting this challenge. While one random number is not the other, there are quite random numbers (as you stated) that do have a name (pi, e, golden number, silver number, etc.) if your bitstream is part of this number, I would only need the code to generate the number, and a starting point. Off course it would take supercomputer powers (or a good deal of luck) to initially find the 'hidden variables' . Can you reveal a little more on how the bitstream was generated? Or can you post some links to 'Interesting reading material'?

The only chaos there is, is the chaos we haven't structured yet.

They creators of the million random digits have generously made the details of the creation available for free:

http://www.rand.org/pubs/monograph_reports/MR1418.html

- Mark

I'm not familiar to this topic so I probably overlooked something, but this is what i found:

If I interpret the bits per 16, i get a uniform random distribution. (between 0 and 2^16). If i take the first number and the difference between subsequent numbers, then they are normally distributed. (I have the same amount of numbers). Since it is normal, it contains info, and therefor can be compressed. If i use gzip, i go from 415.241 to 285.656 bytes. To zip I use gzip (68.096 bytes). to diff i use Matlab, but i cannot imagine this takes a lot of bytes to code. Are you familiar to Matlab? The code looks as follows:

f1='AMillionRandomDigits.bin'

fid = fopen(f1,'r');y=fread(fid,inf,'*uint16');fclose(fid);% load

z=[y(1);diff(y)];

fid = fopen('z','w');fwrite(fid,z,'*uint16');fclose(fid);% save

dos('PATH H:\utils&&gzip -c H:\z > H:\z1'); % compress

y1(1)=z(1);for i=2:numel(z);y1(i)=y1(i-1)+z(i);end% check result

sum(y-y1') % check if data is still the same

I'm excited to know if my previous post was right.

I don't expect it to be right but like to get some feedback about it.

@ooms:

You said:

>Since it is normal, it contains info

While the fact that a set of numbers falls into a normal distribution certainly might contain information, I don't think it necessarily means that there it contains any information about a specific number in the sequence.

Are you saying that this matlab code works for you? I would be very surprised if it is doing what you think it is doing.

- Mark

Hi Mark,

The code works on my computer. I don't know if it is using other resources than gzip, but I can sent you the zipped file and the (un)zipper (gzip) so that you can check if it works. But to compare the unzipped file with the original file, you have to first integrate the numbers. Like this:

y1(1)=z(1); % z(1) is the first 16 bits of the unzipped file

% y1(1) is the first 16 bits of the recovered source

for i=2:numel(z);% do from 2 to end of file

y1(i)=y1(i-1)+z(i);% i is the 16*i th series of 16 bits

end

can you explain where the code won't work?

where can I sent the zipped file and the unzipper to? the unzipper is gzip compiled for x86 (I think 64 bit, but I'm not familiar with architectures)

My email address is up at the top in the "About" tab.

If you are providing an entry, you really don't want to be including this like "you have to first integrate the numbers".

That's your job, not mine.

- Mark

ooM, It's pleasing to see you consider such things.

I think we know the proverbial Way and it seems that no matter who it is they seems to track in the same tracks as those before.

I do not disparage those sorting through the shells for a Pearl.

As to the structure of MDF, I recently translated MDF into a form with only three symbols and although I didn't write those numbers down for ease of posting there was absolutely no love there.

Whatever is to be done it seems mapping MDF to something compressible is the only way to go in my opinion.

Keep your spirits up! Thinking is not harmful. Not thinking can be.

Wow not one post since the 10th

Well.. I was working 16 hours a day for 10 days and now just work 8 hours a day 7 days a week except for today! huge surprise to get the day off in the middle of Crush (Wine Grapes)

So: I have a story to tell about fooling myself big time with the absolute King of Magic Compression foolery.

The story starts with my experimentation and observations heading towards an encoding scheme that had promise.

It is true that I am working on some encoding scheme or another every day. I keep a Log-Book(s) and at the very least write in it everyday if I am not at the keyboard. So here is the story

I wrote an encoder and processed the MDF in the following way. First naturally Zlib tries to compress the input. Then the file is passed to a function I call the "Pre/Post Parser." Then that file is sent to the encoder. Then sent to Zlib. This becomes the output.

With this new encoder Zlib reported over fourty thousand bytes of compression. Well at this point seeing Magic Compression is nothing new so I start in checking the encoder and it looks reasonable so I go to step two. I write the decoder. Now it's often at this point that any flaw in the encoder is exposed. It's an activity that I know well.

So I get the decoder to sync with the encoder after a few minor corrections.

At this point then I see that the file that the pre-Parser sent to the encoder and the file that the decoder sends to the Post-Parser are the same!!! Wow indeed.

So Here I am looking at over forty thousand bytes of compression and the input and output match!

This where I get excited.. So much so I tell three people I did it! I'm as high as a kite.. Just giddy with excitement!

I begin to write the complete program. The command line parsing all the way down to writing the decompressed file out to disk.

So sold on this I was. So after about a week the complete "compression program" is finished and functioning as intended.

Now came the obligatory file check.. MDF to the "Decompressed file." They didn't match. Oh My GAWD!

After three days of examining the encoder and decoder it came down to going into the Basement and working my way up.

I soon discovered that after i had repartitioned and formatted my working drive then installing all the common files from backup that the backup of the "Pre/Post Parser" was a version that had an uncorrected bug. Somehow I missed updating the backup two years ago. The cause of the Magic Compression was an improperly initialised variable. A bug I had already dealt with in the copy of the Parser I had been using for two years already without any concerns.

So there is it.. Egg on my face in a huge way! I yelled BINGO for christ's sake.. Told folks I had done it.. Made a fool of myself.

Oh well. At least it wasn't deliberate.

I then have been trying to perfect the encoding scheme in hopes of saving face. So Twenty or more experiments later I have a good grasp of this aspect of encoding. Now huge reduction and nothing Zlib can crunch.

My work now is reaching the One to less than one ratio however. The aforementioned encoding scheme did not generate forty thousand bytes of reduction is file size but it did do nine bytes. My current best record is fourteen bytes.

There it is.. The ultimate foolishness. Not working it out to the end first! I know better. I just assumed Parser was all good!

From these twenty or so experiments in the direction of this encoding scheme has benefited the Surjective encoder project however. I now have an encoding scheme for that where 40 bits will be represented with 38 bits. Yes yes the Pigeon Holes are not over stuffed.. There will be 2^39 40 bit elements represented by 38 bits but I am sure the elements generated will not all be unique.

Then again the Surjective encoder I reported a while back also did not cover all 2^40 integers too. It still managed to find all 40 bit words of the MDF. The new Scheme has twice the number of elements at a six bit savings. Indeed I learned a new trick with the embarrassing episode of magic compression I just went through.

So that's my story and I am sticking to it.

My schedule is rather busy for the next three months so I don't know what progress I will make but I am working on things everyday, just not able to do it when I want.

Good Luck Challenge People.

Ernst

Well, by need I have managed to make it down to the Internet cafe. I attempted to clean the CPU cooler on the 8-core and found that using "their" heat transfer stuff makes for a baked on CPU which gets pulled out of socket without socket being unlocked. Going to clean that junk off the next one and use my own transfer stuff.

So what to say.. Bent pin dead 8-core.. So I look and see there is a new 8 core the AMD 8350 and all 8 cores clock at 4.0 GHZ with turbo of 4.2 GHZ so that sounds super sweet.

As I posted on my website the "Surjective2" program is now in testing phases which includes the verification and decoder programs. Not a giant leap there as basically this is an engine change-up from version3. The new engine is multiples faster and sports a 38 bit codeword representing 40 bit words.

It will be seen when the new CPU gets here if this variable set I have chosen is one that contains all the 83049 40 bit integers that are the MDF. I round up and Mark! That tail end 100 is the first to get found every time. I think you tacked that on did you not?

So it runs faster. It reduces the codeword length from 43 bits to 38 and if all of the MDF cannot be found in the 2^39 integers this set of variables generates then i swap out variables and try again.

I must make note that all previous versions found multiples of MDF in the Surjective realm meaning I could assemble many many different files that would all decode to MDF but at 43 bits per 40 bits source.

I compressed the encoder file which is larger than the decoder will end up as and it crunched to 7k.

The projected reduction is 2 bits out of 40 so with the rounding up 415245 bytes around 20,000 bytes reduction on MDF is the goal

Fingers crossed here. I sure have done the Home Work on my own all these years. I own it success or failure.

That's it.. Still doing the hard work I bet most of you couldn't do on your best day but hey.. It keeps the electric on and the CPU running all year.

Ernst

Hi ooms,

Like Mark I would be suprised if simple differences would be compressible, but for fun I wrote a small C++ program to test your approach.

The program computes the diffs and saves a binary file which is the same size as the orignal. When this diff file is gzipped, the result is 415398 bytes, 157 bytes larger than the original binary.

I parse the binary file one unsigned short (16 bit) at a time:

39707 2111 20863 40789 4328 35275 2122 1098 46155 33338 ...

Then set out[0]=in[0] and compute out[i] = in[i] - in[i-1] for the rest:

39707 27940 18752 19926 29075 30947 32383 64512 45057 52719 ...

The md5 hash of the diff binary file is c08177eb9d7fed4d0831e0577b7cd83c.

Does these values match what yoy get with MatLab?

Well, I have that new 9370 here. Liquid Cooled eh? Super cool. I plan to spend Labor Day installing everything.

I have been running the test of the newest "Number Generator Engine" on a laptop. What I see, because of the dependable structure of the Million Digit File, is that this engine design is a "Fair Coin" type of output. http://en.wikipedia.org/wiki/Fair_coin

The number of unique matches of 40-bit words that make up the MDF are about 18,000 each run when using different variable sets.

This is an 90% unique element result with about 10% not being unique matches.

If those results extends to all the 40-bit numbers generated then 90% of 2^38 is about 247,390,116,249 unique 40-bit integers with the codeword size that generates each 40-bit integer being 38 bits.

I assume that many uncompressible files the length of MDF or larger may be constructed from the set of integers generated. Thus it may be suggested that I have achieved compressing the uncompressible. If that total 40-bit dataset is considered one file then at 2 bits of savings per 40-bit word there is a theoretical 61,847,529,062 byte reduction.

This is common for all variable sets explored so far.

The mathematics that might determine the best set of variables to use in this Surjective2 engine is something I am thinking on. I feel graphs may offer some inspiration so I am open to suggestions of any C program functions to display integers as dots in a domain on screen.

The next experiment will aim at seeing what a simpler construct that is something of a hybrid between the Varsion3 engine and this the Surjective2 engine will provide. This Surjective2 can be considered "A Net cast shallow." Where the Version3 engine cast a net wide and found all MDF but at the cost of 43 bit codewords for 40 bit source words.

So, I am able to code 40-bit integers with 38-bit codewords now. All the standard of information-physics apply.

How to generate all the MDF words with one set of variables is a good question. One possibility is not looking for the whole file but smaller subsets of it.

I have Sept 1st off I think. I may have time to write the next evolution in this Number Generator Series.

I am considering moving my blogging to eberg.us I assume there is little interest in what I write here on Mark's site.

Am I wrong Data Compression Friends?

Happy Holidays

Ernst

Ernst

I would disagree, I think your posts are read by more people than you realize. I have been to your site and your posts here seem more grounded and easier to follow for me. I have never posted but I say keep it up. However, This mark's website not mine. I would note he has mentioned you favourably in the past and never gave an indication there was an issue. Also remember, he can keep you from posting if desired.

I for one, Thank You for your posts and work you have shared over the last years.

I am working on my own project and have been laughed at as well.

I just don't have a few billion years to prove I'm right! :)

I believe prime number distribution and PI are "compression" in their own way. I have been at this for 15 years now myself and I'm still at it. I will post results WHEN I have them. Trying not to let you beat me to it. :) Good Luck.

Jeremy

This is my 3rd attempt at posting this comment, hopefully it sticks this time. I've been following this thread for a while now, hoping I could contribute one day. I've yet to see anyone approach the problem from a graphical standpoint, so I've converted the data into the below images.

The first is the grayscale bytes at 673x617 resolution and the second is black/white bits at 1346x2468. There does appear to be a grid like pattern to the bits when they are viewed between 98 and 99% zoom. I don't believe this is a display/screen artifact as it's location and spacing are consistent.

http://i102.photobucket.com/albums/m108/tipofthescorpion/Million%20Random%20Digits/MRB_Bytes_20140922_1522_zpsab700177.png

http://i102.photobucket.com/albums/m108/tipofthescorpion/Million%20Random%20Digits/MRB_Bits_20140922_1533_zps59c83c8e.png

Addendum : The links I provided appear to have been encoded again and are no longer true to the originals. Here are the correct links.

http://s102.photobucket.com/user/tipofthescorpion/media/Million%20Random%20Digits/MRB_Bytes_20140922_1522_zpsab700177.png.html

http://s102.photobucket.com/user/tipofthescorpion/media/Million%20Random%20Digits/MRB_Bits_20140922_1533_zps59c83c8e.png.html

I think I've solved your second challenge. The program included below can compress the million random bit file by exactly one byte. It should also work on any random permutation, and on any file size, given substantial enough size and randomness.

Here's the code:

https://github.com/Barbacamanitu/splitCompressor/tree/master/md5Compressor

And a compiled version:

https://dl.dropboxusercontent.com/u/89223857/splitCompressor.exe

Just run it from command line to see the Usage.

@Russ:

Traveling today, I promise to take a look tomorrow.

A claim to win by one byte has an air of plausibility that other claims often lack, but still, it will be surprising to see.

- Mark

@Russ:

Sorry to be harsh, but your entry is neither successful nor particularly good.

To win, you need to be able to generate the million random digits with a collection of program and data that amounts to less than the original file size.

You don't do that. Your total size is:

sizeof(MillionDigitFile) - 1 + sizeof( your app )

Which is much greater than sizeof(Million DigitFile)

So... you just ignored the requirements and declared success?

Your algorithm is pretty simple: split the file in two at a point where two consecutive identical bytes are found. Discard the second consecutive byte and start a new file. The decompressor knows that each time it closes one input file and opens a second, it repeats the last byte in the previous file.

This of course is data hiding, and is prohibited in the rules. If you had taken this to the extreme and split on *every* duplicated pair, you could reduce the total file size by 1624 bytes.

Use an arbitrary delta instead of 0 and you can select the delta with the most hits, which would give you 1748 bytes savings. Maybe, just maybe, you could write a C program that stiches together 1748 files passed on the command line into the output, and have your zipped program size < 1748 bytes, but it would be very tough.

But again, you wouldn't win, because you are hiding information in the files.

Thanks for trying!

- Mark

Jeremy, Thanks for posting a little conversation. I'd really love to have an understanding as to the truth of the structure of numbers.

I'm not sure if it is of importance but I made an observation on the number of primes in relation to binary string length. I can look for that text if you like. There a steady growth rate to prime numbers in relationship to binary magnitude (string length).

Pi now, well what is a repeating pattern? There are plenty of repeating patterns if we look at bit segments of a finite length binary string from the "stream" of data that is Pi, say 8-bits. It's just that Pi doesn't look like rational numbers with the decimal cycles. Say do you have code for a Pi generator that emits bits?

I too know of those 'thousand year" programs. I'm experimenting with one right now. Either I have a bug or one cycle takes a lot longer then I expect. There are many billions of cycles to a complete run. Hey I now have a new AMD 8350 cpu so it's back to 24-7 uptime. By the way I recommend the Antec 950 cpu cooler! http://www.antec.com/product.php?id=706543&pid=58&lan=nz

Studying the nature of numbers is an honourable pursuit so don't accept ridicule. I know a lot about the structure of information today that I did not 12 years ago so I feel confident myself.

Ernst

thesupermonkey

I thought about representing MDF as some sort of graphics but I have weak skills in that area.

Can you reverse the process and extract the MDF? If so do the pictures compress losslessly?

You might try applying a Move To Front approach in a iterative cycle. That means processing the file over and over again. I saw some interesting "Data-Storms" in the output when I experimented with MTF a few years ago. That might make for an interesting graphics project. Maybe a kind of movie of sorts if the file is converted to graphics each iteration of the MTF. I can look for that code if that interests you. I don't have time to go down that road myself.

Ernst

Bummer Russ Martelli.

I will be surprised if a simple solution is found.

Data hiding was the basis of another claim I do believe. There was data in the file name I think of some such thing.

So don't give up! Keep on keeping on.

My personal record is something like 16 bytes reduction not taking into account the size of the decoder.

@Ernst:

16 bytes is pretty close to as good as anyone has ever shown theoretically possible. And it's a *lot* of work to get there!

- Mark

Really?

I am "recoding" the data so don't think of the Million Digit file proper getting 16 bytes of reduction. One of the forms I turned MDF into got 16 bytes of reduction.

I hope that helps clear things up a bit.

I have a question to ask before I invest a bunch more man hours on this latest experiment.

Has anyone explored the idea of expanding the data in the MDF in order to compress it?

I mean we could say the MDF is already a "Compressed File."

I have generated a file that ends up being about 16 times the size and compresses 91% to under 600kbytes so the goal is 95% or 96% compression. It was Bzip2 by the way that did the 91% compression.

So any input and or advice on this avenue of investigation?

I think data hiding should not be dismissed out of hand.

For instance, storing intelligence in the filename could be a key to some clever constructs, and let's face it, a successful attempt is going to be clever for sure.

As it is, filenames are expected to be useless appendages that don't count in size calculations, and this makes perfect sense for the anticipated case, but then you realize that someone could hide a few hundred bytes in a filename and game the system.

The way I see it, if someone supplies a solution that works, that's promising. If however it breaks when one of the filenames is changed arbitrarily, then the size of that filename has to be added to the filesize since it is obviously part of the file, rather than a useless appendage.

Any imaginative solution should be allowed as long as the author is honest about which parts are arbitrary and which are significant.

Ernst said:

I have a question to ask before I invest a bunch more man hours on this latest experiment.

Has anyone explored the idea of expanding the data in the MDF in order to compress it?

==================

If it works, it works, and is certainly worth experimenting with, but I think it's a bit of an uphill battle.

I figure that if you expand the file to make it more compressible, maybe it will help, but the compressor has to account for the additional data too, so the resulting zip file is likely to be larger, rather than smaller. I hope I'm wrong.

There is no doubt that data hiding can be done in clever ways, and it might exhibit some interesting programming techniques.

But in general I don't view it as interesting from a data compression point of view. For one thing, it is almost always dependent on very implementation-specific things, such as file systems, and doesn't scale into a general purpose technique.

So in this challenge, data hiding is ruled out.

I need to make a little more clear in the rules exactly what that means. I think the perfect test for data hiding would be the ability to decompress your data with just two items: a pointer to a block of memory, and a size value. The size value is there because we already know that the file system hides that information, and we normally get it for free when decompressing from a file.

- Mark

CWCunningham said:

Yeah I tried a few things but even with Bzip2 getting a 91% compression on the "expanded" data it wasn't enough to get it below the 415241.

Thanks for the reply.

## Leave A Reply

You can insert source code in your comment without fear of too much mangling by tagging it to use the iG:Syntax Hiliter plugin. A typical usage of this plugin will look like this:`[c]`

int main()

{

printf( "Hello, world!\n" );

return 1;

}

[/c]

Note that tags are enclosed in square brackets, not angle brackets. Tags currently supported by this plugin are: as (ActionScript), asp, c, cpp, csharp, css, delphi, html, java, js, mysql, perl, python, ruby, smarty, sql, vb, vbnet, xml, code (Generic).

If you post your comment and you aren't happy with the way it looks, I will do everything I can to edit it to your satisfaction.