<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The Hutter Prize</title>
	<atom:link href="http://marknelson.us/2006/08/24/the-hutter-prize/feed/" rel="self" type="application/rss+xml" />
	<link>http://marknelson.us/2006/08/24/the-hutter-prize/</link>
	<description>Programming, mostly.</description>
	<lastBuildDate>Tue, 07 Feb 2012 16:05:51 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Mark Nelson</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-325169</link>
		<dc:creator>Mark Nelson</dc:creator>
		<pubDate>Wed, 14 Apr 2010 14:01:27 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-325169</guid>
		<description>@jiahb:

I don&#039;t know how much this has been studied, but I have a strong feeling that using an order-0 or order-1 model to compress equivalent English and and Chinese texts will yield very similar results. I&#039;d be suprised if there is much to gain by what amounts to tokenizing English text.

- Mark</description>
		<content:encoded><![CDATA[<p>@jiahb:</p>
<p>I don&#8217;t know how much this has been studied, but I have a strong feeling that using an order-0 or order-1 model to compress equivalent English and and Chinese texts will yield very similar results. I&#8217;d be suprised if there is much to gain by what amounts to tokenizing English text.</p>
<p>- Mark</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jiahb</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-325167</link>
		<dc:creator>jiahb</dc:creator>
		<pubDate>Wed, 14 Apr 2010 13:56:56 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-325167</guid>
		<description>Translate into Chinese and you guys get probably beyond the Shannon limit. Perhaps Google will be able to do that.
1 Character = 2 Bytes. 
1 Latin word = 7 Bytes = 1 Chinese word = 2 Characters = 4 Bytes
Use whatever conventional compression routine, then you get it.
Of course, this compression may lose information, or may gain more information, depending on how good is your translator. And it&#039;s not 100% revertable. 
Because you are using sample from Wiki, not random stuff, so there are already some internal information, or to say, quantum entanglement. Good luck on y&#039;all!</description>
		<content:encoded><![CDATA[<p>Translate into Chinese and you guys get probably beyond the Shannon limit. Perhaps Google will be able to do that.<br />
1 Character = 2 Bytes.<br />
1 Latin word = 7 Bytes = 1 Chinese word = 2 Characters = 4 Bytes<br />
Use whatever conventional compression routine, then you get it.<br />
Of course, this compression may lose information, or may gain more information, depending on how good is your translator. And it&#8217;s not 100% revertable.<br />
Because you are using sample from Wiki, not random stuff, so there are already some internal information, or to say, quantum entanglement. Good luck on y&#8217;all!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-10464</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Mon, 19 Mar 2007 15:37:06 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-10464</guid>
		<description>era, i never know whether to bless them or curse them. it bothers me that academic papers are locked up and inaccessible to the general public behind pay-to-read journal sites like ieee eexplore, especially when you consider that we, the taxpayers, pay for 90% of the stuff that is published there.

on the other hand, it is copyrighted material, and under current law the authors do have the right to keep it private.

meanwhile i suppose i ought to snag a copy...</description>
		<content:encoded><![CDATA[<p>era, i never know whether to bless them or curse them. it bothers me that academic papers are locked up and inaccessible to the general public behind pay-to-read journal sites like ieee eexplore, especially when you consider that we, the taxpayers, pay for 90% of the stuff that is published there.</p>
<p>on the other hand, it is copyrighted material, and under current law the authors do have the right to keep it private.</p>
<p>meanwhile i suppose i ought to snag a copy&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: era</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-10431</link>
		<dc:creator>era</dc:creator>
		<pubDate>Mon, 19 Mar 2007 09:15:51 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-10431</guid>
		<description>Looks like somebody at Brown scanned the BSTJ article and put it on-line as a PDF: http://www.cs.brown.edu/courses/cs195-5/extras/shannon-1951.pdf

The quality is, erm, shall we say, a good test of human ability to reconstruct information from context.</description>
		<content:encoded><![CDATA[<p>Looks like somebody at Brown scanned the BSTJ article and put it on-line as a PDF: <a href="http://www.cs.brown.edu/courses/cs195-5/extras/shannon-1951.pdf" rel="nofollow">http://www.cs.brown.edu/courses/cs195-5/extras/shannon-1951.pdf</a></p>
<p>The quality is, erm, shall we say, a good test of human ability to reconstruct information from context.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-356</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Wed, 06 Sep 2006 09:53:08 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-356</guid>
		<description>Looks like the existence of the prize, and perhaps the prize money, are having the desired effect of accelerating the rate of improvement.</description>
		<content:encoded><![CDATA[<p>Looks like the existence of the prize, and perhaps the prize money, are having the desired effect of accelerating the rate of improvement.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: James Bowery</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-355</link>
		<dc:creator>James Bowery</dc:creator>
		<pubDate>Wed, 06 Sep 2006 05:02:17 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-355</guid>
		<description>Alexander Ratushnyak&#039;s latest (Aug 29) entry , paq8hp3, if confirmed, compresses enwik8 to 1.394 bits per character.  This represents nearly a 5% improvement since the contest began whereas historically the improvement has been approximately 3% per year.

When Mr. Ratushnyak achieved 4% improvement with paq8hp2, some speculated that all &quot;low hanging fruit&quot; was &quot;gone, gone, gone&quot;.

For Marcus Hutter&#039;s sake I hope there is a reprieve for at least a few months so some other donors can feel comfortable placing their money &quot;at risk&quot; in this purse.

However, there is still the potential that someone will soon use a serious language model and push the compression ratio to the top end of Shannon&#039;s estimate of human intelligence of 1.3 bits per character.</description>
		<content:encoded><![CDATA[<p>Alexander Ratushnyak&#8217;s latest (Aug 29) entry , paq8hp3, if confirmed, compresses enwik8 to 1.394 bits per character.  This represents nearly a 5% improvement since the contest began whereas historically the improvement has been approximately 3% per year.</p>
<p>When Mr. Ratushnyak achieved 4% improvement with paq8hp2, some speculated that all &#8220;low hanging fruit&#8221; was &#8220;gone, gone, gone&#8221;.</p>
<p>For Marcus Hutter&#8217;s sake I hope there is a reprieve for at least a few months so some other donors can feel comfortable placing their money &#8220;at risk&#8221; in this purse.</p>
<p>However, there is still the potential that someone will soon use a serious language model and push the compression ratio to the top end of Shannon&#8217;s estimate of human intelligence of 1.3 bits per character.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-294</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Fri, 25 Aug 2006 21:51:06 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-294</guid>
		<description>Thanks for the correction on the Shannon estimation, Matt. Lacking an online copy of the paper, I checked my copy of &lt;a href=&quot;http://www.amazon.com/gp/product/0139119914/102-2559624-1104943?v=glance&amp;n=283155&quot; rel=&quot;nofollow&quot;&gt;Text Compression&lt;/a&gt; by Cleary, Witten, and Bell, and they cite the same upper and lower bounds as you, 0.6 to 1.3.

Which is good, because it means that there is still a lot of room to go for Hutter prizewinners. If the 0.6 lower bound were to hold, we could see the 100MB corpus reduced to 7.5 MB, roughly half the size of the best efforts to date. And I am a believer, to get to that point is going to require something that starts to look like intelligence. (And as Turing proposes, if it looks like intelligence...)</description>
		<content:encoded><![CDATA[<p>Thanks for the correction on the Shannon estimation, Matt. Lacking an online copy of the paper, I checked my copy of <a href="http://www.amazon.com/gp/product/0139119914/102-2559624-1104943?v=glance&#038;n=283155" rel="nofollow">Text Compression</a> by Cleary, Witten, and Bell, and they cite the same upper and lower bounds as you, 0.6 to 1.3.</p>
<p>Which is good, because it means that there is still a lot of room to go for Hutter prizewinners. If the 0.6 lower bound were to hold, we could see the 100MB corpus reduced to 7.5 MB, roughly half the size of the best efforts to date. And I am a believer, to get to that point is going to require something that starts to look like intelligence. (And as Turing proposes, if it looks like intelligence&#8230;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt Mahoney</title>
		<link>http://marknelson.us/2006/08/24/the-hutter-prize/comment-page-1/#comment-293</link>
		<dc:creator>Matt Mahoney</dc:creator>
		<pubDate>Fri, 25 Aug 2006 21:23:31 +0000</pubDate>
		<guid isPermaLink="false">/2006/08/24/the-hutter-prize/#comment-293</guid>
		<description>It is true today that data compression programs are not very smart, but I think that will have to change to make progress in the Hutter competition.  English has a lot of redundancy that is difficult to exploit.  For example, it is hard for a program to know that &quot;roses are red&quot; is more likely than &quot;roses are green&quot; or even &quot;roses red are&quot;.  A compressor that can learn syntax and semantics from text and apply that knowledge to predict future text will do better than one that can&#039;t.

Also, I believe Shannon estimated the entropy of written English to be about 0.6 to 1.3 bits per character.  I got those numbers by eyeballing a graph in Shannon&#039;s original paper, &quot;Prediction and Entropy of Printed English&quot;, Bell Sys. Tech. J (3) p. 50-64, 1950.  The paper is not online as far as I know.  I don&#039;t know where Wikipedia got the numbers 1.1 to 1.6 bpc.  Cover and King (1978) used a text prediction gambling game to estimate the upper bound at 1.3 bpc.</description>
		<content:encoded><![CDATA[<p>It is true today that data compression programs are not very smart, but I think that will have to change to make progress in the Hutter competition.  English has a lot of redundancy that is difficult to exploit.  For example, it is hard for a program to know that &#8220;roses are red&#8221; is more likely than &#8220;roses are green&#8221; or even &#8220;roses red are&#8221;.  A compressor that can learn syntax and semantics from text and apply that knowledge to predict future text will do better than one that can&#8217;t.</p>
<p>Also, I believe Shannon estimated the entropy of written English to be about 0.6 to 1.3 bits per character.  I got those numbers by eyeballing a graph in Shannon&#8217;s original paper, &#8220;Prediction and Entropy of Printed English&#8221;, Bell Sys. Tech. J (3) p. 50-64, 1950.  The paper is not online as far as I know.  I don&#8217;t know where Wikipedia got the numbers 1.1 to 1.6 bpc.  Cover and King (1978) used a text prediction gambling game to estimate the upper bound at 1.3 bpc.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

