<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mark Nelson &#187; Computer Science</title>
	<atom:link href="http://marknelson.us/category/computer-science/feed/" rel="self" type="application/rss+xml" />
	<link>http://marknelson.us</link>
	<description>Programming, mostly.</description>
	<lastBuildDate>Wed, 01 Feb 2012 16:36:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>A Visit With Tim Bell</title>
		<link>http://marknelson.us/2012/01/21/a-visit-with-tim-bell/</link>
		<comments>http://marknelson.us/2012/01/21/a-visit-with-tim-bell/#comments</comments>
		<pubDate>Sun, 22 Jan 2012 02:22:50 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[People]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=1407</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2012/01/21/a-visit-with-tim-bell/' addthis:title='A Visit With Tim Bell' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>I was in Christchurch, New Zealand, recently and had a chance to meet Tim for the first time in person. Tim teaches at the <a href=" http://www.canterbury.ac.nz/" class="newpage">University of Canterbury in Christchurch</a>, and is <a href="http://www.cosc.canterbury.ac.nz/tim.bell/" class="newpage">Deputy Head of the Computer Science and Software Engineering</a> department. I got a chance to ask him about his work in data compression as well as one of his new areas of interest, Computer Science education.]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2012/01/21/a-visit-with-tim-bell/' addthis:title='A Visit With Tim Bell' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p><img src="/attachments/2012/bell/TimBell2.jpg" alt="Dr. Timothy Bell" align="right" style="margin-left:15px;border-style:solid;border-width:2px"><br />
In my early years of learning about data compression, the book <a href="http://books.google.com/books/about/Text_compression.html?id=sdZQAAAAMAAJ" class="newpage">Text Compression</a> by Timothy Bell, John Cleary, and Ian Witten was my resource of first resort. I was in Christchurch, New Zealand, recently and had a chance to meet Tim for the first time in person. Tim teaches at the <a href=" http://www.canterbury.ac.nz/" class="newpage">University of Canterbury in Christchurch</a>, and is <a href="http://www.cosc.canterbury.ac.nz/tim.bell/" class="newpage">Deputy Head of the Computer Science and Software Engineering</a> department. I got a chance to ask him about his work in data compression as well as one of his new areas of interest, Computer Science education.<br />
<span id="more-1407"></span></p>
<hr/>
MN: Tim, it seems like there has been a lot of interest in data compression in the Antipodes. Names that come to mind include you, John Cleary, and Peter Fenwick in New Zealand, and Ross Williams in Australia. Is this just coincidence, or is compression in the air down there?</p>
<p>TB: I’ve sometimes wonder about this myself&#8230; during the early days of computing and especially personal computers, it took some time for the latest technology to reach us “down under”, so perhaps we were motivated to get more out of what we had rather than wait some months for a larger disk or new memory to arrive from overseas. When the Internet arrived we started with a very small pipe, so a good compression algorithm could do the equivalent to laying a second cable from NZ to the US – who can resist getting something for free?</p>
<p>MN: Since you wrote Text Compression back in the early 90s, I&#8217;d say the biggest development in lossless compression has been the Burrows-Wheeler transform. Is lossless text compression basically done? Are we left with just incremental improvements as processor resources increase?</p>
<p>TB: That seems to be the case; the only big improvements we’ve seen have turned out to be frauds &#8212; we even had one in NZ recently, where a Nelson man raised NZ$5.3 million for an impressive sounding method; he was <a href="http://www.stuff.co.nz/nelson-mail/news/3892853/Whitley-found-guilty-of-fraud" class="newpage">convicted of fraud</a> last year. The main indicator we have that we’re running out of steam (apart from a lack of new discoveries) is Shannon’s experiments on predicting text which gave a bound in the order of 1 bit per character for English text, and current methods are approaching this. Of course, there’s plenty of room for dealing with new kinds of data (for example, bioinformatics deals with massive amounts of data that we’re still trying to understand) and for finding better data structures and algorithms for performing the compression and decompression. Lossy compression is a whole different story&#8230;</p>
<h4>A Change In Focus</h4>
<p>MN: It looks like you are now dedicating a large amount of your time to establishing computer science as part of the basic curriculum in high school education, for students in the 15-18 age range. In many ways, this is as much a bureaucratic problem as an academic one. What motivated you to take it on?</p>
<p>TB: It’s been a problem that we’ve complained about for decades, and it’s been getting worse and worse as computing in schools has focussed increasingly on using computers and not preparing students to be developers. A lot of this can be attributed to bureaucracy – it’s hard to explain to government officials that putting word processors in every classroom isn’t the same as building a computationally literate society. As a result of some strategic lobbying done by others, a small window of opportunity opened for me to be on a group to advise our Ministry of Education, just over 3 years ago. The group managed to convince the officials that something useful could be done, and then we had to work very quickly to come up with a concrete proposal before the enthusiasm died down.  This has happened rapidly; the advisory group first met in November 2008, and Computer Science started being taught in schools in February 2011.</p>
<p>MN: What have you been able to accomplish in New Zealand so far?</p>
<p>TB: Computer science (including programming, but also topics the involve understanding the importance of things like algorithms, HCI, programming languages and even compression) is currently available as part of computing courses for two of the three final years of our main high school graduation qualification, with all three years being covered from 2013. After that we would expect some of the introductory material to start filtering down to earlier classes, and for wider offerings as teachers become more confident in the subject. One of the biggest challenges has been preparing teachers, few of whom have significant experience in Computer Science. Many have embraced it enthusiastically, and the universities and others have done a lot of work to help them get up to  speed. It’s been a wild ride doing it so quickly, but there have been some very pleasing outcomes.</p>
<p>MN: And how do things look in the rest of the world? Are there any obvious winners and losers at this point? Do you have any concise advice for the world?</p>
<p>TB: Computing in schools is a hot topic around the world; the UK have just announced a strong drive to introduce this sort of material to schools, and the US has people working hard to make it available to students. Israel and Korea have had computer science in schools for some time. We’re learning a lot about what is worth teaching, and what the best pedagogy is for the general classroom (most of our experience is for specialist students who have chosen the subject!) The New Zealand path of getting something going quickly with grass-roots support seems to be more effective than waiting for a top-down approach which could take years to develop and prepare teachers for, although it does make for a bumpy ride as problems are ironed out as we go along!</p>
<p>MN: This might be straying out of your area a bit, but do you see CS in a K-12 education setting having an effect on the representation of women in the STEM fields?</p>
<p>TB: Attitudes that affect representation definitely start at school, and to me the biggest goal of teaching CS in high school is not so much to prepare students for further study, but to enable them to find out what the subject is! School students rarely know what CS is, and even worse, it’s common for them to assume that it must be advanced word processing or some other dull area, and hence they avoid it. It’s particularly important for female students to have the opportunity to find out if it’s something that they might be good at, as the stereotypes associated with computing can make them assume that they shouldn’t consider it as a career.</p>
<p>MN: One final question, Tim. The whole world has seen the devastating damage Christchurch has suffered from the earthquakes in the last year. How has the University of Canterbury held up? Have you managed to maintain continuity in your academic calendar?</p>
<p>TB: It’s been quite a year! Thankfully our university has escaped the brunt of the earthquakes (most of the damage is some distance from the university), and we’ve managed to keep a full programme going despite being closed for three weeks for safety checks. Many students joined the  “student volunteer army”, who helped with the cleanup in the damaged parts of town, and that was probably one of the most valuable experiences of their career! It hasn’t been without disruption as buildings need to be checked carefully, and some are still under repair, but with a bit of resourcefulness we managed to keep going (for a while I even delivered my classes in a restaurant while lecture theatres were being inspected) The city is now going through a massive program of redevelopment with some pretty creative ideas, and it’s an exciting time to be part of these changes.</p>
<hr/>
<p>
<img src="/attachments/2012/bell/New_Zealand.png" alt="New Zealand" align="left" style="margin-right:15px;border-style:solid;border-width:2px">Thanks to Dr. Bell for taking the time to share all this with us. My visit to his amazing homeland was a real treat, and the short time I got to spend with Tim in Christchurch was worth the trip all in itself.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2012/01/21/a-visit-with-tim-bell/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>GCC Tries to Help &#8211; With Mixed Results</title>
		<link>http://marknelson.us/2011/08/15/gcc-tries-to-help-with-mixed-results/</link>
		<comments>http://marknelson.us/2011/08/15/gcc-tries-to-help-with-mixed-results/#comments</comments>
		<pubDate>Mon, 15 Aug 2011 11:38:49 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[C/C++]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Work]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=589</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/08/15/gcc-tries-to-help-with-mixed-results/' addthis:title='GCC Tries to Help &#8211; With Mixed Results' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>One of the great things about C, and even more so for C++, is its strong type checking mechanisms. In general a lot of bugs are caught at compile time, and experienced programmers are able to recognize and fix these types of errors quickly. Unfortunately, there are plenty of places in any C program where [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/08/15/gcc-tries-to-help-with-mixed-results/' addthis:title='GCC Tries to Help &#8211; With Mixed Results' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>One of the great things about C, and even more so for C++, is its strong type checking mechanisms. In general a lot of bugs are caught at compile time, and experienced programmers are able to recognize and fix these types of errors quickly.</p>
<p>Unfortunately, there are plenty of places in any C program where type checking disappears. The GNU C compiler, gcc, goes above and beyond the call of duty to help in these areas. But sometimes that help can be a can of worms all of its own.<br />
<span id="more-589"></span></p>
<h4>An Example of Type Safety MIA</h4>
<p>A classic example of type safety gone into hiding is found with the formatted I/O functions in C: <code>s/f/printf()</code> and <code>s/f/scanf()</code>. These function families take a formatted string that defines what type of arguments they are expecting to either read or write, followed by a variable length list of arguments. The arguments passed in must match up precisely with the formatted argument string in type and quantity. The problem is that the traditional compiler, which is concerned with the language, doesn&#8217;t really know how to check that argument list.</p>
<p>Over the years this has lead to many spectacular bugs, when an original programmer or someone doing maintenance inserts a mismatch:</p>
<pre>
    int parse_user_name( char *name, size_t namel )
    {
        printf( "Please provide a name:" );
        sscanf( "%s", namel );
    }
</pre>
<p>In some cases, such as the simple one character typo shown above, you now have a system that can be easily exploited using off-the-shelf tools for code injection. Very bad stuff.</p>
<h4>GNU to the Rescue</h4>
<p>Traditional C compilers have let this kind of bad code go by without a second glance. Back in the day, the responsibility for this deeper type of checking rested with a useful program called <a href="http://en.wikipedia.org/wiki/Lint_(software)" class="newpage">lint</a>. But as the C standard tightened up over the years, fewer people saw a need for lint, and routine checks for these types of problems often disappeared from the build process.</p>
<p>Somewhere along the way, (I first started noticing in the 4.x era) the GNU compiler people began inserting some proactive code to flag errors that once where the province of lint. Just as an example, this short piece of broken code compiles with no warnings when using default settings for gcc 3.4, 4.1, and the Sun C compiler v5.7:</p>
<pre>
#include <stdio.h>

int main()
{
    printf( "%s\n", 1.2 );
    return 0;
}
</pre>
<p>But gcc 4.5.2 correctly sees a big problem, and issues a warning:</p>
<pre>
gcc.c: In function 'main':
gcc.c:5:5: warning: format '%s' expects type 'char *', but argument 2 has type 'double'
</pre>
<p>Since it is valid C code, the compilation proceeds, but at this point it is caveat emptor.</p>
<h4>How Far Do You Go?</h4>
<p>I ran into trouble with this GCC feature when working up an in-class demonstration of redirected  I/O on a Linux system. I hypothesized that the early, paleolithic code to redirect standard output for a child process might have looked something like this:</p>
<pre>
#include &lt;unistd.h&gt;
#include &lt;stdio.h&gt;
#include &lt;fcntl.h&gt;

int main (int argc, char *argv[]) {
  if ( argc &lt;= 2 ) {
    printf( "Usage: test command file\n"
            "The command will be executed"
            "with stdout going to file\n" );
    return -1;
  }
  if (fork() == 0) {
    close(STDOUT_FILENO);
    open(argv[2], O_WRONLY | O_CREAT, 0744);
    printf( "About to execute %s\n", argv[1] );
    fflush( stdout );
    execl(argv[1], 0);
    printf("I will never be called\n");
  }
  sleep ( 2 );
  printf("Execution continues in"
         " the parent process\n");
  return 0;
}
</pre>
<p>This code does just what we want when executed with a command like: <code>./a.out /bin/ls output.txt</code>. The <code>ls</code> command is executed, and its standard output is redirected to the file output.txt.</p>
<p>However, several of my students reported that they were getting mysterious compiler errors on this same code. Sure enough, when running under gcc 4.5, we get this:</p>
<pre>
exec.c: In function 'main':
exec.c:17:6: warning: not enough variable arguments to fit a sentinel
</pre>
<p>For beginning C programmers, this kind of error can be a real time sink. The error message uses terminology that is compiler-specific, it doesn&#8217;t provide a detailed description of exactly what it doesn&#8217;t like, and it doesn&#8217;t recommend a solution. Google searches for this specific error do lead to a solution, but they will ask a bit of work from the novice.</p>
<p>The first lesson my students picked up is that you need at least one more argument to this function call. While it is true that people normally call <code>execl()</code> with <code>arg0</code> set to the filename, this is by no means required. And in fact, most programs will work just fine with an argument count of 0. There is really no good way to know which programs require <code>arg0</code>, or further args, without reading the documentation. In any case, gcc is pushing things a bit by insisting on this.</p>
<p>So changing the line of code that calls exec to:</p>
<pre>
    execl(argv[1], argv[1], 0);
</pre>
<p>should fix the problem, right?</p>
<p>Wrong. This reconfiguration gets a slightly different error:</p>
<pre>
exec.c: In function 'main':
exec.c:17:6: warning: missing sentinel in function call
</pre>
<p>As it turns out, gcc really wants you to cast the literal value 0 to a pointer:</p>
<pre>
    execl(argv[1], argv[1], (char*) 0);
</pre>
<p>Which finally makes the error go away.</p>
<p>This is a bit annoying, because both my C and C++ standards clearly say that a constant integral value of 0 can convert to any pointer type without any fuss. So why should I have cast it to a <code>char *</code> before passing it to execl?</p>
<p>Well, it won&#8217;t make a difference on any of the computers I use, but if you are on a computer where an integral type is passed on the stack with fewer bytes than a pointer, you may run into trouble. <code>execl()</code> is not smart enough to know that the 0 you are passing is actually terminating the argument list &#8211; it may just be an integer &#8211; so an integer is what is pushed on the stack.</p>
<h4>Confusion</h4>
<p>So yes, gcc&#8217;s lint-like characteristics made my lecture a bit more complicated. I had to explain a few things that I had hoped to gloss over. But on the other hand, the casting of 0 to a pointer type is actually a pretty important thing, and I&#8217;m glad gcc reminded me of it.</p>
<p>Requiring a value of <code>arg0</code> when using <code>execl()</code> seems to me less like something that requires reminding, but nonetheless, it certainly won&#8217;t do any harm to follow this advice.</p>
<p>C++ does a good job of inserting type safety into places where it hasn&#8217;t been seen before. For example, you can do all your I/O in C++ with some assurance that you are not making those traditional mistakes seen with <code>scanf</code> or <code>printf</code>. However, the C++ standard library is not likely to tackle the APIs for Linux or Unix system calls, meaning we should be appreciative of the help we get from gcc.</p>
<p>So to whoever wrote the code that checks for sentinels in gcc, I thank you.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2011/08/15/gcc-tries-to-help-with-mixed-results/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Sapir-Whorf to Dijkstra to Torvalds &#8211; Language Bigotry In Our Time</title>
		<link>http://marknelson.us/2011/06/14/sapir-whorf-to-dijkstra-to-torvalds-language-bigotry-in-our-time/</link>
		<comments>http://marknelson.us/2011/06/14/sapir-whorf-to-dijkstra-to-torvalds-language-bigotry-in-our-time/#comments</comments>
		<pubDate>Tue, 14 Jun 2011 20:29:09 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[C/C++]]></category>
		<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[People]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=439</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/06/14/sapir-whorf-to-dijkstra-to-torvalds-language-bigotry-in-our-time/' addthis:title='Sapir-Whorf to Dijkstra to Torvalds &#8211; Language Bigotry In Our Time' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>Back in the day the Sapir-Whorf hypothesis was all the rage in the study of linguistics. With apologies to those who actually work in the field, I&#8217;ll crudely summarize it as the idea that the language you speak both constrains and influences how you think. The idea says that if your language only has one [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/06/14/sapir-whorf-to-dijkstra-to-torvalds-language-bigotry-in-our-time/' addthis:title='Sapir-Whorf to Dijkstra to Torvalds &#8211; Language Bigotry In Our Time' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>Back in the day the Sapir-Whorf hypothesis was all the rage in the study of linguistics. With apologies to those who actually work in the field, I&#8217;ll crudely summarize it as the idea that the language you speak both constrains and influences how you think. The idea says that if your language only has one word for <i>snow</i>, for example, you will actually have a hard time seeing any difference between light powder and crunchy ice pack.</p>
<p><a href="http://en.wikipedia.org/wiki/Sapir-whorf" class="newpage">Sapir-Whorf</a> was seen as completely discredited back when I learned about it, and while Linguistic Relativity has enjoyed a slight comeback with a weakly restated set of hypotheses, it seems fairly certain that human thought is by no means confined to a cage built out of vocabulary and grammar.</p>
<p>Our field has long had its own Sapir in E.W. Dijkstra, whoe <a href="http://www.cs.utexas.edu/~EWD/transcriptions/EWD04xx/EWD498.html" class="newpage">spelled it</a> out with money quotes like these:<br />
<span id="more-439"></span></p>
<blockquote><p>It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.<br />
The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offence.
</p></blockquote>
<p>Closer to today, we have the famous <a href="http://lwn.net/Articles/249460/" class="newpage">rant against C++</a> from Linus Torvalds, who feels that a programmer who uses C++ is going to wreck any project he or she touches:</p>
<blockquote><p>I&#8217;ve come to the conclusion that any programmer that would prefer the project to be in C++ over C is likely a programmer that I really *would* prefer to piss off, so that he doesn&#8217;t come and screw up any project I&#8217;m involved with.</p>
<p>C++ leads to really really bad design choices.</p></blockquote>
<h4>Dogmatism à la Torvalds</h4>
<p>At the large corporate entity that pays my bills, we have a slogan on the back of our badges: <i>No Technology Religion</i>. It might not be easy to live up to this, but yes, I try. </p>
<p>To me, this admonishment means two things:</p>
<ol>
<li/>Try to objectively choose the best tool for the job
<li/>Don&#8217;t let your preferences in tools dictate the way the job should be done
</ol>
<p>Linus is clearly saying in his rant that anyone who programs in C++ is guilty of breaking both of these rules.</p>
<p>I disagree. I think there are times when C++ is clearly the right tool for the job, and that you can arrive at this conclusion fairly objectively. I think Linus is clearly fogged in by his particular Technology Religion.</p>
<h4>A Simple Example</h4>
<p>As the final assignment for my C/C++ programming class last semester, I asked my students to implement a simple token counting program in C. The goal was to reproduce the behavior given by this C++ fragment:</p>
<pre>
map&lt;string,int&gt; counts;
string s;
while ( cin &gt;&gt; s )
    counts[s]++;
for ( auto ii = counts.begin() ; ii != counts.end() ; ii++ )
    cout &lt;&lt; ii-&gt;second &lt;&lt; " : " &lt;&lt; ii-&gt;first &lt;&lt; endl;
</pre>
<p>This particular program highlights a number of features of C++ that are not present in C:</p>
<ul>
<li/>The versatile replacement for C arrays, vector&lt;T&gt;.
<li/>The string class.
<li/>Safe input using iostreams.
<li/>Associative arrays as part of the standard library.
</ul>
<p>This program is quite easy to write in C++, and is basically complete. One could flesh it out a bit with of error handling on the input stream, but that&#8217;s really not even necessary.</p>
<h4>Do it in C</h4>
<p>Rewriting this in C is a straightforward task with one big speed bump: the lack of any sort of associative array in the library. There are a number of ways to deal with this &#8211; I chose the following strategy:</p>
<ul>
<li/>Read all the tokens into an array.
<li/>Upon completion, sort the array
<li/>Once the array is sorted, walk through it to get the count for each token.
</ul>
<p>While this algorithm uses more space than the C++ program, it probably takes up the same amount of time, assuming you don&#8217;t bump into one of the pathologically bad cases of qsort().</p>
<p>Since I asserted that this program is easier to write in C++ than C, it behooves me to give a list of reasons why. </p>
<ul>
<li/>C I/O deficiencies. Reading strings in C is considerably more difficult due to the fact that the C I/O library doesn&#8217;t have a standard way to read strings of unbounded length. (Compiler-specific extensions can be used, but that raises other problems.) Your input code has to do a lot of checking for error cases, or you have to build your own string input functions.
<li/>Memory management of C arrays is a very manual task. I have to allocate the original space for my array, take care that I reallocate if I exceed its length, and free the space when I am done.
<li/>Memory management of C strings has exactly the same probelms.
<li/>Sorting the array of strings is just a tiny bit more inconvenient with qsort(), and qsort() doesn&#8217;t give me the performance guarantees of the sort() function in the C++ library.
</ul>
<p>The C version of my function has more lines of code, and more bookkeeping tasks that need to be done manually. There are more opportunities to make mistakes.</p>
<p>A final reason I like the C++ version of the program better is that it lends itself well to working with other types. Any type that has insertion and extraction operators, and a comparison operator, can use that same code with just one declaration change. Turning it into a function template accomplishes the same thing with no code changes needed at all.</p>
<h4>Some of My Best Friends are C Programmers</h4>
<p>So am I a language bigot for preferring the C++ version of this code?</p>
<p>I hope not. For one thing, I can see that the C version of the program has some nice advantages:</p>
<ul>
<li/>You can write this program using POSIX system calls for almost everything except memory allocation and sorting, resulting in an extremely small footprint.
<li/>The C version of the program will be faster due to the use of low-level I/O. C++ iostreams get better all the time, but their layered approach will always be at a disadvantage when it comes to efficiency.
</ul>
<p>So for a program like this, the choice of language really comes down to context. If you believe in the 80/20 rule, you might think this code should be written in C++ if it is outside of the expensive core part of your program. With fewer lines of code you have fewer chances for error, and efficiency is probably not a big consideration.</p>
<p>If this is in a critical section of code that is executed frequently, you might decide C is your best choice. Make sure to put a little extra time into code review to ensure that the code is free of memory leaks and pointer errors, and you are in business.</p>
<h4>Sic Temper Linus</h4>
<p>So how does Linus&#8217; rant hold up when looking at the C++ code shown at the top of the post? I would venture a guess that given the assignment, any decent C++ programmer would produce code similar to this. Linus says:</p>
<blockquote><p>You invariably start using the &#8220;nice&#8221; library features of the language like STL and Boost and other total and utter crap, that may &#8220;help&#8221; you program, but causes infinite amounts of pain when they don&#8217;t work (and anybody who tells me that STL and especially Boost are stable and portable is just so full of BS that it&#8217;s not even funny</p></blockquote>
<p>In this program I make good use of the standard library components that were once part of the STL. Having been part of the standard for over a decade, they work really well and have no portability or correctness issues in any compiler I am aware of. Saying that components like map and vector are problematic is just wrong.</p>
<blockquote><p>inefficient abstracted programming models where two years down the road you notice that some abstraction wasn&#8217;t very efficient, but now all your code depends on all the nice object models around it, and you cannot fix it without rewriting your app.</p></blockquote>
<p>Well, despite the fact that C++ has made it impossible for me to do so, I managed to write the program without using any abstractions &#8211; no new classes, no interfaces. Basically just straight procedural C code that happens to employ a few useful classes.</p>
<p>And at least with the people I work with, I think this is the rule rather than the exception.</p>
<blockquote><p>
In other words, the only way to do good, efficient, and system-level and portable C++ ends up to limit yourself to all the things that are basically available in C.
</p></blockquote>
<p>When C has container classes, a string class, typesafe I/O, and the programmer&#8217;s gift from the gods, <a href="http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization" class="newpage">RAII</a>, then this statement will be true. For now, it is bollocks.</p>
<p>Before modern C++ was available, I probably would have stuck with a simple pipeline to accomplish this task:</p>
<pre>
tr [:blank:] '\n'  | grep -v "^$" | sort | uniq -c
</pre>
<p>The fact that I can do the same thing just as easily in a compiled language gives me some flexiblity. I think I can appreciate that fact without being a bigot.</p>
<p>Can you?</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2011/06/14/sapir-whorf-to-dijkstra-to-torvalds-language-bigotry-in-our-time/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Euler Mania</title>
		<link>http://marknelson.us/2011/04/10/euler-mania/</link>
		<comments>http://marknelson.us/2011/04/10/euler-mania/#comments</comments>
		<pubDate>Sun, 10 Apr 2011 21:46:51 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Puzzles]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=390</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/04/10/euler-mania/' addthis:title='Euler Mania' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>If I had been drawing a paycheck for every hour I spent working on Project Euler&#8217;s problem 328, I think my summer vacation would already be paid for. But instead, after a long ten days or so of distraction, I&#8217;ll have to settle for the satisfaction of being number 38 or 39 to solve it. [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/04/10/euler-mania/' addthis:title='Euler Mania' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>If I had been drawing a paycheck for every hour I spent working on Project Euler&#8217;s <a href="http://projecteuler.net/index.php?section=problems&#038;id=328" class="newpage">problem 328</a>, I think my summer vacation would already be paid for. But instead, after a long ten days or so of distraction, I&#8217;ll have to settle for the satisfaction of being number  38 or 39 to solve it.<br />
<span id="more-390"></span><br />
In cause you haven&#8217;t visited <a href="http://projecteuler.net/" class="newpage">Project Euler</a>, it is a site dedicated to &#8220;challenging mathematical/computer programming problems&#8221;. A prototypical Project Euler challenge is one with a simple definition that is easy to solve for simple cases, but requires some ingenuity to scale up to the requirements given in the problem.</p>
<h4>You&#8217;re Getting Warmer</h4>
<p>Problem 328 is a classic example of the genre. The basic setup is that of a number guessing game. Knowing that a number is some integer between 1 and N, your job is to make successive guesses until you get the answer. Each guess is answered with one of three conditions: low, high, or a match.</p>
<p>The twist in this problem is that your job is not to minimize the number of guesses you have to make, but rather, to minimize the sum of the guesses. For any range 1 to N, you have to select a path that minimizes the worst case cost.</p>
<p>As an example, if I was going to guess a number between 1 and 10, the best of the worst case strategies yields a value of 16. I get this score if the hidden number is either 8, 9, or 10. My first guess is 7, and in the worst case, my second guess of 9 nails it down to either 8, 9, or 10. Of course if the number is 6 or less I&#8217;ll get a lower score.  Change the first guess to some other number, and you will always have a case which results in a score of greater than 16.</p>
<p>Figure 1 shows what the choice graph looks like when choosing a number between 1 and 20. The first number in each node is the guess. When a  second number is present, it represents the accumulated cost at that point, working up from the leaf nodes. The choice at the top of the graph, 13, shows the cost for that problem: 49.</p>
<table border="0" align="center">
<tr>
<td><image src="/attachments/2011/eulermania/Figure01.png"></td>
</tr>
<tr>
<td><center>Figure 1<br/>Choosing a number between 1 and 20</center></td>
</tr>
</table>
<h4>The Naive Approach</h4>
<p>Solving this problem in small cases is nice and easy. Using a recursive formulation you can implement it in a single screenful of code. My test implementation in C++ is shown here &#8211; it calculates both the optimal first choice and the cost for a given range. By calling itself recursively, the problem solution is tidy and compact:</p>
<pre>
pair&lt;int,int&gt; get_best_path( int low, int high )
{
    if ( low &gt;= high )
    	return pair&lt;int,int&gt;(low,0 );
    if ( low == ( high - 1 ) )
    	return pair&lt;int,int&gt;(low,low);
    if ( low == (high - 2 ))
    	return pair&lt;int,int&gt;(low+1, low+1);
    int best_cost = INT_MAX;
    int best_choice = -1;
    for ( int choice = low + 1 ; choice &lt; high ; choice ++ ) {
    	int cost = choice + max( get_best_path( low, choice-1).second,
                                 get_best_path( choice+1, high).second);
    	if ( cost &lt; best_cost ) {
    		best_cost = cost;
    		best_choice = choice;
    	}
    }
    return pair&lt;int,int&gt;( best_choice, best_cost);
}
</pre>
<p>Like most recursive routines, it bails out early with one of three base cases which have trivial solutions. For all non-trivial solutions, the routine simply iterates through all possible guesses, calculating the cost of that choice and using recursion to calculate the cost of the two subproblems it creates.</p>
<p>Although this algorithm is simple and has a certain elegance, it has one big problem. A little examination will show that the runtime of this routine is asymptotically proportional to k<sup>N</sup>. Running on my desktop Linux system I was able to calculate best choices pretty quickly when N was under 30, but after that the runtime started ramping up drastically.</p>
<h4>Getting There From Here</h4>
<p>Since the solution to this algorithm requires calculating the best choice for numbers up to 200,000, there is no way that an O(k<sup>N</sup>) algorithm was going to fly. And that, of course, is the essence of a good Project Euler problem. Developing a solution for the simple cases is just the start.</p>
<p>After realizing that the naive solution won&#8217;t do it, you have to start looking at the problem from all angles. Can some optimization reduce it to a tractable polynomial problem? Or do you need a completely different approach. Perhaps the problem has a closed form solution that just requires pumping some numbers into an equation?</p>
<p>Eventually I was able to develop a solution that calculated all 200,000 value in less than a second. And while that would make an interesting post all on its own, it would be the epitome of bad form to spill the beans on an Euler Project solution.</p>
<h4>My Path</h4>
<p>Without giving away the secrets, however, I can tell you what was the most important factor for me in nailing down this problem: visualization.</p>
<p>To try to make some sense out of these paths through the choice tree, I turned to an old friend: <a href="http://www.graphviz.org/" class="newpage">Graphiviz</a>. This open source package makes visualization of data structures like binary trees a piece of cake.</p>
<p>Figure 1 is a simple graph created with graphviz. To really see the value of this package, examine the choice tree for N=100 in <a href="/attachments/2011/eulermania/1-100.pdf" class="newpage">PDF format</a> or <a href="/attachments/2011/eulermania/1-100.svg" class="newpage">SVG format</a>, if your browser supports it. I spent a long inspecting these images, including some that had hundreds of nodes.</p>
<p>For this program, I didn&#8217;t even link to the graphviz library &#8211; I just created text files in the correct format, then made a <code>system()</code> call to the dot compiler program, which creates the graphics files.</p>
<p>For this particular problem, graphviz is what guided me to my solution, and I don&#8217;t know of any other package, free or commercial, that could have done as well. It was the perfect tool for the job.</p>
<h4>Up Next</h4>
<p>Now that I have put Problem 328 to bed, have my eye on <a href="http://projecteuler.net/index.php?section=problems&#038;id=304" class="newpage">Problem 304</a>, known as Primonacci. This problem requires working with Fibonacci numbers with trillions of digits &#8211; numbers so big that they won&#8217;t fit in RAM on any computer I have access to. </p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2011/04/10/euler-mania/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Combinatorial Data Compression</title>
		<link>http://marknelson.us/2011/01/09/combinatorial-data-compression/</link>
		<comments>http://marknelson.us/2011/01/09/combinatorial-data-compression/#comments</comments>
		<pubDate>Sun, 09 Jan 2011 23:10:57 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Data Compression]]></category>
		<category><![CDATA[Mathematics]]></category>

		<guid isPermaLink="false">http://marknelson.us/?p=154</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/01/09/combinatorial-data-compression/' addthis:title='Combinatorial Data Compression' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>Newcomers to the world of data compression often stumble on this old idea in hopes of creating a novel and powerful algorithm. In a nutshell, the idea is to create an enumerative coding system that uses combinatorial numbering to identify a message, in hopes of providing a more compact representation . Unfortunately, these schemes always [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2011/01/09/combinatorial-data-compression/' addthis:title='Combinatorial Data Compression' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>Newcomers to the world of data compression often stumble on this old idea in hopes of creating a novel and powerful algorithm. In a nutshell, the idea is to create an enumerative coding system that uses combinatorial numbering to identify a message, in hopes of providing a more compact representation . Unfortunately, these schemes always fail, for reasons that I&#8217;ll lay out in this article.<br />
<span id="more-154"></span></p>
<h4>Combinations</h4>
<p>To design a combinatorial algorithm that will compress files, you can think of the file as a series of integers. Since most of the files that you use are streams of bytes, consider each file to be a sequence of integers with values from 0 to 255.</p>
<p>If you go back to your first classes on probability and statistics, you might remember the definition of a <a href="http://en.wikipedia.org/wiki/Combination" class="newpage">combination</a>. A combination is simply a way of selecting a number of things from a larger set. When you are trying to compress a file of bytes, the natural size of this set is 256.</p>
<p>Probability theory tells us that we can count the number of combinations of a given size using a pretty simple formula. If a set has <em>n</em> elements and we are choosing <em>k</em> at a time, the number of possible combinations is given by the formula n!/k!*(n-k)!. This number is also known as the <em>binomial coefficient</em>.</p>
<p>Just for a simple example, the number of different ways you can select three bytes out of a set of 256 is 2,763,520. In general, with a large set, most combinations are going to generate very large numbers. The exceptions will be for values of k that are either very small or very close to the size of the set.</p>
<p>Combinations are well ordered, so any instance of the three bytes has a specific number between 0 and 2,763,519. We can call this the combinatorial rank. This means I can identify any three byte sequence by a combination number.</p>
<p>Assuming all combinations are equal, we can use an optimal arithmetic coder to encode this number in lg(2,763,510) bits, roughly 21.4. That&#8217;s interesting, because the three bytes actually take up 24 bits, so maybe there is some savings to be found here.</p>
<h4>The First Problem</h4>
<p>Knowing the combinatorial rank is good, but it won&#8217;t let you reconstruct a compressed file on its own. The combinatorial rank gives you the set of bytes in the file, but doesn&#8217;t tell you the <i>order</i> of those bytes. If there are k bytes, they can be ordered in k! different permutations. So to fully describe the file, you need to encode the combination rank <em>and</em> the permutation number.</p>
<p>Encoding the permutation for your 3 byte file is going to take 2.58 bits, calculated as lg(3!). This makes the total needed to encode your three byte file 23.98 bits. Admittedly not a lot of savings, but it&#8217;s also non-zero.</p>
<p>Let&#8217;s look at the number of bits needed to encode a 20 byte message. The number of combinations of this length are roughly 2.8*10^29, which will take 97.8 bits to encode. 20! is roughly 2.43*10^18, which will take 61.1 bits to encode. The total comes out to 158.9 bits. Since we&#8217;re encoding 160 bits of information, there is clearly a greater savings.</p>
<p>As the message size increases, the savings start to grow. At a message length of 50 we save 7.5 bytes, at 75 bytes we save 16 bytes per message. The trend looks good. By the time you get to a message length of 100 bytes, you&#8217;re saving 32 bits per message &#8211; a compression of 4% for doing nothing but recoding!</p>
<h4>The Second Problem</h4>
<p>The second problem you encounter in the combinatorial system is that, by definition, a combination is composed of unique elements. So if you are compressing a three byte file, you can&#8217;t have any duplicate bytes. Is this a problem?</p>
<p>Your inclination is to hope not. You know that every compression scheme only works on a subset of files, so perhaps the combinatorial scheme can be developed to work on segments of files with no duplicates. </p>
<p>How likely are you to find a duplicate in a file of three bytes? You can start by enumerating the total number of files of that length: 256^3. And you know how many files there are with no duplicates: the combinatorial number times the number of permutations. So it&#8217;s a simple matter to calculate the probability that a message of length k has no duplicated bytes. The value will be n!/(n-k)!*256^k.</p>
<p>For a value of 3, we see that the probability of no duplicate bytes is .988 &#8211; this means you can compress almost every file by a fraction of a bit.</p>
<p>You&#8217;d like to think that you can look at pretty long stretches of data and expect a low probability of duplicates, but unfortunately you run into the <a href="http://en.wikipedia.org/wiki/Birthday_problem" class="newpage">Birthday Paradox</a>. In the birthday paradox, you&#8217;re asked a question something like this: in a room of 23 people, what are the chances that two people share a common birthday? For most people, the answer, 50% or so, is non-intuitive. </p>
<p>Likewise, it means that a file with 100 bytes and no duplicates is such a rarity that it might as well never appear &#8211; the chances are less than one in a billion.</p>
<h4>Facing the Music</h4>
<p>You can see the problem here. We can compress very short sequences using a combinatorial system, but the savings are very small. Even so, we can compress most files. We can compress longer files for greater savings, but very few sequences will prove to be eligible.</p>
<p>It&#8217;s actually worse than that. Let&#8217;s work out the number on a hypothetical compressor. This compressor will use a combinatorial scheme to compress all files of 10 bytes. The compressor will look at the file, and if it has no duplicates, it will set a flag symbol in the output stream to be true, followed by the combinatorial number, followed by the permutation.</p>
<p>If the 10 byte file has duplicates, the compressor will generate a flag symbol of false, followed by the uncompressed data.</p>
<p>If this scheme gives us some savings, we can scale it up to operate on files of any size &#8211; we&#8217;ll just compress them in 10-byte chunks.</p>
<p>So let&#8217;s analyze the result. First, the number of files that will make it past the first test is pretty impressive: 83.695%. Each of these files will be compressed down to 79.743 bits. The remaining 16.305 percent will take exactly 80 bits in the output stream. So the overall size of our output file thus far is going to add up to 79.78519 bytes. Our algorithm is still in the black!</p>
<p>Unfortunately, we also need to account for the cost of the flag message. Using optimal coding, when the flag is true we are going to require .25679 bits. When it is false, optimal coding of the much rarer message will require 2.6 bits. Add in the cost of the flag, and the average output size goes up to a smidgen over 80 bits. </p>
<p>In other words, you lose.</p>
<h4>Conclusion</h4>
<p>The problem is a familiar one in data compression. Every time you come up with a way to encode a subset of files that saves some space, you find that all your savings are lost when you try to encode the files that aren&#8217;t part of the subset. Even using a single bit to flag special files as being incompressible is enough to wipe out your savings. It is the definition of <a href="http://www.amazon.com/Hasbro-40509-Whac-A-Mole-Game/dp/B0001GDP00?tag=theinternetdatac" class="newpage">Whac-A-Mole</a>.</p>
<p>With combinatorial coding you will find that same rule to hold true as for all forms of data compression: It isn&#8217;t going to be a universal compressor that can reduce every file in size. The only reason it will be useful is if you have an input set of files that all have a common characteristic: a preponderance of streams where duplicates are rare. And odds are, this set of files will probably be compressible using some more reasonable algorithm.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2011/01/09/combinatorial-data-compression/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Never-ending Awesomeness of Bash</title>
		<link>http://marknelson.us/2010/10/17/the-never-ending-awesomeness-of-bash/</link>
		<comments>http://marknelson.us/2010/10/17/the-never-ending-awesomeness-of-bash/#comments</comments>
		<pubDate>Sun, 17 Oct 2010 21:07:01 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://marknelson.us/2010/10/17/the-never-ending-awesomeness-of-bash/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/10/17/the-never-ending-awesomeness-of-bash/' addthis:title='The Never-ending Awesomeness of Bash' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>This semester I&#8217;m teaching a class on Linux/UNIX, and am enjoying it immensely. With every lecture I&#8217;m reminded that you simply never stop finding new tools and tricks to use in an O/S that is now well into middle age. One of my midterm questions from last week was a basic query regarding filename expansion [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/10/17/the-never-ending-awesomeness-of-bash/' addthis:title='The Never-ending Awesomeness of Bash' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>This semester I&#8217;m teaching a class on Linux/UNIX, and am enjoying it immensely. With every lecture I&#8217;m reminded that you simply never stop finding new tools and tricks to use in an O/S that is now well into middle age.</p>
<p>One of my midterm questions from last week was a basic query regarding filename expansion under bash:</p>
<blockquote><p>
Write a command to copy files chapt00.txt, chapt01.txt, through chapt15.txt from the current directory to your home directory. Make the command as short as possible.
</p></blockquote>
<p>The answer I was expecting was<br/><br />
<tt>cp chapt0[0-9].txt chapt1[0-5].txt ~</tt><br/><br />
and that is indeed what I got from the majority of the students. Of course a few were confused about the use of character classes and tried to get something like <tt>chapt[00-15].txt</tt> to work. But one student turned in something a bit more novel:<br/><br />
<tt>cp chapt{0[0-9],1[0-5]}.txt ~</tt><br/><br />
My initial reaction was that this was simply a misguided attempt at filename expansion using broken syntax. But I was assured that this worked properly, and a quick look at the bash reference manual showed me the error of my ways. It turns out I was completely overlooking a nice feature of the shell: <i>brace expansion</i>. When the shell encounters a sequence of the format <i>preamble{comma-separated-list}postamble</i>, it iterates through the comma list of tokens inside the brackets and generates a new string for each one. The GNU manual example  is for the sequence <i>a{d,c,b}e</i>, which generates a space separated list: <i>ade ace abe</i>.</p>
<p>What this means is what I took to be an incorrect file specification actually generated the sequence  <i>chapt0[0-9].txt chapt1[0-5].txt</i>. Since brace expansion occurs before filename expansion, this results in the expected output. So this answer, being more terse than my expected response, is actually <i>more right</i>.</p>
<p>Our machines at school are using bash 3.2. If we upgraded to bash 4, we could use a numerical sequence with brace expansion to get even a better answer. The expression <tt>chapt{00..15}.txt</tt> should expand to the list of all 16 file names. But bash 4.1 is still only slowly working its way into new distributions, so it may be a while before we can count on that syntax to work everywhere.</p>
<p>The moral of the story, of course, is that no matter how much time you spend working with the UNIX/Linux  command line, there is always plenty more to learn. I should be paying <a href="http://utdallas.edu" class="newpage">UTD</a> to teach this course, not the  other way around.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2010/10/17/the-never-ending-awesomeness-of-bash/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Headline Writing Gone Bad</title>
		<link>http://marknelson.us/2010/10/01/headline-writing-gone-bad/</link>
		<comments>http://marknelson.us/2010/10/01/headline-writing-gone-bad/#comments</comments>
		<pubDate>Fri, 01 Oct 2010 15:36:37 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Snarkiness]]></category>
		<category><![CDATA[Web Articles]]></category>
		<category><![CDATA[Writing]]></category>

		<guid isPermaLink="false">http://marknelson.us/2010/10/01/headline-writing-gone-bad/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/10/01/headline-writing-gone-bad/' addthis:title='Headline Writing Gone Bad' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>Microsoft has added a new keyword to C# as part of the 4.0 release earlier this year. Objects that are typed as dynamic bypass normal static type checking, allowing C# to have the flexibility of other scripting languages. This is all well and good, but the headline writers of the blogosphere have taken a decided [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/10/01/headline-writing-gone-bad/' addthis:title='Headline Writing Gone Bad' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>Microsoft has added a new keyword to C# as part of the 4.0 release earlier this year. Objects that are typed as <i>dynamic</i> bypass  normal static type checking, allowing C# to have the flexibility of other scripting languages.</p>
<p>This is all well and good, but the headline writers of the blogosphere have taken a decided wrong turn with their naming of this feature:</p>
<p><a href="http://www.codeproject.com/Articles/73856/Csharp-4-0-Dynamic-Programming.aspx" class="newpage">C# 4.0: Dynamic Programming</a><br/><br />
<a href="http://www.nikhilk.net/CSharp-Dynamic-Programming-JSON.aspx" class="newpage">C# 4.0, Dynamic Programming and JSON</a><br/><br />
<a href="http://www.codeguru.com/csharp/.net/net_general/visualstudionetadd-ins/article.php/c17991" class="newpage">Dynamic Programming Using C# 4.0 and Microsoft Visual Studio 2010</a><br/><br />
<a href="http://geekswithblogs.net/sdorman/archive/2008/11/16/c-4.0-dynamic-programming.aspx" class="newpage">C# 4.0: Dynamic Programming</a><br/></p>
<p>Note the misuse of the term <i>Dynamic Programming</i>. Everyone who takes an introductory algorithms course learns that the term <a href="http://en.wikipedia.org/wiki/Dynamic_programming" class="newpage">Dynamic Programming</a> has been in use for over fifty years, and refers to a method for solving problems by decomposition. It&#8217;s a useful technique that I&#8217;ve <a href="http://marknelson.us/2007/08/01/memoization/" class="newpage">covered here</a> in the past, and any skilled programmer should be familiar with it.</p>
<p>No, it&#8217;s not the end of the world, but people who are writing about Computer Science really ought to know something about Computer Science, don&#8217;t you think?</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2010/10/01/headline-writing-gone-bad/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Innumeracy Revisited</title>
		<link>http://marknelson.us/2010/09/12/innumeracy-revisited/</link>
		<comments>http://marknelson.us/2010/09/12/innumeracy-revisited/#comments</comments>
		<pubDate>Sun, 12 Sep 2010 17:35:39 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Culture]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Snarkiness]]></category>

		<guid isPermaLink="false">http://marknelson.us/2010/09/12/innumeracy-revisited/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/09/12/innumeracy-revisited/' addthis:title='Innumeracy Revisited' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>The New York Times has an interesting article today examining the curious fact that certain types of terrorist organizations have an unusually high ratio of engineers among their members. An interesting point to study, no doubt, but what caught my eye was this little blunder: William A. Wulf, a former president of the National Academy [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/09/12/innumeracy-revisited/' addthis:title='Innumeracy Revisited' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>The New York Times has an interesting <a href="http://www.nytimes.com/2010/09/12/magazine/12FOB-IdeaLab-t.html" class="newpage">article</a> today examining the curious fact that certain types of terrorist organizations have an unusually high ratio of engineers among their members. An interesting point to study, no doubt, but what caught my eye was this little blunder:</p>
<blockquote><p>
William A. Wulf, a former president of the National Academy of Engineering, is, no surprise, no fan of the Gambetta-Hertog theory. “If you have a million coin flips,” he says, “it’s almost certain that somewhere in those coin flips there will be 20 heads in a row.”
</p></blockquote>
<p>This numerical gaffe is a prime example of innumeracy, a <a href="http://marknelson.us/2008/07/20/innumeracy-part-n/" class="newpage">favorite</a> <a href="http://www.drdobbs.com/blog/archives/2008/05/innumeracy_cont.html" class="newpage">topic</a> of mine, and it is doubly bad. First, the New York Times with its old-school print-format hubris regarding fact checking should not have let this slip by unnoticed. Second, the fact that the speaker is not just an engineer, but president of our National Academy, adds insult to injury.<br />
<span id="more-132"></span></p>
<h3>Probability 101</h3>
<p>The Wikipedia says that <a href=http://en.wikipedia.org/wiki/Numeracy" class="newpage">Numeracy</a> is <i>the ability to reason with numbers and other mathematical concepts.</i> In today&#8217;s world, it should be considered as important as literacy. So let&#8217;s try doing some thinking about this problem.</p>
<p>What should first catch your eye in this is the meaning behind &#8220;20 heads in a row.&#8221; As a programmer, you are instinctively aware that 2 to the 20th power is roughly one million. This means that the chances of flipping a true coin and having it land heads up 20 times in a row is inded roughly one in a million. Does this mean that flipping a coin a million times renders such a streak &#8220;almost certain?&#8221; Of course not.</p>
<p>If the chance of flipping a single head is one in two, and I flip a coin two times, am I almost certain to see one head? No. If the chances of two heads in a row is one in four, am I almost certain to see a streak of two if I flip four times? Still we intuitively answer no. It seems likely, but nowhere near a certainty. So the task in front of us is to scale this equation up and see if it changes in character as we near one million.</p>
<h3>Pinning it Down</h3>
<p>Determining how likely this streak is requires a frequent ruse we employ in probability. Instead of calculating the probability directly, we determine out how likely it is <i>not to occur</i>, then subtract that value from one.</p>
<p>We know that the chance of the coin flip happening in the first 20 flips is 1/2^20. We&#8217;ll call this number <i>p</i>. Now let&#8217;s imagine a sequence of a million coin flips. The chance of a streak of 20 heads not starting at position one is 1-<i>p</i>. The chance of it not happening in the sequence of coins starting at position 2 is likewise 1-<i>p</i>. The same probability is true for every sequence of flips from position 1 to position 999,981, the last possible start of a streak of twenty.</p>
<p>The chances of not seeing a coin flip in every one of those positions is found by multiplying each of their values, leading to the rather unwieldy formula (1-<i>p</i>)^999,981. Unwieldy, perhaps, but your scientific calculator will quickly tell you it resolves to roughly 0.39. So the chances of seeing 20 heads in a row after a million coin flips is more or less 61%. Hardly &#8220;almost certain&#8221;.</p>
<h3>Finding Almost Certain</h3>
<p>I&#8217;d like to think that &#8220;almost certain&#8221; is somewhere in the neighborhood of 99%. I&#8217;ll leave the calculation as an exercise for the reader, but if your calculator has a log button you will be able to determine that you will need almost five million coin tosses to achieve near certainty. And when you think about it (using your beloved numeracy) that number seems a lot more realistic. Something that has a one in a million chance of occuring would seem to be only somewhat likely to occur in a million tries. Give me five million and it&#8217;s a sure thing.</p>
<p>Ironically, the <a href="http://en.wikipedia.org/wiki/The_New_York_Times" class="newpage">Gray Lady</a> just ran an <a href="http://www.nytimes.com/2010/08/22/magazine/22FOB-medium-t.html" class="newpage">ode to fact checking</a> a few weeks ago. Apparently that department is short on people with any sort of mathematical fluency. Perhaps they should think about hiring an engineer or two?</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2010/09/12/innumeracy-revisited/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Brute Force vs. AI</title>
		<link>http://marknelson.us/2010/08/26/brute-force-vs-ai/</link>
		<comments>http://marknelson.us/2010/08/26/brute-force-vs-ai/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 13:32:59 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>

		<guid isPermaLink="false">http://marknelson.us/2010/08/26/brute-force-vs-ai/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/08/26/brute-force-vs-ai/' addthis:title='Brute Force vs. AI' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>One of the annoying things that old school Artificial Intelligence researchers have to deal with is the fact that simple brute force is such a daunting foe. Back in the dawn era of the field, attempts to replicate human thought processes used deductive reasoning, symbolic representation, and incremental learning to solve problems. As an example, [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/08/26/brute-force-vs-ai/' addthis:title='Brute Force vs. AI' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>One of the annoying things that old school Artificial Intelligence researchers have to deal with is the fact that simple brute force is such a daunting foe. Back in the dawn era of the field, attempts to replicate human thought processes used deductive reasoning, symbolic representation, and incremental learning to solve problems.</p>
<p>As an example, look at what the AI consensus might have been 30 years ago for championship chess programs, and compare it to the massive database searches used by Deep Blue to pummel human opponents. I think you&#8217;ll find that things haven&#8217;t worked out quite the way they were expected to.</p>
<p>The feeling of the <i>hoi polloi</i>, of course, is that Artificial Intelligence is dead, and that&#8217;s probably the best thing that could happen to what is still a pretty exciting field.  But it&#8217;s quite a rarity for the public to get a hands-on look at exciting developments in AI.</p>
<p>Unfortunately, the public beta of the Swingly search engine is not going to be the exception to this rule.<br />
<span id="more-131"></span></p>
<h3>Swingly</h3>
<p>Swingly is a search engine that purports to answer questions in plain English by searching the web. They give some examples which work quite well, such as:</p>
<ul>
<li>How much money did Avatar make?</li>
<li>Who won the World Series in 2004?</li>
<li>Who killed Inigo&#8217;s father?</li>
</ul>
<p>
Natural language queries are to AI what register optimization is to compilers &#8211; a fundamental problem that has been studied to death. As a result, I though maybe Swingly might have some interesting results to present here.</p>
<h3>Swingly vs. Google</h3>
<p>Of course, it is only fair to compare its results to those of Google, the brute force ghost at the banquet. I threw out a random selection of questions and examined the results, judging them only on whether or not they answered the question. My sample list included:</p>
<ul>
<li>Who is the editor of Dr. Dobb&#8217;s Journal?</li>
<li>Is same sex marriage legal in Texas?</li>
<li>How does Swingly work?</li>
<li>What books did Mark Nelson write?</li>
<li>How do I tie a half hitch?</li>
<li>what song has the lyrics &#8220;My loneliness is killing me?</li>
</ul>
<p>I rated the test results as Good, Okay, or Bad. Google got 5 Goods and 1 Okay. Swingly got 2 Okays and 4 Bad &#8211; bad meaning that it totally missed the point.</p>
<p>Just as an example, the answer to the last question is &#8220;Baby One More Time&#8221; by Britney Spears. Google&#8217;s first result was a video for that song &#8211; a clear win. Swingly&#8217;s first result was a Wikipedia entry for a song called &#8220;Killing Me&#8221; by the Japanese rock band L&#8217;Arc-en-Ciel &#8211; fail.</p>
<p>So today at least, chalk one up for finely tuned brute force queries, and mark it a loss for natural language parsing. </p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2010/08/26/brute-force-vs-ai/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Little Knowledge Can Be a Felonious Thing</title>
		<link>http://marknelson.us/2010/08/20/a-little-knowledge-can-be-a-felonious-thing/</link>
		<comments>http://marknelson.us/2010/08/20/a-little-knowledge-can-be-a-felonious-thing/#comments</comments>
		<pubDate>Fri, 20 Aug 2010 19:06:59 +0000</pubDate>
		<dc:creator>Mark Nelson</dc:creator>
				<category><![CDATA[Computer Science]]></category>
		<category><![CDATA[Scams]]></category>

		<guid isPermaLink="false">http://marknelson.us/2010/08/20/a-little-knowledge-can-be-a-felonious-thing/</guid>
		<description><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/08/20/a-little-knowledge-can-be-a-felonious-thing/' addthis:title='A Little Knowledge Can Be a Felonious Thing' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div>Not understanding some basic rules of mathematics or logic can be a problem if you are a computer programmer. It can stand in the way of a good solution to a problem, or worse yet, can cause you to spend a lot of time working on a dead end. In the case of New Zealand [...]]]></description>
			<content:encoded><![CDATA[<div class="addthis_toolbox addthis_default_style" addthis:url='http://marknelson.us/2010/08/20/a-little-knowledge-can-be-a-felonious-thing/' addthis:title='A Little Knowledge Can Be a Felonious Thing' ><a class="addthis_button_twitter"></a><a class="addthis_button_favorites"></a><a class="addthis_button_print"></a><a class="addthis_button_facebook_like"></a><a class="addthis_button_google_plusone"></a><a class="addthis_button_compact"></a></div><p>Not understanding some basic rules of mathematics or logic can be a problem if you are a computer programmer. It can stand in the way of a good solution to a problem, or worse yet, can cause you to spend a lot of time working on a dead end.</p>
<p>In the case of New Zealand developer Philip Whitley, not understanding the <a href="http://marknelson.us/2010/08/01/the-pigeonhole-principle/">Pigeonhole Principle</a> led to a $NZ 5.3M fine, and up to five years in jail.<br />
<span id="more-130"></span></p>
<h3>Compression Scam</h3>
<p>For roughly the past ten years or so, Whitley had been raising money to fund a company that was to use his revolutionary data compression algorithm. This algorithm purported to be able to compress <i>every</i> file by 92.5%. </p>
<p>Readers of this space know quite well that it is provably impossible to compress every file. And anyone with some experience in data compression knows that the notion that all files could be compressed by a large, fixed amount is patently ludicrous.</p>
<p>But to the man on the street, this sort of thing isn&#8217;t entirely obvious, and that fact allowed Whitley to raise millions from investors in New Zealand. </p>
<p>Fortunately, the prosecutors didn&#8217;t have to hang their case on <a href="http://findarticles.com/p/articles/mi_8062/is_20100305/ai_n52392609/">testimony from academics</a> about abstract mathematical proofs &#8211; Whitley lied to his investors about having patented the technology, and that was enough for the Kiwi Serious Fraud Office to garner a conviction.</p>
<h3>Not the First, Not the Last</h3>
<p>Whitley isn&#8217;t the first compression con artist, and he won&#8217;t be the last. A few years back I served as a witness for a man who ran afoul of the US Department of Justice for a similar compression scheme. His work resulted in substantial losses for a number of people, as disparate as a dentist from my suburban home of Plano, Texas, and Swiss musician of note <a href="http://www.dietermeier.com/">Deiter Meier</a>. Neither the defendant nor his victims had much knowledge of algorithmic law in that case, either.</p>
<p>What does this mean to us? Well, the next time you hear someone arguing that a CS degree doesn&#8217;t do the average programmer any good, you might remember that at a minimum it can help keep you out of jail. Surely that&#8217;s worth something.</p>
]]></content:encoded>
			<wfw:commentRss>http://marknelson.us/2010/08/20/a-little-knowledge-can-be-a-felonious-thing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

