Microsoft has never been a slacker in the C++ department - they’ve always worked hard to provide a top-notch, compliant product. Visual Studio 10 supports their current incarnation, and for the most part it is up to their usual standards. It’s a great development environment, and I am a dedicated user, but I have to give Microsoft a demerit in one area: their C++11 hash containers have some serious performance problems - so much that the Debug versions of the containers may well be unusable in your application.

Background

I first noticed the problem with unordered_map when I was working on the the code for my updated LZW article I found that when running in the debugger, my program would hang after exiting the compression routine. A little debugging showed that the destructor for my hash table was taking a long time to run. And by a long time, I mean it was approaching an hour!.

This seemed pretty crazy. Destroying a hash table wouldn’t seem to be a complicated task. I decided to see if I could come up with a reasonable benchmark. I wrote a test program that does a simple word frequency count. As a starter data set, I used the first one million white space delimited words in the 2010 CIA factbook, as published by Project Gutenberg. This data set yields 74,208 unique tokens.

I wrote a simple test rig that I used to test the word count program using four different containers:

  • unordered_map indexed by std::string
  • unordered_map indexed by std::string *
  • map indexed by std::string
  • map indexed by std::string *

The reason for testing with std::string * was to reduce the cost of copying strings into the hash table as it was filled, and then to reduce the cost of destroying those strings when the table was destroyed.

I ran tests against map expecting to see a pretty big difference in performance. Because map is normally implemented using a balanced binary tree structure, it has O(log(N)) performance on insertions. A sparsely populated hash table can have O(1) performance. By using fairly large data sets, I expected to see a big difference between the two.

I tried to eliminate a few obvious sources for error in my test function - and I used a template function so that I could use the same code on all the different container types:

template<class CONTAINER, class DATA>
void test( const DATA &data, const char *test_name )
{
  std::cout << "Testing container: " << test_name << std::endl;

#ifdef _DEBUG
  const int passes = 2;
#else
  const int passes = 10;
#endif
  double fill_times = 0;
  double delete_times = 0;
  size_t entries;
  for ( int i = 0 ; i < passes ; i++ ) {
    CONTAINER *container = new CONTAINER();
    std::cout << "Filling... " << std::flush;
    clock_t t0 = clock();
    for ( auto ii = data.begin() ; ii != data.end() ; ii++ ) 
      (*container)[*ii]++;
    double span = double(clock() - t0)/CLOCKS_PER_SEC;
    fill_times += span;
    entries = container->size();
    std::cout << " " << span << " Deleting... " << std::flush;
    t0 = clock();
    delete container;
    span = double(clock() - t0)/CLOCKS_PER_SEC;
    delete_times += span;
    std::cout << span << " " << std::endl;
  }
  std::cout << "Entries: " << entries 
            << ", Fill time: " << (fill_times/passes) 
            << ", Delete time: " << (delete_times/passes) 
            << std::endl;
}

I didn’t go overboard when it came to instrumenting this problem, I just used the timing functions built into the C++ library. On my Windows and Linux test systems, the values of CLOCKS_PER_SEC are both high enough that I’m not worried about granularity issues.

The First Results

I ran my test program in Visual C++ Release mode, using all the standard settings for a console application. For purposes of comparison, I ran the same program using g++ 4.6.1 on the same computer, booted up under Linux. For the set of 1,000,000 tokens, the results are shown below:

TaskVC++ 10 Releaseg++ 4.6.1 -O3
Fill unordered_map<string>0.41s.11s
Fill unordered_map<string const *>0.39s0.14s
Destroy unordered_map<string>3.17s0.01s
Destroy unordered_map<string const *>3.24s0.004s
Fill map<string>0.83s.53s
Fill map<string const *>0.88s0.66s
Destroy map<string>.14s0.01s
Destroy map<string const *>.07s0.002s

There are a few interesting points to take away from these tests:

  • Microsoft's compiler is taking an exceptionally long time to destroy hashed containers - one order of magnitude greater than it took to create it, and two orders of magnitude greater than it takes g++ to do the same task.
  • It doesn't look like constructing and destroying the strings is a big factor. Both compilers have roughly the same performance with both std::string and std::string *. Microsoft's behavior is counterintuitive, as it takes longer to construct and destroy containers using the pointer.
  • The GNU compiler appears to be able to run through this exercise notably faster.

The time it takes to destroy the table is a concern - having a C++ program hang for over 3 seconds to destroy a modestly large data structure is a serious concern - particularly when the same task completes in a few milliseconds with g++.

The Pathological Results

These concerns are nothing compared to what I see when running in debug mode. Setting my Visual Studio project to Debug mode, then running the same test, yields the results shown here:

TaskVC++ 10 Debug
Fill unordered_map<string>17.41s
Fill unordered_map<string const *>17.08s
Destroy unordered_map<string>505.36s
Destroy unordered_map<string const *>505.99s
Fill map<string>13.29s
Fill map<string const *>13.15s
Destroy map<string>0.94s
Destroy map<string const *>0.18s

Those numbers are hard to believe. Destroying a hash table takes one millisecond when using g++. In VC++ 10, it takes almost ten minutes!

Worse, we suddenly see that hashed containers are slower than the containers built on red-black trees. Again, this just doesn’t make sense.

The big problem with these numbers is that it means the debug mode of the compiler is effectively unusable for a lot of tasks. Regardless of how much testing it does, when it is this slow, it is just not useful.

A Workaround

I didn’t invest the time to try debugging Microsoft’s library, so I don’t really know where the time is being spent. I did try a few things to speed things up, and I found one technique that helps a lot. Before including any Microsoft header files, try entering this single line in your source:

#define ITERATOR_DEBUG_LEVEL 0

With this definition in place, the delete times return to ball park of the times seen when running in release mode. Of course, you give up some debugging. I believe that an explanation of what this macro does might be found here.

In the final analysis, I think Microsoft has some serious work to to do here. The performance of their hashed containers, and to some lesser extent, the pre-C++11 associative containers, needs some serious examination. If the library is going to run this much slower than the competition, I need a good explanation why.

Source

HashTest.cpp