Microsoft has never been a slacker in the C++ department – they’ve always worked hard to provide a top-notch, compliant product. Visual Studio 10 supports their current incarnation, and for the most part it is up to their usual standards. It’s a great development environment, and I am a dedicated user, but I have to give Microsoft a demerit in one area: their C++11 hash containers have some serious performance problems – so much that the Debug versions of the containers may well be unusable in your application.
Background
I first noticed the problem with unordered_map when I was working on the the code for my updated LZW article. I found that when running in the debugger, my program would hang after exiting the compression routine. A little debugging showed that the destructor for my hash table was taking a long time to run. And by a long time, I mean it was approaching an hour!.
This seemed pretty crazy. Destroying a hash table wouldn’t seem to be a complicated task. I decided to see if I could come up with a reasonable benchmark. I wrote a test program that does a simple word frequency count. As a starter data set, I used the first one million white space delimited words in the 2010 CIA factbook, as published by Project Gutenberg. This data set yields 74,208 unique tokens.
I wrote a simple test rig that I used to test the word count program using four different containers:
unordered_map indexed by std::string
unordered_map indexed by std::string *
map indexed by std::string
map indexed by std::string *
The reason for testing with std::string * was to reduce the cost of copying strings into the hash table as it was filled, and then to reduce the cost of destroying those strings when the table was destroyed.
I ran tests against map expecting to see a pretty big difference in performance. Because map is normally implemented using a balanced binary tree structure, it has O(log(N)) performance on insertions. A sparsely populated hash table can have O(1) performance. By using fairly large data sets, I expected to see a big difference between the two.
I tried to eliminate a few obvious sources for error in my test function – and I used a template function so that I could use the same code on all the different container types:
template<class CONTAINER, class DATA>
void test( const DATA &data, const char *test_name )
{
std::cout << "Testing container: " << test_name << std::endl;
#ifdef _DEBUG
const int passes = 2;
#else
const int passes = 10;
#endif
double fill_times = 0;
double delete_times = 0;
size_t entries;
for ( int i = 0 ; i < passes ; i++ ) {
CONTAINER *container = new CONTAINER();
std::cout << "Filling... " << std::flush;
clock_t t0 = clock();
for ( auto ii = data.begin() ; ii != data.end() ; ii++ )
(*container)[*ii]++;
double span = double(clock() - t0)/CLOCKS_PER_SEC;
fill_times += span;
entries = container->size();
std::cout << " " << span << " Deleting... " << std::flush;
t0 = clock();
delete container;
span = double(clock() - t0)/CLOCKS_PER_SEC;
delete_times += span;
std::cout << span << " " << std::endl;
}
std::cout << "Entries: " << entries
<< ", Fill time: " << (fill_times/passes)
<< ", Delete time: " << (delete_times/passes)
<< std::endl;
}
I didn’t go overboard when it came to instrumenting this problem, I just used the timing functions built into the C++ library. On my Windows and Linux test systems, the values of CLOCKS_PER_SEC are both high enough that I’m not worried about granularity issues.
The First Results
I ran my test program in Visual C++ Release mode, using all the standard settings for a console application. For purposes of comparison, I ran the same program using g++ 4.6.1 on the same computer, booted up under Linux. For the set of 1,000,000 tokens, the results are shown below:
| Task | VC++ 10 Release | g++ 4.6.1 -O3 |
|---|---|---|
Fill unordered_map<string> |
0.41s | .11s |
Fill unordered_map<string const *> |
0.39s | 0.14s |
Destroy unordered_map<string> |
3.17s | 0.01s |
Destroy unordered_map<string const *> |
3.24s | 0.004s |
Fill map<string> |
0.83s | .53s |
Fill map<string const *> |
0.88s | 0.66s |
Destroy map<string> |
.14s | 0.01s |
Destroy map<string const *> |
.07s | 0.002s |
There are a few interesting points to take away from these tests:
-
Microsoft’s compiler is taking an exceptionally long time to destroy hashed containers – one order of magnitude greater than it took to create it, and two orders of magnitude greater than it takes g++ to do the same task.
It doesn’t look like constructing and destroying the strings is a big factor. Both compilers have roughly the same performance with both
std::string and std::string *. Microsoft’s behavior is counterintuitive, as it takes longer to construct and destroy containers using the pointer.
The GNU compiler appears to be able to run through this exercise notably faster.
The time it takes to destroy the table is a concern – having a C++ program hang for over 3 seconds to destroy a modestly large data structure is a serious concern – particularly when the same task completes in a few milliseconds with g++.
The Pathological Results
These concerns are nothing compared to what I see when running in debug mode. Setting my Visual Studio project to Debug mode, then running the same test, yields the results shown here:
| Task | VC++ 10 Debug |
|---|---|
Fill unordered_map<string> |
17.41s |
Fill unordered_map<string const *> |
17.08s |
Destroy unordered_map<string> |
505.36s |
Destroy unordered_map<string const *> |
505.99s |
Fill map<string> |
13.29s |
Fill map<string const *> |
13.15s |
Destroy map<string> |
0.94s |
Destroy map<string const *> |
0.18s |
Those numbers are hard to believe. Destroying a hash table takes one millisecond when using g++. In VC++ 10, it takes almost ten minutes!
Worse, we suddenly see that hashed containers are slower than the containers built on red-black trees. Again, this just doesn’t make sense.
The big problem with these numbers is that it means the debug mode of the compiler is effectively unusable for a lot of tasks. Regardless of how much testing it does, when it is this slow, it is just not useful.
A Workaround
I didn’t invest the time to try debugging Microsoft’s library, so I don’t really know where the time is being spent. I did try a few things to speed things up, and I found one technique that helps a lot. Before including any Microsoft header files, try entering this single line in your source:
#define ITERATOR_DEBUG_LEVEL 0
With this definition in place, the delete times return to ball park of the times seen when running in release mode. Of course, you give up some debugging. I believe that an explanation of what this macro does might be found here.
In the final analysis, I think Microsoft has some serious work to to do here. The performance of their hashed containers, and to some lesser extent, the pre-C++11 associative containers, needs some serious examination. If the library is going to run this much slower than the competition, I need a good explanation why.
4 users commented in " VC++ 10 Hash Table Performance Problems "
Follow-up comment rss or Leave a TrackbackI work with Dinkumware to maintain Microsoft's C++ Standard Library implementation. Fortunately, we've already fixed the horrible IDL=2 penalties in VC11's unordered containers. I've also improved our hashes.
VC and GCC have radically different string implementations. VC uses the Small String Optimization, while GCC (according to my knowledge) still uses Copy-On-Write, although C++11 is going to force them to stop doing that. This shouldn't affect your const string * numbers, but VC and GCC still have different hash algorithms.
Here's a cleaned-up perf test, comparing VC11 to GCC 4.6.1, that uses the same string representation (just a pair of pointers) and the same hash (FNV-1a, identical to VC11's - I don't know what GCC uses).
Here's a convenient table:
Comparing optimized release mode, VC11 and GCC 4.6.1 are identical for map (as expected - it's hard to seriously screw up a red-black tree). VC11 is somewhat faster for unordered_map (this surprised me; I thought we were generally slower).
I've included separate Od IDL=0 and Od IDL=2 numbers to distinguish the lack of optimizations from IDL=2 penalties. IDL=2 exacts a 2.6x penalty for map insertion, 2.4x for unordered_map insertion, and essentially 0 for destruction (yay!).
Stephan T. Lavavej
Visual C++ Libraries Developer
stl@microsoft.com
Hi Mark,
I tried VC++ 11 release build, got similar result as you do, but once I changed the optimization flag to /Ox, it gets much faster, here is my result:
VC++11 Release with /Ox flag
E:\>HashTest.exe
Read in 1000000 strings
Testing container: unordered_map
Filling... 0.164 Deleting... 0.007
... <9 lines deleted by mrn>
Entries: 74208, Fill time: 0.1611, Delete time: 0.0082
Testing container: unordered_map
Filling... 0.192 Deleting... 0.006
... <9 lines deleted by mrn>
Entries: 74208, Fill time: 0.1909, Delete time: 0.0072
Testing container: map
Filling... 0.485 Deleting... 0.011
... <9 lines deleted by mrn>
Entries: 74208, Fill time: 0.4868, Delete time: 0.0102
Testing container: map
Filling... 0.63 Deleting... 0.01
... <9 lines deleted by mrn>
Filling... 0.64 Deleting... 0.01
Entries: 74208, Fill time: 0.6278, Delete time: 0.0094
VC++11 Release default settings
E:\>HashTest.exe
Read in 1000000 strings
Testing container: unordered_map
... <9 lines deleted by mrn>
Filling... 0.592 Deleting... 0.018
Entries: 74208, Fill time: 0.5939, Delete time: 0.02
Testing container: unordered_map
Filling... 0.634 Deleting... 0.015
... <9 lines deleted by mrn>
Entries: 74208, Fill time: 0.6227, Delete time: 0.0117
Testing container: map
Filling... 1.576 Deleting... 0.014
... <9 lines deleted by mrn>
Filling... 1.568 Deleting... 0.016
Entries: 74208, Fill time: 1.5708, Delete time: 0.0156
Testing container: map
Filling... 1.736 Deleting... 0.012
... <9 lines deleted by mrn>
Filling... 1.736 Deleting... 0.011
Entries: 74208, Fill time: 1.735, Delete time: 0.0123
g++ 4.6.1 with -O3
$ a.exe
Read in 1000000 strings
Testing container: unordered_map
Filling... 0.168 Deleting... 0.019
... <9 lines deleted by mrn>
Entries: 74208, Fill time: 0.158, Delete time: 0.0191
Testing container: unordered_map
Filling... 0.177 Deleting... 0.006
... <9 lines deleted by mrn>
Filling... 0.184 Deleting... 0.006
Entries: 74208, Fill time: 0.181, Delete time: 0.0065
Testing container: map
Filling... 0.798 Deleting... 0.015
... <9 lines deleted by mrn>
Filling... 0.792 Deleting... 0.015
Entries: 74208, Fill time: 0.802, Delete time: 0.0159
Testing container: map
Filling... 0.968 Deleting... 0.007
... <9 lines deleted by mrn>
Entries: 74208, Fill time: 0.9329, Delete time: 0.0075
@Lailin:
Good test - I tend to use the defaults for Release mode under Visual Studio, but I did jack up g++ to -O3, I probably should have compared apples to apples and left g++ options to the default as well.
I wonder if /Ox changes the ITERATER_DEBUG_LEVEL value from 1 to 0?
- Mark
Mark Nelson> I wonder if /Ox changes the ITERATER_DEBUG_LEVEL value from 1 to 0?
Optimizations don't affect _ITERATOR_DEBUG_LEVEL. In VC10+, IDL defaults to 0 (no checking, maximum speed) in release mode (the /MT or /MD options) and defaults to 2 (exhaustive correctness checks, including iterator invalidation) in debug mode (the /MTd or /MDd options).
In release mode, IDL=1 can be explicitly requested. This provides a set of security checks (including out-of-bounds access, but NOT including iterator invalidation). If an attacker is attempting to exploit a buggy program that misuses the STL, and IDL=1 detects such misuse, it terminates the program. This downgrades whatever the attack would have been, to a denial of service which is the least bad result. In VC8 and VC9, this used to be the default in release mode (when it was called _SECURE_SCL=1), but we changed the default in VC10 because of the performance penalty. IDL=2 cannot be requested in release mode (you get an error if you try).
In debug mode, both IDL=0 and IDL=1 can be explicitly requested.
IDEs and build systems typically pair release mode with optimizations, and debug mode without optimizations, but they aren't fundamentally linked.
Leave A Reply
You can insert source code in your comment without fear of too much mangling by tagging it to use the iG:Syntax Hiliter plugin. A typical usage of this plugin will look like this:[c]
Note that tags are enclosed in square brackets, not angle brackets. Tags currently supported by this plugin are: as (ActionScript), asp, c, cpp, csharp, css, delphi, html, java, js, mysql, perl, python, ruby, smarty, sql, vb, vbnet, xml, code (Generic). If you post your comment and you aren't happy with the way it looks, I will do everything I can to edit it to your satisfaction.int main()
{
printf( "Hello, world!\n" );
return 1;
}
[/c]