fgda logo for printout

GCC hash_map vs. unordered_map

Long story cut short: Do not use hash<const char*>. Such keys will be treated as integers. Use C++ strings instead (or create a custom hashing function).

It started innocently but, before noticing anything, I was writing C++ code again. I had a problem to solve that asked for something like Python's dictionaries. There is this thing called hash_map in C++, which apparently wasn't included in the standard library because the standards committee ran out of time, but it has made its way into most compilers anyway. So hash_map it is, I thought to myself.

Before discussing the real problem, let's start with a smaller issue. I started writing on my Cygwin installation with GCC 3.4.4 using <ext/hash_map>. Fine, but after moving to a current Linux distro with GCC 4.4.5 I got a nasty deprecation warning. It seems that starting with GCC 4.3 one should rather use unordered_map (which isn't part of the standard library just yet either). Since I wanted the program to compile with both compilers without warnings, I've changed the includes into this:

#define GCC_VERSION (__GNUC__ * 10000 \
    + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)

#if GCC_VERSION >= 40300
#include <tr1/unordered_map>
using namespace std::tr1;
#define hash_map unordered_map

#else
#include <ext/hash_map>
using namespace __gnu_cxx;
#endif

using namespace std;

This way I could leave all the rest of the code unchanged, you'd think. Well, guess again. If you're a C die hard, who likes using const char* (which is passionately used even in the most popular C++ hash_map example), you're in for a surprise. unordered_map (and probably everything else from the TR1/C++0x standards) doesn't know the concept of const char* any more. By design! To them it's just a pointer like any other, without any magical features. So, while old SGI's hash_map creates a hash from the string itself, this new class just takes the memory address as hash. You most certainly didn't want that.

I was surprised to learn this and I did that by running the benchmark from Attractive Chaos. According to the diagrams posted there, GCC's unordered_map is a terrible memory hog for some unknown reason when using cstrings as keys. It was said that it might have been a compiler bug (but it turned out to be a feature instead). I really couldn't believe that, therefore I downloaded the benchmark to check it myself. I have only tested map, hash_map and unordered_map. Both tests compiled just fine after adding #include <cstring> to their source for the sake of strcmp. This benchmark tries to fill the storage with 5 million int-int or cstring-int pairs (but most of them are duplicates), and then reports the extra memory and time used for that plus the number of elements actually stored. Here are the results on a 32-bit relatively slow machine:

test keys elem­ents mem­ory CPU time
tr1_unordered_map int 625792 10 MB 1.06
tr1_unordered_map cstring 5000000 78 MB 1.04
sgi_hash_map int 625792 10 MB 1.14
sgi_hash_map cstring 625792 10 MB 2.03
sgi_map int 625792 220 kB 4.44
sgi_map cstring 625792 228 kB 9.85

unordered_map has utilised 8 times as much memory when using string keys but, wait a second, it has also stored 8 times as much data! Here is the culprit: duplicate keys have been stored. It became immediately clear that what was really stored were pointers and not string hashes. Unfortunately there's some extra work, that needs to be done, in order to support C-type strings. Instead of the original code of test.cc:

struct eqstr {
    inline bool operator()(const char *s1, const char *s2) const {
        return strcmp(s1, s2) == 0;
    }
};

typedef unordered_map<const char*, int, hash<const char*>, eqstr> strhash;

we have to specify our own hashing function and write:

struct eqstr {
    inline size_t operator()(const char *s) const {
        size_t hash = 1;
        for (; *s; ++s) hash = hash * 5 + *s;
        return hash;
    }
    inline bool operator()(const char *s1, const char *s2) const {
        return strcmp(s1, s2) == 0;
    }
};

typedef unordered_map<const char*, int, eqstr, eqstr> strhash;

I've taken the hashing function from the libstdc++ mailinglist. I have tried something more sophisticated, like FLV, but then the program was running a few times slower. There's really no need for that. We shouldn't worry about hash collisions that much and rather pick a fast function. A few collisions won't harm performance that much, it will just take a few calls to strcmp to find the correct key when required. Look at the performance comparison now. Not bad:

test keys elem­ents mem­ory CPU time
tr1_unordered_map int 625792 10 MB 1.06
tr1_unordered_map cstring 625792 10 MB 1.98
sgi_hash_map int 625792 10 MB 1.14
sgi_hash_map cstring 625792 10 MB 2.03
sgi_map int 625792 220 kB 4.44
sgi_map cstring 625792 228 kB 9.85

Share Share this! Other articles

Comments

Paul M. Dubuc on October 27, 2011, 15:02:

Thanks for this very useful analysis. I wouldn't recommend putting the "using namespace" statements in header files even though because it promotes everything in those namespaces into the global namespace. If you are already using the hash_* containers without the namespace, I would suggest something like this:

#define GCC_VERSION (__GNUC__ * 10000 \
    + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)

#if GCC_VERSION >= 40300
#include <tr1/unordered_map>

#define hash_map std::tr1::unordered_map

#else
#include <ext/hash_map>

#define hash_map __gnu_cxx::hash_map

#endif

See item Item 40 in Herb Sutter's book "More Exceptional C++" for other ideas on migrating to namespaces.