published: August 30, 2011 — tags: c — 7 comments

GCC hash_map vs. unordered_map

Long story cut short: Do not use hash<const char*>. Such keys will be treated as integers. Use C++ strings instead (or create a custom hashing function).

It started innocently but, before noticing anything, I was writing C++ code again. I had a problem to solve that asked for something like Python's dictionaries. There is this thing called hash_map in C++, which apparently wasn't included in the standard library because the standards committee ran out of time, but it has made its way into most compilers anyway. So hash_map it is, I thought to myself.

Before discussing the real problem, let's start with a smaller issue. I started writing on my Cygwin installation with GCC 3.4.4 using <ext/hash_map>. Fine, but after moving to a current Linux distro with GCC 4.4.5 I got a nasty deprecation warning. It seems that starting with GCC 4.3 one should rather use unordered_map (which isn't part of the standard library just yet either). Since I wanted the program to compile with both compilers without warnings, I've changed the includes into this:

#define GCC_VERSION (__GNUC__ * 10000 \
    + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)

#if GCC_VERSION >= 40300
#include <tr1/unordered_map>
using namespace std::tr1;
#define hash_map unordered_map

#else
#include <ext/hash_map>
using namespace __gnu_cxx;
#endif

using namespace std;

This way I could leave all the rest of the code unchanged, you'd think. Well, guess again. If you're a C die hard, who likes using const char* (which is passionately used even in the most popular C++ hash_map example), you're in for a surprise. unordered_map (and probably everything else from the TR1/C++0x standards) doesn't know the concept of const char* any more. By design! To them it's just a pointer like any other, without any magical features. So, while old SGI's hash_map creates a hash from the string itself, this new class just takes the memory address as hash. You most certainly didn't want that.

I was surprised to learn this and I did that by running the benchmark from Attractive Chaos. According to the diagrams posted there, GCC's unordered_map is a terrible memory hog for some unknown reason when using cstrings as keys. It was said that it might have been a compiler bug (but it turned out to be a feature instead). I really couldn't believe that, therefore I downloaded the benchmark to check it myself. I have only tested map, hash_map and unordered_map. Both tests compiled just fine after adding #include <cstring> to their source for the sake of strcmp. This benchmark tries to fill the storage with 5 million int-int or cstring-int pairs (but most of them are duplicates), and then reports the extra memory and time used for that plus the number of elements actually stored. Here are the results on a 32-bit relatively slow machine:

test	keys	elements	memory	CPU time
tr1_unordered_map	int	625792	10 MB	1.06
tr1_unordered_map	cstring	5000000	78 MB	1.04
sgi_hash_map	int	625792	10 MB	1.14
sgi_hash_map	cstring	625792	10 MB	2.03
sgi_map	int	625792	220 kB	4.44
sgi_map	cstring	625792	228 kB	9.85

unordered_map has utilised 8 times as much memory when using string keys but, wait a second, it has also stored 8 times as much data! Here is the culprit: duplicate keys have been stored. It became immediately clear that what was really stored were pointers and not string hashes. Unfortunately there's some extra work, that needs to be done, in order to support C-type strings. Instead of the original code of test.cc:

struct eqstr {
    inline bool operator()(const char *s1, const char *s2) const {
        return strcmp(s1, s2) == 0;
    }
};

typedef unordered_map<const char*, int, hash<const char*>, eqstr> strhash;

we have to specify our own hashing function and write:

struct eqstr {
    inline size_t operator()(const char *s) const {
        size_t hash = 1;
        for (; *s; ++s) hash = hash * 5 + *s;
        return hash;
    }
    inline bool operator()(const char *s1, const char *s2) const {
        return strcmp(s1, s2) == 0;
    }
};

typedef unordered_map<const char*, int, eqstr, eqstr> strhash;

I've taken the hashing function from the libstdc++ mailinglist. I have tried something more sophisticated, like FLV, but then the program was running a few times slower. There's really no need for that. We shouldn't worry about hash collisions that much and rather pick a fast function. A few collisions won't harm performance that much, it will just take a few calls to strcmp to find the correct key when required. Look at the performance comparison now. Not bad:

test	keys	elements	memory	CPU time
tr1_unordered_map	int	625792	10 MB	1.06
tr1_unordered_map	cstring	625792	10 MB	1.98
sgi_hash_map	int	625792	10 MB	1.14
sgi_hash_map	cstring	625792	10 MB	2.03
sgi_map	int	625792	220 kB	4.44
sgi_map	cstring	625792	228 kB	9.85

Share this! Other articles

Comments

Thanks for this very useful analysis. I wouldn't recommend putting the "using namespace" statements in header files even though because it promotes everything in those namespaces into the global namespace. If you are already using the hash_* containers without the namespace, I would suggest something like this:

#define GCC_VERSION (__GNUC__ * 10000 \
    + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)

#if GCC_VERSION >= 40300
#include <tr1/unordered_map>

#define hash_map std::tr1::unordered_map

#else
#include <ext/hash_map>

#define hash_map __gnu_cxx::hash_map

#endif

See item Item 40 in Herb Sutter's book "More Exceptional C++" for other ideas on migrating to namespaces.

If you're still using C strings, you're writing bad code out of ancient habit that will be hard for anyone to maintain. Don't keep doing things just because it was what you learned first.

Seriously: there's no excuse.

I'm not saying that I would prefer picking Cstring over string, but nevertheless I would expect it to work with unordered map, given all the effort to provide compatibility with C. The actual use of CString would depend e.g. on the libraries I'd have to operate with.

As for the remark about writing bad code, I had a look at it and noticed that there were no checks for null pointers, which unnecessarily limits the example and leads to a crash if someone tries to use a nullptr as key. So here is an updated example:

#define GCC_VERSION (__GNUC__ * 10000 \
    + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)

#if GCC_VERSION >= 40300
#include <tr1/unordered_map>
#define hash_map std::tr1::unordered_map

#else
#include <ext/hash_map>
#define hash_map __gnu_cxx::hash_map
#endif

#include <string.h>
#include <assert.h>

struct eqstr {
    inline size_t operator()(const char *s) const {
        size_t hash = 1;
        if (!s)
            return 0;
        for (; *s; ++s)
            hash = hash * 5 + *s;
        return hash;
    }
    inline bool operator()(const char *s1, const char *s2) const {
        if (!s1 || !s2)
            return s1 == s2;
        return strcmp(s1, s2) == 0;
    }
};

typedef hash_map<const char*, int, eqstr, eqstr> CstrIntHash;
typedef std::pair<const char*, int> CstrIntPair;

int main() {
    CstrIntHash data;
    data[0] = 1;
    data["ab"] = 2;
    data["ba"] = 2;
    data["ab"] = 1;
    data[0] = 0;
    assert(data.size() == 3);
    return 0;
}

The "assert(data.size() == 3);" is not going to work as intended. Although the string literal "ab" occurs twice, they will be assigned the same address in most compilers.

Test

Test 2

Test 3

Add your comment