Perfect Hashing of Very Short Strings

This post is a (bit modified) translation of my older post in Russian.

Let a small set of very short strings (1—4 bytes here) is given and we want to construct a dictionary with strings from this set being used as keys. A simple solution is to use a universal hash table with a generic hash function (usually the one provided by the library implementing the hash table, e.g. C++ Standard library provides some std::hash implementation for strings) such as Murmur or FNV hashes.

However such a solution is a heavyweight one in some cases, for example when keys are known in advance and are short enough. This is the case when a generic array may handle the job. For this to be true we have to find a hash function that suffers no collisions for inputs from the set of our keys, so called perfect hash function.

Another feature I want is high hash computation speed, so I am not going to use complex and costly hash functions. For example Pearson hashing is too complex for me (plus I am considering CPUs capable of fast integer hardware multiplication). Finally, I am not going to use division or remainder operations. Instead I will use an array having size equal to some power of 2 (I am OK with some cells left unused).

For strings no longer than four bytes I can use 32-bit words as keys, «NUL» character is not used in texts so one-character strings have only the least significant byte being non-zero and the other three bytes being zero, two-character strings have the least significant half non-zero and the other half zero and so on. The function casting strings to integers that way may be written in C++ as follows.

inline uint32_t to_number(const std::string &word)
{
  register uint32_t result = 0;
  switch (word.size())
  {
  case 4: result  = (uint32_t)word[3] << 24;
  case 3: result |= (uint32_t)word[2] << 16;
  case 2: result |= (uint32_t)word[1] << 8;
  case 1: result |= word[0];
  default: return result;
  }
}

The hash function I am going to construct belongs to the family defined by the following template (one has to find N and b for the given set of keys).

template <uint32_t N, unsigned b>
uint32_t ssphash(uint32_t word)
{
  return (word * N) >> b;
}

Of course, there is no guarantee that for the given set of keys this family of functions contains a perfect hash function. But due to its simplicity and small combinatorial size of the problem (for modern computers) we can try to brute-force N and b (the software was written for MSVC and uses _alloca from malloc.h and intrinsic __lzcnt from intrin.h, some CPUs do not support the latter). The size of the array (our hash table) is taken as the smallest power of 2 not less than quantity of keys.

For examples for the following set of keys (operator and punctuation lexemes of C++)

+ - ! ~ # % ^ & * ( ) / // /* */ [ ] { } | ? : ; , . && || ++ -- < > " ' <= >= = == -> .* ->* << >>

I have got N == 242653, b == 17.

Реклама

Добавить комментарий

Заполните поля или щелкните по значку, чтобы оставить свой комментарий:

Логотип WordPress.com

Для комментария используется ваша учётная запись WordPress.com. Выход / Изменить )

Фотография Twitter

Для комментария используется ваша учётная запись Twitter. Выход / Изменить )

Фотография Facebook

Для комментария используется ваша учётная запись Facebook. Выход / Изменить )

Google+ photo

Для комментария используется ваша учётная запись Google+. Выход / Изменить )

Connecting to %s

%d такие блоггеры, как: