Description
I've observed that the test case from #456 (which has about 15,000 variables, IIRC) shows pretty bad collision rates, even after switching from k&r hash to perl hash.
I suspect two things are in play:
-
We use 32 bits hash functions. Per the Birthday Paradox, that means a 50% chance of collision after ~25,000 elements, and growing rapidly1. A 64 bits hash function reaches that point only after 1.6 billion elements.
-
We use multiplicative hashes and those tend to have poor entropy in their low bits. Perl hash tries to mitigate that with a final shuffle but I have a hunch that a 64 bits big prime multiplicative hash performs better.
Counterpoint contra (2): we could get better distribution out of our 32 bits multiplicative hash functions with h >> (32-N)
, where N
is a power of two, but only if we used power-ot-two hash tables. But we don't because those are more memory hungry.
1 ninja edit: which we then bucket - truncating the result, basically - compounding the effect