Skip to content

Review hash functions #859

Open
Open
@bnoordhuis

Description

@bnoordhuis

I've observed that the test case from #456 (which has about 15,000 variables, IIRC) shows pretty bad collision rates, even after switching from k&r hash to perl hash.

I suspect two things are in play:

  1. We use 32 bits hash functions. Per the Birthday Paradox, that means a 50% chance of collision after ~25,000 elements, and growing rapidly1. A 64 bits hash function reaches that point only after 1.6 billion elements.

  2. We use multiplicative hashes and those tend to have poor entropy in their low bits. Perl hash tries to mitigate that with a final shuffle but I have a hunch that a 64 bits big prime multiplicative hash performs better.

Counterpoint contra (2): we could get better distribution out of our 32 bits multiplicative hash functions with h >> (32-N), where N is a power of two, but only if we used power-ot-two hash tables. But we don't because those are more memory hungry.


1 ninja edit: which we then bucket - truncating the result, basically - compounding the effect

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions