Tech Off Thread

18 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

What's behind string.GetHashCode()?

Back to Forum: Tech Off
  • User profile image
    erik

    Hi,

    I need to hash a string value an array member.  The array size is 1009, so to map a string to an array member I do something like this:

    int arrayMember = (ABS(myStr.GetHashCode()) % 1009

    Everything seems to work great, but at some point later in the day,  I'll run the routine again but get a different hash value for the same string value. 

    My question is whether there is something behind GetHashCode() that might cause the results to change for different runs of the program.  Is GetHashCode seeded in some way when the program initializes, perhaps? 

    Thanks!

    -Erik

  • User profile image
    Sven Groot

    According to reflector it's a fairly standard hash code algorithm. It shouldn't vary given the same input.

  • User profile image
    androidi

    erik wrote:
    to map a string to an array member I do something like this:

    int arrayMember = (ABS(myStr.GetHashCode()) % 1009



    Trying to get into the "Daily WTF" (or whatever it's called now) or what are you trying to do... Actually never mind, I don't want to hear it. Perplexed

  • User profile image
    Matthew van Eerde

    1009 is prime, which is a tantalizing clue.

  • User profile image
    TommyCarlier

    A hashing function ALWAYS has to return the same hash-code for the same value. I seriously doubt the built-in string class would return different hash-codes for the same string.

  • User profile image
    odujosh

    Not enough information. From the information you did provide we have the question, "Why would you ever want to do that?"

    If you are curious about how string.ToHash works download reflector.

    From a quick look. From the look of your problem you should be in deep enough to work your way through. Looks like the answer would always be the same because they are just doing some math on the int equivalent.

    [ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
    public override unsafe int GetHashCode()
    {
        fixed (char* text = ((char*) this))
        {
            char* chPtr = text;
            int num = 0x15051505;
            int num2 = num;
            int* numPtr = (int*) chPtr;
            for (int i = this.Length; i > 0; i -= 4)
            {
                num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                if (i <= 2)
                {
                    break;
                }
                num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
                numPtr += 2;
            }
            return (num + (num2 * 0x5d588b65));
        }
    }
    

  • User profile image
    erik

    Thanks for all the replies -- I appreciate it.  Actually, I figured it out last night (it literally woke me up). 

    I was writing/testing the code on my 32-bit system and then running new tests on a 64-bit system.  GetHashCode() returns a different int value for a given string on different platforms.  For example (on .NET 3.5), the string "professional" hashes to 1867359584 on x86, but then hashes to 1306225169 on x64.

    The reason I missed it was that some of the tests on the 64-bit server came though ASP.NET, which was running in it's default 32-bit config.  I just didn't put together the variances before pinging the board.

    As for what I am doing, sorry about the narrow question, I didn't want to drown the basic problem in details.  I'm doing a poor-man's search engine and I needed a reasonably fast way to store a couple of million words (incl. numbers, amounts, dates, whatever). 

    So, the project is sort of a big hash table that keeps it's buckets on a DB.  Words are mapped to a bucket by taking their hashcode modulo the number of buckets.  The number of buckets is prime because it's a simple way to get an even distribution of words across the buckets. 

  • User profile image
    amotif

    You shouldn't rely on the implementation details of String.GetHashCode() remaining the same.

    Also, GetHashCode isn't guaranteed to return the same value for a given object value from one process to the next, per the docs:


    "The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again."

  • User profile image
    littleguru

    You should never rely on what GetHashCode() returns. Especially since the default implementation returns (at least under 32 bit systems and right now when I write this post) the reference value (I don't know if it is exactly the pointer in memory) of the object.

  • User profile image
    Minh

    erik wrote:
    The number of buckets is prime because it's a simple way to get an even distribution of words across the buckets.

    Sorry, I'm bad in math, but how does having (prime #) of buckets ensure an even distribution. I thought that's more of a function of the hash algorithm.

    Say, if you have a poor hash function:

    return _myString.GetHashCode() % 2;

    then, having 1009 buckets wouldn't help. Would it?

  • User profile image
    erik

    > how does having (prime #) of buckets ensure an even distribution

    It's just an ancient approach that's easy to implement and that I chose when starting this adventure.  But looking at it again, I agree that any value could be used so long as the hash codes I get for each string are well-distributed.  Thanks for pointing that out.  I have a feeling that I'll be changing the hash approach quite often after I start tuning the project with better data.

  • User profile image
    littleguru

    Minh wrote:
    
    erik wrote:
    The number of buckets is prime because it's a simple way to get an even distribution of words across the buckets.

    Sorry, I'm bad in math, but how does having (prime #) of buckets ensure an even distribution. I thought that's more of a function of the hash algorithm.

    Say, if you have a poor hash function:

    return _myString.GetHashCode() % 2;

    then, having 1009 buckets wouldn't help. Would it?


    Some older hashing algorithms required the bucket number to be a prime... That allowed to hasing function to add the items in a distributed manner.

  • User profile image
    Minh

    littleguru wrote:
    
    Some older hashing algorithms required the bucket number to be a prime... That allowed to hasing function to add the items in a distributed manner.
    Aahh... You mean the output of the hashing function's maximum is a prime number, not Int32.Max like .Net?

    But that doesn't prevent you from mapping that to a smaller set of buckets, though, right? If the distribution is even, shouldn't matter.

  • User profile image
    Matthew van Eerde

    Yes, if the distribution is even.  Using a prime number of buckets helps protect you from unevennesses in the distribution.

    This is why the reproductive cycle of cicadas is 17 years, for example, rather than 16; it minimizes the chance of resonance with a predator cycle.

  • User profile image
    Frank Hileman

    The "mod prime" technique has been considered obsolete for many years. You can read why in these links. So any bucket size is fine with a good hash function, and powers of two are typically used.

    http://www.azillionmonkeys.com/qed/hash.html

    http://burtleburtle.net/bob/hash/

  • User profile image
    odujosh

    Frank Hileman wrote:
    The "mod prime" technique has been considered obsolete for many years. You can read why in these links. So any bucket size is fine with a good hash function, and powers of two are typically used.

    http://www.azillionmonkeys.com/qed/hash.html

    http://burtleburtle.net/bob/hash/



    I tend to distrust sites with silly names. Maybe I am alone here:)

  • User profile image
    Matthew van Eerde

    Well, there are obvious benefits from doing a mod-prime.  There are also obvious benefits from doing a mod-(power-of-2).

    I would caution against doing a mod-n, though, where n is simultaneously prime and a power of 2... it's true you get the benefits of both techniques, but there are some unexpected side effects due to your restricted range of choice in n. Wink

  • User profile image
    Minh

    Matthew van Eerde wrote:
    
    I would caution against doing a mod-n, though, where n is simultaneously prime and a power of 2... it's true you get the benefits of both techniques, but there are some unexpected side effects due to your restricted range of choice in n. Wink

    Why? It worked for Neo Smiley


Conversation locked

This conversation has been locked by the site admins. No new comments can be made.