Re: Fast hash of NSData?
Re: Fast hash of NSData?
- Subject: Re: Fast hash of NSData?
- From: Scott Ribe <email@hidden>
- Date: Sun, 01 Dec 2013 16:52:22 -0700
On Dec 1, 2013, at 1:31 PM, Graham Cox <email@hidden> wrote:
> That doesn’t mean it’s impossible, but it surely indicates that it’s vanishingly improbable?
Actually, no. The next file added could collide, and your test provides 0 information as to how likely that is, or even if the fact that you got no collisions was an outcome that had a vanishingly small probability. Now, the probability of collisions is in fact small, but your test doesn't prove it--it just provides evidence that the probability analysis is not fantastically wrong.
What would provide stronger evidence would be looking at both sets of 63-bit hashes, all 3 sets of 62-bit hashes, and so on until you find the size at which you get collisions. But this would be an exercise purely for curiosity.
Now, sometime ago in this discussion, Kyle mentioned the "birthday problem", and it is a valid point, in that the probability of a collision is proportional to an exponential of the square of the number of documents. So, while for 1,000 documents the chance of a collision is down around 1 in 10^14, for 1,000,000 documents it's up to 1 in 10^7. But there's an easy solution, use a 128-bit hash. (The probability of collision is inversely proportional to an exponential of the number of hashes, in other words to an exponential of 2^ number of bits.) So with 128 bits you're at 1 in 10^33 for 1,000 documents, and 1 in 10^27 for 1,000,000 documents. On the other hand, with a 32-bit hash, at 1,000 documents your odds are not so comforting (1 in 10,000 chance of a collision) and for 1,000,000 documents, ugh, the odds of *not* having a collision are only about 1 in 10^51. All of that is assuming a hash with perfectly random distribution, and rounded to 1 significant digit as is appropriate for such rough approximations.
Of course, you can always treat the hash as a quick index to find possible matches, then compare contents. Even with 32-bit hashes, you would not be comparing files very often, with 64-bit probably never, with 128-bit never, to a degree of certainty greater than anything else in your life or your customers'...
--
Scott Ribe
email@hidden
http://www.elevated-dev.com/
(303) 722-0567 voice
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden