Re: Fast hash of NSData?
Re: Fast hash of NSData?
- Subject: Re: Fast hash of NSData?
- From: Graham Cox <email@hidden>
- Date: Mon, 02 Dec 2013 18:23:36 +0100
On 2 Dec 2013, at 3:43 pm, Roland King <email@hidden> wrote:
> I keep not seeing the problem here .. you have two NSData's, you want to know if they are the same, at the time you store them you store the data and some fast-but-not-necessarily-very-good hash. So 'are they the same' becomes
>
> is the underlying data the the same length? - no - not the same
> do they have the same rapid hash? - no - not the same
> actually compare the data byte by byte if the length and hash are the same to work out if they are the same. rare case.
>
> You always get the right answer and even a fairly cruddy hash collides rarely enough to make a full data test unlikely.
>
> reductio ad absurdum leads to not even bothering to hash them at all, comparing the length and if it's the same, just comparing the data.
Essentially that’s correct, but the original use case means that some care is needed to keep things moving at a reasonable clip.
Let’s say I have an array of data blocks I’m using, and the user adds a new one. I need to know whether I’ve seen this data before. I could search the list, doing a comparison with each one, and
afterwards I know if I’ve seen it before or not, but the worst case is the most common one - I haven’t seen it before, but I’ve just spent ages doing a potentially massive byte compare to check.
The hash here is both a quick way to check whether I might have seen it before and to serve as a key to the data that yields that hash. I mean, that’s how a hash is typically used anyway. I compute the hash and I can go straight away to the data that matches and ignore all others, because it’s keyed by the hash. It’s only at that point I need to do an extra compare to check whether it’s a hash collision or a genuine match. (And the argument has become, is this final compare really needed? Looks unlikely, but doing it has a low cost, so why not, appears to be the answer).
> Reductio to even more absurdum says if two chunks of data are the same length, just store the both of 'em.
Except that no longer solves the original problem, which is to make sure that adding two identical blocks of data (images, as it happens) only yields one copy of said data in the eventual file stream for that document. If I just store them without bothering to check for equality, adding the same image 100 times adds the same data 100 times to the file. It seems pretty obvious you’d want to avoid that.
—Graham
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden