Re: Fast hash of NSData?
Re: Fast hash of NSData?
- Subject: Re: Fast hash of NSData?
- From: Mike Abdullah <email@hidden>
- Date: Sat, 30 Nov 2013 22:51:55 +0000
On 30 Nov 2013, at 10:37, Graham Cox <email@hidden> wrote:
>
> On 30 Nov 2013, at 12:41 am, Kyle Sluder <email@hidden> wrote:
>
>> That’s not a good idea. If the file on disk changes out from under you, the cache no longer reflects the data.
>
> That can’t happen - the image data is embedded in the file stream at the moment, not held as a separate file. While I intend to change to that at some point, I don’t do it now. When I do, I’ll also (have to) change the way that the image references in the main stream are written. So the hash approach is only really a stop-gap - it gets me better performance without changing the basic architecture.
>
> @Scott Ribe:
>
>> I recently (last week, no kidding) investigated this question, and went with murmur hash.
>
>
>
> I went with a 64-bit FNV-1a, which looks on a par with murmur, according to the page linked by Eric: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed. Which admittedly I am taking at face value, having only profiled my own code for speed, not randomness, etc. The main reason being that an implementation of FNV-1a was right there on the page :)
Hi Graham,
Anything short of a cryptographic hash is unsuitable for you here. You’re living dangerously, asking for trouble when a collision does happen.
As it happens, we have almost exactly the same scenario as you in Sandvox. We store documents as a file package, where each piece of media is kept on disk, and we want to avoid unnecessary duplication of that media. We experimented with hashing to help with this in the past, but in practice it wasn’t worth it.
Instead now, each time the document is saved, for any new media we scan through the existing media files looking for any that are the same length. For any that are we then properly compare them for equality, and act upon that.
This sounds a pretty poor algorithm, but in practice it works well. The chances of two files having exactly the same length is actually pretty slim. If you consider this a simple hash, it has the nice property that as equality testing gets more expensive — i.e. the files in question get larger — collisions get ever less likely.
Similarly, when doing the actual equality test, this is performed in chunks as we read off disk, rather than processing the whole file at once (NSFileManager has a nice method for doing this with two files, and we have our own variants to handle one or both files being in-memory), so we can bail out of the check as soon as a mismatch occurs. Again, this has the nice property that the further through the file you get, the more likely they are to be equal.
If two files turn out to be equal, then we’ve done the work of reading both of them off disk (that’s the worst case of neither being in-memory), which seems wasteful. But isn’t if you consider that in the case of no testing being done at all, the file would still have to be read once in its entirety, and written to the new location. So they both work out roughly the same amount of work. And if you consider the points above, in the event of a mismatch, the odds are highly in our favour that this will be discovered without having done very much work.
All in all, it’s working pretty well for us. We’ve since adopted background document saving which makes the process even smoother, but even on 10.6 it works well.
Mike.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden