Re: Fast hash of NSData?
Re: Fast hash of NSData?
- Subject: Re: Fast hash of NSData?
- From: Graham Cox <email@hidden>
- Date: Sun, 01 Dec 2013 16:36:36 +0100
On 30 Nov 2013, at 11:51 pm, Mike Abdullah <email@hidden> wrote:
> Anything short of a cryptographic hash is unsuitable for you here. You’re living dangerously, asking for trouble when a collision does happen.
So, I’ve written a test which scans a chosen folder, hashing each file (all files, not just images) and storing the hashes. If a hash collides, I store it in a list and log it.
Scanning my entire hard drive (excluding hidden files), which took several hours, sure I had plenty of collisions - but absolutely no false ones - they all turned out to be genuine duplicates of existing files. This is using the FNV-1a 64-bit hash + length approach.
I’m thinking this is good enough, really. The odds of a particular user having two different image files that collide, and happening to add those exact images at once to our app must be astronomically low. Talk me out of it :)
Here’s the code, if anyone is interested enough to want to reproduce these results, or spot a flaw in my test:
The actual hash, which is part of a category on NSData:
- (u_int64_t) fnvHash64
{
unsigned char* p = (unsigned char*)[self bytes];
u_int64_t h = 14695981039346656037U;
NSUInteger i = 0, len = [self length];
while( i < len )
h = p[i++] ^ (h * 1099511628211U);
return h;
}
The hash class is much as Kyle suggested:
@interface GCDataHash : NSObject <NSCopying>
{
@private
u_int64_t mHash;
NSUInteger mLength;
}
@property (nonatomic, readonly) NSUInteger length;
- (instancetype) initWithData:(NSData*) data;
- (BOOL) isEqual:(id) object;
- (NSUInteger) hash;
@end
@implementation GCDataHash
@synthesize length = mLength;
- (instancetype) initWithData:(NSData*) data
{
self = [super init];
if( self )
{
mHash = [data fnvHash64];
mLength = [data length];
}
return self;
}
- (NSUInteger) hash
{
return mHash;
}
- (BOOL) isEqual:(id) object
{
if( object == self )
return YES;
if(![object isKindOfClass:[GCDataHash class]])
return NO;
GCDataHash* other = (GCDataHash*)object;
return other->mLength == mLength && other->mHash == mHash;
}
- (NSString*) description
{
return [NSString stringWithFormat:@"%@ length=%lu, hash=%lu", [super description], mLength, [self hash]];
}
- (id) copyWithZone:(NSZone *)zone
{
return [self retain];
}
This is the test, which expects a directory URL. ivars ‘collisions' is a mutable array, ‘hashes' a mutable dictionary. With as large a data set as I can reasonably muster (which is my entire hard disk), it returns 0 collisions. It's set to go into packages so it would scan all the internals of my iPhoto library, among others.
- (void) runTestWithURL:(NSURL*) url
{
[collisions removeAllObjects];
[hashes removeAllObjects];
NSDirectoryEnumerator* dirEnum = [[NSFileManager defaultManager] enumeratorAtURL:url
includingPropertiesForKeys:nil
options:NSDirectoryEnumerationSkipsHiddenFiles
errorHandler:^BOOL(NSURL *url, NSError *error)
{
return YES;
}];
NSUInteger count = 0;
for( NSURL* file in dirEnum )
{
@autoreleasepool
{
NSError* error = nil;
NSData* data = [NSData dataWithContentsOfURL:file
options:NSDataReadingMappedIfSafe | NSDataReadingUncached
error:&error];
if( data )
{
GCDataHash* hash = [[GCDataHash alloc] initWithData:data];
// check for hash collision
NSURL* existing = [hashes objectForKey:hash];
if( existing )
{
// the hash collided. See if the data is actually different, or whether it's just a copy
// of an existing file.
NSData* otherData = [NSData dataWithContentsOfURL:existing];
if(![otherData isEqualToData:data])
{
// we have a bona fide collision - hash matches but data doesn't. Save that.
NSDictionary* crunch = @{ hash : file };
[collisions addObject:crunch];
NSLog(@"%lu: collision for path '%@', collided with '%@', hash = %@", collisions.count + 1, file, existing, hash );
}
}
else
[hashes setObject:file forKey:hash];
[hash release];
++count;
}
}
}
NSLog(@"hashes tested: %lu, collided: %lu", count, collisions.count );
}
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden