• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Fast hash of NSData?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fast hash of NSData?


  • Subject: Re: Fast hash of NSData?
  • From: Graham Cox <email@hidden>
  • Date: Sun, 01 Dec 2013 16:36:36 +0100

On 30 Nov 2013, at 11:51 pm, Mike Abdullah <email@hidden> wrote:

> Anything short of a cryptographic hash is unsuitable for you here. You’re living dangerously, asking for trouble when a collision does happen.


So, I’ve written a test which scans a chosen folder, hashing each file (all files, not just images) and storing the hashes. If a hash collides, I store it in a list and log it.

Scanning my entire hard drive (excluding hidden files), which took several hours, sure I had plenty of collisions - but absolutely no false ones - they all turned out to be genuine duplicates of existing files. This is using the FNV-1a 64-bit hash + length approach.

I’m thinking this is good enough, really. The odds of a particular user having two different image files that collide, and happening to add those exact images at once to our app must be astronomically low. Talk me out of it :)

Here’s the code, if anyone is interested enough to want to reproduce these results, or spot a flaw in my test:

The actual hash, which is part of a category on NSData:

- (u_int64_t)	fnvHash64
{
	unsigned char* p = (unsigned char*)[self bytes];

	u_int64_t 	h = 14695981039346656037U;
	NSUInteger i = 0, len = [self length];

	while( i < len )
		h = p[i++] ^ (h * 1099511628211U);

	return h;
}


The hash class is much as Kyle suggested:

@interface GCDataHash	: NSObject <NSCopying>
{
@private
	u_int64_t		mHash;
	NSUInteger	mLength;
}

@property (nonatomic, readonly) NSUInteger	length;

- (instancetype) initWithData:(NSData*) data;

- (BOOL)		isEqual:(id) object;
- (NSUInteger)	hash;

@end

@implementation GCDataHash

@synthesize length = mLength;

- (instancetype) initWithData:(NSData*) data
{
	self = [super init];
	if( self )
	{
		mHash = [data fnvHash64];
		mLength = [data length];
	}

	return self;
}


- (NSUInteger)	hash
{
	return mHash;
}


- (BOOL)		isEqual:(id) object
{
	if( object == self )
		return YES;

	if(![object isKindOfClass:[GCDataHash class]])
		return NO;

	GCDataHash* other = (GCDataHash*)object;

	return other->mLength == mLength && other->mHash == mHash;
}


- (NSString*)	description
{
	return [NSString stringWithFormat:@"%@ length=%lu, hash=%lu", [super description], mLength, [self hash]];
}


- (id)		copyWithZone:(NSZone *)zone
{
	return [self retain];
}

This is the test, which expects a directory URL. ivars ‘collisions' is a mutable array, ‘hashes' a mutable dictionary. With as large a data set as I can reasonably muster (which is my entire hard disk), it returns 0 collisions. It's set to go into packages so it would scan all the internals of my iPhoto library, among others.

- (void)			runTestWithURL:(NSURL*) url
{
	[collisions removeAllObjects];
	[hashes removeAllObjects];


	NSDirectoryEnumerator* dirEnum = [[NSFileManager defaultManager] enumeratorAtURL:url
												 includingPropertiesForKeys:nil
																	options:NSDirectoryEnumerationSkipsHiddenFiles
															   errorHandler:^BOOL(NSURL *url, NSError *error)
																{
																	return YES;
																}];

	NSUInteger count = 0;

	for( NSURL* file in dirEnum )
	{
		@autoreleasepool
		{
			NSError* error = nil;
			NSData* data = [NSData dataWithContentsOfURL:file
												 options:NSDataReadingMappedIfSafe | NSDataReadingUncached
												   error:&error];

			if( data )
			{
				GCDataHash* hash = [[GCDataHash alloc] initWithData:data];

				// check for hash collision

				NSURL* existing = [hashes objectForKey:hash];

				if( existing )
				{
					// the hash collided. See if the data is actually different, or whether it's just a copy
					// of an existing file.

					NSData* otherData = [NSData dataWithContentsOfURL:existing];
					 if(![otherData isEqualToData:data])
					 {
						// we have a bona fide collision - hash matches but data doesn't. Save that.

						NSDictionary* crunch = @{ hash : file };
						 [collisions addObject:crunch];

						NSLog(@"%lu: collision for path '%@', collided with '%@', hash = %@", collisions.count + 1, file, existing, hash );
					 }
				}
				else
					[hashes setObject:file forKey:hash];

				[hash release];

				++count;
			}
		}
	}

	NSLog(@"hashes tested: %lu, collided: %lu", count, collisions.count );
}


_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden


  • Follow-Ups:
    • Re: Fast hash of NSData?
      • From: Marcel Weiher <email@hidden>
    • Re: Fast hash of NSData?
      • From: Scott Ribe <email@hidden>
  • Prev by Date: KVO - receptionist design pattern
  • Next by Date: Re: Fast hash of NSData?
  • Previous by thread: Re: KVO - receptionist design pattern
  • Next by thread: Re: Fast hash of NSData?
  • Index(es):
    • Date
    • Thread