Re: Best way of identifying duplicate files in Cocoa
Re: Best way of identifying duplicate files in Cocoa
- Subject: Re: Best way of identifying duplicate files in Cocoa
- From: Frank Reiff <email@hidden>
- Date: Wed, 21 Nov 2007 15:23:52 +0100
Dear Jean-Daniel,
Thanks for the snippet.
This is all for a later release of the software, so I won't get around
to actually trying this for a few weeks yet.. for the moment I'm
eliminating memory leaks with Core Data, optimizing performance and
sorting out multi-threading bugs :-(
I can see the point of using the md5 (or any other hash really), but
it's also a question of reliability v. performance. The more code
you've got, the more opportunities you have to mess something up. I'm
not sure yet whether performance is going to be a huge issue or not.
I'll probably first implement something quite simple but (hopefully)
bullet proof and then I might look into optimizing this over time.
I'm not actually writing one of those "eliminate duplicate files to
make space on your hard disk" kind of utilities and I only really need
to check files which have identical names but are in different
directories. So I can go with:
1) not the same name
2) not the same size
3) not the same content
and 1) should take care of 99% of cases and 2) will probably put a big
dent into what's left.
Long term the option of checking all files for duplicates irrespective
of their names might be a cool new feature to add, so more complex
schemes will come into their own there. Of course there always is the
guy who has five million files with the same name and the same 5GB
length but different content in the last three bytes out there.. but
perhaps I'll wait for him to get in touch with me first (It's really
slow!) :-)
Best regards,
Frank
On 21 Nov 2007, at 10:33, Jean-Daniel Dupas wrote:
On Nov 20, 2007, at 2:48 PM, Michael Watson wrote:
I implemented MD5 hashing and comparison in a file diff utility I
wrote for internal use, and I gotta say . . . it was *fast* with
tens of thousands of files of varying size. (Say, anywhere from
4KB to dozens of megs.)
So did I! Here is source:
http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py
It checks the file sizes and then hashes the first 4k. Finally,
it'll hash the full file if the sizes and first 4k matches.
b.bum
To get a MD5 you have to read the file, AND compute the digest. To
compare to file, you just have to read the file. What is the benefit
of the MD5 in this case?
I use plain C to be able to compare forks, but you can easyli
replace CFDataRef by NSData
static __inline__
OSStatus SOFileReadChunk(SInt16 aFork, CFIndex length,
CFMutableDataRef buffer) {
ByteCount size = 0;
OSStatus err = noErr;
ByteCount remaining = length;
CFDataSetLength(buffer, length);
void *buf = CFDataGetMutableBytePtr(buffer);
do {
err = FSReadFork(aFork, fsAtMark | kFSNoCacheMask, 0, remaining,
buf, &size);
if (noErr == err || eofErr == err)
remaining -= size;
} while (remaining > 0 && noErr == err);
CFDataSetLength(buffer, length - remaining);
return err;
}
#define BUFFER_SIZE_KB 32
static
OSStatus SOCompareFork(FSRef *f1, FSRef *f2, HFSUniStr255 *forkName,
bool *equals) {
OSStatus err = noErr;
SInt16 fnum1 = 0, fnum2 = 0;
err = FSOpenFork(f1, forkName->length, forkName->unicode, fsRdPerm,
&fnum1);
if (noErr == err)
err = FSOpenFork(f2, forkName->length, forkName->unicode,
fsRdPerm, &fnum2);
if (noErr == err) {
*equals = true;
CFMutableDataRef d1 = CFDataCreateMutable(kCFAllocatorDefault,
1024 * BUFFER_SIZE_KB);
CFMutableDataRef d2 = CFDataCreateMutable(kCFAllocatorDefault,
1024 * BUFFER_SIZE_KB);
do {
err = SOFileReadChunk(fnum1, 1024 * BUFFER_SIZE_KB, d1);
if (noErr == err || eofErr == err)
err = SOFileReadChunk(fnum2, 1024 * BUFFER_SIZE_KB, d2);
if (noErr == err || eofErr == err)
*equals = CFEqual(d1, d2);
} while (noErr == err && *equals);
CFRelease(d2);
CFRelease(d1);
}
if (eofErr == err) err = noErr;
if (fnum2) verify_noerr(FSCloseFork(fnum2));
if (fnum1) verify_noerr(FSCloseFork(fnum1));
return err;
}
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden