• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Best way of identifying duplicate files in Cocoa
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best way of identifying duplicate files in Cocoa


  • Subject: Re: Best way of identifying duplicate files in Cocoa
  • From: Frank Reiff <email@hidden>
  • Date: Wed, 21 Nov 2007 15:23:52 +0100

Dear Jean-Daniel,

Thanks for the snippet.

This is all for a later release of the software, so I won't get around to actually trying this for a few weeks yet.. for the moment I'm eliminating memory leaks with Core Data, optimizing performance and sorting out multi-threading bugs :-(

I can see the point of using the md5 (or any other hash really), but it's also a question of reliability v. performance. The more code you've got, the more opportunities you have to mess something up. I'm not sure yet whether performance is going to be a huge issue or not. I'll probably first implement something quite simple but (hopefully) bullet proof and then I might look into optimizing this over time.

I'm not actually writing one of those "eliminate duplicate files to make space on your hard disk" kind of utilities and I only really need to check files which have identical names but are in different directories. So I can go with:

1) not the same name
2) not the same size
3) not the same content

and 1) should take care of 99% of cases and 2) will probably put a big dent into what's left.

Long term the option of checking all files for duplicates irrespective of their names might be a cool new feature to add, so more complex schemes will come into their own there. Of course there always is the guy who has five million files with the same name and the same 5GB length but different content in the last three bytes out there.. but perhaps I'll wait for him to get in touch with me first (It's really slow!) :-)

Best regards,

Frank


On 21 Nov 2007, at 10:33, Jean-Daniel Dupas wrote:


On Nov 20, 2007, at 2:48 PM, Michael Watson wrote:
I implemented MD5 hashing and comparison in a file diff utility I wrote for internal use, and I gotta say . . . it was *fast* with tens of thousands of files of varying size. (Say, anywhere from 4KB to dozens of megs.)

So did I! Here is source:

http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py

It checks the file sizes and then hashes the first 4k. Finally, it'll hash the full file if the sizes and first 4k matches.

b.bum

To get a MD5 you have to read the file, AND compute the digest. To compare to file, you just have to read the file. What is the benefit of the MD5 in this case?


I use plain C to be able to compare forks, but you can easyli replace CFDataRef by NSData

static __inline__
OSStatus SOFileReadChunk(SInt16 aFork, CFIndex length, CFMutableDataRef buffer) {
ByteCount size = 0;
OSStatus err = noErr;
ByteCount remaining = length;
CFDataSetLength(buffer, length);


void *buf = CFDataGetMutableBytePtr(buffer);
do {
err = FSReadFork(aFork, fsAtMark | kFSNoCacheMask, 0, remaining, buf, &size);
if (noErr == err || eofErr == err)
remaining -= size;
} while (remaining > 0 && noErr == err);


 CFDataSetLength(buffer, length - remaining);
 return err;
}

#define BUFFER_SIZE_KB 32
static
OSStatus SOCompareFork(FSRef *f1, FSRef *f2, HFSUniStr255 *forkName, bool *equals) {
OSStatus err = noErr;
SInt16 fnum1 = 0, fnum2 = 0;


err = FSOpenFork(f1, forkName->length, forkName->unicode, fsRdPerm, &fnum1);
if (noErr == err)
err = FSOpenFork(f2, forkName->length, forkName->unicode, fsRdPerm, &fnum2);


if (noErr == err) {
*equals = true;
CFMutableDataRef d1 = CFDataCreateMutable(kCFAllocatorDefault, 1024 * BUFFER_SIZE_KB);
CFMutableDataRef d2 = CFDataCreateMutable(kCFAllocatorDefault, 1024 * BUFFER_SIZE_KB);


   do {
     err = SOFileReadChunk(fnum1, 1024 * BUFFER_SIZE_KB, d1);
     if (noErr == err || eofErr == err)
       err = SOFileReadChunk(fnum2, 1024 * BUFFER_SIZE_KB, d2);
     if (noErr == err || eofErr == err)
       *equals = CFEqual(d1, d2);
   } while (noErr == err && *equals);

   CFRelease(d2);
   CFRelease(d1);
 }
 if (eofErr == err) err = noErr;
 if (fnum2) verify_noerr(FSCloseFork(fnum2));
 if (fnum1) verify_noerr(FSCloseFork(fnum1));

 return err;
}


_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


References: 
 >Best way of identifying duplicate files in Cocoa (From: Frank Reiff <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Jean-Daniel Dupas <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Frank Reiff <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Michael Watson <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Bill Bumgarner <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Jean-Daniel Dupas <email@hidden>)

  • Prev by Date: Fwd: Writing optimized screeshot utility.
  • Next by Date: Re: Best way of identifying duplicate files in Cocoa
  • Previous by thread: Re: Best way of identifying duplicate files in Cocoa
  • Next by thread: Re: Best way of identifying duplicate files in Cocoa
  • Index(es):
    • Date
    • Thread