• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Best way of identifying duplicate files in Cocoa
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best way of identifying duplicate files in Cocoa


  • Subject: Re: Best way of identifying duplicate files in Cocoa
  • From: Jean-Daniel Dupas <email@hidden>
  • Date: Wed, 21 Nov 2007 10:33:12 +0100


On Nov 20, 2007, at 2:48 PM, Michael Watson wrote:
I implemented MD5 hashing and comparison in a file diff utility I wrote for internal use, and I gotta say . . . it was *fast* with tens of thousands of files of varying size. (Say, anywhere from 4KB to dozens of megs.)

So did I! Here is source:

http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py

It checks the file sizes and then hashes the first 4k. Finally, it'll hash the full file if the sizes and first 4k matches.

b.bum

To get a MD5 you have to read the file, AND compute the digest. To compare to file, you just have to read the file. What is the benefit of the MD5 in this case?


I use plain C to be able to compare forks, but you can easyli replace CFDataRef by NSData

static __inline__
OSStatus SOFileReadChunk(SInt16 aFork, CFIndex length, CFMutableDataRef buffer) {
ByteCount size = 0;
OSStatus err = noErr;
ByteCount remaining = length;
CFDataSetLength(buffer, length);


void *buf = CFDataGetMutableBytePtr(buffer);
do {
err = FSReadFork(aFork, fsAtMark | kFSNoCacheMask, 0, remaining, buf, &size);
if (noErr == err || eofErr == err)
remaining -= size;
} while (remaining > 0 && noErr == err);


  CFDataSetLength(buffer, length - remaining);
  return err;
}

#define BUFFER_SIZE_KB 32
static
OSStatus SOCompareFork(FSRef *f1, FSRef *f2, HFSUniStr255 *forkName, bool *equals) {
OSStatus err = noErr;
SInt16 fnum1 = 0, fnum2 = 0;


err = FSOpenFork(f1, forkName->length, forkName->unicode, fsRdPerm, &fnum1);
if (noErr == err)
err = FSOpenFork(f2, forkName->length, forkName->unicode, fsRdPerm, &fnum2);


if (noErr == err) {
*equals = true;
CFMutableDataRef d1 = CFDataCreateMutable(kCFAllocatorDefault, 1024 * BUFFER_SIZE_KB);
CFMutableDataRef d2 = CFDataCreateMutable(kCFAllocatorDefault, 1024 * BUFFER_SIZE_KB);


    do {
      err = SOFileReadChunk(fnum1, 1024 * BUFFER_SIZE_KB, d1);
      if (noErr == err || eofErr == err)
        err = SOFileReadChunk(fnum2, 1024 * BUFFER_SIZE_KB, d2);
      if (noErr == err || eofErr == err)
        *equals = CFEqual(d1, d2);
    } while (noErr == err && *equals);

    CFRelease(d2);
    CFRelease(d1);
  }
  if (eofErr == err) err = noErr;
  if (fnum2) verify_noerr(FSCloseFork(fnum2));
  if (fnum1) verify_noerr(FSCloseFork(fnum1));

  return err;
}


_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: Best way of identifying duplicate files in Cocoa
      • From: Frank Reiff <email@hidden>
    • Re: Best way of identifying duplicate files in Cocoa
      • From: Bill Bumgarner <email@hidden>
References: 
 >Best way of identifying duplicate files in Cocoa (From: Frank Reiff <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Jean-Daniel Dupas <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Frank Reiff <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Michael Watson <email@hidden>)
 >Re: Best way of identifying duplicate files in Cocoa (From: Bill Bumgarner <email@hidden>)

  • Prev by Date: Re: CoverFlow
  • Next by Date: Re: Best way of identifying duplicate files in Cocoa
  • Previous by thread: Re: Best way of identifying duplicate files in Cocoa
  • Next by thread: Re: Best way of identifying duplicate files in Cocoa
  • Index(es):
    • Date
    • Thread