Re: Best way of identifying duplicate files in Cocoa
Re: Best way of identifying duplicate files in Cocoa
- Subject: Re: Best way of identifying duplicate files in Cocoa
- From: Army Research Lab <email@hidden>
- Date: Wed, 21 Nov 2007 07:21:46 -0500
- Thread-topic: Best way of identifying duplicate files in Cocoa
> ------------------------------
>
> Message: 16
> Date: Fri, 16 Nov 2007 14:25:31 +0100
> From: Frank Reiff <email@hidden>
> Subject: Best way of identifying duplicate files in Cocoa
> To: "Cocoa-Dev (Apple)" <email@hidden>
> Message-ID: <email@hidden>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> Hi,
>
> For my latest project, I need to be able to check whether the files
> (or file bundles) at two paths are duplicates (= have the same
> content) or not.
>
> While this would be ridiculously easy to do on a plain Unix system, it
> turns out to be a major undertaking on OS X.
>
> * file bundles aren't files, but directories, so in fact I need to be
> able to compare directories (invisible files included)
> * OS 9 files still have resource forks as well as data forks, so both
> need to be checked
>
> Another issue is of course performance. Comparing byte-by-byte is
> certainly the simplest and most reliable way of doing this, but it's
> SLOW.. on the other hand I don't really know what the performance
> characteristics of an MD5, CRC32, or SHA hash are and whether or not
> you need to read in the whole file contents to apply them..
>
> It would thus be great if somebody, somewhere had published a ready-to-
> use - (BOOL) file: (NSString*) path isIdenticalTo: (NSString*) path2;
> method :-)
[SNIP]
I had to do something similar to this, but I did it using Python. The
method I used is fairly fast (about 3 GB with 30,000 files, in about 27
seconds of user time on a Xeon 3 GHz machine), and it worked well. I
basically built trees and dictionaries to do it.
1) Build a tree of all the files you're looking into, including the
directories. The leaf nodes are the files, the interior nodes are the
directories. CoreData is a good idea for this.
2) Depth first, and recursively do the following:
a) If the node is a file, its value is its hash
b) If the node is a directory, sort its direct descendents by their
hashes, and then treat that array as if it were a single string, and
hash it. This is the directory's hash.
3) Create a dictionary where the keys are hashes and values are lists of
paths. Do this by walking the tree from step 2, but this time starting at
the root of the tree, and do it breadth first. Any time you have a list
with 2 items in it, then everything below those points are probably the
same. This is true of directories, bundles, etc. There is no point in
walking below the tree of either side, so you can trim the subtrees off from
the tree in step 2, or you can just skip them. I skip them, it makes
debugging easier if the information is still around.
4) At this point, you'll have to do a file by file comparison to be sure
that you really have duplicates. BTW, if you DO find collisions in the
SHA-1 hash, and the file contents are different, tell NIST (www.nist.gov)
about it; they'd be interested to know (assuming you're willing to share the
directory contents for them to poke at)
Also note that my method will miss certain cases, e.g. if A and B are
identical trees, and C is a subtree of either one, then if I run into A and
B first, I'll never realize that C is a subtree (because of my
skipping/trimming step). For me, this isn't a big deal. My tool is just
for cleaning up large amounts of garbage, rapidly. Once I delete either A
or B, I can then rerun my tool and pick up C as well. Maybe not the
cleanest way, but it works for me.
Thanks,
Cem Karan
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden