Re: Best way of identifying duplicate files in Cocoa
Re: Best way of identifying duplicate files in Cocoa
- Subject: Re: Best way of identifying duplicate files in Cocoa
- From: Frank Reiff <email@hidden>
- Date: Wed, 21 Nov 2007 15:40:17 +0100
Hi Cem,
I had to do something similar to this, but I did it using Python. The
method I used is fairly fast (about 3 GB with 30,000 files, in about
27
seconds of user time on a Xeon 3 GHz machine), and it worked well. I
basically built trees and dictionaries to do it.
1) Build a tree of all the files you're looking into, including the
directories. The leaf nodes are the files, the interior nodes are the
directories. CoreData is a good idea for this.
I'm in Core Data already, so that wouldn't be problem.
2) Depth first, and recursively do the following:
a) If the node is a file, its value is its hash
b) If the node is a directory, sort its direct descendents by their
hashes, and then treat that array as if it were a single
string, and
hash it. This is the directory's hash.
3) Create a dictionary where the keys are hashes and values are
lists of
paths. Do this by walking the tree from step 2, but this time
starting at
the root of the tree, and do it breadth first. Any time you have a
list
with 2 items in it, then everything below those points are probably
the
same. This is true of directories, bundles, etc. There is no point
in
walking below the tree of either side, so you can trim the subtrees
off from
the tree in step 2, or you can just skip them. I skip them, it makes
debugging easier if the information is still around.
4) At this point, you'll have to do a file by file comparison to be
sure
that you really have duplicates. BTW, if you DO find collisions in
the
SHA-1 hash, and the file contents are different, tell NIST (www.nist.gov
)
about it; they'd be interested to know (assuming you're willing to
share the
directory contents for them to poke at)
Also note that my method will miss certain cases, e.g. if A and B are
identical trees, and C is a subtree of either one, then if I run
into A and
B first, I'll never realize that C is a subtree (because of my
skipping/trimming step). For me, this isn't a big deal. My tool is
just
for cleaning up large amounts of garbage, rapidly. Once I delete
either A
or B, I can then rerun my tool and pick up C as well. Maybe not the
cleanest way, but it works for me.
That's a fairly impressive algorithm..
Luckily, I don't need to go quite that far because of the nature of
the program I'm writing. I'm basically taking a bunch of files,
computing new destination paths for them and then check whether there
are any file name clashes (say two files called "1.jpg" and "1.jpg").
If there are any clashes, I need to know whether they are duplicates
(in which case I throw away the duplicate) or not ( in which case I
change one of file names to make it unique, e.g. "1.jpg" and "1
(1).jpg").
There's certainly plenty to think about now..
Thanks again for your insights.
Best regards,
Frank
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden