Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Best way of identifying duplicate files in Cocoa

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Best way of identifying duplicate files in Cocoa

Subject: Re: Best way of identifying duplicate files in Cocoa
From: Frank Reiff <email@hidden>
Date: Wed, 21 Nov 2007 15:40:17 +0100

Hi Cem,

I had to do something similar to this, but I did it using Python. The method I used is fairly fast (about 3 GB with 30,000 files, in about 27 seconds of user time on a Xeon 3 GHz machine), and it worked well. I basically built trees and dictionaries to do it.
1) Build a tree of all the files you're looking into, including the
directories.  The leaf nodes are the files, the interior nodes are the
directories.  CoreData is a good idea for this.


I'm in Core Data already, so that wouldn't be problem.

2) Depth first, and recursively do the following: a) If the node is a file, its value is its hash b) If the node is a directory, sort its direct descendents by their hashes, and then treat that array as if it were a single string, and hash it. This is the directory's hash.

3) Create a dictionary where the keys are hashes and values are lists of paths. Do this by walking the tree from step 2, but this time starting at the root of the tree, and do it breadth first. Any time you have a list with 2 items in it, then everything below those points are probably the same. This is true of directories, bundles, etc. There is no point in walking below the tree of either side, so you can trim the subtrees off from the tree in step 2, or you can just skip them. I skip them, it makes debugging easier if the information is still around.

4) At this point, you'll have to do a file by file comparison to be sure that you really have duplicates. BTW, if you DO find collisions in the SHA-1 hash, and the file contents are different, tell NIST (www.nist.gov ) about it; they'd be interested to know (assuming you're willing to share the directory contents for them to poke at)

Also note that my method will miss certain cases, e.g. if A and B are identical trees, and C is a subtree of either one, then if I run into A and B first, I'll never realize that C is a subtree (because of my skipping/trimming step). For me, this isn't a big deal. My tool is just for cleaning up large amounts of garbage, rapidly. Once I delete either A or B, I can then rerun my tool and pick up C as well. Maybe not the cleanest way, but it works for me.


That's a fairly impressive algorithm..

Luckily, I don't need to go quite that far because of the nature of the program I'm writing. I'm basically taking a bunch of files, computing new destination paths for them and then check whether there are any file name clashes (say two files called "1.jpg" and "1.jpg"). If there are any clashes, I need to know whether they are duplicates (in which case I throw away the duplicate) or not ( in which case I change one of file names to make it unique, e.g. "1.jpg" and "1 (1).jpg").

There's certainly plenty to think about now..

Thanks again for your insights.

Best regards,

Frank
_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden



References:  
  >Re: Best way of identifying duplicate files in Cocoa (From: Army Research Lab <email@hidden>)




Prev by Date:
Re: Best way of identifying duplicate files in Cocoa

Next by Date:
Re: NSWindow and Core Anim - flickering

Previous by thread:
Re: Best way of identifying duplicate files in Cocoa

Next by thread:
The Magic Delegate

Index(es):

Date
Thread