Re: Best way of identifying duplicate files in Cocoa
Re: Best way of identifying duplicate files in Cocoa
- Subject: Re: Best way of identifying duplicate files in Cocoa
- From: Michael Watson <email@hidden>
- Date: Tue, 20 Nov 2007 17:48:46 -0500
I implemented MD5 hashing and comparison in a file diff utility I
wrote for internal use, and I gotta say . . . it was *fast* with tens
of thousands of files of varying size. (Say, anywhere from 4KB to
dozens of megs.)
--
m-s
On 20 Nov, 2007, at 16:42, Frank Reiff wrote:
Hi Jean-Daniel,
Thanks for your response.
On 16 Nov 2007, at 14:46, Jean-Daniel Dupas wrote:
Le 16 nov. 07 à 14:25, Frank Reiff a écrit :
Another issue is of course performance. Comparing byte-by-byte is
certainly the simplest and most reliable way of doing this, but
it's SLOW.. on the other hand I don't really know what the
performance characteristics of an MD5, CRC32, or SHA hash are and
whether or not you need to read in the whole file contents to
apply them..
It would thus be great if somebody, somewhere had published a
ready-to-use - (BOOL) file: (NSString*) path isIdenticalTo:
(NSString*) path2; method :-)
I've spent the last two hours searching the web, but I haven't
found anything that comes close..
You don't have to check byte-by-byte if the two files have a
different size.
Then, comparing byte-per-byte is not so slow, as you can abort the
comparaison as soon as two bytes are differents.
Using a hash method has no benefit to compare two files on the
disk. It's only usefull if you want to compare a remote file (with
precomputed hash) and a local file.
I'll probably be going with:
* check length
* check last few bytes (begin with the same bytes but do not finish
with them)
* check byte-by-byte
Computing a hash could be interesting in situations where there are
lots and lots of files with the same length. Instead of having to
compare each file with all other files of the same length, one could
simply compute the hash by traversing it once and then compare the
hashes instead. Of course in order to be 100% certain one would need
to then do another byte-by-byte check again. Alternatively one could
cash the relationships between all files, e.g. A != B and B == C
means A != C and C! = A
I can see this could be fun :-)
Best regards,
Frank_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
@bungie.org
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden