What I would try is using zip or jpeg for comparison. The idea is that
when you compress something, it removes all duplicates, so if you
compress two _uncompressed_ images together and both of them apart and
the two images compressed together are much smaller than the sum of the
sizes of the two separately compressed images, then those two are very
much alike. Lossy compression for pictures and audio is even better
than loss-less, since it removes more of the same.
see <http://complearn.sourceforge.net/> for a ruby script and the
mathematical background and a better explanation.
I read that they use it at the Vrije Universiteit (university in
Amsterdam) to see if student papers are original work.
I tried their script on a few test documents and that worked amazingly.
Hope this Helps,
Arjan
On 10 apr 2005, at 20:20, Scott Ellsworth wrote:
Hi, all.
A friend just asked how one would scan approximately 40k images for
duplicates.. Anyone able to recommend a Java toolkit for same that
works with reasonable speed on the Mac?
This is a terribly complicated problem to do right, but I wanted to
try an "off the cuff" fast solution first. Something perhaps a bit
more clever than a straight checksum of the image data, like a color
histogram. This would let us decide whether something smart that
actually analyzes the images is a good idea.
Scott
---
They both savoured the strange warm glow of being much more ignorant
than ordinary people, who were only ignorant of ordinary things.
-- Discworld scientists at work (Terry Pratchett, Equal Rites)
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden