Re: Matching Postal Addresses
Re: Matching Postal Addresses
- Subject: Re: Matching Postal Addresses
- From: p3consulting <email@hidden>
- Date: Wed, 12 Nov 2003 09:35:29 +0100
On 12 nov. 2003, at 03:32, email@hidden wrote:
I'm working on a program that must take two postal
addresses and determine if they are logically the
same. That is, it must ignore minor differences like
"Elm Street" versus "Elm St".
I'm sure I'm not the first programmer to face this
problem and am wondering if there is anyone out there
that knows of some published code that will do this
for me. My mind hurts just thinking about it...
Thanks!
Jason
It's the kind of problem you will never find a "perfect" solution, but
by combining several steps you can
reach some interesting results:
1. convert to same case, stripping accentued chars (and some other
special char like German double S) using good algorithm (not standard C
lib !)
2. remove or replace "noise" words by a "standardized" version (noise
words = street, avenue, boulevard, box, etc.) (maintain buzzwords in
an editable list, you decide what is "standard" - it may be an escape
sequence of your own)
3. if you have to compare also people names, having a db with first
names may help to detect first name/last name to try to standardize the
sorting - again no 100% automatic solution exist here, you have
situations for which detectting first name/last name is impossible
4. use several soundex/levenshtein/metaphone/hashing algoritms to get
various "key" values (download PostgreSQL sources and look at the
fuzzystrmatch in the contribute folder)
5. display your address list to an human being to sort using the keys
you have created and by highlighting using colors (unique record,
duplicates, to be checked records): he/she will be able to quicky
detect duplicates in the one having "to be checked status" (don't
forget to apply the same principles to the names)
(By using combinations of these techniques we have been able to detect
about 5% of duplicates in a >3000 address list (dentists) , including
duplicates like dentist having given their private and and their
profesionnal address or sometimes several profesionnal addresses
(individual office and hospital) and group of dentists working
together couples father/son working together women referenced under
2 different names because they are now married we have also bound the
system to some 411-like Internet sites to check for moved people One
element important in the success in our case was that the db was made
of known prospects/customers: the people checking the entries having
some "physical" experience with the prospects/customers (phone calls,
invoices, contact at exhibitions, ) - the success may be lower when
working on "bulk" addresses, like addresses you buy at specialized
marketing firms and also the payback comes from the fact that the
saving in postage were high due to the weight of the professional
catalog to be sent every year)
And of course we never have been able to remove the human part of the
work just to make it a lot easier, faster and reliable
Pascal Pochet
P3 Consulting
email@hidden
http://www.p3-consulting.net
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.