Re: Matching Postal Addresses
Re: Matching Postal Addresses
- Subject: Re: Matching Postal Addresses
- From: Brian Hannan <email@hidden>
- Date: Wed, 12 Nov 2003 01:25:54 -0800
At my last job, I worked at a place that did geo location stuff. One
of the things was geo coding -- taking an address in natural human
language and getting the latitude and longitude for that location. It
can get pretty involved. However, without investing lots of time you
can get by with some good heuristics. I'd say go for normalization of
words that can have more than one format. For example, in US English
"street" can have the forms "st", "st.", "street" and nothing, such as
just "Elm". Like was suggested, flatten everything to upper and lower
case. Then attempt to normalize words, so as input you'll have:
"Elm St."
and as output you'd get
"elm street"
As was suggested, nothing is perfect. If you can't resolve the address
or get multiple normalizations, as was also suggested, let the user
help you out. Ask for their input. They usually know best, as picking
the wrong location usually has very bad implications. Letting the user
quickly choose form a brief list of two or three choices usually won't
bother them.
And when you get down to it this is just another form of the classic
normalization problem of which you find tons of examples of in computer
science, especially with DB people. Which means you can get lots of
good advice all over the place.
And keep in mind each country has it's own unique address formats.
Hopefully you aren't doing Japanese addresses, with stuff like "first
house next to the old train station" in it.
On Nov 12, 2003, at 12:35 AM, p3consulting wrote:
On 12 nov. 2003, at 03:32, email@hidden wrote:
I'm working on a program that must take two postal
addresses and determine if they are logically the
same. That is, it must ignore minor differences like
"Elm Street" versus "Elm St".
I'm sure I'm not the first programmer to face this
problem and am wondering if there is anyone out there
that knows of some published code that will do this
for me. My mind hurts just thinking about it...
Thanks!
Jason
It's the kind of problem you will never find a "perfect" solution, but
by combining several steps you can
reach some interesting results:
1. convert to same case, stripping accentued chars (and some other
special char like German double S) using good algorithm (not standard
C lib !)
2. remove or replace "noise" words by a "standardized" version (noise
words = street, avenue, boulevard, box, etc.) (maintain buzzwords in
an editable list, you decide what is "standard" - it may be an escape
sequence of your own)
3. if you have to compare also people names, having a db with first
names may help to detect first name/last name to try to standardize
the sorting - again no 100% automatic solution exist here, you have
situations for which detectting first name/last name is impossible
4. use several soundex/levenshtein/metaphone/hashing algoritms to get
various "key" values (download PostgreSQL sources and look at the
fuzzystrmatch in the contribute folder)
5. display your address list to an human being to sort using the keys
you have created and by highlighting using colors (unique record,
duplicates, to be checked records): he/she will be able to quicky
detect duplicates in the one having "to be checked status" (don't
forget to apply the same principles to the names)
(By using combinations of these techniques we have been able to detect
about 5% of duplicates in a >3000 address list (dentists) , including
duplicates like dentist having given their private and and their
profesionnal address or sometimes several profesionnal addresses
(individual office and hospital) and group of dentists working
together couples father/son working together women referenced
under 2 different names because they are now married we have also
bound the system to some 411-like Internet sites to check for moved
people One element important in the success in our case was that the
db was made of known prospects/customers: the people checking the
entries having some "physical" experience with the prospects/customers
(phone calls, invoices, contact at exhibitions, ) - the success may
be lower when working on "bulk" addresses, like addresses you buy at
specialized marketing firms and also the payback comes from the fact
that the saving in postage were high due to the weight of the
professional catalog to be sent every year)
And of course we never have been able to remove the human part of the
work just to make it a lot easier, faster and reliable
Pascal Pochet
P3 Consulting
email@hidden
http://www.p3-consulting.net
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.
--
Brian Hannan
Chief Admiral of Uncle Jam's Navy
"One nation under a groove, gettin' down just for the FUNK of it."
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.