Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Matching Postal Addresses

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Matching Postal Addresses

Subject: Re: Matching Postal Addresses
From: Brian Hannan <email@hidden>
Date: Wed, 12 Nov 2003 01:25:54 -0800

At my last job, I worked at a place that did geo location stuff. One of the things was geo coding -- taking an address in natural human language and getting the latitude and longitude for that location. It can get pretty involved. However, without investing lots of time you can get by with some good heuristics. I'd say go for normalization of words that can have more than one format. For example, in US English "street" can have the forms "st", "st.", "street" and nothing, such as just "Elm". Like was suggested, flatten everything to upper and lower case. Then attempt to normalize words, so as input you'll have:

"Elm St."

and as output you'd get

"elm street"

As was suggested, nothing is perfect. If you can't resolve the address or get multiple normalizations, as was also suggested, let the user help you out. Ask for their input. They usually know best, as picking the wrong location usually has very bad implications. Letting the user quickly choose form a brief list of two or three choices usually won't bother them.

And when you get down to it this is just another form of the classic normalization problem of which you find tons of examples of in computer science, especially with DB people. Which means you can get lots of good advice all over the place.

And keep in mind each country has it's own unique address formats. Hopefully you aren't doing Japanese addresses, with stuff like "first house next to the old train station" in it.

On Nov 12, 2003, at 12:35 AM, p3consulting wrote:

On 12 nov. 2003, at 03:32, email@hidden wrote:
I'm working on a program that must take two postal
addresses and determine if they are logically the
same. That is, it must ignore minor differences like
"Elm Street" versus "Elm St".

I'm sure I'm not the first programmer to face this
problem and am wondering if there is anyone out there
that knows of some published code that will do this
for me. My mind hurts just thinking about it...

Thanks!

Jason

It's the kind of problem you will never find a "perfect" solution, but by combining several steps you can
reach some interesting results:

1. convert to same case, stripping accentued chars (and some other special char like German double S) using good algorithm (not standard C lib !)
2. remove or replace "noise" words by a "standardized" version (noise words = street, avenue, boulevard, box, etc.) (maintain buzzwords in an editable list, you decide what is "standard" - it may be an escape sequence of your own)
3. if you have to compare also people names, having a db with first names may help to detect first name/last name to try to standardize the sorting - again no 100% automatic solution exist here, you have situations for which detectting first name/last name is impossible
4. use several soundex/levenshtein/metaphone/hashing algoritms to get various "key" values (download PostgreSQL sources and look at the fuzzystrmatch in the contribute folder)
5. display your address list to an human being to sort using the keys you have created and by highlighting using colors (unique record, duplicates, to be checked records): he/she will be able to quicky detect duplicates in the one having "to be checked status" (don't forget to apply the same principles to the names)

(By using combinations of these techniques we have been able to detect about 5% of duplicates in a >3000 address list (dentists) , including duplicates like dentist having given their private and and their profesionnal address or sometimes several profesionnal addresses (individual office and hospital) and group of dentists working together couples father/son working together women referenced under 2 different names because they are now married we have also bound the system to some 411-like Internet sites to check for moved people One element important in the success in our case was that the db was made of known prospects/customers: the people checking the entries having some "physical" experience with the prospects/customers (phone calls, invoices, contact at exhibitions, ) - the success may be lower when working on "bulk" addresses, like addresses you buy at specialized marketing firms and also the payback comes from the fact that the saving in postage were high due to the weight of the professional catalog to be sent every year)

And of course we never have been able to remove the human part of the work just to make it a lot easier, faster and reliable

Pascal Pochet
P3 Consulting
email@hidden
http://www.p3-consulting.net
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

--
Brian Hannan
Chief Admiral of Uncle Jam's Navy

"One nation under a groove, gettin' down just for the FUNK of it."
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: Matching Postal Addresses
  - From: p3consulting <email@hidden>

References:
	>Re: Matching Postal Addresses (From: p3consulting <email@hidden>)

Prev by Date: Re: Matching Postal Addresses
Next by Date: A CFMessagePort communication disruption can mean only one thing...
Previous by thread: Re: Matching Postal Addresses
Next by thread: Re: Matching Postal Addresses
Index(es):
- Date
- Thread