Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Spellchecking queries to a database

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Spellchecking queries to a database

Subject: Re: Spellchecking queries to a database
From: petite_abeille <email@hidden>
Date: Thu, 23 Oct 2003 11:49:07 +0200

Hi Arturo,

Arturo Pirez wrote:

I used to do this sort of thing for a living (up until about 3 months ago). That google spellchecking thing was the bane of my existence. I've hated them ever since they unveiled it :-) It looks so simple. It appears to work so well. Everybody wants it. but...


Yep. Google must the the best spellchecker out there by far... sigh...

Wordnet doesn't have all the information necessary to duplicate the functionality.

Right... but it's a start. And short of indexing the entire Internet I'm not aware of any "extensive" source of data ;) Your best bet is to combine whatever you can find.

No taxonomy that I'm aware of does.

As far as "taxonomy" goes, one thing one could leverage is the dmoz catalog:

http://rdf.dmoz.org/

But... practically speaking... how would a taxonomy fit in the picture?

The functionality can't be duplicated with any RDBMS.


Or it would be too cumbersome to do so.

To try and do it with a natural language search engine like lucene would negatively impact search performance, to put it mildly.

Well... it depends on how much resources you can throw at it. Memory is cheap ;)

There's a reason that Google has 50 PhDs in mathematics and natural language processing. To make the rest of us miserable. :-)

Or happy, as one could use the Google API to access this functionality programatically when practical.

For example, how does it decide to suggest dogg from doggl? Or to suggest immeasurable from imesurable?

My guess is that along side all the traditional technics used for spellchecking, Google "magic sauce" is some clever statistics about what words people _do_ correct and which one they ignore. That way, their suggestion mechanism can automatically improve over time thanks to user interactions. They also must have the largest data set of "words" available to mankind...

None of the stemming algorithms will do that. It must be some sort of distance metric. But (optimal) string transformations of that sort are NP-complete IIRC. So you need massive amounts of computes to do it.

Personally, I would use a statistical approach to improve the suggestions.

Google's 5-10K (that's thousands) of CPUs probably strain a little
doing it :-)


Hardware is cheap ;)

Anyways, I always had fun working on that stuff. I'd be happy to share what I know about the topic.


I would love to hear about it :)

For those interested in the topic, one handy starting point is Alex Murzaku's "Text and Language Processing Resources":

http://www.lissus.com/resources/index.htm

Cheers,

PA.
_______________________________________________
webobjects-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/webobjects-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: Spellchecking queries to a database
  - From: Arturo Pérez <email@hidden>

References:
	>Re: Spellchecking queries to a database (From: petite_abeille <email@hidden>)

Prev by Date: RE: wait, nevermind
Next by Date: RE: objectAtIndex of A WORepitition
Previous by thread: Where is WireHose??
Next by thread: Re: Spellchecking queries to a database
Index(es):
- Date
- Thread