Re: AUGD: MUG Newsletter Indexing
Re: AUGD: MUG Newsletter Indexing
- Subject: Re: AUGD: MUG Newsletter Indexing
- From: Paul Richards <email@hidden>
- Date: Wed, 15 Feb 2006 16:14:51 -0500
On Feb 14, 2006, at 10:29 PM, Jo Booth wrote:
Unless you are searching for a phrase (and probably even then) most
search engines treat the pdf / text file / html file as a bunch of
keywords - and ignore whitespace. The excerpts they present to you
in the search results may look a little weird due to pdf->html
conversion ;) but when you go to a search results it is in it's
native pdf format.
Okay, maybe a few examples will help illustrate the types of problems
I encountered. The following were pulled from a PDF of one of our
past newsletters by selecting all of the newsletter in Preview and
copying, then pasting into Text Wrangler and removing the line breaks
added for the columns:
SM UG
2 0 0 5
Elec- tronic
uni- verse
<http:// www.imaging-resource.com>
<http:/ /www.dpreview.com>
I also tried using pdftotext <http://www.bluem.net/downloads/
pdftotext_en/> from the command line. It seems to be much smarter
about hyphens and line breaks and space padding. However, if there
were two separate articles on the same page, it tended to get them
partially intertwined when it spit the text back out. Page banners
and headers and page numbers could also appear in strange spots.
Paul Richards, Ambassador
Syracuse Macintosh Users Group
Syracuse, NY, USA
http://www.iSMUG.com
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Augd mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden