Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: AUGD: MUG Newsletter Indexing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AUGD: MUG Newsletter Indexing

Subject: Re: AUGD: MUG Newsletter Indexing
From: Paul Richards <email@hidden>
Date: Wed, 15 Feb 2006 16:14:51 -0500


On Feb 14, 2006, at 10:29 PM, Jo Booth wrote:

Unless you are searching for a phrase (and probably even then) most search engines treat the pdf / text file / html file as a bunch of keywords - and ignore whitespace. The excerpts they present to you in the search results may look a little weird due to pdf->html conversion ;) but when you go to a search results it is in it's native pdf format.

Okay, maybe a few examples will help illustrate the types of problems I encountered. The following were pulled from a PDF of one of our past newsletters by selecting all of the newsletter in Preview and copying, then pasting into Text Wrangler and removing the line breaks added for the columns:

SM UG
2 0 0 5
Elec- tronic
uni- verse
<http:// www.imaging-resource.com>
<http:/ /www.dpreview.com>

I also tried using pdftotext <http://www.bluem.net/downloads/ pdftotext_en/> from the command line. It seems to be much smarter about hyphens and line breaks and space padding. However, if there were two separate articles on the same page, it tended to get them partially intertwined when it spit the text back out. Page banners and headers and page numbers could also appear in strange spots.


Paul Richards, Ambassador
Syracuse Macintosh Users Group
Syracuse, NY, USA
http://www.iSMUG.com

Attachment: smime.p7s
Description: S/MIME cryptographic signature

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Augd mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

References:
	>Re: AUGD: MUG Newsletter Indexing (From: Greg Sharp <email@hidden>)
	>Re: AUGD: MUG Newsletter Indexing (From: Paul Richards <email@hidden>)
	>Re: AUGD: MUG Newsletter Indexing (From: Jo Booth <email@hidden>)

Prev by Date: Re: AUGD: Re: Video Podcast of Mac User Group meetings
Next by Date: AUGD: February electronic newsletter for The Rest of Us
Previous by thread: Re: AUGD: MUG Newsletter Indexing
Next by thread: AUGD: Mug InfoManager for Monday, February 6, 2006
Index(es):
- Date
- Thread