• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: AUGD: MUG Newsletter Indexing
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AUGD: MUG Newsletter Indexing


  • Subject: Re: AUGD: MUG Newsletter Indexing
  • From: Paul Richards <email@hidden>
  • Date: Wed, 15 Feb 2006 16:14:51 -0500


On Feb 14, 2006, at 10:29 PM, Jo Booth wrote:

Unless you are searching for a phrase (and probably even then) most search engines treat the pdf / text file / html file as a bunch of keywords - and ignore whitespace. The excerpts they present to you in the search results may look a little weird due to pdf->html conversion ;) but when you go to a search results it is in it's native pdf format.

Okay, maybe a few examples will help illustrate the types of problems I encountered. The following were pulled from a PDF of one of our past newsletters by selecting all of the newsletter in Preview and copying, then pasting into Text Wrangler and removing the line breaks added for the columns:


SM UG
2 0 0 5
Elec- tronic
uni- verse
<http:// www.imaging-resource.com>
<http:/ /www.dpreview.com>

I also tried using pdftotext <http://www.bluem.net/downloads/ pdftotext_en/> from the command line. It seems to be much smarter about hyphens and line breaks and space padding. However, if there were two separate articles on the same page, it tended to get them partially intertwined when it spit the text back out. Page banners and headers and page numbers could also appear in strange spots.


Paul Richards, Ambassador Syracuse Macintosh Users Group Syracuse, NY, USA http://www.iSMUG.com



Attachment: smime.p7s
Description: S/MIME cryptographic signature

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Augd mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

References: 
 >Re: AUGD: MUG Newsletter Indexing (From: Greg Sharp <email@hidden>)
 >Re: AUGD: MUG Newsletter Indexing (From: Paul Richards <email@hidden>)
 >Re: AUGD: MUG Newsletter Indexing (From: Jo Booth <email@hidden>)

  • Prev by Date: Re: AUGD: Re: Video Podcast of Mac User Group meetings
  • Next by Date: AUGD: February electronic newsletter for The Rest of Us
  • Previous by thread: Re: AUGD: MUG Newsletter Indexing
  • Next by thread: AUGD: Mug InfoManager for Monday, February 6, 2006
  • Index(es):
    • Date
    • Thread