Re: AUGD: MUG Newsletter Indexing
Re: AUGD: MUG Newsletter Indexing
- Subject: Re: AUGD: MUG Newsletter Indexing
- From: Jo Booth <email@hidden>
- Date: Wed, 15 Feb 2006 16:29:11 +1300
On 15/02/2006, at 15:50 , Paul Richards wrote:
On Feb 7, 2006, at 12:41 AM, Greg Sharp wrote:
These newsletters are retained in PDF format.
I am wondering if anyone has any suggestions on how one might be
able
to develop an Index for all this material, preferably something that
would be accessible online.
This functionality is also built in to a lot of servers.
I got to thinking about this more over the past couple of days and
remembered another thing about my previous explorations into PDF
indexing. One of the workarounds I tried was converting the PDFs to
text and indexing the text. The thing that tripped me up on that
was things like space padding and other visual text treatments that
were placed in the PDF to make it look good, either by the DTP
software or by the PDF process itself, but which disrupted the
natural flow of text when it was converted back. I am curious about
whether this also poses a problem for the various PDF indexing
methods that have been mentioned in this thread.
Unless you are searching for a phrase (and probably even then) most
search engines treat the pdf / text file / html file as a bunch of
keywords - and ignore whitespace. The excerpts they present to you
in the search results may look a little weird due to pdf->html
conversion ;) but when you go to a search results it is in it's
native pdf format.
I think ;)
-Jo.
WelMac VP / NMGite
http://forums.welmac.org.nz
Attachment:
PGP.sig
Description: This is a digitally signed message part
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Augd mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden