Re: Spotlight, Content Indexing, and SearchKit integration questions
Re: Spotlight, Content Indexing, and SearchKit integration questions
- Subject: Re: Spotlight, Content Indexing, and SearchKit integration questions
- From: Mark T <email@hidden>
- Date: Thu, 02 Jun 2005 11:04:57 -0400
This is from memory, so I may be wrong on a few small points.
On Mon, 2005-05-30 at 14:42 -0500, Dr. Smoke wrote:
> 1. What is the underlying integration of metadata indexing and content
> indexing, e.g. which is done first, which is done second?
> Specifically, when is content indexing performed in the overall
> process of "indexing" documents with Spotlight. Is it performed on a
> per-document basis as a document is processed by its
> associated .mdimporter? In other words, does the larger Spotlight
> indexing process first call the .mdimporter to extract the metadata,
> then invoke SearchKit (ContentIndexing.app?) to index the content?
Spotlight performs both content and metadata indexing, relying on
the .mdimporter to perform both functions.
> 2. Is Spotlight or SearchKit using ContentIndexing.app to
> produce /.Spotlight-V100/ContentIndex.db ? If not, how is the
> ContentIndex.db produced?
Initial indexing of a volume is automatically handled by the mdindex
tool. That creates the files in .Spotlight-V100. After that, the kernel
updates the index at the file I/O level. This means that there is no
indexing process needed after the initial index is created. The index is
updated by the kernel when files are added or changed.
> 3. Have the .mdimporter objects assumed the role of the
> *TextExtractor.bundle objects under Panther? It appears not and that
> this is somehow handled by SearchKit.
They have.
> 4. Where are the "text extractor plug-ins" employed by SearchKit?
> SearchKit documentation -- "Adding Search to Your Application
> (Preliminary)"
> ( http://developer.apple.com/documentation/UserExperience/Conceptual/SearchKitConcepts/index.html ) -- notes the following:
>
>
> "In general, your application takes responsibility for providing
> document content to Search Kit in the appropriate form, namely a
> CFString object. For local, on-disk files supported by the built-in
> text extractor plug-ins, Search Kit knows how to get content for you.
> The text extractor plug-ins work with RTF, XML, plaintext, and PDF
> files, as well as with Microsoft Word documents."
>
>
> However, I've been unable to find these unless they are somehow
> embedded in the type-specific .mdimporter objects included with Tiger,
> i.e. in the /System/Library/Spotlight/ and /Library/Spotlight/
> directories. Examining the contents of these bundles has not given me
> any clues.
The .mdimporter files are, indeed, the "text extractor plug-ins"
mentioned in the documentation.
> 5. How would one create new text extractor plug-ins? For example,
> there is a FBC (Find By Content) Extractor SDK available here
>
>
> http://developer.apple.com/sdk/
>
>
> that is probably outdated and not applicable to SearchKit or
> Spotlight.
You can create text extractors with XCode. It's fully documented.
> 6. In Panther and Jaguar, one could encounter failed index builds if
> ContentIndexing hit "corrupted" documents. I've documented this here:
> http://www.thexlab.com/faqs/failedindex.html In my experience,
> sometimes a document ContentIndexing considered corrupted would
> actually open AOK in its associated application, so the problem was
> probably in the *TextExtractor.bundle, or how ContentIndexing handled
> "bad" return codes from such. For metadata extraction, it appears
> there are provisions that can be made for indicating extraction failed
> or did not produce results, hence the whole "Spotlight" (as seen by
> the user) indexing process should not come crashing down. However,
> how is this handled with content indexing under Tiger?
The text extractor handles pretty much everything. Detecting corrupt
documents and skipping them is up to whoever writes the importer.
> I'd appreciate answers to these questions from Apple Development.
> Thanks!
It seems like you're confused about what Spotlight and friends really
are. Spotlight handles all aspects of system-wide searching. It isn't
really one thing. Instead, it's a set of technologies for indexing
(again, at the kernel level), and searching for information in any form.
.mdimporter files are run by Spotlight (at initial indexing or
kernel-based index updating) to get information out of whatever format
the file is in and into the plain-text index.
Content Indexing is no longer used by Tiger. It used a user-level
indexing process for updates instead of the kernel-level method used
now.
SearchKit can be used by developers to add indexed searching to their
applications. I think it's what Spotlight uses to search the Spotlight
index, but I may be wrong on that point.
You really need to read more about the technology involved. The Apple
docs are a great place to start. Also, Ars Technica has in in-depth
review of Tiger with plenty of metadata and Spotlight talk. You can find
it here:
http://arstechnica.com/reviews/os/macosx-10.4.ars
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden