I hope this post -- my first to this list -- is not considered "too long" but I've tried to document in detail my research, assumptions, and questions concerning the interrelationship of Spotlight, Content Indexing, and SearchKit I've also posted it here since there is neither a Spotlight nor SearchKit mailing list, lists that I hope will be added soon.
I have read the following thread, entitled "SearchKit and Spotlight"
as well as searched all the mailing lists and read all the relevant documentation I could find. However, I cannot find documentation that addresses how Spotlight (as seen by the user) goes about content indexing and the creation of the ContentIndex.db vs. the metadata extraction.
Background:
Under Panther and Jaguar, /System/Library/Find/ directory contained the ContentIdexing.app and a variety of *TextExtractor.bundle files used to extract text from various file types, e.g. HTMLTextExtractor.bundle for HTML documents and PDFTextExtractor.bundle for Adobe PDF files.
Under Tiger:
- It is unclear if the functions of the *TextExtractor.bundle are incorporated in the *.mdimporter plug-ins. There are no *TextExtractor.bundle files in then /System/Library/Find directory. I suspect these are now in SerachKit but I cannot find them.
- The ContentIndexing.app is still included in the /System/Library/Find directory, but appears to be the same app from Panther.
- There was a major change in where the content index was saved. Under Panther and Jaguar, the content index for a given folder was saved in a series of .FBC* files within the folder. Now the .Spotlight-V100 directory contains ContentIndex.db, which appears to be the content index for all Spotlight-searchable files on the volume.
In the "Spotlight Importer Programming Guide" section "Extracting Metadata from Documents" it states:
"Spotlight importers parse your document format for relevant information and assigning that information to the appropriate metadata keys. Keys help index the content in the data store and facilitate searches."
I interpret "help index the content" as being different from a full content index, i.e. the metadata index is not a content index, but is "helpful" in searching for objects. Clearly the kMDItemTextContent metadata attribute is not the document's text content as I have plain-text files that do not return a value for this attribute from mdls, but are returned as search results if I use Find (Finder > Command-F) and search for Content.
It seems that "Spotlight" (as a user sees it) employs two distinct processes when it "indexes" documents:
a. The document's metadata is extracted and added to the metadata stores. This is what is documented in the ADC documents on Spotlight and .mdimporter objects.
b. The document's content is indexed, if appropriate. I suspect this is handled by SearchKit.
How these two processes are integrated is unclear.
Questions:
1. What is the underlying integration of metadata indexing and content indexing, e.g. which is done first, which is done second? Specifically, when is content indexing performed in the overall process of "indexing" documents with Spotlight. Is it performed on a per-document basis as a document is processed by its associated .mdimporter? In other words, does the larger Spotlight indexing process first call the .mdimporter to extract the metadata, then invoke SearchKit (ContentIndexing.app?) to index the content?
2. Is Spotlight or SearchKit using ContentIndexing.app to produce /.Spotlight-V100/ContentIndex.db ? If not, how is the ContentIndex.db produced?
3. Have the .mdimporter objects assumed the role of the *TextExtractor.bundle objects under Panther? It appears not and that this is somehow handled by SearchKit.
"In general, your application takes responsibility for providing document content to Search Kit in the appropriate form, namely a CFString object. For local, on-disk files supported by the built-in text extractor plug-ins, Search Kit knows how to get content for you. The text extractor plug-ins work with RTF, XML, plaintext, and PDF files, as well as with Microsoft Word documents."
However, I've been unable to find these unless they are somehow embedded in the type-specific .mdimporter objects included with Tiger, i.e. in the /System/Library/Spotlight/ and /Library/Spotlight/ directories. Examining the contents of these bundles has not given me any clues.
5. How would one create new text extractor plug-ins? For example, there is a FBC (Find By Content) Extractor SDK available here
that is probably outdated and not applicable to SearchKit or Spotlight.
6. In Panther and Jaguar, one could encounter failed index builds if ContentIndexing hit "corrupted" documents. I've documented this here: http://www.thexlab.com/faqs/failedindex.html In my experience, sometimes a document ContentIndexing considered corrupted would actually open AOK in its associated application, so the problem was probably in the *TextExtractor.bundle, or how ContentIndexing handled "bad" return codes from such. For metadata extraction, it appears there are provisions that can be made for indicating extraction failed or did not produce results, hence the whole "Spotlight" (as seen by the user) indexing process should not come crashing down. However, how is this handled with content indexing under Tiger?
I'd appreciate answers to these questions from Apple Development. Thanks!
|