Re: Spotlight, Content Indexing, and SearchKit integration questions
Re: Spotlight, Content Indexing, and SearchKit integration questions
- Subject: Re: Spotlight, Content Indexing, and SearchKit integration questions
- From: Mark T <email@hidden>
- Date: Thu, 02 Jun 2005 09:14:16 -0400
On Jun 1, 2005, at 8:45 PM, Dr. Smoke wrote:
On 2005.06.01, at 18:31 , Mark T wrote:
Here's the SearchKit situation as I understand it. SearchKit is an
API consisting of functions for a) creating searchable indexes of
content and b) searching them very quickly. SearchKit was
introduced in Panther so developers could easily implement search.
Yes, I understand this.
Spotlight a) tells importer plug-ins about files to index, b) adds
whatever the plug-ins return to a SearchKit index, and c) searches
that index.
I don't think (b) is correct. Spotlight adds whatever
the .mdimporter returns to the "metadata store" which I interpret
to mean the .Spotlight-V100 directory. This contains:
It looks to me from the docs like the metadata store
(ContentIndex.db) is actually a SearchKit index.
I suspect the .store.db and store.db files are metadata indices,
e.g. kMDItemAuthor, etc. (why two is another question) and
ContentIndex.db contains the content index, which is searched when
queries specify "kMDItemTextContent == foo" as a search criteria.
That seems likely.
If Privacy settings have been made on a volume, there's an
additional _exclusions.plist file in that volume's .Spotlight-V100
directory containing the paths to be excluded from searches.
_IndexPolicy.plist merely holds a value related to the -i flag of
mdutil. The value is:
= 3 if indexing is permitted (mdutil -i on),
= 5 if not (mdutil -i off)
Go figure! ;-)
The _rules.plist specifies directories that Spotlight should
include in, or exclude from, indexing beyond its system-wide defaults.
SearchKit is for searching individual applications. Spotlight is a
system-wide SearchKit index.
I'd say that SearchKit is for developers who want to add full-text
index-and-search capabilities to their applications, whereas
Spotlight is a system-wide search facility that (I suspect) employs
SearchKit for its content-based index-and-search functions. AFAIK,
everything outside of kMDItemTextContent is pure Spotlight. I see
nothing in the SearchKit doc (preliminary though it is) that
addresses anything but indexing and searching text content.
That sounds about right. I've looked into Spotlight a bit, but
haven't needed to dig very deep yet. Spotlight must use custom,
Spotlight-only code to search the metadata stores and combine the
results with the SearchKit content-search results.
kMDItemTextContent, as I recall, does contain all the text of a
document (assuming a sane importer developer).
ContentIndex.db stores an index of all the text and metadata on
your system. I think.
IMO, ContentIndex,db includes just the content index for content-
based searches. I suspect the other metadata, e.g. kMDItemAuthor,
etc. is in the *store.db files.
I just did a ascii-dump of /.Spotlight-V100/store.db, and it looks
like that's correct. The store.db files do contain metadata.
There's no real reason to worry about what each file in Spotlight
does, as Spotlight is the only thing that should do anything with
them.
I personally like to know what is happening under the covers. I
write books on troubleshooting Mac OS X -- higher-level, end-user
books -- so knowing what's going on deep under the covers often helps.
I understand that, but I'm having a hard time understanding how the
binary storage format of an indexing technology could produce any
helpful troubleshooting tips. ; )
I have no idea why ContentIndexing.app is still in your Tiger
system, because there's no trace of it on mine. Did you upgrade or
archive and install?
I did an Upgrade install. Sure it's not in /System/Library/Find ?
Note that Spotlight won't look there since it's a system folder.
You have to open the folder or use UNIX find.
I asked another fellow who did an Erase and Install and he said he
had /System/Library/Find/ContentIndexing.app.
Perhaps he was mistaken. Can you double-check that directory?
It's in there on my system. I didn't know about Spotlight not
searching /System. I guess that makes sense. No need to have all
those odd-looking results confusing Average Joe.
I can't figure out what it does, though. I was under the impression
that it was used by the old content indexing system. I have no idea
how it fits into Spotlight.
I haven't looked much into the procedure used for how indexing is
actually done, but it seems like the kernel notifies the mds
daemon when a file has been changed. The daemon then spawns an
mdimport process (and yes, I got the name wrong. I did say I was
going from memory.)
That's how it gets kicked off. I'm more "obsessed" with the sub-
atomic detail, e.g. where SearchKit is being called by
the .mdimporter, for example.
From the Spotlight Importer Programming Guide:
"When metadata is extracted for a file, the GetMetadataForFile
function is called. The function is passed the plug-in interface, a
mutable dictionary that you’ll add the metadata attribute keys and
values to, the UTI type of the target file, and the full path to the
target file.
...
"Your implementation of this function should extract the metadata
from the file and insert it into the dictionary with the appropriate
keys and values. If it successfully returns metadata, the function
should return with a value of true. If no metadata was extracted, you
should return false."
Looks like Spotlight adds whatever the importer returns for text
content to the SearchKit index. The importer never has to know about
the SearchKit part of the whole business.
My interest is also piqued as when I worked at IBM I was one of
their global experts in text retrieval. I've developed large-scale
text retrieval systems, so its an area I know a little about. ;-)
Most of my development work was done in procedural languages:
FORTRAN, APL, REXX, so Cocoa is new to me. I estimated once that
I'd written one million lines of code, directly or indirectly: I've
developed programs that wrote other programs. My OO experience was
primarily in SmallTalk. So I've tried to keep my hand in all-
things "text retrieval." ;-)
Programs that write other programs... That sounds useful. Got one for
Cocoa? ; )
I haven't found any problem with the index getting out of sync.
I've done some basic tests where I created a smart folder (as you
know, that's a saved Spotlight search) and deleted/changed some of
the files it found. The indexing seemed instant. 10.4.1 made some
changes to Spotlight, perhaps the OOS problems were fixed.
The updates are very quick. However, I have seen them get out of
sync. As I noted earlier, I have a mailbox that I know contains 46
items matching a specific search criteria. Spotlight search
returned 10. Rebuilding the index resulted in Spotlight returning
46. I strongly suspect the indices will get out of sync over a
period of time and should probably be rebuilt regularly.
After Spotlight finished: Note 1 MB+ larger ContentIndex.db. No
changes were made to the files indexed during this process.
Something is not right about that. The volume in question contains
only data files and there are no background processes writing to it.
We don't know what algorithm Apple uses for indexing, but it's either
very dynamic or very poorly written. As I said, I haven't had any
problems with Spotlight yet. As I also said, Apple fixed some
Spotlight issues in 10.4.1. Have you repeated these experiments under
that version?
The "Spotlight Importer Programming Guide" is for
making .mdimporters for developers' custom formats. You can write
the importer however you want, but the NSData method you gave
sounds right.
Thanks.
I Googled for examples of Spotlight importers, but all I could
find was various blogs and the Apple Spotlight importer page.
Someone will make an example eventually, but for now read the docs
and look at what others have done:
http://www.apple.com/downloads/macosx/spotlight/
Good idea. Hopefully some of the importers will provide source code.
There's actually a very simple example in the Spotlight Importer
Programming Guide. I haven't needed to write one yet, so I was just
looking at it now. The example is for a property list with Author,
Title, and Notes fields. Those all go under the metadata category;
there's no example that I've found with any content indexing.
Again, good luck. And thanks for educating me about Spotlight. That's
the funny thing about tech discussions: the person trying to answer
questions can actually learn as much (or more) as the person asking
them. _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden