• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Spotlight, Content Indexing, and SearchKit integration questions
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Spotlight, Content Indexing, and SearchKit integration questions


  • Subject: Re: Spotlight, Content Indexing, and SearchKit integration questions
  • From: Mark T <email@hidden>
  • Date: Thu, 02 Jun 2005 09:14:16 -0400


On Jun 1, 2005, at 8:45 PM, Dr. Smoke wrote:

On 2005.06.01, at 18:31 , Mark T wrote:

Here's the SearchKit situation as I understand it. SearchKit is an API consisting of functions for a) creating searchable indexes of content and b) searching them very quickly. SearchKit was introduced in Panther so developers could easily implement search.

Yes, I understand this.

Spotlight a) tells importer plug-ins about files to index, b) adds whatever the plug-ins return to a SearchKit index, and c) searches that index.

I don't think (b) is correct. Spotlight adds whatever the .mdimporter returns to the "metadata store" which I interpret to mean the .Spotlight-V100 directory. This contains:

It looks to me from the docs like the metadata store (ContentIndex.db) is actually a SearchKit index.


I suspect the .store.db and store.db files are metadata indices, e.g. kMDItemAuthor, etc. (why two is another question) and ContentIndex.db contains the content index, which is searched when queries specify "kMDItemTextContent == foo" as a search criteria.

That seems likely.

If Privacy settings have been made on a volume, there's an additional _exclusions.plist file in that volume's .Spotlight-V100 directory containing the paths to be excluded from searches.

_IndexPolicy.plist merely holds a value related to the -i flag of mdutil. The value is:

= 3 if indexing is permitted (mdutil -i on),
= 5 if not (mdutil -i off)

Go figure! ;-)

The _rules.plist specifies directories that Spotlight should include in, or exclude from, indexing beyond its system-wide defaults.

SearchKit is for searching individual applications. Spotlight is a system-wide SearchKit index.

I'd say that SearchKit is for developers who want to add full-text index-and-search capabilities to their applications, whereas Spotlight is a system-wide search facility that (I suspect) employs SearchKit for its content-based index-and-search functions. AFAIK, everything outside of kMDItemTextContent is pure Spotlight. I see nothing in the SearchKit doc (preliminary though it is) that addresses anything but indexing and searching text content.

That sounds about right. I've looked into Spotlight a bit, but haven't needed to dig very deep yet. Spotlight must use custom, Spotlight-only code to search the metadata stores and combine the results with the SearchKit content-search results.


kMDItemTextContent, as I recall, does contain all the text of a document (assuming a sane importer developer).

ContentIndex.db stores an index of all the text and metadata on your system. I think.

IMO, ContentIndex,db includes just the content index for content- based searches. I suspect the other metadata, e.g. kMDItemAuthor, etc. is in the *store.db files.

I just did a ascii-dump of /.Spotlight-V100/store.db, and it looks like that's correct. The store.db files do contain metadata.


There's no real reason to worry about what each file in Spotlight does, as Spotlight is the only thing that should do anything with them.

I personally like to know what is happening under the covers. I write books on troubleshooting Mac OS X -- higher-level, end-user books -- so knowing what's going on deep under the covers often helps.

I understand that, but I'm having a hard time understanding how the binary storage format of an indexing technology could produce any helpful troubleshooting tips. ; )


I have no idea why ContentIndexing.app is still in your Tiger system, because there's no trace of it on mine. Did you upgrade or archive and install?

I did an Upgrade install. Sure it's not in /System/Library/Find ? Note that Spotlight won't look there since it's a system folder. You have to open the folder or use UNIX find.


I asked another fellow who did an Erase and Install and he said he had /System/Library/Find/ContentIndexing.app.

Perhaps he was mistaken. Can you double-check that directory?

It's in there on my system. I didn't know about Spotlight not searching /System. I guess that makes sense. No need to have all those odd-looking results confusing Average Joe.


I can't figure out what it does, though. I was under the impression that it was used by the old content indexing system. I have no idea how it fits into Spotlight.

I haven't looked much into the procedure used for how indexing is actually done, but it seems like the kernel notifies the mds daemon when a file has been changed. The daemon then spawns an mdimport process (and yes, I got the name wrong. I did say I was going from memory.)

That's how it gets kicked off. I'm more "obsessed" with the sub- atomic detail, e.g. where SearchKit is being called by the .mdimporter, for example.

From the Spotlight Importer Programming Guide:

"When metadata is extracted for a file, the GetMetadataForFile function is called. The function is passed the plug-in interface, a mutable dictionary that you’ll add the metadata attribute keys and values to, the UTI type of the target file, and the full path to the target file.

...

"Your implementation of this function should extract the metadata from the file and insert it into the dictionary with the appropriate keys and values. If it successfully returns metadata, the function should return with a value of true. If no metadata was extracted, you should return false."

Looks like Spotlight adds whatever the importer returns for text content to the SearchKit index. The importer never has to know about the SearchKit part of the whole business.

My interest is also piqued as when I worked at IBM I was one of their global experts in text retrieval. I've developed large-scale text retrieval systems, so its an area I know a little about. ;-) Most of my development work was done in procedural languages: FORTRAN, APL, REXX, so Cocoa is new to me. I estimated once that I'd written one million lines of code, directly or indirectly: I've developed programs that wrote other programs. My OO experience was primarily in SmallTalk. So I've tried to keep my hand in all- things "text retrieval." ;-)

Programs that write other programs... That sounds useful. Got one for Cocoa? ; )


I haven't found any problem with the index getting out of sync. I've done some basic tests where I created a smart folder (as you know, that's a saved Spotlight search) and deleted/changed some of the files it found. The indexing seemed instant. 10.4.1 made some changes to Spotlight, perhaps the OOS problems were fixed.

The updates are very quick. However, I have seen them get out of sync. As I noted earlier, I have a mailbox that I know contains 46 items matching a specific search criteria. Spotlight search returned 10. Rebuilding the index resulted in Spotlight returning 46. I strongly suspect the indices will get out of sync over a period of time and should probably be rebuilt regularly.

After Spotlight finished: Note 1 MB+ larger ContentIndex.db. No changes were made to the files indexed during this process. Something is not right about that. The volume in question contains only data files and there are no background processes writing to it.

We don't know what algorithm Apple uses for indexing, but it's either very dynamic or very poorly written. As I said, I haven't had any problems with Spotlight yet. As I also said, Apple fixed some Spotlight issues in 10.4.1. Have you repeated these experiments under that version?


The "Spotlight Importer Programming Guide" is for making .mdimporters for developers' custom formats. You can write the importer however you want, but the NSData method you gave sounds right.

Thanks.


I Googled for examples of Spotlight importers, but all I could find was various blogs and the Apple Spotlight importer page. Someone will make an example eventually, but for now read the docs and look at what others have done:
http://www.apple.com/downloads/macosx/spotlight/

Good idea. Hopefully some of the importers will provide source code.

There's actually a very simple example in the Spotlight Importer Programming Guide. I haven't needed to write one yet, so I was just looking at it now. The example is for a property list with Author, Title, and Notes fields. Those all go under the metadata category; there's no example that I've found with any content indexing.


Again, good luck. And thanks for educating me about Spotlight. That's the funny thing about tech discussions: the person trying to answer questions can actually learn as much (or more) as the person asking them. _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden
  • Follow-Ups:
    • Re: Spotlight, Content Indexing, and SearchKit integration questions
      • From: "Dr. Smoke" <email@hidden>
References: 
 >Re: Spotlight, Content Indexing, and SearchKit integration questions (From: Mark T <email@hidden>)

  • Prev by Date: Drags to Finder result in XML clippings [SOLVED]
  • Next by Date: Re: NSLayoutManager looping with custom NSTextContainer
  • Previous by thread: Re: Spotlight, Content Indexing, and SearchKit integration questions
  • Next by thread: Re: Spotlight, Content Indexing, and SearchKit integration questions
  • Index(es):
    • Date
    • Thread