Re: CoreData async fetch request
Re: CoreData async fetch request
- Subject: Re: CoreData async fetch request
- From: David Melgar <email@hidden>
- Date: Tue, 6 Oct 2009 23:29:28 -0400
Hello,
Thanks for the response. Seems that its straying somewhat from my
original question.
Searching based on prefix matching is fine.  The predicate I'm using
really is of the form "SELF like foo", no wildcard, so it doesn't seem
that it should be that expensive. You say its possible to structure
this to use a binary index. How? I don't see any mention of indices in
the Coredata documentation. If I use SQLite directory, presumably I
can set indices on the fields I want and more closely manage the data
model.
I don't see how setBatchFetchSize helps. Doesn't it just limit the
number of results returned? I have no idea how quickly the results
will come in. Setting a size >1 is therefore indeterminate and may
take the full 3 minutes. If I set it to one, and I want to try and get
the second row as well, it appears that it starts the query all over
again, worst case resulting in 6 minutes before the 2nd result shows
up. Doesn't seem that it scales reasonably if I want to display the
first 10-20 entries.
My issue with Coredata is that it NSFetchRequest always returns ALL
the results of the particular query at one time. If I use SQLite
directly... assuming it supports cursors, I can get each result one at
a time as they show up, display it to the user without slowing down
the query as it continues to find other results.
NSFetchRequest could support a delegate to invoke some method when for
each item that has been found, rather than blocking until all the
results are received.
It also could have been implemented as a virtual queue, an object
which could be read from while being written to in another thread.
On Oct 6, 2009, at 4:08 AM, Ben Trumbull wrote:
On Oct 5, 2009, at 7:00 PM, email@hidden wrote:
I am doing a simple query search for a text string pattern (ie
'SELF like foo') on ~10 million small records stored persistently
using sqlite. This is a performance test to make sure I get
reasonable performance from my database engine before I commit too
much code to it.
Well, @"self like 'foo'" is a different problem than @"self like
'*foo*'".  LIKE queries require Unicode compliant regex and are
intrinsically expensive.  If you do not have a wildcard, you are
better off use an == query.  The DerivedProperty ADC example shows
how to transform the text to make it much faster to search.
If you do need to use a wildcard, you'll really want to stick with
1, either prefix matching or suffix matching.  The DerivedProperty
example shows prefix matching.  It's possible to structure this to
use a binary index, and make the query extremely fast even for
millions of records.  There is a huge difference in computational
complexity.  Prefix matching can use an index, and therefore can run
O(lg(N)).
*foo* (contains) searches are slow, and cannot use an index.  You
really want to avoid these.  Even Spotlight does not do arbitrary
substring matching.  Compare "help" with "elp" in your Spotlight
results.  If you want word matching, you can use Spotlight or
SearchKit to build a supplemental FTS index.
The query is taking over 3 minutes with a small result set. This is
on a new 13" macbook pro w 4gb memory.
... a full table scan executing a regex on each of 10 million rows
on a 5400 rpm drive ?  Well, for doing all that, 3 minutes sounds
pretty fast.
Just as a reference point, if you grab the objectIDs from the result
set, and execute an IN query selecting those objects, how long does
it take ?  50ms ?  100ms ?
The query is taking too long for a user to sit and wait for it. Is
there a way to speed it up? Can indexing be applied to it?
I had thought if I could display results as they are found that
might be reasonable. In my tests, if I use setFetchBatchSize and
setOffset to restart it, then it ends up repeating the query taking
that many times longer to get a result. Not reasonable. It does not
seem to start the query where it left off, as a database cursor
would do.
You can use -com.apple.CoreData.SQLDebug 1 to see the SQL we pass to
the database.  This also has nothing to do with Core Data.  This is
how offset queries behave.  I realize it's not what you expected,
which is why I recommended using -setFetchBatchSize: instead.
My impression is that my usage scenario is not an appropriate use
of core data.
Core Data is just passing the query off to the database.  I'm not
sure why you think going to the database directly will do anything
for the 179.9 / 180.0 seconds it takes to evaluate the query in the
database.
I was planning to try SQLite directly. Would it be more appropriate?
You can try it directly, but it won't have any meaningful effect on
your performance results except that SQLite's built in LIKE operator
doesn't support Unicode.  It'll be a tiny bit faster for that, but
still the same order of magnitude.  And then, either you'll have to
integrate ICU support as Core Data does, and it'll be exactly the
same, or be stuck with ASCII.
Regardless, you'll need to make your searches eligible for an
index.  The DerivedProperty example shows how to do that.
- Ben
Thanks
On Oct 5, 2009 7:14pm, Ben Trumbull <email@hidden> wrote:
> Is there a way to do an asynchronous fetch request against Core
data
> returning partial results?
>
> That depends on whether it's the query part that's expensive
(e.g. WHERE clause with complex text searching and table scans) or
simply the quantity of the row data that's your problem.  For the
latter, you can just use -setFetchBatchSize: and be done.
>
>
> You can use a separate MOC on a background thread to perform
asynchronous work.  You can then pass over results to the main
thread to display to the user.  However, unless your search terms
are very expensive, it's usually easier and faster to use -
setFetchBatchSize: synchronously.  For well indexed queries, it can
handle a million or two rows per second.  Not sure why you'd
subject your users to that kind of experience.  It's common to use
fetch limits, count requests, and only show the top N results.
What's your user going to do with a hundred thousand results anyway ?
>
>
> If you need to attack the computational expense of your query
terms, that's more complicated.  Obviously it would be best to
optimize the queries and ensure they are using an index.  But if
that's not enough, you can execute the queries in a background MOC,
fetching objectIDs + row data (put in the the row cache) and then
have the other MOC materialize the objects by ID from the row
cache.  There's a BackgroundFetching example in /Developer/Examples/
CoreData.  It shows how to do this.  Returning partial results
incrementally would require some creativity on your part to
subdivide the query into several.  Since most expensive queries are
text searches, it's usually possible to subdivide the result set
naturally.  Like the first letter of 'title'.  Similar to the thumb
bar index on the side of the Contacts app on the iPhone.
>
>
> There's also a DerivedProperty example on ADC for optimizing text
queries.
>
>
> Obviously, Apple's own Spotlight could not use something like
> Coredata, since it heavily relies on returning asynchronous partial
> results.
>
> Which is neither here nor there.  Most Cocoa applications
wouldn't want Spotlight to be the sole persistence back end of
their data.  The latency of putting all your data in a full text
index instead of a relational database or keyed archive would be
pretty absurd.  Now, if you're writing an app that's primarily
structured around full text searching, you might instead prefer to
focus on putting your data in Spotlight via small files, and using
the Spotlight APIs.  But it's not suitable for apps interested in
an OOP view of their data.
>
>
> Frankly, this is my second application I've attempted to use
Coredata
> to find it come up surprisingly short. The first time the issue was
> core data not being thread safe.
>
> Core Data can be used efficiently with multiple threads.  It
might help to think of each MOC as a separate writeable view.  If
you'd like to know more, you can search the archives for my posts.
>
>
> What is the target market for Core Data? Why sort of application is
> ideal for its use? What size data store? Right now it escapes me.
>
>
> Cocoa and Cocoa Touch applications, particularly done in an MVC
style with an OO perspective on their data.  Some people also use
it as a persistent cache for data stored in another canonical
format, such as XML files.  On the Mac side, we've had customers
with 3+ million rows (multi GB) databases, and on the embedded
side, roughly 400,000 rows (100s MB).  However, it does take some
care and feeding to handle data sets like that, and most developers
find it straight forward up to about 10% those numbers.
>
>
>
> It sounds like you're having performance issues.  What kinds of
queries are you trying to accomplish ?  How much data are you
working with ?  How have you modeled your primary entities?
>
>
> You can fetch back just NSManagedObjectIDs, and -
setIncludesPropertyValues: to NO to effectively create your own
cursors if you prefer.
>
>
> - Ben
>
>
>
>
>
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden