Performance problem with GC enabled
Performance problem with GC enabled
- Subject: Performance problem with GC enabled
- From: John Engelhart <email@hidden>
- Date: Thu, 12 Mar 2009 09:04:26 -0400
I'm sort of at a loss on this one...
While working on some improvements to RegexKitLite, I've come across a
bit of a problem. I'm in the middle of adding - (NSArray
*)arrayOfStringsMatching:(NSString *)regex and friends, a method that
(surprise!) returns an array of all the matches of a string for a
given regex. Lots of people use the example regex enumerator included
in the project, but hoisting arrayOfStrings like functionality in to
the core makes things like
for(NSString *matchedString in [subjectString
arrayOfStringsMatching:@"^(.*)$"]) { /* process results */ }
a lot easier and cleaner, and potentially faster to boot.
So, while testing the "faster" part with some simple bench marks, I
came across a serious performance problem once GC was turned on.
Without GC, here's what a simple benchmark consisting of a large
(>1MB) file matched by the regex \S+ in a loop spits out:
arrayOfStringsMatching: Time used: 1619821.0 us, per: 0.919759700420
us, count: 1761135
Profiling with shark shows the top three functions, after charging
malloc/free and friends to their callers:
47.8% in uregex_findNext (this is obviously where the regex matching
takes place)
22.0% in CFStringCreateWithSubstring
12.1% in _CFRelease
The bulk of the remaining time is spent all over the place, with
memcpy and bzero making up a good chunk of it.
Performance counters show an average of 1.611 instructions retired per
clock cycle, 79.101 clock cycles / 127.433 instructions retired
between L1 misses, and 37,731.484 clock cycles / 60,785.422
instructions retired between L2 misses.
Now, with GC turned on I get:
arrayOfStringsMatching: Time used: 6247963.0 us, per: 3.547691119647
us, count: 1761135
Oookay, a 2.8x increase in the time it takes to do the same thing is
sort of a problem. I could live with, but would be unhappy about, a
10% increase in the time it takes. A 280% increase in the time it
takes is pretty deep in to "Not Acceptable" territory. Digging in with
shark, the following turns up:
First, two threads:
54.8% in main
45.2% in auto_collection_thread (!!!)
Top 3 functions (for all threads) breaks down to, with no charging of
anything to callers:
39.2% in auto_collection_thread
14.5% in uregex_findNext
11.1% in __spin_lock (!! eleven percent!? in spin_lock!?)
Switching over to just the main thread, again, with no charging of
anything to callers:
26.4% in uregex_findNext
19.0% in __spin_lock (!! WTF?)
16.3% in auto_zone_allocate_object
Performance counters show an average of 1.395 instructions retired per
clock cycle, 45.468 clock cycles / 63.429 instructions retired between
L1 misses, and 23,857.762 clock cycles / 33,281.578 instructions
retired between L2 misses.
This amounts to a 13.5% reduction in instructions retired per clock
cycle, a 50.2% decrease in the number of instructions retired between
L1 misses, and a 45.2% decrease in the number of instructions retired
between L2 misses. This is pretty horrific, GC is slaughtering the
CPU caches.
The basic overview of what happens internally is that uregex_findNext
is called repeatedly until it returns with "nothing left to find".
The results of the find are stored in to an NSRange *ranges array.
The next step is to convert those ranges in to strings in bulk, the
pointers of the strings are stored in (the equivalent of) "id
*matchedStrings = NSAllocateCollectable(totalRanges * sizeof(id),
NSScannedOption)";
Then, a simple for loop:
for(int x = 0; x < totalRanges; x++) {
matchedStrings[x] = [subjectString substringWithRange:ranges[x]];
}
After some digging, nearly 12.5% of the total execution time is spent
on "matchedStrings[x] =". That is, just in the assignment portion of
that statement, the storing of the pointer returned by
substringWithRange:.
This is (obviously) due to -fobjc-gc turning the storing of a __strong
pointer in to a call to objc_assign_strongCast(). Each and every call
to objc_assign_strongCast, in turn, grabs a gc lock before it does its
work. Soo.. what was a simple, single movl instruction of the
returned pointer in to the matchedStrings array has now turned in to,
at a minimum, a function call and an expensive mutex lock acquisition,
a lock that is pretty heavily contested to boot, and that doesn't even
include any of the work done inside the function itself.
Turning the loop body statement in to "/* matchedStrings[x] = */
[subjectString substringWithRange:ranges[x]];" (creating the string,
but not storing the pointer) turns in a time of:
arrayOfStringsMatching: Time used: 3464158.0 us, per: 1.967003097434
us, count: 1761135
This is just shy of a 50% in reduction in the total execution time by
commenting out just the assignment part of a single statement.
For the record, completely commenting out that one single line (so
that the string isn't created, nor is a pointer stored) turns in a
time of 1133339.0 with GC turned on and 1065135.0 with GC turned off
(gc is about 5% slower), so there's no doubt this single line is
responsible for all of the slow down. And in case anyone was
wondering, the existing RKLEnumerator supplied with the project turns
in a time of 2242281.0us, so hoisting this functionality up in to the
core is worth about a 27% improvement.
This kind of reduction in performance is just flat out unacceptable,
and I can't think of anything I can do to speed things up. Any ideas?
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden