Re: Help DTrace gurus: suggestions for capturing a mis-allocated NSData object on a customer's system
Re: Help DTrace gurus: suggestions for capturing a mis-allocated NSData object on a customer's system
- Subject: Re: Help DTrace gurus: suggestions for capturing a mis-allocated NSData object on a customer's system
- From: James Bucanek <email@hidden>
- Date: Sun, 21 Nov 2010 11:56:00 -0700
Ken Thomases <mailto:email@hidden> wrote (Sunday,
November 21, 2010 6:13 AM -0600):
On Nov 20, 2010, at 11:58 AM, James Bucanek wrote:
I think the term "uninitialized" in this context is a valgrind concept.
Valgrind runs your application in a simulator. It associates a "valid-value"
bit with every byte of memory. If your application ever tries to read a byte
of data that hasn't been written yet, it catches it.
I was aware of that.
Which line of your real code corresponds to:
==78805== Uninitialised value was created by a stack allocation
==78805== at 0x1000E4BA7: -[PackageNames readNamePackagesOp] (PackageNames.m:590)
The line in question is:
NSData* data = [NSData dataWithBytes:&batch
length:offsetof(ReadBatch,set)+sizeof(PackageNameRecord)*batch.count];
Valgrind knows that &batch refers to an address that was "created by a stack allocation".
My impression from the Valgrind report is that it's talking about a stack
allocation that occurs on the mentioned line. The allocation of batch does
not occur at that line.
What possible uninitialized value could be "created by a stack allocation at"
that line?
I don't know. If valgrind is, indeed, referring to the batch
variable, then the statement that logically "allocates" it is
RecordBatch batch;
which occurs earlier in the function.
But all stack variables for a function are allocated
simultaneously when the stack frame is created. So, technically,
the batch variable were physically allocated at
- (void)readNamePackagesOp {
I interpreted this statement to mean "the code that referenced
the variable that was allocated on the stack." If that's the
case, then valgrind is a very, very, clever tool because I have
no idea how it would determine that.
When the function returns, valgrind marks all the bytes in that stack frame
as invalid. Any future attempt to read or write from those addresses is
caught as an error:
=78805== Thread 9:
==78805== Conditional jump or move depends on uninitialised value(s)
==78805== at 0x1002AAD08: -[NSConcreteData dealloc] (in /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation)
==78805== by 0x1000EA4B4: -[InsertPackageNamesOp dealloc] (PackageNames.m:1824)
=
Note that valgrind is not saying that the data was being overwritten, but
that NSConcreteData merely "depends on" (a.k.a. "reads") information from an
address space that formerly belonged to the stack frame of
-readNamePackagesOp, which should now be considered as invalid/uninitialized.
My point is that I suspect something is doing a copy of some uninitialized
bytes from a stack variable over top of NSConcreteData's
internal state. Later, when NSConcreteData accesses it, it's
basing a jump or move on this
corrupted data, which Valgrind knows comes from uninitialized data. (Valgrind
knows that a copy from uninitialized data propagates the uninitialized-ness,
even though the destination bytes have been written to.)
But what code could that be? The NSData object's internal
variables are allocated on the heap by +alloc. The only
reference my code has to that is the NSData* data variable.
There are only three other references to this variable after
it's created, and those are all used to send it a message. The
only messages sent are, in order:
-autorelease
-retain
-bytes
-release
That's total extent of interaction with the NSData object.
Now, if you're referring to the contents of the data block, then
I can't see how that would be a factor either, for four reasons.
(a) If uninitialized data were present in |batch| when it was
copied into NSData (via +dataWithBytes:length:) then the method
that would encounter said uninitialized data would be
-insertCount:records:excludingDuplicates: (which is what
processes the records in the array), not -[NSData dealloc].
(b) NSData shouldn't be reading the *contents* of its data block
for any reason. The only thing NSData should be doing during
-dealloc is to free() it's internally malloc'd copy. If that
involves any heap-related variables surrounding said block, and
those variables are uninitialized, then that's the fault of the
framework, not the client.
(c) Once the NSData block is created, the contents of the batch
array are never modified. So the code can't be responsible for
corrupting data surrounding NSData's internally allocated copy.
(d) And finally, any subsequent changes to the |batch| should be
independent of the NSData object once it's created. So any stack
variable munging should be localized to damaged stack variables
of the running thread.
The really damning evidence, in my mind, is the fact that the
whole problem goes away when I replace
RecordBatch batch;
NSData* data = [NSData dataWithBytes:&batch length:sizeof(RecordBatch)];
with
RecordBatch* batch = malloc(sizeof(RecordBatch));
NSData* data = [NSData dataWithBytesNoCopy:batch
length:sizeof(RecordBatch) freeWhenDone:YES];
If the code that fills the batch were overrunning, then it would
overrun *batch. If the code that processes the records was
corrupting it, then it would corrupt *batch the same way it
would corrupt the copy of the array at -[data bytes]. If the
code corrupting the instance variables of the object created
with +dataWithBytes:length:, then it should also be corrupting
the internal variables of a similar object created with +dataWithBytesNoCopy:length:freeWhenDone:.
The body of InsertPackageNamesOp is:
- (void)main
{
...
const RecordBatch* batch = (const RecordBatch*)[data bytes];
SortedIndexSlowLock(namesIndex);
[namesIndex insertCount:batch->count records:batch->set.names excludingDuplicates:YES];
"batch->set.names"? Isn't batch->set an array?
Not in the real code. ;) Sorry about the confusion. I ended up
mixing the simplified version that I posted with the real code,
which is bit more complex. The RecordBatch structure is a buffer
for several types of records, so it's defined as
typedef struct {
PackageRecordType recordType;
NSUInteger count;
union {
PackageNameRecord names[kBatchSizeMax];
DuplicateNameRecord dups[kBatchSizeMax];
AdjunctNameRecord adjuncts[kBatchSizeMax];
} set;
} RecordBatch;
I suspect this is a confusion based on your simplification of your code. If
batch->set is a struct, which contains an array field called "names", then
this code makes sense.
|set| is a union.
However, your calculation of the length for the NSData:
offsetof(ReadBatch,set)+sizeof(PackageNameRecord)*batch.count
no longer does. I would think it should be something like:
offsetof(ReadBatch,set.names)+sizeof(PackageNameRecord)*batch.count
That would work too.
or, my preference, to assure all of the types are and remain correct:
offsetof(typeof(batch),set.names)+sizeof(batch.set.names[0])*batch.count
Oh, I like that. I have to remember to use the typeof() operator
more often...
I kind of agree that the sizeof() calculation is a good place
where things could go wrong. But I'm having trouble imagining
how it could be the root of this problem. If the length: value
was too small, the insert records method would read off the end
of the array and process garbage. However, there's some pretty
heavy error checking in this routine. For example, all duplicate
records are sorted before being written, so the -insert function
checks that each inserted value is in ascending order; hard to
imaging that that would never fail given random data. And if the
length: was too big, +dataWithBytes:length: would just copy more
bytes than were needed--also harmless.
SortedIndexSlowUnlock(namesIndex);
}
'records:' is a (const PackageNameRecord*) parameter (I simplified this to
just Record* in the earlier code example). The contents of this array are
never modified by -insertCount:records:excludingDuplicates:.
If the NSConcreteData were still referring to the 'batch' variable from the
stack of -readNamePackagesOp, then you'd be getting Valgrind errors within
-[InsertPackageNamesOp main], wouldn't you?
No. NSConcreteData could (internally) be retaining a reference
the original &batch address, but not return that in -bytes. The
address returned by -bytes is the only thing
insertCount:records:excludingDuplicates: ever sees.
You can try adding logging of the bytes within the data object from within
both methods. Also, log the object address in each, the address of the
'batch' variable in -readNamePackagesOp, and the -bytes address in -main.
That's worth a try. I'll revert to the earlier code and see if
there's anything useful there.
By the way, from which version of Mac OS X were your Valgrind results and the
crash log obtained? The Valgrind results give a location within
-[NSConcreteData dealloc] of 0x1002AAD08. The crash log gives an address of
0x00007fff8637dd48.
The valgrind results were from my machine (Xeon Mac Pro,
10.6.5), the crash log was from a customer (Core Duo, Macbook
Pro, 10.6.4).
Both addresses seem like they're 64-bit, but I think they're
both 32-bit. From my Mac OS X 10.6.4 (10F569) system, adjusting
for relocation, the ...d48
address makes sense. It's right after the call from within -[NSConcreteData
dealloc] to __NSGenericFree, which jumps to free() (and so disappears from the
backtrace).
On the other hand, the ...d08 address doesn't correspond to an
instruction. It's in the middle of an instruction within the
function prologue. I suspect
you have a different version of Mac OS X, with a different Foundation
framework. I'd be curious to know exactly what -[NSConcreteData dealloc] is
doing when Valgrind complains.
So would I. :)
Lastly, in the modified code you posted for -readNamePackagesOp, there was a
reference to "eof" and a placeholder for "<read a record from the file>". I
wonder what happens if there's an incomplete record. Could the inner 'record'
variable actually be incomplete or partially uninitialized?
Good guess, but not in the real code. The batch array is
completely assembled before the NSData block and its associated
operation are created, so it's impossible to create an operation
for an incomplete batch.
In the real code, file records are read into a memory buffer,
verified, and then converted into a Package object and returned
for processing. The Package object has a |reader| object
associated with it--a deserializer roughly equivalent to
NSCoder--that decodes the byte stream of the record. The actual
code for "!eof" is:
while ([reader remaining]!=0 ...) { [reader readVar:... type:...
Anything that went seriously wrong (buffer overrun, invalid data
type, ...) will cause the reader object to throw an exception,
which would terminate the reader thread before it could create
the NSOperation to process any of the records it had decoded.
Anyway, this morning I tried to create a test program to
reproduce this problem, but it doesn't exhibit any of these
issues when run under valgrind. So, clearly something more
complex is going on and/or I'm still missing something critical.
I genuinely appreciate the time and effort people have spent
helping me track down this problem.
James Bucanek
____________________________________________________________________
Author of Professional Xcode 3 ISBN: 9780470525227
<http://www.proxcode3.com/>
and Learn Objective-C for Java Developers ISBN: 9781430223696
<http://objectivec4java.com/>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden