Re: Help DTrace gurus: suggestions for capturing a mis-allocated NSData object on a customer's system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help DTrace gurus: suggestions for capturing a mis-allocated NSData object on a customer's system

Subject: Re: Help DTrace gurus: suggestions for capturing a mis-allocated NSData object on a customer's system
From: James Bucanek <email@hidden>
Date: Sun, 21 Nov 2010 11:56:00 -0700

Ken Thomases <mailto:email@hidden> wrote (Sunday, November 21, 2010 6:13 AM -0600):

On Nov 20, 2010, at 11:58 AM, James Bucanek wrote:

I think the term "uninitialized" in this context is a valgrind concept.
Valgrind runs your application in a simulator. It associates a "valid-value"
bit with every byte of memory. If your application ever tries to read a byte
of data that hasn't been written yet, it catches it.


I was aware of that.

Which line of your real code corresponds to:
  ==78805==  Uninitialised value was created by a stack allocation
==78805==   at 0x1000E4BA7: -[PackageNames readNamePackagesOp] (PackageNames.m:590)


The line in question is:

NSData* data = [NSData dataWithBytes:&batch
length:offsetof(ReadBatch,set)+sizeof(PackageNameRecord)*batch.count];

Valgrind knows that &batch refers to an address that was "created by a stack allocation".


My impression from the Valgrind report is that it's talking about a stack
allocation that occurs on the mentioned line.  The allocation of batch does
not occur at that line.

What possible uninitialized value could be "created by a stack allocation at"
that line?

I don't know. If valgrind is, indeed, referring to the batch variable, then the statement that logically "allocates" it is

    RecordBatch batch;

which occurs earlier in the function.

But all stack variables for a function are allocated simultaneously when the stack frame is created. So, technically, the batch variable were physically allocated at

    - (void)readNamePackagesOp {

I interpreted this statement to mean "the code that referenced the variable that was allocated on the stack." If that's the case, then valgrind is a very, very, clever tool because I have no idea how it would determine that.

When the function returns, valgrind marks all the bytes in that stack frame
as invalid. Any future attempt to read or write from those addresses is
caught as an error:

=78805== Thread 9:
==78805== Conditional jump or move depends on uninitialised value(s)
==78805==    at 0x1002AAD08: -[NSConcreteData dealloc] (in /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation)
==78805==    by 0x1000EA4B4: -[InsertPackageNamesOp dealloc] (PackageNames.m:1824)
=

Note that valgrind is not saying that the data was being overwritten, but
that NSConcreteData merely "depends on" (a.k.a. "reads") information from an
address space that formerly belonged to the stack frame of
-readNamePackagesOp, which should now be considered as invalid/uninitialized.

My point is that I suspect something is doing a copy of some uninitialized bytes from a stack variable over top of NSConcreteData's internal state. Later, when NSConcreteData accesses it, it's basing a jump or move on this corrupted data, which Valgrind knows comes from uninitialized data. (Valgrind knows that a copy from uninitialized data propagates the uninitialized-ness, even though the destination bytes have been written to.)

But what code could that be? The NSData object's internal variables are allocated on the heap by +alloc. The only reference my code has to that is the NSData* data variable.

There are only three other references to this variable after it's created, and those are all used to send it a message. The only messages sent are, in order:

    -autorelease
    -retain
    -bytes
    -release

That's total extent of interaction with the NSData object.

Now, if you're referring to the contents of the data block, then I can't see how that would be a factor either, for four reasons. (a) If uninitialized data were present in |batch| when it was copied into NSData (via +dataWithBytes:length:) then the method that would encounter said uninitialized data would be -insertCount:records:excludingDuplicates: (which is what processes the records in the array), not -[NSData dealloc]. (b) NSData shouldn't be reading the *contents* of its data block for any reason. The only thing NSData should be doing during -dealloc is to free() it's internally malloc'd copy. If that involves any heap-related variables surrounding said block, and those variables are uninitialized, then that's the fault of the framework, not the client. (c) Once the NSData block is created, the contents of the batch array are never modified. So the code can't be responsible for corrupting data surrounding NSData's internally allocated copy. (d) And finally, any subsequent changes to the |batch| should be independent of the NSData object once it's created. So any stack variable munging should be localized to damaged stack variables of the running thread.

The really damning evidence, in my mind, is the fact that the whole problem goes away when I replace

    RecordBatch batch;
    NSData* data = [NSData dataWithBytes:&batch length:sizeof(RecordBatch)];

with

RecordBatch* batch = malloc(sizeof(RecordBatch)); NSData* data = [NSData dataWithBytesNoCopy:batch length:sizeof(RecordBatch) freeWhenDone:YES];

If the code that fills the batch were overrunning, then it would overrun *batch. If the code that processes the records was corrupting it, then it would corrupt *batch the same way it would corrupt the copy of the array at -[data bytes]. If the code corrupting the instance variables of the object created with +dataWithBytes:length:, then it should also be corrupting the internal variables of a similar object created with +dataWithBytesNoCopy:length:freeWhenDone:.

The body of InsertPackageNamesOp is:

- (void)main
{
...
const RecordBatch* batch = (const RecordBatch*)[data bytes];
SortedIndexSlowLock(namesIndex);
[namesIndex insertCount:batch->count records:batch->set.names excludingDuplicates:YES];

"batch->set.names"? Isn't batch->set an array?

Not in the real code. ;) Sorry about the confusion. I ended up mixing the simplified version that I posted with the real code, which is bit more complex. The RecordBatch structure is a buffer for several types of records, so it's defined as

typedef struct {
    PackageRecordType   recordType;
    NSUInteger          count;
    union {
        PackageNameRecord   names[kBatchSizeMax];
        DuplicateNameRecord dups[kBatchSizeMax];
        AdjunctNameRecord   adjuncts[kBatchSizeMax];
        } set;
    } RecordBatch;

I suspect this is a confusion based on your simplification of your code.  If
batch->set is a struct, which contains an array field called "names", then
this code makes sense.


|set| is a union.

However, your calculation of the length for the NSData:

offsetof(ReadBatch,set)+sizeof(PackageNameRecord)*batch.count

no longer does.  I would think it should be something like:

offsetof(ReadBatch,set.names)+sizeof(PackageNameRecord)*batch.count


That would work too.

or, my preference, to assure all of the types are and remain correct:
offsetof(typeof(batch),set.names)+sizeof(batch.set.names[0])*batch.count

Oh, I like that. I have to remember to use the typeof() operator more often...

I kind of agree that the sizeof() calculation is a good place where things could go wrong. But I'm having trouble imagining how it could be the root of this problem. If the length: value was too small, the insert records method would read off the end of the array and process garbage. However, there's some pretty heavy error checking in this routine. For example, all duplicate records are sorted before being written, so the -insert function checks that each inserted value is in ascending order; hard to imaging that that would never fail given random data. And if the length: was too big, +dataWithBytes:length: would just copy more bytes than were needed--also harmless.

SortedIndexSlowUnlock(namesIndex);
}

'records:' is a (const PackageNameRecord*) parameter (I simplified this to
just Record* in the earlier code example). The contents of this array are
never modified by -insertCount:records:excludingDuplicates:.


If the NSConcreteData were still referring to the 'batch' variable from the
stack of -readNamePackagesOp, then you'd be getting Valgrind errors within
-[InsertPackageNamesOp main], wouldn't you?

No. NSConcreteData could (internally) be retaining a reference the original &batch address, but not return that in -bytes. The address returned by -bytes is the only thing insertCount:records:excludingDuplicates: ever sees.

You can try adding logging of the bytes within the data object from within
both methods.  Also, log the object address in each, the address of the
'batch' variable in -readNamePackagesOp, and the -bytes address in -main.

That's worth a try. I'll revert to the earlier code and see if there's anything useful there.

By the way, from which version of Mac OS X were your Valgrind results and the
crash log obtained?  The Valgrind results give a location within
-[NSConcreteData dealloc] of 0x1002AAD08.  The crash log gives an address of
0x00007fff8637dd48.

The valgrind results were from my machine (Xeon Mac Pro, 10.6.5), the crash log was from a customer (Core Duo, Macbook Pro, 10.6.4).

Both addresses seem like they're 64-bit, but I think they're both 32-bit. From my Mac OS X 10.6.4 (10F569) system, adjusting for relocation, the ...d48 address makes sense. It's right after the call from within -[NSConcreteData dealloc] to __NSGenericFree, which jumps to free() (and so disappears from the backtrace).

On the other hand, the ...d08 address doesn't correspond to an instruction. It's in the middle of an instruction within the function prologue. I suspect you have a different version of Mac OS X, with a different Foundation framework. I'd be curious to know exactly what -[NSConcreteData dealloc] is doing when Valgrind complains.


So would I. :)

Lastly, in the modified code you posted for -readNamePackagesOp, there was a
reference to "eof" and a placeholder for "<read a record from the file>".  I
wonder what happens if there's an incomplete record.  Could the inner 'record'
variable actually be incomplete or partially uninitialized?

Good guess, but not in the real code. The batch array is completely assembled before the NSData block and its associated operation are created, so it's impossible to create an operation for an incomplete batch.

In the real code, file records are read into a memory buffer, verified, and then converted into a Package object and returned for processing. The Package object has a |reader| object associated with it--a deserializer roughly equivalent to NSCoder--that decodes the byte stream of the record. The actual code for "!eof" is:

    while ([reader remaining]!=0 ...) { [reader readVar:... type:...

Anything that went seriously wrong (buffer overrun, invalid data type, ...) will cause the reader object to throw an exception, which would terminate the reader thread before it could create the NSOperation to process any of the records it had decoded.

Anyway, this morning I tried to create a test program to reproduce this problem, but it doesn't exhibit any of these issues when run under valgrind. So, clearly something more complex is going on and/or I'm still missing something critical.

I genuinely appreciate the time and effort people have spent helping me track down this problem.


James Bucanek
____________________________________________________________________
Author of Professional Xcode 3                   ISBN: 9780470525227
<http://www.proxcode3.com/>
and Learn Objective-C for Java Developers        ISBN: 9781430223696
<http://objectivec4java.com/>

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: Help DTrace gurus: suggestions for capturing a mis-allocated	NSData object on a customer's system
From: Ken Thomases <email@hidden>


References:  
  >Re: Help DTrace gurus: suggestions for capturing a mis-allocated	NSData object on a customer's system (From: Ken Thomases <email@hidden>)




Prev by Date:
Re: Help DTrace gurus: suggestions for capturing a mis-allocated	NSData object on a customer's system

Next by Date:
Re: Target 10.4 SDK

Previous by thread:
Re: Help DTrace gurus: suggestions for capturing a mis-allocated	NSData object on a customer's system

Next by thread:
Re: Help DTrace gurus: suggestions for capturing a mis-allocated	NSData object on a customer's system

Index(es):

Date
Thread