Re: Garbage collector vs variable lifetime
Re: Garbage collector vs variable lifetime
- Subject: Re: Garbage collector vs variable lifetime
- From: John Engelhart <email@hidden>
- Date: Mon, 9 Jun 2008 01:03:16 -0400
On Jun 7, 2008, at 10:10 PM, Michael Ash wrote:
On Sat, Jun 7, 2008 at 6:37 PM, Hamish Allan <email@hidden> wrote:
On Sat, Jun 7, 2008 at 7:35 PM, Michael Ash <email@hidden>
wrote:
This is pretty nitpicky. If it's in scope but you don't use it, then
it doesn't matter. Kind of like a Zen koan: if an object is
collected
in the forest but nobody is pointing to it anymore, does it make a
sound?
:)
I'm just arguing that it makes it more straightforward if the GC
behaves deterministically according to the code you wrote, rather
than
what the optimising compiler re-wrote for you.
If you don't like undefined behavior, then C-based languages are a
poor choice. If you don't like nondeterministic object lifetimes, then
garbage collection is a poor choice.
If your statement regarding nondeterministic object lifetimes is true,
and (as I think it has been shown) deterministic object lifetimes is
sometimes required for deterministic program behavior, does this not
imply that the current GC system is fundamentally flawed?
I think Hamish is right. It is reasonable to expect that code, as
entered, results in deterministic behavior and that the 'principle of
least surprise' holds. When programming for multithreading, one
implicitly accepts that common programming techniques and their cause -
> effect relationship may no longer be valid. It requires a complete
change in discipline to account for these effects if one hopes to
produce code that executes deterministically. Otherwise, one is
exposed to 'race conditions' in which things work correctly 99% of the
time, but occasionally they fail.
It's difficult not to see similarities in multithreading programming
and programming with Leopards GC system. If one uses Leopards GC
system without compensating for these 'non-deterministic object life
times', one essentially creates a race condition. 99% of the time,
these race conditions wont result in abnormal program behavior, but
every once in a while the conditions will be such that the collector
will run and reclaim an allocation that is still in use.
In the case of NSData/NSMutableData, the relationship between the
parent object and the pointer returned by bytes/mutableBytes is
obvious. The parent object has an ivar pointer to the bytes
allocation, so keeping the parent object alive keeps the bytes
allocation alive.
In the case of NSString and UTF8String, there is no such
relationship. Since there is no pointer from the parent NSString to
the created UTF8String allocation, keeping the parent NSString 'live'
does not keep the pointer to the UTF8String live.
Under GC, something as simple as [[NSMutableData dataWithLength:4096]
mutableBytes] can't be used. It must be restructured such that the
object instantiation statement be assigned to a variable and not be
'anonymous'.
Even this has pitfalls. Leopards GC system considers all pointers on
the stack to be live, and any pointers in the heap must be updated via
a write barrier. In order to assist the compiler in identifying which
pointers require write barriers, __strong is introduced. It's
important to note, though, that the specification and definition of
the C language is no way requires a stack, and the statement
'{ NSMutableData *data = [NSMutableData dataWithLength:4096]; }' does
not imply in any way that the variable 'data' will exist on the stack
by the C language definition.
It's clear that the C language definition of a pointer and the
definition of a GC pointer under leopard are close, but not
necessarily the same. A very small addendum to the rules along the
lines of 'A __strong pointer will remain visible to the GC system from
the point in which it is defined until the end of its enclosing
block.' would neatly solve an awful lot of issues. This one change
would result in generated code that matched expected behavior vs. the
current C pointer rules which allow the optimizer to consider it
'dead' at the point of last use.
It does not fix the case of UTF8String, though, as the variable
containing the pointer is 'char *'. In fact, the only way I can think
of using the pointer returned by UTF8String that is deterministic and
approximates the old autorelease rules is something like (uses C99):
{
[[NSGarbageCollector defaultCollector] disable];
char *utf8String = [theString UTF8String];
size_t utf8StringLength = strlen(utf8String);
char utf8StringCopy[utf8StringLength + 1];
memcpy(&utf8StringCopy[0], utf8String, utf8StringLength);
utf8StringCopy[utf8StringLength] = 0;
[[NSGarbageCollector defaultCollecor] enable];
// utf8StringCopy is valid until this block ends..
}
As it stands, this is really the only bullet proof way of using the
pointer returned by UTF8String. One can say with certainty that
utf8StringCopy is valid under all uses, to any function, method, or C-
only library function until the end of the enclosing block, regardless
of the behavior of the GC collector.
Using the raw UTF8String pointer is far, far more convenient, and very
likely to work 99.99% of the time without problems.
Convenience, however, rarely trumps correctness.
The compiler and garbage collector both do their jobs. They keep the
object around as long as it is referenced, after which it becomes a
candidate for collection. The trouble is just that where it stops
being referenced and where you think it should stop being referenced
are not the same place.
I think this argues for the case that __strong pointers should not
necessarily be treated as equivalents to regular pointers. As I
previously suggested, appending the pointer rules such that 'A
__strong pointer will remain visible to the GC system from the point
in which it is defined until the end of its enclosing block.' would
seem to better reflect peoples expected behavior, rather than actual
behavior.
This really highlights the danger of bolting such features on to an
existing compiler and language. There are decades of extremely subtle
implied invariants built in to the assumptions used to code the GCC
compiler, especially when it comes to optimization transformations.
Some of these are no longer true when it comes to __strong pointers,
or they create undesirable, subtle side effects.
Now it's your turn -- where is the problem with the compiler marking
references you have *semantically* placed on the stack as strong,
whether or not they really end up on the stack after optimisation?
The problem is that this proposal doesn't make any sense given the
architecture of the Cocoa GC. The GC considers the stack plus
registers of each thread to be one big blob which it then scans for
anything that looks like a pointer. There's no way to mark anything as
being "strong", because the collector considers everything on the
stack to be strong. Even if this were to be resolved it still wouldn't
help because the problem isn't that the data pointer isn't strong, the
problem is that the data pointer *goes away*. No amount of annotation
will fix that; you have to change the compiler to keep object pointers
on the stack through the end of the current scope, and if you make
that change then the annotation is unnecessary anyway.
Actually, one has to tackle the issue of "places it on the stack" as
well, since the language itself does not specify nor require the use
of a stack. One would really need a new storage class specifier, much
like 'auto' and 'register', with the obvious candidate being 'stack'.
As it stands, block local auto __strong pointers are implied to reside
on the stack, not explicitly required to reside on the stack.
Otherwise, one is stuck with a very subtle coupling of language
specification to implementation requirement which I think should
remain distinct. In fact, it's a subtle coupling of language
specification to the specific implementation details of Leopards GC
system.
Granted, this is a pedantic point, but I think it's important to be
explicit in such matters. This whole thread has really shown that
there is an uneven application of the 'rules' when it comes to GC. It
also highlights the fact that the current documentation regarding the
GC system is woefully inadequate when it comes to resolving some of
these finer, nuanced points.
I disagree. Since the programmer cannot possibly know the state of
the
call stack in any other way than by knowing that the compiler must
keep references to objects right up to the point where those objects
are no longer used, he must not make any assumption as to the
lifetime
of those objects beyond that point.
But who is to say the compiler won't make optimisations which alter
when the object is last referenced? (See my reply to Peter for a code
example.)
As Chris Hanson pointed out, the compiler cannot move function or
method calls without changing the underlying semantics of the code, so
you're guaranteed to be safe by doing a [data self] or equivalent at
the end of the loop. You can also, of course, use CFRetain/CFRelease
to more explicitly manage its lifetime.
As I pointed out, even this is not necessarily true. The GCC
__attributes(()) of 'const' and 'pure' can render this assumption
invalid if they were to be extended to objc methods (and since I
realized that I haven't actually tried it, just might be).
Also, a strict reading of the documentation for CFRetain and CFRelease
give no indication of their behavior under GC. In the absence of
anything explicit, I would think that the standard 'toll-free
bridging' rules apply, and therefore CFRetain and CFRelease actually
become essentially empty function calls. I would tend to think that
it is much more appropriate to use the methods provided via the
NSGarbageCollector class: enableCollectorForPointer: and
disableCollectorForPointer:.
If you're doubtful about this, consider what would happen in a boring
non-collected environment if the compiler were allowed to move this
stuff around. A [obj release] or free(ptr) could be moved to before
code which accesses the object or pointer, which would of course
result in disaster.
Ah, but the devil is in the details. The result from [data self] is
invariant, it does not matter if it is executed immediately after the
object is instantiated or just before [data release], the result is
always the same. If the compiler can glean this fact through either
explicit means such as -(id)self __attribute((const)); or via
optimization introspection, then it can be subject to common
subexpression elimination or loop invariant code motion, even dead
code elimination. Since -(id)self causes no side effects, and the
result of which is presumably unused in our hypothetical example, the
statement accomplishes no real work and can be eliminated.
Consider the fact that '[data self]' essentially becomes 'id self(id
self, SEL _cmd) { return(self); }' If one were to take the LLVM ideas
to their hypothetical logical conclusion, or in other words a 'A run
time continuous inter procedure optimizing compiler', the compiler
would have everything at its disposal to make this induction all by
itself.
So, in the specific case of '[data self]', a sufficiently informed
optimizing compiler can cheerfully mark this statement as dead code,
negating its 'object life time extending' effects.
[data self] just so happens to be trivial enough that your statements
might not hold with changes / advances in the compiler. Your
statements stand for anything more complicated.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden