Re: Possible bug with nanosleep()?
site_archiver@lists.apple.com Delivered-To: Darwin-kernel@lists.apple.com Hi Terry and all, [ ... code elided ... ] int32_t secs = (int32_t) ts.tv_sec; Very interesting issue. 8-). (A) Have to ask for a very short sleep Hope that helps, -- Terry _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-kernel mailing list (Darwin-kernel@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a... On Mar 2, 2010, at 12:26 AM, Chris Wilson wrote: On Mon, 1 Mar 2010, Terry Lambert wrote: You need to be looking at tr, not ts, or you need to do the structure assign immediately after the EINTR, before your trace statements. The tr contains the remainder time, the ts structure contents are irrelevant after the nanosleep() call. You're BOX_ macros appear to indicate you wanted the remainder time. Sorry, my mistake, my code does that but I edited it somewhat while writing the email and forgot to update that part of the sample that I pasted. For reference, this is the code I have in the application now: And if I change both int32_t to long (or __darwin_time_t, the type used in the structure) on this line: then it hangs forever on OSX. OK, there is a possible scenario which will cause it to drift, and my modification of your original test function to put it into a working test harness wouldn't have (didn't) trigger it. I've modified my initial test harness to cause the problem to reproduce (moral: always include fully working example code when noting a problem to a mailing list). The comment in the Libc code about the layout isn't strictly correct. Specifically, the ADD_MACH_TIMESPEC() and SUB_MACH_TIMESPEC() are macros, and the marcros only really care about having a corresponding field name for tv_nsec and tv_usec in the structures, so the difference in element size in the structure isn't going to matter, at least until we hit Y2038. By that time I expect that the mach_timespec_t will be using a 64 bit tv_sec value (unless you are running your clock forward to do Y2038 compliance testing, in which case, nanosleep() isn't currently Y2038 compliant because clock_get_time() isn't Y2038 compliant, even when doing 64 bit programs). Actually I originally wrote the code to pass the same structure for both, and was wondering whether that was allowed (it's not documented as supported or unsupported) so I wanted to separate the two to make sure that people wouldn't claim that I was using the function inappropriately and thus ignore the rest of my message. Even then, this is probably a bad use of signals, since multiple signals being sent won't necessarily result in multiple notifications. I'm not interested at all in signals in this code, I wish they wouldn't happen, I just want to sleep for the appointed time and nothing else. Any ideas why I'm seeing (1<<32)-1 in tr.tv_sec after the call, when the call finishes late? This is the actual problem that I'm having. The base problem causing the issue is the signal handler taking a very long time to run in the test code. Specifically, the idea of the 'remain' in the libc function in the nanosleep() implementation is a result of a call to clock_get_time(), which occurs after the SEMWAIT_SIGNAL(). If an EINTR happens there as the result of the signal, the EINTR will not return from the semwait until after the signal trampoline has been run, which will take however long the signal trampoline takes to run. Since the current time used to calculate the remainder time is polled non-atomically with regard to the semwait, using the separate clock_get_time, the remaining delta can be off by the amount of time that it took to run the handler. If this number is larger than the initial timespec time request, then you can "go backwards". So as your code is currently written, you: (B) Have a registered signal handler for the interrupting signal which takes longer than the (remaining) very short sleep to run. Consequently, the SUB_MACH_TIMESPEC() subtracts 1 from the second as a borrow for the tv_nsec, and underflows. The usual suggestion for (A) is to not loop so tightly, or, if you are doing the work in another thread than the one you are observing the timing on, by injecting a thread via the test harness, then block all signals on it. Asking for a really tight loop, though, would still leave you racing up to the expiration time because of (B). The usual suggestion for (B) is to only set a volatile variable in any signal handler, and then examine it in the main thread, instead of doing the work on the signal handler. It would be a very tiny window then, but given the code, you'd still have one, since you're not going to hit the trampoline in 0 instructions. POSIX specifically states "The suspension time may be longer than requested" for resolution and scheduling reasons, so that's not entriely incorrrect. For now, I would suggest you look at the BAD_MACH_TIMESPEC() macro, and post-test the remainder from the EINTR. IIf it comes up bad, then it means that you've actually spent more time than requested in the nanosleep() plus the signal handler(s), and you can just pretend it returned normally, without the EINTR. I'd caution you that if a signal handler is taking long enough to run that it's causing you to see this, then it's probably also throwing off any elapsed time you are measuring. If all you are doing is delaying "for at least this long", then that's not an issue for you. Meanwhile, you should file a bug, and include this conversation as part of your description. The component it needs to be filed against is Libc. This email sent to site_archiver@lists.apple.com
participants (1)
-
Terry Lambert