Re: getaddrinfo() starts failing with EAI_AGAIN (again)
site_archiver@lists.apple.com Delivered-To: darwin-dev@lists.apple.com Personally, I'd be looking for horses, not zebras, at this point. -- Terry On Jul 10, 2008, at 5:36 AM, Jamus Jegier <jamus@jamus.com> wrote: On Jul 9, 2008, at 10:21 PM, Terry Lambert wrote: #!/bin/bash whoami >> /tmp/debug.log date >> /tmp/debug.log dig $1 >> /tmp/debug.log ping -c 5 $1>> /tmp/debug.log _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl... It is probably not being "fixed" by that for the reasons being assumed by that explanation. The person I would ask for a definative answer on whether it was as Peter suggests is on vacation. You said this was a long running process. The daemon that answers lookup information registers in the system lookupd bootstrap namespace, not the "session" bootstrap namespace, so it's somewhat doubtful to me that losing access to the latter would/should impact you this way. We internally use exactly this approach for distributed builds, and it seems to me they would break on us the same way for the same reasons, if this were the case. It also seems to me the latency would be a heck of a lot less than 8 hours, if this were a "session" based port rights revocation issue; more like "the instant the session ends". An outside possibility is that some degenerate behaviour on this code path is causing Mach ports to leak like there's no tomorrow. There are tools to check for this as well, but you should be seeing much worse behavior than just one app failing, were that the case. On Jul 7, 2008, at 8:41 AM, Jamus Jegier wrote: A couple months ago, there was a post by Peter Oberauer about getaddrinfo() failing with EAI_AGAIN by processes under a specific process tree. I am running into the exact same problem, and was wondering if there was any resolution to the issue. I installed Nagios, which frequently forks processes to verify the status of network hosts. After about 8 hours, all child processes start failing with EAI_AGAIN. Under the Nagios parent process, dig works, but ping doesn't. Both DNS and Bonjour lookups begin to fail. Ping and the Nagios helper executables work fine when started under an unrelated process. The only other thing I can add is that I see this on 10.5.3 on a G4 system. I'm installing 10.5.4 now, and will post if I still see this. Typically, getaddrinfo() fails with EAI_AGAIN when there is a failure of malloc of the memory needed to return the linked list of addrinfo structures. This generally happens if: (1) you have a memory leak (either you fail to call freeaddrinfo() on the returned memory, or have a different leak) (2) you have fragmented your process address space (sufficiently that it is impossible to allocate a contiguous memory chunk large enough to return the requested information_ (3) the information you are requesting would be so large that it's impossible to return it (misconfigured DNS server, DNS cache poisoning attack, broken DNS server software, etc.) You should examine how much memory is in use by your process to distinguish #1, you should use vmmap to distinguish #2, and you should use host and similar commands and/or a packet analyzer (or tcpdump) to distinguish #3. Thanks for the suggestions, but I believe Peter Oberauer hit the nail on the head - starting the process from launchd solved my issues. Before that nagios was started by sshing into my system and sudo to the nagios user like this: sudo -u nagios /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/ nagios.cfg Nagios would run this script every few minutes successfully before it would start failing after approximately 8-10 hours. In this script, ping would fail while dig would succeed in resolving the domain name. So I don't think it would be #2, seeing that I doubt that ping would allocate and fragment its address space like you suggested. Also, I don't think it's #3, since bonjour also fails with a directly connected Airport Extreme, dig returns a good value with DNS queries, and pings outside of the nagios process tree would succeed I also don't think it's #1, since I would have noticed ping thrashing as it tried to allocate 4gb in my 1.5 gb system. I can attempt to reproduce the issue and try to gather hard data if my observations aren't sufficient. Jamus This email sent to site_archiver@lists.apple.com
participants (1)
-
Terry Lambert