Re: getaddrinfo() starts failing with EAI_AGAIN (again)
Re: getaddrinfo() starts failing with EAI_AGAIN (again)
- Subject: Re: getaddrinfo() starts failing with EAI_AGAIN (again)
- From: Terry Lambert <email@hidden>
- Date: Thu, 10 Jul 2008 08:29:10 -0700
It is probably not being "fixed" by that for the reasons being assumed
by that explanation.
The person I would ask for a definative answer on whether it was as
Peter suggests is on vacation.
You said this was a long running process. The daemon that answers
lookup information registers in the system lookupd bootstrap
namespace, not the "session" bootstrap namespace, so it's somewhat
doubtful to me that losing access to the latter would/should impact
you this way. We internally use exactly this approach for distributed
builds, and it seems to me they would break on us the same way for the
same reasons, if this were the case. It also seems to me the latency
would be a heck of a lot less than 8 hours, if this were a "session"
based port rights revocation issue; more like "the instant the session
ends".
An outside possibility is that some degenerate behaviour on this code
path is causing Mach ports to leak like there's no tomorrow. There are
tools to check for this as well, but you should be seeing much worse
behavior than just one app failing, were that the case.
Personally, I'd be looking for horses, not zebras, at this point.
-- Terry
On Jul 10, 2008, at 5:36 AM, Jamus Jegier <email@hidden> wrote:
On Jul 9, 2008, at 10:21 PM, Terry Lambert wrote:
On Jul 7, 2008, at 8:41 AM, Jamus Jegier wrote:
A couple months ago, there was a post by Peter Oberauer about
getaddrinfo() failing with EAI_AGAIN by processes under a specific
process tree.
I am running into the exact same problem, and was wondering if
there was any resolution to the issue.
I installed Nagios, which frequently forks processes to verify the
status of network hosts. After about 8 hours, all child processes
start failing with EAI_AGAIN.
Under the Nagios parent process, dig works, but ping doesn't.
Both DNS and Bonjour lookups begin to fail.
Ping and the Nagios helper executables work fine when started
under an unrelated process.
The only other thing I can add is that I see this on 10.5.3 on a
G4 system. I'm installing 10.5.4 now, and will post if I still
see this.
Typically, getaddrinfo() fails with EAI_AGAIN when there is a
failure of malloc of the memory needed to return the linked list of
addrinfo structures. This generally happens if:
(1) you have a memory leak (either you fail to call freeaddrinfo()
on the returned memory, or have a different leak)
(2) you have fragmented your process address space (sufficiently
that it is impossible to allocate a contiguous memory chunk large
enough to return the requested information_
(3) the information you are requesting would be so large that it's
impossible to return it (misconfigured DNS server, DNS cache
poisoning attack, broken DNS server software, etc.)
You should examine how much memory is in use by your process to
distinguish #1, you should use vmmap to distinguish #2, and you
should use host and similar commands and/or a packet analyzer (or
tcpdump) to distinguish #3.
Thanks for the suggestions, but I believe Peter Oberauer hit the
nail on the head - starting the process from launchd solved my issues.
Before that nagios was started by sshing into my system and sudo to
the nagios user like this:
sudo -u nagios /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/
nagios.cfg
Nagios would run this script every few minutes successfully before
it would start failing after approximately 8-10 hours. In this
script, ping would fail while dig would succeed in resolving the
domain name.
#!/bin/bash
whoami >> /tmp/debug.log
date >> /tmp/debug.log
dig $1 >> /tmp/debug.log
ping -c 5 $1>> /tmp/debug.log
So I don't think it would be #2, seeing that I doubt that ping would
allocate and fragment its address space like you suggested. Also, I
don't think it's #3, since bonjour also fails with a directly
connected Airport Extreme, dig returns a good value with DNS
queries, and pings outside of the nagios process tree would succeed
I also don't think it's #1, since I would have noticed ping
thrashing as it tried to allocate 4gb in my 1.5 gb system.
I can attempt to reproduce the issue and try to gather hard data if
my observations aren't sufficient.
Jamus
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden