We have a number of OSX (10.4.5) machines which NFS mount stuff
(including home directories) from a Solaris8 file-server.
Mostly things work fairly well, except if there is an interruption
(e.g. network glitch or we take down the server for some time).
If this happens then *some* of the OSX clients start moaning about the
nfs server being dead (as one would expect if they try to access the
file-systems) but they *don't* recover when the server does come back.
I'm told that *sometimes* they see a dialog about the server
not-responding offering to umount it (which not being privelaged they
couldn't do anyway though the GUI doesn't know that...)
Sadly (most of) our OSX users seem to delight in rebooting the
machines and *then* telling us about the problems so it is hard to
investigate.
However, on at least one occasion I managed to ssh to one of them and
run some tests -- interrupted by the user hitting the power switch
'cos despite telling him over the phone to leave it alone he god bored
after 5 mins...
On that occasion I could verify that the machine was mostly still
alive and NFS worked fine to other servers. Moreover if I NFS mounted
the *same* fs from the server it was happy with that. It was just the
particular mount which was stuck.
Attempting to umount it with the -f force option resulted in the
umount hanging (unkillably), though that did then cause the list
reported by mount to no longer include the dead fs. I was *about* to
re-mount in the same place when the user powered it off.
Sadly this happens for us about once a week (maybe more) because the
server can get somewhat overloaded during nightly backups and can stop
responding for enough seconds for (some of) the OSX clients to get
stuck.
In case it matters the nfs seems to have been running over udp, I
can't work out whether OSX has picked nfs2 or nfs3 since I can't find
where the mtab data is stored. The actual mounts are being done with
am-utils (6.1 built by ourselves of course), and we have a couple of
scripts running periodially which attempt to notice things like
memberd/lookupd/ypbind falling over which also touch some NFS mounted
pieces...
We see syslog messages like:
magnus kernel[0]: nfs server <ourfs>: not responding
magnus KernelEventAgent[36]: tid 00000000 received VQ_NOTRESP event (1)
magnus KernelEventAgent[36]: tid 00000000 type 'nfs', mounted on '<mountpoint>', from '<ourfs>', not responding
magnus KernelEventAgent[36]: tid 00000000 found 1 filesystem(s) with problem(s)
magnus kernel[0]: nfs server <ourfs>: is alive again
if the outage doesn't cause problems, and on the occasion I got to
look I got messages like:
magnus KernelEventAgent[36]: tid 00000000 received VQ_NOTRESP event (1)
magnus KernelEventAgent[36]: tid 00000000 type 'nfs', mounted on '<ourfs>, from '<ourfs>', not responding
magnus KernelEventAgent[36]: tid 00000000 found 1 filesystem(s) with problem(s)
magnus KernelEventAgent[36]: tid 00000000 received VQ_NOTRESPLOCK event (128)
magnus KernelEventAgent[36]: tid 00000000 type 'nfs', mounted on '<ourfs>', from '<ourfs>', not responding, lockd not responding
magnus KernelEventAgent[36]: tid 00000001 display_lock_timeout_warning: generating warning for volume 'raid'
magnus KernelEventAgent[36]: tid 00000000 found 1 filesystem(s) with problem(s)
followed by a huge number of:
magnus kernel[0]: nfs server <ourfs>: not responding
None of the other NFS clients (including Sun, SGI, DEC, Linux, and (I
think) OSX10.2 show any similar problems).
Now the VQ_NOTRESPLOCK message makes me wonder if this *might* be
related to the NFS locking code (BSD didn't used to bother to include
NFS client locking until fairly recently)... I know that it took Sun
at least 200 revisions to get the NFS locking code (mostly) working,
so it seems plausable that the early BSD code might be buggy.
If anyone can suggest any debugging we can do (other than anything
involving turning off our file-servers!), I'd be glad to try it.
--
Jon Peatfield, Computer Officer, DAMTP, University of Cambridge
Mail: email@hidden Web: http://www.damtp.cam.ac.uk/
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Macosx-interop mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden