On Feb 3, 2006, at 1:13 PM, Christopher Dwan wrote:
Originally, we configured the san with one portal machine to be the
metadata controller, one for failover, and the third as a client.
We then re-exported the san volume via NFS from all three portals
for a client:server ratio of 13:1 on the NFS side.
You should not be running other services on your Xsan MDCs (primary
or backup). Something like "OD master" or "OD slave" or DNS server
is probably OK in a pinch, but "NFS server", "AFP server", or "SMB
server" definitely is not. Every one of those CPU cycles you steal
from the "fsm" processes on the MDCs is going to negatively impact
your SAN's overall performance.
This configuration seemed very unresponsive, and it would fall over
(all three portals reboot or hang) if I loaded the cluster with
enough work to get a bunch of reads and writes going from all the
compute nodes.
You didn't mention what version of Xsan you are running... Have you
contacted AppleCare about this problem?
First we totally isolated the MDC machine (turned off most of the
other system services) and reconfigured NFS to only serve from the
two remaining portal machines (one of which was still configured as
a failover MDC). This bumped the client:server ratio for NFS up to
around 20:1. This performed better, though I could still knock the
three portal machines over.
That's because the "backup MDC" can still potentially host the SAN
volume, and anytime it is doing so you run into the same problem as
with the original configuration.
In this case, I noticed that the failover MDC would occasionally
reboot. The logs said something about timeouts communicating with
the MDC. On a hunch, I decided to remove it as a failover. This
performed better still, but it still is not resilient to high
loads. High loads, in this case are defined as "all the compute
nodes running jobs that involve reading and writing from their NFS
mounted san volumes.
At this point, what is failing? The NFS server or the MDCs? I
suggest you contact AppleCare about this issue.
When the systems are loaded, "top" shows me that the "fsm" process
on the MDC is using ">>" threads, which means "more than 100".
It is normal for the "fsm" processes to have > 100 threads.
-- thorpej
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xsan-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xsan-users/email@hidden