On Mar 12, 2007, at 2:57 PM, David Ross wrote:
Here's some data that might explain other AFP hangs or be a new one.
Early XServe G4 1.33 SP 1gb ram 10.4.8.
Ran fine for years.
We recently connected an XRAID. 2x7x400 drives. Both sides setup with
RAID 5 and a hot spare.
Have had 5 hangs where the entire office locks up as they touch a
shared item. XServe is still responsive. Well somewhat. When you tell
it to restart you get a gray screen and have to manually power it off.
System log shows a series of 5 to 10 i/o errors on one of the RAID
sides that appears to match the time of the lockup and is the last
thing in the log before the startup entries from the restart.
Checking the AFP logs there has been someone doing copies of at least
50 8 meg files either from one share point to another or from their
client computer to the server. Several have been well over 1/2 gig
total. All finder copies.
An interesting point is that with at least one of these hangs the i/o
errors pointed to the side of the XRAID that we have not even used yet
except to format the space. No user files. Nothing shared. The other
i/o errors were on the side of the XRAID we're using.
This tells me it is very likely NOT the XRAID box or the cables. Both
sides of the XRAID are cabled directly to ports on the same FC card
inside the XServe. So the common point where things come together is
the FC card.
Which leads me to think the problem is in the XServe. There is
something about the amount of data being transfered that is triggering
the problem. Bad FC card, bad memory, OS bug, bus design bug, bus
hardware failure, or who knows. Maybe the bus is being saturated and
something isn't dealing correctly with the situation.
I added a gig of ram to try and suppress the issue if it was indeed an
OS bug of some sort. No go.
Tonight I'll pull the original 1 gig of ram and swap the FC card.
We had planned to retire the XServe this summer after leopard but now
I wonder if replacing it will solve the issue or just get us to spend
our dollars sooner than planned without solving the issue.
Any other thoughts?
While it's not the most verbose information, have you at least looked
into the event log on the RAID unit via RAID admin ?