I have been administrating our company's small xsan (2 MDC's 10 clients) for about 3 years now , and a familiar problem seems to crop up from time to time. At some point, I have to shut it down, which I do following the normal shutdown procedure, and then when I power it back on it never comes back up as expected.
Currently, although the volume is running and appears to be healthy, none of the clients or MDC's will mount the volume. I have tried it both from Xsan admin and from xsanctl.
How do you approach troubleshooting this type of a problem?
I am not running a DNS server, but the controllers do have a static IP address. I've been learning xsan as I go, so I assume my problems are related to some type of faulty setup.
The luns show up in disk utility on the controllers and I can fail it over betweent he controllers with no problem. I am seeing a Disk Stripe Group is Down in the system log of one of the controllers.
I cannot stop the volume from xsan admin(or start it when it's stopped) but I have no problems using cvadmin to stop/start the volume.
This is what shows up inthe logs when I attempt to mount the volume on one of the MDC's:
Feb 2 14:49:27 cyberedit12 kernel[0]: CVFS 'HAL': request reserved space 0x11a00000
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <HalLeftMetaJournal> hba 1 lun 0 state 0x200000f4 device </dev/rdisk5>
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <AE35Right> hba 1 lun 0 state 0x200000f4 device </dev/rdisk2>
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <AE35Left> hba 1 lun 0 state 0x200000f4 device </dev/rdisk4>
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <TMARight> hba 1 lun 0 state 0x200000f4 device </dev/rdisk7>
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <HalRight> hba 1 lun 0 state 0x200000f4 device </dev/rdisk6>
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <HalLeft> hba 1 lun 1 state 0x200000f4 device </dev/rdisk8>
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <ProLun1a> hba 2 lun 0 state 0x200000f4 device </dev/rdisk3>
Feb 2 14:49:27 cyberedit12 kernel[0]: CvOpenDisk: label <ProLun2a> hba 1 lun 1 state 0x200000f4 device </dev/rdisk9>
Feb 2 14:49:27 cyberedit12 kernel[0]: CVFS 'HAL': FsBlk size 16384, bits 14, mask 0x3fff
Feb 2 14:49:27 cyberedit12 kernel[0]: CVFS 'HAL': Sector size 512, bits 9, mask 0x1ff
Feb 2 14:49:27 cyberedit12 kernel[0]: Not all drives available on stripe group 2 for filesystem 'HAL'
Feb 2 14:49:27 cyberedit12 kernel[0]: Could not mount filesystem HAL, cvfs error 'No such device' (36)
Feb 2 14:49:28 cyberedit12 com.apple.xsan[27]: mount_acfs: Operation not supported by device
Feb 2 14:49:28 cyberedit12 xsand[27]: mount of volume 'HAL' failed (exit code = 22)
Feb 2 14:49:28 cyberedit12 servermgrd[31]: xsan: [31/1545A0] ERROR: -[SANFilesystem mountVolumeNamed:writable:withOptions:]: mount of 'HAL' failed: Unable to mount volume `HAL'
Feb 2 14:49:28 cyberedit12 Xsan Admin[373]: ERROR: Error mounting volume...: Server returned a non-zero status code (100007)
Feb 2 14:49:34 cyberedit12 fsm[307]: Xsan FSS 'HAL[0]': [Node 132] Disk Stripe Group 2 is DOWN for this client. # disks 9 unitmap[3] 0xfffff partaccess 0x1
Feb 2 14:49:35 cyberedit12 Xsan Admin[373]: ERROR: No session for computer cyberedit7 (CF5BFCCD-75AB-4BDF-9263-650F347B6AA1)
Feb 2 14:49:36 cyberedit12 Xsan Admin[373]: ERROR: Error getting list of volumes: kOfflineError (0)
Feb 2 14:49:36 cyberedit12 servermgrd[31]: xsan: [31/1489C0] ERROR: -[SANFilesystem spotlightSearchLevelForVolume:]: 'HAL': -doSpotlightRpcForVolume failed (2)
Feb 2 14:49:49 cyberedit12 fsm[307]: Xsan FSS 'HAL[0]': [Node 136] Disk Stripe Group 2 is DOWN for this client. # disks 9 unitmap[3] 0xfffff partaccess 0x1
Feb 2 14:50:04 cyberedit12 fsm[307]: Xsan FSS 'HAL[0]': [Node 137] Disk Stripe Group 2 is DOWN for this client. # disks 9 unitmap[3] 0xfffff partaccess 0x1
______________________________________________________________________