I've run into a strange problem(s) with our Xgrid controller and
haven't found the answer in the list archives or in the Xgrid
documentation. I'd really appreciate a pointer...
We had a power failure at work which shut down the server running our
controller (no UPS, doh!). After the server came back up, jobs
submitted to the grid would fail with a error = "task: unexpected
reply". Does anyone know what might have been causing this problem? I
expected that the controller would just resubmit tasks that failed on
their assigned agents.
Anyways, I tried restarting the controller in Server Admin. After
spinning for a while, Server Admin crashed. I'd seen this problem
before when the Xgrid controller database was corrupted. There was a
job running at the time of the power failure, so I thought the
database may have gotten corrupted again. (asside: are there tools to
repair the Xgrid database?). I trashed the database files in
/var/xgrid/controller and it now restarts fine. All of the agents
appear "Offline", however, in XgridAdmin even though they aren't.
Nothing I've done including restarting all the agents and controller
and removing all of the agenst from the grid and letting them
re-coneect, seem to help.
Can anyone point me in a useful direction?
Thanks!
Barry
P.S. While I'm in the question asking mode, does anyone know where the
interaction between xgridagent and the Energy Saver controll panel are
documented? It appears that computers that are running xgridagent and
are connected to a controller don't sleep, even if they should
according to the Energy Saver. I'd expect that xgridagent might
prevent sleep while running a task, but even idle agents appear to
prevent their host from sleeping.
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/email@hidden
This email sent to email@hidden