Hi, Am 02.09.2014 um 22:30 schrieb Michael Stauffer:
> I'm trying to help users of another cluster whose admin is on vacation - a > bit of Murphy's Law at work here, it seems. > > Their queue keeps failing, and after restarting qmaster it fails again after > about a minute. The suspicion is some bad job files, judging from these log > entries: > > => Also, the last few lines in the qmaster logfile > => "$SGE_ROOT/$SGE_CELL/spool/qmaster/messages" > => > => 09/02/2014 14:15:02| main|cbica-cluster|C|job file > => "jobs/00/0005/2729" > => has zero size > => 09/02/2014 14:15:02| main|cbica-cluster|C|job file > => "jobs/00/0005/2726" > => has zero size > => 09/02/2014 14:15:02| main|cbica-cluster|C|job file > => "jobs/00/0005/2727" > => has zero size > => 09/02/2014 14:15:02| main|cbica-cluster|C|job file > => "jobs/00/0005/2728" > => has zero size > => 09/02/2014 14:15:02| main|cbica-cluster|C|job file > => "jobs/00/0003/2326" > => has zero size > => 09/02/2014 14:15:02| main|cbica-cluster|E|wrong cull version, read > => 0x00000000, but expected actual version 0x10020000 > => 09/02/2014 14:15:02| main|cbica-cluster|E|error in init_packbuffer: > => wrong cull version The qmaster and commands are working, it's just the exechost which keep failing? You could stop the execd thereon, and remove the complete spool directory for the node. The starting execd will recreate the directory structure for the particular node. If it's the structure of the qmaster instead: do you use classic spooling then? -- Reuti > How can we clear any state files and get a fresh start? Thanks. In the > meantime I'll look more online for answers. > > -M > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users