Hi, Am 14.06.2012 um 17:36 schrieb Sabine Kreidl:
> thanks very much for all the suggestions and sorry for my late follow up. I > have to admit, that adding disabled nodes to the AR has its advantages for > e.g. maintenance windows. It would be a very nice feature, though, if one was > able to specify the desired behavior (respecting disabled nodes vs. omitting > them from the AR) with an option to qrsub - as a potential RFE? :-) At one location there is already one: https://arc.liv.ac.uk/trac/SGE/ticket/770 You can extend this if you like. -- Reuti > I currently had a (new) problem with a waiting AR for a maintenance window. > The used version on this system is SGE 6.2u3, admittedly, so maybe this is a > known and already resolved issue within newer versions: > > We have two queues, only one of them - par.q - accepting parallel jobs, i.e. > associated with our defined PEs. I got the AR submitted via > qrsub -u XXX,YYY -a 07051000 -e 07091000 -pe openmpi-* 1008 > granted within par.q (default job runtimes are 10 days, so we do have plenty > of time still). > All of a sudden the available slots for all instances of par.q were set to 0 > and no parallel jobs got scheduled anymore. Accordingly, "qstat -g c" showed > a negative count for available slots in par.q (some parallel jobs still > running). As I suspected the AR, I deleted it, but a Master restart was > necessary before the default 8 cores per queue instance were recognized again. > > Does anyone have experience with such a behavior and maybe some suggestions > on how to avoid the problem? > > Thanks again and best regards, > Sabine > > > Am 16.02.2012 01:06, schrieb Dave Love: >> William Hay <[email protected]> >> writes: >> >> >>> We have a complex associated with every node called status that is >>> normally set to OK. When a node has a problem we set it to a >>> description of the problem instead. Our JSV ensures jobs always >>> request status=OK. With a similar complex you could request status=OK >>> when making the AR. >>> >> Yes, I think that's the only solution currently for disabled queues, but >> I'd guess it's straightforward to avoid them as an option if someone >> would like to try. We don't currently use AR, so I haven't looked at >> it. >> >> >>> We also have a script that lists out nodes that aren't OK and their >>> status. Essentially duplicating the functionality of pbsnodes under >>> Torque. With this available as a permanent way to disable nodes we've >>> set queues to enabled at startup and use qmod -d to mean "disabled >>> till next reboot" only. >>> >> I tag bad nodes with a comment and put them into a "testing" hostgroup >> with access only for admins (via RQS, which will be ignored for AR for a >> reason I don't follow). I think if node user_lists were used instead of >> the RQS to restrict access, an AR would exclude the bad nodes for >> non-admins, but I'm not sure. >> _______________________________________________ >> users mailing list >> >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> >> >> > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
