On Mon, 16 Apr 2018, William Hay wrote:
...
I don't think that can be right given that the qmaster complains about
multiple user files on start up. If it gave up after the first then
presumably it wouldn't complain about the others.
All I know is that, when we had this sort of problem, most of our users
vanished from the output of 'qconf -suserl'... but returned when the
errors in a small number of user files were corrected.
The user who first drew our attention to this has a user file that looks
like this:
...
It is possible that this file has fixed itself as two tasks from the
problem array job have started and the file has changed since I first
looked at it. However qconf -suser still doesn't show the user in
question and the array job is apparently stuck at the back of the queue
due to not getting any functional tickets.
...
Nothing particularly bad is leaping out at me: you might want to try a
restart to see if the message appears again.
I'm still trying to remember how this works... it's been a while.
Rethinking this, usage data is stored in both user and project files and I
don't immediately remember which is used when. You might just lose the
relative share between the user and others in AllUsers rather than the
AllUsers project compared to other projects.
Stopping qmaster and removing the entire 'project' line would probably get
you going again. Or perhaps keeping it and removing all fields apart from
those important to your scheduling policies (e.g. cpu, mem and io). I
think I've done that before with projects, but not users. The missing
fields ought to get reinitialised, and probably aren't being used by
anything anyway - (I've always thought of them as a side effect when
adding up job usages for to calculate cpu/mem/io - I don't really think
that things like a half-life decayed aggregate value for ru_minflt is of
much use...)
...
Oh fun. Fortunately we're mostly per user functional share and use
share-tree only as a tie breaker. But deleting jobs would be bad. Is
the probably lose any jobs queued something you know from experience?
It seems odd that we can have jobs queued and running with the running
qmaster knowing nothing of the user but deleting the file would kill
them on restart.
No, not from experience. Maybe it'll all be fine, then :)
Odd is certainly the word.
Mark
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users