On Mon, 16 Apr 2018, William Hay wrote:
...
I don't think that can be right given that the qmaster complains about multiple user files on start up. If it gave up after the first then presumably it wouldn't complain about the others.

All I know is that, when we had this sort of problem, most of our users vanished from the output of 'qconf -suserl'... but returned when the errors in a small number of user files were corrected.

The user who first drew our attention to this has a user file that looks like this:
...
It is possible that this file has fixed itself as two tasks from the problem array job have started and the file has changed since I first looked at it. However qconf -suser still doesn't show the user in question and the array job is apparently stuck at the back of the queue due to not getting any functional tickets.
...

Nothing particularly bad is leaping out at me: you might want to try a restart to see if the message appears again.

I'm still trying to remember how this works... it's been a while. Rethinking this, usage data is stored in both user and project files and I don't immediately remember which is used when. You might just lose the relative share between the user and others in AllUsers rather than the AllUsers project compared to other projects.

Stopping qmaster and removing the entire 'project' line would probably get you going again. Or perhaps keeping it and removing all fields apart from those important to your scheduling policies (e.g. cpu, mem and io). I think I've done that before with projects, but not users. The missing fields ought to get reinitialised, and probably aren't being used by anything anyway - (I've always thought of them as a side effect when adding up job usages for to calculate cpu/mem/io - I don't really think that things like a half-life decayed aggregate value for ru_minflt is of much use...)

...
Oh fun. Fortunately we're mostly per user functional share and use share-tree only as a tie breaker. But deleting jobs would be bad. Is the probably lose any jobs queued something you know from experience? It seems odd that we can have jobs queued and running with the running qmaster knowing nothing of the user but deleting the file would kill them on restart.

No, not from experience. Maybe it'll all be fine, then :)

Odd is certainly the word.

Mark
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to