Hello, Recently, in one of our clusters we noticed production jobs go to PENDING state, due to insufficient CPU. The non production jobs are not preempted, as we haven't used --preemption_delay flag for scheduler. The default value for this flag is 10mins. Why is it too high? Is there any reasoning behind using 10mins as a default value?
We are thinking to to use 2mins for this flag. We wouldn't want to wait beyond 2mins to run a prod job during resource constraint. Does it sound reasonable? What's the typical preemption delay used by SREs? -- Regards, Bhuvan Arumugam www.livecipher.com