Il 30/10/20 14:38, Zacarias Benta ha scritto: > I know it sound kind o silly giving a limit and at the same time > allowing for exceptions, but we are trying to prevent the waste of > valuable cpu time. Then convince your users to use checkpointing. Then use shorter run times (we have 24h for 'normal' QoS, 72h for 'long' QoS w/ very low priority).
If the program writes (quickly, but you can tune the timeout) the current state when receiving a SIGTERM, it can then load the previous state and nothing is lost. If you allow a job to run for a month, and on the 29th the node crashes, you've lost a lot. If the job works in chunks of 24h, in the worst case you lose 23h59' ... -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786