I'm sending a job to Aurora that consists of multiple processes. One of
them fails, failing the job. I'd like to get the return code for the failed
process.

Going to the Thermos observer on the slave that runs the job tells me which
process failed, but there's no return code specified either in the table or
the JSON. However, the thermos_runner.INFO logfile in the root of the
sandbox contains this line:

runner.py:139] Process(my_failing_ps) failed [rc=1]

So Thermos, at some point, knows that my process failed with return code 1,
but it's not making it back up to either the web or JSON interfaces. If it
helps, the failed process has a start_time field, but is missing a
stop_time.

Any clues?

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard

Reply via email to