On 4/19/21 1:07 PM, Jarek Potiuk wrote:
Also some comments for the stats. This is good stuff Marton.
Apparently, what's named "jobhours" in your statistics is actually the
runtime for an entire workflow (the sum of all job runtimes for that
workflow). That's at least what I conclude if I look at this workflow,
which your table lists as the longest Arrow "job" with 24 hours of
runtime: https://github.com/apache/arrow/actions/runs/699123317
None of the jobs in that workflow took more than 6 hours, but cumulated
they indeed end up around 24 hours... (because 4 jobs timed out at 6 hours)
That's correct. This is the sum of the time between started_at and
completed_at for each jobs in a workflow run. (Using job API like this:
https://api.github.com/repos/elek/flekszible/actions/runs/621666286/jobs).
It does look like you have workflows rather than jobs - we had very similar
problems
when we (Tobiasz - one of the Airflow contributors) tried to get the stats.
The REST API limitations are super-painful, there is no way to dig down
to the job level (there is no GraphQL version to do it efficiently
unfortunately).
We found that rather than looking at jobhours, it's much better to look at
"in-progress"
and "queued" workflow from each project. It gives a much better
overview of what's going on.
100% agree. And I think the "jobhours" include the queue time as well
(and rerun overwrites all the data).
(BTW, I agree with all the other points, too ;-) )
> we regularly run and store in Google Bigquery and simple DataStudio
report
> showing it (unfortunately we cannot
> share it with everyone as it will incur some costs if it is publicly
used).
Is it possible to share the raw data in some form? If you can publish
data in any form (csv? sqlite?) I can generate static html files with
python notebooks which can be shared with everybody...
(BTW, how do you get the data? Do you poll somehow the actual runs or
collect data from workflow runs / jobs api (this is what I do)?)
Marton