I think I understand well where yatus sits :). And pre-commit it is both - a bit earlier in the stack (i.e. where developer commits the code) but also works very well for the CI case. We use it with `--all-files` now though as this is maybe ~5% of all the time of our CI build time so we have not yet optimized that part - we focused on optimizing the remaining 95% (and I think we already optimized it by at least 50% over last weeks).
For us using pre-commit is important because we do not have to have different sets of checks in the CI and different sets for the developers locally. We just add it once and it runs in both. The pylint one is a very good example and you nicely detected the case I am going to work on. We plan to do exactly this. This is optimization yet to happen for Airflow, and the extreme case is precisely what we are going to address once we address the 95% of other cases. For now, the long "pylint" check is actually what pre-commit does when you run it in with "--all-files" switch. But you are absolutely right in this case it is not needed at all. We simply have not wired it up yet to our "selective tests" introduced last weekend :). Those are wired on the "pytest" level but not yet on the "pylint" level. And this is something we absolutely plan to do. Simply the thing is we are optimizing it addressing the bigger fish first. So just to explain how pre-commit will run in this case -> we already have "run-tests" output set when there is no python code changed, and this will disable pylint, mypy, flake8. All those are a bit special because they require a docker image with all the dependencies to run (Airflow has few 100s of them so in order to get repeatable build we need to run them in a docker container that needs to have the latest dependencies installed). This is an implicit dependency because pylint needs to analyze all the dependencies together with the airflow code and we need to install them. So in this (extreme) case when no python code has changed, we not only have to disable those tests but also disable building the docker file building for pre-commits (which is much longer than the tests). Right now for our builds, we do not only run pre-commits but we also build the images first (this is done in a separate workflow_run). We build those images only once and they are shared between all the jobs (ci check, tests, validation of packages, etc. etc.). This was far more important optimization for us than the "markdown only" change. For the extreme case of markdown-only change, we will have to not only disable the pylint check but also we have to skip building the image entirely because it is not needed in this case. And I am afraid this kind of logic needs to be custom-built, no generic tool can know that for "markdown-only" change the images should not be built in a separate workflow and that we do not need to run all other checks. I think yatus will not be able to tell us automatically "do not build those images in your CI build because none of the actual tests need it". This is a logic that we have to implement regardless - whether we use yatus or pre-commit (please correct me if I am wrong). For all others, this is not a big issue because in total all other pre-commits take 2-3 minutes at best. And if we find that we need to optimize it further we can simply disable the '--all-files' switch for pre-commit and they will only run on the latest commit-changed files (pre-commit will only run the tests related to those changed files). But since they are pretty fast (except pylint/mypy/flake8) we think running them all, for now, is not a problem. Just to take your markdown-only example to show you exactly how pre-commit works - in case we choose to run it for the files changed by the last commit. Attached the output of it. Link to pasteboard here showing the output we get when we do "pre-commit run" in this case: https://pasteboard.co/JvyOQ2f.png In this particular "markdown only" change - the whole pre-commit check took exactly 2s. As you will see only potentially relevant tests for the markdown files were selected: "no-tabs", license adding for "markdown" files, "generating TOC for markdown files", "merge-conflict" check, "detecting private keys" committed, "fixing end-of-files", "trim trailing whitespace", "checking if we have offensive language - such as white/blacklist". All the other tests are skipped. I think this is really equivalent to what yatus does now - if I understand correctly how it works. I do not think yatus can do better than that without wiring in additional hand-coded logic (but maybe I am wrong about it - I would love to hear if it does). BTW. We have an open issue to add markdownlint and since pre-commit is so popular, it's a matter of adding these few lines of configuration to our pre-commit config: https://github.com/markdownlint/markdownlint/blob/master/.pre-commit-hooks.yaml . Funny enough - we discussed this very thing yesterday in this PR here: https://github.com/apache/airflow/pull/11465. The nice thing about the pre-commit framework is that once we add it to our pre-commit config file, everyone who gets the latest master will automatically run markdownlint for markdown changes without having to do anything if they have pre-commit installed (and this markdownlint will happen automatically when they run "git commit" locally). Once we optimize the builds to not build the docker file for non-python changes - all this optimization will be possible and markdown-only change will actually take 2s to run. It was just not our goal for optimization (yet). But we are getting there. On Wed, Oct 14, 2020 at 4:35 AM Allen Wittenauer <a...@effectivemachines.com.invalid> wrote: > > > > On Oct 13, 2020, at 11:46 AM, Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > > > > I am rather interested in how those kinds of cases might be handled > better > > by Yetus - i.e. how much smarter it can be when selecting which parts of > > the tests should be run - and how you would define such relation. What > > pre-commit is doing is rather straightforward (run tests on files that > > changed), what I did in tests takes into account the "structure" of the > > project and acts accordingly. And those are rather simple to implement. > As > > you'd see in my PR it's merely <100 lines in bash to find which files > have > > changed and based on some predefined rules select which tests to run. I'd > > be really interested though if Yates can provide some better ways of > > handling it? > > I think you are misunderstanding where Yetus sits in the stack. > And I also misunderstood where you were running pre-commit; it's clear you > aren't running it _also_ as part of the CI, just as part of the developer > experience. (which also means there is an assumption that every PR _has_ > run those tools ...) > > Yetus' precommit probably better thought about as a pre-merge > utility. The functionality pre-dates git and commit hooks sooooo... > > Anyway, let's look at Airflow. For example, in Airflow's CI is > static-checks-pylint ( > https://github.com/apache/airflow/blob/master/.github/workflows/ci.yml). > This runs on _every_ PR. > > Let's look at https://github.com/apache/airflow/pull/11518. This is a > markdown update. There is no python code. Yet: > > > https://github.com/apache/airflow/runs/1250824427?check_suite_focus=true > > 16 minutes blown on an absolutely useless check. And that's just pylint. > Look at the entire CI Build workflow: > > https://github.com/apache/airflow/actions/runs/305454304 > > ~45 minutes for likely zero value. Spell check might be the only thing > that happened that was actual useful. That's 45 minutes that something > else could have been executing. I haven't looked at the other workflows > that also ran but probably more wasted time. > > Under Yetus, test-patch would have detected this PR was markdown, > ran markdownlint, blanks, and probably a few other things, and reported the > results. It probably would have taken _maybe_ 2 minutes, most of that > spent dealing with the docker image. Hooray! 43 minutes back in the > executor pool for either more Airflow work or for another project! > > Is this an extreme example? Sure. But if you stretch these types > of cuts over a large amount of PRs, it makes a huge, huge difference. > > Airflow is 'advanced' enough in its CI that using test-patch to > cut back on _all_ of these workflows is certainly possible using custom > plug-ins. But it might be easier to use smart-apply-patch's > --changedfilesreport option to generate a list of files in the PR and then > short-circuit workflows based upon that list of file changes. (which > reminds me, we need to update the smart-apply-patch docs to cover this haha) > > === > > In the specific case of testing, test-patch will slice and dice > tests based upon the build tool. So if you are using, say, maven, it will > only run the unit tests for that module. There is nothing an end user > needs to do. No classifications or anything like that. They get that > functionality for free. Since Airflow doesn't have a build tool that Yetus > supports unfortunately. So it wouldn't work out of the box, but it could be > shoe-horned in by supplying a custom build-tool. Probably not worth the > effort in this specific use case, frankly. > > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>