Sure, we can run daily ARM job as Travis CI nightly jobs firstly. Once it's stable enough, we can consider adding it to peer PR.
BTW, I tested flink-end-to-end-test on ARM in last few days. Keeping the same as Travis, all 7 scenarios were tested: 1. split_checkpoints.sh 2. split_sticky.sh 3. split_ha.sh 4. split_heavy.sh 5. split_misc_hadoopfree.sh 6. split_misc.sh 7. split_container.sh The 1st-6th scenarios works well within some hacking and bug fixing locally: 1. frocksdb doesn't have official ARM release, so I built and install it locally for ARM. https://issues.apache.org/jira/browse/FLINK-13598 2. Prometheus has ARM release but the test always download x86 version. Download the correct version can fix the issue. https://issues.apache.org/jira/browse/FLINK-14086 3. Elasticsearch 6.0+ enables Xpack machine learning feature by default, but this feature doesn't support ARM. So Elasticsearch 6.0+ failed to start on ARM. Set `Xpack.ml.enabled: false` can fix this issue. https://issues.apache.org/jira/browse/FLINK-14126 The 7th scenario for container failed because: 1. docker-compose doesn't have official ARM package. Use `apt install docker-compose` can solve the problem. 2. minikube doesn't support ARM arch. Use kubeadm for K8S installation can solve the problem. Fixing the problem mentioned above is not hard. So I think we can add flink build, unit-test and e2e test as nightly jobs now. Any idea? Thanks. Stephan Ewen <se...@apache.org> 于2019年9月19日周四 下午5:44写道: > My gut feeling is that having a CI that only runs on a specific command > will not help too much. > > What about going with nightly builds then? We could set up the ARM CI the > same way as the Travis CI nightly builds (cron builds). They report build > failures to "bui...@flink.apache.org". > Maybe Chesnay or Jark could help with what needs to be done to post to that > mailing list? > > A requirement would be that the builds are stable, from the ARM > perspective, meaning that there are no failures at the moment caused by ARM > specific issue. > > What do the others think? > > > On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <wangxiyuan1...@gmail.com> > wrote: > > > The ARM CI trigger has been changed to `github comment` way only. It > means > > that every PR won't start ARM test unless a comment `check_arm` is added. > > Like what I did in the PR[1]. > > > > A POC for Flink nightly end to end test job is created as well[2]. I'll > > improve it then. > > > > Any feedback or question? > > > > > > [1]: https://github.com/apache/flink/pull/9416 > > https://github.com/apache/flink/pull/9416#issuecomment-527268203 > > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 > > > > > > Thanks > > > > Xiyuan Wang <wangxiyuan1...@gmail.com> 于2019年8月26日周一 下午7:41写道: > > > > > Before ARM CI is ready, I can close the CI test for each PR and let it > > > only be triggered by PR comment. It's quite easy for OpenLab to do > this. > > > > > > OpenLab have many job piplines[1]. Now I use `check` pipline in > > > https://github.com/apache/flink/pull/9416. The job trigger contains > > > github_action and github_comment[2]. I can create a new pipline for > > Flink, > > > the new trigger can only contain github_coment like: > > > > > > trigger: > > > github: > > > - event: pull_request > > > action: comment > > > comment: (?i)^\s*recheck_arm_build\s*$ > > > > > > So that the ARM job will not be ran for every PR. It'll be just ran for > > > the PR which have `recheck_arm_build` comment. > > > > > > Then once ARM CI is ready, I can add it back. > > > > > > > > > nightly tests can be added as well of couse. There is a kind of job in > > > OpenLab called `periodic job`. We can use it for Flink daily nightly > > tests. > > > If any error occur, the report can be sent to bui...@flink.apache.org > > as > > > well. > > > > > > [1]: > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml > > > [2]: > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 > > > > > > Stephan Ewen <se...@apache.org> 于2019年8月26日周一 下午6:13写道: > > > > > >> Adding CI builds for ARM makes only sense when we actually take them > > into > > >> account as "blocking a merge", otherwise there is no point in having > > them. > > >> So we would need to be prepared to do that. > > >> > > >> The cases where something runs in UNIX/x64 but fails on ARM are few > > cases > > >> and so far seem to have been related to libraries or some magic that > > tries > > >> to do system dependent actions outside Java. > > >> > > >> One worthwhile discussion could be whether to run the ARM CI builds as > > >> part > > >> of the nightly tests, not on every commit. > > >> There are a lot of nightly tests, for example for different Java / > > Scala / > > >> Hadoop versions. > > >> > > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang < > wangxiyuan1...@gmail.com> > > >> wrote: > > >> > > >> > Sorry, maybe my words is misleading. > > >> > > > >> > We are just starting adding ARM support. So the CI is non-voting at > > this > > >> > moment to avoid blocking normal Flink development. > > >> > > > >> > But once the ARM CI works well and stable enough. We should mark it > as > > >> > voting. It means that in the future, if the ARM test is failed in a > > PR, > > >> the > > >> > PR can not be merged. The test log may tell develpers what error is > > >> > comming. If the develper need debug the detail on an ARM vm, OpenLab > > can > > >> > provider it. > > >> > > > >> > Adding ARM CI can make sure Flink support ARM originally > > >> > > > >> > I left a workflow in the PR, I'd like to print it here: > > >> > > > >> > 1. Add the basic build script to ensure the CI system and build > job > > >> > works as expect. The job should be marked as non-voting first, it > > >> means the > > >> > CI test failure won't block Flink PR to be merged. > > >> > 2. Add the test script to run unit/intergration test. At this > step > > >> the > > >> > --fn parameter will be added to mvn test. It will run the full > test > > >> cases > > >> > in Flink, so that we can find what test is failed on ARM. > > >> > 3. Fix the test failure one by one. > > >> > 4. Once all the tests are passed, remove the --fn parameter and > > keep > > >> > watch the CI's status for some days. If some bugs raise then, fix > > >> them as > > >> > what we usually do for travis-ci. > > >> > 5. Once the CI is stable enought, remove the non-voting tag, so > > that > > >> > the ARM CI will be the same as travis-ci, to be one of the gate > for > > >> Flink > > >> > PR. > > >> > 6. Finally, Flink community can announce and release Flink ARM > > >> version. > > >> > > > >> > > > >> > Chesnay Schepler <ches...@apache.org> 于2019年8月26日周一 下午2:25写道: > > >> > > > >> >> I'm sorry, but if these issues are only fixed later anyway I see no > > >> >> reason to run these tests on each PR. We're just adding noise to > each > > >> PR > > >> >> that everyone will just ignore. > > >> >> > > >> >> I'm curious as to the benefit of having this directly in Flink; why > > >> >> aren't the ARM builds run outside of the Flink project, and fixes > for > > >> it > > >> >> provided? > > >> >> > > >> >> It seems to me like nothing about these arm builds is actually > > handled > > >> >> by the Flink project. > > >> >> > > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: > > >> >> > Thanks for Stephan to bring up this topic. > > >> >> > > > >> >> > The package build jobs work well now. I have a simple online demo > > >> which > > >> >> is > > >> >> > built and ran on a ARM VM. Feel free to have a try[1]. > > >> >> > > > >> >> > As the first step for ARM support, maybe it's good to add them > now. > > >> >> > > > >> >> > While for the next step, the test part is still broken. It > relates > > to > > >> >> some > > >> >> > points we find: > > >> >> > > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind of > > >> failure > > >> >> can > > >> >> > be fixed easily. > > >> >> > 2. Some tests are failed by depending on third part libaraies[2]. > > It > > >> >> > includes frocksdb, MapR Client and Netty. They don't have ARM > > >> release. > > >> >> > a. Frocksdb: I'm testing it locally now by `make check_some` > > and > > >> >> `make > > >> >> > jtest` similar with its travis job. There are 3 tests failed by > > `make > > >> >> > check_some`. Please see the ticket for more details. Once the > test > > >> pass, > > >> >> > frocksdb can release ARM package then. > > >> >> > b. MapR Client. This belongs to MapR company. At this > moment, > > >> >> maybe we > > >> >> > should skip MapR support for Flink ARM. > > >> >> > c. Netty. Actually Netty runs well on our ARM machine. We > will > > >> ask > > >> >> > Netty community to release ARM support. If they do not want, > > OpenLab > > >> >> will > > >> >> > handle a Maven Repository for some common libraries on ARM. > > >> >> > > > >> >> > > > >> >> > For Chesnay's concern: > > >> >> > > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It > > >> means > > >> >> > that once build or test fails, we'll fix it at once. > > >> >> > Secondly, OpenLab can provide ARM VMs to everyone for > reproducing > > >> and > > >> >> > testing. You just need to creat a Test Request issue in > > openlab[1]. > > >> >> Then > > >> >> > we'll create ARM VMs for you, you can login and do the thing you > > >> want. > > >> >> > > > >> >> > Does it make sense? > > >> >> > > > >> >> > [1]: http://114.115.168.52:8081/#/overview > > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > > >> >> > https://issues.apache.org/jira/browse/FLINK-13450 > > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > > >> >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > Chesnay Schepler <ches...@apache.org> 于2019年8月24日周六 上午12:10写道: > > >> >> > > > >> >> >> I'm wondering what we are supposed to do if the build fails? > > >> >> >> We aren't providing and guides on setting up an arm dev > > >> environment; so > > >> >> >> reproducing it locally isn't possible. > > >> >> >> > > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: > > >> >> >>> Hi all! > > >> >> >>> > > >> >> >>> As part of the Flink on ARM effort, there is a pull request > that > > >> >> >> triggers a > > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM > > machines. > > >> >> >>> > > >> >> >>> Currently that build is roughly equivalent to what the "core" > and > > >> >> "tests" > > >> >> >>> profiles do on Travis. > > >> >> >>> The result will be posted to the PR comments, similar to the > > Flink > > >> >> Bot's > > >> >> >>> Travis build result. > > >> >> >>> The build currently passes :-) so Flink seems to be okay on > ARM. > > >> >> >>> > > >> >> >>> My suggestion would be to try and add this and gather some > > >> experience > > >> >> >> with > > >> >> >>> it. > > >> >> >>> The Travis build results should be our "ground truth" and the > ARM > > >> CI > > >> >> >>> (openlabs CI) would be "informational only" at the beginning, > but > > >> >> helping > > >> >> >>> us understand when we break ARM support. > > >> >> >>> > > >> >> >>> You can see this in the PR that adds the openlabs CI config: > > >> >> >>> https://github.com/apache/flink/pull/9416 > > >> >> >>> > > >> >> >>> Any objections? > > >> >> >>> > > >> >> >>> Best, > > >> >> >>> Stephan > > >> >> >>> > > >> >> >> > > >> >> > > >> >> > > >> > > > > > >