Hi, Flink Team, According to the discussion, I assume that we are now agree that running cron job for ARM at this moment. I have ran POC e2e test in OpenLab for some days[1]. It includes:
flink-end-to-end-test-part1 split_checkpoints.sh and split_sticky.sh flink-end-to-end-test-part2 split_heavy.sh and split_ha.sh flink-end-to-end-test-part3 split_misc.sh and split_misc_hadoopfree.sh part1 and part2 runs well. part3 is not statble. I need take more time to fix part3. container part is not included because the problem5 mentioned below.(I'll add it once it's solved.) While I did som hacks to make sure the job pass. It includes: 1. Frocksdb ARM package: https://issues.apache.org/jira/browse/FLINK-13598 (Not solved) 2. PrometheusReporterEndToEndITCase doesn't support ARM arch: https://issues.apache.org/jira/browse/FLINK-14086 (PR for fix: https://github.com/apache/flink/pull/9768) 3. Elasticsearch Xpack Machine Learning doesn't support ARM : https://issues.apache.org/jira/browse/FLINK-14126 (PR for fix: https://github.com/apache/flink/pull/9765) 4. maven-shade-plugin 3.2.1 doesn't work on ARM for Flink (Fixed, thanks @Dian Fu ) 5. flink e2e container test doesn't support ARM: https://issues.apache.org/jira/browse/FLINK-14241 (PR for fix: https://github.com/apache/flink/pull/9782) Please help review these PRs. Thanks very much. And I added a PR[2] <https://github.com/apache/flink/pull/9416> to make Flink run cron jobs officially. Once it's merged, the jobs will be ran once a day at UTC2000. The result can be sent to bui...@flink.apache.org if Flink team can give the permission to send mail by i...@openlabtesting.org [1]: http://status.openlabtesting.org/builds?project=apache%2Fflink [2]: https://github.com/apache/flink/pull/9416 Thanks. Xiyuan Wang <wangxiyuan1...@gmail.com> 于2019年9月25日周三 下午5:33写道: > Hi Till > Thanks for your response. All ARM related work is triggered here: > https://issues.apache.org/jira/browse/FLINK-13448 and I have created some > PRs already. > > After do some hacking locally, E2E tests runs well now. I have added > them into OpenLab alreay. The POC log: > http://status.openlabtesting.org/builds?project=apache%2Fflink&pipeline=periodic-20 > It > runs at UTC2000 everyday. Following the POC, I have created the official PR > for cron job as well which contains core/test related module test and e2e > test(expect container ones): https://github.com/apache/flink/pull/9416 > > Once it's merged, I can configure it at OpenLab side to send the test > result everyday to bu...@flink.apache.org. > > Thanks. > > > > > > Till Rohrmann <trohrm...@apache.org> 于2019年9月23日周一 下午8:40写道: > >> This sounds good Xiyuan. I'd also be in favour of running the ARM builds >> regularly as cron jobs and once we see that they are stable we could run >> them for every master commit. Hence, I'd say let's fix the above mentioned >> problems and then set the nightly cron job up. >> >> Cheers, >> Till >> >> On Fri, Sep 20, 2019 at 8:57 AM Xiyuan Wang <wangxiyuan1...@gmail.com> >> wrote: >> >> > Sure, we can run daily ARM job as Travis CI nightly jobs firstly. Once >> > it's stable enough, we can consider adding it to peer PR. >> > >> > BTW, I tested flink-end-to-end-test on ARM in last few days. Keeping the >> > same as Travis, all 7 scenarios were tested: >> > >> > 1. split_checkpoints.sh >> > 2. split_sticky.sh >> > 3. split_ha.sh >> > 4. split_heavy.sh >> > 5. split_misc_hadoopfree.sh >> > 6. split_misc.sh >> > 7. split_container.sh >> > >> > The 1st-6th scenarios works well within some hacking and bug fixing >> > locally: >> > 1. frocksdb doesn't have official ARM release, so I built and >> install >> > it locally for ARM. >> > https://issues.apache.org/jira/browse/FLINK-13598 >> > 2. Prometheus has ARM release but the test always download x86 >> version. >> > Download the correct version can fix the issue. >> > https://issues.apache.org/jira/browse/FLINK-14086 >> > 3. Elasticsearch 6.0+ enables Xpack machine learning feature by >> > default, but this feature doesn't support ARM. So Elasticsearch 6.0+ >> failed >> > to start on ARM. Set `Xpack.ml.enabled: false` can fix this issue. >> > https://issues.apache.org/jira/browse/FLINK-14126 >> > >> > The 7th scenario for container failed because: >> > 1. docker-compose doesn't have official ARM package. Use `apt >> install >> > docker-compose` can solve the problem. >> > 2. minikube doesn't support ARM arch. Use kubeadm for K8S >> installation >> > can solve the problem. >> > >> > Fixing the problem mentioned above is not hard. So I think we can add >> flink >> > build, unit-test and e2e test as nightly jobs now. >> > >> > Any idea? >> > >> > Thanks. >> > >> > Stephan Ewen <se...@apache.org> 于2019年9月19日周四 下午5:44写道: >> > >> > > My gut feeling is that having a CI that only runs on a specific >> command >> > > will not help too much. >> > > >> > > What about going with nightly builds then? We could set up the ARM CI >> the >> > > same way as the Travis CI nightly builds (cron builds). They report >> build >> > > failures to "bui...@flink.apache.org". >> > > Maybe Chesnay or Jark could help with what needs to be done to post to >> > that >> > > mailing list? >> > > >> > > A requirement would be that the builds are stable, from the ARM >> > > perspective, meaning that there are no failures at the moment caused >> by >> > ARM >> > > specific issue. >> > > >> > > What do the others think? >> > > >> > > >> > > On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <wangxiyuan1...@gmail.com> >> > > wrote: >> > > >> > > > The ARM CI trigger has been changed to `github comment` way only. It >> > > means >> > > > that every PR won't start ARM test unless a comment `check_arm` is >> > added. >> > > > Like what I did in the PR[1]. >> > > > >> > > > A POC for Flink nightly end to end test job is created as well[2]. >> I'll >> > > > improve it then. >> > > > >> > > > Any feedback or question? >> > > > >> > > > >> > > > [1]: https://github.com/apache/flink/pull/9416 >> > > > >> https://github.com/apache/flink/pull/9416#issuecomment-527268203 >> > > > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 >> > > > >> > > > >> > > > Thanks >> > > > >> > > > Xiyuan Wang <wangxiyuan1...@gmail.com> 于2019年8月26日周一 下午7:41写道: >> > > > >> > > > > Before ARM CI is ready, I can close the CI test for each PR and >> let >> > it >> > > > > only be triggered by PR comment. It's quite easy for OpenLab to >> do >> > > this. >> > > > > >> > > > > OpenLab have many job piplines[1]. Now I use `check` pipline in >> > > > > https://github.com/apache/flink/pull/9416. The job trigger >> contains >> > > > > github_action and github_comment[2]. I can create a new pipline >> for >> > > > Flink, >> > > > > the new trigger can only contain github_coment like: >> > > > > >> > > > > trigger: >> > > > > github: >> > > > > - event: pull_request >> > > > > action: comment >> > > > > comment: (?i)^\s*recheck_arm_build\s*$ >> > > > > >> > > > > So that the ARM job will not be ran for every PR. It'll be just >> ran >> > for >> > > > > the PR which have `recheck_arm_build` comment. >> > > > > >> > > > > Then once ARM CI is ready, I can add it back. >> > > > > >> > > > > >> > > > > nightly tests can be added as well of couse. There is a kind of >> job >> > in >> > > > > OpenLab called `periodic job`. We can use it for Flink daily >> nightly >> > > > tests. >> > > > > If any error occur, the report can be sent to >> > bui...@flink.apache.org >> > > > as >> > > > > well. >> > > > > >> > > > > [1]: >> > > > > >> > > > >> > > >> > >> https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml >> > > > > [2]: >> > > > > >> > > > >> > > >> > >> https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 >> > > > > >> > > > > Stephan Ewen <se...@apache.org> 于2019年8月26日周一 下午6:13写道: >> > > > > >> > > > >> Adding CI builds for ARM makes only sense when we actually take >> them >> > > > into >> > > > >> account as "blocking a merge", otherwise there is no point in >> having >> > > > them. >> > > > >> So we would need to be prepared to do that. >> > > > >> >> > > > >> The cases where something runs in UNIX/x64 but fails on ARM are >> few >> > > > cases >> > > > >> and so far seem to have been related to libraries or some magic >> that >> > > > tries >> > > > >> to do system dependent actions outside Java. >> > > > >> >> > > > >> One worthwhile discussion could be whether to run the ARM CI >> builds >> > as >> > > > >> part >> > > > >> of the nightly tests, not on every commit. >> > > > >> There are a lot of nightly tests, for example for different Java >> / >> > > > Scala / >> > > > >> Hadoop versions. >> > > > >> >> > > > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang < >> > > wangxiyuan1...@gmail.com> >> > > > >> wrote: >> > > > >> >> > > > >> > Sorry, maybe my words is misleading. >> > > > >> > >> > > > >> > We are just starting adding ARM support. So the CI is >> non-voting >> > at >> > > > this >> > > > >> > moment to avoid blocking normal Flink development. >> > > > >> > >> > > > >> > But once the ARM CI works well and stable enough. We should >> mark >> > it >> > > as >> > > > >> > voting. It means that in the future, if the ARM test is failed >> in >> > a >> > > > PR, >> > > > >> the >> > > > >> > PR can not be merged. The test log may tell develpers what >> error >> > is >> > > > >> > comming. If the develper need debug the detail on an ARM vm, >> > OpenLab >> > > > can >> > > > >> > provider it. >> > > > >> > >> > > > >> > Adding ARM CI can make sure Flink support ARM originally >> > > > >> > >> > > > >> > I left a workflow in the PR, I'd like to print it here: >> > > > >> > >> > > > >> > 1. Add the basic build script to ensure the CI system and >> build >> > > job >> > > > >> > works as expect. The job should be marked as non-voting >> first, >> > it >> > > > >> means the >> > > > >> > CI test failure won't block Flink PR to be merged. >> > > > >> > 2. Add the test script to run unit/intergration test. At >> this >> > > step >> > > > >> the >> > > > >> > --fn parameter will be added to mvn test. It will run the >> full >> > > test >> > > > >> cases >> > > > >> > in Flink, so that we can find what test is failed on ARM. >> > > > >> > 3. Fix the test failure one by one. >> > > > >> > 4. Once all the tests are passed, remove the --fn parameter >> and >> > > > keep >> > > > >> > watch the CI's status for some days. If some bugs raise >> then, >> > fix >> > > > >> them as >> > > > >> > what we usually do for travis-ci. >> > > > >> > 5. Once the CI is stable enought, remove the non-voting >> tag, so >> > > > that >> > > > >> > the ARM CI will be the same as travis-ci, to be one of the >> gate >> > > for >> > > > >> Flink >> > > > >> > PR. >> > > > >> > 6. Finally, Flink community can announce and release Flink >> ARM >> > > > >> version. >> > > > >> > >> > > > >> > >> > > > >> > Chesnay Schepler <ches...@apache.org> 于2019年8月26日周一 下午2:25写道: >> > > > >> > >> > > > >> >> I'm sorry, but if these issues are only fixed later anyway I >> see >> > no >> > > > >> >> reason to run these tests on each PR. We're just adding noise >> to >> > > each >> > > > >> PR >> > > > >> >> that everyone will just ignore. >> > > > >> >> >> > > > >> >> I'm curious as to the benefit of having this directly in >> Flink; >> > why >> > > > >> >> aren't the ARM builds run outside of the Flink project, and >> fixes >> > > for >> > > > >> it >> > > > >> >> provided? >> > > > >> >> >> > > > >> >> It seems to me like nothing about these arm builds is actually >> > > > handled >> > > > >> >> by the Flink project. >> > > > >> >> >> > > > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: >> > > > >> >> > Thanks for Stephan to bring up this topic. >> > > > >> >> > >> > > > >> >> > The package build jobs work well now. I have a simple online >> > demo >> > > > >> which >> > > > >> >> is >> > > > >> >> > built and ran on a ARM VM. Feel free to have a try[1]. >> > > > >> >> > >> > > > >> >> > As the first step for ARM support, maybe it's good to add >> them >> > > now. >> > > > >> >> > >> > > > >> >> > While for the next step, the test part is still broken. It >> > > relates >> > > > to >> > > > >> >> some >> > > > >> >> > points we find: >> > > > >> >> > >> > > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind >> of >> > > > >> failure >> > > > >> >> can >> > > > >> >> > be fixed easily. >> > > > >> >> > 2. Some tests are failed by depending on third part >> > libaraies[2]. >> > > > It >> > > > >> >> > includes frocksdb, MapR Client and Netty. They don't have >> ARM >> > > > >> release. >> > > > >> >> > a. Frocksdb: I'm testing it locally now by `make >> > check_some` >> > > > and >> > > > >> >> `make >> > > > >> >> > jtest` similar with its travis job. There are 3 tests >> failed by >> > > > `make >> > > > >> >> > check_some`. Please see the ticket for more details. Once >> the >> > > test >> > > > >> pass, >> > > > >> >> > frocksdb can release ARM package then. >> > > > >> >> > b. MapR Client. This belongs to MapR company. At this >> > > moment, >> > > > >> >> maybe we >> > > > >> >> > should skip MapR support for Flink ARM. >> > > > >> >> > c. Netty. Actually Netty runs well on our ARM machine. >> We >> > > will >> > > > >> ask >> > > > >> >> > Netty community to release ARM support. If they do not want, >> > > > OpenLab >> > > > >> >> will >> > > > >> >> > handle a Maven Repository for some common libraries on ARM. >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > For Chesnay's concern: >> > > > >> >> > >> > > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM >> CI. >> > It >> > > > >> means >> > > > >> >> > that once build or test fails, we'll fix it at once. >> > > > >> >> > Secondly, OpenLab can provide ARM VMs to everyone for >> > > reproducing >> > > > >> and >> > > > >> >> > testing. You just need to creat a Test Request issue in >> > > > openlab[1]. >> > > > >> >> Then >> > > > >> >> > we'll create ARM VMs for you, you can login and do the >> thing >> > you >> > > > >> want. >> > > > >> >> > >> > > > >> >> > Does it make sense? >> > > > >> >> > >> > > > >> >> > [1]: http://114.115.168.52:8081/#/overview >> > > > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 >> > > > >> >> > https://issues.apache.org/jira/browse/FLINK-13450 >> > > > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 >> > > > >> >> > [3]: >> https://github.com/theopenlab/openlab/issues/new/choose >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > Chesnay Schepler <ches...@apache.org> 于2019年8月24日周六 >> 上午12:10写道: >> > > > >> >> > >> > > > >> >> >> I'm wondering what we are supposed to do if the build >> fails? >> > > > >> >> >> We aren't providing and guides on setting up an arm dev >> > > > >> environment; so >> > > > >> >> >> reproducing it locally isn't possible. >> > > > >> >> >> >> > > > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: >> > > > >> >> >>> Hi all! >> > > > >> >> >>> >> > > > >> >> >>> As part of the Flink on ARM effort, there is a pull >> request >> > > that >> > > > >> >> >> triggers a >> > > > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM >> > > > machines. >> > > > >> >> >>> >> > > > >> >> >>> Currently that build is roughly equivalent to what the >> "core" >> > > and >> > > > >> >> "tests" >> > > > >> >> >>> profiles do on Travis. >> > > > >> >> >>> The result will be posted to the PR comments, similar to >> the >> > > > Flink >> > > > >> >> Bot's >> > > > >> >> >>> Travis build result. >> > > > >> >> >>> The build currently passes :-) so Flink seems to be okay >> on >> > > ARM. >> > > > >> >> >>> >> > > > >> >> >>> My suggestion would be to try and add this and gather some >> > > > >> experience >> > > > >> >> >> with >> > > > >> >> >>> it. >> > > > >> >> >>> The Travis build results should be our "ground truth" and >> the >> > > ARM >> > > > >> CI >> > > > >> >> >>> (openlabs CI) would be "informational only" at the >> beginning, >> > > but >> > > > >> >> helping >> > > > >> >> >>> us understand when we break ARM support. >> > > > >> >> >>> >> > > > >> >> >>> You can see this in the PR that adds the openlabs CI >> config: >> > > > >> >> >>> https://github.com/apache/flink/pull/9416 >> > > > >> >> >>> >> > > > >> >> >>> Any objections? >> > > > >> >> >>> >> > > > >> >> >>> Best, >> > > > >> >> >>> Stephan >> > > > >> >> >>> >> > > > >> >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> > > > > >> > > > >> > > >> > >> >