Hi, Flink Team,

According to the discussion, I assume that we are now agree that running
cron job for ARM at this moment. I have ran POC e2e test in OpenLab for
some days[1]. It includes:

flink-end-to-end-test-part1
    split_checkpoints.sh  and split_sticky.sh
flink-end-to-end-test-part2
     split_heavy.sh  and split_ha.sh
flink-end-to-end-test-part3
    split_misc.sh and split_misc_hadoopfree.sh

part1 and part2 runs well. part3 is not statble. I need take more time to
fix part3. container part is not included because the problem5 mentioned
below.(I'll add it once it's solved.)

While I did som hacks to make sure the job pass. It includes:

1. Frocksdb ARM package: https://issues.apache.org/jira/browse/FLINK-13598
(Not solved)
2. PrometheusReporterEndToEndITCase doesn't support ARM arch:
https://issues.apache.org/jira/browse/FLINK-14086 (PR for fix:
https://github.com/apache/flink/pull/9768)
3. Elasticsearch Xpack Machine Learning doesn't support ARM :
https://issues.apache.org/jira/browse/FLINK-14126 (PR for fix:
https://github.com/apache/flink/pull/9765)
4. maven-shade-plugin 3.2.1 doesn't work on ARM for Flink (Fixed, thanks @Dian
Fu )
5. flink e2e container test doesn't support ARM:
https://issues.apache.org/jira/browse/FLINK-14241 (PR for fix:
https://github.com/apache/flink/pull/9782)

Please help review these PRs. Thanks very much.

And I added a PR[2]  <https://github.com/apache/flink/pull/9416> to make
Flink run cron jobs officially. Once it's merged, the jobs will be ran once
a day at UTC2000. The result can be sent to bui...@flink.apache.org if
Flink team can give the permission to send mail by i...@openlabtesting.org


[1]: http://status.openlabtesting.org/builds?project=apache%2Fflink
[2]: https://github.com/apache/flink/pull/9416


Thanks.

Xiyuan Wang <wangxiyuan1...@gmail.com> 于2019年9月25日周三 下午5:33写道:

> Hi Till
>     Thanks for your response. All ARM related work is triggered here:
> https://issues.apache.org/jira/browse/FLINK-13448 and I have created some
> PRs already.
>
>     After do some hacking locally, E2E tests runs well now.  I have added
> them into OpenLab alreay. The POC log:
> http://status.openlabtesting.org/builds?project=apache%2Fflink&pipeline=periodic-20
>  It
> runs at UTC2000 everyday. Following the POC, I have created the official PR
> for cron job as well which contains core/test related module test and e2e
> test(expect container ones): https://github.com/apache/flink/pull/9416
>
>     Once it's merged, I can configure it at OpenLab side to send the test
> result everyday to bu...@flink.apache.org.
>
> Thanks.
>
>
>
>
>
> Till Rohrmann <trohrm...@apache.org> 于2019年9月23日周一 下午8:40写道:
>
>> This sounds good Xiyuan. I'd also be in favour of running the ARM builds
>> regularly as cron jobs and once we see that they are stable we could run
>> them for every master commit. Hence, I'd say let's fix the above mentioned
>> problems and then set the nightly cron job up.
>>
>> Cheers,
>> Till
>>
>> On Fri, Sep 20, 2019 at 8:57 AM Xiyuan Wang <wangxiyuan1...@gmail.com>
>> wrote:
>>
>> > Sure,  we can run daily ARM job as Travis CI nightly jobs firstly. Once
>> > it's stable enough, we can consider adding it to peer PR.
>> >
>> > BTW, I tested flink-end-to-end-test on ARM in last few days. Keeping the
>> > same as Travis, all 7 scenarios were tested:
>> >
>> > 1. split_checkpoints.sh
>> > 2. split_sticky.sh
>> > 3. split_ha.sh
>> > 4. split_heavy.sh
>> > 5. split_misc_hadoopfree.sh
>> > 6. split_misc.sh
>> > 7. split_container.sh
>> >
>> > The 1st-6th scenarios works well within some hacking and bug fixing
>> > locally:
>> >     1. frocksdb doesn't have official ARM release, so I built and
>> install
>> > it locally for ARM.
>> >           https://issues.apache.org/jira/browse/FLINK-13598
>> >     2. Prometheus has ARM release but the test always download x86
>> version.
>> > Download the correct version can fix the issue.
>> >           https://issues.apache.org/jira/browse/FLINK-14086
>> >     3. Elasticsearch 6.0+ enables Xpack machine learning feature by
>> > default, but this feature doesn't support ARM. So Elasticsearch 6.0+
>> failed
>> > to start on ARM. Set `Xpack.ml.enabled: false` can fix this issue.
>> >           https://issues.apache.org/jira/browse/FLINK-14126
>> >
>> > The 7th scenario for container failed because:
>> >     1. docker-compose doesn't have official ARM package. Use `apt
>> install
>> > docker-compose` can solve the problem.
>> >     2. minikube doesn't support ARM arch. Use kubeadm for K8S
>> installation
>> > can solve the problem.
>> >
>> > Fixing the problem mentioned above is not hard. So I think we can add
>> flink
>> > build, unit-test and e2e test as nightly jobs now.
>> >
>> > Any idea?
>> >
>> > Thanks.
>> >
>> > Stephan Ewen <se...@apache.org> 于2019年9月19日周四 下午5:44写道:
>> >
>> > > My gut feeling is that having a CI that only runs on a specific
>> command
>> > > will not help too much.
>> > >
>> > > What about going with nightly builds then? We could set up the ARM CI
>> the
>> > > same way as the Travis CI nightly builds (cron builds). They report
>> build
>> > > failures to "bui...@flink.apache.org".
>> > > Maybe Chesnay or Jark could help with what needs to be done to post to
>> > that
>> > > mailing list?
>> > >
>> > > A requirement would be that the builds are stable, from the ARM
>> > > perspective, meaning that there are no failures at the moment caused
>> by
>> > ARM
>> > > specific issue.
>> > >
>> > > What do the others think?
>> > >
>> > >
>> > > On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <wangxiyuan1...@gmail.com>
>> > > wrote:
>> > >
>> > > > The ARM CI trigger has been changed to `github comment` way only. It
>> > > means
>> > > > that every PR won't start ARM test unless a comment `check_arm` is
>> > added.
>> > > > Like what I did in the PR[1].
>> > > >
>> > > > A POC for Flink nightly end to end test job is created as well[2].
>> I'll
>> > > > improve it then.
>> > > >
>> > > > Any feedback or question?
>> > > >
>> > > >
>> > > > [1]: https://github.com/apache/flink/pull/9416
>> > > >
>> https://github.com/apache/flink/pull/9416#issuecomment-527268203
>> > > > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631
>> > > >
>> > > >
>> > > > Thanks
>> > > >
>> > > > Xiyuan Wang <wangxiyuan1...@gmail.com> 于2019年8月26日周一 下午7:41写道:
>> > > >
>> > > > > Before ARM CI is ready, I can close the CI test for each PR and
>> let
>> > it
>> > > > > only be triggered by PR comment.  It's quite easy for OpenLab to
>> do
>> > > this.
>> > > > >
>> > > > > OpenLab have many job piplines[1].  Now I use `check` pipline in
>> > > > > https://github.com/apache/flink/pull/9416. The job trigger
>> contains
>> > > > > github_action and github_comment[2]. I can create a new pipline
>> for
>> > > > Flink,
>> > > > > the new trigger can only contain github_coment like:
>> > > > >
>> > > > > trigger:
>> > > > >   github:
>> > > > >  - event: pull_request
>> > > > >    action: comment
>> > > > >    comment: (?i)^\s*recheck_arm_build\s*$
>> > > > >
>> > > > > So that the ARM job will not be ran for every PR. It'll be just
>> ran
>> > for
>> > > > > the PR which have `recheck_arm_build` comment.
>> > > > >
>> > > > > Then once ARM CI is ready, I can add it back.
>> > > > >
>> > > > >
>> > > > > nightly tests can be added as well of couse. There is a kind of
>> job
>> > in
>> > > > > OpenLab called `periodic job`. We can use it for Flink daily
>> nightly
>> > > > tests.
>> > > > > If any error occur, the report can be sent to
>> > bui...@flink.apache.org
>> > > > as
>> > > > > well.
>> > > > >
>> > > > > [1]:
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml
>> > > > > [2]:
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19
>> > > > >
>> > > > > Stephan Ewen <se...@apache.org> 于2019年8月26日周一 下午6:13写道:
>> > > > >
>> > > > >> Adding CI builds for ARM makes only sense when we actually take
>> them
>> > > > into
>> > > > >> account as "blocking a merge", otherwise there is no point in
>> having
>> > > > them.
>> > > > >> So we would need to be prepared to do that.
>> > > > >>
>> > > > >> The cases where something runs in UNIX/x64 but fails on ARM are
>> few
>> > > > cases
>> > > > >> and so far seem to have been related to libraries or some magic
>> that
>> > > > tries
>> > > > >> to do system dependent actions outside Java.
>> > > > >>
>> > > > >> One worthwhile discussion could be whether to run the ARM CI
>> builds
>> > as
>> > > > >> part
>> > > > >> of the nightly tests, not on every commit.
>> > > > >> There are a lot of nightly tests, for example for different Java
>> /
>> > > > Scala /
>> > > > >> Hadoop versions.
>> > > > >>
>> > > > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang <
>> > > wangxiyuan1...@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Sorry, maybe my words is misleading.
>> > > > >> >
>> > > > >> > We are just starting adding ARM support. So the CI is
>> non-voting
>> > at
>> > > > this
>> > > > >> > moment to avoid blocking normal Flink development.
>> > > > >> >
>> > > > >> > But once the ARM CI works well and stable enough. We should
>> mark
>> > it
>> > > as
>> > > > >> > voting. It means that in the future, if the ARM test is failed
>> in
>> > a
>> > > > PR,
>> > > > >> the
>> > > > >> > PR can not be merged. The test log may tell develpers what
>> error
>> > is
>> > > > >> > comming. If the develper need debug the detail on an ARM vm,
>> > OpenLab
>> > > > can
>> > > > >> > provider it.
>> > > > >> >
>> > > > >> > Adding ARM CI can make sure Flink support ARM originally
>> > > > >> >
>> > > > >> > I left a workflow in the PR, I'd like to print it here:
>> > > > >> >
>> > > > >> >    1. Add the basic build script to ensure the CI system and
>> build
>> > > job
>> > > > >> >    works as expect. The job should be marked as non-voting
>> first,
>> > it
>> > > > >> means the
>> > > > >> >    CI test failure won't block Flink PR to be merged.
>> > > > >> >    2. Add the test script to run unit/intergration test. At
>> this
>> > > step
>> > > > >> the
>> > > > >> >    --fn parameter will be added to mvn test. It will run the
>> full
>> > > test
>> > > > >> cases
>> > > > >> >    in Flink, so that we can find what test is failed on ARM.
>> > > > >> >    3. Fix the test failure one by one.
>> > > > >> >    4. Once all the tests are passed, remove the --fn parameter
>> and
>> > > > keep
>> > > > >> >    watch the CI's status for some days. If some bugs raise
>> then,
>> > fix
>> > > > >> them as
>> > > > >> >    what we usually do for travis-ci.
>> > > > >> >    5. Once the CI is stable enought, remove the non-voting
>> tag, so
>> > > > that
>> > > > >> >    the ARM CI will be the same as travis-ci, to be one of the
>> gate
>> > > for
>> > > > >> Flink
>> > > > >> >    PR.
>> > > > >> >    6. Finally, Flink community can announce and release Flink
>> ARM
>> > > > >> version.
>> > > > >> >
>> > > > >> >
>> > > > >> > Chesnay Schepler <ches...@apache.org> 于2019年8月26日周一 下午2:25写道:
>> > > > >> >
>> > > > >> >> I'm sorry, but if these issues are only fixed later anyway I
>> see
>> > no
>> > > > >> >> reason to run these tests on each PR. We're just adding noise
>> to
>> > > each
>> > > > >> PR
>> > > > >> >> that everyone will just ignore.
>> > > > >> >>
>> > > > >> >> I'm curious as to the benefit of having this directly in
>> Flink;
>> > why
>> > > > >> >> aren't the ARM builds run outside of the Flink project, and
>> fixes
>> > > for
>> > > > >> it
>> > > > >> >> provided?
>> > > > >> >>
>> > > > >> >> It seems to me like nothing about these arm builds is actually
>> > > > handled
>> > > > >> >> by the Flink project.
>> > > > >> >>
>> > > > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote:
>> > > > >> >> > Thanks for Stephan to bring up this topic.
>> > > > >> >> >
>> > > > >> >> > The package build jobs work well now. I have a simple online
>> > demo
>> > > > >> which
>> > > > >> >> is
>> > > > >> >> > built and ran on a ARM VM. Feel free to have a try[1].
>> > > > >> >> >
>> > > > >> >> > As the first step for ARM support, maybe it's good to add
>> them
>> > > now.
>> > > > >> >> >
>> > > > >> >> > While for the next step, the test part is still broken. It
>> > > relates
>> > > > to
>> > > > >> >> some
>> > > > >> >> > points we find:
>> > > > >> >> >
>> > > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind
>> of
>> > > > >> failure
>> > > > >> >> can
>> > > > >> >> > be fixed easily.
>> > > > >> >> > 2. Some tests are failed by depending on third part
>> > libaraies[2].
>> > > > It
>> > > > >> >> > includes frocksdb, MapR Client and Netty. They don't have
>> ARM
>> > > > >> release.
>> > > > >> >> >      a. Frocksdb: I'm testing it locally now by `make
>> > check_some`
>> > > > and
>> > > > >> >> `make
>> > > > >> >> > jtest` similar with its travis job. There are 3 tests
>> failed by
>> > > > `make
>> > > > >> >> > check_some`. Please see the ticket for more details. Once
>> the
>> > > test
>> > > > >> pass,
>> > > > >> >> > frocksdb can release ARM package then.
>> > > > >> >> >      b. MapR Client. This belongs to MapR company. At this
>> > > moment,
>> > > > >> >> maybe we
>> > > > >> >> > should skip MapR support for Flink ARM.
>> > > > >> >> >      c. Netty. Actually Netty runs well on our ARM machine.
>> We
>> > > will
>> > > > >> ask
>> > > > >> >> > Netty community to release ARM support. If they do not want,
>> > > > OpenLab
>> > > > >> >> will
>> > > > >> >> > handle a Maven Repository for some common libraries on ARM.
>> > > > >> >> >
>> > > > >> >> >
>> > > > >> >> > For Chesnay's concern:
>> > > > >> >> >
>> > > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM
>> CI.
>> > It
>> > > > >> means
>> > > > >> >> > that once build or test fails, we'll fix it at once.
>> > > > >> >> > Secondly,  OpenLab can provide ARM VMs to everyone for
>> > > reproducing
>> > > > >> and
>> > > > >> >> > testing. You just need to creat a  Test Request issue in
>> > > > openlab[1].
>> > > > >> >> Then
>> > > > >> >> > we'll create ARM VMs for you, you can  login and do the
>> thing
>> > you
>> > > > >> want.
>> > > > >> >> >
>> > > > >> >> > Does it make sense?
>> > > > >> >> >
>> > > > >> >> > [1]: http://114.115.168.52:8081/#/overview
>> > > > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449
>> > > > >> >> >        https://issues.apache.org/jira/browse/FLINK-13450
>> > > > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598
>> > > > >> >> > [3]:
>> https://github.com/theopenlab/openlab/issues/new/choose
>> > > > >> >> >
>> > > > >> >> >
>> > > > >> >> >
>> > > > >> >> >
>> > > > >> >> > Chesnay Schepler <ches...@apache.org> 于2019年8月24日周六
>> 上午12:10写道:
>> > > > >> >> >
>> > > > >> >> >> I'm wondering what we are supposed to do if the build
>> fails?
>> > > > >> >> >> We aren't providing and guides on setting up an arm dev
>> > > > >> environment; so
>> > > > >> >> >> reproducing it locally isn't possible.
>> > > > >> >> >>
>> > > > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote:
>> > > > >> >> >>> Hi all!
>> > > > >> >> >>>
>> > > > >> >> >>> As part of the Flink on ARM effort, there is a pull
>> request
>> > > that
>> > > > >> >> >> triggers a
>> > > > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM
>> > > > machines.
>> > > > >> >> >>>
>> > > > >> >> >>> Currently that build is roughly equivalent to what the
>> "core"
>> > > and
>> > > > >> >> "tests"
>> > > > >> >> >>> profiles do on Travis.
>> > > > >> >> >>> The result will be posted to the PR comments, similar to
>> the
>> > > > Flink
>> > > > >> >> Bot's
>> > > > >> >> >>> Travis build result.
>> > > > >> >> >>> The build currently passes :-) so Flink seems to be okay
>> on
>> > > ARM.
>> > > > >> >> >>>
>> > > > >> >> >>> My suggestion would be to try and add this and gather some
>> > > > >> experience
>> > > > >> >> >> with
>> > > > >> >> >>> it.
>> > > > >> >> >>> The Travis build results should be our "ground truth" and
>> the
>> > > ARM
>> > > > >> CI
>> > > > >> >> >>> (openlabs CI) would be "informational only" at the
>> beginning,
>> > > but
>> > > > >> >> helping
>> > > > >> >> >>> us understand when we break ARM support.
>> > > > >> >> >>>
>> > > > >> >> >>> You can see this in the PR that adds the openlabs CI
>> config:
>> > > > >> >> >>> https://github.com/apache/flink/pull/9416
>> > > > >> >> >>>
>> > > > >> >> >>> Any objections?
>> > > > >> >> >>>
>> > > > >> >> >>> Best,
>> > > > >> >> >>> Stephan
>> > > > >> >> >>>
>> > > > >> >> >>
>> > > > >> >>
>> > > > >> >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to