Re: Ask for ARM CI for spark

bo zhaobo Sun, 22 Sep 2019 20:56:41 -0700

Hi Guys,

Recently, we are trying to test pyspark on ARM, we found some issue but
have no idea about them. Could you please have a look if you are free?
Thanks.


There are two issues:
1. The first one looks like a arm performance issue, the test job in a
pyspark test doesn't fully finish when exec assert check. So we change the
source code on our local env to test, they will pass.  For this issue, we
opened a JIRA issue [1]. If you guys are free, please help it. Thanks.
2. The second one looks like a spark internal issue, when we test
"pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_train_prediction",
it will fail as the "condition" function.We tried to deep into it and found
the predicted value is still [0. 0. .....0.], eventhough we wait for a long
time on ARM testing env. That's the main cause I think. And we failed to
debug into which step is wrong. Could you please help to figure it out? I
upload the test log after I inserted some 'printf' into the 'func' function
of  the testcase function. I tried on ARM and X86. ARM log is [2], X86 log
is [3]. They are the same testing env except different ARCH.

Thanks, if you are free, please help us.

Best Regards

[1] https://issues.apache.org/jira/browse/SPARK-29205
[2] https://etherpad.net/p/pyspark-arm
[3] https://etherpad.net/p/pyspark-x86

[image: Mailtrack]
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
Sender
notified by
Mailtrack
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
19/09/23
上午11:53:29

Tianhua huang <huangtianhua...@gmail.com> 于2019年9月19日周四 上午10:59写道：

> @Dongjoon Hyun <dongjoon.h...@gmail.com> ,
>
> Sure, and I have update the JIRA already :)
> https://issues.apache.org/jira/browse/SPARK-29106
> If anything missed, please let me know, thank you.
>
> On Thu, Sep 19, 2019 at 12:44 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, Tianhua.
>>
>> Could you summarize the detail on the JIRA once more?
>> It will be very helpful for the community. Also, I've been waiting on
>> that JIRA. :)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <huangtianhua...@gmail.com>
>> wrote:
>>
>>> @shane knapp <skn...@berkeley.edu> thank you very much, I opened an
>>> issue for this https://issues.apache.org/jira/browse/SPARK-29106, we
>>> can tall the details in it :)
>>> And we will prepare an arm instance today and will send the info to your
>>> email later.
>>>
>>> On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <skn...@berkeley.edu> wrote:
>>>
>>>> @Tianhua huang <huangtianhua...@gmail.com> sure, i think we can get
>>>> something sorted for the short-term.
>>>>
>>>> all we need is ssh access (i can provide an ssh key), and i can then
>>>> have our jenkins master launch a remote worker on that instance.
>>>>
>>>> instance setup, etc, will be up to you.  my support for the time being
>>>> will be to create the job and 'best effort' for everything else.
>>>>
>>>> this should get us up and running asap.
>>>>
>>>> is there an open JIRA for jenkins/arm test support?  we can move the
>>>> technical details about this idea there.
>>>>
>>>> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang <
>>>> huangtianhua...@gmail.com> wrote:
>>>>
>>>>> @Sean Owen <sro...@gmail.com> , so sorry to reply late, we had a
>>>>> Mid-Autumn holiday:)
>>>>>
>>>>> If you hope to integrate ARM CI to amplab jenkins, we can offer the
>>>>> arm instance, and then the ARM job will run together with other x86 jobs,
>>>>> so maybe there is a guideline to do this? @shane knapp
>>>>> <skn...@berkeley.edu>  would you help us?
>>>>>
>>>>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>
>>>>>> I don't know what's involved in actually accepting or operating those
>>>>>> machines, so can't comment there, but in the meantime it's good that you
>>>>>> are running these tests and can help report changes needed to keep it
>>>>>> working with ARM. I would continue with that for now.
>>>>>>
>>>>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang <
>>>>>> huangtianhua...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> For the whole work process of spark ARM CI, we want to make 2 things
>>>>>>> clear.
>>>>>>>
>>>>>>> The first thing is:
>>>>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based
>>>>>>> on commit[2](which already fixed the replay tests failed issue[3], we 
>>>>>>> made
>>>>>>> a new test branch based on date 09-09-2019), the other job[4] based on
>>>>>>> spark master.
>>>>>>>
>>>>>>> The first job we test on the specified branch to prove that our ARM
>>>>>>> CI is good and stable.
>>>>>>> The second job checks spark master every day, then we can find
>>>>>>> whether the latest commits affect the ARM CI. According to the build
>>>>>>> history and result, it shows that some problems are easier to find on 
>>>>>>> ARM
>>>>>>> like SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>,
>>>>>>> and it also shows that we would make efforts to trace and figure them
>>>>>>> out, till now we have found and fixed several problems[5][6][7], thanks
>>>>>>> everyone of the community :). And we believe that ARM CI is very 
>>>>>>> necessary,
>>>>>>> right?
>>>>>>>
>>>>>>> The second thing is:
>>>>>>> We plan to run the jobs for a period of time, and you can see the
>>>>>>> result and logs from 'build history' of the jobs console, if everything
>>>>>>> goes well for one or two weeks could community accept the ARM CI? or how
>>>>>>> long the periodic jobs to run then our community could have enough
>>>>>>> confidence to accept the ARM CI? As you suggested before, it's good to
>>>>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate the 
>>>>>>> ARM
>>>>>>> instances and then maintain the ARM-related test jobs together with
>>>>>>> community, any thoughts?
>>>>>>>
>>>>>>> Thank you all!
>>>>>>>
>>>>>>> [1]
>>>>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64
>>>>>>> [2]
>>>>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72
>>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770
>>>>>>> [4]
>>>>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64
>>>>>>> [5] https://github.com/apache/spark/pull/25186
>>>>>>> [6] https://github.com/apache/spark/pull/25279
>>>>>>> [7] https://github.com/apache/spark/pull/25673
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes, I think it's just local caching. After you run the build you
>>>>>>>> should find lots of stuff cached at ~/.m2/repository and it won't 
>>>>>>>> download
>>>>>>>> every time.
>>>>>>>>
>>>>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo <
>>>>>>>> bzhaojyathousa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Sean,
>>>>>>>>> Thanks for reply. And very apologize for making you confused.
>>>>>>>>> I know the dependencies will be downloaded from SBT or Maven. But
>>>>>>>>> the Spark QA job also exec "mvn clean package", why the log didn't 
>>>>>>>>> print
>>>>>>>>> "downloading some jar from Maven central [1] and build very fast. Is 
>>>>>>>>> the
>>>>>>>>> reason that Spark Jenkins build the Spark jars in the physical 
>>>>>>>>> machiines
>>>>>>>>> and won't destrory the test env after job is finished? Then the other 
>>>>>>>>> job
>>>>>>>>> build Spark will get the dependencies jar from the local cached, as 
>>>>>>>>> the
>>>>>>>>> previous jobs exec "mvn package", those dependencies had been 
>>>>>>>>> downloaded
>>>>>>>>> already on local worker machine. Am I right? Is that the reason the 
>>>>>>>>> job
>>>>>>>>> log[1] didn't print any downloading information from Maven Central?
>>>>>>>>>
>>>>>>>>> Thank you very much.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>>
>>>>>>>>> ZhaoBo
>>>>>>>>>
>>>>>>>>> [image: Mailtrack]
>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>  Sender
>>>>>>>>> notified by
>>>>>>>>> Mailtrack
>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>  19/08/16
>>>>>>>>> 下午03:58:53
>>>>>>>>>
>>>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月16日周五 上午10:38写道：
>>>>>>>>>
>>>>>>>>>> I'm not sure what you mean. The dependencies are downloaded by
>>>>>>>>>> SBT and Maven like in any other project, and nothing about it is 
>>>>>>>>>> specific
>>>>>>>>>> to Spark.
>>>>>>>>>> The worker machines cache artifacts that are downloaded from
>>>>>>>>>> these, but this is a function of Maven and SBT, not Spark. You may 
>>>>>>>>>> find
>>>>>>>>>> that the initial download takes a long time.
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <
>>>>>>>>>> bzhaojyathousa...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Sean,
>>>>>>>>>>>
>>>>>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think
>>>>>>>>>>> we will continue to focus on our test environment.
>>>>>>>>>>>
>>>>>>>>>>> For the networking problems, I mean that we can access Maven
>>>>>>>>>>> Central, and jobs cloud download the required jar package with a 
>>>>>>>>>>> high
>>>>>>>>>>> network speed. What we want to know is that, why the Spark QA test 
>>>>>>>>>>> jobs[1]
>>>>>>>>>>> log shows the job script/maven build seem don't download the jar 
>>>>>>>>>>> packages?
>>>>>>>>>>> Could you tell us the reason about that? Thank you.  The reason we 
>>>>>>>>>>> raise
>>>>>>>>>>> the "networking problems" is that we found a phenomenon during we 
>>>>>>>>>>> test, if
>>>>>>>>>>> we execute "mvn clean package" in a new test environment(As in our 
>>>>>>>>>>> test
>>>>>>>>>>> environment, we will destory the test VMs after the job is finish), 
>>>>>>>>>>> maven
>>>>>>>>>>> will download the dependency jar packages from Maven Central, but 
>>>>>>>>>>> in this
>>>>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't 
>>>>>>>>>>> found it
>>>>>>>>>>> download any jar packages, what the reason about that?
>>>>>>>>>>> Also we build the Spark jar with downloading dependencies from
>>>>>>>>>>> Maven Central, it will cost mostly 1 hour. And we found [2] just 
>>>>>>>>>>> cost
>>>>>>>>>>> 10min. But if we run "mvn package" in a VM which already exec "mvn 
>>>>>>>>>>> package"
>>>>>>>>>>> before, it just cost 14min, looks very closer with [2]. So we 
>>>>>>>>>>> suspect that
>>>>>>>>>>> downloading the Jar packages cost so much time. For the goad of ARM 
>>>>>>>>>>> CI, we
>>>>>>>>>>> expect the performance of NEW ARM CI could be closer with existing 
>>>>>>>>>>> X86 CI,
>>>>>>>>>>> then users could accept it eaiser.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
>>>>>>>>>>> [2]
>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull
>>>>>>>>>>>
>>>>>>>>>>> Best regards
>>>>>>>>>>>
>>>>>>>>>>> ZhaoBo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [image: Mailtrack]
>>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>>>  Sender
>>>>>>>>>>> notified by
>>>>>>>>>>> Mailtrack
>>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&;>
>>>>>>>>>>>  19/08/16
>>>>>>>>>>> 上午09:48:43
>>>>>>>>>>>
>>>>>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月15日周四 下午9:58写道：
>>>>>>>>>>>
>>>>>>>>>>>> I think the right goal is to fix the remaining issues first. If
>>>>>>>>>>>> we set up CI/CD it will only tell us there are still some test 
>>>>>>>>>>>> failures. If
>>>>>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it 
>>>>>>>>>>>> could be
>>>>>>>>>>>> done automatically later. You can continue to test on ARM 
>>>>>>>>>>>> independently for
>>>>>>>>>>>> now.
>>>>>>>>>>>>
>>>>>>>>>>>> It sounds indeed like there are some networking problems in the
>>>>>>>>>>>> test system if you're not able to download from Maven Central. 
>>>>>>>>>>>> That rarely
>>>>>>>>>>>> takes significant time, and there aren't project-specific mirrors 
>>>>>>>>>>>> here. You
>>>>>>>>>>>> might be able to point at a closer public mirror, depending on 
>>>>>>>>>>>> where you
>>>>>>>>>>>> are.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <
>>>>>>>>>>>> huangtianhua...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on
>>>>>>>>>>>>> arm instance based on master and the job includes
>>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/13  and k8s
>>>>>>>>>>>>> integration https://github.com/theopenlab/spark/pull/17/ ,
>>>>>>>>>>>>> there are several things I want to talk about:
>>>>>>>>>>>>>
>>>>>>>>>>>>> First, about the failed tests:
>>>>>>>>>>>>>     1.we have fixed some problems like
>>>>>>>>>>>>> https://github.com/apache/spark/pull/25186 and
>>>>>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen
>>>>>>>>>>>>> and others to help us.
>>>>>>>>>>>>>     2.we tried k8s integration test on arm, and met an error:
>>>>>>>>>>>>> apk fetch hangs,  the tests passed  after adding '--network host' 
>>>>>>>>>>>>> option
>>>>>>>>>>>>> for command `docker build`, see:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176
>>>>>>>>>>>>> , the solution refers to
>>>>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307  and I
>>>>>>>>>>>>> don't know whether it happened once in community CI, or maybe we 
>>>>>>>>>>>>> should
>>>>>>>>>>>>> submit a pr to pass  '--network host' when `docker build`?
>>>>>>>>>>>>>     3.we found there are two tests failed after the commit
>>>>>>>>>>>>> https://github.com/apache/spark/pull/23767  :
>>>>>>>>>>>>>        ReplayListenerSuite:
>>>>>>>>>>>>>        - ...
>>>>>>>>>>>>>        - End-to-end replay *** FAILED ***
>>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>>        - End-to-end replay with compression *** FAILED ***
>>>>>>>>>>>>>          "[driver]" did not equal "[1]"
>>>>>>>>>>>>> (JsonProtocolSuite.scala:622)
>>>>>>>>>>>>>
>>>>>>>>>>>>>         we tried to revert the commit and then the tests
>>>>>>>>>>>>> passed, the patch is too big and so sorry we can't find the 
>>>>>>>>>>>>> reason till
>>>>>>>>>>>>> now, if you are interesting please try it, and it will be very 
>>>>>>>>>>>>> appreciate
>>>>>>>>>>>>>         if someone can help us to figure it out.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Second, about the test time, we increased the flavor of arm
>>>>>>>>>>>>> instance to 16U16G, but seems there was no significant 
>>>>>>>>>>>>> improvement, the k8s
>>>>>>>>>>>>> integration test took about one and a half hours, and the QA 
>>>>>>>>>>>>> test(like
>>>>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took 
>>>>>>>>>>>>> about
>>>>>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> performance and network,
>>>>>>>>>>>>> we split the jobs based on projects such as sql, core and so
>>>>>>>>>>>>> on, the time can be decrease to about seven hours, see
>>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the
>>>>>>>>>>>>> Spark QA tests like
>>>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   ,
>>>>>>>>>>>>> it looks all tests seem never download the jar packages from 
>>>>>>>>>>>>> maven centry
>>>>>>>>>>>>> repo(such as
>>>>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar).
>>>>>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a 
>>>>>>>>>>>>> internal
>>>>>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the 
>>>>>>>>>>>>> network
>>>>>>>>>>>>> connection cost during downloading the dependent jar packages.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Third, the most important thing, it's about ARM CI of spark,
>>>>>>>>>>>>> we believe that it is necessary, right? And you can see we really 
>>>>>>>>>>>>> made a
>>>>>>>>>>>>> lot of efforts, now the basic arm build/test jobs is ok, so we 
>>>>>>>>>>>>> suggest to
>>>>>>>>>>>>> add arm jobs to community, we can set them to novoting firstly, 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways 
>>>>>>>>>>>>> in our
>>>>>>>>>>>>> mind to integrate the ARM CI for spark:
>>>>>>>>>>>>>      1) We introduce openlab ARM CI into spark as a custom CI
>>>>>>>>>>>>> system. We provide human resources and test ARM VMs, also we will 
>>>>>>>>>>>>> focus on
>>>>>>>>>>>>> the ARM related issues about Spark. We will push the PR into 
>>>>>>>>>>>>> community.
>>>>>>>>>>>>>      2) We donate ARM VM resources into existing amplab
>>>>>>>>>>>>> Jenkins. We still provide human resources, focus on the ARM 
>>>>>>>>>>>>> related issues
>>>>>>>>>>>>> about Spark and push the PR into community.
>>>>>>>>>>>>> Both options, we will provide human resources to maintain, of
>>>>>>>>>>>>> course it will be great if we can work together. So please tell 
>>>>>>>>>>>>> us which
>>>>>>>>>>>>> option you would like? And let's move forward. Waiting for your 
>>>>>>>>>>>>> reply,
>>>>>>>>>>>>> thank you very much.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>

Re: Ask for ARM CI for spark

Reply via email to