Hi Guys, Recently, we are trying to test pyspark on ARM, we found some issue but have no idea about them. Could you please have a look if you are free? Thanks.
There are two issues: 1. The first one looks like a arm performance issue, the test job in a pyspark test doesn't fully finish when exec assert check. So we change the source code on our local env to test, they will pass. For this issue, we opened a JIRA issue [1]. If you guys are free, please help it. Thanks. 2. The second one looks like a spark internal issue, when we test "pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_train_prediction", it will fail as the "condition" function.We tried to deep into it and found the predicted value is still [0. 0. .....0.], eventhough we wait for a long time on ARM testing env. That's the main cause I think. And we failed to debug into which step is wrong. Could you please help to figure it out? I upload the test log after I inserted some 'printf' into the 'func' function of the testcase function. I tried on ARM and X86. ARM log is [2], X86 log is [3]. They are the same testing env except different ARCH. Thanks, if you are free, please help us. Best Regards [1] https://issues.apache.org/jira/browse/SPARK-29205 [2] https://etherpad.net/p/pyspark-arm [3] https://etherpad.net/p/pyspark-x86 [image: Mailtrack] <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> Sender notified by Mailtrack <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 19/09/23 上午11:53:29 Tianhua huang <huangtianhua...@gmail.com> 于2019年9月19日周四 上午10:59写道: > @Dongjoon Hyun <dongjoon.h...@gmail.com> , > > Sure, and I have update the JIRA already :) > https://issues.apache.org/jira/browse/SPARK-29106 > If anything missed, please let me know, thank you. > > On Thu, Sep 19, 2019 at 12:44 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Hi, Tianhua. >> >> Could you summarize the detail on the JIRA once more? >> It will be very helpful for the community. Also, I've been waiting on >> that JIRA. :) >> >> Bests, >> Dongjoon. >> >> >> On Mon, Sep 16, 2019 at 11:48 PM Tianhua huang <huangtianhua...@gmail.com> >> wrote: >> >>> @shane knapp <skn...@berkeley.edu> thank you very much, I opened an >>> issue for this https://issues.apache.org/jira/browse/SPARK-29106, we >>> can tall the details in it :) >>> And we will prepare an arm instance today and will send the info to your >>> email later. >>> >>> On Tue, Sep 17, 2019 at 4:40 AM Shane Knapp <skn...@berkeley.edu> wrote: >>> >>>> @Tianhua huang <huangtianhua...@gmail.com> sure, i think we can get >>>> something sorted for the short-term. >>>> >>>> all we need is ssh access (i can provide an ssh key), and i can then >>>> have our jenkins master launch a remote worker on that instance. >>>> >>>> instance setup, etc, will be up to you. my support for the time being >>>> will be to create the job and 'best effort' for everything else. >>>> >>>> this should get us up and running asap. >>>> >>>> is there an open JIRA for jenkins/arm test support? we can move the >>>> technical details about this idea there. >>>> >>>> On Sun, Sep 15, 2019 at 9:03 PM Tianhua huang < >>>> huangtianhua...@gmail.com> wrote: >>>> >>>>> @Sean Owen <sro...@gmail.com> , so sorry to reply late, we had a >>>>> Mid-Autumn holiday:) >>>>> >>>>> If you hope to integrate ARM CI to amplab jenkins, we can offer the >>>>> arm instance, and then the ARM job will run together with other x86 jobs, >>>>> so maybe there is a guideline to do this? @shane knapp >>>>> <skn...@berkeley.edu> would you help us? >>>>> >>>>> On Thu, Sep 12, 2019 at 9:36 PM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> I don't know what's involved in actually accepting or operating those >>>>>> machines, so can't comment there, but in the meantime it's good that you >>>>>> are running these tests and can help report changes needed to keep it >>>>>> working with ARM. I would continue with that for now. >>>>>> >>>>>> On Wed, Sep 11, 2019 at 10:06 PM Tianhua huang < >>>>>> huangtianhua...@gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> For the whole work process of spark ARM CI, we want to make 2 things >>>>>>> clear. >>>>>>> >>>>>>> The first thing is: >>>>>>> About spark ARM CI, now we have two periodic jobs, one job[1] based >>>>>>> on commit[2](which already fixed the replay tests failed issue[3], we >>>>>>> made >>>>>>> a new test branch based on date 09-09-2019), the other job[4] based on >>>>>>> spark master. >>>>>>> >>>>>>> The first job we test on the specified branch to prove that our ARM >>>>>>> CI is good and stable. >>>>>>> The second job checks spark master every day, then we can find >>>>>>> whether the latest commits affect the ARM CI. According to the build >>>>>>> history and result, it shows that some problems are easier to find on >>>>>>> ARM >>>>>>> like SPARK-28770 <https://issues.apache.org/jira/browse/SPARK-28770>, >>>>>>> and it also shows that we would make efforts to trace and figure them >>>>>>> out, till now we have found and fixed several problems[5][6][7], thanks >>>>>>> everyone of the community :). And we believe that ARM CI is very >>>>>>> necessary, >>>>>>> right? >>>>>>> >>>>>>> The second thing is: >>>>>>> We plan to run the jobs for a period of time, and you can see the >>>>>>> result and logs from 'build history' of the jobs console, if everything >>>>>>> goes well for one or two weeks could community accept the ARM CI? or how >>>>>>> long the periodic jobs to run then our community could have enough >>>>>>> confidence to accept the ARM CI? As you suggested before, it's good to >>>>>>> integrate ARM CI to amplab jenkins, we agree that and we can donate the >>>>>>> ARM >>>>>>> instances and then maintain the ARM-related test jobs together with >>>>>>> community, any thoughts? >>>>>>> >>>>>>> Thank you all! >>>>>>> >>>>>>> [1] >>>>>>> http://status.openlabtesting.org/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64 >>>>>>> [2] >>>>>>> https://github.com/apache/spark/commit/0ed9fae45769d4b06b8cf8128f462f09ff3d9a72 >>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-28770 >>>>>>> [4] >>>>>>> http://status.openlabtesting.org/builds?job_name=spark-master-unit-test-hadoop-2.7-arm64 >>>>>>> [5] https://github.com/apache/spark/pull/25186 >>>>>>> [6] https://github.com/apache/spark/pull/25279 >>>>>>> [7] https://github.com/apache/spark/pull/25673 >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 16, 2019 at 11:24 PM Sean Owen <sro...@gmail.com> wrote: >>>>>>> >>>>>>>> Yes, I think it's just local caching. After you run the build you >>>>>>>> should find lots of stuff cached at ~/.m2/repository and it won't >>>>>>>> download >>>>>>>> every time. >>>>>>>> >>>>>>>> On Fri, Aug 16, 2019 at 3:01 AM bo zhaobo < >>>>>>>> bzhaojyathousa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Sean, >>>>>>>>> Thanks for reply. And very apologize for making you confused. >>>>>>>>> I know the dependencies will be downloaded from SBT or Maven. But >>>>>>>>> the Spark QA job also exec "mvn clean package", why the log didn't >>>>>>>>> print >>>>>>>>> "downloading some jar from Maven central [1] and build very fast. Is >>>>>>>>> the >>>>>>>>> reason that Spark Jenkins build the Spark jars in the physical >>>>>>>>> machiines >>>>>>>>> and won't destrory the test env after job is finished? Then the other >>>>>>>>> job >>>>>>>>> build Spark will get the dependencies jar from the local cached, as >>>>>>>>> the >>>>>>>>> previous jobs exec "mvn package", those dependencies had been >>>>>>>>> downloaded >>>>>>>>> already on local worker machine. Am I right? Is that the reason the >>>>>>>>> job >>>>>>>>> log[1] didn't print any downloading information from Maven Central? >>>>>>>>> >>>>>>>>> Thank you very much. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull >>>>>>>>> >>>>>>>>> >>>>>>>>> Best regards >>>>>>>>> >>>>>>>>> ZhaoBo >>>>>>>>> >>>>>>>>> [image: Mailtrack] >>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>> Sender >>>>>>>>> notified by >>>>>>>>> Mailtrack >>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>> 19/08/16 >>>>>>>>> 下午03:58:53 >>>>>>>>> >>>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月16日周五 上午10:38写道: >>>>>>>>> >>>>>>>>>> I'm not sure what you mean. The dependencies are downloaded by >>>>>>>>>> SBT and Maven like in any other project, and nothing about it is >>>>>>>>>> specific >>>>>>>>>> to Spark. >>>>>>>>>> The worker machines cache artifacts that are downloaded from >>>>>>>>>> these, but this is a function of Maven and SBT, not Spark. You may >>>>>>>>>> find >>>>>>>>>> that the initial download takes a long time. >>>>>>>>>> >>>>>>>>>> On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo < >>>>>>>>>> bzhaojyathousa...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Sean, >>>>>>>>>>> >>>>>>>>>>> Thanks very much for pointing out the roadmap. ;-). Then I think >>>>>>>>>>> we will continue to focus on our test environment. >>>>>>>>>>> >>>>>>>>>>> For the networking problems, I mean that we can access Maven >>>>>>>>>>> Central, and jobs cloud download the required jar package with a >>>>>>>>>>> high >>>>>>>>>>> network speed. What we want to know is that, why the Spark QA test >>>>>>>>>>> jobs[1] >>>>>>>>>>> log shows the job script/maven build seem don't download the jar >>>>>>>>>>> packages? >>>>>>>>>>> Could you tell us the reason about that? Thank you. The reason we >>>>>>>>>>> raise >>>>>>>>>>> the "networking problems" is that we found a phenomenon during we >>>>>>>>>>> test, if >>>>>>>>>>> we execute "mvn clean package" in a new test environment(As in our >>>>>>>>>>> test >>>>>>>>>>> environment, we will destory the test VMs after the job is finish), >>>>>>>>>>> maven >>>>>>>>>>> will download the dependency jar packages from Maven Central, but >>>>>>>>>>> in this >>>>>>>>>>> job "spark-master-test-maven-hadoop" [2], from the log, we didn't >>>>>>>>>>> found it >>>>>>>>>>> download any jar packages, what the reason about that? >>>>>>>>>>> Also we build the Spark jar with downloading dependencies from >>>>>>>>>>> Maven Central, it will cost mostly 1 hour. And we found [2] just >>>>>>>>>>> cost >>>>>>>>>>> 10min. But if we run "mvn package" in a VM which already exec "mvn >>>>>>>>>>> package" >>>>>>>>>>> before, it just cost 14min, looks very closer with [2]. So we >>>>>>>>>>> suspect that >>>>>>>>>>> downloading the Jar packages cost so much time. For the goad of ARM >>>>>>>>>>> CI, we >>>>>>>>>>> expect the performance of NEW ARM CI could be closer with existing >>>>>>>>>>> X86 CI, >>>>>>>>>>> then users could accept it eaiser. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/ >>>>>>>>>>> [2] >>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.6-ubuntu-testing/lastBuild/consoleFull >>>>>>>>>>> >>>>>>>>>>> Best regards >>>>>>>>>>> >>>>>>>>>>> ZhaoBo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [image: Mailtrack] >>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>>>> Sender >>>>>>>>>>> notified by >>>>>>>>>>> Mailtrack >>>>>>>>>>> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> >>>>>>>>>>> 19/08/16 >>>>>>>>>>> 上午09:48:43 >>>>>>>>>>> >>>>>>>>>>> Sean Owen <sro...@gmail.com> 于2019年8月15日周四 下午9:58写道: >>>>>>>>>>> >>>>>>>>>>>> I think the right goal is to fix the remaining issues first. If >>>>>>>>>>>> we set up CI/CD it will only tell us there are still some test >>>>>>>>>>>> failures. If >>>>>>>>>>>> it's stable, and not hard to add to the existing CI/CD, yes it >>>>>>>>>>>> could be >>>>>>>>>>>> done automatically later. You can continue to test on ARM >>>>>>>>>>>> independently for >>>>>>>>>>>> now. >>>>>>>>>>>> >>>>>>>>>>>> It sounds indeed like there are some networking problems in the >>>>>>>>>>>> test system if you're not able to download from Maven Central. >>>>>>>>>>>> That rarely >>>>>>>>>>>> takes significant time, and there aren't project-specific mirrors >>>>>>>>>>>> here. You >>>>>>>>>>>> might be able to point at a closer public mirror, depending on >>>>>>>>>>>> where you >>>>>>>>>>>> are. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang < >>>>>>>>>>>> huangtianhua...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> I want to discuss spark ARM CI again, we took some tests on >>>>>>>>>>>>> arm instance based on master and the job includes >>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/13 and k8s >>>>>>>>>>>>> integration https://github.com/theopenlab/spark/pull/17/ , >>>>>>>>>>>>> there are several things I want to talk about: >>>>>>>>>>>>> >>>>>>>>>>>>> First, about the failed tests: >>>>>>>>>>>>> 1.we have fixed some problems like >>>>>>>>>>>>> https://github.com/apache/spark/pull/25186 and >>>>>>>>>>>>> https://github.com/apache/spark/pull/25279, thanks sean owen >>>>>>>>>>>>> and others to help us. >>>>>>>>>>>>> 2.we tried k8s integration test on arm, and met an error: >>>>>>>>>>>>> apk fetch hangs, the tests passed after adding '--network host' >>>>>>>>>>>>> option >>>>>>>>>>>>> for command `docker build`, see: >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176 >>>>>>>>>>>>> , the solution refers to >>>>>>>>>>>>> https://github.com/gliderlabs/docker-alpine/issues/307 and I >>>>>>>>>>>>> don't know whether it happened once in community CI, or maybe we >>>>>>>>>>>>> should >>>>>>>>>>>>> submit a pr to pass '--network host' when `docker build`? >>>>>>>>>>>>> 3.we found there are two tests failed after the commit >>>>>>>>>>>>> https://github.com/apache/spark/pull/23767 : >>>>>>>>>>>>> ReplayListenerSuite: >>>>>>>>>>>>> - ... >>>>>>>>>>>>> - End-to-end replay *** FAILED *** >>>>>>>>>>>>> "[driver]" did not equal "[1]" >>>>>>>>>>>>> (JsonProtocolSuite.scala:622) >>>>>>>>>>>>> - End-to-end replay with compression *** FAILED *** >>>>>>>>>>>>> "[driver]" did not equal "[1]" >>>>>>>>>>>>> (JsonProtocolSuite.scala:622) >>>>>>>>>>>>> >>>>>>>>>>>>> we tried to revert the commit and then the tests >>>>>>>>>>>>> passed, the patch is too big and so sorry we can't find the >>>>>>>>>>>>> reason till >>>>>>>>>>>>> now, if you are interesting please try it, and it will be very >>>>>>>>>>>>> appreciate >>>>>>>>>>>>> if someone can help us to figure it out. >>>>>>>>>>>>> >>>>>>>>>>>>> Second, about the test time, we increased the flavor of arm >>>>>>>>>>>>> instance to 16U16G, but seems there was no significant >>>>>>>>>>>>> improvement, the k8s >>>>>>>>>>>>> integration test took about one and a half hours, and the QA >>>>>>>>>>>>> test(like >>>>>>>>>>>>> spark-master-test-maven-hadoop-2.7 community jenkins job) took >>>>>>>>>>>>> about >>>>>>>>>>>>> seventeen hours(it is too long :(), we suspect that the reason is >>>>>>>>>>>>> the >>>>>>>>>>>>> performance and network, >>>>>>>>>>>>> we split the jobs based on projects such as sql, core and so >>>>>>>>>>>>> on, the time can be decrease to about seven hours, see >>>>>>>>>>>>> https://github.com/theopenlab/spark/pull/19 We found the >>>>>>>>>>>>> Spark QA tests like >>>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/ , >>>>>>>>>>>>> it looks all tests seem never download the jar packages from >>>>>>>>>>>>> maven centry >>>>>>>>>>>>> repo(such as >>>>>>>>>>>>> https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). >>>>>>>>>>>>> So we want to know how the jenkins jobs can do that, is there a >>>>>>>>>>>>> internal >>>>>>>>>>>>> maven repo launched? maybe we can do the same thing to avoid the >>>>>>>>>>>>> network >>>>>>>>>>>>> connection cost during downloading the dependent jar packages. >>>>>>>>>>>>> >>>>>>>>>>>>> Third, the most important thing, it's about ARM CI of spark, >>>>>>>>>>>>> we believe that it is necessary, right? And you can see we really >>>>>>>>>>>>> made a >>>>>>>>>>>>> lot of efforts, now the basic arm build/test jobs is ok, so we >>>>>>>>>>>>> suggest to >>>>>>>>>>>>> add arm jobs to community, we can set them to novoting firstly, >>>>>>>>>>>>> and >>>>>>>>>>>>> improve/rich the jobs step by step. Generally, there are two ways >>>>>>>>>>>>> in our >>>>>>>>>>>>> mind to integrate the ARM CI for spark: >>>>>>>>>>>>> 1) We introduce openlab ARM CI into spark as a custom CI >>>>>>>>>>>>> system. We provide human resources and test ARM VMs, also we will >>>>>>>>>>>>> focus on >>>>>>>>>>>>> the ARM related issues about Spark. We will push the PR into >>>>>>>>>>>>> community. >>>>>>>>>>>>> 2) We donate ARM VM resources into existing amplab >>>>>>>>>>>>> Jenkins. We still provide human resources, focus on the ARM >>>>>>>>>>>>> related issues >>>>>>>>>>>>> about Spark and push the PR into community. >>>>>>>>>>>>> Both options, we will provide human resources to maintain, of >>>>>>>>>>>>> course it will be great if we can work together. So please tell >>>>>>>>>>>>> us which >>>>>>>>>>>>> option you would like? And let's move forward. Waiting for your >>>>>>>>>>>>> reply, >>>>>>>>>>>>> thank you very much. >>>>>>>>>>>>> >>>>>>>>>>>> >>>> >>>> -- >>>> Shane Knapp >>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>> https://rise.cs.berkeley.edu >>>> >>>