Re: Python script calls R script in Zeppelin on Hadoop

Lian Jiang Fri, 31 Aug 2018 09:38:02 -0700

Thanks Jeff.

Problem solved by installing the R packages into /usr/lib64/R/library (the
default lib path) on each datanode. Your clue help!


On Wed, Aug 29, 2018 at 7:40 PM Jeff Zhang <zjf...@gmail.com> wrote:

>
> I am not sure what's wrong. maybe you can ssh to that machine and run this
> r script manually first to verify what's wrong.
>
>
>
> Lian Jiang <jiangok2...@gmail.com>于2018年8月30日周四 上午10:34写道：
>
>> Jeff,
>>
>> R is installed on namenode and all data nodes. The R packages have been
>> copied to them all too. I am not sure if an R script launched by pyspark's 
>> subprocess
>> can access spark context or not. If not, using addFiles to add R packages
>> into spark context will not help test.r install the packages. Thanks for
>> clue.
>>
>>
>>
>> On Wed, Aug 29, 2018 at 7:24 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>>
>>> You need to make sure the spark driver machine have this package
>>> installed. And since you are using yarn-cluster mode via livy, you have to
>>> install this packages on all nodes because the spark driver could be
>>> launched in any node of this cluster.
>>>
>>>
>>>
>>> Lian Jiang <jiangok2...@gmail.com>于2018年8月30日周四 上午1:46写道：
>>>
>>>> After calling a sample R script, we found another issue when running a
>>>> real R script. This R script failed to load changepoint library.
>>>>
>>>> I tried:
>>>>
>>>> %livy2.sparkr
>>>> install.packages("changepoint", repos="file:///mnt/data/tmp/r")
>>>> library(changepoint) // I see "Successfully loaded changepoint package
>>>> version 2.2.2"
>>>>
>>>> %livy2.pyspark
>>>> from pyspark import SparkFiles
>>>> import subprocess
>>>>
>>>> sc.addFile("hdfs:///user/zeppelin/test.r")
>>>> testpath = SparkFiles.get('test.r')
>>>> stdoutdata = subprocess.getoutput("Rscript " + testpath)
>>>> print(stdoutdata)
>>>>
>>>> The error: Error in library(changepoint) : there is no package called
>>>> ‘changepoint’
>>>>
>>>> test.r is simply:
>>>>
>>>> library(changepoint)
>>>>
>>>> Any idea how to make changepoint available for the R script? Thanks.
>>>>
>>>>
>>>>
>>>> On Tue, Aug 28, 2018 at 10:07 PM Lian Jiang <jiangok2...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Jeff.
>>>>>
>>>>> This worked:
>>>>>
>>>>> %livy2.pyspark
>>>>> from pyspark import SparkFiles
>>>>> import subprocess
>>>>>
>>>>> sc.addFile("hdfs:///user/zeppelin/ocic/test.r")
>>>>> testpath = SparkFiles.get('test.r')
>>>>> stdoutdata = subprocess.getoutput("Rscript " + testpath)
>>>>> print(stdoutdata)
>>>>>
>>>>> Cheers!
>>>>>
>>>>> On Tue, Aug 28, 2018 at 6:09 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>
>>>>>> Do you run it under yarn-cluster mode ? Then you must ensure your
>>>>>> rscript shipped to that driver (via sc.addFile or setting 
>>>>>> livy.spark.files)
>>>>>>
>>>>>> And also you need to make sure you have R installed in all hosts of
>>>>>> yarn cluster because the driver may run any node of this cluster.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Lian Jiang <jiangok2...@gmail.com>于2018年8月29日周三 上午1:35写道：
>>>>>>
>>>>>>> Thanks Lucas. We tried and got the same error. Below is the code:
>>>>>>>
>>>>>>> %livy2.pyspark
>>>>>>> import subprocess
>>>>>>> sc.addFile("hdfs:///user/zeppelin/test.r")
>>>>>>> stdoutdata = subprocess.getoutput("Rscript test.r")
>>>>>>> print(stdoutdata)
>>>>>>>
>>>>>>> Fatal error: cannot open file 'test.r': No such file or directory
>>>>>>>
>>>>>>>
>>>>>>> sc.addFile adds test.r to spark context. However, subprocess does
>>>>>>> not use spark context.
>>>>>>>
>>>>>>> Hdfs path does not work either: subprocess.getoutput("Rscript
>>>>>>> hdfs:///user/zeppelin/test.r")
>>>>>>>
>>>>>>> Any idea how to make python call R script? Appreciate!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 28, 2018 at 1:13 AM Partridge, Lucas (GE Aviation) <
>>>>>>> lucas.partri...@ge.com> wrote:
>>>>>>>
>>>>>>>> Have you tried SparkContext.addFile() (not addPyFile()) to add your
>>>>>>>> R script?
>>>>>>>>
>>>>>>>>
>>>>>>>> https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.SparkContext.addFile
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* Lian Jiang <jiangok2...@gmail.com>
>>>>>>>> *Sent:* 27 August 2018 22:42
>>>>>>>> *To:* users@zeppelin.apache.org
>>>>>>>> *Subject:* EXT: Python script calls R script in Zeppelin on Hadoop
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> We are using HDP3.0 (using zeppelin 0.8.0) and are migrating
>>>>>>>> Jupyter notebooks to Zeppelin. One issue we came across is that a 
>>>>>>>> python
>>>>>>>> script calling R script does not work in Zeppelin.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> %livy2.pyspark
>>>>>>>>
>>>>>>>> import os
>>>>>>>>
>>>>>>>> sc.addPyFile("hdfs:///user/zeppelin/my.py")
>>>>>>>>
>>>>>>>> import my
>>>>>>>>
>>>>>>>> my.test()
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> my.test() calls R script like: ['Rscript', 'myR.r']
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Fatal error: cannot open file 'myR.r': No such file or directory
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> When running this notebook in jupyter, both my.py and myR.r exist
>>>>>>>> in the same folder. I understand the story changes on hadoop because 
>>>>>>>> the
>>>>>>>> scripts run in containers.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> My question:
>>>>>>>>
>>>>>>>> Is this scenario supported in zeppelin? How to add a R script into
>>>>>>>> a python spark context so that the Python script can find the R script?
>>>>>>>> Appreciate!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>

Re: Python script calls R script in Zeppelin on Hadoop

Reply via email to