Re: Python script calls R script in Zeppelin on Hadoop

Jeff Zhang Wed, 29 Aug 2018 19:40:21 -0700

I am not sure what's wrong. maybe you can ssh to that machine and run this
r script manually first to verify what's wrong.




Lian Jiang <[email protected]>于2018年8月30日周四 上午10:34写道：

> Jeff,
>
> R is installed on namenode and all data nodes. The R packages have been
> copied to them all too. I am not sure if an R script launched by pyspark's 
> subprocess
> can access spark context or not. If not, using addFiles to add R packages
> into spark context will not help test.r install the packages. Thanks for
> clue.
>
>
>
> On Wed, Aug 29, 2018 at 7:24 PM Jeff Zhang <[email protected]> wrote:
>
>>
>> You need to make sure the spark driver machine have this package
>> installed. And since you are using yarn-cluster mode via livy, you have to
>> install this packages on all nodes because the spark driver could be
>> launched in any node of this cluster.
>>
>>
>>
>> Lian Jiang <[email protected]>于2018年8月30日周四 上午1:46写道：
>>
>>> After calling a sample R script, we found another issue when running a
>>> real R script. This R script failed to load changepoint library.
>>>
>>> I tried:
>>>
>>> %livy2.sparkr
>>> install.packages("changepoint", repos="file:///mnt/data/tmp/r")
>>> library(changepoint) // I see "Successfully loaded changepoint package
>>> version 2.2.2"
>>>
>>> %livy2.pyspark
>>> from pyspark import SparkFiles
>>> import subprocess
>>>
>>> sc.addFile("hdfs:///user/zeppelin/test.r")
>>> testpath = SparkFiles.get('test.r')
>>> stdoutdata = subprocess.getoutput("Rscript " + testpath)
>>> print(stdoutdata)
>>>
>>> The error: Error in library(changepoint) : there is no package called
>>> ‘changepoint’
>>>
>>> test.r is simply:
>>>
>>> library(changepoint)
>>>
>>> Any idea how to make changepoint available for the R script? Thanks.
>>>
>>>
>>>
>>> On Tue, Aug 28, 2018 at 10:07 PM Lian Jiang <[email protected]>
>>> wrote:
>>>
>>>> Thanks Jeff.
>>>>
>>>> This worked:
>>>>
>>>> %livy2.pyspark
>>>> from pyspark import SparkFiles
>>>> import subprocess
>>>>
>>>> sc.addFile("hdfs:///user/zeppelin/ocic/test.r")
>>>> testpath = SparkFiles.get('test.r')
>>>> stdoutdata = subprocess.getoutput("Rscript " + testpath)
>>>> print(stdoutdata)
>>>>
>>>> Cheers!
>>>>
>>>> On Tue, Aug 28, 2018 at 6:09 PM Jeff Zhang <[email protected]> wrote:
>>>>
>>>>> Do you run it under yarn-cluster mode ? Then you must ensure your
>>>>> rscript shipped to that driver (via sc.addFile or setting 
>>>>> livy.spark.files)
>>>>>
>>>>> And also you need to make sure you have R installed in all hosts of
>>>>> yarn cluster because the driver may run any node of this cluster.
>>>>>
>>>>>
>>>>>
>>>>> Lian Jiang <[email protected]>于2018年8月29日周三 上午1:35写道：
>>>>>
>>>>>> Thanks Lucas. We tried and got the same error. Below is the code:
>>>>>>
>>>>>> %livy2.pyspark
>>>>>> import subprocess
>>>>>> sc.addFile("hdfs:///user/zeppelin/test.r")
>>>>>> stdoutdata = subprocess.getoutput("Rscript test.r")
>>>>>> print(stdoutdata)
>>>>>>
>>>>>> Fatal error: cannot open file 'test.r': No such file or directory
>>>>>>
>>>>>>
>>>>>> sc.addFile adds test.r to spark context. However, subprocess does not
>>>>>> use spark context.
>>>>>>
>>>>>> Hdfs path does not work either: subprocess.getoutput("Rscript
>>>>>> hdfs:///user/zeppelin/test.r")
>>>>>>
>>>>>> Any idea how to make python call R script? Appreciate!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 1:13 AM Partridge, Lucas (GE Aviation) <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Have you tried SparkContext.addFile() (not addPyFile()) to add your
>>>>>>> R script?
>>>>>>>
>>>>>>>
>>>>>>> https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.SparkContext.addFile
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Lian Jiang <[email protected]>
>>>>>>> *Sent:* 27 August 2018 22:42
>>>>>>> *To:* [email protected]
>>>>>>> *Subject:* EXT: Python script calls R script in Zeppelin on Hadoop
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We are using HDP3.0 (using zeppelin 0.8.0) and are migrating Jupyter
>>>>>>> notebooks to Zeppelin. One issue we came across is that a python script
>>>>>>> calling R script does not work in Zeppelin.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> %livy2.pyspark
>>>>>>>
>>>>>>> import os
>>>>>>>
>>>>>>> sc.addPyFile("hdfs:///user/zeppelin/my.py")
>>>>>>>
>>>>>>> import my
>>>>>>>
>>>>>>> my.test()
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> my.test() calls R script like: ['Rscript', 'myR.r']
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Fatal error: cannot open file 'myR.r': No such file or directory
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> When running this notebook in jupyter, both my.py and myR.r exist in
>>>>>>> the same folder. I understand the story changes on hadoop because the
>>>>>>> scripts run in containers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My question:
>>>>>>>
>>>>>>> Is this scenario supported in zeppelin? How to add a R script into a
>>>>>>> python spark context so that the Python script can find the R script?
>>>>>>> Appreciate!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>

Re: Python script calls R script in Zeppelin on Hadoop

Reply via email to