Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen Mon, 02 Jun 2014 12:04:36 -0700

OK, my colleague found this:
https://mail.python.org/pipermail/python-list/2014-May/671353.html


And my jar file has 70011 files. Fantastic..




On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:

> I asked several people, no one seems to believe that we can do this:
> $ PYTHONPATH=/path/to/assembly/jar python
> >>> import pyspark
>
> This following pull request did mention something about generating a zip
> file for all python related modules:
> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
>
> I've tested that zipped modules can as least be imported via zipimport.
>
> Any ideas?
>
> -Simon
>
>
>
> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote:
>
>> Hi Simon,
>>
>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>> pyspark is packaged into your assembly jar and shipped to your executors
>> automatically. This seems like a more general problem. There are a few
>> things to try:
>>
>> 1) Run a simple pyspark shell with yarn-client, and do
>> "sc.parallelize(range(10)).count()" to see if you get the same error
>> 2) If so, check if your assembly jar is compiled correctly. Run
>>
>> $ jar -tf <path/to/assembly/jar> pyspark
>> $ jar -tf <path/to/assembly/jar> py4j
>>
>> to see if the files are there. For Py4j, you need both the python files
>> and the Java class files.
>>
>> 3) If the files are there, try running a simple python shell (not pyspark
>> shell) with the assembly jar on the PYTHONPATH:
>>
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> 4) If that works, try it on every worker node. If it doesn't work, there
>> is probably something wrong with your jar.
>>
>> There is a known issue for PySpark on YARN - jars built with Java 7
>> cannot be properly opened by Java 6. I would either verify that the
>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>
>> $ cd /path/to/spark/home
>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>> 2.3.0-cdh5.0.0
>>
>> 5) You can check out
>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>> which has more detailed information about how to debug running an
>> application on YARN in general. In my experience, the steps outlined there
>> are quite useful.
>>
>> Let me know if you get it working (or not).
>>
>> Cheers,
>> Andrew
>>
>>
>>
>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>:
>>
>> Hi folks,
>>>
>>> I have a weird problem when using pyspark with yarn. I started ipython
>>> as follows:
>>>
>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>> --num-executors 4 --executor-memory 4G
>>>
>>> When I create a notebook, I can see workers being created and indeed I
>>> see spark UI running on my client machine on port 4040.
>>>
>>> I have the following simple script:
>>> """
>>> import pyspark
>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>> oneday = data.map(lambda line: line.split(",")).\
>>>               map(lambda f: (f[0], float(f[1]))).\
>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>> "2013-01-02").\
>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>> oneday.take(1)
>>> """
>>>
>>> By executing this, I see that it is my client machine (where ipython is
>>> launched) is reading all the data from HDFS, and produce the result of
>>> take(1), rather than my worker nodes...
>>>
>>> When I do "data.count()", things would blow up altogether. But I do see
>>> in the error message something like this:
>>> """
>>>
>>> Error from python worker:
>>>   /usr/bin/python: No module named pyspark
>>>
>>> """
>>>
>>>
>>> Am I supposed to install pyspark on every worker node?
>>>
>>>
>>> Thanks.
>>>
>>> -Simon
>>>
>>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to