OK, my colleague found this: https://mail.python.org/pipermail/python-list/2014-May/671353.html
And my jar file has 70011 files. Fantastic.. On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xche...@gmail.com> wrote: > I asked several people, no one seems to believe that we can do this: > $ PYTHONPATH=/path/to/assembly/jar python > >>> import pyspark > > This following pull request did mention something about generating a zip > file for all python related modules: > https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html > > I've tested that zipped modules can as least be imported via zipimport. > > Any ideas? > > -Simon > > > > On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote: > >> Hi Simon, >> >> You shouldn't have to install pyspark on every worker node. In YARN mode, >> pyspark is packaged into your assembly jar and shipped to your executors >> automatically. This seems like a more general problem. There are a few >> things to try: >> >> 1) Run a simple pyspark shell with yarn-client, and do >> "sc.parallelize(range(10)).count()" to see if you get the same error >> 2) If so, check if your assembly jar is compiled correctly. Run >> >> $ jar -tf <path/to/assembly/jar> pyspark >> $ jar -tf <path/to/assembly/jar> py4j >> >> to see if the files are there. For Py4j, you need both the python files >> and the Java class files. >> >> 3) If the files are there, try running a simple python shell (not pyspark >> shell) with the assembly jar on the PYTHONPATH: >> >> $ PYTHONPATH=/path/to/assembly/jar python >> >>> import pyspark >> >> 4) If that works, try it on every worker node. If it doesn't work, there >> is probably something wrong with your jar. >> >> There is a known issue for PySpark on YARN - jars built with Java 7 >> cannot be properly opened by Java 6. I would either verify that the >> JAVA_HOME set on all of your workers points to Java 7 (by setting >> SPARK_YARN_USER_ENV), or simply build your jar with Java 6: >> >> $ cd /path/to/spark/home >> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop >> 2.3.0-cdh5.0.0 >> >> 5) You can check out >> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, >> which has more detailed information about how to debug running an >> application on YARN in general. In my experience, the steps outlined there >> are quite useful. >> >> Let me know if you get it working (or not). >> >> Cheers, >> Andrew >> >> >> >> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>: >> >> Hi folks, >>> >>> I have a weird problem when using pyspark with yarn. I started ipython >>> as follows: >>> >>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 >>> --num-executors 4 --executor-memory 4G >>> >>> When I create a notebook, I can see workers being created and indeed I >>> see spark UI running on my client machine on port 4040. >>> >>> I have the following simple script: >>> """ >>> import pyspark >>> data = sc.textFile("hdfs://test/tmp/data/*").cache() >>> oneday = data.map(lambda line: line.split(",")).\ >>> map(lambda f: (f[0], float(f[1]))).\ >>> filter(lambda t: t[0] >= "2013-01-01" and t[0] < >>> "2013-01-02").\ >>> map(lambda t: (parser.parse(t[0]), t[1])) >>> oneday.take(1) >>> """ >>> >>> By executing this, I see that it is my client machine (where ipython is >>> launched) is reading all the data from HDFS, and produce the result of >>> take(1), rather than my worker nodes... >>> >>> When I do "data.count()", things would blow up altogether. But I do see >>> in the error message something like this: >>> """ >>> >>> Error from python worker: >>> /usr/bin/python: No module named pyspark >>> >>> """ >>> >>> >>> Am I supposed to install pyspark on every worker node? >>> >>> >>> Thanks. >>> >>> -Simon >>> >>> >> >