Hi Ted, I think I've make a mistake. I refered to python/mllib, callJavaFunc in mllib/common.py use SparkContext._active_spark_context because it is called from the driver. So maybe there is no explicit way to reach JVM during rdd operations?
What I want to achieve is to take a ThriftWritable object from an LzoBlockInputFormat and deserialize it to a java object. If I could, I want to further transform the thrift object to DataFrame. I think I can implement a custom org.apache.spark.api.python.Converter and pass it to sc.hadoopFile(...,keyConverterClass,valueConverterClass...). But, once I get the converted java object, can I call its methods in python directly, i.e. reach the JVM? Thanks a lot! 2015-09-30 0:54 GMT+08:00 Ted Yu <yuzhih...@gmail.com>: > bq. the right way to reach JVM in python > > Can you tell us more about what you want to achieve ? > > If you want to pass some value to workers, you can use broadcast variable. > > Cheers > > On Mon, Sep 28, 2015 at 10:31 PM, YiZhi Liu <javeli...@gmail.com> wrote: >> >> Hi Ted, >> >> Thank you for reply. The sc works at driver, but how can I reach the >> JVM in rdd.map ? >> >> 2015-09-29 11:26 GMT+08:00 Ted Yu <yuzhih...@gmail.com>: >> >>>> sc._jvm.java.lang.Integer.valueOf("12") >> > 12 >> > >> > FYI >> > >> > On Mon, Sep 28, 2015 at 8:08 PM, YiZhi Liu <javeli...@gmail.com> wrote: >> >> >> >> Hi, >> >> >> >> I'm doing some data processing on pyspark, but I failed to reach JVM >> >> in workers. Here is what I did: >> >> >> >> $ bin/pyspark >> >> >>> data = sc.parallelize(["123", "234"]) >> >> >>> numbers = data.map(lambda s: >> >> >>> >> >> >>> SparkContext._active_spark_context._jvm.java.lang.Integer.valueOf(s.strip())) >> >> >>> numbers.collect() >> >> >> >> I got, >> >> >> >> Caused by: org.apache.spark.api.python.PythonException: Traceback >> >> (most recent call last): >> >> File >> >> >> >> "/mnt/hgfs/lewis/Workspace/source-codes/spark/python/lib/pyspark.zip/pyspark/worker.py", >> >> line 111, in main >> >> process() >> >> File >> >> >> >> "/mnt/hgfs/lewis/Workspace/source-codes/spark/python/lib/pyspark.zip/pyspark/worker.py", >> >> line 106, in process >> >> serializer.dump_stream(func(split_index, iterator), outfile) >> >> File >> >> >> >> "/mnt/hgfs/lewis/Workspace/source-codes/spark/python/lib/pyspark.zip/pyspark/serializers.py", >> >> line 263, in dump_stream >> >> vs = list(itertools.islice(iterator, batch)) >> >> File "<stdin>", line 1, in <lambda> >> >> AttributeError: 'NoneType' object has no attribute '_jvm' >> >> >> >> at >> >> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138) >> >> at >> >> >> >> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179) >> >> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) >> >> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >> >> at org.apache.spark.scheduler.Task.run(Task.scala:88) >> >> at >> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >> >> at >> >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> >> at >> >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> >> ... 1 more >> >> >> >> While _jvm at the driver end looks fine: >> >> >> >> >>> >> >> >>> >> >> >>> SparkContext._active_spark_context._jvm.java.lang.Integer.valueOf("123".strip()) >> >> 123 >> >> >> >> The program is trivial, I just wonder what is the right way to reach >> >> JVM in python. Any help would be appreciated. >> >> >> >> Thanks >> >> >> >> -- >> >> Yizhi Liu >> >> Senior Software Engineer / Data Mining >> >> www.mvad.com, Shanghai, China >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> > >> >> >> >> -- >> Yizhi Liu >> Senior Software Engineer / Data Mining >> www.mvad.com, Shanghai, China > > -- Yizhi Liu Senior Software Engineer / Data Mining www.mvad.com, Shanghai, China --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org