Hi Ted,

I think I've make a mistake. I refered to python/mllib, callJavaFunc
in mllib/common.py use SparkContext._active_spark_context because it
is called from the driver. So maybe there is no explicit way to  reach
JVM during rdd operations?

What I want to achieve is to take a ThriftWritable object from an
LzoBlockInputFormat and deserialize it to a java object. If I could, I
want to further transform the thrift object to DataFrame.

I think I can implement a custom org.apache.spark.api.python.Converter
and pass it to sc.hadoopFile(...,keyConverterClass,valueConverterClass...).
But, once I get the converted java object, can I call its methods in
python directly, i.e. reach the JVM?

Thanks a lot!

2015-09-30 0:54 GMT+08:00 Ted Yu <yuzhih...@gmail.com>:
> bq. the right way to reach JVM in python
>
> Can you tell us more about what you want to achieve ?
>
> If you want to pass some value to workers, you can use broadcast variable.
>
> Cheers
>
> On Mon, Sep 28, 2015 at 10:31 PM, YiZhi Liu <javeli...@gmail.com> wrote:
>>
>> Hi Ted,
>>
>> Thank you for reply. The sc works at driver, but how can I reach the
>> JVM in rdd.map ?
>>
>> 2015-09-29 11:26 GMT+08:00 Ted Yu <yuzhih...@gmail.com>:
>> >>>> sc._jvm.java.lang.Integer.valueOf("12")
>> > 12
>> >
>> > FYI
>> >
>> > On Mon, Sep 28, 2015 at 8:08 PM, YiZhi Liu <javeli...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm doing some data processing on pyspark, but I failed to reach JVM
>> >> in workers. Here is what I did:
>> >>
>> >> $ bin/pyspark
>> >> >>> data = sc.parallelize(["123", "234"])
>> >> >>> numbers = data.map(lambda s:
>> >> >>>
>> >> >>> SparkContext._active_spark_context._jvm.java.lang.Integer.valueOf(s.strip()))
>> >> >>> numbers.collect()
>> >>
>> >> I got,
>> >>
>> >> Caused by: org.apache.spark.api.python.PythonException: Traceback
>> >> (most recent call last):
>> >>   File
>> >>
>> >> "/mnt/hgfs/lewis/Workspace/source-codes/spark/python/lib/pyspark.zip/pyspark/worker.py",
>> >> line 111, in main
>> >>     process()
>> >>   File
>> >>
>> >> "/mnt/hgfs/lewis/Workspace/source-codes/spark/python/lib/pyspark.zip/pyspark/worker.py",
>> >> line 106, in process
>> >>     serializer.dump_stream(func(split_index, iterator), outfile)
>> >>   File
>> >>
>> >> "/mnt/hgfs/lewis/Workspace/source-codes/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>> >> line 263, in dump_stream
>> >>     vs = list(itertools.islice(iterator, batch))
>> >>   File "<stdin>", line 1, in <lambda>
>> >> AttributeError: 'NoneType' object has no attribute '_jvm'
>> >>
>> >> at
>> >> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
>> >> at
>> >>
>> >> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
>> >> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
>> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> >> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> >> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> >> at
>> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> >> at
>> >>
>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> >> at
>> >>
>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> >> ... 1 more
>> >>
>> >> While _jvm at the driver end looks fine:
>> >>
>> >> >>>
>> >> >>>
>> >> >>> SparkContext._active_spark_context._jvm.java.lang.Integer.valueOf("123".strip())
>> >> 123
>> >>
>> >> The program is trivial, I just wonder what is the right way to reach
>> >> JVM in python. Any help would be appreciated.
>> >>
>> >> Thanks
>> >>
>> >> --
>> >> Yizhi Liu
>> >> Senior Software Engineer / Data Mining
>> >> www.mvad.com, Shanghai, China
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>>
>>
>> --
>> Yizhi Liu
>> Senior Software Engineer / Data Mining
>> www.mvad.com, Shanghai, China
>
>



-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to