Re: [EXTERNAL] RDD.pipe() for binary data

Sebastian Piu Sat, 16 Jul 2022 14:09:36 -0700

Other alternatives are to look at how PythonRDD does it in spark, you could
also try to go for a more traditional setup where you expose your python
functions behind a local/remote service and call that from scala - say over
thrift/grpc/http/local socket etc.
Another option, but I've never done it so I'm not sure if it would work, is
to maybe look if arrow could help by sharing a piece of memory with the
data you need from scala and then read it from python


On Sat, Jul 16, 2022 at 9:56 PM Sean Owen <sro...@gmail.com> wrote:

> Use GraphFrames?
>
> On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang <yhzhang1...@gmail.com> wrote:
>
>> Hi Shay,
>>
>> Thanks for your reply! I would very much like to use pyspark. However, my
>> project depends on GraphX, which is only available in the Scala API as far
>> as I know. So I'm locked with Scala and trying to find a way out. I wonder
>> if there's a way to go around it.
>>
>> Best regards,
>> Yuhao Zhang
>>
>>
>> On Sun, Jul 10, 2022 at 5:36 AM Shay Elbaz <shay.el...@gm.com> wrote:
>>
>>> Yuhao,
>>>
>>>
>>> You can use pyspark as entrypoint to your application. With py4j you can
>>> call Java/Scala functions from the python application. There's no need to
>>> use the pipe() function for that.
>>>
>>>
>>> Shay
>>> ------------------------------
>>> *From:* Yuhao Zhang <yhzhang1...@gmail.com>
>>> *Sent:* Saturday, July 9, 2022 4:13:42 AM
>>> *To:* user@spark.apache.org
>>> *Subject:* [EXTERNAL] RDD.pipe() for binary data
>>>
>>>
>>> *ATTENTION:* This email originated from outside of GM.
>>>
>>>
>>> Hi All,
>>>
>>> I'm currently working on a project involving transferring between  Spark
>>> 3.x (I use Scala) and a Python runtime. In Spark, data is stored in an RDD
>>> as floating-point number arrays/vectors and I have custom routines written
>>> in Python to process them. On the Spark side, I also have some operations
>>> specific to Spark Scala APIs, so I need to use both runtimes.
>>>
>>> Now to achieve data transfer I've been using the RDD.pipe() API, by 1.
>>> converting the arrays to strings in Spark and calling RDD.pipe(script.py)
>>> 2. Then Python receives the strings and casts them as Python's data
>>> structures and conducts operations. 3. Python converts the arrays into
>>> strings and prints them back to Spark. 4. Spark gets the strings and cast
>>> them back as arrays.
>>>
>>> Needless to say, this feels unnatural and slow to me, and there are some
>>> potential floating-point number precision issues, as I think the floating
>>> number arrays should have been transmitted as raw bytes. I found no way to
>>> use the RDD.pipe() for this purpose, as written in
>>> https://github.com/apache/spark/blob/3331d4ccb7df9aeb1972ed86472269a9dbd261ff/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L139,
>>> .pipe() seems to be locked with text-based streaming.
>>>
>>> Can anyone shed some light on how I can achieve this? I'm trying to come
>>> up with a way that does not involve modifying the core Spark myself. One
>>> potential solution I can think of is saving/loading the RDD as binary files
>>> but I'm hoping to find a streaming-based solution. Any help is much
>>> appreciated, thanks!
>>>
>>>
>>> Best regards,
>>> Yuhao
>>>
>>

Re: [EXTERNAL] RDD.pipe() for binary data

Reply via email to