Re: PySpark API on top of Apache Arrow

Jules Damji Sat, 26 May 2018 14:28:37 -0700

Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And 
point to the blog by their contributors from Two Sigma. :-)


“On the other hand, Pandas UDF built atop Apache Arrow accords high-performance 
to Python developers, whether you use Pandas UDFs on a single-node machine or 
distributed cluster.”

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On May 26, 2018, at 12:41 PM, Corey Nolet <[email protected]> wrote:
> 
> Gourav & Nicholas,
> 
> THank you! It does look like the pyspark Pandas UDF is exactly what I want 
> and the article I read didn't mention that it used Arrow underneath. Looks 
> like Wes McKinney was also key part of building the Pandas UDF.
> 
> Gourav,
> 
> I totally apologize for my long and drawn out response to you. I initially 
> misunderstood your response. I also need to take the time to dive into the 
> PySpark source code- I was assuming that it was just firing up JVMs under the 
> hood.
> 
> Thanks again! I'll report back with findings. 
> 
>> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <[email protected]> wrote:
>> hi corey
>> 
>> not familiar with arrow, plasma. However recently read an article about 
>> spark on
>> a standalone machine (your case). Sounds like you could take benefit of 
>> pyspark
>> "as-is"
>> 
>> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
>> 
>> regars,
>> 
>> 2018-05-23 22:30 GMT+02:00 Corey Nolet <[email protected]>:
>>> Please forgive me if this question has been asked already. 
>>> 
>>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if 
>>> anyone knows of any efforts to implement the PySpark API on top of Apache 
>>> Arrow directly. In my case, I'm doing data science on a machine with 288 
>>> cores and 1TB of ram. 
>>> 
>>> It would make life much easier if I was able to use the flexibility of the 
>>> PySpark API (rather than having to be tied to the operations in Pandas). It 
>>> seems like an implementation would be fairly straightforward using the 
>>> Plasma server and object_ids. 
>>> 
>>> If you have not heard of an effort underway to accomplish this, any reasons 
>>> why it would be a bad idea?
>>> 
>>> 
>>> Thanks!
>> 
>

Re: PySpark API on top of Apache Arrow

Reply via email to