Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And point to the blog by their contributors from Two Sigma. :-)
“On the other hand, Pandas UDF built atop Apache Arrow accords high-performance to Python developers, whether you use Pandas UDFs on a single-node machine or distributed cluster.” Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On May 26, 2018, at 12:41 PM, Corey Nolet <[email protected]> wrote: > > Gourav & Nicholas, > > THank you! It does look like the pyspark Pandas UDF is exactly what I want > and the article I read didn't mention that it used Arrow underneath. Looks > like Wes McKinney was also key part of building the Pandas UDF. > > Gourav, > > I totally apologize for my long and drawn out response to you. I initially > misunderstood your response. I also need to take the time to dive into the > PySpark source code- I was assuming that it was just firing up JVMs under the > hood. > > Thanks again! I'll report back with findings. > >> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <[email protected]> wrote: >> hi corey >> >> not familiar with arrow, plasma. However recently read an article about >> spark on >> a standalone machine (your case). Sounds like you could take benefit of >> pyspark >> "as-is" >> >> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html >> >> regars, >> >> 2018-05-23 22:30 GMT+02:00 Corey Nolet <[email protected]>: >>> Please forgive me if this question has been asked already. >>> >>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if >>> anyone knows of any efforts to implement the PySpark API on top of Apache >>> Arrow directly. In my case, I'm doing data science on a machine with 288 >>> cores and 1TB of ram. >>> >>> It would make life much easier if I was able to use the flexibility of the >>> PySpark API (rather than having to be tied to the operations in Pandas). It >>> seems like an implementation would be fairly straightforward using the >>> Plasma server and object_ids. >>> >>> If you have not heard of an effort underway to accomplish this, any reasons >>> why it would be a bad idea? >>> >>> >>> Thanks! >> >
