@Justin, it's fixed by https://github.com/apache/spark/pull/12057
On Thu, Feb 11, 2016 at 11:26 AM, Davies Liu wrote:
> Had a quick look in your commit, I think that make sense, could you
> send a PR for that, then we can review it.
>
> In order to support 2), we need to change the serialized Pyt
Had a quick look in your commit, I think that make sense, could you
send a PR for that, then we can review it.
In order to support 2), we need to change the serialized Python
function from `f(iter)` to `f(x)`, process one row at a time (not a
partition),
then we can easily combine them together:
Hey guys,
BLUF: sorry for the length of this email, trying to figure out how to batch
Python UDF executions, and since this is my first time messing with
catalyst, would like any feedback
My team is starting to use PySpark UDFs quite heavily, and performance is a
huge blocker. The extra roundtrip