Re: Python UDF performance at large scale

2015-06-25 Thread Justin Uang
Sweet, filed here: https://issues.apache.org/jira/browse/SPARK-8632 On Thu, Jun 25, 2015 at 3:05 AM Davies Liu wrote: > I'm thinking that the batched synchronous version will be too slow > (with small batch size) or easy to OOM with large (batch size). If > it's not that hard, you can give it a

Re: Python UDF performance at large scale

2015-06-25 Thread Davies Liu
I'm thinking that the batched synchronous version will be too slow (with small batch size) or easy to OOM with large (batch size). If it's not that hard, you can give it a try. On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang wrote: > Correct, I was running with a batch size of about 100 when I did t

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang
Correct, I was running with a batch size of about 100 when I did the tests, because I was worried about deadlocks. Do you have any concerns regarding the batched synchronous version of communication between the Java and Python processes, and if not, should I file a ticket and starting writing it? O

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
>From you comment, the 2x improvement only happens when you have the batch size as 1, right? On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang wrote: > FYI, just submitted a PR to Pyrolite to remove their StopException. > https://github.com/irmen/Pyrolite/pull/30 > > With my benchmark, removing it ba

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang
FYI, just submitted a PR to Pyrolite to remove their StopException. https://github.com/irmen/Pyrolite/pull/30 With my benchmark, removing it basically made it about 2x faster. On Wed, Jun 24, 2015 at 8:33 AM Punyashloka Biswal wrote: > Hi Davies, > > In general, do we expect people to use CPyth

Re: Python UDF performance at large scale

2015-06-24 Thread Punyashloka Biswal
Hi Davies, In general, do we expect people to use CPython only for "heavyweight" UDFs that invoke an external library? Are there any examples of using Jython, especially performance comparisons to Java/Scala and CPython? When using Jython, do you expect the driver to send code to the executor as a

Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Fare points, I also like simpler solutions. The overhead of Python task could be a few of milliseconds, which means we also should eval them as batches (one Python task per batch). Decreasing the batch size for UDF sounds reasonable to me, together with other tricks to reduce the data in socket/p

Re: Python UDF performance at large scale

2015-06-23 Thread Justin Uang
// + punya Thanks for your quick response! I'm not sure that using an unbounded buffer is a good solution to the locking problem. For example, in the situation where I had 500 columns, I am in fact storing 499 extra columns on the java side, which might make me OOM if I have to store many rows. I

Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Thanks for looking into it, I'd like the idea of having ForkingIterator. If we have unlimited buffer in it, then will not have the problem of deadlock, I think. The writing thread will be blocked by Python process, so there will be not much rows be buffered(still be a reason to OOM). At least, this

Python UDF performance at large scale

2015-06-23 Thread Justin Uang
BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but I have a proof-of-concept implementation that avoids caching the entire dataset. Hi, We have been running into performance problems using Python UDFs with DataFrames at large scale. >From the implementation of BatchPyth