Sweet, filed here: https://issues.apache.org/jira/browse/SPARK-8632
On Thu, Jun 25, 2015 at 3:05 AM Davies Liu wrote:
> I'm thinking that the batched synchronous version will be too slow
> (with small batch size) or easy to OOM with large (batch size). If
> it's not that hard, you can give it a
I'm thinking that the batched synchronous version will be too slow
(with small batch size) or easy to OOM with large (batch size). If
it's not that hard, you can give it a try.
On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang wrote:
> Correct, I was running with a batch size of about 100 when I did t
Correct, I was running with a batch size of about 100 when I did the tests,
because I was worried about deadlocks. Do you have any concerns regarding
the batched synchronous version of communication between the Java and
Python processes, and if not, should I file a ticket and starting writing
it?
O
>From you comment, the 2x improvement only happens when you have the
batch size as 1, right?
On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang wrote:
> FYI, just submitted a PR to Pyrolite to remove their StopException.
> https://github.com/irmen/Pyrolite/pull/30
>
> With my benchmark, removing it ba
FYI, just submitted a PR to Pyrolite to remove their StopException.
https://github.com/irmen/Pyrolite/pull/30
With my benchmark, removing it basically made it about 2x faster.
On Wed, Jun 24, 2015 at 8:33 AM Punyashloka Biswal
wrote:
> Hi Davies,
>
> In general, do we expect people to use CPyth
Hi Davies,
In general, do we expect people to use CPython only for "heavyweight" UDFs
that invoke an external library? Are there any examples of using Jython,
especially performance comparisons to Java/Scala and CPython? When using
Jython, do you expect the driver to send code to the executor as a
Fare points, I also like simpler solutions.
The overhead of Python task could be a few of milliseconds, which
means we also should eval them as batches (one Python task per batch).
Decreasing the batch size for UDF sounds reasonable to me, together
with other tricks to reduce the data in socket/p
// + punya
Thanks for your quick response!
I'm not sure that using an unbounded buffer is a good solution to the
locking problem. For example, in the situation where I had 500 columns, I
am in fact storing 499 extra columns on the java side, which might make me
OOM if I have to store many rows. I
Thanks for looking into it, I'd like the idea of having
ForkingIterator. If we have unlimited buffer in it, then will not have
the problem of deadlock, I think. The writing thread will be blocked
by Python process, so there will be not much rows be buffered(still be
a reason to OOM). At least, this
BLUF: BatchPythonEvaluation's implementation is unusable at large scale,
but I have a proof-of-concept implementation that avoids caching the entire
dataset.
Hi,
We have been running into performance problems using Python UDFs with
DataFrames at large scale.
>From the implementation of BatchPyth
10 matches
Mail list logo