Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Gourav Sengupta Sun, 05 May 2019 11:16:20 -0700

hi,

Pandas UDF is a bit of hype. One of their blogs shows the used case of
adding 1 to a field using Pandas UDF which is pretty much pointless. So you
go beyond the blog and realise that your actual used case is more than
adding one :) and the reality hits you

Pandas UDF in certain scenarios is actually slow, try using apply for a
custom or pandas function. In fact in certain scenarios I have found
general UDF's work much faster and use much less memory. Therefore test out
your used case (with at least 30 million records) before trying to use the
Pandas UDF option.

And when you start using GroupMap then you realise after reading
https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
that "Oh!! now I can run into random OOM errors and the maxrecords options
does not help at all"

Excerpt from the above link:
Note that all data for a group will be loaded into memory before the
function is applied. This can lead to out of memory exceptions, especially
if the group sizes are skewed. The configuration for maxRecordsPerBatch
<https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size>
is
not applied on groups and it is up to the user to ensure that the grouped
data will fit into the available memory.

Let me know about your used case in case possible

Regards,
Gourav

On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> wrote:

> Thanks Patrick! I tried to package it according to this instructions, it
> got distributed on the cluster however the same spark program that takes 5
> mins without pandas UDF has started to take 25mins...
>
> Have you experienced anything like this? Also is Pyarrow 0.12 supported
> with Spark 2.3 (according to documentation, it should be fine)?
>
> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <pmccar...@dstillery.com>
> wrote:
>
>> Hi Rishi,
>>
>> I've had success using the approach outlined here:
>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>
>> Does this work for you?
>>
>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com>
>> wrote:
>>
>>> modified the subject & would like to clarify that I am looking to create
>>> an anaconda parcel with pyarrow and other libraries, so that I can
>>> distribute it on the cloudera cluster..
>>>
>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <rishishah.s...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have been trying to figure out a way to build anaconda parcel with
>>>> pyarrow included for my cloudera managed server for distribution but this
>>>> doesn't seem to work right. Could someone please help?
>>>>
>>>> I have tried to install anaconda on one of the management nodes on
>>>> cloudera cluster... tarred the directory, but this directory doesn't
>>>> include all the packages to form a proper parcel for distribution.
>>>>
>>>> Any help is much appreciated!
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Rishi Shah
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>
>
> --
> Regards,
>
> Rishi Shah
>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to