Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Gourav Sengupta Mon, 06 May 2019 05:23:48 -0700

And you found the PANDAS UDF more performant ? Can you share your code and
prove it?


On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com>
wrote:

> I disagree that it's hype. Perhaps not 1:1 with pure scala
> performance-wise, but for python-based data scientists or others with a lot
> of python expertise it allows one to do things that would otherwise be
> infeasible at scale.
>
> For instance, I recently had to convert latitude / longitude pairs to MGRS
> strings (https://en.wikipedia.org/wiki/Military_Grid_Reference_System).
> Writing a pandas UDF (and putting the mgrs python package into a conda
> environment) was _significantly_ easier than any alternative I found.
>
> @Rishi - depending on your network is constructed, some lag could come
> from just uploading the conda environment. If you load it from hdfs with
> --archives does it improve?
>
> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> hi,
>>
>> Pandas UDF is a bit of hype. One of their blogs shows the used case of
>> adding 1 to a field using Pandas UDF which is pretty much pointless. So you
>> go beyond the blog and realise that your actual used case is more than
>> adding one :) and the reality hits you
>>
>> Pandas UDF in certain scenarios is actually slow, try using apply for a
>> custom or pandas function. In fact in certain scenarios I have found
>> general UDF's work much faster and use much less memory. Therefore test out
>> your used case (with at least 30 million records) before trying to use the
>> Pandas UDF option.
>>
>> And when you start using GroupMap then you realise after reading
>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>> that "Oh!! now I can run into random OOM errors and the maxrecords options
>> does not help at all"
>>
>> Excerpt from the above link:
>> Note that all data for a group will be loaded into memory before the
>> function is applied. This can lead to out of memory exceptions, especially
>> if the group sizes are skewed. The configuration for maxRecordsPerBatch
>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size>
>>  is
>> not applied on groups and it is up to the user to ensure that the grouped
>> data will fit into the available memory.
>>
>> Let me know about your used case in case possible
>>
>>
>> Regards,
>> Gourav
>>
>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com>
>> wrote:
>>
>>> Thanks Patrick! I tried to package it according to this instructions, it
>>> got distributed on the cluster however the same spark program that takes 5
>>> mins without pandas UDF has started to take 25mins...
>>>
>>> Have you experienced anything like this? Also is Pyarrow 0.12 supported
>>> with Spark 2.3 (according to documentation, it should be fine)?
>>>
>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
>>> pmccar...@dstillery.com> wrote:
>>>
>>>> Hi Rishi,
>>>>
>>>> I've had success using the approach outlined here:
>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>>
>>>> Does this work for you?
>>>>
>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com>
>>>> wrote:
>>>>
>>>>> modified the subject & would like to clarify that I am looking to
>>>>> create an anaconda parcel with pyarrow and other libraries, so that I can
>>>>> distribute it on the cloudera cluster..
>>>>>
>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <rishishah.s...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have been trying to figure out a way to build anaconda parcel with
>>>>>> pyarrow included for my cloudera managed server for distribution but this
>>>>>> doesn't seem to work right. Could someone please help?
>>>>>>
>>>>>> I have tried to install anaconda on one of the management nodes on
>>>>>> cloudera cluster... tarred the directory, but this directory doesn't
>>>>>> include all the packages to form a proper parcel for distribution.
>>>>>>
>>>>>> Any help is much appreciated!
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>>
>>>>>> Rishi Shah
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Rishi Shah
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to