Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Gourav Sengupta Mon, 06 May 2019 06:47:01 -0700

Hi Patrick,

super duper, thanks a ton for sharing the code. Can you please confirm that
this runs faster than the regular UDF's?


Interestingly I am also running same transformations using another geo
spatial library in Python, where I am passing two fields and getting back
an array.


Regards,
Gourav Sengupta

On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <pmccar...@dstillery.com>
wrote:

> Human time is considerably more expensive than computer time, so in that
> regard, yes :)
>
> This took me one minute to write and ran fast enough for my needs. If
> you're willing to provide a comparable scala implementation I'd be happy to
> compare them.
>
> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
>
> def generate_mgrs_series(lat_lon_str, level):
>
>     import mgrs
>
>     m = mgrs.MGRS()
>
>     precision_level = 0
>
>     levelval = level[0]
>
>     if levelval == 1000:
>
>        precision_level = 2
>
>     if levelval == 100:
>
>        precision_level = 3
>
>     def convert(ll_str):
>
>           lat, lon = ll_str.split('_')
>
>           return m.toMGRS(lat, lon,
>
>               MGRSPrecision = precision_level)
>
>     return lat_lon_str.apply(lambda x: convert(x))
>
> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> And you found the PANDAS UDF more performant ? Can you share your code
>> and prove it?
>>
>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com>
>> wrote:
>>
>>> I disagree that it's hype. Perhaps not 1:1 with pure scala
>>> performance-wise, but for python-based data scientists or others with a lot
>>> of python expertise it allows one to do things that would otherwise be
>>> infeasible at scale.
>>>
>>> For instance, I recently had to convert latitude / longitude pairs to
>>> MGRS strings (
>>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing
>>> a pandas UDF (and putting the mgrs python package into a conda environment)
>>> was _significantly_ easier than any alternative I found.
>>>
>>> @Rishi - depending on your network is constructed, some lag could come
>>> from just uploading the conda environment. If you load it from hdfs with
>>> --archives does it improve?
>>>
>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> hi,
>>>>
>>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of
>>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So you
>>>> go beyond the blog and realise that your actual used case is more than
>>>> adding one :) and the reality hits you
>>>>
>>>> Pandas UDF in certain scenarios is actually slow, try using apply for a
>>>> custom or pandas function. In fact in certain scenarios I have found
>>>> general UDF's work much faster and use much less memory. Therefore test out
>>>> your used case (with at least 30 million records) before trying to use the
>>>> Pandas UDF option.
>>>>
>>>> And when you start using GroupMap then you realise after reading
>>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
>>>> that "Oh!! now I can run into random OOM errors and the maxrecords options
>>>> does not help at all"
>>>>
>>>> Excerpt from the above link:
>>>> Note that all data for a group will be loaded into memory before the
>>>> function is applied. This can lead to out of memory exceptions, especially
>>>> if the group sizes are skewed. The configuration for maxRecordsPerBatch
>>>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size>
>>>>  is
>>>> not applied on groups and it is up to the user to ensure that the grouped
>>>> data will fit into the available memory.
>>>>
>>>> Let me know about your used case in case possible
>>>>
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Patrick! I tried to package it according to this instructions,
>>>>> it got distributed on the cluster however the same spark program that 
>>>>> takes
>>>>> 5 mins without pandas UDF has started to take 25mins...
>>>>>
>>>>> Have you experienced anything like this? Also is Pyarrow 0.12
>>>>> supported with Spark 2.3 (according to documentation, it should be fine)?
>>>>>
>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
>>>>> pmccar...@dstillery.com> wrote:
>>>>>
>>>>>> Hi Rishi,
>>>>>>
>>>>>> I've had success using the approach outlined here:
>>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
>>>>>>
>>>>>> Does this work for you?
>>>>>>
>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> modified the subject & would like to clarify that I am looking to
>>>>>>> create an anaconda parcel with pyarrow and other libraries, so that I 
>>>>>>> can
>>>>>>> distribute it on the cloudera cluster..
>>>>>>>
>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <
>>>>>>> rishishah.s...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I have been trying to figure out a way to build anaconda parcel
>>>>>>>> with pyarrow included for my cloudera managed server for distribution 
>>>>>>>> but
>>>>>>>> this doesn't seem to work right. Could someone please help?
>>>>>>>>
>>>>>>>> I have tried to install anaconda on one of the management nodes on
>>>>>>>> cloudera cluster... tarred the directory, but this directory doesn't
>>>>>>>> include all the packages to form a proper parcel for distribution.
>>>>>>>>
>>>>>>>> Any help is much appreciated!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Rishi Shah
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> Rishi Shah
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>> *Patrick McCarthy  *
>>>>>>
>>>>>> Senior Data Scientist, Machine Learning Engineering
>>>>>>
>>>>>> Dstillery
>>>>>>
>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Rishi Shah
>>>>>
>>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to