Hi Patrick, super duper, thanks a ton for sharing the code. Can you please confirm that this runs faster than the regular UDF's?
Interestingly I am also running same transformations using another geo spatial library in Python, where I am passing two fields and getting back an array. Regards, Gourav Sengupta On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <pmccar...@dstillery.com> wrote: > Human time is considerably more expensive than computer time, so in that > regard, yes :) > > This took me one minute to write and ran fast enough for my needs. If > you're willing to provide a comparable scala implementation I'd be happy to > compare them. > > @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR) > > def generate_mgrs_series(lat_lon_str, level): > > import mgrs > > m = mgrs.MGRS() > > precision_level = 0 > > levelval = level[0] > > if levelval == 1000: > > precision_level = 2 > > if levelval == 100: > > precision_level = 3 > > def convert(ll_str): > > lat, lon = ll_str.split('_') > > return m.toMGRS(lat, lon, > > MGRSPrecision = precision_level) > > return lat_lon_str.apply(lambda x: convert(x)) > > On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> And you found the PANDAS UDF more performant ? Can you share your code >> and prove it? >> >> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com> >> wrote: >> >>> I disagree that it's hype. Perhaps not 1:1 with pure scala >>> performance-wise, but for python-based data scientists or others with a lot >>> of python expertise it allows one to do things that would otherwise be >>> infeasible at scale. >>> >>> For instance, I recently had to convert latitude / longitude pairs to >>> MGRS strings ( >>> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing >>> a pandas UDF (and putting the mgrs python package into a conda environment) >>> was _significantly_ easier than any alternative I found. >>> >>> @Rishi - depending on your network is constructed, some lag could come >>> from just uploading the conda environment. If you load it from hdfs with >>> --archives does it improve? >>> >>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta < >>> gourav.sengu...@gmail.com> wrote: >>> >>>> hi, >>>> >>>> Pandas UDF is a bit of hype. One of their blogs shows the used case of >>>> adding 1 to a field using Pandas UDF which is pretty much pointless. So you >>>> go beyond the blog and realise that your actual used case is more than >>>> adding one :) and the reality hits you >>>> >>>> Pandas UDF in certain scenarios is actually slow, try using apply for a >>>> custom or pandas function. In fact in certain scenarios I have found >>>> general UDF's work much faster and use much less memory. Therefore test out >>>> your used case (with at least 30 million records) before trying to use the >>>> Pandas UDF option. >>>> >>>> And when you start using GroupMap then you realise after reading >>>> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs >>>> that "Oh!! now I can run into random OOM errors and the maxrecords options >>>> does not help at all" >>>> >>>> Excerpt from the above link: >>>> Note that all data for a group will be loaded into memory before the >>>> function is applied. This can lead to out of memory exceptions, especially >>>> if the group sizes are skewed. The configuration for maxRecordsPerBatch >>>> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size> >>>> is >>>> not applied on groups and it is up to the user to ensure that the grouped >>>> data will fit into the available memory. >>>> >>>> Let me know about your used case in case possible >>>> >>>> >>>> Regards, >>>> Gourav >>>> >>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Patrick! I tried to package it according to this instructions, >>>>> it got distributed on the cluster however the same spark program that >>>>> takes >>>>> 5 mins without pandas UDF has started to take 25mins... >>>>> >>>>> Have you experienced anything like this? Also is Pyarrow 0.12 >>>>> supported with Spark 2.3 (according to documentation, it should be fine)? >>>>> >>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy < >>>>> pmccar...@dstillery.com> wrote: >>>>> >>>>>> Hi Rishi, >>>>>> >>>>>> I've had success using the approach outlined here: >>>>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html >>>>>> >>>>>> Does this work for you? >>>>>> >>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> modified the subject & would like to clarify that I am looking to >>>>>>> create an anaconda parcel with pyarrow and other libraries, so that I >>>>>>> can >>>>>>> distribute it on the cloudera cluster.. >>>>>>> >>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah < >>>>>>> rishishah.s...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I have been trying to figure out a way to build anaconda parcel >>>>>>>> with pyarrow included for my cloudera managed server for distribution >>>>>>>> but >>>>>>>> this doesn't seem to work right. Could someone please help? >>>>>>>> >>>>>>>> I have tried to install anaconda on one of the management nodes on >>>>>>>> cloudera cluster... tarred the directory, but this directory doesn't >>>>>>>> include all the packages to form a proper parcel for distribution. >>>>>>>> >>>>>>>> Any help is much appreciated! >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> >>>>>>>> Rishi Shah >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards, >>>>>>> >>>>>>> Rishi Shah >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>>> *Patrick McCarthy * >>>>>> >>>>>> Senior Data Scientist, Machine Learning Engineering >>>>>> >>>>>> Dstillery >>>>>> >>>>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> >>>>> Rishi Shah >>>>> >>>> >>> >>> -- >>> >>> >>> *Patrick McCarthy * >>> >>> Senior Data Scientist, Machine Learning Engineering >>> >>> Dstillery >>> >>> 470 Park Ave South, 17th Floor, NYC 10016 >>> >> > > -- > > > *Patrick McCarthy * > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 >