Hi Andrew, Do not misrepresent my statements. I mentioned it depends on the used case, I NEVER (note the word "never") mentioned that Pandas UDF is ALWAYS (note the word "always") slow.
Regards, Gourav Sengupta On Mon, May 6, 2019 at 6:00 PM Andrew Melo <andrew.m...@gmail.com> wrote: > Hi, > > On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta > <gourav.sengu...@gmail.com> wrote: > > > > Hence, what I mentioned initially does sound correct ? > > I don't agree at all - we've had a significant boost from moving to > regular UDFs to pandas UDFs. YMMV, of course. > > > > > On Mon, May 6, 2019 at 5:43 PM Andrew Melo <andrew.m...@gmail.com> > wrote: > >> > >> Hi, > >> > >> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy > >> <pmccar...@dstillery.com.invalid> wrote: > >> > > >> > Thanks Gourav. > >> > > >> > Incidentally, since the regular UDF is row-wise, we could optimize > that a bit by taking the convert() closure and simply making that the UDF. > >> > > >> > Since there's that MGRS object that we have to create too, we could > probably optimize it further by applying the UDF via rdd.mapPartitions, > which would allow the UDF to instantiate objects once per-partition instead > of per-row and then iterate element-wise through the rows of the partition. > >> > > >> > All that said, having done the above on prior projects I find the > pandas abstractions to be very elegant and friendly to the end-user so I > haven't looked back :) > >> > > >> > (The common memory model via Arrow is a nice boost too!) > >> > >> And some tentative SPIPs that want to use columnar representations > >> internally in Spark should also add some good performance in the > >> future. > >> > >> Cheers > >> Andrew > >> > >> > > >> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> >> > >> >> The proof is in the pudding > >> >> > >> >> :) > >> >> > >> >> > >> >> > >> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> >>> > >> >>> Hi Patrick, > >> >>> > >> >>> super duper, thanks a ton for sharing the code. Can you please > confirm that this runs faster than the regular UDF's? > >> >>> > >> >>> Interestingly I am also running same transformations using another > geo spatial library in Python, where I am passing two fields and getting > back an array. > >> >>> > >> >>> > >> >>> Regards, > >> >>> Gourav Sengupta > >> >>> > >> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy < > pmccar...@dstillery.com> wrote: > >> >>>> > >> >>>> Human time is considerably more expensive than computer time, so > in that regard, yes :) > >> >>>> > >> >>>> This took me one minute to write and ran fast enough for my needs. > If you're willing to provide a comparable scala implementation I'd be happy > to compare them. > >> >>>> > >> >>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR) > >> >>>> > >> >>>> def generate_mgrs_series(lat_lon_str, level): > >> >>>> > >> >>>> > >> >>>> import mgrs > >> >>>> > >> >>>> m = mgrs.MGRS() > >> >>>> > >> >>>> > >> >>>> precision_level = 0 > >> >>>> > >> >>>> levelval = level[0] > >> >>>> > >> >>>> > >> >>>> if levelval == 1000: > >> >>>> > >> >>>> precision_level = 2 > >> >>>> > >> >>>> if levelval == 100: > >> >>>> > >> >>>> precision_level = 3 > >> >>>> > >> >>>> > >> >>>> def convert(ll_str): > >> >>>> > >> >>>> lat, lon = ll_str.split('_') > >> >>>> > >> >>>> > >> >>>> return m.toMGRS(lat, lon, > >> >>>> > >> >>>> MGRSPrecision = precision_level) > >> >>>> > >> >>>> > >> >>>> return lat_lon_str.apply(lambda x: convert(x)) > >> >>>> > >> >>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> >>>>> > >> >>>>> And you found the PANDAS UDF more performant ? Can you share your > code and prove it? > >> >>>>> > >> >>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy < > pmccar...@dstillery.com> wrote: > >> >>>>>> > >> >>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala > performance-wise, but for python-based data scientists or others with a lot > of python expertise it allows one to do things that would otherwise be > infeasible at scale. > >> >>>>>> > >> >>>>>> For instance, I recently had to convert latitude / longitude > pairs to MGRS strings ( > https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a > pandas UDF (and putting the mgrs python package into a conda environment) > was _significantly_ easier than any alternative I found. > >> >>>>>> > >> >>>>>> @Rishi - depending on your network is constructed, some lag > could come from just uploading the conda environment. If you load it from > hdfs with --archives does it improve? > >> >>>>>> > >> >>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> >>>>>>> > >> >>>>>>> hi, > >> >>>>>>> > >> >>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used > case of adding 1 to a field using Pandas UDF which is pretty much > pointless. So you go beyond the blog and realise that your actual used case > is more than adding one :) and the reality hits you > >> >>>>>>> > >> >>>>>>> Pandas UDF in certain scenarios is actually slow, try using > apply for a custom or pandas function. In fact in certain scenarios I have > found general UDF's work much faster and use much less memory. Therefore > test out your used case (with at least 30 million records) before trying to > use the Pandas UDF option. > >> >>>>>>> > >> >>>>>>> And when you start using GroupMap then you realise after > reading > https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs > that "Oh!! now I can run into random OOM errors and the maxrecords options > does not help at all" > >> >>>>>>> > >> >>>>>>> Excerpt from the above link: > >> >>>>>>> Note that all data for a group will be loaded into memory > before the function is applied. This can lead to out of memory exceptions, > especially if the group sizes are skewed. The configuration for > maxRecordsPerBatch is not applied on groups and it is up to the user to > ensure that the grouped data will fit into the available memory. > >> >>>>>>> > >> >>>>>>> Let me know about your used case in case possible > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Regards, > >> >>>>>>> Gourav > >> >>>>>>> > >> >>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah < > rishishah.s...@gmail.com> wrote: > >> >>>>>>>> > >> >>>>>>>> Thanks Patrick! I tried to package it according to this > instructions, it got distributed on the cluster however the same spark > program that takes 5 mins without pandas UDF has started to take 25mins... > >> >>>>>>>> > >> >>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12 > supported with Spark 2.3 (according to documentation, it should be fine)? > >> >>>>>>>> > >> >>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy < > pmccar...@dstillery.com> wrote: > >> >>>>>>>>> > >> >>>>>>>>> Hi Rishi, > >> >>>>>>>>> > >> >>>>>>>>> I've had success using the approach outlined here: > https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html > >> >>>>>>>>> > >> >>>>>>>>> Does this work for you? > >> >>>>>>>>> > >> >>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah < > rishishah.s...@gmail.com> wrote: > >> >>>>>>>>>> > >> >>>>>>>>>> modified the subject & would like to clarify that I am > looking to create an anaconda parcel with pyarrow and other libraries, so > that I can distribute it on the cloudera cluster.. > >> >>>>>>>>>> > >> >>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah < > rishishah.s...@gmail.com> wrote: > >> >>>>>>>>>>> > >> >>>>>>>>>>> Hi All, > >> >>>>>>>>>>> > >> >>>>>>>>>>> I have been trying to figure out a way to build anaconda > parcel with pyarrow included for my cloudera managed server for > distribution but this doesn't seem to work right. Could someone please help? > >> >>>>>>>>>>> > >> >>>>>>>>>>> I have tried to install anaconda on one of the management > nodes on cloudera cluster... tarred the directory, but this directory > doesn't include all the packages to form a proper parcel for distribution. > >> >>>>>>>>>>> > >> >>>>>>>>>>> Any help is much appreciated! > >> >>>>>>>>>>> > >> >>>>>>>>>>> -- > >> >>>>>>>>>>> Regards, > >> >>>>>>>>>>> > >> >>>>>>>>>>> Rishi Shah > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> > >> >>>>>>>>>> -- > >> >>>>>>>>>> Regards, > >> >>>>>>>>>> > >> >>>>>>>>>> Rishi Shah > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> > >> >>>>>>>>> Patrick McCarthy > >> >>>>>>>>> > >> >>>>>>>>> Senior Data Scientist, Machine Learning Engineering > >> >>>>>>>>> > >> >>>>>>>>> Dstillery > >> >>>>>>>>> > >> >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016 > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> -- > >> >>>>>>>> Regards, > >> >>>>>>>> > >> >>>>>>>> Rishi Shah > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> -- > >> >>>>>> > >> >>>>>> Patrick McCarthy > >> >>>>>> > >> >>>>>> Senior Data Scientist, Machine Learning Engineering > >> >>>>>> > >> >>>>>> Dstillery > >> >>>>>> > >> >>>>>> 470 Park Ave South, 17th Floor, NYC 10016 > >> >>>> > >> >>>> > >> >>>> > >> >>>> -- > >> >>>> > >> >>>> Patrick McCarthy > >> >>>> > >> >>>> Senior Data Scientist, Machine Learning Engineering > >> >>>> > >> >>>> Dstillery > >> >>>> > >> >>>> 470 Park Ave South, 17th Floor, NYC 10016 > >> > > >> > > >> > > >> > -- > >> > > >> > Patrick McCarthy > >> > > >> > Senior Data Scientist, Machine Learning Engineering > >> > > >> > Dstillery > >> > > >> > 470 Park Ave South, 17th Floor, NYC 10016 >