Hence, what I mentioned initially does sound correct ? On Mon, May 6, 2019 at 5:43 PM Andrew Melo <andrew.m...@gmail.com> wrote:
> Hi, > > On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy > <pmccar...@dstillery.com.invalid> wrote: > > > > Thanks Gourav. > > > > Incidentally, since the regular UDF is row-wise, we could optimize that > a bit by taking the convert() closure and simply making that the UDF. > > > > Since there's that MGRS object that we have to create too, we could > probably optimize it further by applying the UDF via rdd.mapPartitions, > which would allow the UDF to instantiate objects once per-partition instead > of per-row and then iterate element-wise through the rows of the partition. > > > > All that said, having done the above on prior projects I find the pandas > abstractions to be very elegant and friendly to the end-user so I haven't > looked back :) > > > > (The common memory model via Arrow is a nice boost too!) > > And some tentative SPIPs that want to use columnar representations > internally in Spark should also add some good performance in the > future. > > Cheers > Andrew > > > > > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> > >> The proof is in the pudding > >> > >> :) > >> > >> > >> > >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >>> > >>> Hi Patrick, > >>> > >>> super duper, thanks a ton for sharing the code. Can you please confirm > that this runs faster than the regular UDF's? > >>> > >>> Interestingly I am also running same transformations using another geo > spatial library in Python, where I am passing two fields and getting back > an array. > >>> > >>> > >>> Regards, > >>> Gourav Sengupta > >>> > >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy < > pmccar...@dstillery.com> wrote: > >>>> > >>>> Human time is considerably more expensive than computer time, so in > that regard, yes :) > >>>> > >>>> This took me one minute to write and ran fast enough for my needs. If > you're willing to provide a comparable scala implementation I'd be happy to > compare them. > >>>> > >>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR) > >>>> > >>>> def generate_mgrs_series(lat_lon_str, level): > >>>> > >>>> > >>>> import mgrs > >>>> > >>>> m = mgrs.MGRS() > >>>> > >>>> > >>>> precision_level = 0 > >>>> > >>>> levelval = level[0] > >>>> > >>>> > >>>> if levelval == 1000: > >>>> > >>>> precision_level = 2 > >>>> > >>>> if levelval == 100: > >>>> > >>>> precision_level = 3 > >>>> > >>>> > >>>> def convert(ll_str): > >>>> > >>>> lat, lon = ll_str.split('_') > >>>> > >>>> > >>>> return m.toMGRS(lat, lon, > >>>> > >>>> MGRSPrecision = precision_level) > >>>> > >>>> > >>>> return lat_lon_str.apply(lambda x: convert(x)) > >>>> > >>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >>>>> > >>>>> And you found the PANDAS UDF more performant ? Can you share your > code and prove it? > >>>>> > >>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy < > pmccar...@dstillery.com> wrote: > >>>>>> > >>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala > performance-wise, but for python-based data scientists or others with a lot > of python expertise it allows one to do things that would otherwise be > infeasible at scale. > >>>>>> > >>>>>> For instance, I recently had to convert latitude / longitude pairs > to MGRS strings ( > https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a > pandas UDF (and putting the mgrs python package into a conda environment) > was _significantly_ easier than any alternative I found. > >>>>>> > >>>>>> @Rishi - depending on your network is constructed, some lag could > come from just uploading the conda environment. If you load it from hdfs > with --archives does it improve? > >>>>>> > >>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >>>>>>> > >>>>>>> hi, > >>>>>>> > >>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used > case of adding 1 to a field using Pandas UDF which is pretty much > pointless. So you go beyond the blog and realise that your actual used case > is more than adding one :) and the reality hits you > >>>>>>> > >>>>>>> Pandas UDF in certain scenarios is actually slow, try using apply > for a custom or pandas function. In fact in certain scenarios I have found > general UDF's work much faster and use much less memory. Therefore test out > your used case (with at least 30 million records) before trying to use the > Pandas UDF option. > >>>>>>> > >>>>>>> And when you start using GroupMap then you realise after reading > https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs > that "Oh!! now I can run into random OOM errors and the maxrecords options > does not help at all" > >>>>>>> > >>>>>>> Excerpt from the above link: > >>>>>>> Note that all data for a group will be loaded into memory before > the function is applied. This can lead to out of memory exceptions, > especially if the group sizes are skewed. The configuration for > maxRecordsPerBatch is not applied on groups and it is up to the user to > ensure that the grouped data will fit into the available memory. > >>>>>>> > >>>>>>> Let me know about your used case in case possible > >>>>>>> > >>>>>>> > >>>>>>> Regards, > >>>>>>> Gourav > >>>>>>> > >>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah < > rishishah.s...@gmail.com> wrote: > >>>>>>>> > >>>>>>>> Thanks Patrick! I tried to package it according to this > instructions, it got distributed on the cluster however the same spark > program that takes 5 mins without pandas UDF has started to take 25mins... > >>>>>>>> > >>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12 > supported with Spark 2.3 (according to documentation, it should be fine)? > >>>>>>>> > >>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy < > pmccar...@dstillery.com> wrote: > >>>>>>>>> > >>>>>>>>> Hi Rishi, > >>>>>>>>> > >>>>>>>>> I've had success using the approach outlined here: > https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html > >>>>>>>>> > >>>>>>>>> Does this work for you? > >>>>>>>>> > >>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah < > rishishah.s...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> modified the subject & would like to clarify that I am looking > to create an anaconda parcel with pyarrow and other libraries, so that I > can distribute it on the cloudera cluster.. > >>>>>>>>>> > >>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah < > rishishah.s...@gmail.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi All, > >>>>>>>>>>> > >>>>>>>>>>> I have been trying to figure out a way to build anaconda > parcel with pyarrow included for my cloudera managed server for > distribution but this doesn't seem to work right. Could someone please help? > >>>>>>>>>>> > >>>>>>>>>>> I have tried to install anaconda on one of the management > nodes on cloudera cluster... tarred the directory, but this directory > doesn't include all the packages to form a proper parcel for distribution. > >>>>>>>>>>> > >>>>>>>>>>> Any help is much appreciated! > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Regards, > >>>>>>>>>>> > >>>>>>>>>>> Rishi Shah > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Regards, > >>>>>>>>>> > >>>>>>>>>> Rishi Shah > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> > >>>>>>>>> Patrick McCarthy > >>>>>>>>> > >>>>>>>>> Senior Data Scientist, Machine Learning Engineering > >>>>>>>>> > >>>>>>>>> Dstillery > >>>>>>>>> > >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016 > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Regards, > >>>>>>>> > >>>>>>>> Rishi Shah > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> Patrick McCarthy > >>>>>> > >>>>>> Senior Data Scientist, Machine Learning Engineering > >>>>>> > >>>>>> Dstillery > >>>>>> > >>>>>> 470 Park Ave South, 17th Floor, NYC 10016 > >>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> Patrick McCarthy > >>>> > >>>> Senior Data Scientist, Machine Learning Engineering > >>>> > >>>> Dstillery > >>>> > >>>> 470 Park Ave South, 17th Floor, NYC 10016 > > > > > > > > -- > > > > Patrick McCarthy > > > > Senior Data Scientist, Machine Learning Engineering > > > > Dstillery > > > > 470 Park Ave South, 17th Floor, NYC 10016 >