Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Gourav Sengupta Mon, 06 May 2019 10:32:25 -0700

Hi Andrew,
Do not misrepresent my statements.
I mentioned it depends on the used case, I NEVER (note the word "never")
mentioned that Pandas UDF is ALWAYS (note the word "always") slow.



Regards,
Gourav Sengupta

On Mon, May 6, 2019 at 6:00 PM Andrew Melo <andrew.m...@gmail.com> wrote:

> Hi,
>
> On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta
> <gourav.sengu...@gmail.com> wrote:
> >
> > Hence, what I mentioned initially does sound correct ?
>
> I don't agree at all - we've had a significant boost from moving to
> regular UDFs to pandas UDFs. YMMV, of course.
>
> >
> > On Mon, May 6, 2019 at 5:43 PM Andrew Melo <andrew.m...@gmail.com>
> wrote:
> >>
> >> Hi,
> >>
> >> On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy
> >> <pmccar...@dstillery.com.invalid> wrote:
> >> >
> >> > Thanks Gourav.
> >> >
> >> > Incidentally, since the regular UDF is row-wise, we could optimize
> that a bit by taking the convert() closure and simply making that the UDF.
> >> >
> >> > Since there's that MGRS object that we have to create too, we could
> probably optimize it further by applying the UDF via rdd.mapPartitions,
> which would allow the UDF to instantiate objects once per-partition instead
> of per-row and then iterate element-wise through the rows of the partition.
> >> >
> >> > All that said, having done the above on prior projects I find the
> pandas abstractions to be very elegant and friendly to the end-user so I
> haven't looked back :)
> >> >
> >> > (The common memory model via Arrow is a nice boost too!)
> >>
> >> And some tentative SPIPs that want to use columnar representations
> >> internally in Spark should also add some good performance in the
> >> future.
> >>
> >> Cheers
> >> Andrew
> >>
> >> >
> >> > On Mon, May 6, 2019 at 11:13 AM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >>
> >> >> The proof is in the pudding
> >> >>
> >> >> :)
> >> >>
> >> >>
> >> >>
> >> >> On Mon, May 6, 2019 at 2:46 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >>>
> >> >>> Hi Patrick,
> >> >>>
> >> >>> super duper, thanks a ton for sharing the code. Can you please
> confirm that this runs faster than the regular UDF's?
> >> >>>
> >> >>> Interestingly I am also running same transformations using another
> geo spatial library in Python, where I am passing two fields and getting
> back an array.
> >> >>>
> >> >>>
> >> >>> Regards,
> >> >>> Gourav Sengupta
> >> >>>
> >> >>> On Mon, May 6, 2019 at 2:00 PM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
> >> >>>>
> >> >>>> Human time is considerably more expensive than computer time, so
> in that regard, yes :)
> >> >>>>
> >> >>>> This took me one minute to write and ran fast enough for my needs.
> If you're willing to provide a comparable scala implementation I'd be happy
> to compare them.
> >> >>>>
> >> >>>> @F.pandas_udf(T.StringType(), F.PandasUDFType.SCALAR)
> >> >>>>
> >> >>>> def generate_mgrs_series(lat_lon_str, level):
> >> >>>>
> >> >>>>
> >> >>>>     import mgrs
> >> >>>>
> >> >>>>     m = mgrs.MGRS()
> >> >>>>
> >> >>>>
> >> >>>>     precision_level = 0
> >> >>>>
> >> >>>>     levelval = level[0]
> >> >>>>
> >> >>>>
> >> >>>>     if levelval == 1000:
> >> >>>>
> >> >>>>        precision_level = 2
> >> >>>>
> >> >>>>     if levelval == 100:
> >> >>>>
> >> >>>>        precision_level = 3
> >> >>>>
> >> >>>>
> >> >>>>     def convert(ll_str):
> >> >>>>
> >> >>>>           lat, lon = ll_str.split('_')
> >> >>>>
> >> >>>>
> >> >>>>           return m.toMGRS(lat, lon,
> >> >>>>
> >> >>>>               MGRSPrecision = precision_level)
> >> >>>>
> >> >>>>
> >> >>>>     return lat_lon_str.apply(lambda x: convert(x))
> >> >>>>
> >> >>>> On Mon, May 6, 2019 at 8:23 AM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >>>>>
> >> >>>>> And you found the PANDAS UDF more performant ? Can you share your
> code and prove it?
> >> >>>>>
> >> >>>>> On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
> >> >>>>>>
> >> >>>>>> I disagree that it's hype. Perhaps not 1:1 with pure scala
> performance-wise, but for python-based data scientists or others with a lot
> of python expertise it allows one to do things that would otherwise be
> infeasible at scale.
> >> >>>>>>
> >> >>>>>> For instance, I recently had to convert latitude / longitude
> pairs to MGRS strings (
> https://en.wikipedia.org/wiki/Military_Grid_Reference_System). Writing a
> pandas UDF (and putting the mgrs python package into a conda environment)
> was _significantly_ easier than any alternative I found.
> >> >>>>>>
> >> >>>>>> @Rishi - depending on your network is constructed, some lag
> could come from just uploading the conda environment. If you load it from
> hdfs with --archives does it improve?
> >> >>>>>>
> >> >>>>>> On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
> >> >>>>>>>
> >> >>>>>>> hi,
> >> >>>>>>>
> >> >>>>>>> Pandas UDF is a bit of hype. One of their blogs shows the used
> case of adding 1 to a field using Pandas UDF which is pretty much
> pointless. So you go beyond the blog and realise that your actual used case
> is more than adding one :) and the reality hits you
> >> >>>>>>>
> >> >>>>>>> Pandas UDF in certain scenarios is actually slow, try using
> apply for a custom or pandas function. In fact in certain scenarios I have
> found general UDF's work much faster and use much less memory. Therefore
> test out your used case (with at least 30 million records) before trying to
> use the Pandas UDF option.
> >> >>>>>>>
> >> >>>>>>> And when you start using GroupMap then you realise after
> reading
> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs
> that "Oh!! now I can run into random OOM errors and the maxrecords options
> does not help at all"
> >> >>>>>>>
> >> >>>>>>> Excerpt from the above link:
> >> >>>>>>> Note that all data for a group will be loaded into memory
> before the function is applied. This can lead to out of memory exceptions,
> especially if the group sizes are skewed. The configuration for
> maxRecordsPerBatch is not applied on groups and it is up to the user to
> ensure that the grouped data will fit into the available memory.
> >> >>>>>>>
> >> >>>>>>> Let me know about your used case in case possible
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Regards,
> >> >>>>>>> Gourav
> >> >>>>>>>
> >> >>>>>>> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <
> rishishah.s...@gmail.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>> Thanks Patrick! I tried to package it according to this
> instructions, it got distributed on the cluster however the same spark
> program that takes 5 mins without pandas UDF has started to take 25mins...
> >> >>>>>>>>
> >> >>>>>>>> Have you experienced anything like this? Also is Pyarrow 0.12
> supported with Spark 2.3 (according to documentation, it should be fine)?
> >> >>>>>>>>
> >> >>>>>>>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <
> pmccar...@dstillery.com> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> Hi Rishi,
> >> >>>>>>>>>
> >> >>>>>>>>> I've had success using the approach outlined here:
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html
> >> >>>>>>>>>
> >> >>>>>>>>> Does this work for you?
> >> >>>>>>>>>
> >> >>>>>>>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <
> rishishah.s...@gmail.com> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>> modified the subject & would like to clarify that I am
> looking to create an anaconda parcel with pyarrow and other libraries, so
> that I can distribute it on the cloudera cluster..
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <
> rishishah.s...@gmail.com> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Hi All,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I have been trying to figure out a way to build anaconda
> parcel with pyarrow included for my cloudera managed server for
> distribution but this doesn't seem to work right. Could someone please help?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I have tried to install anaconda on one of the management
> nodes on cloudera cluster... tarred the directory, but this directory
> doesn't include all the packages to form a proper parcel for distribution.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Any help is much appreciated!
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> --
> >> >>>>>>>>>>> Regards,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Rishi Shah
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> --
> >> >>>>>>>>>> Regards,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Rishi Shah
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>>
> >> >>>>>>>>> Patrick McCarthy
> >> >>>>>>>>>
> >> >>>>>>>>> Senior Data Scientist, Machine Learning Engineering
> >> >>>>>>>>>
> >> >>>>>>>>> Dstillery
> >> >>>>>>>>>
> >> >>>>>>>>> 470 Park Ave South, 17th Floor, NYC 10016
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> --
> >> >>>>>>>> Regards,
> >> >>>>>>>>
> >> >>>>>>>> Rishi Shah
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>>
> >> >>>>>> Patrick McCarthy
> >> >>>>>>
> >> >>>>>> Senior Data Scientist, Machine Learning Engineering
> >> >>>>>>
> >> >>>>>> Dstillery
> >> >>>>>>
> >> >>>>>> 470 Park Ave South, 17th Floor, NYC 10016
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>>
> >> >>>> Patrick McCarthy
> >> >>>>
> >> >>>> Senior Data Scientist, Machine Learning Engineering
> >> >>>>
> >> >>>> Dstillery
> >> >>>>
> >> >>>> 470 Park Ave South, 17th Floor, NYC 10016
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Patrick McCarthy
> >> >
> >> > Senior Data Scientist, Machine Learning Engineering
> >> >
> >> > Dstillery
> >> >
> >> > 470 Park Ave South, 17th Floor, NYC 10016
>

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

Reply via email to