And you found the PANDAS UDF more performant ? Can you share your code and prove it?
On Sun, May 5, 2019 at 9:24 PM Patrick McCarthy <pmccar...@dstillery.com> wrote: > I disagree that it's hype. Perhaps not 1:1 with pure scala > performance-wise, but for python-based data scientists or others with a lot > of python expertise it allows one to do things that would otherwise be > infeasible at scale. > > For instance, I recently had to convert latitude / longitude pairs to MGRS > strings (https://en.wikipedia.org/wiki/Military_Grid_Reference_System). > Writing a pandas UDF (and putting the mgrs python package into a conda > environment) was _significantly_ easier than any alternative I found. > > @Rishi - depending on your network is constructed, some lag could come > from just uploading the conda environment. If you load it from hdfs with > --archives does it improve? > > On Sun, May 5, 2019 at 2:15 PM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> hi, >> >> Pandas UDF is a bit of hype. One of their blogs shows the used case of >> adding 1 to a field using Pandas UDF which is pretty much pointless. So you >> go beyond the blog and realise that your actual used case is more than >> adding one :) and the reality hits you >> >> Pandas UDF in certain scenarios is actually slow, try using apply for a >> custom or pandas function. In fact in certain scenarios I have found >> general UDF's work much faster and use much less memory. Therefore test out >> your used case (with at least 30 million records) before trying to use the >> Pandas UDF option. >> >> And when you start using GroupMap then you realise after reading >> https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs >> that "Oh!! now I can run into random OOM errors and the maxrecords options >> does not help at all" >> >> Excerpt from the above link: >> Note that all data for a group will be loaded into memory before the >> function is applied. This can lead to out of memory exceptions, especially >> if the group sizes are skewed. The configuration for maxRecordsPerBatch >> <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size> >> is >> not applied on groups and it is up to the user to ensure that the grouped >> data will fit into the available memory. >> >> Let me know about your used case in case possible >> >> >> Regards, >> Gourav >> >> On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> >> wrote: >> >>> Thanks Patrick! I tried to package it according to this instructions, it >>> got distributed on the cluster however the same spark program that takes 5 >>> mins without pandas UDF has started to take 25mins... >>> >>> Have you experienced anything like this? Also is Pyarrow 0.12 supported >>> with Spark 2.3 (according to documentation, it should be fine)? >>> >>> On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy < >>> pmccar...@dstillery.com> wrote: >>> >>>> Hi Rishi, >>>> >>>> I've had success using the approach outlined here: >>>> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html >>>> >>>> Does this work for you? >>>> >>>> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com> >>>> wrote: >>>> >>>>> modified the subject & would like to clarify that I am looking to >>>>> create an anaconda parcel with pyarrow and other libraries, so that I can >>>>> distribute it on the cloudera cluster.. >>>>> >>>>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <rishishah.s...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I have been trying to figure out a way to build anaconda parcel with >>>>>> pyarrow included for my cloudera managed server for distribution but this >>>>>> doesn't seem to work right. Could someone please help? >>>>>> >>>>>> I have tried to install anaconda on one of the management nodes on >>>>>> cloudera cluster... tarred the directory, but this directory doesn't >>>>>> include all the packages to form a proper parcel for distribution. >>>>>> >>>>>> Any help is much appreciated! >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> >>>>>> Rishi Shah >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> >>>>> Rishi Shah >>>>> >>>> >>>> >>>> -- >>>> >>>> >>>> *Patrick McCarthy * >>>> >>>> Senior Data Scientist, Machine Learning Engineering >>>> >>>> Dstillery >>>> >>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>> >>> >>> >>> -- >>> Regards, >>> >>> Rishi Shah >>> >> > > -- > > > *Patrick McCarthy * > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 >