hi, Pandas UDF is a bit of hype. One of their blogs shows the used case of adding 1 to a field using Pandas UDF which is pretty much pointless. So you go beyond the blog and realise that your actual used case is more than adding one :) and the reality hits you
Pandas UDF in certain scenarios is actually slow, try using apply for a custom or pandas function. In fact in certain scenarios I have found general UDF's work much faster and use much less memory. Therefore test out your used case (with at least 30 million records) before trying to use the Pandas UDF option. And when you start using GroupMap then you realise after reading https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs that "Oh!! now I can run into random OOM errors and the maxrecords options does not help at all" Excerpt from the above link: Note that all data for a group will be loaded into memory before the function is applied. This can lead to out of memory exceptions, especially if the group sizes are skewed. The configuration for maxRecordsPerBatch <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size> is not applied on groups and it is up to the user to ensure that the grouped data will fit into the available memory. Let me know about your used case in case possible Regards, Gourav On Sun, May 5, 2019 at 3:59 AM Rishi Shah <rishishah.s...@gmail.com> wrote: > Thanks Patrick! I tried to package it according to this instructions, it > got distributed on the cluster however the same spark program that takes 5 > mins without pandas UDF has started to take 25mins... > > Have you experienced anything like this? Also is Pyarrow 0.12 supported > with Spark 2.3 (according to documentation, it should be fine)? > > On Tue, Apr 30, 2019 at 9:35 AM Patrick McCarthy <pmccar...@dstillery.com> > wrote: > >> Hi Rishi, >> >> I've had success using the approach outlined here: >> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html >> >> Does this work for you? >> >> On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah <rishishah.s...@gmail.com> >> wrote: >> >>> modified the subject & would like to clarify that I am looking to create >>> an anaconda parcel with pyarrow and other libraries, so that I can >>> distribute it on the cloudera cluster.. >>> >>> On Tue, Apr 30, 2019 at 12:21 AM Rishi Shah <rishishah.s...@gmail.com> >>> wrote: >>> >>>> Hi All, >>>> >>>> I have been trying to figure out a way to build anaconda parcel with >>>> pyarrow included for my cloudera managed server for distribution but this >>>> doesn't seem to work right. Could someone please help? >>>> >>>> I have tried to install anaconda on one of the management nodes on >>>> cloudera cluster... tarred the directory, but this directory doesn't >>>> include all the packages to form a proper parcel for distribution. >>>> >>>> Any help is much appreciated! >>>> >>>> -- >>>> Regards, >>>> >>>> Rishi Shah >>>> >>> >>> >>> -- >>> Regards, >>> >>> Rishi Shah >>> >> >> >> -- >> >> >> *Patrick McCarthy * >> >> Senior Data Scientist, Machine Learning Engineering >> >> Dstillery >> >> 470 Park Ave South, 17th Floor, NYC 10016 >> > > > -- > Regards, > > Rishi Shah >