Re: PySpark Pandas UDF

2019-11-17 Thread Gourav Sengupta
Hi, sorry a completely unrelated question. when is the upcoming release of SPARK 3.0. There are several parallel distributed deep learning frameworks that are being developed, do you think that we could use SPARK 3.0 for distributed deep learning using Pytorch or Tensorflow? Is there any place w

Re: PySpark Pandas UDF

2019-11-17 Thread Bryan Cutler
There was a change in the binary format of Arrow 0.15.1 and there is an environment variable you can set to make pyarrow 0.15.1 compatible with current Spark, which looks to be your problem. Please see the doc below for instructions added in SPARK-2936. Note, this will not be required for the upcom

Re: PySpark Pandas UDF

2019-11-12 Thread Holden Karau
Thanks for sharing that. I think we should maybe add some checks around this so it’s easier to debug. I’m CCing Bryan who might have some thoughts. On Tue, Nov 12, 2019 at 7:42 AM gal.benshlomo wrote: > SOLVED! > thanks for the help - I found the issue. it was the version of pyarrow > (0.15.1) w

RE: PySpark Pandas UDF

2019-11-12 Thread gal.benshlomo
SOLVED! thanks for the help - I found the issue. it was the version of pyarrow (0.15.1) which apparently isn't currently stable. Downgrading it solved the issue for me -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: PySpark Pandas UDF

2019-11-11 Thread gal.benshlomo
Hi, Thanks for your reply. Tried what you've suggested and still getting the same error. Also worth mentioning that when I tried to simply write the dataframe to S3, without applying the function, it works. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ ---

Re: PySpark Pandas UDF

2019-11-10 Thread Holden Karau
Can you switch the write for a count just so we can isolate if it’s the write or the count? Also what’s the output path your using? On Sun, Nov 10, 2019 at 7:31 AM Gal Benshlomo wrote: > > > Hi, > > > > I’m using pandas_udf and not able to run it from cluster mode, even though > the same code wo

RE: PySpark Pandas UDF

2019-11-10 Thread Gal Benshlomo
Hi, I'm using pandas_udf and not able to run it from cluster mode, even though the same code works on standalone. The code is as follows: schema_test = StructType([ StructField("cluster", LongType()), StructField("name", StringType()) ]) @pandas_udf(schema_test, PandasUDFType.GROU

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Gourav Sengupta
just try using an apply on a series for a custom function or on any other library. Advertisement and actual delivery are two different skills altogether. Not everyone wants to add a one to their column using the pandas udf as one of their links shows :) Most of the actual used cases are more aroun

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Nicolas Paris
hi Gourav, > And also be aware that pandas UDF does not always lead to better performance > and sometimes even massively slow performance. this information is not widely spread. this is good to know. in which circumstances is it worst than regular udf ? > With Grouped Map dont you run into the

Re: pySpark - pandas UDF and binaryType

2019-05-03 Thread Gourav Sengupta
And also be aware that pandas UDF does not always lead to better performance and sometimes even massively slow performance. With Grouped Map dont you run into the risk of random memory errors as well? On Thu, May 2, 2019 at 9:32 PM Bryan Cutler wrote: > Hi, > > BinaryType support was not added

Re: pySpark - pandas UDF and binaryType

2019-05-02 Thread Bryan Cutler
Hi, BinaryType support was not added until Spark 2.4.0, see https://issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater is require as you saw in the docs. Bryan On Thu, May 2, 2019 at 4:26 AM Nicolas Paris wrote: > Hi all > > I am using pySpark 2.3.0 and pyArrow 0.10.0 >