Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Wenchen Fan
We really need some documents to define what non-deterministic means. AFAIK, non-deterministic expressions may produce a different result for the same input row, if the already processed input rows are different. The optimizer tries its best to not change the input sequence of non-deterministic ex

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Takuya UESHIN
+1 On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp wrote: > +1 > > On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon wrote: > > > > +1 > > > > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: > >> > >> Sounds reasonable to me. We should make the behavior consistent within > Spark. > >> > >> On Tue, Nov 5, 20

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Shane Knapp
+1 On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon wrote: > > +1 > > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: >> >> Sounds reasonable to me. We should make the behavior consistent within Spark. >> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: >>> >>> Currently, when a PySpark Row is cre

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Hyukjin Kwon
+1 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: > Sounds reasonable to me. We should make the behavior consistent within > Spark. > > On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > >> Currently, when a PySpark Row is created with keyword arguments, the >> fields are sorted alphabetically.

[ANNOUNCE] Announcing Apache Spark 3.0.0-preview

2019-11-07 Thread Xingbo Jiang
Hi all, To enable wide-scale community testing of the upcoming Spark 3.0 release, the Apache Spark community has posted a preview release of Spark 3.0. This preview is *not a stable release in terms of either API or functionality*, but it is meant to give the community early access to try the code

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Rubén Berenguel
That was very interesting, thanks Enrico. Sean, IIRC it also prevents push down of the UDF in Catalyst in some cases. Regards, Ruben > On 7 Nov 2019, at 11:09, Sean Owen wrote: > > Interesting, what does non-deterministic do except have this effect? > aside from the naming, it could be a fin

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Sean Owen
Interesting, what does non-deterministic do except have this effect? aside from the naming, it could be a fine use of this flag if that's all it effectively does. I'm not sure I'd introduce another flag with the same semantics just over naming. If anything 'expensive' also isn't the right word, mor

[DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Enrico Minack
Hi all, Running expensive deterministic UDFs that return complex types, followed by multiple references to those results cause Spark to evaluate the UDF multiple times per row. This has been reported and discussed before: SPARK-18748 SPARK-17728     val f: Int => Array[Int]     val udfF = ud