Re: Release Apache Spark 2.4.5

2020-01-07 Thread Takeshi Yamamuro
+1, the late response... :( Anyway, happy new year, all! Bests, Takeshi On Tue, Jan 7, 2020 at 2:50 AM Dongjoon Hyun wrote: > Thank you all. > > I'll start to check and prepare the 2.4.5 release. > > Bests, > Dongjoon. > > On Sun, Jan 5, 2020 at 22:51 Xiao Li wrote: > >> +1 >> >> Xiao >> >> On

Re: [SPARK-30296][SQL] Add Dataset diffing feature

2020-01-07 Thread Reynold Xin
Can this perhaps exist as an utility function outside Spark? On Tue, Jan 07, 2020 at 12:18 AM, Enrico Minack < m...@enrico.minack.dev > wrote: > > > > Hi Devs, > > > > I'd like to get your thoughts on this Dataset feature proposal. Comparing > datasets is a central operation when regressio

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-07 Thread Long, Andrew
“Where can I find information on how to run standard performance tests/benchmarks?“ The grand standard is spark-sql-perf and in particular the tpc-ds benchmark. Most of the big optimization teams are using this as the primary benchmark. One word of warning is that most groups have also extende

Re: Fail to use SparkR of 3.0 preview 2

2020-01-07 Thread Xiao Li
We can use R version 3.6.1, if we have a concern about the quality of 3.6.2? On Thu, Dec 26, 2019 at 8:14 PM Hyukjin Kwon wrote: > I was randomly googling out of curiosity, and seems indeed that's the > problem ( > https://r.789695.n4.nabble.com/Error-in-rbind-info-getNamespaceInfo-env-quot-S3me

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-07 Thread Wenchen Fan
I think it's simply because as[T] is lazy. You will see the right schema if you do `df.as[T].map(identity)`. On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack wrote: > Hi Devs, > > I'd like to propose a stricter version of as[T]. Given the interface def > as[T](): Dataset[T], it is counter-intuitiv

[SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-07 Thread Enrico Minack
Hi Devs, I'd like to propose a stricter version of as[T]. Given the interface def as[T](): Dataset[T], it is counter-intuitive that the schema of the returned Dataset[T] is not agnostic to the schema of the originating Dataset. The schema should always be derived only from T. I am proposing

[SPARK-30296][SQL] Add Dataset diffing feature

2020-01-07 Thread Enrico Minack
Hi Devs, I'd like to get your thoughts on this Dataset feature proposal. Comparing datasets is a central operation when regression testing your code changes. It would be super useful if Spark's Datasets provide this transformation natively. https://github.com/apache/spark/pull/26936 Regar

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-07 Thread Brett Marcott
1. Where can I find information on how to run standard performance tests/benchmarks? 2. Are performance degradations to existing queries that are fixable by new equivalent queries not allowed for a new major spark version? On Thu, Jan 2, 2020 at 3:05 PM Brett Marcott wrote: > Thanks for the resp