SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-01 Thread Ryan Blue
Over the last couple years, I’ve noticed a trend toward specialized logical plans and increasing use of RunnableCommand nodes. DataSourceV2 is currently on the same path, and I’d like to make the case that we should avoid these practices. I think it’s helpful to consider an example I’ve been watch

Re: data source v2 online meetup

2018-02-01 Thread Reynold Xin
Still would be good to join. We can also do an additional one in March to give people more time. On Thu, Feb 1, 2018 at 3:59 PM, Russell Spitzer wrote: > I can try to do a quick scratch implementation to see how the connector > fits in, but we are in the middle of release land so I don't have t

Re: data source v2 online meetup

2018-02-01 Thread Russell Spitzer
I can try to do a quick scratch implementation to see how the connector fits in, but we are in the middle of release land so I don't have the amount of time I really need to think about this. I'd be glad to join any hangout to discuss everything though. On Thu, Feb 1, 2018 at 11:15 AM Ryan Blue w

Re: data source v2 online meetup

2018-02-01 Thread Ryan Blue
We don't mind updating Iceberg when the API improves. We are fully aware that this is a very early implementation and will change. My hope is that the community is receptive to our suggestions. A good example of an area with friction is filter and projection push-down. The implementation for DSv2

[MLlib] Gaussian Process regression in MLlib

2018-02-01 Thread Valeriy Avanesov
Hi all, it came to my surprise that there is no implementation of Gaussian Process in Spark MLlib. The approach is widely known, employed and scalable (its sparse versions). Is there a good reason for that? Has it been discussed before? If there is a need in this approach being a part of MLl

Re: [Core][Suggestion] sortWithinPartitions and aggregateWithinPartitions for RDD

2018-02-01 Thread Mridul Muralidharan
On Wed, Jan 31, 2018 at 1:15 AM, Ruifeng Zheng wrote: > HI all: > > > >1, Dataset API supports operation “sortWithinPartitions”, but in RDD > API there is no counterpart (I know there is > “repartitionAndSortWithinPartitions”, but I don’t want to repartition the > RDD), I have to convert R

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Tom Graves
I filed a jira [SPARK-23304] Spark SQL coalesce() against hive not working - ASF JIRA for the coalesce issue. | | | | [SPARK-23304] Spark SQL coalesce() against hive not working - ASF JIRA | | | Tom On Thursday, February 1, 2018, 12:36:02 PM CST, Sameer Agarwal wrote: [+

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Sameer Agarwal
[+ Xiao] SPARK-23290 does sound like a blocker. On the SQL side, I can confirm that there were non-trivial changes around repartitioning/coalesce and cache performance in 2.3 -- we're currently investigating these. On 1 February 2018 at 10:02, Andrew Ash wrote: > I'd like to nominate SPARK-23

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Andrew Ash
I'd like to nominate SPARK-23290 as a potential blocker for the 2.3.0 release. It's a regression from 2.2.0 in that user pyspark code that works in 2.2.0 now fails in the 2.3.0 RCs: the type return type of date columns changed from object to date

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Tom Graves
Testing with spark 2.3 and I see a difference in the sql coalesce talking to hive vs spark 2.2. It seems spark 2.3 ignores the coalesce. Query:spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= '20170301' AND dt <= '20170331' AND something IS NOT NULL").coalesce(16).sh

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Michael Heuer
We found two classes new to Spark 2.3.0 that must be registered in Kryo for our tests to pass on RC2 org.apache.spark.sql.execution.datasources.BasicWriteTaskStats org.apache.spark.sql.execution.datasources.ExecutedWriteSummary https://github.com/bigdatagenomics/adam/pull/1897 Perhaps a mention

Re: data source v2 online meetup

2018-02-01 Thread Felix Cheung
+1 hangout From: Xiao Li Sent: Wednesday, January 31, 2018 10:46:26 PM To: Ryan Blue Cc: Reynold Xin; dev; Wenchen Fen; Russell Spitzer Subject: Re: data source v2 online meetup Hi, Ryan, wow, your Iceberg already used data source V2 API! That is pretty cool! I

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Nick Pentreath
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that should be everything outstanding. On Thu, 1 Feb 2018 at 06:21 Yin Huai wrote: > seems we are not running tests related to pandas in pyspark tests (see my > email "python tests related to pandas are skipped in jenkins").