date:20190920

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Xiao Li

+1 on Jungtaek's point. We can revisit this when we release Spark 3.1? After the release of 3.0, I believe we will get more feedback about DSv2 from the community. The current design is just made by a small group of contributors. DSv2 + catalog APIs are still evolving. It is very likely we will mak

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Jungtaek Lim

Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal with this as the change made confusion on my PRs...), but my bet is that DSv2 would be already changed in incompatible way, at least who works for custom DataSource. Making downstream to diverge their implementation heavily

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Jungtaek Lim

small correction: confusion -> conflict, so I had to go through and understand parts of the changes On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim wrote: > Just 2 cents, I haven't tracked the change of DSv2 (though I needed to > deal with this as the change made confusion on my PRs...), but my bet

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Dongjoon Hyun

Do you mean you want to have a breaking API change between 3.0 and 3.1? I believe we follow Semantic Versioning ( https://spark.apache.org/versioning-policy.html ). > We just won’t add any breaking changes before 3.1. Bests, Dongjoon. On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue wrote: > I don’

Spark Tasks Progress

2019-09-20 Thread Sultan Alamro

Hi all, I am trying to do some actions at the Driver side in Spark while an application is running. The Driver needs to know the tasks progress before making any decision. I know that tasks progress can be accessed within each executor or task from RecordReader class by calling getProgress(). The

JDK11 QA (SPARK-29194)

2019-09-20 Thread Dongjoon Hyun

Hi, All. As a next step, we started JDK11 QA. https://issues.apache.org/jira/browse/SPARK-29194 This issue mainly focuses on the following areas, but feel free to add any sub-issues which you hit on JDK11 from now. - Documentations - Examples - Performance - Integration Tests B

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue

I don’t think we need to gate a 3.0 release on making a more stable version of InternalRow Sounds like we agree, then. We will use it for 3.0, but there are known problems with it. Thinking we’d have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to br

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin

I don't think we need to gate a 3.0 release on making a more stable version of InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change and progress towards more stable, but will have to break certain APIs) and 2.x seems like a false premise. To point out some problems wi

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue

When you created the PR to make InternalRow public This isn’t quite accurate. The change I made was to use InternalRow instead of UnsafeRow, which is a specific implementation of InternalRow. Exposing this API has always been a part of DSv2 and while both you and I did some work to avoid this, we

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Sean Owen

I don't know enough about DSv2 to comment on this part, but, any theoretical 2.5 is still a ways off. Does waiting for 3.0 to 'stabilize' it as much as is possible help? I say that because re: Java 11, the main breaking change is probably the Hive 2 / Hadoop 3 dependency, JPMML (minor), as well as

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin

To push back, while I agree we should not drastically change "InternalRow", there are a lot of changes that need to happen to make it stable. For example, none of the publicly exposed interfaces should be in the Catalyst package or the unsafe package. External implementations should be decoupled

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue

I didn't realize that Java 11 would require breaking changes. What breaking changes are required? On Fri, Sep 20, 2019 at 11:18 AM Sean Owen wrote: > Narrowly on Java 11: the problem is that it'll take some breaking > changes, more than would be usually appropriate in a minor release, I > think.

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue

> DSv2 is far from stable right? No, I think it is reasonably stable and very close to being ready for a release. > All the actual data types are unstable and you guys have completely ignored that. I think what you're referring to is the use of `InternalRow`. That's a stable API and there has be

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Sean Owen

Narrowly on Java 11: the problem is that it'll take some breaking changes, more than would be usually appropriate in a minor release, I think. I'm still not convinced there is a burning need to use Java 11 but stay on 2.4, after 3.0 is out, and at least the wheels are in motion there. Java 8 is sti

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin

DSv2 is far from stable right? All the actual data types are unstable and you guys have completely ignored that. We'd need to work on that and that will be a breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that seems too invasive of a change to backport once you consider th

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Ryan Blue

I’m not sure that DSv2 list is accurate. We discussed this in the DSv2 sync this week (just sent out the notes) and came up with these items: - Finish TableProvider update to avoid another API change: pass all table config from metastore - Catalog behavior fix: https://issues.apache.org/j

[DISCUSS] Spark 2.5 release

2019-09-20 Thread Ryan Blue

Hi everyone, In the DSv2 sync this week, we talked about a possible Spark 2.5 release based on the latest Spark 2.4, but with DSv2 and Java 11 support added. A Spark 2.5 release with these two additions will help people migrate to Spark 3.0 when it is released because they will be able to use a s

DSv2 sync notes - 28 September 2019

2019-09-20 Thread Ryan Blue

Here are my notes from this week’s DSv2 sync. *Attendees*: Ryan Blue Holden Karau Russell Spitzer Terry Kim Wenchen Fan Shiv Prashant Sood Joseph Torres Gengliang Wang Matt Cheah Burak Yavuz *Topics*: - Driver-side Hadoop conf - SHOW DATABASES/NAMESPACES behavior - Review outstanding 3

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Dongjoon Hyun

Thank you for the summarization, Xingbo. I also agree with Sean because I don't think those block 3.0.0 preview release. Especially, correctness issues should not be there. Instead, could you summarize what we have as of now for 3.0.0 preview? I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-

custom FileStreamSource which reads from one partition onwards

2019-09-20 Thread Georg Heiler

Hi, to my best knowledge, the existing FileStreamSource reads all the files in a directory (hive table). However, I need to be able to specify an initial partition it should start from (i.e. like a Kafka offset/initial warmed-up state) and then only read data which is semantically (i.e. using a fi

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Sean Owen

Is this a list of items that might be focused on for the final 3.0 release? At least, Scala 2.13 support shouldn't be on that list. The others look plausible, or are already done, but there are probably more. As for the 3.0 preview, I wouldn't necessarily block on any particular feature, though, y

Re: Parquet read performance for different schemas

2019-09-20 Thread Tomas Bartalos

I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column: df.select(sum('amount)) BR, Tomas št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a): > Hello, > > I have 2 parquets (each containing 1 file): > >- parquet-wide - schema has 25 top le

Re: Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Wenchen Fan

> New pushdown API for DataSourceV2 One correction: I want to revisit the pushdown API to make sure it works for dynamic partition pruning and can be extended to support limit/aggregate/... pushdown in the future. It should be a small API update instead of a new API. On Fri, Sep 20, 2019 at 3:46

Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Xingbo Jiang

Hi all, Let's start a new thread to discuss the on-going features for Spark 3.0 preview release. Below is the feature list for the Spark 3.0 preview release. The list is collected from the previous discussions in the dev list. - Followup of the shuffle+repartition correctness issue: support r

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Spark Tasks Progress

JDK11 QA (SPARK-29194)

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: [DISCUSS] Spark 2.5 release

Re: Spark 3.0 preview release on-going features discussion

[DISCUSS] Spark 2.5 release

DSv2 sync notes - 28 September 2019

Re: Spark 3.0 preview release on-going features discussion

custom FileStreamSource which reads from one partition onwards

Re: Spark 3.0 preview release on-going features discussion

Re: Parquet read performance for different schemas

Re: Spark 3.0 preview release on-going features discussion

Spark 3.0 preview release on-going features discussion

24 matches

Site Navigation

Mail list logo

Footer information