Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Marco Gaido
Hi all, I also like this idea very much and I think it may bring also other performance improvements in the future. Thanks to everybody who worked on this. I agree to target this feature for 3.0. Thanks everybody, Bests. Marco On Tue, 31 Jul 2018, 08:39 Wenchen Fan, wrote: > Hi Carson and Yu

Re: Review notification bot

2018-07-31 Thread Hyukjin Kwon
> I originally did that, but GitHub told me I could only have one personal and one bot account. If someone else registered the spark-mention-bot I'd be happy to switch it to that. I have my own spare account for testing purpose (spark-test). https://github.com/spark-test I don't mind sharing it.

Re: Why percentile and distinct are not done in one job?

2018-07-31 Thread 吴晓菊
I mean in AnalyzeColumnCommand.scala the first one to compute percentiles and the second one to compute columnStats. Chrysan Wu 吴晓菊 Phone:+86 17717640807 2018-07-30 23:28 GMT+08:00 Reynold Xin : > Which API are you talking about? > > On Mon, Jul 30, 2018 at 7:03 AM 吴晓菊 wrote: > >> I noticed t

RE: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Wang, Carson
Thanks Marco and Wenchen for reviewing. It sounds good to target this for 3.0. I can also share more data on the benchmark. In the 100 TB TPC-DS benchmark we performed on a 100-node cluster, we saw 90% of the 103 queries had performance gain, and 46% of them are more than 1.1x faster. Individual

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Marco Gaido
Hi Wenchen, I think it would be great to consider also - SPARK-24598 : Datatype overflow conditions gives incorrect result As it is a correctness bug. What do you think? Thanks, Marco 2018-07-31 4:01 GMT+02:00 Wenchen Fan : > I went through t

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Petar Zečević
This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too. It is finished and is ready to be merged (was ready a month ago at least). Do you think you could consider including it in 2

Re: Review notification bot

2018-07-31 Thread Holden Karau
Sure I've turned off the bot for now. I'll look at giving it a preference to more recent contributors and switching it to rather make a suggestion of people to the user rather than doing the ping its self and skipping the PMC list as well. On Tue, Jul 31, 2018 at 12:30 AM, Hyukjin Kwon wrote: >

Re: Data source V2

2018-07-31 Thread vaclavkosar
For streaming there is an event StreamingQueryProgress which provides num of input rows for each source. Num of output rows that were written is currently not available in StreamingQueryProgress, but I submitted an PR for that here: https://github.com/apache/spark/pull/21919 If you are interested,

Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Yu, Yucai
Hi, I would like to share some experience when using AE in eBay’s data warehouse. 1. Saving many manual setting and tuning effort. Setting shuffle.partition one by one query is annoy, with AE, we just need set a big number for all queries. 2. Saving memory. With AE, we can start less exe

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Tomasz Gawęda
Hi, what is the status of Continuous Processing + Aggregations? As far as I remember, Jose Torres said it should  be easy to perform aggregations if coalesce(1) work. IIRC it's already merged to master. Is this work in progress? If yes, it would be great to have full aggregation/join support i

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Stavros Kontopoulos
I have a PR out for SPARK-14540 (Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner). This should allows us to add support for Scala 2.12, I think we can resolve this long standing issue with 2.4. Best, Stavros On Tue, Jul 31, 2018 at 4:07 PM, Tomasz Gawęda wrote: > Hi, > > what i

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Joseph Torres
Full continuous processing aggregation support ran into unanticipated scalability and scheduling problems. We’re planning to overcome those by using some of the barrier execution machinery, but since barrier execution itself is still in progress the full support isn’t going to make it into 2.4. Jo

Writing file

2018-07-31 Thread mattbuttow
According to Stack Overflow (https://stackoverflow.com/q/40786093) it should be possible to write file to a local path and the result should be available on the driver node. However when I try this: df.write.parquet("file:///some/path") the data seems to be written on each node, not a drive

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Wenchen Fan
Here is my interpretation of your proposal, please correct me if something is wrong. End users can read/write a data source with its name and some options. e.g. `df.read.format("xyz").option(...).load`. This is currently the only end-user API for data source v2, and is widely used by Spark users t

Re: Writing file

2018-07-31 Thread Wenchen Fan
It depends on how you deploy Spark. The writer just writes data to your specified path(HDFS or local path), but the writer is run on executors. If you deploy Spark with the local mode, i.e. executor and driver are together, then you will see the output file on the driver node. If you deploy Spark

DISCUSS: SPARK-24882 data source v2 API improvement

2018-07-31 Thread Wenchen Fan
Hi all, Data source v2 is out for a while. During this release, we migrated most of the streaming sources to the v2 API (SPARK-22911 ) started to migrate file sources (SPARK-23817 ) started to de

Re: Review notification bot

2018-07-31 Thread Sean Owen
I haven't been pinged by this bot :( :) But I do like this comments on PRs: like https://github.com/apache/spark/pull/21925#issuecomment-409035244 Is the issue that @-mentions cause emails too? Is there any option to maybe only consider pinging someone if they've touched the code within the last

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
Barrier mode seems like a high impact feature on Spark's core code: is one additional week enough time to properly vet this feature? On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres wrote: > Full continuous processing aggregation support ran into unanticipated > scalability and scheduling problems

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much as

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
I don't have a comprehensive knowledge of the project hydrogen PRs, however I've perused them, and they make substantial modifications to Spark's core DAG scheduler code. What I'm wondering is: how high is the confidence level that the "traditional" code paths are still stable. Put another way, is

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Reynold Xin
I actually totally agree that we should make sure it should have no impact on existing code if the feature is not used. On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson wrote: > I don't have a comprehensive knowledge of the project hydrogen PRs, > however I've perused them, and they make substant

Re: [DISCUSS] Multiple catalog support

2018-07-31 Thread Ryan Blue
Wenchen, I think the misunderstanding is around how the v2 API should work with multiple catalogs. Data sources are read/write implementations that resolve to a single JVM class. When we consider how these implementations should work with multiple table catalogs, I think it is clear that the catal

[build system] DOWNTIME jenkins unreachable overnight

2018-07-31 Thread shane knapp
our building is finally replacing the broken UPS that keeps biting us... ...which means another bit of downtime. :( it begins in 6 hours (11pm PDT) and will be finished tomorrow (august 1st) by ~8am PDT. shane -- Shane Knapp UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise

Re: Review notification bot

2018-07-31 Thread Hyukjin Kwon
(BTW, seems not turned off yet - https://github.com/apache/spark/pull/21939#issuecomment-409412805) 2018년 8월 1일 (수) 오전 2:24, Sean Owen 님이 작성: > I haven't been pinged by this bot :( :) > > But I do like this comments on PRs: like > https://github.com/apache/spark/pull/21925#issuecomment-409035244

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Imran Rashid
I'd like to add SPARK-24296, replicating large blocks over 2GB. Its been up for review for a while, and would end the 2GB block limit (well ... subject to a couple of caveats on SPARK-6235). On Mon, Jul 30, 2018 at 9:01 PM, Wenchen Fan wrote: > I went through the open JIRA tickets and here is a