Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Hi Maciej, FYI, this fix is submitted at https://github.com/apache/spark/pull/16785. Liang-Chi Hsieh wrote > Hi Maciej, > > After looking into the details of the time spent on preparing the executed > plan, the cause of the significant difference between 1.6 and current > codebase when running

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Hi Maciej, After looking into the details of the time spent on preparing the executed plan, the cause of the significant difference between 1.6 and current codebase when running the example, is the optimization process to generate constraints. There seems few operations in generating constraints

Re: Oracle JDBC - Spark SQL - Key Not Found: Scale

2017-02-02 Thread Takeshi Yamamuro
-user +dev cc: xiao Hi, ayan, I made pr to fix the issue that your reported though, it seems all the releases I checked (e.g., v1.6, v2.0, v2.1) does not hit the issue. Could you described more about your environments and conditions? You first reported you used v1.6 though, I checked and found t

Re: Apache Spark Contribution

2017-02-02 Thread Shuai Lin
> > The goal of the project is to develop an algorithm that automatically > scales the cluster up and down based on the volume of data processed by the > application. By "scale the cluster up and down" do you mean: 1) adding/removing spark executors based on the load? How is that from the dynami

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-02 Thread StanZhai
CentOS 7.1, Linux version 3.10.0-229.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #1 SMP Fri Mar 6 11:36:42 UTC 2015 Michael Allman-2 wrote > Hi Stan, > > What OS/version are you using? > > Michael > >> On Jan 22, 2017, at 11:36 PM, StanZh

4 days left to submit your abstract to Spark Summit SF

2017-02-02 Thread Scott walent
We are just 4 days away from closing the CFP for Spark Summit 2017. We have expanded the tracks in SF to include sessions that focus on AI, Machine Learning and a 60 min deep dive track with technical demos. Submit your presentation today and join us for the 10th Spark Summit! Hurry, the CFP clos

Apache Spark Contribution

2017-02-02 Thread Gabi Cristache
Hello, My name is Gabriel Cristache and I am a student in my final year of a Computer Engineering/Science University. I want for my Bachelor Thesis to add support for dynamic scaling to a spark streaming application. *The goal of the project is to develop an algorithm that automatically scales t

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Hi Maciej, Thanks for the info you provided. I tried to run the same example with 1.6 and current branch and record the difference between the time cost on preparing the executed plan. Current branch: 292 ms 95 ms

Re: Structured Streaming Schema Issue

2017-02-02 Thread Sam Elamin
Hi All Ive done a bit more digging to where exactly this happens. It seems like the schema is infered again after the data leaves the source and then comes into the sink Below is a stack trace, the schema at the BigQuerySource has a LongType for customer id but then at the sink, the data received

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Maciej Szymkiewicz
Hi Liang-Chi, Thank you for your answer and PR but what I think I wasn't specific enough. In hindsight I should have illustrate this better. What really troubles me here is a pattern of growing delays. Difference between 1.6.3 (roughly 20s runtime since the first job): 1.6 timeline vs 2.1.0 (45

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-02 Thread Liang-Chi Hsieh
Thanks Nick for pointing it out. I totally agreed. In 1.6 codebase, actually Pipeline uses DataFrame instead of Dataset, because they are not merged yet in 1.6. In StringIndexer and OneHotEncoder, they have called ".rdd" on the Dataset, this would deserialize the rows. In 1.6, as they use DataF