[SQL]A confusing NullPointerException when creating table using Spark2.1.0

2017-02-03 Thread StanZhai
Hi all, After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing NullPointerException when creating table under Spark 2.1.0, but the problem does not exists in Spark 1.6.1. Environment: Hive 1.2.1, Hadoop 2.6.4 Code // spark is an instance

Re: Google Summer of Code 2017 is coming

2017-02-03 Thread Jacek Laskowski
Thanks Sean. You've again been very helpful to put the right tone to the matters. I stand corrected and have no interest in GSoC anymore. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitt

Re: Google Summer of Code 2017 is coming

2017-02-03 Thread Sean Owen
I have a contrarian opinion on GSoC from experience many years ago in Mahout. Of 3 students I interacted with, 2 didn't come close to completing the work they signed up for. I think it's mostly that students are hungry for the resumé line item, and don't understand the amount of work they're propos

Re: Google Summer of Code 2017 is coming

2017-02-03 Thread Holden Karau
As someone who did GSoC back in University I think this could be a good idea if there is enough interest from the PMC & I'd be willing the help mentor if that is a bottleneck. On Fri, Feb 3, 2017 at 12:42 PM, Jacek Laskowski wrote: > Hi, > > Is this something Spark considering? Would be nice to

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-03 Thread Jacek Laskowski
Hi, Just to throw few zlotys to the conversation, I believe that Spark Standalone does not enforce any memory checks to limit or even kill executors beyond requested memory (like YARN). I also found that memory does not have much of use while scheduling tasks and CPU matters only. My understandin

Re: Remove support for Hadoop 2.5 and earlier?

2017-02-03 Thread Jacek Laskowski
Hi Sean, Given that 3.0.0 is coming, removing the unused versions would be a huge benefit from maintenance point of view. I'd support removing support for 2.5 and earlier. Speaking of Hadoop support, is anyone considering 3.0.0 support? Can't find any JIRA for this. Pozdrawiam, Jacek Laskowski -

Re: Apache Spark Contribution

2017-02-03 Thread Steve Loughran
You might want to look at Nephele: Efficient Parallel Data Processing in the Cloud, Warneke & Kao, 2009 http://stratosphere.eu/assets/papers/Nephele_09.pdf This was some of the work done in the research project with gave birth to Flink, though this bit didn't surface as they chose to leave VM a

Fwd: Google Summer of Code 2017 is coming

2017-02-03 Thread Jacek Laskowski
Hi, Is this something Spark considering? Would be nice to mark issues as GSoC in JIRA and solicit feedback. What do you think? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jac

Re: Remove support for Hadoop 2.5 and earlier?

2017-02-03 Thread Steve Loughran
> On 3 Feb 2017, at 11:52, Sean Owen wrote: > > Last year we discussed removing support for things like Hadoop 2.5 and > earlier. It was deprecated in Spark 2.1.0. I'd like to go ahead with this, so > am checking whether anyone has strong feelings about it. > > The original rationale for sepa

Re: Structured Streaming Schema Issue

2017-02-03 Thread Sam Elamin
Hey td I figured out what was happening My source would return the correct schema but the schema on the returned df was actually different. I'm loading json data from cloud storage and that gets infered instead of set So basically the schema I return on the source provider wasn't actually being

RE: No Reducer scenarios

2017-02-03 Thread Praveen Mothkuri
In this case the output of the map-tasks directly go to distributed file-system, to the path set by FileOutputFormat.setOutputPath(JobConf, Path)

Remove support for Hadoop 2.5 and earlier?

2017-02-03 Thread Sean Owen
Last year we discussed removing support for things like Hadoop 2.5 and earlier. It was deprecated in Spark 2.1.0. I'd like to go ahead with this, so am checking whether anyone has strong feelings about it. The original rationale for separate Hadoop profile was bridging the significant difference b

Re: No Reducer scenarios

2017-02-03 Thread ??????????
HI Nair, have you know the class please? I tried to find but failed. I know NewDirectOutputCollector is used to write tmp files. ---Original--- From: "?7?4 R Nair (?1?6?1?1?1?1?1?2?1?0?1?9?1?6 ?1?8?1?0?1?5?1?6)" Date: 2017/1/30 13:32:04 To: "dev";"user";"user"; Subject: No Reducer scenarios

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-03 Thread Maciej Szymkiewicz
Hi Liang-Chi, Thank you for the updates. This looks promising. On 02/03/2017 08:34 AM, Liang-Chi Hsieh wrote: > Hi Maciej, > > FYI, this fix is submitted at https://github.com/apache/spark/pull/16785. > > > Liang-Chi Hsieh wrote >> Hi Maciej, >> >> After looking into the details of the time spen