Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-28 Thread Reynold Xin
I will kick it off with my own +1. On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ]

Re: Broadcast big dataset

2016-09-28 Thread WangJianfei
First thank you very much! My executor memeory is also 4G, but my spark version is 1.5. Does spark version make a trouble? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127p19143.html Sent from the Apache Spark Developers

[VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-28 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.1 [ ] -1 Do not release this package because ... The t

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-28 Thread Michael Gummelt
+1 I know this is cancelled, but FYI, RC3 passes mesos/spark integration tests On Wed, Sep 28, 2016 at 2:52 AM, Sean Owen wrote: > (Process-wise there's no problem with that. The vote is open for at > least 3 days and ends when the RM says it ends. So it's valid anyway > as the vote is still op

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Jakob Odersky
I agree with Sean's answer, you can check out the relevant serializer here https://github.com/twitter/chill/blob/develop/chill-scala/src/main/scala/com/twitter/chill/Traversable.scala On Wed, Sep 28, 2016 at 3:11 AM, Sean Owen wrote: > My guess is that Kryo specially handles Maps generically or

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Joseph Bradley
+1 for 4 months. With QA taking about a month, that's very reasonable. My main ask (especially for MLlib) is for contributors and committers to take extra care not to delay on updating the Programming Guide for new APIs. Documentation debt often collects and has to be paid off during QA, and a l

Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Burak, you can configure what happens with corrupt records for the datasource using the parse mode. The parse will still fail, so we can't get any data out of it, but we do leave the JSON in another column for you to inspect. In the case of this function, we'll just return null if its unparable.

Re: Broadcast big dataset

2016-09-28 Thread Andrew Duffy
Have you tried upping executor memory? There's a separate spark conf for that: spark.executor.memory In general driver configurations don't automatically apply to executors. On Wed, Sep 28, 2016 at 7:03 AM -0700, "WangJianfei" wrote: Hi Devs In my application, i just broadcast a

Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Segel
Silly question? When you talk about ‘user specified schema’ do you mean for the user to supply an additional schema, or that you’re using the schema that’s described by the JSON string? (or both? [either/or] ) Thx On Sep 28, 2016, at 12:52 PM, Michael Armbrust mailto:mich...@databricks.com>>

Re: Spark SQL JSON Column Support

2016-09-28 Thread Burak Yavuz
I would really love something like this! It would be great if it doesn't throw away corrupt_records like the Data Source. On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande wrote: > We are currently pulling out the JSON columns, passing them through > read.json, and then joining them back onto the i

Re: Spark SQL JSON Column Support

2016-09-28 Thread Nathan Lande
We are currently pulling out the JSON columns, passing them through read.json, and then joining them back onto the initial DF so something like from_json would be a nice quality of life improvement for us. On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust wrote: > Spark SQL has great support fo

Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions from_json t

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Sean Owen
I guess I'm claiming the artifacts wouldn't even be different in the first place, because the Hadoop APIs that are used are all the same across these versions. That would be the thing that makes you need multiple versions of the artifact under multiple classifiers. On Wed, Sep 28, 2016 at 1:16 PM,

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Olivier Girardot
ok, don't you think it could be published with just different classifiers hadoop-2.6hadoop-2.4 hadoop-2.2 being the current default. So for now, I should just override spark 2.0.0's dependencies with the ones defined in the pom profile On Thu, Sep 22, 2016 11:17 AM, Sean Owen so...@cloudera.

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Tom Graves
+1 to 4 months. Tom On Tuesday, September 27, 2016 2:07 PM, Reynold Xin wrote: We are 2 months past releasing Spark 2.0.0, an important milestone for the project. Spark 2.0.0 deviated (took 6 month from the regular release cadence we had for the 1.x line, and we never explicitly discu

Broadcast big dataset

2016-09-28 Thread WangJianfei
Hi Devs In my application, i just broadcast a dataset(about 500M) to the ececutors(100+), I got a java heap error Jmartad-7219.hadoop.jd.local:53591 (size: 4.0 MB, free: 3.3 GB) 16/09/28 15:56:48 INFO BlockManagerInfo: Added broadcast_9_piece19 in memory on BJHC-Jmartad-9012.hadoop.jd.local:53197

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
Hi All, Any updates on this? On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning and also you may increase - spark.default.parallelism You can also try with decreasing num-executor cores. Basically, this happens when the executor

Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-28 Thread Marcin Tustin
I've solved this in the past by using a thread pool which runs clean up code on thread creation, to clear out stale values. On Wednesday, September 28, 2016, Grant Digby wrote: > Hi, > > We've received the following error a handful of times and once it's > occurred > all subsequent queries fail

Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Sean Owen
My guess is that Kryo specially handles Maps generically or relies on some mechanism that does, and it happens to iterate over all key/values as part of that and of course there aren't actually any key/values in the map. The Java serialization is a much more literal (expensive) field-by-field seria

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-28 Thread Sean Owen
(Process-wise there's no problem with that. The vote is open for at least 3 days and ends when the RM says it ends. So it's valid anyway as the vote is still open.) On Tue, Sep 27, 2016 at 8:37 PM, Reynold Xin wrote: > So technically the vote has passed, but IMHO it does not make sense to > relea

IllegalArgumentException: spark.sql.execution.id is already set

2016-09-28 Thread Grant Digby
Hi, We've received the following error a handful of times and once it's occurred all subsequent queries fail with the same exception until we bounce the instance: IllegalArgumentException: spark.sql.execution.id is already set at org.apache.spark.sql.execution.SQLExecution$.withNewExecuti

java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Maciej Szymkiewicz
Hi everyone, I suspect there is no point in submitting a JIRA to fix this (not a Spark issue?) but I would like to know if this problem is documented anywhere. Somehow Kryo is loosing default value during serialization: scala> import org.apache.spark.{SparkContext, SparkConf} import org.a

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
: Thanks Sushrut for the reply. Currently I have not defined spark.default.parallelism property. Can you let me know how much should I set it to? Regards, Aditya Calangutkar On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning and

Re: Spark Executor Lost issue

2016-09-28 Thread Aditya
Thanks Sushrut for the reply. Currently I have not defined spark.default.parallelism property. Can you let me know how much should I set it to? Regards, Aditya Calangutkar On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote: Try with increasing the parallelism by repartitioning and al