[JIRA] (SPARK-546) Support full outer join and multiple join in a single shuffle

2016-07-20 Thread Anonymous (JIRA)
Title: Message Title Anonymous started wor

[JIRA] (SPARK-546) Support full outer join and multiple join in a single shuffle

2016-07-20 Thread Anonymous (JIRA)
Title: Message Title Anonymous stopped wor

[JIRA] (SPARK-546) Support full outer join and multiple join in a single shuffle

2016-07-20 Thread Anonymous (JIRA)
Title: Message Title Anonymous started wor

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Reynold Xin
+1 On Wednesday, July 20, 2016, Krishna Sankar wrote: > +1 (non-binding, of course) > > 1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min > mvn clean package -Pyarn -Phadoop-2.7 -DskipTests > 2. Tested pyspark, mllib (iPython 4.0) > 2.0 Spark version is 2.0.0 > 2.1. statistics

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Dongjoon Hyun
+1 (non-binding) - MD5/SHA/GPG matched. - Test passed on Ubuntu (16.04) + Oracle JDK (1.7.0_80) + R(3.2.3) * build/mvn -Phive -Phadoop-2.7 -Pyarn clean package * python python/run-tests.py * R/install-dev.sh & R/run-tests.sh Cheers! Dongjoon. On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min mvn clean package -Pyarn -Phadoop-2.7 -DskipTests 2. Tested pyspark, mllib (iPython 4.0) 2.0 Spark version is 2.0.0 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Lasso Regression

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Joseph Gonzalez
+1 Sent from my iPad - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Jonathan Kelly
+1 (non-binding) On Wed, Jul 20, 2016 at 2:48 PM Michael Allman wrote: > I've run some tests with some real and some synthetic parquet data with > nested columns with and without the hive metastore on our Spark 1.5, 1.6 > and 2.0 versions. I haven't seen any unexpected performance surprises, > e

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
I've run some tests with some real and some synthetic parquet data with nested columns with and without the hive metastore on our Spark 1.5, 1.6 and 2.0 versions. I haven't seen any unexpected performance surprises, except that Spark 2.0 now does schema inference across all files in a partitione

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Maciej Bryński
@Michael, I answered in Jira and could repeat here. I think that my problem is unrelated to Hive, because I'm using read.parquet method. I also attached some VisualVM snapshots to SPARK-16321 (I think I should merge both issues) And code profiling suggest bottleneck when reading parquet file. I wo

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
In reference to https://issues.apache.org/jira/browse/SPARK-16320, the code path for reading data from parquet files has been refactored extensively. The fact that Maciej was testing performance on a table with 400 partitions makes me wonder if my PR for https://issues.apache.org/jira/browse/SPA

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Marcin Tustin
I refer to Maciej Bryński's (mac...@brynski.pl) emails of 29 and 30 June 2016 to this list. He said that his benchmarking suggested that Spark 2.0 was slower than 1.6. I'm wondering if that was ever investigated, and if so if the speed is back up, or not. On Wed, Jul 20, 2016 at 12:18 PM, Michael

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Michael Allman
Marcin, I'm not sure what you're referring to. Can you be more specific? Cheers, Michael > On Jul 20, 2016, at 9:10 AM, Marcin Tustin wrote: > > Whatever happened with the query regarding benchmarks? Is that resolved? > > On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Marcin Tustin
Whatever happened with the query regarding benchmarks? Is that resolved? On Tue, Jul 19, 2016 at 10:35 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes > if a majority o

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Shivaram Venkataraman
+1 SHA and MD5 sums match for all binaries. Docs look fine this time around. Built and ran `dev/run-tests` with Java 7 on a linux machine. No blocker bugs on JIRA and the only critical bug with target as 2.0.0 is SPARK-16633, which doesn't look like a release blocker. I also checked issues which

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Sean Owen
+1 at last. Sigs and hashes check out, and compiles and passes tests with "-Pyarn -Phadoop-2.7 -Phive" on Ubuntu 16 + Java 8. There are actually only 2 issues still targeted for 2.0.0, which is great: SPARK-16633 lag/lead does not return the default value when the offset row does not exist SPARK-

Snappy initialization issue, spark assembly jar missing snappy classes?

2016-07-20 Thread Eugene Morozov
Greetings! We're reading input files with newApiHadoopFile that is configured with multiline split. Everything's fine, besides https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the issue is fixed, but within hadoop 2.7.2. Which means we have to download spark without hadoop and p