Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
UPDATE: The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1023/ The previous repo contains exactly the same content but mutable. Thanks Patrick for pointing it out! -Xiangrui On Thu, Jul 17, 2014 at 7:52 PM, Reynold Xin w

Re: Current way to include hive in a build

2014-07-17 Thread Patrick Wendell
Hey Stephen, The only change the build was that we ask users to run -Phive and -Pyarn of --with-hive and --with-yarn (which internally just set -Phive and -Pyarn). I don't think this should affect the dependency graph. Just to test this, what happens if you run *without* the CDH profile and build

Re: preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Zongheng Yang
Hi Will, These three environment variables are needed [1]. I have had success with Hive 0.12 and Hadoop 1.0.4. For Hive, getting the source distribution seems to be required. Docs contribution will be much appreciated! [1] https://github.com/apache/spark/tree/master/sql#other-dependencies-for-d

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Reynold Xin
+1 On Thursday, July 17, 2014, Matei Zaharia wrote: > +1 > > Tested on Mac, verified CHANGES.txt is good, verified several of the bug > fixes. > > Matei > > On Jul 17, 2014, at 11:12 AM, Xiangrui Meng > wrote: > > > I start the voting with a +1. > > > > Ran tests on the release candidates and s

preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Will Benton
Hi all, What's the preferred environment for generating golden test outputs for new Hive tests? In particular: * what Hadoop version and Hive version should I be using, * are there particular distributions people have run successfully, and * are there any system properties or environment varia

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread DB Tsai
+1 Tested with my Ubuntu Linux. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jul 17, 2014 at 6:36 PM, Matei Zaharia wrote: > +1 > > Tested on Mac, verified CHANGES.txt is good, v

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Matei Zaharia
+1 Tested on Mac, verified CHANGES.txt is good, verified several of the bug fixes. Matei On Jul 17, 2014, at 11:12 AM, Xiangrui Meng wrote: > I start the voting with a +1. > > Ran tests on the release candidates and some basic operations in > spark-shell and pyspark (local and standalone). >

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive, hier

Current way to include hive in a build

2014-07-17 Thread Stephen Boesch
Having looked at trunk make-distribution.sh the --with-hive and --with-yarn are now deprecated. Here is the way I have built it: Added to pom.xml: cdh5 false 2.3.0-cdh5.0.0 2.3.0-cdh5.0.0 0.96.1.1-cdh5.0.0 3.4.5-cdh5.0.0

InputSplit and RecordReader control on HadoopRDD

2014-07-17 Thread Nick R. Katsipoulakis
Hello, I am currently trying to extend some custom InputSplit and RecordReader classes to provide to SparkContext's hadoopRDD() function. My question is the following: Does the value returned by InpuSplit.getLenght() and/or RecordReader.getProgress() affect the execution of a map() function in t

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
I start the voting with a +1. Ran tests on the release candidates and some basic operations in spark-shell and pyspark (local and standalone). -Xiangrui On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng wrote: > Please vote on releasing the following candidate as Apache Spark version > 0.9.2! > >

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-17 Thread Nicholas Chammas
On Thu, Jul 17, 2014 at 1:23 AM, Stephen Haberman < stephen.haber...@gmail.com> wrote: > I'd be ecstatic if more major changes were this well/succinctly > explained > Ditto on that. The summary of user impact was very nice. It would be good to repeat that on the user list or release notes when th

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
Should be an easy rebase for your PR, so I went ahead just to get this fixed up: https://github.com/apache/spark/pull/1466 On Thu, Jul 17, 2014 at 5:32 PM, Ted Malaska wrote: > Don't make this change yet. I have a 1642 that needs to get through around > the same code. > > I can make this change

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
OK I will create PR. thanks On Thu, Jul 17, 2014 at 7:58 AM, Sean Owen wrote: > Looks like a real problem. I see it too. I think the same workaround > found in ClientBase.scala needs to be used here. There, the fact that > this field can be a String or String[] is handled explicitly. In fac

Re: Compile error when compiling for cloudera

2014-07-17 Thread Ted Malaska
Don't make this change yet. I have a 1642 that needs to get through around the same code. I can make this change after 1642 is through. On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen wrote: > CC tmalaska since he touched the line in question. This is a fun one. > So, here's the line of code adde

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
CC tmalaska since he touched the line in question. This is a fun one. So, here's the line of code added last week: val channelFactory = new NioServerSocketChannelFactory (Executors.newCachedThreadPool(), Executors.newCachedThreadPool()); Scala parses this as two statements, one invoking a no-ar

Re: Does RDD checkpointing store the entire state in HDFS?

2014-07-17 Thread Yan Fang
Thank you, TD ! Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Wed, Jul 16, 2014 at 6:53 PM, Tathagata Das wrote: > After every checkpointing interval, the latest state RDD is stored to HDFS > in its entirety. Along with that, the series of DStream transformations > that was setup with th

Re: Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
er, that line being in toDebugString, where it really shouldn't affect anything (no signature changes or the like) On Thu, Jul 17, 2014 at 10:58 AM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > My full build command is: > ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly >

Re: Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
My full build command is: ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly I've changed one line in RDD.scala, nothing else. On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen wrote: > This looks like a Jetty version problem actually. Are you bringing in > something that might be changi

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Looks like a real problem. I see it too. I think the same workaround found in ClientBase.scala needs to be used here. There, the fact that this field can be a String or String[] is handled explicitly. In fact I think you can just call to ClientBase for this? PR it, I say. On Thu, Jul 17, 2014 at 3

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
This looks like a Jetty version problem actually. Are you bringing in something that might be changing the version of Jetty used by Spark? It depends a lot on how you are building things. Good to specify exactly how your'e building here. On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld wrote:

Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
I'm trying to compile the latest code, with the hadoop-version set for 2.0.0-mr1-cdh4.6.0. I'm getting the following error, which I don't get when I don't set the hadoop version: [error] /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/Flu

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
@Sean and @Sandy Thanks for the reply. I used to be able to see yarn-alpha and yarn directories which corresponding to the modules. I guess due to the recent SparkBuild.scala changes, I did not see yarn-alpha (by default) and I thought yarn-alpha is renamed to "yarn" and "yarn-stable" is th

[VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
Please vote on releasing the following candidate as Apache Spark version 0.9.2! The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b The release files, including signatures, digests, etc. ca

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sandy Ryza
To add, we've made some effort to yarn-alpha to work with the 2.0.x line, but this was a time when YARN went through wild API changes. The only line that the yarn-alpha profile is guaranteed to work against is the 0.23 line. On Thu, Jul 17, 2014 at 12:40 AM, Sean Owen wrote: > Are you setting

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Are you setting -Pyarn-alpha? ./sbt/sbt -Pyarn-alpha, followed by "projects", shows it as a module. You should only build yarn-stable *or* yarn-alpha at any given time. I don't remember the modules changing in a while. 'yarn-alpha' is for YARN before it stabilized, circa early Hadoop 2.0.x. 'yarn-