Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Are you setting -Pyarn-alpha? ./sbt/sbt -Pyarn-alpha, followed by
"projects", shows it as a module. You should only build yarn-stable
*or* yarn-alpha at any given time.

I don't remember the modules changing in a while. 'yarn-alpha' is for
YARN before it stabilized, circa early Hadoop 2.0.x. 'yarn-stable' is
for beta and stable YARN, circa late Hadoop 2.0.x and onwards. 'yarn'
is code common to both, so should compile with yarn-alpha.

What's the compile error, and are you setting yarn.version? the
default is to use hadoop.version, but that defaults to 1.0.4 and there
is no such YARN.

Unless I missed it, I only see compile errors in yarn-stable, and you
are trying to compile vs YARN alpha versions no?

On Thu, Jul 17, 2014 at 5:39 AM, Chester Chen  wrote:
> Looking further, the yarn and yarn-stable are both for the stable version
> of Yarn, that explains the compilation errors when using 2.0.5-alpha
> version of hadoop.
>
> the module yarn-alpha ( although is still on SparkBuild.scala), is no
> longer there in sbt console.
>
>
>> projects
>
> [info] In file:/Users/chester/projects/spark/
>
> [info]assembly
>
> [info]bagel
>
> [info]catalyst
>
> [info]core
>
> [info]examples
>
> [info]graphx
>
> [info]hive
>
> [info]mllib
>
> [info]oldDeps
>
> [info]repl
>
> [info]spark
>
> [info]sql
>
> [info]streaming
>
> [info]streaming-flume
>
> [info]streaming-kafka
>
> [info]streaming-mqtt
>
> [info]streaming-twitter
>
> [info]streaming-zeromq
>
> [info]tools
>
> [info]yarn
>
> [info]  * yarn-stable
>
>
> On Wed, Jul 16, 2014 at 5:41 PM, Chester Chen  wrote:
>
>> Hmm
>> looks like a Build script issue:
>>
>> I run the command with :
>>
>> sbt/sbt clean *yarn/*test:compile
>>
>> but errors came from
>>
>> [error] 40 errors found
>>
>> [error] (*yarn-stable*/compile:compile) Compilation failed
>>
>>
>> Chester
>>
>>
>> On Wed, Jul 16, 2014 at 5:18 PM, Chester Chen 
>> wrote:
>>
>>> Hi, Sandy
>>>
>>> We do have some issue with this. The difference is in Yarn-Alpha and
>>> Yarn Stable ( I noticed that in the latest build, the module name has
>>> changed,
>>>  yarn-alpha --> yarn
>>>  yarn --> yarn-stable
>>> )
>>>
>>> For example:  MRJobConfig.class
>>> the field:
>>> "DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH"
>>>
>>>
>>> In Yarn-Alpha : the field returns   java.lang.String[]
>>>
>>>   java.lang.String[] DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;
>>>
>>> while in Yarn-Stable, it returns a String
>>>
>>>   java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;
>>>
>>> So in ClientBaseSuite.scala
>>>
>>> The following code:
>>>
>>> val knownDefMRAppCP: Seq[String] =
>>>   getFieldValue[*String*, Seq[String]](classOf[MRJobConfig],
>>>
>>>  "DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH",
>>>  Seq[String]())(a =>
>>> *a.split(",")*)
>>>
>>>
>>> works for yarn-stable, but doesn't work for yarn-alpha.
>>>
>>> This is the only failure for the SNAPSHOT I downloaded 2 weeks ago.  I
>>> believe this can be refactored to yarn-alpha module and make different
>>> tests according different API signatures.
>>>
>>>  I just update the master branch and build doesn't even compile for
>>> Yarn-Alpha (yarn) model. Yarn-Stable compile with no error and test passed.
>>>
>>>
>>> Does the Spark Jenkins job run against yarn-alpha ?
>>>
>>>
>>>
>>>
>>>
>>> Here is output from yarn-alpha compilation:
>>>
>>> I got the 40 compilation errors.
>>>
>>> sbt/sbt clean yarn/test:compile
>>>
>>> Using /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home as
>>> default JAVA_HOME.
>>>
>>> Note, this will be overridden by -java-home if it is set.
>>>
>>> [info] Loading project definition from
>>> /Users/chester/projects/spark/project/project
>>>
>>> [info] Loading project definition from
>>> /Users/chester/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/project
>>>
>>> [warn] Multiple resolvers having different access mechanism configured
>>> with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
>>> project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
>>>
>>> [info] Loading project definition from
>>> /Users/chester/projects/spark/project
>>>
>>> NOTE: SPARK_HADOOP_VERSION is deprecated, please use
>>> -Dhadoop.version=2.0.5-alpha
>>>
>>> NOTE: SPARK_YARN is deprecated, please use -Pyarn flag.
>>>
>>> [info] Set current project to spark-parent (in build
>>> file:/Users/chester/projects/spark/)
>>>
>>> [success] Total time: 0 s, completed Jul 16, 2014 5:13:06 PM
>>>
>>> [info] Updating {file:/Users/chester/projects/spark/}core...
>>>
>>> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
>>>
>>> [info] Done updating.
>>>
>>> [info] Updating {file:/Users/chester/projects/spark/}yarn...
>>>
>>> [info] Updating {file:/Users/chester/projects/spark/}yarn-stable...
>>>
>>> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
>>>
>>> [info] Done updating.
>>>
>>> [info] Reso

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sandy Ryza
To add, we've made some effort to yarn-alpha to work with the 2.0.x line,
but this was a time when YARN went through wild API changes.  The only line
that the yarn-alpha profile is guaranteed to work against is the 0.23 line.


On Thu, Jul 17, 2014 at 12:40 AM, Sean Owen  wrote:

> Are you setting -Pyarn-alpha? ./sbt/sbt -Pyarn-alpha, followed by
> "projects", shows it as a module. You should only build yarn-stable
> *or* yarn-alpha at any given time.
>
> I don't remember the modules changing in a while. 'yarn-alpha' is for
> YARN before it stabilized, circa early Hadoop 2.0.x. 'yarn-stable' is
> for beta and stable YARN, circa late Hadoop 2.0.x and onwards. 'yarn'
> is code common to both, so should compile with yarn-alpha.
>
> What's the compile error, and are you setting yarn.version? the
> default is to use hadoop.version, but that defaults to 1.0.4 and there
> is no such YARN.
>
> Unless I missed it, I only see compile errors in yarn-stable, and you
> are trying to compile vs YARN alpha versions no?
>
> On Thu, Jul 17, 2014 at 5:39 AM, Chester Chen 
> wrote:
> > Looking further, the yarn and yarn-stable are both for the stable version
> > of Yarn, that explains the compilation errors when using 2.0.5-alpha
> > version of hadoop.
> >
> > the module yarn-alpha ( although is still on SparkBuild.scala), is no
> > longer there in sbt console.
> >
> >
> >> projects
> >
> > [info] In file:/Users/chester/projects/spark/
> >
> > [info]assembly
> >
> > [info]bagel
> >
> > [info]catalyst
> >
> > [info]core
> >
> > [info]examples
> >
> > [info]graphx
> >
> > [info]hive
> >
> > [info]mllib
> >
> > [info]oldDeps
> >
> > [info]repl
> >
> > [info]spark
> >
> > [info]sql
> >
> > [info]streaming
> >
> > [info]streaming-flume
> >
> > [info]streaming-kafka
> >
> > [info]streaming-mqtt
> >
> > [info]streaming-twitter
> >
> > [info]streaming-zeromq
> >
> > [info]tools
> >
> > [info]yarn
> >
> > [info]  * yarn-stable
> >
> >
> > On Wed, Jul 16, 2014 at 5:41 PM, Chester Chen 
> wrote:
> >
> >> Hmm
> >> looks like a Build script issue:
> >>
> >> I run the command with :
> >>
> >> sbt/sbt clean *yarn/*test:compile
> >>
> >> but errors came from
> >>
> >> [error] 40 errors found
> >>
> >> [error] (*yarn-stable*/compile:compile) Compilation failed
> >>
> >>
> >> Chester
> >>
> >>
> >> On Wed, Jul 16, 2014 at 5:18 PM, Chester Chen 
> >> wrote:
> >>
> >>> Hi, Sandy
> >>>
> >>> We do have some issue with this. The difference is in Yarn-Alpha
> and
> >>> Yarn Stable ( I noticed that in the latest build, the module name has
> >>> changed,
> >>>  yarn-alpha --> yarn
> >>>  yarn --> yarn-stable
> >>> )
> >>>
> >>> For example:  MRJobConfig.class
> >>> the field:
> >>> "DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH"
> >>>
> >>>
> >>> In Yarn-Alpha : the field returns   java.lang.String[]
> >>>
> >>>   java.lang.String[] DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;
> >>>
> >>> while in Yarn-Stable, it returns a String
> >>>
> >>>   java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;
> >>>
> >>> So in ClientBaseSuite.scala
> >>>
> >>> The following code:
> >>>
> >>> val knownDefMRAppCP: Seq[String] =
> >>>   getFieldValue[*String*, Seq[String]](classOf[MRJobConfig],
> >>>
> >>>  "DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH",
> >>>  Seq[String]())(a =>
> >>> *a.split(",")*)
> >>>
> >>>
> >>> works for yarn-stable, but doesn't work for yarn-alpha.
> >>>
> >>> This is the only failure for the SNAPSHOT I downloaded 2 weeks ago.  I
> >>> believe this can be refactored to yarn-alpha module and make different
> >>> tests according different API signatures.
> >>>
> >>>  I just update the master branch and build doesn't even compile for
> >>> Yarn-Alpha (yarn) model. Yarn-Stable compile with no error and test
> passed.
> >>>
> >>>
> >>> Does the Spark Jenkins job run against yarn-alpha ?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Here is output from yarn-alpha compilation:
> >>>
> >>> I got the 40 compilation errors.
> >>>
> >>> sbt/sbt clean yarn/test:compile
> >>>
> >>> Using /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home
> as
> >>> default JAVA_HOME.
> >>>
> >>> Note, this will be overridden by -java-home if it is set.
> >>>
> >>> [info] Loading project definition from
> >>> /Users/chester/projects/spark/project/project
> >>>
> >>> [info] Loading project definition from
> >>>
> /Users/chester/.sbt/0.13/staging/ec3aa8f39111944cc5f2/sbt-pom-reader/project
> >>>
> >>> [warn] Multiple resolvers having different access mechanism configured
> >>> with same name 'sbt-plugin-releases'. To avoid conflict, Remove
> duplicate
> >>> project resolvers (`resolvers`) or rename publishing resolver
> (`publishTo`).
> >>>
> >>> [info] Loading project definition from
> >>> /Users/chester/projects/spark/project
> >>>
> >>> NOTE: SPARK_HADOOP_VERSION is deprecated, please use
> >>> -Dhadoop.version=2.0.5-alpha
> >>>
> >>> NOT

[VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
Please vote on releasing the following candidate as Apache Spark version 0.9.2!

The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~meng/spark-0.9.2-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/meng.asc

The staging repository for this release can be found at:
https://repository.apache.org/service/local/repositories/orgapachespark-1023/content/

The documentation corresponding to this release can be found at:
http://people.apache.org/~meng/spark-0.9.2-rc1-docs/

Please vote on releasing this package as Apache Spark 0.9.2!

The vote is open until Sunday, July 20, at 11:10 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 0.9.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

=== About this release ===
This release fixes a few high-priority bugs in 0.9.1 and has a variety
of smaller fixes. The full list is here: http://s.apache.org/d0t. Some
of the more visible patches are:

SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size
SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
SPARK-1676: HDFS FileSystems continually pile up in the FS cache
SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
SPARK-1870: Secondary jars are not added to executor classpath for YARN

This is the second maintenance release on the 0.9 line. We plan to make
additional maintenance releases as new fixes come in.

Best,
Xiangrui


Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
@Sean and @Sandy

   Thanks for the reply. I used to be able to see yarn-alpha and yarn
directories which corresponding to the modules.

   I guess due to the recent SparkBuild.scala changes, I did not see
yarn-alpha (by default) and I thought yarn-alpha is renamed to "yarn" and
"yarn-stable" is the old yarn. So I compiled "yarn" against the
hadoop.version = 2.0.5-alpha.  My mistake.



I tried
export SPARK_HADOOP_VERSION=2.0.5-alpha
sbt/sbt -Pyarn-alpha  yarn-alpha/test

the compilation errors are all gone.

sbt/sbt -Pyarn-alpha projects

does show the yarn-alpha project, I did not realize this is dynamically
enabled based on yarn flag. Thanks Sean for pointing that out.

To Sandy's point, I am not trying to use alpha version of Yarn. I am
experimenting some changes in Yarn Client and refactoring code and just
want to make sure I am passing tests for both yarn-alpha and yarn-stable.


The yarn-alpha tests actually failing due to the yarn API changes in
MRJobConfig class.

as I mentioned in earlier email

The field
DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH

returns String in yarn-stable, but returns String array in yarn-alpha API.

So the method in ClientBaseSuite.scala


val knownDefMRAppCP: Seq[String] =
  getFieldValue[String, Seq[String]](classOf[MRJobConfig],

 "DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH",
 Seq[String]())(a => a.split(","))

will fail for yarn-alpha.

sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha yarn-alpha/test

...

4/07/17 07:07:16 INFO ClientBase: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties

[info] - default Yarn application classpath *** FAILED ***

[info]   java.lang.ClassCastException: [Ljava.lang.String; cannot be cast
to java.lang.String

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$$anonfun$12.apply(ClientBaseSuite.scala:152)

[info]   at scala.Option.map(Option.scala:145)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.getFieldValue(ClientBaseSuite.scala:180)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$.(ClientBaseSuite.scala:152)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures$lzycompute(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$1.apply$mcV$sp(ClientBaseSuite.scala:47)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$1.apply(ClientBaseSuite.scala:47)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$1.apply(ClientBaseSuite.scala:47)

[info]   at
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)

[info]   ...

[info] - default MR application classpath *** FAILED ***

[info]   java.lang.ClassCastException: [Ljava.lang.String; cannot be cast
to java.lang.String

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$$anonfun$12.apply(ClientBaseSuite.scala:152)

[info]   at scala.Option.map(Option.scala:145)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.getFieldValue(ClientBaseSuite.scala:180)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$.(ClientBaseSuite.scala:152)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures$lzycompute(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$2.apply$mcV$sp(ClientBaseSuite.scala:51)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$2.apply(ClientBaseSuite.scala:51)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$2.apply(ClientBaseSuite.scala:51)

[info]   at
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)

[info]   ...

[info] - resultant classpath for an application that defines a classpath
for YARN *** FAILED ***

[info]   java.lang.ClassCastException: [Ljava.lang.String; cannot be cast
to java.lang.String

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$$anonfun$12.apply(ClientBaseSuite.scala:152)

[info]   at scala.Option.map(Option.scala:145)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.getFieldValue(ClientBaseSuite.scala:180)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$Fixtures$.(ClientBaseSuite.scala:152)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures$lzycompute(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite.Fixtures(ClientBaseSuite.scala:141)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$3.apply$mcV$sp(ClientBaseSuite.scala:55)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$3.apply(ClientBaseSuite.scala:55)

[info]   at
org.apache.spark.deploy.yarn.ClientBaseSuite$$anonfun$3.apply(ClientBaseSuite.scala:55)

[info]   at
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)


Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
I'm trying to compile the latest code, with the hadoop-version set for
2.0.0-mr1-cdh4.6.0.

I'm getting the following error, which I don't get when I don't set the
hadoop version:

[error]
/data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
overloaded method constructor NioServerSocketChannelFactory with
alternatives:
[error]   (x$1: java.util.concurrent.Executor,x$2:
java.util.concurrent.Executor,x$3:
Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory 
[error]   (x$1: java.util.concurrent.Executor,x$2:
java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
[error]  cannot be applied to ()
[error]   val channelFactory = new NioServerSocketChannelFactory
[error]^
[error] one error found


I don't know flume from a hole in the wall - does anyone know what I can do
to fix this?


Thanks,
 -Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
This looks like a Jetty version problem actually. Are you bringing in
something that might be changing the version of Jetty used by Spark?
It depends a lot on how you are building things.

Good to specify exactly how your'e building here.

On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld
 wrote:
> I'm trying to compile the latest code, with the hadoop-version set for
> 2.0.0-mr1-cdh4.6.0.
>
> I'm getting the following error, which I don't get when I don't set the
> hadoop version:
>
> [error]
> /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
> overloaded method constructor NioServerSocketChannelFactory with
> alternatives:
> [error]   (x$1: java.util.concurrent.Executor,x$2:
> java.util.concurrent.Executor,x$3:
> Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory 
> [error]   (x$1: java.util.concurrent.Executor,x$2:
> java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
> [error]  cannot be applied to ()
> [error]   val channelFactory = new NioServerSocketChannelFactory
> [error]^
> [error] one error found
>
>
> I don't know flume from a hole in the wall - does anyone know what I can do
> to fix this?
>
>
> Thanks,
>  -Nathan
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com


Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Looks like a real problem. I see it too. I think the same workaround
found in ClientBase.scala needs to be used here. There, the fact that
this field can be a String or String[] is handled explicitly. In fact
I think you can just call to ClientBase for this? PR it, I say.

On Thu, Jul 17, 2014 at 3:24 PM, Chester Chen  wrote:
> val knownDefMRAppCP: Seq[String] =
>   getFieldValue[String, Seq[String]](classOf[MRJobConfig],
>
>  "DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH",
>  Seq[String]())(a => a.split(","))
>
> will fail for yarn-alpha.
>
> sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha yarn-alpha/test
>


Re: Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
My full build command is:
./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly


I've changed one line in RDD.scala, nothing else.



On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen  wrote:

> This looks like a Jetty version problem actually. Are you bringing in
> something that might be changing the version of Jetty used by Spark?
> It depends a lot on how you are building things.
>
> Good to specify exactly how your'e building here.
>
> On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld
>  wrote:
> > I'm trying to compile the latest code, with the hadoop-version set for
> > 2.0.0-mr1-cdh4.6.0.
> >
> > I'm getting the following error, which I don't get when I don't set the
> > hadoop version:
> >
> > [error]
> >
> /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
> > overloaded method constructor NioServerSocketChannelFactory with
> > alternatives:
> > [error]   (x$1: java.util.concurrent.Executor,x$2:
> > java.util.concurrent.Executor,x$3:
> > Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
> 
> > [error]   (x$1: java.util.concurrent.Executor,x$2:
> >
> java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
> > [error]  cannot be applied to ()
> > [error]   val channelFactory = new NioServerSocketChannelFactory
> > [error]^
> > [error] one error found
> >
> >
> > I don't know flume from a hole in the wall - does anyone know what I can
> do
> > to fix this?
> >
> >
> > Thanks,
> >  -Nathan
> >
> >
> > --
> > Nathan Kronenfeld
> > Senior Visualization Developer
> > Oculus Info Inc
> > 2 Berkeley Street, Suite 600,
> > Toronto, Ontario M5A 4J5
> > Phone:  +1-416-203-3003 x 238
> > Email:  nkronenf...@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
er, that line being in toDebugString, where it really shouldn't affect
anything (no signature changes or the like)


On Thu, Jul 17, 2014 at 10:58 AM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> My full build command is:
> ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly
>
>
> I've changed one line in RDD.scala, nothing else.
>
>
>
> On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen  wrote:
>
>> This looks like a Jetty version problem actually. Are you bringing in
>> something that might be changing the version of Jetty used by Spark?
>> It depends a lot on how you are building things.
>>
>> Good to specify exactly how your'e building here.
>>
>> On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld
>>  wrote:
>> > I'm trying to compile the latest code, with the hadoop-version set for
>> > 2.0.0-mr1-cdh4.6.0.
>> >
>> > I'm getting the following error, which I don't get when I don't set the
>> > hadoop version:
>> >
>> > [error]
>> >
>> /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
>> > overloaded method constructor NioServerSocketChannelFactory with
>> > alternatives:
>> > [error]   (x$1: java.util.concurrent.Executor,x$2:
>> > java.util.concurrent.Executor,x$3:
>> > Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
>> 
>> > [error]   (x$1: java.util.concurrent.Executor,x$2:
>> >
>> java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
>> > [error]  cannot be applied to ()
>> > [error]   val channelFactory = new NioServerSocketChannelFactory
>> > [error]^
>> > [error] one error found
>> >
>> >
>> > I don't know flume from a hole in the wall - does anyone know what I
>> can do
>> > to fix this?
>> >
>> >
>> > Thanks,
>> >  -Nathan
>> >
>> >
>> > --
>> > Nathan Kronenfeld
>> > Senior Visualization Developer
>> > Oculus Info Inc
>> > 2 Berkeley Street, Suite 600,
>> > Toronto, Ontario M5A 4J5
>> > Phone:  +1-416-203-3003 x 238
>> > Email:  nkronenf...@oculusinfo.com
>>
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com


Re: Does RDD checkpointing store the entire state in HDFS?

2014-07-17 Thread Yan Fang
Thank you, TD !

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108


On Wed, Jul 16, 2014 at 6:53 PM, Tathagata Das 
wrote:

> After every checkpointing interval, the latest state RDD is stored to HDFS
> in its entirety. Along with that, the series of DStream transformations
> that was setup with the streaming context is also stored into HDFS (the
> whole DAG of DStream objects is serialized and saved).
>
> TD
>
>
> On Wed, Jul 16, 2014 at 5:38 PM, Yan Fang  wrote:
>
> > Hi guys,
> >
> > am wondering how the RDD checkpointing
> > <
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#RDD
> > Checkpointing> works in Spark Streaming. When I use updateStateByKey,
> does
> > the Spark store the entire state (at one time point) into the HDFS or
> only
> > put the transformation into the HDFS? Thank you.
> >
> > Best,
> >
> > Fang, Yan
> > yanfang...@gmail.com
> > +1 (206) 849-4108
> >
>


Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
CC tmalaska since he touched the line in question. This is a fun one.
So, here's the line of code added last week:

val channelFactory = new NioServerSocketChannelFactory
  (Executors.newCachedThreadPool(), Executors.newCachedThreadPool());

Scala parses this as two statements, one invoking a no-arg constructor
and one making a tuple for fun. Put it on one line and it's fine.

It works with newer Netty since there is a no-arg constructor. It
fails with older Netty, which is what you get with older Hadoop.

The fix is obvious. I'm away and if nobody beats me to a PR in the
meantime, I'll propose one as an addendum to the recent JIRA.

Sean

*

On Thu, Jul 17, 2014 at 3:58 PM, Nathan Kronenfeld
 wrote:
> My full build command is:
> ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly
>
>
> I've changed one line in RDD.scala, nothing else.
>
>
>
> On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen  wrote:
>
>> This looks like a Jetty version problem actually. Are you bringing in
>> something that might be changing the version of Jetty used by Spark?
>> It depends a lot on how you are building things.
>>
>> Good to specify exactly how your'e building here.
>>
>> On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld
>>  wrote:
>> > I'm trying to compile the latest code, with the hadoop-version set for
>> > 2.0.0-mr1-cdh4.6.0.
>> >
>> > I'm getting the following error, which I don't get when I don't set the
>> > hadoop version:
>> >
>> > [error]
>> >
>> /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
>> > overloaded method constructor NioServerSocketChannelFactory with
>> > alternatives:
>> > [error]   (x$1: java.util.concurrent.Executor,x$2:
>> > java.util.concurrent.Executor,x$3:
>> > Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
>> 
>> > [error]   (x$1: java.util.concurrent.Executor,x$2:
>> >
>> java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
>> > [error]  cannot be applied to ()
>> > [error]   val channelFactory = new NioServerSocketChannelFactory
>> > [error]^
>> > [error] one error found
>> >
>> >
>> > I don't know flume from a hole in the wall - does anyone know what I can
>> do
>> > to fix this?
>> >
>> >
>> > Thanks,
>> >  -Nathan
>> >
>> >
>> > --
>> > Nathan Kronenfeld
>> > Senior Visualization Developer
>> > Oculus Info Inc
>> > 2 Berkeley Street, Suite 600,
>> > Toronto, Ontario M5A 4J5
>> > Phone:  +1-416-203-3003 x 238
>> > Email:  nkronenf...@oculusinfo.com
>>
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com


Re: Compile error when compiling for cloudera

2014-07-17 Thread Ted Malaska
Don't make this change yet.  I have a 1642 that needs to get through around
the same code.

I can make this change after 1642 is through.


On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen  wrote:

> CC tmalaska since he touched the line in question. This is a fun one.
> So, here's the line of code added last week:
>
> val channelFactory = new NioServerSocketChannelFactory
>   (Executors.newCachedThreadPool(), Executors.newCachedThreadPool());
>
> Scala parses this as two statements, one invoking a no-arg constructor
> and one making a tuple for fun. Put it on one line and it's fine.
>
> It works with newer Netty since there is a no-arg constructor. It
> fails with older Netty, which is what you get with older Hadoop.
>
> The fix is obvious. I'm away and if nobody beats me to a PR in the
> meantime, I'll propose one as an addendum to the recent JIRA.
>
> Sean
>
> *
>
> On Thu, Jul 17, 2014 at 3:58 PM, Nathan Kronenfeld
>  wrote:
> > My full build command is:
> > ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly
> >
> >
> > I've changed one line in RDD.scala, nothing else.
> >
> >
> >
> > On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen  wrote:
> >
> >> This looks like a Jetty version problem actually. Are you bringing in
> >> something that might be changing the version of Jetty used by Spark?
> >> It depends a lot on how you are building things.
> >>
> >> Good to specify exactly how your'e building here.
> >>
> >> On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld
> >>  wrote:
> >> > I'm trying to compile the latest code, with the hadoop-version set for
> >> > 2.0.0-mr1-cdh4.6.0.
> >> >
> >> > I'm getting the following error, which I don't get when I don't set
> the
> >> > hadoop version:
> >> >
> >> > [error]
> >> >
> >>
> /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
> >> > overloaded method constructor NioServerSocketChannelFactory with
> >> > alternatives:
> >> > [error]   (x$1: java.util.concurrent.Executor,x$2:
> >> > java.util.concurrent.Executor,x$3:
> >> > Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
> >> 
> >> > [error]   (x$1: java.util.concurrent.Executor,x$2:
> >> >
> >>
> java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
> >> > [error]  cannot be applied to ()
> >> > [error]   val channelFactory = new NioServerSocketChannelFactory
> >> > [error]^
> >> > [error] one error found
> >> >
> >> >
> >> > I don't know flume from a hole in the wall - does anyone know what I
> can
> >> do
> >> > to fix this?
> >> >
> >> >
> >> > Thanks,
> >> >  -Nathan
> >> >
> >> >
> >> > --
> >> > Nathan Kronenfeld
> >> > Senior Visualization Developer
> >> > Oculus Info Inc
> >> > 2 Berkeley Street, Suite 600,
> >> > Toronto, Ontario M5A 4J5
> >> > Phone:  +1-416-203-3003 x 238
> >> > Email:  nkronenf...@oculusinfo.com
> >>
> >
> >
> >
> > --
> > Nathan Kronenfeld
> > Senior Visualization Developer
> > Oculus Info Inc
> > 2 Berkeley Street, Suite 600,
> > Toronto, Ontario M5A 4J5
> > Phone:  +1-416-203-3003 x 238
> > Email:  nkronenf...@oculusinfo.com
>


Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
OK   I will create PR.

thanks



On Thu, Jul 17, 2014 at 7:58 AM, Sean Owen  wrote:

> Looks like a real problem. I see it too. I think the same workaround
> found in ClientBase.scala needs to be used here. There, the fact that
> this field can be a String or String[] is handled explicitly. In fact
> I think you can just call to ClientBase for this? PR it, I say.
>
> On Thu, Jul 17, 2014 at 3:24 PM, Chester Chen 
> wrote:
> > val knownDefMRAppCP: Seq[String] =
> >   getFieldValue[String, Seq[String]](classOf[MRJobConfig],
> >
> >  "DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH",
> >  Seq[String]())(a =>
> a.split(","))
> >
> > will fail for yarn-alpha.
> >
> > sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha yarn-alpha/test
> >
>


Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
Should be an easy rebase for your PR, so I went ahead just to get this fixed up:

https://github.com/apache/spark/pull/1466

On Thu, Jul 17, 2014 at 5:32 PM, Ted Malaska  wrote:
> Don't make this change yet.  I have a 1642 that needs to get through around
> the same code.
>
> I can make this change after 1642 is through.
>
>
> On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen  wrote:
>>
>> CC tmalaska since he touched the line in question. This is a fun one.
>> So, here's the line of code added last week:
>>
>> val channelFactory = new NioServerSocketChannelFactory
>>   (Executors.newCachedThreadPool(), Executors.newCachedThreadPool());
>>
>> Scala parses this as two statements, one invoking a no-arg constructor
>> and one making a tuple for fun. Put it on one line and it's fine.
>>
>> It works with newer Netty since there is a no-arg constructor. It
>> fails with older Netty, which is what you get with older Hadoop.
>>
>> The fix is obvious. I'm away and if nobody beats me to a PR in the
>> meantime, I'll propose one as an addendum to the recent JIRA.
>>
>> Sean
>>
>> *
>>
>> On Thu, Jul 17, 2014 at 3:58 PM, Nathan Kronenfeld
>>  wrote:
>> > My full build command is:
>> > ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly
>> >
>> >
>> > I've changed one line in RDD.scala, nothing else.
>> >
>> >
>> >
>> > On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen  wrote:
>> >
>> >> This looks like a Jetty version problem actually. Are you bringing in
>> >> something that might be changing the version of Jetty used by Spark?
>> >> It depends a lot on how you are building things.
>> >>
>> >> Good to specify exactly how your'e building here.
>> >>
>> >> On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld
>> >>  wrote:
>> >> > I'm trying to compile the latest code, with the hadoop-version set
>> >> > for
>> >> > 2.0.0-mr1-cdh4.6.0.
>> >> >
>> >> > I'm getting the following error, which I don't get when I don't set
>> >> > the
>> >> > hadoop version:
>> >> >
>> >> > [error]
>> >> >
>> >>
>> >> /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
>> >> > overloaded method constructor NioServerSocketChannelFactory with
>> >> > alternatives:
>> >> > [error]   (x$1: java.util.concurrent.Executor,x$2:
>> >> > java.util.concurrent.Executor,x$3:
>> >> > Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
>> >> 
>> >> > [error]   (x$1: java.util.concurrent.Executor,x$2:
>> >> >
>> >>
>> >> java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
>> >> > [error]  cannot be applied to ()
>> >> > [error]   val channelFactory = new NioServerSocketChannelFactory
>> >> > [error]^
>> >> > [error] one error found
>> >> >
>> >> >
>> >> > I don't know flume from a hole in the wall - does anyone know what I
>> >> > can
>> >> do
>> >> > to fix this?
>> >> >
>> >> >
>> >> > Thanks,
>> >> >  -Nathan
>> >> >
>> >> >
>> >> > --
>> >> > Nathan Kronenfeld
>> >> > Senior Visualization Developer
>> >> > Oculus Info Inc
>> >> > 2 Berkeley Street, Suite 600,
>> >> > Toronto, Ontario M5A 4J5
>> >> > Phone:  +1-416-203-3003 x 238
>> >> > Email:  nkronenf...@oculusinfo.com
>> >>
>> >
>> >
>> >
>> > --
>> > Nathan Kronenfeld
>> > Senior Visualization Developer
>> > Oculus Info Inc
>> > 2 Berkeley Street, Suite 600,
>> > Toronto, Ontario M5A 4J5
>> > Phone:  +1-416-203-3003 x 238
>> > Email:  nkronenf...@oculusinfo.com
>
>


Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-17 Thread Nicholas Chammas
On Thu, Jul 17, 2014 at 1:23 AM, Stephen Haberman <
stephen.haber...@gmail.com> wrote:

> I'd be ecstatic if more major changes were this well/succinctly
> explained
>

Ditto on that. The summary of user impact was very nice. It would be good
to repeat that on the user list or release notes when this change goes out.

Nick


Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
I start the voting with a +1.

Ran tests on the release candidates and some basic operations in
spark-shell and pyspark (local and standalone).

-Xiangrui

On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 0.9.2!
>
> The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~meng/spark-0.9.2-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/meng.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/service/local/repositories/orgapachespark-1023/content/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~meng/spark-0.9.2-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 0.9.2!
>
> The vote is open until Sunday, July 20, at 11:10 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 0.9.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> === About this release ===
> This release fixes a few high-priority bugs in 0.9.1 and has a variety
> of smaller fixes. The full list is here: http://s.apache.org/d0t. Some
> of the more visible patches are:
>
> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size
> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> SPARK-1676: HDFS FileSystems continually pile up in the FS cache
> SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
> SPARK-1870: Secondary jars are not added to executor classpath for YARN
>
> This is the second maintenance release on the 0.9 line. We plan to make
> additional maintenance releases as new fixes come in.
>
> Best,
> Xiangrui


InputSplit and RecordReader control on HadoopRDD

2014-07-17 Thread Nick R. Katsipoulakis
Hello,

I am currently trying to extend some custom InputSplit and RecordReader
classes to provide to SparkContext's hadoopRDD() function.

My question is the following:

Does the value returned by InpuSplit.getLenght() and/or
RecordReader.getProgress() affect the execution of a map() function in the
Spark runtime?

I am asking because I have used these two custom classes on Hadoop and they
do not cause any problems. However, in Spark, I see that new InputSplit
objects are generated during runtime. To be more precise:

In the beginning, I see in my log file that an InputSplit object is
generated and the RecordReader object associated to it is fetching records.
At some point, the job that is handling the previous InputSplit stops, and
a new one is spawned with a new InputSplit. I do not understand why this is
happening?

Any help?

Thank you,
Nick

P.S.-1 : I am sorry for posting my question on the Developer Mailing List,
but I could not find anything similar in the User's list. Also, I really
need to understand the runtime of Spark and I believe that in the
developer's list my question will be read by contributors of Spark.

P.S.-2: I can provide more technical details if they are needed.


Current way to include hive in a build

2014-07-17 Thread Stephen Boesch
Having looked at trunk make-distribution.sh the --with-hive and --with-yarn
are now deprecated.

Here is the way I have built it:

Added to pom.xml:

   
  cdh5
  
false
  
  
2.3.0-cdh5.0.0
2.3.0-cdh5.0.0
0.96.1.1-cdh5.0.0
3.4.5-cdh5.0.0
  


*mvn -Pyarn -Pcdh5 -Phive -Dhadoop.version=2.3.0-cdh5.0.1
-Dyarn.version=2.3.0-cdh5.0.0 -DskipTests clean package*


[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [3.165s]
[INFO] Spark Project Core  SUCCESS
[2:39.504s]
[INFO] Spark Project Bagel ... SUCCESS [7.596s]
[INFO] Spark Project GraphX .. SUCCESS [22.027s]
[INFO] Spark Project ML Library .. SUCCESS [36.284s]
[INFO] Spark Project Streaming ... SUCCESS [24.309s]
[INFO] Spark Project Tools ... SUCCESS [3.147s]
[INFO] Spark Project Catalyst  SUCCESS [20.148s]
[INFO] Spark Project SQL . SUCCESS [18.560s]
*[INFO] Spark Project Hive  FAILURE
[33.962s]*

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-dependency-plugin:2.4:copy-dependencies
(copy-dependencies) on project spark-hive_2.10: Execution copy-dependencies
of goal
org.apache.maven.plugins:maven-dependency-plugin:2.4:copy-dependencies
failed: Plugin org.apache.maven.plugins:maven-dependency-plugin:2.4 or one
of its dependencies could not be resolved: Could not find artifact
commons-logging:commons-logging:jar:1.0.4 -> [Help 1]

Anyone who is presently building with -Phive and has a suggestion for this?


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, 

Cool discussion! I agree that a more standardized API for clustering, and
easy access to underlying routines, would be useful (we've also been
discussing this when trying to develop streaming clustering algorithms,
similar to https://github.com/apache/spark/pull/1361) 

For divisive, hierarchical clustering I implemented something awhile back,
here's a gist. 

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

It does bisecting k-means clustering (with k=2), with a recursive class for
keeping track of the tree. I also found this much better than agglomerative
methods (for the reasons Hector points out).

This needs to be cleaned up, and can surely be optimized (esp. by replacing
the core KMeans step with existing MLLib code), but I can say I was running
it successfully on quite large data sets. 

RJ, depending on where you are in your progress, I'd be happy to help work
on this piece and / or have you use this as a jumping off point, if useful. 

-- Jeremy 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Matei Zaharia
+1

Tested on Mac, verified CHANGES.txt is good, verified several of the bug fixes.

Matei

On Jul 17, 2014, at 11:12 AM, Xiangrui Meng  wrote:

> I start the voting with a +1.
> 
> Ran tests on the release candidates and some basic operations in
> spark-shell and pyspark (local and standalone).
> 
> -Xiangrui
> 
> On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng  wrote:
>> Please vote on releasing the following candidate as Apache Spark version 
>> 0.9.2!
>> 
>> The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~meng/spark-0.9.2-rc1/
>> 
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/meng.asc
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/service/local/repositories/orgapachespark-1023/content/
>> 
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~meng/spark-0.9.2-rc1-docs/
>> 
>> Please vote on releasing this package as Apache Spark 0.9.2!
>> 
>> The vote is open until Sunday, July 20, at 11:10 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Spark 0.9.2
>> [ ] -1 Do not release this package because ...
>> 
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> 
>> === About this release ===
>> This release fixes a few high-priority bugs in 0.9.1 and has a variety
>> of smaller fixes. The full list is here: http://s.apache.org/d0t. Some
>> of the more visible patches are:
>> 
>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size
>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
>> SPARK-1676: HDFS FileSystems continually pile up in the FS cache
>> SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
>> SPARK-1870: Secondary jars are not added to executor classpath for YARN
>> 
>> This is the second maintenance release on the 0.9 line. We plan to make
>> additional maintenance releases as new fixes come in.
>> 
>> Best,
>> Xiangrui



Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread DB Tsai
+1

Tested with my Ubuntu Linux.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Jul 17, 2014 at 6:36 PM, Matei Zaharia  wrote:
> +1
>
> Tested on Mac, verified CHANGES.txt is good, verified several of the bug 
> fixes.
>
> Matei
>
> On Jul 17, 2014, at 11:12 AM, Xiangrui Meng  wrote:
>
>> I start the voting with a +1.
>>
>> Ran tests on the release candidates and some basic operations in
>> spark-shell and pyspark (local and standalone).
>>
>> -Xiangrui
>>
>> On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng  wrote:
>>> Please vote on releasing the following candidate as Apache Spark version 
>>> 0.9.2!
>>>
>>> The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba):
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~meng/spark-0.9.2-rc1/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/meng.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/service/local/repositories/orgapachespark-1023/content/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~meng/spark-0.9.2-rc1-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 0.9.2!
>>>
>>> The vote is open until Sunday, July 20, at 11:10 UTC and passes if
>>> a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 0.9.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> === About this release ===
>>> This release fixes a few high-priority bugs in 0.9.1 and has a variety
>>> of smaller fixes. The full list is here: http://s.apache.org/d0t. Some
>>> of the more visible patches are:
>>>
>>> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size
>>> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
>>> SPARK-1676: HDFS FileSystems continually pile up in the FS cache
>>> SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
>>> SPARK-1870: Secondary jars are not added to executor classpath for YARN
>>>
>>> This is the second maintenance release on the 0.9 line. We plan to make
>>> additional maintenance releases as new fixes come in.
>>>
>>> Best,
>>> Xiangrui
>


preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Will Benton
Hi all,

What's the preferred environment for generating golden test outputs for new 
Hive tests?  In particular:

* what Hadoop version and Hive version should I be using,
* are there particular distributions people have run successfully, and 
* are there any system properties or environment variables (beyond HADOOP_HOME, 
HIVE_HOME, and HIVE_DEV_HOME) I need to set before running the suite?

I ask because I'm getting some errors while trying to add new tests and would 
like to eliminate any possible problems caused by differences between what my 
environment offers and what Spark expects.  (I'm currently running with the 
Fedora packages for Hadoop 2.2.0 and a locally-built Hive 0.12.0.)  Since I'll 
only be using this for generating test outputs, something as simple to set up 
as possible would be great.

(Once I get something working, I'll be happy to write it up and contribute it 
as developer docs.)


thanks,
wb


Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Reynold Xin
+1

On Thursday, July 17, 2014, Matei Zaharia  wrote:

> +1
>
> Tested on Mac, verified CHANGES.txt is good, verified several of the bug
> fixes.
>
> Matei
>
> On Jul 17, 2014, at 11:12 AM, Xiangrui Meng  > wrote:
>
> > I start the voting with a +1.
> >
> > Ran tests on the release candidates and some basic operations in
> > spark-shell and pyspark (local and standalone).
> >
> > -Xiangrui
> >
> > On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng  > wrote:
> >> Please vote on releasing the following candidate as Apache Spark
> version 0.9.2!
> >>
> >> The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba):
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~meng/spark-0.9.2-rc1/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/meng.asc
> >>
> >> The staging repository for this release can be found at:
> >>
> https://repository.apache.org/service/local/repositories/orgapachespark-1023/content/
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~meng/spark-0.9.2-rc1-docs/
> >>
> >> Please vote on releasing this package as Apache Spark 0.9.2!
> >>
> >> The vote is open until Sunday, July 20, at 11:10 UTC and passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 0.9.2
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> >> http://spark.apache.org/
> >>
> >> === About this release ===
> >> This release fixes a few high-priority bugs in 0.9.1 and has a variety
> >> of smaller fixes. The full list is here: http://s.apache.org/d0t. Some
> >> of the more visible patches are:
> >>
> >> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
> size
> >> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> >> SPARK-1676: HDFS FileSystems continually pile up in the FS cache
> >> SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
> >> SPARK-1870: Secondary jars are not added to executor classpath for YARN
> >>
> >> This is the second maintenance release on the 0.9 line. We plan to make
> >> additional maintenance releases as new fixes come in.
> >>
> >> Best,
> >> Xiangrui
>
>


Re: preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Zongheng Yang
Hi Will,

These three environment variables are needed [1].

I have had success with Hive 0.12 and Hadoop 1.0.4. For Hive, getting
the source distribution seems to be required. Docs contribution will
be much appreciated!

[1] 
https://github.com/apache/spark/tree/master/sql#other-dependencies-for-developers

Zongheng

On Thu, Jul 17, 2014 at 7:51 PM, Will Benton  wrote:
> Hi all,
>
> What's the preferred environment for generating golden test outputs for new 
> Hive tests?  In particular:
>
> * what Hadoop version and Hive version should I be using,
> * are there particular distributions people have run successfully, and
> * are there any system properties or environment variables (beyond 
> HADOOP_HOME, HIVE_HOME, and HIVE_DEV_HOME) I need to set before running the 
> suite?
>
> I ask because I'm getting some errors while trying to add new tests and would 
> like to eliminate any possible problems caused by differences between what my 
> environment offers and what Spark expects.  (I'm currently running with the 
> Fedora packages for Hadoop 2.2.0 and a locally-built Hive 0.12.0.)  Since 
> I'll only be using this for generating test outputs, something as simple to 
> set up as possible would be great.
>
> (Once I get something working, I'll be happy to write it up and contribute it 
> as developer docs.)
>
>
> thanks,
> wb


Re: Current way to include hive in a build

2014-07-17 Thread Patrick Wendell
Hey Stephen,

The only change the build was that we ask users to run -Phive and
-Pyarn of --with-hive and --with-yarn (which internally just set
-Phive and -Pyarn). I don't think this should affect the dependency
graph.

Just to test this, what happens if you run *without* the CDH profile
and build with hadoop version 2.3.0? Does that work?

- Patrick

On Thu, Jul 17, 2014 at 4:00 PM, Stephen Boesch  wrote:
> Having looked at trunk make-distribution.sh the --with-hive and --with-yarn
> are now deprecated.
>
> Here is the way I have built it:
>
> Added to pom.xml:
>
>
>   cdh5
>   
> false
>   
>   
> 2.3.0-cdh5.0.0
> 2.3.0-cdh5.0.0
> 0.96.1.1-cdh5.0.0
> 3.4.5-cdh5.0.0
>   
> 
>
> *mvn -Pyarn -Pcdh5 -Phive -Dhadoop.version=2.3.0-cdh5.0.1
> -Dyarn.version=2.3.0-cdh5.0.0 -DskipTests clean package*
>
>
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM .. SUCCESS [3.165s]
> [INFO] Spark Project Core  SUCCESS
> [2:39.504s]
> [INFO] Spark Project Bagel ... SUCCESS [7.596s]
> [INFO] Spark Project GraphX .. SUCCESS [22.027s]
> [INFO] Spark Project ML Library .. SUCCESS [36.284s]
> [INFO] Spark Project Streaming ... SUCCESS [24.309s]
> [INFO] Spark Project Tools ... SUCCESS [3.147s]
> [INFO] Spark Project Catalyst  SUCCESS [20.148s]
> [INFO] Spark Project SQL . SUCCESS [18.560s]
> *[INFO] Spark Project Hive  FAILURE
> [33.962s]*
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-dependency-plugin:2.4:copy-dependencies
> (copy-dependencies) on project spark-hive_2.10: Execution copy-dependencies
> of goal
> org.apache.maven.plugins:maven-dependency-plugin:2.4:copy-dependencies
> failed: Plugin org.apache.maven.plugins:maven-dependency-plugin:2.4 or one
> of its dependencies could not be resolved: Could not find artifact
> commons-logging:commons-logging:jar:1.0.4 -> [Help 1]
>
> Anyone who is presently building with -Phive and has a suggestion for this?


Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
UPDATE:

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1023/

The previous repo contains exactly the same content but mutable.
Thanks Patrick for pointing it out!

-Xiangrui

On Thu, Jul 17, 2014 at 7:52 PM, Reynold Xin  wrote:
> +1
>
> On Thursday, July 17, 2014, Matei Zaharia  wrote:
>
>> +1
>>
>> Tested on Mac, verified CHANGES.txt is good, verified several of the bug
>> fixes.
>>
>> Matei
>>
>> On Jul 17, 2014, at 11:12 AM, Xiangrui Meng > > wrote:
>>
>> > I start the voting with a +1.
>> >
>> > Ran tests on the release candidates and some basic operations in
>> > spark-shell and pyspark (local and standalone).
>> >
>> > -Xiangrui
>> >
>> > On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng > > wrote:
>> >> Please vote on releasing the following candidate as Apache Spark
>> version 0.9.2!
>> >>
>> >> The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba):
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> http://people.apache.org/~meng/spark-0.9.2-rc1/
>> >>
>> >> Release artifacts are signed with the following key:
>> >> https://people.apache.org/keys/committer/meng.asc
>> >>
>> >> The staging repository for this release can be found at:
>> >>
>> https://repository.apache.org/service/local/repositories/orgapachespark-1023/content/
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> http://people.apache.org/~meng/spark-0.9.2-rc1-docs/
>> >>
>> >> Please vote on releasing this package as Apache Spark 0.9.2!
>> >>
>> >> The vote is open until Sunday, July 20, at 11:10 UTC and passes if
>> >> a majority of at least 3 +1 PMC votes are cast.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 0.9.2
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see
>> >> http://spark.apache.org/
>> >>
>> >> === About this release ===
>> >> This release fixes a few high-priority bugs in 0.9.1 and has a variety
>> >> of smaller fixes. The full list is here: http://s.apache.org/d0t. Some
>> >> of the more visible patches are:
>> >>
>> >> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
>> size
>> >> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
>> >> SPARK-1676: HDFS FileSystems continually pile up in the FS cache
>> >> SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
>> >> SPARK-1870: Secondary jars are not added to executor classpath for YARN
>> >>
>> >> This is the second maintenance release on the 0.9 line. We plan to make
>> >> additional maintenance releases as new fixes come in.
>> >>
>> >> Best,
>> >> Xiangrui
>>
>>