Re: Welcoming two new committers

2014-08-09 Thread Andrew Or
Thanks everyone. I look forward to continuing to work with all of you!


2014-08-08 3:23 GMT-07:00 Prashant Sharma :

> Congratulations Andrew and Joey.
>
> Prashant Sharma
>
>
>
>
> On Fri, Aug 8, 2014 at 2:10 PM, Xiangrui Meng  wrote:
>
>> Congrats, Joey & Andrew!!
>>
>> -Xiangrui
>>
>> On Fri, Aug 8, 2014 at 12:14 AM, Christopher Nguyen 
>> wrote:
>> > +1 Joey & Andrew :)
>> >
>> > --
>> > Christopher T. Nguyen
>> > Co-founder & CEO, Adatao  [ah-'DAY-tao]
>> > linkedin.com/in/ctnguyen
>> >
>> >
>> >
>> > On Thu, Aug 7, 2014 at 10:39 PM, Joseph Gonzalez <
>> jegon...@eecs.berkeley.edu
>> >> wrote:
>> >
>> >> Hi Everyone,
>> >>
>> >> Thank you for inviting me to be a committer.  I look forward to working
>> >> with everyone to ensure the continued success of the Spark project.
>> >>
>> >> Thanks!
>> >> Joey
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Aug 7, 2014 at 9:57 PM, Matei Zaharia 
>> >> wrote:
>> >>
>> >> > Hi everyone,
>> >> >
>> >> > The PMC recently voted to add two new committers and PMC members:
>> Joey
>> >> > Gonzalez and Andrew Or. Both have been huge contributors in the past
>> year
>> >> > -- Joey on much of GraphX as well as quite a bit of the initial work
>> in
>> >> > MLlib, and Andrew on Spark Core. Join me in welcoming them as
>> committers!
>> >> >
>> >> > Matei
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
Hi Xiangrui,

Based on your suggestion I moved core and mllib both to 1.1.0-SNAPSHOT...I
am still getting class cast exception:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 249 in stage 52.0 failed 4 times, most recent
failure: Lost task 249.3 in stage 52.0 (TID 10002,
tblpmidn06adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException:
scala.Tuple1 cannot be cast to scala.Product2

I am running ALS.scala merged with my changes. I will try the mllib jar
without my changes next...

Can this be due to the fact that my jars are compiled with Java 1.7_55 but
the cluster JRE is at 1.7_45.

Thanks.

Deb




On Wed, Aug 6, 2014 at 12:01 PM, Debasish Das 
wrote:

> I did not play with Hadoop settings...everything is compiled with
> 2.3.0CDH5.0.2 for me...
>
> I did try to bump the version number of HBase from 0.94 to 0.96 or 0.98
> but there was no profile for CDH in the pom...but that's unrelated to this !
>
>
> On Wed, Aug 6, 2014 at 9:45 AM, DB Tsai  wrote:
>
>> One related question, is mllib jar independent from hadoop version
>> (doesnt use hadoop api directly)? Can I use mllib jar compile for one
>> version of hadoop and use it in another version of hadoop?
>>
>> Sent from my Google Nexus 5
>> On Aug 6, 2014 8:29 AM, "Debasish Das"  wrote:
>>
>>> Hi Xiangrui,
>>>
>>> Maintaining another file will be a pain later so I deployed spark 1.0.1
>>> without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT
>>> along with the code changes for quadratic optimization...
>>>
>>> Later the plan is to patch the snapshot mllib with the deployed stable
>>> mllib...
>>>
>>> There are 5 variants that I am experimenting with around 400M ratings
>>> (daily data, monthly data I will update in few days)...
>>>
>>> 1. LS
>>> 2. NNLS
>>> 3. Quadratic with bounds
>>> 4. Quadratic with L1
>>> 5. Quadratic with equality and positivity
>>>
>>> Now the ALS 1.1.0 snapshot runs fine but after completion on this step
>>> ALS.scala:311
>>>
>>> // Materialize usersOut and productsOut.
>>> usersOut.count()
>>>
>>> I am getting from one of the executors: java.lang.ClassCastException:
>>> scala.Tuple1 cannot be cast to scala.Product2
>>>
>>> I am debugging it further but I was wondering if this is due to RDD
>>> compatibility within 1.0.1 and 1.1.0-SNAPSHOT ?
>>>
>>> I have built the jars on my Mac which has Java 1.7.0_55 but the deployed
>>> cluster has Java 1.7.0_45.
>>>
>>> The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that
>>> Java
>>> version mismatch cause this ?
>>>
>>> Stack traces are below
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>> Executor stacktrace:
>>>
>>>
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
>>>
>>>
>>>
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
>>>
>>>
>>>
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>
>>>
>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>
>>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
>>>
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>
>>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>>
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>
>>>
>>> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>>>
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>
>>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>>
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>
>>>
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)
>>>
>>>
>>>
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123)
>>>
>>>
>>>
>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>>
>>>
>>>
>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>>
>>>
>>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>>
>>>
>>>
>>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>>
>>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123)
>>>
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>
>>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>>
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>
>>>
>>> org.apa

RE: Welcoming two new committers

2014-08-09 Thread Guru Medasani
Congrats Joey and Andrew!

Sent from my Windows Phone

From: Andrew Or
Sent: ‎8/‎9/‎2014 2:43 AM
To: Prashant Sharma
Cc: Xiangrui Meng; Christopher 
Nguyen; Joseph 
Gonzalez; Matei 
Zaharia; 
d...@spark.incubator.apache.org
Subject: Re: Welcoming two new committers

Thanks everyone. I look forward to continuing to work with all of you!


2014-08-08 3:23 GMT-07:00 Prashant Sharma :

> Congratulations Andrew and Joey.
>
> Prashant Sharma
>
>
>
>
> On Fri, Aug 8, 2014 at 2:10 PM, Xiangrui Meng  wrote:
>
>> Congrats, Joey & Andrew!!
>>
>> -Xiangrui
>>
>> On Fri, Aug 8, 2014 at 12:14 AM, Christopher Nguyen 
>> wrote:
>> > +1 Joey & Andrew :)
>> >
>> > --
>> > Christopher T. Nguyen
>> > Co-founder & CEO, Adatao  [ah-'DAY-tao]
>> > linkedin.com/in/ctnguyen
>> >
>> >
>> >
>> > On Thu, Aug 7, 2014 at 10:39 PM, Joseph Gonzalez <
>> jegon...@eecs.berkeley.edu
>> >> wrote:
>> >
>> >> Hi Everyone,
>> >>
>> >> Thank you for inviting me to be a committer.  I look forward to working
>> >> with everyone to ensure the continued success of the Spark project.
>> >>
>> >> Thanks!
>> >> Joey
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Aug 7, 2014 at 9:57 PM, Matei Zaharia 
>> >> wrote:
>> >>
>> >> > Hi everyone,
>> >> >
>> >> > The PMC recently voted to add two new committers and PMC members:
>> Joey
>> >> > Gonzalez and Andrew Or. Both have been huge contributors in the past
>> year
>> >> > -- Joey on much of GraphX as well as quite a bit of the initial work
>> in
>> >> > MLlib, and Andrew on Spark Core. Join me in welcoming them as
>> committers!
>> >> >
>> >> > Matei
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Unit tests in < 5 minutes

2014-08-09 Thread Mridul Muralidharan
Issue with supporting this imo is the fact that scala-test uses the
same vm for all the tests (surefire plugin supports fork, but
scala-test ignores it iirc).
So different tests would initialize different spark context, and can
potentially step on each others toes.

Regards,
Mridul


On Fri, Aug 8, 2014 at 9:31 PM, Nicholas Chammas
 wrote:
> Howdy,
>
> Do we think it's both feasible and worthwhile to invest in getting our unit
> tests to finish in under 5 minutes (or something similarly brief) when run
> by Jenkins?
>
> Unit tests currently seem to take anywhere from 30 min to 2 hours. As
> people add more tests, I imagine this time will only grow. I think it would
> be better for both contributors and reviewers if they didn't have to wait
> so long for test results; PR reviews would be shorter, if nothing else.
>
> I don't know how how this is normally done, but maybe it wouldn't be too
> much work to get a test cycle to feel lighter.
>
> Most unit tests are independent and can be run concurrently, right? Would
> it make sense to build a given patch on many servers at once and send
> disjoint sets of unit tests to each?
>
> I'd be interested in working on something like that if possible (and
> sensible).
>
> Nick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
I validated that I can reproduce this problem with master as well (without
adding any of my mllib changes)...

I separated mllib jar from assembly, deploy the assembly and then I supply
the mllib jar as --jars option to spark-submit...

I get this error:

14/08/09 12:49:32 INFO DAGScheduler: Failed to run count at ALS.scala:299

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 238 in stage 40.0 failed 4 times, most recent
failure: Lost task 238.3 in stage 40.0 (TID 10002,
tblpmidn05adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException:
scala.Tuple1 cannot be cast to scala.Product2


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159)

scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)


scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)


scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)


scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)


scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)


scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)


scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)

org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

org.apache.spark.scheduler.Task.run(Task.scala:54)


org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)


java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:744)

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:68

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Matt Forbes
I was having this same problem early this week and had to include my
changes in the assembly.


On Sat, Aug 9, 2014 at 9:59 AM, Debasish Das 
wrote:

> I validated that I can reproduce this problem with master as well (without
> adding any of my mllib changes)...
>
> I separated mllib jar from assembly, deploy the assembly and then I supply
> the mllib jar as --jars option to spark-submit...
>
> I get this error:
>
> 14/08/09 12:49:32 INFO DAGScheduler: Failed to run count at ALS.scala:299
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 238 in stage 40.0 failed 4 times, most recent
> failure: Lost task 238.3 in stage 40.0 (TID 10002,
> tblpmidn05adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException:
> scala.Tuple1 cannot be cast to scala.Product2
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159)
>
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
>
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>
>
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
>
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
>
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129)
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)
>
>
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
>
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
>
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> java.lang.Thread.run(Thread.java:744)
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
>
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)
>
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)
>
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)
>
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> org.apach

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
Including mllib inside assembly worked fine...If I deploy only the core and
send mllib as --jars then this problem shows up...

Xiangrui could you please comment if it is a bug or expected behavior ? I
will create a JIRA if this needs to be tracked...


On Sat, Aug 9, 2014 at 11:01 AM, Matt Forbes  wrote:

> I was having this same problem early this week and had to include my
> changes in the assembly.
>
>
> On Sat, Aug 9, 2014 at 9:59 AM, Debasish Das 
> wrote:
>
>> I validated that I can reproduce this problem with master as well (without
>> adding any of my mllib changes)...
>>
>> I separated mllib jar from assembly, deploy the assembly and then I supply
>> the mllib jar as --jars option to spark-submit...
>>
>> I get this error:
>>
>> 14/08/09 12:49:32 INFO DAGScheduler: Failed to run count at ALS.scala:299
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due
>> to stage failure: Task 238 in stage 40.0 failed 4 times, most recent
>> failure: Lost task 238.3 in stage 40.0 (TID 10002,
>> tblpmidn05adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException:
>> scala.Tuple1 cannot be cast to scala.Product2
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159)
>>
>> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>
>>
>>
>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>>
>>
>>
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>
>>
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>
>>
>>
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>
>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>>
>> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129)
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)
>>
>>
>>
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>
>>
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>
>>
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>
>>
>>
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>
>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>>
>> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>>
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>
>> org.apache.spark.scheduler.Task.run(Task.scala:54)
>>
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>>
>>
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>>
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> java.lang.Thread.run(Thread.java:744)
>>
>> Driver stacktrace:
>>
>> at org.apache.spark.scheduler.DAGScheduler.org
>>
>> $apache$spark$s

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
Actually nope it did not work fine...

With multiple ALS iteration, I am getting the same error (with or without
my mllib changes)

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 206 in stage 52.0 failed 4 times, most recent
failure: Lost task 206.3 in stage 52.0 (TID ,
tblpmidn42adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException:
scala.Tuple1 cannot be cast to scala.Product2


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159)

scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)


scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)


scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)


scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)


scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)


scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)


scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)


org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

org.apache.spark.scheduler.Task.run(Task.scala:54)


org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)


java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:744)

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)

at akka.act