Re: Spark on EC2

2014-06-01 Thread Jeremy Lee
Hmm.. you've gotten further than me. Which AMI's are you using?


On Sun, Jun 1, 2014 at 2:21 PM, superback 
wrote:

> Hi,
> I am trying to run an example on AMAZON EC2 and have successfully
> set up one cluster with two nodes on EC2. However, when I was testing an
> example using the following command,
>
> *
> ./run-example org.apache.spark.examples.GroupByTest
> spark://`hostname`:7077*
>
> I got the following warnings and errors. Can anyone help one solve this
> problem? Thanks very much!
>
> 46781 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl - Initial
> job has not accepted any resources; check your cluster UI to ensure that
> workers are registered and have sufficient memory
> 61544 [spark-akka.actor.default-dispatcher-3] ERROR
> org.apache.spark.deploy.client.AppClient$ClientActor - All masters are
> unresponsive! Giving up.
> 61544 [spark-akka.actor.default-dispatcher-3] ERROR
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend - Spark
> cluster looks dead, giving up.
> 61546 [spark-akka.actor.default-dispatcher-3] INFO
> org.apache.spark.scheduler.TaskSchedulerImpl - Remove TaskSet 0.0 from pool
> 61549 [main] INFO org.apache.spark.scheduler.DAGScheduler - Failed to run
> count at GroupByTest.scala:50
> Exception in thread "main" org.apache.spark.SparkException: Job aborted:
> Spark cluster looks down
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
Could you post how exactly you are invoking spark-ec2? And are you having
trouble just with r3 instances, or with any instance type?

2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지:

> It's been another day of spinning up dead clusters...
>
> I thought I'd finally worked out what everyone else knew - don't use the
> default AMI - but I've now run through all of the "official" quick-start
> linux releases and I'm none the wiser:
>
> Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
> Provisions servers, connects, installs, but the webserver on the master
> will not start
>
> Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
> Spot instance requests are not supported for this AMI.
>
> SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
> Not tested - costs 10x more for spot instances, not economically viable.
>
> Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
> Provisions servers, but "git" is not pre-installed, so the cluster setup
> fails.
>
> Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
> Provisions servers, but "git" is not pre-installed, so the cluster setup
> fails.
>
> Have I missed something? What AMI's are people using? I've just gone back
> through the archives, and I'm seeing a lot of "I can't get EC2 to work" and
> not a single "My EC2 has post-install issues",
>
> The quickstart page says "...can have a spark cluster up and running in
> five minutes." But it's been three days for me so far. I'm about to bite
> the bullet and start building my own AMI's from scratch... if anyone can
> save me from that, I'd be most grateful.
>
> --
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers
>


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
If you are explicitly specifying the AMI in your invocation of spark-ec2,
may I suggest simply removing any explicit mention of AMI from your
invocation? spark-ec2 automatically selects an appropriate AMI based on the
specified instance type.

2014년 6월 1일 일요일, Nicholas Chammas님이 작성한 메시지:

> Could you post how exactly you are invoking spark-ec2? And are you having
> trouble just with r3 instances, or with any instance type?
>
> 2014년 6월 1일 일요일, Jeremy Lee >님이 작성한
> 메시지:
>
> It's been another day of spinning up dead clusters...
>
> I thought I'd finally worked out what everyone else knew - don't use the
> default AMI - but I've now run through all of the "official" quick-start
> linux releases and I'm none the wiser:
>
> Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
> Provisions servers, connects, installs, but the webserver on the master
> will not start
>
> Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
> Spot instance requests are not supported for this AMI.
>
> SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
> Not tested - costs 10x more for spot instances, not economically viable.
>
> Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
> Provisions servers, but "git" is not pre-installed, so the cluster setup
> fails.
>
> Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
> Provisions servers, but "git" is not pre-installed, so the cluster setup
> fails.
>
>


SparkSQL Table schema in Java

2014-06-01 Thread Kuldeep Bora
Hello,

Congrats for 1.0.0 release.

I would like to ask why is it that the table creation requires an proper
class in Scala and Java while in python you can just use a map?
I think that the use of class for definition of table is bit too
restrictive. Using a plain map otoh could be very handy in creating tables
dynamically.

Are there any alternative apis for spark sql which can work with plain java
maps like in python?

Regards


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Jeremy Lee
*sigh* OK, I figured it out. (Thank you Nick, for the hint)

"m1.large" works, (I swear I tested that earlier and had similar issues...
)

It was my obsession with starting "r3.*large" instances. Clearly I hadn't
patched the script in all the places.. which I think caused it to default
to the Amazon AMI. I'll have to take a closer look at the code and see if I
can't fix it correctly, because I really, really do want nodes with 2x the
CPU and 4x the memory for the same low spot price. :-)

I've got a cluster up now, at least. Time for the fun stuff...

Thanks everyone for the help!



On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas  wrote:

> If you are explicitly specifying the AMI in your invocation of spark-ec2,
> may I suggest simply removing any explicit mention of AMI from your
> invocation? spark-ec2 automatically selects an appropriate AMI based on
> the specified instance type.
>
> 2014년 6월 1일 일요일, Nicholas Chammas님이 작성한 메시지:
>
> Could you post how exactly you are invoking spark-ec2? And are you having
>> trouble just with r3 instances, or with any instance type?
>>
>> 2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지:
>>
>> It's been another day of spinning up dead clusters...
>>
>> I thought I'd finally worked out what everyone else knew - don't use the
>> default AMI - but I've now run through all of the "official" quick-start
>> linux releases and I'm none the wiser:
>>
>> Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
>> Provisions servers, connects, installs, but the webserver on the master
>> will not start
>>
>> Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
>> Spot instance requests are not supported for this AMI.
>>
>> SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
>> Not tested - costs 10x more for spot instances, not economically viable.
>>
>> Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
>> Provisions servers, but "git" is not pre-installed, so the cluster setup
>> fails.
>>
>> Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
>> Provisions servers, but "git" is not pre-installed, so the cluster setup
>> fails.
>>
>>


-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre B
Hi all!

We'be been using the sbt-pack sbt plugin
(https://github.com/xerial/sbt-pack) for building our standalone Spark
application for a while now. Until version 1.0.0, that worked nicely.

For those who don't know the sbt-pack plugin, it basically copies all the
dependencies JARs from your local ivy/maven cache to a your target folder
(in target/pack/lib), and creates launch scripts (in target/pack/bin) for
your application (notably setting all these jars on the classpath).

Now, since Spark 1.0.0 was released, we are encountering a weird error where
running our project with "sbt run" is fine but running our app with the
launch scripts generated by sbt-pack fails.

After a (quite painful) investigation, it turns out some JARs are NOT copied
from the local ivy2 cache to the lib folder. I noticed that all the missing
jars contain "shaded" in their file name (but all not all jars with such
name are missing).
One of the missing JARs is explicitly from the Spark definition
(SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.

This file is clearly present in my local ivy cache, but is not copied by
sbt-pack.

Is there an evident reason for that?

I don't know much about the shading mechanism, maybe I'm missing something
here?


Any help would be appreciated!

Cheers

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
Hi All,

Is it possible to create an RDD from a directory tree of the following form?

RDD[(PATH, Seq[TEXT])]

Thank you,
Oleg


Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Nicholas Chammas
Could you provide an example of what you mean?

I know it's possible to create an RDD from a path with wildcards, like in
the subject.

For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
provide a comma delimited list of paths.

Nick

2014년 6월 1일 일요일, Oleg Proudnikov님이 작성한 메시지:

> Hi All,
>
> Is it possible to create an RDD from a directory tree of the following
> form?
>
> RDD[(PATH, Seq[TEXT])]
>
> Thank you,
> Oleg
>
>


Re: Akka disassociation on Java SE Embedded

2014-06-01 Thread Chanwit Kaewkasi
Hi all,

This is what I found:

1. Like Aaron suggested, an executor will be killed silently when the
OS's memory is running out.
I've found this many times to conclude this it's real. Adding swap and
increasing the JVM heap solved the problem, but you will encounter OS
paging out and full GC.

2. OS paging out and full GC are not likely to affect my benchmark
much while processing data from HDFS. But Akka process's randomly
killed during the network-related stage (for example, sorting). I've
found that an Akka process cannot fetch the result fast enough.
Increasing the block manager timeout helped a lot. I've doubled the
value many times as the network of our ARM cluster is quite slow.

3. We'd like to collect times spent for all stages of our benchmark.
So we always re-run when some tasks failed. Failure happened a lot but
it's understandable as Spark is designed on top of Akka's let-it-crash
philosophy. To make the benchmark run more perfectly (without a task
failure), I called .cache() before calling the transformation of the
next stage. And it helped a lot.

Combined above and others tuning, we can now boost the performance of
our ARM cluster to 2.8 times faster than our first report.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi  wrote:
> May be that's explaining mine too.
> Thank you very much, Aaron !!
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>
>
> On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson  wrote:
>> Spark should effectively turn Akka's failure detector off, because we
>> historically had problems with GCs and other issues causing disassociations.
>> The only thing that should cause these messages nowadays is if the TCP
>> connection (which Akka sustains between Actor Systems on different machines)
>> actually drops. TCP connections are pretty resilient, so one common cause of
>> this is actual Executor failure -- recently, I have experienced a
>> similar-sounding problem due to my machine's OOM killer terminating my
>> Executors, such that they didn't produce any error output.
>>
>>
>> On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi  wrote:
>>>
>>> Hi all,
>>>
>>> On an ARM cluster, I have been testing a wordcount program with JRE 7
>>> and everything is OK. But when changing to the embedded version of
>>> Java SE (Oracle's eJRE), the same program cannot complete all
>>> computing stages.
>>>
>>> It is failed by many Akka's disassociation.
>>>
>>> - I've been trying to increase Akka's timeout but still stuck. I am
>>> not sure what is the right way to do so? (I suspected that GC pausing
>>> the world is causing this).
>>>
>>> - Another question is that how could I properly turn on Akka's logging
>>> to see what's the root cause of this disassociation problem? (If my
>>> guess about GC is wrong).
>>>
>>> Best regards,
>>>
>>> -chanwit
>>>
>>> --
>>> Chanwit Kaewkasi
>>> linkedin.com/in/chanwit
>>
>>


Re: Akka disassociation on Java SE Embedded

2014-06-01 Thread Aaron Davidson
Thanks for the update! I've also run into the block manager timeout issue,
it might be a good idea to increase the default significantly (it would
probably timeout immediately if the TCP connection itself dropped anyway).


On Sun, Jun 1, 2014 at 9:48 AM, Chanwit Kaewkasi  wrote:

> Hi all,
>
> This is what I found:
>
> 1. Like Aaron suggested, an executor will be killed silently when the
> OS's memory is running out.
> I've found this many times to conclude this it's real. Adding swap and
> increasing the JVM heap solved the problem, but you will encounter OS
> paging out and full GC.
>
> 2. OS paging out and full GC are not likely to affect my benchmark
> much while processing data from HDFS. But Akka process's randomly
> killed during the network-related stage (for example, sorting). I've
> found that an Akka process cannot fetch the result fast enough.
> Increasing the block manager timeout helped a lot. I've doubled the
> value many times as the network of our ARM cluster is quite slow.
>
> 3. We'd like to collect times spent for all stages of our benchmark.
> So we always re-run when some tasks failed. Failure happened a lot but
> it's understandable as Spark is designed on top of Akka's let-it-crash
> philosophy. To make the benchmark run more perfectly (without a task
> failure), I called .cache() before calling the transformation of the
> next stage. And it helped a lot.
>
> Combined above and others tuning, we can now boost the performance of
> our ARM cluster to 2.8 times faster than our first report.
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>
>
> On Wed, May 28, 2014 at 1:13 AM, Chanwit Kaewkasi 
> wrote:
> > May be that's explaining mine too.
> > Thank you very much, Aaron !!
> >
> > Best regards,
> >
> > -chanwit
> >
> > --
> > Chanwit Kaewkasi
> > linkedin.com/in/chanwit
> >
> >
> > On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson 
> wrote:
> >> Spark should effectively turn Akka's failure detector off, because we
> >> historically had problems with GCs and other issues causing
> disassociations.
> >> The only thing that should cause these messages nowadays is if the TCP
> >> connection (which Akka sustains between Actor Systems on different
> machines)
> >> actually drops. TCP connections are pretty resilient, so one common
> cause of
> >> this is actual Executor failure -- recently, I have experienced a
> >> similar-sounding problem due to my machine's OOM killer terminating my
> >> Executors, such that they didn't produce any error output.
> >>
> >>
> >> On Thu, May 22, 2014 at 9:19 AM, Chanwit Kaewkasi 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> On an ARM cluster, I have been testing a wordcount program with JRE 7
> >>> and everything is OK. But when changing to the embedded version of
> >>> Java SE (Oracle's eJRE), the same program cannot complete all
> >>> computing stages.
> >>>
> >>> It is failed by many Akka's disassociation.
> >>>
> >>> - I've been trying to increase Akka's timeout but still stuck. I am
> >>> not sure what is the right way to do so? (I suspected that GC pausing
> >>> the world is causing this).
> >>>
> >>> - Another question is that how could I properly turn on Akka's logging
> >>> to see what's the root cause of this disassociation problem? (If my
> >>> guess about GC is wrong).
> >>>
> >>> Best regards,
> >>>
> >>> -chanwit
> >>>
> >>> --
> >>> Chanwit Kaewkasi
> >>> linkedin.com/in/chanwit
> >>
> >>
>


Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
Gotcha. The easiest way to get your dependencies to your Executors would
probably be to construct your SparkContext with all necessary jars passed
in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is
a "necessary jar", but it's possible your application also needs to
distribute other ones to the cluster.

An easy way to make sure all your dependencies get shipped to the cluster
is to create an assembly jar of your application, and then you just need to
tell Spark about that jar, which includes all your application's transitive
dependencies. Maven and sbt both have pretty straightforward ways of
producing assembly jars.


On Sat, May 31, 2014 at 11:23 PM, Russell Jurney 
wrote:

> Thanks for the fast reply.
>
> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
> standalone mode.
>
>
> On Saturday, May 31, 2014, Aaron Davidson  wrote:
>
>> First issue was because your cluster was configured incorrectly. You
>> could probably read 1 file because that was done on the driver node, but
>> when it tried to run a job on the cluster, it failed.
>>
>> Second issue, it seems that the jar containing avro is not getting
>> propagated to the Executors. What version of Spark are you running on? What
>> deployment mode (YARN, standalone, Mesos)?
>>
>>
>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney > > wrote:
>>
>> Now I get this:
>>
>> scala> rdd.first
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>> :41
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>> :41) with 1 output partitions (allowLocal=true)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>> (first at :41)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>> List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
>> partition locally
>>
>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>> :41, took 0.037371256 s
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>> :41
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>> :41) with 16 output partitions (allowLocal=true)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>> (first at :41)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>> List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>> (HadoopRDD[0] at hadoopRDD at :37), which has no missing parents
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at :37)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
>> with 16 tasks
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
>> TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 as
>> 1294 bytes in 1 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
>> TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
>> TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 as
>> 1294 bytes in 1 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
>> TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
>> TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as
>> TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as
>> TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as
>> TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 as
>> 1294 bytes in 0 ms
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as
>> TID 100

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Anwar Rizal
I presume that you need to have access to the path of each file you are
reading.

I don't know whether there is a good way to do that for HDFS, I need to
read the files myself, something like:

def openWithPath(inputPath: String, sc:SparkContext) =  {
  val fs= (new
Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
  val filesIt   = fs.listFiles(path, false)
  val paths   = new ListBuffer[URI]
  while (filesIt.hasNext) {
paths += filesIt.next.getPath.toUri
  }
  val withPaths = paths.toList.map{  p =>
sc.newAPIHadoopFile[LongWritable, Text,
TextInputFormat](p.toString).map{ case (_,s)  => (p, s.toString) }
  }
  withPaths.reduce{ _ ++ _ }
}
...

I would be interested if there is a better way to do the same thing ...

Cheers,
a:


On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas  wrote:

> Could you provide an example of what you mean?
>
> I know it's possible to create an RDD from a path with wildcards, like in
> the subject.
>
> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
> provide a comma delimited list of paths.
>
> Nick
>
> 2014년 6월 1일 일요일, Oleg Proudnikov님이 작성한 메시지:
>
> Hi All,
>>
>> Is it possible to create an RDD from a directory tree of the following
>> form?
>>
>> RDD[(PATH, Seq[TEXT])]
>>
>> Thank you,
>> Oleg
>>
>>


Re: spark 1.0.0 on yarn

2014-06-01 Thread Patrick Wendell
I would agree with your guess, it looks like the yarn library isn't
correctly finding your yarn-site.xml file. If you look in
yarn-site.xml do you definitely the resource manager
address/addresses?

Also, you can try running this command with
SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
set-up correctly.

- Patrick

On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen  wrote:
> Hi all,
>
> I tried a couple ways, but couldn't get it to work..
>
> The following seems to be what the online document
> (http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting:
> SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
> YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
>
> Help info of spark-shell seems to be suggesting "--master yarn --deploy-mode
> cluster".
>
> But either way, I am seeing the following messages:
> 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8032
> 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
>
> My guess is that spark-shell is trying to talk to resource manager to setup
> spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
> though. I am running CDH5 with two resource managers in HA mode. Their
> IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
> HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
>
> Any ideas? Thanks.
> -Simon


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Patrick Wendell
Hey just to clarify this - my understanding is that the poster
(Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
launch spark-ec2 from my laptop. And he was looking for an AMI that
had a high enough version of python.

Spark-ec2 itself has a flag "-a" that allows you to give a specific
AMI. This flag is just an internal tool that we use for testing when
we spin new AMI's. Users can't set that to an arbitrary AMI because we
tightly control things like the Java and OS versions, libraries, etc.


On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
 wrote:
> *sigh* OK, I figured it out. (Thank you Nick, for the hint)
>
> "m1.large" works, (I swear I tested that earlier and had similar issues... )
>
> It was my obsession with starting "r3.*large" instances. Clearly I hadn't
> patched the script in all the places.. which I think caused it to default to
> the Amazon AMI. I'll have to take a closer look at the code and see if I
> can't fix it correctly, because I really, really do want nodes with 2x the
> CPU and 4x the memory for the same low spot price. :-)
>
> I've got a cluster up now, at least. Time for the fun stuff...
>
> Thanks everyone for the help!
>
>
>
> On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
>  wrote:
>>
>> If you are explicitly specifying the AMI in your invocation of spark-ec2,
>> may I suggest simply removing any explicit mention of AMI from your
>> invocation? spark-ec2 automatically selects an appropriate AMI based on the
>> specified instance type.
>>
>> 2014년 6월 1일 일요일, Nicholas Chammas님이 작성한 메시지:
>>
>>> Could you post how exactly you are invoking spark-ec2? And are you having
>>> trouble just with r3 instances, or with any instance type?
>>>
>>> 2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지:
>>>
>>> It's been another day of spinning up dead clusters...
>>>
>>> I thought I'd finally worked out what everyone else knew - don't use the
>>> default AMI - but I've now run through all of the "official" quick-start
>>> linux releases and I'm none the wiser:
>>>
>>> Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
>>> Provisions servers, connects, installs, but the webserver on the master
>>> will not start
>>>
>>> Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
>>> Spot instance requests are not supported for this AMI.
>>>
>>> SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
>>> Not tested - costs 10x more for spot instances, not economically viable.
>>>
>>> Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
>>> Provisions servers, but "git" is not pre-installed, so the cluster setup
>>> fails.
>>>
>>> Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
>>> Provisions servers, but "git" is not pre-installed, so the cluster setup
>>> fails.
>
>
>
>
> --
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers


Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
One potential issue here is that mesos is using classifiers now to
publish there jars. It might be that sbt-pack has trouble with
dependencies that are published using classifiers. I'm pretty sure
mesos is the only dependency in Spark that is using classifiers, so
that's why I mention it.

On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
 wrote:
> Hi all!
>
> We'be been using the sbt-pack sbt plugin
> (https://github.com/xerial/sbt-pack) for building our standalone Spark
> application for a while now. Until version 1.0.0, that worked nicely.
>
> For those who don't know the sbt-pack plugin, it basically copies all the
> dependencies JARs from your local ivy/maven cache to a your target folder
> (in target/pack/lib), and creates launch scripts (in target/pack/bin) for
> your application (notably setting all these jars on the classpath).
>
> Now, since Spark 1.0.0 was released, we are encountering a weird error where
> running our project with "sbt run" is fine but running our app with the
> launch scripts generated by sbt-pack fails.
>
> After a (quite painful) investigation, it turns out some JARs are NOT copied
> from the local ivy2 cache to the lib folder. I noticed that all the missing
> jars contain "shaded" in their file name (but all not all jars with such
> name are missing).
> One of the missing JARs is explicitly from the Spark definition
> (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.
>
> This file is clearly present in my local ivy cache, but is not copied by
> sbt-pack.
>
> Is there an evident reason for that?
>
> I don't know much about the shading mechanism, maybe I'm missing something
> here?
>
>
> Any help would be appreciated!
>
> Cheers
>
> Pierre
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350

On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell  wrote:
> One potential issue here is that mesos is using classifiers now to
> publish there jars. It might be that sbt-pack has trouble with
> dependencies that are published using classifiers. I'm pretty sure
> mesos is the only dependency in Spark that is using classifiers, so
> that's why I mention it.
>
> On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
>  wrote:
>> Hi all!
>>
>> We'be been using the sbt-pack sbt plugin
>> (https://github.com/xerial/sbt-pack) for building our standalone Spark
>> application for a while now. Until version 1.0.0, that worked nicely.
>>
>> For those who don't know the sbt-pack plugin, it basically copies all the
>> dependencies JARs from your local ivy/maven cache to a your target folder
>> (in target/pack/lib), and creates launch scripts (in target/pack/bin) for
>> your application (notably setting all these jars on the classpath).
>>
>> Now, since Spark 1.0.0 was released, we are encountering a weird error where
>> running our project with "sbt run" is fine but running our app with the
>> launch scripts generated by sbt-pack fails.
>>
>> After a (quite painful) investigation, it turns out some JARs are NOT copied
>> from the local ivy2 cache to the lib folder. I noticed that all the missing
>> jars contain "shaded" in their file name (but all not all jars with such
>> name are missing).
>> One of the missing JARs is explicitly from the Spark definition
>> (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.
>>
>> This file is clearly present in my local ivy cache, but is not copied by
>> sbt-pack.
>>
>> Is there an evident reason for that?
>>
>> I don't know much about the shading mechanism, maybe I'm missing something
>> here?
>>
>>
>> Any help would be appreciated!
>>
>> Cheers
>>
>> Pierre
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre Borckmans
You're right Patrick! 

Just had a chat with sbt-pack creator and indeed dependencies with classifiers 
are ignored to avoid problems with dirty cache...

Should be fixed in next version of the plugin.

Cheers

Pierre 

Message sent from a mobile device - excuse typos and abbreviations 

> Le 1 juin 2014 à 20:04, Patrick Wendell  a écrit :
> 
> https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350
> 
>> On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell  wrote:
>> One potential issue here is that mesos is using classifiers now to
>> publish there jars. It might be that sbt-pack has trouble with
>> dependencies that are published using classifiers. I'm pretty sure
>> mesos is the only dependency in Spark that is using classifiers, so
>> that's why I mention it.
>> 
>> On Sun, Jun 1, 2014 at 2:34 AM, Pierre B
>>  wrote:
>>> Hi all!
>>> 
>>> We'be been using the sbt-pack sbt plugin
>>> (https://github.com/xerial/sbt-pack) for building our standalone Spark
>>> application for a while now. Until version 1.0.0, that worked nicely.
>>> 
>>> For those who don't know the sbt-pack plugin, it basically copies all the
>>> dependencies JARs from your local ivy/maven cache to a your target folder
>>> (in target/pack/lib), and creates launch scripts (in target/pack/bin) for
>>> your application (notably setting all these jars on the classpath).
>>> 
>>> Now, since Spark 1.0.0 was released, we are encountering a weird error where
>>> running our project with "sbt run" is fine but running our app with the
>>> launch scripts generated by sbt-pack fails.
>>> 
>>> After a (quite painful) investigation, it turns out some JARs are NOT copied
>>> from the local ivy2 cache to the lib folder. I noticed that all the missing
>>> jars contain "shaded" in their file name (but all not all jars with such
>>> name are missing).
>>> One of the missing JARs is explicitly from the Spark definition
>>> (SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.
>>> 
>>> This file is clearly present in my local ivy cache, but is not copied by
>>> sbt-pack.
>>> 
>>> Is there an evident reason for that?
>>> 
>>> I don't know much about the shading mechanism, maybe I'm missing something
>>> here?
>>> 
>>> 
>>> Any help would be appreciated!
>>> 
>>> Cheers
>>> 
>>> Pierre
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
Note that everything works fine in spark 0.9, which is packaged in CDH5: I
can launch a spark-shell and interact with workers spawned on my yarn
cluster.

So in my /opt/hadoop/conf/yarn-site.xml, I have:
...

yarn.resourcemanager.address.rm1
controller-1.mycomp.com:23140

...

yarn.resourcemanager.address.rm2
controller-2.mycomp.com:23140

...

And the other usual stuff.

So spark 1.0 is launched like this:
Spark Command: java -cp
::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
-XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
--class org.apache.spark.repl.Main

I do see "/opt/hadoop/conf" included, but not sure it's the right place.

Thanks..
-Simon



On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell  wrote:

> I would agree with your guess, it looks like the yarn library isn't
> correctly finding your yarn-site.xml file. If you look in
> yarn-site.xml do you definitely the resource manager
> address/addresses?
>
> Also, you can try running this command with
> SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
> set-up correctly.
>
> - Patrick
>
> On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen 
> wrote:
> > Hi all,
> >
> > I tried a couple ways, but couldn't get it to work..
> >
> > The following seems to be what the online document
> > (http://spark.apache.org/docs/latest/running-on-yarn.html) is
> suggesting:
> >
> SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
> > YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
> >
> > Help info of spark-shell seems to be suggesting "--master yarn
> --deploy-mode
> > cluster".
> >
> > But either way, I am seeing the following messages:
> > 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at
> > /0.0.0.0:8032
> > 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
> > 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> > 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
> > 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> >
> > My guess is that spark-shell is trying to talk to resource manager to
> setup
> > spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
> > though. I am running CDH5 with two resource managers in HA mode. Their
> > IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
> > HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
> >
> > Any ideas? Thanks.
> > -Simon
>


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
Ah yes, looking back at the first email in the thread, indeed that was the
case. For the record, I too launch clusters from my laptop, where I have
Python 2.7 installed.


On Sun, Jun 1, 2014 at 2:01 PM, Patrick Wendell  wrote:

> Hey just to clarify this - my understanding is that the poster
> (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
> launch spark-ec2 from my laptop. And he was looking for an AMI that
> had a high enough version of python.
>
> Spark-ec2 itself has a flag "-a" that allows you to give a specific
> AMI. This flag is just an internal tool that we use for testing when
> we spin new AMI's. Users can't set that to an arbitrary AMI because we
> tightly control things like the Java and OS versions, libraries, etc.
>
>
> On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
>  wrote:
> > *sigh* OK, I figured it out. (Thank you Nick, for the hint)
> >
> > "m1.large" works, (I swear I tested that earlier and had similar
> issues... )
> >
> > It was my obsession with starting "r3.*large" instances. Clearly I hadn't
> > patched the script in all the places.. which I think caused it to
> default to
> > the Amazon AMI. I'll have to take a closer look at the code and see if I
> > can't fix it correctly, because I really, really do want nodes with 2x
> the
> > CPU and 4x the memory for the same low spot price. :-)
> >
> > I've got a cluster up now, at least. Time for the fun stuff...
> >
> > Thanks everyone for the help!
> >
> >
> >
> > On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
> >  wrote:
> >>
> >> If you are explicitly specifying the AMI in your invocation of
> spark-ec2,
> >> may I suggest simply removing any explicit mention of AMI from your
> >> invocation? spark-ec2 automatically selects an appropriate AMI based on
> the
> >> specified instance type.
> >>
> >> 2014년 6월 1일 일요일, Nicholas Chammas님이 작성한
> 메시지:
> >>
> >>> Could you post how exactly you are invoking spark-ec2? And are you
> having
> >>> trouble just with r3 instances, or with any instance type?
> >>>
> >>> 2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지:
> >>>
> >>> It's been another day of spinning up dead clusters...
> >>>
> >>> I thought I'd finally worked out what everyone else knew - don't use
> the
> >>> default AMI - but I've now run through all of the "official"
> quick-start
> >>> linux releases and I'm none the wiser:
> >>>
> >>> Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
> >>> Provisions servers, connects, installs, but the webserver on the master
> >>> will not start
> >>>
> >>> Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
> >>> Spot instance requests are not supported for this AMI.
> >>>
> >>> SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
> >>> Not tested - costs 10x more for spot instances, not economically
> viable.
> >>>
> >>> Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
> >>> Provisions servers, but "git" is not pre-installed, so the cluster
> setup
> >>> fails.
> >>>
> >>> Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
> >>> Provisions servers, but "git" is not pre-installed, so the cluster
> setup
> >>> fails.
> >
> >
> >
> >
> > --
> > Jeremy Lee  BCompSci(Hons)
> >   The Unorthodox Engineers
>


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
More specifically with the -a flag, you *can* set your own AMI, but you’ll need 
to base it off ours. This is because spark-ec2 assumes that some packages (e.g. 
java, Python 2.6) are already available on the AMI.

Matei

On Jun 1, 2014, at 11:01 AM, Patrick Wendell  wrote:

> Hey just to clarify this - my understanding is that the poster
> (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
> launch spark-ec2 from my laptop. And he was looking for an AMI that
> had a high enough version of python.
> 
> Spark-ec2 itself has a flag "-a" that allows you to give a specific
> AMI. This flag is just an internal tool that we use for testing when
> we spin new AMI's. Users can't set that to an arbitrary AMI because we
> tightly control things like the Java and OS versions, libraries, etc.
> 
> 
> On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
>  wrote:
>> *sigh* OK, I figured it out. (Thank you Nick, for the hint)
>> 
>> "m1.large" works, (I swear I tested that earlier and had similar issues... )
>> 
>> It was my obsession with starting "r3.*large" instances. Clearly I hadn't
>> patched the script in all the places.. which I think caused it to default to
>> the Amazon AMI. I'll have to take a closer look at the code and see if I
>> can't fix it correctly, because I really, really do want nodes with 2x the
>> CPU and 4x the memory for the same low spot price. :-)
>> 
>> I've got a cluster up now, at least. Time for the fun stuff...
>> 
>> Thanks everyone for the help!
>> 
>> 
>> 
>> On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
>>  wrote:
>>> 
>>> If you are explicitly specifying the AMI in your invocation of spark-ec2,
>>> may I suggest simply removing any explicit mention of AMI from your
>>> invocation? spark-ec2 automatically selects an appropriate AMI based on the
>>> specified instance type.
>>> 
>>> 2014년 6월 1일 일요일, Nicholas Chammas님이 작성한 메시지:
>>> 
 Could you post how exactly you are invoking spark-ec2? And are you having
 trouble just with r3 instances, or with any instance type?
 
 2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지:
 
 It's been another day of spinning up dead clusters...
 
 I thought I'd finally worked out what everyone else knew - don't use the
 default AMI - but I've now run through all of the "official" quick-start
 linux releases and I'm none the wiser:
 
 Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
 Provisions servers, connects, installs, but the webserver on the master
 will not start
 
 Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
 Spot instance requests are not supported for this AMI.
 
 SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
 Not tested - costs 10x more for spot instances, not economically viable.
 
 Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
 Provisions servers, but "git" is not pre-installed, so the cluster setup
 fails.
 
 Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
 Provisions servers, but "git" is not pre-installed, so the cluster setup
 fails.
>> 
>> 
>> 
>> 
>> --
>> Jeremy Lee  BCompSci(Hons)
>>  The Unorthodox Engineers



Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
I have a large number of directories under a common root:

batch-1/file1.txt
batch-1/file2.txt
batch-1/file3.txt
...
batch-2/file1.txt
batch-2/file2.txt
batch-2/file3.txt
...
batch-N/file1.txt
batch-N/file2.txt
batch-N/file3.txt
...

I would like to read them into an RDD like

{
"batch-1" : [ content1, content2, content3,...]
"batch-2" : [ content1, content2, content3,...]
...
"batch-N" : [ content1, content2, content3,...]
}

Thank you,
Oleg



On 1 June 2014 17:00, Nicholas Chammas  wrote:

> Could you provide an example of what you mean?
>
> I know it's possible to create an RDD from a path with wildcards, like in
> the subject.
>
> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
> provide a comma delimited list of paths.
>
> Nick
>
> 2014년 6월 1일 일요일, Oleg Proudnikov님이 작성한 메시지:
>
> Hi All,
>>
>> Is it possible to create an RDD from a directory tree of the following
>> form?
>>
>> RDD[(PATH, Seq[TEXT])]
>>
>> Thank you,
>> Oleg
>>
>>


-- 
Kind regards,

Oleg


Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Nicholas Chammas
sc.wholeTextFiles()

will
get you close. Alternately, you could write a loop with plain sc.textFile()
that loads all the files under each batch into a separate RDD.


On Sun, Jun 1, 2014 at 4:40 PM, Oleg Proudnikov 
wrote:

> I have a large number of directories under a common root:
>
> batch-1/file1.txt
> batch-1/file2.txt
> batch-1/file3.txt
> ...
> batch-2/file1.txt
> batch-2/file2.txt
> batch-2/file3.txt
> ...
> batch-N/file1.txt
> batch-N/file2.txt
> batch-N/file3.txt
> ...
>
> I would like to read them into an RDD like
>
> {
> "batch-1" : [ content1, content2, content3,...]
> "batch-2" : [ content1, content2, content3,...]
> ...
> "batch-N" : [ content1, content2, content3,...]
> }
>
> Thank you,
> Oleg
>
>
>
> On 1 June 2014 17:00, Nicholas Chammas  wrote:
>
>> Could you provide an example of what you mean?
>>
>> I know it's possible to create an RDD from a path with wildcards, like in
>> the subject.
>>
>> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
>> provide a comma delimited list of paths.
>>
>> Nick
>>
>> 2014년 6월 1일 일요일, Oleg Proudnikov님이 작성한 메시지:
>>
>> Hi All,
>>>
>>> Is it possible to create an RDD from a directory tree of the following
>>> form?
>>>
>>> RDD[(PATH, Seq[TEXT])]
>>>
>>> Thank you,
>>> Oleg
>>>
>>>
>
>
> --
> Kind regards,
>
> Oleg
>
>


Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
Nicholas,

The new in 1.0 wholeTextFiles() gets me exactly what I need. It would be
great to have this functionality with an arbitrary directory tree.

Thank you,
Oleg


​


Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
Followup question: the docs to make a new SparkContext require that I know
where $SPARK_HOME is. However, I have no idea. Any idea where that might be?


On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson  wrote:

> Gotcha. The easiest way to get your dependencies to your Executors would
> probably be to construct your SparkContext with all necessary jars passed
> in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is
> a "necessary jar", but it's possible your application also needs to
> distribute other ones to the cluster.
>
> An easy way to make sure all your dependencies get shipped to the cluster
> is to create an assembly jar of your application, and then you just need to
> tell Spark about that jar, which includes all your application's transitive
> dependencies. Maven and sbt both have pretty straightforward ways of
> producing assembly jars.
>
>
> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney  > wrote:
>
>> Thanks for the fast reply.
>>
>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
>> standalone mode.
>>
>>
>> On Saturday, May 31, 2014, Aaron Davidson  wrote:
>>
>>> First issue was because your cluster was configured incorrectly. You
>>> could probably read 1 file because that was done on the driver node, but
>>> when it tried to run a job on the cluster, it failed.
>>>
>>> Second issue, it seems that the jar containing avro is not getting
>>> propagated to the Executors. What version of Spark are you running on? What
>>> deployment mode (YARN, standalone, Mesos)?
>>>
>>>
>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
>>> russell.jur...@gmail.com> wrote:
>>>
>>> Now I get this:
>>>
>>> scala> rdd.first
>>>
>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>> :41
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>>> :41) with 1 output partitions (allowLocal=true)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>>> (first at :41)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>>> List()
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
>>> partition locally
>>>
>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864
>>>
>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>>> :41, took 0.037371256 s
>>>
>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>> :41
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>>> :41) with 16 output partitions (allowLocal=true)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>>> (first at :41)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>>> List()
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>>> (HadoopRDD[0] at hadoopRDD at :37), which has no missing parents
>>>
>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at :37)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
>>> with 16 tasks
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
>>> TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0
>>> as 1294 bytes in 1 ms
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
>>> TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3
>>> as 1294 bytes in 0 ms
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
>>> TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1
>>> as 1294 bytes in 1 ms
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
>>> TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2
>>> as 1294 bytes in 0 ms
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
>>> TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4
>>> as 1294 bytes in 0 ms
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as
>>> TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6
>>> as 1294 bytes in 0 ms
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as
>>> TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>
>>> 14/05/31 21:36:28 INFO scheduler.TaskSe

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
Anwar,

Will try this as it might do exactly what I need. I will follow your
pattern but use sc.textFile() for each file.

I am now thinking that I could start with an RDD of file paths and map it
into (path, content) pairs, provided I could read a file on the server.

Thank you,
Oleg



On 1 June 2014 18:41, Anwar Rizal  wrote:

> I presume that you need to have access to the path of each file you are
> reading.
>
> I don't know whether there is a good way to do that for HDFS, I need to
> read the files myself, something like:
>
> def openWithPath(inputPath: String, sc:SparkContext) =  {
>   val fs= (new
> Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
>   val filesIt   = fs.listFiles(path, false)
>   val paths   = new ListBuffer[URI]
>   while (filesIt.hasNext) {
> paths += filesIt.next.getPath.toUri
>   }
>   val withPaths = paths.toList.map{  p =>
> sc.newAPIHadoopFile[LongWritable, Text,
> TextInputFormat](p.toString).map{ case (_,s)  => (p, s.toString) }
>   }
>   withPaths.reduce{ _ ++ _ }
> }
> ...
>
> I would be interested if there is a better way to do the same thing ...
>
> Cheers,
> a:
>
>
> On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Could you provide an example of what you mean?
>>
>> I know it's possible to create an RDD from a path with wildcards, like in
>> the subject.
>>
>> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
>> provide a comma delimited list of paths.
>>
>> Nick
>>
>> 2014년 6월 1일 일요일, Oleg Proudnikov님이 작성한 메시지:
>>
>> Hi All,
>>>
>>> Is it possible to create an RDD from a directory tree of the following
>>> form?
>>>
>>> RDD[(PATH, Seq[TEXT])]
>>>
>>> Thank you,
>>> Oleg
>>>
>>>
>


-- 
Kind regards,

Oleg


[Spark Streaming] Distribute custom receivers evenly across excecutors

2014-06-01 Thread Guang Gao
Dear All,

I'm running Spark Streaming (1.0.0) with Yarn (2.2.0) on a 10-node cluster.
I setup 10 custom receivers to hear from 10 data streams. I want one
receiver per node in order to maximize the network bandwidth. However, if I
set "--executor-cores 4", the 10 receivers only run on 3 of the nodes in
the cluster, each running 4, 4, 2 receivers; if I set "--executor-cores 1",
each node will run exactly one receiver, and it seems that Spark can't make
any progress to process theses streams.

I read the documentation on configuration and also googled but didn't find
a clue. Is there a way to configure how the receivers are distributed?

Thanks!

Here are some details:

How I created 10 receivers:

val conf = new SparkConf().setAppName(jobId)
val sc = new StreamingContext(conf, Seconds(1))
var lines:DStream[String] =
  sc.receiverStream(
  new CustomReceiver(...)
  )
for(i <- 1 to 9) {
lines = lines.union(
sc.receiverStream(
  new CustomReceiver(...)
   )
}

How I submit a job to Yarn:

spark-submit \
--class $JOB_CLASS \
--master yarn-client \
--num-executors 10 \
--driver-memory 1g \
--executor-memory 2g \
--executor-cores 4 \
$JAR_NAME


Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
You can avoid that by using the constructor that takes a SparkConf, a la

val conf = new SparkConf()
conf.setJars("avro.jar", ...)
val sc = new SparkContext(conf)


On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney 
wrote:

> Followup question: the docs to make a new SparkContext require that I know
> where $SPARK_HOME is. However, I have no idea. Any idea where that might be?
>
>
> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson 
> wrote:
>
>> Gotcha. The easiest way to get your dependencies to your Executors would
>> probably be to construct your SparkContext with all necessary jars passed
>> in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is
>> a "necessary jar", but it's possible your application also needs to
>> distribute other ones to the cluster.
>>
>> An easy way to make sure all your dependencies get shipped to the cluster
>> is to create an assembly jar of your application, and then you just need to
>> tell Spark about that jar, which includes all your application's transitive
>> dependencies. Maven and sbt both have pretty straightforward ways of
>> producing assembly jars.
>>
>>
>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <
>> russell.jur...@gmail.com> wrote:
>>
>>> Thanks for the fast reply.
>>>
>>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
>>> standalone mode.
>>>
>>>
>>> On Saturday, May 31, 2014, Aaron Davidson  wrote:
>>>
 First issue was because your cluster was configured incorrectly. You
 could probably read 1 file because that was done on the driver node, but
 when it tried to run a job on the cluster, it failed.

 Second issue, it seems that the jar containing avro is not getting
 propagated to the Executors. What version of Spark are you running on? What
 deployment mode (YARN, standalone, Mesos)?


 On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
 russell.jur...@gmail.com> wrote:

 Now I get this:

 scala> rdd.first

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 :41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
 :41) with 1 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
 (first at :41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
 partition locally

 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
 hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864

 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
 :41, took 0.037371256 s

 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
 :41

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
 :41) with 16 output partitions (allowLocal=true)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
 (first at :41)

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
 List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
 (HadoopRDD[0] at hadoopRDD at :37), which has no missing parents

 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
 tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at :37)

 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
 with 16 tasks

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
 TID 92 on executor 2: hivecluster3 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0
 as 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
 TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
 TID 94 on executor 4: hivecluster4 (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1
 as 1294 bytes in 1 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
 TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
 TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)

 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4
 as 1294 bytes in 0 ms

 14/05/31 21:36:28 INFO scheduler.TaskSetM

Re: Trouble with EC2

2014-06-01 Thread PJ$
Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't
gotten any further. No clue what's wrong. I'd really appreciate any
guidance y'all could offer.

Best,
PJ$


On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia 
wrote:

> What instance types did you launch on?
>
> Sometimes you also get a bad individual machine from EC2. It might help to
> remove the node it’s complaining about from the conf/slaves file.
>
> Matei
>
> On May 30, 2014, at 11:18 AM, PJ$  wrote:
>
> Hey Folks,
>
> I'm really having quite a bit of trouble getting spark running on ec2. I'm
> not using scripts the https://github.com/apache/spark/tree/master/ec2
> because I'd like to know how everything works. But I'm going a little
> crazy. I think that something about the networking configuration must be
> messed up, but I'm at a loss. Shortly after starting the cluster, I get a
> lot of this:
>
> 14/05/30 18:03:22 INFO master.Master: Registering worker
> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
> 14/05/30 18:03:22 INFO master.Master: Registering worker
> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
> 14/05/30 18:03:23 INFO master.Master: Registering worker
> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
> 14/05/30 18:03:23 INFO master.Master: Registering worker
> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
> 14/05/30 18:05:54 INFO master.Master:
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
> removing it.
> 14/05/30 18:05:54 INFO actor.LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkMaster/deadLetters] to
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
> was not delivered. [5] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/05/30 18:05:54 INFO master.Master:
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
> removing it.
> 14/05/30 18:05:54 INFO master.Master:
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
> removing it.
> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
> -> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
> [Association failed with
> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
> akka.remote.EndpointAssociationException: Association failed with [
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
> ]
> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
> -> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
> [Association failed with
> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
> akka.remote.EndpointAssociationException: Association failed with [
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
> ]
> 14/05/30 18:05:54 INFO master.Master:
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
> removing it.
> 14/05/30 18:05:54 INFO master.Master:
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
> removing it.
> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
> -> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
> [Association failed with
> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
> akka.remote.EndpointAssociationException: Association failed with [
> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
>
>
>


Re: Trouble with EC2

2014-06-01 Thread Matei Zaharia
So to run spark-ec2, you should use the default AMI that it launches with if 
you don’t pass -a. Those are based on Amazon Linux, not Debian. Passing your 
own AMI is an advanced option but people need to install some stuff on their 
AMI in advance for it to work with our scripts.

Matei


On Jun 1, 2014, at 3:11 PM, PJ$  wrote:

> Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't 
> gotten any further. No clue what's wrong. I'd really appreciate any guidance 
> y'all could offer. 
> 
> Best, 
> PJ$
> 
> 
> On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia  
> wrote:
> What instance types did you launch on?
> 
> Sometimes you also get a bad individual machine from EC2. It might help to 
> remove the node it’s complaining about from the conf/slaves file.
> 
> Matei
> 
> On May 30, 2014, at 11:18 AM, PJ$  wrote:
> 
>> Hey Folks, 
>> 
>> I'm really having quite a bit of trouble getting spark running on ec2. I'm 
>> not using scripts the https://github.com/apache/spark/tree/master/ec2 
>> because I'd like to know how everything works. But I'm going a little crazy. 
>> I think that something about the networking configuration must be messed up, 
>> but I'm at a loss. Shortly after starting the cluster, I get a lot of this: 
>> 
>> 14/05/30 18:03:22 INFO master.Master: Registering worker 
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:03:22 INFO master.Master: Registering worker 
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:03:23 INFO master.Master: Registering worker 
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:03:23 INFO master.Master: Registering worker 
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:05:54 INFO master.Master: 
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
>> removing it.
>> 14/05/30 18:05:54 INFO actor.LocalActorRef: Message 
>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
>> Actor[akka://sparkMaster/deadLetters] to 
>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
>>  was not delivered. [5] dead letters encountered. This logging can be turned 
>> off or adjusted with configuration settings 'akka.log-dead-letters' and 
>> 'akka.log-dead-letters-during-shutdown'.
>> 14/05/30 18:05:54 INFO master.Master: 
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
>> removing it.
>> 14/05/30 18:05:54 INFO master.Master: 
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
>> removing it.
>> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
>> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] -> 
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
>> failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
>> akka.remote.EndpointAssociationException: Association failed with 
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
>> Caused by: 
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
>> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
>> ]
>> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
>> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] -> 
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
>> failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
>> akka.remote.EndpointAssociationException: Association failed with 
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
>> Caused by: 
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
>> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
>> ]
>> 14/05/30 18:05:54 INFO master.Master: 
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
>> removing it.
>> 14/05/30 18:05:54 INFO master.Master: 
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated, 
>> removing it.
>> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
>> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077] -> 
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error [Association 
>> failed with [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
>> akka.remote.EndpointAssociationException: Association failed with 
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
>> Caused by: 
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
>> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
> 
> 



Re: Trouble with EC2

2014-06-01 Thread Jeremy Lee
Ha yes,,, I just went through this.

(a) You have to use the ;'default' spark AMI, ( ami-7a320f3f at the moment
) and not any of the other linux distros. They don't work.
(b) Start with m1.large instances.. I tried going for r3.large at first,
and had no end of self-caused trouble. m1.large works.
(c) It's possible for the script to choose the wrong AMI, especially if one
has been messing with it to allow other instance types. (ahem)

But it will work in the end.. just start simple. (yeah, I know m1.large
doesn't look that large anymore. :-)


On Mon, Jun 2, 2014 at 8:11 AM, PJ$  wrote:

> Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't
> gotten any further. No clue what's wrong. I'd really appreciate any
> guidance y'all could offer.
>
> Best,
> PJ$
>
>
> On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia 
> wrote:
>
>> What instance types did you launch on?
>>
>> Sometimes you also get a bad individual machine from EC2. It might help
>> to remove the node it’s complaining about from the conf/slaves file.
>>
>> Matei
>>
>> On May 30, 2014, at 11:18 AM, PJ$  wrote:
>>
>> Hey Folks,
>>
>> I'm really having quite a bit of trouble getting spark running on ec2.
>> I'm not using scripts the https://github.com/apache/spark/tree/master/ec2
>> because I'd like to know how everything works. But I'm going a little
>> crazy. I think that something about the networking configuration must be
>> messed up, but I'm at a loss. Shortly after starting the cluster, I get a
>> lot of this:
>>
>> 14/05/30 18:03:22 INFO master.Master: Registering worker
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:03:22 INFO master.Master: Registering worker
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:03:23 INFO master.Master: Registering worker
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:03:23 INFO master.Master: Registering worker
>> ip-10-100-184-45.ec2.internal:7078 with 2 cores, 6.3 GB RAM
>> 14/05/30 18:05:54 INFO master.Master:
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
>> removing it.
>> 14/05/30 18:05:54 INFO actor.LocalActorRef: Message
>> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
>> Actor[akka://sparkMaster/deadLetters] to
>> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.100.75.70%3A36725-25#847210246]
>> was not delivered. [5] dead letters encountered. This logging can be turned
>> off or adjusted with configuration settings 'akka.log-dead-letters' and
>> 'akka.log-dead-letters-during-shutdown'.
>> 14/05/30 18:05:54 INFO master.Master:
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
>> removing it.
>> 14/05/30 18:05:54 INFO master.Master:
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
>> removing it.
>> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
>> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
>> -> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
>> [Association failed with
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
>> akka.remote.EndpointAssociationException: Association failed with [
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
>> ]
>> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
>> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
>> -> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
>> [Association failed with
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
>> akka.remote.EndpointAssociationException: Association failed with [
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
>> ]
>> 14/05/30 18:05:54 INFO master.Master:
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
>> removing it.
>> 14/05/30 18:05:54 INFO master.Master:
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485 got disassociated,
>> removing it.
>> 14/05/30 18:05:54 ERROR remote.EndpointWriter: AssociationError 
>> [akka.tcp://sparkMaster@ip-10-100-184-45.ec2.internal:7077]
>> -> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]: Error
>> [Association failed with
>> [akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]] [
>> akka.remote.EndpointAssociationException: Association failed with [
>> akka.tcp://spark@ip-10-100-75-70.ec2.internal:38485]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: ip-10-100-75-70.ec2.internal/10.100.75.70:38485
>>
>>
>>
>


-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: spark 1.0.0 on yarn

2014-06-01 Thread Patrick Wendell
As a debugging step, does it work if you use a single resource manager
with the key "yarn.resourcemanager.address" instead of using two named
resource managers? I wonder if somehow the YARN client can't detect
this multi-master set-up.

On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen  wrote:
> Note that everything works fine in spark 0.9, which is packaged in CDH5: I
> can launch a spark-shell and interact with workers spawned on my yarn
> cluster.
>
> So in my /opt/hadoop/conf/yarn-site.xml, I have:
> ...
> 
> yarn.resourcemanager.address.rm1
> controller-1.mycomp.com:23140
> 
> ...
> 
> yarn.resourcemanager.address.rm2
> controller-2.mycomp.com:23140
> 
> ...
>
> And the other usual stuff.
>
> So spark 1.0 is launched like this:
> Spark Command: java -cp
> ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
> -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
> org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class
> org.apache.spark.repl.Main
>
> I do see "/opt/hadoop/conf" included, but not sure it's the right place.
>
> Thanks..
> -Simon
>
>
>
> On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell  wrote:
>>
>> I would agree with your guess, it looks like the yarn library isn't
>> correctly finding your yarn-site.xml file. If you look in
>> yarn-site.xml do you definitely the resource manager
>> address/addresses?
>>
>> Also, you can try running this command with
>> SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
>> set-up correctly.
>>
>> - Patrick
>>
>> On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen 
>> wrote:
>> > Hi all,
>> >
>> > I tried a couple ways, but couldn't get it to work..
>> >
>> > The following seems to be what the online document
>> > (http://spark.apache.org/docs/latest/running-on-yarn.html) is
>> > suggesting:
>> >
>> > SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
>> > YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
>> >
>> > Help info of spark-shell seems to be suggesting "--master yarn
>> > --deploy-mode
>> > cluster".
>> >
>> > But either way, I am seeing the following messages:
>> > 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager at
>> > /0.0.0.0:8032
>> > 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
>> > 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
>> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
>> > 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
>> > 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
>> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
>> >
>> > My guess is that spark-shell is trying to talk to resource manager to
>> > setup
>> > spark master/worker nodes - I am not sure where 0.0.0.0:8032 came from
>> > though. I am running CDH5 with two resource managers in HA mode. Their
>> > IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
>> > HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
>> >
>> > Any ideas? Thanks.
>> > -Simon
>
>


Re: Create/shutdown objects before/after RDD use (or: Non-serializable classes)

2014-06-01 Thread Tobias Pfeiffer
Xiangrui,

thanks for your suggestion!

On Sat, May 31, 2014 at 6:12 PM, Xiangrui Meng  wrote:
> One hack you can try is:
>
> rdd.mapPartitions(iter => {
>   val x = new X()
>   iter.map(row => x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty }
> })

In fact, I employed a similar hack by now:

rdd.mapPartitions(iter => {
  val x = new X()
  iter.map(row => {
x.doSomethingWith(row)
if (!iter.hasNext) x.shutdown()
row
  })
})

Thanks
Tobias


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Jeremy Lee
Sort of.. there were two separate issues, but both related to AWS..

I've sorted the confusion about the Master/Worker AMI ... use the version
chosen by the scripts. (and use the right instance type so the script can
choose wisely)

But yes, one also needs a "launch machine" to kick off the cluster, and for
that I _also_ was using an Amazon instance... (made sense.. I have a team
that will needs to do things as well, not just me) and I was just pointing
out that if you use the "most recommended by Amazon" AMI (for your free
micro instance, for example) you get python 2.6 and the ec2 scripts fail.

That merely needs a line in the documentation saying "use Ubuntu for your
cluster controller, not Amazon Linux" or somesuch. But yeah, for a newbie,
it was hard working out when to use "default" or "custom" AMIs for various
parts of the setup.


On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell  wrote:

> Hey just to clarify this - my understanding is that the poster
> (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
> launch spark-ec2 from my laptop. And he was looking for an AMI that
> had a high enough version of python.
>
> Spark-ec2 itself has a flag "-a" that allows you to give a specific
> AMI. This flag is just an internal tool that we use for testing when
> we spin new AMI's. Users can't set that to an arbitrary AMI because we
> tightly control things like the Java and OS versions, libraries, etc.
>
>
> On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
>  wrote:
> > *sigh* OK, I figured it out. (Thank you Nick, for the hint)
> >
> > "m1.large" works, (I swear I tested that earlier and had similar
> issues... )
> >
> > It was my obsession with starting "r3.*large" instances. Clearly I hadn't
> > patched the script in all the places.. which I think caused it to
> default to
> > the Amazon AMI. I'll have to take a closer look at the code and see if I
> > can't fix it correctly, because I really, really do want nodes with 2x
> the
> > CPU and 4x the memory for the same low spot price. :-)
> >
> > I've got a cluster up now, at least. Time for the fun stuff...
> >
> > Thanks everyone for the help!
> >
> >
> >
> > On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
> >  wrote:
> >>
> >> If you are explicitly specifying the AMI in your invocation of
> spark-ec2,
> >> may I suggest simply removing any explicit mention of AMI from your
> >> invocation? spark-ec2 automatically selects an appropriate AMI based on
> the
> >> specified instance type.
> >>
> >> 2014년 6월 1일 일요일, Nicholas Chammas님이 작성한
> 메시지:
> >>
> >>> Could you post how exactly you are invoking spark-ec2? And are you
> having
> >>> trouble just with r3 instances, or with any instance type?
> >>>
> >>> 2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지:
> >>>
> >>> It's been another day of spinning up dead clusters...
> >>>
> >>> I thought I'd finally worked out what everyone else knew - don't use
> the
> >>> default AMI - but I've now run through all of the "official"
> quick-start
> >>> linux releases and I'm none the wiser:
> >>>
> >>> Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
> >>> Provisions servers, connects, installs, but the webserver on the master
> >>> will not start
> >>>
> >>> Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
> >>> Spot instance requests are not supported for this AMI.
> >>>
> >>> SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
> >>> Not tested - costs 10x more for spot instances, not economically
> viable.
> >>>
> >>> Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
> >>> Provisions servers, but "git" is not pre-installed, so the cluster
> setup
> >>> fails.
> >>>
> >>> Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
> >>> Provisions servers, but "git" is not pre-installed, so the cluster
> setup
> >>> fails.
> >
> >
> >
> >
> > --
> > Jeremy Lee  BCompSci(Hons)
> >   The Unorthodox Engineers
>



-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
That helped a bit... Now I have a different failure: the start up process
is stuck in an infinite loop outputting the following message:

14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application
report from ASM:
 appMasterRpcPort: -1
 appStartTime: 1401672868277
 yarnAppState: ACCEPTED

I am using the hadoop 2 prebuild package. Probably it doesn't have the
latest yarn client.

-Simon




On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell  wrote:

> As a debugging step, does it work if you use a single resource manager
> with the key "yarn.resourcemanager.address" instead of using two named
> resource managers? I wonder if somehow the YARN client can't detect
> this multi-master set-up.
>
> On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen 
> wrote:
> > Note that everything works fine in spark 0.9, which is packaged in CDH5:
> I
> > can launch a spark-shell and interact with workers spawned on my yarn
> > cluster.
> >
> > So in my /opt/hadoop/conf/yarn-site.xml, I have:
> > ...
> > 
> > yarn.resourcemanager.address.rm1
> > controller-1.mycomp.com:23140
> > 
> > ...
> > 
> > yarn.resourcemanager.address.rm2
> > controller-2.mycomp.com:23140
> > 
> > ...
> >
> > And the other usual stuff.
> >
> > So spark 1.0 is launched like this:
> > Spark Command: java -cp
> >
> ::/home/chenxu/spark-1.0.0-bin-hadoop2/conf:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar:/home/chenxu/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
> > -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
> > org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client
> --class
> > org.apache.spark.repl.Main
> >
> > I do see "/opt/hadoop/conf" included, but not sure it's the right place.
> >
> > Thanks..
> > -Simon
> >
> >
> >
> > On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell 
> wrote:
> >>
> >> I would agree with your guess, it looks like the yarn library isn't
> >> correctly finding your yarn-site.xml file. If you look in
> >> yarn-site.xml do you definitely the resource manager
> >> address/addresses?
> >>
> >> Also, you can try running this command with
> >> SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath is being
> >> set-up correctly.
> >>
> >> - Patrick
> >>
> >> On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen 
> >> wrote:
> >> > Hi all,
> >> >
> >> > I tried a couple ways, but couldn't get it to work..
> >> >
> >> > The following seems to be what the online document
> >> > (http://spark.apache.org/docs/latest/running-on-yarn.html) is
> >> > suggesting:
> >> >
> >> >
> SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
> >> > YARN_CONF_DIR=/opt/hadoop/conf ./spark-shell --master yarn-client
> >> >
> >> > Help info of spark-shell seems to be suggesting "--master yarn
> >> > --deploy-mode
> >> > cluster".
> >> >
> >> > But either way, I am seeing the following messages:
> >> > 14/06/01 00:33:20 INFO client.RMProxy: Connecting to ResourceManager
> at
> >> > /0.0.0.0:8032
> >> > 14/06/01 00:33:21 INFO ipc.Client: Retrying connect to server:
> >> > 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
> >> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
> SECONDS)
> >> > 14/06/01 00:33:22 INFO ipc.Client: Retrying connect to server:
> >> > 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
> >> > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1
> SECONDS)
> >> >
> >> > My guess is that spark-shell is trying to talk to resource manager to
> >> > setup
> >> > spark master/worker nodes - I am not sure where 0.0.0.0:8032 came
> from
> >> > though. I am running CDH5 with two resource managers in HA mode. Their
> >> > IP/port should be in /opt/hadoop/conf/yarn-site.xml. I tried both
> >> > HADOOP_CONF_DIR and YARN_CONF_DIR, but that info isn't picked up.
> >> >
> >> > Any ideas? Thanks.
> >> > -Simon
> >
> >
>


Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
Thanks again. Run results here:
https://gist.github.com/rjurney/dc0efae486ba7d55b7d5

This time I get a port already in use exception on 4040, but it isn't
fatal. Then when I run rdd.first, I get this over and over:

14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has
not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient memory



On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson  wrote:

> You can avoid that by using the constructor that takes a SparkConf, a la
>
> val conf = new SparkConf()
> conf.setJars("avro.jar", ...)
> val sc = new SparkContext(conf)
>
>
> On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney 
> wrote:
>
>> Followup question: the docs to make a new SparkContext require that I
>> know where $SPARK_HOME is. However, I have no idea. Any idea where that
>> might be?
>>
>>
>> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson 
>> wrote:
>>
>>> Gotcha. The easiest way to get your dependencies to your Executors would
>>> probably be to construct your SparkContext with all necessary jars passed
>>> in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is
>>> a "necessary jar", but it's possible your application also needs to
>>> distribute other ones to the cluster.
>>>
>>> An easy way to make sure all your dependencies get shipped to the
>>> cluster is to create an assembly jar of your application, and then you just
>>> need to tell Spark about that jar, which includes all your application's
>>> transitive dependencies. Maven and sbt both have pretty straightforward
>>> ways of producing assembly jars.
>>>
>>>
>>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <
>>> russell.jur...@gmail.com> wrote:
>>>
 Thanks for the fast reply.

 I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
 standalone mode.


 On Saturday, May 31, 2014, Aaron Davidson  wrote:

> First issue was because your cluster was configured incorrectly. You
> could probably read 1 file because that was done on the driver node, but
> when it tried to run a job on the cluster, it failed.
>
> Second issue, it seems that the jar containing avro is not getting
> propagated to the Executors. What version of Spark are you running on? 
> What
> deployment mode (YARN, standalone, Mesos)?
>
>
> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
> russell.jur...@gmail.com> wrote:
>
> Now I get this:
>
> scala> rdd.first
>
> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
> :41
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
> :41) with 1 output partitions (allowLocal=true)
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
> (first at :41)
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
> List()
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
> partition locally
>
> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864
>
> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
> :41, took 0.037371256 s
>
> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
> :41
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
> :41) with 16 output partitions (allowLocal=true)
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
> (first at :41)
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
> List()
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
> (HadoopRDD[0] at hadoopRDD at :37), which has no missing parents
>
> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at :37)
>
> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
> 5.0 with 16 tasks
>
> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>
> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0
> as 1294 bytes in 1 ms
>
> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3
> as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>
> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3
> as 1294 bytes in 0 ms
>
> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1
> as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>
> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized

Re: Spark on EC2

2014-06-01 Thread superback
I haven't set up AMI yet. I am just trying to run a simple job on the EC2
cluster. So, is setting up AMI a prerequisite for running simple Spark
example like org.apache.spark.examples.GroupByTest? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638p6681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this.

Matei

On Jun 1, 2014, at 6:14 PM, Jeremy Lee  wrote:

> Sort of.. there were two separate issues, but both related to AWS..
> 
> I've sorted the confusion about the Master/Worker AMI ... use the version 
> chosen by the scripts. (and use the right instance type so the script can 
> choose wisely)
> 
> But yes, one also needs a "launch machine" to kick off the cluster, and for 
> that I _also_ was using an Amazon instance... (made sense.. I have a team 
> that will needs to do things as well, not just me) and I was just pointing 
> out that if you use the "most recommended by Amazon" AMI (for your free micro 
> instance, for example) you get python 2.6 and the ec2 scripts fail.
> 
> That merely needs a line in the documentation saying "use Ubuntu for your 
> cluster controller, not Amazon Linux" or somesuch. But yeah, for a newbie, it 
> was hard working out when to use "default" or "custom" AMIs for various parts 
> of the setup.
> 
> 
> On Mon, Jun 2, 2014 at 4:01 AM, Patrick Wendell  wrote:
> Hey just to clarify this - my understanding is that the poster
> (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
> launch spark-ec2 from my laptop. And he was looking for an AMI that
> had a high enough version of python.
> 
> Spark-ec2 itself has a flag "-a" that allows you to give a specific
> AMI. This flag is just an internal tool that we use for testing when
> we spin new AMI's. Users can't set that to an arbitrary AMI because we
> tightly control things like the Java and OS versions, libraries, etc.
> 
> 
> On Sun, Jun 1, 2014 at 12:51 AM, Jeremy Lee
>  wrote:
> > *sigh* OK, I figured it out. (Thank you Nick, for the hint)
> >
> > "m1.large" works, (I swear I tested that earlier and had similar issues... )
> >
> > It was my obsession with starting "r3.*large" instances. Clearly I hadn't
> > patched the script in all the places.. which I think caused it to default to
> > the Amazon AMI. I'll have to take a closer look at the code and see if I
> > can't fix it correctly, because I really, really do want nodes with 2x the
> > CPU and 4x the memory for the same low spot price. :-)
> >
> > I've got a cluster up now, at least. Time for the fun stuff...
> >
> > Thanks everyone for the help!
> >
> >
> >
> > On Sun, Jun 1, 2014 at 5:19 PM, Nicholas Chammas
> >  wrote:
> >>
> >> If you are explicitly specifying the AMI in your invocation of spark-ec2,
> >> may I suggest simply removing any explicit mention of AMI from your
> >> invocation? spark-ec2 automatically selects an appropriate AMI based on the
> >> specified instance type.
> >>
> >> 2014년 6월 1일 일요일, Nicholas Chammas님이 작성한 메시지:
> >>
> >>> Could you post how exactly you are invoking spark-ec2? And are you having
> >>> trouble just with r3 instances, or with any instance type?
> >>>
> >>> 2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지:
> >>>
> >>> It's been another day of spinning up dead clusters...
> >>>
> >>> I thought I'd finally worked out what everyone else knew - don't use the
> >>> default AMI - but I've now run through all of the "official" quick-start
> >>> linux releases and I'm none the wiser:
> >>>
> >>> Amazon Linux AMI 2014.03.1 - ami-7aba833f (64-bit)
> >>> Provisions servers, connects, installs, but the webserver on the master
> >>> will not start
> >>>
> >>> Red Hat Enterprise Linux 6.5 (HVM) - ami-5cdce419
> >>> Spot instance requests are not supported for this AMI.
> >>>
> >>> SuSE Linux Enterprise Server 11 sp3 (HVM) - ami-1a88bb5f
> >>> Not tested - costs 10x more for spot instances, not economically viable.
> >>>
> >>> Ubuntu Server 14.04 LTS (HVM) - ami-f64f77b3
> >>> Provisions servers, but "git" is not pre-installed, so the cluster setup
> >>> fails.
> >>>
> >>> Amazon Linux AMI (HVM) 2014.03.1 - ami-5aba831f
> >>> Provisions servers, but "git" is not pre-installed, so the cluster setup
> >>> fails.
> >
> >
> >
> >
> > --
> > Jeremy Lee  BCompSci(Hons)
> >   The Unorthodox Engineers
> 
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers



Please put me into the mail list, thanks.

2014-06-01 Thread Yunmeng Ban



Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-06-01 Thread Ngoc Dao
Alternative solution:
https://github.com/xitrum-framework/xitrum-package

It collects all dependency .jar files in your Scala program into a
directory. It doesn't merge the .jar files together, the .jar files
are left "as is".


On Sat, May 31, 2014 at 3:42 AM, Andrei  wrote:
> Thanks, Stephen. I have eventually decided to go with assembly, but put away
> Spark and Hadoop jars, and instead use `spark-submit` to automatically
> provide these dependencies. This way no resource conflicts arise and
> mergeStrategy needs no modification. To memorize this stable setup and also
> share it with the community I've crafted a project [1] with minimal working
> config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's
> Hadoop client. Hope, it will help somebody to take Spark setup quicker.
>
> Though I'm fine with this setup for final builds, I'm still looking for a
> more interactive dev setup - something that doesn't require full rebuild.
>
> [1]: https://github.com/faithlessfriend/sample-spark-project
>
> Thanks and have a good weekend,
> Andrei
>
> On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch  wrote:
>>
>>
>> The MergeStrategy combined with sbt assembly did work for me.  This is not
>> painless: some trial and error and the assembly may take multiple minutes.
>>
>> You will likely want to filter out some additional classes from the
>> generated jar file.  Here is an SOF answer to explain that and with IMHO the
>> best answer snippet included here (in this case the OP understandably did
>> not want to not include javax.servlet.Servlet)
>>
>> http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar
>>
>>
>> mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) => ms
>> filter { case (file, toPath) => toPath != "javax/servlet/Servlet.class" } }
>>
>> There is a setting to not include the project files in the assembly but I
>> do not recall it at this moment.
>>
>>
>>
>> 2014-05-29 10:13 GMT-07:00 Andrei :
>>
>>> Thanks, Jordi, your gist looks pretty much like what I have in my project
>>> currently (with few exceptions that I'm going to borrow).
>>>
>>> I like the idea of using "sbt package", since it doesn't require third
>>> party plugins and, most important, doesn't create a mess of classes and
>>> resources. But in this case I'll have to handle jar list manually via Spark
>>> context. Is there a way to automate this process? E.g. when I was a Clojure
>>> guy, I could run "lein deps" (lein is a build tool similar to sbt) to
>>> download all dependencies and then just enumerate them from my app. Maybe
>>> you have heard of something like that for Spark/SBT?
>>>
>>> Thanks,
>>> Andrei
>>>
>>>
>>> On Thu, May 29, 2014 at 3:48 PM, jaranda  wrote:

 Hi Andrei,

 I think the preferred way to deploy Spark jobs is by using the sbt
 package
 task instead of using the sbt assembly plugin. In any case, as you
 comment,
 the mergeStrategy in combination with some dependency exlusions should
 fix
 your problems. Have a look at  this gist
    for further
 details (I just followed some recommendations commented in the sbt
 assembly
 plugin documentation).

 Up to now I haven't found a proper way to combine my
 development/deployment
 phases, although I must say my experience in Spark is pretty poor (it
 really
 depends in your deployment requirements as well). In this case, I think
 someone else could give you some further insights.

 Best,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>


Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Yunmeng Ban
Hi,

I'm running the example of JavaKafkaWordCount in a standalone cluster. I
want to set 1600MB memory for each slave node. I wrote in the
spark/conf/spark-env.sh

SPARK_WORKER_MEMORY=1600m

But the logs on slave nodes looks this:
Spark Executor Command: "/usr/java/latest/bin/java" "-cp"
":/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar"
"-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"

The memory seems to be the default number, not 1600M.
I don't how to make SPARK_WORKER_MEMORY work.
Can anyone help me?
Many thanks in advance.

Yunmeng


Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
Sounds like you have two shells running, and the first one is talking all
your resources. Do a "jps" and kill the other guy, then try again.

By the way, you can look at http://localhost:8080 (replace localhost with
the server your Spark Master is running on) to see what applications are
currently started, and what resource allocations they have.


On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney 
wrote:

> Thanks again. Run results here:
> https://gist.github.com/rjurney/dc0efae486ba7d55b7d5
>
> This time I get a port already in use exception on 4040, but it isn't
> fatal. Then when I run rdd.first, I get this over and over:
>
> 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not 
> accepted any resources; check your cluster UI to ensure that workers are 
> registered and have sufficient memory
>
>
>
>
>
> On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson  wrote:
>
>> You can avoid that by using the constructor that takes a SparkConf, a la
>>
>> val conf = new SparkConf()
>> conf.setJars("avro.jar", ...)
>> val sc = new SparkContext(conf)
>>
>>
>> On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney 
>> wrote:
>>
>>> Followup question: the docs to make a new SparkContext require that I
>>> know where $SPARK_HOME is. However, I have no idea. Any idea where that
>>> might be?
>>>
>>>
>>> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson 
>>> wrote:
>>>
 Gotcha. The easiest way to get your dependencies to your Executors
 would probably be to construct your SparkContext with all necessary jars
 passed in (as the "jars" parameter), or inside a SparkConf with setJars().
 Avro is a "necessary jar", but it's possible your application also needs to
 distribute other ones to the cluster.

 An easy way to make sure all your dependencies get shipped to the
 cluster is to create an assembly jar of your application, and then you just
 need to tell Spark about that jar, which includes all your application's
 transitive dependencies. Maven and sbt both have pretty straightforward
 ways of producing assembly jars.


 On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <
 russell.jur...@gmail.com> wrote:

> Thanks for the fast reply.
>
> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
> standalone mode.
>
>
> On Saturday, May 31, 2014, Aaron Davidson  wrote:
>
>> First issue was because your cluster was configured incorrectly. You
>> could probably read 1 file because that was done on the driver node, but
>> when it tried to run a job on the cluster, it failed.
>>
>> Second issue, it seems that the jar containing avro is not getting
>> propagated to the Executors. What version of Spark are you running on? 
>> What
>> deployment mode (YARN, standalone, Mesos)?
>>
>>
>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
>> russell.jur...@gmail.com> wrote:
>>
>> Now I get this:
>>
>> scala> rdd.first
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>> :41
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>> :41) with 1 output partitions (allowLocal=true)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>> (first at :41)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>> stage: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the
>> requested partition locally
>>
>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-0.avro:0+3864
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>> :41, took 0.037371256 s
>>
>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>> :41
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>> :41) with 16 output partitions (allowLocal=true)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>> (first at :41)
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>> stage: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>> (HadoopRDD[0] at hadoopRDD at :37), which has no missing parents
>>
>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at :37)
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
>> 5.0 with 16 tasks
>>
>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
>> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>
>> 14/05/31 21:36:28 INFO

Re: Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Aaron Davidson
In addition to setting the Standalone memory, you'll also need to tell your
SparkContext to claim the extra resources. Set "spark.executor.memory" to
1600m as well. This should be a system property set in SPARK_JAVA_OPTS in
conf/spark-env.sh (in 0.9.1, which you appear to be using) -- e.g.,
export SPARK_JAVA_OPTS="-Dspark.executor.memory=1600mb"


On Sun, Jun 1, 2014 at 7:36 PM, Yunmeng Ban  wrote:

> Hi,
>
> I'm running the example of JavaKafkaWordCount in a standalone cluster. I
> want to set 1600MB memory for each slave node. I wrote in the
> spark/conf/spark-env.sh
>
> SPARK_WORKER_MEMORY=1600m
>
> But the logs on slave nodes looks this:
> Spark Executor Command: "/usr/java/latest/bin/java" "-cp"
> ":/~path/spark/conf:/~path/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar"
> "-Xms512M" "-Xmx512M"
> "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>
> The memory seems to be the default number, not 1600M.
> I don't how to make SPARK_WORKER_MEMORY work.
> Can anyone help me?
> Many thanks in advance.
>
> Yunmeng
>


Re: apache whirr for spark

2014-06-01 Thread chirag lakhani
Thanks for letting me know, I am leaning towards using Whirr to setup a
Yarn cluster with Hive, Pig, Hbase, etc... and then adding Spark on Yarn.
 Is it pretty straightforward to install Spark on a Yarn cluster?


On Fri, May 30, 2014 at 5:51 PM, Matei Zaharia 
wrote:

> I don’t think Whirr provides support for this, but Spark’s own EC2 scripts
> also launch a Hadoop cluster:
> http://spark.apache.org/docs/latest/ec2-scripts.html.
>
> Matei
>
> On May 30, 2014, at 12:59 PM, chirag lakhani 
> wrote:
>
> > Does anyone know if it is possible to use Whirr to setup a Spark cluster
> on AWS.  I would like to be able to use Whirr to setup a cluster that has
> all of the standard Hadoop and Spark tools.  I want to automate this
> process because I anticipate I will have to create and destroy often enough
> that I would like to have it all automated.  Could anyone provide any
> pointers into how this could be done or whether it is documented somewhere?
> >
> > Chirag Lakhani
>
>


Re: Spark on EC2

2014-06-01 Thread Nicholas Chammas
No, you don't have to set up your own AMI. Actually it's probably simpler
and less error prone if you let spark-ec2 manage that for you as you first
start to get comfortable with Spark. Just spin up a cluster without any
explicit mention of AMI and it will do the right thing.

2014년 6월 1일 일요일, superback님이 작성한 메시지:

> I haven't set up AMI yet. I am just trying to run a simple job on the EC2
> cluster. So, is setting up AMI a prerequisite for running simple Spark
> example like org.apache.spark.examples.GroupByTest?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2-tp6638p6681.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-06-01 Thread Wei Da
Hi guys,
I'm using IntelliJ IDEA 13.1.2 Community Edition, and I have installed
Scala plugin and Maven 3.2.1. I want to develop Spark applications with
IntelliJ IDEA through Maven.

In IntelliJ, I create a Maven project with the archetype ID
"spark-core_2.10", but got the following messages in the "Message Maven
Goal":

=

[WARNING] Archetype not found in any catalog. Falling back to central
repository (http://repo1.maven.org/maven2).
[WARNING] Use -DarchetypeRepository= if archetype's
repository is elsewhere.
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 20.064 s
[INFO] Finished at: 2014-06-02T11:50:14+08:00
[INFO] Final Memory: 9M/65M
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-archetype-plugin:2.2:generate (default-cli)
on project standalone-pom: The defined artifact is not an archetype ->
[Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] Maven execution terminated abnormally (exit code 1)

=

I have spent several days on this, but did not get any success.
The intructions on Spark Website (
http://spark.apache.org/docs/latest/building-with-maven.html) may be to
brief for newbies like me. Is there any more detailed instructions on how
to build Spark App with Intellij IDEA? Thanks a lot!


Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-06-01 Thread Matei Zaharia
Don’t try to use spark-core as an archetype. Instead just create a plain Scala 
project (no archetype) and add a Maven dependency on spark-core. That should be 
all you need.

Matei

On Jun 1, 2014, at 9:15 PM, Wei Da  wrote:

> Hi guys,
> I'm using IntelliJ IDEA 13.1.2 Community Edition, and I have installed Scala 
> plugin and Maven 3.2.1. I want to develop Spark applications with IntelliJ 
> IDEA through Maven. 
> 
> In IntelliJ, I create a Maven project with the archetype ID 
> "spark-core_2.10", but got the following messages in the "Message Maven Goal":
> 
> =
> 
> [WARNING] Archetype not found in any catalog. Falling back to central 
> repository (http://repo1.maven.org/maven2).
> [WARNING] Use -DarchetypeRepository= if archetype's 
> repository is elsewhere.
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 20.064 s
> [INFO] Finished at: 2014-06-02T11:50:14+08:00
> [INFO] Final Memory: 9M/65M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-archetype-plugin:2.2:generate (default-cli) on 
> project standalone-pom: The defined artifact is not an archetype -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> [ERROR] Maven execution terminated abnormally (exit code 1)
> 
> =
> 
> I have spent several days on this, but did not get any success. 
> The intructions on Spark Website 
> (http://spark.apache.org/docs/latest/building-with-maven.html) may be to 
> brief for newbies like me. Is there any more detailed instructions on how to 
> build Spark App with Intellij IDEA? Thanks a lot!