Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-06-01 Thread Matei Zaharia
Don’t try to use spark-core as an archetype. Instead just create a plain Scala project (no archetype) and add a Maven dependency on spark-core. That should be all you need. Matei On Jun 1, 2014, at 9:15 PM, Wei Da wrote: > Hi guys, > I'm using IntelliJ IDEA 13.1.2 Community Edition, and I hav

Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-06-01 Thread Wei Da
Hi guys, I'm using IntelliJ IDEA 13.1.2 Community Edition, and I have installed Scala plugin and Maven 3.2.1. I want to develop Spark applications with IntelliJ IDEA through Maven. In IntelliJ, I create a Maven project with the archetype ID "spark-core_2.10", but got the following messages in the

Re: Spark on EC2

2014-06-01 Thread Nicholas Chammas
No, you don't have to set up your own AMI. Actually it's probably simpler and less error prone if you let spark-ec2 manage that for you as you first start to get comfortable with Spark. Just spin up a cluster without any explicit mention of AMI and it will do the right thing. 2014년 6월 1일 일요일, supe

Re: apache whirr for spark

2014-06-01 Thread chirag lakhani
Thanks for letting me know, I am leaning towards using Whirr to setup a Yarn cluster with Hive, Pig, Hbase, etc... and then adding Spark on Yarn. Is it pretty straightforward to install Spark on a Yarn cluster? On Fri, May 30, 2014 at 5:51 PM, Matei Zaharia wrote: > I don’t think Whirr provide

Re: Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Aaron Davidson
In addition to setting the Standalone memory, you'll also need to tell your SparkContext to claim the extra resources. Set "spark.executor.memory" to 1600m as well. This should be a system property set in SPARK_JAVA_OPTS in conf/spark-env.sh (in 0.9.1, which you appear to be using) -- e.g., export

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
Sounds like you have two shells running, and the first one is talking all your resources. Do a "jps" and kill the other guy, then try again. By the way, you can look at http://localhost:8080 (replace localhost with the server your Spark Master is running on) to see what applications are currently

Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Yunmeng Ban
Hi, I'm running the example of JavaKafkaWordCount in a standalone cluster. I want to set 1600MB memory for each slave node. I wrote in the spark/conf/spark-env.sh SPARK_WORKER_MEMORY=1600m But the logs on slave nodes looks this: Spark Executor Command: "/usr/java/latest/bin/java" "-cp" ":/~path/

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-06-01 Thread Ngoc Dao
Alternative solution: https://github.com/xitrum-framework/xitrum-package It collects all dependency .jar files in your Scala program into a directory. It doesn't merge the .jar files together, the .jar files are left "as is". On Sat, May 31, 2014 at 3:42 AM, Andrei wrote: > Thanks, Stephen. I h

Please put me into the mail list, thanks.

2014-06-01 Thread Yunmeng Ban

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
FYI, I opened https://issues.apache.org/jira/browse/SPARK-1990 to track this. Matei On Jun 1, 2014, at 6:14 PM, Jeremy Lee wrote: > Sort of.. there were two separate issues, but both related to AWS.. > > I've sorted the confusion about the Master/Worker AMI ... use the version > chosen by the

Re: Spark on EC2

2014-06-01 Thread superback
I haven't set up AMI yet. I am just trying to run a simple job on the EC2 cluster. So, is setting up AMI a prerequisite for running simple Spark example like org.apache.spark.examples.GroupByTest? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-EC2

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
Thanks again. Run results here: https://gist.github.com/rjurney/dc0efae486ba7d55b7d5 This time I get a port already in use exception on 4040, but it isn't fatal. Then when I run rdd.first, I get this over and over: 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted a

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
That helped a bit... Now I have a different failure: the start up process is stuck in an infinite loop outputting the following message: 14/06/02 01:34:56 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1401672868277 yarnAppState: ACCEPTE

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Jeremy Lee
Sort of.. there were two separate issues, but both related to AWS.. I've sorted the confusion about the Master/Worker AMI ... use the version chosen by the scripts. (and use the right instance type so the script can choose wisely) But yes, one also needs a "launch machine" to kick off the cluster

Re: Create/shutdown objects before/after RDD use (or: Non-serializable classes)

2014-06-01 Thread Tobias Pfeiffer
Xiangrui, thanks for your suggestion! On Sat, May 31, 2014 at 6:12 PM, Xiangrui Meng wrote: > One hack you can try is: > > rdd.mapPartitions(iter => { > val x = new X() > iter.map(row => x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty } > }) In fact, I employed a similar hack by n

Re: spark 1.0.0 on yarn

2014-06-01 Thread Patrick Wendell
As a debugging step, does it work if you use a single resource manager with the key "yarn.resourcemanager.address" instead of using two named resource managers? I wonder if somehow the YARN client can't detect this multi-master set-up. On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen wrote: > Not

Re: Trouble with EC2

2014-06-01 Thread Jeremy Lee
Ha yes,,, I just went through this. (a) You have to use the ;'default' spark AMI, ( ami-7a320f3f at the moment ) and not any of the other linux distros. They don't work. (b) Start with m1.large instances.. I tried going for r3.large at first, and had no end of self-caused trouble. m1.large works.

Re: Trouble with EC2

2014-06-01 Thread Matei Zaharia
So to run spark-ec2, you should use the default AMI that it launches with if you don’t pass -a. Those are based on Amazon Linux, not Debian. Passing your own AMI is an advanced option but people need to install some stuff on their AMI in advance for it to work with our scripts. Matei On Jun 1

Re: Trouble with EC2

2014-06-01 Thread PJ$
Running on a few m3.larges with the ami-848a6eec image (debian 7). Haven't gotten any further. No clue what's wrong. I'd really appreciate any guidance y'all could offer. Best, PJ$ On Sat, May 31, 2014 at 1:40 PM, Matei Zaharia wrote: > What instance types did you launch on? > > Sometimes you

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
You can avoid that by using the constructor that takes a SparkConf, a la val conf = new SparkConf() conf.setJars("avro.jar", ...) val sc = new SparkContext(conf) On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney wrote: > Followup question: the docs to make a new SparkContext require that I know >

[Spark Streaming] Distribute custom receivers evenly across excecutors

2014-06-01 Thread Guang Gao
Dear All, I'm running Spark Streaming (1.0.0) with Yarn (2.2.0) on a 10-node cluster. I setup 10 custom receivers to hear from 10 data streams. I want one receiver per node in order to maximize the network bandwidth. However, if I set "--executor-cores 4", the 10 receivers only run on 3 of the nod

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
Anwar, Will try this as it might do exactly what I need. I will follow your pattern but use sc.textFile() for each file. I am now thinking that I could start with an RDD of file paths and map it into (path, content) pairs, provided I could read a file on the server. Thank you, Oleg On 1 June

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Russell Jurney
Followup question: the docs to make a new SparkContext require that I know where $SPARK_HOME is. However, I have no idea. Any idea where that might be? On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson wrote: > Gotcha. The easiest way to get your dependencies to your Executors would > probably be

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
Nicholas, The new in 1.0 wholeTextFiles() gets me exactly what I need. It would be great to have this functionality with an arbitrary directory tree. Thank you, Oleg ​

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Nicholas Chammas
sc.wholeTextFiles() will get you close. Alternately, you could write a loop with plain sc.textFile() that loads all the files under each batch into a separate RDD. On Sun, Jun 1, 2014 at 4:40 P

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
I have a large number of directories under a common root: batch-1/file1.txt batch-1/file2.txt batch-1/file3.txt ... batch-2/file1.txt batch-2/file2.txt batch-2/file3.txt ... batch-N/file1.txt batch-N/file2.txt batch-N/file3.txt ... I would like to read them into an RDD like { "batch-1" : [ conte

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Matei Zaharia
More specifically with the -a flag, you *can* set your own AMI, but you’ll need to base it off ours. This is because spark-ec2 assumes that some packages (e.g. java, Python 2.6) are already available on the AMI. Matei On Jun 1, 2014, at 11:01 AM, Patrick Wendell wrote: > Hey just to clarify t

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
Ah yes, looking back at the first email in the thread, indeed that was the case. For the record, I too launch clusters from my laptop, where I have Python 2.7 installed. On Sun, Jun 1, 2014 at 2:01 PM, Patrick Wendell wrote: > Hey just to clarify this - my understanding is that the poster > (Je

Re: spark 1.0.0 on yarn

2014-06-01 Thread Xu (Simon) Chen
Note that everything works fine in spark 0.9, which is packaged in CDH5: I can launch a spark-shell and interact with workers spawned on my yarn cluster. So in my /opt/hadoop/conf/yarn-site.xml, I have: ... yarn.resourcemanager.address.rm1 controller-1.mycomp.com:23140

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre Borckmans
You're right Patrick! Just had a chat with sbt-pack creator and indeed dependencies with classifiers are ignored to avoid problems with dirty cache... Should be fixed in next version of the plugin. Cheers Pierre Message sent from a mobile device - excuse typos and abbreviations > Le 1 jui

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell wrote: > One potential issue here is that mesos is using classifiers now to > publish there jars. It might be that sbt-pack has trouble with > dependencies that are published

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies that are published using classifiers. I'm pretty sure mesos is the only dependency in Spark that is using classifiers, so that's why I mention it. On Sun,

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Patrick Wendell
Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag "-a" that allows you to give a spec

Re: spark 1.0.0 on yarn

2014-06-01 Thread Patrick Wendell
I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Anwar Rizal
I presume that you need to have access to the path of each file you are reading. I don't know whether there is a good way to do that for HDFS, I need to read the files myself, something like: def openWithPath(inputPath: String, sc:SparkContext) = { val fs= (new Path(inputPath)).getFile

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
Gotcha. The easiest way to get your dependencies to your Executors would probably be to construct your SparkContext with all necessary jars passed in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is a "necessary jar", but it's possible your application also needs to distribu

Re: Akka disassociation on Java SE Embedded

2014-06-01 Thread Aaron Davidson
Thanks for the update! I've also run into the block manager timeout issue, it might be a good idea to increase the default significantly (it would probably timeout immediately if the TCP connection itself dropped anyway). On Sun, Jun 1, 2014 at 9:48 AM, Chanwit Kaewkasi wrote: > Hi all, > > Thi

Re: Akka disassociation on Java SE Embedded

2014-06-01 Thread Chanwit Kaewkasi
Hi all, This is what I found: 1. Like Aaron suggested, an executor will be killed silently when the OS's memory is running out. I've found this many times to conclude this it's real. Adding swap and increasing the JVM heap solved the problem, but you will encounter OS paging out and full GC. 2.

Re: sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Nicholas Chammas
Could you provide an example of what you mean? I know it's possible to create an RDD from a path with wildcards, like in the subject. For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also provide a comma delimited list of paths. Nick 2014년 6월 1일 일요일, Oleg Proudnikov님이 작성한 메시지:

sc.textFileGroupByPath("*/*.txt")

2014-06-01 Thread Oleg Proudnikov
Hi All, Is it possible to create an RDD from a directory tree of the following form? RDD[(PATH, Seq[TEXT])] Thank you, Oleg

Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre B
Hi all! We'be been using the sbt-pack sbt plugin (https://github.com/xerial/sbt-pack) for building our standalone Spark application for a while now. Until version 1.0.0, that worked nicely. For those who don't know the sbt-pack plugin, it basically copies all the dependencies JARs from your local

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Jeremy Lee
*sigh* OK, I figured it out. (Thank you Nick, for the hint) "m1.large" works, (I swear I tested that earlier and had similar issues... ) It was my obsession with starting "r3.*large" instances. Clearly I hadn't patched the script in all the places.. which I think caused it to default to the Amazo

SparkSQL Table schema in Java

2014-06-01 Thread Kuldeep Bora
Hello, Congrats for 1.0.0 release. I would like to ask why is it that the table creation requires an proper class in Scala and Java while in python you can just use a map? I think that the use of class for definition of table is bit too restrictive. Using a plain map otoh could be very handy in c

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
If you are explicitly specifying the AMI in your invocation of spark-ec2, may I suggest simply removing any explicit mention of AMI from your invocation? spark-ec2 automatically selects an appropriate AMI based on the specified instance type. 2014년 6월 1일 일요일, Nicholas Chammas님이 작성한 메시지: > Could y

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Nicholas Chammas
Could you post how exactly you are invoking spark-ec2? And are you having trouble just with r3 instances, or with any instance type? 2014년 6월 1일 일요일, Jeremy Lee님이 작성한 메시지: > It's been another day of spinning up dead clusters... > > I thought I'd finally worked out what everyone else knew - don't

Re: Spark on EC2

2014-06-01 Thread Jeremy Lee
Hmm.. you've gotten further than me. Which AMI's are you using? On Sun, Jun 1, 2014 at 2:21 PM, superback wrote: > Hi, > I am trying to run an example on AMAZON EC2 and have successfully > set up one cluster with two nodes on EC2. However, when I was testing an > example using the follo