Need help about how hadoop works.

2014-04-22 Thread Carter
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding how hadoop works. If we have a cluster of 5 computers, and install Spark on the cluster WITHOUT Hadoop. And then we run the code on one computer: val doc = sc.textFile("/home/scalatest.txt",5) doc.count Can the "count" t

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Praveen R
I guess you need to limit the heap size. Add the below line in spark-env.sh and make sure to rsync it all workers. SPARK_JAVA_OPTS+=" -Xms512m -Xmx512m " On Wed, Apr 23, 2014 at 4:55 AM, jaeholee wrote: > Ok. I tried setting the partition number to 128 and numbers greater than > 128, > and no

Re: no response in spark web UI

2014-04-22 Thread Akhil Das
Hi SparkContext launches the web interface at 4040, if you have multiple sparkContext's on the same machine then the ports will be bind to successive ports beginning with 4040. Here's the documentation: https://spark.apache.org/docs/0.9.0/monitoring.html And here's a simple scala program to star

Re: Question about running spark on yarn

2014-04-22 Thread sandy . ryza
I currently don't have plans to work on that. -Sandy > On Apr 22, 2014, at 8:06 PM, Gordon Wang wrote: > > Thanks I see. Do you guys have plan to port this to sbt? > > >> On Wed, Apr 23, 2014 at 10:24 AM, Sandy Ryza wrote: >> Right, it only works for Maven >> >> >>> On Tue, Apr 22, 2014 at

Re: Some questions in using Graphx

2014-04-22 Thread Wu Zeming
Hi Ankur, Thanks for your reply! I think these are usefully for me. I hope these can be improved in spark-0.9.2 or spark-1.0. Another thing I forgot. I think the persist api for Graph, VertexRDD and EdgeRDD should not be set public now, because it will lead to UnsupportedOperationException when I

no response in spark web UI

2014-04-22 Thread wxhsdp
Hi, all i used to run my app using sbt run. but now i want to see the job information in spark web ui. i'am in local mode, i start the spark shell, and access the web ui on http://ubuntu.local:4040/stages/. but when i sbt run some application, there is no response in the web ui. how to make con

Custom KryoSerializer

2014-04-22 Thread Soren Macbeth
Does spark support extending and registering a KryoSerializer class in an application jar? An example of why you might want to do this would be to always register some set of common classes within an organization while still allowing the particular application jar to use a kryo registrator to regi

Re: Question about running spark on yarn

2014-04-22 Thread Gordon Wang
Thanks I see. Do you guys have plan to port this to sbt? On Wed, Apr 23, 2014 at 10:24 AM, Sandy Ryza wrote: > Right, it only works for Maven > > > On Tue, Apr 22, 2014 at 6:23 PM, Gordon Wang wrote: > >> Hi Sandy, >> >> Thanks for your reply ! >> >> Does this work for sbt ? >> >> I checked the

Re: Question about running spark on yarn

2014-04-22 Thread Sandy Ryza
Right, it only works for Maven On Tue, Apr 22, 2014 at 6:23 PM, Gordon Wang wrote: > Hi Sandy, > > Thanks for your reply ! > > Does this work for sbt ? > > I checked the commit, looks like only maven build has such option. > > > > On Wed, Apr 23, 2014 at 12:38 AM, Sandy Ryza wrote: > >> Hi Gord

No configuration setting found for key 'akka.version'

2014-04-22 Thread mbaryu
I can't seem to instantiate a SparkContext. What am I doing wrong? I tried using a SparkConf and the 2-string constructor with identical results. (Note that the project is configured for eclipse in the pom, but I'm compiling and running on the command line.) Here's the exception: ~/workspace/Re

Re: Question about running spark on yarn

2014-04-22 Thread Gordon Wang
Hi Sandy, Thanks for your reply ! Does this work for sbt ? I checked the commit, looks like only maven build has such option. On Wed, Apr 23, 2014 at 12:38 AM, Sandy Ryza wrote: > Hi Gordon, > > We recently handled this in SPARK-1064. As of 1.0.0, you'll be able to > pass -Phadoop-provided

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave wrote: > I wasn't able to reproduce this with a small test file

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ankur Dave
I wasn't able to reproduce this with a small test file, but I did change the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take the third column rather than the second? If so, would you mind posting a larger sample of the file, or even the whole file if possible? Here's

GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
I am trying to read an edge list into a Graph. My data looks like 394365859 --> 136153151 589404147 --> 1361045425 I read it into a Graph via: val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName) val edgeTupRDD = edgeFullStrRDD.map(x => x.split("\t")) .ma

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
Ok. I tried setting the partition number to 128 and numbers greater than 128, and now I get another error message about "Java heap space". Is it possible that there is something wrong with the setup of my Spark cluster to begin with? Or is it still an issue with partitioning my data? Or do I just n

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
On Wed, Apr 23, 2014 at 12:06 AM, jaeholee wrote: > How do you determine the number of partitions? For example, I have 16 > workers, and the number of cores and the worker memory set in spark-env.sh > are: > > CORE = 8 > MEMORY = 16g > So you have the capacity to work on 16 * 8 = 128 tasks at a

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
How do you determine the number of partitions? For example, I have 16 workers, and the number of cores and the worker memory set in spark-env.sh are: CORE = 8 MEMORY = 16g The .csv data I have is about 500MB, but I am eventually going to use a file that is about 15GB. Is the MEMORY variable in s

Joining large dataset causes failure on Spark!

2014-04-22 Thread Hasan Asfoor
Greetings, I have created an RDD of 60 rows and then I joined it with itself. For some reason Spark consumes all of my storage which is more than 20GB of free storage! Is this the expected behavior of Spark!? Am I doing something wrong here? The code is shown below (done in Java). I tried to c

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
Most likely the data is not "just too big". For most operations the data is processed partition by partition. The partitions may be too big. This is what your last question hints at too: > val numWorkers = 10 > val data = sc.textFile("somedirectory/data.csv", numWorkers) This will work, but not q

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
Spark is running fine, but I get this message. Does this mean that my data is just too big? 14/04/22 17:06:20 ERROR TaskSchedulerImpl: Lost executor 2 on WORKER#2: OutOfMemoryError 14/04/22 17:06:20 ERROR TaskSetManager: Task 550.0:2 failed 4 times; aborting job org.apache.spark.SparkException

Re: Bind exception while running FlumeEventCount

2014-04-22 Thread Tathagata Das
Hello Neha, This is the result of a known bug in 0.9. Can you try running the latest Spark master branch to see if this problem is resolved? TD On Tue, Apr 22, 2014 at 2:48 AM, NehaS Singh wrote: > Hi, > > I have installed > spark-0.9.0-incubating-bin-cdh4 and

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-22 Thread Tathagata Das
As I said before, starting two SSCs in the JVM is not supported, neither in local mode or nor in cluster mode. You have two choices. 1. run one ssc in one JVM: This will use a single Spark cluster (as it will use a single SparkContext) for the computation. Therefore they can share the cluster's res

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
wow! it worked! thank you so much! so now, all I need to do is to put the number of workers that I want to use when I read the data right? e.g. val numWorkers = 10 val data = sc.textFile("somedirectory/data.csv", numWorkers) -- View this message in context: http://apache-spark-user-list.10015

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Praveen R
Could you try setting MASTER variable in spark-env.sh export MASTER=spark://:7077 For starting the standalone cluster, ./sbin/start-all.sh should work as far as you have password less access to all machines. Any error here ? On Tue, Apr 22, 2014 at 10:10 PM, jaeholee wrote: > No, I am not u

Running large join in ALS example through PySpark

2014-04-22 Thread Laird, Benjamin
Hello all - I'm running the ALS/Collaborative Filtering code through pySpark on spark0.9.0. (http://spark.apache.org/docs/0.9.0/mllib-guide.html#using-mllib-in-python) My data file has about 27M tuples (User, Item, Rating). ALS.train(ratings,1,30) runs on my 3 node cluster (24 cores, 60GB RAM)

Re: java.net.SocketException on reduceByKey() in pyspark

2014-04-22 Thread benlaird
I was getting this error after upgrading my nodes to Python2.7. I suspected the problem was due to conflicting Python versions, but my 2.7 install seemed correct on my nodes. I set the PYSPARK_PYTHON variable to my 2.7 install (as I still had 2.6 installed and linked to the 'python' executable, w

Re: Some questions in using Graphx

2014-04-22 Thread Ankur Dave
These are excellent questions. Answers below: On Tue, Apr 22, 2014 at 8:20 AM, wu zeming wrote: > 1. Why do some transformations like partitionBy, mapVertices cache the new > graph and some like outerJoinVertices not? In general, we cache RDDs that are used more than once to avoid recomputatio

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
No, I am not using the aws. I am using one of the national lab's cluster. But as I mentioned, I am pretty new to computer science, so I might not be answering your question right... but 7077 is accessible. Maybe I got it wrong from the get-go? I will just write down what I did... Basically I have

Re: Question about running spark on yarn

2014-04-22 Thread Sandy Ryza
Hi Gordon, We recently handled this in SPARK-1064. As of 1.0.0, you'll be able to pass -Phadoop-provided to Maven and avoid including Hadoop and its dependencies in the assembly jar. -Sandy On Tue, Apr 22, 2014 at 2:43 AM, Gordon Wang wrote: > In this page http://spark.apache.org/docs/0.9.0/

internship opportunity

2014-04-22 Thread Tom Vacek
Thomson Reuters is looking for a graduate (or possibly advanced undergraduate) summer intern in Eagan, MN. This is a chance to work on an innovative project exploring how big data sets can be used by professionals such as lawyers, scientists and journalists. If you're subscribed to this mailing li

Some questions in using Graphx

2014-04-22 Thread wu zeming
Hi all, I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100 million and the number of edges can be 1 billion in our graph. As a result, I must carefully use my limit memory. So I have some questions to the Graphx module. Why do some transformations like partitionBy,

Re: Spark is slow

2014-04-22 Thread Nicholas Chammas
How long are the count() steps taking? And how many partitions are pairs1and triples initially divided into? You can see this by doing pairs1._jrdd.splits().size(), for example. If you just need to count the number of distinct keys, is it faster if you did the following instead of groupByKey().cou

Some questions in using Graphx

2014-04-22 Thread wu zeming
Hi all, I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100 million and the number of edges can be 1 billion in our graph. As a result, I must carefully use my limit memory. So I have some questions to the Graphx module. Why do some transformations like partitionBy,

Spark runs applications in an inconsistent way

2014-04-22 Thread Aureliano Buendia
Hi, Sometimes running the very same spark application binary, behaves differently with every execution. - The Ganglia profile is different with every execution: sometimes it takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next time it is 0.75 TB... - Spark UI shows

Re: Using google cloud storage for spark big data

2014-04-22 Thread Aureliano Buendia
On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > We don't have anything fancy. It's basically some very thin layer of > google specifics on top of a stand alone cluster. We basically created two > disk snapshots, one for the master and one for the workers

Re: stdout in workers

2014-04-22 Thread Daniel Darabos
On Mon, Apr 21, 2014 at 7:59 PM, Jim Carroll wrote: > > I'm experimenting with a few things trying to understand how it's working. > I > took the JavaSparkPi example as a starting point and added a few System.out > lines. > > I added a system.out to the main body of the driver program (not inside

Re: 'Filesystem closed' while running spark job

2014-04-22 Thread Marcin Cylke
On Tue, 22 Apr 2014 12:28:15 +0200 Marcin Cylke wrote: > Hi > > I have a Spark job that reads files from HDFS, does some pretty basic > transformations, then writes it to some other location on hdfs. > > I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with > Kerberos security enabled.

help me

2014-04-22 Thread Joe L
I got the following performance is it normal in spark to be like this. some times spark switchs into node_local mode from process_local and it becomes 10x faster. I am very confused. scala> val a = sc.textFile("/user/exobrain/batselem/LUBM1000") scala> f.count() Long = 137805557 took 130.80966161

Re: Spark-ec2 asks for password

2014-04-22 Thread Pierre Borckmans
We’ve been experiencing this as well, and our simple solution is to actually keep trying the ssh connection instead of just waiting: Something like this: def wait_for_ssh_connection(opts, host): u.message("Waiting for ssh connection to host {}".format(host)) connected = False while (conne

'Filesystem closed' while running spark job

2014-04-22 Thread Marcin Cylke
Hi I have a Spark job that reads files from HDFS, does some pretty basic transformations, then writes it to some other location on hdfs. I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with Kerberos security enabled. One of my approaches to fixing this issue was changing SparkConf, s

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-22 Thread Andre Bois-Crettez
The data partitionning is done by default *according to the number of HDFS blocks* of the source. You can change the partitionning with .repartion, either to increase or decrease the level of parallelism : val recordsRDD = SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) val re

Re: Using google cloud storage for spark big data

2014-04-22 Thread Andras Nemeth
We don't have anything fancy. It's basically some very thin layer of google specifics on top of a stand alone cluster. We basically created two disk snapshots, one for the master and one for the workers. The snapshots contain initialization scripts so that the master/worker daemons are started on b

Bind exception while running FlumeEventCount

2014-04-22 Thread NehaS Singh
Hi, I have installed spark-0.9.0-incubating-bin-cdh4 and I am using apache flume for streaming. I have used the streaming.examples.FlumeEventCount. Also I have written Avro conf file for flume.When I try to do streamin ing spark and I run the following command it

Question about running spark on yarn

2014-04-22 Thread Gordon Wang
In this page http://spark.apache.org/docs/0.9.0/running-on-yarn.html We have to use spark assembly to submit spark apps to yarn cluster. And I checked the assembly jars of spark. It contains some yarn classes which are added during compile time. The yarn classes are not what I want. My question i

Re: how to save RDD partitions in different folders?

2014-04-22 Thread dmpour23
I am not exacly sure how to use MultipleOutput in Spark. Have been looking into Apache Crunch ? in its guide http://crunch.apache.org/user-guide.html it states that: Multiple outputs: Spark doesn't have a concept of multiple outputs; when you write a data set to disk, the pipeline that creates tha

Re: PySpark still reading only text?

2014-04-22 Thread Bertrand Dechoux
Cool, thanks for the link. Bertrand Dechoux On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath wrote: > Also see: https://github.com/apache/spark/pull/455 > > This will add support for reading sequencefile and other inputformat in > PySpark, as long as the Writables are either simple (primitives,

Efficient Aggregation over DB data

2014-04-22 Thread Sai Prasanna
Hi All, I want to access a particular column of a DB table stored in a CSV format and perform some aggregate queries over it. I wrote the following query in scala as a first step. *var add=(x:String)=>x.split("\\s+)(2).toInt* *var result=List[Int]()* *input.split("\\n").foreach(x=>result::=add(x

Re: how to solve this problem?

2014-04-22 Thread Ankur Dave
This is a known bug in GraphX, and the fix is in PR #367: https://github.com/apache/spark/pull/367 Applying that PR should solve the problem. Ankur On Mon, Apr 21, 2014 at 8:20 PM, gogototo wrote: > java.util.NoSuchElementException: End of stream >