Check your cluster UI to ensure that workers are registered and have sufficient memory

2014-05-05 Thread Sai Prasanna
I executed the following commands to launch spark app with yarn client mode. I have Hadoop 2.3.0, Spark 0.8.1 and Scala 2.9.3 SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly SPARK_YARN_MODE=true \ SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar

unsibscribe

2014-05-05 Thread Konstantin Kudryavtsev
unsibscribe Thank you, Konstantin Kudryavtsev

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
How could I do iteration? because the persist is lazy and recomputing may required, all the path of iteration will be save, memory overflow can not be escaped? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p53

unsubscribe

2014-05-05 Thread Shubhabrata Roy
unsubscribe

Re: what`s the meaning of primitive in "gradient descent primitive"?

2014-05-05 Thread Sean Owen
I understood it to mean module, unit of functionality, subroutine. On May 5, 2014 3:50 AM, "phoenix bai" wrote: > > Hi all, > > I am reading the doc of spark ( http://spark.apache.org/docs/0.9.0/mllib-guide.html#gradient-descent-primitive). I am trying to translate the doc into Chinese, and there

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
.set("spark.cleaner.ttl", "120") drops broadcast_0 which makes a Exception below. It is strange, because broadcast_0 is no need, and I have broadcast_3 instead, and recent RDD is persisted, there is no need for recomputing... what is the problem? need help. ~~~ 14/05/05 17:03:12 INFO stor

Re: Is any idea on architecture based on Spark + Spray + Akka

2014-05-05 Thread Quintus Zhou
Hi, Yi Your project sounds interesting to me, Im also working on 3g4g communication domain, besides Ive also done a tiny project based on hadoop, which analyzes execution logs. Recently, Im planed to pick it up again. So, if you don't mind, may i know the introduction of your log analyzing

unsibscribe

2014-05-05 Thread Chhaya Vishwakarma
unsibscribe Regards, Chhaya Vishwakarma The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in th

java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-05 Thread Francis . Hu
Hi,All We run a spark cluster with three workers. created a spark streaming application, then run the spark project using below command: shell> sbt run spark://192.168.219.129:7077 tcp://192.168.20.118:5556 foo we looked at the webui of workers, jobs failed without any error or in

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Using checkpoint. It removes dependences:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast cleaning. May be it could be removed automatically when no dependences. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5369.html Se

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Cheng Lian
Have you tried Broadcast.unpersist()? On Mon, May 5, 2014 at 6:34 PM, Earthson wrote: > RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for > broadcast > cleaning. May be it could be removed automatically when no dependences. > > > > -- > View this message in context: > http://a

Spark Streaming and JMS

2014-05-05 Thread Patrick McGloin
Hi all, Is there a "best practice" for subscribing to JMS with Spark Streaming? I have searched but not found anything conclusive. In the absence of a standard practice the solution I was thinking of was to use Akka + Camel (akka.camel.Consumer) to create a subscription for a Spark Streaming Cus

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-05 Thread Nan Zhu
Ah, I think this should be fixed in 0.9.1? Did you see the exception is thrown in the worker side? Best, -- Nan Zhu On Sunday, May 4, 2014 at 10:15 PM, Cheney Sun wrote: > Hi Nan, > > Have you found a way to fix the issue? Now I run into the same problem with > version 0.9.1. > > Thank

Re: Shark on cloudera CDH5 error

2014-05-05 Thread manas Kar
No replies yet. Guess everyone who had this problem knew the obvious reason why the error occurred. It took me some time to figure out the work around though. It seems shark depends on /var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core/hadoop-core.jar for client server com

Re: configure spark history server for running on Yarn

2014-05-05 Thread Tom Graves
Since 1.0 is still in development you can pick up the latest docs in git:  https://github.com/apache/spark/tree/branch-1.0/docs I didn't see anywhere that you said you started the spark history server? there are multiple things that need to happen for the spark history server to work. 1) config

Spark GCE Script

2014-05-05 Thread Akhil Das
Hi Sparkers, We have created a quick spark_gce script which can launch a spark cluster in the Google Cloud. I'm sharing it because it might be helpful for someone using the Google Cloud for deployment rather than AWS. Here's the link to the script https://github.com/sigmoidanalytics/spark_gce F

Re: Using google cloud storage for spark big data

2014-05-05 Thread Akhil Das
Hi Aureliano, You might want to check this script out, https://github.com/sigmoidanalytics/spark_gce Let me know if you need any help around that. Thanks Best Regards On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia wrote: > > > > On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth < > andras.ne

Re: performance improvement on second operation...without caching?

2014-05-05 Thread Ethan Jewett
Thanks Patrick and Matei for the clarification. I actually have to update some code now, as I was apparently relying on the fact that the output files are being re-used. Explains some edge-case behavior that I've seen. For me, at least, I read the guide, did some tests on fairly extensive RDD depe

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

2014-05-05 Thread Soumya Simanta
I just upgraded my Spark version to 1.0.0_SNAPSHOT. commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3 Author: witgo Date: Fri May 2 12:40:27 2014 -0700 I'm running a standalone cluster with 3 workers. - *Workers:* 3 - *Cores:* 48 Total, 0 Used - *Memory:* 469.8 GB Total, 0.0 B Used

RE: another updateStateByKey question - updated w possible Spark bug

2014-05-05 Thread Adrian Mocanu
I’ve encountered this issue again and am able to reproduce it about 10% of the time. 1. Here is the input: RDD[ (a, 126232566, 1), (a, 126232566, 2) ] RDD[ (a, 126232566, 1), (a, 126232566, 3) ] RDD[ (a, 126232566, 3) ] RDD[ (a, 126232566, 4) ] RDD[ (a, 126232566, 2) ]

RE: another updateStateByKey question - updated w possible Spark bug

2014-05-05 Thread Adrian Mocanu
Forgot to mention my batch interval is 1 second: val ssc = new StreamingContext(conf, Seconds(1)) hence the Thread.sleep(1100) From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: May-05-14 12:06 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: RE: another updateSt

Re: Caused by: java.lang.OutOfMemoryError: unable to create new native thread

2014-05-05 Thread ssimanta
Thanks Wyane. Maybe that's is what is happening. My current limits are. $ ps -u ssimanta -L | wc -l (with Spark and spark-shell *not* running) 790 $ ulimit -u 1024 Once I start Spark my limit increases to $ ps -u ssimanta -L | wc -l (with Spark and spark-shell running) 982 Any recommend

Comprehensive Port Configuration reference?

2014-05-05 Thread Scott Clasen
Is there somewhere documented how one would go about configuring every open port a spark application needs? This seems like one of the main things that make running spark hard in places like EC2 where you arent using the canned spark scripts. Starting an app looks like you'll see ports open for

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Yes, I've tried. The problem is new broadcast object generated by every step until eat up all of the memory. I solved it by using RDD.checkpoint to remove dependences to old broadcast object, and use cleanner.ttl to clean up these broadcast object automatically. If there's more elegant way to

Re: Spark GCE Script

2014-05-05 Thread Matei Zaharia
Very cool! Have you thought about sending this as a pull request? We’d be happy to maintain it inside Spark, though it might be interesting to find a single Python package that can manage clusters across both EC2 and GCE. Matei On May 5, 2014, at 7:18 AM, Akhil Das wrote: > Hi Sparkers, > >

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-05 Thread Jacob Eisinger
Howdy Andrew, I agree; the subnet idea is a good one... unfortunately, it doesn't really help to secure the network. You mentioned that the drivers need to talk to the workers. I think it is slightly broader - all of the workers and the driver/shell need to be addressable from/to each other on

Re: CDH 5.0 and Spark 0.9.0

2014-05-05 Thread Paul Schooss
Hello Sean, Thanks a bunch, I am not currently working HA mode. The configuration is identical to our CDH4 setup which perfectly fine. It's really strange how only spark breaks with this enabled. On Thu, May 1, 2014 at 3:06 AM, Sean Owen wrote: > This codec does require native libraries to be

Problem with sharing class across worker nodes using spark-shell on Spark 1.0.0

2014-05-05 Thread Soumya Simanta
Hi, I'm trying to run a simple Spark job that uses a 3rd party class (in this case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT I'm starting my bin/spark-shell with the following command. ./spark-shell *--driver-class-path*"$LIBPATH/jodatime2.3/joda-convert-1.2.jar:$LIBPATH/j

Re: Spark GCE Script

2014-05-05 Thread Nicholas Chammas
I second this motion. :) A unified "cloud deployment" tool would be absolutely great. On Mon, May 5, 2014 at 1:34 PM, Matei Zaharia wrote: > Very cool! Have you thought about sending this as a pull request? We’d be > happy to maintain it inside Spark, though it might be interesting to find a >

Re: Spark GCE Script

2014-05-05 Thread François Le lay
Has anyone considered using jclouds tooling to support multiple cloud providers? Maybe using Pallet? François > On May 5, 2014, at 3:22 PM, Nicholas Chammas > wrote: > > I second this motion. :) > > A unified "cloud deployment" tool would be absolutely great. > > > On Mon, May 5, 2014 at 1

Re: sbt run with spark.ContextCleaner ERROR

2014-05-05 Thread Tathagata Das
Well there has been more bug fixes added to RC3 as well. So its best to try out the current master and let us know whether you still get the scary logs. TD On Sun, May 4, 2014 at 3:52 AM, wxhsdp wrote: > Hi, TD > > actually, i'am not very clear with my spark version. i check out from > https:/

Re: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-05 Thread Tathagata Das
Do those file actually exist? Those stdout/stderr should have the output of the spark's executors running in the workers, and its weird that they dont exist. Could be permission issue - maybe the directories/files are not being generated because it cannot? TD On Mon, May 5, 2014 at 3:06 AM, Fran

Re: performance improvement on second operation...without caching?

2014-05-05 Thread Diana Carroll
Ethan, you're not the only one, which is why I was asking about this! :-) Matei, thanks for your response. your answer explains the performance jump in my code, but shows I've missed something key in my understanding of Spark! I was not aware until just now that map output was saved to disk (othe

Re: spark streaming kafka output

2014-05-05 Thread Tathagata Das
There is not in-built code in Spark Streaming to output to Kafka yet. However, I have heard people have use Twitter Storehaus with foreachRDD and Storehaus has a kafka output. Something that you might look into. TD On Sun, May 4, 2014 at 11:45 PM, Weide Zhang wrote: > Hi , > > Is there any cod

Re: spark streaming question

2014-05-05 Thread Tathagata Das
One main reason why Spark Streaming can achieve higher throughput than Storm is because Spark Streaming operates in coarser-grained batches - second-scale massive batches - which reduce per-tuple of overheads in shuffles, and other kinds of data movements, etc. Note that, this is also true that th

Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-05 Thread Gerard Maas
Hi all, I'm currently working on creating a set of docker images to facilitate local development with Spark/streaming on Mesos (+zk, hdfs, kafka) After solving the initial hurdles to get things working together in docker containers, now everything seems to start-up correctly and the mesos UI show

Re: Spark Streaming and JMS

2014-05-05 Thread Tathagata Das
A few high-level suggestions. 1. I recommend using the new Receiver API in almost-released Spark 1.0 (see branch-1.0 / master branch on github). Its a slightly better version of the earlier NetworkReceiver, as it hides away blockgenerator (which needed to be unnecessarily manually started and stop

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-05 Thread Benjamin
Hi, Before considering running on Mesos, did you try to submit the application on Spark deployed without Mesos on Docker containers ? Currently investigating this idea to deploy quickly a complete set of clusters with Docker, I'm interested by your findings on sharing the settings of Kafka and Zo

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-05 Thread Gerard Maas
Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts [1]. The amplab docker images are a good starting point. One of the biggest hurdles has been HDFS, which requires reverse-DNS and I didn't want to go the dnsmasq route to keep the containers relatively simple to u

Increase Stack Size Workers

2014-05-05 Thread Andrea Esposito
Hi there, i'm doing an iterative algorithm and sometimes i ended up with StackOverflowError, doesn't matter if i do checkpoints or not. Remaining i don't understand why this is happening, i figure out that increasing the stack size is a workaround. Developing using "local[n]" so the local mode i

Spark 0.9.1 - saveAsSequenceFile and large RDD

2014-05-05 Thread Allen Lee
Hi, Fairly new to Spark. I'm using Spark's saveAsSequenceFile() to write large Sequence Files to HDFS. The Sequence Files need to be large to be efficiently accessed in HDFS, preferably larger than Hadoop's block size, 64MB. The task works for files smaller than 64 MiB (with a warning for seque

Re: Incredible slow iterative computation

2014-05-05 Thread Andrea Esposito
Update: Checkpointing it doesn't perform. I checked by the "isCheckpointed" method but it returns always false. ??? 2014-05-05 23:14 GMT+02:00 Andrea Esposito : > Checkpoint doesn't help it seems. I do it at each iteration/superstep. > > Looking deeply, the RDDs are recomputed just few times at

Re: Incredible slow iterative computation

2014-05-05 Thread Matei Zaharia
It may be slow because of serialization (have you tried Kryo there?) or just because at some point the data starts to be on disk. Try profiling the tasks while it’s running (e.g. just use jstack to see what they’re doing) and definitely try Kryo if you’re currently using Java Serialization. Kryo

How can adding a random count() change the behavior of my program?

2014-05-05 Thread Nicholas Chammas
I’m running into something very strange today. I’m getting an error on the follow innocuous operations. a = sc.textFile('s3n://...') a = a.repartition(8) a = a.map(...) c = a.countByKey() # ERRORs out on this action. See below for traceback. [1] If I add a count() right after the repartition(), t

Re: Incredible slow iterative computation

2014-05-05 Thread Earthson
checkpoint seems to be just add a CheckPoint mark? You need an action after marked it. I have tried it with success:) newRdd = oldRdd.map(myFun).persist(myStorageLevel) newRdd.checkpoint // < {}) // Force evaluation newRdd.isCheckpointed // true here oldRdd.unpersist(true) If you have

答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-05 Thread Francis . Hu
The file does not exist in fact and no permission issue. francis@ubuntu-4:/test/spark-0.9.1$ ll work/app-20140505053550-/ total 24 drwxrwxr-x 6 francis francis 4096 May 5 05:35 ./ drwxrwxr-x 11 francis francis 4096 May 5 06:18 ../ drwxrwxr-x 2 francis francis 4096 May 5 05:35 2/

Re: 答复: java.io.FileNotFoundException: /test/spark-0.9.1/work/app-20140505053550-0000/2/stdout (No such file or directory)

2014-05-05 Thread Tathagata Das
Can you check the Spark worker logs on that machine. Either from the web ui, or directly. Should be /test/spark-XXX/logs/ See if that has any error. If there is not permission issue, I am not why stdout and stderr is not being generated. TD On Mon, May 5, 2014 at 7:13 PM, Francis.Hu wrote: >

How to use spark-submit

2014-05-05 Thread Stephen Boesch
I have a spark streaming application that uses the external streaming modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly invoke the spark-submit script: what are the ---driver-class-path and/or -Dspark.executor.extraClassPath parameters required? For reference, the following

details about event log

2014-05-05 Thread wxhsdp
Hi, i'am looking at the event log, i'am a little confuse about some metrics here's the info of one task: "Launch Time":1399336904603 "Finish Time":1399336906465 "Executor Run Time":1781 "Shuffle Read Metrics":"Shuffle Finish Time":1399336906027, "Fetch Wait Time":0 "Shuffle Write Metrics":{"Shuf

RE: "sbt/sbt run" command returns a JVM problem

2014-05-05 Thread Carter
hi I still have over 1g left for my program. Date: Sun, 4 May 2014 19:14:30 -0700 From: ml-node+s1001560n5340...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: "sbt/sbt run" command returns a JVM problem the total memory of your machine is 2G right?then how much memory is left free?

Re: How to use spark-submit

2014-05-05 Thread Soumya Simanta
Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks > On May 5, 2014, at 10:24 PM, Stephen Boesch wrote: > > > I ha

Can I share RDD between a pyspark and spark API

2014-05-05 Thread manas Kar
Hi experts. I have some pre-built python parsers that I am planning to use, just because I don't want to write them again in scala. However after the data is parsed I would like to take the RDD and use it in a scala program.(Yes, I like scala more than python and more comfortable in scala :) In d

about broadcast

2014-05-05 Thread randylu
In my code, there are two broadcast variables. Sometimes reading the small one took more time than the big one, so strange! Log on slave node is as follows: Block broadcast_2 stored as values to memory (estimated size *4.0 KB*, free 17.2 GB) Reading broadcast variable 2 took *9.998537123* s Blo

Re: about broadcast

2014-05-05 Thread randylu
additional, Reading the big broadcast variable always took about 2s. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-broadcast-tp5416p5417.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Better option to use Querying in Spark

2014-05-05 Thread prabeesh k
Hi, I have seen three different ways to query data from Spark 1. Default SQL support( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/sql/examples/HiveFromSpark.scala ) 2. Shark 3. Blink DB I would like know which one is more efficient Regard

Re: sbt run with spark.ContextCleaner ERROR

2014-05-05 Thread wxhsdp
Hi, TD i tried on v1.0.0-rc3 and still got the error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Better option to use Querying in Spark

2014-05-05 Thread Mayur Rustagi
All three have different usecases. If you are looking for more of a warehouse you are better off with Shark. SparkSQL is a way to query regular data in sql like syntax leveraging columnar store. BlinkDB is a experiment, meant to integrate with Shark in the long term. Not meant for production useca

Re: Better option to use Querying in Spark

2014-05-05 Thread prabeesh k
Thank you for your prompt reply. Regards, prabeesh On Tue, May 6, 2014 at 11:44 AM, Mayur Rustagi wrote: > All three have different usecases. If you are looking for more of a > warehouse you are better off with Shark. > SparkSQL is a way to query regular data in sql like syntax leveraging > col

Re: Increase Stack Size Workers

2014-05-05 Thread Matei Zaharia
Add export SPARK_JAVA_OPTS=“-Xss16m” to conf/spark-env.sh. Then it should apply to the executor. Matei On May 5, 2014, at 2:20 PM, Andrea Esposito wrote: > Hi there, > > i'm doing an iterative algorithm and sometimes i ended up with > StackOverflowError, doesn't matter if i do checkpoints o

Re: "sbt/sbt run" command returns a JVM problem

2014-05-05 Thread Akhil Das
Hi Carter, Do a export JAVA_OPTS="-Xmx2g" before hitting sbt/sbt run. That will solve your problem. Thanks Best Regards On Tue, May 6, 2014 at 8:02 AM, Carter wrote: > hi I still have over 1g left for my program. > > -- > Date: Sun, 4 May 2014 19:14:30 -0700 > From