I executed the following commands to launch spark app with yarn client
mode. I have Hadoop 2.3.0, Spark 0.8.1 and Scala 2.9.3
SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
SPARK_YARN_MODE=true \
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
unsibscribe
Thank you,
Konstantin Kudryavtsev
How could I do iteration? because the persist is lazy and recomputing may
required, all the path of iteration will be save, memory overflow can not be
escaped?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p53
unsubscribe
I understood it to mean module, unit of functionality, subroutine.
On May 5, 2014 3:50 AM, "phoenix bai" wrote:
>
> Hi all,
>
> I am reading the doc of spark (
http://spark.apache.org/docs/0.9.0/mllib-guide.html#gradient-descent-primitive).
I am trying to translate the doc into Chinese, and there
.set("spark.cleaner.ttl", "120") drops broadcast_0 which makes a Exception
below. It is strange, because broadcast_0 is no need, and I have broadcast_3
instead, and recent RDD is persisted, there is no need for recomputing...
what is the problem? need help.
~~~
14/05/05 17:03:12 INFO stor
Hi, Yi
Your project sounds interesting to me, Im also working on 3g4g communication
domain, besides Ive also done a tiny project based on hadoop, which analyzes
execution logs. Recently, Im planed to pick it up again. So, if you don't
mind, may i know the introduction of your log analyzing
unsibscribe
Regards,
Chhaya Vishwakarma
The contents of this e-mail and any attachment(s) may contain confidential or
privileged information for the intended recipient(s). Unintended recipients are
prohibited from taking action on the basis of information in th
Hi,All
We run a spark cluster with three workers.
created a spark streaming application,
then run the spark project using below command:
shell> sbt run spark://192.168.219.129:7077 tcp://192.168.20.118:5556 foo
we looked at the webui of workers, jobs failed without any error or in
Using checkpoint. It removes dependences:)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast
cleaning. May be it could be removed automatically when no dependences.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5369.html
Se
Have you tried Broadcast.unpersist()?
On Mon, May 5, 2014 at 6:34 PM, Earthson wrote:
> RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for
> broadcast
> cleaning. May be it could be removed automatically when no dependences.
>
>
>
> --
> View this message in context:
> http://a
Hi all,
Is there a "best practice" for subscribing to JMS with Spark Streaming? I
have searched but not found anything conclusive.
In the absence of a standard practice the solution I was thinking of was to
use Akka + Camel (akka.camel.Consumer) to create a subscription for a Spark
Streaming Cus
Ah, I think this should be fixed in 0.9.1?
Did you see the exception is thrown in the worker side?
Best,
--
Nan Zhu
On Sunday, May 4, 2014 at 10:15 PM, Cheney Sun wrote:
> Hi Nan,
>
> Have you found a way to fix the issue? Now I run into the same problem with
> version 0.9.1.
>
> Thank
No replies yet. Guess everyone who had this problem knew the obvious reason
why the error occurred.
It took me some time to figure out the work around though.
It seems shark depends on
/var/lib/spark/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core/hadoop-core.jar
for client server com
Since 1.0 is still in development you can pick up the latest docs in git:
https://github.com/apache/spark/tree/branch-1.0/docs
I didn't see anywhere that you said you started the spark history server?
there are multiple things that need to happen for the spark history server to
work.
1) config
Hi Sparkers,
We have created a quick spark_gce script which can launch a spark cluster
in the Google Cloud. I'm sharing it because it might be helpful for someone
using the Google Cloud for deployment rather than AWS.
Here's the link to the script
https://github.com/sigmoidanalytics/spark_gce
F
Hi Aureliano,
You might want to check this script out,
https://github.com/sigmoidanalytics/spark_gce
Let me know if you need any help around that.
Thanks
Best Regards
On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia wrote:
>
>
>
> On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <
> andras.ne
Thanks Patrick and Matei for the clarification. I actually have to update
some code now, as I was apparently relying on the fact that the output
files are being re-used. Explains some edge-case behavior that I've seen.
For me, at least, I read the guide, did some tests on fairly extensive RDD
depe
I just upgraded my Spark version to 1.0.0_SNAPSHOT.
commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3
Author: witgo
Date: Fri May 2 12:40:27 2014 -0700
I'm running a standalone cluster with 3 workers.
- *Workers:* 3
- *Cores:* 48 Total, 0 Used
- *Memory:* 469.8 GB Total, 0.0 B Used
I’ve encountered this issue again and am able to reproduce it about 10% of the
time.
1. Here is the input:
RDD[ (a, 126232566, 1), (a, 126232566, 2) ]
RDD[ (a, 126232566, 1), (a, 126232566, 3) ]
RDD[ (a, 126232566, 3) ]
RDD[ (a, 126232566, 4) ]
RDD[ (a, 126232566, 2) ]
Forgot to mention my batch interval is 1 second:
val ssc = new StreamingContext(conf, Seconds(1))
hence the Thread.sleep(1100)
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: May-05-14 12:06 PM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: RE: another updateSt
Thanks Wyane.
Maybe that's is what is happening. My current limits are.
$ ps -u ssimanta -L | wc -l (with Spark and spark-shell *not* running)
790
$ ulimit -u
1024
Once I start Spark my limit increases to
$ ps -u ssimanta -L | wc -l (with Spark and spark-shell running)
982
Any recommend
Is there somewhere documented how one would go about configuring every open
port a spark application needs?
This seems like one of the main things that make running spark hard in
places like EC2 where you arent using the canned spark scripts.
Starting an app looks like you'll see ports open for
Yes, I've tried.
The problem is new broadcast object generated by every step until eat up all
of the memory.
I solved it by using RDD.checkpoint to remove dependences to old broadcast
object, and use cleanner.ttl to clean up these broadcast object
automatically.
If there's more elegant way to
Very cool! Have you thought about sending this as a pull request? We’d be happy
to maintain it inside Spark, though it might be interesting to find a single
Python package that can manage clusters across both EC2 and GCE.
Matei
On May 5, 2014, at 7:18 AM, Akhil Das wrote:
> Hi Sparkers,
>
>
Howdy Andrew,
I agree; the subnet idea is a good one... unfortunately, it doesn't really
help to secure the network.
You mentioned that the drivers need to talk to the workers. I think it is
slightly broader - all of the workers and the driver/shell need to be
addressable from/to each other on
Hello Sean,
Thanks a bunch, I am not currently working HA mode.
The configuration is identical to our CDH4 setup which perfectly fine. It's
really strange how only spark breaks with this enabled.
On Thu, May 1, 2014 at 3:06 AM, Sean Owen wrote:
> This codec does require native libraries to be
Hi,
I'm trying to run a simple Spark job that uses a 3rd party class (in this
case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT
I'm starting my bin/spark-shell with the following command.
./spark-shell
*--driver-class-path*"$LIBPATH/jodatime2.3/joda-convert-1.2.jar:$LIBPATH/j
I second this motion. :)
A unified "cloud deployment" tool would be absolutely great.
On Mon, May 5, 2014 at 1:34 PM, Matei Zaharia wrote:
> Very cool! Have you thought about sending this as a pull request? We’d be
> happy to maintain it inside Spark, though it might be interesting to find a
>
Has anyone considered using jclouds tooling to support multiple cloud
providers? Maybe using Pallet?
François
> On May 5, 2014, at 3:22 PM, Nicholas Chammas
> wrote:
>
> I second this motion. :)
>
> A unified "cloud deployment" tool would be absolutely great.
>
>
> On Mon, May 5, 2014 at 1
Well there has been more bug fixes added to RC3 as well. So its best to try
out the current master and let us know whether you still get the scary logs.
TD
On Sun, May 4, 2014 at 3:52 AM, wxhsdp wrote:
> Hi, TD
>
> actually, i'am not very clear with my spark version. i check out from
> https:/
Do those file actually exist? Those stdout/stderr should have the output of
the spark's executors running in the workers, and its weird that they dont
exist. Could be permission issue - maybe the directories/files are not
being generated because it cannot?
TD
On Mon, May 5, 2014 at 3:06 AM, Fran
Ethan, you're not the only one, which is why I was asking about this! :-)
Matei, thanks for your response. your answer explains the performance jump
in my code, but shows I've missed something key in my understanding of
Spark!
I was not aware until just now that map output was saved to disk (othe
There is not in-built code in Spark Streaming to output to Kafka yet.
However, I have heard people have use Twitter Storehaus with foreachRDD and
Storehaus has a kafka output. Something that you might look into.
TD
On Sun, May 4, 2014 at 11:45 PM, Weide Zhang wrote:
> Hi ,
>
> Is there any cod
One main reason why Spark Streaming can achieve higher throughput than
Storm is because Spark Streaming operates in coarser-grained batches -
second-scale massive batches - which reduce per-tuple of overheads in
shuffles, and other kinds of data movements, etc.
Note that, this is also true that th
Hi all,
I'm currently working on creating a set of docker images to facilitate
local development with Spark/streaming on Mesos (+zk, hdfs, kafka)
After solving the initial hurdles to get things working together in docker
containers, now everything seems to start-up correctly and the mesos UI
show
A few high-level suggestions.
1. I recommend using the new Receiver API in almost-released Spark 1.0 (see
branch-1.0 / master branch on github). Its a slightly better version of the
earlier NetworkReceiver, as it hides away blockgenerator (which needed to
be unnecessarily manually started and stop
Hi,
Before considering running on Mesos, did you try to submit the application
on Spark deployed without Mesos on Docker containers ?
Currently investigating this idea to deploy quickly a complete set of
clusters with Docker, I'm interested by your findings on sharing the
settings of Kafka and Zo
Hi Benjamin,
Yes, we initially used a modified version of the AmpLabs docker scripts
[1]. The amplab docker images are a good starting point.
One of the biggest hurdles has been HDFS, which requires reverse-DNS and I
didn't want to go the dnsmasq route to keep the containers relatively
simple to u
Hi there,
i'm doing an iterative algorithm and sometimes i ended up with
StackOverflowError, doesn't matter if i do checkpoints or not.
Remaining i don't understand why this is happening, i figure out that
increasing the stack size is a workaround.
Developing using "local[n]" so the local mode i
Hi,
Fairly new to Spark. I'm using Spark's saveAsSequenceFile() to write large
Sequence Files to HDFS. The Sequence Files need to be large to be
efficiently accessed in HDFS, preferably larger than Hadoop's block size,
64MB. The task works for files smaller than 64 MiB (with a warning for
seque
Update: Checkpointing it doesn't perform. I checked by the "isCheckpointed"
method but it returns always false. ???
2014-05-05 23:14 GMT+02:00 Andrea Esposito :
> Checkpoint doesn't help it seems. I do it at each iteration/superstep.
>
> Looking deeply, the RDDs are recomputed just few times at
It may be slow because of serialization (have you tried Kryo there?) or just
because at some point the data starts to be on disk. Try profiling the tasks
while it’s running (e.g. just use jstack to see what they’re doing) and
definitely try Kryo if you’re currently using Java Serialization. Kryo
I’m running into something very strange today. I’m getting an error on the
follow innocuous operations.
a = sc.textFile('s3n://...')
a = a.repartition(8)
a = a.map(...)
c = a.countByKey() # ERRORs out on this action. See below for traceback. [1]
If I add a count() right after the repartition(), t
checkpoint seems to be just add a CheckPoint mark? You need an action after
marked it. I have tried it with success:)
newRdd = oldRdd.map(myFun).persist(myStorageLevel)
newRdd.checkpoint // < {}) // Force evaluation
newRdd.isCheckpointed // true here
oldRdd.unpersist(true)
If you have
The file does not exist in fact and no permission issue.
francis@ubuntu-4:/test/spark-0.9.1$ ll work/app-20140505053550-/
total 24
drwxrwxr-x 6 francis francis 4096 May 5 05:35 ./
drwxrwxr-x 11 francis francis 4096 May 5 06:18 ../
drwxrwxr-x 2 francis francis 4096 May 5 05:35 2/
Can you check the Spark worker logs on that machine. Either from the web
ui, or directly. Should be /test/spark-XXX/logs/ See if that has any error.
If there is not permission issue, I am not why stdout and stderr is not
being generated.
TD
On Mon, May 5, 2014 at 7:13 PM, Francis.Hu wrote:
>
I have a spark streaming application that uses the external streaming
modules (e.g. kafka, mqtt, ..) as well. It is not clear how to properly
invoke the spark-submit script: what are the ---driver-class-path and/or
-Dspark.executor.extraClassPath parameters required?
For reference, the following
Hi,
i'am looking at the event log, i'am a little confuse about some metrics
here's the info of one task:
"Launch Time":1399336904603
"Finish Time":1399336906465
"Executor Run Time":1781
"Shuffle Read Metrics":"Shuffle Finish Time":1399336906027, "Fetch Wait
Time":0
"Shuffle Write Metrics":{"Shuf
hi I still have over 1g left for my program.
Date: Sun, 4 May 2014 19:14:30 -0700
From: ml-node+s1001560n5340...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: "sbt/sbt run" command returns a JVM problem
the total memory of your machine is 2G right?then how much memory is
left free?
Yes, I'm struggling with a similar problem where my class are not found on the
worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone
can provide some documentation on the usage of spark-submit.
Thanks
> On May 5, 2014, at 10:24 PM, Stephen Boesch wrote:
>
>
> I ha
Hi experts.
I have some pre-built python parsers that I am planning to use, just
because I don't want to write them again in scala. However after the data is
parsed I would like to take the RDD and use it in a scala program.(Yes, I
like scala more than python and more comfortable in scala :)
In d
In my code, there are two broadcast variables. Sometimes reading the small
one took more time than the big one, so strange!
Log on slave node is as follows:
Block broadcast_2 stored as values to memory (estimated size *4.0 KB*, free
17.2 GB)
Reading broadcast variable 2 took *9.998537123* s
Blo
additional, Reading the big broadcast variable always took about 2s.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/about-broadcast-tp5416p5417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I have seen three different ways to query data from Spark
1. Default SQL support(
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/sql/examples/HiveFromSpark.scala
)
2. Shark
3. Blink DB
I would like know which one is more efficient
Regard
Hi, TD
i tried on v1.0.0-rc3 and still got the error
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-run-with-spark-ContextCleaner-ERROR-tp5304p5421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
All three have different usecases. If you are looking for more of a
warehouse you are better off with Shark.
SparkSQL is a way to query regular data in sql like syntax leveraging
columnar store.
BlinkDB is a experiment, meant to integrate with Shark in the long term.
Not meant for production useca
Thank you for your prompt reply.
Regards,
prabeesh
On Tue, May 6, 2014 at 11:44 AM, Mayur Rustagi wrote:
> All three have different usecases. If you are looking for more of a
> warehouse you are better off with Shark.
> SparkSQL is a way to query regular data in sql like syntax leveraging
> col
Add export SPARK_JAVA_OPTS=“-Xss16m” to conf/spark-env.sh. Then it should apply
to the executor.
Matei
On May 5, 2014, at 2:20 PM, Andrea Esposito wrote:
> Hi there,
>
> i'm doing an iterative algorithm and sometimes i ended up with
> StackOverflowError, doesn't matter if i do checkpoints o
Hi Carter,
Do a export JAVA_OPTS="-Xmx2g" before hitting sbt/sbt run. That will solve
your problem.
Thanks
Best Regards
On Tue, May 6, 2014 at 8:02 AM, Carter wrote:
> hi I still have over 1g left for my program.
>
> --
> Date: Sun, 4 May 2014 19:14:30 -0700
> From
61 matches
Mail list logo