Hi, all
I installed spark-0.9.1 and zeromq 4.0.1 , and then run below example:
./bin/run-example org.apache.spark.streaming.examples.SimpleZeroMQPublisher
tcp://127.0.1.1:1234 foo.bar`
./bin/run-example org.apache.spark.streaming.examples.ZeroMQWordCount
local[2] tcp://127.0.1.1:1234 foo`
Hi Liu,
is it the feature of spark 0.9.1?
my version is 0.9.0, it has no effect when i set spark.eventLog.enabled
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p5028.html
Sent from the Apache Spark User
Unfortunately zeromq 4.0.1 is not supported.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala#L63Says
about the version. You will need that version of zeromq to see it
work. Basically I have seen it working nicely with zer
Hi,
You can also use any path pattern as defined here:
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.
On 29/04/2014 05:07, Nicholas Chammas wrote:
Not tha
Hi,
By default a fraction of the executor memory (60%) is reserved for RDD
caching, so if there's no explicit caching in the code (eg. rdd.cache()
etc.), or if we persist RDD with StorageLevel.DISK_ONLY, is this part of
memory wated? Does Spark allocates the RDD cache memory dynamically? Or
does s
Thanks, Prashant Sharma
It works right now after degrade zeromq from 4.0.1 to 2.2.
Do you know the new release of spark whether it will upgrade zeromq ?
Many of our programs are using zeromq 4.0.1, so if in next release ,spark
streaming can release with a newer zeromq that would be be
Well that is not going to be easy, simply because we depend on akka-zeromq
for zeromq support. And since akka does not support the latest zeromq
library yet, I doubt if there is something simple that can be done to
support it.
Prashant Sharma
On Tue, Apr 29, 2014 at 2:44 PM, Francis.Hu wrote:
>
Very interesting.
One of spark's attractive features is being able to do stuff interactively
via spark-shell. Is something like that still available via Ooyala's job
server?
Or do you use the spark-shell independently of that? If the latter then how
do you manage custom jars for spark-shell? Our
In the context of telecom industry, let's supose we have several existing
RDDs populated from some tables in Cassandra:
val callPrices: RDD[PriceRow]
val calls: RDD[CallRow]
val offersInCourse: RDD[OfferRow]
where types are defined as follows,
/** Represents the p
Finally, I'm using file to save RDDs, and then reload it. It works fine,
because Gibbs Sampling for LDA is really slow. It's about 10min to sampling
10k wiki document for 10 round(1 round/min).
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-requir
I have no idea why shuffle spill is so large. But this might make it
smaller:
val addition = (a: Int, b: Int) => a + b
val wordsCount = wordsPair.combineByKey(identity, addition, addition)
This way only one entry per distinct word will end up in the shuffle for
each partition, instead of one entr
Hi all -
I’m using pySpark/MLLib ALS for user/item clustering and would like to directly
access the user/product RDDs (called userFeatures/productFeatures in class
MatrixFactorizationModel in mllib/recommendation/MatrixFactorizationModel.scala
This doesn’t seem to complex, but it doesn’t seem l
Hi,
By default a fraction of the executor memory (60%) is reserved for RDD
caching, so if there's no explicit caching in the code (eg. rdd.cache()
etc.), or if we persist RDD with StorageLevel.DISK_ONLY, is this part of
memory wated? Does Spark allocates the RDD cache memory dynamically? Or
does s
Create a key and join on that.
val callPricesByHour = callPrices.map(p => ((p.year, p.month, p.day,
p.hour), p))
val callsByHour = calls.map(c => ((c.year, c.month, c.day, c.hour), c))
val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) =>
BillRow(c.customer, c.hour, c.minutes *
There's no easy way to d this currently. The pieces are there from the PySpark
code for regression which should be adaptable.
But you'd have to roll your own solution.
This is something I also want so I intend to put together a pull request for
this soon
—
Sent from Mailbox
On Tue, Apr 29,
Hi,
Is it possible to know from code about an RDD if it is cached, and more
precisely, how many of its partitions are cached in memory and how many are
cached on disk? I know I can get the storage level, but I also want to know
the current actual caching status. Knowing memory consumption would al
SparkContext.getRDDStorageInfo
On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth <
andras.nem...@lynxanalytics.com> wrote:
> Hi,
>
> Is it possible to know from code about an RDD if it is cached, and more
> precisely, how many of its partitions are cached in memory and how many are
> cached on dis
The return type should be RDD[(Int, Int, Int)] because sc.textFile()
returns an RDD. Try adding an import for the RDD type to get rid of the
compile error.
import org.apache.spark.rdd.RDD
On Mon, Apr 28, 2014 at 6:22 PM, SK wrote:
> Hi,
>
> I am a new user of Spark. I have a class that define
Hi all:
Is it possible to develop Spark programs in Python and run them on YARN?
>From the Python SparkContext class, it doesn't seem to have such an option.
Thank you,
- Guanhua
===
Guanhua Yan, Ph.D.
Information Sciences Group (CCS-3)
Los Alamos National Laboratory
Hi,
I am a new user of Spark. I have a class that defines a function as follows.
It returns a tuple : (Int, Int, Int).
class Sim extends VectorSim {
override def input(master:String): (Int,Int,Int) = {
sc = new SparkContext(master, "Test")
val ratings = sc.tex
What is Seq[V] in updateStateByKey?
Does this store the collected tuples of the RDD in a collection?
Method signature:
def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) =>
Option[S] ): DStream[(K, S)]
In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; the
Each time I run sbt/sbt assembly to compile my program, the packaging time
takes about 370 sec (about 6 min). How can I reduce this time?
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/packaging-time-tp5048.html
Sent from the Apache Spark User List
Hi All,
I have replication factor 3 in my HDFS.
With 3 datanodes, i ran my experiments. Now i just added another node to it
with no data in it.
When i ran, SPARK launches non-local tasks in it and the time taken is more
than what it took for 3 node cluster.
Here delayed scheduling fails i think b
Hello folks,
I was going to post this question to spark user group as well. If you have
any leads on how to solve this issue please let me know:
I am trying to build a basic spark project (spark depends on akka) and I am
trying to create a fatjar using sbt assembly. The goal is to run the fatjar
you need to merge reference.conf files and its no longer an issue.
see the Build for for spark itself:
case "reference.conf" => MergeStrategy.concat
On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao wrote:
> Hello folks,
>
> I was going to post this question to spark user group as well. If you ha
Tips from my experience. Disable scaladoc:
sources in doc in Compile := List()
Do not package the source:
publishArtifact in packageSrc := false
And most importantly do not run "sbt assembly". It creates a fat jar. Use
"sbt package" or "sbt stage" (from sbt-native-packager). They create a
direc
Tip: read the wiki --
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
On Tue, Apr 29, 2014 at 12:48 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:
> Tips from my experience. Disable scaladoc:
>
> sources in doc in Compile := List()
>
> Do not package the s
I am getting a class cast Exception. I am clueless to why this occurs.
I am transforming a non pair RDD to PairRDD and doing groupByKey
org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 4 times
(most recent failure: Exception failure: java.lang.ClassCastException:
java.lang.Double
The original DStream is of (K,V). This function creates a DStream of
(K,S). Each time slice brings one or more new V for each K. The old
state S (can be different from V!) for each K -- possibly non-existent
-- is updated in some way by a bunch of new V, to produce a new state
S -- which also might
You may have already seen it, but I will mention it anyways. This example
may help.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala
Here the state is essentially a running count of the words seen. So the
value t
This will be possible in 1.0 after this pull request:
https://github.com/apache/spark/pull/30
Matei
On Apr 29, 2014, at 9:51 AM, Guanhua Yan wrote:
> Hi all:
>
> Is it possible to develop Spark programs in Python and run them on YARN? From
> the Python SparkContext class, it doesn't seem to
Hi,
I am configuring a standalone setup for spark cluster using
spark-0.9.1-bin-hadoop2 binary.
Started the master and slave(localhost) using start-master and start-slaves
sh.I can see the master and worker started in web ui.
Now i am running a sample poc java jar file which connects to the master
Hi TD,
In my tests with spark streaming, I'm using JavaNetworkWordCount(modified) code
and a program that I wrote that sends words to the Spark worker, I use TCP as
transport. I verified that after starting Spark, it connects to my source which
actually starts sending, but the first word count
Hi,
I started with a text file(CSV) of sorted data (by first column), parsed it
into Scala objects using map operation in Scala. Then I used more maps to
add some extra info to the data and saved it as text file.
The final text file is not sorted. What do I need to do to keep the order
from the ori
Is you batch size 30 seconds by any chance?
Assuming not, please check whether you are creating the streaming context
with master "local[n]" where n > 2. With "local" or "local[1]", the system
only has one processing slot, which is occupied by the receiver leaving no
room for processing the receiv
Also seeing logs related to memory towards the end.
14/04/29 15:07:54 INFO MemoryStore: ensureFreeSpace(138763) called with
curMem=0, maxMem=1116418867
14/04/29 15:07:54 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 135.5 KB, free 1064.6 MB)
14/04/29 15:07:54 INFO F
Hi TD,
We are not using stream context with master local, we have 1 Master and 8
Workers and 1 word source. The command line that we are using is:
bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount
spark://192.168.0.13:7077
On Apr 30, 2014, at 0:09, Tathagata Das wrot
Thanks, Matei. Will take a look at it.
Best regards,
Guanhua
From: Matei Zaharia
Reply-To:
Date: Tue, 29 Apr 2014 14:19:30 -0700
To:
Subject: Re: Python Spark on YARN
This will be possible in 1.0 after this pull request:
https://github.com/apache/spark/pull/30
Matei
On Apr 29, 2014, at
Strange! Can you just do lines.print() to print the raw data instead of
doing word count. Beyond that we can do two things.
1. Can see the Spark stage UI to see whether there are stages running
during the 30 second period you referred to?
2. If you upgrade to using Spark master branch (or Spark 1.
Hi TD, I am GMT +8 from you, Tomorrow I will get these information that you
have asked me.
Thanks
- Messaggio originale -
Da: "Tathagata Das"
Inviato: 30/04/2014 00.57
A: "user@spark.apache.org"
Oggetto: Re: Spark's behavior
Strange! Can you just do lines.print() to print the raw d
Hello,
I am trying to write multiple files with Spark, but I can not find a way to
do it.
Here is the idea.
val rddKeyValue : Rdd[(String, String)] = rddlines.map( line =>
createKeyValue(line))
now I would like to save this as and all the values inside
the file
I tried to use this after the
Hi Daniel
Thanks for your reply, While I think for reduceByKey, it will also do
map side combine, thus extra the result is the same, say, for each partition,
one entry per distinct word. In my case with javaserializer, 240MB dataset
yield to around 70MB shuffle data. Only that shuffle
Hi
I noticed that in spark 1.0 meetup, on 1.1 and beyond roadmap, it
mentioned support for pluggable storage strategies. We are also planning on
similar things to enable block manager to store data on more storage media.
So is there any exist plan or design or rough idea on this
Hi all,
I searched around, but fail to find anything that says about running sparkR
on YARN.
so, is it possible to run sparkR with yarn ? either with yarn-standalone or
yarn-client mode.
if so, is there any document that could guide me through the build & setup
processes?
I am desparate for some
There is a JavaSparkContext, but no JavaSparkConf object. I know SparkConf
is new in 0.9.x.
Is there a plan to add something like this to the java api?
It's rather a bother to have things like setAll take a scala
Traverable[String String] when using SparkConf from the java api.
At a minimum addi
I think the real problem is "spark.akka.frameSize". It is to small for
passing the data. every executor failed, and there is no executor, then the
task hangs up.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp
i met with the same question when update to spark 0.9.1
(svn checkout https://github.com/apache/spark/)
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq;
at org.apache.spark.examples.GroupByTest$.main(
Hi
I am running a WordCount program which count words from HDFS, and I
noticed that the serializer part of code takes a lot of CPU time. On a
16core/32thread node, the total throughput is around 50MB/s by JavaSerializer,
and if I switching to KryoSerializer, it doubles to around 100-150
Is this the serialization throughput per task or the serialization
throughput for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond wrote:
> Hi
>
> I am running a WordCount program which count words from HDFS, and I
> noticed that the serializer part of code takes a lot of CPU
The signature of this function was changed in spark 1.0... is there
any chance that somehow you are actually running against a newer
version of Spark?
On Tue, Apr 29, 2014 at 8:58 PM, wxhsdp wrote:
> i met with the same question when update to spark 0.9.1
> (svn checkout https://github.com/apache
For all the tasks, say 32 task on total
Best Regards,
Raymond Liu
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Is this the serialization throughput per task or the serialization throughput
for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond wrote
Hm - I'm still not sure if you mean
100MB/s for each task = 3200MB/s across all cores
-or-
3.1MB/s for each task = 100MB/s across all cores
If it's the second one, that's really slow and something is wrong. If
it's the first one this in the range of what I'd expect, but I'm no
expert.
On Tue, Apr
Hi Patrick,
I¹m a little confused about your comment that RDDs are not ordered. As far
as I know, RDDs keep list of partitions that are ordered and this is why I
can call RDD.take() and get the same first k rows every time I call it and
RDD.take() returns the same entries as RDD.map().take() beca
By the way, to be clear, I run repartition firstly to make all data go through
shuffle instead of run ReduceByKey etc directly ( which reduce the data need to
be shuffle and serialized), thus say all 50MB/s data from HDFS will go to
serializer. ( in fact, I also tried generate data in memory dir
This class was made to be "java friendly" so that we wouldn't have to
use two versions. The class itself is simple. But I agree adding java
setters would be nice.
On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth wrote:
> There is a JavaSparkContext, but no JavaSparkConf object. I know SparkConf
> i
Later case, total throughput aggregated from all cores.
Best Regards,
Raymond Liu
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Wednesday, April 30, 2014 1:22 PM
To: user@spark.apache.org
Subject: Re: How fast would you expect shuffle serialize to be?
Hm -
You are right, once you sort() the RDD, then yes it has a well defined ordering.
But that ordering is lost as soon as you transform the RDD, including
if you union it with another RDD.
On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim wrote:
> Hi Patrick,
>
> I¹m a little confused about your comment
My implication is that it isn't "java friendly" enough. The follow methods
return scala objects
getAkkaConf
getAll
getExecutorEnv
and the follow method require scala objects as their params
setAll
setExecutorEnv (both of the bulk methods)
so-- while it is usable from java, I wouldn't call it fr
Thanks for the quick response!
To better understand it, the reason sorted RDD has a well-defined ordering
is because sortedRDD.getPartitions() returns the partitions in the right
order and each partition internally is properly sorted. So, if you have
var rdd = sc.parallelize([2, 1, 3]);
var sorte
If you call map() on an RDD it will retain the ordering it had before,
but that is not necessarily a correct sort order for the new RDD.
var rdd = sc.parallelize([2, 1, 3]);
var sorted = rdd.map(x => (x, x)).sort(); // should be [1, 2, 3]
var mapped = sorted.mapValues(x => 3 - x); // should be [2,
We don't have any documentation on running SparkR on YARN and I think there
might be some issues that need to be fixed (The recent PySpark on YARN PRs
are an example).
SparkR has only been tested to work with Spark standalone mode so far.
Thanks
Shivaram
On Tue, Apr 29, 2014 at 7:56 PM, phoenix
Hi, Any suggestion to the following issue ??
I have replication factor 3 in my HDFS.
With 3 datanodes, i ran my experiments. Now i just added another node to it
with no data in it.
When i ran, SPARK launches non-local tasks in it and the time taken is more
than what it took for 3 node cluster.
He
I just tried to use serializer to write object directly in local mode with code:
val datasize = args(1).toInt
val dataset = (0 until datasize).map( i => ("asmallstring", i))
val out: OutputStream = {
new BufferedOutputStream(new FileOutputStream(args(2)), 1024 * 100)
Yes, that’s what I meant. Sure, the numbers might not be actually sorted,
but the order of rows semantically are kept throughout non-shuffling
transforms. I’m on board with you on union as well.
Back to the original question, then, why is it important to coalesce to a
single partition? When you un
Hi, patrick
i checked out https://github.com/apache/spark/ this morning and built
/spark/trunk
with ./sbt/sbt assembly
is it spark 1.0?
so how can i update my sbt file? the latest version in
http://repo1.maven.org/maven2/org/apache/spark/
is 0.9.1
thank you for your help
--
View this message
65 matches
Mail list logo