Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Francis . Hu
Hi, all I installed spark-0.9.1 and zeromq 4.0.1 , and then run below example: ./bin/run-example org.apache.spark.streaming.examples.SimpleZeroMQPublisher tcp://127.0.1.1:1234 foo.bar` ./bin/run-example org.apache.spark.streaming.examples.ZeroMQWordCount local[2] tcp://127.0.1.1:1234 foo`

RE: questions about debugging a spark application

2014-04-29 Thread wxhsdp
Hi Liu, is it the feature of spark 0.9.1? my version is 0.9.0, it has no effect when i set spark.eventLog.enabled -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p5028.html Sent from the Apache Spark User

Re: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma
Unfortunately zeromq 4.0.1 is not supported. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/ZeroMQWordCount.scala#L63Says about the version. You will need that version of zeromq to see it work. Basically I have seen it working nicely with zer

Re: File list read into single RDD

2014-04-29 Thread Christophe Préaud
Hi, You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 e.g.: sc.textFile('{/path/to/file1,/path/to/file2}') Christophe. On 29/04/2014 05:07, Nicholas Chammas wrote: Not tha

Spark RDD cache memory usage

2014-04-29 Thread Han JU
Hi, By default a fraction of the executor memory (60%) is reserved for RDD caching, so if there's no explicit caching in the code (eg. rdd.cache() etc.), or if we persist RDD with StorageLevel.DISK_ONLY, is this part of memory wated? Does Spark allocates the RDD cache memory dynamically? Or does s

答复: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Francis . Hu
Thanks, Prashant Sharma It works right now after degrade zeromq from 4.0.1 to 2.2. Do you know the new release of spark whether it will upgrade zeromq ? Many of our programs are using zeromq 4.0.1, so if in next release ,spark streaming can release with a newer zeromq that would be be

Re: 答复: Issue during Spark streaming with ZeroMQ source

2014-04-29 Thread Prashant Sharma
Well that is not going to be easy, simply because we depend on akka-zeromq for zeromq support. And since akka does not support the latest zeromq library yet, I doubt if there is something simple that can be done to support it. Prashant Sharma On Tue, Apr 29, 2014 at 2:44 PM, Francis.Hu wrote: >

Re: launching concurrent jobs programmatically

2014-04-29 Thread ishaaq
Very interesting. One of spark's attractive features is being able to do stuff interactively via spark-shell. Is something like that still available via Ooyala's job server? Or do you use the spark-shell independently of that? If the latter then how do you manage custom jars for spark-shell? Our

Joining not-pair RDDs in Spark

2014-04-29 Thread jsantos
In the context of telecom industry, let's supose we have several existing RDDs populated from some tables in Cassandra: val callPrices: RDD[PriceRow] val calls: RDD[CallRow] val offersInCourse: RDD[OfferRow] where types are defined as follows, /** Represents the p

Re: Why Spark require this object to be serializerable?

2014-04-29 Thread Earthson
Finally, I'm using file to save RDDs, and then reload it. It works fine, because Gibbs Sampling for LDA is really slow. It's about 10min to sampling 10k wiki document for 10 round(1 round/min). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-requir

Re: Shuffle Spill Issue

2014-04-29 Thread Daniel Darabos
I have no idea why shuffle spill is so large. But this might make it smaller: val addition = (a: Int, b: Int) => a + b val wordsCount = wordsPair.combineByKey(identity, addition, addition) This way only one entry per distinct word will end up in the shuffle for each partition, instead of one entr

User/Product Clustering with pySpark ALS

2014-04-29 Thread Laird, Benjamin
Hi all - I’m using pySpark/MLLib ALS for user/item clustering and would like to directly access the user/product RDDs (called userFeatures/productFeatures in class MatrixFactorizationModel in mllib/recommendation/MatrixFactorizationModel.scala This doesn’t seem to complex, but it doesn’t seem l

Fwd: Spark RDD cache memory usage

2014-04-29 Thread Han JU
Hi, By default a fraction of the executor memory (60%) is reserved for RDD caching, so if there's no explicit caching in the code (eg. rdd.cache() etc.), or if we persist RDD with StorageLevel.DISK_ONLY, is this part of memory wated? Does Spark allocates the RDD cache memory dynamically? Or does s

Re: Joining not-pair RDDs in Spark

2014-04-29 Thread Daniel Darabos
Create a key and join on that. val callPricesByHour = callPrices.map(p => ((p.year, p.month, p.day, p.hour), p)) val callsByHour = calls.map(c => ((c.year, c.month, c.day, c.hour), c)) val bills = callPricesByHour.join(callsByHour).mapValues({ case (p, c) => BillRow(c.customer, c.hour, c.minutes *

Re: User/Product Clustering with pySpark ALS

2014-04-29 Thread Nick Pentreath
There's no easy way to d this currently. The pieces are there from the PySpark code for regression which should be adaptable. But you'd have to roll your own solution. This is something I also want so I intend to put together a pull request for this soon — Sent from Mailbox On Tue, Apr 29,

Storage information about an RDD from the API

2014-04-29 Thread Andras Nemeth
Hi, Is it possible to know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? I know I can get the storage level, but I also want to know the current actual caching status. Knowing memory consumption would al

Re: Storage information about an RDD from the API

2014-04-29 Thread Koert Kuipers
SparkContext.getRDDStorageInfo On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > Hi, > > Is it possible to know from code about an RDD if it is cached, and more > precisely, how many of its partitions are cached in memory and how many are > cached on dis

Re: How to declare Tuple return type for a function

2014-04-29 Thread Roger Hoover
The return type should be RDD[(Int, Int, Int)] because sc.textFile() returns an RDD. Try adding an import for the RDD type to get rid of the compile error. import org.apache.spark.rdd.RDD On Mon, Apr 28, 2014 at 6:22 PM, SK wrote: > Hi, > > I am a new user of Spark. I have a class that define

Python Spark on YARN

2014-04-29 Thread Guanhua Yan
Hi all: Is it possible to develop Spark programs in Python and run them on YARN? >From the Python SparkContext class, it doesn't seem to have such an option. Thank you, - Guanhua === Guanhua Yan, Ph.D. Information Sciences Group (CCS-3) Los Alamos National Laboratory

How to declare Tuple return type for a function

2014-04-29 Thread SK
Hi, I am a new user of Spark. I have a class that defines a function as follows. It returns a tuple : (Int, Int, Int). class Sim extends VectorSim { override def input(master:String): (Int,Int,Int) = { sc = new SparkContext(master, "Test") val ratings = sc.tex

What is Seq[V] in updateStateByKey?

2014-04-29 Thread Adrian Mocanu
What is Seq[V] in updateStateByKey? Does this store the collected tuples of the RDD in a collection? Method signature: def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) => Option[S] ): DStream[(K, S)] In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; the

packaging time

2014-04-29 Thread SK
Each time I run sbt/sbt assembly to compile my program, the packaging time takes about 370 sec (about 6 min). How can I reduce this time? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/packaging-time-tp5048.html Sent from the Apache Spark User List

Delayed Scheduling - Setting spark.locality.wait.node parameter in interactive shell

2014-04-29 Thread Sai Prasanna
Hi All, I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster. Here delayed scheduling fails i think b

Spark: issues with running a sbt fat jar due to akka dependencies

2014-04-29 Thread Shivani Rao
Hello folks, I was going to post this question to spark user group as well. If you have any leads on how to solve this issue please let me know: I am trying to build a basic spark project (spark depends on akka) and I am trying to create a fatjar using sbt assembly. The goal is to run the fatjar

Re: Spark: issues with running a sbt fat jar due to akka dependencies

2014-04-29 Thread Koert Kuipers
you need to merge reference.conf files and its no longer an issue. see the Build for for spark itself: case "reference.conf" => MergeStrategy.concat On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao wrote: > Hello folks, > > I was going to post this question to spark user group as well. If you ha

Re: packaging time

2014-04-29 Thread Daniel Darabos
Tips from my experience. Disable scaladoc: sources in doc in Compile := List() Do not package the source: publishArtifact in packageSrc := false And most importantly do not run "sbt assembly". It creates a fat jar. Use "sbt package" or "sbt stage" (from sbt-native-packager). They create a direc

Re: packaging time

2014-04-29 Thread Mark Hamstra
Tip: read the wiki -- https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Tue, Apr 29, 2014 at 12:48 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Tips from my experience. Disable scaladoc: > > sources in doc in Compile := List() > > Do not package the s

java.lang.ClassCastException for groupByKey

2014-04-29 Thread amit karmakar
I am getting a class cast Exception. I am clueless to why this occurs. I am transforming a non pair RDD to PairRDD and doing groupByKey org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 4 times (most recent failure: Exception failure: java.lang.ClassCastException: java.lang.Double

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Sean Owen
The original DStream is of (K,V). This function creates a DStream of (K,S). Each time slice brings one or more new V for each K. The old state S (can be different from V!) for each K -- possibly non-existent -- is updated in some way by a bunch of new V, to produce a new state S -- which also might

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Tathagata Das
You may have already seen it, but I will mention it anyways. This example may help. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala Here the state is essentially a running count of the words seen. So the value t

Re: Python Spark on YARN

2014-04-29 Thread Matei Zaharia
This will be possible in 1.0 after this pull request: https://github.com/apache/spark/pull/30 Matei On Apr 29, 2014, at 9:51 AM, Guanhua Yan wrote: > Hi all: > > Is it possible to develop Spark programs in Python and run them on YARN? From > the Python SparkContext class, it doesn't seem to

Spark cluster standalone setup

2014-04-29 Thread pradeep_s
Hi, I am configuring a standalone setup for spark cluster using spark-0.9.1-bin-hadoop2 binary. Started the master and slave(localhost) using start-master and start-slaves sh.I can see the master and worker started in web ui. Now i am running a sample poc java jar file which connects to the master

Spark's behavior

2014-04-29 Thread Eduardo Costa Alfaia
Hi TD, In my tests with spark streaming, I'm using JavaNetworkWordCount(modified) code and a program that I wrote that sends words to the Spark worker, I use TCP as transport. I verified that after starting Spark, it connects to my source which actually starts sending, but the first word count

rdd ordering gets scrambled

2014-04-29 Thread Mohit Jaggi
Hi, I started with a text file(CSV) of sorted data (by first column), parsed it into Scala objects using map operation in Scala. Then I used more maps to add some extra info to the data and saved it as text file. The final text file is not sorted. What do I need to do to keep the order from the ori

Re: Spark's behavior

2014-04-29 Thread Tathagata Das
Is you batch size 30 seconds by any chance? Assuming not, please check whether you are creating the streaming context with master "local[n]" where n > 2. With "local" or "local[1]", the system only has one processing slot, which is occupied by the receiver leaving no room for processing the receiv

Re: Spark cluster standalone setup memory issue

2014-04-29 Thread pradeep_s
Also seeing logs related to memory towards the end. 14/04/29 15:07:54 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=1116418867 14/04/29 15:07:54 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 1064.6 MB) 14/04/29 15:07:54 INFO F

Re: Spark's behavior

2014-04-29 Thread Eduardo Costa Alfaia
Hi TD, We are not using stream context with master local, we have 1 Master and 8 Workers and 1 word source. The command line that we are using is: bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount spark://192.168.0.13:7077 On Apr 30, 2014, at 0:09, Tathagata Das wrot

Re: Python Spark on YARN

2014-04-29 Thread Guanhua Yan
Thanks, Matei. Will take a look at it. Best regards, Guanhua From: Matei Zaharia Reply-To: Date: Tue, 29 Apr 2014 14:19:30 -0700 To: Subject: Re: Python Spark on YARN This will be possible in 1.0 after this pull request: https://github.com/apache/spark/pull/30 Matei On Apr 29, 2014, at

Re: Spark's behavior

2014-04-29 Thread Tathagata Das
Strange! Can you just do lines.print() to print the raw data instead of doing word count. Beyond that we can do two things. 1. Can see the Spark stage UI to see whether there are stages running during the 30 second period you referred to? 2. If you upgrade to using Spark master branch (or Spark 1.

R: Spark's behavior

2014-04-29 Thread Eduardo Alfaia
Hi TD, I am GMT +8 from you, Tomorrow I will get these information that you have asked me. Thanks - Messaggio originale - Da: "Tathagata Das" Inviato: ‎30/‎04/‎2014 00.57 A: "user@spark.apache.org" Oggetto: Re: Spark's behavior Strange! Can you just do lines.print() to print the raw d

Fwd: MultipleOutputs IdentityReducer

2014-04-29 Thread Andre Kuhnen
Hello, I am trying to write multiple files with Spark, but I can not find a way to do it. Here is the idea. val rddKeyValue : Rdd[(String, String)] = rddlines.map( line => createKeyValue(line)) now I would like to save this as and all the values inside the file I tried to use this after the

RE: Shuffle Spill Issue

2014-04-29 Thread Liu, Raymond
Hi Daniel Thanks for your reply, While I think for reduceByKey, it will also do map side combine, thus extra the result is the same, say, for each partition, one entry per distinct word. In my case with javaserializer, 240MB dataset yield to around 70MB shuffle data. Only that shuffle

About pluggable storage roadmap?

2014-04-29 Thread Liu, Raymond
Hi I noticed that in spark 1.0 meetup, on 1.1 and beyond roadmap, it mentioned support for pluggable storage strategies. We are also planning on similar things to enable block manager to store data on more storage media. So is there any exist plan or design or rough idea on this

sparkR - is it possible to run sparkR on yarn?

2014-04-29 Thread phoenix bai
Hi all, I searched around, but fail to find anything that says about running sparkR on YARN. so, is it possible to run sparkR with yarn ? either with yarn-standalone or yarn-client mode. if so, is there any document that could guide me through the build & setup processes? I am desparate for some

JavaSparkConf

2014-04-29 Thread Soren Macbeth
There is a JavaSparkContext, but no JavaSparkConf object. I know SparkConf is new in 0.9.x. Is there a plan to add something like this to the java api? It's rather a bother to have things like setAll take a scala Traverable[String String] when using SparkConf from the java api. At a minimum addi

Re: parallelize for a large Seq is extreamly slow.

2014-04-29 Thread Earthson
I think the real problem is "spark.akka.frameSize". It is to small for passing the data. every executor failed, and there is no executor, then the task hangs up. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread wxhsdp
i met with the same question when update to spark 0.9.1 (svn checkout https://github.com/apache/spark/) Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.jarOfClass(Ljava/lang/Class;)Lscala/collection/Seq; at org.apache.spark.examples.GroupByTest$.main(

How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code takes a lot of CPU time. On a 16core/32thread node, the total throughput is around 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it doubles to around 100-150

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond wrote: > Hi > > I am running a WordCount program which count words from HDFS, and I > noticed that the serializer part of code takes a lot of CPU

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread Patrick Wendell
The signature of this function was changed in spark 1.0... is there any chance that somehow you are actually running against a newer version of Spark? On Tue, Apr 29, 2014 at 8:58 PM, wxhsdp wrote: > i met with the same question when update to spark 0.9.1 > (svn checkout https://github.com/apache

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
For all the tasks, say 32 task on total Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond wrote

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
Hm - I'm still not sure if you mean 100MB/s for each task = 3200MB/s across all cores -or- 3.1MB/s for each task = 100MB/s across all cores If it's the second one, that's really slow and something is wrong. If it's the first one this in the range of what I'd expect, but I'm no expert. On Tue, Apr

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
Hi Patrick, I¹m a little confused about your comment that RDDs are not ordered. As far as I know, RDDs keep list of partitions that are ordered and this is why I can call RDD.take() and get the same first k rows every time I call it and RDD.take() returns the same entries as RDD.map(Š).take() beca

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
By the way, to be clear, I run repartition firstly to make all data go through shuffle instead of run ReduceByKey etc directly ( which reduce the data need to be shuffle and serialized), thus say all 50MB/s data from HDFS will go to serializer. ( in fact, I also tried generate data in memory dir

Re: JavaSparkConf

2014-04-29 Thread Patrick Wendell
This class was made to be "java friendly" so that we wouldn't have to use two versions. The class itself is simple. But I agree adding java setters would be nice. On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth wrote: > There is a JavaSparkContext, but no JavaSparkConf object. I know SparkConf > i

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
Later case, total throughput aggregated from all cores. Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, April 30, 2014 1:22 PM To: user@spark.apache.org Subject: Re: How fast would you expect shuffle serialize to be? Hm -

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
You are right, once you sort() the RDD, then yes it has a well defined ordering. But that ordering is lost as soon as you transform the RDD, including if you union it with another RDD. On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim wrote: > Hi Patrick, > > I¹m a little confused about your comment

Re: JavaSparkConf

2014-04-29 Thread Soren Macbeth
My implication is that it isn't "java friendly" enough. The follow methods return scala objects getAkkaConf getAll getExecutorEnv and the follow method require scala objects as their params setAll setExecutorEnv (both of the bulk methods) so-- while it is usable from java, I wouldn't call it fr

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
Thanks for the quick response! To better understand it, the reason sorted RDD has a well-defined ordering is because sortedRDD.getPartitions() returns the partitions in the right order and each partition internally is properly sorted. So, if you have var rdd = sc.parallelize([2, 1, 3]); var sorte

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
If you call map() on an RDD it will retain the ordering it had before, but that is not necessarily a correct sort order for the new RDD. var rdd = sc.parallelize([2, 1, 3]); var sorted = rdd.map(x => (x, x)).sort(); // should be [1, 2, 3] var mapped = sorted.mapValues(x => 3 - x); // should be [2,

Re: sparkR - is it possible to run sparkR on yarn?

2014-04-29 Thread Shivaram Venkataraman
We don't have any documentation on running SparkR on YARN and I think there might be some issues that need to be fixed (The recent PySpark on YARN PRs are an example). SparkR has only been tested to work with Spark standalone mode so far. Thanks Shivaram On Tue, Apr 29, 2014 at 7:56 PM, phoenix

Setting spark.locality.wait.node parameter in interactive shell

2014-04-29 Thread Sai Prasanna
Hi, Any suggestion to the following issue ?? I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster. He

RE: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Liu, Raymond
I just tried to use serializer to write object directly in local mode with code: val datasize = args(1).toInt val dataset = (0 until datasize).map( i => ("asmallstring", i)) val out: OutputStream = { new BufferedOutputStream(new FileOutputStream(args(2)), 1024 * 100)

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Mingyu Kim
Yes, that’s what I meant. Sure, the numbers might not be actually sorted, but the order of rows semantically are kept throughout non-shuffling transforms. I’m on board with you on union as well. Back to the original question, then, why is it important to coalesce to a single partition? When you un

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread wxhsdp
Hi, patrick i checked out https://github.com/apache/spark/ this morning and built /spark/trunk with ./sbt/sbt assembly is it spark 1.0? so how can i update my sbt file? the latest version in http://repo1.maven.org/maven2/org/apache/spark/ is 0.9.1 thank you for your help -- View this message