Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
Hi I get this exception when I run a Spark test case on my local machine: An exception or error caused a run to abort: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions(Lorg/apache/spark/streaming/dstream/DStream;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Orderi

RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
I use spark 1.1.0-SNAPSHOT val spark="org.apache.spark" %% "spark-core" % "1.1.0-SNAPSHOT" % "provided" excludeAll( -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: January-22-15 11:39 AM To: Adrian Mocanu Cc: u...@spark.inc

RE: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-22 Thread Adrian Mocanu
quot;org.apache.cassandra", "cassandra-thrift") val casAll = "org.apache.cassandra" % "cassandra-all" % "2.0.3" intransitive() val casThrift = "org.apache.cassandra" % "cassandra-thrift" % "2.0.3" intransitive() val sparkStre

does updateStateByKey return Seq() ordered?

2015-02-10 Thread Adrian Mocanu
I was looking at updateStateByKey documentation, It passes in a values Seq which contains values that have the same key. I would like to know if there is any ordering to these values. My feeling is that there is no ordering, but maybe it does preserve RDD ordering. Example: RDD[ (a,2), (a,3), (a

how to print RDD by key into file with grouByKey

2015-03-13 Thread Adrian Mocanu
Hi I have an RDD: RDD[(String, scala.Iterable[(Long, Int)])] which I want to print into a file, a file for each key string. I tried to trigger a repartition of the RDD by doing group by on it. The grouping gives RDD[(String, scala.Iterable[Iterable[(Long, Int)]])] so I flattened that: Rdd.gro

partitionBy not working w HashPartitioner

2015-03-16 Thread Adrian Mocanu
Here's my use case: I read an array into an RDD and I use a hash partitioner to partition the RDD. This is the array type: Array[(String, Iterable[(Long, Int)])] topK:Array[(String, Iterable[(Long, Int)])] = ... import org.apache.spark.HashPartitioner val hashPartitioner=new HashPartitioner(10) v

updateStateByKey - Seq[V] order

2015-03-24 Thread Adrian Mocanu
Hi Does updateStateByKey pass elements to updateFunc (in Seq[V]) in order in which they appear in the RDD? My guess is no which means updateFunc needs to be commutative. Am I correct? I've asked this question before but there were no takers. Here's the scala docs for updateStateByKey /** *

writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi Is there a way to write all RDDs in a DStream to the same file? I tried this and got an empty file. I think it's bc the file is not closed i.e. ESMinibatchFunctions.writer.close() executes before the stream is created. Here's my code myStream.foreachRDD(rdd => { rdd.foreach(x => {

Spark log shows only this line repeated: RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time X

2015-03-26 Thread Adrian Mocanu
Here's my log output from a streaming job. What is this? 09:54:27.504 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067504 09:54:27.505 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Call

EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out

2015-03-26 Thread Adrian Mocanu
Hi I need help fixing a time out exception thrown from ElasticSearch Hadoop. The ES cluster is up all the time. I use ElasticSearch Hadoop to read data from ES into RDDs. I get a collection of these RDD which I traverse (with foreachRDD) and create more RDDs from each one RDD in the collection.

RE: EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out

2015-03-26 Thread Adrian Mocanu
] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_75] this is a huge stack trace... but it keeps repeating What could this be from? From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March 26, 2015 2:10 PM To: u

ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn't it? Bug?) Here's the exception I get: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: java.net.SocketTimeoutExceptio

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
.map { case (str, map) => unwrapAndCountMentionsPerPost(map)} } } var uberRdd = esRdds(0) for (rdd <- esRdds) { uberRdd = uberRdd ++ rdd } uberRdd.map joinforeach(x => println(x)) } From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: April 22, 2

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
at it uses scroll to get the data from ES; about 150 items at a time. Usual delay when I perform the same query from a browser plugin ranges from 1-5sec. Thanks From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: April 22, 2015 3:09 PM To: Adrian Mocanu Cc: u...@spark.incubator.apache.org S

RE: ElasticSearch enrich

2014-06-27 Thread Adrian Mocanu
b0c1, could you post your code? I am interested in your solution. Thanks Adrian From: boci [mailto:boci.b...@gmail.com] Sent: June-26-14 6:17 PM To: user@spark.apache.org Subject: Re: Elastic

RE: Spark java.lang.AbstractMethodError

2014-07-28 Thread Adrian Mocanu
I'm interested in this as well! From: Alex Minnaar [mailto:aminn...@verticalscope.com] Sent: July-28-14 3:40 PM To: user@spark.apache.org Subject: Spark java.lang.AbstractMethodError I am trying to run an example Spark standalone app with the following code import org.apache.spark.streaming._

[spark upgrade] Error communicating with MapOutputTracker when running test cases in latest spark

2014-09-10 Thread Adrian Mocanu
I use https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala to help me with testing. In spark 9.1 my tests depending on TestSuiteBase worked fine. As soon as I switched to latest (1.0.1) all tests fail. My sbt import is: "org.apache.sp

using RDD result in another TDD

2014-11-12 Thread Adrian Mocanu
Hi I'd like to use the result of one RDD1 in another RDD2. Normally I would use something like a barrier so make the 2nd RDD wait till the computation of the 1st RDD is done then include the result from RDD1 in the closure for RDD2. Currently I create another RDD, RDD3, out of the result of RDD1

RE: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
My understanding is that the reason you have an Option is so you could filter out tuples when None is returned. This way your state data won't grow forever. -Original Message- From: spr [mailto:s...@yarcdata.com] Sent: November-12-14 2:25 PM To: u...@spark.incubator.apache.org Subject: R

RE: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread Adrian Mocanu
You are correct; the filtering I’m talking about is done implicitly. You don’t have to do it yourself. Spark will do it for you and remove those entries from the state collection. From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: November-12-14 3:50 PM To: Adrian Mocanu Cc: spr; u

RE: RowMatrix.multiply() ?

2015-01-09 Thread Adrian Mocanu
I’m resurrecting this thread because I’m interested in doing transpose on a RowMatrix. There is this other thread too: http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-td12562.html Which presents https://issues.apache.org/jira/browse/SPARK-3434 which is still in

reconciling best effort real time with delayed aggregation

2015-01-14 Thread Adrian Mocanu
Hi I have a question regarding design trade offs and best practices. I'm working on a real time analytics system. For simplicity, I have data with timestamps (the key) and counts (the value). I use DStreams for the real time aspect. Tuples w the same timestamp can be across various RDDs and I ju

RE: ETL on pyspark

2014-02-25 Thread Adrian Mocanu
Hi Matei If Spark crashes while writing the file, after recovery from the failure does it continue where it left off or will there be duplicates in the file? -A From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: February-24-14 4:20 PM To: u...@spark.incubator.apache.org Subject: Re: ETL o

RE: ETL on pyspark

2014-02-25 Thread Adrian Mocanu
duplicates. Old attempts will just be deleted. Matei On Feb 25, 2014, at 9:19 AM, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: Hi Matei If Spark crashes while writing the file, after recovery from the failure does it continue where it left off or will there be duplicates in the

window every n elements instead of time based

2014-02-26 Thread Adrian Mocanu
Hi Is there a way to do window processing but not based on time but every 6 items going through the stream? Example: Window of size 3 with 1 item "duration" Stream data: 1,2,3,4,5,6,7 [1,2,3]=window 1 [2,3,4]=window 2 [3,4,5]=window 2 etc -Adrian

skipping ahead in RDD

2014-02-26 Thread Adrian Mocanu
Hi Scenario: Say I've been streaming tuples with Spark for 24 hours and one of the nodes fails. The RDD will be recomputed on the other Spark nodes and the streaming continues. I'm interested to know how I can skip the first 23 hours and jump in the stream to the last hour. Is this possible?

RE: window every n elements instead of time based

2014-02-27 Thread Adrian Mocanu
: u...@spark.incubator.apache.org Subject: Re: window every n elements instead of time based Currently, all in-built DStream operation is time-based windowing. We may provide count-based windowing in the future. On Wed, Feb 26, 2014 at 9:34 AM, Adrian Mocanu mailto:amoc...@verticalscope.com

is RDD failure transparent to stream consumer

2014-02-27 Thread Adrian Mocanu
Is RDD failure transparent to a spark stream consumer except for the slowdown needed to recreate the RDD. After reading the papers on RDDs and DStreams from spark homepage I believe it is, but I'd like a confirmation. Thanks -Adrian

RE: is RDD failure transparent to stream consumer

2014-02-28 Thread Adrian Mocanu
Would really like an answer to this. A `yes` or `no` would suffice. I'm talking ab RDD failure in this context: myStream.foreachRDD(rdd=>rdd.foreach(tuple => println(tuple))) From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: February-27-14 12:19 PM To: u...@spark.incubator

RE: is RDD failure transparent to stream consumer

2014-02-28 Thread Adrian Mocanu
. However, the built-in save operators (e.g. saveAsTextFile) are automatically idempotent (they only create each output partition once). Matei On Feb 28, 2014, at 10:10 AM, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: Would really like an answer to this. A `yes` or `no` would suffice

sstream.foreachRDD

2014-03-04 Thread Adrian Mocanu
Hi I've noticed that if in the driver of a spark app I have a foreach and add stream elements to a list from the stream, the list contains no elements at the end of the processing. Take this sample code: val list= new java.util.List() sstream.foreachRDD (rdd => rdd.foreach( tuple => list.add

how to add rddID to tuples in DStream

2014-03-04 Thread Adrian Mocanu
Hi I've another simple question: I have a DStream and want to add the ID of each RDD to the tuples contained in that RDD and return a DStream of them. So far I've figured out how to do this via foreachRDD and return RDDs. How to create a DStream from these RDDs? Or even better, how to use .map to

slf4j and log4j loop

2014-03-14 Thread Adrian Mocanu
Hi Have you encountered a slf4j and log4j loop when using Spark? I pull a few packages via sbt. Spark package uses slf4j-log4j12.jar and another package uses use log4j-over-slf4j.jar which creates the circular loop between the 2 loggers and thus the exception below. Do you know of a fix for this

RE: slf4j and log4j loop

2014-03-14 Thread Adrian Mocanu
fix: https://github.com/apache/spark/pull/107 -- Sean Owen | Director, Data Science | London On Fri, Mar 14, 2014 at 1:04 PM, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: Hi Have you encountered a slf4j and log4j loop when using Spark? I pull a few packages via sbt. Spark packa

is collect exactly-once?

2014-03-17 Thread Adrian Mocanu
Hi Quick question here, I know that .foreach is not idempotent. I am wondering if collect() is idempotent? Meaning that once I've collect()-ed if spark node crashes I can't get the same values from the stream ever again. Thanks -Adrian

partitioning via groupByKey

2014-03-19 Thread Adrian Mocanu
When you partition via groupByKey tulpes (parts of the RDD) are moved from some node to another node based on key (hash partitioning). Do the tuples remain part of 1 RDD as before but moved to different nodes or does this shuffling create, say, several RDDs which will have parts of the original

how to sort within DStream or merge consecutive RDDs

2014-03-19 Thread Adrian Mocanu
Hi I would like to know if it is possible to sort within DStream. I know it's possible to sort within an RDD and I know it's impossible to sort within the entire DStream but I would be satisfied with sorting across 2 RDDs: 1) merge 2 consecutive RDDs 2) reduce by key + sort the merged data 3) ta

is sorting necessary after join of sorted RDD

2014-03-20 Thread Adrian Mocanu
Is sorting necessary after a join on 2 sorted RDD? Example: val stream1= mystream.transform(rdd=>rdd.sortByKey(true)) val stream2= mystream.transform(rdd=>rdd.sortByKey(true)) val stream3= stream1.join(stream2) Will stream3 be ordered? -Adrian

DStream spark paper

2014-03-20 Thread Adrian Mocanu
I looked over the specs on page 9 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf The first paragraph mentions the window size is 30 seconds "Word-Count, which performs a sliding window count over 30s; and TopKCount, which finds the k most frequent words over the past 30s.

remove duplicates

2014-03-24 Thread Adrian Mocanu
I have a DStream like this: ..RDD[a,b],RDD[b,c].. Is there a way to remove duplicates across the entire DStream? Ie: I would like the output to be (by removing one of the b's): ..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c].. Thanks -Adrian

[bug?] streaming window unexpected behaviour

2014-03-24 Thread Adrian Mocanu
I have what I would call unexpected behaviour when using window on a stream. I have 2 windowed streams with a 5s batch interval. One window stream is (5s,5s)=smallWindow and the other (10s,5s)=bigWindow What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is of the size 5s

RE: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Adrian Mocanu
ing window duration > sliding interval). TD On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu wrote: > I have what I would call unexpected behaviour when using window on a stream. > > I have 2 windowed streams with a 5s batch interval. One window stream > is (5s,5s)=smallWindow and the

RE: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Adrian Mocanu
Let me rephrase that, Do you think it is possible to use an accumulator to skip the first few incomplete RDDs? -Original Message- From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-25-14 9:57 AM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: RE

closures & moving averages (state)

2014-03-26 Thread Adrian Mocanu
I'm passing a moving average function during the map phase like this: val average= new Sma(window=3) stream.map(x=> average.addNumber(x)) where class Sma extends Serializable { .. } I also tried to put creation of object average in an object like I saw in another post: object Average {

RE: closures & moving averages (state)

2014-03-26 Thread Adrian Mocanu
one of the Spark creators to comment on this. -A From: Benjamin Black [mailto:b...@b3k.us] Sent: March-26-14 11:50 AM To: user@spark.apache.org Subject: Re: closures & moving averages (state) Perhaps you want reduce rather than map? On Wednesday, March 26, 2014, Adrian Mocanu mailto:amo

RE: streaming questions

2014-03-26 Thread Adrian Mocanu
Hi Diana, I'll answer Q3: You can check if an RDD is empty in several ways. Someone here mentioned that using an iterator was safer: val isEmpty = rdd.mapPartitions(iter => Iterator(! iter.hasNext)).reduce(_&&_) You can also check with a fold or rdd.count rdd.reduce(_ + _) // can't handle em

function state lost when next RDD is processed

2014-03-27 Thread Adrian Mocanu
Is there a way to pass a custom function to spark to run it on the entire stream? For example, say I have a function which sums up values in each RDD and then across RDDs. I've tried with map, transform, reduce. They all apply my sum function on 1 RDD. When the next RDD comes the function start

StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Found this transform fn in StreamingContext which takes in a DStream[_] and a function which acts on each of its RDDs Unfortunately I can't figure out how to transform my DStream[(String,Int)] into DStream[_] /*** Create a new DStream in which each RDD is generated by applying a function on RD

RE: StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Please disregard I didn't see the Seq wrapper. From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-27-14 11:57 AM To: u...@spark.incubator.apache.org Subject: StreamingContext.transform on a DStream Found this transform fn in StreamingContext which takes in a DStream[_]

how to create a DStream from bunch of RDDs

2014-03-27 Thread Adrian Mocanu
I create several RDDs by merging several consecutive RDDs from a DStream. Is there a way to add these new RDDs to a DStream? -Adrian

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
I'd like to resurrect this thread since I don't have an answer yet. From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-27-14 10:04 AM To: u...@spark.incubator.apache.org Subject: function state lost when next RDD is processed Is there a way to pass a custom function t

RE: function state lost when next RDD is processed

2014-03-28 Thread Adrian Mocanu
t when next RDD is processed As long as the amount of state being passed is relatively small, it's probably easiest to send it back to the driver and to introduce it into RDD transformations as the zero value of a fold. On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu mailto:amoc...@verti

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I think you should sort each RDD -Original Message- From: yh18190 [mailto:yh18...@gmail.com] Sent: March-28-14 4:44 PM To: u...@spark.incubator.apache.org Subject: Re: Splitting RDD and Grouping together to perform computation Hi, Thanks Nanzhu.I tried to implement your suggestion on fol

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
I say you need to remap so you have a key for each tuple that you can sort on. Then call rdd.sortByKey(true) like this mystream.transform(rdd => rdd.sortByKey(true)) For this fn to be available you need to import org.apache.spark.rdd.OrderedRDDFunctions -Original Message- From: yh18190 [

RE: Splitting RDD and Grouping together to perform computation

2014-03-28 Thread Adrian Mocanu
Not sure how to change your code because you'd need to generate the keys where you get the data. Sorry about that. I can tell you where to put the code to remap and sort though. import org.apache.spark.rdd.OrderedRDDFunctions val res2=reduced_hccg.map(_._2) .map( x=> (newkey,x)).sortByKey(true)

writing booleans w Calliope

2014-04-17 Thread Adrian Mocanu
Has anyone managed to write Booleans to Cassandra from an RDD with Calliope? My Booleans give compile time errors: expression of type List[Any] does not conform to expected type Types.CQLRowValues CQLColumnValue is definted as ByteBuffer: type CQLColumnValue = ByteBuffer For now I convert them to

default spark partitioner

2014-04-23 Thread Adrian Mocanu
How does the default spark partitioner partition RDD data? Does it keep the data in order? I'm asking because I'm generating an RDD by hand via `ssc.sparkContext.makeRDD(collection.toArray)` and I collect and iterate over what I collect, but the data is in a different order than in the initial

RE: default spark partitioner

2014-04-23 Thread Adrian Mocanu
should keep them in order, but what kind of collection do you have? Maybe toArray changes the order. Matei On Apr 23, 2014, at 8:21 AM, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: How does the default spark partitioner partition RDD data? Does it keep the data in order? I&

reduceByKeyAndWindow - spark internals

2014-04-24 Thread Adrian Mocanu
If I have this code: val stream1= doublesInputStream.window(Seconds(10), Seconds(2)) val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10)) Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10 second window? Example, in the first 10 secs stream1 will have

FW: reduceByKeyAndWindow - spark internals

2014-04-25 Thread Adrian Mocanu
Any suggestions where I can find this in the documentation or elsewhere? Thanks From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: April-24-14 11:26 AM To: u...@spark.incubator.apache.org Subject: reduceByKeyAndWindow - spark internals If I have this code: val stream1

What is the recommended way to store state across RDDs?

2014-04-28 Thread Adrian Mocanu
What is the recommended way to store state across RDDs as you traverse a DStream and go from 1 RDD to another? Consider a trivial example of moving average. Between RDDs should the average be saved in a cache (ie redis) or is there another globar var type available in Spark? Accumulators are on

What is Seq[V] in updateStateByKey?

2014-04-29 Thread Adrian Mocanu
What is Seq[V] in updateStateByKey? Does this store the collected tuples of the RDD in a collection? Method signature: def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) => Option[S] ): DStream[(K, S)] In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; the

RE: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Adrian Mocanu
e new thing arriving for each key at each time slice. On Tue, Apr 29, 2014 at 7:50 PM, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: > What is Seq[V] in updateStateByKey? > > Does this store the collected tuples of the RDD in a collection? > > > > Method signature

RE: What is Seq[V] in updateStateByKey?

2014-05-01 Thread Adrian Mocanu
ty trivial but hey. On Wed, Apr 30, 2014 at 9:23 PM, Adrian Mocanu wrote: > Hi TD, > > Why does the example keep recalculating the count via fold? > > Wouldn’t it make more sense to get the last count in values Seq and > add 1 to it and save that as current count? > >

updateStateByKey example not using correct input data?

2014-05-01 Thread Adrian Mocanu
I'm trying to understand updateStateByKey. Here's an example I'm testing with: Input data: DStream( RDD( ("a",2) ), RDD( ("a",3) ), RDD( ("a",4) ), RDD( ("a",5) ), RDD( ("a",6) ), RDD( ("a",7) ) ) Code: val updateFunc = (values: Seq[Int], state: Option[StateClass]) => { val previousState

range partitioner with updateStateByKey

2014-05-01 Thread Adrian Mocanu
If I use a range partitioner, will this make updateStateByKey take the tuples in order? Right now I see them not being taken in order (most of them are ordered but not all) -Adrian

RE: range partitioner with updateStateByKey

2014-05-02 Thread Adrian Mocanu
order? TD On Thu, May 1, 2014 at 2:35 PM, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: If I use a range partitioner, will this make updateStateByKey take the tuples in order? Right now I see them not being taken in order (most of them are ordered but not all) -Adrian

another updateStateByKey question

2014-05-02 Thread Adrian Mocanu
Has anyone else noticed that sometimes the same tuple calls update state function twice? I have 2 tuples with the same key in 1 RDD part of DStream: RDD[ (a,1), (a,2) ] When the update function is called the first time Seq[V] has data: 1, 2 which is correct: StateClass(3,2, ArrayBuffer(1, 2)) The

RE: another updateStateByKey question

2014-05-02 Thread Adrian Mocanu
PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: another updateStateByKey question Could be a bug. Can you share a code with data that I can use to reproduce this? TD On May 2, 2014 9:49 AM, "Adrian Mocanu" mailto:amoc...@verticalscope.com>> wrote:

RE: another updateStateByKey question - updated w possible Spark bug

2014-05-05 Thread Adrian Mocanu
user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: another updateStateByKey question Could be a bug. Can you share a code with data that I can use to reproduce this? TD On May 2, 2014 9:49 AM, "Adrian Mocanu" mailto:amoc...@verticalscope.com>> wrote: Has

RE: another updateStateByKey question - updated w possible Spark bug

2014-05-05 Thread Adrian Mocanu
Forgot to mention my batch interval is 1 second: val ssc = new StreamingContext(conf, Seconds(1)) hence the Thread.sleep(1100) From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: May-05-14 12:06 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: RE: another

missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Adrian Mocanu
Hey guys, I've asked before, in Spark 0.9 - I now use 0.9.1, about removing log4j dependency and was told that it was gone. However I still find it part of zookeeper imports. This is fine since I exclude it myself in the sbt file, but another issue arises. I wonder if anyone else has run into th

same log4j slf4j error in spark 9.1

2014-05-15 Thread Adrian Mocanu
I recall someone from the Spark team (TD?) saying that Spark 9.1 will change the logger and the circular loop error between slf4j and log4j wouldn't show up. Yet on Spark 9.1 I still get SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowEr

RE: missing method in my slf4j after excluding Spark ZK log4j

2014-05-15 Thread Adrian Mocanu
;s a first guess. I'd say consult mvn dependency:tree, but you're on sbt and I don't know the equivalent. On Mon, May 12, 2014 at 3:51 PM, Adrian Mocanu wrote: > Hey guys, > > I've asked before, in Spark 0.9 - I now use 0.9.1, about removing > log4j dependenc

same log4j slf4j error in spark 9.1

2014-05-15 Thread Adrian Mocanu
I recall someone from the Spark team (TD?) saying that Spark 9.1 will change the logger and the circular loop error between slf4j and log4j wouldn't show up. Yet on Spark 9.1 I still get SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowEr

RE: same log4j slf4j error in spark 9.1

2014-05-16 Thread Adrian Mocanu
n your project is bringing in dependency in the classpath? Alternatively, if you dont want slf4j-log4j12 from spark, you can safely exclude in the dependencies. TD On Thu, May 8, 2014 at 12:56 PM, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: I recall someone from the Spark team

RE: slf4j and log4j loop

2014-05-16 Thread Adrian Mocanu
Please ignore. This was sent last week not sure why it arrived so late. -Original Message- From: amoc [mailto:amoc...@verticalscope.com] Sent: May-09-14 10:13 AM To: u...@spark.incubator.apache.org Subject: Re: slf4j and log4j loop Hi Patrick/Sean, Sorry to resurrect this thread, but aft

tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
I have a few test cases for Spark which extend TestSuiteBase from org.apache.spark.streaming. The tests run fine on my machine but when I commit to repo and run the tests automatically with bamboo the test cases fail with these errors. How to fix? 21-May-2014 16:33:09 [info] StreamingZigZagSp

RE: tests that run locally fail when run through bamboo

2014-05-21 Thread Adrian Mocanu
java.net.BindException: Address already in use Is there a way to set these connection up so that they don't all start on the same port (that's my guess for the root cause of the issue) From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: May-21-14 4:58 PM To: u...@spark.incubator.apache

How to turn off MetadataCleaner?

2014-05-22 Thread Adrian Mocanu
Hi After using sparks TestSuiteBase to run some tests I've noticed that at the end, after finishing all tests the cleaner is still running and outputs the following perdiodically: INFO o.apache.spark.util.MetadataCleaner - Ran metadata cleaner for SHUFFLE_BLOCK_MANAGER I use method testOperat

RE: How to turn off MetadataCleaner?

2014-05-23 Thread Adrian Mocanu
at version are you using. If it is 0.9.1, I can see that the cleaner in ShuffleBlockManager<https://github.com/apache/spark/blob/v0.9.1/core/src/main/scala/org/apache/spark/storage/ShuffleBlockManager.scala> is not stopped, so it is a bug. TD On Thu, May 22, 2014 at 9:24 AM, Adrian Moc