Re: Sessionization using updateStateByKey

2015-07-15 Thread Sean McNamara
I would just like to add that we do the very same/similar thing at Webtrends, updateStateByKey has been a life-saver for our sessionization use-cases. Cheers, Sean On Jul 15, 2015, at 11:20 AM, Silvio Fiorito mailto:silvio.fior...@granturing.com>> wrote: Hi Cody, I’ve had success using upda

Re: Counters in Spark

2015-02-13 Thread Sean McNamara
.map is just a transformation, so no work will actually be performed until something takes action against it. Try adding a .count(), like so: inputRDD.map { x => { counter += 1 } }.count() In case it is helpful, here are the docs on what exactly the transformations and actions are: htt

Re: Spark (SQL) as OLAP engine

2015-02-03 Thread Sean McNamara
We have gone down a similar path at Webtrends, Spark has worked amazingly well for us in this use case. Our solution goes from REST, directly into spark, and back out to the UI instantly. Here is the resulting product in case you are curious (and please pardon the self promotion): https://www

Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara
Hi Kane- http://spark.apache.org/docs/latest/tuning.html has excellent information that may be helpful. In particular increasing the number of tasks may help, as well as confirming that you don’t have more data than you're expecting landing on a key. Also, if you are using spark < 1.2.0, set

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Sean McNamara
> Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka Yes it can. "0.8.1 is fully compatible with 0.8." It is buried on this page: http://kafka.apache.org/documentation.html In addition to the pom version bump SPARK-2492 would bring the kafka streaming receiver (which was originally

Re: foreachPartition and task status

2014-10-14 Thread Sean McNamara
Are you using spark streaming? On Oct 14, 2014, at 10:35 AM, Salman Haq wrote: > Hi, > > In my application, I am successfully using foreachPartition to write large > amounts of data into a Cassandra database. > > What is the recommended practice if the application wants to know that the > ta

Re: Spark can't find jars

2014-10-13 Thread Sean McNamara
I’ve run into this as well. I haven’t had a chance to troubleshoot what exactly was going on, but I got around it by building my app as a single uberjar. Sean On Oct 13, 2014, at 6:40 PM, HARIPRIYA AYYALASOMAYAJULA mailto:aharipriy...@gmail.com>> wrote: Helo, Can you check if the jar file

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
ache.org/jira/browse/SPARK-2492). Thanks Jerry From: Abraham Jacob [mailto:abe.jac...@gmail.com<mailto:abe.jac...@gmail.com>] Sent: Saturday, October 11, 2014 6:57 AM To: Sean McNamara Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming KafkaUtils I

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
ion { return tuple2; } } ) ); } JavaPairDStream unifiedStream; if (kafkaStreams.size() > 1) { unifiedStream = jssc.union(kafkaStreams.get(0), kafkaStreams.subList(1, kafkaStreams.size())); } else { unifiedStream = kafkaStreams.get(0); } unifiedStream.print(); jssc.start(); jssc.

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
Would you mind sharing the code leading to your createStream? Are you also setting group.id? Thanks, Sean On Oct 10, 2014, at 4:31 PM, Abraham Jacob wrote: > Hi Folks, > > I am seeing some strange behavior when using the Spark Kafka connector in > Spark streaming. > > I have a Kafka top

Re: balancing RDDs

2014-06-25 Thread Sean McNamara
unless task scheduling time > beats spark.locality.wait. Can cause overall low performance for all > subsequent tasks. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > O

balancing RDDs

2014-06-23 Thread Sean McNamara
We have a use case where we’d like something to execute once on each node and I thought it would be good to ask here. Currently we achieve this by setting the parallelism to the number of nodes and use a mod partitioner: val balancedRdd = sc.parallelize( (0 until Settings.parallelism)