Re: Spark on Apache Ingnite?

2016-01-11 Thread RodrigoB
Although I haven't work explicitly with either, they do seem to differ in design and consequently in usage scenarios. Ignite is claimed to be a pure in-memory distributed database. With Ignite, updating existing keys is something that is self-managed comparing with Tachyon. In Tachyon once a val

Scala 2.11 and Akka 2.4.0

2015-12-01 Thread RodrigoB
Hi, I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0. I've changed the main pom.xml files to corresponding akka version and am getting the following exception when starting the master on standalone: Exception Details: Location: akka/dispatch/Mailbox.processAllSystemMessage

Re: Scala 2.11 and Akka 2.4.0

2015-12-07 Thread RodrigoB
Hi Manas, Thanks for the reply. I've done that. The problem lies with Spark + akka 2.4.0 build. Seems the maven shader plugin is altering some class files and breaking the Akka runtime. Seems the Spark build on Scala 2.11 using SBT is broken. I'm getting build errors using sbt due to the issues f

Running in local mode as SQL engine - what to optimize?

2016-09-29 Thread RodrigoB
Hi all, For several reasons which I won't elaborate (yet), we're using Spark on local mode as an in memory SQL engine for data we're retrieving from Cassandra, execute SQL queries and return to the client - so no cluster, no worker nodes. I'm well aware local mode has always been considered a test

Re: Spark Streaming checkpoint recovery causes IO re-execution

2015-01-20 Thread RodrigoB
Hi Hannes, Good to know I'm not alone on the boat. Sorry about not posting back, I haven't gone in a while onto the user list. It's on my agenda to get over this issue. Will be very important for our recovery implementation. I have done an internal proof of concept but without any conclusions so

Spark Streaming - Minimizing batch interval

2015-03-25 Thread RodrigoB
I've been given a feature requirement that means processing events on a latency lower than 0.25ms. Meaning I would have to make sure that Spark streaming gets new events from the messaging layer within that period of time. Would anyone have achieve such numbers using a Spark cluster? Or would thi

Re: Spark streaming on load run - How to increase single node capacity?

2014-06-05 Thread RodrigoB
Hi Wayne, Tnks for reply. I did raise the thread max before posting, based on your previous comment on another post using ulimit -n 2048. That seemed to have helped on the out of memory issue. I'm curious if this is standard procedure for scaling a spark node's resources vertically or is it just

RE: range partitioner with updateStateByKey

2014-06-06 Thread RodrigoB
Hi TD, I have the same question: I need the workers to process using arrival order since it's updating a state based on previous one. tnks in advance. Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/range-partitioner-with-updateStateByKey-tp5190p7123

Re: NoSuchElementException: key not found

2014-06-06 Thread RodrigoB
Hi Tathagata, Im seeing the same issue on a load run over night with Kafka streaming (6000 mgs/sec) and 500millisec batch size. Again occasional and only happening after a few hours I believe Im using updateStateByKey. Any comment will be very welcome. tnks, Rod -- View this message in conte

Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can collect each RDD in the driver app runtime and inside I do something like this: myDStream.foreachRDD(rdd => { var someCol = Seq[MyType]() foreach(kv =>{ someC

Re: Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi Luis, Yes it's actually an ouput of the previous RDD. Have you ever used the Cassandra Spark Driver on the driver app? I believe these limitations go around that - it's designed to save RDDs from the nodes. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n

Re: Cassandra driver Spark question

2014-07-16 Thread RodrigoB
Tnks to both for the comments and the debugging suggestion, I will try to use. Regarding you comment, yes I do agree the current solution was not efficient but for using the saveToCassandra method I need an RDD thus the paralelize method. I finally got direct by Piotr to use the CassandraConnec

UpdatestateByKey assumptions

2014-07-29 Thread RodrigoB
Hi all, I'm currently having a load issue with the updatestateBykey function. Seems to be running with considerable delay for a few the state objects when the number increases. I have a 1 sec batch size receiving events from Kafka stream which creates state objects and also update then consequen

Re: UpdatestateByKey assumptions

2014-07-29 Thread RodrigoB
Just an update on this, Looking into Spark logs seems that some partitions are not found and recomputed. Gives the impression that those are related with the delayed updatestatebykey calls. I'm seeing something like: log line 1 - Partition rdd_132_1 not found, computing it log line N - Found

Re: Spark Streaming Checkpoint: SparkContext is not serializable class

2014-07-30 Thread RodrigoB
Hi, I don't think you can do that. The code inside the for each is running on the node level and you're trying to create another rdd within the node's specific execution context. Try to load the text file before the streaming context on the driver app and use it later as a cached rdd on following

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-01 Thread RodrigoB
Hi TD, I've also been fighting this issue only to find the exact same solution you are suggesting. Too bad I didn't find either the post or the issue sooner. I'm using a 1 second batch with N amount of kafka events (1 to 1 with the state objects) per batch and only calling the updatestatebykey f

Re: Running a task once on each executor

2014-08-11 Thread RodrigoB
Hi Christopher, I am also in the need of having a single function call on the node level. Your suggestion makes sense as a solution to the requirement, but still feels like a workaround, this check will get called on every row...Also having static members and methods created specially on a multi-t

Spark Streaming checkpoint recovery causes IO re-execution

2014-08-21 Thread RodrigoB
Dear Spark users, We have a spark streaming application which receives events from kafka and has an updatestatebykey call that executes IO like writing to Cassandra or sending events to other systems. Upon metadata checkpoint recovery (before the data checkpoint occurs) all lost RDDs get recompu

Re: Low Level Kafka Consumer for Spark

2014-08-25 Thread RodrigoB
Hi Dibyendu, My colleague has taken a look at the spark kafka consumer github you have provided and started experimenting. We found that somehow when Spark has a failure after a data checkpoint, the expected re-computations correspondent to the metadata checkpoints are not recovered so we loose K

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread RodrigoB
Dibyendu, Tnks for getting back. I believe you are absolutely right. We were under the assumption that the raw data was being computed again and that's not happening after further tests. This applies to Kafka as well. The issue is of major priority fortunately. Regarding your suggestion, I wou

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread RodrigoB
Hi Yana, The fact is that the DB writing is happening on the node level and not on Spark level. One of the benefits of distributed computing nature of Spark is enabling IO distribution as well. For example, is much faster to have the nodes to write to Cassandra instead of having them all collected

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-31 Thread RodrigoB
Hi Yana, You are correct. What needs to be added is that besides RDDs being checkpointed, metadata which represents execution of computations are also checkpointed in Spark Streaming. Upon driver recovery, the last batches (the ones already executed and the ones that should have been executed whi

Re: Low Level Kafka Consumer for Spark

2014-08-31 Thread RodrigoB
Just a comment on the recovery part. Is it correct to say that currently Spark Streaming recovery design does not consider re-computations (upon metadata lineage recovery) that depend on blocks of data of the received stream? https://issues.apache.org/jira/browse/SPARK-1647 Just to illustrate a

RDD data checkpoint cleaning

2014-09-22 Thread RodrigoB
Hi all, I've just started to take Spark Streaming recovery more seriously as things get more serious on the project roll-out. We need to ensure full recovery on all Spark levels - driver, receiver and worker. I've started to do some tests today and become concerned with the current findings. I

Re: NullPointerException on reading checkpoint files

2014-09-23 Thread RodrigoB
Hi TD, This is actually an important requirement (recovery of shared variables) for us as we need to spread some referential data across the Spark nodes on application startup. I just bumped into this issue on Spark version 1.0.1. I assume the latest one also doesn't include this capability. Are t

Re: broadcast variable get cleaned by ContextCleaner unexpectedly ?

2014-09-23 Thread RodrigoB
Could you be using by any chance the getOrCreate for the StreamingContext creation? I've seen this happen when I tried to first create the Spark context, then create the broadcast variables, and then recreate the StreamingContext from the checkpoint directory. So the worker process cannot find the

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Hi TD, tnks for getting back on this. Yes that's what I was experiencing - data checkpoints were being recovered from considerable time before the last data checkpoint, probably since the beginning of the first writes, would have to confirm. I have some development on this though. These results

Re: RDD data checkpoint cleaning

2014-09-23 Thread RodrigoB
Just a follow-up. Just to make sure about the RDDs not being cleaned up, I just replayed the app both on the windows remote laptop and then on the linux machine and at the same time was observing the RDD folders in HDFS. Confirming the observed behavior: running on the laptop I could see the RDDs

Dynamically switching Nr of allocated core

2014-11-03 Thread RodrigoB
Hi all, I can't seem to find a clear answer on the documentation. Does the standalone cluster support dynamic assigment of nr of allocated cores to an application once another app stops? I'm aware that we can have core sharding if we use Mesos between active applications depending on the nr of

Re: Low Level Kafka Consumer for Spark

2014-12-02 Thread RodrigoB
Hi Dibyendu,What are your thoughts on keeping this solution (or not), considering that Spark Streaming v1.2 will have built-in recoverability of the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm concerned about the complexity of this solution with regards the added complexity an

Re: Low Level Kafka Consumer for Spark

2014-12-02 Thread RodrigoB
Dibyendu, Just to make sure I will not be misunderstood - My concerns are referring to the Spark upcoming solution and not yours. I would to gather the perspective of someone which implemented recovery with Kafka a different way. Tnks, Rod -- View this message in context: http://apache-spark-

Re: protobuf error running spark on hadoop 2.4

2014-12-16 Thread RodrigoB
Hi, I'm currently having this issue as well, while using Spark with Akka 2.3.4. Did Debasish's suggestion work for you? I've already built Spark with hadoop 2.3.0 but still having the error. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/prot