Although I haven't work explicitly with either, they do seem to differ in
design and consequently in usage scenarios.
Ignite is claimed to be a pure in-memory distributed database.
With Ignite, updating existing keys is something that is self-managed
comparing with Tachyon. In Tachyon once a val
Hi,
I'm currently trying to build spark with Scala 2.11 and Akka 2.4.0.
I've changed the main pom.xml files to corresponding akka version and am
getting the following exception when starting the master on standalone:
Exception Details:
Location:
akka/dispatch/Mailbox.processAllSystemMessage
Hi Manas,
Thanks for the reply. I've done that. The problem lies with Spark + akka
2.4.0 build. Seems the maven shader plugin is altering some class files and
breaking the Akka runtime.
Seems the Spark build on Scala 2.11 using SBT is broken. I'm getting build
errors using sbt due to the issues f
Hi all,
For several reasons which I won't elaborate (yet), we're using Spark on
local mode as an in memory SQL engine for data we're retrieving from
Cassandra, execute SQL queries and return to the client - so no cluster, no
worker nodes. I'm well aware local mode has always been considered a test
Hi Hannes,
Good to know I'm not alone on the boat. Sorry about not posting back, I
haven't gone in a while onto the user list.
It's on my agenda to get over this issue. Will be very important for our
recovery implementation. I have done an internal proof of concept but
without any conclusions so
I've been given a feature requirement that means processing events on a
latency lower than 0.25ms.
Meaning I would have to make sure that Spark streaming gets new events from
the messaging layer within that period of time. Would anyone have achieve
such numbers using a Spark cluster? Or would thi
Hi Wayne,
Tnks for reply. I did raise the thread max before posting, based on your
previous comment on another post using ulimit -n 2048. That seemed to have
helped on the out of memory issue.
I'm curious if this is standard procedure for scaling a spark node's
resources vertically or is it just
Hi TD,
I have the same question: I need the workers to process using arrival order
since it's updating a state based on previous one.
tnks in advance.
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/range-partitioner-with-updateStateByKey-tp5190p7123
Hi Tathagata,
Im seeing the same issue on a load run over night with Kafka streaming (6000
mgs/sec) and 500millisec batch size. Again occasional and only happening
after a few hours I believe
Im using updateStateByKey.
Any comment will be very welcome.
tnks,
Rod
--
View this message in conte
Hi all,
I am currently trying to save to Cassandra after some Spark Streaming
computation.
I call a myDStream.foreachRDD so that I can collect each RDD in the driver
app runtime and inside I do something like this:
myDStream.foreachRDD(rdd => {
var someCol = Seq[MyType]()
foreach(kv =>{
someC
Hi Luis,
Yes it's actually an ouput of the previous RDD.
Have you ever used the Cassandra Spark Driver on the driver app? I believe
these limitations go around that - it's designed to save RDDs from the
nodes.
tnks,
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n
Tnks to both for the comments and the debugging suggestion, I will try to
use.
Regarding you comment, yes I do agree the current solution was not efficient
but for using the saveToCassandra method I need an RDD thus the paralelize
method. I finally got direct by Piotr to use the CassandraConnec
Hi all,
I'm currently having a load issue with the updatestateBykey function.
Seems to be running with considerable delay for a few the state objects when
the number increases.
I have a 1 sec batch size receiving events from Kafka stream which creates
state objects and also update then consequen
Just an update on this,
Looking into Spark logs seems that some partitions are not found and
recomputed. Gives the impression that those are related with the delayed
updatestatebykey calls.
I'm seeing something like:
log line 1 - Partition rdd_132_1 not found, computing it
log line N - Found
Hi,
I don't think you can do that. The code inside the for each is running on
the node level and you're trying to create another rdd within the node's
specific execution context.
Try to load the text file before the streaming context on the driver app and
use it later as a cached rdd on following
Hi TD,
I've also been fighting this issue only to find the exact same solution you
are suggesting.
Too bad I didn't find either the post or the issue sooner.
I'm using a 1 second batch with N amount of kafka events (1 to 1 with the
state objects) per batch and only calling the updatestatebykey f
Hi Christopher,
I am also in the need of having a single function call on the node level.
Your suggestion makes sense as a solution to the requirement, but still
feels like a workaround, this check will get called on every row...Also
having static members and methods created specially on a multi-t
Dear Spark users,
We have a spark streaming application which receives events from kafka and
has an updatestatebykey call that executes IO like writing to Cassandra or
sending events to other systems.
Upon metadata checkpoint recovery (before the data checkpoint occurs) all
lost RDDs get recompu
Hi Dibyendu,
My colleague has taken a look at the spark kafka consumer github you have
provided and started experimenting.
We found that somehow when Spark has a failure after a data checkpoint, the
expected re-computations correspondent to the metadata checkpoints are not
recovered so we loose K
Dibyendu,
Tnks for getting back.
I believe you are absolutely right. We were under the assumption that the
raw data was being computed again and that's not happening after further
tests. This applies to Kafka as well.
The issue is of major priority fortunately.
Regarding your suggestion, I wou
Hi Yana,
The fact is that the DB writing is happening on the node level and not on
Spark level. One of the benefits of distributed computing nature of Spark is
enabling IO distribution as well. For example, is much faster to have the
nodes to write to Cassandra instead of having them all collected
Hi Yana,
You are correct. What needs to be added is that besides RDDs being
checkpointed, metadata which represents execution of computations are also
checkpointed in Spark Streaming.
Upon driver recovery, the last batches (the ones already executed and the
ones that should have been executed whi
Just a comment on the recovery part.
Is it correct to say that currently Spark Streaming recovery design does not
consider re-computations (upon metadata lineage recovery) that depend on
blocks of data of the received stream?
https://issues.apache.org/jira/browse/SPARK-1647
Just to illustrate a
Hi all,
I've just started to take Spark Streaming recovery more seriously as things
get more serious on the project roll-out. We need to ensure full recovery on
all Spark levels - driver, receiver and worker.
I've started to do some tests today and become concerned with the current
findings.
I
Hi TD,
This is actually an important requirement (recovery of shared variables) for
us as we need to spread some referential data across the Spark nodes on
application startup. I just bumped into this issue on Spark version 1.0.1. I
assume the latest one also doesn't include this capability. Are t
Could you be using by any chance the getOrCreate for the StreamingContext
creation?
I've seen this happen when I tried to first create the Spark context, then
create the broadcast variables, and then recreate the StreamingContext from
the checkpoint directory. So the worker process cannot find the
Hi TD, tnks for getting back on this.
Yes that's what I was experiencing - data checkpoints were being recovered
from considerable time before the last data checkpoint, probably since the
beginning of the first writes, would have to confirm. I have some
development on this though.
These results
Just a follow-up.
Just to make sure about the RDDs not being cleaned up, I just replayed the
app both on the windows remote laptop and then on the linux machine and at
the same time was observing the RDD folders in HDFS.
Confirming the observed behavior: running on the laptop I could see the RDDs
Hi all,
I can't seem to find a clear answer on the documentation.
Does the standalone cluster support dynamic assigment of nr of allocated
cores to an application once another app stops?
I'm aware that we can have core sharding if we use Mesos between active
applications depending on the nr of
Hi Dibyendu,What are your thoughts on keeping this solution (or not),
considering that Spark Streaming v1.2 will have built-in recoverability of
the received data?https://issues.apache.org/jira/browse/SPARK-1647I'm
concerned about the complexity of this solution with regards the added
complexity an
Dibyendu,
Just to make sure I will not be misunderstood - My concerns are referring to
the Spark upcoming solution and not yours. I would to gather the perspective
of someone which implemented recovery with Kafka a different way.
Tnks,
Rod
--
View this message in context:
http://apache-spark-
Hi,
I'm currently having this issue as well, while using Spark with Akka 2.3.4.
Did Debasish's suggestion work for you?
I've already built Spark with hadoop 2.3.0 but still having the error.
tnks,
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/prot
32 matches
Mail list logo