I need to implement following logic in a spark streaming app:
for the incoming dStream, do some transformation, and invoke
updateStateByKey to
update state object for each key (mark data entries that are updated as
dirty for next
step), then let state objects produce event(s) based (based on dirt
Here's official spark document about batch size/interval:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-size
spark is batch oriented processing. As you mentioned, the streaming
is continuous flow, and core spark can not handle it.
Spark streaming br
what are you trying to do? generate time series from your data in HDFS, or
doing
some transformation and/or aggregation from your time series data in HDFS?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html
Sent
if you want the result as RDD of (key, 1)
new_rdd = rdd.filter(x => x._2 == 1)
if you want result as RDD of keys (since you know the values are 1), then
new_rdd = rdd.filter(x => x._2 == 1).map(x => x._1)
x._1 and x._2 are the way of scala to access the key and value from
key/value pair.
I'm not sure what you mean by "previous run". Is it previous batch? or
previous run of spark-submit?
If it's "previous batch" (spark streaming creates a batch every batch
interval), then there's nothing to do.
If it's previous run of spark-submit (assuming you are able to save the
result somewher
there's no need to initialize StateDStream. Take a look at example
StatefulNetworkWordCount.scala, it's part of spark source code.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html
Sent from the Apache Spark Us
follow the instruction here:
http://spark.apache.org/docs/latest/building-with-maven.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14144.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks for your response! I found that too, and it does the trick! Here's
refined code:
val inputDStream = ...
val keyedDStream = inputDStream.map(...) // use sensorId as key
val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new
MyPartitioner(...)))
val stateDStream = part
Add something like following to spark-env.sh
export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH
(and remove all 5 exports you listed). Then restart all worker nodes, and
try
again.
Good luck!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-java-library-pa
Hi Mayur,
Thanks for your response. I did write a simple test that set up a DStream
with
5 batches; The batch duration is 1 second, and the 3rd batch will take extra
2 seconds, the output of the test shows that the 3rd batch causes backlog,
and spark streaming does catch up on 4th and 5th batch (
Thanks for your response. I do have something like:
val inputDStream = ...
val keyedDStream = inputDStream.map(...) // use sensorId as key
val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new
MyPartitioner(...)))
val stateDStream = partitionedDStream.updateStateByKey[...](ud
I'm working on a DStream application. The input are sensors' measurements,
the data format is
There are 10 thousands sensors, and updateStateByKey is used to maintain
the states of sensors, the code looks like following:
val inputDStream = ...
val keyedDStream = inputDStream.map(...) // use s
Since you are using your home computer, so it's probably not reachable by EC2
from internet.
You can try to set "spark.driver.host" to your WAN ip, "spark.driver.port"
to a fixed port in SparkConf, and open that port in your home network (port
forwarding to the computer you are using). see if that
the command should be "spark-shell --master spark://:7077".
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13593.html
Sent from the Apache Spark User List mailing list archive at Nabble
factor or failures),
then what's the right batch interval? 5 seconds (the worst case)?
2. What will happen to DStream processing if 1 batch took longer than batch
interval? Can Spark recover from that?
Thanks,
Qihong
--
View this message in context:
http://apache-spark-user-list.100156
15 matches
Mail list logo