hey mike!
you'll definitely want to increase your parallelism by adding more shards to
the stream - as well as spinning up 1 receiver per shard and unioning all the
shards per the KinesisWordCount example that is included with the kinesis
streaming package.
you'll need more cores (cluster) or threads (local) to support this - equalling
at least the number of shards/receivers + 1.
also, it looks like you're writing to S3 per RDD. you'll want to broaden that
out to write DStream batches - or expand even further and write window batches
(where the window interval is a multiple of the batch interval).
this goes for any spark streaming implementation - not just Kinesis.
lemme know if that works for you.
thanks!
-Chris
_____________________________
From: Mike Trienis <[email protected]>
Sent: Wednesday, March 18, 2015 2:45 PM
Subject: Spark Streaming S3 Performance Implications
To: <[email protected]>
Hi All,
I am pushing data from Kinesis stream to S3 using Spark Streaming and
noticed that during testing (i.e. master=local[2]) the batches (1 second
intervals) were falling behind the incoming data stream at about 5-10 events /
second. It seems that the rdd.saveAsTextFile(s3n://...) is taking at a few
seconds to complete.
val saveFunc = (rdd: RDD[String], time: Time) => {
val count = rdd.count()
if (count > 0) {
val s3BucketInterval = time.milliseconds.toString
rdd.saveAsTextFile(s3n://...)
} }
dataStream.foreachRDD(saveFunc)
Should I expect the same behaviour in a deployed cluster? Or does the
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
"Write the elements of the dataset as a text file (or set of text
files) in a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element to
convert it to a line of text in the file."
Thanks, Mike.