Hey Chris,
Apologies for the delayed reply. Your responses are always insightful and
appreciated :-)
However, I have a few more questions.
"also, it looks like you're writing to S3 per RDD. you'll want to broaden
that out to write DStream batches"
I assume you mean "dstream.saveAsTextFiles(....)" vs
"rdd.saveAsTextFile(....)". Although looking at the source code
DStream.scala, the saveAsTextFiles is simply wrapping a rdd.saveAsTextFile
def saveAsTextFiles(prefix: String, suffix: String = "") {
val saveFunc = (rdd: RDD[T], time: Time) => {
val file = rddToFileName(prefix, suffix, time)
rdd.saveAsTextFile(file)
}
this.foreachRDD(saveFunc)
}
So it's not clear to me how it would improve the throughput.
Also, for your comment "expand even further and write window batches (where
the window interval is a multiple of the batch interval)". I still don't
quite understand mechanics underneath, for example, what would be the
difference between extending the batch interval and adding windowed
batches? I presume it has something to do with the processor thread(s)
within a receiver?
By the way, in the near future, I'll be putting together some performance
numbers on a proper deployment, and will be sure to share my findings.
Thanks, Mike!
On Sat, Mar 21, 2015 at 8:09 AM, Chris Fregly <[email protected]> wrote:
> hey mike!
>
> you'll definitely want to increase your parallelism by adding more shards
> to the stream - as well as spinning up 1 receiver per shard and unioning
> all the shards per the KinesisWordCount example that is included with the
> kinesis streaming package.
>
> you'll need more cores (cluster) or threads (local) to support this -
> equalling at least the number of shards/receivers + 1.
>
> also, it looks like you're writing to S3 per RDD. you'll want to broaden
> that out to write DStream batches - or expand even further and write
> window batches (where the window interval is a multiple of the batch
> interval).
>
> this goes for any spark streaming implementation - not just Kinesis.
>
> lemme know if that works for you.
>
> thanks!
>
> -Chris
> _____________________________
> From: Mike Trienis <[email protected]>
> Sent: Wednesday, March 18, 2015 2:45 PM
> Subject: Spark Streaming S3 Performance Implications
> To: <[email protected]>
>
>
>
> Hi All,
>
> I am pushing data from Kinesis stream to S3 using Spark Streaming and
> noticed that during testing (i.e. master=local[2]) the batches (1 second
> intervals) were falling behind the incoming data stream at about 5-10
> events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
> at a few seconds to complete.
>
> val saveFunc = (rdd: RDD[String], time: Time) => {
>
> val count = rdd.count()
>
> if (count > 0) {
>
> val s3BucketInterval = time.milliseconds.toString
>
> rdd.saveAsTextFile(s3n://...)
>
> }
> }
>
> dataStream.foreachRDD(saveFunc)
>
>
> Should I expect the same behaviour in a deployed cluster? Or does the
> rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
>
> "Write the elements of the dataset as a text file (or set of text files)
> in a given directory in the local filesystem, HDFS or any other
> Hadoop-supported file system. Spark will call toString on each element to
> convert it to a line of text in the file."
>
> Thanks, Mike.
>
>
>