Spark Streaming S3 Performance Implications

Mike Trienis Wed, 18 Mar 2015 11:46:09 -0700

Hi All,

I am pushing data from Kinesis stream to S3 using Spark Streaming and
noticed that during testing (i.e. master=local[2]) the batches (1 second
intervals) were falling behind the incoming data stream at about 5-10
events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
at a few seconds to complete.


        val saveFunc = (rdd: RDD[String], time: Time) => {

            val count = rdd.count()

            if (count > 0) {

                val s3BucketInterval = time.milliseconds.toString

               rdd.saveAsTextFile(s3n://...)

            }
        }

        dataStream.foreachRDD(saveFunc)


Should I expect the same behaviour in a deployed cluster? Or does the
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?

"Write the elements of the dataset as a text file (or set of text files) in
a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element to
convert it to a line of text in the file."

Thanks, Mike.

Spark Streaming S3 Performance Implications

Reply via email to