Hi All,
I am pushing data from Kinesis stream to S3 using Spark Streaming and
noticed that during testing (i.e. master=local[2]) the batches (1 second
intervals) were falling behind the incoming data stream at about 5-10
events / second. It seems that the rdd.saveAsTextFile(s3n://...) is taking
at a few seconds to complete.
val saveFunc = (rdd: RDD[String], time: Time) => {
val count = rdd.count()
if (count > 0) {
val s3BucketInterval = time.milliseconds.toString
rdd.saveAsTextFile(s3n://...)
}
}
dataStream.foreachRDD(saveFunc)
Should I expect the same behaviour in a deployed cluster? Or does the
rdd.saveAsTextFile(s3n://...) distribute the push work to each worker node?
"Write the elements of the dataset as a text file (or set of text files) in
a given directory in the local filesystem, HDFS or any other
Hadoop-supported file system. Spark will call toString on each element to
convert it to a line of text in the file."
Thanks, Mike.