Some background on what we're trying to do: We have four Kinesis receivers with varying amounts of data coming through them. Ultimately we work on a unioned stream that is getting about 11 MB/second of data. We use a batch size of 5 seconds.
We create four distinct DStreams from this data that have different aggregation computations (various combinations of map/flatMap/reduceByKeyAndWindow and then finishing by serializing the records to JSON strings and writing them to S3). We want to do 30 minute windows of computations on this data, to get a better compression rate for the aggregates (there are a lot of repeated keys across this time frame, and we want to combine them all -- we do this using reduceByKeyAndWindow). But even when trying to do 5 minute windows, we have issues with "Could not compute split, block —— not found". This is being run on a YARN cluster and it seems like the executors are getting killed even though they should have plenty of memory. Also, it seems like no computation actually takes place until the end of the window duration. This seems inefficient if there is a lot of data that you know is going to be needed for the computation. Is there any good way around this? There are some of the configuration settings we are using for Spark: spark.executor.memory=26000M,\ spark.executor.cores=4,\ spark.executor.instances=5,\ spark.driver.cores=4,\ spark.driver.memory=24000M,\ spark.default.parallelism=128,\ spark.streaming.blockInterval=100ms,\ spark.streaming.receiver.maxRate=20000,\ spark.akka.timeout=300,\ spark.storage.memoryFraction=0.6,\ spark.rdd.compress=true,\ spark.executor.instances=16,\ spark.serializer=org.apache.spark.serializer.KryoSerializer,\ spark.kryoserializer.buffer.max=2047m,\ Is this the correct way to do this, and how can I further debug to figure out this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org