Most stream systems you're still going to incur the cost of reading each message... I suppose you could rotate among reading just the latest messages from a single partition of a Kafka topic if they were evenly balanced.
But once you've read the messages, nothing's stopping you from filtering most of them out before doing further processing. The dstream .transform method will let you do any filtering / sampling you could have done on an rdd. On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <martin.leq...@gmail.com> wrote: > Hi all, > > I have to handle high-speed rate data stream. To reduce the heavy load, I > want to use sampling techniques for each stream window. It means that I want > to process a subset of data instead of whole window data. I saw Spark > support sampling operations for RDD, but for DStream, Spark supports > sampling operation as well? If not, could you please give me a suggestion > how to implement it? > > Thanks, > Martin --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org