Can you keep a queue per executor in memory? On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <martin.leq...@gmail.com> wrote: > Hi Cody and all, > > Thank you for your answer. I implement simple random sampling (SRS) for > DStream using transform method, and it works fine. > However, I have a problem when I implement reservoir sampling (RS). In RS, I > need to maintain a reservoir (a queue) to store selected data items (RDDs). > If I define a large stream window, the queue also increases and it leads to > the driver run out of memory. I explain my problem in detail here: > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok > > Could you please give me some suggestions or advice to fix this problem? > > Thanks > > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Most stream systems you're still going to incur the cost of reading >> each message... I suppose you could rotate among reading just the >> latest messages from a single partition of a Kafka topic if they were >> evenly balanced. >> >> But once you've read the messages, nothing's stopping you from >> filtering most of them out before doing further processing. The >> dstream .transform method will let you do any filtering / sampling you >> could have done on an rdd. >> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <martin.leq...@gmail.com> >> wrote: >> > Hi all, >> > >> > I have to handle high-speed rate data stream. To reduce the heavy load, >> > I >> > want to use sampling techniques for each stream window. It means that I >> > want >> > to process a subset of data instead of whole window data. I saw Spark >> > support sampling operations for RDD, but for DStream, Spark supports >> > sampling operation as well? If not, could you please give me a >> > suggestion >> > how to implement it? >> > >> > Thanks, >> > Martin > >
--------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org