Re: sampling operation for DStream

Cody Koeninger Mon, 01 Aug 2016 09:44:26 -0700

Can you keep a queue per executor in memory?

On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <martin.leq...@gmail.com> wrote:
> Hi Cody and all,
>
> Thank you for your answer. I implement simple random sampling (SRS) for
> DStream using transform method, and it works fine.
> However, I have a problem when I implement reservoir sampling (RS). In RS, I
> need to maintain a reservoir (a queue) to store selected data items (RDDs).
> If I define a large stream window, the queue also increases  and it leads to
> the driver run out of memory.  I explain my problem in detail here:
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Most stream systems you're still going to incur the cost of reading
>> each message... I suppose you could rotate among reading just the
>> latest messages from a single partition of a Kafka topic if they were
>> evenly balanced.
>>
>> But once you've read the messages, nothing's stopping you from
>> filtering most of them out before doing further processing.  The
>> dstream .transform method will let you do any filtering / sampling you
>> could have done on an rdd.
>>
>> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <martin.leq...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I have to handle high-speed rate data stream. To reduce the heavy load,
>> > I
>> > want to use sampling techniques for each stream window. It means that I
>> > want
>> > to process a subset of data instead of whole window data. I saw Spark
>> > support sampling operations for RDD, but for DStream, Spark supports
>> > sampling operation as well? If not,  could you please give me a
>> > suggestion
>> > how to implement it?
>> >
>> > Thanks,
>> > Martin
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: sampling operation for DStream

Reply via email to