Thanks Burak. But with BloomFilter, won't I be getting a false poisitve? On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz <brk...@gmail.com> wrote:
> I noticed that 1 wouldn't be a problem, because you'll save the > BloomFilter in the state. > > For 2, you would keep a Map of UUID's to the timestamp of when you saw > them. If the UUID exists in the map, then you wouldn't increase the count. > If the timestamp of a UUID expires, you would remove it from the map. The > reason we remove from the map is to keep a bounded amount of space. It'll > probably take a lot more space than the BloomFilter though depending on > your data volume. > > On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande < > deshpandesh...@gmail.com> wrote: > >> In the previous email you gave me 2 solutions >> 1. Bloom filter --> problem in repopulating the bloom filter on restarts >> 2. keeping the state of the unique ids >> >> Please elaborate on 2. >> >> >> >> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote: >> >>> I don't have any sample code, but on a high level: >>> >>> My state would be: (Long, BloomFilter[UUID]) >>> In the update function, my value will be the UUID of the record, since >>> the word itself is the key. >>> I'll ask my BloomFilter if I've seen this UUID before. If not increase >>> count, also add to Filter. >>> >>> Does that make sense? >>> >>> >>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande < >>> deshpandesh...@gmail.com> wrote: >>> >>>> Hi Burak, >>>> Thanks for the response. Can you please elaborate on your idea of >>>> storing the state of the unique ids. >>>> Do you have any sample code or links I can refer to. >>>> Thanks >>>> >>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote: >>>> >>>>> Off the top of my head... (Each may have it's own issues) >>>>> >>>>> If upstream you add a uniqueId to all your records, then you may use a >>>>> BloomFilter to approximate if you've seen a row before. >>>>> The problem I can see with that approach is how to repopulate the >>>>> bloom filter on restarts. >>>>> >>>>> If you are certain that you're not going to reprocess some data after >>>>> a certain time, i.e. there is no way I'm going to get the same data in 2 >>>>> hours, it may only happen in the last 2 hours, then you may also keep the >>>>> state of uniqueId's as well, and then age them out after a certain time. >>>>> >>>>> >>>>> Best, >>>>> Burak >>>>> >>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande < >>>>> deshpandesh...@gmail.com> wrote: >>>>> >>>>>> Please share your thoughts..... >>>>>> >>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande < >>>>>> deshpandesh...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande < >>>>>>> deshpandesh...@gmail.com> wrote: >>>>>>> >>>>>>>> My streaming application stores lot of aggregations using >>>>>>>> mapWithState. >>>>>>>> >>>>>>>> I want to know what are all the possible ways I can make it >>>>>>>> idempotent. >>>>>>>> >>>>>>>> Please share your views. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande < >>>>>>>> deshpandesh...@gmail.com> wrote: >>>>>>>> >>>>>>>>> In a Wordcount application which stores the count of all the >>>>>>>>> words input so far using mapWithState. How do I make sure my counts >>>>>>>>> are >>>>>>>>> not messed up if I happen to read a line more than once? >>>>>>>>> >>>>>>>>> Appreciate your response. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >