I don't have any sample code, but on a high level: My state would be: (Long, BloomFilter[UUID]) In the update function, my value will be the UUID of the record, since the word itself is the key. I'll ask my BloomFilter if I've seen this UUID before. If not increase count, also add to Filter.
Does that make sense? On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <[email protected]> wrote: > Hi Burak, > Thanks for the response. Can you please elaborate on your idea of storing > the state of the unique ids. > Do you have any sample code or links I can refer to. > Thanks > > On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <[email protected]> wrote: > >> Off the top of my head... (Each may have it's own issues) >> >> If upstream you add a uniqueId to all your records, then you may use a >> BloomFilter to approximate if you've seen a row before. >> The problem I can see with that approach is how to repopulate the bloom >> filter on restarts. >> >> If you are certain that you're not going to reprocess some data after a >> certain time, i.e. there is no way I'm going to get the same data in 2 >> hours, it may only happen in the last 2 hours, then you may also keep the >> state of uniqueId's as well, and then age them out after a certain time. >> >> >> Best, >> Burak >> >> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande < >> [email protected]> wrote: >> >>> Please share your thoughts..... >>> >>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande < >>> [email protected]> wrote: >>> >>>> >>>> >>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande < >>>> [email protected]> wrote: >>>> >>>>> My streaming application stores lot of aggregations using >>>>> mapWithState. >>>>> >>>>> I want to know what are all the possible ways I can make it >>>>> idempotent. >>>>> >>>>> Please share your views. >>>>> >>>>> Thanks >>>>> >>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande < >>>>> [email protected]> wrote: >>>>> >>>>>> In a Wordcount application which stores the count of all the words >>>>>> input so far using mapWithState. How do I make sure my counts are not >>>>>> messed up if I happen to read a line more than once? >>>>>> >>>>>> Appreciate your response. >>>>>> >>>>>> Thanks >>>>>> >>>>> >>>>> >>>> >>> >> >
