Thanks Burak. But with BloomFilter, won't I be getting a false poisitve?

On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz <brk...@gmail.com> wrote:

> I noticed that 1 wouldn't be a problem, because you'll save the
> BloomFilter in the state.
>
> For 2, you would keep a Map of UUID's to the timestamp of when you saw
> them. If the UUID exists in the map, then you wouldn't increase the count.
> If the timestamp of a UUID expires, you would remove it from the map. The
> reason we remove from the map is to keep a bounded amount of space. It'll
> probably take a lot more space than the BloomFilter though depending on
> your data volume.
>
> On Wed, Jan 25, 2017 at 11:24 AM, shyla deshpande <
> deshpandesh...@gmail.com> wrote:
>
>> In the previous email you gave me 2 solutions
>> 1. Bloom filter --> problem in repopulating the bloom filter on restarts
>> 2. keeping the state of the unique ids
>>
>> Please elaborate on 2.
>>
>>
>>
>> On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>
>>> I don't have any sample code, but on a high level:
>>>
>>> My state would be: (Long, BloomFilter[UUID])
>>> In the update function, my value will be the UUID of the record, since
>>> the word itself is the key.
>>> I'll ask my BloomFilter if I've seen this UUID before. If not increase
>>> count, also add to Filter.
>>>
>>> Does that make sense?
>>>
>>>
>>> On Wed, Jan 25, 2017 at 9:28 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
>>>> Hi Burak,
>>>> Thanks for the response. Can you please elaborate on your idea of
>>>> storing the state of the unique ids.
>>>> Do you have any sample code or links I can refer to.
>>>> Thanks
>>>>
>>>> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz <brk...@gmail.com> wrote:
>>>>
>>>>> Off the top of my head... (Each may have it's own issues)
>>>>>
>>>>> If upstream you add a uniqueId to all your records, then you may use a
>>>>> BloomFilter to approximate if you've seen a row before.
>>>>> The problem I can see with that approach is how to repopulate the
>>>>> bloom filter on restarts.
>>>>>
>>>>> If you are certain that you're not going to reprocess some data after
>>>>> a certain time, i.e. there is no way I'm going to get the same data in 2
>>>>> hours, it may only happen in the last 2 hours, then you may also keep the
>>>>> state of uniqueId's as well, and then age them out after a certain time.
>>>>>
>>>>>
>>>>> Best,
>>>>> Burak
>>>>>
>>>>> On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <
>>>>> deshpandesh...@gmail.com> wrote:
>>>>>
>>>>>> Please share your thoughts.....
>>>>>>
>>>>>> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <
>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> My streaming application stores lot of aggregations using
>>>>>>>> mapWithState.
>>>>>>>>
>>>>>>>> I want to know what are all the possible ways I can make it
>>>>>>>> idempotent.
>>>>>>>>
>>>>>>>> Please share your views.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>>>>>>> deshpandesh...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In a Wordcount application which  stores the count of all the
>>>>>>>>> words input so far using mapWithState.  How do I make sure my counts 
>>>>>>>>> are
>>>>>>>>> not messed up if I happen to read a line more than once?
>>>>>>>>>
>>>>>>>>> Appreciate your response.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to