Re: Better way to clean up state when connect

Xingcan Cui Tue, 15 May 2018 07:23:07 -0700

Hi Chengzhi,

currently, it's impossible to process both a stream and a (dynamically updated) 
dataset in a single job. I'll provide you with some workarounds, all of which 
are based on that the file for active test names is not so large.


(1) You may define your own stream source[1] which should be aware of the file 
update, and keep the input file as a stream (the Stream B as you described). 
Some special records can be inserted to indicate the start and end of an 
update. Note that instead of using the `keyBy()` method, the Stream B should be 
broadcasted, while the Stream A can be partitioned arbitrarily. With this 
method, you can clean and rebuild the states according to the start/end 
indicators.

(2) You may also take the file of active test names as external states and set 
processing time timers[2] to update them regularly (e.g., with 1 min interval) 
in a ProcessFunction[3].

IMO, the watermark may not work as expected for your use case. Besides, since 
the file will be updated unpredictably, it's hard to guarantee the precision of 
results.

Hope that helps,
Xingcan

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/dev/datastream_api.html#data-sources
[2] 
https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/process_function.html#timers
[3] 
https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/process_function.html#the-processfunction

> On May 14, 2018, at 10:05 PM, Chengzhi Zhao <w.zhaocheng...@gmail.com> wrote:
> 
> Hi Xingcan,
> 
> Thanks for your response, to give your more background about my use case, I 
> have Stream B with some split test name, and Stream A will be the actual 
> test. I want to have Stream A connect to Stream B to figure out whether this 
> test is still active or not. I am not sure this is the right way to do: My 
> watermark is based on event time for 15 mins, OnTimer will be emit that 
> records after 15 mins. I was wondering if there is way to purge the state of 
> entire Stream B so I can get all the active test, since the file will include 
> all the updated split testing name so I can refresh the lookup.
> 
> Also, I am not sure if I am using the right operator here, or if there is a 
> way to share variable globally so I can just perform filter on stream A.
> Please let me know your thoughts and thanks for you suggestions again.
> 
> Regards,
> Chengzhi
> 
> On Sat, May 12, 2018 at 8:55 PM, Xingcan Cui <xingc...@gmail.com 
> <mailto:xingc...@gmail.com>> wrote:
> Hi Chengzhi,
> 
> you said the Stream B which comes from a file will be updated unpredictably. 
> I wonder if you could share more about how to judge an item (from Stream A I 
> suppose) is not in the file and what watermark generation strategy did you 
> choose?
> 
> Best,
> Xingcan
> 
> > On May 12, 2018, at 12:48 AM, Chengzhi Zhao <w.zhaocheng...@gmail.com 
> > <mailto:w.zhaocheng...@gmail.com>> wrote:
> > 
> > Hi there,
> > 
> > I have a use case to check for active ID, there are two streams and I 
> > connect them: one has actual data (Stream A) and the other one is for 
> > lookup purpose (Stream B), I am getting Stream B as a file which includes 
> > all active ID, so inactive ID would not be show up on this list. I tried to 
> > use watermark to clean up the state of inactivate ID, but the Stream B 
> > updates is unpredictable so I want to keep everything in state until I 
> > found the item is not in that file any more. 
> > 
> > Please suggest what is the best way to implement it in flink. Thanks in 
> > advance for your help.
> > 
> > Regards,
> > Chengzhi
> > 
> > 
> 
>

Re: Better way to clean up state when connect

Reply via email to