Hi Benchao, I think your suggestion is very reasonable. For most users, having compaction enabled by default if mini-batch enabled is a more beneficial approach. However, I think this is an another thing we could discuss in the future about compaction within minibatch, which is an orthogonal topic to this discussion. Minibatch join itself would follow the option 'table.exec.mini-batch.enabled’, 'table.exec.mini-batch.allow-latency' and 'table.exec.mini-batch.size'.
> 2024年1月16日 18:52,Benchao Li <libenc...@apache.org> 写道: > > shuai, > > Thanks for the explanations, I understand the scenario you described > now. IIUC, this will be a rather rare case that need to disable > "compaction" when mini-batch is enabled, so I won't be against > introducing it. However, I would suggest to enable the "compaction" by > default (if mini-batch enabled), which will benefit most of use cases. > For others that have special requirements about the changelog semantic > (no compaction), they can disable compaction by themselves. WDYT? > >> This is a relatively large optimization that may pose a significant >> risk of bugs, so I like to keep it from being enabled by default for >> now. > @Jingsong has raised an interesting point that for large optimization > or new features, we want to have an option for it and disable it by > default in case of the risk of bugs. I agree with it, mostly. > Currently there is no standard about whether a change is major or not, > which means we may run into a situation debating whether a change is > major or not. Anyway, it's an orthogonal topic to this discussion. > > shuai xu <xushuai...@gmail.com> 于2024年1月16日周二 13:14写道: >> >> Hi Benchao, >> >> Do you have any other questions about this issue? Also, I would appreciate >> your thoughts on the proposal to introduce the new option >> 'table.exec.mini-batch.compact-changes-enabled'. I’m looking forward your >> feedback. >> >>> 2024年1月12日 15:01,shuai xu <xushuai...@gmail.com> 写道: >>> >>> Suppose we currently have a job that joins two CDC sources after >>> de-duplicating them and the output is available for audit analysis, and the >>> user turns off the parameter >>> "table.exec.deduplicate.mini-batch.compact-changes-enabled" to ensure that >>> it does not lose update details. If we don't introduce this parameter, >>> after the user upgrades the version, some update details may be lost due to >>> the mini-batch connection being enabled by default, resulting in distorted >>> audit results. >>> >>>> 2024年1月11日 16:19,Benchao Li <libenc...@apache.org> 写道: >>>> >>>>> the change might not be supposed for the downstream of the job which >>>>> requires details of changelog >>>> >>>> Could you elaborate on this a bit? I've never met such kinds of >>>> requirements before, I'm curious what is the scenario that requires >>>> this. >>>> >>>> shuai xu <xushuai...@gmail.com> 于2024年1月11日周四 13:08写道: >>>>> >>>>> Thanks for your response, Benchao. >>>>> >>>>> Here is my thought on the newly added option. >>>>> Users' current jobs are running on a version without minibatch join. If >>>>> the existing option to enable minibatch join is utilized, then when >>>>> users' jobs are migrated to the new version, the internal behavior of the >>>>> join operation within the jobs will change. Although the semantic of >>>>> changelog emitted by the Join operator is eventual consistency, the >>>>> change might not be supposed for the downstream of the job which requires >>>>> details of changelog. This newly added option also refers to >>>>> 'table.exec.deduplicate.mini-batch.compact-changes-enabled'. >>>>> >>>>> As for the implementation,The new operator shares the state of the >>>>> original operator and it merely has an additional minibatch for storing >>>>> records to do some optimization. The storage remains consistent, and >>>>> there is minor modification to the computational logic. >>>>> >>>>> Best, >>>>> Xu Shuai >>>>> >>>>>> 2024年1月10日 22:56,Benchao Li <libenc...@apache.org> 写道: >>>>>> >>>>>> Thanks shuai for driving this, mini-batch Join is a very useful >>>>>> optimization, +1 for the general idea. >>>>>> >>>>>> Regarding the configuration >>>>>> "table.exec.stream.join.mini-batch-enabled", I'm not sure it's really >>>>>> necessary. The semantic of changelog emitted by the Join operator is >>>>>> eventual consistency, so there is no much difference between original >>>>>> Join and mini-batch Join from this aspect. Besides, introducing more >>>>>> options would make it more complex for users, harder to understand and >>>>>> maintain, which we should be careful about. >>>>>> >>>>>> One thing about the implementation, could you make the new operator >>>>>> share the same state definition with the original one? >>>>>> >>>>>> shuai xu <xushuai...@gmail.com> 于2024年1月10日周三 21:23写道: >>>>>>> >>>>>>> Hi devs, >>>>>>> >>>>>>> I’d like to start a discussion on FLIP-415: Introduce a new join >>>>>>> operator to support minibatch[1]. >>>>>>> >>>>>>> Currently, when performing cascading connections in Flink, there is a >>>>>>> pain point of record amplification. Every record join operator receives >>>>>>> would trigger join process. However, if records of +I and -D matches , >>>>>>> they could be folded to reduce two times of join process. Besides, >>>>>>> records of -U +U might output 4 records in which two records are >>>>>>> redundant when encountering outer join . >>>>>>> >>>>>>> To address this issue, this FLIP introduces a new >>>>>>> MiniBatchStreamingJoinOperator to achieve batch processing which could >>>>>>> reduce number of outputting redundant messages and avoid unnecessary >>>>>>> join processes. >>>>>>> A new option is added to control the operator to avoid influencing >>>>>>> existing jobs. >>>>>>> >>>>>>> Please find more details in the FLIP wiki document [1]. Looking >>>>>>> forward to your feedback. >>>>>>> >>>>>>> [1] >>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-415%3A+Introduce+a+new+join+operator+to+support+minibatch >>>>>>> >>>>>>> Best, >>>>>>> Xu Shuai >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Best, >>>>>> Benchao Li >>>>> >>>> >>>> >>>> -- >>>> >>>> Best, >>>> Benchao Li >>> >>> Best, >>> Xu Shuai >> >> >> Best, >> Xu Shuai > > > > -- > > Best, > Benchao Li Best, Xu Shuai