Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-17 Thread shuai xu
Hi all, Thank you for the valuable input. Based on the current discussion, the minibatch join is prepared to follow the existing three options of 'table.exec.mini-batch.enabled’, 'table.exec.mini-batch.allow-latency' and 'table.exec.mini-batch.size’. As for the compaction within the minibatch

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-17 Thread shuai xu
Hi Benchao, I think your suggestion is very reasonable. For most users, having compaction enabled by default if mini-batch enabled is a more beneficial approach. However, I think this is an another thing we could discuss in the future about compaction within minibatch, which is an orthogonal to

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-16 Thread Benchao Li
shuai, Thanks for the explanations, I understand the scenario you described now. IIUC, this will be a rather rare case that need to disable "compaction" when mini-batch is enabled, so I won't be against introducing it. However, I would suggest to enable the "compaction" by default (if mini-batch e

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-15 Thread shuai xu
Hi Benchao, Do you have any other questions about this issue? Also, I would appreciate your thoughts on the proposal to introduce the new option 'table.exec.mini-batch.compact-changes-enabled'. I’m looking forward your feedback. > 2024年1月12日 15:01,shuai xu 写道: > > Suppose we currently have

Re: Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-15 Thread Jane Chan
Hi shuai, Thanks for your clarification. The internal behavior of minibatch processing is not well-defined now. I think you're right on this point. If you change the goal of the newly introduced configuration to address this issue, then I'm ok with it. Best, Jane On Mon, Jan 15, 2024 at 2:27

Re:Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-14 Thread Xuyang
Hi, shuai. Thanks for this explaination. This scenario sounds reasonable to me. I agree that we need to split the behavior in minibatch into two types of options: 1. Whether to open minibatch to save batch data; 2. Whether to compress the changelog data while saving the batch, and merge the data

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-12 Thread shuai xu
Hi all. The point I want to highlight is that minibatch join could potentially yield incomplete changelog which existing jobs are not supposed to be. For example, the scenario that joins two CDC sources after de-duplicating them and the output would be used for audit analysis could not accept

Re: Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-12 Thread Jane Chan
Hi shuai, Thanks for the update! Regarding the newly introduced configuration, I hold the same concern with Benchao and Xuyang. First of all, in most cases, the fact that users choose to enable mini-batch configuration indicates they are aware of the trade-off between throughput and completeness

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-11 Thread Jingsong Li
Hi all, This is a relatively large optimization that may pose a significant risk of bugs, so I like to keep it from being enabled by default for now. Best, Jingsong On Fri, Jan 12, 2024 at 3:01 PM shuai xu wrote: > > Suppose we currently have a job that joins two CDC sources after > de-duplica

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-11 Thread shuai xu
Suppose we currently have a job that joins two CDC sources after de-duplicating them and the output is available for audit analysis, and the user turns off the parameter "table.exec.deduplicate.mini-batch.compact-changes-enabled" to ensure that it does not lose update details. If we don't introd

Re:Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-11 Thread Xuyang
Hi, Xu Shuai. Thanks for driving this flip. The CDC message amplification of cascade join has always been a problem for users. Judging from the nexmark results, this optimization is very meaningful. I just have the same doubts as Benchao, why can't we use minibatch join as the default behavio

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-11 Thread Benchao Li
> the change might not be supposed for the downstream of the job which requires > details of changelog Could you elaborate on this a bit? I've never met such kinds of requirements before, I'm curious what is the scenario that requires this. shuai xu 于2024年1月11日周四 13:08写道: > > Thanks for your re

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-10 Thread shuai xu
Hi Jane, Thanks for your reminder! I missed this. I updated the FLIP with the UML of MiniBatchStreamingJoinOperator and linking my POC implementation as reference. They are placed in the part of Proposed Changes. Best, Xu Shuai > 2024年1月11日 11:18,Jane Chan 写道: > > Hi shuai, > > Thanks

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-10 Thread shuai xu
Thanks for your response, Benchao. Here is my thought on the newly added option. Users' current jobs are running on a version without minibatch join. If the existing option to enable minibatch join is utilized, then when users' jobs are migrated to the new version, the internal behavior of the j

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-10 Thread Jane Chan
Hi shuai, Thanks for initiating the discussion. The mini-batch join optimization is very helpful, particularly for optimizing outer join conditions in CDC sources and handling cascade joins. And +1 for the proposal. However, I don't see any details on the proposed "MiniBatchStreamingJoinOperator"

Re: [DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-10 Thread Benchao Li
Thanks shuai for driving this, mini-batch Join is a very useful optimization, +1 for the general idea. Regarding the configuration "table.exec.stream.join.mini-batch-enabled", I'm not sure it's really necessary. The semantic of changelog emitted by the Join operator is eventual consistency, so the

[DISCUSS] FLIP-415: Introduce a new join operator to support minibatch

2024-01-10 Thread shuai xu
Hi devs, I’d like to start a discussion on FLIP-415: Introduce a new join operator to support minibatch[1]. Currently, when performing cascading connections in Flink, there is a pain point of record amplification. Every record join operator receives would trigger join process. However, if reco