Hi Jing, Thank you for the suggestion. Yes, we can extend it to support null if in the future we find any use-case for this flexibility.
Best, Dong On Thu, Jul 6, 2023 at 7:55 PM Jing Ge <j...@ververica.com> wrote: > Hi Dong, > > one scenario I could imagine is that users could enable global object > reuse features but force deep copy for some user defined specific functions > because of any limitations. But that is only my gut feeling. And agree, we > could keep the solution simple for now as FLIP described and upgrade to 3VL > once there are such real requirements that are rising. > > Best regards, > Jing > > On Thu, Jul 6, 2023 at 12:30 PM Dong Lin <lindon...@gmail.com> wrote: > >> Hi Jing, >> >> Thank you for the detailed explanation. Please see my reply inline. >> >> On Thu, Jul 6, 2023 at 3:17 AM Jing Ge <j...@ververica.com> wrote: >> >>> Hi Xuannan, Hi Dong, >>> >>> Thanks for your clarification. >>> >>> @Xuannan >>> >>> A Jira ticket has been created for the doc update: >>> https://issues.apache.org/jira/browse/FLINK-32546 >>> >>> @Dong >>> >>> I don't have a concrete example. I just thought about it from a >>> conceptual or pattern's perspective. Since we have 1. coarse-grained global >>> switch(CGS as abbreviation), i.e. the pipeline.object-reuse and 2. >>> fine-grained local switch(FGS as abbreviation), i.e. the >>> objectReuseCompliant variable for specific operators/functions, there will >>> be the following patterns with appropriate combinations: >>> >>> pattern 1: coarse-grained switch only. Local object reuse will be >>> controlled by the coarse-grained switch: >>> 1.1 cgs == true -> local object reused enabled >>> 1.2 cgs == true -> local object reused enabled >>> 1.3 cgs == false -> local object reused disabled, i.e. deep copy enabled >>> 1.4 cgs == false -> local object reused disabled, i.e. deep copy enabled >>> >>> afaiu, this is the starting point. I wrote 4 on purpose to make the >>> regression check easier. We can consider it as the combinations with >>> cgs(true/false) and fgs(true/false) while fgs is ignored. >>> >>> Now we introduce fine-grained switch. There will be two patterns: >>> >>> pattern 2: fine-grained switch over coarse-grained switch. >>> Coarse-grained switch will be ignored when the local fine-grained switch >>> has different value: >>> 2.1 cgs == true and fgs == true -> local object reused enabled >>> 2.2 cgs == true and fgs == false -> local object reused disabled, i.e. >>> deep copy enabled >>> 2.3 cgs == false and fgs == true -> local object reused enabled >>> 2.4 cgs == false and fgs == false -> local object reused disabled, i.e. >>> deep copy enabled >>> >>> cgs is actually ignored. >>> >>> Current FLIP is using a slightly different pattern: >>> >>> pattern 3: fine-grained switch over coarse-grained switch only when >>> coarse-grained switch is off, i..e cgs OR fgs: >>> 3.1 cgs == true and fgs == true -> local object reused enabled >>> 3.2 cgs == true and fgs == false -> local object reused enabled >>> 3.3 cgs == false and fgs == true -> local object reused enabled >>> 3.4 cgs == false and fgs == false -> local object reused disabled, i.e. >>> deep copy enabled >>> >>> All of those patterns are rational and each has different focus. It >>> depends on the real requirement to choose one of them. >>> >>> As we can see, if fgs is using 2VL, there is a regression between >>> pattern 1 and pattern 2. You are absolutely right in this case. That's why >>> I suggested 3VL, i.e. fgs will have triple values: true, false, >>> unknown(e.g. null) >>> >>> pattern 4: 3VL fgs with the null as init value (again, there are just >>> two combination, I made it 4 on purpose): >>> 4.1 cgs == true and fgs == null -> local object reused enabled >>> 4.2 cgs == true and fgs == null -> local object reused enabled >>> 4.3 cgs == false and fgs == null -> local object reused disabled, i.e. >>> deep copy enabled >>> 4.4 cgs == false and fgs == null -> local object reused disabled, i.e. >>> deep copy enabled >>> >>> Since the default value of fgs is null, pattern 4 is backward compatible >>> with pattern 1, which means no regression. >>> >>> Now we will set value to fgs and follow the pattern 2: >>> 4.5 cgs == true and fgs == true -> local object reused enabled >>> 4.6 cgs == true and fgs == false -> local object reused disabled, i.e. >>> deep copy enabled >>> 4.7 cgs == false and fgs == true -> local object reused enabled >>> 4.8 cgs == false and fgs == false -> local object reused disabled, i.e. >>> deep copy enabled >>> >>> Pattern 4 contains pattern 3 with the following combinations(force >>> enabling local object reuse): >>> 4.5 cgs == true and fgs == true -> local object reused enabled >>> 4.2 cgs == true and fgs == null -> local object reused enabled >>> 4.7 cgs == false and fgs == true -> local object reused enabled >>> 4.4 cgs == false and fgs == null -> local object reused disabled, i.e. >>> deep copy enabled >>> >>> Comparing pattern 4 to pattern 3, user will have one additional >>> flexibility to control(force disabling) the local object reuse capability >>> because of 3VL, i.e. 4.2+4.6 vs. 3.2. >>> >> >> I think you are suggesting to allow the user setting fgs to null so that >> "user will have one additional flexibility to control(force disabling) the >> local object reuse capability". >> >> In general, an API that only allows false/true is a bit more complex to >> implement than an API that allows false/true/null. All things being equal, >> I believe it is preferred to have a simpler public API. >> >> I understand you are coming from a conceptual perspective and trying to >> make it similar to hierarchical RBAC. However, after thinking through this, >> I still could not find a use-case where we actually need this flexibility. >> In particular, for cases where a user has explicitly configured >> pipeline.object-reuse to true, that means the user already knows (or takes >> the responsibility of ensuring) that correctness can be achieved, why would >> the user want to explicitly disable the object-reuse for an operator? >> >> Regards, >> Dong >> >> >>> >>> It is commonly used in the hierarchical RBAC to enable more fine-grained >>> access control of sub role. >>> >>> I hope I have been able to explain myself clearly. Looking forward to >>> your feedback. >>> >>> Best regards, >>> Jing >>> >>> >>> >>> On Wed, Jul 5, 2023 at 12:47 PM Dong Lin <lindon...@gmail.com> wrote: >>> >>>> Hi Jing, >>>> >>>> Thanks for the comments! Please find below my comments, which are based >>>> on the offline discussion with Xuannan. >>>> >>>> On Wed, Jul 5, 2023 at 1:36 AM Jing Ge <j...@ververica.com> wrote: >>>> >>>>> Hi Xuannan, Hi Dong >>>>> >>>>> Thanks for the Proposal! After reading the FLIP, I'd like to ask some >>>>> questions: >>>>> >>>>> 1. Naming convention for boolean variables. It is recommended to >>>>> follow JavaBean [1], i.e. objectReuseCompliant as the variable name with >>>>> isObjectReuseCompliant() and setObjectReuseCompliant() as the methods' >>>>> name. >>>>> >>>>> >>>> Good point. We have updated the FLIP as suggested. >>>> >>>> >>>>> >>>>> 2. >>>>> >>>>> - >>>>> >>>>> *If pipeline.object-reuse is set to true, records emitted by this >>>>> operator will be re-used.* >>>>> - >>>>> >>>>> *Otherwise, if getIsObjectReuseCompliant() returns true, records >>>>> emitted by this operator will be re-used.* >>>>> - >>>>> >>>>> *Otherwise, records emitted by this operator will be deep-copied >>>>> before being given to the next operator in the chain.* >>>>> >>>>> >>>>> If I understand you correctly, the hard coding objectReusedCompliant >>>>> should have higher priority over the configuration, the checking logic >>>>> should be: >>>>> >>>>> - >>>>> >>>>> *If getIsObjectReuseCompliant() returns true, records emitted by >>>>> this operator will be re-used.* >>>>> - >>>>> >>>>> *Otherwise, if pipeline.object-reuse is set to true, records >>>>> emitted by this operator will be re-used.* >>>>> - >>>>> >>>>> *Otherwise, records emitted by this operator will be deep-copied >>>>> before being given to the next operator in the chain.* >>>>> >>>>> >>>>> The results are the same but the checking logics are different, but >>>>> there are some additional thoughts, which lead us to the next question. >>>>> >>>> >>>>> >>>> >>>>> 3. Current design lets specific operators enable object reuse and >>>>> ignore the global config. There could be another thought, on the contrary: >>>>> if an operator has hard coded the objectReuseCompliant as false, i.e. >>>>> disable object reuse on purpose, records should not be reused even if the >>>>> global config pipeline.object-reused is set to true, which turns out that >>>>> the objectReuseCompliant could be a triple value logic: ture: force object >>>>> reusing; false: force deep-copying; unknown: depends on >>>>> pipeline.object-reuse config. >>>>> >>>> >>>> With the current proposal, if pipeline.object-reused == true and >>>> operatorA's objectReuseCompliant == false, Flink will not deep-copy >>>> operatorA's output. I think you are suggesting changing the behavior such >>>> that Flink should deep-copy the operatorA's output. >>>> >>>> Could you explain what is the advantage of this approach compared to >>>> the approach described in the FLIP? >>>> >>>> My concern with this approach is that it can cause performance >>>> regression. This is an operator's objectReuseCompliant will be false by >>>> default unless it is explicitly overridden. For those jobs which are >>>> currently configured with pipeline.object-reused = true, these jobs will >>>> likely start to have lower performance (due to object deep-copy) after >>>> upgrading to the newer Flink version. >>>> >>>> Best, >>>> Dong >>>> >>>> >>>>> >>>>> Best regards, >>>>> Jing >>>>> >>>>> >>>>> [1] https://en.wikipedia.org/wiki/JavaBeans >>>>> >>>>> On Mon, Jul 3, 2023 at 4:25 AM Xuannan Su <suxuanna...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Dong(cc'ed) and I are opening this thread to discuss our proposal to >>>>>> add operator attribute to allow operator to specify support for >>>>>> object-reuse [1]. >>>>>> >>>>>> Currently, the default configuration for pipeline.object-reuse is set >>>>>> to false to avoid data corruption, which can result in suboptimal >>>>>> performance. We propose adding APIs that operators can utilize to >>>>>> inform the Flink runtime whether it is safe to reuse the emitted >>>>>> records. This enhancement would enable Flink to maximize its >>>>>> performance using the default configuration. >>>>>> >>>>>> Please refer to the FLIP document for more details about the proposed >>>>>> design and implementation. We welcome any feedback and opinions on >>>>>> this proposal. >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Dong and Xuannan >>>>>> >>>>>> [1] >>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=255073749 >>>>>> >>>>>