Hi everybody,


As part of the upcoming 1.0 release we are stabilizing Flink's APIs.

I would like to start a discussion about removing the Combinable annotation
from the DataSet API.



# The Current State:

The DataSet API offers two interface for combine functions, CombineFunction
and GroupCombineFunction, which can be added to a GroupReduceFunctions
(GRF).


However, implementing a combine interface does not make a GRF combinable.
In addition, the GRF class needs to be annotated with the Combinable
annotation. The RichGroupReduceFunction has a default implementation of
combine, which forwards the combine parameters to the reduce method. This
default implementation is not used, unless the class that extends RGRF has
the Combinable annotation.



In addition to the Combinable annotation, the DataSet API
GroupReduceOperator features the setCombinable(boolean) method to override
a missing or existing Combinable annotation.



# My Proposal:

I propose to remove the Combinable annotation because it is not required
and complicates the definition of combiners. Instead, the combine method of
a GroupReduceFunction should be executed if it implements one of the
combine function interfaces. This would require to remove the default
combine implementation of the RichGroupReduceFunction as well.

This would be an API breaking change and has a few implications.



# Possible Implementation:

There are a few ways to implement this change.

- Remove Combinable annotation or mark it deprecated (and keep effect)

- Remove default combine method from RichGroupReduceFunction or deprecate it



Approach 1:
- Remove Combinable annotation
- Remove default combine() method from RichGroupReduceFunction
- Effect:
    - All RichGroupReduceFunctions that do either have the Combinable
annotation or implement the combine method do not compile anymore
    - GroupReduceFunctions that have the Combinable annotation do not
compile anymore
    - GroupReduceFunctions that implement a combine interface without
having the annotation (and not being set to combinable during program
construction) will execute the previously not executed combine method. This
might change the behavior of the program. In worst case, the program might
silently produce wrong results (or crash) if the combiner implementation
was faulty. In best case, the program executes faster.



Approach 2:
- As Approach 1
- In addition extend both combine interfaces with a deprecated marker
method. This will ensure that all functions that implement a combinable
interface do not compile anymore and need to be fixed. This could prevent
the silent failure as in Approach 1, but would also cause an additional API
breaking change once the deprecated marker method is removed again.



Approach 3:
- Mark Combinable annotation deprecated
- Mark combine() method in RichGroupReduceFunction as deprecated
- Effect:
    - There'll be a couple of deprecation warnings.
    - We face the same problem with silent failures as in Approach 1.
    - We have to check if RichGroupReduceFunction's override combine or not
(can be done with reflection). If the method is not overridden we do not
execute it (unless there is a Combinable annotation) and we are fine. If it
is overridden and no Combinable annotation has been defined, we have the
same problem with silent failures as before.
    - After we remove the deprecated annotation and method, we have the
same effect as with Approach 1.



There are more alternatives, but these are the most viable, IMO.



I think, if we want to remove the combinable annotation, we should do it
now.

Given the three options, would go for Approach 1. Yes, breaks a lot of code
and yes there is the possibility of computing incorrect results. Approach 2
is safer but would mean another API breaking change in the future. Approach
3 comes with fewer breaking changes but has the same problem of silent
failures.

IMO, the breaking API changes of Approach 1 are even desirable because they
will make users aware that this feature changed.



What do you think?



Cheers, Fabian

Reply via email to