I noticed that the ticket to remove deprecated DataStream#fold() [1] has been 
created but not yet reach an agreement or assigned.

Actually fold related function and state descriptions has been deprecated for 
long than 3 years, and I think it's okay to remove such kind of state now.

[1] https://issues.apache.org/jira/browse/FLINK-19035

Best,
Yun Tang
________________________________
From: Aljoscha Krettek <aljos...@apache.org>
Sent: Thursday, August 27, 2020 16:45
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: Re: [DISCUSS] Removing deprecated methods from DataStream API

Did you consider DataStream.project() yet? In general I think we should
remove most of the relational-ish methods from DataStream. More
candidates in this set of methods would be the tuple index/expression
methods for aggregations like min/max etc...

Aljoscha

On 25.08.20 20:52, Konstantin Knauf wrote:
> I would argue that the guarantees of @Public methods that became
> ineffective were broken when they became ineffective (and were deprecated).
>
>     - ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
>     - ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
>
> Removing these methods seems like the better of two evils to me as it shows
> users that they have been using no-ops for some time.
>
> On Thu, Aug 20, 2020 at 10:50 AM Stephan Ewen <se...@apache.org> wrote:
>
>> We have removed some public methods in the past, after a careful
>> deprecation period, if they were not well working any more.
>>
>> The sentiment I got from users is that careful cleanup is in fact
>> appreciated, otherwise things get confusing over time (the deprecated
>> methods cause noise in the API).
>> Still, we need to be very careful here.
>>
>> I would suggest to
>>    - start with the non-public breaking methods
>>    - remove fold() (very long deprecated)
>>    - remove split() buggy
>>
>> Removing the env.socketStream() and env.fileStream() methods would
>> probably be good, too. They are very long deprecated and they don't work
>> well (with checkpoints) and the sources are the first thing a user needs to
>> understand when starting with Flink, so removing noise there is super
>> helpful.
>>
>>
>> On Thu, Aug 20, 2020 at 8:53 AM Dawid Wysakowicz <dwysakow...@apache.org>
>> wrote:
>>
>>> Hey Till,
>>>
>>> You've got a good point here. Removing some of the methods would cause
>>> breaking the stability guarantees. I do understand if we decide not to
>>> remove them for that reason, let me explain though why I am thinking it
>>> might make sense to remove them already. First of all I am a bit afraid it
>>> might take a long time before we arrive at the 2.0 version. We have not
>>> ever discussed that in the community. At the same time a lot of the methods
>>> already don't work or are buggy, and we do not fix them any more.
>>>
>>> Methods which removing would not break the Public guarantees:
>>>
>>>     - ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)
>>>     - RuntimeContext#getAllAccumulators (deprecated in 0.10)
>>>     - ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)
>>>     - 
>>> StreamExecutionEnvironment#setNumberOfExecutionRetries/getNumberOfExecutionRetries
>>>     (not the equivalent in the ExecutionConfig)
>>>     - StreamExecutionEnvironment#setStateBackend(AbstractStateBackend)
>>>     (deprecated in 1.5)
>>>
>>> Methods which removing would break the Public guarantees:
>>>
>>> which have no effect:
>>>
>>>     - ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
>>>     - ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
>>>
>>> which are buggy or discouraged and thus we do not support fixing them:
>>>
>>>     - DataStream#split (deprecated in 1.8)
>>>     - DataStream#fold and all related classes and methods such as
>>>     FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in
>>>     1.3/1.4)
>>>
>>> The methods like:
>>>
>>>     - 
>>> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...),
>>>
>>>     - methods in (Connected)DataStream that specify keys as either
>>>     indices or field names
>>>     -
>>>     ExecutionConfig#setNumberOfExecutionRetries/getNumberOfExecutionRetries
>>>
>>> should be working just fine and I feel the least eager to remove those.
>>>
>>> I'd suggest I will open PRs for removing the methods that will not cause
>>> breakage of the Public guarantees as the general feedback was rather
>>> positive. For the rest I do understand the resentment to do so and will not
>>> do it in the 1.x branch. Still I think it is valuable to have the
>>> discussion.
>>>
>>> Best,
>>>
>>> Dawid
>>>
>>>
>>> On 18/08/2020 09:26, Till Rohrmann wrote:
>>>
>>> Having looked at the proposed set of methods to remove I've noticed that
>>> some of them are actually annotated with @Public. According to our
>>> stability guarantees, only major releases (1.0, 2.0, etc.) can break APIs
>>> with this annotation. Hence, I believe that we cannot simply remove them
>>> unless the community decides to change the stability guarantees we give or
>>> by making the next release a major release (Flink 2.0).
>>>
>>> Cheers,
>>> Till
>>>
>>> On Tue, Aug 18, 2020 at 5:57 AM Yun Gao <yungao...@aliyun.com> wrote:
>>>
>>>> +1 for removing the methods that are deprecated for a while & have
>>>> alternative methods.
>>>>
>>>> One specific thing is that if we remove the DataStream#split, do we
>>>> consider enabling side-output in more operators in the future ? Currently
>>>> it should be only available in ProcessFunctions, but not available to other
>>>> commonly used UDF like Source or AsyncFunction[1].
>>>>
>>>> One temporary solution occurs to me is to add a ProcessFunction after
>>>> the operators want to use side-output. But I think the solution is not very
>>>> direct to come up with and if it really works we might add it to the
>>>> document of side-output.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-7954
>>>>
>>>> Best,
>>>>   Yun
>>>>
>>>> ------------------Original Mail ------------------
>>>> *Sender:*Kostas Kloudas <kklou...@gmail.com>
>>>> *Send Date:*Tue Aug 18 03:52:44 2020
>>>> *Recipients:*Dawid Wysakowicz <dwysakow...@apache.org>
>>>> *CC:*dev <dev@flink.apache.org>, user <u...@flink.apache.org>
>>>> *Subject:*Re: [DISCUSS] Removing deprecated methods from DataStream API
>>>>
>>>>> +1 for removing them.
>>>>>
>>>>>
>>>>>
>>>>>  From a quick look, most of them (not all) have been deprecated a long
>>>>> time ago.
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Kostas
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 17, 2020 at 9:37 PM Dawid Wysakowicz wrote:
>>>>>
>>>>>>
>>>>>
>>>>>> @David Yes, my idea was to remove any use of fold method and all
>>>>> related classes including WindowedStream#fold
>>>>>
>>>>>>
>>>>>
>>>>>> @Klou Good idea to also remove the deprecated enableCheckpointing() &
>>>>> StreamExecutionEnvironment#readFile and alike. I did another pass over 
>>>>> some
>>>>> of the classes and thought we could also drop:
>>>>>
>>>>>>
>>>>>
>>>>>> ExecutionConfig#set/getCodeAnalysisMode
>>>>>
>>>>>> ExecutionConfig#disable/enableSysoutLogging
>>>>>
>>>>>> ExecutionConfig#set/isFailTaskOnCheckpointError
>>>>>
>>>>>> ExecutionConfig#isLatencyTrackingEnabled
>>>>>
>>>>>>
>>>>>
>>>>>> As for the `forceCheckpointing` I am not fully convinced to doing it.
>>>>> As far as I know iterations still do not participate in checkpointing
>>>>> correctly. Therefore it still might make sense to force it. In other words
>>>>> there is no real alternative to that method. Unless we only remove the
>>>>> methods from StreamExecutionEnvironment and redirect to the setter in
>>>>> CheckpointConfig. WDYT?
>>>>>
>>>>>>
>>>>>
>>>>>> An updated list of methods I suggest to remove:
>>>>>
>>>>>>
>>>>>
>>>>>> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)
>>>>>
>>>>>> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
>>>>>
>>>>>> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
>>>>>
>>>>>> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)
>>>>>
>>>>>> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in
>>>>> 1.1?)
>>>>>
>>>>>>
>>>>> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
>>>>> (deprecated in 1.2)
>>>>>
>>>>>> RuntimeContext#getAllAccumulators (deprecated in 0.10)
>>>>>
>>>>>> DataStream#fold and all related classes and methods such as
>>>>> FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in
>>>>> 1.3/1.4)
>>>>>
>>>>>> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend)
>>>>> (deprecated in 1.5)
>>>>>
>>>>>> DataStream#split (deprecated in 1.8)
>>>>>
>>>>>> Methods in (Connected)DataStream that specify keys as either indices
>>>>> or field names such as DataStream#keyBy, DataStream#partitionCustom,
>>>>> ConnectedStream#keyBy, .... (deprecated in 1.11)
>>>>>
>>>>>>
>>>>>
>>>>>> Bear in mind that majority of the options listed above in
>>>>> ExecutionConfig take no effect. They were left there purely to satisfy the
>>>>> binary compatibility. Personally I don't see any benefit of leaving a
>>>>> method and silently dropping the underlying feature. The only 
>>>>> configuration
>>>>> that is respected is setting the number of execution retries.
>>>>>
>>>>>>
>>>>>
>>>>>> I also wanted to make it explicit that most of the changes above
>>>>> would result in a binary incompatibility and require additional exclusions
>>>>> in the japicmp. Those are:
>>>>>
>>>>>>
>>>>>
>>>>>> ExecutionConfig#set/getCodeAnalysisMode (deprecated in 1.9)
>>>>>
>>>>>> ExecutionConfig#disable/enableSysoutLogging (deprecated in 1.10)
>>>>>
>>>>>> ExecutionConfig#set/isFailTaskOnCheckpointError (deprecated in 1.9)
>>>>>
>>>>>> ExecutionConfig#isLatencyTrackingEnabled (deprecated in 1.7)
>>>>>
>>>>>> ExecutionConfig#(get)/setNumberOfExecutionRetries() (deprecated in
>>>>> 1.1?)
>>>>>
>>>>>> DataStream#fold and all related classes and methods such as
>>>>> FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in
>>>>> 1.3/1.4)
>>>>>
>>>>>> DataStream#split (deprecated in 1.8)
>>>>>
>>>>>> Methods in (Connected)DataStream that specify keys as either indices
>>>>> or field names such as DataStream#keyBy, DataStream#partitionCustom,
>>>>> ConnectedStream#keyBy, .... (deprecated in 1.11)
>>>>>
>>>>>>
>>>>> StreamExecutionEnvironment#readFile,readFileStream(...),socketTextStream(...),socketTextStream(...)
>>>>> (deprecated in 1.2)
>>>>>
>>>>>>
>>>>>
>>>>>> Looking forward to more opinions on the issue.
>>>>>
>>>>>>
>>>>>
>>>>>> Best,
>>>>>
>>>>>>
>>>>>
>>>>>> Dawid
>>>>>
>>>>>>
>>>>>
>>>>>>
>>>>>
>>>>>> On 17/08/2020 12:49, Kostas Kloudas wrote:
>>>>>
>>>>>>
>>>>>
>>>>>> Thanks a lot for starting this Dawid,
>>>>>
>>>>>>
>>>>>
>>>>>> Big +1 for the proposed clean-up, and I would also add the deprecated
>>>>>
>>>>>> methods of the StreamExecutionEnvironment like:
>>>>>
>>>>>>
>>>>>
>>>>>> enableCheckpointing(long interval, CheckpointingMode mode, boolean
>>>>> force)
>>>>>
>>>>>> enableCheckpointing()
>>>>>
>>>>>> isForceCheckpointing()
>>>>>
>>>>>>
>>>>>
>>>>>> readFile(FileInputFormat inputFormat,String
>>>>>
>>>>>> filePath,FileProcessingMode watchType,long interval, FilePathFilter
>>>>>
>>>>>> filter)
>>>>>
>>>>>> readFileStream(...)
>>>>>
>>>>>>
>>>>>
>>>>>> socketTextStream(String hostname, int port, char delimiter, long
>>>>> maxRetry)
>>>>>
>>>>>> socketTextStream(String hostname, int port, char delimiter)
>>>>>
>>>>>>
>>>>>
>>>>>> There are more, like the (get)/setNumberOfExecutionRetries() that were
>>>>>
>>>>>> deprecated long ago, but I have not investigated to see if they are
>>>>>
>>>>>> actually easy to remove.
>>>>>
>>>>>>
>>>>>
>>>>>> Cheers,
>>>>>
>>>>>> Kostas
>>>>>
>>>>>>
>>>>>
>>>>>> On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz
>>>>>
>>>>>> wrote:
>>>>>
>>>>>>
>>>>>
>>>>>> Hi devs and users,
>>>>>
>>>>>>
>>>>>
>>>>>> I wanted to ask you what do you think about removing some of the
>>>>> deprecated APIs around the DataStream API.
>>>>>
>>>>>>
>>>>>
>>>>>> The APIs I have in mind are:
>>>>>
>>>>>>
>>>>>
>>>>>> RuntimeContext#getAllAccumulators (deprecated in 0.10)
>>>>>
>>>>>> DataStream#fold and all related classes and methods such as
>>>>> FoldFunction, FoldingState, FoldingStateDescriptor ... (deprecated in
>>>>> 1.3/1.4)
>>>>>
>>>>>> StreamExecutionEnvironment#setStateBackend(AbstractStateBackend)
>>>>> (deprecated in 1.5)
>>>>>
>>>>>> DataStream#split (deprecated in 1.8)
>>>>>
>>>>>> Methods in (Connected)DataStream that specify keys as either indices
>>>>> or field names such as DataStream#keyBy, DataStream#partitionCustom,
>>>>> ConnectedStream#keyBy, .... (deprecated in 1.11)
>>>>>
>>>>>>
>>>>>
>>>>>> I think the first three should be straightforward. They are long
>>>>> deprecated. The getAccumulators method is not used very often in my
>>>>> opinion. The same applies to the DataStream#fold which additionally is not
>>>>> very performant. Lastly the setStateBackend has an alternative with a 
>>>>> class
>>>>> from the AbstractStateBackend hierarchy, therefore it will be still code
>>>>> compatible. Moreover if we remove the
>>>>> #setStateBackend(AbstractStateBackend) we will get rid off warnings users
>>>>> have right now when setting a statebackend as the correct method cannot be
>>>>> used without an explicit casting.
>>>>>
>>>>>>
>>>>>
>>>>>> As for the DataStream#split I know there were some objections against
>>>>> removing the #split method in the past. I still believe the output tags 
>>>>> can
>>>>> replace the split method already.
>>>>>
>>>>>>
>>>>>
>>>>>> The only problem in the last set of methods I propose to remove is
>>>>> that they were deprecated only in the last release and those method were
>>>>> only partially deprecated. Moreover some of the methods were not 
>>>>> deprecated
>>>>> in ConnectedStreams. Nevertheless I'd still be inclined to remove the
>>>>> methods in this release.
>>>>>
>>>>>>
>>>>>
>>>>>> Let me know what do you think about it.
>>>>>
>>>>>>
>>>>>
>>>>>> Best,
>>>>>
>>>>>>
>>>>>
>>>>>> Dawid
>>>>
>>>>
>

Reply via email to