Hi Chesnay
"Move Calcite rules from Scala to Java": I would hope that this would be
an entirely internal change, and could thus be an incremental process
independent of major releases.
What is the actual scale of this item; how much are we actually
re-writing?
Thanks for asking
yes, you're right, that should be internal change.
Yeah I was also thinking about incremental change (rule by rule or
reasonable small group of rules).
And yes, this could be an independent (on major release) activity
The problem is actually for children of RelOptRule.
Currently I see 60+ such rules (in Scala) using the mentioned deprecated
api.
There are also children of ConverterRule (50+) which do not have such
issues.
Maybe it could be considered as the next step to have all the rules in
Java.
On Tue, Jun 27, 2023 at 1:34 PM Xintong Song <tonysong...@gmail.com>
wrote:
Hi Alex & Gyula,
By compatibility discussion do you mean the "[DISCUSS] FLIP-321:
Introduce
an API deprecation process" thread [1]?
Yes, I meant the FLIP-321 discussion. I just noticed I pasted the wrong
url
in my previous email. Sorry for the mistake.
I am also curious to know if the rationale behind this new API has been
previously discussed on the mailing list. Do we have a list of
shortcomings
in the current DataStream API that it tries to resolve? How does the
current ProcessFunction functionality fit into the picture? Will it be
kept
as is or subsumed by new API?
I don't think we should create a replacement for the DataStream API
unless
we have a very good reason to do so and with a proper discussion about
this
as Alex said.
The ProcessFunction API which is targeting to replace DataStream API is
still a proposal, not a decision. Sorry for the confusion, I should have
been more careful with my words, not giving the impression that this is
something we'll do anyway.
There will be a FLIP describing the motivations and designs in detail,
for
the community to discuss and vote on. We are still working on it. TBH,
this
is not trivial and we would need more time on it.
Just to quickly share some backgrounds:
- We see quite some problems with the current DataStream APIs
- Users are working with concrete classes rather than interfaces,
which means
- Users can access methods that are designed to be used by internal
classes, even though they are annotated with `@Internal`. E.g.,
`DataStream#getTransformation`.
- Changes to the non-API implementations (e.g.,
`Transformation`)
would affect the API classes (e.g., `DataStream`), which
makes it hard to
provide binary compatibility.
- Internal classes are used as parameter / return-value of public
APIs. E.g., while `AbstractStreamOperator` is PublicEvolving,
`StreamTask`
which returns from `AbstractStreamOperator#getContainingTask` is
Internal.
- In many cases, users are asked to extend the API classes, rather
than implementing interfaces. E.g., `AbstractStreamOperator`.
- Any changes to the base classes, even the internal part, may
affect the behavior of the user-provided sub-classes
- Users can override the behavior of the base classes
- The API module `flink-streaming-java` contains non-API classes,
and
depends on internal modules such as `flink-runtime`, which means
- Changes to the internal modules may affect the API modules, which
requires users to re-build their applications upon upgrading
- The artifact user needs for building their application larger
than necessary.
- We probably should not expose operators (e.g.,
`AbstractStreamOperator`) to users. Functions should be enough
for users to
define their data processing logics. Exposing operator-level
concepts
(e.g., mailbox thread model, checkpoint barrier alignment, etc.) is
unnecessary and limits the improvement regarding such exposed
mechanisms
with compatibility considerations.
- The current DataStream API seems to be a mixture of many things,
making it hard to understand especially for newcomers. It might be
better
to re-organize it into several parts: (the taxonomy below are just
an
example of the, we are still working on this)
- The most fundamental stateful stream processing: streams,
partitions / key, process functions, state, timeline-service
- An extension for common batch-streaming unified functions:
map,
flatmap, filter, agg, reduce, join, etc.
- An extension for windowing supports: window, triggering
- An extension for event-time supports: event time, watermark
- The extensions are like short-cuts / sugars, without which
users
can probably still achieve the same behavior by working with the
fundamental APIs, but would be a lot easier with the extensions
- The original plan was to do in-place refactors / changes on
DataStream API. Some related items are listed in this doc [2] attached
to
the kicking off email [3]. Not all of the above issues are listed,
because
we haven't looked into this as deeply as now by that time.
- We proposed this as a new API rather than in-place refactors in the
2.0 work item list, because we realized the changes might be too big
for an
in-place change. First having a new API then gradually retiring the
old
one
would help users to smoothly migrate between them.
A thorough discussion is definitely needed once the FLIP is out. And of
course it's possible that the FLIP might be rejected. Given that we are
planning for release 2.0, I just feel it would be better to bring this up
early even the concrete plan is not yet ready,
Best,
Xintong
[1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9
[2]
https://docs.google.com/document/d/1_PMGl5RuDQGlV99_gL3y7OiRsF0DgCk91Coua6hFXhE/edit?usp=sharing
[3] https://lists.apache.org/thread/b8w5cx0qqbwzzklyn5xxf54vw9ymys1c
On Tue, Jun 27, 2023 at 5:15 PM Gyula Fóra <gyf...@apache.org> wrote:
Hey!
I share the same concerns mentioned above regarding the
"ProcessFunction
API".
I don't think we should create a replacement for the DataStream API
unless
we have a very good reason to do so and with a proper discussion about
this
as Alex said.
Cheers,
Gyula
On Tue, Jun 27, 2023 at 11:03 AM Alexander Fedulov <
alexander.fedu...@gmail.com> wrote:
Hi Xintong,
By compatibility discussion do you mean the "[DISCUSS] FLIP-321:
Introduce
an API deprecation process" thread [1]?
I am also curious to know if the rationale behind this new API has
been
previously discussed on the mailing list. Do we have a list of
shortcomings
in the current DataStream API that it tries to resolve? How does the
current ProcessFunction functionality fit into the picture? Will it
be
kept
as is or subsumed by new API?
[1] https://lists.apache.org/thread/vmhzv8fcw2b33pqxp43486owrxbkd5x9
Best,
Alex
On Mon, 26 Jun 2023 at 14:33, Xintong Song <tonysong...@gmail.com>
wrote:
The ProcessFunction API item is giving me the most headaches
because
it's
very unclear what it actually entails; like is it an entirely
separate
API
to DataStream (sounds like it is!) or an extension of DataStream.
How
much
will it share the internals with DataStream etc.; how does it
relate
to
the
Table API (w.r.t. switching APIs / what Table API uses
underneath).
I totally understand your confusion. We started planning this after
kicking
off the release 2.0, so there's still a lot to be explored and the
plan
keeps changing.
- In the beginning, we planned to do an in-place refactor of
DataStream
API, until the API migration period is proposed.
- Then we want to make it an entirely separate API to
DataStream,
and
listed as a must-have for release 2.0 so that we can remove
DataStream
once
it's ready.
- However, depending on the outcome of the API compatibility
discussion
[1], we may not be able to remove DataStream in 2.0 anyway,
which
means
we
might need to re-evaluate the necessity of this item for 2.0.
I'd say we wait a bit longer for the compatibility discussion [1]
and
decide the priority for this item afterwards.
Best,
Xintong
[1] https://lists.apache.org/list.html?dev@flink.apache.org
On Mon, Jun 26, 2023 at 6:00 PM Chesnay Schepler <
ches...@apache.org
wrote:
by-and-large I'm quite happy with the list of items.
I'm curious as to why the "Disaggregated State Management" item
is
marked
as a must-have; will it require changes that break something?
What
prevents
it from being added in 2.1?
We may want to update the Java 17 item to "Make Java 17 the
default,
drop
Java 8/11". Maybe even split it into a must-have "Drop Java 8"
and
a
nice-to-have "Drop Java 11"?
"Move Calcite rules from Scala to Java": I would hope that this
would
be
an entirely internal change, and could thus be an incremental
process
independent of major releases.
What is the actual scale of this item; how much are we actually
re-writing?
"Add MetricGroup#getLogicalScope": I'd raise this to a
must-have; i
think
I marked it down as nice-to-have only because it depends on
another
item.
The ProcessFunction API item is giving me the most headaches
because
it's
very unclear what it actually entails; like is it an entirely
separate
API
to DataStream (sounds like it is!) or an extension of DataStream.
How
much
will it share the internals with DataStream etc.; how does it
relate
to
the
Table API (w.r.t. switching APIs / what Table API uses
underneath).
There are a few items I added as ideas which don't have a
priority
yet;
would love to get some feedback on those.
On 21/06/2023 08:41, Xintong Song wrote:
Hi devs,
As previously discussed in [1], we had been collecting work item
proposals
for the 2.0 release until June 15th, on the wiki page [2].
- As we have passed the due date, I'd like to kindly remind
everyone
*not
to add / remove items directly on the wiki page*. If needed,
please
post
in this thread or reach out to the release managers instead.
- I've reached out to some folks for clarifications about
their
proposals. Some of them mentioned that they can not yet tell
whether
we
should do an item or not, and would need more time /
discussions
to
make
the decision. So I added a new symbol for items whose
priorities
are
`TBD`.
Now it's time to collaboratively decide a minimum set of
must-have
items.
I've gone through the entire list of proposed items, and found
most
of
them
make quite much sense. So I think an online sync might not be
necessary
for
this. I'd like to go with this DISCUSS thread, where everyone can
comment
on how they think the list can be improved, followed by a VOTE to
formally
make the decision.
Any feedback and opinions, including but not limited to the
following
aspects, will be appreciated.
- Important items that are missing from the list
- Concerns regarding the listed items or their priorities
Looking forward to your feedback.
Best,
Xintong
[1]
https://lists.apache.org/list?dev@flink.apache.org:lte=1M:release%202.0%20status%20updates
[2]
https://cwiki.apache.org/confluence/display/FLINK/2.0+Release
--
Best regards,
Sergey