Re: [DISCUSSION] Scan mode in Table Store for Flink Stream and Batch job

2022-12-09 Thread Jingsong Li
Thanks Shammon!

Your summary was very good and very detailed.

I thought about it again.

## Solution 1
Actually, according to what you said, there should be so many modes in theory.
- Runtime-mode: streaming or batch.
- Range: full or incremental.
- Position: Latest, timestamp, snapshot-id, compacted.

Advantages: The disassembly is very detailed, and every action is very clear.
Disadvantages: There are many combinations from orthogonality. In
combination with runtime-mode stream or batch, we can say that there
are 16 modes from orthogonality, many of which are meaningless. As you
said, default behavior is also a problem.

## Solution 2
Currently [1]:
- The environment determines the runtime-mode whether it is streaming
or a batch.
- The `scan.mode` determines position.
- No specific option determines `range`, but it is determined by
runtime-mode. However, it is not completely determined by runtime
mode, such as `full` and `compacted`, which are also read in
full-range under the stream.

Advantages: Simple. The default values of options are what we want for
streaming and batch.
Disadvantages:
1. The semantics of from timestamp are different in the case of
streaming and batch.
2. `full` and `compacted` are special.

## Solution 3

I understand that the core problem of solution2 may be more problem 2:
`full` and `compacted` are special.
How about:
- the runtime mode determines whether to read incremental only or full data.
- `scan.mode` contains: Latest, timestamp, snapshot-id.
The default is full in batch mode and incremental in stream mode.

However, we have two other choices for `scan.mode`: `latest-full`,
`compacted-full`. Regardless of the runtime-mode, the two choices
force full range to read.

I think solution 3 is a compromise solution. It can also ensure the
availability of default values. Conceptually, it can at least explain
the current options.

What do you think?

[1] 
https://nightlies.apache.org/flink/flink-table-store-docs-master/docs/development/configuration/

Best,
Jingsong

On Thu, Dec 8, 2022 at 4:11 PM Shammon FY  wrote:
>
> Hi devs:
>
> I'm an engineer from ByteDance, and here I'd link to discuss "Scan mode in 
> Table Store for Flink Stream and Batch job".
>
> Users can execute Flink Steam and Batch jobs on Table Store. In Table Store 
> 0.2 there're two items which determine how the Stream and Batch jobs' sources 
> read data: StartupMode and config in Options.
> 1. StartupMode
>   a) DEFAULT. Determines actual startup mode according to other table 
> properties.  If \"scan.timestamp-millis\" is set, the actual startup mode 
> will be \"from-timestamp\" mode. Otherwise, the actual startup mode will be 
> \"full\".
>   b) FULL. For streaming sources, read the latest snapshot on the table upon 
> first startup, and continue to read the latest changes. For batch sources, 
> just consume the latest snapshot but do not read new changes.
>   c) LATEST. For streaming sources, continuously reads the latest changes 
> without reading a snapshot at the beginning.  For batch sources, behaves the 
> same as the \"full\" startup mode.
>   d) FROM_TIMESTAMP. For streaming sources, continuously reads changes 
> starting from timestamp specified by \"scan.timestamp-millis\", without 
> reading a snapshot at the beginning. For batch sources, read a snapshot at 
> timestamp specified by \"scan.timestamp-millis\" but do not read new changes.
> 2. Config in Options
>   a) scan.timestamp-millis, log.scan.timestamp-millis. Optional timestamp 
> used in case of \"from-timestamp\" scan mode.
>   b) read.compacted. Read the latest compact snapshots only.
>
> After discussing with @Jingsong Li and @Caizhi wen, we found that the config 
> in Options and StartupMode are not orthogonal. For example, read.compacted 
> and FROM_TIMESTAMP mode and its behavior in Stream and Batch sources. We want 
> to improve StartupMode to unify the data reading mode of Stream and Batch 
> jobs, and add the following StartupMode item:
> COMPACTED: For streaming sources, read a snapshot after the latest compaction 
> on the table upon first startup, and continue to read the latest changes. For 
> batch sources, just read a snapshot after the latest compaction but do not 
> read new changes.
> The advantage is that for Stream and Batch jobs, we only need to determine 
> their behavior through StartupMode, but we also found two main problems:
> 1. The behaviors of some StartupModes in Stream and Batch jobs are 
> inconsistent, which may cause user misunderstanding, such as FROM_ TIMESTAMP: 
> streaming job reads incremental data, while batch job reads full data
> 2. StartupMode does not define all data reading modes. For example, streaming 
> jobs read snapshots according to timestamp, and then read incremental data.
>
> To support all data reading modes in Table Store such as time travel, we try 
> to divide data reading into two orthogonal dimensions: data reading range and 
> startup position.
> 1. Data reading range
>   a) Inc

Re: [DISCUSS] Cleaning up HighAvailabilityServices interface to reflect the per-JM-process LeaderElection

2022-12-09 Thread Matthias Pohl
Hi Dong,
see my answers below.

Regarding "Interface change might affect other projects that customize HA
> services", are you referring to those projects which hack into Flink's
> source code (as opposed to using Flink's public API) to customize HA
> services?


Yes, the proposed change might affect projects that need to have their own
HA implementation for whatever reason (interface change) or if a project
accesses the HA backend to retrieve metadata from the ZK node/k8s ConfigMap
(change about how the data is stored in the HA backend). The latter one was
actually already the case with the change introduced by FLINK-24038 [1].

By the way, since Flink already supports zookeeper and kubernetes as the
> high availability services, are you aware of many projects that still need
> to hack into Flink's code to customize high availability services?


I am aware of projects that use customized HA. But based on our experience
in FLINK-24038 [1] no one complained. So, making people aware through the
mailing list might be good enough.

And regarding "We lose some flexibility in terms of per-component
> LeaderElection", could you explain what flexibility we need so that we can
> gauge the associated downside of losing the flexibility?


Just to recap: The current interface allows having per-component
LeaderElection (e.g. the ResourceManager leader can run on a different
JobManager than the Dispatcher). This implementation was replaced by
FLINK-24038 [1] and removed in FLINK-25806 [2]. The new implementation does
LeaderElection per process (e.g. ResourceManager and Dispatcher always run
on the same JobManager). The changed interface would require us to touch
the interface again if (for whatever reason) we want to reintroduce
per-component leader election in some form.
The interface change is, strictly speaking, not necessary to provide the
new functionality. But I like the idea of certain requirements (currently,
we need per-process leader election to fix what was reported in FLINK-24038
[1]) being reflected in the interface. This makes sure that we don't
introduce a per-component leader election again accidentally in the future
because we thought it's a good idea but forgot about FLINK-24038.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-24038
[2] https://issues.apache.org/jira/browse/FLINK-25806

On Fri, Dec 9, 2022 at 2:09 AM Dong Lin  wrote:

> Hi Matthias,
>
> Thanks for the proposal! Overall I am in favor of making this interface
> change to make Flink's codebase more maintainable.
>
> Regarding "Interface change might affect other projects that customize HA
> services", are you referring to those projects which hack into Flink's
> source code (as opposed to using Flink's public API) to customize HA
> services? If yes, it seems OK to break those projects since we don't have
> any backward compatibility guarantee for those projects.
>
> By the way, since Flink already supports zookeeper and kubernetes as the
> high availability services, are you aware of many projects that still need
> to hack into Flink's code to customize high availability services?
>
> And regarding "We lose some flexibility in terms of per-component
> LeaderElection", could you explain what flexibility we need so that we can
> gauge the associated downside of losing the flexibility?
>
> Thanks!
> Dong
>
>
>
> On Wed, Dec 7, 2022 at 4:28 PM Matthias Pohl  .invalid>
> wrote:
>
> > Hi everyone,
> >
> > The Flink community introduced a new way how leader election works in
> Flink
> > 1.15 with FLINK-24038 [1]. Instead of a per-component leader election,
> all
> > components (i.e. ResourceManager, Dispatcher, REST server, JobMaster)
> use a
> > single (per-JM-process) leader election instance. It was meant to fix
> some
> > issues with deregistering Flink applications in multi-JM setups [1] and
> > reduce load on the HA backend. Users were able to opt-out and switch back
> > to the old implementation [2].
> >
> > The new approach was kind of complicated to implement while still
> > maintaining support for the old implementation through the existing
> > interfaces. With FLINK-25806 [3], the old implementation was removed in
> > Flink 1.16. This enables us to clean things up in the
> > HighAvailabilityServices.
> >
> > The proposed change would mean touching the HighAvailabilityServices
> > interface. Currently, the interface provides factory methods for
> > LeaderElectionServices of the aforementioned components. All of these
> > LeaderElectionServices are internally based on the same LeaderElection
> > instance handled in DefaultMultipleComponentLeaderElectionService.
> > Therefore, we can replace all these factory methods by a single one which
> > returns a LeaderElectionService instance that’s going to be used by all
> > components. Of course, we could also stick to the old
> > HighAvailabilityServices and return the same LeaderElectionService
> instance
> > through each of the four factory methods (similar to what’s done now with

Re: [DISCUSS] FLIP-276: Data Consistency of Streaming and Batch ETL in Flink and Table Store

2022-12-09 Thread Piotr Nowojski
Hi Shammon,

Do I understand it correctly, that you effectively want to expand the
checkpoint alignment mechanism across many different jobs and hand over
checkpoint barriers from upstream to downstream jobs using the intermediate
tables?

Re the watermarks for the "Rejected Alternatives". I don't understand why
this has been rejected. Could you elaborate on this point? Here are a
couple of my thoughts on this matter, but please correct me if I'm wrong,
as I haven't dived deeper into this topic.

> As shown above, there are 2 watermarks T1 and T2, T1 < T2.
> The StreamTask reads data in order:
V11,V12,V21,T1(channel1),V13,T1(channel2).
> At this time, StreamTask will confirm that watermark T1 is completed, but
the data beyond
> T1 has been processed(V13) and the results are written to the sink table.

1. I see the same "problem" with unaligned checkpoints in your current
proposal.
2. I don't understand why this is a problem? Just store in the "sink table"
what's the watermark (T1), and downstream jobs should process the data with
that "watermark" anyway. Record "V13" should be treated as "early" data.
Downstream jobs if:
 a) they are streaming jobs, for example they should aggregate it in
windowed/temporal state, but they shouldn't produce the result that
contains it, as the watermark T2 was not yet processed. Or they would just
pass that record as "early" data.
 b) they are batch jobs, it looks to me like batch jobs shouldn't take "all
available data", but only consider "all the data until some watermark", for
example the latest available: T1

3. I'm pretty sure there are counter examples, where your proposed
mechanism of using checkpoints (even aligned!) will produce
inconsistent data from the perspective of the event time.
  a) For example what if one of your "ETL" jobs, has the following DAG:
[image: flip276.jpg]
  Even if you use aligned checkpoints for committing the data to the sink
table, the watermarks of "Window1" and "Window2" are completely
independent. The sink table might easily have data from the Src1/Window1
from the event time T1 and Src2/Window2 from later event time T2.
  b) I think the same applies if you have two completely independent ETL
jobs writing either to the same sink table, or two to different sink tables
(that are both later used in the same downstream job).

4a) I'm not sure if I like the idea of centralising the whole system in
this way. If you have 10 jobs, the likelihood of the checkpoint failure
will be 10 times higher, and/or the duration of the checkpoint can be much
much longer (especially under backpressure). And this is actually already a
limitation of Apache Flink (global checkpoints are more prone to fail the
larger the scale), so I would be anxious about making it potentially even a
larger issue.
4b) I'm also worried about increased complexity of the system after adding
the global checkpoint, and additional (single?) point of failure.
5. Such a design would also not work if we ever wanted to have task local
checkpoints.

All in all, it seems to me like actually the watermarks and even time are
the better concept in this context that should have been used for
synchronising and data consistency across the whole system.

Best,
Piotrek

czw., 1 gru 2022 o 11:50 Shammon FY  napisał(a):

> Hi @Martijn
>
> Thanks for your comments, and I'd like to reply to them
>
> 1. It sounds good to me, I'll update the content structure in FLIP later
> and give the problems first.
>
> 2. "Each ETL job creates snapshots with checkpoint info on sink tables in
> Table Store"  -> That reads like you're proposing that snapshots need to be
> written to Table Store?
>
> Yes. To support the data consistency in the FLIP, we need to get through
> checkpoints in Flink and snapshots in store, this requires a close
> combination of Flink and store implementation. In the first stage we plan
> to implement it based on Flink and Table Store only, snapshots written to
> external storage don't support consistency.
>
> 3. If you introduce a MetaService, it becomes the single point of failure
> because it coordinates everything. But I can't find anything in the FLIP on
> making the MetaService high available or how to deal with failovers there.
>
> I think you raise a very important problem and I missed it in FLIP. The
> MetaService is a single point and should support failover, we will do it in
> future in the first stage we only support standalone mode, THX
>
> 4. The FLIP states under Rejected Alternatives "Currently watermark in
> Flink cannot align data." which is not true, given that there is FLIP-182
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-182%3A+Support+watermark+alignment+of+FLIP-27+Sources
>
> Watermark alignment in FLIP-182 is different from requirements "watermark
> align data" in our FLIP. FLIP-182 aims to fix watermark generation in
> different sources for "slight imbalance or data skew", which means in some
> cases the source must generate watermark even if they should 

[jira] [Created] (FLINK-30349) Sync missing HBase e2e tests to external repo

2022-12-09 Thread Martijn Visser (Jira)
Martijn Visser created FLINK-30349:
--

 Summary: Sync missing HBase e2e tests to external repo
 Key: FLINK-30349
 URL: https://issues.apache.org/jira/browse/FLINK-30349
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / HBase
Affects Versions: hbase-3.0.0
Reporter: Martijn Visser
Assignee: Martijn Visser






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30350) Write dependency-reduced pom to default directory

2022-12-09 Thread Chesnay Schepler (Jira)
Chesnay Schepler created FLINK-30350:


 Summary: Write dependency-reduced pom to default directory
 Key: FLINK-30350
 URL: https://issues.apache.org/jira/browse/FLINK-30350
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.17.0


The dependency-reduced pom is currently written to the target/ directory. This 
makes sense in general as it is a generated artifact, and should be 
automatically cleaned-up when doing a mvn clean.

The shade-plugin however marks this pom as the modules pom during the build, 
and maven has a requirement that the basedir of a module is where the pom 
resides.
As a result the basedir is changed whenever dependency-reduction is applied, 
causing weird side-effects like changing the working directory during IT cases.

Let's bite the bullet and keep the dependency-reduced pom next to the original 
pom, with some extra configuration to remove it during a mvn clean.
This side-steps all the issues encountered in FLINK-30083.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release flink-connector-aws v4.0.0, release candidate #1

2022-12-09 Thread Chesnay Schepler

+1 (binding)

- clean source release
- builds from source
- source matches git tag
- all expected maven artifacts present
- maven artifacts have correct Flink version suffix
- releases notes are good
- PR is good

Not a blocking issue, but the source NOTICE currently says "Apache 
Flink", where it should say "Apache Flink

AWS connector" or something.


On 07/12/2022 14:38, Teoh, Hong wrote:

+1 (non-binding)

* Hashes and Signatures look good
* All required files on dist.apache.org
* Tag is present in Github
* Verified source archive does not contain any binary files
* Source archive builds using maven
* Started packaged example SQL job using SQL client. Verified that the 
following connectors work:
* flink-sql-connector-aws-kinesis-firehose
* flink-sql-connector-aws-kinesis-streams
* flink-sql-connector-kinesis

Cheers,
Hong


On 06/12/2022, 17:41, "Danny Cranmer"  wrote:

 CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



 Hi everyone,

 I am reopening this vote since the issue we detected [1] cannot be fixed
 without AWS SDK changes, and it is an existing issue, not a regression.

 Please review and vote on the release candidate #1 for the version 4.0.0,
 as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 This release externalizes the Kinesis Data Streams and Kinesis Data
 Firehose connector to the flink-connector-aws repository.

 The complete staging area is available for your review, which includes:
 * JIRA release notes [2],
 * the official Apache source release to be deployed to dist.apache.org [3],
 which are signed with the key with fingerprint 125FD8DB [4],
 * all artifacts to be deployed to the Maven Central Repository [5],
 * source code tag v4.0.0-rc1 [6],
 * website pull request listing the new release [7].

 The vote will be open for at least 72 hours (Friday 9th December 18:00
 UTC). It is adopted by majority approval, with at least 3 PMC affirmative
 votes.

 Thanks,
 Danny

 [1] https://issues.apache.org/jira/browse/FLINK-30304
 [2]
 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352538
 [3]
 https://dist.apache.org/repos/dist/dev/flink/flink-connector-aws-4.0.0-rc1/
 [4] https://dist.apache.org/repos/dist/release/flink/KEYS
 [5] https://repository.apache.org/content/repositories/orgapacheflink-1558/
 [6] https://github.com/apache/flink-connector-aws/releases/tag/v4.0.0-rc1
 [6] https://github.com/apache/flink-web/pull/592

 On Mon, Dec 5, 2022 at 4:17 PM Danny Cranmer 
 wrote:

 > This release is officially cancelled due to the discovery of a critical
 > bug [1].
 >
 > I will fix and follow up with an RC.
 >
 > [1] https://issues.apache.org/jira/browse/FLINK-30304
 >
 > On Mon, Dec 5, 2022 at 3:25 PM Danny Cranmer 
 > wrote:
 >
 >> Hi everyone,
 >> Please review and vote on the release candidate #1 for the version 
4.0.0,
 >> as follows:
 >> [ ] +1, Approve the release
 >> [ ] -1, Do not approve the release (please provide specific comments)
 >>
 >> This release externalizes the Kinesis Data Streams and Kinesis Data
 >> Firehose connector to the flink-connector-aws repository.
 >>
 >> The complete staging area is available for your review, which includes:
 >> * JIRA release notes [1],
 >> * the official Apache source release to be deployed to dist.apache.org
 >> [2], which are signed with the key with fingerprint 125FD8DB [3],
 >> * all artifacts to be deployed to the Maven Central Repository [4],
 >> * source code tag v4.0.0-rc1 [5],
 >> * website pull request listing the new release [6].
 >>
 >> The vote will be open for at least 72 hours (Thursday 8th December 16:00
 >> UTC). It is adopted by majority approval, with at least 3 PMC 
affirmative
 >> votes.
 >>
 >> Thanks,
 >> Danny
 >>
 >> [1]
 >> 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352538
 >> [2]
 >> 
https://dist.apache.org/repos/dist/dev/flink/flink-connector-aws-4.0.0-rc1/
 >> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
 >> [4]
 >> https://repository.apache.org/content/repositories/orgapacheflink-1558/
 >> [5] 
https://github.com/apache/flink-connector-aws/releases/tag/v4.0.0-rc1
 >> [6] https://github.com/apache/flink-web/pull/592
 >>
 >>
 >>





[jira] [Created] (FLINK-30351) Enable test cases again

2022-12-09 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30351:
-

 Summary: Enable test cases again
 Key: FLINK-30351
 URL: https://issues.apache.org/jira/browse/FLINK-30351
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Pulsar
Affects Versions: 1.17.0, 1.16.1, 1.15.4
Reporter: Matthias Pohl


The following Pulsar-related tests were disabled due to test instabilities (see 
linked subtasks):
 * PulsarSourceUnorderedE2ECase
 * 
PulsarUnorderedPartitionSplitReaderTest
 * 
PulsarUnorderedSourceITCase

This issue can be resolved by enabling the tests again on {{master}} and the 
release branches after the test instabilities have been resolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Release flink-connector-aws v4.0.0, release candidate #1

2022-12-09 Thread Martijn Visser
+1 (binding)

- Validated hashes
- Verified signature
- Verified that no binaries exist in the source archive
- Build the source with Maven
- Verified licenses
- Verified web PR

On Fri, Dec 9, 2022 at 12:16 PM Chesnay Schepler  wrote:

> +1 (binding)
>
> - clean source release
> - builds from source
> - source matches git tag
> - all expected maven artifacts present
> - maven artifacts have correct Flink version suffix
> - releases notes are good
> - PR is good
>
> Not a blocking issue, but the source NOTICE currently says "Apache
> Flink", where it should say "Apache Flink
> AWS connector" or something.
>
>
> On 07/12/2022 14:38, Teoh, Hong wrote:
> > +1 (non-binding)
> >
> > * Hashes and Signatures look good
> > * All required files on dist.apache.org
> > * Tag is present in Github
> > * Verified source archive does not contain any binary files
> > * Source archive builds using maven
> > * Started packaged example SQL job using SQL client. Verified that the
> following connectors work:
> > * flink-sql-connector-aws-kinesis-firehose
> > * flink-sql-connector-aws-kinesis-streams
> > * flink-sql-connector-kinesis
> >
> > Cheers,
> > Hong
> >
> >
> > On 06/12/2022, 17:41, "Danny Cranmer"  wrote:
> >
> >  CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
> >
> >
> >
> >  Hi everyone,
> >
> >  I am reopening this vote since the issue we detected [1] cannot be
> fixed
> >  without AWS SDK changes, and it is an existing issue, not a
> regression.
> >
> >  Please review and vote on the release candidate #1 for the version
> 4.0.0,
> >  as follows:
> >  [ ] +1, Approve the release
> >  [ ] -1, Do not approve the release (please provide specific
> comments)
> >
> >  This release externalizes the Kinesis Data Streams and Kinesis Data
> >  Firehose connector to the flink-connector-aws repository.
> >
> >  The complete staging area is available for your review, which
> includes:
> >  * JIRA release notes [2],
> >  * the official Apache source release to be deployed to
> dist.apache.org [3],
> >  which are signed with the key with fingerprint 125FD8DB [4],
> >  * all artifacts to be deployed to the Maven Central Repository [5],
> >  * source code tag v4.0.0-rc1 [6],
> >  * website pull request listing the new release [7].
> >
> >  The vote will be open for at least 72 hours (Friday 9th December
> 18:00
> >  UTC). It is adopted by majority approval, with at least 3 PMC
> affirmative
> >  votes.
> >
> >  Thanks,
> >  Danny
> >
> >  [1] https://issues.apache.org/jira/browse/FLINK-30304
> >  [2]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352538
> >  [3]
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-aws-4.0.0-rc1/
> >  [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> >  [5]
> https://repository.apache.org/content/repositories/orgapacheflink-1558/
> >  [6]
> https://github.com/apache/flink-connector-aws/releases/tag/v4.0.0-rc1
> >  [6] https://github.com/apache/flink-web/pull/592
> >
> >  On Mon, Dec 5, 2022 at 4:17 PM Danny Cranmer <
> dannycran...@apache.org>
> >  wrote:
> >
> >  > This release is officially cancelled due to the discovery of a
> critical
> >  > bug [1].
> >  >
> >  > I will fix and follow up with an RC.
> >  >
> >  > [1] https://issues.apache.org/jira/browse/FLINK-30304
> >  >
> >  > On Mon, Dec 5, 2022 at 3:25 PM Danny Cranmer <
> dannycran...@apache.org>
> >  > wrote:
> >  >
> >  >> Hi everyone,
> >  >> Please review and vote on the release candidate #1 for the
> version 4.0.0,
> >  >> as follows:
> >  >> [ ] +1, Approve the release
> >  >> [ ] -1, Do not approve the release (please provide specific
> comments)
> >  >>
> >  >> This release externalizes the Kinesis Data Streams and Kinesis
> Data
> >  >> Firehose connector to the flink-connector-aws repository.
> >  >>
> >  >> The complete staging area is available for your review, which
> includes:
> >  >> * JIRA release notes [1],
> >  >> * the official Apache source release to be deployed to
> dist.apache.org
> >  >> [2], which are signed with the key with fingerprint 125FD8DB [3],
> >  >> * all artifacts to be deployed to the Maven Central Repository
> [4],
> >  >> * source code tag v4.0.0-rc1 [5],
> >  >> * website pull request listing the new release [6].
> >  >>
> >  >> The vote will be open for at least 72 hours (Thursday 8th
> December 16:00
> >  >> UTC). It is adopted by majority approval, with at least 3 PMC
> affirmative
> >  >> votes.
> >  >>
> >  >> Thanks,
> >  >> Danny
> >  >>
> >  >> [1]
> >  >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=123

[jira] [Created] (FLINK-30352) [Connectors][Elasticsearch] Document missing configuration properties

2022-12-09 Thread Andriy Redko (Jira)
Andriy Redko created FLINK-30352:


 Summary: [Connectors][Elasticsearch] Document missing 
configuration properties
 Key: FLINK-30352
 URL: https://issues.apache.org/jira/browse/FLINK-30352
 Project: Flink
  Issue Type: Bug
  Components: Connectors / ElasticSearch
Affects Versions: 1.15.3, 1.16.0
Reporter: Andriy Redko


There is a number of configuration properties which are not documented:

- sink.delivery-guarantee
- connection.request-timeout
- connection.timeout
- socket.timeout
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

2022-12-09 Thread Dong Lin
Hi Hang,

Thanks for the FLIP! The FLIP looks good and it is pretty informative.

I have just two minor comments regarding names:
- Would it be useful to rename the config key as
*metrics.scope.jm.job.operator-coordinator* for consistency with
*metrics.scope.jm.job
*(which is not named as *jm-job)?
- Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
as SCOPE_NAMING_TM_JOB_OPERATOR)?

Cheers,
Dong



On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan  wrote:

> Hi all,
>
> MengYue and I created FLIP-274[1] Introduce metric group for
> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> operators and running on Job Manager. The coordination mechanism is
> operator events between OperatorCoordinator and its all operators, the
> coordination is more and more using in Flink, for example many Sources and
> Sinks depend on the mechanism to assign splits and coordinate commits to
> external systems. The OperatorCoordinator is widely using in flink kafka
> connector, flink pulsar connector, flink cdc connector, flink hudi
> connector and so on.
>
> But there is not a suitable metric group scope for the OperatorCoordinator
> and not an implementation for the interface OperatorCoordinatorMetricGroup.
> These metrics in OperatorCoordinator could be how many splits/partitions
> have been assigned to source readers, how many files have been written out
> by sink writers, these metrics not only help users to know the job progress
> but also make big job maintaining easier. Thus we propose the FLIP-274 to
> introduce a new metric group scope for OperatorCoordinator and provide an
> internal implementation for OperatorCoordinatorMetricGroup.
>
> Could you help review this FLIP when you get time? Any feedback is
> appreciated!
>
> Best,
> Hang
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>


[jira] [Created] (FLINK-30353) Enable concurrency for external connector repositories

2022-12-09 Thread Martijn Visser (Jira)
Martijn Visser created FLINK-30353:
--

 Summary: Enable concurrency for external connector repositories
 Key: FLINK-30353
 URL: https://issues.apache.org/jira/browse/FLINK-30353
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / Common
Reporter: Martijn Visser
Assignee: Martijn Visser






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-277: Native GlueCatalog Support in Flink

2022-12-09 Thread Jark Wu
Hi Samrat,

Thanks a lot for driving the new catalog, and sorry for jumping into the
discussion late.

As Flink SQL is becoming the first-class citizen of the Flink API, we are
planning to push Catalog
to become the first-class citizen of the connector instead of Source &
Sink. For Flink SQL users,
using Catalog is as natural and user-friendly as working with databases,
rather than having to define
DDL and schemas over and over again. This is also how Trino/Presto does.

Regarding the repo for the Glue catalog, I think we can add it to
flink-connector-aws. We don't need
separate repos for Catalogs because Catalog is a kind of connector (others
are sources & sinks).
For example, MySqlCatalog[1] and PostgresCatalog[2] are in
flink-connector-jdbc, and HiveCatalog is
in flink-connector-hive. This can reduce repository maintenance, and I
think maybe some common
AWS utils can be shared there.  cc @Danny Cranmer 
what do you think about this?

Besides, I have a question about Glue Namespace. Could you share the
documentation of the Glue
 Namespaces? (Sorry, I didn't find it.) According to the "Flink Glue
Metaspace Mapping" section,
if there is a database "mydb" under namespace "ns1", is that mean the
database name in Flink is "ns1.mydb"?

Best,
Jark


[1]:
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/MySqlCatalog.java
[2]:
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-jdbc/src/main/java/org/apache/flink/connector/jdbc/catalog/PostgresCatalog.java

On Fri, 9 Dec 2022 at 08:51, Dong Lin  wrote:

> Hi Samrat,
>
> Sorry for the late reply. Yeah I am referring to creating a similar
> external repo such as flink-catalog-glue. flink-connector-aws is already
> named with `connector` so it seems a bit weird to put a catalog there.
>
> Thanks!
> Dong
>
> On Wed, Dec 7, 2022 at 1:04 PM Samrat Deb  wrote:
>
> > Hi Dong Lin,
> >
> > Since this is the first proposal for adding a vendor-specific catalog
> > > library in Flink, I think maybe we should also externalize those
> catalog
> > > libraries similar to how we are externalizing connector libraries. It
> is
> > > likely that we might want to add catalogs for other vectors in the
> > future.
> > > Externalizing those catalogs can make Flink development more scalable
> in
> > > the long term.
> >
> > Initially i mis-interpretted externalising the catalogs, There already
> > exists an externalised connector for aws [1].
> > Are you referring to creating a similar external repo for catalogs or
> will
> > it be better to add it in flink-connector-aws[1] ?
> >
> > [1] https://github.com/apache/flink-connector-aws
> >
> > Samrat
> >
> > On Tue, Dec 6, 2022 at 6:52 PM Samrat Deb  wrote:
> >
> > > Hi Dong Lin,
> > >
> > > Aws Glue Data catalog is vendor specific and in future we will get such
> > > type of implementation from different providers. We should
> > > definitely externalize these catalog libraries similar to flink
> > connectors.
> > > I am thinking of creating
> > > flink-catalog similar to flink-connector under the root (flink). glue
> > > catalog can be one of modules under the flink-catalog . Please suggest
> if
> > > there is a better structure we can create for catalogs.
> > >
> > >
> > > It is mentioned in the FLIP that there will be two types of
> SdkHttpClient
> > >> supported based on the catalog option http-client.type. Is
> > >> http-client.type
> > >> a public config for the GlueCatalog? If yes, can we add this config to
> > the
> > >> "Configurations" section and explain how users should choose the
> client
> > >> type?
> > >
> > >
> > > yes http-client.type is public config for the GlueCatalog. By default
> > > client-type will be `urlconnection` , if user don't specify any
> > connection
> > > type.
> > > I have updated the FLIP-277[1] #configuration section with all the
> > configs
> > > . Please review it again .
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink
> > >
> > > Samrat
> > >
> > > On Tue, Dec 6, 2022 at 5:50 PM Samrat Deb 
> wrote:
> > >
> > >> Hi Yuxia,
> > >>
> > >> Thank you for reviewing the flip and putting forward your observations
> > >> and comments.
> > >>
> > >> 1: I noticed there's a YAML part in the section of "Using the
> Catalog",
> > >>> what do you mean by that? Do you mean how to use glue catalog in sql
> > >>> client? If so, just for your information, it's not supported to use
> > yaml
> > >>> envrioment file in sql client[2].
> > >>
> > >>
> > >> Thank you for attaching the jira ticket [1] . I missed the changes.
> > >> There is a provision to register catalog directly through factory
> > resources
> > >> .
> > >> - GenericInMemoryCatalog is defined through
> > >>
> >
> `flink/flink-table/flink-table-api-java/src/main/resources/META-INF/services/org.apache.flink.table.factories.Factory`
> > >> - HiveCatalog is defined through
> 

Re: [VOTE] Release flink-connector-jdbc v3.0.0, release candidate #1

2022-12-09 Thread Martijn Visser
That was because the number of changes between 1.16.0 and master were
limited.


Re: [DISCUSS] Cleaning up HighAvailabilityServices interface to reflect the per-JM-process LeaderElection

2022-12-09 Thread weijie guo
Hi Matthias,

Thanks for the proposal! I am in favor of cleaning up this interface, and
It seems a bit cumbersome now. Especially, the implementation of
per-component leader election has been removed from our current code path.

To be honest, I don't like the per-component approach. I'm even often asked
why flink used this way? Of course, I admit that this will make our HA
service more flexible. But personally, perhaps the per-process solution is
more better, at least from the perspective of reducing potential problems
like FLINK-24038, and it can definitely reduce the complexity of JobManager.

Regarding "We lose some flexibility in terms of per-component
LeaderElection ", I am curious that there are so many such extreme
requirements that we have to rely on the per-component pattern to achieve
them? If there are, is this requirement really reasonable, and users may
inadvertently recreate problems similar to FLINK-24038.

Best regards,

Weijie


Matthias Pohl  于2022年12月9日周五 17:47写道:

> Hi Dong,
> see my answers below.
>
> Regarding "Interface change might affect other projects that customize HA
> > services", are you referring to those projects which hack into Flink's
> > source code (as opposed to using Flink's public API) to customize HA
> > services?
>
>
> Yes, the proposed change might affect projects that need to have their own
> HA implementation for whatever reason (interface change) or if a project
> accesses the HA backend to retrieve metadata from the ZK node/k8s ConfigMap
> (change about how the data is stored in the HA backend). The latter one was
> actually already the case with the change introduced by FLINK-24038 [1].
>
> By the way, since Flink already supports zookeeper and kubernetes as the
> > high availability services, are you aware of many projects that still
> need
> > to hack into Flink's code to customize high availability services?
>
>
> I am aware of projects that use customized HA. But based on our experience
> in FLINK-24038 [1] no one complained. So, making people aware through the
> mailing list might be good enough.
>
> And regarding "We lose some flexibility in terms of per-component
> > LeaderElection", could you explain what flexibility we need so that we
> can
> > gauge the associated downside of losing the flexibility?
>
>
> Just to recap: The current interface allows having per-component
> LeaderElection (e.g. the ResourceManager leader can run on a different
> JobManager than the Dispatcher). This implementation was replaced by
> FLINK-24038 [1] and removed in FLINK-25806 [2]. The new implementation does
> LeaderElection per process (e.g. ResourceManager and Dispatcher always run
> on the same JobManager). The changed interface would require us to touch
> the interface again if (for whatever reason) we want to reintroduce
> per-component leader election in some form.
> The interface change is, strictly speaking, not necessary to provide the
> new functionality. But I like the idea of certain requirements (currently,
> we need per-process leader election to fix what was reported in FLINK-24038
> [1]) being reflected in the interface. This makes sure that we don't
> introduce a per-component leader election again accidentally in the future
> because we thought it's a good idea but forgot about FLINK-24038.
>
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-24038
> [2] https://issues.apache.org/jira/browse/FLINK-25806
>
> On Fri, Dec 9, 2022 at 2:09 AM Dong Lin  wrote:
>
> > Hi Matthias,
> >
> > Thanks for the proposal! Overall I am in favor of making this interface
> > change to make Flink's codebase more maintainable.
> >
> > Regarding "Interface change might affect other projects that customize HA
> > services", are you referring to those projects which hack into Flink's
> > source code (as opposed to using Flink's public API) to customize HA
> > services? If yes, it seems OK to break those projects since we don't have
> > any backward compatibility guarantee for those projects.
> >
> > By the way, since Flink already supports zookeeper and kubernetes as the
> > high availability services, are you aware of many projects that still
> need
> > to hack into Flink's code to customize high availability services?
> >
> > And regarding "We lose some flexibility in terms of per-component
> > LeaderElection", could you explain what flexibility we need so that we
> can
> > gauge the associated downside of losing the flexibility?
> >
> > Thanks!
> > Dong
> >
> >
> >
> > On Wed, Dec 7, 2022 at 4:28 PM Matthias Pohl  > .invalid>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > The Flink community introduced a new way how leader election works in
> > Flink
> > > 1.15 with FLINK-24038 [1]. Instead of a per-component leader election,
> > all
> > > components (i.e. ResourceManager, Dispatcher, REST server, JobMaster)
> > use a
> > > single (per-JM-process) leader election instance. It was meant to fix
> > some
> > > issues with deregistering Flink applications in multi-JM s

[jira] [Created] (FLINK-30354) Reducing the number of ThreadPools in LookupFullCache and related cache-loading classes

2022-12-09 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30354:
-

 Summary: Reducing the number of ThreadPools in LookupFullCache and 
related cache-loading classes
 Key: FLINK-30354
 URL: https://issues.apache.org/jira/browse/FLINK-30354
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Runtime
Affects Versions: 1.17.0
Reporter: Matthias Pohl


In the course of reviewing FLINK-29405, I came up with a proposal to reduce the 
complexity of the {{LookupFullCache}} implementation and shrinking the amount 
of threadpools being used from 3 to 2. Here's the proposal I also shared in the 
[FLINK-29405 PR 
comment|https://github.com/apache/flink/pull/20919#pullrequestreview-1208332584]:

About the responsibilities how I see them:
* {{LookupFullCache}} is the composite class for combining the {{CacheLoader}} 
and the {{CacheReloadTrigger}} through the {{ReloadTriggerContext}}
* {{ReloadTriggerContext}} provides an async call to trigger the reload but 
also some utility methods for providing processing or event time (where it's 
not clear to me why this is connected with the reload. It looks like a future 
task based on the TODO comments)
* {{CacheLoader}} is in charge of loading the data into memory (if possible 
concurrently).
{{CacheReloadTrigger}} provides different strategies to trigger new reloads.

About the different executors:
* The {{CacheReloadTrigger}} utilize a {{SingleThreadScheduledExecutor}} which 
triggers {{ReloadTriggerContext::reload}} subsequently. If the loading takes 
longer, subsequently triggered calls pile up. Here, I'm wondering whether 
that's what we want. thinking
* {{CacheLoader}} utilizes a {{SingleThreadExecutor}} in 
{{CacheLoader#reloadExecutor}} which is kind of the "main" thread for reloading 
the data. It triggers {{CacheLoader#updateCache}} with 
{{CacheLoader#reloadLock}} being acquired. 
{{(InputFormat)CacheLoader#updateCache}} is implemented synchronously. The data 
is loaded concurrently if possible using a {{FixedThreadPool}}.

My proposal is now to reduce the number of used thread pools: Instead of having 
a {{SingleThreadExecutor}} and a {{FixedThreadPool}} in the {{CacheLoader}} 
implementation, couldn't we come up with a custom {{ThreadPoolExecutor}} where 
we specify the minimum number of threads being 1 and the maximum being the 
number of cores (similar to what is already there with 
[ThreadUtils#newThreadPool|https://github.com/apache/flink/blob/d067629d4d200f940d0b58759459d7ff5832b292/flink-table/flink-sql-gateway-api/src/main/java/org/apache/flink/table/gateway/api/utils/ThreadUtils.java#L36]).
 That would free the {{CacheLoader}} from starting and shutting down thread 
pools by moving its ownership from {{CacheLoader}} to {{LookupFullCache}} 
calling it the {{cacheLoadingThreadPool}} (or similar). Additionally, the 
{{ScheduledThreadPool}} currently living in the {{CacheReloadTrigger}} 
implementations could move into {{LookupFullCache}} as well calling it 
something like {{cacheLoadSchedulingThreadPool}}. LookupFullCache would be in 
charge of managing all cache loading-related threads. Additionally, it would 
manage the current execution through {{CompletableFutures}} (one for triggering 
the reload and one for executing the reload. Triggering a reload would require 
cancelling the current future (if it's not completed, yet) or ignoring the 
trigger if we want a reload to finish before triggering a new one.

{{CacheLoader#updateCache}} would become 
{{CacheLoader#updateCacheAsync(ExecutorService)}} returning a 
{{CompletableFuture}} that completes as soon as all subtasks are completed. 
{{CacheLoader#reloadAsync}} would return this {{CompletableFuture}} instead of 
creating its own future. The lifecycle (as already explained in the previous 
paragraph) would be managed by {{LookupFullCache}}. The benefit would be that 
we wouldn't have to deal interrupts in {{CacheLoader}}.

I see the following benefits:

{{ReloadtriggerContext}} becomes obsolete (one has to clarify what the event 
time and processing time functions are for, though).
{{CacheLoader#awaitFirstLoad}} becomes obsolete as well. We can verify the 
completion of the cache loading in {{LookupFullCache}} through the 
{{CompletableFuture}} instances.
{{CacheReloadTrigger}} can focus on the strategy implementation without 
worrying about instantiating threads. This is duplicated code right now in 
{{PeriodicCacheReloadTrigger}} and {{TimedCacheReloadTrigger}}.
I might miss something here. I'm curious what you think. I probably got carried 
away a bit by your proposal introducing async calls. innocent I totally 
understand if you argue that it's way too much out-of-scope for this issue and 
we actually want to focus on fixing the test instability. In that case, I would 
do another round of review on your current proposal. But I'm happy to help you 
if you think that my proposal is reasonable. 

Re: [VOTE] Release flink-connector-aws v4.0.0, release candidate #1

2022-12-09 Thread Danny Cranmer
+1 (binding)

- Validated hashes
- Verified signature
- Verified that no binaries exist in the source archive
- Build the source with Maven
- Verified licenses
- Run the following DataStream apps:
  - KDS > KDS (v1 legacy)
  - KDS > KDS (v2)
  - KDS > KDF
  - KDS (EFO) > DDB

On Fri, Dec 9, 2022 at 1:16 PM Martijn Visser 
wrote:

> +1 (binding)
>
> - Validated hashes
> - Verified signature
> - Verified that no binaries exist in the source archive
> - Build the source with Maven
> - Verified licenses
> - Verified web PR
>
> On Fri, Dec 9, 2022 at 12:16 PM Chesnay Schepler 
> wrote:
>
> > +1 (binding)
> >
> > - clean source release
> > - builds from source
> > - source matches git tag
> > - all expected maven artifacts present
> > - maven artifacts have correct Flink version suffix
> > - releases notes are good
> > - PR is good
> >
> > Not a blocking issue, but the source NOTICE currently says "Apache
> > Flink", where it should say "Apache Flink
> > AWS connector" or something.
> >
> >
> > On 07/12/2022 14:38, Teoh, Hong wrote:
> > > +1 (non-binding)
> > >
> > > * Hashes and Signatures look good
> > > * All required files on dist.apache.org
> > > * Tag is present in Github
> > > * Verified source archive does not contain any binary files
> > > * Source archive builds using maven
> > > * Started packaged example SQL job using SQL client. Verified that the
> > following connectors work:
> > > * flink-sql-connector-aws-kinesis-firehose
> > > * flink-sql-connector-aws-kinesis-streams
> > > * flink-sql-connector-kinesis
> > >
> > > Cheers,
> > > Hong
> > >
> > >
> > > On 06/12/2022, 17:41, "Danny Cranmer" 
> wrote:
> > >
> > >  CAUTION: This email originated from outside of the organization.
> Do
> > not click links or open attachments unless you can confirm the sender and
> > know the content is safe.
> > >
> > >
> > >
> > >  Hi everyone,
> > >
> > >  I am reopening this vote since the issue we detected [1] cannot be
> > fixed
> > >  without AWS SDK changes, and it is an existing issue, not a
> > regression.
> > >
> > >  Please review and vote on the release candidate #1 for the version
> > 4.0.0,
> > >  as follows:
> > >  [ ] +1, Approve the release
> > >  [ ] -1, Do not approve the release (please provide specific
> > comments)
> > >
> > >  This release externalizes the Kinesis Data Streams and Kinesis
> Data
> > >  Firehose connector to the flink-connector-aws repository.
> > >
> > >  The complete staging area is available for your review, which
> > includes:
> > >  * JIRA release notes [2],
> > >  * the official Apache source release to be deployed to
> > dist.apache.org [3],
> > >  which are signed with the key with fingerprint 125FD8DB [4],
> > >  * all artifacts to be deployed to the Maven Central Repository
> [5],
> > >  * source code tag v4.0.0-rc1 [6],
> > >  * website pull request listing the new release [7].
> > >
> > >  The vote will be open for at least 72 hours (Friday 9th December
> > 18:00
> > >  UTC). It is adopted by majority approval, with at least 3 PMC
> > affirmative
> > >  votes.
> > >
> > >  Thanks,
> > >  Danny
> > >
> > >  [1] https://issues.apache.org/jira/browse/FLINK-30304
> > >  [2]
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352538
> > >  [3]
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-connector-aws-4.0.0-rc1/
> > >  [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > >  [5]
> > https://repository.apache.org/content/repositories/orgapacheflink-1558/
> > >  [6]
> > https://github.com/apache/flink-connector-aws/releases/tag/v4.0.0-rc1
> > >  [6] https://github.com/apache/flink-web/pull/592
> > >
> > >  On Mon, Dec 5, 2022 at 4:17 PM Danny Cranmer <
> > dannycran...@apache.org>
> > >  wrote:
> > >
> > >  > This release is officially cancelled due to the discovery of a
> > critical
> > >  > bug [1].
> > >  >
> > >  > I will fix and follow up with an RC.
> > >  >
> > >  > [1] https://issues.apache.org/jira/browse/FLINK-30304
> > >  >
> > >  > On Mon, Dec 5, 2022 at 3:25 PM Danny Cranmer <
> > dannycran...@apache.org>
> > >  > wrote:
> > >  >
> > >  >> Hi everyone,
> > >  >> Please review and vote on the release candidate #1 for the
> > version 4.0.0,
> > >  >> as follows:
> > >  >> [ ] +1, Approve the release
> > >  >> [ ] -1, Do not approve the release (please provide specific
> > comments)
> > >  >>
> > >  >> This release externalizes the Kinesis Data Streams and Kinesis
> > Data
> > >  >> Firehose connector to the flink-connector-aws repository.
> > >  >>
> > >  >> The complete staging area is available for your review, which
> > includes:
> > >  >> * JIRA release notes [1],
> > >  >> * the official Apache source release to be deployed to
> > dist.apache.org
> > >  >> [2], whic

[jira] [Created] (FLINK-30355) crictl causes long wait in e2e tests

2022-12-09 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-30355:
-

 Summary: crictl causes long wait in e2e tests
 Key: FLINK-30355
 URL: https://issues.apache.org/jira/browse/FLINK-30355
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure, Tests
Affects Versions: 1.17.0
Reporter: Matthias Pohl


We observed strange behavior in the e2e test where the e2e test run times out: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43824&view=logs&j=bea52777-eaf8-5663-8482-18fbc3630e81&s=ae4f8708-9994-57d3-c2d7-b892156e7812&t=b2642e3a-5b86-574d-4c8a-f7e2842bfb14&l=7446

The issue seems to be related to {{crictl}} again because we see the following 
error message in multiple tests. No logs are produced afterwards for ~30mins 
resulting in the overall test run taking too long:
{code}
Dec 09 08:55:39 crictl
fatal: destination path 'cri-dockerd' already exists and is not an empty 
directory.
fatal: a branch named 'v0.2.3' already exists
mkdir: cannot create directory ‘bin’: File exists
Dec 09 09:26:41 fs.protected_regular = 0
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-30356) Update NOTICE files to say "Apache Flink AWS connectors"

2022-12-09 Thread Danny Cranmer (Jira)
Danny Cranmer created FLINK-30356:
-

 Summary: Update NOTICE files to say "Apache Flink AWS connectors"
 Key: FLINK-30356
 URL: https://issues.apache.org/jira/browse/FLINK-30356
 Project: Flink
  Issue Type: Technical Debt
  Components: Connectors / AWS
Affects Versions: aws-connector-3.0.0, aws-connector-4.0.0
Reporter: Danny Cranmer
 Fix For: aws-connector-4.1.0, aws-connector-3.1.0


[https://lists.apache.org/thread/8bb8kh3w5ohztj50k4cgsqt97466t9fj]

 

Update all NOTICE files as per:

- "Not a blocking issue, but the source NOTICE currently says "Apache Flink", 
where it should say "Apache Flink AWS connector" or something."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Cleaning up HighAvailabilityServices interface to reflect the per-JM-process LeaderElection

2022-12-09 Thread Chesnay Schepler
I generally agree that the internals of the HA services are currently 
too complex, but I'm wondering if the proposal doesn't go a bit too far 
to resolve those.
Is there maybe some way we can refactor things internally to reduce 
complexity while keeping the per-component semantics?


Ultimately, the per-component leader election gives us the theoretical 
ability to split components into separate processes, which is also 
something we strive to maintain in other layers like the RPC system.


That's a powerful property to have, which is also quite difficult to 
patch back in once you get rid of it.


Of note, whenever a discussion came up about scalability of the JM 
process the first answer has _always_ been "well we could split it up at 
one point if it's necessary.".


> I am curious that there are so many such extreme requirements that we 
have to rely on the per-component pattern to achieve them?


This doesn't necessarily go into the _extreme_ direction. It could be 
something as simple as running the UI in an environment that is more 
accessible than the other processes, running multiple UIs for 
load-balancing purposes without paying the additional memory tax of a 
full JM, or the Dispatcher process not running any user-code (== some 
isolation between jobs).
The original FLIP-6 design had ideas to that end, and they aren't really 
bad ideas. We just never executed them.


> users may inadvertently recreate problems similar to FLINK-24038

That's certainly a risk, but the per-process leader election was just 
one possible solution, that just also had other benefits at the time.




Right now I unfortunately can't provide specific ideas on how we could 
clean things up internally; that'd take some time that I won't have 
until next year.


On 09/12/2022 16:41, weijie guo wrote:

Hi Matthias,

Thanks for the proposal! I am in favor of cleaning up this interface, and
It seems a bit cumbersome now. Especially, the implementation of
per-component leader election has been removed from our current code path.

To be honest, I don't like the per-component approach. I'm even often asked
why flink used this way? Of course, I admit that this will make our HA
service more flexible. But personally, perhaps the per-process solution is
more better, at least from the perspective of reducing potential problems
like FLINK-24038, and it can definitely reduce the complexity of JobManager.

Regarding "We lose some flexibility in terms of per-component
LeaderElection ", I am curious that there are so many such extreme
requirements that we have to rely on the per-component pattern to achieve
them? If there are, is this requirement really reasonable, and users may
inadvertently recreate problems similar to FLINK-24038.

Best regards,

Weijie


Matthias Pohl  于2022年12月9日周五 17:47写道:


Hi Dong,
see my answers below.

Regarding "Interface change might affect other projects that customize HA

services", are you referring to those projects which hack into Flink's
source code (as opposed to using Flink's public API) to customize HA
services?


Yes, the proposed change might affect projects that need to have their own
HA implementation for whatever reason (interface change) or if a project
accesses the HA backend to retrieve metadata from the ZK node/k8s ConfigMap
(change about how the data is stored in the HA backend). The latter one was
actually already the case with the change introduced by FLINK-24038 [1].

By the way, since Flink already supports zookeeper and kubernetes as the

high availability services, are you aware of many projects that still

need

to hack into Flink's code to customize high availability services?


I am aware of projects that use customized HA. But based on our experience
in FLINK-24038 [1] no one complained. So, making people aware through the
mailing list might be good enough.

And regarding "We lose some flexibility in terms of per-component

LeaderElection", could you explain what flexibility we need so that we

can

gauge the associated downside of losing the flexibility?


Just to recap: The current interface allows having per-component
LeaderElection (e.g. the ResourceManager leader can run on a different
JobManager than the Dispatcher). This implementation was replaced by
FLINK-24038 [1] and removed in FLINK-25806 [2]. The new implementation does
LeaderElection per process (e.g. ResourceManager and Dispatcher always run
on the same JobManager). The changed interface would require us to touch
the interface again if (for whatever reason) we want to reintroduce
per-component leader election in some form.
The interface change is, strictly speaking, not necessary to provide the
new functionality. But I like the idea of certain requirements (currently,
we need per-process leader election to fix what was reported in FLINK-24038
[1]) being reflected in the interface. This makes sure that we don't
introduce a per-component leader election again accidentally in the future
because we thought it's a good i

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

2022-12-09 Thread Chesnay Schepler
As a whole I feel like this FLIP is overly complicated. A dedicated 
coordinator MG implementation is overkill; it could just re-use the 
existing Task/OperatorMGs to create the same structure we have on TMs, 
similar to what we did with the Job MG.


However, I'm not convinced that this is required anyway, because all the 
example metrics you listed can be implemented on the TM side + 
aggregating them in the external metrics backend.


Since I'm on holidays soon, just so no one tries to pull a fast one on 
me, if this were to go to a vote as-is I'd be against it.



On 09/12/2022 15:30, Dong Lin wrote:

Hi Hang,

Thanks for the FLIP! The FLIP looks good and it is pretty informative.

I have just two minor comments regarding names:
- Would it be useful to rename the config key as
*metrics.scope.jm.job.operator-coordinator* for consistency with
*metrics.scope.jm.job
*(which is not named as *jm-job)?
- Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
as SCOPE_NAMING_TM_JOB_OPERATOR)?

Cheers,
Dong



On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan  wrote:


Hi all,

MengYue and I created FLIP-274[1] Introduce metric group for
OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
operators and running on Job Manager. The coordination mechanism is
operator events between OperatorCoordinator and its all operators, the
coordination is more and more using in Flink, for example many Sources and
Sinks depend on the mechanism to assign splits and coordinate commits to
external systems. The OperatorCoordinator is widely using in flink kafka
connector, flink pulsar connector, flink cdc connector, flink hudi
connector and so on.

But there is not a suitable metric group scope for the OperatorCoordinator
and not an implementation for the interface OperatorCoordinatorMetricGroup.
These metrics in OperatorCoordinator could be how many splits/partitions
have been assigned to source readers, how many files have been written out
by sink writers, these metrics not only help users to know the job progress
but also make big job maintaining easier. Thus we propose the FLIP-274 to
introduce a new metric group scope for OperatorCoordinator and provide an
internal implementation for OperatorCoordinatorMetricGroup.

Could you help review this FLIP when you get time? Any feedback is
appreciated!

Best,
Hang

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator





Re: [VOTE] Release flink-connector-aws v4.0.0, release candidate #1

2022-12-09 Thread Danny Cranmer
Thanks all, this vote thread is now closed.

I have raised an issue [1] to fix the NOTICE file issue Chesnay detected.

Thanks,
Danny

[1] https://issues.apache.org/jira/browse/FLINK-30356

On Fri, Dec 9, 2022 at 4:03 PM Danny Cranmer 
wrote:

> +1 (binding)
>
> - Validated hashes
> - Verified signature
> - Verified that no binaries exist in the source archive
> - Build the source with Maven
> - Verified licenses
> - Run the following DataStream apps:
>   - KDS > KDS (v1 legacy)
>   - KDS > KDS (v2)
>   - KDS > KDF
>   - KDS (EFO) > DDB
>
> On Fri, Dec 9, 2022 at 1:16 PM Martijn Visser 
> wrote:
>
>> +1 (binding)
>>
>> - Validated hashes
>> - Verified signature
>> - Verified that no binaries exist in the source archive
>> - Build the source with Maven
>> - Verified licenses
>> - Verified web PR
>>
>> On Fri, Dec 9, 2022 at 12:16 PM Chesnay Schepler 
>> wrote:
>>
>> > +1 (binding)
>> >
>> > - clean source release
>> > - builds from source
>> > - source matches git tag
>> > - all expected maven artifacts present
>> > - maven artifacts have correct Flink version suffix
>> > - releases notes are good
>> > - PR is good
>> >
>> > Not a blocking issue, but the source NOTICE currently says "Apache
>> > Flink", where it should say "Apache Flink
>> > AWS connector" or something.
>> >
>> >
>> > On 07/12/2022 14:38, Teoh, Hong wrote:
>> > > +1 (non-binding)
>> > >
>> > > * Hashes and Signatures look good
>> > > * All required files on dist.apache.org
>> > > * Tag is present in Github
>> > > * Verified source archive does not contain any binary files
>> > > * Source archive builds using maven
>> > > * Started packaged example SQL job using SQL client. Verified that the
>> > following connectors work:
>> > > * flink-sql-connector-aws-kinesis-firehose
>> > > * flink-sql-connector-aws-kinesis-streams
>> > > * flink-sql-connector-kinesis
>> > >
>> > > Cheers,
>> > > Hong
>> > >
>> > >
>> > > On 06/12/2022, 17:41, "Danny Cranmer" 
>> wrote:
>> > >
>> > >  CAUTION: This email originated from outside of the organization.
>> Do
>> > not click links or open attachments unless you can confirm the sender
>> and
>> > know the content is safe.
>> > >
>> > >
>> > >
>> > >  Hi everyone,
>> > >
>> > >  I am reopening this vote since the issue we detected [1] cannot
>> be
>> > fixed
>> > >  without AWS SDK changes, and it is an existing issue, not a
>> > regression.
>> > >
>> > >  Please review and vote on the release candidate #1 for the
>> version
>> > 4.0.0,
>> > >  as follows:
>> > >  [ ] +1, Approve the release
>> > >  [ ] -1, Do not approve the release (please provide specific
>> > comments)
>> > >
>> > >  This release externalizes the Kinesis Data Streams and Kinesis
>> Data
>> > >  Firehose connector to the flink-connector-aws repository.
>> > >
>> > >  The complete staging area is available for your review, which
>> > includes:
>> > >  * JIRA release notes [2],
>> > >  * the official Apache source release to be deployed to
>> > dist.apache.org [3],
>> > >  which are signed with the key with fingerprint 125FD8DB [4],
>> > >  * all artifacts to be deployed to the Maven Central Repository
>> [5],
>> > >  * source code tag v4.0.0-rc1 [6],
>> > >  * website pull request listing the new release [7].
>> > >
>> > >  The vote will be open for at least 72 hours (Friday 9th December
>> > 18:00
>> > >  UTC). It is adopted by majority approval, with at least 3 PMC
>> > affirmative
>> > >  votes.
>> > >
>> > >  Thanks,
>> > >  Danny
>> > >
>> > >  [1] https://issues.apache.org/jira/browse/FLINK-30304
>> > >  [2]
>> > >
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352538
>> > >  [3]
>> > >
>> >
>> https://dist.apache.org/repos/dist/dev/flink/flink-connector-aws-4.0.0-rc1/
>> > >  [4] https://dist.apache.org/repos/dist/release/flink/KEYS
>> > >  [5]
>> > https://repository.apache.org/content/repositories/orgapacheflink-1558/
>> > >  [6]
>> > https://github.com/apache/flink-connector-aws/releases/tag/v4.0.0-rc1
>> > >  [6] https://github.com/apache/flink-web/pull/592
>> > >
>> > >  On Mon, Dec 5, 2022 at 4:17 PM Danny Cranmer <
>> > dannycran...@apache.org>
>> > >  wrote:
>> > >
>> > >  > This release is officially cancelled due to the discovery of a
>> > critical
>> > >  > bug [1].
>> > >  >
>> > >  > I will fix and follow up with an RC.
>> > >  >
>> > >  > [1] https://issues.apache.org/jira/browse/FLINK-30304
>> > >  >
>> > >  > On Mon, Dec 5, 2022 at 3:25 PM Danny Cranmer <
>> > dannycran...@apache.org>
>> > >  > wrote:
>> > >  >
>> > >  >> Hi everyone,
>> > >  >> Please review and vote on the release candidate #1 for the
>> > version 4.0.0,
>> > >  >> as follows:
>> > >  >> [ ] +1, Approve the release
>> > >  >> [ ] -1, Do not approve the release (please provide specific
>> > comments)
>> > >  >>

[RESULT] [VOTE] flink-connector-aws v4.0.0, release candidate #1

2022-12-09 Thread Danny Cranmer
I'm happy to announce that we have unanimously approved this release.

There are 4 approving votes, 3 of which are binding:
* Hong (non-binding)
* Chesnary (binding)
* Martijn (binding)
* Danny (binding)

There are no disapproving votes.

Thanks everyone!
Danny


Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

2022-12-09 Thread Hang Ruan
Hi, Dong,

Thanks for your suggestion. I plan to rename this scope like this :

public static final ConfigOption SCOPE_NAMING_OPERATOR_COORDINATOR =
key("metrics.scope.operator-coordinator")
.stringType()

.defaultValue(".jobmanager...coordinator")
.withDescription(
"Defines the scope format string that is
applied to all metrics scoped to a job on a JobManager.");

Best,
Hang

Dong Lin  于2022年12月9日周五 22:34写道:

> Hi Hang,
>
> Thanks for the FLIP! The FLIP looks good and it is pretty informative.
>
> I have just two minor comments regarding names:
> - Would it be useful to rename the config key as
> *metrics.scope.jm.job.operator-coordinator* for consistency with
> *metrics.scope.jm.job
> *(which is not named as *jm-job)?
> - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>
> Cheers,
> Dong
>
>
>
> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan  wrote:
>
> > Hi all,
> >
> > MengYue and I created FLIP-274[1] Introduce metric group for
> > OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> > operators and running on Job Manager. The coordination mechanism is
> > operator events between OperatorCoordinator and its all operators, the
> > coordination is more and more using in Flink, for example many Sources
> and
> > Sinks depend on the mechanism to assign splits and coordinate commits to
> > external systems. The OperatorCoordinator is widely using in flink kafka
> > connector, flink pulsar connector, flink cdc connector, flink hudi
> > connector and so on.
> >
> > But there is not a suitable metric group scope for the
> OperatorCoordinator
> > and not an implementation for the interface
> OperatorCoordinatorMetricGroup.
> > These metrics in OperatorCoordinator could be how many splits/partitions
> > have been assigned to source readers, how many files have been written
> out
> > by sink writers, these metrics not only help users to know the job
> progress
> > but also make big job maintaining easier. Thus we propose the FLIP-274 to
> > introduce a new metric group scope for OperatorCoordinator and provide an
> > internal implementation for OperatorCoordinatorMetricGroup.
> >
> > Could you help review this FLIP when you get time? Any feedback is
> > appreciated!
> >
> > Best,
> > Hang
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >
>


Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

2022-12-09 Thread Hang Ruan
Hi, Chesnay,

Thanks for your reply.

Actually we can not reuse the Task/OperatorMG for the OperatorCoordinator.
There are mainly two reasons.
First of all, the scopes of these metric groups are not suitable for the
OperatorCoordinator. It should be
".jobmanager...coordinator".
Secondly, there are some metrics that we cannot compute from the subtasks.
For example, in flink cdc connectors, we could report how many tables are
pending in the OperatorCoordinator. But this information is not available
in its subtasks.

We try to add some common metrics to the OperatorCoordinatorMetricGroup for
all OperatorCoordinator implementations. In fact, it should be discussed
whether these common metrics are necessary.

Best,
Hang

Chesnay Schepler  于2022年12月10日周六 01:33写道:

> As a whole I feel like this FLIP is overly complicated. A dedicated
> coordinator MG implementation is overkill; it could just re-use the
> existing Task/OperatorMGs to create the same structure we have on TMs,
> similar to what we did with the Job MG.
>
> However, I'm not convinced that this is required anyway, because all the
> example metrics you listed can be implemented on the TM side +
> aggregating them in the external metrics backend.
>
> Since I'm on holidays soon, just so no one tries to pull a fast one on
> me, if this were to go to a vote as-is I'd be against it.
>
>
> On 09/12/2022 15:30, Dong Lin wrote:
> > Hi Hang,
> >
> > Thanks for the FLIP! The FLIP looks good and it is pretty informative.
> >
> > I have just two minor comments regarding names:
> > - Would it be useful to rename the config key as
> > *metrics.scope.jm.job.operator-coordinator* for consistency with
> > *metrics.scope.jm.job
> > *(which is not named as *jm-job)?
> > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >
> > Cheers,
> > Dong
> >
> >
> >
> > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan  wrote:
> >
> >> Hi all,
> >>
> >> MengYue and I created FLIP-274[1] Introduce metric group for
> >> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> >> operators and running on Job Manager. The coordination mechanism is
> >> operator events between OperatorCoordinator and its all operators, the
> >> coordination is more and more using in Flink, for example many Sources
> and
> >> Sinks depend on the mechanism to assign splits and coordinate commits to
> >> external systems. The OperatorCoordinator is widely using in flink kafka
> >> connector, flink pulsar connector, flink cdc connector, flink hudi
> >> connector and so on.
> >>
> >> But there is not a suitable metric group scope for the
> OperatorCoordinator
> >> and not an implementation for the interface
> OperatorCoordinatorMetricGroup.
> >> These metrics in OperatorCoordinator could be how many splits/partitions
> >> have been assigned to source readers, how many files have been written
> out
> >> by sink writers, these metrics not only help users to know the job
> progress
> >> but also make big job maintaining easier. Thus we propose the FLIP-274
> to
> >> introduce a new metric group scope for OperatorCoordinator and provide
> an
> >> internal implementation for OperatorCoordinatorMetricGroup.
> >>
> >> Could you help review this FLIP when you get time? Any feedback is
> >> appreciated!
> >>
> >> Best,
> >> Hang
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>
>
>


Re: [DISCUSS] Cleaning up HighAvailabilityServices interface to reflect the per-JM-process LeaderElection

2022-12-09 Thread Dong Lin
Hi Chesnay,

I like the use-cases (e.g. running multiple UIs for load-balancing
purposes) mentioned. On the other hand, these are probably not
high-priority features, and we don't know when the community will get to
implement these features. It seems a bit over-design to add implementation
complexity for something that we won't need?

Adding regarding the effort to add back the per-component election
capability: given that the implementation already follows per-process
election, and given that there will likely be a lot of extra
design/implementation/test effort needed to achieve the use-cases described
above, maybe the change proposed in this thread won't affect the overall
effort much?

I am hoping that by making the Flink codebase simpler and more readable, we
can increase developer velocity and reduce the time we needed to tackle the
bugs such as FLINK-24038. Then we would have more time to actually
implement the fancy use-cases described above:) What do you think?


On Sat, Dec 10, 2022 at 1:31 AM Chesnay Schepler  wrote:

> I generally agree that the internals of the HA services are currently
> too complex, but I'm wondering if the proposal doesn't go a bit too far
> to resolve those.
> Is there maybe some way we can refactor things internally to reduce
> complexity while keeping the per-component semantics?
>
> Ultimately, the per-component leader election gives us the theoretical
> ability to split components into separate processes, which is also
> something we strive to maintain in other layers like the RPC system.
>
> That's a powerful property to have, which is also quite difficult to
> patch back in once you get rid of it.


> Of note, whenever a discussion came up about scalability of the JM
> process the first answer has _always_ been "well we could split it up at
> one point if it's necessary.".
>
>  > I am curious that there are so many such extreme requirements that we
> have to rely on the per-component pattern to achieve them?
>
> This doesn't necessarily go into the _extreme_ direction. It could be
> something as simple as running the UI in an environment that is more
> accessible than the other processes, running multiple UIs for
> load-balancing purposes without paying the additional memory tax of a
> full JM, or the Dispatcher process not running any user-code (== some
> isolation between jobs).

The original FLIP-6 design had ideas to that end, and they aren't really
> bad ideas. We just never executed them.


>  > users may inadvertently recreate problems similar to FLINK-24038
>
> That's certainly a risk, but the per-process leader election was just
> one possible solution, that just also had other benefits at the time.
>
>
>
> Right now I unfortunately can't provide specific ideas on how we could
> clean things up internally; that'd take some time that I won't have
> until next year.
>
> On 09/12/2022 16:41, weijie guo wrote:
> > Hi Matthias,
> >
> > Thanks for the proposal! I am in favor of cleaning up this interface, and
> > It seems a bit cumbersome now. Especially, the implementation of
> > per-component leader election has been removed from our current code
> path.
> >
> > To be honest, I don't like the per-component approach. I'm even often
> asked
> > why flink used this way? Of course, I admit that this will make our HA
> > service more flexible. But personally, perhaps the per-process solution
> is
> > more better, at least from the perspective of reducing potential problems
> > like FLINK-24038, and it can definitely reduce the complexity of
> JobManager.
> >
> > Regarding "We lose some flexibility in terms of per-component
> > LeaderElection ", I am curious that there are so many such extreme
> > requirements that we have to rely on the per-component pattern to achieve
> > them? If there are, is this requirement really reasonable, and users may
> > inadvertently recreate problems similar to FLINK-24038.
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Matthias Pohl  于2022年12月9日周五 17:47写道:
> >
> >> Hi Dong,
> >> see my answers below.
> >>
> >> Regarding "Interface change might affect other projects that customize
> HA
> >>> services", are you referring to those projects which hack into Flink's
> >>> source code (as opposed to using Flink's public API) to customize HA
> >>> services?
> >>
> >> Yes, the proposed change might affect projects that need to have their
> own
> >> HA implementation for whatever reason (interface change) or if a project
> >> accesses the HA backend to retrieve metadata from the ZK node/k8s
> ConfigMap
> >> (change about how the data is stored in the HA backend). The latter one
> was
> >> actually already the case with the change introduced by FLINK-24038 [1].
> >>
> >> By the way, since Flink already supports zookeeper and kubernetes as the
> >>> high availability services, are you aware of many projects that still
> >> need
> >>> to hack into Flink's code to customize high availability services?
> >>
> >> I am aware of projects that use custom

Re: [DISCUSS] Cleaning up HighAvailabilityServices interface to reflect the per-JM-process LeaderElection

2022-12-09 Thread Dong Lin
Hi Matthias,

Thanks for the explanation. I was trying to understand the concrete
user-facing benefits of preserving the flexibility of per-component leader
election. Now I get that maybe they want to scale those components
independently, and maybe run the UI in an environment that is more accessible
than the other processes.

I replied to Chesnay's email regarding whether it is worthwhile to keep the
existing interface for those potential but not-yet-realized benefits.

Thanks,
Dong

On Fri, Dec 9, 2022 at 5:47 PM Matthias Pohl 
wrote:

> Hi Dong,
> see my answers below.
>
> Regarding "Interface change might affect other projects that customize HA
> > services", are you referring to those projects which hack into Flink's
> > source code (as opposed to using Flink's public API) to customize HA
> > services?
>
>
> Yes, the proposed change might affect projects that need to have their own
> HA implementation for whatever reason (interface change) or if a project
> accesses the HA backend to retrieve metadata from the ZK node/k8s ConfigMap
> (change about how the data is stored in the HA backend). The latter one was
> actually already the case with the change introduced by FLINK-24038 [1].
>
> By the way, since Flink already supports zookeeper and kubernetes as the
> > high availability services, are you aware of many projects that still
> need
> > to hack into Flink's code to customize high availability services?
>
>
> I am aware of projects that use customized HA. But based on our experience
> in FLINK-24038 [1] no one complained. So, making people aware through the
> mailing list might be good enough.
>
> And regarding "We lose some flexibility in terms of per-component
> > LeaderElection", could you explain what flexibility we need so that we
> can
> > gauge the associated downside of losing the flexibility?
>
>
> Just to recap: The current interface allows having per-component
> LeaderElection (e.g. the ResourceManager leader can run on a different
> JobManager than the Dispatcher). This implementation was replaced by
> FLINK-24038 [1] and removed in FLINK-25806 [2]. The new implementation does
> LeaderElection per process (e.g. ResourceManager and Dispatcher always run
> on the same JobManager). The changed interface would require us to touch
> the interface again if (for whatever reason) we want to reintroduce
> per-component leader election in some form.
> The interface change is, strictly speaking, not necessary to provide the
> new functionality. But I like the idea of certain requirements (currently,
> we need per-process leader election to fix what was reported in FLINK-24038
> [1]) being reflected in the interface. This makes sure that we don't
> introduce a per-component leader election again accidentally in the future
> because we thought it's a good idea but forgot about FLINK-24038.
>
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-24038
> [2] https://issues.apache.org/jira/browse/FLINK-25806
>
> On Fri, Dec 9, 2022 at 2:09 AM Dong Lin  wrote:
>
> > Hi Matthias,
> >
> > Thanks for the proposal! Overall I am in favor of making this interface
> > change to make Flink's codebase more maintainable.
> >
> > Regarding "Interface change might affect other projects that customize HA
> > services", are you referring to those projects which hack into Flink's
> > source code (as opposed to using Flink's public API) to customize HA
> > services? If yes, it seems OK to break those projects since we don't have
> > any backward compatibility guarantee for those projects.
> >
> > By the way, since Flink already supports zookeeper and kubernetes as the
> > high availability services, are you aware of many projects that still
> need
> > to hack into Flink's code to customize high availability services?
> >
> > And regarding "We lose some flexibility in terms of per-component
> > LeaderElection", could you explain what flexibility we need so that we
> can
> > gauge the associated downside of losing the flexibility?
> >
> > Thanks!
> > Dong
> >
> >
> >
> > On Wed, Dec 7, 2022 at 4:28 PM Matthias Pohl  > .invalid>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > The Flink community introduced a new way how leader election works in
> > Flink
> > > 1.15 with FLINK-24038 [1]. Instead of a per-component leader election,
> > all
> > > components (i.e. ResourceManager, Dispatcher, REST server, JobMaster)
> > use a
> > > single (per-JM-process) leader election instance. It was meant to fix
> > some
> > > issues with deregistering Flink applications in multi-JM setups [1] and
> > > reduce load on the HA backend. Users were able to opt-out and switch
> back
> > > to the old implementation [2].
> > >
> > > The new approach was kind of complicated to implement while still
> > > maintaining support for the old implementation through the existing
> > > interfaces. With FLINK-25806 [3], the old implementation was removed in
> > > Flink 1.16. This enables us to clean things up in the
> > > HighAvailabilityServices.

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

2022-12-09 Thread Dong Lin
Hi Chesney,

Just to double check with you, OperatorCoordinatorMetricGroup (annotated as
@PublicEvolving) has already been introduced into Flink by FLIP-179
.
And that FLIP has got you +1.. Do you mean we should remove this
OperatorCoordinatorMetricGroup?

Regards,
Dong

On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler  wrote:

> As a whole I feel like this FLIP is overly complicated. A dedicated
> coordinator MG implementation is overkill; it could just re-use the
> existing Task/OperatorMGs to create the same structure we have on TMs,
> similar to what we did with the Job MG.
>
> However, I'm not convinced that this is required anyway, because all the
> example metrics you listed can be implemented on the TM side +
> aggregating them in the external metrics backend.
>
> Since I'm on holidays soon, just so no one tries to pull a fast one on
> me, if this were to go to a vote as-is I'd be against it.
>
>
> On 09/12/2022 15:30, Dong Lin wrote:
> > Hi Hang,
> >
> > Thanks for the FLIP! The FLIP looks good and it is pretty informative.
> >
> > I have just two minor comments regarding names:
> > - Would it be useful to rename the config key as
> > *metrics.scope.jm.job.operator-coordinator* for consistency with
> > *metrics.scope.jm.job
> > *(which is not named as *jm-job)?
> > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >
> > Cheers,
> > Dong
> >
> >
> >
> > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan  wrote:
> >
> >> Hi all,
> >>
> >> MengYue and I created FLIP-274[1] Introduce metric group for
> >> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> >> operators and running on Job Manager. The coordination mechanism is
> >> operator events between OperatorCoordinator and its all operators, the
> >> coordination is more and more using in Flink, for example many Sources
> and
> >> Sinks depend on the mechanism to assign splits and coordinate commits to
> >> external systems. The OperatorCoordinator is widely using in flink kafka
> >> connector, flink pulsar connector, flink cdc connector, flink hudi
> >> connector and so on.
> >>
> >> But there is not a suitable metric group scope for the
> OperatorCoordinator
> >> and not an implementation for the interface
> OperatorCoordinatorMetricGroup.
> >> These metrics in OperatorCoordinator could be how many splits/partitions
> >> have been assigned to source readers, how many files have been written
> out
> >> by sink writers, these metrics not only help users to know the job
> progress
> >> but also make big job maintaining easier. Thus we propose the FLIP-274
> to
> >> introduce a new metric group scope for OperatorCoordinator and provide
> an
> >> internal implementation for OperatorCoordinatorMetricGroup.
> >>
> >> Could you help review this FLIP when you get time? Any feedback is
> >> appreciated!
> >>
> >> Best,
> >> Hang
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>
>
>


[jira] [Created] (FLINK-30357) Wrong link in connector/jdbc doc.

2022-12-09 Thread Aiden Gong (Jira)
Aiden Gong created FLINK-30357:
--

 Summary: Wrong link in connector/jdbc doc.
 Key: FLINK-30357
 URL: https://issues.apache.org/jira/browse/FLINK-30357
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.15.2, 1.16.0
Reporter: Aiden Gong
 Attachments: image-2022-12-10-15-40-50-043.png, 
image-2022-12-10-15-40-50-117.png

!image-2022-12-10-15-40-50-043.png!!image-2022-12-10-15-40-50-117.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)