Re: [VOTE] FLIP-400: AsyncScalarFunction for asynchronous scalar function support

2023-12-28 Thread Timo Walther
+1 (binding)

Cheers,
Timo

> Am 28.12.2023 um 03:13 schrieb Yuepeng Pan :
> 
> +1 (non-binding).
> 
> Best,
> Yuepeng Pan.
> 
> 
> 
> 
> At 2023-12-28 09:19:37, "Lincoln Lee"  wrote:
>> +1 (binding)
>> 
>> Best,
>> Lincoln Lee
>> 
>> 
>> Martijn Visser  于2023年12月27日周三 23:16写道:
>> 
>>> +1 (binding)
>>> 
>>> On Fri, Dec 22, 2023 at 1:44 AM Jim Hughes 
>>> wrote:
 
 Hi Alan,
 
 +1 (non binding)
 
 Cheers,
 
 Jim
 
 On Wed, Dec 20, 2023 at 2:41 PM Alan Sheinberg
  wrote:
 
> Hi everyone,
> 
> I'd like to start a vote on FLIP-400 [1]. It covers introducing a new
>>> UDF
> type, AsyncScalarFunction for completing invocations asynchronously.
>>> It
> has been discussed in this thread [2].
> 
> I would like to start a vote.  The vote will be open for at least 72
>>> hours
> (until December 28th 18:00 GMT) unless there is an objection or
> insufficient votes.
> 
> [1]
> 
> 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-400%3A+AsyncScalarFunction+for+asynchronous+scalar+function+support
> [2] https://lists.apache.org/thread/q3st6t1w05grd7bthzfjtr4r54fv4tm2
> 
> Thanks,
> Alan
> 
>>> 



[ANNOUNCE] Apache Flink Pulsar Connector 4.1.0 released

2023-12-28 Thread Leonard Xu
The Apache Flink community is very happy to announce the release of Apache 
Flink Pulsar Connector 4.1.0 which is 
compatible with the Apache Flink 1.17.x and 1.18.x release series.

Apache Flink® is an open-source stream processing framework for distributed, 
high-performing,
 always-available, and accurate data streaming applications.

The release is available for download at:
https://flink.apache.org/downloads.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353431

We would like to thank all contributors of the Apache Flink community who made 
this release possible!

Best,
Leonard

[jira] [Created] (FLINK-33955) UnsupportedFileSystemException when trying to save data to Azure's abfss File System

2023-12-28 Thread Jira
Alek Łańduch created FLINK-33955:


 Summary: UnsupportedFileSystemException when trying to save data 
to Azure's abfss File System
 Key: FLINK-33955
 URL: https://issues.apache.org/jira/browse/FLINK-33955
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.17.1, 1.18.0
 Environment: Flink 1.17.1 & Flink 1.18.0 with Java 11, ADLS Gen.2 with 
hierarchical namespace enabled
Reporter: Alek Łańduch
 Attachments: error.log, pom.xml, success.log

When using Azure's File System connector for reading and writing files to Azure 
Data Lake Storage 2 Flink job fails at writing files with given error:

 
{noformat}
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: 
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme 
"file"{noformat}
 

Full logs from Job Manager along with stack trace is attached to as 
[^error.log] file.

The connection itself seems to be good, as the job successfully creates desired 
structure inside ADLS (and the the `.part` file), but the file itself is empty.

The job is simple, as its only purpose is to save events `a`, `b` and `c` into 
a file on ADLS. The whole code is presented below:
{code:java}
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class DataStreamJob {
  public static void main(String[] args) throws Exception {
    final FileSink sink = FileSink
        .forRowFormat(
            new Path("abfss://t...@stads2dev01.dfs.core.windows.net/output"),
            new SimpleStringEncoder("UTF-8"))
        .build();
    final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
    env.fromElements("a", "b", "c").sinkTo(sink);
    env.execute("Test");
  }
}
{code}
Code is run locally using Flink 1.18.0 (the same behavior was present in 
version 1.17.1). The only change that was made to `flink-conf.yaml` was to add 
key for accessing Azure:

 
{code:java}
fs.azure.account.auth.type.stads2dev01.dfs.core.windows.net: SharedKey
fs.azure.account.key.stads2dev01.dfs.core.windows.net: **{code}
 

The [^pom.xml] file was created by using [Getting 
Started|https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/configuration/overview/#getting-started]
 documentation - the only thing I added was `flink-azure-fs-hadoop` connector. 
The whole [^pom.xml] file is attached. The connector JAR was also copied from 
`opt` directory to `plugins/azure-fs-hadoop` in cluster files according to the 
documentation.

The interesting fact is that the deprecated method `writeAsText` (instead of 
FileSink) not only works and creates desired file on ADLS, but *the subsequent 
jobs that use FileSInk that previously failed now works and creates file 
successfully* (until cluster's restart). The logs from job with deprecated 
method are also attached here as [^success.log] file.

I suspect that it is somehow connected to how Azure File System is initialized, 
where the new FileSink method would create it incorrectly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-402: Extend ZooKeeper Curator configurations

2023-12-28 Thread Martijn Visser
Hi Oleksandr,

The FLIP talks about Curator, but outside of flink test utils, the
usage of Curator is only for test purposes. I don't think there's
anything preventing you right now from providing these additional
parameters as values in the flink-conf.yaml ?

Best regards,

Martijn

On Thu, Dec 14, 2023 at 2:21 PM Alex Nitavsky  wrote:
>
> Hi all,
>
> I would like to start a discussion thread for: *FLIP-402: Extend ZooKeeper
> Curator configurations *[1]
>
> * Problem statement *
> Currently Flink misses several Apache Curator configurations, which could
> be useful for Flink deployment with ZooKeeper as HA provider.
>
> * Proposed solution *
> We have inspected all possible options for Apache Curator and proposed
> those which could be valuable for Flink users:
>
> - high-availability.zookeeper.client.authorization [2]
> - high-availability.zookeeper.client.maxCloseWaitMs [3]
> - high-availability.zookeeper.client.simulatedSessionExpirationPercent [4]
>
> The proposed way is to reflect those properties into Flink configuration
> options for Apache ZooKeeper.
>
> Looking forward to your feedback and suggestions.
>
> Kind regards
> Oleksandr
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-402%3A+Extend+ZooKeeper+Curator+configurations
> [2]
> https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#authorization(java.lang.String,byte%5B%5D)
> [3]
> https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#maxCloseWaitMs(int)
> [4]
> https://curator.apache.org/apidocs/org/apache/curator/framework/CuratorFrameworkFactory.Builder.html#simulatedSessionExpirationPercent(int)


[DISCUSS] Hadoop 2 vs Hadoop 3 usage

2023-12-28 Thread Martijn Visser
Hi all,

I want to get some insights on how many users are still using Hadoop 2
vs how many users are using Hadoop 3. Flink currently requires a
minimum version of Hadoop 2.10.2 for certain features, but also
extensively uses Hadoop 3 (like for the file system implementations)

Hadoop 2 has a large number of direct and indirect vulnerabilities
[1]. Most of them can only be resolved by dropping support for Hadoop
2 and upgrading to a Hadoop 3 version. This thread is primarily to get
more insights if Hadoop 2 is still commonly used, or if we can
actually discuss dropping support for Hadoop 2 in Flink.

Best regards,

Martijn

[1] https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/2.10.2


Re: [NOTICE] Hive connector externalization

2023-12-28 Thread Martijn Visser
Hi Sergey,

Is the next step that we need to generate a release of the
externalized code? Did someone already volunteer for that?

Best regards,

Martijn

On Mon, Dec 11, 2023 at 3:00 AM yuxia  wrote:
>
> Thanks Sergey for the work. Happy to see we can externalize Hive connector 
> finally.
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
> 发件人: "snuyanzin" 
> 收件人: "dev" 
> 发送时间: 星期六, 2023年 12 月 09日 上午 6:24:35
> 主题: [NOTICE] Hive connector externalization
>
> Hi everyone
>
> We are getting close to the externalization of Hive connector[1].
> Since currently externalized version is already passing tests against
> release-1.18 and release-1.19 then I'm going to remove Hive connector code
> from Flink main repo[2]. For that reason I would kindly ask to avoid
> merging of Hive connector related changes to Flink main repo (master
> branch) in order to make this smoother. Instead it would be better to
> create/merge  prs to connector's repo[3]
>
> Also huge shoutout to Yuxia Luo, Martijn Visser, Ryan Skraba for the review
>
> [1] https://issues.apache.org/jira/browse/FLINK-30064
> [2] https://issues.apache.org/jira/browse/FLINK-33786
> [3] https://github.com/apache/flink-connector-hive
>
> --
> Best regards,
> Sergey


Re: [VOTE] Release flink-shaded 18.0, release candidate #1

2023-12-28 Thread Martijn Visser
Hi,

+1 (binding)

- Validated hashes
- Verified signature
- Verified that no binaries exist in the source archive
- Build the source with Maven
- Verified licenses
- Verified web PRs

Best regards,

Martijn

On Mon, Dec 11, 2023 at 12:11 AM Sergey Nuyanzin  wrote:
>
> Hey everyone,
>
> The vote for flink-shaded 18.0 is still open. Please test and vote for
> rc1, so that we can release it.
>
> On Thu, Nov 30, 2023 at 4:03 PM Jing Ge  wrote:
>
> > +1(not binding)
> >
> > - validate checksum
> > - validate hash
> > - checked the release notes
> > - verified that no binaries exist in the source archive
> > - build the source with Maven 3.8.6 and jdk11
> > - checked repo
> > - checked tag
> > - verified web PR
> >
> > Best regards,
> > Jing
> >
> > On Thu, Nov 30, 2023 at 11:39 AM Sergey Nuyanzin 
> > wrote:
> >
> > > +1 (non-binding)
> > >
> > > - Downloaded all the resources
> > > - Validated checksum hash
> > > - Build the source with Maven and jdk8
> > > - Build Flink master with new flink-shaded and check that all the tests
> > are
> > > passing
> > >
> > > one minor thing that I noticed during releasing: for ci it uses maven
> > 3.8.6
> > > at the same time for release profile there is an enforcement plugin to
> > > check that maven version is less than 3.3
> > > I created a jira issue[1] for that
> > > i made the release with 3.2.5 maven version (I suppose previous version
> > was
> > > also done with 3.2.5 because of same issue)
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-33703
> > >
> > > On Wed, Nov 29, 2023 at 11:41 AM Matthias Pohl 
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > * Downloaded all resources
> > > > * Extracts sources and compilation on these sources
> > > > * Diff of git tag checkout with downloaded sources
> > > > * Verifies SHA512 checksums & GPG certification
> > > > * Checks that all POMs have the right expected version
> > > > * Generated diffs to compare pom file changes with NOTICE files:
> > Nothing
> > > > suspicious found except for a minor (non-blocking) typo [1]
> > > >
> > > > Thanks for driving this effort, Sergey. :)
> > > >
> > > > [1] https://github.com/apache/flink-shaded/pull/126/files#r1409080162
> > > >
> > > > On Wed, Nov 29, 2023 at 10:25 AM Rui Fan <1996fan...@gmail.com> wrote:
> > > >
> > > >> Sorry, it's non-binding.
> > > >>
> > > >> On Wed, Nov 29, 2023 at 5:19 PM Rui Fan <1996fan...@gmail.com> wrote:
> > > >>
> > > >> > Thanks Matthias for the clarification!
> > > >> >
> > > >> > After I import the latest KEYS, it works fine.
> > > >> >
> > > >> > +1(binding)
> > > >> >
> > > >> > - Validated checksum hash
> > > >> > - Verified signature
> > > >> > - Verified that no binaries exist in the source archive
> > > >> > - Build the source with Maven and jdk8
> > > >> > - Verified licenses
> > > >> > - Verified web PRs, and left a comment
> > > >> >
> > > >> > Best,
> > > >> > Rui
> > > >> >
> > > >> > On Wed, Nov 29, 2023 at 5:05 PM Matthias Pohl
> > > >> >  wrote:
> > > >> >
> > > >> >> The key is the last key in the KEYS file. It's just having a
> > > different
> > > >> >> format with spaces being added (due to different gpg versions?):
> > F752
> > > >> 9FAE
> > > >> >> 2481 1A5C 0DF3  CA74 1596 BBF0 7268 35D8
> > > >> >>
> > > >> >> On Wed, Nov 29, 2023 at 9:41 AM Rui Fan <1996fan...@gmail.com>
> > > wrote:
> > > >> >>
> > > >> >> > Hey Sergey,
> > > >> >> >
> > > >> >> > Thank you for driving this release.
> > > >> >> >
> > > >> >> > I try to check this signature, the whole key is
> > > >> >> > F7529FAE24811A5C0DF3CA741596BBF0726835D8,
> > > >> >> > it matches your 1596BBF0726835D8, but I cannot
> > > >> >> > find it from the Flink KEYS[1].
> > > >> >> >
> > > >> >> > Please correct me if my operation is wrong, thanks~
> > > >> >> >
> > > >> >> > [1] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > >> >> >
> > > >> >> > Best,
> > > >> >> > Rui
> > > >> >> >
> > > >> >> >
> > > >> >> > On Wed, Nov 29, 2023 at 6:09 AM Sergey Nuyanzin <
> > > snuyan...@gmail.com
> > > >> >
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> > > Hi everyone,
> > > >> >> > > Please review and vote on the release candidate #1 for the
> > > version
> > > >> >> 18.0,
> > > >> >> > as
> > > >> >> > > follows:
> > > >> >> > > [ ] +1, Approve the release
> > > >> >> > > [ ] -1, Do not approve the release (please provide specific
> > > >> comments)
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > The complete staging area is available for your review, which
> > > >> >> includes:
> > > >> >> > > * JIRA release notes [1],
> > > >> >> > > * the official Apache source release to be deployed to
> > > >> >> dist.apache.org
> > > >> >> > > [2],
> > > >> >> > > which are signed with the key with fingerprint 1596BBF0726835D8
> > > >> [3],
> > > >> >> > > * all artifacts to be deployed to the Maven Central Repository
> > > [4],
> > > >> >> > > * source code tag "release-18.0-rc1" [5],
> > > >> >> > > * website pull request listing the new release [6].
> > > >> >> >

Re: [DISCUSS] FLIP-399: Flink Connector Doris

2023-12-28 Thread Martijn Visser
+1 for this :)

On Thu, Dec 7, 2023 at 6:05 AM wudi <676366...@qq.com.invalid> wrote:
>
>
> Hi all,
>
> As discussed in the previous email [1], about contributing the Flink Doris 
> Connector to the Flink community.
>
>
> Apache Doris[2] is a high-performance, real-time analytical database based on 
> MPP architecture, for scenarios where Flink is used for data analysis, 
> processing, or real-time writing on Doris, Flink Doris Connector is an 
> effective tool.
>
> At the same time, Contributing Flink Doris Connector to the Flink community 
> will further expand the Flink Connectors ecosystem.
>
> So I would like to start an official discussion FLIP-399: Flink Connector 
> Doris[3].
>
> Looking forward to comments, feedbacks and suggestions from the community on 
> the proposal.
>
> [1] https://lists.apache.org/thread/lvh8g9o6qj8bt3oh60q81z0o1cv3nn8p
> [2] https://doris.apache.org/docs/dev/get-starting/what-is-apache-doris/
> [3] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-399%3A+Flink+Connector+Doris
>
>
> Brs,
>
> di.wu


Re: [VOTE] Release 1.18.1, release candidate #2

2023-12-28 Thread Yun Tang
Thanks Jing for driving this release.

+1 (non-binding)


  *
Download artifacts and verify the signatures.
  *
Verified the web PR
  *
Verified the number of Python packages is 11
  *
Started a local cluster and verified FLIP-291 to see the rescale results.
  *
Verified the jar packages were built with JDK8

Best
Yun Tang



From: Rui Fan <1996fan...@gmail.com>
Sent: Thursday, December 28, 2023 10:54
To: dev@flink.apache.org 
Subject: Re: [VOTE] Release 1.18.1, release candidate #2

Thanks Jing for driving this release!

+1(non-binding)

- Downloaded artifacts
- Verified signatures and sha512
- The source archives do not contain any binaries
- Verified web PR
- Build the source with Maven 3 and java8 (Checked the license as well)
- bin/start-cluster.sh with java8, it works fine and no any unexpected LOG-
Ran demo, it's fine:  bin/flink
runexamples/streaming/StateMachineExample.jar

Best,
Rui

On Wed, Dec 27, 2023 at 8:45 PM Martijn Visser 
wrote:

> Hi Jing,
>
> Thanks for driving this.
>
> +1 (binding)
>
> - Validated hashes
> - Verified signature
> - Verified that no binaries exist in the source archive
> - Build the source with Maven via mvn clean install
> -Pcheck-convergence -Dflink.version=1.18.1
> - Verified licenses
> - Verified web PR
> - Started a cluster and the Flink SQL client, successfully read and
> wrote with the Kafka connector to Confluent Cloud with AVRO and Schema
> Registry enabled
> - Started a cluster and submitted a job that checkpoints to GCS without
> problems
>
> Best regards,
>
> Martijn
>
> On Thu, Dec 21, 2023 at 4:55 AM gongzhongqiang
>  wrote:
> >
> > Thanks Jing Ge for driving this release.
> >
> > +1 (non-binding), I have checked:
> > [✓] The checksums and signatures are validated
> > [✓] The tag checked is fine
> > [✓] Built from source is passed
> > [✓] The flink-web PR is reviewed and checked
> >
> >
> > Best,
> > Zhongqiang Gong
>


[jira] [Created] (FLINK-33956) Bump org.apache.zookeeper:zookeeper from 3.7.1 to 3.7.2

2023-12-28 Thread Martijn Visser (Jira)
Martijn Visser created FLINK-33956:
--

 Summary: Bump org.apache.zookeeper:zookeeper from 3.7.1 to 3.7.2
 Key: FLINK-33956
 URL: https://issues.apache.org/jira/browse/FLINK-33956
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Coordination
Reporter: Martijn Visser
Assignee: Martijn Visser


Bumps org.apache.zookeeper:zookeeper from 3.7.1 to 3.7.2.

 Merging this pull request will resolve a critical severity [Dependabot 
alert|https://github.com/apache/flink/security/dependabot/184] on 
org.apache.zookeeper:zookeeper.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Building Flink JDBC connector locally on M1 Mac

2023-12-28 Thread Martijn Visser
Hi David,

I believe this is because the testcontainers rely on the log4j
configuration settings. Github CI runs something like mvn clean
install -Dlog4j.configurationFile=tools/ci/log4j.properties

Best regards,

Martijn

On Fri, Dec 8, 2023 at 5:30 PM David Radley  wrote:
>
> Hi,
> When I build the JDBC connector locally on my M1 Mac with a mvn clean 
> install, I get the following errors:
>
>
>  in (XaFacadeImpl.java:75)
>
> [ERROR] Errors:
>
> [ERROR]   DerbyExactlyOnceSinkE2eTest.testInsert » JobExecution Job execution 
> failed.
>
> [ERROR]   MySqlExactlyOnceSinkE2eTest » IllegalState Previous attempts to 
> find a Docker ...
>
> [ERROR]   OracleExactlyOnceSinkE2eTest » IllegalState Could not find a valid 
> Docker envi...
>
> [ERROR]   PostgresCatalogTest » IllegalState Could not find a valid Docker 
> environment. ...
>
> [ERROR]   JdbcCatalogFactoryTest » IllegalState Could not find a valid Docker 
> environmen...
>
> [ERROR]   PostgresExactlyOnceSinkE2eTest » IllegalState Could not find a 
> valid Docker en...
>
> [ERROR]   SqlServerExactlyOnceSinkE2eTest » IllegalState Previous attempts to 
> find a Doc...
>
> [INFO]
>
> [ERROR] Tests run: 280, Failures: 1, Errors: 7, Skipped: 1
>
> An example of an individual failure is:
>
> Test set: 
> org.apache.flink.connector.jdbc.databases.mysql.xa.MySqlExactlyOnceSinkE2eTest
> ---
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.006 s <<< 
> FAILURE! - in 
> org.apache.flink.connector.jdbc.databases.mysql.xa.MySqlExactlyOnceSinkE2eTest
> org.apache.flink.connector.jdbc.databases.mysql.xa.MySqlExactlyOnceSinkE2eTest
>   Time elapsed: 0.006 s  <<< ERROR!
> java.lang.IllegalStateException: Previous attempts to find a Docker 
> environment failed. Will not retry. Please see logs and check configuration
>at 
> org.testcontainers.dockerclient.DockerClientProviderStrategy.getFirstValidStrategy(DockerClientProviderStrategy.java:231)
>at 
> org.testcontainers.DockerClientFactory.getOrInitializeStrategy(DockerClientFactory.java:150)
>at 
> org.testcontainers.DockerClientFactory.client(DockerClientFactory.java:191)
>at 
> org.testcontainers.DockerClientFactory$1.getDockerClient(DockerClientFactory.java:104)
>at 
> com.github.dockerjava.api.DockerClientDelegate.authConfig(DockerClientDelegate.java:109)
>at 
> org.testcontainers.containers.GenericContainer.start(GenericContainer.java:321)
>
>
>
>
> Googling I see a suggestion  : If you're using MAC, in my case, to solve the 
> problem I had to add a link.
>
> Follow the command below:
>
> sudo ln -s $HOME/.docker/run/docker.sock /var/run/docker.sock
>
> But this does not work for me.
>
> I am running rancher, so do have a docker environment. Any pointers?
>
> Kind regards, David.
>
> Unless otherwise stated above:
>
> IBM United Kingdom Limited
> Registered in England and Wales with number 741598
> Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU


Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

2023-12-28 Thread Yangze Guo
Thanks for the response, Zhanghao.

PersistenceServices sounds good to me.

Best,
Yangze Guo

On Wed, Dec 27, 2023 at 11:30 AM Zhanghao Chen
 wrote:
>
> Thanks for driving this effort, Yangze! The proposal overall LGTM. Other from 
> the throughput enhancement in the OLAP scenario, the separation of leader 
> election/discovery services and the metadata persistence services will also 
> make the HA impl clearer and easier to maintain. Just a minor comment on 
> naming: would it better to rename PersistentServices to PersistenceServices, 
> as usually we put a noun before Services?
>
> Best,
> Zhanghao Chen
> 
> From: Yangze Guo 
> Sent: Tuesday, December 19, 2023 17:33
> To: dev 
> Subject: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios
>
> Hi, there,
>
> We would like to start a discussion thread on "FLIP-403: High
> Availability Services for OLAP Scenarios"[1].
>
> Currently, Flink's high availability service consists of two
> mechanisms: leader election/retrieval services for JobManager and
> persistent services for job metadata. However, these mechanisms are
> set up in an "all or nothing" manner. In OLAP scenarios, we typically
> only require leader election/retrieval services for JobManager
> components since jobs usually do not have a restart strategy.
> Additionally, the persistence of job states can negatively impact the
> cluster's throughput, especially for short query jobs.
>
> To address these issues, this FLIP proposes splitting the
> HighAvailabilityServices into LeaderServices and PersistentServices,
> and enable users to independently configure the high availability
> strategies specifically related to jobs.
>
> Please find more details in the FLIP wiki document [1]. Looking
> forward to your feedback.
>
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-403+High+Availability+Services+for+OLAP+Scenarios
>
> Best,
> Yangze Guo


[jira] [Created] (FLINK-33957) Generic Log-Based Incremental Checkpoint fails when Task Local Recovery is enabled

2023-12-28 Thread Prabhu Joseph (Jira)
Prabhu Joseph created FLINK-33957:
-

 Summary: Generic Log-Based Incremental Checkpoint fails when Task 
Local Recovery is enabled
 Key: FLINK-33957
 URL: https://issues.apache.org/jira/browse/FLINK-33957
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.17.1
Reporter: Prabhu Joseph


Generic Log-Based Incremental Checkpoint fails when Task Local Recovery is 
enabled. The issue happened to one of our users. I am trying to reproduce the 
issue.

```
at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:298)
 ~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
 ~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
... 3 more
Caused by: org.apache.flink.util.SerializedThrowable: 
java.lang.IllegalStateException: one checkpoint contains at most one 
materializationID
at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) 
~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
at 
org.apache.flink.runtime.state.ChangelogTaskLocalStateStore.updateReference(ChangelogTaskLocalStateStore.java:92)
 ~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
at 
org.apache.flink.runtime.state.ChangelogTaskLocalStateStore.storeLocalState(ChangelogTaskLocalStateStore.java:130)
 ~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
at 
org.apache.flink.runtime.state.TaskStateManagerImpl.reportTaskStateSnapshots(TaskStateManagerImpl.java:140)
 ~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.reportCompletedSnapshotStates(AsyncCheckpointRunnable.java:237)
 ~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:136)
 ~[flink-dist-1.17.1-amzn-1.jar:1.17.1-amzn-1]
... 3 more
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33958) Implement restore tests for IntervalJoin node

2023-12-28 Thread Bonnie Varghese (Jira)
Bonnie Varghese created FLINK-33958:
---

 Summary: Implement restore tests for IntervalJoin node
 Key: FLINK-33958
 URL: https://issues.apache.org/jira/browse/FLINK-33958
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Bonnie Varghese
Assignee: Bonnie Varghese






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] FLIP-406: Reorganize State & Checkpointing & Recovery Configuration

2023-12-28 Thread Xuannan Su
Hi Zakelly,

Thanks for driving this! The organization of the configuration option
in the FLIP looks much cleaner and easier to understand. +1 to the
FLIP.

Just some questions from me.

1. I think the change to the ConfigOptions should be put in the
`Public Interface` section, instead of `Proposed Changed`, as those
configuration options are public interface.

2. The key `state.checkpoint.cleaner.parallel-mode` seems confusing.
It feels like it is used to choose different modes. In fact, it is a
boolean flag to indicate whether to enable parallel clean. How about
making it `state.checkpoint.cleaner.parallel-mode.enabled`?

3. The `execution.checkpointing.write-buffer` may better be
`execution.checkpointing.write-buffer-size` so that we know it is
configuring the size of the buffer.

Best,
Xuannan


On Wed, Dec 27, 2023 at 7:17 PM Yanfei Lei  wrote:
>
> Hi Zakelly,
>
> > Considering the name occupation, how about naming it as 
> > `execution.checkpointing.type`?
>
> `Checkpoint Type`[1,2] is used to describe aligned/unaligned
> checkpoint, I am inclined to make a choice between
> `execution.checkpointing.incremental` and
> `execution.checkpointing.incremental.enabled`.
>
>
> [1] 
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/monitoring/checkpoint_monitoring/
> [2] 
> https://github.com/apache/flink/blob/master/flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/detail/job-checkpoints-detail.component.html#L27
>
> --
> Best,
> Yanfei
>
> Zakelly Lan  于2023年12月27日周三 14:41写道:
> >
> > Hi Lijie,
> >
> > Thanks for the reminder! I missed this.
> >
> > Considering the name occupation, how about naming it as
> > `execution.checkpointing.type`?
> >
> > Actually I think the current `execution.checkpointing.mode` is confusing in
> > some ways, maybe `execution.checkpointing.data-consistency` is better.
> >
> >
> > Best,
> > Zakelly
> >
> >
> > On Wed, Dec 27, 2023 at 12:59 PM Lijie Wang 
> > wrote:
> >
> > > Hi Zakelly,
> > >
> > > >> I'm wondering if `execution.checkpointing.savepoint-dir` would be
> > > better.
> > >
> > > `execution.checkpointing.dir` and `execution.checkpointing.savepoint-dir`
> > > are also fine for me.
> > >
> > > >> So I think an enumeration option `execution.checkpointing.mode` which
> > > can be 'full' (default) or 'incremental' would be better
> > >
> > > I agree with using an enumeration option. But currently there is already a
> > > configuration option called `execution.checkpointing.mode`, which is used
> > > to choose EXACTLY_ONCE or AT_LEAST_ONCE. Maybe we need to use another name
> > > or merge these two options.
> > >
> > > Best,
> > > Lijie
> > >
> > > Zakelly Lan  于2023年12月27日周三 11:43写道:
> > >
> > > > Hi everyone,
> > > >
> > > > Thanks all for your comments!
> > > >
> > > > @Yanfei
> > > >
> > > > > 1. For some state backends that do not support incremental checkpoint,
> > > > > how does the execution.checkpointing.incrementaloption take effect? Or
> > > > > is it better to put incremental under state.backend.xxx.incremental?
> > > > >
> > > > I'd rather not put the option for incremental checkpoint under the
> > > > 'state.backend', since it is more about the checkpointing instead of
> > > state
> > > > accessing. Of course, the state backend may not necessarily do
> > > incremental
> > > > checkpoint as requested. If the state backend is not capable of taking
> > > > incremental cp, it is better to fallback to the full cp.
> > > >
> > > > 2. I'm a little worried that putting all configurations into
> > > > > `ExecutionCheckpointingOptions` will introduce some dependency
> > > > > problems. Some options would be used by flink-runtime module, but
> > > > > flink-runtime should not depend on flink-streaming-java. e.g.
> > > > > FLINK-28286[1].
> > > > > So, I prefer to move configurations to `CheckpointingOptions`, WDYT?
> > > > >
> > > >
> > > > Yes, that's a very good point.  Moving to
> > > > `CheckpointingOptions`(flink-core) makes sense.
> > > >
> > > > @Lijie
> > > >
> > > > How about
> > > > > state.savepoints.dir -> execution.checkpointing.savepoint.dir
> > > > > state.checkpoints.dir -> execution.checkpointing.checkpoint.dir
> > > >
> > > >
> > > > Actually, I think the `checkpointing.checkpoint` may cause some
> > > confusion.
> > > > But I'm ok if others agree.
> > > > I'm wondering if `execution.checkpointing.savepoint-dir` would be 
> > > > better.
> > > > WDYT?
> > > >
> > > > 2. We changed the execution.checkpointing.local-copy' to
> > > > > 'execution.checkpointing.local-copy.enabled'. Should we also add
> > > > "enabled"
> > > > > suffix for other boolean type configuration options ? For example,
> > > > > execution.checkpointing.incremental ->
> > > > > execution.checkpointing.incremental.enabled
> > > > >
> > > >
> > > > Actually, the incremental cp is something like choosing a mode for doing
> > > > checkpoint instead of enabling a function. So I think an enumeration
> > > option
> > > > `execution.checkpointing.mode` which can

Re: [DISCUSS] FLIP-398: Improve Serialization Configuration And Usage In Flink

2023-12-28 Thread Ken Krugler
Hi Xintong,

I agree that decoupling from Kryo is a bigger topic, well beyond the scope of 
this FLIP.

The reason I’d brought up Fury is that this increases my confidence that Flink 
will want to decouple from Kryo sooner rather than later.

So I feel it would be worth investing in a (minor) name change now, to improve 
that migration path in the future. Thus my suggestion for avoiding the explicit 
use of Kryo in method names.

Regards,

— Ken




> On Dec 17, 2023, at 7:16 PM, Xintong Song  wrote:
> 
> Hi Ken,
> 
> I think the main purpose of this FLIP is to change how users interact with
> the knobs for customizing the serialization behaviors, from requiring code
> changes to working with pure configurations. Redesigning the knobs (i.e.,
> names, semantics, etc.), on the other hand, is not the purpose of this
> FLIP. Preserving the existing names and semantics should also help minimize
> the migration cost for existing users. Therefore, I'm in favor of not
> changing them.
> 
> Concerning decoupling from Kryo, and introducing other serialization
> frameworks like Fury, I think that's a bigger topic that is worth further
> discussion. At the moment, I'm not aware of any community consensus on
> doing so. And even if in the future we decide to do so, the changes needed
> should be the same w/ or w/o this FLIP. So I'd suggest not to block this
> FLIP on these issues.
> 
> WDYT?
> 
> Best,
> 
> Xintong
> 
> 
> 
> On Fri, Dec 15, 2023 at 1:40 AM Ken Krugler 
> wrote:
> 
>> Hi Yong,
>> 
>> Looks good, thanks for creating this.
>> 
>> One comment - related to my recent email about Fury, I would love to see
>> the v2 serialization decoupled from Kryo.
>> 
>> As part of that, instead of using xxxKryo in methods, call them xxxGeneric.
>> 
>> A more extreme change would be to totally rely on Fury (so no more POJO
>> serializer). Fury is faster than the POJO serializer in my tests, but this
>> would be a much bigger change.
>> 
>> Though it could dramatically simplify the Flink serialization support.
>> 
>> — Ken
>> 
>> PS - a separate issue is how to migrate state from Kryo to something like
>> Fury, which supports schema evolution. I think this might be possible, by
>> having a smarter deserializer that identifies state as being created by
>> Kryo, and using (shaded) Kryo to deserialize, while still writing as Fury.
>> 
>>> On Dec 6, 2023, at 6:35 PM, Yong Fang  wrote:
>>> 
>>> Hi devs,
>>> 
>>> I'd like to start a discussion about FLIP-398: Improve Serialization
>>> Configuration And Usage In Flink [1].
>>> 
>>> Currently, users can register custom data types and serializers in Flink
>>> jobs through various methods, including registration in code,
>>> configuration, and annotations. These lead to difficulties in upgrading
>>> Flink jobs and priority issues.
>>> 
>>> In flink-2.0 we would like to manage job data types and serializers
>> through
>>> configurations. This FLIP will introduce a unified option for data type
>> and
>>> serializer and users can configure all custom data types and
>>> pojo/kryo/custom serializers. In addition, this FLIP will add more
>> built-in
>>> serializers for complex data types such as List and Map, and optimize the
>>> management of Avro Serializers.
>>> 
>>> Looking forward to hearing from you, thanks!
>>> 
>>> [1]
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-398%3A+Improve+Serialization+Configuration+And+Usage+In+Flink
>>> 
>>> Best,
>>> Fang Yong
>> 
>> --
>> Ken Krugler
>> http://www.scaleunlimited.com
>> Custom big data solutions
>> Flink & Pinot
>> 
>> 
>> 
>> 



--
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink & Pinot





Re: [DISCUSS] FLIP-406: Reorganize State & Checkpointing & Recovery Configuration

2023-12-28 Thread Zakelly Lan
Hi everyone,

Thanks all for your comments!

As many of you have questions about the names for boolean options, I
suggest we make a naming rule for them. For now I could think of three
options:

Option 1: Use enumeration options if possible. But this may cause some name
collisions or confusion as we discussed and we should unify the statement
everywhere.
Option 2: Use boolean options and add 'enabled' as the suffix.
Option 3: Use boolean options and ONLY add 'enabled' when there are more
detailed configurations under the same prefix, to prevent one name from
serving as a prefix to another.

I am slightly inclined to Option 3, since it is more in line with current
practice and friendly for existing users. Also It reduces the length of
configuration names as much as possible. I really want to hear your
opinions.


@Xuannan

I agree with your comments 1 and 3.

For 2, If we decide to change the name, maybe
`execution.checkpointing.parallel-cleaner` is better? And as for whether to
add 'enabled' I suggest we discuss the rule above. WDYT?
Thanks!


Best,
Zakelly

On Fri, Dec 29, 2023 at 12:02 PM Xuannan Su  wrote:

> Hi Zakelly,
>
> Thanks for driving this! The organization of the configuration option
> in the FLIP looks much cleaner and easier to understand. +1 to the
> FLIP.
>
> Just some questions from me.
>
> 1. I think the change to the ConfigOptions should be put in the
> `Public Interface` section, instead of `Proposed Changed`, as those
> configuration options are public interface.
>
> 2. The key `state.checkpoint.cleaner.parallel-mode` seems confusing.
> It feels like it is used to choose different modes. In fact, it is a
> boolean flag to indicate whether to enable parallel clean. How about
> making it `state.checkpoint.cleaner.parallel-mode.enabled`?
>
> 3. The `execution.checkpointing.write-buffer` may better be
> `execution.checkpointing.write-buffer-size` so that we know it is
> configuring the size of the buffer.
>
> Best,
> Xuannan
>
>
> On Wed, Dec 27, 2023 at 7:17 PM Yanfei Lei  wrote:
> >
> > Hi Zakelly,
> >
> > > Considering the name occupation, how about naming it as
> `execution.checkpointing.type`?
> >
> > `Checkpoint Type`[1,2] is used to describe aligned/unaligned
> > checkpoint, I am inclined to make a choice between
> > `execution.checkpointing.incremental` and
> > `execution.checkpointing.incremental.enabled`.
> >
> >
> > [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/monitoring/checkpoint_monitoring/
> > [2]
> https://github.com/apache/flink/blob/master/flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/detail/job-checkpoints-detail.component.html#L27
> >
> > --
> > Best,
> > Yanfei
> >
> > Zakelly Lan  于2023年12月27日周三 14:41写道:
> > >
> > > Hi Lijie,
> > >
> > > Thanks for the reminder! I missed this.
> > >
> > > Considering the name occupation, how about naming it as
> > > `execution.checkpointing.type`?
> > >
> > > Actually I think the current `execution.checkpointing.mode` is
> confusing in
> > > some ways, maybe `execution.checkpointing.data-consistency` is better.
> > >
> > >
> > > Best,
> > > Zakelly
> > >
> > >
> > > On Wed, Dec 27, 2023 at 12:59 PM Lijie Wang 
> > > wrote:
> > >
> > > > Hi Zakelly,
> > > >
> > > > >> I'm wondering if `execution.checkpointing.savepoint-dir` would be
> > > > better.
> > > >
> > > > `execution.checkpointing.dir` and
> `execution.checkpointing.savepoint-dir`
> > > > are also fine for me.
> > > >
> > > > >> So I think an enumeration option `execution.checkpointing.mode`
> which
> > > > can be 'full' (default) or 'incremental' would be better
> > > >
> > > > I agree with using an enumeration option. But currently there is
> already a
> > > > configuration option called `execution.checkpointing.mode`, which is
> used
> > > > to choose EXACTLY_ONCE or AT_LEAST_ONCE. Maybe we need to use
> another name
> > > > or merge these two options.
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Zakelly Lan  于2023年12月27日周三 11:43写道:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Thanks all for your comments!
> > > > >
> > > > > @Yanfei
> > > > >
> > > > > > 1. For some state backends that do not support incremental
> checkpoint,
> > > > > > how does the execution.checkpointing.incrementaloption take
> effect? Or
> > > > > > is it better to put incremental under
> state.backend.xxx.incremental?
> > > > > >
> > > > > I'd rather not put the option for incremental checkpoint under the
> > > > > 'state.backend', since it is more about the checkpointing instead
> of
> > > > state
> > > > > accessing. Of course, the state backend may not necessarily do
> > > > incremental
> > > > > checkpoint as requested. If the state backend is not capable of
> taking
> > > > > incremental cp, it is better to fallback to the full cp.
> > > > >
> > > > > 2. I'm a little worried that putting all configurations into
> > > > > > `ExecutionCheckpointingOptions` will introduce some dependency
> > > > > > problems. Some options would be used by fl

Anyone encounters the non-heap memory increasing issue when using RocksDB as state backend and has better solutions?

2023-12-28 Thread Sue Alen
Does anyone encounter the non-heap memory increasing issue when using RocksDB 
as state backend with memory controlling enabled? It has plagued us for long 
time. Our flink jobs are deployed under Kubernetes, and they would be killed 
because of OOM by the increasing non-heap memory by Kubernetes. Thought we can 
add the environment variable MALLOC_ARENA_MAX=1 for TaskManagers to avoid it, 
but that will reduce the throughput. In some scenarios, the throughput 
decreases 40%.

So we want to find out the answer of bellow questions, and further, any better 
solution. Anyone could help? Thanks in advance!


1、 Why savepoint or full checkpoints would lead to such glibc bug? Which 
methods calls trigger the bug?

2、 It doestn’t happen during savepoint or full checkpoints of all flink job, 
but some scenarios such as over aggregation windows which do hold large states. 
So anyone knows what certain scenarios or functions or operators will trigger 
such glibc bug?

3、 This troubleshooting has stated in the official documentation for several 
years, why nobody fixes the glibc bug or makes some effort from Flink side, 
such as using another memory allocator like jemelloc?


Here is the troubleshooting link and description, FYI.

https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/memory/mem_trouble/#container-memory-exceeded

Container Memory Exceeded 
#

If a Flink container tries to allocate memory beyond its requested size (Yarn 
or Kubernetes), this usually indicates that Flink has not reserved enough 
native memory. You can observe this either by using an external monitoring 
system or from the error messages when a container gets killed by the 
deployment environment.

If you encounter this problem in the JobManager process, you can also enable 
the JVM Direct Memory limit by setting the 
jobmanager.memory.enable-jvm-direct-memory-limit
 option to exclude possible JVM Direct Memory leak.

If 
RocksDBStateBackend
 is used:

  *   and memory controlling is disabled: You can try to increase the 
TaskManager’s managed 
memory.
  *   and memory controlling is enabled and non-heap memory increases during 
savepoint or full checkpoints: This may happen due to the glibc memory 
allocator (see glibc 
bug). You can try to add 
the environment 
variable
 MALLOC_ARENA_MAX=1 for TaskManagers.

Alternatively, you can increase the JVM 
Overhead.