date:20241009

Re: [PR] [FLINK-20625] Implement V2 Google Cloud PubSub Source in accordance with FLIP-27 [flink-connector-gcp-pubsub]

2024-10-09 Thread via GitHub



clmccart commented on PR #32:
URL: 
https://github.com/apache/flink-connector-gcp-pubsub/pull/32#issuecomment-2402775724

   looks like the sink rewrite was completed in 
[FLINK-24298](https://issues.apache.org/jira/browse/FLINK-24298). will need to 
take a look at that and probably remove the sink changes from this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-36456) Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key

2024-10-09 Thread Raul Garcia (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raul Garcia updated FLINK-36456:

Component/s: Runtime / Configuration
 (was: Runtime / Metrics)

> Improve Security in DatadogHttpReporterFactory by providing ENV alternative 
> to retrieve API key
> ---
>
> Key: FLINK-36456
> URL: https://issues.apache.org/jira/browse/FLINK-36456
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: Raul Garcia
>Priority: Minor
>
> The current implementation of the {{DatadogHttpReporterFactory}} class 
> retrieves the Datadog API key from the Flink configuration. In Kubernetes 
> environments, this typically means storing the API key in ConfigMaps, which 
> can expose sensitive information in plain text. Since ConfigMaps are not 
> designed to hold secrets, this approach poses potential security risks.
> My proposal is to fallback to {{DD_API_KEY}} which is the standard way of 
> passing the API key to containers and it's usually available in the 
> environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]

2024-10-09 Thread via GitHub



rxp90 opened a new pull request, #25470:
URL: https://github.com/apache/flink/pull/25470

   … DD_API_KEY if not present
   
   
   
   ## What is the purpose of the change
   
   *(For example: This pull request makes task deployment go through the blob 
server, rather than through RPC. That way we avoid re-transferring them on each 
deployment (during recovery).)*
   
   
   ## Brief change log
   
   *(for example:)*
 - *The TaskInfo is stored in the blob store on job creation time as a 
persistent artifact*
 - *Deployments RPC transmits only the blob storage reference*
 - *TaskManagers retrieve the TaskInfo from the blob cache*
   
   
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follow [the 
conventions for tests defined in our code quality 
guide](https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing).
   
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This change is already covered by existing tests, such as *(please describe 
tests)*.
   
   *(or)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
 - *Added integration tests for end-to-end deployment with large payloads 
(100MB)*
 - *Extended integration test for recovery after master (JobManager) 
failure*
 - *Added test that validates that TaskInfo is transferred only once across 
recoveries*
 - *Manually verified the change by running a 4 node cluster with 2 
JobManagers and 4 TaskManagers, a stateful streaming program, and killing one 
JobManager and two TaskManagers during the execution, verifying that recovery 
happens correctly.*
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / no)
 - The serializers: (yes / no / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / no / 
don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
 - The S3 file system connector: (yes / no / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / no)
 - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-36456) Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36456:
---
Labels: pull-request-available  (was: )

> Improve Security in DatadogHttpReporterFactory by providing ENV alternative 
> to retrieve API key
> ---
>
> Key: FLINK-36456
> URL: https://issues.apache.org/jira/browse/FLINK-36456
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Reporter: Raul Garcia
>Priority: Minor
>  Labels: pull-request-available
>
> The current implementation of the {{DatadogHttpReporterFactory}} class 
> retrieves the Datadog API key from the Flink configuration. In Kubernetes 
> environments, this typically means storing the API key in ConfigMaps, which 
> can expose sensitive information in plain text. Since ConfigMaps are not 
> designed to hold secrets, this approach poses potential security risks.
> My proposal is to fallback to {{DD_API_KEY}} which is the standard way of 
> passing the API key to containers and it's usually available in the 
> environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]

2024-10-09 Thread via GitHub



Juoelenis commented on PR #25470:
URL: https://github.com/apache/flink/pull/25470#issuecomment-2403211335

   Very legit 👍👍👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36445][runtime] Introduce ExecutionPlan interface. [flink]

2024-10-09 Thread via GitHub



Juoelenis commented on PR #25469:
URL: https://github.com/apache/flink/pull/25469#issuecomment-2403214920

   Big W


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36440] Bump log4j from 2.17.1 to 2.23.1 [flink]

2024-10-09 Thread via GitHub



Juoelenis commented on PR #25463:
URL: https://github.com/apache/flink/pull/25463#issuecomment-2403221800

   1+ aura


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-36440) Bump log4j from 2.17.1 to 2.23.1

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36440:
---
Labels: pull-request-available  (was: )

> Bump log4j from 2.17.1 to 2.23.1
> 
>
> Key: FLINK-36440
> URL: https://issues.apache.org/jira/browse/FLINK-36440
> Project: Flink
>  Issue Type: Improvement
>Reporter: Siddharth R
>Priority: Major
>  Labels: pull-request-available
>
> Bumping *log4j* to the latest version (2.23.1) - this will remediate a lot of 
> vulnerabilities in dependant packages.
> Package details:
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-1.2-api/2.23.1]
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-slf4j-impl/2.23.1]
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api/2.23.1]
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core/2.23.1]
> Release notes:
> [https://logging.apache.org/log4j/2.x/release-notes.html]
>  
> Lot of bug fixes has been done in the newer versions and I don't see any 
> breaking changes as such.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36440] Bump log4j from 2.17.1 to 2.23.1 [flink]

2024-10-09 Thread via GitHub



ferenc-csaky commented on code in PR #25463:
URL: https://github.com/apache/flink/pull/25463#discussion_r1794118721


##
pom.xml:
##
@@ -127,7 +127,7 @@ under the License.
true
11
1.7.36
-   2.17.1
+   2.23.1

Review Comment:
   Any reason why not bump to 2.24.1, which is the latest ATM?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]

2024-10-09 Thread via GitHub



flinkbot commented on PR #25470:
URL: https://github.com/apache/flink/pull/25470#issuecomment-2402739010

   
   ## CI report:
   
   * 8e38de9c2f05656c51131a9d241ccab6fa8824c2 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Final [flink-connector-gcp-pubsub]

2024-10-09 Thread via GitHub



clmccart commented on code in PR #32:
URL: 
https://github.com/apache/flink-connector-gcp-pubsub/pull/32#discussion_r1793804145


##
ordering-keys-prober/README.md:
##
@@ -0,0 +1,75 @@
+### Introduction
+
+The prober is designed to demonstrate effective use of Cloud Pub/Sub's
+[ordered delivery](https://cloud.google.com/pubsub/docs/ordering).
+
+### Building
+
+These instructions assume you are using [Maven](https://maven.apache.org/).
+
+1.  If you want to build the connector from head, clone the repository, 
ensuring
+to do so recursively to pick up submodules:
+
+`git clone https://github.com/GoogleCloudPlatform/pubsub`

Review Comment:
   update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Final [flink-connector-gcp-pubsub]

2024-10-09 Thread via GitHub



clmccart commented on code in PR #32:
URL: 
https://github.com/apache/flink-connector-gcp-pubsub/pull/32#discussion_r1793804145


##
ordering-keys-prober/README.md:
##
@@ -0,0 +1,75 @@
+### Introduction
+
+The prober is designed to demonstrate effective use of Cloud Pub/Sub's
+[ordered delivery](https://cloud.google.com/pubsub/docs/ordering).
+
+### Building
+
+These instructions assume you are using [Maven](https://maven.apache.org/).
+
+1.  If you want to build the connector from head, clone the repository, 
ensuring
+to do so recursively to pick up submodules:
+
+`git clone https://github.com/GoogleCloudPlatform/pubsub`

Review Comment:
   update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Donate PubSub source and sink rewrite back to Apache [flink-connector-gcp-pubsub]

2024-10-09 Thread via GitHub



clmccart commented on code in PR #32:
URL: 
https://github.com/apache/flink-connector-gcp-pubsub/pull/32#discussion_r1793809868


##
.gitignore:
##
@@ -35,4 +35,8 @@ out/
 tools/flink
 tools/flink-*
 tools/releasing/release
-tools/japicmp-output
\ No newline at end of file
+tools/japicmp-output
+*Dockerfile
+*launch.sh
+*pubsub.yaml
+*.vscode/*

Review Comment:
   will remove once manual testing is done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-33926][kubernetes]: Allow using job jars in the system classpath in native mode [flink]

2024-10-09 Thread via GitHub



SamBarker commented on code in PR #25445:
URL: https://github.com/apache/flink/pull/25445#discussion_r1794173964


##
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesClusterDescriptor.java:
##
@@ -214,9 +215,12 @@ public ClusterClientProvider 
deployApplicationCluster(
 
 applicationConfiguration.applyToConfiguration(flinkConfig);
 
-// No need to do pipelineJars validation if it is a PyFlink job.
+// No need to do pipelineJars validation if it is a PyFlink job or no 
jars specified in
+// pipeline.jars (to use jars in system classpath instead).
 if 
(!(PackagedProgramUtils.isPython(applicationConfiguration.getApplicationClassName())
-|| 
PackagedProgramUtils.isPython(applicationConfiguration.getProgramArguments( 
{
+|| PackagedProgramUtils.isPython(
+
applicationConfiguration.getProgramArguments()))
+&& flinkConfig.get(PipelineOptions.JARS) != null) {

Review Comment:
   Would it make be useful to push this compound condition to 
`PackageProgramUtils` as something like `hasClasspath` or `usesSystemClasspath` 
(with the implied flip to false)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FlINK-XXXX] Expand the matrix for the smoke test. [flink-kubernetes-operator]

2024-10-09 Thread via GitHub



SamBarker opened a new pull request, #892:
URL: https://github.com/apache/flink-kubernetes-operator/pull/892

   The goal is to remove namespace from the main CI run based on 
https://github.com/apache/flink-kubernetes-operator/pull/881#discussion_r1790257507
   
   
   
   ## What is the purpose of the change
   
   *(For example: This pull request adds a new feature to periodically create 
and maintain savepoints through the `FlinkDeployment` custom resource.)*
   
   
   ## Brief change log
   
   *(for example:)*
 - *Periodic savepoint trigger is introduced to the custom resource*
 - *The operator checks on reconciliation whether the required time has 
passed*
 - *The JobManager's dispose savepoint API is used to clean up obsolete 
savepoints*
   
   ## Verifying this change
   
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This change is already covered by existing tests, such as *(please describe 
tests)*.
   
   *(or)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
 - *Added integration tests for end-to-end deployment with large payloads 
(100MB)*
 - *Extended integration test for recovery after master (JobManager) 
failure*
 - *Manually verified the change by running a 4 node cluster with 2 
JobManagers and 4 TaskManagers, a stateful streaming program, and killing one 
JobManager and two TaskManagers during the execution, verifying that recovery 
happens correctly.*
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / no)
 - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
(yes / no)
 - Core observer or reconciler logic that is regularly executed: (yes / no)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / no)
 - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FlINK-XXXX] Expand the matrix for the smoke test. [flink-kubernetes-operator]

2024-10-09 Thread via GitHub



SamBarker closed pull request #892: [FlINK-] Expand the matrix for the 
smoke test.
URL: https://github.com/apache/flink-kubernetes-operator/pull/892


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-33926) Can't start a job with a jar in the system classpath in native k8s mode

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-33926:
---
Labels: pull-request-available  (was: )

> Can't start a job with a jar in the system classpath in native k8s mode
> ---
>
> Key: FLINK-33926
> URL: https://issues.apache.org/jira/browse/FLINK-33926
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.6.0
>Reporter: Trystan
>Assignee: Gantigmaa Selenge
>Priority: Major
>  Labels: pull-request-available
>
> It appears that the combination of the running operator-controlled jobs in 
> native k8s + application mode + using a job jar in the classpath is invalid. 
> Avoiding dynamic classloading (as specified in the 
> [docs|https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/ops/debugging/debugging_classloading/#avoiding-dynamic-classloading-for-user-code])
>  is beneficial for some jobs. This affects at least Flink 1.16.1 and 
> Kubernetes Operator 1.6.0.
>  
> FLINK-29288 seems to have addressed this for standalone mode. If I am 
> misunderstanding how to correctly build jars for this native k8s scenario, 
> apologies for the noise and any pointers would be appreciated!
>  
> Perhaps related, the [spec 
> documentation|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/#jobspec]
>  declares it optional, but isn't clear about under what conditions that 
> applies.
>  * Putting the jar in the system classpath and pointing *jarURI* to that jar 
> leads to linkage errors.
>  * Not including *jarURI* leads to NullPointerExceptions in the operator:
> {code:java}
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"java.lang.NullPointerException","stackTrace":"org.apache.flink.kubernetes.operator.exception.ReconciliationException:
>  java.lang.NullPointerException\n\tat 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:148)\n\tat
>  
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:56)\n\tat
>  
> io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:138)\n\tat
>  
> io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:96)\n\tat
>  
> org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)\n\tat
>  
> io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:95)\n\tat
>  
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:139)\n\tat
>  
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:119)\n\tat
>  
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:89)\n\tat
>  
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)\n\tat
>  
> io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)\n\tat
>  java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)\n\tat 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: 
> java.lang.NullPointerException\n\tat 
> org.apache.flink.kubernetes.utils.KubernetesUtils.checkJarFileForApplicationMode(KubernetesUtils.java:407)\n\tat
>  
> org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:207)\n\tat
>  
> org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)\n\tat
>  
> org.apache.flink.kubernetes.operator.service.NativeFlinkService.deployApplicationCluster(Native","additionalMetadata":{},"throwableList":[{"type":"java.lang.NullPointerException","additionalMetadata":{}}]}
>   {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted

2024-10-09 Thread Arvid Heise (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887987#comment-17887987
 ] 

Arvid Heise commented on FLINK-36455:
-

I'm not an expert of the SFS but from cross checking the code, it looks safe 
because it has no retries whatsoever. So it would fail on transient error and 
trigger a new final checkpoint which is in accordance with the contract of 
notifyCheckpointCompleted.

> Sink should commit everything on notifyCheckpointCompleted
> --
>
> Key: FLINK-36455
> URL: https://issues.apache.org/jira/browse/FLINK-36455
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
> Fix For: 2.0-preview
>
>
> Currently, we retry committables at some time later until they eventually 
> succeed.
> However, that violates the contract of notifyCheckpointCompleted which states 
> that all side effect must be committed before returning the method. In 
> particular, notifyCheckpointCompleted must fail if we cannot guarantee that 
> all side effects are committed for final checkpoints. As soon as 
> notifyCheckpointCompleted returns, the final checkpoint is deemed completed, 
> which currently may mean that some transactions are still open.
> The solution is that all retries must happen in a close loop in 
> notifyCheckpointCompleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted

2024-10-09 Thread Galen Warren (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887988#comment-17887988
 ] 

Galen Warren commented on FLINK-36455:
--

Thanks for checking, much appreciated. I may go ahead and use the older one for 
now until I can use your fix.

> Sink should commit everything on notifyCheckpointCompleted
> --
>
> Key: FLINK-36455
> URL: https://issues.apache.org/jira/browse/FLINK-36455
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
> Fix For: 2.0-preview
>
>
> Currently, we retry committables at some time later until they eventually 
> succeed.
> However, that violates the contract of notifyCheckpointCompleted which states 
> that all side effect must be committed before returning the method. In 
> particular, notifyCheckpointCompleted must fail if we cannot guarantee that 
> all side effects are committed for final checkpoints. As soon as 
> notifyCheckpointCompleted returns, the final checkpoint is deemed completed, 
> which currently may mean that some transactions are still open.
> The solution is that all retries must happen in a close loop in 
> notifyCheckpointCompleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36456) Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key

2024-10-09 Thread Raul Garcia (Jira)

Raul Garcia created FLINK-36456:
---

 Summary: Improve Security in DatadogHttpReporterFactory by 
providing ENV alternative to retrieve API key
 Key: FLINK-36456
 URL: https://issues.apache.org/jira/browse/FLINK-36456
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Metrics
Reporter: Raul Garcia


The current implementation of the {{DatadogHttpReporterFactory}} class 
retrieves the Datadog API key from the Flink configuration. In Kubernetes 
environments, this typically means storing the API key in ConfigMaps, which can 
expose sensitive information in plain text. Since ConfigMaps are not designed 
to hold secrets, this approach poses potential security risks.

My proposal is to fallback to {{DD_API_KEY}} which is the standard way of 
passing the API key to containers and it's usually available in the environment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36440] Bump log4j from 2.17.1 to 2.23.1 [flink]

2024-10-09 Thread via GitHub



Juoelenis commented on code in PR #25463:
URL: https://github.com/apache/flink/pull/25463#discussion_r1794127775


##
pom.xml:
##
@@ -127,7 +127,7 @@ under the License.
true
11
1.7.36
-   2.17.1
+   2.23.1

Review Comment:
   I have no idea what that is cuz i dunno java. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-25546][flink-connector-base][JUnit5 Migration] Module: flink-connector-base [flink]

2024-10-09 Thread via GitHub



snuyanzin commented on PR #21161:
URL: https://github.com/apache/flink/pull/21161#issuecomment-2403319420

   @Jiabao-Sun since you are one of those who contributed a lot in JUnit5 
migration may I ask you to have a look here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]

2024-10-09 Thread via GitHub



wcoqscwx commented on PR #25470:
URL: https://github.com/apache/flink/pull/25470#issuecomment-2402733236

   Same change as this PR opened > 24 months ago: 
https://github.com/apache/flink/pull/19684


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FLINK-36437][hive] Remove hive connector dependency usage [flink]

2024-10-09 Thread via GitHub



Sxnan opened a new pull request, #25467:
URL: https://github.com/apache/flink/pull/25467

   ## What is the purpose of the change
   
   Remove hive connector dependency usage
   
   ## Brief change log
   
   - Remove hive connector dependency usage
   
   
   ## Verifying this change
   
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: no
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793080478


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceEnumeratorMetrics.java:
##
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.cdc.connectors.base.source.metrics;
+
+import org.apache.flink.metrics.Gauge;
+import org.apache.flink.metrics.MetricGroup;
+import org.apache.flink.metrics.groups.SplitEnumeratorMetricGroup;
+
+import io.debezium.relational.TableId;
+import org.apache.commons.lang3.StringUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+
+/** A collection class for handling metrics in {@link 
SourceEnumeratorMetrics}. */
+public class SourceEnumeratorMetrics {
+private static final Logger LOGGER = 
LoggerFactory.getLogger(SourceEnumeratorMetrics.class);
+// Constants
+public static final int UNDEFINED = 0;
+
+// Metric names
+public static final String IS_SNAPSHOTTING = "isSnapshotting";
+public static final String IS_STREAM_READING = "isStreamReading";
+public static final String NUM_TABLES_SNAPSHOTTED = "numTablesSnapshotted";
+public static final String NUM_TABLES_REMAINING = "numTablesRemaining";
+public static final String NUM_SNAPSHOT_SPLITS_PROCESSED = 
"numSnapshotSplitsProcessed";
+public static final String NUM_SNAPSHOT_SPLITS_REMAINING = 
"numSnapshotSplitsRemaining";
+public static final String NUM_SNAPSHOT_SPLITS_FINISHED = 
"numSnapshotSplitsFinished";
+public static final String SNAPSHOT_START_TIME = "snapshotStartTime";
+public static final String SNAPSHOT_END_TIME = "snapshotEndTime";
+public static final String DATABASE_GROUP_KEY = "database";

Review Comment:
   Ok, it has been adjusted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793078972


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java:
##
@@ -20,16 +20,57 @@
 import 
org.apache.flink.cdc.connectors.base.source.reader.IncrementalSourceReader;
 import org.apache.flink.metrics.Counter;
 import org.apache.flink.metrics.Gauge;
+import org.apache.flink.metrics.MetricGroup;
 import org.apache.flink.metrics.groups.SourceReaderMetricGroup;
 import org.apache.flink.runtime.metrics.MetricNames;
+import org.apache.flink.util.clock.SystemClock;
+
+import io.debezium.data.Envelope;
+import io.debezium.relational.TableId;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.kafka.connect.source.SourceRecord;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.HashMap;
+import java.util.Map;
+
+import static 
org.apache.flink.cdc.connectors.base.utils.SourceRecordUtils.getTableId;
+import static 
org.apache.flink.cdc.connectors.base.utils.SourceRecordUtils.isDataChangeRecord;
+import static 
org.apache.flink.cdc.connectors.base.utils.SourceRecordUtils.isSchemaChangeEvent;
 
 /** A collection class for handling metrics in {@link 
IncrementalSourceReader}. */
 public class SourceReaderMetrics {
 
+private static final Logger LOG = 
LoggerFactory.getLogger(SourceReaderMetrics.class);
+
 public static final long UNDEFINED = -1;
 
+// Metric group keys
+public static final String DATABASE_GROUP_KEY = "database";

Review Comment:
   Ok, it has been adjusted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-36450) Exclude log4j from curator-test

2024-10-09 Thread Siddharth R (Jira)

Siddharth R created FLINK-36450:
---

 Summary: Exclude log4j from curator-test
 Key: FLINK-36450
 URL: https://issues.apache.org/jira/browse/FLINK-36450
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.10.0
Reporter: Siddharth R


Exclude vulnerable log4j dependency from curator-test

 


      log4j 
      log4j 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36445][runtime] Introduce ExecutionPlan interface. [flink]

2024-10-09 Thread via GitHub



flinkbot commented on PR #25469:
URL: https://github.com/apache/flink/pull/25469#issuecomment-2401882499

   
   ## CI report:
   
   * 477c3c2050a7fc508519b9806ac84cc9e3eab32a UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-36451) Kubernetes Application JobManager Potential Deadlock and TaskManager Pod Residuals

2024-10-09 Thread xiechenling (Jira)

xiechenling created FLINK-36451:
---

 Summary: Kubernetes Application JobManager Potential Deadlock and 
TaskManager Pod Residuals
 Key: FLINK-36451
 URL: https://issues.apache.org/jira/browse/FLINK-36451
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.19.1
 Environment: * Flink version: 1.19.1
 * - Deployment mode: Flink Kubernetes Application Mode
 * - JVM version: OpenJDK 17

 
Reporter: xiechenling
 Attachments: 1.png, 2.png, jobmanager.log, jstack.txt

In Kubernetes Application Mode, when there is significant etcd latency or 
instability, the Flink JobManager may enter a deadlock situation. Additionally, 
TaskManager pods are not cleaned up properly, resulting in stale resources that 
prevent the Flink job from recovering correctly. This issue occurs during 
frequent service restarts or network instability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-36449] Bump apache-rat-plugin [flink-kubernetes-operator]

2024-10-09 Thread via GitHub



r-sidd opened a new pull request, #890:
URL: https://github.com/apache/flink-kubernetes-operator/pull/890

   ## What is the purpose of the change
   
   Bump apache-rat-plugin
   
   ## Brief change log
   
   Bump apache-rat-plugin from 0.12 to 0.16.1 to remediate the underlying 
vulnerabilities in the dependencies.
   
   Vulnerabilities from dependencies:
   [CVE-2022-4245](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4245)
   [CVE-2022-4244](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4244)
   
[CVE-2020-15250](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15250)
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency):  yes
 - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
 - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-36449) Bump apache-rat-plugin from 0.12 to 0.16.1

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36449:
---
Labels: pull-request-available  (was: )

> Bump apache-rat-plugin from 0.12 to 0.16.1
> --
>
> Key: FLINK-36449
> URL: https://issues.apache.org/jira/browse/FLINK-36449
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.10.0
>Reporter: Siddharth R
>Priority: Major
>  Labels: pull-request-available
>
> Bump apache-rat-plugin from 0.12 to 0.16.1 to remediate the underlying 
> vulnerabilities in the dependencies.
>  
> Vulnerabilities from dependencies:
> [CVE-2022-4245|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4245]
> [CVE-2022-4244|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4244]
> [CVE-2020-15250|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15250]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-33977][runtime] Adaptive scheduler may not minimize the number of TMs during downscaling [flink]

2024-10-09 Thread via GitHub



ztison commented on PR #25218:
URL: https://github.com/apache/flink/pull/25218#issuecomment-2401913141

   Hi, I was wondering if it's advisable to change the default slot assigner. 
Wouldn't it be more appropriate to keep the default setting, which spreads the 
workload across as many workers as possible? This approach enhances the 
system's availability and resilience. The system will need to migrate the 
entire state to a new TM if it crashes, is it what we want in default 
implementation? 
   Isn't it a better approach to have a config option to activate the proposed 
behavior as defined in FLINK-36426? Then we could use a new slot assigner 
instead of the `DefaultSlotAssigner` and `StateLocalitySlotAssigner`. 
   I'm also wondering if the same strategy can be applied in the Resource 
Manager to allocate slots from a minimized number of Task Managers (TMs).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted

2024-10-09 Thread Arvid Heise (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-36455:
---

Assignee: Arvid Heise

> Sink should commit everything on notifyCheckpointCompleted
> --
>
> Key: FLINK-36455
> URL: https://issues.apache.org/jira/browse/FLINK-36455
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
> Fix For: 2.0-preview
>
>
> Currently, we retry committables at some time later until they eventually 
> succeed.
> However, that violates the contract of notifyCheckpointCompleted which states 
> that all side effect must be committed before returning the method. In 
> particular, notifyCheckpointCompleted must fail if we cannot guarantee that 
> all side effects are committed for final checkpoints. As soon as 
> notifyCheckpointCompleted returns, the final checkpoint is deemed completed, 
> which currently may mean that some transactions are still open.
> The solution is that all retries must happen in a close loop in 
> notifyCheckpointCompleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted

2024-10-09 Thread Arvid Heise (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887948#comment-17887948
 ] 

Arvid Heise commented on FLINK-36455:
-

Yes, all sinks following the FLIP-143 API are susceptible to the issue. After 
implementing this ticket, graceful final checkpoints will never leave 
transactions open (unless the connector is buggy). For normal checkpoints, open 
transactions can only occur iff the RPC message notifyCheckpointCompleted is 
lost (which cannot happen for final checkpoints).

Note that the issue is only relevant when transactions fail to commit on first 
try, so everything is an edge case.

> Sink should commit everything on notifyCheckpointCompleted
> --
>
> Key: FLINK-36455
> URL: https://issues.apache.org/jira/browse/FLINK-36455
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
> Fix For: 2.0-preview
>
>
> Currently, we retry committables at some time later until they eventually 
> succeed.
> However, that violates the contract of notifyCheckpointCompleted which states 
> that all side effect must be committed before returning the method. In 
> particular, notifyCheckpointCompleted must fail if we cannot guarantee that 
> all side effects are committed for final checkpoints. As soon as 
> notifyCheckpointCompleted returns, the final checkpoint is deemed completed, 
> which currently may mean that some transactions are still open.
> The solution is that all retries must happen in a close loop in 
> notifyCheckpointCompleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted

2024-10-09 Thread Galen Warren (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887949#comment-17887949
 ] 

Galen Warren commented on FLINK-36455:
--

Gotcha, thanks. Presumably that wouldn't include the legacy StreamingFileSink?

> Sink should commit everything on notifyCheckpointCompleted
> --
>
> Key: FLINK-36455
> URL: https://issues.apache.org/jira/browse/FLINK-36455
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
> Fix For: 2.0-preview
>
>
> Currently, we retry committables at some time later until they eventually 
> succeed.
> However, that violates the contract of notifyCheckpointCompleted which states 
> that all side effect must be committed before returning the method. In 
> particular, notifyCheckpointCompleted must fail if we cannot guarantee that 
> all side effects are committed for final checkpoints. As soon as 
> notifyCheckpointCompleted returns, the final checkpoint is deemed completed, 
> which currently may mean that some transactions are still open.
> The solution is that all retries must happen in a close loop in 
> notifyCheckpointCompleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793008830


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceEnumeratorMetrics.java:
##
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.cdc.connectors.base.source.metrics;
+
+import org.apache.flink.metrics.Gauge;
+import org.apache.flink.metrics.MetricGroup;
+import org.apache.flink.metrics.groups.SplitEnumeratorMetricGroup;
+
+import io.debezium.relational.TableId;
+import org.apache.commons.lang3.StringUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+
+/** A collection class for handling metrics in {@link 
SourceEnumeratorMetrics}. */
+public class SourceEnumeratorMetrics {
+private static final Logger LOGGER = 
LoggerFactory.getLogger(SourceEnumeratorMetrics.class);
+// Constants
+public static final int UNDEFINED = 0;
+
+// Metric names
+public static final String IS_SNAPSHOTTING = "isSnapshotting";
+public static final String IS_STREAM_READING = "isStreamReading";
+public static final String NUM_TABLES_SNAPSHOTTED = "numTablesSnapshotted";
+public static final String NUM_TABLES_REMAINING = "numTablesRemaining";
+public static final String NUM_SNAPSHOT_SPLITS_PROCESSED = 
"numSnapshotSplitsProcessed";
+public static final String NUM_SNAPSHOT_SPLITS_REMAINING = 
"numSnapshotSplitsRemaining";
+public static final String NUM_SNAPSHOT_SPLITS_FINISHED = 
"numSnapshotSplitsFinished";
+public static final String SNAPSHOT_START_TIME = "snapshotStartTime";
+public static final String SNAPSHOT_END_TIME = "snapshotEndTime";
+public static final String DATABASE_GROUP_KEY = "database";
+public static final String SCHEMA_GROUP_KEY = "schema";
+public static final String TABLE_GROUP_KEY = "table";
+
+private final SplitEnumeratorMetricGroup metricGroup;
+
+private volatile int isSnapshotting = UNDEFINED;
+private volatile int isStreamReading = UNDEFINED;
+private volatile int numTablesRemaining = 0;
+
+// Map for managing per-table metrics by table identifier
+// Key: Identifier of the table
+// Value: TableMetrics related to the table
+private final Map tableMetricsMap = new 
ConcurrentHashMap<>();
+
+public SourceEnumeratorMetrics(SplitEnumeratorMetricGroup metricGroup) {
+this.metricGroup = metricGroup;
+metricGroup.gauge(IS_SNAPSHOTTING, () -> isSnapshotting);
+metricGroup.gauge(IS_STREAM_READING, () -> isStreamReading);
+metricGroup.gauge(NUM_TABLES_REMAINING, () -> numTablesRemaining);
+}
+
+public void enterSnapshotPhase() {
+this.isSnapshotting = 1;
+}
+
+public void exitSnapshotPhase() {
+this.isSnapshotting = 0;
+}
+
+public void enterStreamReading() {
+this.isStreamReading = 1;
+}
+
+public void exitStreamReading() {
+this.isStreamReading = 0;
+}
+
+public void registerMetrics(
+Gauge numTablesSnapshotted,
+Gauge numSnapshotSplitsProcessed,
+Gauge numSnapshotSplitsRemaining) {
+metricGroup.gauge(NUM_TABLES_SNAPSHOTTED, numTablesSnapshotted);
+metricGroup.gauge(NUM_SNAPSHOT_SPLITS_PROCESSED, 
numSnapshotSplitsProcessed);
+metricGroup.gauge(NUM_SNAPSHOT_SPLITS_REMAINING, 
numSnapshotSplitsRemaining);
+}
+
+public void addNewTables(int numNewTables) {
+numTablesRemaining += numNewTables;
+}
+
+public void startSnapshotTables(int numSnapshottedTables) {
+numTablesRemaining -= numSnapshottedTables;
+}
+
+public TableMetrics getTableMetrics(TableId tableId) {
+return tableMetricsMap.computeIfAbsent(
+tableId,
+key -> new TableMetrics(key.catalog(), key.schema(), 
key.table(), metricGroup));
+}
+
+// --- Helper

[jira] [Updated] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36421:
---
Labels: pull-request-available  (was: )

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  O_WRONLY|O_CREAT|O_EXCL, 0666) = 148
> [pid 947310] 08:22:59 fsync(148)= 0
> [pid 947310] 08:22:59 clos

[PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]

2024-10-09 Thread via GitHub



Planet-X opened a new pull request, #25468:
URL: https://github.com/apache/flink/pull/25468

   ## What is the purpose of the change
   
   The `FsCheckpointStreamFactory` returns handles to files containing larger 
serialized data for checkpoint creation. However, these files are not synced 
persistently to disk via `fsync` before their handles are incorporated into 
completed checkpoints. Upon e.g. power loss this can cause corruption of a 
checkpoint as it references files that may now be lost due to missing 
persistence.  
   This PR adds a #sync() call to the `FsCheckpointStreamFactory`'s 
#closeAndGetHandle() function. This makes sure that the returned handle always 
points to a file that has been safely persisted. 
   
   ## Brief change log
   
 - *Files written by `FsCheckpointStreamFactory` are now synced before 
their handle is referenced in checkpoints*
   
   ## Verifying this change
   
   There is no easy way to unit- or integration-test these changes. OS tools 
are required. Correctness has been verified by using `strace`, see discussion 
at FLINK-36421.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
 - The serializers: (no)
 - The runtime per-record code paths (performance sensitive): (no)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
 - The S3 file system connector: (no)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (no)
 - If yes, how is the feature documented? (not applicable)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (FLINK-36322) Fix compile error of flink benchmark caused by breaking changes

2024-10-09 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36322:
---

Assignee: Zakelly Lan

> Fix compile error of flink benchmark caused by breaking changes
> ---
>
> Key: FLINK-36322
> URL: https://issues.apache.org/jira/browse/FLINK-36322
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Benchmarks
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793008455


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceEnumeratorMetrics.java:
##
@@ -0,0 +1,251 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.cdc.connectors.base.source.metrics;
+
+import org.apache.flink.metrics.Gauge;
+import org.apache.flink.metrics.MetricGroup;
+import org.apache.flink.metrics.groups.SplitEnumeratorMetricGroup;
+
+import io.debezium.relational.TableId;
+import org.apache.commons.lang3.StringUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.atomic.AtomicInteger;
+
+/** A collection class for handling metrics in {@link 
SourceEnumeratorMetrics}. */
+public class SourceEnumeratorMetrics {
+private static final Logger LOGGER = 
LoggerFactory.getLogger(SourceEnumeratorMetrics.class);
+// Constants
+public static final int UNDEFINED = 0;
+
+// Metric names
+public static final String IS_SNAPSHOTTING = "isSnapshotting";
+public static final String IS_STREAM_READING = "isStreamReading";
+public static final String NUM_TABLES_SNAPSHOTTED = "numTablesSnapshotted";
+public static final String NUM_TABLES_REMAINING = "numTablesRemaining";
+public static final String NUM_SNAPSHOT_SPLITS_PROCESSED = 
"numSnapshotSplitsProcessed";
+public static final String NUM_SNAPSHOT_SPLITS_REMAINING = 
"numSnapshotSplitsRemaining";
+public static final String NUM_SNAPSHOT_SPLITS_FINISHED = 
"numSnapshotSplitsFinished";
+public static final String SNAPSHOT_START_TIME = "snapshotStartTime";
+public static final String SNAPSHOT_END_TIME = "snapshotEndTime";
+public static final String DATABASE_GROUP_KEY = "database";
+public static final String SCHEMA_GROUP_KEY = "schema";
+public static final String TABLE_GROUP_KEY = "table";
+
+private final SplitEnumeratorMetricGroup metricGroup;
+
+private volatile int isSnapshotting = UNDEFINED;
+private volatile int isStreamReading = UNDEFINED;
+private volatile int numTablesRemaining = 0;
+
+// Map for managing per-table metrics by table identifier
+// Key: Identifier of the table
+// Value: TableMetrics related to the table
+private final Map tableMetricsMap = new 
ConcurrentHashMap<>();
+
+public SourceEnumeratorMetrics(SplitEnumeratorMetricGroup metricGroup) {
+this.metricGroup = metricGroup;
+metricGroup.gauge(IS_SNAPSHOTTING, () -> isSnapshotting);
+metricGroup.gauge(IS_STREAM_READING, () -> isStreamReading);
+metricGroup.gauge(NUM_TABLES_REMAINING, () -> numTablesRemaining);
+}
+
+public void enterSnapshotPhase() {
+this.isSnapshotting = 1;
+}
+
+public void exitSnapshotPhase() {
+this.isSnapshotting = 0;
+}
+
+public void enterStreamReading() {
+this.isStreamReading = 1;
+}
+
+public void exitStreamReading() {
+this.isStreamReading = 0;
+}
+
+public void registerMetrics(
+Gauge numTablesSnapshotted,
+Gauge numSnapshotSplitsProcessed,
+Gauge numSnapshotSplitsRemaining) {
+metricGroup.gauge(NUM_TABLES_SNAPSHOTTED, numTablesSnapshotted);
+metricGroup.gauge(NUM_SNAPSHOT_SPLITS_PROCESSED, 
numSnapshotSplitsProcessed);
+metricGroup.gauge(NUM_SNAPSHOT_SPLITS_REMAINING, 
numSnapshotSplitsRemaining);
+}
+
+public void addNewTables(int numNewTables) {
+numTablesRemaining += numNewTables;
+}
+
+public void startSnapshotTables(int numSnapshottedTables) {
+numTablesRemaining -= numSnapshottedTables;
+}
+
+public TableMetrics getTableMetrics(TableId tableId) {
+return tableMetricsMap.computeIfAbsent(
+tableId,
+key -> new TableMetrics(key.catalog(), key.schema(), 
key.table(), metricGroup));
+}
+
+// --- Helper

[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted

2024-10-09 Thread Galen Warren (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887940#comment-17887940
 ] 

Galen Warren commented on FLINK-36455:
--

[~arvid] I'm trying to understand the scope of this issue. Would all sinks be 
impacted? All versions of Flink? Thanks.

> Sink should commit everything on notifyCheckpointCompleted
> --
>
> Key: FLINK-36455
> URL: https://issues.apache.org/jira/browse/FLINK-36455
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Arvid Heise
>Priority: Major
> Fix For: 2.0-preview
>
>
> Currently, we retry committables at some time later until they eventually 
> succeed.
> However, that violates the contract of notifyCheckpointCompleted which states 
> that all side effect must be committed before returning the method. In 
> particular, notifyCheckpointCompleted must fail if we cannot guarantee that 
> all side effects are committed for final checkpoints. As soon as 
> notifyCheckpointCompleted returns, the final checkpoint is deemed completed, 
> which currently may mean that some transactions are still open.
> The solution is that all retries must happen in a close loop in 
> notifyCheckpointCompleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread Marc Aurel Fritz (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887822#comment-17887822
 ] 

Marc Aurel Fritz commented on FLINK-36421:
--

[~gaborgsomogyi] We're been running the patched version for about a month on 
the most affected edge device. It's turned on and off multiple times a day and 
a corrupted checkpoint happened quite reliably at least once a week pre-patch. 
About a week ago we also rolled out the patched version to all other edge 
devices in production and they seem to run perfectly stable. No corrupted 
checkpoints have been observed with the patch in place for now - I'll keep you 
updated if that changes!

[~zakelly] Interesting! I assumed rocksdb wouldn't be affected as I expected 
that the rocksdb library is used to  write all the serialized state to disk as 
part of the database. Does the rocksdb state backend also reference larger 
state as external files written by the {{{}FsCheckpointStreamFactory{}}}, then 
presumably just storing their references in the rocksdb database? I'll take a 
closer look on how this works!

I've opened up a PR for this here: https://github.com/apache/flink/pull/25468

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683f

[jira] [Assigned] (FLINK-36446) Refactor Job Submission Process to Use ExecutionPlan Instead of JobGraph

2024-10-09 Thread Zhu Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu reassigned FLINK-36446:
---

Assignee: Junrui Li

> Refactor Job Submission Process to Use ExecutionPlan Instead of JobGraph
> 
>
> Key: FLINK-36446
> URL: https://issues.apache.org/jira/browse/FLINK-36446
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
>
> Refactor the job submission process to submit an {{ExecutionPlan}} instead of 
> a {{{}JobGraph{}}}.
> Since {{JobGraph}} implements the {{ExecutionPlan}} interface, this change 
> will not impact the existing submission process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36445) Introducing an ExecutionPlan interface

2024-10-09 Thread Zhu Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu reassigned FLINK-36445:
---

Assignee: Junrui Li

> Introducing an ExecutionPlan interface
> --
>
> Key: FLINK-36445
> URL: https://issues.apache.org/jira/browse/FLINK-36445
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
>
> Introducing an {{ExecutionPlan}} interface to replace the {{JobGraph}} in the 
> job submission and recovery processes, with {{JobGraph}} implementing this 
> interface.
> Please note that this JIRA only focuses on the introduction of the interface; 
> the actual replacement of the interface will be addressed in subsequent JIRAs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36158) Refactor Job Recovery Process to Use ExecutionPlan Instead of JobGraph

2024-10-09 Thread Zhu Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu reassigned FLINK-36158:
---

Assignee: Junrui Li

> Refactor Job Recovery Process to Use ExecutionPlan Instead of JobGraph
> --
>
> Key: FLINK-36158
> URL: https://issues.apache.org/jira/browse/FLINK-36158
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
>
> Change the existing JobGraphStore to ExecutionPlanStore support storing 
> ExecutionPlans, allowing jobs to be restored from ExecutionPlans.
> Since {{JobGraph}} implements the {{ExecutionPlan}} interface, this change 
> will not impact the existing recovery.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36445) Introducing an ExecutionPlan interface

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36445:
---
Labels: pull-request-available  (was: )

> Introducing an ExecutionPlan interface
> --
>
> Key: FLINK-36445
> URL: https://issues.apache.org/jira/browse/FLINK-36445
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
>  Labels: pull-request-available
>
> Introducing an {{ExecutionPlan}} interface to replace the {{JobGraph}} in the 
> job submission and recovery processes, with {{JobGraph}} implementing this 
> interface.
> Please note that this JIRA only focuses on the introduction of the interface; 
> the actual replacement of the interface will be addressed in subsequent JIRAs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36441] Ensure producers are not leaked [flink-connector-kafka]

2024-10-09 Thread via GitHub



AHeise merged PR #126:
URL: https://github.com/apache/flink-connector-kafka/pull/126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FLINK-36445][runtime] Introduce ExecutionPlan interface. [flink]

2024-10-09 Thread via GitHub



JunRuiLee opened a new pull request, #25469:
URL: https://github.com/apache/flink/pull/25469

   
   
   ## What is the purpose of the change
   
   [FLINK-36445][runtime] Introduce ExecutionPlan interface.
   
   ## Brief change log
   
   Introduce ExecutionPlan interface and JobGraph implements this interface.
   
   
   ## Verifying this change
   
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (yes / **no**)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
 - The serializers: (yes / **no** / don't know)
 - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't 
know)
 - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (yes / **no**)
 - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-36441) Resource leaks in Kafka connector (Tests)

2024-10-09 Thread Arvid Heise (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887871#comment-17887871
 ] 

Arvid Heise commented on FLINK-36441:
-

Merged as 
f888d5afa7b62fe52ddad7575f3f7a4895e64453..17b26d28bf43bd47e342aaed5c07c9f5f00bdcc7.

> Resource leaks in Kafka connector (Tests)
> -
>
> Key: FLINK-36441
> URL: https://issues.apache.org/jira/browse/FLINK-36441
> Project: Flink
>  Issue Type: Bug
>Affects Versions: kafka-3.2.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
>
> Producers are leaked in quite a few tests. For non-transactional producers, 
> there is even a leak in the producer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-35109] Bump to Flink 1.19 and support Flink 1.20 [flink-connector-kafka]

2024-10-09 Thread via GitHub



AHeise commented on PR #122:
URL: 
https://github.com/apache/flink-connector-kafka/pull/122#issuecomment-2401873098

   I forked out FLINK-36441 for the test fixes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (FLINK-36441) Resource leaks in Kafka connector (Tests)

2024-10-09 Thread Arvid Heise (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise resolved FLINK-36441.
-
Fix Version/s: kafka-3.3.0
   Resolution: Fixed

> Resource leaks in Kafka connector (Tests)
> -
>
> Key: FLINK-36441
> URL: https://issues.apache.org/jira/browse/FLINK-36441
> Project: Flink
>  Issue Type: Bug
>Affects Versions: kafka-3.2.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Major
>  Labels: pull-request-available
> Fix For: kafka-3.3.0
>
>
> Producers are leaked in quite a few tests. For non-transactional producers, 
> there is even a leak in the producer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]

2024-10-09 Thread via GitHub



gaborgsomogyi commented on PR #25468:
URL: https://github.com/apache/flink/pull/25468#issuecomment-2402241930

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887922#comment-17887922
 ] 

Gabor Somogyi commented on FLINK-36421:
---

I would backport it to at least 1.20 line to have this fix on 1.x. Thanks for 
the PR.

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  O_WRONLY|O_C

Re: [PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]

2024-10-09 Thread via GitHub



gaborgsomogyi commented on PR #25468:
URL: https://github.com/apache/flink/pull/25468#issuecomment-2402249199

   Needs a green azure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36009] Architecture tests fixes [flink-connector-jdbc]

2024-10-09 Thread via GitHub



eskabetxe commented on PR #140:
URL: 
https://github.com/apache/flink-connector-jdbc/pull/140#issuecomment-2401637907

   Created another PR only with new module for architecture change 
[here](https://github.com/apache/flink-connector-jdbc/pull/143/files)
   This is keeped to allow the discussion about the fixes, and will be 
simplified so the other PR is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-31234][kubernetes] Add an option to redirect stdout to log dir. [flink]

2024-10-09 Thread via GitHub



xiaows08 commented on PR #22138:
URL: https://github.com/apache/flink/pull/22138#issuecomment-2401639724

   @huwh 感谢回复，环境如下：
   flink-kubernetes-operator 1.9.0
   flink: flink:1.18.1-scala_2.12
   log4j-console.properties 是镜像默认的
   
   ```yaml
   apiVersion: flink.apache.org/v1beta1
   kind: FlinkDeployment
   metadata:
 name: basic
   spec:
 image: flink:1.18.1-scala_2.12
 flinkVersion: v1_18
 ingress:
   className: "nginx"
   template: "{{name}}.fo.xyuqing.com" # 引号必须有
 podTemplate:
   spec:
 containers:
   - name: flink-main-container
 ports:
   - name: metrics
 containerPort: 9249
 protocol: TCP
 env:
   - name: TZ  # 设置容器运行的时区
 value: Asia/Shanghai
 flinkConfiguration:
   taskmanager.numberOfTaskSlots: "2"
   env.stdout-err.redirect-to-file: "true"
   metrics.reporter.prom.factory.class: 
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
 serviceAccount: flink
 jobManager:
   resource:
 cpu: 1
 memory: 2G
 taskManager:
   resource:
 cpu: 1
 memory: 2G
 job:
   jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
   parallelism: 2
   upgradeMode: stateless
   
   ```
   
![image](https://github.com/user-attachments/assets/20975efa-0e45-4fc3-94c9-d8301b350d33)
   
   
![image](https://github.com/user-attachments/assets/ce2bcf7b-94dc-4df2-aa3f-a0d6b484c2d0)
   
   我的意思是虽然 `kubectl logs -f pods/xxx ` 不会打印log4j2的日志内容，但是对应的 `xxx.out` 文件还是包含有 
log4j2  `xxx.log`的日志内容，最终导致 代码中的log4j的输出会存在两份，也就会占两份存储，而且`xxx.out` 
文件不能做到按大小或时间自动切分


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]

2024-10-09 Thread via GitHub



flinkbot commented on PR #25468:
URL: https://github.com/apache/flink/pull/25468#issuecomment-2401668508

   
   ## CI report:
   
   * fd427ffdeaa9b9fc6c31dee3dc09ca982ad7b5ba UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread Marc Aurel Fritz (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887822#comment-17887822
 ] 

Marc Aurel Fritz edited comment on FLINK-36421 at 10/9/24 8:43 AM:
---

[~gaborgsomogyi] We're been running the patched version for about a month on 
the most affected edge device. It's turned on and off multiple times a day and 
a corrupted checkpoint happened quite reliably at least once a week pre-patch. 
About a week ago we also rolled out the patched version to all other edge 
devices in production and they seem to run perfectly stable. No corrupted 
checkpoints have been observed with the patch in place for now - I'll keep you 
updated if that changes!

[~zakelly] Interesting! I assumed rocksdb wouldn't be affected as I expected 
that the rocksdb library is used to  write all the serialized state to disk as 
part of the database. Does the rocksdb state backend also reference larger 
state as external files written by the {{{}FsCheckpointStreamFactory{}}}, then 
presumably just storing their references in the rocksdb database? I'll take a 
closer look on how this works!

I've opened up a PR for this here: [https://github.com/apache/flink/pull/25468]

Should I also open up PRs for backporting this?


was (Author: JIRAUSER305204):
[~gaborgsomogyi] We're been running the patched version for about a month on 
the most affected edge device. It's turned on and off multiple times a day and 
a corrupted checkpoint happened quite reliably at least once a week pre-patch. 
About a week ago we also rolled out the patched version to all other edge 
devices in production and they seem to run perfectly stable. No corrupted 
checkpoints have been observed with the patch in place for now - I'll keep you 
updated if that changes!

[~zakelly] Interesting! I assumed rocksdb wouldn't be affected as I expected 
that the rocksdb library is used to  write all the serialized state to disk as 
part of the database. Does the rocksdb state backend also reference larger 
state as external files written by the {{{}FsCheckpointStreamFactory{}}}, then 
presumably just storing their references in the rocksdb database? I'll take a 
closer look on how this works!

I've opened up a PR for this here: https://github.com/apache/flink/pull/25468

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c2

[jira] [Closed] (FLINK-36064) Refactor REST API for Client and Operator Coordinator Communication to Use operatorUid.

2024-10-09 Thread Zhu Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu closed FLINK-36064.
---
Fix Version/s: 2.0.0
   Resolution: Done

63b92bd88ac32be5b3e70550ba90d828b7b655d1

> Refactor REST API for Client and Operator Coordinator Communication to Use 
> operatorUid.
> ---
>
> Key: FLINK-36064
> URL: https://issues.apache.org/jira/browse/FLINK-36064
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Junrui Li
>Assignee: Junrui Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Currently, the REST API between clients and operator coordinators relies on 
> the operator ID to send requests to the Job Manager (JM) for querying 
> results. However, with the introduction of Stream Graph submission, the Job 
> Graph will be compiled and generated within the JM, preventing the client 
> from accessing the operator ID.
> To address this issue, we modify the REST API for communication between 
> clients and operator coordinators by removing the dependency on operatorID 
> and transitioning to a client-defined {{{}String operatorUid{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36440) Bump log4j from 2.17.1 to 2.23.1

2024-10-09 Thread Ferenc Csaky (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887927#comment-17887927
 ] 

Ferenc Csaky commented on FLINK-36440:
--

Latest version is 2.24.1 [1], any specific reason to not use that?

[1] https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core

> Bump log4j from 2.17.1 to 2.23.1
> 
>
> Key: FLINK-36440
> URL: https://issues.apache.org/jira/browse/FLINK-36440
> Project: Flink
>  Issue Type: Improvement
>Reporter: Siddharth R
>Priority: Major
>
> Bumping *log4j* to the latest version (2.23.1) - this will remediate a lot of 
> vulnerabilities in dependant packages.
> Package details:
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-1.2-api/2.23.1]
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-slf4j-impl/2.23.1]
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api/2.23.1]
>  # 
> [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core/2.23.1]
> Release notes:
> [https://logging.apache.org/log4j/2.x/release-notes.html]
>  
> Lot of bug fixes has been done in the newer versions and I don't see any 
> breaking changes as such.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36241] Table API .plus should support string concatenation [flink]

2024-10-09 Thread via GitHub



dawidwys commented on PR #25298:
URL: https://github.com/apache/flink/pull/25298#issuecomment-2402337798

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted

2024-10-09 Thread Arvid Heise (Jira)

Arvid Heise created FLINK-36455:
---

 Summary: Sink should commit everything on notifyCheckpointCompleted
 Key: FLINK-36455
 URL: https://issues.apache.org/jira/browse/FLINK-36455
 Project: Flink
  Issue Type: Bug
  Components: API / Core
Reporter: Arvid Heise
 Fix For: 2.0-preview


Currently, we retry committables at some time later until they eventually 
succeed.

However, that violates the contract of notifyCheckpointCompleted which states 
that all side effect must be committed before returning the method. In 
particular, notifyCheckpointCompleted must fail if we cannot guarantee that all 
side effects are committed for final checkpoints. As soon as 
notifyCheckpointCompleted returns, the final checkpoint is deemed completed, 
which currently may mean that some transactions are still open.

The solution is that all retries must happen in a close loop in 
notifyCheckpointCompleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36448) Compilation fails in IntelliJ

2024-10-09 Thread Xintong Song (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887865#comment-17887865
 ] 

Xintong Song commented on FLINK-36448:
--

[~mallikarjuna],

I'm not entirely sure, but this seems related: 
[https://stackoverflow.com/questions/69306905/error-package-sun-rmi-registry-is-declared-in-module-java-rmi-which-does-not]

I think we have added such options in pom files, which might not be recognized 
by the IDE. Have you tried "Reload all maven project" in IDE?

cc [~sxnan]  

> Compilation fails in IntelliJ
> -
>
> Key: FLINK-36448
> URL: https://issues.apache.org/jira/browse/FLINK-36448
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Kumar Mallikarjuna
>Priority: Major
>  Labels: build, build-failure, intellij
>
> I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and 
> tests work just fine when using Maven directly. However building the project 
> and running unit tests with IntelliJ fail with the below error:
>  
> {code:java}
> /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java
> package sun.rmi.registry is not visible
>   (package sun.rmi.registry is declared in module java.rmi, which does not 
> export it to the unnamed module)
> sun.rmi.registry{code}
> I also tried the "Use '–release' option for cross-compilation" setting in 
> IntelliJ but that leads to the same error. I am using JDK11 for compilation.
> cc: [~mapohl] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-36450] exclude log4j from curator-test [flink-kubernetes-operator]

2024-10-09 Thread via GitHub



r-sidd opened a new pull request, #889:
URL: https://github.com/apache/flink-kubernetes-operator/pull/889

   ## What is the purpose of the change
   
   Exclude the vulnerable log4j dependency from curator-test
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
 - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-36450) Exclude log4j from curator-test

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36450:
---
Labels: pull-request-available  (was: )

> Exclude log4j from curator-test
> ---
>
> Key: FLINK-36450
> URL: https://issues.apache.org/jira/browse/FLINK-36450
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.10.0
>Reporter: Siddharth R
>Priority: Major
>  Labels: pull-request-available
>
> Exclude vulnerable log4j dependency from curator-test
>  
> 
>       log4j 
>       log4j 
> 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-36064][runtime] Refactor REST API for Client and Operator Coordinator Communication to Use String OperatorUid. [flink]

2024-10-09 Thread via GitHub



zhuzhurk closed pull request #25227: [FLINK-36064][runtime] Refactor REST API 
for Client and Operator Coordinator Communication to Use String OperatorUid.
URL: https://github.com/apache/flink/pull/25227


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792981064


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/assigner/HybridSplitAssigner.java:
##
@@ -137,6 +161,7 @@ public Optional getNext() {
 // assigning the stream split. Otherwise, records emitted from 
stream split
 // might be out-of-order in terms of same primary key with 
snapshot splits.
 isStreamSplitAssigned = true;
+enumeratorMetrics.enterStreamReading();

Review Comment:
   > If 
`isNewlyAddedAssigningFinished(snapshotSplitAssigner.getAssignerStatus())`, 
this should invoke `enumeratorMetrics.enterStreamReading();`.
   
   Yes, I did miss it. I have already made adjustments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792984158


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java:
##
@@ -60,4 +110,146 @@ public void recordFetchDelay(long fetchDelay) {
 public void addNumRecordsInErrors(long delta) {
 this.numRecordsInErrorsCounter.inc(delta);
 }
+
+public void updateLastReceivedEventTime(Long eventTimestamp) {
+if (eventTimestamp != null && eventTimestamp > 0L) {
+lastReceivedEventTime = eventTimestamp;
+}
+}
+
+public void markRecord() {
+metricGroup.getIOMetricGroup().getNumRecordsInCounter().inc();
+}
+
+public void updateRecordCounters(SourceRecord record) {
+catchAndWarnLogAllExceptions(
+() -> {
+// Increase reader and table level input counters
+if (isDataChangeRecord(record)) {
+TableMetrics tableMetrics = 
getTableMetrics(getTableId(record));
+Envelope.Operation op = Envelope.operationFor(record);
+switch (op) {
+case READ:
+snapshotCounter.inc();
+tableMetrics.markSnapshotRecord();
+break;
+case CREATE:
+insertCounter.inc();
+tableMetrics.markInsertRecord();
+break;
+case DELETE:
+deleteCounter.inc();
+tableMetrics.markDeleteRecord();
+break;
+case UPDATE:
+updateCounter.inc();
+tableMetrics.markUpdateRecord();
+break;
+}
+} else if (isSchemaChangeEvent(record)) {
+schemaChangeCounter.inc();
+TableId tableId = getTableId(record);
+if (tableId != null) {
+getTableMetrics(tableId).markSchemaChangeRecord();
+}
+}
+});
+}
+
+private TableMetrics getTableMetrics(TableId tableId) {
+return tableMetricsMap.computeIfAbsent(
+tableId,
+id -> new TableMetrics(id.catalog(), id.schema(), id.table(), 
metricGroup));
+}
+
+// --- Helper functions 
-
+
+private void catchAndWarnLogAllExceptions(Runnable runnable) {
+try {
+runnable.run();
+} catch (Exception e) {
+// Catch all exceptions as errors in metric handling should not 
fail the job
+LOG.warn("Failed to update metrics", e);
+}
+}
+
+private long getCurrentEventTimeLag() {
+if (lastReceivedEventTime == UNDEFINED) {
+return UNDEFINED;
+}
+return SystemClock.getInstance().absoluteTimeMillis() - 
lastReceivedEventTime;
+}
+
+// --- Helper classes 

+
+/**
+ * Collection class for managing metrics of a table.
+ *
+ * Metrics of table level are registered in its corresponding subgroup 
under the {@link
+ * SourceReaderMetricGroup}.
+ */
+private static class TableMetrics {
+// Snapshot + Stream
+private final Counter recordsCounter;
+
+// Snapshot phase
+private final Counter snapshotCounter;
+
+// Stream phase
+private final Counter insertCounter;
+private final Counter updateCounter;
+private final Counter deleteCounter;
+private final Counter schemaChangeCounter;
+
+public TableMetrics(
+String databaseName, String schemaName, String tableName, 
MetricGroup parentGroup) {
+databaseName = processNull(databaseName);
+schemaName = processNull(schemaName);
+tableName = processNull(tableName);

Review Comment:
   > Will there be a problem if we provide an empty string for the group?
   
   MetricGroup does not allow value to be null. In order to be compatible with 
multi-layer models of different databases, null is uniformly treated as an 
empty string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792988134


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java:
##
@@ -39,14 +80,23 @@ public class SourceReaderMetrics {
 /** The total number of record that failed to consume, process or emit. */
 private final Counter numRecordsInErrorsCounter;
 
+private volatile long lastReceivedEventTime = UNDEFINED;
+private volatile long currentReadTimestampMs = UNDEFINED;

Review Comment:
   Ok, it has been adjusted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792988461


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java:
##
@@ -60,4 +110,146 @@ public void recordFetchDelay(long fetchDelay) {
 public void addNumRecordsInErrors(long delta) {
 this.numRecordsInErrorsCounter.inc(delta);
 }
+
+public void updateLastReceivedEventTime(Long eventTimestamp) {
+if (eventTimestamp != null && eventTimestamp > 0L) {
+lastReceivedEventTime = eventTimestamp;
+}
+}
+
+public void markRecord() {
+metricGroup.getIOMetricGroup().getNumRecordsInCounter().inc();

Review Comment:
   Ok, it has been adjusted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792989625


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java:
##
@@ -39,14 +80,23 @@ public class SourceReaderMetrics {
 /** The total number of record that failed to consume, process or emit. */
 private final Counter numRecordsInErrorsCounter;
 
+private volatile long lastReceivedEventTime = UNDEFINED;
+private volatile long currentReadTimestampMs = UNDEFINED;
+
 public SourceReaderMetrics(SourceReaderMetricGroup metricGroup) {
 this.metricGroup = metricGroup;
 this.numRecordsInErrorsCounter = 
metricGroup.getNumRecordsInErrorsCounter();
-}
 
-public void registerMetrics() {
 metricGroup.gauge(
 MetricNames.CURRENT_FETCH_EVENT_TIME_LAG, (Gauge) 
this::getFetchDelay);
+metricGroup.gauge(CURRENT_READ_TIMESTAMP_MS, () -> 
currentReadTimestampMs);

Review Comment:
   Yes, I have already deleted it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792982379


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/assigner/HybridSplitAssigner.java:
##
@@ -137,6 +161,7 @@ public Optional getNext() {
 // assigning the stream split. Otherwise, records emitted from 
stream split
 // might be out-of-order in terms of same primary key with 
snapshot splits.
 isStreamSplitAssigned = true;
+enumeratorMetrics.enterStreamReading();

Review Comment:
   > `enumeratorMetrics.exitStreamReading()` should be invoked in `addSplits`.
   
   Inside `snapshotSplitAssigner.addSplits`, it will handle



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]

2024-10-09 Thread via GitHub



liuxiao2shf commented on code in PR #3619:
URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792986970


##
flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/reader/IncrementalSourceRecordEmitter.java:
##
@@ -169,9 +172,14 @@ protected void reportMetrics(SourceRecord element) {
 }
 }
 
-private static class OutputCollector implements Collector {
-private SourceOutput output;
-private Long currentMessageTimestamp;
+/**
+ * Collector for outputting records.
+ *
+ * @param 
+ */

Review Comment:
   Ok, it has been adjusted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility

2024-10-09 Thread yux (Jira)

yux created FLINK-36453:
---

 Summary: Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver 
compatbility
 Key: FLINK-36453
 URL: https://issues.apache.org/jira/browse/FLINK-36453
 Project: Flink
  Issue Type: Bug
  Components: Flink CDC
Reporter: yux






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36452) Avoid EOFException when reading binlog after a large table has just completed snapshot reading

2024-10-09 Thread LvYanquan (Jira)

LvYanquan created FLINK-36452:
-

 Summary: Avoid EOFException when reading binlog after a large 
table has just completed snapshot reading
 Key: FLINK-36452
 URL: https://issues.apache.org/jira/browse/FLINK-36452
 Project: Flink
  Issue Type: Improvement
  Components: Flink CDC
Affects Versions: cdc-3.3.0
Reporter: LvYanquan
 Fix For: cdc-3.3.0


When reading binlog after a large table has just completed snapshot reading, we 
will need to compare the binlog position with All previous chunks to determine 
if this data has been processed before. However, this process is quite 
time-consuming and may lead to EOFException of binlog client.

We should try to shorten the comparison time as much as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility

2024-10-09 Thread yux (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yux updated FLINK-36453:

Description: 
DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in 
versions prior to 2.2.

We may backport this bugfix for now before updating to Debezium 2.x.

> Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility
> ---
>
> Key: FLINK-36453
> URL: https://issues.apache.org/jira/browse/FLINK-36453
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
>Reporter: yux
>Priority: Major
>
> DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in 
> versions prior to 2.2.
> We may backport this bugfix for now before updating to Debezium 2.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility

2024-10-09 Thread yux (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887877#comment-17887877
 ] 

yux commented on FLINK-36453:
-

I'd love to take this ticket. [~leonard]

> Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility
> ---
>
> Key: FLINK-36453
> URL: https://issues.apache.org/jira/browse/FLINK-36453
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
>Reporter: yux
>Priority: Major
>
> DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in 
> versions prior to 2.2.
> We may backport this bugfix for now before updating to Debezium 2.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36322) Fix compile error of flink benchmark caused by breaking changes

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36322:
---
Labels: pull-request-available  (was: )

> Fix compile error of flink benchmark caused by breaking changes
> ---
>
> Key: FLINK-36322
> URL: https://issues.apache.org/jira/browse/FLINK-36322
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Benchmarks
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36452) Avoid EOFException when reading binlog after a large table has just completed snapshot reading

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36452:
---
Labels: pull-request-available  (was: )

> Avoid EOFException when reading binlog after a large table has just completed 
> snapshot reading
> --
>
> Key: FLINK-36452
> URL: https://issues.apache.org/jira/browse/FLINK-36452
> Project: Flink
>  Issue Type: Improvement
>  Components: Flink CDC
>Affects Versions: cdc-3.3.0
>Reporter: LvYanquan
>Priority: Major
>  Labels: pull-request-available
> Fix For: cdc-3.3.0
>
>
> When reading binlog after a large table has just completed snapshot reading, 
> we will need to compare the binlog position with All previous chunks to 
> determine if this data has been processed before. However, this process is 
> quite time-consuming and may lead to EOFException of binlog client.
> We should try to shorten the comparison time as much as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility

2024-10-09 Thread Leonard Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Xu reassigned FLINK-36453:
--

Assignee: yux

> Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility
> ---
>
> Key: FLINK-36453
> URL: https://issues.apache.org/jira/browse/FLINK-36453
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
>Reporter: yux
>Assignee: yux
>Priority: Major
>
> DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in 
> versions prior to 2.2.
> We may backport this bugfix for now before updating to Debezium 2.x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36452) Avoid EOFException when reading binlog after a large table has just completed snapshot reading

2024-10-09 Thread Leonard Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonard Xu reassigned FLINK-36452:
--

Assignee: LvYanquan

> Avoid EOFException when reading binlog after a large table has just completed 
> snapshot reading
> --
>
> Key: FLINK-36452
> URL: https://issues.apache.org/jira/browse/FLINK-36452
> Project: Flink
>  Issue Type: Improvement
>  Components: Flink CDC
>Affects Versions: cdc-3.3.0
>Reporter: LvYanquan
>Assignee: LvYanquan
>Priority: Major
>  Labels: pull-request-available
> Fix For: cdc-3.3.0
>
>
> When reading binlog after a large table has just completed snapshot reading, 
> we will need to compare the binlog position with All previous chunks to 
> determine if this data has been processed before. However, this process is 
> quite time-consuming and may lead to EOFException of binlog client.
> We should try to shorten the comparison time as much as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36448) Compilation fails in IntelliJ

2024-10-09 Thread Kumar Mallikarjuna (Jira)

Kumar Mallikarjuna created FLINK-36448:
--

 Summary: Compilation fails in IntelliJ
 Key: FLINK-36448
 URL: https://issues.apache.org/jira/browse/FLINK-36448
 Project: Flink
  Issue Type: Bug
  Components: API / Core
Reporter: Kumar Mallikarjuna


I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and 
tests work just fine when using Maven directly. However building the project 
and running unit tests with IntelliJ fail with the below error:

 
{code:java}
/Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java
package sun.rmi.registry is not visible
  (package sun.rmi.registry is declared in module java.rmi, which does not 
export it to the unnamed module)
sun.rmi.registry{code}
I also tried the "Use '–release' option for cross-compilation" setting in 
IntelliJ but that leads to the same error. I am using JDK11 for compilation.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36448) Compilation fails in IntelliJ

2024-10-09 Thread Kumar Mallikarjuna (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887846#comment-17887846
 ] 

Kumar Mallikarjuna commented on FLINK-36448:


Hi [~xtsong] , would you know what's wrong here?

> Compilation fails in IntelliJ
> -
>
> Key: FLINK-36448
> URL: https://issues.apache.org/jira/browse/FLINK-36448
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Kumar Mallikarjuna
>Priority: Major
>  Labels: build, build-failure, intellij
>
> I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and 
> tests work just fine when using Maven directly. However building the project 
> and running unit tests with IntelliJ fail with the below error:
>  
> {code:java}
> /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java
> package sun.rmi.registry is not visible
>   (package sun.rmi.registry is declared in module java.rmi, which does not 
> export it to the unnamed module)
> sun.rmi.registry{code}
> I also tried the "Use '–release' option for cross-compilation" setting in 
> IntelliJ but that leads to the same error. I am using JDK11 for compilation.
> cc: [~mapohl] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36448) Compilation fails in IntelliJ

2024-10-09 Thread Kumar Mallikarjuna (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kumar Mallikarjuna updated FLINK-36448:
---
Description: 
I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and 
tests work just fine when using Maven directly. However building the project 
and running unit tests with IntelliJ fail with the below error:

 
{code:java}
/Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java
package sun.rmi.registry is not visible
  (package sun.rmi.registry is declared in module java.rmi, which does not 
export it to the unnamed module)
sun.rmi.registry{code}
I also tried the "Use '–release' option for cross-compilation" setting in 
IntelliJ but that leads to the same error. I am using JDK11 for compilation.

cc: [~mapohl] 

 

  was:
I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and 
tests work just fine when using Maven directly. However building the project 
and running unit tests with IntelliJ fail with the below error:

 
{code:java}
/Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java
package sun.rmi.registry is not visible
  (package sun.rmi.registry is declared in module java.rmi, which does not 
export it to the unnamed module)
sun.rmi.registry{code}
I also tried the "Use '–release' option for cross-compilation" setting in 
IntelliJ but that leads to the same error. I am using JDK11 for compilation.

 


> Compilation fails in IntelliJ
> -
>
> Key: FLINK-36448
> URL: https://issues.apache.org/jira/browse/FLINK-36448
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Reporter: Kumar Mallikarjuna
>Priority: Major
>  Labels: build, build-failure, intellij
>
> I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and 
> tests work just fine when using Maven directly. However building the project 
> and running unit tests with IntelliJ fail with the below error:
>  
> {code:java}
> /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java
> package sun.rmi.registry is not visible
>   (package sun.rmi.registry is declared in module java.rmi, which does not 
> export it to the unnamed module)
> sun.rmi.registry{code}
> I also tried the "Use '–release' option for cross-compilation" setting in 
> IntelliJ but that leads to the same error. I am using JDK11 for compilation.
> cc: [~mapohl] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36449) Bump apache-rat-plugin from 0.12 to 0.16.1

2024-10-09 Thread Siddharth R (Jira)

Siddharth R created FLINK-36449:
---

 Summary: Bump apache-rat-plugin from 0.12 to 0.16.1
 Key: FLINK-36449
 URL: https://issues.apache.org/jira/browse/FLINK-36449
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.10.0
Reporter: Siddharth R


Bump apache-rat-plugin from 0.12 to 0.16.1 to remediate the underlying 
vulnerabilities in the dependencies.

 

Vulnerabilities from dependencies:
[CVE-2022-4245|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4245]
[CVE-2022-4244|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4244]
[CVE-2020-15250|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15250]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [FLINK-35363] Architecture tests refactor [flink-connector-jdbc]

2024-10-09 Thread via GitHub



eskabetxe opened a new pull request, #143:
URL: https://github.com/apache/flink-connector-jdbc/pull/143

   Created the module 
[flink-connector-jdbc-architecture](https://issues.apache.org/jira/browse/FLINK-connector-jdbc-architecture)
 to have all tests, we need to add the modules as dependency to validate that 
module..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-36437][hive] Remove hive connector dependency usage [flink]

2024-10-09 Thread via GitHub



flinkbot commented on PR #25467:
URL: https://github.com/apache/flink/pull/25467#issuecomment-2401656939

   
   ## CI report:
   
   * 8675be3d2aef7c1bd59b40573b01741181702929 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-36454) FlinkContainer always overwrites default config

2024-10-09 Thread Arvid Heise (Jira)

Arvid Heise created FLINK-36454:
---

 Summary: FlinkContainer always overwrites default config
 Key: FLINK-36454
 URL: https://issues.apache.org/jira/browse/FLINK-36454
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 2.0-preview
Reporter: Arvid Heise


Because FlinkContainer disregards the default config, some meaningful options 
are lost when executing container in tests. For example, because we overwrite 
env.java.opts.all, all --add-opens are lost leading to exceptions like


{code:java}
 10:26:26,171 [main] ERROR 
org.apache.flink.connector.testframe.container.FlinkContainers [] - 
java.lang.ExceptionInInitializerError
at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:109)
at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:1026)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:247)
at 
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1270)
at 
org.apache.flink.client.cli.CliFrontend.lambda$mainInternal$10(CliFrontend.java:1367)
at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
at 
org.apache.flink.client.cli.CliFrontend.mainInternal(CliFrontend.java:1367)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1335)
Caused by: java.lang.RuntimeException: Can not register process function 
transformation translator.
at 
org.apache.flink.datastream.impl.ExecutionEnvironmentImpl.(ExecutionEnvironmentImpl.java:98)
... 8 more
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make field 
private final java.util.Map java.util.Collections$UnmodifiableMap.m accessible: 
module java.base does not "opens java.util" to unnamed module @771b8d5c
at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(Unknown 
Source)
at 
java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(Unknown 
Source)
at java.base/java.lang.reflect.Field.checkCanSetAccessible(Unknown 
Source)
at java.base/java.lang.reflect.Field.setAccessible(Unknown Source)
at 
org.apache.flink.streaming.runtime.translators.DataStreamV2SinkTransformationTranslator.registerSinkTransformationTranslator(DataStreamV2SinkTransformationTranslator.java:104)
at 
org.apache.flink.datastream.impl.ExecutionEnvironmentImpl.(ExecutionEnvironmentImpl.java:96)
... 8 more {code}

FlinkImageBuilder should read the default config and only overwrite the custom 
settings in
{code:java}
private Path createTemporaryFlinkConfFile(Configuration finalConfiguration, 
Path tempDirectory)
throws IOException {
Path flinkConfFile = 
tempDirectory.resolve(GlobalConfiguration.FLINK_CONF_FILENAME);
Files.write(
flinkConfFile,
ConfigurationUtils.convertConfigToWritableLines(finalConfiguration, 
false));

return flinkConfFile;
} {code}

The workaround is to set the option manually in the test but that may be 
outdated on version upgrade.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [hotfix] Fix testfail due to data race with after migration to Fabric8 interceptor [flink-kubernetes-operator]

2024-10-09 Thread via GitHub



aplyusnin opened a new pull request, #891:
URL: https://github.com/apache/flink-kubernetes-operator/pull/891

   Tests that perform requests are usualy fail now on CI due to race 
conditions. 
   
   To fix it, I addeв busywait for metrics with timeout as it is done in 
RestApiMetricsCollectorTest#testJmMetricCollection().
   Also I added an extra delay in the 
RestApiMetricsCollectorTest#testJmMetricCollection(), it helps to avoid CI 
failing on my workload runs. 
   
   Alternative fix:
   Add blocks inside tests until requests are performed, but this requreis a 
lot of modifications 
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): no
 - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
 - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not documented
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [hotfix] Fix testfail due to data race with after migration to Fabric8 interceptor [flink-kubernetes-operator]

2024-10-09 Thread via GitHub



aplyusnin commented on PR #891:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/891#issuecomment-2403691503

   ```kubernetesClient.resource(deployment).delete()``` has to be a blocking 
call, but when I run the tests wtih some extra logging I saw, that 404 response 
code doesn't appear before assertions on CI (I don't actually know why it is, I 
didn't get that problem locally).  After some experiments, I came up with a 
conclusion, that such way fixes the problem.
   
   I agree, that it is not good way for fixing the problem, but I don't know 
how to properly add some extra blocking for the main testing Thread to get rid 
of the problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-25546][flink-connector-base][JUnit5 Migration] Module: flink-connector-base [flink]

2024-10-09 Thread via GitHub



Jiabao-Sun commented on code in PR #21161:
URL: https://github.com/apache/flink/pull/21161#discussion_r1794452407


##
flink-connectors/flink-connector-base/src/test/java/org/apache/flink/connector/base/sink/AsyncSinkBaseITCase.java:
##
@@ -32,7 +32,7 @@
 
 /** Integration tests of a baseline generic sink that implements the 
AsyncSinkBase. */
 @ExtendWith(TestLoggerExtension.class)

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-36458) Use original 'OceanBaseCEContainer' of testcontainers in testing

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36458:
---
Labels: pull-request-available  (was: )

> Use original 'OceanBaseCEContainer' of testcontainers in testing
> 
>
> Key: FLINK-36458
> URL: https://issues.apache.org/jira/browse/FLINK-36458
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / JDBC
>Reporter: He Wang
>Priority: Minor
>  Labels: pull-request-available
>
> The docker image of OceanBase has been refactored, which caused some 
> incompatibility with the latest testcontainers, so an 'OceanBaseContainer' 
> implementation was introduced in FLINK-34572. Now the latest testcontainers 
> are fully compatible with the new Docker image, so we can use the original 
> 'OceanBaseCEContainer' to replace the existing OceanBaseContainer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] [FLINK-30899][connector/filesystem] Fix FileSystemTableSource incorrectly selecting fields on partitioned tables [flink]

2024-10-09 Thread via GitHub



flinkbot commented on PR #25476:
URL: https://github.com/apache/flink/pull/25476#issuecomment-2404177936

   
   ## CI report:
   
   * 4cf5dec3610ec538a01e66210e753478fb6ff216 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-30899][connector/filesystem] Fix FileSystemTableSource incorrectly selecting fields on partitioned tables [flink]

2024-10-09 Thread via GitHub



flinkbot commented on PR #25477:
URL: https://github.com/apache/flink/pull/25477#issuecomment-2404181439

   
   ## CI report:
   
   * ea2e19fbae18568d95b7195492cbe3ef448f1892 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FLINK-36460] Refactoring the CI matrix [flink-kubernetes-operator]

2024-10-09 Thread via GitHub



SamBarker opened a new pull request, #893:
URL: https://github.com/apache/flink-kubernetes-operator/pull/893

   ## What is the purpose of the change
   
   As observed in 
[FLINK-36332](https://github.com/apache/flink-kubernetes-operator/pull/881#issuecomment-2378765602)
 the CI workflow has become difficult to mange. This PR is attempting to split 
tests along sensible lines to make the matrix more manageable. 
   
   ## Brief change log
   
   Simplify CI workflows.
   
   ## Verifying this change
   
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): **no**
 - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
**no**
 - Core observer or reconciler logic that is regularly executed: **no**
   
   ## Documentation
   
 - Does this pull request introduce a new feature? **no**
 - If yes, how is the feature documented? **not applicable**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (FLINK-36460) Refactor Kubernetes Operator Edge to Edge test setup

2024-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-36460:
---
Labels: pull-request-available  (was: )

> Refactor Kubernetes Operator Edge to Edge test setup
> 
>
> Key: FLINK-36460
> URL: https://issues.apache.org/jira/browse/FLINK-36460
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Sam Barker
>Priority: Major
>  Labels: pull-request-available
>
> The edge to edge operator tests are getting difficult to control and evolve 
> as was noticed in FLINK-36332. 
> We should try and find clear seems to break the test suite up along to avoid 
> running lots of extra tests or creating mutually exclusive include and 
> exclude conditions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] Fix flaky tests in classes [flink]

2024-10-09 Thread via GitHub



nikunjagarwal321 opened a new pull request, #25471:
URL: https://github.com/apache/flink/pull/25471

   ## What is the purpose of the change
   
   [NonDex](https://github.com/TestingResearchIllinois/NonDex) is a tool for 
detecting and debugging wrong assumptions on under-determined Java APIs. While 
running the test cases using NonDex, flaky tests were found in the following 
classes : 
   - org.apache.flink.table.planner.plan.batch.sql.PartitionableSourceTest 
   - 
org.apache.flink.table.planner.plan.rules.logical.PushPartitionIntoTableSourceScanRuleTest
   
   The flaky tests can be found when running the following command:
   mvn edu.illinois:nondex-maven-plugin:2.1.7:nondex -Dtest={test} 
   
   Sample Error : 
   
   ```
   PartitionableSourceTest.testUnconvertedExpression:136 optimized exec plan 
==> expected: <
   Calc(select=[id, name, part1, part2, (part2 + 1) AS virtualField])
   +- TableSourceScan(table=[[default_catalog, default_database, 
PartitionableTable, partitions=[{part1=A, part2=2}]]], fields=[id, name, part1, 
part2])
   > but was: <
   Calc(select=[id, name, part1, part2, (part2 + 1) AS virtualField])
   +- TableSourceScan(table=[[default_catalog, default_database, 
PartitionableTable, partitions=[{part2=2, part1=A}]]], fields=[id, name, part1, 
part2])
   >
   ```
   
   - The fix is to include ordering in `PartitionPushDownSpec` and 
`PushPartitionIntoTableSourceScanRuleTest` while converting from Map to String 
in order to maintain the same order of Map while converting to String and thus, 
make the tests more stable. 
   The function : getDigests() converts the partitions to String and the 
ordering of different objects of Map may differ as it is nondeterministic. 
   Also the expected string which are used in test files are hardcoded in XML 
files and only contain one set of ordering.
   
   Hence, we can convert the nondeterministic ordering in 
PartitionPushDownSpec.getDigests() to get the ordered value of Map as the 
string plan.
   
   
   
   ## Brief change log
   
   *(for example:)*
 - *The TaskInfo is stored in the blob store on job creation time as a 
persistent artifact*
 - *Deployments RPC transmits only the blob storage reference*
 - *TaskManagers retrieve the TaskInfo from the blob cache*
   
   
   ## Verifying this change
   
   This change is already covered by existing tests, such as 
*org.apache.flink.table.planner.plan.batch.sql.PartitionableSourceTest*.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency):  no
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
 - The serializers: don't know
 - The runtime per-record code paths (performance sensitive): no
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
 - The S3 file system connector: no
   
   ## Documentation
   
 - Does this pull request introduce a new feature? no
 - If yes, how is the feature documented? not applicable 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [hotfix][state/forst] Remove string.format() in critical path [flink]

2024-10-09 Thread via GitHub



flinkbot commented on PR #25473:
URL: https://github.com/apache/flink/pull/25473#issuecomment-2403913805

   
   ## CI report:
   
   * dd54c09afab52f3178da592cd9196ad3036e66b4 UNKNOWN
   
   
   Bot commands
 The @flinkbot bot supports the following commands:
   
- `@flinkbot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (FLINK-36460) Refactor Kubernetes Operator Edge to Edge test setup

2024-10-09 Thread Sam Barker (Jira)

Sam Barker created FLINK-36460:
--

 Summary: Refactor Kubernetes Operator Edge to Edge test setup
 Key: FLINK-36460
 URL: https://issues.apache.org/jira/browse/FLINK-36460
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Sam Barker


The edge to edge operator tests are getting difficult to control and evolve as 
was noticed in FLINK-36332. 

We should try and find clear seems to break the test suite up along to avoid 
running lots of extra tests or creating mutually exclusive include and exclude 
conditions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] [hotfix][state/forst] Remove string.format() in critical path [flink]

2024-10-09 Thread via GitHub



fredia opened a new pull request, #25473:
URL: https://github.com/apache/flink/pull/25473

   
   
   ## What is the purpose of the change
   
   This PR removes string.format() in critical path in ForSt map state.
   
   ## Brief change log
   - Remove string.format() in `ForStMapState#buildDBPutRequest`
   
   
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follow [the 
conventions for tests defined in our code quality 
guide](https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing).
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
 - Dependencies (does it add or upgrade a dependency): (no)
 - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
 - The serializers: (no)
 - The runtime per-record code paths (performance sensitive): (yes)
 - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no )
 - The S3 file system connector: (no)
   
   ## Documentation
   
 - Does this pull request introduce a new feature? (no)
 - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 153 matches

Mail list logo