Re: [PR] [FLINK-20625] Implement V2 Google Cloud PubSub Source in accordance with FLIP-27 [flink-connector-gcp-pubsub]
clmccart commented on PR #32: URL: https://github.com/apache/flink-connector-gcp-pubsub/pull/32#issuecomment-2402775724 looks like the sink rewrite was completed in [FLINK-24298](https://issues.apache.org/jira/browse/FLINK-24298). will need to take a look at that and probably remove the sink changes from this PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-36456) Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key
[ https://issues.apache.org/jira/browse/FLINK-36456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Garcia updated FLINK-36456: Component/s: Runtime / Configuration (was: Runtime / Metrics) > Improve Security in DatadogHttpReporterFactory by providing ENV alternative > to retrieve API key > --- > > Key: FLINK-36456 > URL: https://issues.apache.org/jira/browse/FLINK-36456 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: Raul Garcia >Priority: Minor > > The current implementation of the {{DatadogHttpReporterFactory}} class > retrieves the Datadog API key from the Flink configuration. In Kubernetes > environments, this typically means storing the API key in ConfigMaps, which > can expose sensitive information in plain text. Since ConfigMaps are not > designed to hold secrets, this approach poses potential security risks. > My proposal is to fallback to {{DD_API_KEY}} which is the standard way of > passing the API key to containers and it's usually available in the > environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]
rxp90 opened a new pull request, #25470: URL: https://github.com/apache/flink/pull/25470 … DD_API_KEY if not present ## What is the purpose of the change *(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)* ## Brief change log *(for example:)* - *The TaskInfo is stored in the blob store on job creation time as a persistent artifact* - *Deployments RPC transmits only the blob storage reference* - *TaskManagers retrieve the TaskInfo from the blob cache* ## Verifying this change Please make sure both new and modified tests in this PR follow [the conventions for tests defined in our code quality guide](https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing). *(Please pick either of the following options)* This change is a trivial rework / code cleanup without any test coverage. *(or)* This change is already covered by existing tests, such as *(please describe tests)*. *(or)* This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end deployment with large payloads (100MB)* - *Extended integration test for recovery after master (JobManager) failure* - *Added test that validates that TaskInfo is transferred only once across recoveries* - *Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / no) - The serializers: (yes / no / don't know) - The runtime per-record code paths (performance sensitive): (yes / no / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know) - The S3 file system connector: (yes / no / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / no) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-36456) Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key
[ https://issues.apache.org/jira/browse/FLINK-36456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36456: --- Labels: pull-request-available (was: ) > Improve Security in DatadogHttpReporterFactory by providing ENV alternative > to retrieve API key > --- > > Key: FLINK-36456 > URL: https://issues.apache.org/jira/browse/FLINK-36456 > Project: Flink > Issue Type: Improvement > Components: Runtime / Configuration >Reporter: Raul Garcia >Priority: Minor > Labels: pull-request-available > > The current implementation of the {{DatadogHttpReporterFactory}} class > retrieves the Datadog API key from the Flink configuration. In Kubernetes > environments, this typically means storing the API key in ConfigMaps, which > can expose sensitive information in plain text. Since ConfigMaps are not > designed to hold secrets, this approach poses potential security risks. > My proposal is to fallback to {{DD_API_KEY}} which is the standard way of > passing the API key to containers and it's usually available in the > environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]
Juoelenis commented on PR #25470: URL: https://github.com/apache/flink/pull/25470#issuecomment-2403211335 Very legit 👍👍👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36445][runtime] Introduce ExecutionPlan interface. [flink]
Juoelenis commented on PR #25469: URL: https://github.com/apache/flink/pull/25469#issuecomment-2403214920 Big W -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36440] Bump log4j from 2.17.1 to 2.23.1 [flink]
Juoelenis commented on PR #25463: URL: https://github.com/apache/flink/pull/25463#issuecomment-2403221800 1+ aura -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-36440) Bump log4j from 2.17.1 to 2.23.1
[ https://issues.apache.org/jira/browse/FLINK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36440: --- Labels: pull-request-available (was: ) > Bump log4j from 2.17.1 to 2.23.1 > > > Key: FLINK-36440 > URL: https://issues.apache.org/jira/browse/FLINK-36440 > Project: Flink > Issue Type: Improvement >Reporter: Siddharth R >Priority: Major > Labels: pull-request-available > > Bumping *log4j* to the latest version (2.23.1) - this will remediate a lot of > vulnerabilities in dependant packages. > Package details: > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-1.2-api/2.23.1] > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-slf4j-impl/2.23.1] > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api/2.23.1] > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core/2.23.1] > Release notes: > [https://logging.apache.org/log4j/2.x/release-notes.html] > > Lot of bug fixes has been done in the newer versions and I don't see any > breaking changes as such. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36440] Bump log4j from 2.17.1 to 2.23.1 [flink]
ferenc-csaky commented on code in PR #25463: URL: https://github.com/apache/flink/pull/25463#discussion_r1794118721 ## pom.xml: ## @@ -127,7 +127,7 @@ under the License. true 11 1.7.36 - 2.17.1 + 2.23.1 Review Comment: Any reason why not bump to 2.24.1, which is the latest ATM? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]
flinkbot commented on PR #25470: URL: https://github.com/apache/flink/pull/25470#issuecomment-2402739010 ## CI report: * 8e38de9c2f05656c51131a9d241ccab6fa8824c2 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Final [flink-connector-gcp-pubsub]
clmccart commented on code in PR #32: URL: https://github.com/apache/flink-connector-gcp-pubsub/pull/32#discussion_r1793804145 ## ordering-keys-prober/README.md: ## @@ -0,0 +1,75 @@ +### Introduction + +The prober is designed to demonstrate effective use of Cloud Pub/Sub's +[ordered delivery](https://cloud.google.com/pubsub/docs/ordering). + +### Building + +These instructions assume you are using [Maven](https://maven.apache.org/). + +1. If you want to build the connector from head, clone the repository, ensuring +to do so recursively to pick up submodules: + +`git clone https://github.com/GoogleCloudPlatform/pubsub` Review Comment: update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Final [flink-connector-gcp-pubsub]
clmccart commented on code in PR #32: URL: https://github.com/apache/flink-connector-gcp-pubsub/pull/32#discussion_r1793804145 ## ordering-keys-prober/README.md: ## @@ -0,0 +1,75 @@ +### Introduction + +The prober is designed to demonstrate effective use of Cloud Pub/Sub's +[ordered delivery](https://cloud.google.com/pubsub/docs/ordering). + +### Building + +These instructions assume you are using [Maven](https://maven.apache.org/). + +1. If you want to build the connector from head, clone the repository, ensuring +to do so recursively to pick up submodules: + +`git clone https://github.com/GoogleCloudPlatform/pubsub` Review Comment: update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Donate PubSub source and sink rewrite back to Apache [flink-connector-gcp-pubsub]
clmccart commented on code in PR #32: URL: https://github.com/apache/flink-connector-gcp-pubsub/pull/32#discussion_r1793809868 ## .gitignore: ## @@ -35,4 +35,8 @@ out/ tools/flink tools/flink-* tools/releasing/release -tools/japicmp-output \ No newline at end of file +tools/japicmp-output +*Dockerfile +*launch.sh +*pubsub.yaml +*.vscode/* Review Comment: will remove once manual testing is done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-33926][kubernetes]: Allow using job jars in the system classpath in native mode [flink]
SamBarker commented on code in PR #25445: URL: https://github.com/apache/flink/pull/25445#discussion_r1794173964 ## flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesClusterDescriptor.java: ## @@ -214,9 +215,12 @@ public ClusterClientProvider deployApplicationCluster( applicationConfiguration.applyToConfiguration(flinkConfig); -// No need to do pipelineJars validation if it is a PyFlink job. +// No need to do pipelineJars validation if it is a PyFlink job or no jars specified in +// pipeline.jars (to use jars in system classpath instead). if (!(PackagedProgramUtils.isPython(applicationConfiguration.getApplicationClassName()) -|| PackagedProgramUtils.isPython(applicationConfiguration.getProgramArguments( { +|| PackagedProgramUtils.isPython( + applicationConfiguration.getProgramArguments())) +&& flinkConfig.get(PipelineOptions.JARS) != null) { Review Comment: Would it make be useful to push this compound condition to `PackageProgramUtils` as something like `hasClasspath` or `usesSystemClasspath` (with the implied flip to false)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [FlINK-XXXX] Expand the matrix for the smoke test. [flink-kubernetes-operator]
SamBarker opened a new pull request, #892: URL: https://github.com/apache/flink-kubernetes-operator/pull/892 The goal is to remove namespace from the main CI run based on https://github.com/apache/flink-kubernetes-operator/pull/881#discussion_r1790257507 ## What is the purpose of the change *(For example: This pull request adds a new feature to periodically create and maintain savepoints through the `FlinkDeployment` custom resource.)* ## Brief change log *(for example:)* - *Periodic savepoint trigger is introduced to the custom resource* - *The operator checks on reconciliation whether the required time has passed* - *The JobManager's dispose savepoint API is used to clean up obsolete savepoints* ## Verifying this change *(Please pick either of the following options)* This change is a trivial rework / code cleanup without any test coverage. *(or)* This change is already covered by existing tests, such as *(please describe tests)*. *(or)* This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end deployment with large payloads (100MB)* - *Extended integration test for recovery after master (JobManager) failure* - *Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / no) - The public API, i.e., is any changes to the `CustomResourceDescriptors`: (yes / no) - Core observer or reconciler logic that is regularly executed: (yes / no) ## Documentation - Does this pull request introduce a new feature? (yes / no) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FlINK-XXXX] Expand the matrix for the smoke test. [flink-kubernetes-operator]
SamBarker closed pull request #892: [FlINK-] Expand the matrix for the smoke test. URL: https://github.com/apache/flink-kubernetes-operator/pull/892 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-33926) Can't start a job with a jar in the system classpath in native k8s mode
[ https://issues.apache.org/jira/browse/FLINK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-33926: --- Labels: pull-request-available (was: ) > Can't start a job with a jar in the system classpath in native k8s mode > --- > > Key: FLINK-33926 > URL: https://issues.apache.org/jira/browse/FLINK-33926 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.6.0 >Reporter: Trystan >Assignee: Gantigmaa Selenge >Priority: Major > Labels: pull-request-available > > It appears that the combination of the running operator-controlled jobs in > native k8s + application mode + using a job jar in the classpath is invalid. > Avoiding dynamic classloading (as specified in the > [docs|https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/ops/debugging/debugging_classloading/#avoiding-dynamic-classloading-for-user-code]) > is beneficial for some jobs. This affects at least Flink 1.16.1 and > Kubernetes Operator 1.6.0. > > FLINK-29288 seems to have addressed this for standalone mode. If I am > misunderstanding how to correctly build jars for this native k8s scenario, > apologies for the noise and any pointers would be appreciated! > > Perhaps related, the [spec > documentation|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/#jobspec] > declares it optional, but isn't clear about under what conditions that > applies. > * Putting the jar in the system classpath and pointing *jarURI* to that jar > leads to linkage errors. > * Not including *jarURI* leads to NullPointerExceptions in the operator: > {code:java} > {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"java.lang.NullPointerException","stackTrace":"org.apache.flink.kubernetes.operator.exception.ReconciliationException: > java.lang.NullPointerException\n\tat > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:148)\n\tat > > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:56)\n\tat > > io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:138)\n\tat > > io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:96)\n\tat > > org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)\n\tat > > io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:95)\n\tat > > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:139)\n\tat > > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:119)\n\tat > > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:89)\n\tat > > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)\n\tat > > io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)\n\tat > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source)\n\tat > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: > java.lang.NullPointerException\n\tat > org.apache.flink.kubernetes.utils.KubernetesUtils.checkJarFileForApplicationMode(KubernetesUtils.java:407)\n\tat > > org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:207)\n\tat > > org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)\n\tat > > org.apache.flink.kubernetes.operator.service.NativeFlinkService.deployApplicationCluster(Native","additionalMetadata":{},"throwableList":[{"type":"java.lang.NullPointerException","additionalMetadata":{}}]} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted
[ https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887987#comment-17887987 ] Arvid Heise commented on FLINK-36455: - I'm not an expert of the SFS but from cross checking the code, it looks safe because it has no retries whatsoever. So it would fail on transient error and trigger a new final checkpoint which is in accordance with the contract of notifyCheckpointCompleted. > Sink should commit everything on notifyCheckpointCompleted > -- > > Key: FLINK-36455 > URL: https://issues.apache.org/jira/browse/FLINK-36455 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Fix For: 2.0-preview > > > Currently, we retry committables at some time later until they eventually > succeed. > However, that violates the contract of notifyCheckpointCompleted which states > that all side effect must be committed before returning the method. In > particular, notifyCheckpointCompleted must fail if we cannot guarantee that > all side effects are committed for final checkpoints. As soon as > notifyCheckpointCompleted returns, the final checkpoint is deemed completed, > which currently may mean that some transactions are still open. > The solution is that all retries must happen in a close loop in > notifyCheckpointCompleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted
[ https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887988#comment-17887988 ] Galen Warren commented on FLINK-36455: -- Thanks for checking, much appreciated. I may go ahead and use the older one for now until I can use your fix. > Sink should commit everything on notifyCheckpointCompleted > -- > > Key: FLINK-36455 > URL: https://issues.apache.org/jira/browse/FLINK-36455 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Fix For: 2.0-preview > > > Currently, we retry committables at some time later until they eventually > succeed. > However, that violates the contract of notifyCheckpointCompleted which states > that all side effect must be committed before returning the method. In > particular, notifyCheckpointCompleted must fail if we cannot guarantee that > all side effects are committed for final checkpoints. As soon as > notifyCheckpointCompleted returns, the final checkpoint is deemed completed, > which currently may mean that some transactions are still open. > The solution is that all retries must happen in a close loop in > notifyCheckpointCompleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36456) Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key
Raul Garcia created FLINK-36456: --- Summary: Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key Key: FLINK-36456 URL: https://issues.apache.org/jira/browse/FLINK-36456 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Reporter: Raul Garcia The current implementation of the {{DatadogHttpReporterFactory}} class retrieves the Datadog API key from the Flink configuration. In Kubernetes environments, this typically means storing the API key in ConfigMaps, which can expose sensitive information in plain text. Since ConfigMaps are not designed to hold secrets, this approach poses potential security risks. My proposal is to fallback to {{DD_API_KEY}} which is the standard way of passing the API key to containers and it's usually available in the environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36440] Bump log4j from 2.17.1 to 2.23.1 [flink]
Juoelenis commented on code in PR #25463: URL: https://github.com/apache/flink/pull/25463#discussion_r1794127775 ## pom.xml: ## @@ -127,7 +127,7 @@ under the License. true 11 1.7.36 - 2.17.1 + 2.23.1 Review Comment: I have no idea what that is cuz i dunno java. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-25546][flink-connector-base][JUnit5 Migration] Module: flink-connector-base [flink]
snuyanzin commented on PR #21161: URL: https://github.com/apache/flink/pull/21161#issuecomment-2403319420 @Jiabao-Sun since you are one of those who contributed a lot in JUnit5 migration may I ask you to have a look here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36456] [Runtime/Configuration] Improve Security in DatadogHttpReporterFactory by providing ENV alternative to retrieve API key [flink]
wcoqscwx commented on PR #25470: URL: https://github.com/apache/flink/pull/25470#issuecomment-2402733236 Same change as this PR opened > 24 months ago: https://github.com/apache/flink/pull/19684 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [FLINK-36437][hive] Remove hive connector dependency usage [flink]
Sxnan opened a new pull request, #25467: URL: https://github.com/apache/flink/pull/25467 ## What is the purpose of the change Remove hive connector dependency usage ## Brief change log - Remove hive connector dependency usage ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793080478 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceEnumeratorMetrics.java: ## @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.cdc.connectors.base.source.metrics; + +import org.apache.flink.metrics.Gauge; +import org.apache.flink.metrics.MetricGroup; +import org.apache.flink.metrics.groups.SplitEnumeratorMetricGroup; + +import io.debezium.relational.TableId; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.atomic.AtomicInteger; + +/** A collection class for handling metrics in {@link SourceEnumeratorMetrics}. */ +public class SourceEnumeratorMetrics { +private static final Logger LOGGER = LoggerFactory.getLogger(SourceEnumeratorMetrics.class); +// Constants +public static final int UNDEFINED = 0; + +// Metric names +public static final String IS_SNAPSHOTTING = "isSnapshotting"; +public static final String IS_STREAM_READING = "isStreamReading"; +public static final String NUM_TABLES_SNAPSHOTTED = "numTablesSnapshotted"; +public static final String NUM_TABLES_REMAINING = "numTablesRemaining"; +public static final String NUM_SNAPSHOT_SPLITS_PROCESSED = "numSnapshotSplitsProcessed"; +public static final String NUM_SNAPSHOT_SPLITS_REMAINING = "numSnapshotSplitsRemaining"; +public static final String NUM_SNAPSHOT_SPLITS_FINISHED = "numSnapshotSplitsFinished"; +public static final String SNAPSHOT_START_TIME = "snapshotStartTime"; +public static final String SNAPSHOT_END_TIME = "snapshotEndTime"; +public static final String DATABASE_GROUP_KEY = "database"; Review Comment: Ok, it has been adjusted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793078972 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java: ## @@ -20,16 +20,57 @@ import org.apache.flink.cdc.connectors.base.source.reader.IncrementalSourceReader; import org.apache.flink.metrics.Counter; import org.apache.flink.metrics.Gauge; +import org.apache.flink.metrics.MetricGroup; import org.apache.flink.metrics.groups.SourceReaderMetricGroup; import org.apache.flink.runtime.metrics.MetricNames; +import org.apache.flink.util.clock.SystemClock; + +import io.debezium.data.Envelope; +import io.debezium.relational.TableId; +import org.apache.commons.lang3.StringUtils; +import org.apache.kafka.connect.source.SourceRecord; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.HashMap; +import java.util.Map; + +import static org.apache.flink.cdc.connectors.base.utils.SourceRecordUtils.getTableId; +import static org.apache.flink.cdc.connectors.base.utils.SourceRecordUtils.isDataChangeRecord; +import static org.apache.flink.cdc.connectors.base.utils.SourceRecordUtils.isSchemaChangeEvent; /** A collection class for handling metrics in {@link IncrementalSourceReader}. */ public class SourceReaderMetrics { +private static final Logger LOG = LoggerFactory.getLogger(SourceReaderMetrics.class); + public static final long UNDEFINED = -1; +// Metric group keys +public static final String DATABASE_GROUP_KEY = "database"; Review Comment: Ok, it has been adjusted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-36450) Exclude log4j from curator-test
Siddharth R created FLINK-36450: --- Summary: Exclude log4j from curator-test Key: FLINK-36450 URL: https://issues.apache.org/jira/browse/FLINK-36450 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.10.0 Reporter: Siddharth R Exclude vulnerable log4j dependency from curator-test log4j log4j -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36445][runtime] Introduce ExecutionPlan interface. [flink]
flinkbot commented on PR #25469: URL: https://github.com/apache/flink/pull/25469#issuecomment-2401882499 ## CI report: * 477c3c2050a7fc508519b9806ac84cc9e3eab32a UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-36451) Kubernetes Application JobManager Potential Deadlock and TaskManager Pod Residuals
xiechenling created FLINK-36451: --- Summary: Kubernetes Application JobManager Potential Deadlock and TaskManager Pod Residuals Key: FLINK-36451 URL: https://issues.apache.org/jira/browse/FLINK-36451 Project: Flink Issue Type: Bug Affects Versions: 1.19.1 Environment: * Flink version: 1.19.1 * - Deployment mode: Flink Kubernetes Application Mode * - JVM version: OpenJDK 17 Reporter: xiechenling Attachments: 1.png, 2.png, jobmanager.log, jstack.txt In Kubernetes Application Mode, when there is significant etcd latency or instability, the Flink JobManager may enter a deadlock situation. Additionally, TaskManager pods are not cleaned up properly, resulting in stale resources that prevent the Flink job from recovering correctly. This issue occurs during frequent service restarts or network instability. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-36449] Bump apache-rat-plugin [flink-kubernetes-operator]
r-sidd opened a new pull request, #890: URL: https://github.com/apache/flink-kubernetes-operator/pull/890 ## What is the purpose of the change Bump apache-rat-plugin ## Brief change log Bump apache-rat-plugin from 0.12 to 0.16.1 to remediate the underlying vulnerabilities in the dependencies. Vulnerabilities from dependencies: [CVE-2022-4245](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4245) [CVE-2022-4244](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4244) [CVE-2020-15250](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15250) ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): yes - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-36449) Bump apache-rat-plugin from 0.12 to 0.16.1
[ https://issues.apache.org/jira/browse/FLINK-36449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36449: --- Labels: pull-request-available (was: ) > Bump apache-rat-plugin from 0.12 to 0.16.1 > -- > > Key: FLINK-36449 > URL: https://issues.apache.org/jira/browse/FLINK-36449 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.10.0 >Reporter: Siddharth R >Priority: Major > Labels: pull-request-available > > Bump apache-rat-plugin from 0.12 to 0.16.1 to remediate the underlying > vulnerabilities in the dependencies. > > Vulnerabilities from dependencies: > [CVE-2022-4245|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4245] > [CVE-2022-4244|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4244] > [CVE-2020-15250|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15250] -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-33977][runtime] Adaptive scheduler may not minimize the number of TMs during downscaling [flink]
ztison commented on PR #25218: URL: https://github.com/apache/flink/pull/25218#issuecomment-2401913141 Hi, I was wondering if it's advisable to change the default slot assigner. Wouldn't it be more appropriate to keep the default setting, which spreads the workload across as many workers as possible? This approach enhances the system's availability and resilience. The system will need to migrate the entire state to a new TM if it crashes, is it what we want in default implementation? Isn't it a better approach to have a config option to activate the proposed behavior as defined in FLINK-36426? Then we could use a new slot assigner instead of the `DefaultSlotAssigner` and `StateLocalitySlotAssigner`. I'm also wondering if the same strategy can be applied in the Resource Manager to allocate slots from a minimized number of Task Managers (TMs). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted
[ https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise reassigned FLINK-36455: --- Assignee: Arvid Heise > Sink should commit everything on notifyCheckpointCompleted > -- > > Key: FLINK-36455 > URL: https://issues.apache.org/jira/browse/FLINK-36455 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Fix For: 2.0-preview > > > Currently, we retry committables at some time later until they eventually > succeed. > However, that violates the contract of notifyCheckpointCompleted which states > that all side effect must be committed before returning the method. In > particular, notifyCheckpointCompleted must fail if we cannot guarantee that > all side effects are committed for final checkpoints. As soon as > notifyCheckpointCompleted returns, the final checkpoint is deemed completed, > which currently may mean that some transactions are still open. > The solution is that all retries must happen in a close loop in > notifyCheckpointCompleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted
[ https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887948#comment-17887948 ] Arvid Heise commented on FLINK-36455: - Yes, all sinks following the FLIP-143 API are susceptible to the issue. After implementing this ticket, graceful final checkpoints will never leave transactions open (unless the connector is buggy). For normal checkpoints, open transactions can only occur iff the RPC message notifyCheckpointCompleted is lost (which cannot happen for final checkpoints). Note that the issue is only relevant when transactions fail to commit on first try, so everything is an edge case. > Sink should commit everything on notifyCheckpointCompleted > -- > > Key: FLINK-36455 > URL: https://issues.apache.org/jira/browse/FLINK-36455 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Fix For: 2.0-preview > > > Currently, we retry committables at some time later until they eventually > succeed. > However, that violates the contract of notifyCheckpointCompleted which states > that all side effect must be committed before returning the method. In > particular, notifyCheckpointCompleted must fail if we cannot guarantee that > all side effects are committed for final checkpoints. As soon as > notifyCheckpointCompleted returns, the final checkpoint is deemed completed, > which currently may mean that some transactions are still open. > The solution is that all retries must happen in a close loop in > notifyCheckpointCompleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted
[ https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887949#comment-17887949 ] Galen Warren commented on FLINK-36455: -- Gotcha, thanks. Presumably that wouldn't include the legacy StreamingFileSink? > Sink should commit everything on notifyCheckpointCompleted > -- > > Key: FLINK-36455 > URL: https://issues.apache.org/jira/browse/FLINK-36455 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Fix For: 2.0-preview > > > Currently, we retry committables at some time later until they eventually > succeed. > However, that violates the contract of notifyCheckpointCompleted which states > that all side effect must be committed before returning the method. In > particular, notifyCheckpointCompleted must fail if we cannot guarantee that > all side effects are committed for final checkpoints. As soon as > notifyCheckpointCompleted returns, the final checkpoint is deemed completed, > which currently may mean that some transactions are still open. > The solution is that all retries must happen in a close loop in > notifyCheckpointCompleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793008830 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceEnumeratorMetrics.java: ## @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.cdc.connectors.base.source.metrics; + +import org.apache.flink.metrics.Gauge; +import org.apache.flink.metrics.MetricGroup; +import org.apache.flink.metrics.groups.SplitEnumeratorMetricGroup; + +import io.debezium.relational.TableId; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.atomic.AtomicInteger; + +/** A collection class for handling metrics in {@link SourceEnumeratorMetrics}. */ +public class SourceEnumeratorMetrics { +private static final Logger LOGGER = LoggerFactory.getLogger(SourceEnumeratorMetrics.class); +// Constants +public static final int UNDEFINED = 0; + +// Metric names +public static final String IS_SNAPSHOTTING = "isSnapshotting"; +public static final String IS_STREAM_READING = "isStreamReading"; +public static final String NUM_TABLES_SNAPSHOTTED = "numTablesSnapshotted"; +public static final String NUM_TABLES_REMAINING = "numTablesRemaining"; +public static final String NUM_SNAPSHOT_SPLITS_PROCESSED = "numSnapshotSplitsProcessed"; +public static final String NUM_SNAPSHOT_SPLITS_REMAINING = "numSnapshotSplitsRemaining"; +public static final String NUM_SNAPSHOT_SPLITS_FINISHED = "numSnapshotSplitsFinished"; +public static final String SNAPSHOT_START_TIME = "snapshotStartTime"; +public static final String SNAPSHOT_END_TIME = "snapshotEndTime"; +public static final String DATABASE_GROUP_KEY = "database"; +public static final String SCHEMA_GROUP_KEY = "schema"; +public static final String TABLE_GROUP_KEY = "table"; + +private final SplitEnumeratorMetricGroup metricGroup; + +private volatile int isSnapshotting = UNDEFINED; +private volatile int isStreamReading = UNDEFINED; +private volatile int numTablesRemaining = 0; + +// Map for managing per-table metrics by table identifier +// Key: Identifier of the table +// Value: TableMetrics related to the table +private final Map tableMetricsMap = new ConcurrentHashMap<>(); + +public SourceEnumeratorMetrics(SplitEnumeratorMetricGroup metricGroup) { +this.metricGroup = metricGroup; +metricGroup.gauge(IS_SNAPSHOTTING, () -> isSnapshotting); +metricGroup.gauge(IS_STREAM_READING, () -> isStreamReading); +metricGroup.gauge(NUM_TABLES_REMAINING, () -> numTablesRemaining); +} + +public void enterSnapshotPhase() { +this.isSnapshotting = 1; +} + +public void exitSnapshotPhase() { +this.isSnapshotting = 0; +} + +public void enterStreamReading() { +this.isStreamReading = 1; +} + +public void exitStreamReading() { +this.isStreamReading = 0; +} + +public void registerMetrics( +Gauge numTablesSnapshotted, +Gauge numSnapshotSplitsProcessed, +Gauge numSnapshotSplitsRemaining) { +metricGroup.gauge(NUM_TABLES_SNAPSHOTTED, numTablesSnapshotted); +metricGroup.gauge(NUM_SNAPSHOT_SPLITS_PROCESSED, numSnapshotSplitsProcessed); +metricGroup.gauge(NUM_SNAPSHOT_SPLITS_REMAINING, numSnapshotSplitsRemaining); +} + +public void addNewTables(int numNewTables) { +numTablesRemaining += numNewTables; +} + +public void startSnapshotTables(int numSnapshottedTables) { +numTablesRemaining -= numSnapshottedTables; +} + +public TableMetrics getTableMetrics(TableId tableId) { +return tableMetricsMap.computeIfAbsent( +tableId, +key -> new TableMetrics(key.catalog(), key.schema(), key.table(), metricGroup)); +} + +// --- Helper
[jira] [Updated] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory
[ https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36421: --- Labels: pull-request-available (was: ) > Missing fsync in FsCheckpointStreamFactory > -- > > Key: FLINK-36421 > URL: https://issues.apache.org/jira/browse/FLINK-36421 > Project: Flink > Issue Type: Bug > Components: FileSystems, Runtime / Checkpointing >Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0 >Reporter: Marc Aurel Fritz >Priority: Critical > Labels: pull-request-available > Attachments: fsync-fs-stream-factory.diff > > > With Flink 1.20 we observed another checkpoint corruption bug. This is > similar to FLINK-35217, but affects only files written by the taskmanager > (the ones with random names as described > [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]). > After system crash the files written by the taskmanager may be corrupted > (file size of 0 bytes) if the changes in the file-system cache haven't been > written to disk. The "_metadata" file written by the jobmanager is always > fine because it's properly fsynced. > Investigation revealed that "fsync" is missing, this time in > "FsCheckpointStreamFactory". In this case the "OutputStream" is closed > without calling "fsync", thus the file is not durably persisted on disk > before the checkpoint is completed. (As previously established in > FLINK-35217, calling "fsync" is necessary as simply closing the stream does > not have any guarantees on persistence.) > "strace" on the taskmanager's process confirms this behavior: > # The checkpoint chk-1217's directory is created at "mkdir" > # The checkpoint chk-1217's non-inline state is written by the taskmanager > at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that > there's no "fsync" before "close". > # The checkpoint chk-1217 is finished, its "_metadata" is written and synced > properly > # The old checkpoint chk-1216 is deleted at "unlink" > The new checkpoint chk-1217 now references a not-synced file that can get > corrupted on e.g. power loss. This means there is no working checkpoint left > as the old checkpoint was deleted. > For durable persistence an "fsync" call is missing before "close" in step 2. > Full "strace" log: > {code:java} > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0x7f68414c5b50) = -1 ENOENT (No such file or directory) > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0x7f68414c5b50) = -1 ENOENT (No such file or directory) > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 947250] 08:22:58 > mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0777) = 0 > [pid 1303248] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc", > 0x7f56f08d5610) = -1 ENOENT (No such file or directory) > [pid 1303248] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 1303248] 08:22:59 openat(AT_FDCWD, > "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc", > O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199 > [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 > [pid 1303248] 08:22:59 close(199) = 0 > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata", > 0x7f683fb378b0) = -1 ENOENT (No such file or directory) > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata", > 0x7f683fb37730) = -1 ENOENT (No such file or directory) > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0", > 0x7f683fb37730) = -1 ENOENT (No such file or directory) > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 947310] 08:22:59 openat(AT_FDCWD, > "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0", > O_WRONLY|O_CREAT|O_EXCL, 0666) = 148 > [pid 947310] 08:22:59 fsync(148)= 0 > [pid 947310] 08:22:59 clos
[PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]
Planet-X opened a new pull request, #25468: URL: https://github.com/apache/flink/pull/25468 ## What is the purpose of the change The `FsCheckpointStreamFactory` returns handles to files containing larger serialized data for checkpoint creation. However, these files are not synced persistently to disk via `fsync` before their handles are incorporated into completed checkpoints. Upon e.g. power loss this can cause corruption of a checkpoint as it references files that may now be lost due to missing persistence. This PR adds a #sync() call to the `FsCheckpointStreamFactory`'s #closeAndGetHandle() function. This makes sure that the returned handle always points to a file that has been safely persisted. ## Brief change log - *Files written by `FsCheckpointStreamFactory` are now synced before their handle is referenced in checkpoints* ## Verifying this change There is no easy way to unit- or integration-test these changes. OS tools are required. Correctness has been verified by using `strace`, see discussion at FLINK-36421. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (FLINK-36322) Fix compile error of flink benchmark caused by breaking changes
[ https://issues.apache.org/jira/browse/FLINK-36322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zakelly Lan reassigned FLINK-36322: --- Assignee: Zakelly Lan > Fix compile error of flink benchmark caused by breaking changes > --- > > Key: FLINK-36322 > URL: https://issues.apache.org/jira/browse/FLINK-36322 > Project: Flink > Issue Type: Technical Debt > Components: Benchmarks >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1793008455 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceEnumeratorMetrics.java: ## @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.cdc.connectors.base.source.metrics; + +import org.apache.flink.metrics.Gauge; +import org.apache.flink.metrics.MetricGroup; +import org.apache.flink.metrics.groups.SplitEnumeratorMetricGroup; + +import io.debezium.relational.TableId; +import org.apache.commons.lang3.StringUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.atomic.AtomicInteger; + +/** A collection class for handling metrics in {@link SourceEnumeratorMetrics}. */ +public class SourceEnumeratorMetrics { +private static final Logger LOGGER = LoggerFactory.getLogger(SourceEnumeratorMetrics.class); +// Constants +public static final int UNDEFINED = 0; + +// Metric names +public static final String IS_SNAPSHOTTING = "isSnapshotting"; +public static final String IS_STREAM_READING = "isStreamReading"; +public static final String NUM_TABLES_SNAPSHOTTED = "numTablesSnapshotted"; +public static final String NUM_TABLES_REMAINING = "numTablesRemaining"; +public static final String NUM_SNAPSHOT_SPLITS_PROCESSED = "numSnapshotSplitsProcessed"; +public static final String NUM_SNAPSHOT_SPLITS_REMAINING = "numSnapshotSplitsRemaining"; +public static final String NUM_SNAPSHOT_SPLITS_FINISHED = "numSnapshotSplitsFinished"; +public static final String SNAPSHOT_START_TIME = "snapshotStartTime"; +public static final String SNAPSHOT_END_TIME = "snapshotEndTime"; +public static final String DATABASE_GROUP_KEY = "database"; +public static final String SCHEMA_GROUP_KEY = "schema"; +public static final String TABLE_GROUP_KEY = "table"; + +private final SplitEnumeratorMetricGroup metricGroup; + +private volatile int isSnapshotting = UNDEFINED; +private volatile int isStreamReading = UNDEFINED; +private volatile int numTablesRemaining = 0; + +// Map for managing per-table metrics by table identifier +// Key: Identifier of the table +// Value: TableMetrics related to the table +private final Map tableMetricsMap = new ConcurrentHashMap<>(); + +public SourceEnumeratorMetrics(SplitEnumeratorMetricGroup metricGroup) { +this.metricGroup = metricGroup; +metricGroup.gauge(IS_SNAPSHOTTING, () -> isSnapshotting); +metricGroup.gauge(IS_STREAM_READING, () -> isStreamReading); +metricGroup.gauge(NUM_TABLES_REMAINING, () -> numTablesRemaining); +} + +public void enterSnapshotPhase() { +this.isSnapshotting = 1; +} + +public void exitSnapshotPhase() { +this.isSnapshotting = 0; +} + +public void enterStreamReading() { +this.isStreamReading = 1; +} + +public void exitStreamReading() { +this.isStreamReading = 0; +} + +public void registerMetrics( +Gauge numTablesSnapshotted, +Gauge numSnapshotSplitsProcessed, +Gauge numSnapshotSplitsRemaining) { +metricGroup.gauge(NUM_TABLES_SNAPSHOTTED, numTablesSnapshotted); +metricGroup.gauge(NUM_SNAPSHOT_SPLITS_PROCESSED, numSnapshotSplitsProcessed); +metricGroup.gauge(NUM_SNAPSHOT_SPLITS_REMAINING, numSnapshotSplitsRemaining); +} + +public void addNewTables(int numNewTables) { +numTablesRemaining += numNewTables; +} + +public void startSnapshotTables(int numSnapshottedTables) { +numTablesRemaining -= numSnapshottedTables; +} + +public TableMetrics getTableMetrics(TableId tableId) { +return tableMetricsMap.computeIfAbsent( +tableId, +key -> new TableMetrics(key.catalog(), key.schema(), key.table(), metricGroup)); +} + +// --- Helper
[jira] [Commented] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted
[ https://issues.apache.org/jira/browse/FLINK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887940#comment-17887940 ] Galen Warren commented on FLINK-36455: -- [~arvid] I'm trying to understand the scope of this issue. Would all sinks be impacted? All versions of Flink? Thanks. > Sink should commit everything on notifyCheckpointCompleted > -- > > Key: FLINK-36455 > URL: https://issues.apache.org/jira/browse/FLINK-36455 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Arvid Heise >Priority: Major > Fix For: 2.0-preview > > > Currently, we retry committables at some time later until they eventually > succeed. > However, that violates the contract of notifyCheckpointCompleted which states > that all side effect must be committed before returning the method. In > particular, notifyCheckpointCompleted must fail if we cannot guarantee that > all side effects are committed for final checkpoints. As soon as > notifyCheckpointCompleted returns, the final checkpoint is deemed completed, > which currently may mean that some transactions are still open. > The solution is that all retries must happen in a close loop in > notifyCheckpointCompleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory
[ https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887822#comment-17887822 ] Marc Aurel Fritz commented on FLINK-36421: -- [~gaborgsomogyi] We're been running the patched version for about a month on the most affected edge device. It's turned on and off multiple times a day and a corrupted checkpoint happened quite reliably at least once a week pre-patch. About a week ago we also rolled out the patched version to all other edge devices in production and they seem to run perfectly stable. No corrupted checkpoints have been observed with the patch in place for now - I'll keep you updated if that changes! [~zakelly] Interesting! I assumed rocksdb wouldn't be affected as I expected that the rocksdb library is used to write all the serialized state to disk as part of the database. Does the rocksdb state backend also reference larger state as external files written by the {{{}FsCheckpointStreamFactory{}}}, then presumably just storing their references in the rocksdb database? I'll take a closer look on how this works! I've opened up a PR for this here: https://github.com/apache/flink/pull/25468 > Missing fsync in FsCheckpointStreamFactory > -- > > Key: FLINK-36421 > URL: https://issues.apache.org/jira/browse/FLINK-36421 > Project: Flink > Issue Type: Bug > Components: FileSystems, Runtime / Checkpointing >Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0 >Reporter: Marc Aurel Fritz >Priority: Critical > Labels: pull-request-available > Attachments: fsync-fs-stream-factory.diff > > > With Flink 1.20 we observed another checkpoint corruption bug. This is > similar to FLINK-35217, but affects only files written by the taskmanager > (the ones with random names as described > [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]). > After system crash the files written by the taskmanager may be corrupted > (file size of 0 bytes) if the changes in the file-system cache haven't been > written to disk. The "_metadata" file written by the jobmanager is always > fine because it's properly fsynced. > Investigation revealed that "fsync" is missing, this time in > "FsCheckpointStreamFactory". In this case the "OutputStream" is closed > without calling "fsync", thus the file is not durably persisted on disk > before the checkpoint is completed. (As previously established in > FLINK-35217, calling "fsync" is necessary as simply closing the stream does > not have any guarantees on persistence.) > "strace" on the taskmanager's process confirms this behavior: > # The checkpoint chk-1217's directory is created at "mkdir" > # The checkpoint chk-1217's non-inline state is written by the taskmanager > at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that > there's no "fsync" before "close". > # The checkpoint chk-1217 is finished, its "_metadata" is written and synced > properly > # The old checkpoint chk-1216 is deleted at "unlink" > The new checkpoint chk-1217 now references a not-synced file that can get > corrupted on e.g. power loss. This means there is no working checkpoint left > as the old checkpoint was deleted. > For durable persistence an "fsync" call is missing before "close" in step 2. > Full "strace" log: > {code:java} > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0x7f68414c5b50) = -1 ENOENT (No such file or directory) > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0x7f68414c5b50) = -1 ENOENT (No such file or directory) > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 947250] 08:22:58 > mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0777) = 0 > [pid 1303248] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc", > 0x7f56f08d5610) = -1 ENOENT (No such file or directory) > [pid 1303248] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 1303248] 08:22:59 openat(AT_FDCWD, > "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc", > O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199 > [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 > [pid 1303248] 08:22:59 close(199) = 0 > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata", > 0x7f683f
[jira] [Assigned] (FLINK-36446) Refactor Job Submission Process to Use ExecutionPlan Instead of JobGraph
[ https://issues.apache.org/jira/browse/FLINK-36446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu reassigned FLINK-36446: --- Assignee: Junrui Li > Refactor Job Submission Process to Use ExecutionPlan Instead of JobGraph > > > Key: FLINK-36446 > URL: https://issues.apache.org/jira/browse/FLINK-36446 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Junrui Li >Assignee: Junrui Li >Priority: Major > > Refactor the job submission process to submit an {{ExecutionPlan}} instead of > a {{{}JobGraph{}}}. > Since {{JobGraph}} implements the {{ExecutionPlan}} interface, this change > will not impact the existing submission process. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-36445) Introducing an ExecutionPlan interface
[ https://issues.apache.org/jira/browse/FLINK-36445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu reassigned FLINK-36445: --- Assignee: Junrui Li > Introducing an ExecutionPlan interface > -- > > Key: FLINK-36445 > URL: https://issues.apache.org/jira/browse/FLINK-36445 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Junrui Li >Assignee: Junrui Li >Priority: Major > > Introducing an {{ExecutionPlan}} interface to replace the {{JobGraph}} in the > job submission and recovery processes, with {{JobGraph}} implementing this > interface. > Please note that this JIRA only focuses on the introduction of the interface; > the actual replacement of the interface will be addressed in subsequent JIRAs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-36158) Refactor Job Recovery Process to Use ExecutionPlan Instead of JobGraph
[ https://issues.apache.org/jira/browse/FLINK-36158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu reassigned FLINK-36158: --- Assignee: Junrui Li > Refactor Job Recovery Process to Use ExecutionPlan Instead of JobGraph > -- > > Key: FLINK-36158 > URL: https://issues.apache.org/jira/browse/FLINK-36158 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Junrui Li >Assignee: Junrui Li >Priority: Major > > Change the existing JobGraphStore to ExecutionPlanStore support storing > ExecutionPlans, allowing jobs to be restored from ExecutionPlans. > Since {{JobGraph}} implements the {{ExecutionPlan}} interface, this change > will not impact the existing recovery. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-36445) Introducing an ExecutionPlan interface
[ https://issues.apache.org/jira/browse/FLINK-36445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36445: --- Labels: pull-request-available (was: ) > Introducing an ExecutionPlan interface > -- > > Key: FLINK-36445 > URL: https://issues.apache.org/jira/browse/FLINK-36445 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Junrui Li >Assignee: Junrui Li >Priority: Major > Labels: pull-request-available > > Introducing an {{ExecutionPlan}} interface to replace the {{JobGraph}} in the > job submission and recovery processes, with {{JobGraph}} implementing this > interface. > Please note that this JIRA only focuses on the introduction of the interface; > the actual replacement of the interface will be addressed in subsequent JIRAs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36441] Ensure producers are not leaked [flink-connector-kafka]
AHeise merged PR #126: URL: https://github.com/apache/flink-connector-kafka/pull/126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [FLINK-36445][runtime] Introduce ExecutionPlan interface. [flink]
JunRuiLee opened a new pull request, #25469: URL: https://github.com/apache/flink/pull/25469 ## What is the purpose of the change [FLINK-36445][runtime] Introduce ExecutionPlan interface. ## Brief change log Introduce ExecutionPlan interface and JobGraph implements this interface. ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-36441) Resource leaks in Kafka connector (Tests)
[ https://issues.apache.org/jira/browse/FLINK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887871#comment-17887871 ] Arvid Heise commented on FLINK-36441: - Merged as f888d5afa7b62fe52ddad7575f3f7a4895e64453..17b26d28bf43bd47e342aaed5c07c9f5f00bdcc7. > Resource leaks in Kafka connector (Tests) > - > > Key: FLINK-36441 > URL: https://issues.apache.org/jira/browse/FLINK-36441 > Project: Flink > Issue Type: Bug >Affects Versions: kafka-3.2.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > > Producers are leaked in quite a few tests. For non-transactional producers, > there is even a leak in the producer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-35109] Bump to Flink 1.19 and support Flink 1.20 [flink-connector-kafka]
AHeise commented on PR #122: URL: https://github.com/apache/flink-connector-kafka/pull/122#issuecomment-2401873098 I forked out FLINK-36441 for the test fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (FLINK-36441) Resource leaks in Kafka connector (Tests)
[ https://issues.apache.org/jira/browse/FLINK-36441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arvid Heise resolved FLINK-36441. - Fix Version/s: kafka-3.3.0 Resolution: Fixed > Resource leaks in Kafka connector (Tests) > - > > Key: FLINK-36441 > URL: https://issues.apache.org/jira/browse/FLINK-36441 > Project: Flink > Issue Type: Bug >Affects Versions: kafka-3.2.0 >Reporter: Arvid Heise >Assignee: Arvid Heise >Priority: Major > Labels: pull-request-available > Fix For: kafka-3.3.0 > > > Producers are leaked in quite a few tests. For non-transactional producers, > there is even a leak in the producer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]
gaborgsomogyi commented on PR #25468: URL: https://github.com/apache/flink/pull/25468#issuecomment-2402241930 @flinkbot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory
[ https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887922#comment-17887922 ] Gabor Somogyi commented on FLINK-36421: --- I would backport it to at least 1.20 line to have this fix on 1.x. Thanks for the PR. > Missing fsync in FsCheckpointStreamFactory > -- > > Key: FLINK-36421 > URL: https://issues.apache.org/jira/browse/FLINK-36421 > Project: Flink > Issue Type: Bug > Components: FileSystems, Runtime / Checkpointing >Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0 >Reporter: Marc Aurel Fritz >Priority: Critical > Labels: pull-request-available > Attachments: fsync-fs-stream-factory.diff > > > With Flink 1.20 we observed another checkpoint corruption bug. This is > similar to FLINK-35217, but affects only files written by the taskmanager > (the ones with random names as described > [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]). > After system crash the files written by the taskmanager may be corrupted > (file size of 0 bytes) if the changes in the file-system cache haven't been > written to disk. The "_metadata" file written by the jobmanager is always > fine because it's properly fsynced. > Investigation revealed that "fsync" is missing, this time in > "FsCheckpointStreamFactory". In this case the "OutputStream" is closed > without calling "fsync", thus the file is not durably persisted on disk > before the checkpoint is completed. (As previously established in > FLINK-35217, calling "fsync" is necessary as simply closing the stream does > not have any guarantees on persistence.) > "strace" on the taskmanager's process confirms this behavior: > # The checkpoint chk-1217's directory is created at "mkdir" > # The checkpoint chk-1217's non-inline state is written by the taskmanager > at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that > there's no "fsync" before "close". > # The checkpoint chk-1217 is finished, its "_metadata" is written and synced > properly > # The old checkpoint chk-1216 is deleted at "unlink" > The new checkpoint chk-1217 now references a not-synced file that can get > corrupted on e.g. power loss. This means there is no working checkpoint left > as the old checkpoint was deleted. > For durable persistence an "fsync" call is missing before "close" in step 2. > Full "strace" log: > {code:java} > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0x7f68414c5b50) = -1 ENOENT (No such file or directory) > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0x7f68414c5b50) = -1 ENOENT (No such file or directory) > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 947250] 08:22:58 > mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0777) = 0 > [pid 1303248] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc", > 0x7f56f08d5610) = -1 ENOENT (No such file or directory) > [pid 1303248] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 1303248] 08:22:59 openat(AT_FDCWD, > "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc", > O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199 > [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 > [pid 1303248] 08:22:59 close(199) = 0 > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata", > 0x7f683fb378b0) = -1 ENOENT (No such file or directory) > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata", > 0x7f683fb37730) = -1 ENOENT (No such file or directory) > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0", > 0x7f683fb37730) = -1 ENOENT (No such file or directory) > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 947310] 08:22:59 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > [pid 947310] 08:22:59 openat(AT_FDCWD, > "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0", > O_WRONLY|O_C
Re: [PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]
gaborgsomogyi commented on PR #25468: URL: https://github.com/apache/flink/pull/25468#issuecomment-2402249199 Needs a green azure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36009] Architecture tests fixes [flink-connector-jdbc]
eskabetxe commented on PR #140: URL: https://github.com/apache/flink-connector-jdbc/pull/140#issuecomment-2401637907 Created another PR only with new module for architecture change [here](https://github.com/apache/flink-connector-jdbc/pull/143/files) This is keeped to allow the discussion about the fixes, and will be simplified so the other PR is merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-31234][kubernetes] Add an option to redirect stdout to log dir. [flink]
xiaows08 commented on PR #22138: URL: https://github.com/apache/flink/pull/22138#issuecomment-2401639724 @huwh 感谢回复,环境如下: flink-kubernetes-operator 1.9.0 flink: flink:1.18.1-scala_2.12 log4j-console.properties 是镜像默认的 ```yaml apiVersion: flink.apache.org/v1beta1 kind: FlinkDeployment metadata: name: basic spec: image: flink:1.18.1-scala_2.12 flinkVersion: v1_18 ingress: className: "nginx" template: "{{name}}.fo.xyuqing.com" # 引号必须有 podTemplate: spec: containers: - name: flink-main-container ports: - name: metrics containerPort: 9249 protocol: TCP env: - name: TZ # 设置容器运行的时区 value: Asia/Shanghai flinkConfiguration: taskmanager.numberOfTaskSlots: "2" env.stdout-err.redirect-to-file: "true" metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory serviceAccount: flink jobManager: resource: cpu: 1 memory: 2G taskManager: resource: cpu: 1 memory: 2G job: jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar parallelism: 2 upgradeMode: stateless ```   我的意思是虽然 `kubectl logs -f pods/xxx ` 不会打印log4j2的日志内容,但是对应的 `xxx.out` 文件还是包含有 log4j2 `xxx.log`的日志内容,最终导致 代码中的log4j的输出会存在两份,也就会占两份存储,而且`xxx.out` 文件不能做到按大小或时间自动切分 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36421] [fs] [checkpoint] Sync outputStream before returning handle in FsCheckpointStreamFactory [flink]
flinkbot commented on PR #25468: URL: https://github.com/apache/flink/pull/25468#issuecomment-2401668508 ## CI report: * fd427ffdeaa9b9fc6c31dee3dc09ca982ad7b5ba UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory
[ https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887822#comment-17887822 ] Marc Aurel Fritz edited comment on FLINK-36421 at 10/9/24 8:43 AM: --- [~gaborgsomogyi] We're been running the patched version for about a month on the most affected edge device. It's turned on and off multiple times a day and a corrupted checkpoint happened quite reliably at least once a week pre-patch. About a week ago we also rolled out the patched version to all other edge devices in production and they seem to run perfectly stable. No corrupted checkpoints have been observed with the patch in place for now - I'll keep you updated if that changes! [~zakelly] Interesting! I assumed rocksdb wouldn't be affected as I expected that the rocksdb library is used to write all the serialized state to disk as part of the database. Does the rocksdb state backend also reference larger state as external files written by the {{{}FsCheckpointStreamFactory{}}}, then presumably just storing their references in the rocksdb database? I'll take a closer look on how this works! I've opened up a PR for this here: [https://github.com/apache/flink/pull/25468] Should I also open up PRs for backporting this? was (Author: JIRAUSER305204): [~gaborgsomogyi] We're been running the patched version for about a month on the most affected edge device. It's turned on and off multiple times a day and a corrupted checkpoint happened quite reliably at least once a week pre-patch. About a week ago we also rolled out the patched version to all other edge devices in production and they seem to run perfectly stable. No corrupted checkpoints have been observed with the patch in place for now - I'll keep you updated if that changes! [~zakelly] Interesting! I assumed rocksdb wouldn't be affected as I expected that the rocksdb library is used to write all the serialized state to disk as part of the database. Does the rocksdb state backend also reference larger state as external files written by the {{{}FsCheckpointStreamFactory{}}}, then presumably just storing their references in the rocksdb database? I'll take a closer look on how this works! I've opened up a PR for this here: https://github.com/apache/flink/pull/25468 > Missing fsync in FsCheckpointStreamFactory > -- > > Key: FLINK-36421 > URL: https://issues.apache.org/jira/browse/FLINK-36421 > Project: Flink > Issue Type: Bug > Components: FileSystems, Runtime / Checkpointing >Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0 >Reporter: Marc Aurel Fritz >Priority: Critical > Labels: pull-request-available > Attachments: fsync-fs-stream-factory.diff > > > With Flink 1.20 we observed another checkpoint corruption bug. This is > similar to FLINK-35217, but affects only files written by the taskmanager > (the ones with random names as described > [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]). > After system crash the files written by the taskmanager may be corrupted > (file size of 0 bytes) if the changes in the file-system cache haven't been > written to disk. The "_metadata" file written by the jobmanager is always > fine because it's properly fsynced. > Investigation revealed that "fsync" is missing, this time in > "FsCheckpointStreamFactory". In this case the "OutputStream" is closed > without calling "fsync", thus the file is not durably persisted on disk > before the checkpoint is completed. (As previously established in > FLINK-35217, calling "fsync" is necessary as simply closing the stream does > not have any guarantees on persistence.) > "strace" on the taskmanager's process confirms this behavior: > # The checkpoint chk-1217's directory is created at "mkdir" > # The checkpoint chk-1217's non-inline state is written by the taskmanager > at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that > there's no "fsync" before "close". > # The checkpoint chk-1217 is finished, its "_metadata" is written and synced > properly > # The old checkpoint chk-1216 is deleted at "unlink" > The new checkpoint chk-1217 now references a not-synced file that can get > corrupted on e.g. power loss. This means there is no working checkpoint left > as the old checkpoint was deleted. > For durable persistence an "fsync" call is missing before "close" in step 2. > Full "strace" log: > {code:java} > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", > 0x7f68414c5b50) = -1 ENOENT (No such file or directory) > [pid 947250] 08:22:58 > stat("/opt/flink/statestore/d0346e6c2
[jira] [Closed] (FLINK-36064) Refactor REST API for Client and Operator Coordinator Communication to Use operatorUid.
[ https://issues.apache.org/jira/browse/FLINK-36064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhu Zhu closed FLINK-36064. --- Fix Version/s: 2.0.0 Resolution: Done 63b92bd88ac32be5b3e70550ba90d828b7b655d1 > Refactor REST API for Client and Operator Coordinator Communication to Use > operatorUid. > --- > > Key: FLINK-36064 > URL: https://issues.apache.org/jira/browse/FLINK-36064 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Reporter: Junrui Li >Assignee: Junrui Li >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > > Currently, the REST API between clients and operator coordinators relies on > the operator ID to send requests to the Job Manager (JM) for querying > results. However, with the introduction of Stream Graph submission, the Job > Graph will be compiled and generated within the JM, preventing the client > from accessing the operator ID. > To address this issue, we modify the REST API for communication between > clients and operator coordinators by removing the dependency on operatorID > and transitioning to a client-defined {{{}String operatorUid{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36440) Bump log4j from 2.17.1 to 2.23.1
[ https://issues.apache.org/jira/browse/FLINK-36440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887927#comment-17887927 ] Ferenc Csaky commented on FLINK-36440: -- Latest version is 2.24.1 [1], any specific reason to not use that? [1] https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core > Bump log4j from 2.17.1 to 2.23.1 > > > Key: FLINK-36440 > URL: https://issues.apache.org/jira/browse/FLINK-36440 > Project: Flink > Issue Type: Improvement >Reporter: Siddharth R >Priority: Major > > Bumping *log4j* to the latest version (2.23.1) - this will remediate a lot of > vulnerabilities in dependant packages. > Package details: > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-1.2-api/2.23.1] > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-slf4j-impl/2.23.1] > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api/2.23.1] > # > [https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core/2.23.1] > Release notes: > [https://logging.apache.org/log4j/2.x/release-notes.html] > > Lot of bug fixes has been done in the newer versions and I don't see any > breaking changes as such. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36241] Table API .plus should support string concatenation [flink]
dawidwys commented on PR #25298: URL: https://github.com/apache/flink/pull/25298#issuecomment-2402337798 @flinkbot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-36455) Sink should commit everything on notifyCheckpointCompleted
Arvid Heise created FLINK-36455: --- Summary: Sink should commit everything on notifyCheckpointCompleted Key: FLINK-36455 URL: https://issues.apache.org/jira/browse/FLINK-36455 Project: Flink Issue Type: Bug Components: API / Core Reporter: Arvid Heise Fix For: 2.0-preview Currently, we retry committables at some time later until they eventually succeed. However, that violates the contract of notifyCheckpointCompleted which states that all side effect must be committed before returning the method. In particular, notifyCheckpointCompleted must fail if we cannot guarantee that all side effects are committed for final checkpoints. As soon as notifyCheckpointCompleted returns, the final checkpoint is deemed completed, which currently may mean that some transactions are still open. The solution is that all retries must happen in a close loop in notifyCheckpointCompleted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36448) Compilation fails in IntelliJ
[ https://issues.apache.org/jira/browse/FLINK-36448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887865#comment-17887865 ] Xintong Song commented on FLINK-36448: -- [~mallikarjuna], I'm not entirely sure, but this seems related: [https://stackoverflow.com/questions/69306905/error-package-sun-rmi-registry-is-declared-in-module-java-rmi-which-does-not] I think we have added such options in pom files, which might not be recognized by the IDE. Have you tried "Reload all maven project" in IDE? cc [~sxnan] > Compilation fails in IntelliJ > - > > Key: FLINK-36448 > URL: https://issues.apache.org/jira/browse/FLINK-36448 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Kumar Mallikarjuna >Priority: Major > Labels: build, build-failure, intellij > > I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and > tests work just fine when using Maven directly. However building the project > and running unit tests with IntelliJ fail with the below error: > > {code:java} > /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java > package sun.rmi.registry is not visible > (package sun.rmi.registry is declared in module java.rmi, which does not > export it to the unnamed module) > sun.rmi.registry{code} > I also tried the "Use '–release' option for cross-compilation" setting in > IntelliJ but that leads to the same error. I am using JDK11 for compilation. > cc: [~mapohl] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-36450] exclude log4j from curator-test [flink-kubernetes-operator]
r-sidd opened a new pull request, #889: URL: https://github.com/apache/flink-kubernetes-operator/pull/889 ## What is the purpose of the change Exclude the vulnerable log4j dependency from curator-test ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-36450) Exclude log4j from curator-test
[ https://issues.apache.org/jira/browse/FLINK-36450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36450: --- Labels: pull-request-available (was: ) > Exclude log4j from curator-test > --- > > Key: FLINK-36450 > URL: https://issues.apache.org/jira/browse/FLINK-36450 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.10.0 >Reporter: Siddharth R >Priority: Major > Labels: pull-request-available > > Exclude vulnerable log4j dependency from curator-test > > > log4j > log4j > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-36064][runtime] Refactor REST API for Client and Operator Coordinator Communication to Use String OperatorUid. [flink]
zhuzhurk closed pull request #25227: [FLINK-36064][runtime] Refactor REST API for Client and Operator Coordinator Communication to Use String OperatorUid. URL: https://github.com/apache/flink/pull/25227 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792981064 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/assigner/HybridSplitAssigner.java: ## @@ -137,6 +161,7 @@ public Optional getNext() { // assigning the stream split. Otherwise, records emitted from stream split // might be out-of-order in terms of same primary key with snapshot splits. isStreamSplitAssigned = true; +enumeratorMetrics.enterStreamReading(); Review Comment: > If `isNewlyAddedAssigningFinished(snapshotSplitAssigner.getAssignerStatus())`, this should invoke `enumeratorMetrics.enterStreamReading();`. Yes, I did miss it. I have already made adjustments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792984158 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java: ## @@ -60,4 +110,146 @@ public void recordFetchDelay(long fetchDelay) { public void addNumRecordsInErrors(long delta) { this.numRecordsInErrorsCounter.inc(delta); } + +public void updateLastReceivedEventTime(Long eventTimestamp) { +if (eventTimestamp != null && eventTimestamp > 0L) { +lastReceivedEventTime = eventTimestamp; +} +} + +public void markRecord() { +metricGroup.getIOMetricGroup().getNumRecordsInCounter().inc(); +} + +public void updateRecordCounters(SourceRecord record) { +catchAndWarnLogAllExceptions( +() -> { +// Increase reader and table level input counters +if (isDataChangeRecord(record)) { +TableMetrics tableMetrics = getTableMetrics(getTableId(record)); +Envelope.Operation op = Envelope.operationFor(record); +switch (op) { +case READ: +snapshotCounter.inc(); +tableMetrics.markSnapshotRecord(); +break; +case CREATE: +insertCounter.inc(); +tableMetrics.markInsertRecord(); +break; +case DELETE: +deleteCounter.inc(); +tableMetrics.markDeleteRecord(); +break; +case UPDATE: +updateCounter.inc(); +tableMetrics.markUpdateRecord(); +break; +} +} else if (isSchemaChangeEvent(record)) { +schemaChangeCounter.inc(); +TableId tableId = getTableId(record); +if (tableId != null) { +getTableMetrics(tableId).markSchemaChangeRecord(); +} +} +}); +} + +private TableMetrics getTableMetrics(TableId tableId) { +return tableMetricsMap.computeIfAbsent( +tableId, +id -> new TableMetrics(id.catalog(), id.schema(), id.table(), metricGroup)); +} + +// --- Helper functions - + +private void catchAndWarnLogAllExceptions(Runnable runnable) { +try { +runnable.run(); +} catch (Exception e) { +// Catch all exceptions as errors in metric handling should not fail the job +LOG.warn("Failed to update metrics", e); +} +} + +private long getCurrentEventTimeLag() { +if (lastReceivedEventTime == UNDEFINED) { +return UNDEFINED; +} +return SystemClock.getInstance().absoluteTimeMillis() - lastReceivedEventTime; +} + +// --- Helper classes + +/** + * Collection class for managing metrics of a table. + * + * Metrics of table level are registered in its corresponding subgroup under the {@link + * SourceReaderMetricGroup}. + */ +private static class TableMetrics { +// Snapshot + Stream +private final Counter recordsCounter; + +// Snapshot phase +private final Counter snapshotCounter; + +// Stream phase +private final Counter insertCounter; +private final Counter updateCounter; +private final Counter deleteCounter; +private final Counter schemaChangeCounter; + +public TableMetrics( +String databaseName, String schemaName, String tableName, MetricGroup parentGroup) { +databaseName = processNull(databaseName); +schemaName = processNull(schemaName); +tableName = processNull(tableName); Review Comment: > Will there be a problem if we provide an empty string for the group? MetricGroup does not allow value to be null. In order to be compatible with multi-layer models of different databases, null is uniformly treated as an empty string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792988134 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java: ## @@ -39,14 +80,23 @@ public class SourceReaderMetrics { /** The total number of record that failed to consume, process or emit. */ private final Counter numRecordsInErrorsCounter; +private volatile long lastReceivedEventTime = UNDEFINED; +private volatile long currentReadTimestampMs = UNDEFINED; Review Comment: Ok, it has been adjusted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792988461 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java: ## @@ -60,4 +110,146 @@ public void recordFetchDelay(long fetchDelay) { public void addNumRecordsInErrors(long delta) { this.numRecordsInErrorsCounter.inc(delta); } + +public void updateLastReceivedEventTime(Long eventTimestamp) { +if (eventTimestamp != null && eventTimestamp > 0L) { +lastReceivedEventTime = eventTimestamp; +} +} + +public void markRecord() { +metricGroup.getIOMetricGroup().getNumRecordsInCounter().inc(); Review Comment: Ok, it has been adjusted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792989625 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/metrics/SourceReaderMetrics.java: ## @@ -39,14 +80,23 @@ public class SourceReaderMetrics { /** The total number of record that failed to consume, process or emit. */ private final Counter numRecordsInErrorsCounter; +private volatile long lastReceivedEventTime = UNDEFINED; +private volatile long currentReadTimestampMs = UNDEFINED; + public SourceReaderMetrics(SourceReaderMetricGroup metricGroup) { this.metricGroup = metricGroup; this.numRecordsInErrorsCounter = metricGroup.getNumRecordsInErrorsCounter(); -} -public void registerMetrics() { metricGroup.gauge( MetricNames.CURRENT_FETCH_EVENT_TIME_LAG, (Gauge) this::getFetchDelay); +metricGroup.gauge(CURRENT_READ_TIMESTAMP_MS, () -> currentReadTimestampMs); Review Comment: Yes, I have already deleted it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792982379 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/assigner/HybridSplitAssigner.java: ## @@ -137,6 +161,7 @@ public Optional getNext() { // assigning the stream split. Otherwise, records emitted from stream split // might be out-of-order in terms of same primary key with snapshot splits. isStreamSplitAssigned = true; +enumeratorMetrics.enterStreamReading(); Review Comment: > `enumeratorMetrics.exitStreamReading()` should be invoked in `addSplits`. Inside `snapshotSplitAssigner.addSplits`, it will handle -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36315][cdc-base]The flink-cdc-base module supports source metric statistics [flink-cdc]
liuxiao2shf commented on code in PR #3619: URL: https://github.com/apache/flink-cdc/pull/3619#discussion_r1792986970 ## flink-cdc-connect/flink-cdc-source-connectors/flink-cdc-base/src/main/java/org/apache/flink/cdc/connectors/base/source/reader/IncrementalSourceRecordEmitter.java: ## @@ -169,9 +172,14 @@ protected void reportMetrics(SourceRecord element) { } } -private static class OutputCollector implements Collector { -private SourceOutput output; -private Long currentMessageTimestamp; +/** + * Collector for outputting records. + * + * @param + */ Review Comment: Ok, it has been adjusted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility
yux created FLINK-36453: --- Summary: Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility Key: FLINK-36453 URL: https://issues.apache.org/jira/browse/FLINK-36453 Project: Flink Issue Type: Bug Components: Flink CDC Reporter: yux -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36452) Avoid EOFException when reading binlog after a large table has just completed snapshot reading
LvYanquan created FLINK-36452: - Summary: Avoid EOFException when reading binlog after a large table has just completed snapshot reading Key: FLINK-36452 URL: https://issues.apache.org/jira/browse/FLINK-36452 Project: Flink Issue Type: Improvement Components: Flink CDC Affects Versions: cdc-3.3.0 Reporter: LvYanquan Fix For: cdc-3.3.0 When reading binlog after a large table has just completed snapshot reading, we will need to compare the binlog position with All previous chunks to determine if this data has been processed before. However, this process is quite time-consuming and may lead to EOFException of binlog client. We should try to shorten the comparison time as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility
[ https://issues.apache.org/jira/browse/FLINK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yux updated FLINK-36453: Description: DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in versions prior to 2.2. We may backport this bugfix for now before updating to Debezium 2.x. > Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility > --- > > Key: FLINK-36453 > URL: https://issues.apache.org/jira/browse/FLINK-36453 > Project: Flink > Issue Type: Bug > Components: Flink CDC >Reporter: yux >Priority: Major > > DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in > versions prior to 2.2. > We may backport this bugfix for now before updating to Debezium 2.x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility
[ https://issues.apache.org/jira/browse/FLINK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887877#comment-17887877 ] yux commented on FLINK-36453: - I'd love to take this ticket. [~leonard] > Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility > --- > > Key: FLINK-36453 > URL: https://issues.apache.org/jira/browse/FLINK-36453 > Project: Flink > Issue Type: Bug > Components: Flink CDC >Reporter: yux >Priority: Major > > DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in > versions prior to 2.2. > We may backport this bugfix for now before updating to Debezium 2.x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-36322) Fix compile error of flink benchmark caused by breaking changes
[ https://issues.apache.org/jira/browse/FLINK-36322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36322: --- Labels: pull-request-available (was: ) > Fix compile error of flink benchmark caused by breaking changes > --- > > Key: FLINK-36322 > URL: https://issues.apache.org/jira/browse/FLINK-36322 > Project: Flink > Issue Type: Technical Debt > Components: Benchmarks >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-36452) Avoid EOFException when reading binlog after a large table has just completed snapshot reading
[ https://issues.apache.org/jira/browse/FLINK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36452: --- Labels: pull-request-available (was: ) > Avoid EOFException when reading binlog after a large table has just completed > snapshot reading > -- > > Key: FLINK-36452 > URL: https://issues.apache.org/jira/browse/FLINK-36452 > Project: Flink > Issue Type: Improvement > Components: Flink CDC >Affects Versions: cdc-3.3.0 >Reporter: LvYanquan >Priority: Major > Labels: pull-request-available > Fix For: cdc-3.3.0 > > > When reading binlog after a large table has just completed snapshot reading, > we will need to compare the binlog position with All previous chunks to > determine if this data has been processed before. However, this process is > quite time-consuming and may lead to EOFException of binlog client. > We should try to shorten the comparison time as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-36453) Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility
[ https://issues.apache.org/jira/browse/FLINK-36453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leonard Xu reassigned FLINK-36453: -- Assignee: yux > Backport DBZ-6502 patch to fix Oracle 23.x JDBC driver compatbility > --- > > Key: FLINK-36453 > URL: https://issues.apache.org/jira/browse/FLINK-36453 > Project: Flink > Issue Type: Bug > Components: Flink CDC >Reporter: yux >Assignee: yux >Priority: Major > > DBZ-6502 fixes Oracle JDBC 23.x compatibility, but it wasn't shipped in > versions prior to 2.2. > We may backport this bugfix for now before updating to Debezium 2.x. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (FLINK-36452) Avoid EOFException when reading binlog after a large table has just completed snapshot reading
[ https://issues.apache.org/jira/browse/FLINK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leonard Xu reassigned FLINK-36452: -- Assignee: LvYanquan > Avoid EOFException when reading binlog after a large table has just completed > snapshot reading > -- > > Key: FLINK-36452 > URL: https://issues.apache.org/jira/browse/FLINK-36452 > Project: Flink > Issue Type: Improvement > Components: Flink CDC >Affects Versions: cdc-3.3.0 >Reporter: LvYanquan >Assignee: LvYanquan >Priority: Major > Labels: pull-request-available > Fix For: cdc-3.3.0 > > > When reading binlog after a large table has just completed snapshot reading, > we will need to compare the binlog position with All previous chunks to > determine if this data has been processed before. However, this process is > quite time-consuming and may lead to EOFException of binlog client. > We should try to shorten the comparison time as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36448) Compilation fails in IntelliJ
Kumar Mallikarjuna created FLINK-36448: -- Summary: Compilation fails in IntelliJ Key: FLINK-36448 URL: https://issues.apache.org/jira/browse/FLINK-36448 Project: Flink Issue Type: Bug Components: API / Core Reporter: Kumar Mallikarjuna I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and tests work just fine when using Maven directly. However building the project and running unit tests with IntelliJ fail with the below error: {code:java} /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java package sun.rmi.registry is not visible (package sun.rmi.registry is declared in module java.rmi, which does not export it to the unnamed module) sun.rmi.registry{code} I also tried the "Use '–release' option for cross-compilation" setting in IntelliJ but that leads to the same error. I am using JDK11 for compilation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-36448) Compilation fails in IntelliJ
[ https://issues.apache.org/jira/browse/FLINK-36448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887846#comment-17887846 ] Kumar Mallikarjuna commented on FLINK-36448: Hi [~xtsong] , would you know what's wrong here? > Compilation fails in IntelliJ > - > > Key: FLINK-36448 > URL: https://issues.apache.org/jira/browse/FLINK-36448 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Kumar Mallikarjuna >Priority: Major > Labels: build, build-failure, intellij > > I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and > tests work just fine when using Maven directly. However building the project > and running unit tests with IntelliJ fail with the below error: > > {code:java} > /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java > package sun.rmi.registry is not visible > (package sun.rmi.registry is declared in module java.rmi, which does not > export it to the unnamed module) > sun.rmi.registry{code} > I also tried the "Use '–release' option for cross-compilation" setting in > IntelliJ but that leads to the same error. I am using JDK11 for compilation. > cc: [~mapohl] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-36448) Compilation fails in IntelliJ
[ https://issues.apache.org/jira/browse/FLINK-36448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kumar Mallikarjuna updated FLINK-36448: --- Description: I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and tests work just fine when using Maven directly. However building the project and running unit tests with IntelliJ fail with the below error: {code:java} /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java package sun.rmi.registry is not visible (package sun.rmi.registry is declared in module java.rmi, which does not export it to the unnamed module) sun.rmi.registry{code} I also tried the "Use '–release' option for cross-compilation" setting in IntelliJ but that leads to the same error. I am using JDK11 for compilation. cc: [~mapohl] was: I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and tests work just fine when using Maven directly. However building the project and running unit tests with IntelliJ fail with the below error: {code:java} /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java package sun.rmi.registry is not visible (package sun.rmi.registry is declared in module java.rmi, which does not export it to the unnamed module) sun.rmi.registry{code} I also tried the "Use '–release' option for cross-compilation" setting in IntelliJ but that leads to the same error. I am using JDK11 for compilation. > Compilation fails in IntelliJ > - > > Key: FLINK-36448 > URL: https://issues.apache.org/jira/browse/FLINK-36448 > Project: Flink > Issue Type: Bug > Components: API / Core >Reporter: Kumar Mallikarjuna >Priority: Major > Labels: build, build-failure, intellij > > I'm unable to build master (as of "c9c946f") with IntelliJ. Compilation and > tests work just fine when using Maven directly. However building the project > and running unit tests with IntelliJ fail with the below error: > > {code:java} > /Users/kumarmallikarjuna/Workspace/confluent/flink/flink-core/src/main/java/org/apache/flink/management/jmx/JMXServer.java > package sun.rmi.registry is not visible > (package sun.rmi.registry is declared in module java.rmi, which does not > export it to the unnamed module) > sun.rmi.registry{code} > I also tried the "Use '–release' option for cross-compilation" setting in > IntelliJ but that leads to the same error. I am using JDK11 for compilation. > cc: [~mapohl] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36449) Bump apache-rat-plugin from 0.12 to 0.16.1
Siddharth R created FLINK-36449: --- Summary: Bump apache-rat-plugin from 0.12 to 0.16.1 Key: FLINK-36449 URL: https://issues.apache.org/jira/browse/FLINK-36449 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.10.0 Reporter: Siddharth R Bump apache-rat-plugin from 0.12 to 0.16.1 to remediate the underlying vulnerabilities in the dependencies. Vulnerabilities from dependencies: [CVE-2022-4245|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4245] [CVE-2022-4244|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-4244] [CVE-2020-15250|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-15250] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [FLINK-35363] Architecture tests refactor [flink-connector-jdbc]
eskabetxe opened a new pull request, #143: URL: https://github.com/apache/flink-connector-jdbc/pull/143 Created the module [flink-connector-jdbc-architecture](https://issues.apache.org/jira/browse/FLINK-connector-jdbc-architecture) to have all tests, we need to add the modules as dependency to validate that module.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-36437][hive] Remove hive connector dependency usage [flink]
flinkbot commented on PR #25467: URL: https://github.com/apache/flink/pull/25467#issuecomment-2401656939 ## CI report: * 8675be3d2aef7c1bd59b40573b01741181702929 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-36454) FlinkContainer always overwrites default config
Arvid Heise created FLINK-36454: --- Summary: FlinkContainer always overwrites default config Key: FLINK-36454 URL: https://issues.apache.org/jira/browse/FLINK-36454 Project: Flink Issue Type: Bug Components: Test Infrastructure Affects Versions: 2.0-preview Reporter: Arvid Heise Because FlinkContainer disregards the default config, some meaningful options are lost when executing container in tests. For example, because we overwrite env.java.opts.all, all --add-opens are lost leading to exceptions like {code:java} 10:26:26,171 [main] ERROR org.apache.flink.connector.testframe.container.FlinkContainers [] - java.lang.ExceptionInInitializerError at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:109) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:1026) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:247) at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1270) at org.apache.flink.client.cli.CliFrontend.lambda$mainInternal$10(CliFrontend.java:1367) at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) at org.apache.flink.client.cli.CliFrontend.mainInternal(CliFrontend.java:1367) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1335) Caused by: java.lang.RuntimeException: Can not register process function transformation translator. at org.apache.flink.datastream.impl.ExecutionEnvironmentImpl.(ExecutionEnvironmentImpl.java:98) ... 8 more Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make field private final java.util.Map java.util.Collections$UnmodifiableMap.m accessible: module java.base does not "opens java.util" to unnamed module @771b8d5c at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(Unknown Source) at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(Unknown Source) at java.base/java.lang.reflect.Field.checkCanSetAccessible(Unknown Source) at java.base/java.lang.reflect.Field.setAccessible(Unknown Source) at org.apache.flink.streaming.runtime.translators.DataStreamV2SinkTransformationTranslator.registerSinkTransformationTranslator(DataStreamV2SinkTransformationTranslator.java:104) at org.apache.flink.datastream.impl.ExecutionEnvironmentImpl.(ExecutionEnvironmentImpl.java:96) ... 8 more {code} FlinkImageBuilder should read the default config and only overwrite the custom settings in {code:java} private Path createTemporaryFlinkConfFile(Configuration finalConfiguration, Path tempDirectory) throws IOException { Path flinkConfFile = tempDirectory.resolve(GlobalConfiguration.FLINK_CONF_FILENAME); Files.write( flinkConfFile, ConfigurationUtils.convertConfigToWritableLines(finalConfiguration, false)); return flinkConfFile; } {code} The workaround is to set the option manually in the test but that may be outdated on version upgrade. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [hotfix] Fix testfail due to data race with after migration to Fabric8 interceptor [flink-kubernetes-operator]
aplyusnin opened a new pull request, #891: URL: https://github.com/apache/flink-kubernetes-operator/pull/891 Tests that perform requests are usualy fail now on CI due to race conditions. To fix it, I addeв busywait for metrics with timeout as it is done in RestApiMetricsCollectorTest#testJmMetricCollection(). Also I added an extra delay in the RestApiMetricsCollectorTest#testJmMetricCollection(), it helps to avoid CI failing on my workload runs. Alternative fix: Add blocks inside tests until requests are performed, but this requreis a lot of modifications ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not documented -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [hotfix] Fix testfail due to data race with after migration to Fabric8 interceptor [flink-kubernetes-operator]
aplyusnin commented on PR #891: URL: https://github.com/apache/flink-kubernetes-operator/pull/891#issuecomment-2403691503 ```kubernetesClient.resource(deployment).delete()``` has to be a blocking call, but when I run the tests wtih some extra logging I saw, that 404 response code doesn't appear before assertions on CI (I don't actually know why it is, I didn't get that problem locally). After some experiments, I came up with a conclusion, that such way fixes the problem. I agree, that it is not good way for fixing the problem, but I don't know how to properly add some extra blocking for the main testing Thread to get rid of the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-25546][flink-connector-base][JUnit5 Migration] Module: flink-connector-base [flink]
Jiabao-Sun commented on code in PR #21161: URL: https://github.com/apache/flink/pull/21161#discussion_r1794452407 ## flink-connectors/flink-connector-base/src/test/java/org/apache/flink/connector/base/sink/AsyncSinkBaseITCase.java: ## @@ -32,7 +32,7 @@ /** Integration tests of a baseline generic sink that implements the AsyncSinkBase. */ @ExtendWith(TestLoggerExtension.class) Review Comment: ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-36458) Use original 'OceanBaseCEContainer' of testcontainers in testing
[ https://issues.apache.org/jira/browse/FLINK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36458: --- Labels: pull-request-available (was: ) > Use original 'OceanBaseCEContainer' of testcontainers in testing > > > Key: FLINK-36458 > URL: https://issues.apache.org/jira/browse/FLINK-36458 > Project: Flink > Issue Type: Improvement > Components: Connectors / JDBC >Reporter: He Wang >Priority: Minor > Labels: pull-request-available > > The docker image of OceanBase has been refactored, which caused some > incompatibility with the latest testcontainers, so an 'OceanBaseContainer' > implementation was introduced in FLINK-34572. Now the latest testcontainers > are fully compatible with the new Docker image, so we can use the original > 'OceanBaseCEContainer' to replace the existing OceanBaseContainer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [FLINK-30899][connector/filesystem] Fix FileSystemTableSource incorrectly selecting fields on partitioned tables [flink]
flinkbot commented on PR #25476: URL: https://github.com/apache/flink/pull/25476#issuecomment-2404177936 ## CI report: * 4cf5dec3610ec538a01e66210e753478fb6ff216 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [FLINK-30899][connector/filesystem] Fix FileSystemTableSource incorrectly selecting fields on partitioned tables [flink]
flinkbot commented on PR #25477: URL: https://github.com/apache/flink/pull/25477#issuecomment-2404181439 ## CI report: * ea2e19fbae18568d95b7195492cbe3ef448f1892 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [FLINK-36460] Refactoring the CI matrix [flink-kubernetes-operator]
SamBarker opened a new pull request, #893: URL: https://github.com/apache/flink-kubernetes-operator/pull/893 ## What is the purpose of the change As observed in [FLINK-36332](https://github.com/apache/flink-kubernetes-operator/pull/881#issuecomment-2378765602) the CI workflow has become difficult to mange. This PR is attempting to split tests along sensible lines to make the matrix more manageable. ## Brief change log Simplify CI workflows. ## Verifying this change This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changes to the `CustomResourceDescriptors`: **no** - Core observer or reconciler logic that is regularly executed: **no** ## Documentation - Does this pull request introduce a new feature? **no** - If yes, how is the feature documented? **not applicable** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (FLINK-36460) Refactor Kubernetes Operator Edge to Edge test setup
[ https://issues.apache.org/jira/browse/FLINK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-36460: --- Labels: pull-request-available (was: ) > Refactor Kubernetes Operator Edge to Edge test setup > > > Key: FLINK-36460 > URL: https://issues.apache.org/jira/browse/FLINK-36460 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Sam Barker >Priority: Major > Labels: pull-request-available > > The edge to edge operator tests are getting difficult to control and evolve > as was noticed in FLINK-36332. > We should try and find clear seems to break the test suite up along to avoid > running lots of extra tests or creating mutually exclusive include and > exclude conditions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] Fix flaky tests in classes [flink]
nikunjagarwal321 opened a new pull request, #25471: URL: https://github.com/apache/flink/pull/25471 ## What is the purpose of the change [NonDex](https://github.com/TestingResearchIllinois/NonDex) is a tool for detecting and debugging wrong assumptions on under-determined Java APIs. While running the test cases using NonDex, flaky tests were found in the following classes : - org.apache.flink.table.planner.plan.batch.sql.PartitionableSourceTest - org.apache.flink.table.planner.plan.rules.logical.PushPartitionIntoTableSourceScanRuleTest The flaky tests can be found when running the following command: mvn edu.illinois:nondex-maven-plugin:2.1.7:nondex -Dtest={test} Sample Error : ``` PartitionableSourceTest.testUnconvertedExpression:136 optimized exec plan ==> expected: < Calc(select=[id, name, part1, part2, (part2 + 1) AS virtualField]) +- TableSourceScan(table=[[default_catalog, default_database, PartitionableTable, partitions=[{part1=A, part2=2}]]], fields=[id, name, part1, part2]) > but was: < Calc(select=[id, name, part1, part2, (part2 + 1) AS virtualField]) +- TableSourceScan(table=[[default_catalog, default_database, PartitionableTable, partitions=[{part2=2, part1=A}]]], fields=[id, name, part1, part2]) > ``` - The fix is to include ordering in `PartitionPushDownSpec` and `PushPartitionIntoTableSourceScanRuleTest` while converting from Map to String in order to maintain the same order of Map while converting to String and thus, make the tests more stable. The function : getDigests() converts the partitions to String and the ordering of different objects of Map may differ as it is nondeterministic. Also the expected string which are used in test files are hardcoded in XML files and only contain one set of ordering. Hence, we can convert the nondeterministic ordering in PartitionPushDownSpec.getDigests() to get the ordered value of Map as the string plan. ## Brief change log *(for example:)* - *The TaskInfo is stored in the blob store on job creation time as a persistent artifact* - *Deployments RPC transmits only the blob storage reference* - *TaskManagers retrieve the TaskInfo from the blob cache* ## Verifying this change This change is already covered by existing tests, such as *org.apache.flink.table.planner.plan.batch.sql.PartitionableSourceTest*. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: don't know - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [hotfix][state/forst] Remove string.format() in critical path [flink]
flinkbot commented on PR #25473: URL: https://github.com/apache/flink/pull/25473#issuecomment-2403913805 ## CI report: * dd54c09afab52f3178da592cd9196ad3036e66b4 UNKNOWN Bot commands The @flinkbot bot supports the following commands: - `@flinkbot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (FLINK-36460) Refactor Kubernetes Operator Edge to Edge test setup
Sam Barker created FLINK-36460: -- Summary: Refactor Kubernetes Operator Edge to Edge test setup Key: FLINK-36460 URL: https://issues.apache.org/jira/browse/FLINK-36460 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Sam Barker The edge to edge operator tests are getting difficult to control and evolve as was noticed in FLINK-36332. We should try and find clear seems to break the test suite up along to avoid running lots of extra tests or creating mutually exclusive include and exclude conditions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [hotfix][state/forst] Remove string.format() in critical path [flink]
fredia opened a new pull request, #25473: URL: https://github.com/apache/flink/pull/25473 ## What is the purpose of the change This PR removes string.format() in critical path in ForSt map state. ## Brief change log - Remove string.format() in `ForStMapState#buildDBPutRequest` ## Verifying this change Please make sure both new and modified tests in this PR follow [the conventions for tests defined in our code quality guide](https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#7-testing). This change is a trivial rework / code cleanup without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (yes) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no ) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org