[ https://issues.apache.org/jira/browse/FLINK-34991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17910366#comment-17910366 ]
david radley commented on FLINK-34991: -------------------------------------- Hi, I thought this might relate to your issue. We hit a classloading issue around UDFs. It was fixed in master after a big refactor around job scheduling. We have a PR [https://github.com/apache/flink/pull/25656] ; we are looking to use to fix this at 1.20 if we can be convinced there are no side effects. I suggest checking whether this fails for you in master / v2. > Flink Operator ClassPath Race Condition Bug > ------------------------------------------- > > Key: FLINK-34991 > URL: https://issues.apache.org/jira/browse/FLINK-34991 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: 1.7.2 > Environment: Standard Flink Operator with Flink Deployment. > To recreate, just remove a critical SQL connector library from the bundled jar > Reporter: Ryan van Huuksloot > Priority: Minor > > Hello, > I believe we've found a bug with the Job Managers of the Kubernetes Operator. > I think there is a race condition or an incorrect conditional where the > operator is checking for High Availability instead of seeing if there is an > issue with Class Loading in the Job Manager. > *Example:* > When deploying a SQL Flink Job, it starts the job managers in a failed state. > Describing the flink deployment returns the Error message > {code:java} > RestoreFailed ... HA metadata not available to restore from last state. It is > possible that the job has finished or terminally failed, or the configmaps > have been deleted.{code} > But upon further investigation, the actual error was that the class loading > of the Job Manager wasn't correct. This was a log in the Job Manager > {code:java} > "Could not find any factory for identifier 'kafka' that implements > 'org.apache.flink.table.factories.DynamicTableFactory' in the > classpath.\n\nAvailable factory identifiers > are:\n\nblackhole\ndatagen\nfilesystem\nprint","name":"org.apache.flink.table.api.ValidationException","extendedStackTrace":"org.apache.flink.table.api.ValidationException: > Could not find any factory for identifier 'kafka' that implements > 'org.apache.flink.table.factories.DynamicTableFactory' in the > classpath.\n\nAvailable factory identifiers > are:\n\nblackhole\ndatagen\nfilesystem\nprint\n\"{code} > There is also logging in the operator > {code:java} > ... Cannot discover a connector using option: 'connector'='kafka'\n\tat > org.apache.flink.table.factories.FactoryUtil.enrichNoMatchingConnectorError(FactoryUtil.java:798)\n\tat > > org.apache.flink.table.factories.FactoryUtil.discoverTableFactory(FactoryUtil.java:772)\n\tat > > org.apache.flink.table.factories.FactoryUtil.createDynamicTableSource(FactoryUtil.java:215)\n\t... > 52 more\nCaused by: org.apache.flink.table.api.ValidationException: Could > not find any factory for identifier 'kafka' that implements > 'org.apache.flink.table.factories.DynamicTableFactory' in the classpath > ....{code} > I think that the operator should return this error in the CRD since the HA > error is not the root cause. > > > To recreate: > All I did was remove the > `"org.apache.flink:flink-connector-kafka:$flinkConnectorKafkaVersion"` from > my bundled jar so the class path was missing. This was executing a Flink SQL > job. Which means the job manager starts before the class path error is thrown > which seems to be the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)