Robert Metzger created FLINK-31810: -------------------------------------- Summary: RocksDBException: Bad table magic number on checkpoint rescale Key: FLINK-31810 URL: https://issues.apache.org/jira/browse/FLINK-31810 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.15.2 Reporter: Robert Metzger
While rescaling a job from checkpoint, I ran into this exception: {code:java} SinkMaterializer[7] -> rob-result[7]: Writer -> rob-result[7]: Committer (4/4)#3 (c1b348f7eed6e1ce0e41ef75338ae754) switched from INITIALIZING to FAILED with failure cause: java.lang.Exception: Exception while creating StreamOperatorStateContext. at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:255) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:265) at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:703) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:679) at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:646) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:917) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for SinkUpsertMaterializer_7d9b7588bc2ff89baed50d7a4558caa4_(4/4) from any of the 1 provided restore options. at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164) ... 11 more Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception. at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395) at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483) at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ... 13 more Caused by: java.io.IOException: Error while opening RocksDB instance. at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreDBInstanceFromStateHandle(RocksDBIncrementalRestoreOperation.java:465) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithRescaling(RocksDBIncrementalRestoreOperation.java:321) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:164) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:315) ... 18 more Caused by: org.rocksdb.RocksDBException: Bad table magic number: expected 9863518390377041911, found 4096 in /tmp/job_00000000000000000000000000000000_op_SinkUpsertMaterializer_7d9b7588bc2ff89baed50d7a4558caa4__4_4__uuid_d5587dfc-78b3-427c-8cb6-35507b71bc4b/46475654-5515-430e-b215-389d42cddb97/000232.sst at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:306) at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80) ... 22 more {code} I haven't found any other cases of this issue on the internet, except for this, but it is not related to Flink: https://github.com/facebook/rocksdb/issues/3438. I'm posting this to track if other users are affected by this as well, and to collect information about the circumstances when this issue occurs. What I have done to run into this situation: a) killed the JobManager b) removed the jobgraph information from the cluster config map so that the JobGraph gets regenerated with the changed parallelism c) restarted the JobManager, which regenerated the job graph with the new parallelism d) During state recovery, above exception occurred. In my understanding, rescaling from a checkpoint is legal, because that's what reactive mode is also doing. Also, removing the jobgraph information from the cluster config map is ok, because the Flink Kubernetes operator is also doing that. -- This message was sent by Atlassian Jira (v8.20.10#820010)