TengHuo opened a new issue, #6208: URL: https://github.com/apache/hudi/issues/6208
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Flink append only pipeline will failed due to a FileNotFoundException. It showed a parquet file can't be found after triggering a checkpoint. **To Reproduce** Steps to reproduce the behavior: very hard to reproduce, add detail later 1. 2. 3. 4. **Expected behavior** Flink append only pipeline run properly. **Environment Description** * Hudi version : 0.11.1 * Spark version : N/A * Hive version : N/A * Hadoop version : 2.10 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no **Additional context** I have checked HDFS audit log, it shows this missing file was created, but deleted just few seconds ```log 2022-07-19 17:16:22,961 INFO FSNamesystem.audit: allowed=true ugi=xxxxxx (auth:PROXY) via flink (auth:SIMPLE) ip=/xxx.xxx cmd=create src=/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet dst=null perm=xxxxxx:xxxxxx:rw-rw-r-- proto=rpc callerContext=clientIp:xx.xx.xx.242 2022-07-19 17:16:24,824 INFO FSNamesystem.audit: allowed=true ugi=xxxxxx (auth:PROXY) via flink (auth:SIMPLE) ip=/xxx.xxx cmd=delete src=/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet dst=null perm=null proto=rpc callerContext=clientIp:xx.xx.xx.173 ``` **Stacktrace** ```log 2022-07-19 17:16:24,465 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 6713 for job 20c9582448454df74af4e9424245709d (412626 bytes in 2183 ms). 2022-07-19 17:16:25,805 INFO org.apache.hudi.sink.StreamWriteOperatorCoordinator [] - Commit instant [20220719171325240] success! 2022-07-19 17:16:25,919 INFO org.apache.hudi.sink.StreamWriteOperatorCoordinator [] - Create instant [20220719171625805] for table [wt_test_hudi] with type [COPY_ON_WRITE] 2022-07-19 17:16:25,919 INFO org.apache.hudi.sink.StreamWriteOperatorCoordinator [] - Executor executes action [commits the instant 20220719171325240] success! 2022-07-19 17:16:26,734 INFO org.apache.hudi.sink.StreamWriteOperatorCoordinator [] - Executor executes action [sync hive metadata for instant 20220719171625805] success! 2022-07-19 17:19:21,805 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 6714 (type=CHECKPOINT) @ 1658222361802 for job 20c9582448454df74af4e9424245709d. 2022-07-19 17:19:21,810 INFO org.apache.hudi.sink.StreamWriteOperatorCoordinator [] - Executor executes action [taking checkpoint 6714] success! 2022-07-19 17:19:22,169 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), _UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy (10/128) (4dacfeb990f80a0fe30c810a062e2e3b) switched from RUNNING to FAILED on container_e1004_1656494149297_195917_01_000062 @ xxxxxxx(dataPort=39371). java.lang.Exception: Could not perform checkpoint 6714 for operator Source: TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), _UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy (10/128)#0. at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1006) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$7(StreamTask.java:958) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93) at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:344) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684) at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639) at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:782) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 6714 for operator Source: TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), _UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy (10/128)#0. Failure reason: Checkpoint was declined. at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:264) at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1092) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93) at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1076) at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:994) ... 13 more Caused by: org.apache.hudi.exception.HoodieException: Error collect the write status for task [9] at org.apache.hudi.sink.bulk.BulkInsertWriterHelper.getWriteStatuses(BulkInsertWriterHelper.java:184) at org.apache.hudi.sink.append.AppendWriteFunction.flushData(AppendWriteFunction.java:123) at org.apache.hudi.sink.append.AppendWriteFunction.snapshotState(AppendWriteFunction.java:78) at org.apache.hudi.sink.common.AbstractStreamWriteFunction.snapshotState(AbstractStreamWriteFunction.java:157) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118) at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:89) at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:218) at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1092) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93) at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1076) at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:994) ... 13 more Caused by: java.io.FileNotFoundException: File does not exist: /xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet (inode 6318419053) Holder DFSClient_NONMAPREDUCE_1440496717_78 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2782) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:521) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:161) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2663) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:520) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1087) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1109) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1030) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3039) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1115) at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1944) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1750) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:751) Caused by: org.apache.hadoop.ipc.RemoteException: File does not exist: /xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet (inode 6318419053) Holder DFSClient_NONMAPREDUCE_1440496717_78 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2782) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:521) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:161) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2663) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:520) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1087) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1109) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1030) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3039) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1590) at org.apache.hadoop.ipc.Client.call(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1418) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:251) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:130) at com.sun.proxy.$Proxy27.addBlock(Unknown Source) ~[?:?] at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:472) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:449) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:175) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:167) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:105) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:375) at com.sun.proxy.$Proxy28.addBlock(Unknown Source) ~[?:?] at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1112) at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1944) at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1750) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:751) 2022-07-19 17:19:22,200 WARN org.apache.hudi.sink.StreamWriteOperatorCoordinator [] - Reset the event for task [9] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
