TengHuo opened a new issue, #6208:
URL: https://github.com/apache/hudi/issues/6208

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Flink append only pipeline will failed due to a FileNotFoundException. It 
showed a parquet file can't be found after triggering a checkpoint.
   
   **To Reproduce**
   
   Steps to reproduce the behavior: very hard to reproduce, add detail later
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   Flink append only pipeline run properly.
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : N/A
   
   * Hive version : N/A
   
   * Hadoop version : 2.10
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I have checked HDFS audit log, it shows this missing file was created, but 
deleted just few seconds
   
   ```log
   2022-07-19 17:16:22,961 INFO FSNamesystem.audit: allowed=true        
ugi=xxxxxx (auth:PROXY) via flink (auth:SIMPLE) ip=/xxx.xxx     cmd=create      
src=/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet
 dst=null        perm=xxxxxx:xxxxxx:rw-rw-r--    proto=rpc       
callerContext=clientIp:xx.xx.xx.242
   2022-07-19 17:16:24,824 INFO FSNamesystem.audit: allowed=true        
ugi=xxxxxx (auth:PROXY) via flink (auth:SIMPLE) ip=/xxx.xxx     cmd=delete      
src=/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet
 dst=null        perm=null       proto=rpc       
callerContext=clientIp:xx.xx.xx.173
   ```
   
   **Stacktrace**
   
   ```log
   2022-07-19 17:16:24,465 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 6713 for job 20c9582448454df74af4e9424245709d (412626 bytes in 2183 
ms).
   2022-07-19 17:16:25,805 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Commit 
instant [20220719171325240] success!
   2022-07-19 17:16:25,919 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Create 
instant [20220719171625805] for table [wt_test_hudi] with type [COPY_ON_WRITE]
   2022-07-19 17:16:25,919 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor 
executes action [commits the instant 20220719171325240] success!
   2022-07-19 17:16:26,734 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor 
executes action [sync hive metadata for instant 20220719171625805] success!
   2022-07-19 17:19:21,805 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 6714 (type=CHECKPOINT) @ 1658222361802 for job 
20c9582448454df74af4e9424245709d.
   2022-07-19 17:19:21,810 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor 
executes action [taking checkpoint 6714] success!
   2022-07-19 17:19:22,169 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., 
DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), 
_UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy (10/128) 
(4dacfeb990f80a0fe30c810a062e2e3b) switched from RUNNING to FAILED on 
container_e1004_1656494149297_195917_01_000062 @ xxxxxxx(dataPort=39371).
   java.lang.Exception: Could not perform checkpoint 6714 for operator Source: 
TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., 
DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), 
_UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy 
(10/128)#0.
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1006)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$7(StreamTask.java:958)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:344)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:782)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
        at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could 
not complete snapshot 6714 for operator Source: TableSourceScan(table=[[...]], 
fields=[...]) -> Calc(select=[..., DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 
1000000000), 0), _UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> 
Sink: dummy (10/128)#0. Failure reason: Checkpoint was declined.
        at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:264)
        at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169)
        at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1092)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1076)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:994)
        ... 13 more
   Caused by: org.apache.hudi.exception.HoodieException: Error collect the 
write status for task [9]
        at 
org.apache.hudi.sink.bulk.BulkInsertWriterHelper.getWriteStatuses(BulkInsertWriterHelper.java:184)
        at 
org.apache.hudi.sink.append.AppendWriteFunction.flushData(AppendWriteFunction.java:123)
        at 
org.apache.hudi.sink.append.AppendWriteFunction.snapshotState(AppendWriteFunction.java:78)
        at 
org.apache.hudi.sink.common.AbstractStreamWriteFunction.snapshotState(AbstractStreamWriteFunction.java:157)
        at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
        at 
org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
        at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:89)
        at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:218)
        at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169)
        at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590)
        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1092)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1076)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:994)
        ... 13 more
   Caused by: java.io.FileNotFoundException: File does not exist: 
/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet
 (inode 6318419053) Holder DFSClient_NONMAPREDUCE_1440496717_78 does not have 
any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2782)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:521)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:161)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2663)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:520)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1087)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1109)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1030)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3039)
   
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
        at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
        at 
org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1115)
        at 
org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1944)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1750)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:751)
   Caused by: org.apache.hadoop.ipc.RemoteException: File does not exist: 
/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet
 (inode 6318419053) Holder DFSClient_NONMAPREDUCE_1440496717_78 does not have 
any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2782)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:521)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:161)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2663)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:520)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1087)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1109)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1030)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3039)
   
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1590)
        at org.apache.hadoop.ipc.Client.call(Client.java:1521)
        at org.apache.hadoop.ipc.Client.call(Client.java:1418)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:251)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:130)
        at com.sun.proxy.$Proxy27.addBlock(Unknown Source) ~[?:?]
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:472)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:449)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:175)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:167)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:105)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:375)
        at com.sun.proxy.$Proxy28.addBlock(Unknown Source) ~[?:?]
        at 
org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1112)
        at 
org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1944)
        at 
org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1750)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:751)
   2022-07-19 17:19:22,200 WARN  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Reset the 
event for task [9]
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to