Paul Lin created FLINK-12022:
--------------------------------

             Summary: Enable StreamWriter to update file length on sync flush
                 Key: FLINK-12022
                 URL: https://issues.apache.org/jira/browse/FLINK-12022
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / FileSystem
    Affects Versions: 1.7.2, 1.6.4
            Reporter: Paul Lin
            Assignee: Paul Lin


Currently, users of file systems that do not support truncating have to 
struggle with BucketingSink and use its valid length file to indicate the 
checkpointed data position. The problem is that by default the file length will 
only be updated when a block is full or the file is closed, but when the job 
crashes and the file is not closed properly, the file length is still behind 
its actual value and the checkpointed file length. When the job restarts, it 
looks like data loss, because the valid length is bigger than the file. This 
situation lasts until namenode notices the change of block size of the file, 
and it could be half an hour or more.

So I propose to add an option to StreamWriterBase to update file lengths on 
each flush. This can be expensive because it involves namenode and should be 
used when strong consistency is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to