Stephan Ewen created FLINK-9749:
-----------------------------------

             Summary: Rework Bucketing Sink
                 Key: FLINK-9749
                 URL: https://issues.apache.org/jira/browse/FLINK-9749
             Project: Flink
          Issue Type: New Feature
          Components: Streaming Connectors
            Reporter: Stephan Ewen
            Assignee: Kostas Kloudas
             Fix For: 1.6.0


The BucketingSink has a series of deficits at the moment.

Due to the long list of issues, I would suggest to add a new StreamingFileSink 
with a new and cleaner design

h3. Encoders, Parquet, ORC

 - It only efficiently supports row-wise data formats (avro, jso, sequence 
files.
 - Efforts to add (columnar) compression for blocks of data is inefficient, 
because blocks cannot span checkpoints due to persistence-on-checkpoint.
 - The encoders are part of the \{{flink-connector-filesystem project}}, rather 
than in orthogonal formats projects. This blows up the dependencies of the 
\{{flink-connector-filesystem project}} project. As an example, the rolling 
file sink has dependencies on Hadoop and Avro, which messes up dependency 
management.

h3. Use of FileSystems

 - The BucketingSink works only on Hadoop's FileSystem abstraction not support 
Flink's own FileSystem abstraction and cannot work with the packaged S3, 
maprfs, and swift file systems
 - The sink hence needs Hadoop as a dependency
 - The sink relies on "trying out" whether truncation works, which requires 
write access to the users working directory
 - The sink relies on enumerating and counting files, rather than maintaining 
its own state, making less efficient

h3. Correctness and Efficiency on S3
 - The BucketingSink relies on strong consistency in the file enumeration, 
hence may work incorrectly on S3.
 - The BucketingSink relies on persisting streams at intermediate points. This 
is not working properly on S3, hence there may be data loss on S3.

h3. .valid-length companion file
 - The valid length file makes it hard for consumers of the data and should be 
dropped



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to