?????? ?????? Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring

Xinyu Zhang Tue, 15 May 2018 06:06:09 -0700

Yes, I'm glad to do it. but I'm not sure writing a new file is a good solution. 
So I want to discuss it here.
Do you have any ideas? @Kostas





------------------ ???????? ------------------
??????: "twalthr"<[email protected]>;
????????: 2018??5??15??(??????) ????8:21
??????: "Xinyu Zhang"<[email protected]>;
????: "dev"<[email protected]>; "kkloudas"<[email protected]>; 
????: Re: ?????? Rewriting a new file instead of writing a ".valid-length" file 
inBucketSink when restoring



As far as I know, the bucketing sink is currenlty also limited by 
relying on Hadoops file system abstraction. It is planned to switch to 
Flink's file system abstraction which might also improve this situation. 
Kostas (in CC) might know more about it.

But I think we can discuss if an other behavior should be configurable 
as well. Would you be willing to contribute?

Regards,
Timo


Am 15.05.18 um 14:01 schrieb Xinyu Zhang:
> Thanks for your reply.
> Indeed, if a file is very large, it will take a long time. However, 
> the the ??.valid-length?? file is not convenient for others who use the 
> data in HDFS.
> Maybe we should provide a configuration for users to choose which 
> strategy they prefer.
> Do you have any ideas?
>
>
> ------------------ ???????? ------------------
> *??????:* "Timo Walther"<[email protected]>;
> *????????:* 2018??5??15??(??????) ????7:30
> *??????:* "dev"<[email protected]>;
> *????:* Re: Rewriting a new file instead of writing a ".valid-length" 
> file inBucketSink when restoring
>
> I guess writing a new file would take much longer than just using the
> .valid-length file, especially if the files are very large. The
> restoring time should be as minimal as possible to ensure little
> downtime on restarts.
>
> Regards,
> Timo
>
>
> Am 15.05.18 um 09:31 schrieb Gary Yao:
> > Hi,
> >
> > The BucketingSink truncates the file if the Hadoop FileSystem 
> supports this
> > operation (Hadoop 2.7 and above) [1]. What version of Hadoop are you 
> using?
> >
> > Best,
> > Gary
> >
> > [1]
> > 
> https://github.com/apache/flink/blob/bcd028d75b0e5c5c691e24640a2196b2fdaf85e0/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L301
> >
> > On Mon, May 14, 2018 at 1:37 PM, ?????? <[email protected]> wrote:
> >
> >> Hi
> >>
> >>
> >> I'm trying to copy data from kafka to HDFS . The data in HDFS is 
> used to
> >> do other computations by others in map/reduce.
> >> If some tasks failed, the ".valid-length" file is created for the low
> >> version hadoop. The problem is other people must know how to deal 
> with the
> >> ".valid-length" file, otherwise, the data may be not exactly-once.
> >> Hence, why not rewrite a new file when restoring instead of writing a
> >> ".valid-length" file. In this way, others who use the data in HDFS 
> don't
> >> need to know how to deal with the ".valid-length" file.
> >>
> >>
> >> Thanks!
> >>
> >>
> >> Zhang Xinyu
>

?????? ?????? Rewriting a new file instead of writing a ".valid-length" file inBucketSink when restoring

Reply via email to