Hi guys, I've created a ticket for that issue in Jira and proposed possible solution just to continue our discussion and develop a plan how to fix the issue.
https://issues.apache.org/jira/browse/FLINK-10203 Cheers, Artsem On Tue, 21 Aug 2018 at 16:59, Artsem Semianenka <artfulonl...@gmail.com> wrote: > Thanks Kostas for reply, > > But till there are distributions like Cloudera which latest version (5.15) > based on Hadoop 2.6 > I and many other Cloudera users obliged to use an older HDFS version. > Moreover I read discussion > on Cloudera forum regarding moving to more fresh version of Hadoop, and > Cloudera guys said > that they are not going to do that because they concentrate on 6th version > based on Hadoop 3.x . > In this case I doubt that Flink ready to work with latest Hadoop 3.x > version. > And as the result my company as Cloudera user in the trap. We place a bet > on Flink but can't use it > with our environment . > > I will think about you idea of RecoverableStream without truncate for Bulk > encoders. But to tell the truth > currently I have no idea how to implement it . Because idiomatically > RecoverableWriter should be able > recover form specified pointer. In our case for Parquet BulkFormat we > don't need to recover we should > recreate hole file with checkpointed state. It not looks like > RecoverableWriter. > > Cheers, > Artsem > > > On Tue, 21 Aug 2018 at 16:09, Kostas Kloudas <k.klou...@data-artisans.com> > wrote: > >> Hi Artsem, >> >> Till is correct in that getting rid of the “valid-length” file was a >> design decision >> for the new StreamingFileSink since the beginning. The motivation was >> that >> users were reporting that essentially it was very cumbersome to use. >> >> In general, when the BucketingSink gets deprecated, I could see a benefit >> in having a >> legacy recoverable stream just in case you are obliged to use an older >> HDFS version. >> But, at least for now, this would be useful only for row-wise encoders, >> and NOT for >> bulk-encoders like Parquet. >> >> The reason is that for now, when using bulk encoders you roll on every >> checkpoint. >> This implies that you do not need truncate, or the valid length file. >> Given this, >> you may only need to write a Recoverable stream that just does not >> truncate. >> >> Would you like to try it out and see if it works for your usecase? >> >> Cheers, >> Kostas >> >> On Aug 21, 2018, at 1:58 PM, Artsem Semianenka <artfulonl...@gmail.com> >> wrote: >> >> Thanks for reply, Till ! >> >> Buy the way, If Flink going to support compatibility with Hadoop 2.6 I >> don't see another way how to achieve it. >> As I mention before one of popular distributive Cloudera still based on >> Hadoop 2.6 and it very sad if Flink unsupport it. >> I really want to help Flink comunity to support this legacy. But >> currently I see only one way to acheve it by emulate 'truncate' logic and >> recreate new file with needed lenght and replace old . >> >> Cheers, >> Artsem >> >> On Tue, 21 Aug 2018 at 14:41, Till Rohrmann <trohrm...@apache.org> wrote: >> >>> Hi Artsem, >>> >>> if I recall correctly, then we explicitly decided to not support the >>> valid >>> file length files with the new StreamingFileSink because they are really >>> hard to handle for the user. I've pulled Klou into this conversation who >>> is >>> more knowledgeable and can give you a bit more advice. >>> >>> Cheers, >>> Till >>> >>> On Mon, Aug 20, 2018 at 2:53 PM Artsem Semianenka < >>> artfulonl...@gmail.com> >>> wrote: >>> >>> > I have an idea to create new version of >>> HadoopRecoverableFsDataOutputStream >>> > class (for example with name LegacyHadoopRecoverableFsDataOutputStream >>> :) ) >>> > which will works with valid-length files without invoking truncate. And >>> > modify check in HadoopRecoverableWriter to use >>> > LegacyHadoopRecoverableFsDataOutputStream in case if Hadoop version is >>> > lower then 2.7 . I will try to provide PR soon if no objections. I >>> hope I >>> > am on the right way. >>> > >>> > On Mon, 20 Aug 2018 at 14:40, Artsem Semianenka < >>> artfulonl...@gmail.com> >>> > wrote: >>> > >>> > > Hi guys ! >>> > > I have a question regarding new StreamingFileSink (introduced in 1.6 >>> > > version) . We use this sink to write data into Parquet format. But I >>> > faced >>> > > with issue when trying to run job on Yarn cluster and save result to >>> > HDFS. >>> > > In our case we use latest Cloudera distributive (CHD 5.15) and it >>> > contains >>> > > HDFS 2.6.0 . This version is not support truncate method . I would >>> like >>> > to >>> > > create Pull request but I want to ask your advice how better design >>> this >>> > > fix and which ideas are behind this decision . I saw similiar PR for >>> > > BucketingSink https://github.com/apache/flink/pull/6108 . Maybe I >>> could >>> > > also add support of valid-length files for older Hadoop versions ? >>> > > >>> > > P.S.Unfortently CHD 5.15 (with Hadoop 2.6) is the latest version of >>> > > Cloudera distributive and we can't upgrade hadoop to 2.7 Hadoop . >>> > > >>> > > Best regards, >>> > > Artsem >>> > > >>> > >>> > >>> > -- >>> > >>> > С уважением, >>> > Артем Семененко >>> > >>> >> >> >> -- >> >> С уважением, >> Артем Семененко >> >> >> > > -- > > С уважением, > Артем Семененко > -- С уважением, Артем Семененко