Re: Support Hadoop 2.6 for StreamingFileSink

Artsem Semianenka Thu, 23 Aug 2018 08:05:25 -0700

Hi guys,

I've created a ticket for that issue in Jira and proposed possible solution
just to continue our discussion
and develop a plan how to fix the issue.


https://issues.apache.org/jira/browse/FLINK-10203

Cheers,
Artsem

On Tue, 21 Aug 2018 at 16:59, Artsem Semianenka <artfulonl...@gmail.com>
wrote:

> Thanks Kostas for reply,
>
> But till there are distributions like Cloudera which latest version (5.15)
> based on Hadoop 2.6
> I and many other Cloudera users obliged to use an older HDFS version.
> Moreover I read discussion
> on Cloudera forum regarding moving to more fresh version of Hadoop, and
> Cloudera guys said
> that they are not going to do that because they concentrate on 6th version
> based on Hadoop 3.x .
> In this case I doubt that Flink ready to work with latest Hadoop 3.x
> version.
> And as the result my company as Cloudera user in the trap. We place a bet
> on Flink but can't use it
> with our environment .
>
> I will think about you idea of RecoverableStream without truncate for Bulk
> encoders. But to tell the truth
> currently I have no idea how to implement it . Because idiomatically
> RecoverableWriter should be able
> recover form specified pointer. In our case for Parquet BulkFormat we
> don't need to recover we should
> recreate hole file with checkpointed state. It not looks like
> RecoverableWriter.
>
> Cheers,
> Artsem
>
>
> On Tue, 21 Aug 2018 at 16:09, Kostas Kloudas <k.klou...@data-artisans.com>
> wrote:
>
>> Hi Artsem,
>>
>> Till is correct in that getting rid of the “valid-length” file was a
>> design decision
>> for the new StreamingFileSink since the beginning. The motivation was
>> that
>> users were reporting that essentially it was very cumbersome to use.
>>
>> In general, when the BucketingSink gets deprecated, I could see a benefit
>> in having a
>> legacy recoverable stream just in case you are obliged to use an older
>> HDFS version.
>> But, at least for now, this would be useful only for row-wise encoders,
>> and NOT for
>> bulk-encoders like Parquet.
>>
>> The reason is that for now, when using bulk encoders you roll on every
>> checkpoint.
>> This implies that you do not need truncate, or the valid length file.
>> Given this,
>> you may only need to write a Recoverable stream that just does not
>> truncate.
>>
>> Would you like to try it out and see if it works for your usecase?
>>
>> Cheers,
>> Kostas
>>
>> On Aug 21, 2018, at 1:58 PM, Artsem Semianenka <artfulonl...@gmail.com>
>> wrote:
>>
>> Thanks for reply, Till !
>>
>> Buy the way, If Flink going to support compatibility with Hadoop 2.6 I
>> don't see another way how to achieve it.
>> As I mention before one of popular distributive Cloudera still based on
>> Hadoop 2.6 and it very sad if Flink unsupport it.
>> I really want to help Flink comunity to support this legacy. But
>> currently I see only one way to acheve it by emulate 'truncate' logic and
>> recreate new file with needed lenght and replace old .
>>
>> Cheers,
>> Artsem
>>
>> On Tue, 21 Aug 2018 at 14:41, Till Rohrmann <trohrm...@apache.org> wrote:
>>
>>> Hi Artsem,
>>>
>>> if I recall correctly, then we explicitly decided to not support the
>>> valid
>>> file length files with the new StreamingFileSink because they are really
>>> hard to handle for the user. I've pulled Klou into this conversation who
>>> is
>>> more knowledgeable and can give you a bit more advice.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Mon, Aug 20, 2018 at 2:53 PM Artsem Semianenka <
>>> artfulonl...@gmail.com>
>>> wrote:
>>>
>>> > I have an idea to create new version of
>>> HadoopRecoverableFsDataOutputStream
>>> > class (for example with name LegacyHadoopRecoverableFsDataOutputStream
>>> :) )
>>> > which will works with valid-length files without invoking truncate. And
>>> > modify check in HadoopRecoverableWriter to use
>>> > LegacyHadoopRecoverableFsDataOutputStream in case if Hadoop version is
>>> > lower then 2.7 . I will try to provide PR soon if no objections. I
>>> hope I
>>> > am on the right way.
>>> >
>>> > On Mon, 20 Aug 2018 at 14:40, Artsem Semianenka <
>>> artfulonl...@gmail.com>
>>> > wrote:
>>> >
>>> > > Hi guys !
>>> > > I have a question regarding new StreamingFileSink (introduced in 1.6
>>> > > version) . We use this sink to write data into Parquet format. But I
>>> > faced
>>> > > with issue when trying to run job on Yarn cluster and save result to
>>> > HDFS.
>>> > > In our case we use latest Cloudera distributive (CHD 5.15) and it
>>> > contains
>>> > > HDFS 2.6.0  . This version is not support truncate method . I would
>>> like
>>> > to
>>> > > create Pull request but I want to ask your advice how better design
>>> this
>>> > > fix and which ideas are behind this decision . I saw similiar PR for
>>> > > BucketingSink https://github.com/apache/flink/pull/6108 . Maybe I
>>> could
>>> > > also add support of valid-length files for older Hadoop versions ?
>>> > >
>>> > > P.S.Unfortently CHD 5.15 (with Hadoop 2.6) is the latest version of
>>> > > Cloudera distributive and we can't upgrade hadoop to 2.7 Hadoop .
>>> > >
>>> > > Best regards,
>>> > > Artsem
>>> > >
>>> >
>>> >
>>> > --
>>> >
>>> > С уважением,
>>> > Артем Семененко
>>> >
>>>
>>
>>
>> --
>>
>> С уважением,
>> Артем Семененко
>>
>>
>>
>
> --
>
> С уважением,
> Артем Семененко
>


-- 

С уважением,
Артем Семененко

Re: Support Hadoop 2.6 for StreamingFileSink

Reply via email to