Re: S3 recovery and checkpoint directories exhibit explosive growth

Bowen Li Tue, 25 Jul 2017 13:42:39 -0700

Hi Stephan,
    Making Flink's S3 integration independent of Hadoop is great. We've
been running into a lot of Hadoop configuration trouble when trying to
enabling Flink checkpointing with S3 on AWS EMR.


    Is there any concrete plan or tickets created yet for tracking?

Thanks,
Bowen


On Mon, Jul 24, 2017 at 11:12 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi Prashant!
>
> Flink's S3 integration currently goes through Hadoop's S3 file system (as
> you probably noticed).
>
> It seems that the Hadoop's S3 file system is not really well suited for
> what we want to do, and we are looking to drop it and replace it by
> something direct (independent of Hadoop) in the coming release...
>
> One essential thing to make sure is to not have the "trash" activated in
> the configuration, as it adds very high overhead to the delete operations.
>
> Best,
> Stephan
>
>
> On Mon, Jul 24, 2017 at 7:56 PM, Stephan Ewen <se...@apache.org> wrote:
>
>> Hi Prashant!
>>
>> I assume you are using Flink 1.3.0 or 1.3.1?
>>
>> Here are some things you can do:
>>
>>   - I would try and disable the incremental checkpointing for a start
>> and see what happens then. That should reduce the number of files already.
>>
>>   - Is it possible for you to run a patched version of Flink? If yes, can
>> you try to do the following: In the class "FileStateHandle", in the method
>> "discardState()", remove the code around "FileUtils.deletePathIfEmpty(...)"
>> - this is probably not working well when hitting too many S3 files.
>>
>>   -  You can delete old "completedCheckpointXXXYYY" files, but please do
>> not delete the other two types, they are needed for HA recovery.
>>
>> Greetings,
>> Stephan
>>
>>
>> On Mon, Jul 24, 2017 at 3:46 AM, prashantnayak <
>> prash...@intellifylearning.com> wrote:
>>
>>> Hi Xiaogang and Stephan
>>>
>>> We're continuing to test and have now set up the cluster to disable
>>> incremental RocksDB checkpointing as well as increasing the checkpoint
>>> interval from 30s to 120s  (not ideal really :-( )
>>>
>>> We'll run it with a large number of jobs and report back if this setup
>>> shows
>>> improvement.
>>>
>>> Appreciate any another insights you might have around this problem.
>>>
>>> Thanks
>>> Prashant
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-flink-user-maili
>>> ng-list-archive.2336050.n4.nabble.com/S3-recovery-and-checkp
>>> oint-directories-exhibit-explosive-growth-tp14270p14392.html
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive at Nabble.com.
>>>
>>
>>
>

Re: S3 recovery and checkpoint directories exhibit explosive growth

Reply via email to