I looked into the disk issues and found that Fabian was on the right path.
The checkpoints that were lingering were in-fact in use.

Thanks for the help!
Clay



On Thu, Sep 26, 2019 at 8:09 PM Clay Teeter <clay.tee...@maalka.com> wrote:

> I see, I'll try turning off incremental checkpoints to see if that helps.
>
> re: Diskspace, i could see a scenario with my application where i could
> get 10,000+ checkpoints, if the checkpoints are additive.  I'll let you
> know what i see.
>
> Thanks!
> Clay
>
>
> On Wed, Sep 25, 2019 at 5:40 PM Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi,
>>
>> You enabled incremental checkpoints.
>> This means that parts of older checkpoints that did not change since the
>> last checkpoint are not removed because they are still referenced by the
>> incremental checkpoints.
>> Flink will automatically remove them once they are not needed anymore.
>>
>> Are you sure that the size of your application's state is not growing too
>> large?
>>
>> Best, Fabian
>>
>> Am Di., 24. Sept. 2019 um 10:47 Uhr schrieb Clay Teeter <
>> clay.tee...@maalka.com>:
>>
>>> Oh geez,  checkmarks  = checkpoints... sorry.
>>>
>>> What i mean by stale "checkpoints" are checkpoints that should be reaped
>>> by: "state.checkpoints.num-retained: 3".
>>>
>>> What is happening is that directories:
>>>   - state.checkpoints.dir: file:///opt/ha/49/checkpoints
>>>   - high-availability.storageDir: file:///opt/ha/49/ha
>>> are growing with every checkpoint and i'm running out of disk space.
>>>
>>> On Tue, Sep 24, 2019 at 4:55 AM Biao Liu <mmyy1...@gmail.com> wrote:
>>>
>>>> Hi Clay,
>>>>
>>>> Sorry I don't get your point. I'm not sure what the "stale checkmarks"
>>>> exactly means. The HA storage and checkpoint directory left after shutting
>>>> down cluster?
>>>>
>>>> Thanks,
>>>> Biao /'bɪ.aʊ/
>>>>
>>>>
>>>>
>>>> On Tue, 24 Sep 2019 at 03:12, Clay Teeter <clay.tee...@maalka.com>
>>>> wrote:
>>>>
>>>>> I'm trying to get my standalone cluster to remove stale checkmarks.
>>>>>
>>>>> The cluster is composed of a single job and task manager backed by
>>>>> rocksdb with high availability.
>>>>>
>>>>> The configuration on both the job and task manager are:
>>>>>
>>>>> state.backend: rocksdb
>>>>> state.checkpoints.dir: file:///opt/ha/49/checkpoints
>>>>> state.backend.incremental: true
>>>>> state.checkpoints.num-retained: 3
>>>>> jobmanager.heap.size: 1024m
>>>>> taskmanager.heap.size: 2048m
>>>>> taskmanager.numberOfTaskSlots: 24
>>>>> parallelism.default: 1
>>>>> high-availability.jobmanager.port: 6123
>>>>> high-availability.zookeeper.path.root: ********_49
>>>>> high-availability: zookeeper
>>>>> high-availability.storageDir: file:///opt/ha/49/ha
>>>>> high-availability.zookeeper.quorum: ******t:2181
>>>>>
>>>>> Both machines have access to /opt/ha/49 and /opt/ha/49/checkpoints via
>>>>> NFS and are owned by the flink user.  Also, there are no errors that i can
>>>>> find.
>>>>>
>>>>> Does anyone have any ideas that i could try?
>>>>>
>>>>>

Reply via email to