Re: Troubleshooting checkpoint expiration

Alexis Sarda-Espinosa Sat, 31 Aug 2024 08:50:35 -0700

Well, for future reference, this helped in the case of ABFS:


logger.abfs.name = org.apache.hadoop.fs.azurebfs.services.AbfsClient
logger.abfs.level = DEBUG
logger.abfs.filter.failures.type = RegexFilter
logger.abfs.filter.failures.regex = ^.*([Ff]ail|[Rr]etry|: [45][0-9]{2},).*$
logger.abfs.filter.failures.onMatch = ACCEPT
logger.abfs.filter.failures.onMismatch = DENY


Am Mi., 7. Aug. 2024 um 12:18 Uhr schrieb Alexis Sarda-Espinosa <
sarda.espin...@gmail.com>:

> I must ask again if anyone at least knows if Flink's file system can
> expose more detailed exceptions when things go wrong, Azure support is
> asking for specific exception messages to decide how to troubleshoot.
>
> Regards,
> Alexis.
>
> Am Di., 23. Juli 2024 um 13:39 Uhr schrieb Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com>:
>
>> Hi again,
>>
>> I found a Hadoop class that can log latency information [1], but since I
>> don't see any exceptions in the logs when a checkpoint expires due to
>> timeout, I'm still wondering if I can change other log levels to get more
>> insights, maybe somewhere in Flink's file system abstractions?
>>
>> [1]
>> https://hadoop.apache.org/docs/r3.2.4/hadoop-azure/abfs.html#Perf_Options
>>
>>
>> Regards,
>> Alexis.
>>
>> Am Fr., 19. Juli 2024 um 09:17 Uhr schrieb Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com>:
>>
>>> Hello,
>>>
>>> We have a Flink job that uses ABFSS for checkpoints and related state.
>>> Lately we see a lot of exceptions due to expiration of checkpoints, and I'm
>>> guessing that's an issue in the infrastructure or on Azure's side, but I
>>> was wondering if there are Flink/Hadoop Java packages that log potentially
>>> useful information if we DEBUG/TRACE them?
>>>
>>> Regards,
>>> Alexis.
>>>
>>>

Re: Troubleshooting checkpoint expiration

Reply via email to