[ 
https://issues.apache.org/jira/browse/FLINK-32535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Guo updated FLINK-32535:
-------------------------------
    Fix Version/s: 2.1.0
                       (was: 2.0.0)

> CheckpointingStatisticsHandler periodically returns NullArgumentException 
> after job restarts
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-32535
>                 URL: https://issues.apache.org/jira/browse/FLINK-32535
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST
>    Affects Versions: 2.1.0
>            Reporter: Hong Liang Teoh
>            Priority: Major
>             Fix For: 2.1.0
>
>
> *What*
> When making requests to /checkpoints REST API after a job restart, we see 500 
> for a short period of time. We should handle this gracefully in the 
> CheckpointingStatisticsHandler.
>  
> *How to replicate*
>  * Checkpointing interval 1s
>  * Job is constantly restarting
>  * Make constant requests to /checkpoints REST API.
> See [here|https://github.com/apache/flink/pull/22901#issuecomment-1617830035] 
> for more info
>  
> Stack trace:
> {{org.apache.commons.math3.exception.NullArgumentException: input array}}
> {{    at 
> org.apache.commons.math3.util.MathArrays.verifyValues(MathArrays.java:1753)}}
> {{    at 
> org.apache.commons.math3.stat.descriptive.AbstractUnivariateStatistic.test(AbstractUnivariateStatistic.java:158)}}
> {{    at 
> org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:272)}}
> {{    at 
> org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:241)}}
> {{    at 
> org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics$CommonMetricsSnapshot.getPercentile(DescriptiveStatisticsHistogramStatistics.java:159)}}
> {{    at 
> org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics.getQuantile(DescriptiveStatisticsHistogramStatistics.java:53)}}
> {{    at 
> org.apache.flink.runtime.checkpoint.StatsSummarySnapshot.getQuantile(StatsSummarySnapshot.java:108)}}
> {{    at 
> org.apache.flink.runtime.rest.messages.checkpoints.StatsSummaryDto.valueOf(StatsSummaryDto.java:81)}}
> {{    at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.createCheckpointingStatistics(CheckpointingStatisticsHandler.java:133)}}
> {{    at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleCheckpointStatsRequest(CheckpointingStatisticsHandler.java:85)}}
> {{    at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleCheckpointStatsRequest(CheckpointingStatisticsHandler.java:59)}}
> {{    at 
> org.apache.flink.runtime.rest.handler.job.checkpoints.AbstractCheckpointStatsHandler.lambda$handleRequest$1(AbstractCheckpointStatsHandler.java:62)}}
> {{    at 
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)}}
> {{    at 
> java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)}}
> {{    at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)}}
> {{    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)}}
> {{    at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}
> {{    at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}
> {{    at java.base/java.lang.Thread.run(Thread.java:829)\n}}
>  
> See graphs here for tests. The dips in the green line correspond to the 
> failures immediately after a job restart.
> !https://user-images.githubusercontent.com/35062175/250529297-908a6714-ea15-4aac-a7fc-332589da2582.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to