[ https://issues.apache.org/jira/browse/FLINK-26864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517468#comment-17517468 ]
Sebastian Mattheis edited comment on FLINK-26864 at 4/5/22 2:18 PM: -------------------------------------------------------------------- [~ym] , I talked to [~pnowojski] if we should do a quick-fix like a revert while I'm working on it but we agreed that this is not too urgent for now as it is not included in 1.16. The fix is, as said, WIP and I will finish it this week where I'm expecting to get back to performance as before. The performance regression is similar to what is observed in FLINK-23560: * Root cause: In the specific benchmarks, a lock is applied because there are no mail actions generated/executed. This lock elision cannot be applied anymore and is normal if, e.g., checkpointing is executed but also with the changes that perform latency measurements for mailbox processing which both generates/executes mail actions. If lock elision cannot be applied anymore, performance drops for these specific benchmarks as observed/described in this issue. The implications are: # There is no performance regression if the application performs checkpointing anyways, i.e., in most streaming applications. # For batch processing applications, there might be the observed performance regression. To avoid the regression, the fix is to start latency measurements only if there are mails genergated/executed. This fix is WIP. was (Author: JIRAUSER284806): [~ym] , I talked to [~pnowojski] if we should do a quick-fix like a revert while I'm working on it but we agreed that this is too urgent for now as it is not included in 1.16. The fix is, as said, WIP and I will finish it this week where I'm expecting to get back to performance as before. The performance regression is similar to what is observed in FLINK-23560: * Root cause: In the specific benchmarks, a lock is applied because there are no mail actions generated/executed. This lock elision cannot be applied anymore and is normal if, e.g., checkpointing is executed but also with the changes that perform latency measurements for mailbox processing which both generates/executes mail actions. If lock elision cannot be applied anymore, performance drops for these specific benchmarks as observed/described in this issue. The implications are: # There is no performance regression if the application performs checkpointing anyways, i.e., in most streaming applications. # For batch processing applications, there might be the observed performance regression. To avoid the regression, the fix is to start latency measurements only if there are mails genergated/executed. This fix is WIP. > Performance regression on 25.03.2022 > ------------------------------------ > > Key: FLINK-26864 > URL: https://issues.apache.org/jira/browse/FLINK-26864 > Project: Flink > Issue Type: Bug > Components: Benchmarks > Affects Versions: 1.16.0 > Reporter: Piotr Nowojski > Assignee: Sebastian Mattheis > Priority: Blocker > > http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=arrayKeyBy&extr=on&quarts=on&equid=off&env=2&revs=200 > http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=remoteFilePartition&extr=on&quarts=on&equid=off&env=2&revs=200 > http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=remoteSortPartition&extr=on&quarts=on&equid=off&env=2&revs=200 > http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=tupleKeyBy&extr=on&quarts=on&equid=off&env=2&revs=200 -- This message was sent by Atlassian Jira (v8.20.1#820001)