[ https://issues.apache.org/jira/browse/FLINK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gary Lam updated FLINK-37109: ----------------------------- Attachment: Flame graph prior to change using 74pct cpu.png > Improve state processor API performance when reading keyed rocksdb state by > allowing duplicates > ----------------------------------------------------------------------------------------------- > > Key: FLINK-37109 > URL: https://issues.apache.org/jira/browse/FLINK-37109 > Project: Flink > Issue Type: Improvement > Components: API / State Processor > Reporter: Gary Lam > Priority: Minor > Labels: pull-request-available > Attachments: Flame graph prior to change using 74pct cpu.png > > > Could we allow for duplicates via a flag when reading keyed rocksdb state, to > improve performance? > From the [mailing list > discussion,|https://www.mail-archive.com/user@flink.apache.org/msg43863.html] > when the state processor api reads from state, it does multiple reads/writes > to avoid duplicates: > > {code:java} > The trick we perform is to delete keys from rocksDB after each read, so we > can do full table scans on all column families but never see any > duplicates.{code} > > In my application, which has a keyed state of size ~200GB, I have found it > takes >4 hours to iterate the entire state. Doing a CPU profile, 70% of the > time is spent on the `remove()` rocksdb call. > If I comment out [this > line|https://github.com/apache/flink/blob/26436ac27ae9e4705910b0502abb5bdd33ec686b/flink-libraries/flink-state-processing-api/src/main/java/org/apache/flink/state/api/input/KeyedStateInputFormat.java#L229] > `keysAndNamespaces.remove();`, I can read the entire state in <15 minutes, > and my particular application (trying to detect outliers in the state) is > robust to duplicates. > Thus if we allow this to be a user configurable flag (to skip deduplication) > it would give a performance boost to users who don't care about > deduplication. > -- This message was sent by Atlassian Jira (v8.20.10#820010)