[ https://issues.apache.org/jira/browse/FLINK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786304#comment-17786304 ]
Rui Fan commented on FLINK-18473: --------------------------------- Hi [~yigress] , thanks for your feedback. The current strategy is indeed unfriendly for multiple HDD disks. And global round robin may be a good solution to fix this problem, but it's hard to be done in Community. Because multiple task managers in the same server cannot communicate. Yun also mentioned it. There are two possible solutions: # Internal flink version uses zookeeper to achieve global round robin in server. # In our production, most of jobs are small state, just a little jobs with big state. The problem I encountered many years ago was that TM with very large state had multiple slots, and multiple slots of the same TM might use the same disk. ** For this case, round robin at tm level may be enough, and it doesn't need any extra communication between TaskManagers. Also, I wanna cc [~masteryhx] who is currently very active Flink Committer in the State module. He might be interested in this JIRA. > Optimize RocksDB disk load balancing strategy > --------------------------------------------- > > Key: FLINK-18473 > URL: https://issues.apache.org/jira/browse/FLINK-18473 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends > Affects Versions: 1.12.0 > Reporter: Rui Fan > Priority: Minor > Labels: auto-deprioritized-major, stale-minor > > In general, bigdata servers have many disks. For large-state jobs, if > multiple slots are running on a TM, then each slot will create a RocksDB > instance. We hope that multiple RocksDB instances use different disks to > achieve load balancing. > h3. The problem of current load balancing strategy: > When the current RocksDB is initialized, a random value nextDirectory is > generated according to the number of RocksDB dir: [code > link|https://github.com/apache/flink/blob/2d371eb5ac9a3e485d3665cb9a740c65e2ba2ac6/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java#L441] > {code:java} > nextDirectory = new Random().nextInt(initializedDbBasePaths.length); > {code} > Different slots generate different RocksDBStateBackend objects, so each slot > will generate its own *nextDirectory*. The random algorithm used here, so the > random value generated by different slots may be the same. For example: the > current RocksDB dir is configured with 10 disks, the *nextDirectory* > generated by slot0 and slot1 are both 5, then slot0 and slot1 will use the > same disk. This disk will be under a lot of pressure, other disks will not be > under pressure. > h3. Optimization ideas: > *{{nextDirectory}}* should belong to slot sharing, the initial value of > *{{nextDirectory}}* cannot be 0, it is still generated by random. But define > *nextDirectory* as +_{{static AtomicInteger()}}_+ and execute > +_{{nextDirectory.incrementAndGet()}}_+ every time RocksDBKeyedStateBackend > is applied for. > {{nextDirectory}} takes the remainder of {{initializedDbBasePaths.length}} to > decide which disk to use. > Is there any problem with the above ideas? > -- This message was sent by Atlassian Jira (v8.20.10#820010)