We have also faced similar issues. The only thing that happens in sync when using async snaphots is getting a persistent point in time picture which in case of rocksdb backend is making symlinks. That would linearly increase with number of files to symlink but this should be negligible. We could not find a satisfying reason for increase in latency with state size.
Best, Narendra Narendra Joshi On 29 Oct 2017 15:04, "Sofer, Tovi" <tovi.so...@citi.com> wrote: > Hi all, > > > > In our application we have a requirement to very low latency, preferably > less than 5ms. > > We were able to achieve this so far, but when we start increasing the > state size, we see distinctive decrease in latency. > > We have added MinPauseBetweenCheckpoints, and are using async snapshots. > > · Why does state size has such distinctive effect on latency? How > can this effect be minimized? > > · Can the state snapshot be done using separates threads and > resources in order to less effect on stream data handling? > > > > > > Details: > > > > Application configuration: > > env.enableCheckpointing(1000); > > env.getCheckpointConfig().*setMinPauseBetweenCheckpoints*(1000); > > env.setStateBackend(new FsStateBackend(checkpointDirURI, true)); // use > async snapshots > > env.setParallelism (16) ; //running on machine with 40 cores > > > > Results: > > > > A. *When state size is ~20MB got latency of 0.3 ms latency for 99’th > percentile* > > > > *Latency info: *(in nanos) > > 2017-10-26 07:26:55,030 INFO com.citi.artemis.flink.reporters.Log4JReporter > - [Flink-MetricRegistry-1] localhost.taskmanager. > 6afd21aeb9b9bef41a4912b023469497.Flink Streaming > Job.AverageE2ELatencyChecker.0.LatencyHistogram: count:10000 min:31919 > max:13481166 mean:89492.0644 stddev:265876.0259763816 p50:68140.5 > p75:82152.5 p95:146654.0499999999 p98:204671.74 p99:308958.73999999993 > p999:3844154.002999794 > > *State\checkpoint info:* > > > > > > > > > > *B. **When state size is ~200MB latency was significantly decreased > to 9 ms latency for 99’th percentile* > > *Latency info: * > > 2017-10-26 07:17:35,289 INFO com.citi.artemis.flink.reporters.Log4JReporter > - [Flink-MetricRegistry-1] localhost.taskmanager. > 05431e7ecab1888b2792265cdc0ddf84.Flink Streaming > Job.AverageE2ELatencyChecker.0.LatencyHistogram: count:10000 min:30186 > max:46236470 mean:322105.7072 stddev:2060373.4782505725 p50:68979.5 > p75:85780.25 p95:219882.69999999914 p98:2360171.4399999934 > p99:9251766.559999945 p999:3.956163987499886E7 > > *State\checkpoint info:* > > > > > > > > Thanks and regrdas, > > Tovi > > >