Hi Peter, State initialization with with historic data is a use case that's coming up more and more. Unfortunately, there's no good solution for this yet but just a couple of workaround that require careful design and work for all cases. There was a talk about exactly this problem and some ideas for addressing it at Flink Forward a month ago [1]. The slides and video of the talk are available online [2].
Your idea of initializing keyed state during startup (by the open() method) doesn't work. Keyed state is automatically moved into the context of the key of a currently processed record. Since there are no records during initialization, one would need to manually set the key for the state to initialize. The challenge here is that the keys are partitioned / sharded across the parallel instances. So, one would need to know on which instance which key must be initialized. This is not trivial. Best, Fabian [1] https://sf-2018.flink-forward.org/kb_sessions/bootstrapping-state-in-apache-flink/ [2] https://data-artisans.com/flink-forward/resources/bootstrapping-state-in-apache-flink 2018-05-04 19:47 GMT+02:00 Tao Xia <t...@udacity.com>: > Also would like to know how to do this if it is possible. > > On Fri, May 4, 2018 at 9:31 AM, Peter Zende <peter.ze...@gmail.com> wrote: > >> Hi, >> >> We use RocksDB with FsStateBackend (HDFS) to store state used by the >> mapWithState operator. Is it possible to initialize / populate this state >> during the streaming application startup? >> >> Our intention is to reprocess the historical data from HDFS in a batch >> job and save the latest state of the records onto HDFS. Thus when we >> restart the streaming job we can just build up or load the most recent view >> of this store. >> >> Many thanks, >> Peter >> > >