This makes perfect sense to me. Thanks Congxian and Kostas for your inputs.
Gagan On Thu, Jan 10, 2019 at 6:03 PM Kostas Kloudas <k.klou...@da-platform.com> wrote: > Hi Gagan, > > I agree with Congxian! > In MapState, when accessing the state/value associated with a key in the > map, then the whole value is de-serialized (and serialized in case of a > put()). > Given this, it is more efficient to have many keys, with small state, than > fewer keys with huge state. > > Cheers, > Kostas > > > On Thu, Jan 10, 2019 at 12:34 PM Congxian Qiu <qcx978132...@gmail.com> > wrote: > >> Hi, Gagan Agrawal >> >> In my opinion, I prefer the first. >> >> Here is the reason. >> >> In RocksDB StateBackend, we will serialize the key, namespace, user-key >> into a serialized bytes (key-bytes) and serialize user-value to serialized >> bytes(value-bytes) then insert into the key-bytes/value-bytes into >> RocksDB, when retrieving from RocksDB we can user get(for a single >> key/value) or iterator(for a key range). >> >> If we store four maps into a single MapState, we need to deserialize the >> value-bytes(a Map) when we want to retrieve a single user-value. >> >> >> Gagan Agrawal <agrawalga...@gmail.com> 于2019年1月10日周四 上午10:38写道: >> >>> Hi, >>> I have a use case where 4 streams get merged (union) and grouped on >>> common key (keyBy) and a custom KeyedProcessFunction is called. Now I need >>> to keep state (RocksDB backend) for all 4 streams in my custom >>> KeyedProcessFunction where each of these 4 streams would be stored as map. >>> So I have 2 options >>> >>> 1. Create a separate MapStateDescriptor for each of these streams and >>> store their events separately. >>> 2. Create a single MapStateDescriptor where there will be only 4 keys >>> (corresponding to 4 stream types) and value will be of type Map which >>> further keep events from respective streams. >>> >>> I want to understand from performance perspective, would there be any >>> difference in above approaches. Will keeping 4 different MapState cause 4 >>> lookups for RocksDB backend when they are accessed? Or all of these >>> MapStates are internally stored within RocksDB in single row corresponding >>> to respective key (as per keyedStream) and hence they are all fetched in >>> single call before operator's processElement is called? If there are >>> different lookups in RocksDB for each of MapStateDescriptor, then I think >>> keeping them in single MapStateDescriptor would be more efficient minimize >>> RocksDB calls? Please advise. >>> >>> Gagan >>> >> >> >> -- >> Best, >> Congxian >> >