Hi Avi, The typical approach would be as you've described in #1. #2 is not necessary -- #1 is already doing basically exactly that.
-Jamie On Wed, Nov 21, 2018 at 3:36 AM Avi Levi <avi.l...@bluevoyant.com> wrote: > Hi , > I am very new to flink so please be gentle :) > > *The challenge:* > I have a road sensor that should scan billons of cars per day. for starter > I want to recognise if each car that passes by is new or not. new cars > (never been seen before by that sensor ) will be placed on a different > topic on kafka than the other (total of two topics for new and old) . > under the assumption that the state will contain billions of unique car > ids. > > *Suggested Solutions* > My question is it which approach is better. > Both approaches using RocksDB > > 1. use the ValueState and to split the steam like > *val domainsSrc = env* > * .addSource(consumer)* > * .keyBy(car => car.id <http://car.id>)* > * .map(...)* > and checking if the state value is null to recognise new cars. if new than > I will update the state > how will the persistent data will be shard among the nodes in the cluster > (let's say that I have 10 nodes) ? > > 2. use MapState and to partition the stream to groups by some arbitrary > factor e.g > *val domainsSrc = env* > * .addSource(consumer)* > * .keyBy{ car =>* > * val h car.id.hashCode % partitionFactor* > * math.abs(h)* > * } .map(...)* > and to check *mapState.keys.contains(car.id <http://car.id>) *if not - > add it to the state > > which approach is better ? > > Thanks in advance > Avi >