Hi, Eduardo Currently, we can't load state from the outside(there is an ongoing jira[1] to do this), in the other word, if you disable checkpoint, and use the Kafka/database as your state storage, you should do the deduplication things by yourself.
Just curious, which state backend do you use, and how about the latency? [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-43%3A+Savepoint+Connector Best, Congxian Eduardo Winpenny Tejedor <eduardo.winpe...@gmail.com> 于2019年6月13日周四 下午11:31写道: > Is it possible someone could comment on this question in either direction > please? > > Thanks, > Eduardo > > On Sat, 8 Jun 2019, 14:10 Eduardo Winpenny Tejedor, < > eduardo.winpe...@gmail.com> wrote: > >> Hi, >> >> In generic terms, if a keyed operator outputs its state into a sink, and >> the state of that operator after a restart can be derived from the sink's >> contents, can we do just that, as opposed to using checkpointing mechanics? >> Would that reduce latency (even if only theoretically)? >> >> An example: A word counting app with a Kafka source and a Kafka sink >> (compacted topic). The input is an unbounded stream of single words and the >> output is a <word, word_count> tuple stream that goes into a Kafka >> compacted topic. >> >> AFAIK Flink can guarantee exactly-once semantics by updating the keyed >> state of its counting operator on _every_ event - after a restart it can >> then resume from the last checkpoint. But in this case, given the sink >> contains exactly the same relevant information as a checkpoint (namely the >> <key, key_count> tuples), could we load the state of an operator from our >> sink and avoid the latency added by our state backend? If so, how can this >> be achieved? >> >> If we replaced the Kafka sink with a database sink, could we on startup >> know which keys a Flink task has been allocated, perform a _single_ query >> to the database to load the key_counts and load those into the operator? >> How can this be achieved? Instead of a single query you may want to do a >> batched query, as long as you're not querying the database once per key. >> >> Thanks, >> Eduardo >> >