Chris + Samza Devs,

I was wondering whether Samza could support re-processing as described by
the Kappa architecture or Liquid (
http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf).

It seems that a changelog is not sufficient to be able to restore state
backward in time.  Kafka compaction will guarantee that local state can be
restored from where it left off but I don't see how it can restore past
state.

Imagine the case where a stream job has a lot of state in it's local store
but it has not updated any keys in a long time.

Time t1: All of the data would be in the tail of the Kafka log (past the
cleaner point).
Time t2:  The job updates some keys.   Now we're in a state where the next
compaction will blow away the old values for those keys.
Time t3:  Compaction occurs and old values are discarded.

Say we want to launch a re-processing job that would begin from t1.  If we
launch that job before t3, it will correctly restore it's state.  However,
if we launch the job after t3, it will be missing old values, right?

Unless I'm misunderstanding something, the only way around this is to keep
snapshots in addition to the changelog.  Has there been any discussion of
providing an option in Samza of taking RocksDB snapshots and persisting
them to an object store or HDFS?

Thanks,

Roger

Reply via email to