I think the nuance in Roger's example is that the stream that's being rewound is an event stream not a primary data stream. As such, going back to the earliest offer might only bring you back a week. If you want a consistent view of that time, you'd want your table join to be the view as of a week ago.
On Friday, February 20, 2015, Roger Hoover <[email protected]> wrote: > Jay, > > Sorry, I didn't explain it very well. I'm talking about a stream-table > join where the table comes from a compacted topic that is used to populate > a local data store. As the stream events are processed, they are joined > with dimension data from the local store. > > If you want to kick off another version of this job that starts back in > time, the new job cannot reliably recreate the same state of the local > store that the original had because old values may have been compacted > away. > > Does that make sense? > > Roger > > On Fri, Feb 20, 2015 at 2:52 PM, Jay Kreps <[email protected] > <javascript:;>> wrote: > > > Hey Roger, > > > > I'm not sure if I understand the case you are describing. > > > > As Chris says we don't yet give you fined grained control over when > history > > starts to disappear (though we designed with the intention of making that > > configurable later). However I'm not sure if you need that for the case > you > > describe. > > > > Say you have a job J that takes inputs I1...IN and produces output > O1...ON > > and in the process accumulates state in a topic S. I think the approach > is > > to launch a J' (changed or improved in some way) that reprocesses I1...IN > > from the beginning of time (or some past point) into O1'...ON' and > > accumulates state in S'. So the state for J and the state for J' are > > totally independent. J' can't reuse J's state in general because the code > > that generates that state may have changed. > > > > -Jay > > > > On Thu, Feb 19, 2015 at 9:30 AM, Roger Hoover <[email protected] > <javascript:;>> > > wrote: > > > > > Chris + Samza Devs, > > > > > > I was wondering whether Samza could support re-processing as described > by > > > the Kappa architecture or Liquid ( > > > http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf). > > > > > > It seems that a changelog is not sufficient to be able to restore state > > > backward in time. Kafka compaction will guarantee that local state can > > be > > > restored from where it left off but I don't see how it can restore past > > > state. > > > > > > Imagine the case where a stream job has a lot of state in it's local > > store > > > but it has not updated any keys in a long time. > > > > > > Time t1: All of the data would be in the tail of the Kafka log (past > the > > > cleaner point). > > > Time t2: The job updates some keys. Now we're in a state where the > > next > > > compaction will blow away the old values for those keys. > > > Time t3: Compaction occurs and old values are discarded. > > > > > > Say we want to launch a re-processing job that would begin from t1. If > > we > > > launch that job before t3, it will correctly restore it's state. > > However, > > > if we launch the job after t3, it will be missing old values, right? > > > > > > Unless I'm misunderstanding something, the only way around this is to > > keep > > > snapshots in addition to the changelog. Has there been any discussion > of > > > providing an option in Samza of taking RocksDB snapshots and persisting > > > them to an object store or HDFS? > > > > > > Thanks, > > > > > > Roger > > > > > >
