Thanks for the quick confirmation! I'll need to send back a bit more data than I originally planned, but your suggestion of encoding progress in the keys should work quite nicely.
Best wishes as you push for 2.1.0! On Wed, Jun 8, 2022 at 12:42 PM Christopher <ctubb...@apache.org> wrote: > On Wed, Jun 8, 2022 at 2:40 PM Scott Kirklin <scott.kirk...@gmail.com> > wrote: > > > > Hello, > > > > I am trying to do graph traversal with a custom Iterator. Simplifying a > bit, a “node” is a unique row id and edges are represented as an entry > where the Key.row is the source node and the Key.colQualifier is the target > node. The custom iterator maintains a stack and uses a subordinate iterator > to traverse following these edges. For small graphs this works exactly as > hoped, but once the graph becomes large enough to fill a scan batch the > iterator is torn down and when re-init’ed the stack is gone, so I can’t > resume from where it left off. From the docs it says that "Being torn-down > is equivalent to a new instance of the Iterator being creating and deepCopy > being called on the new instance with the old instance provided as the > argument to deepCopy". I thought that meant that I could carry state > through the life of the traversal, at least as long as the iterator stays > on a single TServer and deepCopy copies the right data, but I cannot find > evidence that this actually happens in the code or by tracing. > IterConfigUtil looks like it is responsible for re-creating the iterator > when resuming a scan, and it only calls ‘init’. > > I'm not sure why the docs describe it that way. It certainly doesn't > appear to match the code. deepCopy doesn't accept the old instance as > an argument... it gets the iterator environment, which does not > contain the previous iterator. There is some strange wording in that > doc. It says "being creating" also, implying there's some serious > grammar being mangled in this portion of the docs. I'm not sure what > it was trying to say, but I don't think we have any guarantees > regarding whether an iterator is torn down or not between scan session > batches. > > > > > > Now, my actual question: Is there a supported way to maintain internal > state throughout the lifetime of an Iterator? Is my approach at all > sensible? > > I don't think you can rely on the statefulness of an iterator across > scan session batches, but you may be able to encode information in the > keys that are emitted so it knows how to skip paths in the graph when > the newly constructed iterator seeks to the resume point. > > > > > I am able to accomplish what I want 100% from the client as well of > course, but that will have much worse performance for many users. A lot of > usage happens by users who connect (over high latency connections) through > the thrift proxy, which will make a client side solution very > non-performant, so I am motivated to figure out a server-side solution, but > am not married to any particular pattern. Totally changing the key design > is on the table as well, as this effort is still somewhat greenfield. > > > > Thanks in advance, > > Scott >