On Wed, Jun 8, 2022 at 2:40 PM Scott Kirklin <scott.kirk...@gmail.com> wrote:
>
> Hello,
>
> I am trying to do graph traversal with a custom Iterator. Simplifying a bit, 
> a “node” is a unique row id and edges are represented as an entry where the 
> Key.row is the source node and the Key.colQualifier is the target node. The 
> custom iterator maintains a stack and uses a subordinate iterator to traverse 
> following these edges. For small graphs this works exactly as hoped, but once 
> the graph becomes large enough to fill a scan batch the iterator is torn down 
> and when re-init’ed the stack is gone, so I can’t resume from where it left 
> off. From the docs it says that "Being torn-down is equivalent to a new 
> instance of the Iterator being creating and deepCopy being called on the new 
> instance with the old instance provided as the argument to deepCopy". I 
> thought that meant that I could carry state through the life of the 
> traversal, at least as long as the iterator stays on a single TServer and 
> deepCopy copies the right data, but I cannot find evidence that this actually 
> happens in the code or by tracing. IterConfigUtil looks like it is 
> responsible for re-creating the iterator when resuming a scan, and it only 
> calls ‘init’.

I'm not sure why the docs describe it that way. It certainly doesn't
appear to match the code. deepCopy doesn't accept the old instance as
an argument... it gets the iterator environment, which does not
contain the previous iterator. There is some strange wording in that
doc. It says "being creating" also, implying there's some serious
grammar being mangled in this portion of the docs. I'm not sure what
it was trying to say, but I don't think we have any guarantees
regarding whether an iterator is torn down or not between scan session
batches.


>
> Now, my actual question: Is there a supported way to maintain internal state 
> throughout the lifetime of an Iterator? Is my approach at all sensible?

I don't think you can rely on the statefulness of an iterator across
scan session batches, but you may be able to encode information in the
keys that are emitted so it knows how to skip paths in the graph when
the newly constructed iterator seeks to the resume point.

>
> I am able to accomplish what I want 100% from the client as well of course, 
> but that will have much worse performance for many users. A lot of usage 
> happens by users who connect (over high latency connections) through the 
> thrift proxy, which will make a client side solution very non-performant, so 
> I am motivated to figure out a server-side solution, but am not married to 
> any particular pattern. Totally changing the key design is on the table as 
> well, as this effort is still somewhat greenfield.
>
> Thanks in advance,
> Scott

Reply via email to