Re: iterator state persistence in 2.0.1

Scott Kirklin Wed, 08 Jun 2022 13:17:24 -0700

Thanks for the quick confirmation! I'll need to send back a bit more data
than I originally planned, but your suggestion of encoding progress in the
keys should work quite nicely.


Best wishes as you push for 2.1.0!

On Wed, Jun 8, 2022 at 12:42 PM Christopher <ctubb...@apache.org> wrote:

> On Wed, Jun 8, 2022 at 2:40 PM Scott Kirklin <scott.kirk...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I am trying to do graph traversal with a custom Iterator. Simplifying a
> bit, a “node” is a unique row id and edges are represented as an entry
> where the Key.row is the source node and the Key.colQualifier is the target
> node. The custom iterator maintains a stack and uses a subordinate iterator
> to traverse following these edges. For small graphs this works exactly as
> hoped, but once the graph becomes large enough to fill a scan batch the
> iterator is torn down and when re-init’ed the stack is gone, so I can’t
> resume from where it left off. From the docs it says that "Being torn-down
> is equivalent to a new instance of the Iterator being creating and deepCopy
> being called on the new instance with the old instance provided as the
> argument to deepCopy". I thought that meant that I could carry state
> through the life of the traversal, at least as long as the iterator stays
> on a single TServer and deepCopy copies the right data, but I cannot find
> evidence that this actually happens in the code or by tracing.
> IterConfigUtil looks like it is responsible for re-creating the iterator
> when resuming a scan, and it only calls ‘init’.
>
> I'm not sure why the docs describe it that way. It certainly doesn't
> appear to match the code. deepCopy doesn't accept the old instance as
> an argument... it gets the iterator environment, which does not
> contain the previous iterator. There is some strange wording in that
> doc. It says "being creating" also, implying there's some serious
> grammar being mangled in this portion of the docs. I'm not sure what
> it was trying to say, but I don't think we have any guarantees
> regarding whether an iterator is torn down or not between scan session
> batches.
>
>
> >
> > Now, my actual question: Is there a supported way to maintain internal
> state throughout the lifetime of an Iterator? Is my approach at all
> sensible?
>
> I don't think you can rely on the statefulness of an iterator across
> scan session batches, but you may be able to encode information in the
> keys that are emitted so it knows how to skip paths in the graph when
> the newly constructed iterator seeks to the resume point.
>
> >
> > I am able to accomplish what I want 100% from the client as well of
> course, but that will have much worse performance for many users. A lot of
> usage happens by users who connect (over high latency connections) through
> the thrift proxy, which will make a client side solution very
> non-performant, so I am motivated to figure out a server-side solution, but
> am not married to any particular pattern. Totally changing the key design
> is on the table as well, as this effort is still somewhat greenfield.
> >
> > Thanks in advance,
> > Scott
>

Re: iterator state persistence in 2.0.1

Reply via email to