Re: Correct way to collect results from an Acero query

Weston Pace Wed, 21 Sep 2022 10:54:55 -0700

Funny you should mention this, I just ran into the same problem :).
We use StartAndCollect so much in our unit tests that there must be
some usefulness there.  You are correct that it is not an API that can
be used outside of tests.

I added utility methods DeclarationToTable, DeclarationToBatches, and
DeclarationToExecBatches to exec_plan.h in [1]. These all take in a
declaration (that does not have a sink node), add a sink node, create
an exec plan, and run it.  It might be a bit before [1] merges so if
you want to pull these out into their own PR that might be useful.

The utility methods capture the common case where a user wants to use
the default exec context and run the plan immediately.  The main
downside of these utility methods is that they gather all results in
memory.  However, if you are dealing with small amounts of data (e.g.
prototyping, testing) or doing some kind of aggregation then this
might not be a problem.

We could probably also add a DeclarationToReader method in the future.

[1] https://github.com/apache/arrow/pull/13782

On Wed, Sep 21, 2022 at 8:26 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> Hello!
>
> I am testing a custom data source node I added to Acero and found myself in
> need of collecting the results from an Acero query into memory.
>
> Searching the codebase, I found "StartAndCollect" is what many of the tests
> and benchmarks are using, but I am not sure if that is the public API to do
> so because:
> (1) the header file arrow/compute/exec/test_util.h depends on gtest, which
> seems to be a test-only dependency
> (2) the method "StartAndCollect" doesn't return a Result/Status object, so
> errors probably cannot be propagated.
>
> Is there a better way / some other public method to achieve this?
>
> Thanks,
> Li

Re: Correct way to collect results from an Acero query

Reply via email to