I ran into similar issues where a bug in a node's code led to an error that 
caused difficult-to-debug hangs or crashes during execution. I think a common 
problem with diagnosing such issues is that error messages (within Status 
instances) during execution do not always get communicated. Perhaps it would be 
useful to add a kind of debug-flag in Acero that causes these error messages 
during execution to at least be printed.


Yaron.
________________________________
From: Weston Pace <weston.p...@gmail.com>
Sent: Thursday, July 14, 2022 12:38 PM
To: dev@arrow.apache.org <dev@arrow.apache.org>
Subject: Re: cpp: Debugging 'plan destruction before finishing'

> After some quick debugging, I found that the asof node's StopProducing (a
conditioning necessary to finish the plan) is called shortly after the
error output.

StopProducing should probably more accurately be named "Abort" or
"StopRightNow".  If you run the plan to completion normally I do not
believe you should see this getting called.

> What cases would cause the plan to destruct before its nodes finish?

This may be a chicken/egg problem but destroying a plan before it has
finished will cause this (the destructor panics and, in a probably
futile attempt, calls StopProducing in hopes it can stop the ongoing
plan before a segmentation fault since any ongoing task is going to
assume the plan is still alive and valid).

StartAndCollect returns a future.  Are you sure you are keeping the
exec plan alive / in scope until that future completes?  Can you share
the code that is calling StartAndCollect?

An error that is unhandled and reaches a sink (since there are no
nodes that "handle errors" today this means any error) will also
trigger a call to StopProducing.  So if the AsofJoinNode is calling
ErrorReceived on its output then that would be a potential cause.  You
can probably check for this condition with a debugger.

On Thu, Jul 14, 2022 at 9:07 AM Ivan Chau <ivan.m.c...@gmail.com> wrote:
>
> Hi all,
>
> I've been encountering a "plan destruction before finishing" output
> occurring with the AsOfJoin node, particularly when joining large tables.
>
> My execution context is configured with the default memory pool and a
> nullptr for the executor. I am calling StartAndCollect
> <https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/compute/exec/test_util.cc#L183-L197>
> to execute the plan.
>
> After some quick debugging, I found that the asof node's StopProducing (a
> conditioning necessary to finish the plan) is called shortly after the
> error output.
>
> What cases would cause the plan to destruct before its nodes finish?

Reply via email to