I ran into similar issues where a bug in a node's code led to an error that caused difficult-to-debug hangs or crashes during execution. I think a common problem with diagnosing such issues is that error messages (within Status instances) during execution do not always get communicated. Perhaps it would be useful to add a kind of debug-flag in Acero that causes these error messages during execution to at least be printed.
Yaron. ________________________________ From: Weston Pace <weston.p...@gmail.com> Sent: Thursday, July 14, 2022 12:38 PM To: dev@arrow.apache.org <dev@arrow.apache.org> Subject: Re: cpp: Debugging 'plan destruction before finishing' > After some quick debugging, I found that the asof node's StopProducing (a conditioning necessary to finish the plan) is called shortly after the error output. StopProducing should probably more accurately be named "Abort" or "StopRightNow". If you run the plan to completion normally I do not believe you should see this getting called. > What cases would cause the plan to destruct before its nodes finish? This may be a chicken/egg problem but destroying a plan before it has finished will cause this (the destructor panics and, in a probably futile attempt, calls StopProducing in hopes it can stop the ongoing plan before a segmentation fault since any ongoing task is going to assume the plan is still alive and valid). StartAndCollect returns a future. Are you sure you are keeping the exec plan alive / in scope until that future completes? Can you share the code that is calling StartAndCollect? An error that is unhandled and reaches a sink (since there are no nodes that "handle errors" today this means any error) will also trigger a call to StopProducing. So if the AsofJoinNode is calling ErrorReceived on its output then that would be a potential cause. You can probably check for this condition with a debugger. On Thu, Jul 14, 2022 at 9:07 AM Ivan Chau <ivan.m.c...@gmail.com> wrote: > > Hi all, > > I've been encountering a "plan destruction before finishing" output > occurring with the AsOfJoin node, particularly when joining large tables. > > My execution context is configured with the default memory pool and a > nullptr for the executor. I am calling StartAndCollect > <https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/compute/exec/test_util.cc#L183-L197> > to execute the plan. > > After some quick debugging, I found that the asof node's StopProducing (a > conditioning necessary to finish the plan) is called shortly after the > error output. > > What cases would cause the plan to destruct before its nodes finish?