Re: C++ RecordBatch Debugging Segmentation Fault

Rares Vernica Thu, 20 May 2021 19:40:04 -0700

The field is always Int64Array. Regarding the arrowBatch *error reading
variable* message, we believe this is an artifact of gdb/gcc optimizations.
I examined the variable in lower contexts with gdb and it looks fine.


We replaced:

std::static_pointer_cast<arrow::Int64Array>(_arrowBatch->column(_nAtts +
dim))->raw_values()[index];

with:

arrowBatch->column_data(_nAtts + dim)->GetValues<int64_t>(1)[index];

and the problem went away. So, it looks like column() was the culprit.
Moreover, since column() creates a shared pointer we can avoid this step.

As we are making this call frequently and the number of columns of interest
is small, do you recommend storing the result of column() locally in the
class? Also should we store the result of column() or the result of
raw_values() locally?

Thanks for the pointers!
Rares


On Thu, May 20, 2021 at 5:16 PM Weston Pace <weston.p...@gmail.com> wrote:

> I like Yibo's stack overflow theory given the "error reading variable"
> but I did confirm that I can cause a segmentation fault if
> std::atomic_store /  std::atomic_load are unavailable.  I simulated
> this by simply commenting out the specializations rather than actually
> run against GCC 4.9.2 so it may not be perfect.  I've attached a patch
> with my stress test (based on the latest master,
> #c697a41ab9c11511113e7387fe4710df920c36ed).  Running that stress test
> while running `stress -c 16` on my server reproduces it pretty
> reliably.
>
> Thread 1 (Thread 0x7f6ae05fc700 (LWP 2308757)):
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f6ae352e859 in __GI_abort () at abort.c:79
> #2  0x00007f6ae37fe892 in __gnu_cxx::__verbose_terminate_handler () at
>
> /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
> #3  0x00007f6ae37fcf69 in __cxxabiv1::__terminate (handler=<optimized
> out>) at
> /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
> #4  0x00007f6ae37fcfab in std::terminate () at
>
> /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
> #5  0x00007f6ae37fd9d0 in __cxxabiv1::__cxa_pure_virtual () at
>
> /home/conda/feedstock_root/build_artifacts/ctng-compilers_1601682258120/work/.build/x86_64-conda-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
> #6  0x000055a64bc4400a in
> std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
> (this=0x7f6ad0001160) at
>
> /home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:155
> #7  0x000055a64bc420f3 in
> std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
> (this=0x7f6ae05fa568, __in_chrg=<optimized out>) at
>
> /home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:730
> #8  0x000055a64bc3a4a2 in std::__shared_ptr<arrow::Array,
> (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f6ae05fa560,
> __in_chrg=<optimized out>) at
>
> /home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr_base.h:1169
> #9  0x000055a64bc3a4be in std::shared_ptr<arrow::Array>::~shared_ptr
> (this=0x7f6ae05fa560, __in_chrg=<optimized out>) at
>
> /home/pace/anaconda3/envs/arrow-dev/x86_64-conda-linux-gnu/include/c++/9.3.0/bits/shared_ptr.h:103
> #10 0x000055a64bc557ca in
>
> arrow::TestRecordBatch_BatchColumnBoxingStress_Test::<lambda()>::operator()(void)
> const (__closure=0x55a64d5f5218) at
> ../src/arrow/record_batch_test.cc:206
>
> As a workaround to see if this is indeed your issue, you can call
> RecordBatch::column on each of the columns as soon as you create the
> RecordBatch (from one thread) which will force the boxed columns to
> materialize.
>
> -Weston
>
> On Thu, May 20, 2021 at 11:40 AM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > Also, is it possible that the field is not an Int64Array?
> >
> > On Wed, May 19, 2021 at 10:19 PM Yibo Cai <yibo....@arm.com> wrote:
> > >
> > > On 5/20/21 4:15 AM, Rares Vernica wrote:
> > > > Hello,
> > > >
> > > > I'm using Arrow for accessing data outside the SciDB database
> engine. It
> > > > generally works fine but we are running into Segmentation Faults in a
> > > > corner multi-threaded case. I identified two threads that work on
> the same
> > > > Record Batch. I wonder if there is something internal about
> RecordBatch
> > > > that might help solve the mystery.
> > > >
> > > > We are using Arrow 0.16.0. The backtrace of the triggering thread
> looks
> > > > like this:
> > > >
> > > > Program received signal SIGSEGV, Segmentation fault.
> > > > [Switching to Thread 0x7fdad5fb4700 (LWP 3748)]
> > > > 0x00007fdaa805abe0 in ?? ()
> > > > (gdb) thread
> > > > [Current thread is 2 (Thread 0x7fdad5fb4700 (LWP 3748))]
> > > > (gdb) bt
> > > > #0  0x00007fdaa805abe0 in ?? ()
> > > > #1  0x0000000000850212 in
> > > > std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
> > > > #2  0x00007fdae4b1fbf1 in
> > > > std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
> > > > (this=0x7fdad5fb1ae8, __in_chrg=<optimized out>) at
> > > >
> /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:666
> > > > #3  0x00007fdae4b39d74 in std::__shared_ptr<arrow::Array,
> > > > (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fdad5fb1ae0,
> > > > __in_chrg=<optimized out>) at
> > > >
> /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:914
> > > > #4  0x00007fdae4b39da8 in std::shared_ptr<arrow::Array>::~shared_ptr
> > > > (this=0x7fdad5fb1ae0, __in_chrg=<optimized out>) at
> > > > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:93
> > > > #5  0x00007fdae4b6a8e1 in scidb::XChunkIterator::getCoord
> > > > (this=0x7fdaa807f9f0, dim=1, index=1137) at XArray.cpp:358
> > > > #6  0x00007fdae4b68ecb in scidb::XChunkIterator::XChunkIterator
> > > > (this=0x7fdaa807f9f0, chunk=..., iterationMode=0, arrowBatch=<error
> reading
> > > > variable: Cannot access memory at address 0xd5fb1b90>) at
> XArray.cpp:157
> > >
> > > FWIW, this "error reading variable" looks suspicious. Maybe the
> argument
> > > 'arrowBatch' is trashed accidentally (stack overflow)?
> > > https://github.com/Paradigm4/bridge/blob/master/src/XArray.cpp#L132
> > >
> > > > ...
> > > >
> > > > The backtrace of the other thread working on exactly the same Record
> Batch
> > > > looks like this:
> > > >
> > > > (gdb) thread
> > > > [Current thread is 3 (Thread 0x7fdad61b5700 (LWP 3746))]
> > > > (gdb) bt
> > > > #0  0x00007fdae3bc1ec7 in arrow::SimpleRecordBatch::column(int)
> const ()
> > > > from /lib64/libarrow.so.16
> > > > #1  0x00007fdae4b6a888 in scidb::XChunkIterator::getCoord
> > > > (this=0x7fdab00c0bb0, dim=0, index=71) at XArray.cpp:357
> > > > #2  0x00007fdae4b6a5a2 in scidb::XChunkIterator::operator++
> > > > (this=0x7fdab00c0bb0) at XArray.cpp:305
> > > > ...
> > > >
> > > > In both cases, the last non-Arrow code is the getCorord function
> > > > https://github.com/Paradigm4/bridge/blob/master/src/XArray.cpp#L355
> > > >
> > > >      int64_t XChunkIterator::getCoord(size_t dim, int64_t index)
> > > >      {
> > > >          return std::static_pointer_cast<arrow::Int64Array>(
> > > >              _arrowBatch->column(_nAtts + dim))->raw_values()[index];
> > > >      }
> > > > ...
> > > > std::shared_ptr<const arrow::RecordBatch> _arrowBatch;
> > > >
> > > > Do you see anything suspicious about this code? What would trigger
> the
> > > > shared_ptr destruction which takes place in thread 2?
> > > >
> > > > Thank you!
> > > > Rares
> > > >
>

Re: C++ RecordBatch Debugging Segmentation Fault

Reply via email to