What compiler / glibc version are you using? arrow::SimpleRecordBatch::column does some non-trivial caching which uses std::atomic_load[1] which is not implemented properly on gcc < 5 so our behavior is different depending on the compiler version.
[1] https://en.cppreference.com/w/cpp/atomic/atomic_load On Wed, May 19, 2021 at 10:15 AM Rares Vernica <rvern...@gmail.com> wrote: > > Hello, > > I'm using Arrow for accessing data outside the SciDB database engine. It > generally works fine but we are running into Segmentation Faults in a > corner multi-threaded case. I identified two threads that work on the same > Record Batch. I wonder if there is something internal about RecordBatch > that might help solve the mystery. > > We are using Arrow 0.16.0. The backtrace of the triggering thread looks > like this: > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fdad5fb4700 (LWP 3748)] > 0x00007fdaa805abe0 in ?? () > (gdb) thread > [Current thread is 2 (Thread 0x7fdad5fb4700 (LWP 3748))] > (gdb) bt > #0 0x00007fdaa805abe0 in ?? () > #1 0x0000000000850212 in > std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () > #2 0x00007fdae4b1fbf1 in > std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count > (this=0x7fdad5fb1ae8, __in_chrg=<optimized out>) at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:666 > #3 0x00007fdae4b39d74 in std::__shared_ptr<arrow::Array, > (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fdad5fb1ae0, > __in_chrg=<optimized out>) at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr_base.h:914 > #4 0x00007fdae4b39da8 in std::shared_ptr<arrow::Array>::~shared_ptr > (this=0x7fdad5fb1ae0, __in_chrg=<optimized out>) at > /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/bits/shared_ptr.h:93 > #5 0x00007fdae4b6a8e1 in scidb::XChunkIterator::getCoord > (this=0x7fdaa807f9f0, dim=1, index=1137) at XArray.cpp:358 > #6 0x00007fdae4b68ecb in scidb::XChunkIterator::XChunkIterator > (this=0x7fdaa807f9f0, chunk=..., iterationMode=0, arrowBatch=<error reading > variable: Cannot access memory at address 0xd5fb1b90>) at XArray.cpp:157 > ... > > The backtrace of the other thread working on exactly the same Record Batch > looks like this: > > (gdb) thread > [Current thread is 3 (Thread 0x7fdad61b5700 (LWP 3746))] > (gdb) bt > #0 0x00007fdae3bc1ec7 in arrow::SimpleRecordBatch::column(int) const () > from /lib64/libarrow.so.16 > #1 0x00007fdae4b6a888 in scidb::XChunkIterator::getCoord > (this=0x7fdab00c0bb0, dim=0, index=71) at XArray.cpp:357 > #2 0x00007fdae4b6a5a2 in scidb::XChunkIterator::operator++ > (this=0x7fdab00c0bb0) at XArray.cpp:305 > ... > > In both cases, the last non-Arrow code is the getCorord function > https://github.com/Paradigm4/bridge/blob/master/src/XArray.cpp#L355 > > int64_t XChunkIterator::getCoord(size_t dim, int64_t index) > { > return std::static_pointer_cast<arrow::Int64Array>( > _arrowBatch->column(_nAtts + dim))->raw_values()[index]; > } > ... > std::shared_ptr<const arrow::RecordBatch> _arrowBatch; > > Do you see anything suspicious about this code? What would trigger the > shared_ptr destruction which takes place in thread 2? > > Thank you! > Rares