[ 
https://issues.apache.org/jira/browse/IMPALA-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912978#comment-17912978
 ] 

Laszlo Gaal commented on IMPALA-13669:
--------------------------------------

Managed to catch an instance of the bug with the build machine online. Running 
a stack dump on the hung process yielded the following call stack:
{code}
(gdb) info threads
  Id   Target Id                          Frame 
* 1    Thread 0xffffb8c4c040 (LWP 894818) 0x0000ffffb7aeecc4 in 
__futex_abstimed_wait64 () from /lib64/libc.so.6
(gdb) bt
#0  0x0000ffffb7aeecc4 in __futex_abstimed_wait64 () from /lib64/libc.so.6
#1  0x0000ffffb7af8f80 in pthread_rwlock_wrlock@GLIBC_2.17 () from 
/lib64/libc.so.6
#2  0x0000000003632440 in glog_internal_namespace_::Mutex::Lock (this=0x556f430 
<google::log_mutex>) at src/base/mutex.h:250
#3  glog_internal_namespace_::MutexLock::MutexLock (mu=0x556f430 
<google::log_mutex>, this=<synthetic pointer>) at src/base/mutex.h:290
#4  google::LogMessage::Flush (this=0xffffc5250608) at src/logging.cc:1335
#5  0x0000000003634c30 in google::LogMessageFatal::~LogMessageFatal 
(this=<optimized out>, __in_chrg=<optimized out>) at src/logging.cc:2048
#6  0x0000000000f849e4 in impala::BufferPool::Client::MoveToDirtyUnpinned 
(this=<optimized out>, page=page@entry=0x2da57e00) at 
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool.cc:503
#7  0x0000000000f84cfc in impala::BufferPool::Unpin 
(this=this@entry=0x4177c200, client=<optimized out>, 
client@entry=0xffffc52507a8, handle=0x2e60cc80) at 
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool.cc:210
#8  0x0000000000f50048 in 
impala::BufferPoolTest_ScratchLimitZero_Test::TestBody (this=0x2dad3000) at 
/data0/jenkins/workspace/impala-private-basic-parameterized/Impala-Toolchain/toolchain-packages-gcc10.4.0/gcc-10.4.0/include/c++/10.4.0/bits/stl_vector.h:1168
#9  0x00000000037e77f8 in 
testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> 
(location=0x3f4f198 "the test body", method=<optimized out>, object=0x2dad3000) 
at /mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2612
#10 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> 
(object=object@entry=0x2dad3000, method=<optimized out>, 
location=location@entry=0x3f4f198 "the test body") at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2648
#11 0x00000000037cc468 in testing::Test::Run (this=0x2dad3000) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2687
#12 testing::Test::Run (this=0x2dad3000) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2677
#13 0x00000000037cc608 in testing::TestInfo::Run (this=0x27c32c60) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2836
#14 0x00000000037cc8a4 in testing::TestSuite::Run (this=0x27c3e240) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:3015
#15 testing::TestSuite::Run (this=0x27c3e240) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2968
#16 0x00000000037df65c in testing::internal::UnitTestImpl::RunAllTests 
(this=this@entry=0x27c2e000) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:5920
#17 0x00000000037cc9d8 in 
testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
 bool> (location=0x3f4f248 "auxiliary test code (environments or event 
listeners)", method=<optimized out>, object=0x27c2e000) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2601
#18 
testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
 bool> (location=0x3f4f248 "auxiliary test code (environments or event 
listeners)", method=<optimized out>, object=0x27c2e000) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:2648
#19 testing::UnitTest::Run (this=0x575ad50 
<testing::UnitTest::GetInstance()::instance>) at 
/mnt/source/googletest/googletest-1.14.0/googletest/src/gtest.cc:5484
#20 0x0000000000f34be4 in RUN_ALL_TESTS () at 
/data/jenkins/workspace/impala-private-basic-parameterized/Impala-Toolchain/toolchain-packages-gcc10.4.0/googletest-1.14.0/include/gtest/gtest.h:2317
#21 main (argc=<optimized out>, argv=<optimized out>) at 
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool-test.cc:2521
{code}
(This was captured on ARM)
Listing the call environment for frame #6:
{code}
(gdb) f 6
#6  0x0000000000f849e4 in impala::BufferPool::Client::MoveToDirtyUnpinned 
(this=<optimized out>, page=page@entry=0x2da57e00) at 
/data/jenkins/workspace/impala-private-basic-parameterized/repos/Impala/be/src/runtime/bufferpool/buffer-pool.cc:503
503       DCHECK(spilling_enabled());
(gdb) list
498       handle->Reset();
499     }
500     
501     void BufferPool::Client::MoveToDirtyUnpinned(Page* page) {
502       // Only valid to unpin pages if spilling is enabled.
503       DCHECK(spilling_enabled());
504       DCHECK_EQ(0, page->pin_count);
505     
506       unique_lock<mutex> lock(lock_);
507       DCHECK_CONSISTENCY();
{code}
suggests that the hang might be happening in GLog.

> buffer-pool-test hangs on Rocky 9
> ---------------------------------
>
>                 Key: IMPALA-13669
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13669
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.5.0
>            Reporter: Laszlo Gaal
>            Assignee: Laszlo Gaal
>            Priority: Critical
>
> Recent test runs on Rocky Linux 9.2 often resulted in an hang in 
> {{buffer-pool-test}} during BE tests. The hangs were observed only on Rocky 
> 9, and they were seen on Intel and ARM CPUs both.
> When the hang occurs, it is only resolved by the test run's internal watchdog 
> timing out at 20 hours, killing the build.
> Example runs:
> * https://jenkins.impala.io/job/rocky-9.2-from-scratch-ARM/4/ (ARM)
> * https://jenkins.impala.io/job/rocky-9.2-from-scratch/9/ (Intel)
> Multiple occurrences were observed in private environments as well.
> Marking as P2 (critical), as it doesn't block precommit runs, but makes it 
> impossible to make progress with Rocky 9 / RHEL 9 support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to