[
https://issues.apache.org/jira/browse/IMPALA-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joe McDonnell updated IMPALA-14465:
-----------------------------------
Description:
Nightly jobs running on Redhat8 ARM64 have been seeing failures in Kudu custom
cluster tests like TestKuduHMSIntegration. These tests restart the Kudu service
to apply different startup options, but the Kudu service is unusable and all
operations fails. e.g.
{noformat}
E impala.error.HiveServer2Error: Query aa48dd645b659d95:060aac6c00000000
failed:
E AnalysisException: Cannot analyze Kudu table 't': Error determining if
Kudu's integration with the Hive Metastore is enabled: cannot complete before
timeout: KuduRpc(method=getHiveMetastoreConfig, tablet=null, attempt=95,
TimeoutTracker(timeout=180000, elapsed=179251), Trace Summary(177060 ms):
Sent(0), Received(0), Delayed(94), MasterRefresh(0), AuthRefresh(0), Truncated:
false
E Delayed: (UNKNOWN, [ getHiveMetastoreConfig, 94 ])){noformat}
When the tests restart the Kudu cluster, the restart command inherits
environment variables:
{noformat}
def _restart_kudu_service(kudu_args=None):
kudu_env = dict(os.environ)
if kudu_args is not None:
kudu_env["IMPALA_KUDU_STARTUP_FLAGS"] = kudu_args
call = subprocess.Popen(
['/bin/bash', '-c', os.path.join(IMPALA_HOME,
'testdata/cluster/admin restart
kudu')],
env=kudu_env)
call.wait()
if call.returncode != 0:
raise RuntimeError("Unable to restart Kudu"){noformat}
Comparing the environment between regular Kudu minicluster startup vs the
restart triggered by the custom cluster test showed several differences. After
trial and error, the significant difference is that the test runs with
HEAPCHECK set (but empty). Somehow that causes problems, and the Kudu processes
get stuck in this stack:
{noformat}
#0 0x0000ffffa55937e4 in syscall () from /lib64/libc.so.6
#1 0x00000000036a6878 in munmap ()
#2 0x0000000000f3a658 in locate_debug_info ()
#3 0x0000000000f3a7dc in _ULaarch64_dwarf_find_debug_frame ()
#4 0x0000000000f3aaec in _ULaarch64_dwarf_callback ()
#5 0x0000ffffa567cf88 in dl_iterate_phdr () from /lib64/libc.so.6
#6 0x0000000003426cc0 in dl_iterate_phdr (callback=0xf3a924
<_ULaarch64_dwarf_callback>, data=0xffffea532d48) at
/mnt/source/kudu/kudu-54f3bd31c/src/kudu/util/debug/unwind_safeness.cc:160
#7 0x0000000000f3aff0 in _ULaarch64_dwarf_find_proc_info ()
#8 0x0000000000f3754c in _ULaarch64_dwarf_step ()
#9 0x0000000000f359bc in _ULaarch64_step ()
#10 0x0000000000f5f56c in GetStackTrace_libunwind(void**, int, int) ()
#11 0x0000000000f60304 in GetStackTrace(void**, int, int) ()
#12 0x0000000000f597fc in MallocHook_GetCallerStackTrace ()
#13 0x0000000000f62258 in NewHook(void const*, unsigned long) ()
#14 0x0000000000f59568 in MallocHook::InvokeNewHookSlow(void const*, unsigned
long) ()
#15 0x00000000036a5648 in tcmalloc::allocate_full_cpp_throw_oom(unsigned long)
()
#16 0x00000000035caddc in google::protobuf::DescriptorProto*
google::protobuf::Arena::CreateMaybeMessage<google::protobuf::DescriptorProto>(google::protobuf::Arena*)
()
#17 0x00000000035cf7f8 in
google::protobuf::FileDescriptorProto::_InternalParse(char const*,
google::protobuf::internal::ParseContext*) ()
#18 0x000000000355912c in bool
google::protobuf::internal::MergeFromImpl<false>(google::protobuf::stringpiece_internal::StringPiece,
google::protobuf::MessageLite*, google::protobuf::MessageLite::ParseFlags) ()
#19 0x00000000035e841c in google::protobuf::EncodedDescriptorDatabase::Add(void
const*, int) ()
#20 0x0000000003588f90 in
google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int) ()
#21 0x00000000035f482c in google::protobuf::(anonymous
namespace)::AddDescriptorsImpl(google::protobuf::internal::DescriptorTable
const*) ()
#22 0x00000000035f4f3c in
google::protobuf::internal::AddDescriptorsRunner::AddDescriptorsRunner(google::protobuf::internal::DescriptorTable
const*) ()
#23 0x00000000036a3630 in __libc_csu_init ()
#24 0x0000ffffa559432c in __libc_start_main () from /lib64/libc.so.6
#25 0x0000000000e33e60 in _start (){noformat}
> Kudu cannot start up on Redhat8 ARM64 with HEAPCHECK set in environment
> -----------------------------------------------------------------------
>
> Key: IMPALA-14465
> URL: https://issues.apache.org/jira/browse/IMPALA-14465
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Reporter: Joe McDonnell
> Priority: Critical
>
> Nightly jobs running on Redhat8 ARM64 have been seeing failures in Kudu
> custom cluster tests like TestKuduHMSIntegration. These tests restart the
> Kudu service to apply different startup options, but the Kudu service is
> unusable and all operations fails. e.g.
> {noformat}
> E impala.error.HiveServer2Error: Query aa48dd645b659d95:060aac6c00000000
> failed:
> E AnalysisException: Cannot analyze Kudu table 't': Error determining if
> Kudu's integration with the Hive Metastore is enabled: cannot complete before
> timeout: KuduRpc(method=getHiveMetastoreConfig, tablet=null, attempt=95,
> TimeoutTracker(timeout=180000, elapsed=179251), Trace Summary(177060 ms):
> Sent(0), Received(0), Delayed(94), MasterRefresh(0), AuthRefresh(0),
> Truncated: false
> E Delayed: (UNKNOWN, [ getHiveMetastoreConfig, 94 ])){noformat}
> When the tests restart the Kudu cluster, the restart command inherits
> environment variables:
> {noformat}
> def _restart_kudu_service(kudu_args=None):
> kudu_env = dict(os.environ)
> if kudu_args is not None:
> kudu_env["IMPALA_KUDU_STARTUP_FLAGS"] = kudu_args
> call = subprocess.Popen(
> ['/bin/bash', '-c', os.path.join(IMPALA_HOME,
> 'testdata/cluster/admin restart
> kudu')],
> env=kudu_env)
> call.wait()
> if call.returncode != 0:
> raise RuntimeError("Unable to restart Kudu"){noformat}
> Comparing the environment between regular Kudu minicluster startup vs the
> restart triggered by the custom cluster test showed several differences.
> After trial and error, the significant difference is that the test runs with
> HEAPCHECK set (but empty). Somehow that causes problems, and the Kudu
> processes get stuck in this stack:
> {noformat}
> #0 0x0000ffffa55937e4 in syscall () from /lib64/libc.so.6
> #1 0x00000000036a6878 in munmap ()
> #2 0x0000000000f3a658 in locate_debug_info ()
> #3 0x0000000000f3a7dc in _ULaarch64_dwarf_find_debug_frame ()
> #4 0x0000000000f3aaec in _ULaarch64_dwarf_callback ()
> #5 0x0000ffffa567cf88 in dl_iterate_phdr () from /lib64/libc.so.6
> #6 0x0000000003426cc0 in dl_iterate_phdr (callback=0xf3a924
> <_ULaarch64_dwarf_callback>, data=0xffffea532d48) at
> /mnt/source/kudu/kudu-54f3bd31c/src/kudu/util/debug/unwind_safeness.cc:160
> #7 0x0000000000f3aff0 in _ULaarch64_dwarf_find_proc_info ()
> #8 0x0000000000f3754c in _ULaarch64_dwarf_step ()
> #9 0x0000000000f359bc in _ULaarch64_step ()
> #10 0x0000000000f5f56c in GetStackTrace_libunwind(void**, int, int) ()
> #11 0x0000000000f60304 in GetStackTrace(void**, int, int) ()
> #12 0x0000000000f597fc in MallocHook_GetCallerStackTrace ()
> #13 0x0000000000f62258 in NewHook(void const*, unsigned long) ()
> #14 0x0000000000f59568 in MallocHook::InvokeNewHookSlow(void const*, unsigned
> long) ()
> #15 0x00000000036a5648 in tcmalloc::allocate_full_cpp_throw_oom(unsigned
> long) ()
> #16 0x00000000035caddc in google::protobuf::DescriptorProto*
> google::protobuf::Arena::CreateMaybeMessage<google::protobuf::DescriptorProto>(google::protobuf::Arena*)
> ()
> #17 0x00000000035cf7f8 in
> google::protobuf::FileDescriptorProto::_InternalParse(char const*,
> google::protobuf::internal::ParseContext*) ()
> #18 0x000000000355912c in bool
> google::protobuf::internal::MergeFromImpl<false>(google::protobuf::stringpiece_internal::StringPiece,
> google::protobuf::MessageLite*, google::protobuf::MessageLite::ParseFlags) ()
> #19 0x00000000035e841c in
> google::protobuf::EncodedDescriptorDatabase::Add(void const*, int) ()
> #20 0x0000000003588f90 in
> google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int)
> ()
> #21 0x00000000035f482c in google::protobuf::(anonymous
> namespace)::AddDescriptorsImpl(google::protobuf::internal::DescriptorTable
> const*) ()
> #22 0x00000000035f4f3c in
> google::protobuf::internal::AddDescriptorsRunner::AddDescriptorsRunner(google::protobuf::internal::DescriptorTable
> const*) ()
> #23 0x00000000036a3630 in __libc_csu_init ()
> #24 0x0000ffffa559432c in __libc_start_main () from /lib64/libc.so.6
> #25 0x0000000000e33e60 in _start (){noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]