[ 
https://issues.apache.org/jira/browse/IMPALA-14465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18022984#comment-18022984
 ] 

ASF subversion and git services commented on IMPALA-14465:
----------------------------------------------------------

Commit 48b38810e8404bb3b13acfec03151acb9135eb1f in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=48b38810e ]

IMPALA-14465: Unset HEAPCHECK when custom cluster tests restart Kudu

Custom cluster tests like TestKuduHMSIntegration restart the Kudu
service with custom startup flags. On Redhat8 ARM64, these tests
have been failing due to Kudu being unresponsive after this
restart. Debugging showed that Kudu was stuck early in startup.
This only reproduced via the custom cluster tests and never via
regular minicluster startup.

When custom cluster tests restart Kudu, the script to restart
Kudu inherits environment variables from the test runner. It
turns out that the HEAPCHECK environment variable (even when
empty) causes Kudu to get stuck during startup on Redhat8
ARM64 after the recent toolchain update.

As a short-term fix, this unsets HEAPCHECK when restarting the
Kudu service for these tests. There will need to be further
investigation / cleanup beyond this.

Testing:
 - Ran the Kudu custom cluster tests on Redhat8 ARM64 and
   on Ubuntu 20 x86_64

Change-Id: I51513e194d9e605df199672231b412fae40343af
Reviewed-on: http://gerrit.cloudera.org:8080/23467
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Kudu cannot start up on Redhat8 ARM64 with HEAPCHECK set in environment
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-14465
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14465
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Critical
>
> Nightly jobs running on Redhat8 ARM64 have been seeing failures in Kudu 
> custom cluster tests like TestKuduHMSIntegration. These tests restart the 
> Kudu service to apply different startup options, but the Kudu service is 
> unusable and all operations fails. e.g.
> {noformat}
> E   impala.error.HiveServer2Error: Query aa48dd645b659d95:060aac6c00000000 
> failed:
> E   AnalysisException: Cannot analyze Kudu table 't': Error determining if 
> Kudu's integration with the Hive Metastore is enabled: cannot complete before 
> timeout: KuduRpc(method=getHiveMetastoreConfig, tablet=null, attempt=95, 
> TimeoutTracker(timeout=180000, elapsed=179251), Trace Summary(177060 ms): 
> Sent(0), Received(0), Delayed(94), MasterRefresh(0), AuthRefresh(0), 
> Truncated: false
> E    Delayed: (UNKNOWN, [ getHiveMetastoreConfig, 94 ])){noformat}
> When the tests restart the Kudu cluster, the restart command inherits 
> environment variables:
> {noformat}
>   def _restart_kudu_service(kudu_args=None):
>     kudu_env = dict(os.environ)
>     if kudu_args is not None:
>       kudu_env["IMPALA_KUDU_STARTUP_FLAGS"] = kudu_args
>     call = subprocess.Popen(
>         ['/bin/bash', '-c', os.path.join(IMPALA_HOME,
>                                          'testdata/cluster/admin restart 
> kudu')],
>         env=kudu_env)
>     call.wait()
>     if call.returncode != 0:
>       raise RuntimeError("Unable to restart Kudu"){noformat}
> Comparing the environment between regular Kudu minicluster startup vs the 
> restart triggered by the custom cluster test showed several differences. 
> After trial and error, the significant difference is that the test runs with 
> HEAPCHECK set (but empty). Somehow that causes problems, and the Kudu 
> processes get stuck in this stack:
> {noformat}
> #0  0x0000ffffa55937e4 in syscall () from /lib64/libc.so.6
> #1  0x00000000036a6878 in munmap ()
> #2  0x0000000000f3a658 in locate_debug_info ()
> #3  0x0000000000f3a7dc in _ULaarch64_dwarf_find_debug_frame ()
> #4  0x0000000000f3aaec in _ULaarch64_dwarf_callback ()
> #5  0x0000ffffa567cf88 in dl_iterate_phdr () from /lib64/libc.so.6
> #6  0x0000000003426cc0 in dl_iterate_phdr (callback=0xf3a924 
> <_ULaarch64_dwarf_callback>, data=0xffffea532d48) at 
> /mnt/source/kudu/kudu-54f3bd31c/src/kudu/util/debug/unwind_safeness.cc:160
> #7  0x0000000000f3aff0 in _ULaarch64_dwarf_find_proc_info ()
> #8  0x0000000000f3754c in _ULaarch64_dwarf_step ()
> #9  0x0000000000f359bc in _ULaarch64_step ()
> #10 0x0000000000f5f56c in GetStackTrace_libunwind(void**, int, int) ()
> #11 0x0000000000f60304 in GetStackTrace(void**, int, int) ()
> #12 0x0000000000f597fc in MallocHook_GetCallerStackTrace ()
> #13 0x0000000000f62258 in NewHook(void const*, unsigned long) ()
> #14 0x0000000000f59568 in MallocHook::InvokeNewHookSlow(void const*, unsigned 
> long) ()
> #15 0x00000000036a5648 in tcmalloc::allocate_full_cpp_throw_oom(unsigned 
> long) ()
> #16 0x00000000035caddc in google::protobuf::DescriptorProto* 
> google::protobuf::Arena::CreateMaybeMessage<google::protobuf::DescriptorProto>(google::protobuf::Arena*)
>  ()
> #17 0x00000000035cf7f8 in 
> google::protobuf::FileDescriptorProto::_InternalParse(char const*, 
> google::protobuf::internal::ParseContext*) ()
> #18 0x000000000355912c in bool 
> google::protobuf::internal::MergeFromImpl<false>(google::protobuf::stringpiece_internal::StringPiece,
>  google::protobuf::MessageLite*, google::protobuf::MessageLite::ParseFlags) ()
> #19 0x00000000035e841c in 
> google::protobuf::EncodedDescriptorDatabase::Add(void const*, int) ()
> #20 0x0000000003588f90 in 
> google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int) 
> ()
> #21 0x00000000035f482c in google::protobuf::(anonymous 
> namespace)::AddDescriptorsImpl(google::protobuf::internal::DescriptorTable 
> const*) ()
> #22 0x00000000035f4f3c in 
> google::protobuf::internal::AddDescriptorsRunner::AddDescriptorsRunner(google::protobuf::internal::DescriptorTable
>  const*) ()
> #23 0x00000000036a3630 in __libc_csu_init ()
> #24 0x0000ffffa559432c in __libc_start_main () from /lib64/libc.so.6
> #25 0x0000000000e33e60 in _start (){noformat}
> Unsetting HEAPCHECK causes the Kudu startup to work normally. For some 
> reason, this is only a problem on Redhat8 ARM64.
> We should unset HEAPCHECK for this restart case (and look into removing the 
> "export HEAPCHECK=" statements).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to