[jira] [Updated] (KUDU-3527) Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton
[ https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Martonka updated KUDU-3527: -- Summary: Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton (was: Fix kudu issues on reel 8.8 graviton) > Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton > - > > Key: KUDU-3527 > URL: https://issues.apache.org/jira/browse/KUDU-3527 > Project: Kudu > Issue Type: Bug >Reporter: Zoltan Martonka >Assignee: Zoltan Martonka >Priority: Major > > Test failing in debug build: > client_examples-test > client-test > predicate-test > columnar_serialization-test > wire_protocol-test > block_manager-test > log_block_manager-test > alter_table-test > auth_token_expire-itest > consistency-itest > flex_partitioning-itest > linked_list-test > maintenance_mode-itest > master_replication-itest > master-stress-test > raft_consensus-itest > security-unknown-tsk-itest > stop_tablet-itest > tablet_history_gc-itest > tablet_server_quiescing-itest > ts_authz-itest > webserver-stress-itest > dynamic_multi_master-test > rpc-test > kudu-tool-test > rebalancer_tool-test > tablet_server-test > bitmap-test -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3527) Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton
[ https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Martonka updated KUDU-3527: -- Description: BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where fs_block_size=64k. *Cause:* Currently tablets fail to load if one metadata is missing but there is still a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is a chance that, when we delete the container file, we only delete the ".meta", but leave the ".data" file. In the current test on systems with fs_block_size=4k deletion never occurs. Changing to kNumAppends=64 will cause the test to randomly fail on x86 systems too, although only with a 2-3% chance (at least on my ubuntu20 machine). *Solution:* This test was not intended to test the file deletion itself (as it does not do it on x86_64 or 4k arm kernels). It only occurs, because _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._ _We should just set_ was: Test failing in debug build: client_examples-test client-test predicate-test columnar_serialization-test wire_protocol-test block_manager-test log_block_manager-test alter_table-test auth_token_expire-itest consistency-itest flex_partitioning-itest linked_list-test maintenance_mode-itest master_replication-itest master-stress-test raft_consensus-itest security-unknown-tsk-itest stop_tablet-itest tablet_history_gc-itest tablet_server_quiescing-itest ts_authz-itest webserver-stress-itest dynamic_multi_master-test rpc-test kudu-tool-test rebalancer_tool-test tablet_server-test bitmap-test > Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton > - > > Key: KUDU-3527 > URL: https://issues.apache.org/jira/browse/KUDU-3527 > Project: Kudu > Issue Type: Bug >Reporter: Zoltan Martonka >Assignee: Zoltan Martonka >Priority: Major > > BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where > fs_block_size=64k. > *Cause:* > Currently tablets fail to load if one metadata is missing but there is still > a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is > a chance that, when we delete the container file, we only delete the ".meta", > but leave the ".data" file. > In the current test on systems with fs_block_size=4k deletion never occurs. > Changing to kNumAppends=64 will cause the test to randomly fail on x86 > systems too, although only with a 2-3% chance (at least on my ubuntu20 > machine). > *Solution:* > This test was not intended to test the file deletion itself (as it does not > do it on x86_64 or 4k arm kernels). It only occurs, because > _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._ > _We should just set_ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3527) Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton
[ https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Martonka updated KUDU-3527: -- Description: BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where fs_block_size=64k. *Cause:* Currently tablets fail to load if one metadata is missing but there is still a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is a chance that, when we delete the container file, we only delete the ".meta", but leave the ".data" file. In the current test on systems with fs_block_size=4k deletion never occurs. Changing to kNumAppends=64 will cause the test to randomly fail on x86 systems too, although only with a 2-3% chance (at least on my ubuntu20 machine). *Solution:* This test was not intended to test the file deletion itself (as it does not do it on x86_64 or 4k arm kernels). It only occurs, because _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._ _We should just set_ FLAGS_log_block_manager_delete_dead_container = false; to restore the original scope of the test. There is a separate issue for the root cause (which is not arm specific at all): https://issues.apache.org/jira/browse/KUDU-3528 was: BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where fs_block_size=64k. *Cause:* Currently tablets fail to load if one metadata is missing but there is still a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is a chance that, when we delete the container file, we only delete the ".meta", but leave the ".data" file. In the current test on systems with fs_block_size=4k deletion never occurs. Changing to kNumAppends=64 will cause the test to randomly fail on x86 systems too, although only with a 2-3% chance (at least on my ubuntu20 machine). *Solution:* This test was not intended to test the file deletion itself (as it does not do it on x86_64 or 4k arm kernels). It only occurs, because _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._ _We should just set_ > Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton > - > > Key: KUDU-3527 > URL: https://issues.apache.org/jira/browse/KUDU-3527 > Project: Kudu > Issue Type: Bug >Reporter: Zoltan Martonka >Assignee: Zoltan Martonka >Priority: Major > > BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where > fs_block_size=64k. > *Cause:* > Currently tablets fail to load if one metadata is missing but there is still > a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is > a chance that, when we delete the container file, we only delete the ".meta", > but leave the ".data" file. > In the current test on systems with fs_block_size=4k deletion never occurs. > Changing to kNumAppends=64 will cause the test to randomly fail on x86 > systems too, although only with a 2-3% chance (at least on my ubuntu20 > machine). > *Solution:* > This test was not intended to test the file deletion itself (as it does not > do it on x86_64 or 4k arm kernels). It only occurs, because > _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._ > _We should just set_ FLAGS_log_block_manager_delete_dead_container = false; > to restore the original scope of the test. > There is a separate issue for the root cause (which is not arm specific at > all): > https://issues.apache.org/jira/browse/KUDU-3528 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3534) Corrupt timestamps crash the server
Abhishek Chennaka created KUDU-3534: --- Summary: Corrupt timestamps crash the server Key: KUDU-3534 URL: https://issues.apache.org/jira/browse/KUDU-3534 Project: Kudu Issue Type: Improvement Reporter: Abhishek Chennaka Cam across a situation where the tablet server was crashing with the below log messages: I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641 .. F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 0) The reason behind is that there were two separate delete ops with the same exact hybrid stamp which is not ideally possible. This was noticed across multiple replicas in the same server so most likely it is a server specific issue (probably disk related) while the same replicas in other servers did not thrown an issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3534) Corrupt timestamps crash the server
[ https://issues.apache.org/jira/browse/KUDU-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Chennaka updated KUDU-3534: Description: Cam across a situation where the tablet server was crashing with the below log messages: {code:java} I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641 .. F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 0) {code} The reason behind is that there were two separate delete ops with the same exact hybrid stamp which is not ideally possible. This was noticed across multiple replicas in the same server so most likely it is a server specific issue (probably disk related) while the same replicas in other servers did not thrown an issue. was: Cam across a situation where the tablet server was crashing with the below log messages: I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641 .. F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 0) The reason behind is that there were two separate delete ops with the same exact hybrid stamp which is not ideally possible. This was noticed across multiple replicas in the same server so most likely it is a server specific issue (probably disk related) while the same replicas in other servers did not thrown an issue. > Corrupt timestamps crash the server > --- > > Key: KUDU-3534 > URL: https://issues.apache.org/jira/browse/KUDU-3534 > Project: Kudu > Issue Type: Improvement >Reporter: Abhishek Chennaka >Priority: Minor > > Cam across a situation where the tablet server was crashing with the below > log messages: > {code:java} > I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P > 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling > CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641 > .. > F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. > 0) {code} > The reason behind is that there were two separate delete ops with the same > exact hybrid stamp which is not ideally possible. This was noticed across > multiple replicas in the same server so most likely it is a server specific > issue (probably disk related) while the same replicas in other servers did > not thrown an issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3534) Corrupt timestamps crash the server
[ https://issues.apache.org/jira/browse/KUDU-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Chennaka updated KUDU-3534: Description: Came across a situation where the tablet server was crashing with the below log messages: {code:java} I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641 .. F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 0) {code} The reason behind is that there were two separate delete ops with the same exact hybrid stamp which is not ideally possible. This was noticed across multiple replicas in the same server so most likely it is a server specific issue (probably disk related) while the same replicas in other servers did not thrown an issue. was: Cam across a situation where the tablet server was crashing with the below log messages: {code:java} I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641 .. F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 0) {code} The reason behind is that there were two separate delete ops with the same exact hybrid stamp which is not ideally possible. This was noticed across multiple replicas in the same server so most likely it is a server specific issue (probably disk related) while the same replicas in other servers did not thrown an issue. > Corrupt timestamps crash the server > --- > > Key: KUDU-3534 > URL: https://issues.apache.org/jira/browse/KUDU-3534 > Project: Kudu > Issue Type: Improvement >Reporter: Abhishek Chennaka >Priority: Minor > > Came across a situation where the tablet server was crashing with the below > log messages: > {code:java} > I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P > 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling > CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641 > .. > F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. > 0) {code} > The reason behind is that there were two separate delete ops with the same > exact hybrid stamp which is not ideally possible. This was noticed across > multiple replicas in the same server so most likely it is a server specific > issue (probably disk related) while the same replicas in other servers did > not thrown an issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3524) The TestScannerKeepAlivePeriodicallyCrossServers scenario fails with SIGABRT
[ https://issues.apache.org/jira/browse/KUDU-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797023#comment-17797023 ] ASF subversion and git services commented on KUDU-3524: --- Commit 6bcda5eff94ea7c7f96c38d67ade3f83111e6743 in kudu's branch refs/heads/master from xinghuayu007 [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=6bcda5eff ] KUDU-3524 Fix crash when sending periodic keep-alive requests Currently, Kudu client applications on macOS crash upon calling StartKeepAlivePeriodically(), see KUDU-3524 for details. That's because a PeriodicTimer was used to send keep-alive requests in a synchronous manner, while attempting to wait for the response on a reactor thread. However, reactor threads do not allow for waiting. This patch uses 'ScannerKeepAliveAysnc()', an asynchronous interface to send keep-alive requests to avoid this problem. Change-Id: I130db970a091cdf7689245a79dc4ea445d1f739f Reviewed-on: http://gerrit.cloudera.org:8080/20739 Tested-by: Alexey Serbin Reviewed-by: Alexey Serbin > The TestScannerKeepAlivePeriodicallyCrossServers scenario fails with SIGABRT > > > Key: KUDU-3524 > URL: https://issues.apache.org/jira/browse/KUDU-3524 > Project: Kudu > Issue Type: Bug >Reporter: Alexey Serbin >Priority: Major > > Running the newly added tests scenario > {{TestScannerKeepAlivePeriodicallyCrossServers}} fails with SIGABRT when run > as the following on macOS (but I guess it's not macOS-specific) in DEBUG > build: > {noformat} > ./bin/client-test --stress_cpu_threads=32 > --gtest_filter='*TestScannerKeepAlivePeriodicallyCrossServers*' > {noformat} > The error message and the stacktrace is below: > {noformat} > F20231113 12:21:13.431455 41195482 thread_restrictions.cc:79] Check failed: > LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to > prevent server-wide latency aberrations and deadlocks. Thread 41195482 (name: > "rpc reactor", category: "reactor") > *** Check failure stack trace: *** > Process 77090 stopped > * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT > frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10 > libsystem_kernel.dylib`__pthread_kill: > -> 0x7fff205b890e <+10>: jae0x7fff205b8918; <+20> > 0x7fff205b8910 <+12>: movq %rax, %rdi > 0x7fff205b8913 <+15>: jmp0x7fff205b2ab9; cerror_nocancel > 0x7fff205b8918 <+20>: retq > Target 0: (client-test) stopped. > (lldb) bt > * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT > * frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10 > frame #1: 0x7fff205e75bd libsystem_pthread.dylib`pthread_kill + 263 > frame #2: 0x7fff2053c406 libsystem_c.dylib`abort + 125 > frame #3: 0x00010f64ebd8 > libglog.1.dylib`google::LogMessage::SendToLog() [inlined] > google::LogMessage::Fail() at logging.cc:1946:3 [opt] > frame #4: 0x00010f64ebd2 > libglog.1.dylib`google::LogMessage::SendToLog(this=0x70001a95e108) at > logging.cc:1920:5 [opt] > frame #5: 0x00010f64f47a > libglog.1.dylib`google::LogMessage::Flush(this=0x70001a95e108) at > logging.cc:1777:5 [opt] > frame #6: 0x00010f65428f > libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=0x70001a95e108) > at logging.cc:2557:5 [opt] > frame #7: 0x00010f650349 > libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=) > at logging.cc:2556:37 [opt] > frame #8: 0x00010e545473 > libkudu_util.dylib`kudu::ThreadRestrictions::AssertWaitAllowed() at > thread_restrictions.cc:79:3 > frame #9: 0x00010013ebb9 > client-test`kudu::CountDownLatch::Wait(this=0x70001a95e2a0) const at > countdown_latch.h:74:5 > frame #10: 0x00010a1749f5 > libkrpc.dylib`kudu::Notification::WaitForNotification(this=0x70001a95e2a0) > const at notification.h:127:12 > frame #11: 0x00010a1748e9 > libkrpc.dylib`kudu::rpc::Proxy::SyncRequest(this=0x00011317e9b8, > method="ScannerKeepAlive", req=0x70001a95e428, resp=0x70001a95e408, > controller=0x70001a95e458) at proxy.cc:259:8 > frame #12: 0x00010697220f > libtserver_service_proto.dylib`kudu::tserver::TabletServerServiceProxy::ScannerKeepAlive(this=0x00011317e9b8, > req=0x70001a95e428, resp=0x70001a95e408, > controller=0x70001a95e458) at tserver_service.proxy.cc:98:10 > frame #13: 0x00010525c5b6 > libkudu_client.dylib`kudu::client::KuduScanner::Data::KeepAlive(this=0x00011290c700) > at scanner-internal.cc:664:3 > frame #14: 0x000105269e76 > libkudu_client.dylib`kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(this=0x000112899858)::$_0::operator(