[jira] [Updated] (KUDU-3527) Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton

2023-12-14 Thread Zoltan Martonka (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Martonka updated KUDU-3527:
--
Summary: Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel  8.8 
graviton  (was: Fix kudu issues on reel 8.8 graviton)

> Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel  8.8 graviton
> -
>
> Key: KUDU-3527
> URL: https://issues.apache.org/jira/browse/KUDU-3527
> Project: Kudu
>  Issue Type: Bug
>Reporter: Zoltan Martonka
>Assignee: Zoltan Martonka
>Priority: Major
>
> Test failing in debug build:
> client_examples-test 
> client-test   
> predicate-test   
> columnar_serialization-test 
> wire_protocol-test 
> block_manager-test 
> log_block_manager-test 
> alter_table-test 
> auth_token_expire-itest 
> consistency-itest 
> flex_partitioning-itest   
> linked_list-test 
> maintenance_mode-itest   
> master_replication-itest 
> master-stress-test   
> raft_consensus-itest   
> security-unknown-tsk-itest 
> stop_tablet-itest 
> tablet_history_gc-itest 
> tablet_server_quiescing-itest 
> ts_authz-itest   
> webserver-stress-itest 
> dynamic_multi_master-test   
> rpc-test   
> kudu-tool-test   
> rebalancer_tool-test   
> tablet_server-test   
> bitmap-test 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3527) Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton

2023-12-14 Thread Zoltan Martonka (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Martonka updated KUDU-3527:
--
Description: 
BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where 
fs_block_size=64k.

*Cause:*
Currently tablets fail to load if one metadata is missing but there is still a 
non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is a 
chance that, when we delete the container file, we only delete the ".meta", but 
leave the ".data" file.

In the current test on systems with fs_block_size=4k deletion never occurs. 
Changing to kNumAppends=64 will cause the test to randomly fail on x86 systems 
too, although only with a 2-3% chance (at least on my ubuntu20 machine).

*Solution:*
This test was not intended to test the file deletion itself (as it does not do 
it on x86_64 or 4k arm kernels). It only occurs, because 
_FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._

_We should just set_ 

  was:
Test failing in debug build:
client_examples-test 
client-test   
predicate-test   
columnar_serialization-test 
wire_protocol-test 
block_manager-test 
log_block_manager-test 
alter_table-test 
auth_token_expire-itest 
consistency-itest 
flex_partitioning-itest   
linked_list-test 
maintenance_mode-itest   
master_replication-itest 
master-stress-test   
raft_consensus-itest   
security-unknown-tsk-itest 
stop_tablet-itest 
tablet_history_gc-itest 
tablet_server_quiescing-itest 
ts_authz-itest   
webserver-stress-itest 
dynamic_multi_master-test   
rpc-test   
kudu-tool-test   
rebalancer_tool-test   
tablet_server-test   
bitmap-test 


> Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel  8.8 graviton
> -
>
> Key: KUDU-3527
> URL: https://issues.apache.org/jira/browse/KUDU-3527
> Project: Kudu
>  Issue Type: Bug
>Reporter: Zoltan Martonka
>Assignee: Zoltan Martonka
>Priority: Major
>
> BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where 
> fs_block_size=64k.
> *Cause:*
> Currently tablets fail to load if one metadata is missing but there is still 
> a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is 
> a chance that, when we delete the container file, we only delete the ".meta", 
> but leave the ".data" file.
> In the current test on systems with fs_block_size=4k deletion never occurs. 
> Changing to kNumAppends=64 will cause the test to randomly fail on x86 
> systems too, although only with a 2-3% chance (at least on my ubuntu20 
> machine).
> *Solution:*
> This test was not intended to test the file deletion itself (as it does not 
> do it on x86_64 or 4k arm kernels). It only occurs, because 
> _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._
> _We should just set_ 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3527) Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel 8.8 graviton

2023-12-14 Thread Zoltan Martonka (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Martonka updated KUDU-3527:
--
Description: 
BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where 
fs_block_size=64k.

*Cause:*
Currently tablets fail to load if one metadata is missing but there is still a 
non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is a 
chance that, when we delete the container file, we only delete the ".meta", but 
leave the ".data" file.

In the current test on systems with fs_block_size=4k deletion never occurs. 
Changing to kNumAppends=64 will cause the test to randomly fail on x86 systems 
too, although only with a 2-3% chance (at least on my ubuntu20 machine).

*Solution:*
This test was not intended to test the file deletion itself (as it does not do 
it on x86_64 or 4k arm kernels). It only occurs, because 
_FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._

_We should just set_ FLAGS_log_block_manager_delete_dead_container = false; to 
restore the original scope of the test.

There is a separate issue for the root cause (which is not arm specific at all):
https://issues.apache.org/jira/browse/KUDU-3528

  was:
BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where 
fs_block_size=64k.

*Cause:*
Currently tablets fail to load if one metadata is missing but there is still a 
non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is a 
chance that, when we delete the container file, we only delete the ".meta", but 
leave the ".data" file.

In the current test on systems with fs_block_size=4k deletion never occurs. 
Changing to kNumAppends=64 will cause the test to randomly fail on x86 systems 
too, although only with a 2-3% chance (at least on my ubuntu20 machine).

*Solution:*
This test was not intended to test the file deletion itself (as it does not do 
it on x86_64 or 4k arm kernels). It only occurs, because 
_FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._

_We should just set_ 


> Fix BlockManagerTest.TestMetadataOkayDespiteFailure on rhel  8.8 graviton
> -
>
> Key: KUDU-3527
> URL: https://issues.apache.org/jira/browse/KUDU-3527
> Project: Kudu
>  Issue Type: Bug
>Reporter: Zoltan Martonka
>Assignee: Zoltan Martonka
>Priority: Major
>
> BlockManagerTest.TestMetadataOkayDespiteFailure might fail on systems where 
> fs_block_size=64k.
> *Cause:*
> Currently tablets fail to load if one metadata is missing but there is still 
> a non-empty ".data" file. If FLAGS_env_inject_eio is not zero, then there is 
> a chance that, when we delete the container file, we only delete the ".meta", 
> but leave the ".data" file.
> In the current test on systems with fs_block_size=4k deletion never occurs. 
> Changing to kNumAppends=64 will cause the test to randomly fail on x86 
> systems too, although only with a 2-3% chance (at least on my ubuntu20 
> machine).
> *Solution:*
> This test was not intended to test the file deletion itself (as it does not 
> do it on x86_64 or 4k arm kernels). It only occurs, because 
> _FLAGS_log_container_max_size = 256 * 1024;_ is _not "large enought"._
> _We should just set_ FLAGS_log_block_manager_delete_dead_container = false; 
> to restore the original scope of the test.
> There is a separate issue for the root cause (which is not arm specific at 
> all):
> https://issues.apache.org/jira/browse/KUDU-3528



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3534) Corrupt timestamps crash the server

2023-12-14 Thread Abhishek Chennaka (Jira)
Abhishek Chennaka created KUDU-3534:
---

 Summary: Corrupt timestamps crash the server
 Key: KUDU-3534
 URL: https://issues.apache.org/jira/browse/KUDU-3534
 Project: Kudu
  Issue Type: Improvement
Reporter: Abhishek Chennaka


Cam across a situation where the tablet server was crashing with the below log 
messages:
I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 
035c5ff8ec2f4f71878f96adb9632c3c: Scheduling 
CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641
..
F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 
0) 
The reason behind is that there were two separate delete ops with the same 
exact hybrid stamp which is not ideally possible. This was noticed across 
multiple replicas in the same server so most likely it is a server specific 
issue (probably disk related) while the same replicas in other servers did not 
thrown an issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3534) Corrupt timestamps crash the server

2023-12-14 Thread Abhishek Chennaka (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Chennaka updated KUDU-3534:

Description: 
Cam across a situation where the tablet server was crashing with the below log 
messages:
{code:java}
I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 
035c5ff8ec2f4f71878f96adb9632c3c: Scheduling 
CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641
..
F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 
0) {code}
The reason behind is that there were two separate delete ops with the same 
exact hybrid stamp which is not ideally possible. This was noticed across 
multiple replicas in the same server so most likely it is a server specific 
issue (probably disk related) while the same replicas in other servers did not 
thrown an issue.

  was:
Cam across a situation where the tablet server was crashing with the below log 
messages:
I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 
035c5ff8ec2f4f71878f96adb9632c3c: Scheduling 
CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641
..
F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 
0) 
The reason behind is that there were two separate delete ops with the same 
exact hybrid stamp which is not ideally possible. This was noticed across 
multiple replicas in the same server so most likely it is a server specific 
issue (probably disk related) while the same replicas in other servers did not 
thrown an issue.


> Corrupt timestamps crash the server
> ---
>
> Key: KUDU-3534
> URL: https://issues.apache.org/jira/browse/KUDU-3534
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Abhishek Chennaka
>Priority: Minor
>
> Cam across a situation where the tablet server was crashing with the below 
> log messages:
> {code:java}
> I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 
> 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling 
> CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641
> ..
> F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 
> 0) {code}
> The reason behind is that there were two separate delete ops with the same 
> exact hybrid stamp which is not ideally possible. This was noticed across 
> multiple replicas in the same server so most likely it is a server specific 
> issue (probably disk related) while the same replicas in other servers did 
> not thrown an issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3534) Corrupt timestamps crash the server

2023-12-14 Thread Abhishek Chennaka (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Chennaka updated KUDU-3534:

Description: 
Came across a situation where the tablet server was crashing with the below log 
messages:
{code:java}
I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 
035c5ff8ec2f4f71878f96adb9632c3c: Scheduling 
CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641
..
F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 
0) {code}
The reason behind is that there were two separate delete ops with the same 
exact hybrid stamp which is not ideally possible. This was noticed across 
multiple replicas in the same server so most likely it is a server specific 
issue (probably disk related) while the same replicas in other servers did not 
thrown an issue.

  was:
Cam across a situation where the tablet server was crashing with the below log 
messages:
{code:java}
I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 
035c5ff8ec2f4f71878f96adb9632c3c: Scheduling 
CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641
..
F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 
0) {code}
The reason behind is that there were two separate delete ops with the same 
exact hybrid stamp which is not ideally possible. This was noticed across 
multiple replicas in the same server so most likely it is a server specific 
issue (probably disk related) while the same replicas in other servers did not 
thrown an issue.


> Corrupt timestamps crash the server
> ---
>
> Key: KUDU-3534
> URL: https://issues.apache.org/jira/browse/KUDU-3534
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Abhishek Chennaka
>Priority: Minor
>
> Came across a situation where the tablet server was crashing with the below 
> log messages:
> {code:java}
> I1204 03:42:13.302340 124627 maintenance_manager.cc:382] P 
> 035c5ff8ec2f4f71878f96adb9632c3c: Scheduling 
> CompactRowSetsOp(886eddb2ccca466995e400c62c1b1197): perf score=0.561641
> ..
> F1204 03:42:20.046682 124484 compaction.cc:465] Check failed: 0 != ret (0 vs. 
> 0) {code}
> The reason behind is that there were two separate delete ops with the same 
> exact hybrid stamp which is not ideally possible. This was noticed across 
> multiple replicas in the same server so most likely it is a server specific 
> issue (probably disk related) while the same replicas in other servers did 
> not thrown an issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3524) The TestScannerKeepAlivePeriodicallyCrossServers scenario fails with SIGABRT

2023-12-14 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17797023#comment-17797023
 ] 

ASF subversion and git services commented on KUDU-3524:
---

Commit 6bcda5eff94ea7c7f96c38d67ade3f83111e6743 in kudu's branch 
refs/heads/master from xinghuayu007
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=6bcda5eff ]

KUDU-3524 Fix crash when sending periodic keep-alive requests

Currently, Kudu client applications on macOS crash upon calling
StartKeepAlivePeriodically(), see KUDU-3524 for details. That's
because a PeriodicTimer was used to send keep-alive requests in
a synchronous manner, while attempting to wait for the response
on a reactor thread. However, reactor threads do not allow for
waiting.

This patch uses 'ScannerKeepAliveAysnc()', an asynchronous
interface to send keep-alive requests to avoid this problem.

Change-Id: I130db970a091cdf7689245a79dc4ea445d1f739f
Reviewed-on: http://gerrit.cloudera.org:8080/20739
Tested-by: Alexey Serbin 
Reviewed-by: Alexey Serbin 


> The TestScannerKeepAlivePeriodicallyCrossServers scenario fails with SIGABRT
> 
>
> Key: KUDU-3524
> URL: https://issues.apache.org/jira/browse/KUDU-3524
> Project: Kudu
>  Issue Type: Bug
>Reporter: Alexey Serbin
>Priority: Major
>
> Running the newly added tests scenario 
> {{TestScannerKeepAlivePeriodicallyCrossServers}} fails with SIGABRT when run 
> as the following on macOS (but I guess it's not macOS-specific) in DEBUG 
> build:
> {noformat}
> ./bin/client-test --stress_cpu_threads=32 
> --gtest_filter='*TestScannerKeepAlivePeriodicallyCrossServers*'
> {noformat}
> The error message and the stacktrace is below:
> {noformat}
> F20231113 12:21:13.431455 41195482 thread_restrictions.cc:79] Check failed: 
> LoadTLS()->wait_allowed Waiting is not allowed to be used on this thread to 
> prevent server-wide latency aberrations and deadlocks. Thread 41195482 (name: 
> "rpc reactor", category: "reactor")
> *** Check failure stack trace: ***
> Process 77090 stopped
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
> frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> libsystem_kernel.dylib`__pthread_kill:
> ->  0x7fff205b890e <+10>: jae0x7fff205b8918; <+20>
> 0x7fff205b8910 <+12>: movq   %rax, %rdi
> 0x7fff205b8913 <+15>: jmp0x7fff205b2ab9; cerror_nocancel
> 0x7fff205b8918 <+20>: retq   
> Target 0: (client-test) stopped.
> (lldb) bt
> * thread #335, name = 'rpc reactor-41195482', stop reason = signal SIGABRT
>   * frame #0: 0x7fff205b890e libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x7fff205e75bd libsystem_pthread.dylib`pthread_kill + 263
> frame #2: 0x7fff2053c406 libsystem_c.dylib`abort + 125
> frame #3: 0x00010f64ebd8 
> libglog.1.dylib`google::LogMessage::SendToLog() [inlined] 
> google::LogMessage::Fail() at logging.cc:1946:3 [opt]
> frame #4: 0x00010f64ebd2 
> libglog.1.dylib`google::LogMessage::SendToLog(this=0x70001a95e108) at 
> logging.cc:1920:5 [opt]
> frame #5: 0x00010f64f47a 
> libglog.1.dylib`google::LogMessage::Flush(this=0x70001a95e108) at 
> logging.cc:1777:5 [opt]
> frame #6: 0x00010f65428f 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=0x70001a95e108)
>  at logging.cc:2557:5 [opt]
> frame #7: 0x00010f650349 
> libglog.1.dylib`google::LogMessageFatal::~LogMessageFatal(this=) 
> at logging.cc:2556:37 [opt]
> frame #8: 0x00010e545473 
> libkudu_util.dylib`kudu::ThreadRestrictions::AssertWaitAllowed() at 
> thread_restrictions.cc:79:3
> frame #9: 0x00010013ebb9 
> client-test`kudu::CountDownLatch::Wait(this=0x70001a95e2a0) const at 
> countdown_latch.h:74:5
> frame #10: 0x00010a1749f5 
> libkrpc.dylib`kudu::Notification::WaitForNotification(this=0x70001a95e2a0)
>  const at notification.h:127:12
> frame #11: 0x00010a1748e9 
> libkrpc.dylib`kudu::rpc::Proxy::SyncRequest(this=0x00011317e9b8, 
> method="ScannerKeepAlive", req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at proxy.cc:259:8
> frame #12: 0x00010697220f 
> libtserver_service_proto.dylib`kudu::tserver::TabletServerServiceProxy::ScannerKeepAlive(this=0x00011317e9b8,
>  req=0x70001a95e428, resp=0x70001a95e408, 
> controller=0x70001a95e458) at tserver_service.proxy.cc:98:10
> frame #13: 0x00010525c5b6 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::KeepAlive(this=0x00011290c700)
>  at scanner-internal.cc:664:3
> frame #14: 0x000105269e76 
> libkudu_client.dylib`kudu::client::KuduScanner::Data::StartKeepAlivePeriodically(this=0x000112899858)::$_0::operator(