[jira] [Commented] (KUDU-3349) Kudu java client failed to demote leader and caused a lot of deleting rows timeout

2024-07-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868127#comment-17868127
 ] 

ASF subversion and git services commented on KUDU-3349:
---

Commit e44e0d4892b0e2469a18aefb78062f5aa2e1799c in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=e44e0d489 ]

[client] add ScanTokenStaleRaftMembershipTest

This patch adds a new test scenario TabletLeaderChange into the newly
added ScanTokenStaleRaftMembershipTest fixture.  The motivation for this
patch was a request to clarify on the Kudu C++ client's behavior in
particular scenarios, which on itself was in the context of a follow-up
to KUDU-3349.

Change-Id: I6ce3d549d4ab2502c58deae1250b49ba16bbc914
Reviewed-on: http://gerrit.cloudera.org:8080/21580
Reviewed-by: Ashwani Raina 
Reviewed-by: Abhishek Chennaka 
Tested-by: Alexey Serbin 


> Kudu java client failed to demote leader and caused a lot of deleting rows 
> timeout
> --
>
> Key: KUDU-3349
> URL: https://issues.apache.org/jira/browse/KUDU-3349
> Project: Kudu
>  Issue Type: Bug
>Reporter: Redriver
>Priority: Major
> Fix For: 1.16.0
>
>
> During deleting rows through Spark, I found a lot of PendingErrors which 
> caused timeout, the deleting takes very long time, and finally failed 
> sometimes.
> {code:java}
> java.lang.RuntimeException: PendingErrors overflowed. Failed to write at 
> least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before 
> timeout: Batch{operations=100, tablet="b037e2e266b44c4c95da4065a8d5b719" 
> [0x0006, 0x0007), ignoredErrors=[], rpc=KuduRpc(method=Write, 
> tablet=b037e2e266b44c4c95da4065a8d5b719, attempt=23, 
> TimeoutTracker(timeout=3, elapsed=26852), Traces: [0ms] sending RPC to 
> server  contents="c51a7275257240b7a8c7e99d0895ae89">, [2ms] delaying RPC due to: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [2ms] received response from server 
> : 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [10ms] sending RPC to server 
> , 
> [12ms] delaying RPC due to: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [12ms] received response from server  size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [30ms] sending RPC to server  contents="c51a7275257240b7a8c7e99d0895ae89">, [32ms] delaying RPC due to: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [32ms] received response from server 
> : 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [51ms] sending RPC to server 
> , 
> [52ms] delaying RPC due to: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [52ms] received response from server  size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [70ms] sending RPC to server  contents="c51a7275257240b7a8c7e99d0895ae89">, [72ms] delaying RPC due to: 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [72ms] received response from server 
> : 
> Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this 
> config: current role FOLLOWER (error 0), [90ms] sending RPC to server 
> , 
> [92ms] delaying RPC due to: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [92ms] received response from server  size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [170ms] sending RPC to server  size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [172ms] delaying RPC 
> due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader 
> of this config: current role FOLLOWER (error 0), [172ms] received response 
> from server  contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica 
> c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role 
> FOLLOWER (error 0), [190ms] sending RPC to server  size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [192ms] delaying RPC 
> due to: Illegal st

[jira] [Assigned] (KUDU-3590) update certs in test_certs.cc

2024-07-23 Thread Attila Bukor (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Bukor reassigned KUDU-3590:
--

Assignee: Attila Bukor  (was: Bakai Ádám)

> update certs in test_certs.cc
> -
>
> Key: KUDU-3590
> URL: https://issues.apache.org/jira/browse/KUDU-3590
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: Bakai Ádám
>Assignee: Attila Bukor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3592) Memory spike in tablets with huge number of updates

2024-07-23 Thread Abhishek Chennaka (Jira)
Abhishek Chennaka created KUDU-3592:
---

 Summary: Memory spike in tablets with huge number of updates
 Key: KUDU-3592
 URL: https://issues.apache.org/jira/browse/KUDU-3592
 Project: Kudu
  Issue Type: Improvement
Reporter: Abhishek Chennaka
 Attachments: Screen Shot 2024-07-18 at 12.19.25 PM.png, metrics.txt, 
metrics_144.txt, metrics_147.txt

Came across a scenario where a tablet with 868 rows received about 10-20 
million upserts (as updates) per minute which caused a strange behavior(with 
ancient mark set to 15 minutes):

1. Any minor/major compaction on any of those tablet replicas leads to a memory 
spike in the servers using up all the memory in the server and eventually the 
process was killed with OOM message by the OS.
{code:java}
I20240715 12:35:40.534596 851004 maintenance_manager.cc:392] P 
d247fdcc1e4a45f8a01b8155960280a6: Scheduling 
MinorDeltaCompactionOp(f99cddc1e2444bacbcdd117b0b377a02): perf 
score=0.023000{code}
As soon as the delta compactions started, we see a spike in memory usage of the 
tablet server process. The usage went up until the process was killed:
{code:java}
W20240715 12:41:10.909035 850831 tablet_service.cc:1608] Rejecting Write 
request: Soft memory limit exceeded (at 310.52% of capacity) [suppressed 9 
similar messages]
W20240715 12:41:11.816135 850916 raft_consensus.cc:1537] Rejecting consensus 
request: Soft memory limit exceeded (at 311.39% of capacity) [suppressed 5 
similar messages]
W20240715 12:41:11.920109 850831 tablet_service.cc:1608] Rejecting Write 
request: Soft memory limit exceeded (at 311.51% of capacity) [suppressed 17 
similar messages]
W20240715 12:41:12.945374 850845 tablet_service.cc:1608] Rejecting Write 
request: Soft memory limit exceeded (at 312.54% of capacity) [suppressed 26 
similar messages]
W20240715 12:41:12.923309 850906 raft_consensus.cc:1537] Rejecting consensus 
request: Soft memory limit exceeded (at 312.49% of capacity) [suppressed 3 
similar messages]{code}
Eventually the process was killed by the OS:
{code:java}
kernel: Out of memory: Killed process 850414 (kudu-tserver) 
total-vm:422911368kB, anon-rss:38764kB, file-rss:0kB, shmem-rss:0kB, 
UID:39977 pgtables:822744kB oom_score_adj:0{code}

2. Scanning this tablet also caused a memory spike by the tablet server taking 
upto almost 90% of the memory in the server (memory hard limit in Kudu was set 
about 30% or less)[attached screenshot of the scans dashboard webpage]. 
Interestingly the on_disk_size of this tablet was only about 6GB and 
on_disk_data_size about 37KB[attached metrics related to this tablet in 
metrics.txt] but the memory consumed was in the order of hundreds of GB.

3. We also noticed an issue with bootstrapping one of such update heavy tablets 
where the recovery of the WAL dir of the tablet (this is a different tablet 
from the above)took upto 1.8TB causing the server to crash. The number of WAL 
segments in the original directory were about 250 but the recovery WAL dir had 
about ~200k wal segments. We could not collect much information on this as the 
tablet was deleted to avoid downtime, but if the issue is seen again it would 
be good to collect the tablet metadata and the examine the WAL segments for the 
config index values present. [We got the metrics of the tablet attached 
metrics_144.txt and metrics_147.txt].

While we investigate the root cause of such behavior it might be a good idea to 
A. Impose some rail guards on the number of deltas/updates that can be 
accumulated and throttling writes until compaction is done to reduce the number 
of deltas.
B. Have more strict checks on the memory usage during delta compactions and 
scanning of the tablet.

It needs to be noted that the workload of tens of millions of updates was not 
expected and the changes in the application were reverted which calmed things 
down. This could be an application error but we should have some rail guards 
from Kudu to not cause the entire memory to be used up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3590) update certs in test_certs.cc

2024-07-23 Thread Attila Bukor (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Bukor updated KUDU-3590:
---
Code Review: https://gerrit.cloudera.org/c/21607/

> update certs in test_certs.cc
> -
>
> Key: KUDU-3590
> URL: https://issues.apache.org/jira/browse/KUDU-3590
> Project: Kudu
>  Issue Type: Sub-task
>Reporter: Bakai Ádám
>Assignee: Attila Bukor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3590) update certs in test_certs.cc

2024-07-23 Thread Attila Bukor (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Bukor updated KUDU-3590:
---
Affects Version/s: 1.17.0

> update certs in test_certs.cc
> -
>
> Key: KUDU-3590
> URL: https://issues.apache.org/jira/browse/KUDU-3590
> Project: Kudu
>  Issue Type: Sub-task
>Affects Versions: 1.17.0
>Reporter: Bakai Ádám
>Assignee: Attila Bukor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3590) update certs in test_certs.cc

2024-07-23 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868266#comment-17868266
 ] 

ASF subversion and git services commented on KUDU-3590:
---

Commit 4decbbdf553ed54796ddbbb49e1142925657110f in kudu's branch 
refs/heads/master from Attila Bukor
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=4decbbdf5 ]

KUDU-3590 Update expired test certificates

Certificates generated in 2d624f877 expired after 1 year, causing
several test failures in security-itest and rpc-test.

This commit replaces the expired certificates with ones with a validity
of 20 years.

Note: If you git blame this file in 2044 in your flying car, please make
sure to update the keys and certificates to whatever is reasonable for
your quantum computers instead of simply relying on the instructions in
the comments.

Certificate chain:

Issuer: CN=IntermediateCA, ST=California, C=US, 
emailAddress=d...@kudu.apache.org, O=Apache Software Foundation, 
OU=Intermediate CA
Validity
Not Before: Jul 23 19:58:19 2024 GMT
Not After : Apr  9 19:58:19 2044 GMT
Subject: CN=127.0.0.1, ST=California, C=US,
emailAddress=d...@kudu.apache.org, O=Apache Software Foundation,
OU=Kudu

Issuer: C=US, ST=Some-State, O=Apache Software Foundation, 
CN=127.0.0.1, emailAddress=d...@kudu.apache.org
Validity
Not Before: Jul 23 19:55:34 2024 GMT
Not After : Jul 18 19:55:34 2044 GMT
Subject: CN=IntermediateCA, ST=California, C=US, 
emailAddress=d...@kudu.apache.org, O=Apache Software Foundation, 
OU=Intermediate CA

Issuer: C=US, ST=Some-State, O=Apache Software Foundation, 
CN=127.0.0.1, emailAddress=d...@kudu.apache.org
Validity
Not Before: Jul 23 17:50:11 2024 GMT
Not After : Jun 29 17:50:11 2124 GMT
Subject: C=US, ST=Some-State, O=Apache Software Foundation, 
CN=127.0.0.1, emailAddress=d...@kudu.apache.org

Change-Id: I0d9a38926307618b81e292074732d35520e9a8e9
Reviewed-on: http://gerrit.cloudera.org:8080/21607
Reviewed-by: Marton Greber 
Tested-by: Marton Greber 
Reviewed-by: Zoltan Chovan 


> update certs in test_certs.cc
> -
>
> Key: KUDU-3590
> URL: https://issues.apache.org/jira/browse/KUDU-3590
> Project: Kudu
>  Issue Type: Sub-task
>Affects Versions: 1.17.0
>Reporter: Bakai Ádám
>Assignee: Attila Bukor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)