[jira] [Commented] (KUDU-3349) Kudu java client failed to demote leader and caused a lot of deleting rows timeout
[ https://issues.apache.org/jira/browse/KUDU-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868127#comment-17868127 ] ASF subversion and git services commented on KUDU-3349: --- Commit e44e0d4892b0e2469a18aefb78062f5aa2e1799c in kudu's branch refs/heads/master from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=e44e0d489 ] [client] add ScanTokenStaleRaftMembershipTest This patch adds a new test scenario TabletLeaderChange into the newly added ScanTokenStaleRaftMembershipTest fixture. The motivation for this patch was a request to clarify on the Kudu C++ client's behavior in particular scenarios, which on itself was in the context of a follow-up to KUDU-3349. Change-Id: I6ce3d549d4ab2502c58deae1250b49ba16bbc914 Reviewed-on: http://gerrit.cloudera.org:8080/21580 Reviewed-by: Ashwani Raina Reviewed-by: Abhishek Chennaka Tested-by: Alexey Serbin > Kudu java client failed to demote leader and caused a lot of deleting rows > timeout > -- > > Key: KUDU-3349 > URL: https://issues.apache.org/jira/browse/KUDU-3349 > Project: Kudu > Issue Type: Bug >Reporter: Redriver >Priority: Major > Fix For: 1.16.0 > > > During deleting rows through Spark, I found a lot of PendingErrors which > caused timeout, the deleting takes very long time, and finally failed > sometimes. > {code:java} > java.lang.RuntimeException: PendingErrors overflowed. Failed to write at > least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before > timeout: Batch{operations=100, tablet="b037e2e266b44c4c95da4065a8d5b719" > [0x0006, 0x0007), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=b037e2e266b44c4c95da4065a8d5b719, attempt=23, > TimeoutTracker(timeout=3, elapsed=26852), Traces: [0ms] sending RPC to > server contents="c51a7275257240b7a8c7e99d0895ae89">, [2ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [2ms] received response from server > : > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [10ms] sending RPC to server > , > [12ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [12ms] received response from server size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [30ms] sending RPC to server contents="c51a7275257240b7a8c7e99d0895ae89">, [32ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [32ms] received response from server > : > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [51ms] sending RPC to server > , > [52ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [52ms] received response from server size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [70ms] sending RPC to server contents="c51a7275257240b7a8c7e99d0895ae89">, [72ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [72ms] received response from server > : > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [90ms] sending RPC to server > , > [92ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [92ms] received response from server size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [170ms] sending RPC to server size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [172ms] delaying RPC > due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader > of this config: current role FOLLOWER (error 0), [172ms] received response > from server contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [190ms] sending RPC to server size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [192ms] delaying RPC > due to: Illegal st
[jira] [Assigned] (KUDU-3590) update certs in test_certs.cc
[ https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Bukor reassigned KUDU-3590: -- Assignee: Attila Bukor (was: Bakai Ádám) > update certs in test_certs.cc > - > > Key: KUDU-3590 > URL: https://issues.apache.org/jira/browse/KUDU-3590 > Project: Kudu > Issue Type: Sub-task >Reporter: Bakai Ádám >Assignee: Attila Bukor >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3592) Memory spike in tablets with huge number of updates
Abhishek Chennaka created KUDU-3592: --- Summary: Memory spike in tablets with huge number of updates Key: KUDU-3592 URL: https://issues.apache.org/jira/browse/KUDU-3592 Project: Kudu Issue Type: Improvement Reporter: Abhishek Chennaka Attachments: Screen Shot 2024-07-18 at 12.19.25 PM.png, metrics.txt, metrics_144.txt, metrics_147.txt Came across a scenario where a tablet with 868 rows received about 10-20 million upserts (as updates) per minute which caused a strange behavior(with ancient mark set to 15 minutes): 1. Any minor/major compaction on any of those tablet replicas leads to a memory spike in the servers using up all the memory in the server and eventually the process was killed with OOM message by the OS. {code:java} I20240715 12:35:40.534596 851004 maintenance_manager.cc:392] P d247fdcc1e4a45f8a01b8155960280a6: Scheduling MinorDeltaCompactionOp(f99cddc1e2444bacbcdd117b0b377a02): perf score=0.023000{code} As soon as the delta compactions started, we see a spike in memory usage of the tablet server process. The usage went up until the process was killed: {code:java} W20240715 12:41:10.909035 850831 tablet_service.cc:1608] Rejecting Write request: Soft memory limit exceeded (at 310.52% of capacity) [suppressed 9 similar messages] W20240715 12:41:11.816135 850916 raft_consensus.cc:1537] Rejecting consensus request: Soft memory limit exceeded (at 311.39% of capacity) [suppressed 5 similar messages] W20240715 12:41:11.920109 850831 tablet_service.cc:1608] Rejecting Write request: Soft memory limit exceeded (at 311.51% of capacity) [suppressed 17 similar messages] W20240715 12:41:12.945374 850845 tablet_service.cc:1608] Rejecting Write request: Soft memory limit exceeded (at 312.54% of capacity) [suppressed 26 similar messages] W20240715 12:41:12.923309 850906 raft_consensus.cc:1537] Rejecting consensus request: Soft memory limit exceeded (at 312.49% of capacity) [suppressed 3 similar messages]{code} Eventually the process was killed by the OS: {code:java} kernel: Out of memory: Killed process 850414 (kudu-tserver) total-vm:422911368kB, anon-rss:38764kB, file-rss:0kB, shmem-rss:0kB, UID:39977 pgtables:822744kB oom_score_adj:0{code} 2. Scanning this tablet also caused a memory spike by the tablet server taking upto almost 90% of the memory in the server (memory hard limit in Kudu was set about 30% or less)[attached screenshot of the scans dashboard webpage]. Interestingly the on_disk_size of this tablet was only about 6GB and on_disk_data_size about 37KB[attached metrics related to this tablet in metrics.txt] but the memory consumed was in the order of hundreds of GB. 3. We also noticed an issue with bootstrapping one of such update heavy tablets where the recovery of the WAL dir of the tablet (this is a different tablet from the above)took upto 1.8TB causing the server to crash. The number of WAL segments in the original directory were about 250 but the recovery WAL dir had about ~200k wal segments. We could not collect much information on this as the tablet was deleted to avoid downtime, but if the issue is seen again it would be good to collect the tablet metadata and the examine the WAL segments for the config index values present. [We got the metrics of the tablet attached metrics_144.txt and metrics_147.txt]. While we investigate the root cause of such behavior it might be a good idea to A. Impose some rail guards on the number of deltas/updates that can be accumulated and throttling writes until compaction is done to reduce the number of deltas. B. Have more strict checks on the memory usage during delta compactions and scanning of the tablet. It needs to be noted that the workload of tens of millions of updates was not expected and the changes in the application were reverted which calmed things down. This could be an application error but we should have some rail guards from Kudu to not cause the entire memory to be used up. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3590) update certs in test_certs.cc
[ https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Bukor updated KUDU-3590: --- Code Review: https://gerrit.cloudera.org/c/21607/ > update certs in test_certs.cc > - > > Key: KUDU-3590 > URL: https://issues.apache.org/jira/browse/KUDU-3590 > Project: Kudu > Issue Type: Sub-task >Reporter: Bakai Ádám >Assignee: Attila Bukor >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3590) update certs in test_certs.cc
[ https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Bukor updated KUDU-3590: --- Affects Version/s: 1.17.0 > update certs in test_certs.cc > - > > Key: KUDU-3590 > URL: https://issues.apache.org/jira/browse/KUDU-3590 > Project: Kudu > Issue Type: Sub-task >Affects Versions: 1.17.0 >Reporter: Bakai Ádám >Assignee: Attila Bukor >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3590) update certs in test_certs.cc
[ https://issues.apache.org/jira/browse/KUDU-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868266#comment-17868266 ] ASF subversion and git services commented on KUDU-3590: --- Commit 4decbbdf553ed54796ddbbb49e1142925657110f in kudu's branch refs/heads/master from Attila Bukor [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=4decbbdf5 ] KUDU-3590 Update expired test certificates Certificates generated in 2d624f877 expired after 1 year, causing several test failures in security-itest and rpc-test. This commit replaces the expired certificates with ones with a validity of 20 years. Note: If you git blame this file in 2044 in your flying car, please make sure to update the keys and certificates to whatever is reasonable for your quantum computers instead of simply relying on the instructions in the comments. Certificate chain: Issuer: CN=IntermediateCA, ST=California, C=US, emailAddress=d...@kudu.apache.org, O=Apache Software Foundation, OU=Intermediate CA Validity Not Before: Jul 23 19:58:19 2024 GMT Not After : Apr 9 19:58:19 2044 GMT Subject: CN=127.0.0.1, ST=California, C=US, emailAddress=d...@kudu.apache.org, O=Apache Software Foundation, OU=Kudu Issuer: C=US, ST=Some-State, O=Apache Software Foundation, CN=127.0.0.1, emailAddress=d...@kudu.apache.org Validity Not Before: Jul 23 19:55:34 2024 GMT Not After : Jul 18 19:55:34 2044 GMT Subject: CN=IntermediateCA, ST=California, C=US, emailAddress=d...@kudu.apache.org, O=Apache Software Foundation, OU=Intermediate CA Issuer: C=US, ST=Some-State, O=Apache Software Foundation, CN=127.0.0.1, emailAddress=d...@kudu.apache.org Validity Not Before: Jul 23 17:50:11 2024 GMT Not After : Jun 29 17:50:11 2124 GMT Subject: C=US, ST=Some-State, O=Apache Software Foundation, CN=127.0.0.1, emailAddress=d...@kudu.apache.org Change-Id: I0d9a38926307618b81e292074732d35520e9a8e9 Reviewed-on: http://gerrit.cloudera.org:8080/21607 Reviewed-by: Marton Greber Tested-by: Marton Greber Reviewed-by: Zoltan Chovan > update certs in test_certs.cc > - > > Key: KUDU-3590 > URL: https://issues.apache.org/jira/browse/KUDU-3590 > Project: Kudu > Issue Type: Sub-task >Affects Versions: 1.17.0 >Reporter: Bakai Ádám >Assignee: Attila Bukor >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)