[ 
https://issues.apache.org/jira/browse/IGNITE-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Koptilin updated IGNITE-24523:
-----------------------------------------
    Description: 
h3. Motivation

We have two messages that are printed at the change peer op start:
{noformat}
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][BaseCliRequestProcessor] 
Receive ChangePeersAndLearnersAsyncRequest with term 1 to 
<17_part_0/irt_aconlnj_1> from 192.168.1.22:3345, new conf is 
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][NodeImpl] Node 
<17_part_0/irt_aconlnj_1> change configuration from 
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3 to 
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
Then we are printing a message about the configuration that was applied in each 
node where the partition holds (unfortunately without RAFT group ID):
{noformat}
[INFO 
][%irt_aconlnj_3%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
[INFO 
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3.
[INFO 
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
And finally we are printing a message about the rebalance that was finished in 
the same node where it had been started:
{noformat}
[INFO ][%irt_aconlnj_1%rebalance-scheduler-0][RebalanceRaftGroupEventsListener] 
Rebalance finished [tablePartitionId=17_part_0, appliedPeers=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]]]
{noformat}
And again, print the message about the assignment that was changed and saved on 
each cluster node:
{noformat}
[INFO ][%irt_aconlnj_0%tableManager-io-13][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3344, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_2%tableManager-io-16][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3346, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_3%tableManager-io-15][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3347, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_1%tableManager-io-13][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3345, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
{noformat}
In some cases we can print a message that confuses:
{noformat}
Node <51_part_3/idtt_n_3344> change configuration from idtt_n_3344 to 
idtt_n_3344
{noformat}
A set of peers in the old assignment coincides with the new one.
h3. Definition of done
 * I believe we can change a log level to debug for all the messages that are 
printed on each node.
 * Chose the one message at the rebalance start (the other one move to debug).
 * Research a case where the sets of peers match and change the log message in 
order for the reason for the rebalance to become clear.
 * Use our common log format if it is violated.

*Update:*

As we discussed privately, we agreed to log messages related to configuration 
changes using the debug level, and change the message when the original 
configuration matches the new one. As for rebalance logging, it should be 
addressed comprehensively. Simply changing the logging level is unlikely to 
bring any benefit.

  was:
h3. Motivation

We have two messages that are printed at the change peer op start:
{noformat}
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][BaseCliRequestProcessor] 
Receive ChangePeersAndLearnersAsyncRequest with term 1 to 
<17_part_0/irt_aconlnj_1> from 192.168.1.22:3345, new conf is 
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][NodeImpl] Node 
<17_part_0/irt_aconlnj_1> change configuration from 
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3 to 
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
Then we are printing a message about the configuration that was applied in each 
node where the partition holds (unfortunately without RAFT group ID):
{noformat}
[INFO 
][%irt_aconlnj_3%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
[INFO 
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3.
[INFO 
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
And finally we are printing a message about the rebalance that was finished in 
the same node where it had been started:
{noformat}
[INFO ][%irt_aconlnj_1%rebalance-scheduler-0][RebalanceRaftGroupEventsListener] 
Rebalance finished [tablePartitionId=17_part_0, appliedPeers=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]]]
{noformat}
And again, print the message about the assignment that was changed and saved on 
each cluster node:
{noformat}
[INFO ][%irt_aconlnj_0%tableManager-io-13][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3344, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_2%tableManager-io-16][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3346, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_3%tableManager-io-15][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3347, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_1%tableManager-io-13][TableManager] Received update on 
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
localMemberAddress=192.168.1.22:3345, stableAssignments=[Assignment 
[consistentId=irt_aconlnj_2, isPeer=true], Assignment 
[consistentId=irt_aconlnj_0, isPeer=true], Assignment 
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
{noformat}

In some cases we can print a message that confuses:
{noformat}
Node <51_part_3/idtt_n_3344> change configuration from idtt_n_3344 to 
idtt_n_3344
{noformat}
A set of peers in the old assignment coincides with the new one.

h3. Definition of done
* I believe we can change a log level to debug for all the messages that are 
printed on each node.
* Chose the one message at the rebalance start (the other one move to debug).
* Research a case where the sets of peers match and change the log message in 
order for the reason for the rebalance to become clear.
* Use our common log format if it is violated.


> Better raft logging in case of node configuration changes
> ---------------------------------------------------------
>
>                 Key: IGNITE-24523
>                 URL: https://issues.apache.org/jira/browse/IGNITE-24523
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexander Lapin
>            Assignee: Vyacheslav Koptilin
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Motivation
> We have two messages that are printed at the change peer op start:
> {noformat}
> [INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][BaseCliRequestProcessor] 
> Receive ChangePeersAndLearnersAsyncRequest with term 1 to 
> <17_part_0/irt_aconlnj_1> from 192.168.1.22:3345, new conf is 
> irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2
> [INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][NodeImpl] Node 
> <17_part_0/irt_aconlnj_1> change configuration from 
> irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3 to 
> irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
> {noformat}
> Then we are printing a message about the configuration that was applied in 
> each node where the partition holds (unfortunately without RAFT group ID):
> {noformat}
> [INFO 
> ][%irt_aconlnj_3%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
> onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
> [INFO 
> ][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
> onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3.
> [INFO 
> ][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter] 
> onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
> {noformat}
> And finally we are printing a message about the rebalance that was finished 
> in the same node where it had been started:
> {noformat}
> [INFO 
> ][%irt_aconlnj_1%rebalance-scheduler-0][RebalanceRaftGroupEventsListener] 
> Rebalance finished [tablePartitionId=17_part_0, appliedPeers=[Assignment 
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_1, isPeer=true]]]
> {noformat}
> And again, print the message about the assignment that was changed and saved 
> on each cluster node:
> {noformat}
> [INFO ][%irt_aconlnj_0%tableManager-io-13][TableManager] Received update on 
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
> localMemberAddress=192.168.1.22:3344, stableAssignments=[Assignment 
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> [INFO ][%irt_aconlnj_2%tableManager-io-16][TableManager] Received update on 
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
> localMemberAddress=192.168.1.22:3346, stableAssignments=[Assignment 
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> [INFO ][%irt_aconlnj_3%tableManager-io-15][TableManager] Received update on 
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
> localMemberAddress=192.168.1.22:3347, stableAssignments=[Assignment 
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> [INFO ][%irt_aconlnj_1%tableManager-io-13][TableManager] Received update on 
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0, 
> localMemberAddress=192.168.1.22:3345, stableAssignments=[Assignment 
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment 
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments 
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> {noformat}
> In some cases we can print a message that confuses:
> {noformat}
> Node <51_part_3/idtt_n_3344> change configuration from idtt_n_3344 to 
> idtt_n_3344
> {noformat}
> A set of peers in the old assignment coincides with the new one.
> h3. Definition of done
>  * I believe we can change a log level to debug for all the messages that are 
> printed on each node.
>  * Chose the one message at the rebalance start (the other one move to debug).
>  * Research a case where the sets of peers match and change the log message 
> in order for the reason for the rebalance to become clear.
>  * Use our common log format if it is violated.
> *Update:*
> As we discussed privately, we agreed to log messages related to configuration 
> changes using the debug level, and change the message when the original 
> configuration matches the new one. As for rebalance logging, it should be 
> addressed comprehensively. Simply changing the logging level is unlikely to 
> bring any benefit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to