[
https://issues.apache.org/jira/browse/IGNITE-24523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vyacheslav Koptilin updated IGNITE-24523:
-----------------------------------------
Description:
h3. Motivation
We have two messages that are printed at the change peer op start:
{noformat}
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 1 to
<17_part_0/irt_aconlnj_1> from 192.168.1.22:3345, new conf is
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][NodeImpl] Node
<17_part_0/irt_aconlnj_1> change configuration from
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3 to
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
Then we are printing a message about the configuration that was applied in each
node where the partition holds (unfortunately without RAFT group ID):
{noformat}
[INFO
][%irt_aconlnj_3%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
[INFO
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3.
[INFO
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
And finally we are printing a message about the rebalance that was finished in
the same node where it had been started:
{noformat}
[INFO ][%irt_aconlnj_1%rebalance-scheduler-0][RebalanceRaftGroupEventsListener]
Rebalance finished [tablePartitionId=17_part_0, appliedPeers=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]]]
{noformat}
And again, print the message about the assignment that was changed and saved on
each cluster node:
{noformat}
[INFO ][%irt_aconlnj_0%tableManager-io-13][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3344, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_2%tableManager-io-16][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3346, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_3%tableManager-io-15][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3347, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_1%tableManager-io-13][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3345, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
{noformat}
In some cases we can print a message that confuses:
{noformat}
Node <51_part_3/idtt_n_3344> change configuration from idtt_n_3344 to
idtt_n_3344
{noformat}
A set of peers in the old assignment coincides with the new one.
h3. Definition of done
* I believe we can change a log level to debug for all the messages that are
printed on each node.
* Chose the one message at the rebalance start (the other one move to debug).
* Research a case where the sets of peers match and change the log message in
order for the reason for the rebalance to become clear.
* Use our common log format if it is violated.
*Update:*
As we discussed privately, we agreed to log messages related to configuration
changes using the debug level, and change the message when the original
configuration matches the new one. As for rebalance logging, it should be
addressed comprehensively. Simply changing the logging level is unlikely to
bring any benefit.
was:
h3. Motivation
We have two messages that are printed at the change peer op start:
{noformat}
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][BaseCliRequestProcessor]
Receive ChangePeersAndLearnersAsyncRequest with term 1 to
<17_part_0/irt_aconlnj_1> from 192.168.1.22:3345, new conf is
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2
[INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][NodeImpl] Node
<17_part_0/irt_aconlnj_1> change configuration from
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3 to
irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
Then we are printing a message about the configuration that was applied in each
node where the partition holds (unfortunately without RAFT group ID):
{noformat}
[INFO
][%irt_aconlnj_3%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
[INFO
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3.
[INFO
][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
{noformat}
And finally we are printing a message about the rebalance that was finished in
the same node where it had been started:
{noformat}
[INFO ][%irt_aconlnj_1%rebalance-scheduler-0][RebalanceRaftGroupEventsListener]
Rebalance finished [tablePartitionId=17_part_0, appliedPeers=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]]]
{noformat}
And again, print the message about the assignment that was changed and saved on
each cluster node:
{noformat}
[INFO ][%irt_aconlnj_0%tableManager-io-13][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3344, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_2%tableManager-io-16][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3346, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_3%tableManager-io-15][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3347, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
[INFO ][%irt_aconlnj_1%tableManager-io-13][TableManager] Received update on
stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
localMemberAddress=192.168.1.22:3345, stableAssignments=[Assignment
[consistentId=irt_aconlnj_2, isPeer=true], Assignment
[consistentId=irt_aconlnj_0, isPeer=true], Assignment
[consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
[nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
{noformat}
In some cases we can print a message that confuses:
{noformat}
Node <51_part_3/idtt_n_3344> change configuration from idtt_n_3344 to
idtt_n_3344
{noformat}
A set of peers in the old assignment coincides with the new one.
h3. Definition of done
* I believe we can change a log level to debug for all the messages that are
printed on each node.
* Chose the one message at the rebalance start (the other one move to debug).
* Research a case where the sets of peers match and change the log message in
order for the reason for the rebalance to become clear.
* Use our common log format if it is violated.
> Better raft logging in case of node configuration changes
> ---------------------------------------------------------
>
> Key: IGNITE-24523
> URL: https://issues.apache.org/jira/browse/IGNITE-24523
> Project: Ignite
> Issue Type: Improvement
> Reporter: Alexander Lapin
> Assignee: Vyacheslav Koptilin
> Priority: Major
> Labels: ignite-3
> Time Spent: 10m
> Remaining Estimate: 0h
>
> h3. Motivation
> We have two messages that are printed at the change peer op start:
> {noformat}
> [INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][BaseCliRequestProcessor]
> Receive ChangePeersAndLearnersAsyncRequest with term 1 to
> <17_part_0/irt_aconlnj_1> from 192.168.1.22:3345, new conf is
> irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2
> [INFO ][%irt_aconlnj_1%JRaft-Request-Processor-22][NodeImpl] Node
> <17_part_0/irt_aconlnj_1> change configuration from
> irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3 to
> irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
> {noformat}
> Then we are printing a message about the configuration that was applied in
> each node where the partition holds (unfortunately without RAFT group ID):
> {noformat}
> [INFO
> ][%irt_aconlnj_3%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
> onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
> [INFO
> ][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
> onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_3.
> [INFO
> ][%irt_aconlnj_2%JRaft-FSMCaller-Disruptor_stripe_1-0][StateMachineAdapter]
> onConfigurationCommitted: irt_aconlnj_0,irt_aconlnj_1,irt_aconlnj_2.
> {noformat}
> And finally we are printing a message about the rebalance that was finished
> in the same node where it had been started:
> {noformat}
> [INFO
> ][%irt_aconlnj_1%rebalance-scheduler-0][RebalanceRaftGroupEventsListener]
> Rebalance finished [tablePartitionId=17_part_0, appliedPeers=[Assignment
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment
> [consistentId=irt_aconlnj_1, isPeer=true]]]
> {noformat}
> And again, print the message about the assignment that was changed and saved
> on each cluster node:
> {noformat}
> [INFO ][%irt_aconlnj_0%tableManager-io-13][TableManager] Received update on
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
> localMemberAddress=192.168.1.22:3344, stableAssignments=[Assignment
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> [INFO ][%irt_aconlnj_2%tableManager-io-16][TableManager] Received update on
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
> localMemberAddress=192.168.1.22:3346, stableAssignments=[Assignment
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> [INFO ][%irt_aconlnj_3%tableManager-io-15][TableManager] Received update on
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
> localMemberAddress=192.168.1.22:3347, stableAssignments=[Assignment
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> [INFO ][%irt_aconlnj_1%tableManager-io-13][TableManager] Received update on
> stable assignments [key=assignments.stable.17_part_0, partition=17_part_0,
> localMemberAddress=192.168.1.22:3345, stableAssignments=[Assignment
> [consistentId=irt_aconlnj_2, isPeer=true], Assignment
> [consistentId=irt_aconlnj_0, isPeer=true], Assignment
> [consistentId=irt_aconlnj_1, isPeer=true]], pendingAssignments=Assignments
> [nodes=HashSet [], force=false, timestamp=0, fromReset=false], revision=154]
> {noformat}
> In some cases we can print a message that confuses:
> {noformat}
> Node <51_part_3/idtt_n_3344> change configuration from idtt_n_3344 to
> idtt_n_3344
> {noformat}
> A set of peers in the old assignment coincides with the new one.
> h3. Definition of done
> * I believe we can change a log level to debug for all the messages that are
> printed on each node.
> * Chose the one message at the rebalance start (the other one move to debug).
> * Research a case where the sets of peers match and change the log message
> in order for the reason for the rebalance to become clear.
> * Use our common log format if it is violated.
> *Update:*
> As we discussed privately, we agreed to log messages related to configuration
> changes using the debug level, and change the message when the original
> configuration matches the new one. As for rebalance logging, it should be
> addressed comprehensively. Simply changing the logging level is unlikely to
> bring any benefit.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)