Hi lists,

ceph version:luminous 12.2.2

The cluster was doing writing thoughput test when this problem happened.
The cluster health became error 
Health check update: 27 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
Clients can't write any data into cluster.
osd22 and osd40 are the osds who is resposible for the problem.
osd22's log shows below mesage and keep repeating
2018-01-07 20:44:52.202322 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x96aa9400 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969602 vs existing csq=969601 existing_state=STATE_STANDBY

2018-01-07 20:44:52.250600 b56db8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 20:44:52.252470 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x95c04000 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969604 vs existing csq=969603 existing_state=STATE_STANDBY

2018-01-07 20:44:52.300354 b5edb8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 20:44:52.302788 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x978e7a00 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969606 vs existing csq=969605 existing_state=STATE_STANDBY

2018-01-07 20:44:52.350987 b56db8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 20:44:52.352953 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x97420e00 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969608 vs existing csq=969607 existing_state=STATE_STANDBY

2018-01-07 20:44:52.400959 b5edb8e0  0 bad crc in data 3751247614 != exp 
3467727689

osd40's log shows below message and keep repeating
2018-01-07 20:44:52.200709 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484865 cs=969601 
l=0).fault initiating reconnect

2018-01-07 20:44:52.251423 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484866 cs=969603 
l=0).fault initiating reconnect

2018-01-07 20:44:52.301166 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484867 cs=969605 
l=0).fault initiating reconnect

2018-01-07 20:44:52.351810 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484868 cs=969607 
l=0).fault initiating reconnect

2018-01-07 20:44:52.401782 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484869 cs=969609 
l=0).fault initiating reconnect

The NIC of osd22' s host was keeping sending data to osd40's at about 50MBps 
when this happened.

After reboot osd22 the cluster goes back to normal..
This happened twice in my writing test with the same osds(osd22 and osd40).

What could cause this problem?Is this caused by a faulty HDD?
what data's crc didn't match ? 


2018-01-09



lin.yunfan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to