Hi. I have rebuild my ceph yesterday with the latest master branch and the problem still occurs. I also found that the number of receive errors will increase during the testing (/sys/class/infiniband/mlx4_0/ports/1/counters//port_rcv_errors) and I think that is the reason why the osd's connection will broken and I will try to figure out.
Thanks. Best Regards, Hung-Wei Chiu(邱宏瑋) -- Computer Center, Department of Computer Science National Chiao Tung University 2017-03-21 5:07 GMT+08:00 Haomai Wang <hao...@xsky.com>: > plz uses master branch to test rdma > > On Sun, Mar 19, 2017 at 11:08 PM, Hung-Wei Chiu (邱宏瑋) < > hwc...@cs.nctu.edu.tw> wrote: > >> Hi >> >> I want to test the performance for Ceph with RDMA, so I build the ceph >> with RDMA and deploy into my test environment manually. >> >> I use the fio for my performance evaluation and it works fine if the Cepu >> use the *async + posix* as its ms_type. >> After changing the ms_type from *async + posix* to *async + rdma, *some >> osd's status will turn down during the performance testing and that causing >> the fio can't finish its job. >> The log file of those strange OSD shows that there're something wrong >> when OSD try to send a message and you can see below. >> >> ... >> 2017-03-20 09:43:10.096042 7faac163e700 -1 Infiniband recv_msg got error >> -104: (104) Connection reset by peer >> 2017-03-20 09:43:10.096314 7faac163e700 0 -- 10.0.0.16:6809/23853 >> >> 10.0.0.17:6813/32315 conn(0x563de5282000 :-1 s=STATE_OPEN pgs=264 cs=29 >> l=0).fault initiating reconnect >> 2017-03-20 09:43:10.251606 7faac1e3f700 -1 Infiniband send_msg send >> returned error 32: (32) Broken pipe >> 2017-03-20 09:43:10.251755 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> >> 10.0.0.17:6821/32509 conn(0x563de51f1000 :-1 s=STATE_OPEN pgs=314 cs=24 >> l=0).fault initiating reconnect >> 2017-03-20 09:43:10.254103 7faac1e3f700 -1 Infiniband send_msg send >> returned error 32: (32) Broken pipe >> 2017-03-20 09:43:10.254375 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> >> 10.0.0.15:6821/48196 conn(0x563de514b000 :6809 s=STATE_OPEN pgs=275 >> cs=30 l=0).fault initiating reconnect >> 2017-03-20 09:43:10.260622 7faac1e3f700 -1 Infiniband send_msg send >> returned error 32: (32) Broken pipe >> 2017-03-20 09:43:10.260693 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >> >> 10.0.0.15:6805/47835 conn(0x563de537d800 :-1 s=STATE_OPEN pgs=310 cs=11 >> l=0).fault with nothing to send, going to standby >> 2017-03-20 09:43:10.264621 7faac163e700 -1 Infiniband send_msg send >> returned error 32: (32) Broken pipe >> 2017-03-20 09:43:10.264682 7faac163e700 0 -- 10.0.0.16:6809/23853 >> >> 10.0.0.15:6829/48397 conn(0x563de5fdb000 :-1 s=STATE_OPEN pgs=231 cs=23 >> l=0).fault with nothing to send, going to standby >> 2017-03-20 09:43:10.291832 7faac163e700 -1 Infiniband send_msg send >> returned error 32: (32) Broken pipe >> 2017-03-20 09:43:10.291895 7faac163e700 0 -- 10.0.0.16:6809/23853 >> >> 10.0.0.17:6817/32412 conn(0x563de50f5800 :-1 s=STATE_OPEN pgs=245 cs=25 >> l=0).fault initiating reconnect >> 2017-03-20 09:43:10.387540 7faac2e41700 -1 Infiniband send_msg send >> returned error 32: (32) Broken pipe >> 2017-03-20 09:43:10.387565 7faac2e41700 -1 Infiniband send_msg send >> returned error 32: (32) Broken pipe >> 2017-03-20 09:43:10.387635 7faac2e41700 0 -- 10.0.0.16:6809/23853 >> >> 10.0.0.17:6801/32098 conn(0x563de51ab800 :6809 s=STATE_OPEN pgs=268 >> cs=23 l=0).fault with nothing to send, going to standby >> 2017-03-20 09:43:11.453373 7faabdee0700 -1 osd.10 902 heartbeat_check: no >> reply from 10.0.0.15:6803 osd.0 since back 2017-03-20 09:42:50.610507 >> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) >> 2017-03-20 09:43:11.453422 7faabdee0700 -1 osd.10 902 heartbeat_check: no >> reply from 10.0.0.15:6807 osd.1 since back 2017-03-20 09:42:50.610507 >> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) >> 2017-03-20 09:43:11.453435 7faabdee0700 -1 osd.10 902 heartbeat_check: no >> reply from 10.0.0.15:6811 osd.2 since back 2017-03-20 09:42:50.610507 >> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) >> 2017-03-20 09:43:11.453444 7faabdee0700 -1 osd.10 902 heartbeat_check: no >> reply from 10.0.0.15:6815 osd.3 since back 2017-03-20 09:42:50.610507 >> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371) >> *...* >> >> >> The following is my environment. >> *[Software]* >> *Ceph Version*: ceph version 12.0.0-1356-g7ba32cb (I build my self with >> master branch) >> >> *Deployment*: Without ceph-deploy and systemd, just manually invoke >> every daemons. >> >> *Host*: Ubuntu 16.04.1 LTS (x86_64 ), with linux kernel 4.4.0-66-generic. >> >> *NIC*: Ethernet controller: Mellanox Technologies MT27520 Family >> [ConnectX-3 Pro] >> >> *NIC Driver*: MLNX_OFED_LINUX-4.0-1.0.1.0 (OFED-4.0-1.0.1): >> >> >> *[Configuration]* >> Ceph.conf >> >> [global] >> fsid = 0612cc7e-6239-456c-978b-b4df781fe831 >> mon initial members = ceph-1,ceph-2,ceph-3 >> mon host = 10.0.0.15,10.0.0.16,10.0.0.17 >> osd pool default size = 2 >> osd pool default pg num = 1024 >> osd pool default pgp num = 1024 >> ms_type=async+rdma >> ms_async_rdma_device_name = mlx4_0 >> >> Fio.conf >> >> [global] >> >> ioengine=rbd >> clientname=admin >> pool=rbd >> rbdname=rbd >> clustername=ceph >> runtime=120 >> iodepth=128 >> numjobs=6 >> group_reporting >> size=256G >> direct=1 >> ramp_time=5 >> [r75w25] >> bs=4k >> rw=randrw >> rwmixread=75 >> >> >> *[Cluster Env]* >> >> 1. Total three Node. >> 2. 3 ceph monitors on each node. >> 3. 8 ceph osd on each node (total 24 osd). >> >> >> Thanks >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com