Thanks for the response, gregory. We need to support a couple of production services we have migrated to ceph. So we are in a bit of soup.
cluster is as follows: ``` ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 11.06848 root default -7 5.45799 host master 5 hdd 5.45799 osd.5 up 1.00000 1.00000 -5 1.81940 host node2 7 hdd 1.81940 osd.7 up 1.00000 1.00000 -3 1.81940 host node3 8 hdd 1.81940 osd.8 up 1.00000 1.00000 -9 1.81940 host node4 6 hdd 1.81940 osd.6 up 1.00000 1.00000 -11 0.15230 host node5 9 hdd 0.15230 osd.9 up 1.00000 1.00000 ``` We have installed ceph cluster and kubernetes cluster on the same nodes (centos 7). We were facing low perf from ceph cluster ~10.5MB/S ```dd if=/dev/zero | of=./here bs=1M count=1024 oflag=direct``` So, we were in the process of adding additional NIC to each node. rebooting each one by one, ensuring rebooted node works well and proceeding further. After every few(a couple) reboots of nodes, mds would go down. (report data damage). We would following the disaster recovery link and it ll be merry. a couple of days since, mds hasnt come up. disaster recovery doesnt work no more. cluster conf: ``` [global] fsid = 2ed909ef-e3d7-4081-b01a-d04d12a1155d mon_initial_members = master, node3, node2 mon_host = 10.10.73.45,10.10.73.44,10.10.73.43 auth cluster required = cephx auth service required = cephx auth client required = cephx public_network= 10.10.73.0/24 osd pool default size = 2 # Write an object 3 times. osd pool default min size = 2 mon allow pool delete = true cluster network = 10.10.73.0/24 max open files = 131072 [mon] mon data = /var/lib/ceph/mon/ceph-$id [osd] osd data = /var/lib/ceph/osd/ceph-$id osd journal size = 20000 osd mkfs type = xfs osd mkfs options xfs = -f filestore xattr use omap = true filestore min sync interval = 10 filestore max sync interval = 15 filestore queue max ops = 25000 filestore queue max bytes = 10485760 filestore queue committing max ops = 5000 filestore queue committing max bytes = 10485760000 journal max write bytes = 1073714824 journal max write entries = 10000 journal queue max ops = 50000 journal queue max bytes = 10485760000 osd max write size = 512 osd client message size cap = 2147483648 osd deep scrub stride = 131072 osd op threads = 8 osd disk threads = 4 osd map cache size = 1024 osd map cache bl size = 128 mon allow pool delete = true cluster network = 10.10.73.0/24 max open files = 131072 [mon] mon data = /var/lib/ceph/mon/ceph-$id [osd] osd data = /var/lib/ceph/osd/ceph-$id osd journal size = 20000 osd mkfs type = xfs osd mkfs options xfs = -f filestore xattr use omap = true filestore min sync interval = 10 filestore max sync interval = 15 filestore queue max ops = 25000 filestore queue max bytes = 10485760 filestore queue committing max ops = 5000 filestore queue committing max bytes = 10485760000 journal max write bytes = 1073714824 journal max write entries = 10000 journal queue max ops = 50000 journal queue max bytes = 10485760000 osd max write size = 512 osd client message size cap = 2147483648 osd deep scrub stride = 131072 osd op threads = 8 osd disk threads = 4 osd map cache size = 1024 osd map cache bl size = 128 osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" osd recovery op priority = 4 osd recovery max active = 10 osd max backfills = 4 osd skip data digest = true [client] rbd cache = true rbd cache size = 268435456 rbd cache max dirty = 134217728 rbd cache max dirty age = 5 ``` ceph health: ``` master@~/ ceph -s cluster: id: 2ed909ef-e3d7-4081-b01a-d04d12a1155d health: HEALTH_ERR 4 scrub errors Possible data damage: 1 pg inconsistent services: mon: 3 daemons, quorum node2,node3,master mgr: master(active) mds: cephfs-1/1/1 up {0=master=up:active(laggy or crashed)} osd: 5 osds: 5 up, 5 in data: pools: 2 pools, 300 pgs objects: 194.1 k objects, 33 GiB usage: 131 GiB used, 11 TiB / 11 TiB avail pgs: 299 active+clean 1 active+clean+inconsistent ``` ceph health detail ``` ceph health detail HEALTH_ERR 4 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 4 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 1.43 is active+clean+inconsistent, acting [5,8,7] ``` mds logs have already been provided. Sincerely appreciate reading through it all. Thanks,
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com