Hi David, What about memory usage?
1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM. If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 150GB RAM needed especially in recovery/backfilling scenario's like these. Kind regards, Caspar 2018-03-15 21:53 GMT+01:00 Dan van der Ster <d...@vanderster.com>: > Did you use perf top or iotop to try to identify where the osd is stuck? > Did you try increasing the op thread suicide timeout from 180s? > > Splitting should log at the beginning and end of an op, so it should be > clear if it's taking longer than the timeout. > > .. Dan > > > > On Mar 15, 2018 9:23 PM, "David Turner" <drakonst...@gmail.com> wrote: > > I am aware of the filestore splitting happening. I manually split all of > the subfolders a couple weeks ago on this cluster, but every time we have > backfilling the newly moved PGs have a chance to split before the > backfilling is done. When that has happened in the past it causes some > blocked requests and will flap OSDs if we don't increase the > osd_heartbeat_grace, but it has never consistently killed the OSDs during > the task. Maybe that's new in Luminous due to some of the priority and > timeout settings. > > This problem in general seems unrelated to the subfolder splitting, > though, since it started to happen very quickly into the backfilling > process. Definitely before many of the recently moved PGs would have > reached that point. I've also confirmed that the OSDs that are dying are > not just stuck on a process (like it looks like with filestore splitting), > but actually segfaulting and restarting. > > On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster <d...@vanderster.com> > wrote: > >> Hi, >> >> Do you see any split or merge messages in the osd logs? >> I recall some surprise filestore splitting on a few osds after the >> luminous upgrade. >> >> .. Dan >> >> >> On Mar 15, 2018 6:04 PM, "David Turner" <drakonst...@gmail.com> wrote: >> >> I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last >> week I added 2 nodes to the cluster. The backfilling has been ATROCIOUS. >> I have OSDs consistently [2] segfaulting during recovery. There's no >> pattern of which OSDs are segfaulting, which hosts have segfaulting OSDs, >> etc... It's all over the cluster. I have been trying variants on all of >> these following settings with different levels of success, but I cannot >> eliminate the blocked requests and segfaulting >> OSDs. osd_heartbeat_grace, osd_max_backfills, osd_op_thread >> _suicide_timeout, osd_recovery_max_active, osd_recovery_slee >> p_hdd, osd_recovery_sleep_hybrid, osd_recovery_thread_timeout, >> and osd_scrub_during_recovery. Except for setting nobackfilling on the >> cluster I can't stop OSDs from segfaulting during recovery. >> >> Does anyone have any ideas for this? I've been struggling with this for >> over a week now. For the first couple days I rebalanced the cluster and >> had this exact same issue prior to adding new storage. Even setting >> osd_max_backfills to 1 and recovery_sleep to 1.0, with everything else on >> defaults, doesn't help. >> >> Backfilling caused things to slow down on Jewel, but I wasn't having OSDs >> segfault multiple times/hour like I am on Luminous. So many OSDs are going >> down that I had to set nodown to prevent potential data instability of OSDs >> on multiple hosts going up and down all the time. That blocks IO for every >> OSD that dies either until it comes back up or I manually mark it down. I >> hope someone has some ideas for me here. Our plan moving forward is to >> only use half of the capacity of the drives by pretending they're 5TB >> instead of 10TB to increase the spindle speed per TB. Also migrating to >> bluestore will hopefully help. >> >> >> [1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on >> Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB >> RAM. >> >> [2] -19> 2018-03-15 16:42:17.998074 7fe661601700 5 -- >> 10.130.115.25:6811/2942118 >> 10.130.115.48:0/372681 conn(0x55e3ea087000 >> :6811 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pg >> s=1920 cs=1 l=1). rx osd.254 seq 74507 0x55e3eb8e2e00 osd_ping(ping >> e93182 stamp 2018-03-15 16:42:17.990698) v4 >> -18> 2018-03-15 16:42:17.998091 7fe661601700 1 -- >> 10.130.115.25:6811/2942118 <== osd.254 10.130.115.48:0/372681 74507 ==== >> osd_ping(ping e93182 stamp 2018-03-15 16:42:17.990698) >> v4 ==== 2004+0+0 (492539280 0 0) 0x55e3eb8e2e00 con 0x55e3ea087000 >> -17> 2018-03-15 16:42:17.998109 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe639772700' had timed out after 60 >> -16> 2018-03-15 16:42:17.998111 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe639f73700' had timed out after 60 >> -15> 2018-03-15 16:42:17.998120 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe63a774700' had timed out after 60 >> -14> 2018-03-15 16:42:17.998123 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe63af75700' had timed out after 60 >> -13> 2018-03-15 16:42:17.998126 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe63b776700' had timed out after 60 >> -12> 2018-03-15 16:42:17.998129 7fe661601700 1 heartbeat_map >> is_healthy 'FileStore::op_tp thread 0x7fe654854700' had timed out after 60 >> -11> 2018-03-15 16:42:18.004203 7fe661601700 5 -- >> 10.130.115.25:6811/2942118 >> 10.130.115.33:0/3348055 >> conn(0x55e3eb5f0000 :6811 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH p >> gs=1894 cs=1 l=1). rx osd.169 seq 74633 0x55e3eb8e2e00 osd_ping(ping >> e93182 stamp 2018-03-15 16:42:17.998828) v4 >> -10> 2018-03-15 16:42:18.004230 7fe661601700 1 -- >> 10.130.115.25:6811/2942118 <== osd.169 10.130.115.33:0/3348055 74633 >> ==== osd_ping(ping e93182 stamp 2018-03-15 16:42:17.998828 >> ) v4 ==== 2004+0+0 (2306332339 0 0) 0x55e3eb8e2e00 con 0x55e3eb5f0000 >> -9> 2018-03-15 16:42:18.004241 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe639772700' had timed out after 60 >> -8> 2018-03-15 16:42:18.004244 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe639f73700' had timed out after 60 >> -7> 2018-03-15 16:42:18.004246 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe63a774700' had timed out after 60 >> -6> 2018-03-15 16:42:18.004248 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe63af75700' had timed out after 60 >> -5> 2018-03-15 16:42:18.004249 7fe661601700 1 heartbeat_map >> is_healthy 'OSD::osd_op_tp thread 0x7fe63b776700' had timed out after 60 >> -4> 2018-03-15 16:42:18.004251 7fe661601700 1 heartbeat_map >> is_healthy 'FileStore::op_tp thread 0x7fe654854700' had timed out after 60 >> -3> 2018-03-15 16:42:18.004256 7fe661601700 1 heartbeat_map >> is_healthy 'FileStore::op_tp thread 0x7fe654854700' had suicide timed out >> after 180 >> -2> 2018-03-15 16:42:18.004462 7fe6605ff700 5 -- >> 10.130.113.25:6811/2942118 >> 10.130.113.33:0/3348055 >> conn(0x55e3eb599800 :6811 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH p >> gs=1937 cs=1 l=1). rx osd.169 seq 74633 0x55e3eef6d200 osd_ping(ping >> e93182 stamp 2018-03-15 16:42:17.998828) v4 >> -1> 2018-03-15 16:42:18.004502 7fe6605ff700 1 -- >> 10.130.113.25:6811/2942118 <== osd.169 10.130.113.33:0/3348055 74633 >> ==== osd_ping(ping e93182 stamp 2018-03-15 16:42:17.998828 >> ) v4 ==== 2004+0+0 (2306332339 0 0) 0x55e3eef6d200 con 0x55e3eb599800 >> 0> 2018-03-15 16:42:18.015185 7fe654854700 -1 *** Caught signal >> (Aborted) ** >> in thread 7fe654854700 thread_name:tp_fstore_op >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com