Re: [ceph-users] Backfilling on Luminous

2018-03-30 Thread Pavan Rallabhandi
‘expected_num_objects’ at the time of pool creation, be aware of this fix http://tracker.ceph.com/issues/22530. Thanks, -Pavan. From: David Turner Date: Tuesday, March 20, 2018 at 1:50 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Backfilling on Luminous @Pavan, I did not know

Re: [ceph-users] Backfilling on Luminous

2018-03-20 Thread David Turner
ry that too. > > > > Thanks, > > -Pavan. > > > > *From: *ceph-users on behalf of David > Turner > *Date: *Monday, March 19, 2018 at 1:36 PM > *To: *Caspar Smit > *Cc: *ceph-users > *Subject: *EXT: Re: [ceph-users] Backfilling on Luminous > > > >

Re: [ceph-users] Backfilling on Luminous

2018-03-19 Thread Pavan Rallabhandi
: EXT: Re: [ceph-users] Backfilling on Luminous Sorry for being away. I set all of my backfilling to VERY slow settings over the weekend and things have been stable, but incredibly slow (1% recovery from 3% misplaced to 2% all weekend). I'm back on it now and well rested. @Caspar, SWAP

Re: [ceph-users] Backfilling on Luminous

2018-03-19 Thread David Turner
Sorry for being away. I set all of my backfilling to VERY slow settings over the weekend and things have been stable, but incredibly slow (1% recovery from 3% misplaced to 2% all weekend). I'm back on it now and well rested. @Caspar, SWAP isn't being used on these nodes and all of the affected OS

Re: [ceph-users] Backfilling on Luminous

2018-03-16 Thread Caspar Smit
Hi David, What about memory usage? 1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM. If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 150GB RAM needed especially in recove

Re: [ceph-users] Backfilling on Luminous

2018-03-15 Thread Dan van der Ster
Did you use perf top or iotop to try to identify where the osd is stuck? Did you try increasing the op thread suicide timeout from 180s? Splitting should log at the beginning and end of an op, so it should be clear if it's taking longer than the timeout. .. Dan On Mar 15, 2018 9:23 PM, "David

Re: [ceph-users] Backfilling on Luminous

2018-03-15 Thread David Turner
I am aware of the filestore splitting happening. I manually split all of the subfolders a couple weeks ago on this cluster, but every time we have backfilling the newly moved PGs have a chance to split before the backfilling is done. When that has happened in the past it causes some blocked reque

Re: [ceph-users] Backfilling on Luminous

2018-03-15 Thread Dan van der Ster
Hi, Do you see any split or merge messages in the osd logs? I recall some surprise filestore splitting on a few osds after the luminous upgrade. .. Dan On Mar 15, 2018 6:04 PM, "David Turner" wrote: I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last week I added 2 nodes t

Re: [ceph-users] Backfilling on Luminous

2018-03-15 Thread David Turner
We haven't used jemalloc for anything. The only thing in our /etc/sysconfig/ceph configuration is increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES. I didn't see anything in dmesg on one of the recent hosts that had an osd segfault. I looked at your ticket and that looks like something with PGs b

Re: [ceph-users] Backfilling on Luminous

2018-03-15 Thread Jan Marquardt
Hi David, Am 15.03.18 um 18:03 schrieb David Turner: > I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last > week I added 2 nodes to the cluster.  The backfilling has been > ATROCIOUS.  I have OSDs consistently [2] segfaulting during recovery.  > There's no pattern of which OSDs

Re: [ceph-users] Backfilling on Luminous

2018-03-15 Thread Cassiano Pilipavicius
Hi David, for me something similar happened when I've upgraded from jewel to luminous and I have discovered that the problem is the memory allocator. I've tried to change to JEMAlloc in jewel to improve performance, and when upgraded to bluestore in luminous my osds started to crash. I;ve commente

[ceph-users] Backfilling on Luminous

2018-03-15 Thread David Turner
I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last week I added 2 nodes to the cluster. The backfilling has been ATROCIOUS. I have OSDs consistently [2] segfaulting during recovery. There's no pattern of which OSDs are segfaulting, which hosts have segfaulting OSDs, etc... It