date:20180226

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Oliver Freyermuth

Hi Stijn, 

Am 26.02.2018 um 07:58 schrieb Stijn De Weirdt:
> hi oliver,
> 
 in preparation for production, we have run very successful tests with 
 large sequential data,
 and just now a stress-test creating many small files on CephFS. 

 We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool 
 with 6 hosts with 32 OSDs each, running in EC k=4 m=2. 
 Compression is activated (aggressive, snappy). All Bluestore, LVM, 
 Luminous 12.2.3. 
> (this is all afaik;) so with EC k=4, small files get cut in 4 smaller
> parts. i'm not sure when the compression is applied, but your small
> files might be very small files before the get cut in 4 tiny parts. this
> might become pure iops wrt performance.
> with filestore (and witout compression), this was quite awfull. we have
> not retested with bluestore yet, but in the end a disk is just a disk.
> writing 1 file results in 6 diskwrites, so you need a lot of iops and/or
> disks.
> 
> <...>

Thanks for these hints! 
I think in our case, the high number of disks / OSDs saves us from really 
noticing this. 
At least, checking with iotop / iostat during the stress testing, I saw mostly 
no disk activity on the OSDs, the MDS
was the main bottleneck. 

> 
 In parallel, I had reinstalled one OSD host. 
 It was backfilling well, but now, <24 hours later, before backfill has 
 finished, several OSD hosts enter OOM condition. 
 Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the 
 default bluestore cache size of 1 GB. However, it seems the processes are 
 using much more,
 up to several GBs until memory is exhausted. They then become sluggish, 
 are kicked out of the cluster, come back, and finally at some point they 
 are OOMed. 
>>>
>>> 32GB RAM for MDS, 64GB RAM for 32 OSDs per node looks very low on memory 
>>> requirements for the scale you are trying. what are the size of each osd 
>>> device?
>>> Could you also dump osd tree + more cluster info in the tracker you raised, 
>>> so that one could try to recreate at a lower scale and check.
>>
>> Done! 
>> All HDD-OSDs have 4 TB, while the SSDs used for the metadata pool have 240 
>> GB. 
> the rule of thumb is 1GB per 1 TB. that is a lot (and imho one of the
> bad things about ceph, but i'm not complaining ;)
> most of the time this memory will not be used except for cache, but eg
> recovery is one of the cases where it is used, and thus needed.
> 
> i have no idea what the real requirements are (i assumes there's some
> fixed amount per OSD and the rest is linear(?) with volume. so you can
> try to use some softraid on the disks to reduce the number of OSDs per
> host; but i doubt that the fixed part is over 50%, so you will probably
> end up with ahving to add some memory or not use certain disks. i don't
> know if you can limit the amount of volume per disk, eg only use 2TB of
> a 4TB disk, because then you can keep the iops.

It would likely be possible to duplicate the RAM of the OSDs at an affordable 
price 
(only half of the DIMM slots are occupied - we already planned for the future, 
just did not expect this to be necessary so quickly). 
This would grant us 128 GB per OSD host, which matches with 32*4 TB = 128 TB, 
i.e. 1 GB of RAM for 1 TB of disk. 

For the MDSes, the same is true, we could upgrade them to 64 GB or 96 GB 
without throwing away existing DIMMs. 
128 GB as Linh had in the HPC setup. Potentially, we could even go for 128 GB 
and move the small DIMMs from the MDS's to OSD's. 
We'll discuss... 

Many thanks for your very valuable input! 

Cheers,
Oliver


> 
> stijn
> 
>> We had initially planned to use something more lightweight on CPU and RAM 
>> (BeeGFS or Lustre),
>> but since we encountered serious issues with BeeGFS, have some bad past 
>> experience with Lustre (but it was an old version)
>> and were really happy with the self-healing features of Ceph which also 
>> allows us to reinstall OSD-hosts if we do an upgrade without having a 
>> downtime,
>> we have decided to repurpose the hardware. For this reason, the RAM is not 
>> really optimized (yet) for Ceph. 
>> We will try to adapt hardware now as best as possible. 
>>
>> Are there memory recommendations for a setup of this size? Anything's 
>> welcome. 
>>
>> Cheers and thanks!
>>  Oliver
>>
>>>

 Now, I have restarted some OSD processes and hosts which helped to reduce 
 the memory usage - but now I have some OSDs crashing continously,
 leading to PG unavailability, and preventing recovery from completion. 
 I have reported a ticket about that, with stacktrace and log:
 http://tracker.ceph.com/issues/23120
 This might well be a consequence of a previous OOM killer condition. 

 However, my final question after these ugly experiences is: 
 Did somebody ever stresstest CephFS for many small files? 
 Are those issues known? Can special configuration help? 
 Are the memory issues

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Oliver Freyermuth

I second Stijn's question for more details, also on the stress testing. 

Did you "only" have each node write 2M of files per directory, or each "job", 
i.e. nodes*(number of cores per node) processes? 
Do you have monitoring of the memory usage? Is the large amount of RAM actually 
used on the MDS? 
Did you increase the mds_cache_memory_limit setting? 

Am 26.02.2018 um 08:15 schrieb Linh Vu:
> Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the 
> OSD nodes have 128GB each. Networking is 2x25Gbe.  
> 
> 
> We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about 
> 500-ish compute nodes. We have done stress testing with small files up to 2M 
> per directory as part of our acceptance testing, and encountered no problem.
> 
> --
> *From:* ceph-users  on behalf of Oliver 
> Freyermuth 
> *Sent:* Monday, 26 February 2018 3:45:59 AM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] CephFS very unstable with many small files
>  
> Dear Cephalopodians,
> 
> in preparation for production, we have run very successful tests with large 
> sequential data,
> and just now a stress-test creating many small files on CephFS.
> 
> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 
> hosts with 32 OSDs each, running in EC k=4 m=2.
> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 
> 12.2.3.
> There are (at the moment) only two MDS's, one is active, the other standby.
> 
> For the test, we had 1120 client processes on 40 client machines (all 
> cephfs-fuse!) extract a tarball with 150k small files
> ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a 
> separate subdirectory.
> 
> Things started out rather well (but expectedly slow), we had to increase
> mds_log_max_segments => 240
> mds_log_max_expiring => 160
> due to https://github.com/ceph/ceph/pull/18624
> and adjusted mds_cache_memory_limit to 4 GB.
> 
> Even though the MDS machine has 32 GB, it is also running 2 OSDs (for 
> metadata) and so we have been careful with the cache
> (e.g. due to http://tracker.ceph.com/issues/22599 ).
> 
> After a while, we tested MDS failover and realized we entered a flip-flop 
> situation between the two MDS nodes we have.
> Increasing mds_beacon_grace to 240 helped with that.
> 
> Now, with about 100,000,000 objects written, we are in a disaster situation.
> First off, the MDS could not restart anymore - it required >40 GB of memory, 
> which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
> So it tried to recover and OOMed quickly after. Replay was reasonably fast, 
> but join took many minutes:
> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
> and finally, 5 minutes later, OOM.
> 
> I stopped half of the stress-test tar's, which did not help - then I rebooted 
> half of the clients, which did help and let the MDS recover just fine.
> So it seems the client caps have been too many for the MDS to handle. I'm 
> unsure why "tar" would cause so many open file handles.
> Is there anything that can be configured to prevent this from happening?
> Now, I only lost some "stress test data", but later, it might be user's 
> data...
> 
> 
> In parallel, I had reinstalled one OSD host.
> It was backfilling well, but now, <24 hours later, before backfill has 
> finished, several OSD hosts enter OOM condition.
> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the 
> default bluestore cache size of 1 GB. However, it seems the processes are 
> using much more,
> up to several GBs until memory is exhausted. They then become sluggish, are 
> kicked out of the cluster, come back, and finally at some point they are 
> OOMed.
> 
> Now, I have restarted some OSD processes and hosts which helped to reduce the 
> memory usage - but now I have some OSDs crashing continously,
> leading to PG unavailability, and preventing

Re: [ceph-users] how to fix X is an unexpected clone

2018-02-26 Thread Saverio Proto

Hello Stefan,

ceph-object-tool does not exist on my setup, do yo mean the command
/usr/bin/ceph-objectstore-tool that is installed with the ceph-osd package ?

I have the following situation here in Ceph Luminous:

2018-02-26 07:15:30.066393 7f0684acb700 -1 log_channel(cluster) log
[ERR] : 5.111f shard 395 missing
5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152
2018-02-26 07:15:30.395189 7f0684acb700 -1 log_channel(cluster) log
[ERR] : deep-scrub 5.111f
5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152 is an
unexpected clone

I did not understand how you actually fixed the problem. Could you
provide more details ?

thanks

Saverio


On 08.08.17 12:02, Stefan Priebe - Profihost AG wrote:
> Hello Greg,
> 
> Am 08.08.2017 um 11:56 schrieb Gregory Farnum:
>> On Mon, Aug 7, 2017 at 11:55 PM Stefan Priebe - Profihost AG
>> mailto:s.pri...@profihost.ag>> wrote:
>>
>> Hello,
>>
>> how can i fix this one:
>>
>> 2017-08-08 08:42:52.265321 osd.20 [ERR] repair 3.61a
>> 3:58654d3d:::rbd_data.106dd406b8b4567.018c:9d455 is an
>> unexpected clone
>> 2017-08-08 08:43:04.914640 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1
>> pgs repair; 1 scrub errors
>> 2017-08-08 08:43:33.470246 osd.20 [ERR] 3.61a repair 1 errors, 0 fixed
>> 2017-08-08 08:44:04.915148 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1
>> scrub errors
>>
>> If i just delete manually the relevant files ceph is crashing. rados
>> does not list those at all?
>>
>> How can i fix this?
>>
>>
>> You've sent quite a few emails that have this story spread out, and I
>> think you've tried several different steps to repair it that have been a
>> bit difficult to track.
>>
>> It would be helpful if you could put the whole story in one place and
>> explain very carefully exactly what you saw and how you responded. Stuff
>> like manually copying around the wrong files, or files without a
>> matching object info, could have done some very strange things.
>> Also, basic debugging stuff like what version you're running will help. :)
>>
>> Also note that since you've said elsewhere you don't need this image, I
>> don't think it's going to hurt you to leave it like this for a bit
>> (though it will definitely mess up your monitoring).
>> -Greg
> 
> i'm sorry about that. You're correct.
> 
> I was able to fix this just a few minutes ago by using the
> ceph-object-tool and the remove operation to remove all left over files.
> 
> I did this on all OSDs with the problematic pg. After that ceph was able
> to fix itself.
> 
> A better approach might be that ceph can recover itself from an
> unexpected clone by just deleting it.
> 
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
SWITCH
Saverio Proto, Peta Solutions
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
phone +41 44 268 15 15, direct +41 44 268 1573
saverio.pr...@switch.ch, http://www.switch.ch

http://www.switch.ch/stories
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to fix X is an unexpected clone

2018-02-26 Thread Stefan Priebe - Profihost AG

Am 26.02.2018 um 09:54 schrieb Saverio Proto:
> Hello Stefan,
> 
> ceph-object-tool does not exist on my setup, do yo mean the command
> /usr/bin/ceph-objectstore-tool that is installed with the ceph-osd package ?

Yes sorry i meant the ceph-objectstore-tool tool. With that you can
remove objects.

> 
> I have the following situation here in Ceph Luminous:
> 
> 2018-02-26 07:15:30.066393 7f0684acb700 -1 log_channel(cluster) log
> [ERR] : 5.111f shard 395 missing
> 5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152
> 2018-02-26 07:15:30.395189 7f0684acb700 -1 log_channel(cluster) log
> [ERR] : deep-scrub 5.111f
> 5:f88e2b07:::rbd_data.8a09fb8793c74f.6dce:23152 is an
> unexpected clone
> 
> I did not understand how you actually fixed the problem. Could you
> provide more details ?

something like:
ceph-objectstore-tool --data-path /.../osd.$OSD/ --journal-path
/dev/disk/by-partlabel/journal$OSD rbd_data.$RBD remove-clone-metadata
$CLONEID

> 
> thanks
> 
> Saverio
> 
> 
> On 08.08.17 12:02, Stefan Priebe - Profihost AG wrote:
>> Hello Greg,
>>
>> Am 08.08.2017 um 11:56 schrieb Gregory Farnum:
>>> On Mon, Aug 7, 2017 at 11:55 PM Stefan Priebe - Profihost AG
>>> mailto:s.pri...@profihost.ag>> wrote:
>>>
>>> Hello,
>>>
>>> how can i fix this one:
>>>
>>> 2017-08-08 08:42:52.265321 osd.20 [ERR] repair 3.61a
>>> 3:58654d3d:::rbd_data.106dd406b8b4567.018c:9d455 is an
>>> unexpected clone
>>> 2017-08-08 08:43:04.914640 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1
>>> pgs repair; 1 scrub errors
>>> 2017-08-08 08:43:33.470246 osd.20 [ERR] 3.61a repair 1 errors, 0 fixed
>>> 2017-08-08 08:44:04.915148 mon.0 [INF] HEALTH_ERR; 1 pgs inconsistent; 1
>>> scrub errors
>>>
>>> If i just delete manually the relevant files ceph is crashing. rados
>>> does not list those at all?
>>>
>>> How can i fix this?
>>>
>>>
>>> You've sent quite a few emails that have this story spread out, and I
>>> think you've tried several different steps to repair it that have been a
>>> bit difficult to track.
>>>
>>> It would be helpful if you could put the whole story in one place and
>>> explain very carefully exactly what you saw and how you responded. Stuff
>>> like manually copying around the wrong files, or files without a
>>> matching object info, could have done some very strange things.
>>> Also, basic debugging stuff like what version you're running will help. :)
>>>
>>> Also note that since you've said elsewhere you don't need this image, I
>>> don't think it's going to hurt you to leave it like this for a bit
>>> (though it will definitely mess up your monitoring).
>>> -Greg
>>
>> i'm sorry about that. You're correct.
>>
>> I was able to fix this just a few minutes ago by using the
>> ceph-object-tool and the remove operation to remove all left over files.
>>
>> I did this on all OSDs with the problematic pg. After that ceph was able
>> to fix itself.
>>
>> A better approach might be that ceph can recover itself from an
>> unexpected clone by just deleting it.
>>
>> Greets,
>> Stefan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS crash Luminous

2018-02-26 Thread David C

Thanks for the tips, John. I'll increase the debug level as suggested.

On 25 Feb 2018 20:56, "John Spray"  wrote:

> On Sat, Feb 24, 2018 at 10:13 AM, David C  wrote:
> > Hi All
> >
> > I had an MDS go down on a 12.2.1 cluster, the standby took over but I
> don't
> > know what caused the issue. Scrubs are scheduled to start at 23:00 on
> this
> > cluster but this appears to have started a minute before.
> >
> > Can anyone help me with diagnosing this please. Here's the relevant bit
> from
> > the MDS log:
>
> The messages about the heartbeat map not being healthy are a sign that
> somewhere in the MDS a thread is getting stuck and not letting others
> get in there to do work.  The daemon responds to that by stopping
> sending beacons to the monitors, who in turn blacklist the misbehaving
> MDS daemon.
>
> You'll have a better shot at working out what got jammed up if "debug
> mds" is set to something like 7, or if this is happening predictably
> at 22:59:30 you could even attach gdb to the running process and grab
> a backtrace of all threads.
>
> John
>
> > 2018-02-23 22:59:30.702915 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:32.960228 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:34.703001 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:342018-02-23 22:59:02.702284 7f26e0612700  1
> heartbeat_map
> > is_healthy 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:02.702334 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:02.959726 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:06.702354 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:06.702366 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:07.959804 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:10.702421 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:10.702434 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:12.959876 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:14.702522 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:14.702535 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:17.959985 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:18.702645 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:18.702670 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:22.702742 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:22.702754 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:22.960063 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:26.702841 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:26.702854 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:27.960141 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:30.702903 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > .703014 7f26e0612700  1 mds.beacon.mdshostname _send skipping beacon,
> > heartbeat map not healthy
> > 2018-02-23 22:59:37.960301 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:38.703063 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:38.703075 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:42.703147 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:42.703160 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:42.960414 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:46.703209 7f26e0612700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:46.703222 7f26e0612700  1 mds.beacon.mdshostname _send
> > skipping beacon, heartbeat map not healthy
> > 2018-02-23 22:59:47.960487 7f26e461a700  1 heartbeat_map is_healthy
> > 'MDSRank' had timed out after 15
> > 2018-02-23 22:59:50.703305 7f26e0612700  1 hear

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Oliver Freyermuth

Dear Cephalopodians,

I have to extend my question a bit - in our system with 105,000,000 objects in 
CephFS (mostly stabilized now after the stress-testing...),
I observe the following data distribution for the metadata pool:
# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS 
  0   ssd 0.21829  1.0  223G  9927M  213G  4.34 0.79   0 
  1   ssd 0.21829  1.0  223G  9928M  213G  4.34 0.79   0 
  2   ssd 0.21819  1.0  223G 77179M  148G 33.73 6.11 128 
  3   ssd 0.21819  1.0  223G 76981M  148G 33.64 6.10 128

osd.0 - osd.3 are all exclusively meant for cephfs-metadata, currently we use 4 
replicas with failure domain OSD there. 
I have reinstalled and reformatted osd.0 and osd.1 about 36 hours ago. 

All 128 PGs in the metadata pool are backfilling (I have increased 
osd-max-backfills temporarily to speed things up for those OSDs). 
However, they only managed to backfill < 10 GB in those 36 hours. I have not 
touched any other of the default settings concerning backfill
or recovery (but these are SSDs, so sleeps should be 0). 
The backfilling seems not to be limited by CPU, nor network, not disks. 
"ceph -s" confirms a backfill performance of about 60-100 keys/s. 
This metadata, as written before, is almost exclusively RocksDB:

"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 84760592384,
"db_used_bytes": 77289488384,

is it normal that this kind of backfilling is so horrendously slow? Is there a 
way to speed it up? 
Like this, it will take almost two weeks for 77 GB of (meta)data. 
Right now, the system is still in the testing phase, but we'd of course like to 
be able to add more MDS's and SSD's later without extensive backfilling 
periods. 

Cheers,
Oliver

Am 25.02.2018 um 19:26 schrieb Oliver Freyermuth:
> Dear Cephalopodians,
> 
> as part of our stress test with 100,000,000 objects (all small files) we 
> ended up with
> the following usage on the OSDs on which the metadata pool lives:
> # ceph osd df | head
> ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS 
> [...]
>   2   ssd 0.21819  1.0  223G 79649M  145G 34.81 6.62 128 
>   3   ssd 0.21819  1.0  223G 79697M  145G 34.83 6.63 128
> 
> The cephfs-data cluster is mostly empty (5 % usage), but contains 100,000,000 
> small objects. 
> 
> Looking with:
> ceph daemon osd.2 perf dump
> I get:
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 84760592384,
> "db_used_bytes": 78920024064,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 0,
> "slow_used_bytes": 0,
> so it seems this is almost exclusively RocksDB usage. 
> 
> Is this expected? 
> Is there a recommendation on how much MDS storage is needed for a CephFS with 
> 450 TB? 
> 
> Cheers,
>   Oliver
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-26 Thread Caspar Smit

2018-02-24 7:10 GMT+01:00 David Turner :

> Caspar, it looks like your idea should work. Worst case scenario seems
> like the osd wouldn't start, you'd put the old SSD back in and go back to
> the idea to weight them to 0, backfilling, then recreate the osds.
> Definitely with a try in my opinion, and I'd love to hear your experience
> after.
>
>
Hi David,

First of all, thank you for ALL your answers on this ML, you're really
putting a lot of effort into answering many questions asked here and very
often they contain invaluable information.


To follow up on this post i went out and built a very small (proxmox)
cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
And it worked!
Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)

Here's what i did on 1 node:

1) ceph osd set noout
2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
3) ddrescue -f -n -vv   /root/clone-db.log
4) removed the old SSD physically from the node
5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
6) ceph osd unset noout

I assume that once the ddrescue step is finished a 'partprobe' or something
similar is triggered and udev finds the DB partitions on the new SSD and
starts the OSD's again (kind of what happens during hotplug)
So it is probably better to clone the SSD in another (non-ceph) system to
not trigger any udev events.

I also tested a reboot after this and everything still worked.


The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
Delta of data was very low because it was a test cluster.

All in all the OSD's in question were 'down' for only 5 minutes (so i
stayed within the ceph_osd_down_out interval of the default 10 minutes and
didn't actually need to set noout :)

Kind regards,
Caspar



> Nico, it is not possible to change the WAL or DB size, location, etc after
> osd creation. If you want to change the configuration of the osd after
> creation, you have to remove it from the cluster and recreate it. There is
> no similar functionality to how you could move, recreate, etc filesystem
> osd journals. I think this might be on the radar as a feature, but I don't
> know for certain. I definitely consider it to be a regression of bluestore.
>
>
>
>
> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> A very interesting question and I would add the follow up question:
>>
>> Is there an easy way to add an external DB/WAL devices to an existing
>> OSD?
>>
>> I suspect that it might be something on the lines of:
>>
>> - stop osd
>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>> - (maybe run some kind of osd mkfs ?)
>> - start osd
>>
>> Has anyone done this so far or recommendations on how to do it?
>>
>> Which also makes me wonder: what is actually the format of WAL and
>> BlockDB in bluestore? Is there any documentation available about it?
>>
>> Best,
>>
>> Nico
>>
>>
>> Caspar Smit  writes:
>>
>> > Hi All,
>> >
>> > What would be the proper way to preventively replace a DB/WAL SSD (when
>> it
>> > is nearing it's DWPD/TBW limit and not failed yet).
>> >
>> > It hosts DB partitions for 5 OSD's
>> >
>> > Maybe something like:
>> >
>> > 1) ceph osd reweight 0 the 5 OSD's
>> > 2) let backfilling complete
>> > 3) destroy/remove the 5 OSD's
>> > 4) replace SSD
>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>> >
>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so
>> i
>> > thought maybe the following would work:
>> >
>> > 1) ceph osd set noout
>> > 2) stop the 5 OSD's (systemctl stop)
>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>> > 4) remove the old SSD
>> > 5) start the 5 OSD's (systemctl start)
>> > 6) let backfilling/recovery complete (only delta data between OSD stop
>> and
>> > now)
>> > 6) ceph osd unset noout
>> >
>> > Would this be a viable method to replace a DB SSD? Any udev/serial
>> nr/uuid
>> > stuff preventing this to work?
>> >
>> > Or is there another 'less hacky' way to replace a DB SSD without moving
>> too
>> > much data?
>> >
>> > Kind regards,
>> > Caspar
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Dear Cephalopodians,

in the few remaining days when we can still play at our will with parameters,
we just now tried to set:
ceph osd pool set cephfs_data fast_read 1
but did not notice any effect on sequential, large file read throughput on our 
k=4 m=2 EC pool. 

Should this become active immediately? Or do OSDs need a restart first? 
Is the option already deemed safe? 

Or is it just that we should not expect any change on throughput, since our 
system (for large sequential reads)
is purely limited by the IPoIB throughput, and the shards are nevertheless 
requested by the primary OSD? 
So the gain would not be in throughput, but the reply to the client would be 
slightly faster (before all shards have arrived)? 
Then this option would be mainly of interest if the disk IO was congested 
(which does not happen for us as of yet)
and not help so much if the system is limited by network bandwidth. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Some additional information gathered from our monitoring:
It seems fast_read does indeed become active immediately, but I do not 
understand the effect. 

With fast_read = 0, we see:
~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
~ 2.3 GB/s total incoming traffic to all 6 OSD hosts

With fast_read = 1, we see:
~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
~ 3   GB/s total incoming traffic to all 6 OSD hosts

I would have expected exactly the contrary to happen... 

Cheers,
Oliver

Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
> Dear Cephalopodians,
> 
> in the few remaining days when we can still play at our will with parameters,
> we just now tried to set:
> ceph osd pool set cephfs_data fast_read 1
> but did not notice any effect on sequential, large file read throughput on 
> our k=4 m=2 EC pool. 
> 
> Should this become active immediately? Or do OSDs need a restart first? 
> Is the option already deemed safe? 
> 
> Or is it just that we should not expect any change on throughput, since our 
> system (for large sequential reads)
> is purely limited by the IPoIB throughput, and the shards are nevertheless 
> requested by the primary OSD? 
> So the gain would not be in throughput, but the reply to the client would be 
> slightly faster (before all shards have arrived)? 
> Then this option would be mainly of interest if the disk IO was congested 
> (which does not happen for us as of yet)
> and not help so much if the system is limited by network bandwidth. 
> 
> Cheers,
>   Oliver
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to correctly purge a "ceph-volume lvm" OSD

2018-02-26 Thread Alfredo Deza

On Sat, Feb 24, 2018 at 1:26 PM, Oliver Freyermuth
 wrote:
> Dear Cephalopodians,
>
> when purging a single OSD on a host (created via ceph-deploy 2.0, i.e. using 
> ceph-volume lvm), I currently proceed as follows:
>
> On the OSD-host:
> $ systemctl stop ceph-osd@4.service
> $ ls -la /var/lib/ceph/osd/ceph-4
> # Check block und block.db links:
> lrwxrwxrwx.  1 ceph ceph   93 23. Feb 01:28 block -> 
> /dev/ceph-69b1fbe5-f084-4410-a99a-ab57417e7846/osd-block-cd273506-e805-40ac-b23d-c7b9ff45d874
> lrwxrwxrwx.  1 root root   43 23. Feb 01:28 block.db -> 
> /dev/ceph-osd-blockdb-ssd-1/db-for-disk-sda
> # resolve actual underlying device:
> $ pvs | grep ceph-69b1fbe5-f084-4410-a99a-ab57417e7846
>   /dev/sda   ceph-69b1fbe5-f084-4410-a99a-ab57417e7846 lvm2 a--<3,64t > 0
> # Zap the device:
> $ ceph-volume lvm zap --destroy /dev/sda
>
> Now, on the mon:
> # purge the OSD:
> $ ceph osd purge osd.4 --yes-i-really-mean-it
>
> Then I re-deploy using:
> $ ceph-deploy --overwrite-conf osd create --bluestore --block-db 
> ceph-osd-blockdb-ssd-1/db-for-disk-sda --data /dev/sda osd001
>
> from the admin-machine.
>
> This works just fine, however, it leaves a stray ceph-volume service behind:
> $ ls -la /etc/systemd/system/multi-user.target.wants/ -1 | grep 
> ceph-volume@lvm-4
> lrwxrwxrwx.  1 root root   44 24. Feb 18:30 
> ceph-volume@lvm-4-5a984083-48e1-4c2f-a1f3-3458c941e597.service -> 
> /usr/lib/systemd/system/ceph-volume@.service
> lrwxrwxrwx.  1 root root   44 23. Feb 01:28 
> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service -> 
> /usr/lib/systemd/system/ceph-volume@.service
>
> This stray service then, after reboot of the machine, stays in activating 
> state (since the disk will of course never come back):
> ---
> $ systemctl status 
> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
> ● ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service - Ceph 
> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>Loaded: loaded (/usr/lib/systemd/system/ceph-volume@.service; enabled; 
> vendor preset: disabled)
>Active: activating (start) since Sa 2018-02-24 19:21:47 CET; 1min 12s ago
>  Main PID: 1866 (timeout)
>CGroup: 
> /system.slice/system-ceph\x2dvolume.slice/ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>├─1866 timeout 1 /usr/sbin/ceph-volume-systemd 
> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>└─1872 /usr/bin/python2.7 /usr/sbin/ceph-volume-systemd 
> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>
> Feb 24 19:21:47 osd001.baf.physik.uni-bonn.de systemd[1]: Starting Ceph 
> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874...
> ---
> Manually, I can fix this by running:
> $ systemctl disable 
> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>
> My question is: Should I really remove that manually?
> Should "ceph-volume lvm zap --destroy" have taken care of it (bug)?

You should remove it manually. The problem with zapping is that we
might not have the information we need to remove the systemd unit.
Since an OSD can be made out of different devices, ceph-volume might
be asked to "zap" a device which it can't compute to what OSD it
belongs. The systemd units are tied to the ID and UUID of the OSD.



> Am I missing a step?
>
> Cheers,
> Oliver
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to correctly purge a "ceph-volume lvm" OSD

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 13:02 schrieb Alfredo Deza:
> On Sat, Feb 24, 2018 at 1:26 PM, Oliver Freyermuth
>  wrote:
>> Dear Cephalopodians,
>>
>> when purging a single OSD on a host (created via ceph-deploy 2.0, i.e. using 
>> ceph-volume lvm), I currently proceed as follows:
>>
>> On the OSD-host:
>> $ systemctl stop ceph-osd@4.service
>> $ ls -la /var/lib/ceph/osd/ceph-4
>> # Check block und block.db links:
>> lrwxrwxrwx.  1 ceph ceph   93 23. Feb 01:28 block -> 
>> /dev/ceph-69b1fbe5-f084-4410-a99a-ab57417e7846/osd-block-cd273506-e805-40ac-b23d-c7b9ff45d874
>> lrwxrwxrwx.  1 root root   43 23. Feb 01:28 block.db -> 
>> /dev/ceph-osd-blockdb-ssd-1/db-for-disk-sda
>> # resolve actual underlying device:
>> $ pvs | grep ceph-69b1fbe5-f084-4410-a99a-ab57417e7846
>>   /dev/sda   ceph-69b1fbe5-f084-4410-a99a-ab57417e7846 lvm2 a--<3,64t
>>  0
>> # Zap the device:
>> $ ceph-volume lvm zap --destroy /dev/sda
>>
>> Now, on the mon:
>> # purge the OSD:
>> $ ceph osd purge osd.4 --yes-i-really-mean-it
>>
>> Then I re-deploy using:
>> $ ceph-deploy --overwrite-conf osd create --bluestore --block-db 
>> ceph-osd-blockdb-ssd-1/db-for-disk-sda --data /dev/sda osd001
>>
>> from the admin-machine.
>>
>> This works just fine, however, it leaves a stray ceph-volume service behind:
>> $ ls -la /etc/systemd/system/multi-user.target.wants/ -1 | grep 
>> ceph-volume@lvm-4
>> lrwxrwxrwx.  1 root root   44 24. Feb 18:30 
>> ceph-volume@lvm-4-5a984083-48e1-4c2f-a1f3-3458c941e597.service -> 
>> /usr/lib/systemd/system/ceph-volume@.service
>> lrwxrwxrwx.  1 root root   44 23. Feb 01:28 
>> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service -> 
>> /usr/lib/systemd/system/ceph-volume@.service
>>
>> This stray service then, after reboot of the machine, stays in activating 
>> state (since the disk will of course never come back):
>> ---
>> $ systemctl status 
>> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>> ● ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service - Ceph 
>> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>>Loaded: loaded (/usr/lib/systemd/system/ceph-volume@.service; enabled; 
>> vendor preset: disabled)
>>Active: activating (start) since Sa 2018-02-24 19:21:47 CET; 1min 12s ago
>>  Main PID: 1866 (timeout)
>>CGroup: 
>> /system.slice/system-ceph\x2dvolume.slice/ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>>├─1866 timeout 1 /usr/sbin/ceph-volume-systemd 
>> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>>└─1872 /usr/bin/python2.7 /usr/sbin/ceph-volume-systemd 
>> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>>
>> Feb 24 19:21:47 osd001.baf.physik.uni-bonn.de systemd[1]: Starting Ceph 
>> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874...
>> ---
>> Manually, I can fix this by running:
>> $ systemctl disable 
>> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>>
>> My question is: Should I really remove that manually?
>> Should "ceph-volume lvm zap --destroy" have taken care of it (bug)?
> 
> You should remove it manually. The problem with zapping is that we
> might not have the information we need to remove the systemd unit.
> Since an OSD can be made out of different devices, ceph-volume might
> be asked to "zap" a device which it can't compute to what OSD it
> belongs. The systemd units are tied to the ID and UUID of the OSD.

Understood, thanks for the reply! 

Could this be added to the documentation at some point for all the other users 
operating the cluster manually / with ceph-deploy? 
This would likely be best to prevent others from falling into this trap ;-). 
Should I open a ticket asking for this? 

Cheers,
Oliver

> 
> 
>> Am I missing a step?
>>
>> Cheers,
>> Oliver
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] reweight-by-utilization reverse weight after adding new nodes?

2018-02-26 Thread Martin Palma

Hello,

from some OSDs in our cluster we got the "nearfull" warning message so
we run the "ceph osd reweight-by-utilization" command to better
distribute the data.

Now we have expanded out cluster with new nodes should we reverse the
weight of the changed OSDs to 1.0?

Best,
Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Linux Distribution: Is upgrade the kerner version a good idea?

2018-02-26 Thread Massimiliano Cuttini




Not good.
I'm not worried about time and effort.
I'm worried to fix this while there is not time.
Ceph is builded to avoid downtime, not a good idea create it on an a

system with availability issues.

It is only with switching (when installing a node), subsequent kernel
updates should be installed without any issues


Yes but I alway get the felling that many "bad" software rely on the 
distribution in order to know in which modality works or feature to use.

So, having a distribution with unexpected kernel can be confusing.
Of course this is just a guess and only my paranoia.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] erasure-code-profile: what's "w=" ?

2018-02-26 Thread Wolfgang Lendl

hi,

I have no idea what "w=8" means and can't find any hints in docs ...
maybe someone can explain


ceph 12.2.2

# ceph osd erasure-code-profile get ec42
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8


thx
wolfgang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Linux Distribution: Is upgrade the kerner version a good idea?

2018-02-26 Thread Lenz Grimmer

On 02/25/2018 01:18 PM, Massimiliano Cuttini wrote:

> Is upgrade the kernel to major version on a distribution a bad idea?
> Or is just safe as like as upgrade like any other package?
> I prefer ultra stables release instead of latest higher package.

In that case it's probably best to stick with the latest kernel that has
been released by the distributor for that particular distribution
version. Upgrading to newer kernel versions can be tricky, if they
require new userland utilities for managing new features.

> But maybe I'm in wrong thinking that latest major kernel not in the
> default repository is likely say "dev distribution" and instead is just
> a stable release as like every others.

It's in there for a reason ;) For running Ceph (on the cluster side), a
latest and greatest kernel version is not really required, as the Ceph
services run in userland anyway.

The only requirement that I could think of that might benefit from
running a more recent kernel is using the kRBD or CephFS modules, but
these are "client drivers" - no need to upgrade the kernel on your
entire cluster because of that.

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)

signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Install previous version of Ceph

2018-02-26 Thread Ronny Aasen


On 23. feb. 2018 23:37, Scottix wrote:

Hey,
We had one of our monitor servers die on us and I have a replacement 
computer now. In between that time you have released 12.2.3 but we are 
still on 12.2.2.


We are on Ubuntu servers

I see all the binaries are in the repo but your package cache only shows 
12.2.3, is there a reason for not keeping the previous builds like in my 
case.


I could do an install like
apt install ceph-mon=12.2.2

Also how would I go installing 12.2.2 in my scenario since I don't want 
to update till have this monitor running again.


Thanks,
Scott


did you figure out a solution to this ? I have the same problem now.
I assume you have to download the old version manually and install with 
dpkg -i


optionally mirror the ceph repo and build your own repo index containing 
all versions.


kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating to new pools

2018-02-26 Thread Eugen Block


I'm following up on the rbd export/import option with a little delay.

The fact that the snapshot is not protected after the image is  
reimported is not a big problem, you could deal with that or wait for  
a fix.
But there's one major problem using this method: the VMs lose their  
rbd_children and parent data!


Although the imported VM launches successfully, it has no parent  
information. So this will eventually lead to a problem reading data  
from the parent image, I assume.


This brings up another issue: deleting glance images is now easily  
possible since the image has no clones. And if the VM loses its base  
image it probably will run into a failed state.


---cut here---
# New glance image with new VM
root@control:~ # rbd children glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap
cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk

# Parent data available
root@control:~ # rbd info  
cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk | grep parent

parent: glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap


# Export base image
root@control:~ # rbd export --export-format 2  
glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1  
/var/lib/glance/images/cirros

Exporting image: 100% complete...done.

# Export VM's disk
root@control:~ # rbd export --export-format 2  
cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk  
/var/lib/glance/images/cirros_disk

Exporting image: 100% complete...done.


# Delete VM
root@control:~ # rbd rm cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk
Removing image: 100% complete...done.

# Reimport VM's disk
root@control:~ # rbd import --export-format 2  
/var/lib/glance/images/cirros_disk  
cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk

Importing image: 100% complete...done.


# Delete glance image
root@control:~ # rbd snap unprotect  
glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap

root@control:~ # rbd snap purge glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1
Removing all snapshots: 100% complete...done.
root@control:~ # rbd rm glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1
Removing image: 100% complete...done.

# Reimport glance image
root@control:~ # rbd import --export-format 2  
/var/lib/glance/images/cirros  
glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1

Importing image: 100% complete...done.

root@control:~ # rbd snap protect  
glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap


# There are no children
root@control:~ # rbd children glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap
root@control:~ #

# VM starts successfully
root@control:~ # nova start c1
Request to start server c1 has been accepted.

# But no data in rbd_children
root@control:~ # rados -p cinder listomapvals rbd_children
root@control:~ #
---cut here---

So in conclusion, this method is not suited for OpenStack. You could  
probably consider it in case of desaster recovery for single VMs, but  
not for a whole cloud environment where you would lose all  
relationships between base images and their clones.


Regards,
Eugen


Zitat von Eugen Block :


Hi,

I created a ticket for the rbd import issue:

https://tracker.ceph.com/issues/23038

Regards,
Eugen


Zitat von Jason Dillaman :


On Fri, Feb 16, 2018 at 11:20 AM, Eugen Block  wrote:

Hi Jason,


... also forgot to mention "rbd export --export-format 2" / "rbd
import --export-format 2" that will also deeply export/import all
snapshots associated with an image and that feature is available in
the Luminous release.



thanks for that information, this could be very valuable for us. I'll have
to test that intesively, but not before next week.

But a first quick test brought up a couple of issues which I'll have to
re-check before bringing them up here.

One issue is worth mentioning, though: After I exported (rbd export
--export-format ...) a glance image and imported it back to a  
different pool
(rbd import --export-format ...) its snapshot was copied, but not  
protected.
This prevented nova from cloning the base image and leaving that  
instance in

error state. Protecting the snapshot manually and launch another instance
enabled nova to clone the image successfully.

Could this be worth a bug report or is it rather something I did wrong or
missed?


Definitely deserves a bug tracker ticket opened. Thanks.


I wish you all a nice weekend!

Regards
Eugen


Zitat von Jason Dillaman :


On Fri, Feb 16, 2018 at 8:08 AM, Jason Dillaman 
wrote:


On Fri, Feb 16, 2018 at 5:36 AM, Jens-U. Mozdzen  wrote:


Dear list, hello Jason,

you may have seen my message on the Ceph mailing list about RDB pool
migration - it's a common subject that pools were created in a
sub-optimum
fashion and i. e. pgnum is (not yet) reducible, so we're looking into
means
to "clone" an RBD pool into a new pool within the same cluster
(including
snapshots).

We had looked into creating a tool for this job, but soon noticed that
we're
duplicating basic functionality of rbd-mirror. So we tested the
following,
which worked out nicely:

- create a test cluster (Ceph cluster plus an Openstack cl

Re: [ceph-users] Migrating to new pools

2018-02-26 Thread Jason Dillaman

On Mon, Feb 26, 2018 at 9:56 AM, Eugen Block  wrote:
> I'm following up on the rbd export/import option with a little delay.
>
> The fact that the snapshot is not protected after the image is reimported is
> not a big problem, you could deal with that or wait for a fix.
> But there's one major problem using this method: the VMs lose their
> rbd_children and parent data!

Correct -- the images are "flattened". The data is consistent but you
lose any savings from the re-use of non-overwritten parent data.

> Although the imported VM launches successfully, it has no parent
> information. So this will eventually lead to a problem reading data from the
> parent image, I assume.
>
> This brings up another issue: deleting glance images is now easily possible
> since the image has no clones. And if the VM loses its base image it
> probably will run into a failed state.
>
> ---cut here---
> # New glance image with new VM
> root@control:~ # rbd children
> glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap
> cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk
>
> # Parent data available
> root@control:~ # rbd info cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk |
> grep parent
> parent: glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap
>
>
> # Export base image
> root@control:~ # rbd export --export-format 2
> glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1 /var/lib/glance/images/cirros
> Exporting image: 100% complete...done.
>
> # Export VM's disk
> root@control:~ # rbd export --export-format 2
> cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk
> /var/lib/glance/images/cirros_disk
> Exporting image: 100% complete...done.
>
>
> # Delete VM
> root@control:~ # rbd rm cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk
> Removing image: 100% complete...done.
>
> # Reimport VM's disk
> root@control:~ # rbd import --export-format 2
> /var/lib/glance/images/cirros_disk
> cinder/f43265e9-beab-4f83-be46-a51da013f70a_disk
> Importing image: 100% complete...done.
>
>
> # Delete glance image
> root@control:~ # rbd snap unprotect
> glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap
> root@control:~ # rbd snap purge glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1
> Removing all snapshots: 100% complete...done.
> root@control:~ # rbd rm glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1
> Removing image: 100% complete...done.
>
> # Reimport glance image
> root@control:~ # rbd import --export-format 2 /var/lib/glance/images/cirros
> glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1
> Importing image: 100% complete...done.
>
> root@control:~ # rbd snap protect
> glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap
>
> # There are no children
> root@control:~ # rbd children
> glance/4479820a-d58b-4ac6-ba2a-e8871b24fcb1@snap
> root@control:~ #
>
> # VM starts successfully
> root@control:~ # nova start c1
> Request to start server c1 has been accepted.
>
> # But no data in rbd_children
> root@control:~ # rados -p cinder listomapvals rbd_children
> root@control:~ #
> ---cut here---
>
> So in conclusion, this method is not suited for OpenStack. You could
> probably consider it in case of desaster recovery for single VMs, but not
> for a whole cloud environment where you would lose all relationships between
> base images and their clones.
>
> Regards,
> Eugen
>
>
> Zitat von Eugen Block :
>
>
>> Hi,
>>
>> I created a ticket for the rbd import issue:
>>
>> https://tracker.ceph.com/issues/23038
>>
>> Regards,
>> Eugen
>>
>>
>> Zitat von Jason Dillaman :
>>
>>> On Fri, Feb 16, 2018 at 11:20 AM, Eugen Block  wrote:

 Hi Jason,

> ... also forgot to mention "rbd export --export-format 2" / "rbd
> import --export-format 2" that will also deeply export/import all
> snapshots associated with an image and that feature is available in
> the Luminous release.



 thanks for that information, this could be very valuable for us. I'll
 have
 to test that intesively, but not before next week.

 But a first quick test brought up a couple of issues which I'll have to
 re-check before bringing them up here.

 One issue is worth mentioning, though: After I exported (rbd export
 --export-format ...) a glance image and imported it back to a different
 pool
 (rbd import --export-format ...) its snapshot was copied, but not
 protected.
 This prevented nova from cloning the base image and leaving that
 instance in
 error state. Protecting the snapshot manually and launch another
 instance
 enabled nova to clone the image successfully.

 Could this be worth a bug report or is it rather something I did wrong
 or
 missed?
>>>
>>>
>>> Definitely deserves a bug tracker ticket opened. Thanks.
>>>
 I wish you all a nice weekend!

 Regards
 Eugen


 Zitat von Jason Dillaman :

> On Fri, Feb 16, 2018 at 8:08 AM, Jason Dillaman 
> wrote:
>>
>>
>> On Fri, Feb 16, 2018 at 5:36 AM, Jens-U. Mozdzen 
>> wrote:
>>>
>>>

[ceph-users] How to "apply" and monitor bluestore compression?

2018-02-26 Thread Martin Emrich


Hi!

I just migrated my backup cluster from filestore to bluestore (8 OSDs, 
one OSD at a time, took two weeks but went smoothly).


I also enabled compression on a pool beforehand and am impressed by the 
compression ratio (snappy, agressive, default parameters). So apparently 
during backfilling, the compression got applied.


Now I'd like to apply it to another pool (with the even stronger zstd 
algorithm). But it apparently only works on new written data.


Is there a way to "trigger" compression on the already existing objects 
(similar to a background scrub?)


Also, is there some reporting for the actual compression ratio?

Thanks

Martin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 【mon】Problem with mon leveldb

2018-02-26 Thread David Turner

Mons won't compact and clean up old maps while any PG is in a non-clean
state.  What is your `ceph status`?  I would guess this isn't your problem,
but thought I'd throw it out there just in case.

Also in Hammer, OSDs started telling each other when they clean up maps and
this caused a map pointer leak where the pointer for which map to keep was
set to NULL and OSDs would stop deleting maps until they were restarted
which forced them to ask the mons for what that pointer should be.  This
bug was fixed in 0.94.8.  You can check to see if you're running into this
by performing a `du` on the meta folder inside of an OSD.  I know your
complaint is about mons getting really large, but it just sounded familiar
to this issue with OSDs getting really large with maps.

On Sun, Feb 25, 2018 at 8:31 AM yu2xiangyang  wrote:

>
> Hi cephers,
>
>
> Recently I have met a problem with leveldb which is set as monitor store
> by default.
>
>
> My ceph version is 0.94.5.
>
>
> I have a disk format as xfs,and mount the disk to
> /var/lib/ceph/mon/mon.， and the size is 100GB.
>
>
> The monitor store size is increasing 1GB per hours and never seems compact
> and my monitor size has ever reached 60GB.
>
>
> I stop the monitor and backup the monitor data for analysis , and the size
> is 23GB.
>
>
> I find that after manual compact range paxos 1 2(in fact , key
> paxos 1 and key paxos 2 has already been delelted and ceph has
> already compact the range paxos 1 to 2) After compaction, the
> monitor store size is only 489MB.
>
>
> Actually we can compact the monitor store with ceph tell mon.xxx compact
> command, but the monitor size is exploding. and too big, there must be some
> problem with levedb or using leveldb in ceph.
>
>
> Has anyone has ever analysis the monitor store problem with leveldb?
>
>
> Best regards,
>  Brandy
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to "apply" and monitor bluestore compression?

2018-02-26 Thread Igor Fedotov


Hi Martin,


On 2/26/2018 6:19 PM, Martin Emrich wrote:

Hi!

I just migrated my backup cluster from filestore to bluestore (8 OSDs, 
one OSD at a time, took two weeks but went smoothly).


I also enabled compression on a pool beforehand and am impressed by 
the compression ratio (snappy, agressive, default parameters). So 
apparently during backfilling, the compression got applied.


Now I'd like to apply it to another pool (with the even stronger zstd 
algorithm). But it apparently only works on new written data.


Is there a way to "trigger" compression on the already existing 
objects (similar to a background scrub?)


Unfortunately I don't know any way but rewrite data or move it to 
another pool.

Also, is there some reporting for the actual compression ratio?
I'm working on adding compression statistics to ceph/rados df reports. 
And AFAIK currently the only way to monitor compression ration is to 
inspect osd performance counters.


Thanks

Martin


Regards,
Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Install previous version of Ceph

2018-02-26 Thread David Turner

In the past I downloaded the packages for a version and configured it as a
local repo on the server.  basically it was a tar.gz that I would extract
that would place the ceph packages in a folder for me and swap out the repo
config file to a version that points to the local folder.  I haven't needed
to do that much, but it was helpful.  Generally it's best to just mirror
the upstream and lock it to the version you're using in production.  That's
a good rule of thumb for other repos as well, especially for ceph nodes.
When I install a new ceph node, I want all of it's package versions to
match 100% to the existing nodes.  Troubleshooting problems becomes
drastically simpler once you get to that point.

On Mon, Feb 26, 2018 at 9:08 AM Ronny Aasen 
wrote:

> On 23. feb. 2018 23:37, Scottix wrote:
> > Hey,
> > We had one of our monitor servers die on us and I have a replacement
> > computer now. In between that time you have released 12.2.3 but we are
> > still on 12.2.2.
> >
> > We are on Ubuntu servers
> >
> > I see all the binaries are in the repo but your package cache only shows
> > 12.2.3, is there a reason for not keeping the previous builds like in my
> > case.
> >
> > I could do an install like
> > apt install ceph-mon=12.2.2
> >
> > Also how would I go installing 12.2.2 in my scenario since I don't want
> > to update till have this monitor running again.
> >
> > Thanks,
> > Scott
>
> did you figure out a solution to this ? I have the same problem now.
> I assume you have to download the old version manually and install with
> dpkg -i
>
> optionally mirror the ceph repo and build your own repo index containing
> all versions.
>
> kind regards
> Ronny Aasen
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Patrick Donnelly

On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
 wrote:
> Am 25.02.2018 um 21:50 schrieb John Spray:
>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>>> Now, with about 100,000,000 objects written, we are in a disaster situation.
>>> First off, the MDS could not restart anymore - it required >40 GB of 
>>> memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and 
>>> swap.
>>> So it tried to recover and OOMed quickly after. Replay was reasonably fast, 
>>> but join took many minutes:
>>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>>> and finally, 5 minutes later, OOM.
>>>
>>> I stopped half of the stress-test tar's, which did not help - then I 
>>> rebooted half of the clients, which did help and let the MDS recover just 
>>> fine.
>>> So it seems the client caps have been too many for the MDS to handle. I'm 
>>> unsure why "tar" would cause so many open file handles.
>>> Is there anything that can be configured to prevent this from happening?
>>
>> Clients will generally hold onto capabilities for files they've
>> written out -- this is pretty sub-optimal for many workloads where
>> files are written out but not likely to be accessed again in the near
>> future.  While clients hold these capabilities, the MDS cannot drop
>> things from its own cache.
>>
>> The way this is *meant* to work is that the MDS hits its cache size
>> limit, and sends a message to clients asking them to drop some files
>> from their local cache, and consequently release those capabilities.
>> However, this has historically been a tricky area with ceph-fuse
>> clients (there are some hacks for detecting kernel version and using
>> different mechanisms for different versions of fuse), and it's
>> possible that on your clients this mechanism is simply not working,
>> leading to a severely oversized MDS cache.
>>
>> The MDS should have been showing health alerts in "ceph status" about
>> this, but I suppose it's possible that it wasn't surviving long enough
>> to hit the timeout (60s) that we apply for warning about misbehaving
>> clients?  It would be good to check the cluster log to see if you were
>> getting any health messages along the lines of "Client xyz failing to
>> respond to cache pressure".
>
> This explains the high memory usage indeed.
> I can also confirm seeing those health alerts, now that I check the logs.
> The systems have been (servers and clients) all exclusively CentOS 7.4,
> so kernels are rather old, but I would have hoped things have been backported
> by RedHat.
>
> Is there anything one can do to limit client's cache sizes?

You said the clients are ceph-fuse running 12.2.3? Then they should have:

http://tracker.ceph.com/issues/22339

(Please double check you're not running older clients on accident.)

I have run small file tests with ~128 clients without issue. Generally
if there is an issue it is because clients are not releasing their
capabilities properly (due to invalidation bugs which should be caught
by the above backport) or the MDS memory usage exceeds RAM. If the
clients are not releasing their capabilities, you should see the
errors John described in the cluster log.

You said in the original post that the `mds cache memory limit = 4GB`.
If that's the case, you really shouldn't be exceeding 40GB of RAM!
It's possible you have found a bug of some kind. I suggest tracking
the MDS cache statistics (which includes the inode count in cache) by
collecting a `perf dump` via the admin socket. Then you can begin to
find out what's consuming all of the MDS memory.

Additionally, I concur with John on digging into why the MDS is
missing heartbeats by collecting debug logs (`debug mds = 15`) at that
time. It may also shed light on the issue.

Thanks for performing the test and letting us know the results.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread David Turner

When a Ceph system is in recovery, it uses much more RAM than it does while
running healthy.  This increase is often on the order of 4x more memory (at
least back in the days of filestore, I'm not 100% certain about bluestore,
but I would assume the same applies).  You have another thread on the ML
where you are under-provisioned on the recommended memory sizes for your
cluster by more than half.  This could be impacting your recovery.  Are you
noticing any OOM killer messages during this recovery?  Are OSDs flapping
up and down?  You would see this by additional peering in the status while
you're recovering.

You mentioned that you increased the max backfills.  What did you set that
to?  I usually watch the `ceph status` for slow requests and the OSD disks
`iostat` to know what I can sanely increase the max backfills to for a
cluster as all hardware variable make each cluster different.  Have you
confirmed that the recovery sleep is indeed 0 or are you assuming it is?
You can check this by querying the OSD daemon.

RocksDB usage is known to scale up with the amount of objects.  For single
write and not modifying the object, you're likely to see around 6KB of used
RocksDB space per object.  If you modify the object regularly, this size
will increase.  A safe guess for RocksDB partition sizing is 10GB per 1TB
of storage, but that number does not apply to systems with immense amounts
of small objects.  You'll want to try to calculate that out yourself with a
guestimate of how many objects you'll have and if they'll be modified after
they're written.  7KB/object is a safe number to guestimate with for
objects not modified.  You should be able to calculate out the numbers for
your environment easily enough though.

On Mon, Feb 26, 2018 at 6:01 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Dear Cephalopodians,
>
> I have to extend my question a bit - in our system with 105,000,000
> objects in CephFS (mostly stabilized now after the stress-testing...),
> I observe the following data distribution for the metadata pool:
> # ceph osd df | head
> ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS
>   0   ssd 0.21829  1.0  223G  9927M  213G  4.34 0.79   0
>   1   ssd 0.21829  1.0  223G  9928M  213G  4.34 0.79   0
>   2   ssd 0.21819  1.0  223G 77179M  148G 33.73 6.11 128
>   3   ssd 0.21819  1.0  223G 76981M  148G 33.64 6.10 128
>
> osd.0 - osd.3 are all exclusively meant for cephfs-metadata, currently we
> use 4 replicas with failure domain OSD there.
> I have reinstalled and reformatted osd.0 and osd.1 about 36 hours ago.
>
> All 128 PGs in the metadata pool are backfilling (I have increased
> osd-max-backfills temporarily to speed things up for those OSDs).
> However, they only managed to backfill < 10 GB in those 36 hours. I have
> not touched any other of the default settings concerning backfill
> or recovery (but these are SSDs, so sleeps should be 0).
> The backfilling seems not to be limited by CPU, nor network, not disks.
> "ceph -s" confirms a backfill performance of about 60-100 keys/s.
> This metadata, as written before, is almost exclusively RocksDB:
>
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 84760592384,
> "db_used_bytes": 77289488384,
>
> is it normal that this kind of backfilling is so horrendously slow? Is
> there a way to speed it up?
> Like this, it will take almost two weeks for 77 GB of (meta)data.
> Right now, the system is still in the testing phase, but we'd of course
> like to be able to add more MDS's and SSD's later without extensive
> backfilling periods.
>
> Cheers,
> Oliver
>
> Am 25.02.2018 um 19:26 schrieb Oliver Freyermuth:
> > Dear Cephalopodians,
> >
> > as part of our stress test with 100,000,000 objects (all small files) we
> ended up with
> > the following usage on the OSDs on which the metadata pool lives:
> > # ceph osd df | head
> > ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS
> > [...]
> >   2   ssd 0.21819  1.0  223G 79649M  145G 34.81 6.62 128
> >   3   ssd 0.21819  1.0  223G 79697M  145G 34.83 6.63 128
> >
> > The cephfs-data cluster is mostly empty (5 % usage), but contains
> 100,000,000 small objects.
> >
> > Looking with:
> > ceph daemon osd.2 perf dump
> > I get:
> > "bluefs": {
> > "gift_bytes": 0,
> > "reclaim_bytes": 0,
> > "db_total_bytes": 84760592384,
> > "db_used_bytes": 78920024064,
> > "wal_total_bytes": 0,
> > "wal_used_bytes": 0,
> > "slow_total_bytes": 0,
> > "slow_used_bytes": 0,
> > so it seems this is almost exclusively RocksDB usage.
> >
> > Is this expected?
> > Is there a recommendation on how much MDS storage is needed for a CephFS
> with 450 TB?
> >
> > Cheers,
> >   Oliver
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ce

Re: [ceph-users] How to "apply" and monitor bluestore compression?

2018-02-26 Thread Martin Emrich


Hi!


Am 26.02.18 um 16:26 schrieb Igor Fedotov:
I'm working on adding compression statistics to ceph/rados df reports. 
And AFAIK currently the only way to monitor compression ration is to 
inspect osd performance counters.

Awesome, looking forward to it :)

Cheers,

Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Patrick Donnelly

On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
 wrote:
> Looking with:
> ceph daemon osd.2 perf dump
> I get:
> "bluefs": {
> "gift_bytes": 0,
> "reclaim_bytes": 0,
> "db_total_bytes": 84760592384,
> "db_used_bytes": 78920024064,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 0,
> "slow_used_bytes": 0,
> so it seems this is almost exclusively RocksDB usage.
>
> Is this expected?

Yes. The directory entries are stored in the omap of the objects. This
will be stored in the RocksDB backend of Bluestore.

> Is there a recommendation on how much MDS storage is needed for a CephFS with 
> 450 TB?

It seems in the above test you're using about 1KB per inode (file).
Using that you can extrapolate how much space the data pool needs
based on your file system usage. (If all you're doing is filling the
file system with empty files, of course you're going to need an
unusually large metadata pool.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] reweight-by-utilization reverse weight after adding new nodes?

2018-02-26 Thread David Turner

I would recommend continuing from where you are now and running `ceph osd
reweight-by-utilization` again.  Your weights might be a little more odd,
but your data distribution should be the same.  If you were to reset the
weights for the previous OSDs, you would only incur an additional round of
reweighting for no discernible benefit.

On Mon, Feb 26, 2018 at 7:13 AM Martin Palma  wrote:

> Hello,
>
> from some OSDs in our cluster we got the "nearfull" warning message so
> we run the "ceph osd reweight-by-utilization" command to better
> distribute the data.
>
> Now we have expanded out cluster with new nodes should we reverse the
> weight of the changed OSDs to 1.0?
>
> Best,
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread David Turner

Patrick's answer supersedes what I said about RocksDB usage.  My knowledge
was more general for actually storing objects, not the metadata inside of
MDS.  Thank you for sharing Patrick.

On Mon, Feb 26, 2018 at 11:00 AM Patrick Donnelly 
wrote:

> On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
>  wrote:
> > Looking with:
> > ceph daemon osd.2 perf dump
> > I get:
> > "bluefs": {
> > "gift_bytes": 0,
> > "reclaim_bytes": 0,
> > "db_total_bytes": 84760592384,
> > "db_used_bytes": 78920024064,
> > "wal_total_bytes": 0,
> > "wal_used_bytes": 0,
> > "slow_total_bytes": 0,
> > "slow_used_bytes": 0,
> > so it seems this is almost exclusively RocksDB usage.
> >
> > Is this expected?
>
> Yes. The directory entries are stored in the omap of the objects. This
> will be stored in the RocksDB backend of Bluestore.
>
> > Is there a recommendation on how much MDS storage is needed for a CephFS
> with 450 TB?
>
> It seems in the above test you're using about 1KB per inode (file).
> Using that you can extrapolate how much space the data pool needs
> based on your file system usage. (If all you're doing is filling the
> file system with empty files, of course you're going to need an
> unusually large metadata pool.)
>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
>  wrote:
>> Am 25.02.2018 um 21:50 schrieb John Spray:
>>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
 Now, with about 100,000,000 objects written, we are in a disaster 
 situation.
 First off, the MDS could not restart anymore - it required >40 GB of 
 memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and 
 swap.
 So it tried to recover and OOMed quickly after. Replay was reasonably 
 fast, but join took many minutes:
 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
 and finally, 5 minutes later, OOM.

 I stopped half of the stress-test tar's, which did not help - then I 
 rebooted half of the clients, which did help and let the MDS recover just 
 fine.
 So it seems the client caps have been too many for the MDS to handle. I'm 
 unsure why "tar" would cause so many open file handles.
 Is there anything that can be configured to prevent this from happening?
>>>
>>> Clients will generally hold onto capabilities for files they've
>>> written out -- this is pretty sub-optimal for many workloads where
>>> files are written out but not likely to be accessed again in the near
>>> future.  While clients hold these capabilities, the MDS cannot drop
>>> things from its own cache.
>>>
>>> The way this is *meant* to work is that the MDS hits its cache size
>>> limit, and sends a message to clients asking them to drop some files
>>> from their local cache, and consequently release those capabilities.
>>> However, this has historically been a tricky area with ceph-fuse
>>> clients (there are some hacks for detecting kernel version and using
>>> different mechanisms for different versions of fuse), and it's
>>> possible that on your clients this mechanism is simply not working,
>>> leading to a severely oversized MDS cache.
>>>
>>> The MDS should have been showing health alerts in "ceph status" about
>>> this, but I suppose it's possible that it wasn't surviving long enough
>>> to hit the timeout (60s) that we apply for warning about misbehaving
>>> clients?  It would be good to check the cluster log to see if you were
>>> getting any health messages along the lines of "Client xyz failing to
>>> respond to cache pressure".
>>
>> This explains the high memory usage indeed.
>> I can also confirm seeing those health alerts, now that I check the logs.
>> The systems have been (servers and clients) all exclusively CentOS 7.4,
>> so kernels are rather old, but I would have hoped things have been backported
>> by RedHat.
>>
>> Is there anything one can do to limit client's cache sizes?
> 
> You said the clients are ceph-fuse running 12.2.3? Then they should have:
> 
> http://tracker.ceph.com/issues/22339
> 
> (Please double check you're not running older clients on accident.)

I can confirm all clients have been running 12.2.3. 
Is the issue really related? It looks like a remount-failure fix. 

> 
> I have run small file tests with ~128 clients without issue. Generally
> if there is an issue it is because clients are not releasing their
> capabilities properly (due to invalidation bugs which should be caught
> by the above backport) or the MDS memory usage exceeds RAM. If the
> clients are not releasing their capabilities, you should see the
> errors John described in the cluster log.
> 
> You said in the original post that the `mds cache memory limit = 4GB`.
> If that's the case, you really shouldn't be exceeding 40GB of RAM!
> It's possible you have found a bug of some kind. I suggest tracking
> the MDS cache statistics (which includes the inode count in cache) by
> collecting a `perf dump` via the admin socket. Then you can begin to
> find out what's consuming all of the MDS memory.
> 
> Additionally, I concur with John on digging into why the MDS is
> missing heartbeats by collecting debug logs (`debug mds = 15`) at that
> time. It may also shed light on the issue.

Yes, I confirmed this earlier - indeed I found the "failing to respond to cache 
pressure" alerts in the logs. 
The excess of RAM initally was "only" about 50 - 100 % which was still fine - 
the main issue started after I tested MDS failover in this situation. 
If I understand correctly, the clients are only prevented from growing their 
caps to huge values if an MDS is running
and actively preventing them from doing so. Correct? 

However, since the failover took a few minutes (I played with the beacon 
timeouts and increased the mds_log_max_segments and mds_log_max_expiring to 
check impact on performance), 
this could well have been the main cause for the huge memory consumption. Do I 
understand correctly that the clients may grow their number of caps
to huge numbers if all MDS are down for a few minutes, since nobody holds their 
hands? 

This could explain why, when

[ceph-users] Luminous | PG split causing slow requests

2018-02-26 Thread David C

Hi All

I have a 12.2.1 cluster, all filestore OSDs, OSDs are spinners, journals on
NVME. Cluster primarily used for CephFS, ~20M objects.

I'm seeing some OSDs getting marked down, it appears to be related to PG
splitting, e.g:

2018-02-26 10:27:27.935489 7f140dbe2700  1 _created [C,D] has 5121 objects,
> starting split.
>

Followed by:

2018-02-26 10:27:58.242551 7f141cc3f700  0 log_channel(cluster) log [WRN] :
> 9 slow requests, 5 included below; oldest blocked for > 30.308128 secs
> 2018-02-26 10:27:58.242563 7f141cc3f700  0 log_channel(cluster) log [WRN]
> : slow request 30.151105 seconds old, received at 2018-02-26
> 10:27:28.091312: osd_op(mds.0.5339:811969 3.5c
> 3:3bb9d743:::200.0018c6c4:head [write 73416~5897 [fadvise_dontneed]] snapc
> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
> commit_sent
> 2018-02-26 10:27:58.242569 7f141cc3f700  0 log_channel(cluster) log [WRN]
> : slow request 30.133441 seconds old, received at 2018-02-26
> 10:27:28.108976: osd_op(mds.0.5339:811970 3.5c
> 3:3bb9d743:::200.0018c6c4:head [write 79313~4866 [fadvise_dontneed]] snapc
> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
> commit_sent
> 2018-02-26 10:27:58.242574 7f141cc3f700  0 log_channel(cluster) log [WRN]
> : slow request 30.083401 seconds old, received at 2018-02-26
> 10:27:28.159016: osd_op(mds.9174516.0:444202 3.5c
> 3:3bb9d743:::200.0018c6c4:head [stat] snapc 0=[]
> ondisk+read+rwordered+known_if_redirected+full_force e13994) currently
> waiting for rw locks
> 2018-02-26 10:27:58.242579 7f141cc3f700  0 log_channel(cluster) log [WRN]
> : slow request 30.072310 seconds old, received at 2018-02-26
> 10:27:28.170107: osd_op(mds.0.5339:811971 3.5c
> 3:3bb9d743:::200.0018c6c4:head [write 84179~1941 [fadvise_dontneed]] snapc
> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently waiting
> for rw locks
> 2018-02-26 10:27:58.242584 7f141cc3f700  0 log_channel(cluster) log [WRN]
> : slow request 30.308128 seconds old, received at 2018-02-26
> 10:27:27.934288: osd_op(mds.0.5339:811964 3.5c
> 3:3bb9d743:::200.0018c6c4:head [write 0~62535 [fadvise_dontneed]] snapc
> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
> commit_sent
> 2018-02-26 10:27:59.242768 7f141cc3f700  0 log_channel(cluster) log [WRN]
> : 47 slow requests, 5 included below; oldest blocked for > 31.308410 secs
> 2018-02-26 10:27:59.242776 7f141cc3f700  0 log_channel(cluster) log [WRN]
> : slow request 30.349575 seconds old, received at 2018-02-26
> 10:27:28.893124:


I'm also experiencing some MDS crash issues which I think could be related.

Is there anything I can do to mitigate the slow requests problem? The rest
of the time the cluster is performing pretty well.

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread John Spray

On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth
 wrote:
> Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
>> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
>>  wrote:
>>> Am 25.02.2018 um 21:50 schrieb John Spray:
 On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
> Now, with about 100,000,000 objects written, we are in a disaster 
> situation.
> First off, the MDS could not restart anymore - it required >40 GB of 
> memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and 
> swap.
> So it tried to recover and OOMed quickly after. Replay was reasonably 
> fast, but join took many minutes:
> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
> and finally, 5 minutes later, OOM.
>
> I stopped half of the stress-test tar's, which did not help - then I 
> rebooted half of the clients, which did help and let the MDS recover just 
> fine.
> So it seems the client caps have been too many for the MDS to handle. I'm 
> unsure why "tar" would cause so many open file handles.
> Is there anything that can be configured to prevent this from happening?

 Clients will generally hold onto capabilities for files they've
 written out -- this is pretty sub-optimal for many workloads where
 files are written out but not likely to be accessed again in the near
 future.  While clients hold these capabilities, the MDS cannot drop
 things from its own cache.

 The way this is *meant* to work is that the MDS hits its cache size
 limit, and sends a message to clients asking them to drop some files
 from their local cache, and consequently release those capabilities.
 However, this has historically been a tricky area with ceph-fuse
 clients (there are some hacks for detecting kernel version and using
 different mechanisms for different versions of fuse), and it's
 possible that on your clients this mechanism is simply not working,
 leading to a severely oversized MDS cache.

 The MDS should have been showing health alerts in "ceph status" about
 this, but I suppose it's possible that it wasn't surviving long enough
 to hit the timeout (60s) that we apply for warning about misbehaving
 clients?  It would be good to check the cluster log to see if you were
 getting any health messages along the lines of "Client xyz failing to
 respond to cache pressure".
>>>
>>> This explains the high memory usage indeed.
>>> I can also confirm seeing those health alerts, now that I check the logs.
>>> The systems have been (servers and clients) all exclusively CentOS 7.4,
>>> so kernels are rather old, but I would have hoped things have been 
>>> backported
>>> by RedHat.
>>>
>>> Is there anything one can do to limit client's cache sizes?
>>
>> You said the clients are ceph-fuse running 12.2.3? Then they should have:
>>
>> http://tracker.ceph.com/issues/22339
>>
>> (Please double check you're not running older clients on accident.)
>
> I can confirm all clients have been running 12.2.3.
> Is the issue really related? It looks like a remount-failure fix.

The fuse client uses a remount internally to persuade the fuse kernel
module to really drop things from its cache (fuse doesn't provide the
ideal hooks for managing this stuff in network filesystems).

>> I have run small file tests with ~128 clients without issue. Generally
>> if there is an issue it is because clients are not releasing their
>> capabilities properly (due to invalidation bugs which should be caught
>> by the above backport) or the MDS memory usage exceeds RAM. If the
>> clients are not releasing their capabilities, you should see the
>> errors John described in the cluster log.
>>
>> You said in the original post that the `mds cache memory limit = 4GB`.
>> If that's the case, you really shouldn't be exceeding 40GB of RAM!
>> It's possible you have found a bug of some kind. I suggest tracking
>> the MDS cache statistics (which includes the inode count in cache) by
>> collecting a `perf dump` via the admin socket. Then you can begin to
>> find out what's consuming all of the MDS memory.
>>
>> Additionally, I concur with John on digging into why the MDS is
>> missing heartbeats by collecting debug logs (`debug mds = 15`) at that
>> time. It may also shed light on the issue.
>
> Yes, I confirmed this earlier - indeed I found the "failing to respond to 
> cache pressure" alerts in the logs.
> The excess of RAM initally was "only" about 50 - 100 % which was still fine - 
> the main issue started after I tested MDS failover in this situation.
> If I understand correctly, the clients are only prevented from growing their 
> caps to huge values if an MDS is running
> and actively preventing them from doing so. Correct?

The clients have their own per-client limit on cache size
(client_cache_size) that they apply local

Re: [ceph-users] Luminous | PG split causing slow requests

2018-02-26 Thread David Turner

Splitting PG's is one of the most intensive and disruptive things you can,
and should, do to a cluster.  Tweaking recovery sleep, max backfills, and
heartbeat grace should help with this.  Heartbeat grace can be set high
enough to mitigate the OSDs flapping which slows things down by peering and
additional recovery, while still being able to detect OSDs that might fail
and go down.  The recovery sleep and max backfills are the settings you
want to look at for mitigating slow requests.  I generally tweak those
while watching iostat of some OSDs and ceph -s to make sure I'm not giving
too  much priority to the recovery operations so that client IO can still
happen.

On Mon, Feb 26, 2018 at 11:10 AM David C  wrote:

> Hi All
>
> I have a 12.2.1 cluster, all filestore OSDs, OSDs are spinners, journals
> on NVME. Cluster primarily used for CephFS, ~20M objects.
>
> I'm seeing some OSDs getting marked down, it appears to be related to PG
> splitting, e.g:
>
> 2018-02-26 10:27:27.935489 7f140dbe2700  1 _created [C,D] has 5121
>> objects, starting split.
>>
>
> Followed by:
>
> 2018-02-26 10:27:58.242551 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : 9 slow requests, 5 included below; oldest blocked for > 30.308128 secs
>> 2018-02-26 10:27:58.242563 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : slow request 30.151105 seconds old, received at 2018-02-26
>> 10:27:28.091312: osd_op(mds.0.5339:811969 3.5c
>> 3:3bb9d743:::200.0018c6c4:head [write 73416~5897 [fadvise_dontneed]] snapc
>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
>> commit_sent
>> 2018-02-26 10:27:58.242569 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : slow request 30.133441 seconds old, received at 2018-02-26
>> 10:27:28.108976: osd_op(mds.0.5339:811970 3.5c
>> 3:3bb9d743:::200.0018c6c4:head [write 79313~4866 [fadvise_dontneed]] snapc
>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
>> commit_sent
>> 2018-02-26 10:27:58.242574 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : slow request 30.083401 seconds old, received at 2018-02-26
>> 10:27:28.159016: osd_op(mds.9174516.0:444202 3.5c
>> 3:3bb9d743:::200.0018c6c4:head [stat] snapc 0=[]
>> ondisk+read+rwordered+known_if_redirected+full_force e13994) currently
>> waiting for rw locks
>> 2018-02-26 10:27:58.242579 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : slow request 30.072310 seconds old, received at 2018-02-26
>> 10:27:28.170107: osd_op(mds.0.5339:811971 3.5c
>> 3:3bb9d743:::200.0018c6c4:head [write 84179~1941 [fadvise_dontneed]] snapc
>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently waiting
>> for rw locks
>> 2018-02-26 10:27:58.242584 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : slow request 30.308128 seconds old, received at 2018-02-26
>> 10:27:27.934288: osd_op(mds.0.5339:811964 3.5c
>> 3:3bb9d743:::200.0018c6c4:head [write 0~62535 [fadvise_dontneed]] snapc
>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
>> commit_sent
>> 2018-02-26 10:27:59.242768 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : 47 slow requests, 5 included below; oldest blocked for > 31.308410 secs
>> 2018-02-26 10:27:59.242776 7f141cc3f700  0 log_channel(cluster) log [WRN]
>> : slow request 30.349575 seconds old, received at 2018-02-26
>> 10:27:28.893124:
>
>
> I'm also experiencing some MDS crash issues which I think could be related.
>
> Is there anything I can do to mitigate the slow requests problem? The rest
> of the time the cluster is performing pretty well.
>
> Thanks,
> David
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to correctly purge a "ceph-volume lvm" OSD

2018-02-26 Thread David Turner

If we're asking for documentation updates, the man page for ceph-volume is
incredibly outdated.  In 12.2.3 it still says that bluestore is not yet
implemented and that it's planned to be supported.
'[--bluestore] filestore objectstore (not yet implemented)'
'using  a  filestore  setup (bluestore  support  is  planned)'.

On Mon, Feb 26, 2018 at 7:05 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 13:02 schrieb Alfredo Deza:
> > On Sat, Feb 24, 2018 at 1:26 PM, Oliver Freyermuth
> >  wrote:
> >> Dear Cephalopodians,
> >>
> >> when purging a single OSD on a host (created via ceph-deploy 2.0, i.e.
> using ceph-volume lvm), I currently proceed as follows:
> >>
> >> On the OSD-host:
> >> $ systemctl stop ceph-osd@4.service
> >> $ ls -la /var/lib/ceph/osd/ceph-4
> >> # Check block und block.db links:
> >> lrwxrwxrwx.  1 ceph ceph   93 23. Feb 01:28 block ->
> /dev/ceph-69b1fbe5-f084-4410-a99a-ab57417e7846/osd-block-cd273506-e805-40ac-b23d-c7b9ff45d874
> >> lrwxrwxrwx.  1 root root   43 23. Feb 01:28 block.db ->
> /dev/ceph-osd-blockdb-ssd-1/db-for-disk-sda
> >> # resolve actual underlying device:
> >> $ pvs | grep ceph-69b1fbe5-f084-4410-a99a-ab57417e7846
> >>   /dev/sda   ceph-69b1fbe5-f084-4410-a99a-ab57417e7846 lvm2 a--
> <3,64t 0
> >> # Zap the device:
> >> $ ceph-volume lvm zap --destroy /dev/sda
> >>
> >> Now, on the mon:
> >> # purge the OSD:
> >> $ ceph osd purge osd.4 --yes-i-really-mean-it
> >>
> >> Then I re-deploy using:
> >> $ ceph-deploy --overwrite-conf osd create --bluestore --block-db
> ceph-osd-blockdb-ssd-1/db-for-disk-sda --data /dev/sda osd001
> >>
> >> from the admin-machine.
> >>
> >> This works just fine, however, it leaves a stray ceph-volume service
> behind:
> >> $ ls -la /etc/systemd/system/multi-user.target.wants/ -1 | grep
> ceph-volume@lvm-4
> >> lrwxrwxrwx.  1 root root   44 24. Feb 18:30
> ceph-volume@lvm-4-5a984083-48e1-4c2f-a1f3-3458c941e597.service ->
> /usr/lib/systemd/system/ceph-volume@.service
> >> lrwxrwxrwx.  1 root root   44 23. Feb 01:28
> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service ->
> /usr/lib/systemd/system/ceph-volume@.service
> >>
> >> This stray service then, after reboot of the machine, stays in
> activating state (since the disk will of course never come back):
> >> ---
> >> $ systemctl status
> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
> >> ● ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service -
> Ceph Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
> >>Loaded: loaded (/usr/lib/systemd/system/ceph-volume@.service;
> enabled; vendor preset: disabled)
> >>Active: activating (start) since Sa 2018-02-24 19:21:47 CET; 1min
> 12s ago
> >>  Main PID: 1866 (timeout)
> >>CGroup:
> /system.slice/system-ceph\x2dvolume.slice/ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
> >>├─1866 timeout 1 /usr/sbin/ceph-volume-systemd
> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
> >>└─1872 /usr/bin/python2.7 /usr/sbin/ceph-volume-systemd
> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
> >>
> >> Feb 24 19:21:47 osd001.baf.physik.uni-bonn.de systemd[1]: Starting
> Ceph Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874...
> >> ---
> >> Manually, I can fix this by running:
> >> $ systemctl disable
> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
> >>
> >> My question is: Should I really remove that manually?
> >> Should "ceph-volume lvm zap --destroy" have taken care of it (bug)?
> >
> > You should remove it manually. The problem with zapping is that we
> > might not have the information we need to remove the systemd unit.
> > Since an OSD can be made out of different devices, ceph-volume might
> > be asked to "zap" a device which it can't compute to what OSD it
> > belongs. The systemd units are tied to the ID and UUID of the OSD.
>
> Understood, thanks for the reply!
>
> Could this be added to the documentation at some point for all the other
> users operating the cluster manually / with ceph-deploy?
> This would likely be best to prevent others from falling into this trap
> ;-).
> Should I open a ticket asking for this?
>
> Cheers,
> Oliver
>
> >
> >
> >> Am I missing a step?
> >>
> >> Cheers,
> >> Oliver
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
> On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
>  wrote:
>> Looking with:
>> ceph daemon osd.2 perf dump
>> I get:
>> "bluefs": {
>> "gift_bytes": 0,
>> "reclaim_bytes": 0,
>> "db_total_bytes": 84760592384,
>> "db_used_bytes": 78920024064,
>> "wal_total_bytes": 0,
>> "wal_used_bytes": 0,
>> "slow_total_bytes": 0,
>> "slow_used_bytes": 0,
>> so it seems this is almost exclusively RocksDB usage.
>>
>> Is this expected?
> 
> Yes. The directory entries are stored in the omap of the objects. This
> will be stored in the RocksDB backend of Bluestore.
> 
>> Is there a recommendation on how much MDS storage is needed for a CephFS 
>> with 450 TB?
> 
> It seems in the above test you're using about 1KB per inode (file).
> Using that you can extrapolate how much space the data pool needs
> based on your file system usage. (If all you're doing is filling the
> file system with empty files, of course you're going to need an
> unusually large metadata pool.)
> 
Many thanks, this helps! 
We naturally hope our users will not do this, this stress test was a worst case 
- 
but the rough number (1 kB per inode) does indeed help a lot, and also the 
increase with modifications
of the file as laid out by David. 

Is also the slow backfilling normal? 
Will such increase in storage (by many file modifications) at some point also 
be reduced, i.e.
is the database compacted / can one trigger that / is there something like "SQL 
vacuum"? 

To also answer David's questions in parallel:
- Concerning the slow backfill, I am only talking about the "metadata OSDs". 
  They are fully SSD backed, and have no separate device for block.db / WAL. 
- I adjusted backfills up to 128 for those metadata OSDs, the cluster is 
currently fully empty, i.e. no client's are doing anything. 
  There are no slow requests. 
  Since no clients are doing anything and the rest of the cluster is now clean 
(apart from the two backfilling OSDs),
  right now there is also no memory pressure at all. 
  The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each. 
  The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s of 
write throughput. 
  Network traffic between the node with the clean OSDs and the 
"being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly more 
bandwidth available... 
- Checking sleeps with: 
# ceph -n osd.1 --show-config | grep sleep
osd_recovery_sleep = 0.00
osd_recovery_sleep_hdd = 0.10
osd_recovery_sleep_hybrid = 0.025000
osd_recovery_sleep_ssd = 0.00
shows there should be 0 sleep. Or is there another way to query? 

Cheers and many thanks for the valuable replies!
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Install previous version of Ceph

2018-02-26 Thread Scottix

I have been trying the dpk -i route but hitting a lot of dependencies, so
still working on it.

On Mon, Feb 26, 2018 at 7:36 AM David Turner  wrote:

> In the past I downloaded the packages for a version and configured it as a
> local repo on the server.  basically it was a tar.gz that I would extract
> that would place the ceph packages in a folder for me and swap out the repo
> config file to a version that points to the local folder.  I haven't needed
> to do that much, but it was helpful.  Generally it's best to just mirror
> the upstream and lock it to the version you're using in production.  That's
> a good rule of thumb for other repos as well, especially for ceph nodes.
> When I install a new ceph node, I want all of it's package versions to
> match 100% to the existing nodes.  Troubleshooting problems becomes
> drastically simpler once you get to that point.
>
> On Mon, Feb 26, 2018 at 9:08 AM Ronny Aasen 
> wrote:
>
>> On 23. feb. 2018 23:37, Scottix wrote:
>> > Hey,
>> > We had one of our monitor servers die on us and I have a replacement
>> > computer now. In between that time you have released 12.2.3 but we are
>> > still on 12.2.2.
>> >
>> > We are on Ubuntu servers
>> >
>> > I see all the binaries are in the repo but your package cache only shows
>> > 12.2.3, is there a reason for not keeping the previous builds like in my
>> > case.
>> >
>> > I could do an install like
>> > apt install ceph-mon=12.2.2
>> >
>> > Also how would I go installing 12.2.2 in my scenario since I don't want
>> > to update till have this monitor running again.
>> >
>> > Thanks,
>> > Scott
>>
>> did you figure out a solution to this ? I have the same problem now.
>> I assume you have to download the old version manually and install with
>> dpkg -i
>>
>> optionally mirror the ceph repo and build your own repo index containing
>> all versions.
>>
>> kind regards
>> Ronny Aasen
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread David Turner

That was a good way to check for the recovery sleep.  Does your `ceph
status` show 128 PGs backfilling (or a number near that at least)?  The PGs
not backfilling will say 'backfill+wait'.

On Mon, Feb 26, 2018 at 11:25 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
> >  wrote:
> >> Looking with:
> >> ceph daemon osd.2 perf dump
> >> I get:
> >> "bluefs": {
> >> "gift_bytes": 0,
> >> "reclaim_bytes": 0,
> >> "db_total_bytes": 84760592384,
> >> "db_used_bytes": 78920024064,
> >> "wal_total_bytes": 0,
> >> "wal_used_bytes": 0,
> >> "slow_total_bytes": 0,
> >> "slow_used_bytes": 0,
> >> so it seems this is almost exclusively RocksDB usage.
> >>
> >> Is this expected?
> >
> > Yes. The directory entries are stored in the omap of the objects. This
> > will be stored in the RocksDB backend of Bluestore.
> >
> >> Is there a recommendation on how much MDS storage is needed for a
> CephFS with 450 TB?
> >
> > It seems in the above test you're using about 1KB per inode (file).
> > Using that you can extrapolate how much space the data pool needs
> > based on your file system usage. (If all you're doing is filling the
> > file system with empty files, of course you're going to need an
> > unusually large metadata pool.)
> >
> Many thanks, this helps!
> We naturally hope our users will not do this, this stress test was a worst
> case -
> but the rough number (1 kB per inode) does indeed help a lot, and also the
> increase with modifications
> of the file as laid out by David.
>
> Is also the slow backfilling normal?
> Will such increase in storage (by many file modifications) at some point
> also be reduced, i.e.
> is the database compacted / can one trigger that / is there something like
> "SQL vacuum"?
>
> To also answer David's questions in parallel:
> - Concerning the slow backfill, I am only talking about the "metadata
> OSDs".
>   They are fully SSD backed, and have no separate device for block.db /
> WAL.
> - I adjusted backfills up to 128 for those metadata OSDs, the cluster is
> currently fully empty, i.e. no client's are doing anything.
>   There are no slow requests.
>   Since no clients are doing anything and the rest of the cluster is now
> clean (apart from the two backfilling OSDs),
>   right now there is also no memory pressure at all.
>   The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.
>   The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s
> of write throughput.
>   Network traffic between the node with the clean OSDs and the
> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly
> more bandwidth available...
> - Checking sleeps with:
> # ceph -n osd.1 --show-config | grep sleep
> osd_recovery_sleep = 0.00
> osd_recovery_sleep_hdd = 0.10
> osd_recovery_sleep_hybrid = 0.025000
> osd_recovery_sleep_ssd = 0.00
> shows there should be 0 sleep. Or is there another way to query?
>
> Cheers and many thanks for the valuable replies!
> Oliver
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 17:31 schrieb David Turner:
> That was a good way to check for the recovery sleep.  Does your `ceph status` 
> show 128 PGs backfilling (or a number near that at least)?  The PGs not 
> backfilling will say 'backfill+wait'.

Yes:
pgs: 37778254/593342240 objects degraded (6.367%)
 2036 active+clean
 128  active+undersized+degraded+remapped+backfilling
 6active+clean+scrubbing+deep
 6active+clean+scrubbing

The 2048 PGs are from our data pool, they are by now finally clean again and 
scrubbing around. 
The 128 PGs are all from the metadata pool, so they are exclusively from the 
two "clean" OSDs
and the two OSDs being backfilled:

# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS 
  0   ssd 0.21829  1.0  223G 11227M  212G  4.90 0.89   0 
  1   ssd 0.21829  1.0  223G 11225M  212G  4.90 0.89   0 
  2   ssd 0.21819  1.0  223G 77813M  147G 34.00 6.16 128 
  3   ssd 0.21819  1.0  223G 77590M  147G 33.91 6.15 128

It's just... very slow. 

> 
> On Mon, Feb 26, 2018 at 11:25 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
> > mailto:freyerm...@physik.uni-bonn.de>> 
> wrote:
> >> Looking with:
> >> ceph daemon osd.2 perf dump
> >> I get:
> >>     "bluefs": {
> >>         "gift_bytes": 0,
> >>         "reclaim_bytes": 0,
> >>         "db_total_bytes": 84760592384,
> >>         "db_used_bytes": 78920024064,
> >>         "wal_total_bytes": 0,
> >>         "wal_used_bytes": 0,
> >>         "slow_total_bytes": 0,
> >>         "slow_used_bytes": 0,
> >> so it seems this is almost exclusively RocksDB usage.
> >>
> >> Is this expected?
> >
> > Yes. The directory entries are stored in the omap of the objects. This
> > will be stored in the RocksDB backend of Bluestore.
> >
> >> Is there a recommendation on how much MDS storage is needed for a 
> CephFS with 450 TB?
> >
> > It seems in the above test you're using about 1KB per inode (file).
> > Using that you can extrapolate how much space the data pool needs
> > based on your file system usage. (If all you're doing is filling the
> > file system with empty files, of course you're going to need an
> > unusually large metadata pool.)
> >
> Many thanks, this helps!
> We naturally hope our users will not do this, this stress test was a 
> worst case -
> but the rough number (1 kB per inode) does indeed help a lot, and also 
> the increase with modifications
> of the file as laid out by David.
> 
> Is also the slow backfilling normal?
> Will such increase in storage (by many file modifications) at some point 
> also be reduced, i.e.
> is the database compacted / can one trigger that / is there something 
> like "SQL vacuum"?
> 
> To also answer David's questions in parallel:
> - Concerning the slow backfill, I am only talking about the "metadata 
> OSDs".
>   They are fully SSD backed, and have no separate device for block.db / 
> WAL.
> - I adjusted backfills up to 128 for those metadata OSDs, the cluster is 
> currently fully empty, i.e. no client's are doing anything.
>   There are no slow requests.
>   Since no clients are doing anything and the rest of the cluster is now 
> clean (apart from the two backfilling OSDs),
>   right now there is also no memory pressure at all.
>   The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.
>   The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s 
> of write throughput.
>   Network traffic between the node with the clean OSDs and the 
> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly 
> more bandwidth available...
> - Checking sleeps with:
> # ceph -n osd.1 --show-config | grep sleep
> osd_recovery_sleep = 0.00
> osd_recovery_sleep_hdd = 0.10
> osd_recovery_sleep_hybrid = 0.025000
> osd_recovery_sleep_ssd = 0.00
> shows there should be 0 sleep. Or is there another way to query?
> 
> Cheers and many thanks for the valuable replies!
>         Oliver
> 


-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 17:15 schrieb John Spray:
> On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth
>  wrote:
>> Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
>>> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
>>>  wrote:
 Am 25.02.2018 um 21:50 schrieb John Spray:
> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>> Now, with about 100,000,000 objects written, we are in a disaster 
>> situation.
>> First off, the MDS could not restart anymore - it required >40 GB of 
>> memory, which (together with the 2 OSDs on the MDS host) exceeded RAM 
>> and swap.
>> So it tried to recover and OOMed quickly after. Replay was reasonably 
>> fast, but join took many minutes:
>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
>> and finally, 5 minutes later, OOM.
>>
>> I stopped half of the stress-test tar's, which did not help - then I 
>> rebooted half of the clients, which did help and let the MDS recover 
>> just fine.
>> So it seems the client caps have been too many for the MDS to handle. 
>> I'm unsure why "tar" would cause so many open file handles.
>> Is there anything that can be configured to prevent this from happening?
>
> Clients will generally hold onto capabilities for files they've
> written out -- this is pretty sub-optimal for many workloads where
> files are written out but not likely to be accessed again in the near
> future.  While clients hold these capabilities, the MDS cannot drop
> things from its own cache.
>
> The way this is *meant* to work is that the MDS hits its cache size
> limit, and sends a message to clients asking them to drop some files
> from their local cache, and consequently release those capabilities.
> However, this has historically been a tricky area with ceph-fuse
> clients (there are some hacks for detecting kernel version and using
> different mechanisms for different versions of fuse), and it's
> possible that on your clients this mechanism is simply not working,
> leading to a severely oversized MDS cache.
>
> The MDS should have been showing health alerts in "ceph status" about
> this, but I suppose it's possible that it wasn't surviving long enough
> to hit the timeout (60s) that we apply for warning about misbehaving
> clients?  It would be good to check the cluster log to see if you were
> getting any health messages along the lines of "Client xyz failing to
> respond to cache pressure".

 This explains the high memory usage indeed.
 I can also confirm seeing those health alerts, now that I check the logs.
 The systems have been (servers and clients) all exclusively CentOS 7.4,
 so kernels are rather old, but I would have hoped things have been 
 backported
 by RedHat.

 Is there anything one can do to limit client's cache sizes?
>>>
>>> You said the clients are ceph-fuse running 12.2.3? Then they should have:
>>>
>>> http://tracker.ceph.com/issues/22339
>>>
>>> (Please double check you're not running older clients on accident.)
>>
>> I can confirm all clients have been running 12.2.3.
>> Is the issue really related? It looks like a remount-failure fix.
> 
> The fuse client uses a remount internally to persuade the fuse kernel
> module to really drop things from its cache (fuse doesn't provide the
> ideal hooks for managing this stuff in network filesystems).

Thanks for the explanation, now I understand! 

> 
>>> I have run small file tests with ~128 clients without issue. Generally
>>> if there is an issue it is because clients are not releasing their
>>> capabilities properly (due to invalidation bugs which should be caught
>>> by the above backport) or the MDS memory usage exceeds RAM. If the
>>> clients are not releasing their capabilities, you should see the
>>> errors John described in the cluster log.
>>>
>>> You said in the original post that the `mds cache memory limit = 4GB`.
>>> If that's the case, you really shouldn't be exceeding 40GB of RAM!
>>> It's possible you have found a bug of some kind. I suggest tracking
>>> the MDS cache statistics (which includes the inode count in cache) by
>>> collecting a `perf dump` via the admin socket. Then you can begin to
>>> find out what's consuming all of the MDS memory.
>>>
>>> Additionally, I concur with John on digging into why the MDS is
>>> missing heartbeats by collecting debug logs (`debug mds = 15`) at that
>>> time. It may also shed light on the issue.
>>
>> Yes, I confirmed this earlier - indeed I found the "failing to respond to 
>> cache pressure" alerts in the logs.
>> The excess of RAM initally was "only" about 50 - 100 % which was still fine 
>> - the main issue started after I tested MDS failover in this situation.
>> If I understand correctly, the clients are only prevented from growing their 
>> caps

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Patrick Donnelly

On Mon, Feb 26, 2018 at 7:59 AM, Patrick Donnelly  wrote:
> It seems in the above test you're using about 1KB per inode (file).
> Using that you can extrapolate how much space the data pool needs

s/data pool/metadata pool/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread John Spray

On Mon, Feb 26, 2018 at 4:50 PM, Oliver Freyermuth
 wrote:
> Am 26.02.2018 um 17:15 schrieb John Spray:
>> On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth
>>  wrote:
>>> Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
 On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
  wrote:
> Am 25.02.2018 um 21:50 schrieb John Spray:
>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
>>> Now, with about 100,000,000 objects written, we are in a disaster 
>>> situation.
>>> First off, the MDS could not restart anymore - it required >40 GB of 
>>> memory, which (together with the 2 OSDs on the MDS host) exceeded RAM 
>>> and swap.
>>> So it tried to recover and OOMed quickly after. Replay was reasonably 
>>> fast, but join took many minutes:
>>> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
>>> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 
>>> rejoin_joint_start
>>> and finally, 5 minutes later, OOM.
>>>
>>> I stopped half of the stress-test tar's, which did not help - then I 
>>> rebooted half of the clients, which did help and let the MDS recover 
>>> just fine.
>>> So it seems the client caps have been too many for the MDS to handle. 
>>> I'm unsure why "tar" would cause so many open file handles.
>>> Is there anything that can be configured to prevent this from happening?
>>
>> Clients will generally hold onto capabilities for files they've
>> written out -- this is pretty sub-optimal for many workloads where
>> files are written out but not likely to be accessed again in the near
>> future.  While clients hold these capabilities, the MDS cannot drop
>> things from its own cache.
>>
>> The way this is *meant* to work is that the MDS hits its cache size
>> limit, and sends a message to clients asking them to drop some files
>> from their local cache, and consequently release those capabilities.
>> However, this has historically been a tricky area with ceph-fuse
>> clients (there are some hacks for detecting kernel version and using
>> different mechanisms for different versions of fuse), and it's
>> possible that on your clients this mechanism is simply not working,
>> leading to a severely oversized MDS cache.
>>
>> The MDS should have been showing health alerts in "ceph status" about
>> this, but I suppose it's possible that it wasn't surviving long enough
>> to hit the timeout (60s) that we apply for warning about misbehaving
>> clients?  It would be good to check the cluster log to see if you were
>> getting any health messages along the lines of "Client xyz failing to
>> respond to cache pressure".
>
> This explains the high memory usage indeed.
> I can also confirm seeing those health alerts, now that I check the logs.
> The systems have been (servers and clients) all exclusively CentOS 7.4,
> so kernels are rather old, but I would have hoped things have been 
> backported
> by RedHat.
>
> Is there anything one can do to limit client's cache sizes?

 You said the clients are ceph-fuse running 12.2.3? Then they should have:

 http://tracker.ceph.com/issues/22339

 (Please double check you're not running older clients on accident.)
>>>
>>> I can confirm all clients have been running 12.2.3.
>>> Is the issue really related? It looks like a remount-failure fix.
>>
>> The fuse client uses a remount internally to persuade the fuse kernel
>> module to really drop things from its cache (fuse doesn't provide the
>> ideal hooks for managing this stuff in network filesystems).
>
> Thanks for the explanation, now I understand!
>
>>
 I have run small file tests with ~128 clients without issue. Generally
 if there is an issue it is because clients are not releasing their
 capabilities properly (due to invalidation bugs which should be caught
 by the above backport) or the MDS memory usage exceeds RAM. If the
 clients are not releasing their capabilities, you should see the
 errors John described in the cluster log.

 You said in the original post that the `mds cache memory limit = 4GB`.
 If that's the case, you really shouldn't be exceeding 40GB of RAM!
 It's possible you have found a bug of some kind. I suggest tracking
 the MDS cache statistics (which includes the inode count in cache) by
 collecting a `perf dump` via the admin socket. Then you can begin to
 find out what's consuming all of the MDS memory.

 Additionally, I concur with John on digging into why the MDS is
 missing heartbeats by collecting debug logs (`debug mds = 15`) at that
 time. It may also shed light on the issue.
>>>
>>> Yes, I confirmed this earlier - indeed I found the "failing to respond to 
>>> cache pressure" alerts in the logs.
>>> The excess of RAM initally was "only" about 50 - 100 % which was still fine 
>>

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-26 Thread David Turner

I'm glad that I was able to help out.  I wanted to point out that the
reason those steps worked for you as quickly as they did is likely that you
configured your blocks.db to use the /dev/disk/by-partuuid/{guid} instead
of /dev/sdx#.  Had you configured your osds with /dev/sdx#, then you would
have needed to either modify them to point to the partuuid path or changed
them to the new devices name (which is a bad name as it will likely change
on reboot).  Changing your path for blocks.db is as simple as `ln -sf
/var/lib/ceph/osd/ceph-#/blocks.db /dev/disk/by-partuuid/{uuid}` and then
restarting the osd to make sure that it can read from the new symlink
location.

I'm curious about your OSDs starting automatically after doing those steps
as well.  I would guess you deployed them with ceph-disk instead of
ceph-volume, is that right?  ceph-volume no longer uses udev rules and
shouldn't have picked up these changes here.

On Mon, Feb 26, 2018 at 6:23 AM Caspar Smit  wrote:

> 2018-02-24 7:10 GMT+01:00 David Turner :
>
>> Caspar, it looks like your idea should work. Worst case scenario seems
>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>> the idea to weight them to 0, backfilling, then recreate the osds.
>> Definitely with a try in my opinion, and I'd love to hear your experience
>> after.
>>
>>
> Hi David,
>
> First of all, thank you for ALL your answers on this ML, you're really
> putting a lot of effort into answering many questions asked here and very
> often they contain invaluable information.
>
>
> To follow up on this post i went out and built a very small (proxmox)
> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
> And it worked!
> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>
> Here's what i did on 1 node:
>
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv   /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
> 6) ceph osd unset noout
>
> I assume that once the ddrescue step is finished a 'partprobe' or
> something similar is triggered and udev finds the DB partitions on the new
> SSD and starts the OSD's again (kind of what happens during hotplug)
> So it is probably better to clone the SSD in another (non-ceph) system to
> not trigger any udev events.
>
> I also tested a reboot after this and everything still worked.
>
>
> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
> Delta of data was very low because it was a test cluster.
>
> All in all the OSD's in question were 'down' for only 5 minutes (so i
> stayed within the ceph_osd_down_out interval of the default 10 minutes and
> didn't actually need to set noout :)
>
> Kind regards,
> Caspar
>
>
>
>> Nico, it is not possible to change the WAL or DB size, location, etc
>> after osd creation. If you want to change the configuration of the osd
>> after creation, you have to remove it from the cluster and recreate it.
>> There is no similar functionality to how you could move, recreate, etc
>> filesystem osd journals. I think this might be on the radar as a feature,
>> but I don't know for certain. I definitely consider it to be a regression
>> of bluestore.
>>
>>
>>
>>
>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>> nico.schottel...@ungleich.ch> wrote:
>>
>>>
>>> A very interesting question and I would add the follow up question:
>>>
>>> Is there an easy way to add an external DB/WAL devices to an existing
>>> OSD?
>>>
>>> I suspect that it might be something on the lines of:
>>>
>>> - stop osd
>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>>> - (maybe run some kind of osd mkfs ?)
>>> - start osd
>>>
>>> Has anyone done this so far or recommendations on how to do it?
>>>
>>> Which also makes me wonder: what is actually the format of WAL and
>>> BlockDB in bluestore? Is there any documentation available about it?
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>> Caspar Smit  writes:
>>>
>>> > Hi All,
>>> >
>>> > What would be the proper way to preventively replace a DB/WAL SSD
>>> (when it
>>> > is nearing it's DWPD/TBW limit and not failed yet).
>>> >
>>> > It hosts DB partitions for 5 OSD's
>>> >
>>> > Maybe something like:
>>> >
>>> > 1) ceph osd reweight 0 the 5 OSD's
>>> > 2) let backfilling complete
>>> > 3) destroy/remove the 5 OSD's
>>> > 4) replace SSD
>>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>>> >
>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
>>> so i
>>> > thought maybe the following would work:
>>> >
>>> > 1) ceph osd set noout
>>> > 2) stop the 5 OSD's (systemctl stop)
>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>>> > 4) remove the old SSD
>>> > 5) start the 5 OSD's (systemctl start)
>>> > 6) let backfilling/recovery complete (only delta data between OSD stop
>>> and
>>> > now

Re: [ceph-users] How to correctly purge a "ceph-volume lvm" OSD

2018-02-26 Thread Alfredo Deza

On Mon, Feb 26, 2018 at 11:24 AM, David Turner  wrote:
> If we're asking for documentation updates, the man page for ceph-volume is
> incredibly outdated.  In 12.2.3 it still says that bluestore is not yet
> implemented and that it's planned to be supported.
> '[--bluestore] filestore objectstore (not yet implemented)'
> 'using  a  filestore  setup (bluestore  support  is  planned)'.

This is a bit hard to track because ceph-deploy is an out-of-tree
project that gets pulled into the Ceph repo, and the man page lives in
the Ceph source tree.

We have updated the man page and the references to ceph-deploy to
correctly show the new API and all the flags supported, but this is in
master and was not backported
to luminous.

>
> On Mon, Feb 26, 2018 at 7:05 AM Oliver Freyermuth
>  wrote:
>>
>> Am 26.02.2018 um 13:02 schrieb Alfredo Deza:
>> > On Sat, Feb 24, 2018 at 1:26 PM, Oliver Freyermuth
>> >  wrote:
>> >> Dear Cephalopodians,
>> >>
>> >> when purging a single OSD on a host (created via ceph-deploy 2.0, i.e.
>> >> using ceph-volume lvm), I currently proceed as follows:
>> >>
>> >> On the OSD-host:
>> >> $ systemctl stop ceph-osd@4.service
>> >> $ ls -la /var/lib/ceph/osd/ceph-4
>> >> # Check block und block.db links:
>> >> lrwxrwxrwx.  1 ceph ceph   93 23. Feb 01:28 block ->
>> >> /dev/ceph-69b1fbe5-f084-4410-a99a-ab57417e7846/osd-block-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >> lrwxrwxrwx.  1 root root   43 23. Feb 01:28 block.db ->
>> >> /dev/ceph-osd-blockdb-ssd-1/db-for-disk-sda
>> >> # resolve actual underlying device:
>> >> $ pvs | grep ceph-69b1fbe5-f084-4410-a99a-ab57417e7846
>> >>   /dev/sda   ceph-69b1fbe5-f084-4410-a99a-ab57417e7846 lvm2 a--
>> >> <3,64t 0
>> >> # Zap the device:
>> >> $ ceph-volume lvm zap --destroy /dev/sda
>> >>
>> >> Now, on the mon:
>> >> # purge the OSD:
>> >> $ ceph osd purge osd.4 --yes-i-really-mean-it
>> >>
>> >> Then I re-deploy using:
>> >> $ ceph-deploy --overwrite-conf osd create --bluestore --block-db
>> >> ceph-osd-blockdb-ssd-1/db-for-disk-sda --data /dev/sda osd001
>> >>
>> >> from the admin-machine.
>> >>
>> >> This works just fine, however, it leaves a stray ceph-volume service
>> >> behind:
>> >> $ ls -la /etc/systemd/system/multi-user.target.wants/ -1 | grep
>> >> ceph-volume@lvm-4
>> >> lrwxrwxrwx.  1 root root   44 24. Feb 18:30
>> >> ceph-volume@lvm-4-5a984083-48e1-4c2f-a1f3-3458c941e597.service ->
>> >> /usr/lib/systemd/system/ceph-volume@.service
>> >> lrwxrwxrwx.  1 root root   44 23. Feb 01:28
>> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service ->
>> >> /usr/lib/systemd/system/ceph-volume@.service
>> >>
>> >> This stray service then, after reboot of the machine, stays in
>> >> activating state (since the disk will of course never come back):
>> >> ---
>> >> $ systemctl status
>> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>> >> ● ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service - Ceph
>> >> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >>Loaded: loaded (/usr/lib/systemd/system/ceph-volume@.service;
>> >> enabled; vendor preset: disabled)
>> >>Active: activating (start) since Sa 2018-02-24 19:21:47 CET; 1min
>> >> 12s ago
>> >>  Main PID: 1866 (timeout)
>> >>CGroup:
>> >> /system.slice/system-ceph\x2dvolume.slice/ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>> >>├─1866 timeout 1 /usr/sbin/ceph-volume-systemd
>> >> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >>└─1872 /usr/bin/python2.7 /usr/sbin/ceph-volume-systemd
>> >> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >>
>> >> Feb 24 19:21:47 osd001.baf.physik.uni-bonn.de systemd[1]: Starting Ceph
>> >> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874...
>> >> ---
>> >> Manually, I can fix this by running:
>> >> $ systemctl disable
>> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>> >>
>> >> My question is: Should I really remove that manually?
>> >> Should "ceph-volume lvm zap --destroy" have taken care of it (bug)?
>> >
>> > You should remove it manually. The problem with zapping is that we
>> > might not have the information we need to remove the systemd unit.
>> > Since an OSD can be made out of different devices, ceph-volume might
>> > be asked to "zap" a device which it can't compute to what OSD it
>> > belongs. The systemd units are tied to the ID and UUID of the OSD.
>>
>> Understood, thanks for the reply!
>>
>> Could this be added to the documentation at some point for all the other
>> users operating the cluster manually / with ceph-deploy?
>> This would likely be best to prevent others from falling into this trap
>> ;-).
>> Should I open a ticket asking for this?
>>
>> Cheers,
>> Oliver
>>
>> >
>> >
>> >> Am I missing a step?
>> >>
>> >> Cheers,
>> >> Oliver
>> >>
>> >>
>> >> ___
>> >> ceph-users mailing li

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier

After my last round of backfills completed, I started 5 more bluestore 
conversions, which helped me recognize a very specific pattern of performance.

> pool objects-ssd id 20
>   recovery io 757 MB/s, 10845 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr

The “non-throttled” backfills are only coming from filestore SSD OSD’s.
When backfilling from bluestore SSD OSD’s, they appear to be throttled at the 
aforementioned <20 ops per OSD.

This would corroborate why the first batch of SSD’s I migrated to bluestore 
were all at “full” speed, as all of the OSD’s they were backfilling from were 
filestore based, compared to increasingly bluestore backfill targets, leading 
to increasingly long backfill times as I move from one host to the next.

Looking at the recovery settings, the recovery_sleep and recovery_sleep_ssd 
values across bluestore or filestore OSDs are showing as 0 values, which means 
no sleep/throttle if I am reading everything correctly.

> sudo ceph daemon osd.73 config show | grep recovery
> "osd_allow_recovery_below_min_size": "true",
> "osd_debug_skip_full_check_in_recovery": "false",
> "osd_force_recovery_pg_log_entries_factor": "1.30",
> "osd_min_recovery_priority": "0",
> "osd_recovery_cost": "20971520",
> "osd_recovery_delay_start": "0.00",
> "osd_recovery_forget_lost_objects": "false",
> "osd_recovery_max_active": "35",
> "osd_recovery_max_chunk": "8388608",
> "osd_recovery_max_omap_entries_per_chunk": "64000",
> "osd_recovery_max_single_start": "1",
> "osd_recovery_op_priority": "3",
> "osd_recovery_op_warn_multiple": "16",
> "osd_recovery_priority": "5",
> "osd_recovery_retry_interval": "30.00",
> "osd_recovery_sleep": "0.00",
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.025000",
> "osd_recovery_sleep_ssd": "0.00",
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_thread_timeout": "30",
> "osd_scrub_during_recovery": "false",


As far as I know, the device class is configured correctly as far as I know, it 
all shows as ssd/hdd correctly in ceph osd tree.

So hopefully this may be enough of a smoking gun to help narrow down where this 
may be stemming from.

Thanks,

Reed

> On Feb 23, 2018, at 10:04 AM, David Turner  wrote:
> 
> Here is a [1] link to a ML thread tracking some slow backfilling on 
> bluestore.  It came down to the backfill sleep setting for them.  Maybe it 
> will help.
> 
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html 
> 
> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier  > wrote:
> Probably unrelated, but I do keep seeing this odd negative objects degraded 
> message on the fs-metadata pool:
> 
>> pool fs-metadata-ssd id 16
>>   -34/3 objects degraded (-1133.333%)
>>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
> 
> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
> culprit? Maybe its some weird sampling interval issue thats been solved in 
> 12.2.3?
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 23, 2018, at 8:26 AM, Reed Dier > > wrote:
>> 
>> Below is ceph -s
>> 
>>>   cluster:
>>> id: {id}
>>> health: HEALTH_WARN
>>> noout flag(s) set
>>> 260610/1068004947 objects misplaced (0.024%)
>>> Degraded data redundancy: 23157232/1068004947 objects degraded 
>>> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>>> 
>>>   services:
>>> mon: 3 daemons, quorum mon02,mon01,mon03
>>> mgr: mon03(active), standbys: mon02
>>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>>  flags noout
>>> 
>>>   data:
>>> pools:   5 pools, 5316 pgs
>>> objects: 339M objects, 46627 GB
>>> usage:   154 TB used, 108 TB / 262 TB avail
>>> pgs: 23157232/1068004947 objects degraded (2.168%)
>>>  260610/1068004947 objects misplaced (0.024%)
>>>  4984 active+clean
>>>  183  active+undersized+degraded+remapped+backfilling
>>>  145  active+undersized+degraded+remapped+backfill_wait
>>>  3active+remapped+backfill_wait
>>>  1active+remapped+backfilling
>>> 
>>>   io:
>>> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>>> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
>> 
>> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
>> fs-metadata pool at 32 PG.
>> 
>>> Are you sure the recovery is actually going slower, or are the individual 
>>> ops larger or more expensive?
>> 
>> The objects should not vary wildly in size.
>> Even if they

Re: [ceph-users] Luminous | PG split causing slow requests

2018-02-26 Thread David C

Thanks, David. I think I've probably used the wrong terminology here, I'm
not splitting PGs to create more PGs. This is the PG folder splitting that
happens automatically, I believe it's controlled by the
"filestore_split_multiple" setting (which is 8 on my OSDs, I believe that's
the Luminous default...). Increasing heartbeat grace would probably still
be a good idea to prevent the flapping. I'm trying to understand if the
slow requests is to be expected or if I need to tune something or look at
hardware.

On Mon, Feb 26, 2018 at 4:19 PM, David Turner  wrote:

> Splitting PG's is one of the most intensive and disruptive things you can,
> and should, do to a cluster.  Tweaking recovery sleep, max backfills, and
> heartbeat grace should help with this.  Heartbeat grace can be set high
> enough to mitigate the OSDs flapping which slows things down by peering and
> additional recovery, while still being able to detect OSDs that might fail
> and go down.  The recovery sleep and max backfills are the settings you
> want to look at for mitigating slow requests.  I generally tweak those
> while watching iostat of some OSDs and ceph -s to make sure I'm not giving
> too  much priority to the recovery operations so that client IO can still
> happen.
>
> On Mon, Feb 26, 2018 at 11:10 AM David C  wrote:
>
>> Hi All
>>
>> I have a 12.2.1 cluster, all filestore OSDs, OSDs are spinners, journals
>> on NVME. Cluster primarily used for CephFS, ~20M objects.
>>
>> I'm seeing some OSDs getting marked down, it appears to be related to PG
>> splitting, e.g:
>>
>> 2018-02-26 10:27:27.935489 7f140dbe2700  1 _created [C,D] has 5121
>>> objects, starting split.
>>>
>>
>> Followed by:
>>
>> 2018-02-26 10:27:58.242551 7f141cc3f700  0 log_channel(cluster) log [WRN]
>>> : 9 slow requests, 5 included below; oldest blocked for > 30.308128 secs
>>> 2018-02-26 10:27:58.242563 7f141cc3f700  0 log_channel(cluster) log
>>> [WRN] : slow request 30.151105 seconds old, received at 2018-02-26
>>> 10:27:28.091312: osd_op(mds.0.5339:811969 3.5c
>>> 3:3bb9d743:::200.0018c6c4:head [write 73416~5897 [fadvise_dontneed]] snapc
>>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
>>> commit_sent
>>> 2018-02-26 10:27:58.242569 7f141cc3f700  0 log_channel(cluster) log
>>> [WRN] : slow request 30.133441 seconds old, received at 2018-02-26
>>> 10:27:28.108976: osd_op(mds.0.5339:811970 3.5c
>>> 3:3bb9d743:::200.0018c6c4:head [write 79313~4866 [fadvise_dontneed]] snapc
>>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
>>> commit_sent
>>> 2018-02-26 10:27:58.242574 7f141cc3f700  0 log_channel(cluster) log
>>> [WRN] : slow request 30.083401 seconds old, received at 2018-02-26
>>> 10:27:28.159016: osd_op(mds.9174516.0:444202 3.5c
>>> 3:3bb9d743:::200.0018c6c4:head [stat] snapc 0=[]
>>> ondisk+read+rwordered+known_if_redirected+full_force e13994) currently
>>> waiting for rw locks
>>> 2018-02-26 10:27:58.242579 7f141cc3f700  0 log_channel(cluster) log
>>> [WRN] : slow request 30.072310 seconds old, received at 2018-02-26
>>> 10:27:28.170107: osd_op(mds.0.5339:811971 3.5c
>>> 3:3bb9d743:::200.0018c6c4:head [write 84179~1941 [fadvise_dontneed]] snapc
>>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
>>> waiting for rw locks
>>> 2018-02-26 10:27:58.242584 7f141cc3f700  0 log_channel(cluster) log
>>> [WRN] : slow request 30.308128 seconds old, received at 2018-02-26
>>> 10:27:27.934288: osd_op(mds.0.5339:811964 3.5c
>>> 3:3bb9d743:::200.0018c6c4:head [write 0~62535 [fadvise_dontneed]] snapc
>>> 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
>>> commit_sent
>>> 2018-02-26 10:27:59.242768 7f141cc3f700  0 log_channel(cluster) log
>>> [WRN] : 47 slow requests, 5 included below; oldest blocked for > 31.308410
>>> secs
>>> 2018-02-26 10:27:59.242776 7f141cc3f700  0 log_channel(cluster) log
>>> [WRN] : slow request 30.349575 seconds old, received at 2018-02-26
>>> 10:27:28.893124:
>>
>>
>> I'm also experiencing some MDS crash issues which I think could be
>> related.
>>
>> Is there anything I can do to mitigate the slow requests problem? The
>> rest of the time the cluster is performing pretty well.
>>
>> Thanks,
>> David
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS very unstable with many small files

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 17:59 schrieb John Spray:
> On Mon, Feb 26, 2018 at 4:50 PM, Oliver Freyermuth
>  wrote:
>> Am 26.02.2018 um 17:15 schrieb John Spray:
>>> On Mon, Feb 26, 2018 at 4:06 PM, Oliver Freyermuth
>>>  wrote:
 Am 26.02.2018 um 16:43 schrieb Patrick Donnelly:
> On Sun, Feb 25, 2018 at 3:49 PM, Oliver Freyermuth
>  wrote:
>> Am 25.02.2018 um 21:50 schrieb John Spray:
>>> On Sun, Feb 25, 2018 at 4:45 PM, Oliver Freyermuth
 Now, with about 100,000,000 objects written, we are in a disaster 
 situation.
 First off, the MDS could not restart anymore - it required >40 GB of 
 memory, which (together with the 2 OSDs on the MDS host) exceeded RAM 
 and swap.
 So it tried to recover and OOMed quickly after. Replay was reasonably 
 fast, but join took many minutes:
 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 
 rejoin_joint_start
 and finally, 5 minutes later, OOM.

 I stopped half of the stress-test tar's, which did not help - then I 
 rebooted half of the clients, which did help and let the MDS recover 
 just fine.
 So it seems the client caps have been too many for the MDS to handle. 
 I'm unsure why "tar" would cause so many open file handles.
 Is there anything that can be configured to prevent this from 
 happening?
>>>
>>> Clients will generally hold onto capabilities for files they've
>>> written out -- this is pretty sub-optimal for many workloads where
>>> files are written out but not likely to be accessed again in the near
>>> future.  While clients hold these capabilities, the MDS cannot drop
>>> things from its own cache.
>>>
>>> The way this is *meant* to work is that the MDS hits its cache size
>>> limit, and sends a message to clients asking them to drop some files
>>> from their local cache, and consequently release those capabilities.
>>> However, this has historically been a tricky area with ceph-fuse
>>> clients (there are some hacks for detecting kernel version and using
>>> different mechanisms for different versions of fuse), and it's
>>> possible that on your clients this mechanism is simply not working,
>>> leading to a severely oversized MDS cache.
>>>
>>> The MDS should have been showing health alerts in "ceph status" about
>>> this, but I suppose it's possible that it wasn't surviving long enough
>>> to hit the timeout (60s) that we apply for warning about misbehaving
>>> clients?  It would be good to check the cluster log to see if you were
>>> getting any health messages along the lines of "Client xyz failing to
>>> respond to cache pressure".
>>
>> This explains the high memory usage indeed.
>> I can also confirm seeing those health alerts, now that I check the logs.
>> The systems have been (servers and clients) all exclusively CentOS 7.4,
>> so kernels are rather old, but I would have hoped things have been 
>> backported
>> by RedHat.
>>
>> Is there anything one can do to limit client's cache sizes?
>
> You said the clients are ceph-fuse running 12.2.3? Then they should have:
>
> http://tracker.ceph.com/issues/22339
>
> (Please double check you're not running older clients on accident.)

 I can confirm all clients have been running 12.2.3.
 Is the issue really related? It looks like a remount-failure fix.
>>>
>>> The fuse client uses a remount internally to persuade the fuse kernel
>>> module to really drop things from its cache (fuse doesn't provide the
>>> ideal hooks for managing this stuff in network filesystems).
>>
>> Thanks for the explanation, now I understand!
>>
>>>
> I have run small file tests with ~128 clients without issue. Generally
> if there is an issue it is because clients are not releasing their
> capabilities properly (due to invalidation bugs which should be caught
> by the above backport) or the MDS memory usage exceeds RAM. If the
> clients are not releasing their capabilities, you should see the
> errors John described in the cluster log.
>
> You said in the original post that the `mds cache memory limit = 4GB`.
> If that's the case, you really shouldn't be exceeding 40GB of RAM!
> It's possible you have found a bug of some kind. I suggest tracking
> the MDS cache statistics (which includes the inode count in cache) by
> collecting a `perf dump` via the admin socket. Then you can begin to
> find out what's consuming all of the MDS memory.
>
> Additionally, I concur with John on digging into why the MDS is
> missing heartbeats by collecting debug logs (`debug mds = 15`) at that
> time. It may also shed light on the issue.

 Yes, I confirmed this earlier - indeed I found t

Re: [ceph-users] How to correctly purge a "ceph-volume lvm" OSD

2018-02-26 Thread David Turner

I don't follow what ceph-deploy has to do with the man page for
ceph-volume.  Is ceph-volume also out-of-tree and as such the man pages
aren't version specific with its capabilities?  It's very disconcerting to
need to ignore the man pages for CLI tools.

On Mon, Feb 26, 2018 at 12:10 PM Alfredo Deza  wrote:

> On Mon, Feb 26, 2018 at 11:24 AM, David Turner 
> wrote:
> > If we're asking for documentation updates, the man page for ceph-volume
> is
> > incredibly outdated.  In 12.2.3 it still says that bluestore is not yet
> > implemented and that it's planned to be supported.
> > '[--bluestore] filestore objectstore (not yet implemented)'
> > 'using  a  filestore  setup (bluestore  support  is  planned)'.
>
> This is a bit hard to track because ceph-deploy is an out-of-tree
> project that gets pulled into the Ceph repo, and the man page lives in
> the Ceph source tree.
>
> We have updated the man page and the references to ceph-deploy to
> correctly show the new API and all the flags supported, but this is in
> master and was not backported
> to luminous.
>
> >
> > On Mon, Feb 26, 2018 at 7:05 AM Oliver Freyermuth
> >  wrote:
> >>
> >> Am 26.02.2018 um 13:02 schrieb Alfredo Deza:
> >> > On Sat, Feb 24, 2018 at 1:26 PM, Oliver Freyermuth
> >> >  wrote:
> >> >> Dear Cephalopodians,
> >> >>
> >> >> when purging a single OSD on a host (created via ceph-deploy 2.0,
> i.e.
> >> >> using ceph-volume lvm), I currently proceed as follows:
> >> >>
> >> >> On the OSD-host:
> >> >> $ systemctl stop ceph-osd@4.service
> >> >> $ ls -la /var/lib/ceph/osd/ceph-4
> >> >> # Check block und block.db links:
> >> >> lrwxrwxrwx.  1 ceph ceph   93 23. Feb 01:28 block ->
> >> >>
> /dev/ceph-69b1fbe5-f084-4410-a99a-ab57417e7846/osd-block-cd273506-e805-40ac-b23d-c7b9ff45d874
> >> >> lrwxrwxrwx.  1 root root   43 23. Feb 01:28 block.db ->
> >> >> /dev/ceph-osd-blockdb-ssd-1/db-for-disk-sda
> >> >> # resolve actual underlying device:
> >> >> $ pvs | grep ceph-69b1fbe5-f084-4410-a99a-ab57417e7846
> >> >>   /dev/sda   ceph-69b1fbe5-f084-4410-a99a-ab57417e7846 lvm2 a--
> >> >> <3,64t 0
> >> >> # Zap the device:
> >> >> $ ceph-volume lvm zap --destroy /dev/sda
> >> >>
> >> >> Now, on the mon:
> >> >> # purge the OSD:
> >> >> $ ceph osd purge osd.4 --yes-i-really-mean-it
> >> >>
> >> >> Then I re-deploy using:
> >> >> $ ceph-deploy --overwrite-conf osd create --bluestore --block-db
> >> >> ceph-osd-blockdb-ssd-1/db-for-disk-sda --data /dev/sda osd001
> >> >>
> >> >> from the admin-machine.
> >> >>
> >> >> This works just fine, however, it leaves a stray ceph-volume service
> >> >> behind:
> >> >> $ ls -la /etc/systemd/system/multi-user.target.wants/ -1 | grep
> >> >> ceph-volume@lvm-4
> >> >> lrwxrwxrwx.  1 root root   44 24. Feb 18:30
> >> >> ceph-volume@lvm-4-5a984083-48e1-4c2f-a1f3-3458c941e597.service ->
> >> >> /usr/lib/systemd/system/ceph-volume@.service
> >> >> lrwxrwxrwx.  1 root root   44 23. Feb 01:28
> >> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service ->
> >> >> /usr/lib/systemd/system/ceph-volume@.service
> >> >>
> >> >> This stray service then, after reboot of the machine, stays in
> >> >> activating state (since the disk will of course never come back):
> >> >> ---
> >> >> $ systemctl status
> >> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
> >> >> ● ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service -
> Ceph
> >> >> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
> >> >>Loaded: loaded (/usr/lib/systemd/system/ceph-volume@.service;
> >> >> enabled; vendor preset: disabled)
> >> >>Active: activating (start) since Sa 2018-02-24 19:21:47 CET; 1min
> >> >> 12s ago
> >> >>  Main PID: 1866 (timeout)
> >> >>CGroup:
> >> >>
> /system.slice/system-ceph\x2dvolume.slice/ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
> >> >>├─1866 timeout 1 /usr/sbin/ceph-volume-systemd
> >> >> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
> >> >>└─1872 /usr/bin/python2.7 /usr/sbin/ceph-volume-systemd
> >> >> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
> >> >>
> >> >> Feb 24 19:21:47 osd001.baf.physik.uni-bonn.de systemd[1]: Starting
> Ceph
> >> >> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874...
> >> >> ---
> >> >> Manually, I can fix this by running:
> >> >> $ systemctl disable
> >> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
> >> >>
> >> >> My question is: Should I really remove that manually?
> >> >> Should "ceph-volume lvm zap --destroy" have taken care of it (bug)?
> >> >
> >> > You should remove it manually. The problem with zapping is that we
> >> > might not have the information we need to remove the systemd unit.
> >> > Since an OSD can be made out of different devices, ceph-volume might
> >> > be asked to "zap" a device which it can't compute to what OSD it
> >> > belongs. The systemd units are tied to the ID and UUID of th

Re: [ceph-users] Luminous | PG split causing slow requests

2018-02-26 Thread David Turner

The slow requests are absolutely expected on filestore subfolder
splitting.  You can however stop an OSD, split it's subfolders, and start
it back up.  I perform this maintenance once/month.  I changed my settings
to [1]these, but I only suggest doing something this drastic if you're
committed to manually split your PGs regularly.  In my environment that
needs to be once/month.

Along with those settings, I use [2]this script to perform the subfolder
splitting. It will change your config file to [3]these settings, perform
the subfolder splitting, change them back to what you currently have, and
start your OSDs back up.  using a negative merge threshold prevents
subfolder merging which is useful for some environments.

The script automatically sets noout and unset it for you afterwards as well
it won't start unless the cluster is health_ok.  Feel free to use it as is
or pick from it what's useful for you.  I highly suggest that anyone
feeling the pains of subfolder splitting to do some sort of offline
splitting to get through it.  If you're using some sort of config
management like salt or puppet, be sure to disable it so that the config
won't be overwritten while the subfolders are being split.


[1] filestore_merge_threshold = -16
 filestore_split_multiple = 256

[2] https://gist.github.com/drakonstein/cb76c7696e65522ab0e699b7ea1ab1c4

[3] filestore_merge_threshold = -1
 filestore_split_multiple = 1
On Mon, Feb 26, 2018 at 12:18 PM David C  wrote:

> Thanks, David. I think I've probably used the wrong terminology here, I'm
> not splitting PGs to create more PGs. This is the PG folder splitting that
> happens automatically, I believe it's controlled by the
> "filestore_split_multiple" setting (which is 8 on my OSDs, I believe that's
> the Luminous default...). Increasing heartbeat grace would probably still
> be a good idea to prevent the flapping. I'm trying to understand if the
> slow requests is to be expected or if I need to tune something or look at
> hardware.
>
> On Mon, Feb 26, 2018 at 4:19 PM, David Turner 
> wrote:
>
>> Splitting PG's is one of the most intensive and disruptive things you
>> can, and should, do to a cluster.  Tweaking recovery sleep, max backfills,
>> and heartbeat grace should help with this.  Heartbeat grace can be set high
>> enough to mitigate the OSDs flapping which slows things down by peering and
>> additional recovery, while still being able to detect OSDs that might fail
>> and go down.  The recovery sleep and max backfills are the settings you
>> want to look at for mitigating slow requests.  I generally tweak those
>> while watching iostat of some OSDs and ceph -s to make sure I'm not giving
>> too  much priority to the recovery operations so that client IO can still
>> happen.
>>
>> On Mon, Feb 26, 2018 at 11:10 AM David C  wrote:
>>
>>> Hi All
>>>
>>> I have a 12.2.1 cluster, all filestore OSDs, OSDs are spinners, journals
>>> on NVME. Cluster primarily used for CephFS, ~20M objects.
>>>
>>> I'm seeing some OSDs getting marked down, it appears to be related to PG
>>> splitting, e.g:
>>>
>>> 2018-02-26 10:27:27.935489 7f140dbe2700  1 _created [C,D] has 5121
 objects, starting split.

>>>
>>> Followed by:
>>>
>>> 2018-02-26 10:27:58.242551 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : 9 slow requests, 5 included below; oldest blocked for > 30.308128
 secs
 2018-02-26 10:27:58.242563 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.151105 seconds old, received at 2018-02-26
 10:27:28.091312: osd_op(mds.0.5339:811969 3.5c
 3:3bb9d743:::200.0018c6c4:head [write 73416~5897 [fadvise_dontneed]] snapc
 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
 commit_sent
 2018-02-26 10:27:58.242569 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.133441 seconds old, received at 2018-02-26
 10:27:28.108976: osd_op(mds.0.5339:811970 3.5c
 3:3bb9d743:::200.0018c6c4:head [write 79313~4866 [fadvise_dontneed]] snapc
 0=[] ondisk+write+known_if_redirected+full_force e13994) currently
 commit_sent
 2018-02-26 10:27:58.242574 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.083401 seconds old, received at 2018-02-26
 10:27:28.159016: osd_op(mds.9174516.0:444202 3.5c
 3:3bb9d743:::200.0018c6c4:head [stat] snapc 0=[]
 ondisk+read+rwordered+known_if_redirected+full_force e13994) currently
 waiting for rw locks
 2018-02-26 10:27:58.242579 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.072310 seconds old, received at 2018-02-26
 10:27:28.170107: osd_op(mds.0.5339:811971 3.5c
 3:3bb9d743:::200.0018c6c4:head [write 84179~1941 [fadvise_dontneed]] snapc
 0=[] ondisk+write+known_if_redirected+full_force e13994) currently waiting
 for rw locks
 2018-02-26 10:27:58.242584 7f141cc3f700  0 log_channel(cluster) log
 [WRN] : slow request 30.308128 seconds old, received at 2018-02-26
 10:2

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Gregory Farnum

I don’t actually know this option, but based on your results it’s clear
that “fast read” is telling the OSD it should issue reads to all k+m OSDs
storing data and then reconstruct the data from the first k to respond.
Without the fast read it simply asks the regular k data nodes to read it
back straight and sends the reply back. This is a straight trade off of
more bandwidth for lower long-tail latencies.
-Greg
On Mon, Feb 26, 2018 at 3:57 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Some additional information gathered from our monitoring:
> It seems fast_read does indeed become active immediately, but I do not
> understand the effect.
>
> With fast_read = 0, we see:
> ~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
> ~ 2.3 GB/s total incoming traffic to all 6 OSD hosts
>
> With fast_read = 1, we see:
> ~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
> ~ 3   GB/s total incoming traffic to all 6 OSD hosts
>
> I would have expected exactly the contrary to happen...
>
> Cheers,
> Oliver
>
> Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
> > Dear Cephalopodians,
> >
> > in the few remaining days when we can still play at our will with
> parameters,
> > we just now tried to set:
> > ceph osd pool set cephfs_data fast_read 1
> > but did not notice any effect on sequential, large file read throughput
> on our k=4 m=2 EC pool.
> >
> > Should this become active immediately? Or do OSDs need a restart first?
> > Is the option already deemed safe?
> >
> > Or is it just that we should not expect any change on throughput, since
> our system (for large sequential reads)
> > is purely limited by the IPoIB throughput, and the shards are
> nevertheless requested by the primary OSD?
> > So the gain would not be in throughput, but the reply to the client
> would be slightly faster (before all shards have arrived)?
> > Then this option would be mainly of interest if the disk IO was
> congested (which does not happen for us as of yet)
> > and not help so much if the system is limited by network bandwidth.
> >
> > Cheers,
> >   Oliver
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] CephFS Single Threaded Performance

2018-02-26 Thread Brian Woods

I have a small test cluster (just two nodes) and after rebuilding it
several times I found my latest configuration that SHOULD be the fastest is
by far the slowest (per thread).


I have around 10 spinals that I have an erasure encoded CephFS on. When I
installed several SSDs and recreated it with the meta data and the write
cache on SSD my performance plummeted from about 10-20MBps to 2-3MBps, but
only per thread… I did a rados benchmark and the SSDs Meta and Write pools
can sustain anywhere from 50 to 150MBps without issue.


And, if I spool up multiple copies to the FS, each copy adds to that
throughput without much of a hit. In fact I can go up to about 8 copied
(about 16MBps) before they start slowing down at all. Even while I have
several threads actively writing I still benchmark around 25MBps.


Any ideas why single threaded performance would take a hit like this?
Almost everything is running on a single node (just a few OSDs on another
node) and I have plenty of RAM (96GBs) and CPU (8 Xeon Cores).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rados cppool, very low speed

2018-02-26 Thread Gregory Farnum

“tacos cppool” is a toy. Please don’t use it for anything that matters. :)
On Sun, Feb 25, 2018 at 10:16 PM Behnam Loghmani 
wrote:

> Hi,
>
> I want to copy objects from one of my pools to another pool with "rados
> cppool" but the speed of this operation is so low. on the other hand, the
> speed of PUT/GET in radosgw is so different and it's so higher.
>
> Is there any trick to speed it up?
>
> ceph version 12.2.3
>
> Regards,
> Behnam Loghmani
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> I don’t actually know this option, but based on your results it’s clear that 
> “fast read” is telling the OSD it should issue reads to all k+m OSDs storing 
> data and then reconstruct the data from the first k to respond. Without the 
> fast read it simply asks the regular k data nodes to read it back straight 
> and sends the reply back. This is a straight trade off of more bandwidth for 
> lower long-tail latencies.
> -Greg

Many thanks, this certainly explains it! 
Apparently I misunderstood how "normal" read works - I thought that in any 
case, all shards would be requested, and the primary OSD would check EC is 
still fine. 
However, with the explanation that indeed only the actual "k" shards are read 
in the "normal" case, it's fully clear to me that "fast_read" will be slower 
for us,
since we are limited by network bandwidth. 

On a side-note, activating fast_read also appears to increase CPU load a bit, 
which is then probably due to the EC calculations that need to be performed if 
the "wrong"
shards arrived at the primary OSD first. 

I believe this also explains why an EC pool actually does remapping in a k=4 
m=2 pool with failure domain host if one of 6 hosts goes down:
Namely, to have the "k" shards available on the "up" OSDs. This answers an 
earlier question of mine. 

Many thanks for clearing this up!

Cheers,
Oliver

> On Mon, Feb 26, 2018 at 3:57 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Some additional information gathered from our monitoring:
> It seems fast_read does indeed become active immediately, but I do not 
> understand the effect.
> 
> With fast_read = 0, we see:
> ~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
> ~ 2.3 GB/s total incoming traffic to all 6 OSD hosts
> 
> With fast_read = 1, we see:
> ~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
> ~ 3   GB/s total incoming traffic to all 6 OSD hosts
> 
> I would have expected exactly the contrary to happen...
> 
> Cheers,
>         Oliver
> 
> Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
> > Dear Cephalopodians,
> >
> > in the few remaining days when we can still play at our will with 
> parameters,
> > we just now tried to set:
> > ceph osd pool set cephfs_data fast_read 1
> > but did not notice any effect on sequential, large file read throughput 
> on our k=4 m=2 EC pool.
> >
> > Should this become active immediately? Or do OSDs need a restart first?
> > Is the option already deemed safe?
> >
> > Or is it just that we should not expect any change on throughput, since 
> our system (for large sequential reads)
> > is purely limited by the IPoIB throughput, and the shards are 
> nevertheless requested by the primary OSD?
> > So the gain would not be in throughput, but the reply to the client 
> would be slightly faster (before all shards have arrived)?
> > Then this option would be mainly of interest if the disk IO was 
> congested (which does not happen for us as of yet)
> > and not help so much if the system is limited by network bandwidth.
> >
> > Cheers,
> >       Oliver
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> > I don’t actually know this option, but based on your results it’s clear
> that “fast read” is telling the OSD it should issue reads to all k+m OSDs
> storing data and then reconstruct the data from the first k to respond.
> Without the fast read it simply asks the regular k data nodes to read it
> back straight and sends the reply back. This is a straight trade off of
> more bandwidth for lower long-tail latencies.
> > -Greg
>
> Many thanks, this certainly explains it!
> Apparently I misunderstood how "normal" read works - I thought that in any
> case, all shards would be requested, and the primary OSD would check EC is
> still fine.
>

Nope, EC PGs can self-validate (they checksum everything) and so extra
shards are requested only if one of the OSDs has an error.


> However, with the explanation that indeed only the actual "k" shards are
> read in the "normal" case, it's fully clear to me that "fast_read" will be
> slower for us,
> since we are limited by network bandwidth.
>
> On a side-note, activating fast_read also appears to increase CPU load a
> bit, which is then probably due to the EC calculations that need to be
> performed if the "wrong"
> shards arrived at the primary OSD first.
>
> I believe this also explains why an EC pool actually does remapping in a
> k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
> Namely, to have the "k" shards available on the "up" OSDs. This answers an
> earlier question of mine.
>

I don't quite understand what you're asking/saying here, but if an OSD gets
marked out all the PGs that used to rely on it will get another OSD unless
you've instructed the cluster not to do so. The specifics of any given
erasure code have nothing to do with it. :)
-Greg


>
> Many thanks for clearing this up!
>
> Cheers,
> Oliver
>
> > On Mon, Feb 26, 2018 at 3:57 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> > Some additional information gathered from our monitoring:
> > It seems fast_read does indeed become active immediately, but I do
> not understand the effect.
> >
> > With fast_read = 0, we see:
> > ~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
> > ~ 2.3 GB/s total incoming traffic to all 6 OSD hosts
> >
> > With fast_read = 1, we see:
> > ~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
> > ~ 3   GB/s total incoming traffic to all 6 OSD hosts
> >
> > I would have expected exactly the contrary to happen...
> >
> > Cheers,
> > Oliver
> >
> > Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
> > > Dear Cephalopodians,
> > >
> > > in the few remaining days when we can still play at our will with
> parameters,
> > > we just now tried to set:
> > > ceph osd pool set cephfs_data fast_read 1
> > > but did not notice any effect on sequential, large file read
> throughput on our k=4 m=2 EC pool.
> > >
> > > Should this become active immediately? Or do OSDs need a restart
> first?
> > > Is the option already deemed safe?
> > >
> > > Or is it just that we should not expect any change on throughput,
> since our system (for large sequential reads)
> > > is purely limited by the IPoIB throughput, and the shards are
> nevertheless requested by the primary OSD?
> > > So the gain would not be in throughput, but the reply to the
> client would be slightly faster (before all shards have arrived)?
> > > Then this option would be mainly of interest if the disk IO was
> congested (which does not happen for us as of yet)
> > > and not help so much if the system is limited by network bandwidth.
> > >
> > > Cheers,
> > >   Oliver
> > >
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
> >  wrote:
> >> Looking with:
> >> ceph daemon osd.2 perf dump
> >> I get:
> >> "bluefs": {
> >> "gift_bytes": 0,
> >> "reclaim_bytes": 0,
> >> "db_total_bytes": 84760592384,
> >> "db_used_bytes": 78920024064,
> >> "wal_total_bytes": 0,
> >> "wal_used_bytes": 0,
> >> "slow_total_bytes": 0,
> >> "slow_used_bytes": 0,
> >> so it seems this is almost exclusively RocksDB usage.
> >>
> >> Is this expected?
> >
> > Yes. The directory entries are stored in the omap of the objects. This
> > will be stored in the RocksDB backend of Bluestore.
> >
> >> Is there a recommendation on how much MDS storage is needed for a
> CephFS with 450 TB?
> >
> > It seems in the above test you're using about 1KB per inode (file).
> > Using that you can extrapolate how much space the data pool needs
> > based on your file system usage. (If all you're doing is filling the
> > file system with empty files, of course you're going to need an
> > unusually large metadata pool.)
> >
> Many thanks, this helps!
> We naturally hope our users will not do this, this stress test was a worst
> case -
> but the rough number (1 kB per inode) does indeed help a lot, and also the
> increase with modifications
> of the file as laid out by David.
>
> Is also the slow backfilling normal?
> Will such increase in storage (by many file modifications) at some point
> also be reduced, i.e.
> is the database compacted / can one trigger that / is there something like
> "SQL vacuum"?
>
> To also answer David's questions in parallel:
> - Concerning the slow backfill, I am only talking about the "metadata
> OSDs".
>   They are fully SSD backed, and have no separate device for block.db /
> WAL.
> - I adjusted backfills up to 128 for those metadata OSDs, the cluster is
> currently fully empty, i.e. no client's are doing anything.
>   There are no slow requests.
>   Since no clients are doing anything and the rest of the cluster is now
> clean (apart from the two backfilling OSDs),
>   right now there is also no memory pressure at all.
>   The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.
>   The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s
> of write throughput.
>   Network traffic between the node with the clean OSDs and the
> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly
> more bandwidth available...
> - Checking sleeps with:
> # ceph -n osd.1 --show-config | grep sleep
> osd_recovery_sleep = 0.00
> osd_recovery_sleep_hdd = 0.10
> osd_recovery_sleep_hybrid = 0.025000
> osd_recovery_sleep_ssd = 0.00
> shows there should be 0 sleep. Or is there another way to query?
>

Check if the OSDs are reporting their stores or their journals to be
"rotational" via "ceph osd metadata"?

If that's being detected wrong, that would cause them to be using those
sleeps.
-Greg

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 9:12 AM Reed Dier  wrote:

> After my last round of backfills completed, I started 5 more bluestore
> conversions, which helped me recognize a very specific pattern of
> performance.
>
> pool objects-ssd id 20
>   recovery io 757 MB/s, 10845 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
>
>
> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
> When backfilling from bluestore SSD OSD’s, they appear to be throttled at
> the aforementioned <20 ops per OSD.
>

Wait, is that the current state? What are you referencing when you talk
about recovery ops per second?

Also, what are the values for osd_recovery_sleep_hdd
and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
that your BlueStore SSD OSDs are correctly reporting both themselves and
their journals as non-rotational?
-Greg


>
> This would corroborate why the first batch of SSD’s I migrated to
> bluestore were all at “full” speed, as all of the OSD’s they were
> backfilling from were filestore based, compared to increasingly bluestore
> backfill targets, leading to increasingly long backfill times as I move
> from one host to the next.
>
> Looking at the recovery settings, the recovery_sleep and
> recovery_sleep_ssd values across bluestore or filestore OSDs are showing as
> 0 values, which means no sleep/throttle if I am reading everything
> correctly.
>
> sudo ceph daemon osd.73 config show | grep recovery
> "osd_allow_recovery_below_min_size": "true",
> "osd_debug_skip_full_check_in_recovery": "false",
> "osd_force_recovery_pg_log_entries_factor": "1.30",
> "osd_min_recovery_priority": "0",
> "osd_recovery_cost": "20971520",
> "osd_recovery_delay_start": "0.00",
> "osd_recovery_forget_lost_objects": "false",
> "osd_recovery_max_active": "35",
> "osd_recovery_max_chunk": "8388608",
> "osd_recovery_max_omap_entries_per_chunk": "64000",
> "osd_recovery_max_single_start": "1",
> "osd_recovery_op_priority": "3",
> "osd_recovery_op_warn_multiple": "16",
> "osd_recovery_priority": "5",
> "osd_recovery_retry_interval": "30.00",
> *"osd_recovery_sleep": "0.00",*
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.025000",
> *"osd_recovery_sleep_ssd": "0.00",*
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_thread_timeout": "30",
> "osd_scrub_during_recovery": "false",
>
>
> As far as I know, the device class is configured correctly as far as I
> know, it all shows as ssd/hdd correctly in ceph osd tree.
>
> So hopefully this may be enough of a smoking gun to help narrow down where
> this may be stemming from.
>
> Thanks,
>
> Reed
>
> On Feb 23, 2018, at 10:04 AM, David Turner  wrote:
>
> Here is a [1] link to a ML thread tracking some slow backfilling on
> bluestore.  It came down to the backfill sleep setting for them.  Maybe it
> will help.
>
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html
>
> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier  wrote:
>
>> Probably unrelated, but I do keep seeing this odd negative objects
>> degraded message on the fs-metadata pool:
>>
>> pool fs-metadata-ssd id 16
>>   -34/3 objects degraded (-1133.333%)
>>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
>>
>>
>> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a
>> culprit? Maybe its some weird sampling interval issue thats been solved in
>> 12.2.3?
>>
>> Thanks,
>>
>> Reed
>>
>>
>> On Feb 23, 2018, at 8:26 AM, Reed Dier  wrote:
>>
>> Below is ceph -s
>>
>>   cluster:
>> id: {id}
>> health: HEALTH_WARN
>> noout flag(s) set
>> 260610/1068004947 objects misplaced (0.024%)
>> Degraded data redundancy: 23157232/1068004947 objects
>> degraded (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>>
>>   services:
>> mon: 3 daemons, quorum mon02,mon01,mon03
>> mgr: mon03(active), standbys: mon02
>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>  flags noout
>>
>>   data:
>> pools:   5 pools, 5316 pgs
>> objects: 339M objects, 46627 GB
>> usage:   154 TB used, 108 TB / 262 TB avail
>> pgs: 23157232/1068004947 objects degraded (2.168%)
>>  260610/1068004947 objects misplaced (0.024%)
>>  4984 active+clean
>>  183  active+undersized+degraded+remapped+backfilling
>>  145  active+undersized+degraded+remapped+backfill_wait
>>  3active+remapped+backfill_wait
>>  1active+remapped+backfilling
>>
>>   io:
>> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
>>
>>
>> Als

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> > I don’t actually know this option, but based on your results it’s clear 
> that “fast read” is telling the OSD it should issue reads to all k+m OSDs 
> storing data and then reconstruct the data from the first k to respond. 
> Without the fast read it simply asks the regular k data nodes to read it back 
> straight and sends the reply back. This is a straight trade off of more 
> bandwidth for lower long-tail latencies.
> > -Greg
> 
> Many thanks, this certainly explains it!
> Apparently I misunderstood how "normal" read works - I thought that in 
> any case, all shards would be requested, and the primary OSD would check EC 
> is still fine.
> 
> 
> Nope, EC PGs can self-validate (they checksum everything) and so extra shards 
> are requested only if one of the OSDs has an error.
>  
> 
> However, with the explanation that indeed only the actual "k" shards are 
> read in the "normal" case, it's fully clear to me that "fast_read" will be 
> slower for us,
> since we are limited by network bandwidth.
> 
> On a side-note, activating fast_read also appears to increase CPU load a 
> bit, which is then probably due to the EC calculations that need to be 
> performed if the "wrong"
> shards arrived at the primary OSD first.
> 
> I believe this also explains why an EC pool actually does remapping in a 
> k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
> Namely, to have the "k" shards available on the "up" OSDs. This answers 
> an earlier question of mine.
> 
> 
> I don't quite understand what you're asking/saying here, but if an OSD gets 
> marked out all the PGs that used to rely on it will get another OSD unless 
> you've instructed the cluster not to do so. The specifics of any given 
> erasure code have nothing to do with it. :)
> -Greg

Ah, sorry, let me clarify. 
The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts. 
So necessarily, there is one shard for each host. If one host goes down for a 
prolonged time,
there's no "logical" advantage of redistributing things - since whatever you 
do, with 5 hosts, all PGs will stay in degraded state anyways. 

However, I noticed Ceph is remapping all PGs, and actively moving data. I 
presume now this is done for two reasons:
- The remapping is needed since the primary OSD might be the one which went 
down. But for remapping (I guess) there's no need to actually move data,
  or is there? 
- The data movement is done to have the "k" shards available. 
If it's really the case that "all shards are equal", then data movement should 
not occur - or is this a bug / bad feature? 

Cheers,
Oliver

>  
> 
> 
> Many thanks for clearing this up!
> 
> Cheers,
>         Oliver
> 
> > On Mon, Feb 26, 2018 at 3:57 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de> 
>  >> wrote:
> >
> >     Some additional information gathered from our monitoring:
> >     It seems fast_read does indeed become active immediately, but I do 
> not understand the effect.
> >
> >     With fast_read = 0, we see:
> >     ~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
> >     ~ 2.3 GB/s total incoming traffic to all 6 OSD hosts
> >
> >     With fast_read = 1, we see:
> >     ~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
> >     ~ 3   GB/s total incoming traffic to all 6 OSD hosts
> >
> >     I would have expected exactly the contrary to happen...
> >
> >     Cheers,
> >             Oliver
> >
> >     Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
> >     > Dear Cephalopodians,
> >     >
> >     > in the few remaining days when we can still play at our will with 
> parameters,
> >     > we just now tried to set:
> >     > ceph osd pool set cephfs_data fast_read 1
> >     > but did not notice any effect on sequential, large file read 
> throughput on our k=4 m=2 EC pool.
> >     >
> >     > Should this become active immediately? Or do OSDs need a restart 
> first?
> >     > Is the option already deemed safe?
> >     >
> >     > Or is it just that we should not expect any change on throughput, 
> since our system (for large sequential reads)
> >     > is purely limited by the IPoIB throughput, and the shards are 
> nevertheless requested by the primary OSD?
> >     > So the gain would not be in throughput, but the reply to the 
> client would be slightly faster (before all shards have arrived)?
> >     > Then this option would be mainly of interest if the disk IO was 
> congested (which does not happen for us as of yet)
> >     > and not help so much if the system i

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 19:56 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
> > mailto:freyerm...@physik.uni-bonn.de>> 
> wrote:
> >> Looking with:
> >> ceph daemon osd.2 perf dump
> >> I get:
> >>     "bluefs": {
> >>         "gift_bytes": 0,
> >>         "reclaim_bytes": 0,
> >>         "db_total_bytes": 84760592384,
> >>         "db_used_bytes": 78920024064,
> >>         "wal_total_bytes": 0,
> >>         "wal_used_bytes": 0,
> >>         "slow_total_bytes": 0,
> >>         "slow_used_bytes": 0,
> >> so it seems this is almost exclusively RocksDB usage.
> >>
> >> Is this expected?
> >
> > Yes. The directory entries are stored in the omap of the objects. This
> > will be stored in the RocksDB backend of Bluestore.
> >
> >> Is there a recommendation on how much MDS storage is needed for a 
> CephFS with 450 TB?
> >
> > It seems in the above test you're using about 1KB per inode (file).
> > Using that you can extrapolate how much space the data pool needs
> > based on your file system usage. (If all you're doing is filling the
> > file system with empty files, of course you're going to need an
> > unusually large metadata pool.)
> >
> Many thanks, this helps!
> We naturally hope our users will not do this, this stress test was a 
> worst case -
> but the rough number (1 kB per inode) does indeed help a lot, and also 
> the increase with modifications
> of the file as laid out by David.
> 
> Is also the slow backfilling normal?
> Will such increase in storage (by many file modifications) at some point 
> also be reduced, i.e.
> is the database compacted / can one trigger that / is there something 
> like "SQL vacuum"?
> 
> To also answer David's questions in parallel:
> - Concerning the slow backfill, I am only talking about the "metadata 
> OSDs".
>   They are fully SSD backed, and have no separate device for block.db / 
> WAL.
> - I adjusted backfills up to 128 for those metadata OSDs, the cluster is 
> currently fully empty, i.e. no client's are doing anything.
>   There are no slow requests.
>   Since no clients are doing anything and the rest of the cluster is now 
> clean (apart from the two backfilling OSDs),
>   right now there is also no memory pressure at all.
>   The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.
>   The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s 
> of write throughput.
>   Network traffic between the node with the clean OSDs and the 
> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly 
> more bandwidth available...
> - Checking sleeps with:
> # ceph -n osd.1 --show-config | grep sleep
> osd_recovery_sleep = 0.00
> osd_recovery_sleep_hdd = 0.10
> osd_recovery_sleep_hybrid = 0.025000
> osd_recovery_sleep_ssd = 0.00
> shows there should be 0 sleep. Or is there another way to query?
> 
> 
> Check if the OSDs are reporting their stores or their journals to be 
> "rotational" via "ceph osd metadata"?

I find:
"bluestore_bdev_model": "Micron_5100_MTFD",
"bluestore_bdev_partition_path": "/dev/sda2",
"bluestore_bdev_rotational": "0",
"bluestore_bdev_size": "239951482880",
"bluestore_bdev_type": "ssd",
[...]
"rotational": "0"

for all of them (obviously with different device paths). 
Also, they've been assigned the ssd device class automatically: 
# ceph osd df | head
ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS
   
  0   ssd 0.21829  1.0  223G 11310M  212G  4.94 0.94   0
   
  1   ssd 0.21829  1.0  223G 11368M  212G  4.97 0.95   0
   
  2   ssd 0.21819  1.0  223G 76076M  149G 33.25 6.35 128
   
  3   ssd 0.21819  1.0  223G 76268M  148G 33.33 6.37 128

So this should not be the reason... 

> 
> If that's being detected wrong, that would cause them to be using those 
> sleeps.
> -Greg
> 
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier

The ‘good perf’ that I reported below was the result of beginning 5 new 
bluestore conversions which results in a leading edge of ‘good’ performance, 
before trickling off.

This performance lasted about 20 minutes, where it backfilled a small set of 
PGs off of non-bluestore OSDs.

Current performance is now hovering around:
> pool objects-ssd id 20
>   recovery io 14285 kB/s, 202 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr

> What are you referencing when you talk about recovery ops per second?

These are recovery ops as reported by ceph -s or via stats exported via influx 
plugin in mgr, and via local collectd collection.

> Also, what are the values for osd_recovery_sleep_hdd and 
> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
> your BlueStore SSD OSDs are correctly reporting both themselves and their 
> journals as non-rotational?

This yields more interesting results.
Pasting results for 3 sets of OSDs in this order
 {0}hdd+nvme block.db
{24}ssd+nvme block.db
{59}ssd+nvme journal

> ceph osd metadata | grep 'id\|rotational'
> "id": 0,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "1",
> "bluestore_bdev_rotational": "1",
> "journal_rotational": "1",
> "rotational": “1"
> "id": 24,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "0",
> "bluestore_bdev_rotational": "0",
> "journal_rotational": "1",
> "rotational": “0"
> "id": 59,
> "journal_rotational": "0",
> "rotational": “0"

I wonder if it matters/is correct to see "journal_rotational": “1” for the 
bluestore OSD’s {0,24} with nvme block.db.

Hope this may be helpful in determining the root cause.

If it helps, all of the OSD’s were originally deployed with ceph-deploy, but 
are now being redone with ceph-volume locally on each host.

Thanks,

Reed

> On Feb 26, 2018, at 1:00 PM, Gregory Farnum  wrote:
> 
> On Mon, Feb 26, 2018 at 9:12 AM Reed Dier  > wrote:
> After my last round of backfills completed, I started 5 more bluestore 
> conversions, which helped me recognize a very specific pattern of performance.
> 
>> pool objects-ssd id 20
>>   recovery io 757 MB/s, 10845 objects/s
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
> 
> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
> When backfilling from bluestore SSD OSD’s, they appear to be throttled at the 
> aforementioned <20 ops per OSD.
> 
> Wait, is that the current state? What are you referencing when you talk about 
> recovery ops per second?
> 
> Also, what are the values for osd_recovery_sleep_hdd and 
> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
> your BlueStore SSD OSDs are correctly reporting both themselves and their 
> journals as non-rotational?
> -Greg
>  
> 
> This would corroborate why the first batch of SSD’s I migrated to bluestore 
> were all at “full” speed, as all of the OSD’s they were backfilling from were 
> filestore based, compared to increasingly bluestore backfill targets, leading 
> to increasingly long backfill times as I move from one host to the next.
> 
> Looking at the recovery settings, the recovery_sleep and recovery_sleep_ssd 
> values across bluestore or filestore OSDs are showing as 0 values, which 
> means no sleep/throttle if I am reading everything correctly.
> 
>> sudo ceph daemon osd.73 config show | grep recovery
>> "osd_allow_recovery_below_min_size": "true",
>> "osd_debug_skip_full_check_in_recovery": "false",
>> "osd_force_recovery_pg_log_entries_factor": "1.30",
>> "osd_min_recovery_priority": "0",
>> "osd_recovery_cost": "20971520",
>> "osd_recovery_delay_start": "0.00",
>> "osd_recovery_forget_lost_objects": "false",
>> "osd_recovery_max_active": "35",
>> "osd_recovery_max_chunk": "8388608",
>> "osd_recovery_max_omap_entries_per_chunk": "64000",
>> "osd_recovery_max_single_start": "1",
>> "osd_recovery_op_priority": "3",
>> "osd_recovery_op_warn_multiple": "16",
>> "osd_recovery_priority": "5",
>> "osd_recovery_retry_interval": "30.00",
>> "osd_recovery_sleep": "0.00",
>> "osd_recovery_sleep_hdd": "0.10",
>> "osd_recovery_sleep_hybrid": "0.025000",
>> "osd_recovery_sleep_ssd": "0.00",
>> "osd_recovery_thread_suicide_timeout": "300",
>> "osd_recovery_thread_timeout": "30",
>> "osd_scrub_during_recovery": "false",
> 
> 
> As far as I know, the device class is configured correctly as far as I know, 
> it all shows as ssd/hdd correctly in ceph osd tree.
> 
> So hopefully this may be enough of a smoking gun to help narrow down where 
> this may be stemming from.
> 
> Thanks,
> 
> Reed
> 
>> On Fe

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> > Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> > > I don’t actually know this option, but based on your results it’s
> clear that “fast read” is telling the OSD it should issue reads to all k+m
> OSDs storing data and then reconstruct the data from the first k to
> respond. Without the fast read it simply asks the regular k data nodes to
> read it back straight and sends the reply back. This is a straight trade
> off of more bandwidth for lower long-tail latencies.
> > > -Greg
> >
> > Many thanks, this certainly explains it!
> > Apparently I misunderstood how "normal" read works - I thought that
> in any case, all shards would be requested, and the primary OSD would check
> EC is still fine.
> >
> >
> > Nope, EC PGs can self-validate (they checksum everything) and so extra
> shards are requested only if one of the OSDs has an error.
> >
> >
> > However, with the explanation that indeed only the actual "k" shards
> are read in the "normal" case, it's fully clear to me that "fast_read" will
> be slower for us,
> > since we are limited by network bandwidth.
> >
> > On a side-note, activating fast_read also appears to increase CPU
> load a bit, which is then probably due to the EC calculations that need to
> be performed if the "wrong"
> > shards arrived at the primary OSD first.
> >
> > I believe this also explains why an EC pool actually does remapping
> in a k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
> > Namely, to have the "k" shards available on the "up" OSDs. This
> answers an earlier question of mine.
> >
> >
> > I don't quite understand what you're asking/saying here, but if an OSD
> gets marked out all the PGs that used to rely on it will get another OSD
> unless you've instructed the cluster not to do so. The specifics of any
> given erasure code have nothing to do with it. :)
> > -Greg
>
> Ah, sorry, let me clarify.
> The EC pool I am considering is k=4 m=2 with failure domain host, on 6
> hosts.
> So necessarily, there is one shard for each host. If one host goes down
> for a prolonged time,
> there's no "logical" advantage of redistributing things - since whatever
> you do, with 5 hosts, all PGs will stay in degraded state anyways.
>
> However, I noticed Ceph is remapping all PGs, and actively moving data. I
> presume now this is done for two reasons:
> - The remapping is needed since the primary OSD might be the one which
> went down. But for remapping (I guess) there's no need to actually move
> data,
>   or is there?
> - The data movement is done to have the "k" shards available.
> If it's really the case that "all shards are equal", then data movement
> should not occur - or is this a bug / bad feature?
>

If you lose one OSD out of a host, Ceph is going to try and re-replicate
the data onto the other OSDs in that host. Your PG size and the CRUSH rule
instructs it that the PG needs 6 different OSDs, and those OSDs need to be
placed on different hosts.

You're right that gets very funny if your PG size is equal to the number of
hosts. We generally discourage people from running configurations like that.

Or if you mean that you are losing a host, and the data is shuffling around
on the remaining hosts...hrm, that'd be weird. (Perhaps a result of EC
pools' "indep" rather than "firstn" crush rules?)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth:
> Am 26.02.2018 um 19:56 schrieb Gregory Farnum:
>>
>>
>> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth 
>> mailto:freyerm...@physik.uni-bonn.de>> wrote:
>>
>> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
>> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
>> > mailto:freyerm...@physik.uni-bonn.de>> 
>> wrote:
>> >> Looking with:
>> >> ceph daemon osd.2 perf dump
>> >> I get:
>> >>     "bluefs": {
>> >>         "gift_bytes": 0,
>> >>         "reclaim_bytes": 0,
>> >>         "db_total_bytes": 84760592384,
>> >>         "db_used_bytes": 78920024064,
>> >>         "wal_total_bytes": 0,
>> >>         "wal_used_bytes": 0,
>> >>         "slow_total_bytes": 0,
>> >>         "slow_used_bytes": 0,
>> >> so it seems this is almost exclusively RocksDB usage.
>> >>
>> >> Is this expected?
>> >
>> > Yes. The directory entries are stored in the omap of the objects. This
>> > will be stored in the RocksDB backend of Bluestore.
>> >
>> >> Is there a recommendation on how much MDS storage is needed for a 
>> CephFS with 450 TB?
>> >
>> > It seems in the above test you're using about 1KB per inode (file).
>> > Using that you can extrapolate how much space the data pool needs
>> > based on your file system usage. (If all you're doing is filling the
>> > file system with empty files, of course you're going to need an
>> > unusually large metadata pool.)
>> >
>> Many thanks, this helps!
>> We naturally hope our users will not do this, this stress test was a 
>> worst case -
>> but the rough number (1 kB per inode) does indeed help a lot, and also 
>> the increase with modifications
>> of the file as laid out by David.
>>
>> Is also the slow backfilling normal?
>> Will such increase in storage (by many file modifications) at some point 
>> also be reduced, i.e.
>> is the database compacted / can one trigger that / is there something 
>> like "SQL vacuum"?
>>
>> To also answer David's questions in parallel:
>> - Concerning the slow backfill, I am only talking about the "metadata 
>> OSDs".
>>   They are fully SSD backed, and have no separate device for block.db / 
>> WAL.
>> - I adjusted backfills up to 128 for those metadata OSDs, the cluster is 
>> currently fully empty, i.e. no client's are doing anything.
>>   There are no slow requests.
>>   Since no clients are doing anything and the rest of the cluster is now 
>> clean (apart from the two backfilling OSDs),
>>   right now there is also no memory pressure at all.
>>   The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each.
>>   The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s 
>> of write throughput.
>>   Network traffic between the node with the clean OSDs and the 
>> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly 
>> more bandwidth available...
>> - Checking sleeps with:
>> # ceph -n osd.1 --show-config | grep sleep
>> osd_recovery_sleep = 0.00
>> osd_recovery_sleep_hdd = 0.10
>> osd_recovery_sleep_hybrid = 0.025000
>> osd_recovery_sleep_ssd = 0.00
>> shows there should be 0 sleep. Or is there another way to query?
>>
>>
>> Check if the OSDs are reporting their stores or their journals to be 
>> "rotational" via "ceph osd metadata"?
> 
> I find:
> "bluestore_bdev_model": "Micron_5100_MTFD",
> "bluestore_bdev_partition_path": "/dev/sda2",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_size": "239951482880",
> "bluestore_bdev_type": "ssd",
> [...]
> "rotational": "0"
> 
> for all of them (obviously with different device paths). 
> Also, they've been assigned the ssd device class automatically: 
> # ceph osd df | head
> ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS  
>  
>   0   ssd 0.21829  1.0  223G 11310M  212G  4.94 0.94   0  
>  
>   1   ssd 0.21829  1.0  223G 11368M  212G  4.97 0.95   0  
>  
>   2   ssd 0.21819  1.0  223G 76076M  149G 33.25 6.35 128  
>  
>   3   ssd 0.21819  1.0  223G 76268M  148G 33.33 6.37 128
> 
> So this should not be the reason... 
> 

Checking again with the nice "grep" expression from the other thread concerning 
bluestore backfilling...
# ceph osd metadata | grep 'id\|rotational'
yields:
"id": 0,
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"
"id": 1,
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"
"id": 2,
"bluefs_db_rotational": "0",
"bluestore_bdev_r

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 11:21 AM Reed Dier  wrote:

> The ‘good perf’ that I reported below was the result of beginning 5 new
> bluestore conversions which results in a leading edge of ‘good’
> performance, before trickling off.
>
> This performance lasted about 20 minutes, where it backfilled a small set
> of PGs off of non-bluestore OSDs.
>
> Current performance is now hovering around:
>
> pool objects-ssd id 20
>   recovery io 14285 kB/s, 202 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
>
>
> What are you referencing when you talk about recovery ops per second?
>
> These are recovery ops as reported by ceph -s or via stats exported via
> influx plugin in mgr, and via local collectd collection.
>
> Also, what are the values for osd_recovery_sleep_hdd
> and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
> that your BlueStore SSD OSDs are correctly reporting both themselves and
> their journals as non-rotational?
>
>
> This yields more interesting results.
> Pasting results for 3 sets of OSDs in this order
>  {0}hdd+nvme block.db
> {24}ssd+nvme block.db
> {59}ssd+nvme journal
>
> ceph osd metadata | grep 'id\|rotational'
> "id": 0,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "1",
> "bluestore_bdev_rotational": "1",
> *"journal_rotational": "1",*
> "rotational": “1"
>
> "id": 24,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "0",
> "bluestore_bdev_rotational": "0",
> *"journal_rotational": "1",*
> "rotational": “0"
>
> "id": 59,
> "journal_rotational": "0",
> "rotational": “0"
>
>
> I wonder if it matters/is correct to see "journal_rotational": “1” for the
> bluestore OSD’s {0,24} with nvme block.db.
>
> Hope this may be helpful in determining the root cause.
>

If you have an SSD main store and a hard drive ("rotational") journal, the
OSD will insert recovery sleeps from the osd_recovery_sleep_hybrid config
option. By default that is .025 (seconds).

I believe you can override the setting (I'm not sure how), but you really
want to correct that flag at the OS layer. Generally when we see this
there's a RAID card or something between the solid-state device and the
host which is lying about the state of the world.
-Greg


>
> If it helps, all of the OSD’s were originally deployed with ceph-deploy,
> but are now being redone with ceph-volume locally on each host.
>
> Thanks,
>
> Reed
>
> On Feb 26, 2018, at 1:00 PM, Gregory Farnum  wrote:
>
> On Mon, Feb 26, 2018 at 9:12 AM Reed Dier  wrote:
>
>> After my last round of backfills completed, I started 5 more bluestore
>> conversions, which helped me recognize a very specific pattern of
>> performance.
>>
>> pool objects-ssd id 20
>>   recovery io 757 MB/s, 10845 objects/s
>>
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
>>
>>
>> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
>> When backfilling from bluestore SSD OSD’s, they appear to be throttled at
>> the aforementioned <20 ops per OSD.
>>
>
> Wait, is that the current state? What are you referencing when you talk
> about recovery ops per second?
>
> Also, what are the values for osd_recovery_sleep_hdd
> and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
> that your BlueStore SSD OSDs are correctly reporting both themselves and
> their journals as non-rotational?
> -Greg
>
>
>>
>> This would corroborate why the first batch of SSD’s I migrated to
>> bluestore were all at “full” speed, as all of the OSD’s they were
>> backfilling from were filestore based, compared to increasingly bluestore
>> backfill targets, leading to increasingly long backfill times as I move
>> from one host to the next.
>>
>> Looking at the recovery settings, the recovery_sleep and
>> recovery_sleep_ssd values across bluestore or filestore OSDs are showing as
>> 0 values, which means no sleep/throttle if I am reading everything
>> correctly.
>>
>> sudo ceph daemon osd.73 config show | grep recovery
>> "osd_allow_recovery_below_min_size": "true",
>> "osd_debug_skip_full_check_in_recovery": "false",
>> "osd_force_recovery_pg_log_entries_factor": "1.30",
>> "osd_min_recovery_priority": "0",
>> "osd_recovery_cost": "20971520",
>> "osd_recovery_delay_start": "0.00",
>> "osd_recovery_forget_lost_objects": "false",
>> "osd_recovery_max_active": "35",
>> "osd_recovery_max_chunk": "8388608",
>> "osd_recovery_max_omap_entries_per_chunk": "64000",
>> "osd_recovery_max_single_start": "1",
>> "osd_recovery_op_priority": "3",
>> "osd_recovery_op_warn_multiple": "16",
>> "osd_recovery_priority": "5",
>> "osd_recovery_retry_interval": "30.00",
>> *"osd_recovery_sleep": "0.00",*

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread David Turner

Our problem only appeared to be present in bucket creation.  Listing,
putting, etc objects in a bucket work just fine regardless of the
bucket_location setting.  I ran this test on a few different realms to see
what would happen and only 1 of them had a problem.  There isn't an obvious
thing that steps out about it.  The 2 local realms do not have multi-site,
the internal realm has multi-site and the operations were performed on the
primary zone for the zonegroup.

Worked with non 'US' bucket_location for s3cmd to create bucket:
realm=internal
zonegroup=internal-ga
zone=internal-atl

Failed with non 'US' bucket_location for s3cmd to create bucket:
realm=local-atl
zonegroup=local-atl
zone=local-atl

Worked with non 'US' bucket_location for s3cmd to create bucket:
realm=local
zonegroup=local
zone=local

I was thinking it might have to do with all of the parts being named the
same, but I made sure to do the last test to confirm.  Interestingly it's
only bucket creation that has a problem and it's fine as long as I put 'US'
as the bucket_location.

On Mon, Feb 19, 2018 at 6:48 PM F21  wrote:

> I am using the official ceph/daemon docker image. It starts RGW and
> creates a zonegroup and zone with their names set to an empty string:
>
> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/start_rgw.sh#L36:54
>
> $RGW_ZONEGROUP and $RGW_ZONE are both empty strings by default:
>
> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/variables_entrypoint.sh#L46
>
> Here's what I get when I query RGW:
>
> $ radosgw-admin zonegroup list
> {
>  "default_info": "",
>  "zonegroups": [
>  "default"
>  ]
> }
>
> $ radosgw-admin zone list
> {
>  "default_info": "",
>  "zones": [
>  "default"
>  ]
> }
>
> On 20/02/2018 10:33 AM, Yehuda Sadeh-Weinraub wrote:
> > What is the name of your zonegroup?
> >
> > On Mon, Feb 19, 2018 at 3:29 PM, F21  wrote:
> >> I've done some debugging and the LocationConstraint is not being set by
> the
> >> SDK by default.
> >>
> >> I do, however, need to set the region on the client to us-east-1 for it
> to
> >> work. Anything else will return an InvalidLocationConstraint error.
> >>
> >> Francis
> >>
> >>
> >> On 20/02/2018 8:40 AM, Yehuda Sadeh-Weinraub wrote:
> >>> Sounds like the go sdk adds a location constraint to requests that
> >>> don't go to us-east-1. RGW itself is definitely isn't tied to
> >>> us-east-1, and does not know anything about it (unless you happen to
> >>> have a zonegroup named us-east-1). Maybe there's a way to configure
> >>> the sdk to avoid doing that?
> >>>
> >>> Yehuda
> >>>
> >>> On Sun, Feb 18, 2018 at 1:54 PM, F21  wrote:
>  I am using the AWS Go SDK v2 (https://github.com/aws/aws-sdk-go-v2)
> to
>  talk
>  to my RGW instance using the s3 interface. I am running ceph in docker
>  using
>  the ceph/daemon docker images in demo mode. The RGW is started with a
>  zonegroup and zone with their names set to an empty string by the
> scripts
>  in
>  the image.
> 
>  I have ForcePathStyle for the client set to true, because I want to
>  access
>  all my buckets using the path: myrgw.instance:8080/somebucket.
> 
>  I noticed that if I set the region for the client to anything other
> than
>  us-east-1, I get this error when creating a bucket:
>  InvalidLocationConstraint: The specified location-constraint is not
>  valid.
> 
>  If I set the region in the client to something made up, such as "ceph"
>  and
>  the LocationConstraint to "ceph", I still get the same error.
> 
>  The only way to get my buckets to create successfully is to set the
>  client's
>  region to us-east-1. I have grepped the ceph code base and cannot find
>  any
>  references to us-east-1. In addition, I looked at the AWS docs for
>  calculating v4 signatures and us-east-1 is the default region but I
> can
>  see
>  that the region string is used in the calculation (i.e. the region is
> not
>  ignored when calculating the signature if it is set to us-east-1).
> 
>  Why do my buckets create successfully if I set the region in my s3
> client
>  to
>  us-east-1, but not otherwise? If I do not want to use us-east-1 as my
>  default region, for example, if I want us-west-1 as my default region,
>  what
>  should I be configuring in ceph?
> 
>  Thanks,
> 
>  Francis
> 
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
ht

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 11:26 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth:
> > Am 26.02.2018 um 19:56 schrieb Gregory Farnum:
> >>
> >>
> >> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >>
> >> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
> >> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
> >> >  freyerm...@physik.uni-bonn.de>> wrote:
> >> >> Looking with:
> >> >> ceph daemon osd.2 perf dump
> >> >> I get:
> >> >> "bluefs": {
> >> >> "gift_bytes": 0,
> >> >> "reclaim_bytes": 0,
> >> >> "db_total_bytes": 84760592384,
> >> >> "db_used_bytes": 78920024064,
> >> >> "wal_total_bytes": 0,
> >> >> "wal_used_bytes": 0,
> >> >> "slow_total_bytes": 0,
> >> >> "slow_used_bytes": 0,
> >> >> so it seems this is almost exclusively RocksDB usage.
> >> >>
> >> >> Is this expected?
> >> >
> >> > Yes. The directory entries are stored in the omap of the objects.
> This
> >> > will be stored in the RocksDB backend of Bluestore.
> >> >
> >> >> Is there a recommendation on how much MDS storage is needed for
> a CephFS with 450 TB?
> >> >
> >> > It seems in the above test you're using about 1KB per inode
> (file).
> >> > Using that you can extrapolate how much space the data pool needs
> >> > based on your file system usage. (If all you're doing is filling
> the
> >> > file system with empty files, of course you're going to need an
> >> > unusually large metadata pool.)
> >> >
> >> Many thanks, this helps!
> >> We naturally hope our users will not do this, this stress test was
> a worst case -
> >> but the rough number (1 kB per inode) does indeed help a lot, and
> also the increase with modifications
> >> of the file as laid out by David.
> >>
> >> Is also the slow backfilling normal?
> >> Will such increase in storage (by many file modifications) at some
> point also be reduced, i.e.
> >> is the database compacted / can one trigger that / is there
> something like "SQL vacuum"?
> >>
> >> To also answer David's questions in parallel:
> >> - Concerning the slow backfill, I am only talking about the
> "metadata OSDs".
> >>   They are fully SSD backed, and have no separate device for
> block.db / WAL.
> >> - I adjusted backfills up to 128 for those metadata OSDs, the
> cluster is currently fully empty, i.e. no client's are doing anything.
> >>   There are no slow requests.
> >>   Since no clients are doing anything and the rest of the cluster
> is now clean (apart from the two backfilling OSDs),
> >>   right now there is also no memory pressure at all.
> >>   The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load
> each.
> >>   The OSDs being backfilled have 3.3 % CPU load, and have about 250
> kB/s of write throughput.
> >>   Network traffic between the node with the clean OSDs and the
> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly
> more bandwidth available...
> >> - Checking sleeps with:
> >> # ceph -n osd.1 --show-config | grep sleep
> >> osd_recovery_sleep = 0.00
> >> osd_recovery_sleep_hdd = 0.10
> >> osd_recovery_sleep_hybrid = 0.025000
> >> osd_recovery_sleep_ssd = 0.00
> >> shows there should be 0 sleep. Or is there another way to query?
> >>
> >>
> >> Check if the OSDs are reporting their stores or their journals to be
> "rotational" via "ceph osd metadata"?
> >
> > I find:
> > "bluestore_bdev_model": "Micron_5100_MTFD",
> > "bluestore_bdev_partition_path": "/dev/sda2",
> > "bluestore_bdev_rotational": "0",
> > "bluestore_bdev_size": "239951482880",
> > "bluestore_bdev_type": "ssd",
> > [...]
> > "rotational": "0"
> >
> > for all of them (obviously with different device paths).
> > Also, they've been assigned the ssd device class automatically:
> > # ceph osd df | head
> > ID  CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS
> >   0   ssd 0.21829  1.0  223G 11310M  212G  4.94 0.94   0
> >   1   ssd 0.21829  1.0  223G 11368M  212G  4.97 0.95   0
> >   2   ssd 0.21819  1.0  223G 76076M  149G 33.25 6.35 128
> >   3   ssd 0.21819  1.0  223G 76268M  148G 33.33 6.37 128
> >
> > So this should not be the reason...
> >
>
> Checking again with the nice "grep" expression from the other thread
> concerning bluestore backfilling...
> # ceph osd metadata | grep 'id\|rotational'
> yields:
> "id": 0,
> "bluefs_db_rotational": "0",
> "bluestore_bdev_rotational": "0",
> "journal_rotational": "1",
> "rotational": "0"
> "id": 1,
> "bluefs_db_rotational": "0",
> "bluestore_bdev_rotational": "0",
> "jou

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 20:23 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de> 
>  >> wrote:
> >
> >     Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> >     > I don’t actually know this option, but based on your results it’s 
> clear that “fast read” is telling the OSD it should issue reads to all k+m 
> OSDs storing data and then reconstruct the data from the first k to respond. 
> Without the fast read it simply asks the regular k data nodes to read it back 
> straight and sends the reply back. This is a straight trade off of more 
> bandwidth for lower long-tail latencies.
> >     > -Greg
> >
> >     Many thanks, this certainly explains it!
> >     Apparently I misunderstood how "normal" read works - I thought that 
> in any case, all shards would be requested, and the primary OSD would check 
> EC is still fine.
> >
> >
> > Nope, EC PGs can self-validate (they checksum everything) and so extra 
> shards are requested only if one of the OSDs has an error.
> >  
> >
> >     However, with the explanation that indeed only the actual "k" 
> shards are read in the "normal" case, it's fully clear to me that "fast_read" 
> will be slower for us,
> >     since we are limited by network bandwidth.
> >
> >     On a side-note, activating fast_read also appears to increase CPU 
> load a bit, which is then probably due to the EC calculations that need to be 
> performed if the "wrong"
> >     shards arrived at the primary OSD first.
> >
> >     I believe this also explains why an EC pool actually does remapping 
> in a k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
> >     Namely, to have the "k" shards available on the "up" OSDs. This 
> answers an earlier question of mine.
> >
> >
> > I don't quite understand what you're asking/saying here, but if an OSD 
> gets marked out all the PGs that used to rely on it will get another OSD 
> unless you've instructed the cluster not to do so. The specifics of any given 
> erasure code have nothing to do with it. :)
> > -Greg
> 
> Ah, sorry, let me clarify.
> The EC pool I am considering is k=4 m=2 with failure domain host, on 6 
> hosts.
> So necessarily, there is one shard for each host. If one host goes down 
> for a prolonged time,
> there's no "logical" advantage of redistributing things - since whatever 
> you do, with 5 hosts, all PGs will stay in degraded state anyways.
> 
> However, I noticed Ceph is remapping all PGs, and actively moving data. I 
> presume now this is done for two reasons:
> - The remapping is needed since the primary OSD might be the one which 
> went down. But for remapping (I guess) there's no need to actually move data,
>   or is there?
> - The data movement is done to have the "k" shards available.
> If it's really the case that "all shards are equal", then data movement 
> should not occur - or is this a bug / bad feature?
> 
> 
> If you lose one OSD out of a host, Ceph is going to try and re-replicate the 
> data onto the other OSDs in that host. Your PG size and the CRUSH rule 
> instructs it that the PG needs 6 different OSDs, and those OSDs need to be 
> placed on different hosts.
> 
> You're right that gets very funny if your PG size is equal to the number of 
> hosts. We generally discourage people from running configurations like that.

Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would be our 
starting point - since we may add more hosts later (not too soon-ish, but it's 
not excluded more may come in a year or so), 
and migrating large EC pools to different settings still seems a bit messy. 
We can't really afford to reduce available storage significantly more in the 
current setup, and would like to have the possibility to lose one host (for 
example for an OS upgrade),
and then still lose a few disks in case they fail with bad timing. 

> 
> Or if you mean that you are losing a host, and the data is shuffling around 
> on the remaining hosts...hrm, that'd be weird. (Perhaps a result of EC pools' 
> "indep" rather than "firstn" crush rules?)

They are indep, which I think is the default (no manual editing done). I 
thought the main goal of indep was exactly to reduce data movement. 
Indeed, it's very funny that data is moved, it certainly does not help to 
increase redundancy ;-). 

Cheers,
Oliver

> -Greg





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread Yehuda Sadeh-Weinraub

I'm not sure if the rgw logs (debug rgw = 20) specify explicitly why a
bucket creation is rejected in these cases, but it might be worth
trying to look at these. If not, then a tcpdump of the specific failed
request might shed some light (would be interesting to look at the
generated LocationConstraint).

Yehuda

On Mon, Feb 26, 2018 at 11:29 AM, David Turner  wrote:
> Our problem only appeared to be present in bucket creation.  Listing,
> putting, etc objects in a bucket work just fine regardless of the
> bucket_location setting.  I ran this test on a few different realms to see
> what would happen and only 1 of them had a problem.  There isn't an obvious
> thing that steps out about it.  The 2 local realms do not have multi-site,
> the internal realm has multi-site and the operations were performed on the
> primary zone for the zonegroup.
>
> Worked with non 'US' bucket_location for s3cmd to create bucket:
> realm=internal
> zonegroup=internal-ga
> zone=internal-atl
>
> Failed with non 'US' bucket_location for s3cmd to create bucket:
> realm=local-atl
> zonegroup=local-atl
> zone=local-atl
>
> Worked with non 'US' bucket_location for s3cmd to create bucket:
> realm=local
> zonegroup=local
> zone=local
>
> I was thinking it might have to do with all of the parts being named the
> same, but I made sure to do the last test to confirm.  Interestingly it's
> only bucket creation that has a problem and it's fine as long as I put 'US'
> as the bucket_location.
>
> On Mon, Feb 19, 2018 at 6:48 PM F21  wrote:
>>
>> I am using the official ceph/daemon docker image. It starts RGW and
>> creates a zonegroup and zone with their names set to an empty string:
>>
>> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/start_rgw.sh#L36:54
>>
>> $RGW_ZONEGROUP and $RGW_ZONE are both empty strings by default:
>>
>> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/variables_entrypoint.sh#L46
>>
>> Here's what I get when I query RGW:
>>
>> $ radosgw-admin zonegroup list
>> {
>>  "default_info": "",
>>  "zonegroups": [
>>  "default"
>>  ]
>> }
>>
>> $ radosgw-admin zone list
>> {
>>  "default_info": "",
>>  "zones": [
>>  "default"
>>  ]
>> }
>>
>> On 20/02/2018 10:33 AM, Yehuda Sadeh-Weinraub wrote:
>> > What is the name of your zonegroup?
>> >
>> > On Mon, Feb 19, 2018 at 3:29 PM, F21  wrote:
>> >> I've done some debugging and the LocationConstraint is not being set by
>> >> the
>> >> SDK by default.
>> >>
>> >> I do, however, need to set the region on the client to us-east-1 for it
>> >> to
>> >> work. Anything else will return an InvalidLocationConstraint error.
>> >>
>> >> Francis
>> >>
>> >>
>> >> On 20/02/2018 8:40 AM, Yehuda Sadeh-Weinraub wrote:
>> >>> Sounds like the go sdk adds a location constraint to requests that
>> >>> don't go to us-east-1. RGW itself is definitely isn't tied to
>> >>> us-east-1, and does not know anything about it (unless you happen to
>> >>> have a zonegroup named us-east-1). Maybe there's a way to configure
>> >>> the sdk to avoid doing that?
>> >>>
>> >>> Yehuda
>> >>>
>> >>> On Sun, Feb 18, 2018 at 1:54 PM, F21  wrote:
>>  I am using the AWS Go SDK v2 (https://github.com/aws/aws-sdk-go-v2)
>>  to
>>  talk
>>  to my RGW instance using the s3 interface. I am running ceph in
>>  docker
>>  using
>>  the ceph/daemon docker images in demo mode. The RGW is started with a
>>  zonegroup and zone with their names set to an empty string by the
>>  scripts
>>  in
>>  the image.
>> 
>>  I have ForcePathStyle for the client set to true, because I want to
>>  access
>>  all my buckets using the path: myrgw.instance:8080/somebucket.
>> 
>>  I noticed that if I set the region for the client to anything other
>>  than
>>  us-east-1, I get this error when creating a bucket:
>>  InvalidLocationConstraint: The specified location-constraint is not
>>  valid.
>> 
>>  If I set the region in the client to something made up, such as
>>  "ceph"
>>  and
>>  the LocationConstraint to "ceph", I still get the same error.
>> 
>>  The only way to get my buckets to create successfully is to set the
>>  client's
>>  region to us-east-1. I have grepped the ceph code base and cannot
>>  find
>>  any
>>  references to us-east-1. In addition, I looked at the AWS docs for
>>  calculating v4 signatures and us-east-1 is the default region but I
>>  can
>>  see
>>  that the region string is used in the calculation (i.e. the region is
>>  not
>>  ignored when calculating the signature if it is set to us-east-1).
>> 
>>  Why do my buckets create successfully if I set the region in my s3
>>  client
>>  to
>>  us-east-1, but not otherwise? If I do not want to use us-east-1 as my
>>  default region, for example, if I want us-west-1 as

Re: [ceph-users] Storage usage of CephFS-MDS

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 20:31 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 11:26 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth:
> > Am 26.02.2018 um 19:56 schrieb Gregory Farnum:
> >>
> >>
> >> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de> 
>  >> wrote:
> >>
> >>     Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
> >>     > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
> >>     >    >> wrote:
> >>     >> Looking with:
> >>     >> ceph daemon osd.2 perf dump
> >>     >> I get:
> >>     >>     "bluefs": {
> >>     >>         "gift_bytes": 0,
> >>     >>         "reclaim_bytes": 0,
> >>     >>         "db_total_bytes": 84760592384,
> >>     >>         "db_used_bytes": 78920024064,
> >>     >>         "wal_total_bytes": 0,
> >>     >>         "wal_used_bytes": 0,
> >>     >>         "slow_total_bytes": 0,
> >>     >>         "slow_used_bytes": 0,
> >>     >> so it seems this is almost exclusively RocksDB usage.
> >>     >>
> >>     >> Is this expected?
> >>     >
> >>     > Yes. The directory entries are stored in the omap of the 
> objects. This
> >>     > will be stored in the RocksDB backend of Bluestore.
> >>     >
> >>     >> Is there a recommendation on how much MDS storage is needed for 
> a CephFS with 450 TB?
> >>     >
> >>     > It seems in the above test you're using about 1KB per inode 
> (file).
> >>     > Using that you can extrapolate how much space the data pool needs
> >>     > based on your file system usage. (If all you're doing is filling 
> the
> >>     > file system with empty files, of course you're going to need an
> >>     > unusually large metadata pool.)
> >>     >
> >>     Many thanks, this helps!
> >>     We naturally hope our users will not do this, this stress test was 
> a worst case -
> >>     but the rough number (1 kB per inode) does indeed help a lot, and 
> also the increase with modifications
> >>     of the file as laid out by David.
> >>
> >>     Is also the slow backfilling normal?
> >>     Will such increase in storage (by many file modifications) at some 
> point also be reduced, i.e.
> >>     is the database compacted / can one trigger that / is there 
> something like "SQL vacuum"?
> >>
> >>     To also answer David's questions in parallel:
> >>     - Concerning the slow backfill, I am only talking about the 
> "metadata OSDs".
> >>       They are fully SSD backed, and have no separate device for 
> block.db / WAL.
> >>     - I adjusted backfills up to 128 for those metadata OSDs, the 
> cluster is currently fully empty, i.e. no client's are doing anything.
> >>       There are no slow requests.
> >>       Since no clients are doing anything and the rest of the cluster 
> is now clean (apart from the two backfilling OSDs),
> >>       right now there is also no memory pressure at all.
> >>       The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load 
> each.
> >>       The OSDs being backfilled have 3.3 % CPU load, and have about 
> 250 kB/s of write throughput.
> >>       Network traffic between the node with the clean OSDs and the 
> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly 
> more bandwidth available...
> >>     - Checking sleeps with:
> >>     # ceph -n osd.1 --show-config | grep sleep
> >>     osd_recovery_sleep = 0.00
> >>     osd_recovery_sleep_hdd = 0.10
> >>     osd_recovery_sleep_hybrid = 0.025000
> >>     osd_recovery_sleep_ssd = 0.00
> >>     shows there should be 0 sleep. Or is there another way to query?
> >>
> >>
> >> Check if the OSDs are reporting their stores or their journals to be 
> "rotational" via "ceph osd metadata"?
> >
> > I find:
> >         "bluestore_bdev_model": "Micron_5100_MTFD",
> >         "bluestore_bdev_partition_path": "/dev/sda2",
> >         "bluestore_bdev_rotational": "0",
> >         "bluestore_bdev_size": "239951482880",
> >         "bluestore_bdev_type": "ssd",
> > [...]
> >         "rotational": "0"
> >
> > for all of them (obviously with different device paths).
> > Also, they've been assigned the ssd device class automatically:
> > # ceph osd df | head
> > ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS
> >   0   ssd 0.21829  1.0  223G 11310M  212G  4.94 0.94   0
> >   1   ssd 0.21829  1.0  223G 11368M  212G  4.97 0.95   0
> >   2   ssd 0.21819  1.0  223G 76076M  149G 33.25 6.35 128
> >   3   ssd 0

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 11:33 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 20:23 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> > Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> > > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de 
> >> wrote:
> > >
> > > Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> > > > I don’t actually know this option, but based on your results
> it’s clear that “fast read” is telling the OSD it should issue reads to all
> k+m OSDs storing data and then reconstruct the data from the first k to
> respond. Without the fast read it simply asks the regular k data nodes to
> read it back straight and sends the reply back. This is a straight trade
> off of more bandwidth for lower long-tail latencies.
> > > > -Greg
> > >
> > > Many thanks, this certainly explains it!
> > > Apparently I misunderstood how "normal" read works - I thought
> that in any case, all shards would be requested, and the primary OSD would
> check EC is still fine.
> > >
> > >
> > > Nope, EC PGs can self-validate (they checksum everything) and so
> extra shards are requested only if one of the OSDs has an error.
> > >
> > >
> > > However, with the explanation that indeed only the actual "k"
> shards are read in the "normal" case, it's fully clear to me that
> "fast_read" will be slower for us,
> > > since we are limited by network bandwidth.
> > >
> > > On a side-note, activating fast_read also appears to increase
> CPU load a bit, which is then probably due to the EC calculations that need
> to be performed if the "wrong"
> > > shards arrived at the primary OSD first.
> > >
> > > I believe this also explains why an EC pool actually does
> remapping in a k=4 m=2 pool with failure domain host if one of 6 hosts goes
> down:
> > > Namely, to have the "k" shards available on the "up" OSDs.
> This answers an earlier question of mine.
> > >
> > >
> > > I don't quite understand what you're asking/saying here, but if an
> OSD gets marked out all the PGs that used to rely on it will get another
> OSD unless you've instructed the cluster not to do so. The specifics of any
> given erasure code have nothing to do with it. :)
> > > -Greg
> >
> > Ah, sorry, let me clarify.
> > The EC pool I am considering is k=4 m=2 with failure domain host, on
> 6 hosts.
> > So necessarily, there is one shard for each host. If one host goes
> down for a prolonged time,
> > there's no "logical" advantage of redistributing things - since
> whatever you do, with 5 hosts, all PGs will stay in degraded state anyways.
> >
> > However, I noticed Ceph is remapping all PGs, and actively moving
> data. I presume now this is done for two reasons:
> > - The remapping is needed since the primary OSD might be the one
> which went down. But for remapping (I guess) there's no need to actually
> move data,
> >   or is there?
> > - The data movement is done to have the "k" shards available.
> > If it's really the case that "all shards are equal", then data
> movement should not occur - or is this a bug / bad feature?
> >
> >
> > If you lose one OSD out of a host, Ceph is going to try and re-replicate
> the data onto the other OSDs in that host. Your PG size and the CRUSH rule
> instructs it that the PG needs 6 different OSDs, and those OSDs need to be
> placed on different hosts.
> >
> > You're right that gets very funny if your PG size is equal to the number
> of hosts. We generally discourage people from running configurations like
> that.
>
> Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would be
> our starting point - since we may add more hosts later (not too soon-ish,
> but it's not excluded more may come in a year or so),
> and migrating large EC pools to different settings still seems a bit messy.
> We can't really afford to reduce available storage significantly more in
> the current setup, and would like to have the possibility to lose one host
> (for example for an OS upgrade),
> and then still lose a few disks in case they fail with bad timing.
>
> >
> > Or if you mean that you are losing a host, and the data is shuffling
> around on the remaining hosts...hrm, that'd be weird. (Perhaps a result of
> EC pools' "indep" rather than "firstn" crush rules?)
>
> They are indep, which I think is the default (no manual editing done). I
> thought the main goal of indep was exactly to reduce data movement.
> Indeed, it's very funny that data is moved, it certainly does not help to
> increase redundancy ;-).


Given that you're stuck in that state, you probably

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 20:42 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 11:33 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 26.02.2018 um 20:23 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de> 
>  >> wrote:
> >
> >     Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> >     > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de> 
> > 
>  
>   >     >
> >     >     Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> >     >     > I don’t actually know this option, but based on your 
> results it’s clear that “fast read” is telling the OSD it should issue reads 
> to all k+m OSDs storing data and then reconstruct the data from the first k 
> to respond. Without the fast read it simply asks the regular k data nodes to 
> read it back straight and sends the reply back. This is a straight trade off 
> of more bandwidth for lower long-tail latencies.
> >     >     > -Greg
> >     >
> >     >     Many thanks, this certainly explains it!
> >     >     Apparently I misunderstood how "normal" read works - I 
> thought that in any case, all shards would be requested, and the primary OSD 
> would check EC is still fine.
> >     >
> >     >
> >     > Nope, EC PGs can self-validate (they checksum everything) and so 
> extra shards are requested only if one of the OSDs has an error.
> >     >  
> >     >
> >     >     However, with the explanation that indeed only the actual "k" 
> shards are read in the "normal" case, it's fully clear to me that "fast_read" 
> will be slower for us,
> >     >     since we are limited by network bandwidth.
> >     >
> >     >     On a side-note, activating fast_read also appears to increase 
> CPU load a bit, which is then probably due to the EC calculations that need 
> to be performed if the "wrong"
> >     >     shards arrived at the primary OSD first.
> >     >
> >     >     I believe this also explains why an EC pool actually does 
> remapping in a k=4 m=2 pool with failure domain host if one of 6 hosts goes 
> down:
> >     >     Namely, to have the "k" shards available on the "up" OSDs. 
> This answers an earlier question of mine.
> >     >
> >     >
> >     > I don't quite understand what you're asking/saying here, but if 
> an OSD gets marked out all the PGs that used to rely on it will get another 
> OSD unless you've instructed the cluster not to do so. The specifics of any 
> given erasure code have nothing to do with it. :)
> >     > -Greg
> >
> >     Ah, sorry, let me clarify.
> >     The EC pool I am considering is k=4 m=2 with failure domain host, 
> on 6 hosts.
> >     So necessarily, there is one shard for each host. If one host goes 
> down for a prolonged time,
> >     there's no "logical" advantage of redistributing things - since 
> whatever you do, with 5 hosts, all PGs will stay in degraded state anyways.
> >
> >     However, I noticed Ceph is remapping all PGs, and actively moving 
> data. I presume now this is done for two reasons:
> >     - The remapping is needed since the primary OSD might be the one 
> which went down. But for remapping (I guess) there's no need to actually move 
> data,
> >       or is there?
> >     - The data movement is done to have the "k" shards available.
> >     If it's really the case that "all shards are equal", then data 
> movement should not occur - or is this a bug / bad feature?
> >
> >
> > If you lose one OSD out of a host, Ceph is going to try and 
> re-replicate the data onto the other OSDs in that host. Your PG size and the 
> CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs 
> need to be placed on different hosts.
> >
> > You're right that gets very funny if your PG size is equal to the 
> number of hosts. We generally discourage people from running configurations 
> like that.
> 
> Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would 
> be our starting point - since we may add more hosts later (not too soon-ish, 
> but it's not excluded more may come in a year or so),
> and migrating large EC pools to different settings still seems a bit 
> messy.
> We can't really afford to reduce available storage significantly more in 
> the current setup, and would like to have the possibility to lose one host 
> (for example for an OS upgrade),
> and then still lose a few disks in case they fail with bad timing.
> 
> >
> >

[ceph-users] planning a new cluster

2018-02-26 Thread Frank Ritchie

Hi all,

I am planning for a new Ceph cluster that will provide RBD storage for
OpenStack and Kubernetes. Additionally, there may need a need for a small
amount of RGW storage.

Which option would be better:

1. Defining separate pools for OpenStack images/ephemeral
vms/volumes/backups (as seen here https://ceph.com/pgcalc/) along with
pools for Kubernetes and RGW.

2. Define a single block storage pool (to be used by OpenStack and
Kubernetes) and an object pool (for RGW).

I am not sure how much space each component will require at this time.

thx
Frank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread David Turner

I run with `debug rgw = 10` and was able to find these lines at the end of
a request to create the bucket.

Successfully creating a bucket with `bucket_location = US` looks like
[1]this.  Failing to create a bucket has "ERROR: S3 error: 400
(InvalidLocationConstraint): The specified location-constraint is not
valid" on the CLI and [2]this (excerpt from the end of the request) in the
rgw log (debug level 10).  "create bucket location constraint" was not
found in the log for successfully creating the bucket.


[1]
2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
info.flags=0x17
2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
to cache LRU end
2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
name=user.rgw.acl bl.length()=141
2018-02-26 19:52:36.423863 7f4bc9bc8700 10 RGWWatcher::handle_notify()
notify_id 344855809097728 cookie 139963970426880 notifier 39099765
bl.length()=361
2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
name=local-atl.rgw.data.root++testerton info.flags=0x17
2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
local-atl.rgw.data.root++testerton to cache LRU end

[2]
2018-02-26 19:43:37.340289 7f466bbca700  2 req 428078:0.004204:s3:PUT
/testraint/:create_bucket:executing
2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
do_aws4_auth_completion
2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
do_aws4_auth_completion
2018-02-26 19:43:37.340715 7f466bbca700 10 create bucket location
constraint: cn
2018-02-26 19:43:37.340766 7f466bbca700  0 location constraint (cn) can't
be found.
2018-02-26 19:43:37.340794 7f466bbca700  2 req 428078:0.004701:s3:PUT
/testraint/:create_bucket:completing
2018-02-26 19:43:37.341782 7f466bbca700  2 req 428078:0.005689:s3:PUT
/testraint/:create_bucket:op status=-2208
2018-02-26 19:43:37.341792 7f466bbca700  2 req 428078:0.005707:s3:PUT
/testraint/:create_bucket:http status=400

On Mon, Feb 26, 2018 at 2:36 PM Yehuda Sadeh-Weinraub 
wrote:

> I'm not sure if the rgw logs (debug rgw = 20) specify explicitly why a
> bucket creation is rejected in these cases, but it might be worth
> trying to look at these. If not, then a tcpdump of the specific failed
> request might shed some light (would be interesting to look at the
> generated LocationConstraint).
>
> Yehuda
>
> On Mon, Feb 26, 2018 at 11:29 AM, David Turner 
> wrote:
> > Our problem only appeared to be present in bucket creation.  Listing,
> > putting, etc objects in a bucket work just fine regardless of the
> > bucket_location setting.  I ran this test on a few different realms to
> see
> > what would happen and only 1 of them had a problem.  There isn't an
> obvious
> > thing that steps out about it.  The 2 local realms do not have
> multi-site,
> > the internal realm has multi-site and the operations were performed on
> the
> > primary zone for the zonegroup.
> >
> > Worked with non 'US' bucket_location for s3cmd to create bucket:
> > realm=internal
> > zonegroup=internal-ga
> > zone=internal-atl
> >
> > Failed with non 'US' bucket_location for s3cmd to create bucket:
> > realm=local-atl
> > zonegroup=local-atl
> > zone=local-atl
> >
> > Worked with non 'US' bucket_location for s3cmd to create bucket:
> > realm=local
> > zonegroup=local
> > zone=local
> >
> > I was thinking it might have to do with all of the parts being named the
> > same, but I made sure to do the last test to confirm.  Interestingly it's
> > only bucket creation that has a problem and it's fine as long as I put
> 'US'
> > as the bucket_location.
> >
> > On Mon, Feb 19, 2018 at 6:48 PM F21  wrote:
> >>
> >> I am using the official ceph/daemon docker image. It starts RGW and
> >> creates a zonegroup and zone with their names set to an empty string:
> >>
> >>
> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/start_rgw.sh#L36:54
> >>
> >> $RGW_ZONEGROUP and $RGW_ZONE are both empty strings by default:
> >>
> >>
> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/variables_entrypoint.sh#L46
> >>
> >> Here's what I get when I query RGW:
> >>
> >> $ radosgw-admin zonegroup list
> >> {
> >>  "default_info": "",
> >>  "zonegroups": [
> >>  "default"
> >>  ]
> >> }
> >>
> >> $ radosgw-admin zone list
> >> {
> >>  "default_info": "",
> >>  "zones": [
> >>  "default"
> >>  ]
> >> }
> >>
> >> On 20/02/2018 10:33 AM, Yehuda Sadeh-Weinraub wrote:
> >> > What is the name of your zonegroup?
> >> >
> >> > On Mon, Feb 19, 2018 at 3:29 PM, F21  wrote:
> >> >> I've done some debugging and the LocationConstraint is not being set
> by
> >> >> the
> >> >> SDK by default.
> >> >>
> >> >> I do, however, need to set the region on the client to us-east-1 for
> it
> >> >> to
> >> >> work. Anything else will retur

Re: [ceph-users] planning a new cluster

2018-02-26 Thread David Turner

Depending on what your security requirements are, you may not have a
choice.  If your OpenStack deployment shouldn't be able to load the
Kubernetes RBDs (or vice versa), then you need to keep them separate and
maintain different keyrings for the 2 services.  If that is going to be how
you go about it, I would recommend starting with a relatively low number of
PGs in both pools and figure out what the distribution of data between them
ends up being by the time you're 40-50% full and increase PG counts
accordingly.  If you can put them into the same pool, I don't see a reason
why you shouldn't, unless you foresee a time when you want to move one of
them, but not the other to a new cluster or faster storage.  Having them
separate would allow you to change them to a different crush rule to put
them on different storage in the same cluster and some sort of rados tool
to copy a pool to a new cluster would do the other (less likely than
possibly changing the crush rule for different types of storage).

On Mon, Feb 26, 2018 at 2:57 PM Frank Ritchie 
wrote:

> Hi all,
>
> I am planning for a new Ceph cluster that will provide RBD storage for
> OpenStack and Kubernetes. Additionally, there may need a need for a small
> amount of RGW storage.
>
> Which option would be better:
>
> 1. Defining separate pools for OpenStack images/ephemeral
> vms/volumes/backups (as seen here https://ceph.com/pgcalc/) along with
> pools for Kubernetes and RGW.
>
> 2. Define a single block storage pool (to be used by OpenStack and
> Kubernetes) and an object pool (for RGW).
>
> I am not sure how much space each component will require at this time.
>
> thx
> Frank
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier

I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
solution to getting the metadata configured correctly.

For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
with NVMe block.db.

> {
> "id": 24,
> "arch": "x86_64",
> "back_addr": "",
> "back_iface": "bond0",
> "bluefs": "1",
> "bluefs_db_access_mode": "blk",
> "bluefs_db_block_size": "4096",
> "bluefs_db_dev": "259:0",
> "bluefs_db_dev_node": "nvme0n1",
> "bluefs_db_driver": "KernelDevice",
> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> "bluefs_db_partition_path": "/dev/nvme0n1p4",
> "bluefs_db_rotational": "0",
> "bluefs_db_serial": " ",
> "bluefs_db_size": "16000221184",
> "bluefs_db_type": "nvme",
> "bluefs_single_shared_device": "0",
> "bluefs_slow_access_mode": "blk",
> "bluefs_slow_block_size": "4096",
> "bluefs_slow_dev": "253:8",
> "bluefs_slow_dev_node": "dm-8",
> "bluefs_slow_driver": "KernelDevice",
> "bluefs_slow_model": "",
> "bluefs_slow_partition_path": "/dev/dm-8",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_size": "1920378863616",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_access_mode": "blk",
> "bluestore_bdev_block_size": "4096",
> "bluestore_bdev_dev": "253:8",
> "bluestore_bdev_dev_node": "dm-8",
> "bluestore_bdev_driver": "KernelDevice",
> "bluestore_bdev_model": "",
> "bluestore_bdev_partition_path": "/dev/dm-8",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_size": "1920378863616",
> "bluestore_bdev_type": "ssd",
> "ceph_version": "ceph version 12.2.2 
> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
> "default_device_class": "ssd",
> "distro": "ubuntu",
> "distro_description": "Ubuntu 16.04.3 LTS",
> "distro_version": "16.04",
> "front_addr": "",
> "front_iface": "bond0",
> "hb_back_addr": "",
> "hb_front_addr": "",
> "hostname": “host00",
> "journal_rotational": "1",
> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 
> 2018",
> "kernel_version": "4.13.0-26-generic",
> "mem_swap_kb": "124999672",
> "mem_total_kb": "131914008",
> "os": "Linux",
> "osd_data": "/var/lib/ceph/osd/ceph-24",
> "osd_objectstore": "bluestore",
> "rotational": "0"
> }


So it looks like it correctly guessed(?) the 
bluestore_bdev_type/default_device_class correctly (though it may have been an 
inherited value?), as did bluefs_db_type get set to nvme correctly.

So I’m not sure why journal_rotational is still showing 1.
Maybe something in the ceph-volume lvm piece that isn’t correctly setting that 
flag on OSD creation?
Also seems like the journal_rotational field should have been deprecated in 
bluestore as bluefs_db_rotational should cover that, and if there were a WAL 
partition as well, I assume there would be something to the tune of 
bluefs_wal_rotational or something like that, and journal would never be used 
for bluestore?

Appreciate the help.

Thanks,
Reed

> On Feb 26, 2018, at 1:28 PM, Gregory Farnum  wrote:
> 
> On Mon, Feb 26, 2018 at 11:21 AM Reed Dier  > wrote:
> The ‘good perf’ that I reported below was the result of beginning 5 new 
> bluestore conversions which results in a leading edge of ‘good’ performance, 
> before trickling off.
> 
> This performance lasted about 20 minutes, where it backfilled a small set of 
> PGs off of non-bluestore OSDs.
> 
> Current performance is now hovering around:
>> pool objects-ssd id 20
>>   recovery io 14285 kB/s, 202 objects/s
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
> 
>> What are you referencing when you talk about recovery ops per second?
> 
> These are recovery ops as reported by ceph -s or via stats exported via 
> influx plugin in mgr, and via local collectd collection.
> 
>> Also, what are the values for osd_recovery_sleep_hdd and 
>> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
>> your BlueStore SSD OSDs are correctly reporting both themselves and their 
>> journals as non-rotational?
> 
> This yields more interesting results.
> Pasting results for 3 sets of OSDs in this order
>  {0}hdd+nvme block.db
> {24}ssd+nvme block.db
> {59}ssd+nvme journal
> 
>> ceph osd metadata | grep 'id\|rotational'
>> "id": 0,
>> "bluefs_db_rotational": "0",
>> "bluefs_slow_rotational": "1",
>> "bluestore_bdev_rotational": "1",
>> "journal_rotational": "1",
>> "rotational": “1"
>> "id":

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread Yehuda Sadeh-Weinraub

According to the log here, it says that the location constraint it got
is "cn", can you take a look at a tcpdump, see if that's actually
what's passed in?

On Mon, Feb 26, 2018 at 12:02 PM, David Turner  wrote:
> I run with `debug rgw = 10` and was able to find these lines at the end of a
> request to create the bucket.
>
> Successfully creating a bucket with `bucket_location = US` looks like
> [1]this.  Failing to create a bucket has "ERROR: S3 error: 400
> (InvalidLocationConstraint): The specified location-constraint is not valid"
> on the CLI and [2]this (excerpt from the end of the request) in the rgw log
> (debug level 10).  "create bucket location constraint" was not found in the
> log for successfully creating the bucket.
>
>
> [1]
> 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
> name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> info.flags=0x17
> 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
> local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> to cache LRU end
> 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr: name=user.rgw.acl
> bl.length()=141
> 2018-02-26 19:52:36.423863 7f4bc9bc8700 10 RGWWatcher::handle_notify()
> notify_id 344855809097728 cookie 139963970426880 notifier 39099765
> bl.length()=361
> 2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
> name=local-atl.rgw.data.root++testerton info.flags=0x17
> 2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
> local-atl.rgw.data.root++testerton to cache LRU end
>
> [2]
> 2018-02-26 19:43:37.340289 7f466bbca700  2 req 428078:0.004204:s3:PUT
> /testraint/:create_bucket:executing
> 2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
> do_aws4_auth_completion
> 2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
> do_aws4_auth_completion
> 2018-02-26 19:43:37.340715 7f466bbca700 10 create bucket location
> constraint: cn
> 2018-02-26 19:43:37.340766 7f466bbca700  0 location constraint (cn) can't be
> found.
> 2018-02-26 19:43:37.340794 7f466bbca700  2 req 428078:0.004701:s3:PUT
> /testraint/:create_bucket:completing
> 2018-02-26 19:43:37.341782 7f466bbca700  2 req 428078:0.005689:s3:PUT
> /testraint/:create_bucket:op status=-2208
> 2018-02-26 19:43:37.341792 7f466bbca700  2 req 428078:0.005707:s3:PUT
> /testraint/:create_bucket:http status=400
>
> On Mon, Feb 26, 2018 at 2:36 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> I'm not sure if the rgw logs (debug rgw = 20) specify explicitly why a
>> bucket creation is rejected in these cases, but it might be worth
>> trying to look at these. If not, then a tcpdump of the specific failed
>> request might shed some light (would be interesting to look at the
>> generated LocationConstraint).
>>
>> Yehuda
>>
>> On Mon, Feb 26, 2018 at 11:29 AM, David Turner 
>> wrote:
>> > Our problem only appeared to be present in bucket creation.  Listing,
>> > putting, etc objects in a bucket work just fine regardless of the
>> > bucket_location setting.  I ran this test on a few different realms to
>> > see
>> > what would happen and only 1 of them had a problem.  There isn't an
>> > obvious
>> > thing that steps out about it.  The 2 local realms do not have
>> > multi-site,
>> > the internal realm has multi-site and the operations were performed on
>> > the
>> > primary zone for the zonegroup.
>> >
>> > Worked with non 'US' bucket_location for s3cmd to create bucket:
>> > realm=internal
>> > zonegroup=internal-ga
>> > zone=internal-atl
>> >
>> > Failed with non 'US' bucket_location for s3cmd to create bucket:
>> > realm=local-atl
>> > zonegroup=local-atl
>> > zone=local-atl
>> >
>> > Worked with non 'US' bucket_location for s3cmd to create bucket:
>> > realm=local
>> > zonegroup=local
>> > zone=local
>> >
>> > I was thinking it might have to do with all of the parts being named the
>> > same, but I made sure to do the last test to confirm.  Interestingly
>> > it's
>> > only bucket creation that has a problem and it's fine as long as I put
>> > 'US'
>> > as the bucket_location.
>> >
>> > On Mon, Feb 19, 2018 at 6:48 PM F21  wrote:
>> >>
>> >> I am using the official ceph/daemon docker image. It starts RGW and
>> >> creates a zonegroup and zone with their names set to an empty string:
>> >>
>> >>
>> >> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/start_rgw.sh#L36:54
>> >>
>> >> $RGW_ZONEGROUP and $RGW_ZONE are both empty strings by default:
>> >>
>> >>
>> >> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/variables_entrypoint.sh#L46
>> >>
>> >> Here's what I get when I query RGW:
>> >>
>> >> $ radosgw-admin zonegroup list
>> >> {
>> >>  "default_info": "",
>> >>  "zonegroups": [
>> >>  "default"
>> >>  ]
>> >> }
>> >>
>> >> $ radosgw-admin zone list
>> >> {
>> >>  "default_info": "",
>> >>  "zones": [
>> >>  "default"
>> >>  ]
>> >> }
>> >>
>> >> On 20/02/201

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread David Turner

That's what I set it to in the config file.  I probably should have
mentioned that.

On Mon, Feb 26, 2018 at 4:07 PM Yehuda Sadeh-Weinraub 
wrote:

> According to the log here, it says that the location constraint it got
> is "cn", can you take a look at a tcpdump, see if that's actually
> what's passed in?
>
> On Mon, Feb 26, 2018 at 12:02 PM, David Turner 
> wrote:
> > I run with `debug rgw = 10` and was able to find these lines at the end
> of a
> > request to create the bucket.
> >
> > Successfully creating a bucket with `bucket_location = US` looks like
> > [1]this.  Failing to create a bucket has "ERROR: S3 error: 400
> > (InvalidLocationConstraint): The specified location-constraint is not
> valid"
> > on the CLI and [2]this (excerpt from the end of the request) in the rgw
> log
> > (debug level 10).  "create bucket location constraint" was not found in
> the
> > log for successfully creating the bucket.
> >
> >
> > [1]
> > 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
> >
> name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> > info.flags=0x17
> > 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
> >
> local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> > to cache LRU end
> > 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
> name=user.rgw.acl
> > bl.length()=141
> > 2018-02-26 19:52:36.423863 7f4bc9bc8700 10 RGWWatcher::handle_notify()
> > notify_id 344855809097728 cookie 139963970426880 notifier 39099765
> > bl.length()=361
> > 2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
> > name=local-atl.rgw.data.root++testerton info.flags=0x17
> > 2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
> > local-atl.rgw.data.root++testerton to cache LRU end
> >
> > [2]
> > 2018-02-26 19:43:37.340289 7f466bbca700  2 req 428078:0.004204:s3:PUT
> > /testraint/:create_bucket:executing
> > 2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
> > do_aws4_auth_completion
> > 2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
> > do_aws4_auth_completion
> > 2018-02-26 19:43:37.340715 7f466bbca700 10 create bucket location
> > constraint: cn
> > 2018-02-26 19:43:37.340766 7f466bbca700  0 location constraint (cn)
> can't be
> > found.
> > 2018-02-26 19:43:37.340794 7f466bbca700  2 req 428078:0.004701:s3:PUT
> > /testraint/:create_bucket:completing
> > 2018-02-26 19:43:37.341782 7f466bbca700  2 req 428078:0.005689:s3:PUT
> > /testraint/:create_bucket:op status=-2208
> > 2018-02-26 19:43:37.341792 7f466bbca700  2 req 428078:0.005707:s3:PUT
> > /testraint/:create_bucket:http status=400
> >
> > On Mon, Feb 26, 2018 at 2:36 PM Yehuda Sadeh-Weinraub  >
> > wrote:
> >>
> >> I'm not sure if the rgw logs (debug rgw = 20) specify explicitly why a
> >> bucket creation is rejected in these cases, but it might be worth
> >> trying to look at these. If not, then a tcpdump of the specific failed
> >> request might shed some light (would be interesting to look at the
> >> generated LocationConstraint).
> >>
> >> Yehuda
> >>
> >> On Mon, Feb 26, 2018 at 11:29 AM, David Turner 
> >> wrote:
> >> > Our problem only appeared to be present in bucket creation.  Listing,
> >> > putting, etc objects in a bucket work just fine regardless of the
> >> > bucket_location setting.  I ran this test on a few different realms to
> >> > see
> >> > what would happen and only 1 of them had a problem.  There isn't an
> >> > obvious
> >> > thing that steps out about it.  The 2 local realms do not have
> >> > multi-site,
> >> > the internal realm has multi-site and the operations were performed on
> >> > the
> >> > primary zone for the zonegroup.
> >> >
> >> > Worked with non 'US' bucket_location for s3cmd to create bucket:
> >> > realm=internal
> >> > zonegroup=internal-ga
> >> > zone=internal-atl
> >> >
> >> > Failed with non 'US' bucket_location for s3cmd to create bucket:
> >> > realm=local-atl
> >> > zonegroup=local-atl
> >> > zone=local-atl
> >> >
> >> > Worked with non 'US' bucket_location for s3cmd to create bucket:
> >> > realm=local
> >> > zonegroup=local
> >> > zone=local
> >> >
> >> > I was thinking it might have to do with all of the parts being named
> the
> >> > same, but I made sure to do the last test to confirm.  Interestingly
> >> > it's
> >> > only bucket creation that has a problem and it's fine as long as I put
> >> > 'US'
> >> > as the bucket_location.
> >> >
> >> > On Mon, Feb 19, 2018 at 6:48 PM F21  wrote:
> >> >>
> >> >> I am using the official ceph/daemon docker image. It starts RGW and
> >> >> creates a zonegroup and zone with their names set to an empty string:
> >> >>
> >> >>
> >> >>
> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/start_rgw.sh#L36:54
> >> >>
> >> >> $RGW_ZONEGROUP and $RGW_ZONE are both empty strings by default:
> >> >>
> >> >>
> >> >>
> https://github.com/ceph/ceph-container/blob/master/ceph-releases/luminous/ubuntu/16.04/daemon/variabl

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread David Turner

I'm also not certain how to do the tcpdump for this.  Do you have any
pointers to how to capture that for you?

On Mon, Feb 26, 2018 at 4:09 PM David Turner  wrote:

> That's what I set it to in the config file.  I probably should have
> mentioned that.
>
> On Mon, Feb 26, 2018 at 4:07 PM Yehuda Sadeh-Weinraub 
> wrote:
>
>> According to the log here, it says that the location constraint it got
>> is "cn", can you take a look at a tcpdump, see if that's actually
>> what's passed in?
>>
>> On Mon, Feb 26, 2018 at 12:02 PM, David Turner 
>> wrote:
>> > I run with `debug rgw = 10` and was able to find these lines at the end
>> of a
>> > request to create the bucket.
>> >
>> > Successfully creating a bucket with `bucket_location = US` looks like
>> > [1]this.  Failing to create a bucket has "ERROR: S3 error: 400
>> > (InvalidLocationConstraint): The specified location-constraint is not
>> valid"
>> > on the CLI and [2]this (excerpt from the end of the request) in the rgw
>> log
>> > (debug level 10).  "create bucket location constraint" was not found in
>> the
>> > log for successfully creating the bucket.
>> >
>> >
>> > [1]
>> > 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
>> >
>> name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>> > info.flags=0x17
>> > 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
>> >
>> local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>> > to cache LRU end
>> > 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
>> name=user.rgw.acl
>> > bl.length()=141
>> > 2018-02-26 19:52:36.423863 7f4bc9bc8700 10 RGWWatcher::handle_notify()
>> > notify_id 344855809097728 cookie 139963970426880 notifier 39099765
>> > bl.length()=361
>> > 2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
>> > name=local-atl.rgw.data.root++testerton info.flags=0x17
>> > 2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
>> > local-atl.rgw.data.root++testerton to cache LRU end
>> >
>> > [2]
>> > 2018-02-26 19:43:37.340289 7f466bbca700  2 req 428078:0.004204:s3:PUT
>> > /testraint/:create_bucket:executing
>> > 2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
>> > do_aws4_auth_completion
>> > 2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
>> > do_aws4_auth_completion
>> > 2018-02-26 19:43:37.340715 7f466bbca700 10 create bucket location
>> > constraint: cn
>> > 2018-02-26 19:43:37.340766 7f466bbca700  0 location constraint (cn)
>> can't be
>> > found.
>> > 2018-02-26 19:43:37.340794 7f466bbca700  2 req 428078:0.004701:s3:PUT
>> > /testraint/:create_bucket:completing
>> > 2018-02-26 19:43:37.341782 7f466bbca700  2 req 428078:0.005689:s3:PUT
>> > /testraint/:create_bucket:op status=-2208
>> > 2018-02-26 19:43:37.341792 7f466bbca700  2 req 428078:0.005707:s3:PUT
>> > /testraint/:create_bucket:http status=400
>> >
>> > On Mon, Feb 26, 2018 at 2:36 PM Yehuda Sadeh-Weinraub <
>> yeh...@redhat.com>
>> > wrote:
>> >>
>> >> I'm not sure if the rgw logs (debug rgw = 20) specify explicitly why a
>> >> bucket creation is rejected in these cases, but it might be worth
>> >> trying to look at these. If not, then a tcpdump of the specific failed
>> >> request might shed some light (would be interesting to look at the
>> >> generated LocationConstraint).
>> >>
>> >> Yehuda
>> >>
>> >> On Mon, Feb 26, 2018 at 11:29 AM, David Turner 
>> >> wrote:
>> >> > Our problem only appeared to be present in bucket creation.  Listing,
>> >> > putting, etc objects in a bucket work just fine regardless of the
>> >> > bucket_location setting.  I ran this test on a few different realms
>> to
>> >> > see
>> >> > what would happen and only 1 of them had a problem.  There isn't an
>> >> > obvious
>> >> > thing that steps out about it.  The 2 local realms do not have
>> >> > multi-site,
>> >> > the internal realm has multi-site and the operations were performed
>> on
>> >> > the
>> >> > primary zone for the zonegroup.
>> >> >
>> >> > Worked with non 'US' bucket_location for s3cmd to create bucket:
>> >> > realm=internal
>> >> > zonegroup=internal-ga
>> >> > zone=internal-atl
>> >> >
>> >> > Failed with non 'US' bucket_location for s3cmd to create bucket:
>> >> > realm=local-atl
>> >> > zonegroup=local-atl
>> >> > zone=local-atl
>> >> >
>> >> > Worked with non 'US' bucket_location for s3cmd to create bucket:
>> >> > realm=local
>> >> > zonegroup=local
>> >> > zone=local
>> >> >
>> >> > I was thinking it might have to do with all of the parts being named
>> the
>> >> > same, but I made sure to do the last test to confirm.  Interestingly
>> >> > it's
>> >> > only bucket creation that has a problem and it's fine as long as I
>> put
>> >> > 'US'
>> >> > as the bucket_location.
>> >> >
>> >> > On Mon, Feb 19, 2018 at 6:48 PM F21  wrote:
>> >> >>
>> >> >> I am using the official ceph/daemon docker image. It starts RGW and
>> >> >> creates a zonegroup and zone with their names set to an empty
>> string:
>> >> >>
>> >> >>
>> >> >>

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread Yehuda Sadeh-Weinraub

If that's what you set in the config file, I assume that's what passed
in. Why did you set that in your config file? You don't have a
zonegroup named 'cn', right?

On Mon, Feb 26, 2018 at 1:10 PM, David Turner  wrote:
> I'm also not certain how to do the tcpdump for this.  Do you have any
> pointers to how to capture that for you?
>
> On Mon, Feb 26, 2018 at 4:09 PM David Turner  wrote:
>>
>> That's what I set it to in the config file.  I probably should have
>> mentioned that.
>>
>> On Mon, Feb 26, 2018 at 4:07 PM Yehuda Sadeh-Weinraub 
>> wrote:
>>>
>>> According to the log here, it says that the location constraint it got
>>> is "cn", can you take a look at a tcpdump, see if that's actually
>>> what's passed in?
>>>
>>> On Mon, Feb 26, 2018 at 12:02 PM, David Turner 
>>> wrote:
>>> > I run with `debug rgw = 10` and was able to find these lines at the end
>>> > of a
>>> > request to create the bucket.
>>> >
>>> > Successfully creating a bucket with `bucket_location = US` looks like
>>> > [1]this.  Failing to create a bucket has "ERROR: S3 error: 400
>>> > (InvalidLocationConstraint): The specified location-constraint is not
>>> > valid"
>>> > on the CLI and [2]this (excerpt from the end of the request) in the rgw
>>> > log
>>> > (debug level 10).  "create bucket location constraint" was not found in
>>> > the
>>> > log for successfully creating the bucket.
>>> >
>>> >
>>> > [1]
>>> > 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
>>> >
>>> > name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>>> > info.flags=0x17
>>> > 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
>>> >
>>> > local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>>> > to cache LRU end
>>> > 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
>>> > name=user.rgw.acl
>>> > bl.length()=141
>>> > 2018-02-26 19:52:36.423863 7f4bc9bc8700 10 RGWWatcher::handle_notify()
>>> > notify_id 344855809097728 cookie 139963970426880 notifier 39099765
>>> > bl.length()=361
>>> > 2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
>>> > name=local-atl.rgw.data.root++testerton info.flags=0x17
>>> > 2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
>>> > local-atl.rgw.data.root++testerton to cache LRU end
>>> >
>>> > [2]
>>> > 2018-02-26 19:43:37.340289 7f466bbca700  2 req 428078:0.004204:s3:PUT
>>> > /testraint/:create_bucket:executing
>>> > 2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
>>> > do_aws4_auth_completion
>>> > 2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
>>> > do_aws4_auth_completion
>>> > 2018-02-26 19:43:37.340715 7f466bbca700 10 create bucket location
>>> > constraint: cn
>>> > 2018-02-26 19:43:37.340766 7f466bbca700  0 location constraint (cn)
>>> > can't be
>>> > found.
>>> > 2018-02-26 19:43:37.340794 7f466bbca700  2 req 428078:0.004701:s3:PUT
>>> > /testraint/:create_bucket:completing
>>> > 2018-02-26 19:43:37.341782 7f466bbca700  2 req 428078:0.005689:s3:PUT
>>> > /testraint/:create_bucket:op status=-2208
>>> > 2018-02-26 19:43:37.341792 7f466bbca700  2 req 428078:0.005707:s3:PUT
>>> > /testraint/:create_bucket:http status=400
>>> >
>>> > On Mon, Feb 26, 2018 at 2:36 PM Yehuda Sadeh-Weinraub
>>> > 
>>> > wrote:
>>> >>
>>> >> I'm not sure if the rgw logs (debug rgw = 20) specify explicitly why a
>>> >> bucket creation is rejected in these cases, but it might be worth
>>> >> trying to look at these. If not, then a tcpdump of the specific failed
>>> >> request might shed some light (would be interesting to look at the
>>> >> generated LocationConstraint).
>>> >>
>>> >> Yehuda
>>> >>
>>> >> On Mon, Feb 26, 2018 at 11:29 AM, David Turner 
>>> >> wrote:
>>> >> > Our problem only appeared to be present in bucket creation.
>>> >> > Listing,
>>> >> > putting, etc objects in a bucket work just fine regardless of the
>>> >> > bucket_location setting.  I ran this test on a few different realms
>>> >> > to
>>> >> > see
>>> >> > what would happen and only 1 of them had a problem.  There isn't an
>>> >> > obvious
>>> >> > thing that steps out about it.  The 2 local realms do not have
>>> >> > multi-site,
>>> >> > the internal realm has multi-site and the operations were performed
>>> >> > on
>>> >> > the
>>> >> > primary zone for the zonegroup.
>>> >> >
>>> >> > Worked with non 'US' bucket_location for s3cmd to create bucket:
>>> >> > realm=internal
>>> >> > zonegroup=internal-ga
>>> >> > zone=internal-atl
>>> >> >
>>> >> > Failed with non 'US' bucket_location for s3cmd to create bucket:
>>> >> > realm=local-atl
>>> >> > zonegroup=local-atl
>>> >> > zone=local-atl
>>> >> >
>>> >> > Worked with non 'US' bucket_location for s3cmd to create bucket:
>>> >> > realm=local
>>> >> > zonegroup=local
>>> >> > zone=local
>>> >> >
>>> >> > I was thinking it might have to do with all of the parts being named
>>> >> > the
>>> >> > same, but I made sure to do the last test to confirm.  Interestingly
>>> >> > it's
>>> >> > only bucke

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread David Turner

I set it to that for randomness.  I don't have a zonegroup named 'us'
either, but that works fine.  I don't see why 'cn' should be any
different.  The bucket_location that triggered me noticing this was 'gd1'.
I don't know where that one came from, but I don't see why we should force
people setting it to 'us' when that has nothing to do with the realm.  If
it needed to be set to 'local-atl' that would make sense, but 'us' works
just fine.  Perhaps 'us' working is what shouldn't work as opposed to
allowing whatever else to be able to work.

I tested setting bucket_location to 'local-atl' and it did successfully
create the bucket.  So the question becomes, why do my other realms not
care what that value is set to and why does this realm allow 'us' to be
used when it isn't correct?

On Mon, Feb 26, 2018 at 4:12 PM Yehuda Sadeh-Weinraub 
wrote:

> If that's what you set in the config file, I assume that's what passed
> in. Why did you set that in your config file? You don't have a
> zonegroup named 'cn', right?
>
> On Mon, Feb 26, 2018 at 1:10 PM, David Turner 
> wrote:
> > I'm also not certain how to do the tcpdump for this.  Do you have any
> > pointers to how to capture that for you?
> >
> > On Mon, Feb 26, 2018 at 4:09 PM David Turner 
> wrote:
> >>
> >> That's what I set it to in the config file.  I probably should have
> >> mentioned that.
> >>
> >> On Mon, Feb 26, 2018 at 4:07 PM Yehuda Sadeh-Weinraub <
> yeh...@redhat.com>
> >> wrote:
> >>>
> >>> According to the log here, it says that the location constraint it got
> >>> is "cn", can you take a look at a tcpdump, see if that's actually
> >>> what's passed in?
> >>>
> >>> On Mon, Feb 26, 2018 at 12:02 PM, David Turner 
> >>> wrote:
> >>> > I run with `debug rgw = 10` and was able to find these lines at the
> end
> >>> > of a
> >>> > request to create the bucket.
> >>> >
> >>> > Successfully creating a bucket with `bucket_location = US` looks like
> >>> > [1]this.  Failing to create a bucket has "ERROR: S3 error: 400
> >>> > (InvalidLocationConstraint): The specified location-constraint is not
> >>> > valid"
> >>> > on the CLI and [2]this (excerpt from the end of the request) in the
> rgw
> >>> > log
> >>> > (debug level 10).  "create bucket location constraint" was not found
> in
> >>> > the
> >>> > log for successfully creating the bucket.
> >>> >
> >>> >
> >>> > [1]
> >>> > 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
> >>> >
> >>> >
> name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> >>> > info.flags=0x17
> >>> > 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
> >>> >
> >>> >
> local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> >>> > to cache LRU end
> >>> > 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
> >>> > name=user.rgw.acl
> >>> > bl.length()=141
> >>> > 2018-02-26 19:52:36.423863 7f4bc9bc8700 10
> RGWWatcher::handle_notify()
> >>> > notify_id 344855809097728 cookie 139963970426880 notifier 39099765
> >>> > bl.length()=361
> >>> > 2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
> >>> > name=local-atl.rgw.data.root++testerton info.flags=0x17
> >>> > 2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
> >>> > local-atl.rgw.data.root++testerton to cache LRU end
> >>> >
> >>> > [2]
> >>> > 2018-02-26 19:43:37.340289 7f466bbca700  2 req 428078:0.004204:s3:PUT
> >>> > /testraint/:create_bucket:executing
> >>> > 2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
> >>> > do_aws4_auth_completion
> >>> > 2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
> >>> > do_aws4_auth_completion
> >>> > 2018-02-26 19:43:37.340715 7f466bbca700 10 create bucket location
> >>> > constraint: cn
> >>> > 2018-02-26 19:43:37.340766 7f466bbca700  0 location constraint (cn)
> >>> > can't be
> >>> > found.
> >>> > 2018-02-26 19:43:37.340794 7f466bbca700  2 req 428078:0.004701:s3:PUT
> >>> > /testraint/:create_bucket:completing
> >>> > 2018-02-26 19:43:37.341782 7f466bbca700  2 req 428078:0.005689:s3:PUT
> >>> > /testraint/:create_bucket:op status=-2208
> >>> > 2018-02-26 19:43:37.341792 7f466bbca700  2 req 428078:0.005707:s3:PUT
> >>> > /testraint/:create_bucket:http status=400
> >>> >
> >>> > On Mon, Feb 26, 2018 at 2:36 PM Yehuda Sadeh-Weinraub
> >>> > 
> >>> > wrote:
> >>> >>
> >>> >> I'm not sure if the rgw logs (debug rgw = 20) specify explicitly
> why a
> >>> >> bucket creation is rejected in these cases, but it might be worth
> >>> >> trying to look at these. If not, then a tcpdump of the specific
> failed
> >>> >> request might shed some light (would be interesting to look at the
> >>> >> generated LocationConstraint).
> >>> >>
> >>> >> Yehuda
> >>> >>
> >>> >> On Mon, Feb 26, 2018 at 11:29 AM, David Turner <
> drakonst...@gmail.com>
> >>> >> wrote:
> >>> >> > Our problem only appeared to be present in bucket creation.
> >>> >> > Listing,
> >>> >> > putting, etc objects in a bucket work just fine regardless of the
> >>> >> >

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread Yehuda Sadeh-Weinraub

I don't know why 'us' works for you, but it could be that s3cmd is
just not sending any location constraint when 'us' is set. You can try
looking at the capture for this. You can try using wireshark for the
capture (assuming http endpoint and not https).

Yehuda

On Mon, Feb 26, 2018 at 1:21 PM, David Turner  wrote:
> I set it to that for randomness.  I don't have a zonegroup named 'us'
> either, but that works fine.  I don't see why 'cn' should be any different.
> The bucket_location that triggered me noticing this was 'gd1'.  I don't know
> where that one came from, but I don't see why we should force people setting
> it to 'us' when that has nothing to do with the realm.  If it needed to be
> set to 'local-atl' that would make sense, but 'us' works just fine.  Perhaps
> 'us' working is what shouldn't work as opposed to allowing whatever else to
> be able to work.
>
> I tested setting bucket_location to 'local-atl' and it did successfully
> create the bucket.  So the question becomes, why do my other realms not care
> what that value is set to and why does this realm allow 'us' to be used when
> it isn't correct?
>
> On Mon, Feb 26, 2018 at 4:12 PM Yehuda Sadeh-Weinraub 
> wrote:
>>
>> If that's what you set in the config file, I assume that's what passed
>> in. Why did you set that in your config file? You don't have a
>> zonegroup named 'cn', right?
>>
>> On Mon, Feb 26, 2018 at 1:10 PM, David Turner 
>> wrote:
>> > I'm also not certain how to do the tcpdump for this.  Do you have any
>> > pointers to how to capture that for you?
>> >
>> > On Mon, Feb 26, 2018 at 4:09 PM David Turner 
>> > wrote:
>> >>
>> >> That's what I set it to in the config file.  I probably should have
>> >> mentioned that.
>> >>
>> >> On Mon, Feb 26, 2018 at 4:07 PM Yehuda Sadeh-Weinraub
>> >> 
>> >> wrote:
>> >>>
>> >>> According to the log here, it says that the location constraint it got
>> >>> is "cn", can you take a look at a tcpdump, see if that's actually
>> >>> what's passed in?
>> >>>
>> >>> On Mon, Feb 26, 2018 at 12:02 PM, David Turner 
>> >>> wrote:
>> >>> > I run with `debug rgw = 10` and was able to find these lines at the
>> >>> > end
>> >>> > of a
>> >>> > request to create the bucket.
>> >>> >
>> >>> > Successfully creating a bucket with `bucket_location = US` looks
>> >>> > like
>> >>> > [1]this.  Failing to create a bucket has "ERROR: S3 error: 400
>> >>> > (InvalidLocationConstraint): The specified location-constraint is
>> >>> > not
>> >>> > valid"
>> >>> > on the CLI and [2]this (excerpt from the end of the request) in the
>> >>> > rgw
>> >>> > log
>> >>> > (debug level 10).  "create bucket location constraint" was not found
>> >>> > in
>> >>> > the
>> >>> > log for successfully creating the bucket.
>> >>> >
>> >>> >
>> >>> > [1]
>> >>> > 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
>> >>> >
>> >>> >
>> >>> > name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>> >>> > info.flags=0x17
>> >>> > 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
>> >>> >
>> >>> >
>> >>> > local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
>> >>> > to cache LRU end
>> >>> > 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
>> >>> > name=user.rgw.acl
>> >>> > bl.length()=141
>> >>> > 2018-02-26 19:52:36.423863 7f4bc9bc8700 10
>> >>> > RGWWatcher::handle_notify()
>> >>> > notify_id 344855809097728 cookie 139963970426880 notifier 39099765
>> >>> > bl.length()=361
>> >>> > 2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
>> >>> > name=local-atl.rgw.data.root++testerton info.flags=0x17
>> >>> > 2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
>> >>> > local-atl.rgw.data.root++testerton to cache LRU end
>> >>> >
>> >>> > [2]
>> >>> > 2018-02-26 19:43:37.340289 7f466bbca700  2 req
>> >>> > 428078:0.004204:s3:PUT
>> >>> > /testraint/:create_bucket:executing
>> >>> > 2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
>> >>> > do_aws4_auth_completion
>> >>> > 2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
>> >>> > do_aws4_auth_completion
>> >>> > 2018-02-26 19:43:37.340715 7f466bbca700 10 create bucket location
>> >>> > constraint: cn
>> >>> > 2018-02-26 19:43:37.340766 7f466bbca700  0 location constraint (cn)
>> >>> > can't be
>> >>> > found.
>> >>> > 2018-02-26 19:43:37.340794 7f466bbca700  2 req
>> >>> > 428078:0.004701:s3:PUT
>> >>> > /testraint/:create_bucket:completing
>> >>> > 2018-02-26 19:43:37.341782 7f466bbca700  2 req
>> >>> > 428078:0.005689:s3:PUT
>> >>> > /testraint/:create_bucket:op status=-2208
>> >>> > 2018-02-26 19:43:37.341792 7f466bbca700  2 req
>> >>> > 428078:0.005707:s3:PUT
>> >>> > /testraint/:create_bucket:http status=400
>> >>> >
>> >>> > On Mon, Feb 26, 2018 at 2:36 PM Yehuda Sadeh-Weinraub
>> >>> > 
>> >>> > wrote:
>> >>> >>
>> >>> >> I'm not sure if the rgw logs (debug rgw = 20) specify explicitly
>> >>> >> why a
>> >>> >> bucket creation is rejected in these cases, but it m

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 12:26 PM Reed Dier  wrote:

> I will try to set the hybrid sleeps to 0 on the affected OSDs as an
> interim solution to getting the metadata configured correctly.
>

Yes, that's a good workaround as long as you don't have any actual hybrid
OSDs (or aren't worried about them sleeping...I'm not sure if that setting
came from experience or not).


>
> For reference, here is the complete metadata for osd.24, bluestore SATA
> SSD with NVMe block.db.
>
> {
> "id": 24,
> "arch": "x86_64",
> "back_addr": "",
> "back_iface": "bond0",
> "bluefs": "1",
> "bluefs_db_access_mode": "blk",
> "bluefs_db_block_size": "4096",
> "bluefs_db_dev": "259:0",
> "bluefs_db_dev_node": "nvme0n1",
> "bluefs_db_driver": "KernelDevice",
> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> "bluefs_db_partition_path": "/dev/nvme0n1p4",
> "bluefs_db_rotational": "0",
> "bluefs_db_serial": " ",
> "bluefs_db_size": "16000221184",
> "bluefs_db_type": "nvme",
> "bluefs_single_shared_device": "0",
> "bluefs_slow_access_mode": "blk",
> "bluefs_slow_block_size": "4096",
> "bluefs_slow_dev": "253:8",
> "bluefs_slow_dev_node": "dm-8",
> "bluefs_slow_driver": "KernelDevice",
> "bluefs_slow_model": "",
> "bluefs_slow_partition_path": "/dev/dm-8",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_size": "1920378863616",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_access_mode": "blk",
> "bluestore_bdev_block_size": "4096",
> "bluestore_bdev_dev": "253:8",
> "bluestore_bdev_dev_node": "dm-8",
> "bluestore_bdev_driver": "KernelDevice",
> "bluestore_bdev_model": "",
> "bluestore_bdev_partition_path": "/dev/dm-8",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_size": "1920378863616",
> "bluestore_bdev_type": "ssd",
> "ceph_version": "ceph version 12.2.2
> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
> "default_device_class": "ssd",
> "distro": "ubuntu",
> "distro_description": "Ubuntu 16.04.3 LTS",
> "distro_version": "16.04",
> "front_addr": "",
> "front_iface": "bond0",
> "hb_back_addr": "",
> "hb_front_addr": "",
> "hostname": “host00",
> "journal_rotational": "1",
> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44
> UTC 2018",
> "kernel_version": "4.13.0-26-generic",
> "mem_swap_kb": "124999672",
> "mem_total_kb": "131914008",
> "os": "Linux",
> "osd_data": "/var/lib/ceph/osd/ceph-24",
> "osd_objectstore": "bluestore",
> "rotational": "0"
> }
>
>
> So it looks like it correctly guessed(?) the
> bluestore_bdev_type/default_device_class correctly (though it may have been
> an inherited value?), as did bluefs_db_type get set to nvme correctly.
>
> So I’m not sure why journal_rotational is still showing 1.
> Maybe something in the ceph-volume lvm piece that isn’t correctly setting
> that flag on OSD creation?
> Also seems like the journal_rotational field should have been deprecated
> in bluestore as bluefs_db_rotational should cover that, and if there were a
> WAL partition as well, I assume there would be something to the tune of
> bluefs_wal_rotational or something like that, and journal would never be
> used for bluestore?
>

Thanks to both of you for helping diagnose this issue. I created a ticket
and have a PR up to fix it: http://tracker.ceph.com/issues/23141,
https://github.com/ceph/ceph/pull/20602

Until that gets backported into another Luminous release you'll need to do
some kind of workaround though. :/
-Greg


>
> Appreciate the help.
>
> Thanks,
> Reed
>
> On Feb 26, 2018, at 1:28 PM, Gregory Farnum  wrote:
>
> On Mon, Feb 26, 2018 at 11:21 AM Reed Dier  wrote:
>
>> The ‘good perf’ that I reported below was the result of beginning 5 new
>> bluestore conversions which results in a leading edge of ‘good’
>> performance, before trickling off.
>>
>> This performance lasted about 20 minutes, where it backfilled a small set
>> of PGs off of non-bluestore OSDs.
>>
>> Current performance is now hovering around:
>>
>> pool objects-ssd id 20
>>   recovery io 14285 kB/s, 202 objects/s
>>
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
>>
>>
>> What are you referencing when you talk about recovery ops per second?
>>
>> These are recovery ops as reported by ceph -s or via stats exported via
>> influx plugin in mgr, and via local collectd collection.
>>
>> Also, what are the values for osd_recovery_sleep_hdd
>> and osd_recovery_sleep_hybrid, and can you validate via "ceph osd meta

Re: [ceph-users] erasure-code-profile: what's "w=" ?

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 5:09 AM Wolfgang Lendl <
wolfgang.le...@meduniwien.ac.at> wrote:

> hi,
>
> I have no idea what "w=8" means and can't find any hints in docs ...
> maybe someone can explain
>
>
> ceph 12.2.2
>
> # ceph osd erasure-code-profile get ec42
> crush-device-class=hdd
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=4
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
>

I think that's exposing the "word" size it uses when doing the erasure
coding. It is technically configurable but I would not fuss about it.
-Greg

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to correctly purge a "ceph-volume lvm" OSD

2018-02-26 Thread Alfredo Deza

On Mon, Feb 26, 2018 at 12:51 PM, David Turner  wrote:
> I don't follow what ceph-deploy has to do with the man page for ceph-volume.
> Is ceph-volume also out-of-tree and as such the man pages aren't version
> specific with its capabilities?  It's very disconcerting to need to ignore
> the man pages for CLI tools.

Sorry, it seems like I completely got confused there :)

We forgot to backport the man page changes for the luminous branch.
Thanks for pointing this out.

Created http://tracker.ceph.com/issues/23142 to track this

You are absolutely right, apologies for the misfire.


>
> On Mon, Feb 26, 2018 at 12:10 PM Alfredo Deza  wrote:
>>
>> On Mon, Feb 26, 2018 at 11:24 AM, David Turner 
>> wrote:
>> > If we're asking for documentation updates, the man page for ceph-volume
>> > is
>> > incredibly outdated.  In 12.2.3 it still says that bluestore is not yet
>> > implemented and that it's planned to be supported.
>> > '[--bluestore] filestore objectstore (not yet implemented)'
>> > 'using  a  filestore  setup (bluestore  support  is  planned)'.
>>
>> This is a bit hard to track because ceph-deploy is an out-of-tree
>> project that gets pulled into the Ceph repo, and the man page lives in
>> the Ceph source tree.
>>
>> We have updated the man page and the references to ceph-deploy to
>> correctly show the new API and all the flags supported, but this is in
>> master and was not backported
>> to luminous.
>>
>> >
>> > On Mon, Feb 26, 2018 at 7:05 AM Oliver Freyermuth
>> >  wrote:
>> >>
>> >> Am 26.02.2018 um 13:02 schrieb Alfredo Deza:
>> >> > On Sat, Feb 24, 2018 at 1:26 PM, Oliver Freyermuth
>> >> >  wrote:
>> >> >> Dear Cephalopodians,
>> >> >>
>> >> >> when purging a single OSD on a host (created via ceph-deploy 2.0,
>> >> >> i.e.
>> >> >> using ceph-volume lvm), I currently proceed as follows:
>> >> >>
>> >> >> On the OSD-host:
>> >> >> $ systemctl stop ceph-osd@4.service
>> >> >> $ ls -la /var/lib/ceph/osd/ceph-4
>> >> >> # Check block und block.db links:
>> >> >> lrwxrwxrwx.  1 ceph ceph   93 23. Feb 01:28 block ->
>> >> >>
>> >> >> /dev/ceph-69b1fbe5-f084-4410-a99a-ab57417e7846/osd-block-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >> >> lrwxrwxrwx.  1 root root   43 23. Feb 01:28 block.db ->
>> >> >> /dev/ceph-osd-blockdb-ssd-1/db-for-disk-sda
>> >> >> # resolve actual underlying device:
>> >> >> $ pvs | grep ceph-69b1fbe5-f084-4410-a99a-ab57417e7846
>> >> >>   /dev/sda   ceph-69b1fbe5-f084-4410-a99a-ab57417e7846 lvm2 a--
>> >> >> <3,64t 0
>> >> >> # Zap the device:
>> >> >> $ ceph-volume lvm zap --destroy /dev/sda
>> >> >>
>> >> >> Now, on the mon:
>> >> >> # purge the OSD:
>> >> >> $ ceph osd purge osd.4 --yes-i-really-mean-it
>> >> >>
>> >> >> Then I re-deploy using:
>> >> >> $ ceph-deploy --overwrite-conf osd create --bluestore --block-db
>> >> >> ceph-osd-blockdb-ssd-1/db-for-disk-sda --data /dev/sda osd001
>> >> >>
>> >> >> from the admin-machine.
>> >> >>
>> >> >> This works just fine, however, it leaves a stray ceph-volume service
>> >> >> behind:
>> >> >> $ ls -la /etc/systemd/system/multi-user.target.wants/ -1 | grep
>> >> >> ceph-volume@lvm-4
>> >> >> lrwxrwxrwx.  1 root root   44 24. Feb 18:30
>> >> >> ceph-volume@lvm-4-5a984083-48e1-4c2f-a1f3-3458c941e597.service ->
>> >> >> /usr/lib/systemd/system/ceph-volume@.service
>> >> >> lrwxrwxrwx.  1 root root   44 23. Feb 01:28
>> >> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service ->
>> >> >> /usr/lib/systemd/system/ceph-volume@.service
>> >> >>
>> >> >> This stray service then, after reboot of the machine, stays in
>> >> >> activating state (since the disk will of course never come back):
>> >> >> ---
>> >> >> $ systemctl status
>> >> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>> >> >> ● ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service -
>> >> >> Ceph
>> >> >> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >> >>Loaded: loaded (/usr/lib/systemd/system/ceph-volume@.service;
>> >> >> enabled; vendor preset: disabled)
>> >> >>Active: activating (start) since Sa 2018-02-24 19:21:47 CET; 1min
>> >> >> 12s ago
>> >> >>  Main PID: 1866 (timeout)
>> >> >>CGroup:
>> >> >>
>> >> >> /system.slice/system-ceph\x2dvolume.slice/ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>> >> >>├─1866 timeout 1 /usr/sbin/ceph-volume-systemd
>> >> >> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >> >>└─1872 /usr/bin/python2.7 /usr/sbin/ceph-volume-systemd
>> >> >> lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874
>> >> >>
>> >> >> Feb 24 19:21:47 osd001.baf.physik.uni-bonn.de systemd[1]: Starting
>> >> >> Ceph
>> >> >> Volume activation: lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874...
>> >> >> ---
>> >> >> Manually, I can fix this by running:
>> >> >> $ systemctl disable
>> >> >> ceph-volume@lvm-4-cd273506-e805-40ac-b23d-c7b9ff45d874.service
>> >> >>
>> >> >> My question is: Should I really

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit  wrote:

> 2018-02-24 7:10 GMT+01:00 David Turner :
>
>> Caspar, it looks like your idea should work. Worst case scenario seems
>> like the osd wouldn't start, you'd put the old SSD back in and go back to
>> the idea to weight them to 0, backfilling, then recreate the osds.
>> Definitely with a try in my opinion, and I'd love to hear your experience
>> after.
>>
>>
> Hi David,
>
> First of all, thank you for ALL your answers on this ML, you're really
> putting a lot of effort into answering many questions asked here and very
> often they contain invaluable information.
>
>
> To follow up on this post i went out and built a very small (proxmox)
> cluster (3 OSD's per host) to test my suggestion of cloning the DB/WAL SDD.
> And it worked!
> Note: this was on Luminous v12.2.2 (all bluestore, ceph-disk based OSD's)
>
> Here's what i did on 1 node:
>
> 1) ceph osd set noout
> 2) systemctl stop osd.0; systemctl stop osd.1; systemctl stop osd.2
> 3) ddrescue -f -n -vv   /root/clone-db.log
> 4) removed the old SSD physically from the node
> 5) checked with "ceph -s" and already saw HEALTH_OK and all OSD's up/in
> 6) ceph osd unset noout
>
> I assume that once the ddrescue step is finished a 'partprobe' or
> something similar is triggered and udev finds the DB partitions on the new
> SSD and starts the OSD's again (kind of what happens during hotplug)
> So it is probably better to clone the SSD in another (non-ceph) system to
> not trigger any udev events.
>
> I also tested a reboot after this and everything still worked.
>
>
> The old SSD was 120GB and the new is 256GB (cloning took around 4 minutes)
> Delta of data was very low because it was a test cluster.
>
> All in all the OSD's in question were 'down' for only 5 minutes (so i
> stayed within the ceph_osd_down_out interval of the default 10 minutes and
> didn't actually need to set noout :)
>

I kicked off a brief discussion about this with some of the BlueStore guys
and they're aware of the problem with migrating across SSDs, but so far
it's just a Trello card:
https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
They do confirm you should be okay with dd'ing things across, assuming
symlinks get set up correctly as David noted.

I've got some other bad news, though: BlueStore has internal metadata about
the size of the block device it's using, so if you copy it onto a larger
block device, it will not actually make use of the additional space. :(
-Greg


>
> Kind regards,
> Caspar
>
>
>
>> Nico, it is not possible to change the WAL or DB size, location, etc
>> after osd creation. If you want to change the configuration of the osd
>> after creation, you have to remove it from the cluster and recreate it.
>> There is no similar functionality to how you could move, recreate, etc
>> filesystem osd journals. I think this might be on the radar as a feature,
>> but I don't know for certain. I definitely consider it to be a regression
>> of bluestore.
>>
>>
>>
>>
>> On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius <
>> nico.schottel...@ungleich.ch> wrote:
>>
>>>
>>> A very interesting question and I would add the follow up question:
>>>
>>> Is there an easy way to add an external DB/WAL devices to an existing
>>> OSD?
>>>
>>> I suspect that it might be something on the lines of:
>>>
>>> - stop osd
>>> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
>>> - (maybe run some kind of osd mkfs ?)
>>> - start osd
>>>
>>> Has anyone done this so far or recommendations on how to do it?
>>>
>>> Which also makes me wonder: what is actually the format of WAL and
>>> BlockDB in bluestore? Is there any documentation available about it?
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>> Caspar Smit  writes:
>>>
>>> > Hi All,
>>> >
>>> > What would be the proper way to preventively replace a DB/WAL SSD
>>> (when it
>>> > is nearing it's DWPD/TBW limit and not failed yet).
>>> >
>>> > It hosts DB partitions for 5 OSD's
>>> >
>>> > Maybe something like:
>>> >
>>> > 1) ceph osd reweight 0 the 5 OSD's
>>> > 2) let backfilling complete
>>> > 3) destroy/remove the 5 OSD's
>>> > 4) replace SSD
>>> > 5) create 5 new OSD's with seperate DB partition on new SSD
>>> >
>>> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved
>>> so i
>>> > thought maybe the following would work:
>>> >
>>> > 1) ceph osd set noout
>>> > 2) stop the 5 OSD's (systemctl stop)
>>> > 3) 'dd' the old SSD to a new SSD of same or bigger size
>>> > 4) remove the old SSD
>>> > 5) start the 5 OSD's (systemctl start)
>>> > 6) let backfilling/recovery complete (only delta data between OSD stop
>>> and
>>> > now)
>>> > 6) ceph osd unset noout
>>> >
>>> > Would this be a viable method to replace a DB SSD? Any udev/serial
>>> nr/uuid
>>> > stuff preventing this to work?
>>> >
>>> > Or is there another 'less hacky' way to replace a DB SSD without
>>> moving too
>>> > much data?
>>> >
>>> > Kind regards,
>>> > Caspar
>>> > _

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier

For the record, I am not seeing a demonstrative fix by injecting the value of 0 
into the OSDs running.
> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require 
> restart)

If it does indeed need to be restarted, I will need to wait for the current 
backfills to finish their process as restarting an OSD would bring me under 
min_size.

However, doing config show on the osd daemon appears to have taken the value of 
0.

> ceph daemon osd.24 config show | grep recovery_sleep
> "osd_recovery_sleep": "0.00",
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.00",
> "osd_recovery_sleep_ssd": "0.00",


I may take the restart as an opportunity to also move to 12.2.3 at the same 
time, since it is not expected that that should affect this issue.

I could also attempt to change osd_recovery_sleep_hdd as well, since these are 
ssd osd’s, it shouldn’t make a difference, but its a free move.

Thanks,

Reed

> On Feb 26, 2018, at 3:42 PM, Gregory Farnum  wrote:
> 
> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier  > wrote:
> I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
> solution to getting the metadata configured correctly.
> 
> Yes, that's a good workaround as long as you don't have any actual hybrid 
> OSDs (or aren't worried about them sleeping...I'm not sure if that setting 
> came from experience or not).
>  
> 
> For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
> with NVMe block.db.
> 
>> {
>> "id": 24,
>> "arch": "x86_64",
>> "back_addr": "",
>> "back_iface": "bond0",
>> "bluefs": "1",
>> "bluefs_db_access_mode": "blk",
>> "bluefs_db_block_size": "4096",
>> "bluefs_db_dev": "259:0",
>> "bluefs_db_dev_node": "nvme0n1",
>> "bluefs_db_driver": "KernelDevice",
>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>> "bluefs_db_rotational": "0",
>> "bluefs_db_serial": " ",
>> "bluefs_db_size": "16000221184",
>> "bluefs_db_type": "nvme",
>> "bluefs_single_shared_device": "0",
>> "bluefs_slow_access_mode": "blk",
>> "bluefs_slow_block_size": "4096",
>> "bluefs_slow_dev": "253:8",
>> "bluefs_slow_dev_node": "dm-8",
>> "bluefs_slow_driver": "KernelDevice",
>> "bluefs_slow_model": "",
>> "bluefs_slow_partition_path": "/dev/dm-8",
>> "bluefs_slow_rotational": "0",
>> "bluefs_slow_size": "1920378863616",
>> "bluefs_slow_type": "ssd",
>> "bluestore_bdev_access_mode": "blk",
>> "bluestore_bdev_block_size": "4096",
>> "bluestore_bdev_dev": "253:8",
>> "bluestore_bdev_dev_node": "dm-8",
>> "bluestore_bdev_driver": "KernelDevice",
>> "bluestore_bdev_model": "",
>> "bluestore_bdev_partition_path": "/dev/dm-8",
>> "bluestore_bdev_rotational": "0",
>> "bluestore_bdev_size": "1920378863616",
>> "bluestore_bdev_type": "ssd",
>> "ceph_version": "ceph version 12.2.2 
>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
>> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
>> "default_device_class": "ssd",
>> "distro": "ubuntu",
>> "distro_description": "Ubuntu 16.04.3 LTS",
>> "distro_version": "16.04",
>> "front_addr": "",
>> "front_iface": "bond0",
>> "hb_back_addr": "",
>> "hb_front_addr": "",
>> "hostname": “host00",
>> "journal_rotational": "1",
>> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 
>> 2018",
>> "kernel_version": "4.13.0-26-generic",
>> "mem_swap_kb": "124999672",
>> "mem_total_kb": "131914008",
>> "os": "Linux",
>> "osd_data": "/var/lib/ceph/osd/ceph-24",
>> "osd_objectstore": "bluestore",
>> "rotational": "0"
>> }
> 
> 
> So it looks like it correctly guessed(?) the 
> bluestore_bdev_type/default_device_class correctly (though it may have been 
> an inherited value?), as did bluefs_db_type get set to nvme correctly.
> 
> So I’m not sure why journal_rotational is still showing 1.
> Maybe something in the ceph-volume lvm piece that isn’t correctly setting 
> that flag on OSD creation?
> Also seems like the journal_rotational field should have been deprecated in 
> bluestore as bluefs_db_rotational should cover that, and if there were a WAL 
> partition as well, I assume there would be something to the tune of 
> bluefs_wal_rotational or something like that, and journal would never be used 
> for bluestore?
> 
> Thanks to both of you for helping diagnose this issue. I created a ticket and 
> have a PR up to fix it: http://tracker.ceph.com/issues/23141 
> , 
> https://github.com/ceph

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> > > The EC pool I am considering is k=4 m=2 with failure domain
> host, on 6 hosts.
> > > So necessarily, there is one shard for each host. If one host
> goes down for a prolonged time,
> > > there's no "logical" advantage of redistributing things -
> since whatever you do, with 5 hosts, all PGs will stay in degraded state
> anyways.
> > >
> > > However, I noticed Ceph is remapping all PGs, and actively
> moving data. I presume now this is done for two reasons:
> > > - The remapping is needed since the primary OSD might be the
> one which went down. But for remapping (I guess) there's no need to
> actually move data,
> > >   or is there?
> > > - The data movement is done to have the "k" shards available.
> > > If it's really the case that "all shards are equal", then data
> movement should not occur - or is this a bug / bad feature?
> > >
> > >
> > > If you lose one OSD out of a host, Ceph is going to try and
> re-replicate the data onto the other OSDs in that host. Your PG size and
> the CRUSH rule instructs it that the PG needs 6 different OSDs, and those
> OSDs need to be placed on different hosts.
> > >
> > > You're right that gets very funny if your PG size is equal to the
> number of hosts. We generally discourage people from running configurations
> like that.
> >
> > Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts)
> would be our starting point - since we may add more hosts later (not too
> soon-ish, but it's not excluded more may come in a year or so),
> > and migrating large EC pools to different settings still seems a bit
> messy.
> > We can't really afford to reduce available storage significantly
> more in the current setup, and would like to have the possibility to lose
> one host (for example for an OS upgrade),
> > and then still lose a few disks in case they fail with bad timing.
> >
> > >
> > > Or if you mean that you are losing a host, and the data is
> shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a
> result of EC pools' "indep" rather than "firstn" crush rules?)
> >
> > They are indep, which I think is the default (no manual editing
> done). I thought the main goal of indep was exactly to reduce data movement.
> > Indeed, it's very funny that data is moved, it certainly does not
> help to increase redundancy ;-).
> >
> 
> >
> > Can you also share the output of "ceph osd crush dump"?
>
> Attached.
>

Yep, that all looks simple enough.

Do you have any "ceph -s" or other records from when this was occurring? Is
it actually deleting or migrating any of the existing shards, or is it just
that the shards which were previously on the out'ed OSDs are now getting
copied onto the remaining ones?

I think I finally understand what's happening here but would like to be
sure. :)
-Greg

(In short: certain straws were previously mapping onto osd.[outed], but now
they map onto the remaining OSDs. Because everything's independent, the
actual CRUSH mapping for any shard other than the last is now going to map
onto a remaining OSD, which would displace the shard it already holds. But
the previously-present shard is going to remain "remapped" there because it
can't map successfully. So if you lose osd.5, you'll go from a CRUSH
mapping like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2
and 5 will both be on OSD 4.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Significance of the us-east-1 region when using S3 clients to talk to RGW

2018-02-26 Thread David Turner

I just realized the difference between the internal realm, local realm, and
local-atl realm.  local-atl is a Luminous cluster while the other 2 are
Jewel.  It looks like that option was completely ignored in Jewel and now
Luminous is taking it into account (which is better imo).  I think you're
right that 'us' is probably some sort of default in s3cmd that doesn't
actually send the variable to the gateway.

Unfortunately we only allow https for rgw in the environments I have set
up, but I think we found the cause of the initial randomness of things.
Thanks Yehuda.

On Mon, Feb 26, 2018 at 4:26 PM Yehuda Sadeh-Weinraub 
wrote:

> I don't know why 'us' works for you, but it could be that s3cmd is
> just not sending any location constraint when 'us' is set. You can try
> looking at the capture for this. You can try using wireshark for the
> capture (assuming http endpoint and not https).
>
> Yehuda
>
> On Mon, Feb 26, 2018 at 1:21 PM, David Turner 
> wrote:
> > I set it to that for randomness.  I don't have a zonegroup named 'us'
> > either, but that works fine.  I don't see why 'cn' should be any
> different.
> > The bucket_location that triggered me noticing this was 'gd1'.  I don't
> know
> > where that one came from, but I don't see why we should force people
> setting
> > it to 'us' when that has nothing to do with the realm.  If it needed to
> be
> > set to 'local-atl' that would make sense, but 'us' works just fine.
> Perhaps
> > 'us' working is what shouldn't work as opposed to allowing whatever else
> to
> > be able to work.
> >
> > I tested setting bucket_location to 'local-atl' and it did successfully
> > create the bucket.  So the question becomes, why do my other realms not
> care
> > what that value is set to and why does this realm allow 'us' to be used
> when
> > it isn't correct?
> >
> > On Mon, Feb 26, 2018 at 4:12 PM Yehuda Sadeh-Weinraub  >
> > wrote:
> >>
> >> If that's what you set in the config file, I assume that's what passed
> >> in. Why did you set that in your config file? You don't have a
> >> zonegroup named 'cn', right?
> >>
> >> On Mon, Feb 26, 2018 at 1:10 PM, David Turner 
> >> wrote:
> >> > I'm also not certain how to do the tcpdump for this.  Do you have any
> >> > pointers to how to capture that for you?
> >> >
> >> > On Mon, Feb 26, 2018 at 4:09 PM David Turner 
> >> > wrote:
> >> >>
> >> >> That's what I set it to in the config file.  I probably should have
> >> >> mentioned that.
> >> >>
> >> >> On Mon, Feb 26, 2018 at 4:07 PM Yehuda Sadeh-Weinraub
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> According to the log here, it says that the location constraint it
> got
> >> >>> is "cn", can you take a look at a tcpdump, see if that's actually
> >> >>> what's passed in?
> >> >>>
> >> >>> On Mon, Feb 26, 2018 at 12:02 PM, David Turner <
> drakonst...@gmail.com>
> >> >>> wrote:
> >> >>> > I run with `debug rgw = 10` and was able to find these lines at
> the
> >> >>> > end
> >> >>> > of a
> >> >>> > request to create the bucket.
> >> >>> >
> >> >>> > Successfully creating a bucket with `bucket_location = US` looks
> >> >>> > like
> >> >>> > [1]this.  Failing to create a bucket has "ERROR: S3 error: 400
> >> >>> > (InvalidLocationConstraint): The specified location-constraint is
> >> >>> > not
> >> >>> > valid"
> >> >>> > on the CLI and [2]this (excerpt from the end of the request) in
> the
> >> >>> > rgw
> >> >>> > log
> >> >>> > (debug level 10).  "create bucket location constraint" was not
> found
> >> >>> > in
> >> >>> > the
> >> >>> > log for successfully creating the bucket.
> >> >>> >
> >> >>> >
> >> >>> > [1]
> >> >>> > 2018-02-26 19:52:36.419251 7f4bc9bc8700 10 cache put:
> >> >>> >
> >> >>> >
> >> >>> >
> name=local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> >> >>> > info.flags=0x17
> >> >>> > 2018-02-26 19:52:36.419262 7f4bc9bc8700 10 adding
> >> >>> >
> >> >>> >
> >> >>> >
> local-atl.rgw.data.root++.bucket.meta.testerton:bef43c26-daf3-47ef-a3a5-e1167e3f88ac.39099765.1
> >> >>> > to cache LRU end
> >> >>> > 2018-02-26 19:52:36.419266 7f4bc9bc8700 10 updating xattr:
> >> >>> > name=user.rgw.acl
> >> >>> > bl.length()=141
> >> >>> > 2018-02-26 19:52:36.423863 7f4bc9bc8700 10
> >> >>> > RGWWatcher::handle_notify()
> >> >>> > notify_id 344855809097728 cookie 139963970426880 notifier 39099765
> >> >>> > bl.length()=361
> >> >>> > 2018-02-26 19:52:36.423875 7f4bc9bc8700 10 cache put:
> >> >>> > name=local-atl.rgw.data.root++testerton info.flags=0x17
> >> >>> > 2018-02-26 19:52:36.423882 7f4bc9bc8700 10 adding
> >> >>> > local-atl.rgw.data.root++testerton to cache LRU end
> >> >>> >
> >> >>> > [2]
> >> >>> > 2018-02-26 19:43:37.340289 7f466bbca700  2 req
> >> >>> > 428078:0.004204:s3:PUT
> >> >>> > /testraint/:create_bucket:executing
> >> >>> > 2018-02-26 19:43:37.340366 7f466bbca700  5 NOTICE: call to
> >> >>> > do_aws4_auth_completion
> >> >>> > 2018-02-26 19:43:37.340472 7f466bbca700 10 v4 auth ok --
> >> >>> > do_aws4_auth_co

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier

Quick turn around,

Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
bluestore opened the floodgates.

> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr

Graph of performance jump. Extremely marked.
https://imgur.com/a/LZR9R 

So at least we now have the gun to go with the smoke.

Thanks for the help and appreciate you pointing me in some directions that I 
was able to use to figure out the issue.

Adding to ceph.conf for future OSD conversions.

Thanks,

Reed


> On Feb 26, 2018, at 4:12 PM, Reed Dier  wrote:
> 
> For the record, I am not seeing a demonstrative fix by injecting the value of 
> 0 into the OSDs running.
>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require 
>> restart)
> 
> If it does indeed need to be restarted, I will need to wait for the current 
> backfills to finish their process as restarting an OSD would bring me under 
> min_size.
> 
> However, doing config show on the osd daemon appears to have taken the value 
> of 0.
> 
>> ceph daemon osd.24 config show | grep recovery_sleep
>> "osd_recovery_sleep": "0.00",
>> "osd_recovery_sleep_hdd": "0.10",
>> "osd_recovery_sleep_hybrid": "0.00",
>> "osd_recovery_sleep_ssd": "0.00",
> 
> 
> I may take the restart as an opportunity to also move to 12.2.3 at the same 
> time, since it is not expected that that should affect this issue.
> 
> I could also attempt to change osd_recovery_sleep_hdd as well, since these 
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
> 
> Thanks,
> 
> Reed
> 
>> On Feb 26, 2018, at 3:42 PM, Gregory Farnum > > wrote:
>> 
>> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier > > wrote:
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
>> solution to getting the metadata configured correctly.
>> 
>> Yes, that's a good workaround as long as you don't have any actual hybrid 
>> OSDs (or aren't worried about them sleeping...I'm not sure if that setting 
>> came from experience or not).
>>  
>> 
>> For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
>> with NVMe block.db.
>> 
>>> {
>>> "id": 24,
>>> "arch": "x86_64",
>>> "back_addr": "",
>>> "back_iface": "bond0",
>>> "bluefs": "1",
>>> "bluefs_db_access_mode": "blk",
>>> "bluefs_db_block_size": "4096",
>>> "bluefs_db_dev": "259:0",
>>> "bluefs_db_dev_node": "nvme0n1",
>>> "bluefs_db_driver": "KernelDevice",
>>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>>> "bluefs_db_rotational": "0",
>>> "bluefs_db_serial": " ",
>>> "bluefs_db_size": "16000221184",
>>> "bluefs_db_type": "nvme",
>>> "bluefs_single_shared_device": "0",
>>> "bluefs_slow_access_mode": "blk",
>>> "bluefs_slow_block_size": "4096",
>>> "bluefs_slow_dev": "253:8",
>>> "bluefs_slow_dev_node": "dm-8",
>>> "bluefs_slow_driver": "KernelDevice",
>>> "bluefs_slow_model": "",
>>> "bluefs_slow_partition_path": "/dev/dm-8",
>>> "bluefs_slow_rotational": "0",
>>> "bluefs_slow_size": "1920378863616",
>>> "bluefs_slow_type": "ssd",
>>> "bluestore_bdev_access_mode": "blk",
>>> "bluestore_bdev_block_size": "4096",
>>> "bluestore_bdev_dev": "253:8",
>>> "bluestore_bdev_dev_node": "dm-8",
>>> "bluestore_bdev_driver": "KernelDevice",
>>> "bluestore_bdev_model": "",
>>> "bluestore_bdev_partition_path": "/dev/dm-8",
>>> "bluestore_bdev_rotational": "0",
>>> "bluestore_bdev_size": "1920378863616",
>>> "bluestore_bdev_type": "ssd",
>>> "ceph_version": "ceph version 12.2.2 
>>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
>>> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
>>> "default_device_class": "ssd",
>>> "distro": "ubuntu",
>>> "distro_description": "Ubuntu 16.04.3 LTS",
>>> "distro_version": "16.04",
>>> "front_addr": "",
>>> "front_iface": "bond0",
>>> "hb_back_addr": "",
>>> "hb_front_addr": "",
>>> "hostname": “host00",
>>> "journal_rotational": "1",
>>> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 
>>> UTC 2018",
>>> "kernel_version": "4.13.0-26-generic",
>>> "mem_swap_kb": "124999672",
>>> "mem_total_kb": "131914008",
>>> "os": "Linux",
>>> "osd_data": "/var/lib/ceph/osd/ceph-24",
>>> "osd_objectstore": "bluestore",
>>> "rotational": "0"
>>> }
>> 
>> 
>> So it looks like it

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 2:23 PM Reed Dier  wrote:

> Quick turn around,
>
> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
> bluestore opened the floodgates.
>

Oh right, the OSD does not (think it can) have anything it can really do if
you've got a rotational journal and an SSD main device, and since BlueStore
was misreporting itself as having a rotational journal the OSD falls back
to the hard drive settings. Sorry I didn't work through that ahead of time;
glad this works around it for you!
-Greg


>
> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
>
>
> Graph of performance jump. Extremely marked.
> https://imgur.com/a/LZR9R
>
> So at least we now have the gun to go with the smoke.
>
> Thanks for the help and appreciate you pointing me in some directions that
> I was able to use to figure out the issue.
>
> Adding to ceph.conf for future OSD conversions.
>
> Thanks,
>
> Reed
>
>
> On Feb 26, 2018, at 4:12 PM, Reed Dier  wrote:
>
> For the record, I am not seeing a demonstrative fix by injecting the value
> of 0 into the OSDs running.
>
> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require
> restart)
>
>
> If it does indeed need to be restarted, I will need to wait for the
> current backfills to finish their process as restarting an OSD would bring
> me under min_size.
>
> However, doing config show on the osd daemon appears to have taken the
> value of 0.
>
> ceph daemon osd.24 config show | grep recovery_sleep
> "osd_recovery_sleep": "0.00",
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.00",
> "osd_recovery_sleep_ssd": "0.00",
>
>
> I may take the restart as an opportunity to also move to 12.2.3 at the
> same time, since it is not expected that that should affect this issue.
>
> I could also attempt to change osd_recovery_sleep_hdd as well, since these
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
>
> Thanks,
>
> Reed
>
> On Feb 26, 2018, at 3:42 PM, Gregory Farnum  wrote:
>
> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier  wrote:
>
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an
>> interim solution to getting the metadata configured correctly.
>>
>
> Yes, that's a good workaround as long as you don't have any actual hybrid
> OSDs (or aren't worried about them sleeping...I'm not sure if that setting
> came from experience or not).
>
>
>>
>> For reference, here is the complete metadata for osd.24, bluestore SATA
>> SSD with NVMe block.db.
>>
>> {
>> "id": 24,
>> "arch": "x86_64",
>> "back_addr": "",
>> "back_iface": "bond0",
>> "bluefs": "1",
>> "bluefs_db_access_mode": "blk",
>> "bluefs_db_block_size": "4096",
>> "bluefs_db_dev": "259:0",
>> "bluefs_db_dev_node": "nvme0n1",
>> "bluefs_db_driver": "KernelDevice",
>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>> "bluefs_db_rotational": "0",
>> "bluefs_db_serial": " ",
>> "bluefs_db_size": "16000221184",
>> "bluefs_db_type": "nvme",
>> "bluefs_single_shared_device": "0",
>> "bluefs_slow_access_mode": "blk",
>> "bluefs_slow_block_size": "4096",
>> "bluefs_slow_dev": "253:8",
>> "bluefs_slow_dev_node": "dm-8",
>> "bluefs_slow_driver": "KernelDevice",
>> "bluefs_slow_model": "",
>> "bluefs_slow_partition_path": "/dev/dm-8",
>> "bluefs_slow_rotational": "0",
>> "bluefs_slow_size": "1920378863616",
>> "bluefs_slow_type": "ssd",
>> "bluestore_bdev_access_mode": "blk",
>> "bluestore_bdev_block_size": "4096",
>> "bluestore_bdev_dev": "253:8",
>> "bluestore_bdev_dev_node": "dm-8",
>> "bluestore_bdev_driver": "KernelDevice",
>> "bluestore_bdev_model": "",
>> "bluestore_bdev_partition_path": "/dev/dm-8",
>> "bluestore_bdev_rotational": "0",
>> "bluestore_bdev_size": "1920378863616",
>> "bluestore_bdev_type": "ssd",
>> "ceph_version": "ceph version 12.2.2
>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
>> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
>> "default_device_class": "ssd",
>> "distro": "ubuntu",
>> "distro_description": "Ubuntu 16.04.3 LTS",
>> "distro_version": "16.04",
>> "front_addr": "",
>> "front_iface": "bond0",
>> "hb_back_addr": "",
>> "hb_front_addr": "",
>> "hostname": “host00",
>> "journal_rotational": "1",
>> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44
>> UTC 2018",
>> "kernel_version": "4.13.0-26-generic",
>> "mem_swap_kb

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 23:15 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> >     >     The EC pool I am considering is k=4 m=2 with failure domain 
> host, on 6 hosts.
> >     >     So necessarily, there is one shard for each host. If one host 
> goes down for a prolonged time,
> >     >     there's no "logical" advantage of redistributing things - 
> since whatever you do, with 5 hosts, all PGs will stay in degraded state 
> anyways.
> >     >
> >     >     However, I noticed Ceph is remapping all PGs, and actively 
> moving data. I presume now this is done for two reasons:
> >     >     - The remapping is needed since the primary OSD might be the 
> one which went down. But for remapping (I guess) there's no need to actually 
> move data,
> >     >       or is there?
> >     >     - The data movement is done to have the "k" shards available.
> >     >     If it's really the case that "all shards are equal", then 
> data movement should not occur - or is this a bug / bad feature?
> >     >
> >     >
> >     > If you lose one OSD out of a host, Ceph is going to try and 
> re-replicate the data onto the other OSDs in that host. Your PG size and the 
> CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs 
> need to be placed on different hosts.
> >     >
> >     > You're right that gets very funny if your PG size is equal to the 
> number of hosts. We generally discourage people from running configurations 
> like that.
> >
> >     Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) 
> would be our starting point - since we may add more hosts later (not too 
> soon-ish, but it's not excluded more may come in a year or so),
> >     and migrating large EC pools to different settings still seems a 
> bit messy.
> >     We can't really afford to reduce available storage significantly 
> more in the current setup, and would like to have the possibility to lose one 
> host (for example for an OS upgrade),
> >     and then still lose a few disks in case they fail with bad timing.
> >
> >     >
> >     > Or if you mean that you are losing a host, and the data is 
> shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a 
> result of EC pools' "indep" rather than "firstn" crush rules?)
> >
> >     They are indep, which I think is the default (no manual editing 
> done). I thought the main goal of indep was exactly to reduce data movement.
> >     Indeed, it's very funny that data is moved, it certainly does not 
> help to increase redundancy ;-).
> >
> 
> >
> > Can you also share the output of "ceph osd crush dump"?
> 
> Attached.
> 
> 
> Yep, that all looks simple enough.
> 
> Do you have any "ceph -s" or other records from when this was occurring? Is 
> it actually deleting or migrating any of the existing shards, or is it just 
> that the shards which were previously on the out'ed OSDs are now getting 
> copied onto the remaining ones?
> 
> I think I finally understand what's happening here but would like to be sure. 
> :)
> -Greg
> 
> (In short: certain straws were previously mapping onto osd.[outed], but now 
> they map onto the remaining OSDs. Because everything's independent, the 
> actual CRUSH mapping for any shard other than the last is now going to map 
> onto a remaining OSD, which would displace the shard it already holds. But 
> the previously-present shard is going to remain "remapped" there because it 
> can't map successfully. So if you lose osd.5, you'll go from a CRUSH mapping 
> like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2 and 5 
> will both be on OSD 4.)

Interesting! This would also mean that space usage on the remaining-active OSDs 
would increase by 1/6 in our setup, which is significant. 
So that's another good reason to use mon_osd_down_out_subtree_limit=host or to 
just set "ceph osd set noout" when actively reinstalling a host. 

I reproduced just now. Here's what I see (ignore the inconsistent PG, that's 
unrelated and likely a cause of previous OSD OOM issues): 
# ceph -s
  cluster:
id: 69b1fbe5-f084-4410-a99a-ab57417e7846
health: HEALTH_ERR
41569430/513248666 objects misplaced (8.099%)
1 scrub errors
Possible data damage: 1 pg inconsistent
Degraded data redundancy: 105575103/513248666 objects degraded 
(20.570%), 2176 pgs degraded, 985 pgs undersized
 
  services:
mon: 3 daemons, quorum mon003,mon001,mon002
mgr: mon002(active), standbys: mon001, mon003
mds: cephfs_baf-1/1/1 up  {0=mon002=up:active}, 1 up:standby-replay, 1 
up:standby
osd: 196 osds: 164 up, 164 in; 1166 remapped pgs
 
  data:
pools:   2 pools, 2176 pgs
objects: 89370k objects, 4488 GB
usage:   29546 GB used, 555 TB / 584 TB avail
pgs: 105575103/5132

Re: [ceph-users] CephFS Single Threaded Performance

2018-02-26 Thread John Spray

On Mon, Feb 26, 2018 at 6:25 PM, Brian Woods  wrote:
> I have a small test cluster (just two nodes) and after rebuilding it several
> times I found my latest configuration that SHOULD be the fastest is by far
> the slowest (per thread).
>
>
> I have around 10 spinals that I have an erasure encoded CephFS on. When I
> installed several SSDs and recreated it with the meta data and the write
> cache on SSD my performance plummeted from about 10-20MBps to 2-3MBps, but
> only per thread… I did a rados benchmark and the SSDs Meta and Write pools
> can sustain anywhere from 50 to 150MBps without issue.
>
>
> And, if I spool up multiple copies to the FS, each copy adds to that
> throughput without much of a hit. In fact I can go up to about 8 copied
> (about 16MBps) before they start slowing down at all. Even while I have
> several threads actively writing I still benchmark around 25MBps.

If a CephFS system is experiencing substantial latency doing metadata
operations, then you may find that the overall data throughput is much
worse with a single writer process than with several.  That would be
because typical workloads like "cp" or "tar" are entirely serial, and
will wait for one metadata operation (such as creating a file) to
complete before doing any more work.

In your case, I would suspect that your metadata latency got a lot
worse when you switched from dedicating your SSDs to metadata, to
sharing your SSDs between metadata and a cache tier.  This is one of
many situations in which configuring a cache tier can make your
performance worse rather than better.  Cache tiers generally only make
sense if you know you have a "hot" subset of a larger dataset, and
that subset fits in your cache tier.

> Any ideas why single threaded performance would take a hit like this? Almost
> everything is running on a single node (just a few OSDs on another node) and
> I have plenty of RAM (96GBs) and CPU (8 Xeon Cores).

In general, performance testing you do on 1-2 nodes is unlikely to
translate well to what would happen on a more usually sized cluster.
If building a "mini" Ceph cluster for performance testing, I'd suggest
at the very minimum that you start with three servers for OSDs, a
separate one for the MDS, and another separate one for the client.
That way, you have network hops in all the right places, rather than
having the 2-node situation where some arbitrary 50% of messages are
not actually traversing a network, and where clients are competing for
CPU time with servers.

John

>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 2:30 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 23:15 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> > > > The EC pool I am considering is k=4 m=2 with failure
> domain host, on 6 hosts.
> > > > So necessarily, there is one shard for each host. If one
> host goes down for a prolonged time,
> > > > there's no "logical" advantage of redistributing things
> - since whatever you do, with 5 hosts, all PGs will stay in degraded state
> anyways.
> > > >
> > > > However, I noticed Ceph is remapping all PGs, and
> actively moving data. I presume now this is done for two reasons:
> > > > - The remapping is needed since the primary OSD might be
> the one which went down. But for remapping (I guess) there's no need to
> actually move data,
> > > >   or is there?
> > > > - The data movement is done to have the "k" shards
> available.
> > > > If it's really the case that "all shards are equal",
> then data movement should not occur - or is this a bug / bad feature?
> > > >
> > > >
> > > > If you lose one OSD out of a host, Ceph is going to try and
> re-replicate the data onto the other OSDs in that host. Your PG size and
> the CRUSH rule instructs it that the PG needs 6 different OSDs, and those
> OSDs need to be placed on different hosts.
> > > >
> > > > You're right that gets very funny if your PG size is equal
> to the number of hosts. We generally discourage people from running
> configurations like that.
> > >
> > > Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2
> hosts) would be our starting point - since we may add more hosts later (not
> too soon-ish, but it's not excluded more may come in a year or so),
> > > and migrating large EC pools to different settings still seems
> a bit messy.
> > > We can't really afford to reduce available storage
> significantly more in the current setup, and would like to have the
> possibility to lose one host (for example for an OS upgrade),
> > > and then still lose a few disks in case they fail with bad
> timing.
> > >
> > > >
> > > > Or if you mean that you are losing a host, and the data is
> shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a
> result of EC pools' "indep" rather than "firstn" crush rules?)
> > >
> > > They are indep, which I think is the default (no manual
> editing done). I thought the main goal of indep was exactly to reduce data
> movement.
> > > Indeed, it's very funny that data is moved, it certainly does
> not help to increase redundancy ;-).
> > >
> > 
> > >
> > > Can you also share the output of "ceph osd crush dump"?
> >
> > Attached.
> >
> >
> > Yep, that all looks simple enough.
> >
> > Do you have any "ceph -s" or other records from when this was occurring?
> Is it actually deleting or migrating any of the existing shards, or is it
> just that the shards which were previously on the out'ed OSDs are now
> getting copied onto the remaining ones?
> >
> > I think I finally understand what's happening here but would like to be
> sure. :)
> > -Greg
> >
> > (In short: certain straws were previously mapping onto osd.[outed], but
> now they map onto the remaining OSDs. Because everything's independent, the
> actual CRUSH mapping for any shard other than the last is now going to map
> onto a remaining OSD, which would displace the shard it already holds. But
> the previously-present shard is going to remain "remapped" there because it
> can't map successfully. So if you lose osd.5, you'll go from a CRUSH
> mapping like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2
> and 5 will both be on OSD 4.)
>
> Interesting! This would also mean that space usage on the remaining-active
> OSDs would increase by 1/6 in our setup, which is significant.
> So that's another good reason to use mon_osd_down_out_subtree_limit=host
> or to just set "ceph osd set noout" when actively reinstalling a host.
>
> I reproduced just now. Here's what I see (ignore the inconsistent PG,
> that's unrelated and likely a cause of previous OSD OOM issues):
> # ceph -s
>   cluster:
> id: 69b1fbe5-f084-4410-a99a-ab57417e7846
> health: HEALTH_ERR
> 41569430/513248666 objects misplaced (8.099%)
> 1 scrub errors
> Possible data damage: 1 pg inconsistent
> Degraded data redundancy: 105575103/513248666 objects degraded
> (20.570%), 2176 pgs degraded, 985 pgs undersized
>
>   services:
> mon: 3 daemons, quorum mon003,mon001,mon002
> mgr: mon002(active), standbys: mon001, mon003
> mds: cephfs_baf-1/1/1 up  {0=mon002=up:active}, 1 up:standby-replay, 1
> up:s

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 23:29 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 2:23 PM Reed Dier  > wrote:
> 
> Quick turn around,
> 
> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
> bluestore opened the floodgates.
> 
> 
> Oh right, the OSD does not (think it can) have anything it can really do if 
> you've got a rotational journal and an SSD main device, and since BlueStore 
> was misreporting itself as having a rotational journal the OSD falls back to 
> the hard drive settings. Sorry I didn't work through that ahead of time; glad 
> this works around it for you!
> -Greg

To chime in, this also helps for me! Replication is much faster now. 
It's a bit strange though that for my metadata-OSDs I see the following with 
iostat now:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00 0,00 1333,00  301,40 143391,20 42861,60   227,92
21,05   13,788,88   35,44   0,59  96,64
sda   0,00 0,00 1283,40  258,20 139004,80   876,00   181,47 
7,184,665,112,40   0,54  83,32
(MDS should be doing nothing on it)
while on the OSDs to which things are backfilled I see:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00   47,20  458,20   367,20  1628,00 7,90 
0,921,826,951,29   1,18  59,86
sdb   0,00 0,00   48,20  589,00   375,20  1892,00 7,12 
0,400,630,780,62   0,59  37,32

So it seems the "sending" OSDs are now finally taken to their limit (they read 
and write a lot), but the receiving side is rather bored. 
Maybe this strange effect (many writes when actually reading stuff for 
backfilling) is normal for metadata => RocksDB? 

In any case, glad this "rotational" issue is int he queue to be fixed in a 
future release ;-). 

Cheers,
Oliver

>  
> 
> 
>> pool objects-ssd id 20
>>   recovery io 1512 MB/s, 21547 objects/s
>>
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
> 
> Graph of performance jump. Extremely marked.
> https://imgur.com/a/LZR9R
> 
> So at least we now have the gun to go with the smoke.
> 
> Thanks for the help and appreciate you pointing me in some directions 
> that I was able to use to figure out the issue.
> 
> Adding to ceph.conf for future OSD conversions.
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 26, 2018, at 4:12 PM, Reed Dier > > wrote:
>>
>> For the record, I am not seeing a demonstrative fix by injecting the 
>> value of 0 into the OSDs running.
>>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may 
>>> require restart)
>>
>> If it does indeed need to be restarted, I will need to wait for the 
>> current backfills to finish their process as restarting an OSD would bring 
>> me under min_size.
>>
>> However, doing config show on the osd daemon appears to have taken the 
>> value of 0.
>>
>>> ceph daemon osd.24 config show | grep recovery_sleep
>>>     "osd_recovery_sleep": "0.00",
>>>     "osd_recovery_sleep_hdd": "0.10",
>>>     "osd_recovery_sleep_hybrid": "0.00",
>>>     "osd_recovery_sleep_ssd": "0.00",
>>
>> I may take the restart as an opportunity to also move to 12.2.3 at the 
>> same time, since it is not expected that that should affect this issue.
>>
>> I could also attempt to change osd_recovery_sleep_hdd as well, since 
>> these are ssd osd’s, it shouldn’t make a difference, but its a free move.
>>
>> Thanks,
>>
>> Reed
>>
>>> On Feb 26, 2018, at 3:42 PM, Gregory Farnum >> > wrote:
>>>
>>> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier >> > wrote:
>>>
>>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an 
>>> interim solution to getting the metadata configured correctly.
>>>
>>>
>>> Yes, that's a good workaround as long as you don't have any actual 
>>> hybrid OSDs (or aren't worried about them sleeping...I'm not sure if that 
>>> setting came from experience or not).
>>>  
>>>
>>>
>>> For reference, here is the complete metadata for osd.24, bluestore 
>>> SATA SSD with NVMe block.db.
>>>
 {
         "id": 24,
         "arch": "x86_64",
         "back_addr": "",
         "back_iface": "bond0",
         "bluefs": "1",
         "bluefs_db_access_mode": "blk",
         "bluefs_db_block_size": "4096",
         "bluefs_db_dev": "259:0",
         "bluefs_db_dev_node": "nvme0n1",
         "bluefs_db_driver": "KernelDevice",

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Am 26.02.2018 um 23:48 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 2:30 PM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Am 26.02.2018 um 23:15 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de> 
>  >> wrote:
> >
> >     >     >     The EC pool I am considering is k=4 m=2 with failure 
> domain host, on 6 hosts.
> >     >     >     So necessarily, there is one shard for each host. If 
> one host goes down for a prolonged time,
> >     >     >     there's no "logical" advantage of redistributing things 
> - since whatever you do, with 5 hosts, all PGs will stay in degraded state 
> anyways.
> >     >     >
> >     >     >     However, I noticed Ceph is remapping all PGs, and 
> actively moving data. I presume now this is done for two reasons:
> >     >     >     - The remapping is needed since the primary OSD might 
> be the one which went down. But for remapping (I guess) there's no need to 
> actually move data,
> >     >     >       or is there?
> >     >     >     - The data movement is done to have the "k" shards 
> available.
> >     >     >     If it's really the case that "all shards are equal", 
> then data movement should not occur - or is this a bug / bad feature?
> >     >     >
> >     >     >
> >     >     > If you lose one OSD out of a host, Ceph is going to try and 
> re-replicate the data onto the other OSDs in that host. Your PG size and the 
> CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs 
> need to be placed on different hosts.
> >     >     >
> >     >     > You're right that gets very funny if your PG size is equal 
> to the number of hosts. We generally discourage people from running 
> configurations like that.
> >     >
> >     >     Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 
> hosts) would be our starting point - since we may add more hosts later (not 
> too soon-ish, but it's not excluded more may come in a year or so),
> >     >     and migrating large EC pools to different settings still 
> seems a bit messy.
> >     >     We can't really afford to reduce available storage 
> significantly more in the current setup, and would like to have the 
> possibility to lose one host (for example for an OS upgrade),
> >     >     and then still lose a few disks in case they fail with bad 
> timing.
> >     >
> >     >     >
> >     >     > Or if you mean that you are losing a host, and the data is 
> shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a 
> result of EC pools' "indep" rather than "firstn" crush rules?)
> >     >
> >     >     They are indep, which I think is the default (no manual 
> editing done). I thought the main goal of indep was exactly to reduce data 
> movement.
> >     >     Indeed, it's very funny that data is moved, it certainly does 
> not help to increase redundancy ;-).
> >     >
> >     
> >     >
> >     > Can you also share the output of "ceph osd crush dump"?
> >
> >     Attached.
> >
> >
> > Yep, that all looks simple enough.
> >
> > Do you have any "ceph -s" or other records from when this was 
> occurring? Is it actually deleting or migrating any of the existing shards, 
> or is it just that the shards which were previously on the out'ed OSDs are 
> now getting copied onto the remaining ones?
> >
> > I think I finally understand what's happening here but would like to be 
> sure. :)
> > -Greg
> >
> > (In short: certain straws were previously mapping onto osd.[outed], but 
> now they map onto the remaining OSDs. Because everything's independent, the 
> actual CRUSH mapping for any shard other than the last is now going to map 
> onto a remaining OSD, which would displace the shard it already holds. But 
> the previously-present shard is going to remain "remapped" there because it 
> can't map successfully. So if you lose osd.5, you'll go from a CRUSH mapping 
> like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2 and 5 
> will both be on OSD 4.)
> 
> Interesting! This would also mean that space usage on the 
> remaining-active OSDs would increase by 1/6 in our setup, which is 
> significant.
> So that's another good reason to use mon_osd_down_out_subtree_limit=host 
> or to just set "ceph osd set noout" when actively reinstalling a host.
> 
> I reproduced just now. Here's what I see (ignore the inconsistent PG, 
> that's unrelated and likely a cause of previous OSD OOM issues):
> # ceph -s
>   cluster:
>     id:     69b1fbe5-f084-4410-a99a-ab57417e7846
>     health: HEALTH_ERR
>             41569430/513248666 objects misplaced (8.099%)
>             1 scrub errors
>

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Gregory Farnum

On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

>
> > Does this match expectations?
> >
> >
> > Can you get the output of eg "ceph pg 2.7cd query"? Want to make sure
> the backfilling versus acting sets and things are correct.
>
> You'll find attached:
> query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs are up and
> everything is healthy.
> query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs 164-195 (one
> host) are down and out.
>

Yep, that's what we want to see. So when everything's well, we have OSDs
91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1, 5, 6, 4.

When marking out a host, we have OSDs 91, 63, 33, 163, 123, UNMAPPED. That
corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.

So what's happened is that with the new map, when choosing the home for
shard 4, we selected host 4 instead of host 6 (which is gone). And now
shard 5 can't map properly. But of course we still have shard 5 available
on host 4, so host 4 is going to end up properly owning shard 4, but also
just carrying that shard 5 around as a remapped location.

So this is as we expect. Whew.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

2018-02-26 Thread Oliver Freyermuth

Am 27.02.2018 um 00:10 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> 
> >     Does this match expectations?
> >
> >
> > Can you get the output of eg "ceph pg 2.7cd query"? Want to make sure 
> the backfilling versus acting sets and things are correct.
> 
> You'll find attached:
> query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs are up and 
> everything is healthy.
> query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs 164-195 
> (one host) are down and out.
> 
> 
> Yep, that's what we want to see. So when everything's well, we have OSDs 91, 
> 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1, 5, 6, 4.
> 
> When marking out a host, we have OSDs 91, 63, 33, 163, 123, UNMAPPED. That 
> corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.
> 
> So what's happened is that with the new map, when choosing the home for shard 
> 4, we selected host 4 instead of host 6 (which is gone). And now shard 5 
> can't map properly. But of course we still have shard 5 available on host 4, 
> so host 4 is going to end up properly owning shard 4, but also just carrying 
> that shard 5 around as a remapped location.
> 
> So this is as we expect. Whew.
> -Greg

Understood. Thanks for explaining step by step :-). 
It's of course a bit weird that this happens, since in the end, this really 
means data is moved (or rather, a shard is recreated) and taking up space 
without increasing redundancy
(well, it might, if it lands on a different OSD than shard 5, but that's not 
really ensured). I'm unsure if this can be solved "better" in any way. 

Anyways, it seems this would be another reason why running with k+m=number of 
hosts should not be a general recommendation. For us, it's fine for now,
especially since we want to keep the cluster open for later extension with more 
OSDs, and we do now know the gotchas - and I don't see a better EC 
configuration at the moment
which would accomodate our wishes (one host + x safety, don't reduce space too 
much). 

So thanks again! 

Cheers,
Oliver

smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2018-02-26 Thread Jan Pekař - Imatic


I think I hit the same issue.
I have corrupted data on cephfs and I don't remember the same issue 
before Luminous (i did the same tests before).


It is on my test 1 node cluster with lower memory then recommended (so 
server is swapping) but it shouldn't lose data (it never did before).

So slow requests may appear in the log like Florent B mentioned.

My test is to take some bigger files (few GB) and copy it to cephfs or 
from cephfs to cephfs and stress the cluster so data copying stall for a 
while. It will resume in few seconds/minutes and everything looks ok (no 
error on copying). But copied file may be corrupted silently.


I checked wiles with MD5SUM and compared some corrupted files in detail. 
There were missing some 4MB blocks of data (cephfs object size) - 
corrupted file had that block of data filled with zeroes.


My idea is, that there happen something wrong when cluster is under 
pressure and client want to save the block. Client gets OK and continues 
with another block so data is lost and corrupted block is filled with zeros.


I tried kernel client 4.x and ceph-fuse client with same result.

I'm using erasure for cephfs data pool, cache tier and my storage is 
bluestore and filestore mixed.


How can I help to debug or what should I do to help to find the problem?

With regards
Jan Pekar

On 14.12.2017 15:41, Yan, Zheng wrote:

On Thu, Dec 14, 2017 at 8:52 PM, Florent B  wrote:

On 14/12/2017 03:38, Yan, Zheng wrote:

On Thu, Dec 14, 2017 at 12:49 AM, Florent B  wrote:


Systems are on Debian Jessie : kernel 3.16.0-4-amd64 & libfuse 2.9.3-15.

I don't know pattern of corruption, but according to error message in
Dovecot, it seems to expect data to read but reach EOF.

All seems fine using fuse_disable_pagecache (no more corruption, and
performance increased : no more MDS slow requests on filelock requests).


I checked ceph-fuse changes since kraken, didn't find any clue. I
would be helpful if you can try recent version kernel.

Regards
Yan, Zheng


Problem occurred this morning even with fuse_disable_pagecache=true.

It seems to be a lock issue between imap & lmtp processes.

Dovecot uses fcntl as locking method. Is there any change about it in
Luminous ? I switched to flock to see if problem is still there...



I don't remenber there is any change.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Orphaned objects after removing rbd image

2018-02-26 Thread Krzysztof Dajka

Hi,

Recently I discovered that my pool after deleting volumes from openstack
doesn't reclaim all the space. I didn't pinpoint problem if it is caused by
client (cinder volume) or whether it's within backend itself.

For now I've came to conclusion that ceph pool volumes has orphaned objects:
Total number of 'volumes':
[root@won1 STACK ~]# rbd ls volumes|wc -l
85

Total number of 'volumes' including orphaned volumes:
[root@won1 STACK ~]# rados ls -p volumes | grep rbd_data | sort | awk -F.
'{ print $2 }' |uniq -c |sort -n |wc -l
133

At the moment I'm doing backup of orphaned objects using
rados get -p volumes $x /mnt/rbd_data/$x
So later I can remove those objects which doesn't have parent image.

Did anyone have any issues with orphaned objects on ceph 10.2.7?
I cannot find any bugs or anyone mentioning this issue besides this topic:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/
2016-December/015236.html
Is there a tool/better way to pinpoint objects without parent image?

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

95 matches

Mail list logo