[ceph-users] Octopus 15.2.1

2020-04-10 Thread gert . wieberdink
I am trying to install a fresh Ceph cluster on CentOS 8.
Using the latest Ceph repo for el8, it still is not possible because of certain 
dependencies:
libleveldb.so.1 needed by ceph-osd.
Even after manually downloading and installing the 
leveldb-1.20-1.el8.x86_64.rpm package, there are still dependencies:
Problem: package ceph-mgr-2:15.2.1-0.el8.x86_64 requires ceph-mgr-modules-core 
= 2:15.2.1-0.el8, but none of the providers can be installed
  - conflicting requests
  - nothing provides python3-cherrypy needed by 
ceph-mgr-modules-core-2:15.2.1-0.el8.noarch
  - nothing provides python3-pecan needed by 
ceph-mgr-modules-core-2:15.2.1-0.el8.noarch

Is there a way to perform a fresh Ceph install on CentOS 8?
Thanking in advance for your answer.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: remove S3 bucket with rados CLI

2020-04-10 Thread Paul Emmerich
Quick & dirty solution if only one OSD is full (likely as it looks
very unbalanced): take down the full OSD, delete data, take it back
online


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Apr 9, 2020 at 3:30 PM Dan van der Ster  wrote:
>
> On Thu, Apr 9, 2020 at 3:25 PM Robert Sander
>  wrote:
> >
> > Hi Dan,
> >
> > Am 09.04.20 um 15:08 schrieb Dan van der Ster:
> > >
> > > What do you have for full_ratio?
> >
> > The cluster is running Nautilus and the ratios should still be the
> > default values. Currently I have to direct access to report them.
> >
> > > Maybe you can unblock by setting the full_ratio to 0.96?
> >
> > We will try that on tuesday.
> >
> > Additionally here is the output of "ceph df":
> >
> > [root@fra1s80103 ~]# ceph df
> > RAW STORAGE:
> > CLASS SIZEAVAIL   USEDRAW USED %RAW USED
> > hdd   524 TiB 101 TiB 416 TiB  423 TiB 80.74
> > ssd11 TiB 7.8 TiB 688 MiB  3.2 TiB 28.92
> > TOTAL 535 TiB 109 TiB 416 TiB  426 TiB 79.68
> >
> > POOLS:
> > POOLID  STORED  OBJECTSUSED   %USEDMAX AVAIL
> > .rgw.root2  1.2 KiB  4  256 KiB   01.4 TiB
> > default.rgw.control  30 B80 B 01.4 TiB
> > default.rgw.meta 4  3.2 KiB 13  769 KiB   01.4 TiB
> > default.rgw.log  5   48 KiB210   48 KiB   01.4 TiB
> > default.rgw.buckets.index 6 487 GiB  21.10k 487 GiB   8.09 1.4 TiB
> > default.rgw.buckets.data 8  186 TiB 671.88M 416 TiB 100.00   0 B
> > default.rgw.buckets.non-ec 9  0 B00 B 0  0 B
> >
> > It's a four node cluster with the buckets.data pool erasure coded on hdd
> > with k=m=2 and size=4 and min_size=2, to have each part on a different node.
> >
> > New HDDs and even new nodes are currently being ordered to expand this
> > proof of concept setup for backup storage.
>
> This looks like an unbalanced cluster.
>
> # ceph osd df  | sort -n -k17
>
> should be illuminating.
>
> -- dan
>
>
> >
> > Regards
> > --
> > Robert Sander
> > Heinlein Support GmbH
> > Schwedter Str. 8/9b, 10119 Berlin
> >
> > http://www.heinlein-support.de
> >
> > Tel: 030 / 405051-43
> > Fax: 030 / 405051-19
> >
> > Zwangsangaben lt. §35a GmbHG:
> > HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> > Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Octopus 15.2.1

2020-04-10 Thread Jeff Bailey
Leveldb is currently in epel-testing and should be moved to epel next 
week.  You can get the rest of the dependencies from 
https://copr.fedorainfracloud.org/coprs/ktdreyer/ceph-el8/  It works 
fine.  Hopefully, everything will make it into epel eventually but for 
now this is good enough for me.


On 4/10/2020 4:06 AM, gert.wieberd...@ziggo.nl wrote:

I am trying to install a fresh Ceph cluster on CentOS 8.
Using the latest Ceph repo for el8, it still is not possible because of certain 
dependencies:
libleveldb.so.1 needed by ceph-osd.
Even after manually downloading and installing the 
leveldb-1.20-1.el8.x86_64.rpm package, there are still dependencies:
Problem: package ceph-mgr-2:15.2.1-0.el8.x86_64 requires ceph-mgr-modules-core 
= 2:15.2.1-0.el8, but none of the providers can be installed
   - conflicting requests
   - nothing provides python3-cherrypy needed by 
ceph-mgr-modules-core-2:15.2.1-0.el8.noarch
   - nothing provides python3-pecan needed by 
ceph-mgr-modules-core-2:15.2.1-0.el8.noarch

Is there a way to perform a fresh Ceph install on CentOS 8?
Thanking in advance for your answer.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recommendation for decent write latency performance from HDDs

2020-04-10 Thread Reed Dier
Going to resurrect this thread to provide another option:

LVM-cache, ie putting a cache device in-front of the bluestore-LVM LV.

I only mention this because I noticed it in the SUSE documentation for SES6 
(based on Nautilus) here: 
https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html 


>  If you plan to use a fast drive as an LVM cache for multiple OSDs, be aware 
> that all OSD operations (including replication) will go through the caching 
> device. All reads will be queried from the caching device, and are only 
> served from the slow device in case of a cache miss. Writes are always 
> applied to the caching device first, and are flushed to the slow device at a 
> later time ('writeback' is the default caching mode).
> When deciding whether to utilize an LVM cache, verify whether the fast drive 
> can serve as a front for multiple OSDs while still providing an acceptable 
> amount of IOPS. You can test it by measuring the maximum amount of IOPS that 
> the fast device can serve, and then dividing the result by the number of OSDs 
> behind the fast device. If the result is lower or close to the maximum amount 
> of IOPS that the OSD can provide without the cache, LVM cache is probably not 
> suited for this setup.
> The interaction of the LVM cache device with OSDs is important. Writes are 
> periodically flushed from the caching device to the slow device. If the 
> incoming traffic is sustained and significant, the caching device will 
> struggle to keep up with incoming requests as well as the flushing process, 
> resulting in performance drop. Unless the fast device can provide much more 
> IOPS with better latency than the slow device, do not use LVM cache with a 
> sustained high volume workload. Traffic in a burst pattern is more suited for 
> LVM cache as it gives the cache time to flush its dirty data without 
> interfering with client traffic. For a sustained low traffic workload, it is 
> difficult to guess in advance whether using LVM cache will improve 
> performance. The best test is to benchmark and compare the LVM cache setup 
> against the WAL/DB setup. Moreover, as small writes are heavy on the WAL 
> partition, it is suggested to use the fast device for the DB and/or WAL 
> instead of an LVM cache.


So it sounds like you could partition your NVMe for either LVM-cache, DB/WAL, 
or both?

Just figured this sounded a bit more akin to what you were looking for in your 
original post and figured I would share.

I don't use this, but figured I would share it.

Reed

> On Apr 4, 2020, at 9:12 AM, jes...@krogh.cc wrote:
> 
> Hi.
> 
> We have a need for "bulk" storage - but with decent write latencies.
> Normally we would do this with a DAS with a Raid5 with 2GB Battery
> backed write cache in front - As cheap as possible but still getting the
> features of scalability of ceph.
> 
> In our "first" ceph cluster we did the same - just stuffed in BBWC
> in the OSD nodes and we're fine - but now we're onto the next one and
> systems like:
> https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
> Does not support a Raid controller like that - but is branded as for "Ceph
> Storage Solutions".
> 
> It do however support 4 NVMe slots in the front - So - some level of
> "tiering" using the NVMe drives should be what is "suggested" - but what
> do people do? What is recommeneded. I see multiple options:
> 
> Ceph tiering at the "pool - layer":
> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> And rumors that it is "deprectated:
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
> 
> Pro: Abstract layer
> Con: Deprecated? - Lots of warnings?
> 
> Offloading the block.db on NVMe / SSD:
> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
> 
> Pro: Easy to deal with - seem heavily supported.
> Con: As far as I can tell - this will only benefit the metadata of the
> osd- not actual data. Thus a data-commit to the osd til still be dominated
> by the writelatency of the underlying - very slow HDD.
> 
> Bcache:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
> 
> Pro: Closest to the BBWC mentioned above - but with way-way larger cache
> sizes.
> Con: It is hard to see if I end up being the only one on the planet using
> this
> solution.
> 
> Eat it - Writes will be as slow as hitting dead-rust - anything that
> cannot live
> with that need to be entirely on SSD/NVMe.
> 
> Other?
> 
> Thanks for your input.
> 
> Jesper
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.

[ceph-users] Re: Recommendation for decent write latency performance from HDDs

2020-04-10 Thread Paul Emmerich
My main problem with LVM cache was always the unpredictable
performance. It's *very* hard to benchmark properly even in a
synthetic setup, even harder to guess anything about a real-world
workload.
And testing out both configurations for a real-world setup is often
not feasible, especially as usage patterns change over the lifetime of
a cluster.

Does anyone have any real-world experience with LVM cache?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Apr 10, 2020 at 11:19 PM Reed Dier  wrote:
>
> Going to resurrect this thread to provide another option:
>
> LVM-cache, ie putting a cache device in-front of the bluestore-LVM LV.
>
> I only mention this because I noticed it in the SUSE documentation for SES6 
> (based on Nautilus) here: 
> https://documentation.suse.com/ses/6/html/ses-all/lvmcache.html
>
>  If you plan to use a fast drive as an LVM cache for multiple OSDs, be aware 
> that all OSD operations (including replication) will go through the caching 
> device. All reads will be queried from the caching device, and are only 
> served from the slow device in case of a cache miss. Writes are always 
> applied to the caching device first, and are flushed to the slow device at a 
> later time ('writeback' is the default caching mode).
> When deciding whether to utilize an LVM cache, verify whether the fast drive 
> can serve as a front for multiple OSDs while still providing an acceptable 
> amount of IOPS. You can test it by measuring the maximum amount of IOPS that 
> the fast device can serve, and then dividing the result by the number of OSDs 
> behind the fast device. If the result is lower or close to the maximum amount 
> of IOPS that the OSD can provide without the cache, LVM cache is probably not 
> suited for this setup.
>
> The interaction of the LVM cache device with OSDs is important. Writes are 
> periodically flushed from the caching device to the slow device. If the 
> incoming traffic is sustained and significant, the caching device will 
> struggle to keep up with incoming requests as well as the flushing process, 
> resulting in performance drop. Unless the fast device can provide much more 
> IOPS with better latency than the slow device, do not use LVM cache with a 
> sustained high volume workload. Traffic in a burst pattern is more suited for 
> LVM cache as it gives the cache time to flush its dirty data without 
> interfering with client traffic. For a sustained low traffic workload, it is 
> difficult to guess in advance whether using LVM cache will improve 
> performance. The best test is to benchmark and compare the LVM cache setup 
> against the WAL/DB setup. Moreover, as small writes are heavy on the WAL 
> partition, it is suggested to use the fast device for the DB and/or WAL 
> instead of an LVM cache.
>
>
> So it sounds like you could partition your NVMe for either LVM-cache, DB/WAL, 
> or both?
>
> Just figured this sounded a bit more akin to what you were looking for in 
> your original post and figured I would share.
>
> I don't use this, but figured I would share it.
>
> Reed
>
> On Apr 4, 2020, at 9:12 AM, jes...@krogh.cc wrote:
>
> Hi.
>
> We have a need for "bulk" storage - but with decent write latencies.
> Normally we would do this with a DAS with a Raid5 with 2GB Battery
> backed write cache in front - As cheap as possible but still getting the
> features of scalability of ceph.
>
> In our "first" ceph cluster we did the same - just stuffed in BBWC
> in the OSD nodes and we're fine - but now we're onto the next one and
> systems like:
> https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
> Does not support a Raid controller like that - but is branded as for "Ceph
> Storage Solutions".
>
> It do however support 4 NVMe slots in the front - So - some level of
> "tiering" using the NVMe drives should be what is "suggested" - but what
> do people do? What is recommeneded. I see multiple options:
>
> Ceph tiering at the "pool - layer":
> https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> And rumors that it is "deprectated:
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html/release_notes/deprecated_functionality
>
> Pro: Abstract layer
> Con: Deprecated? - Lots of warnings?
>
> Offloading the block.db on NVMe / SSD:
> https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
>
> Pro: Easy to deal with - seem heavily supported.
> Con: As far as I can tell - this will only benefit the metadata of the
> osd- not actual data. Thus a data-commit to the osd til still be dominated
> by the writelatency of the underlying - very slow HDD.
>
> Bcache:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
>
> Pro: Closest to the BBWC mentioned above - but with way-way larger cache
> sizes.
> Con: It is hard to see if I end up being the

[ceph-users] ceph-df free discrepancy

2020-04-10 Thread Reed Dier
Hopefully someone can sanity check me here, but I'm getting the feeling that 
the MAX AVAIL in ceph df isn't reporting the correct value in 14.2.8 
(mon/mgr/mds are .8, most OSDs are .7)

> RAW STORAGE:
> CLASS SIZEAVAIL   USEDRAW USED %RAW USED
> hdd   530 TiB 163 TiB 366 TiB  367 TiB 69.19
> ssd   107 TiB  37 TiB  70 TiB   70 TiB 65.33
> TOTAL 637 TiB 201 TiB 436 TiB  437 TiB 68.54
> 
> POOLS:
> POOL  ID STORED  OBJECTS USED
> %USED MAX AVAIL
> fs-metadata   16  44 GiB   4.16M  44 GiB  
> 0.25   5.6 TiB
> cephfs-hdd-3x 17  46 TiB 109.54M 144 TiB 
> 61.8130 TiB
> objects-hybrid20  46 TiB 537.08M 187 TiB 
> 91.71   5.6 TiB
> objects-hdd   24 224 GiB  50.81k 676 GiB  
> 0.7430 TiB
> rbd-hybrid29 3.8 TiB   1.19M  11 TiB 
> 40.38   5.6 TiB
> device_health_metrics 33 270 MiB 327 270 MiB 
> 030 TiB
> rbd-ssd   34 4.2 TiB   1.19M  12 TiB 
> 41.55   5.6 TiB
> cephfs-hdd-ec73   37  42 TiB  30.35M  72 TiB 
> 44.7862 TiB

I have a few pools that don't look like they are calculating the available 
storage for that pool correctly.

Specifically, any of my hybrid pools (20,29) or all-SSD pools (16,34).

For my hybrid pools, I have a crush rule of take 1 of host in the ssd root, 
take -1 chassis in the hdd root.
For my ssd pools, I have a crush rule of take 0 of host in the ssd root.

Now I have 60 ssd osds 1.92T each, and sadly distribution is imperfect (leaving 
those issues out of this), and I have plenty of underfull and overfull osds, 
which I am trying to manually reweighs to get my most full's down to free up 
space:
> $ ceph osd df class ssd | sort -k17
> ID  CLASS WEIGHT  REWEIGHT SIZERAW USE DATA OMAPMETAAVAIL   
> %USE  VAR  PGS STATUS
> MIN/MAX VAR: 0.80/1.21  STDDEV: 5.56
>  TOTAL 107 TiB  70 TiB   68 TiB 163 GiB 431 GiB  37 TiB 
> 65.33
>  28   ssd 1.77879  1.0 1.8 TiB 951 GiB  916 GiB 7.1 GiB 5.8 GiB 871 GiB 
> 52.20 0.80  68 up
>  33   ssd 1.77879  1.0 1.8 TiB 1.0 TiB 1010 GiB 6.0 MiB 5.9 GiB 777 GiB 
> 57.33 0.88  74 up
>  47   ssd 1.77879  1.0 1.8 TiB 1.0 TiB 1011 GiB 6.7 GiB 6.4 GiB 776 GiB 
> 57.38 0.88  75 up
> [SNIP]
>  57   ssd 1.77879  0.98000 1.8 TiB 1.4 TiB  1.3 TiB 6.2 GiB 8.6 GiB 417 GiB 
> 77.08 1.18 102 up
> 107   ssd 1.80429  1.0 1.8 TiB 1.4 TiB  1.3 TiB 7.0 GiB 8.7 GiB 422 GiB 
> 77.15 1.18 102 up
>  50   ssd 1.77879  1.0 1.8 TiB 1.4 TiB  1.4 TiB 5.5 MiB 8.6 GiB 381 GiB 
> 79.10 1.21 105 up
>  60   ssd 1.77879  0.92000 1.8 TiB 1.4 TiB  1.4 TiB 6.2 MiB 9.0 GiB 379 GiB 
> 79.17 1.21 105 up

That said, as a straw man argument, ~380GiB free, times 60 OSDs, should be 
~22.8TiB free, if all OSD's grew evenly, which they won't, which is still far 
short of 37TiB raw free, as expected.
However, what doesn't track is the 5.6TiB available at the pools level, even 
for a 3x replicated pool (5.6*3=16.8TiB, which is about 34% less than my napkin 
math, which would be 22.8/3=7.6TiB.
But what tracks even less is the hybrid pools, which use 1/3 of what the 
3x-replicated data consumes.
Meaning if my napkin math is right, should show ~22.8TiB free.

Am I grossly mis-understanding how this is calculated?
Maybe this is fixed in Octopus?

Just trying to get a grasp on what I'm seeing not matching expectations.

Thanks,

Reed

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-df free discrepancy

2020-04-10 Thread Paul Emmerich
On Sat, Apr 11, 2020 at 12:43 AM Reed Dier  wrote:
> That said, as a straw man argument, ~380GiB free, times 60 OSDs, should be 
> ~22.8TiB free, if all OSD's grew evenly, which they won't

Yes, that's the problem. They won't grow evenly. The fullest one will
grow faster than the others. Also, your full-ratio is probably 95% not
100%.
So it'll be full as soon as OSD 70 takes another ~360 GB of data. But
the others won't take 360 GB of data but less because of the bad
balancing. For example, OSD 28 will only get around 233 GB of data by
the time OSD 70 has 360 GB.



Paul

> , which is still far short of 37TiB raw free, as expected.
> However, what doesn't track is the 5.6TiB available at the pools level, even 
> for a 3x replicated pool (5.6*3=16.8TiB, which is about 34% less than my 
> napkin math, which would be 22.8/3=7.6TiB.
> But what tracks even less is the hybrid pools, which use 1/3 of what the 
> 3x-replicated data consumes.
> Meaning if my napkin math is right, should show ~22.8TiB free.
>
> Am I grossly mis-understanding how this is calculated?
> Maybe this is fixed in Octopus?
>
> Just trying to get a grasp on what I'm seeing not matching expectations.
>
> Thanks,
>
> Reed
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-df free discrepancy

2020-04-10 Thread Reed Dier
That definitely makes sense.

However should a hybrid pool not have 3x available of that of all SSD pool?
There is plenty of rust behind it that won't impede it being able to satisfy 
all 3 replicants.

Example lets say I write 5.6TiB (current max avail):
to a hybrid pool, that's 5.6TiB to ssd osds, and 11.2TiB to hdd osds.
to an all ssd pool, that's 16.8TiB written to ssd osd.
And those are vastly different amounts to the ssd osds, obviously, and feel 
like the max avail is misleading, at least for these hybrid pools, which are 
admittedly less common for ceph I imagine.

But its worth noting I'm just using crush roots to point my crush rules, and 
not actually using the device class, although it is set properly.
And I imagine that if someone had an oddly specific crush rulesets to direct pg 
distribution, they too could see weird (possibly misleading) results like this. 

Reed

> On Apr 10, 2020, at 5:55 PM, Paul Emmerich  wrote:
> 
> On Sat, Apr 11, 2020 at 12:43 AM Reed Dier  wrote:
>> That said, as a straw man argument, ~380GiB free, times 60 OSDs, should be 
>> ~22.8TiB free, if all OSD's grew evenly, which they won't
> 
> Yes, that's the problem. They won't grow evenly. The fullest one will
> grow faster than the others. Also, your full-ratio is probably 95% not
> 100%.
> So it'll be full as soon as OSD 70 takes another ~360 GB of data. But
> the others won't take 360 GB of data but less because of the bad
> balancing. For example, OSD 28 will only get around 233 GB of data by
> the time OSD 70 has 360 GB.
> 
> 
> 
> Paul
> 
>> , which is still far short of 37TiB raw free, as expected.
>> However, what doesn't track is the 5.6TiB available at the pools level, even 
>> for a 3x replicated pool (5.6*3=16.8TiB, which is about 34% less than my 
>> napkin math, which would be 22.8/3=7.6TiB.
>> But what tracks even less is the hybrid pools, which use 1/3 of what the 
>> 3x-replicated data consumes.
>> Meaning if my napkin math is right, should show ~22.8TiB free.
>> 
>> Am I grossly mis-understanding how this is calculated?
>> Maybe this is fixed in Octopus?
>> 
>> Just trying to get a grasp on what I'm seeing not matching expectations.
>> 
>> Thanks,
>> 
>> Reed
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io