[ceph-users] NFS version 4.0

2021-02-04 Thread Jens Hyllegaard (Soft Design A/S)
Hi.

We are trying to set up an NFS server using ceph which needs to be accessed by 
an IBM System i.
As far as I can tell the IBM System i only supports nfs v. 4.
Looking at the nfs-ganesha deployments it seems that these only support 4.1 or 
4.2. I have tried editing the configuration file to support 4.0 and it seems to 
work.
Is there a reason than it currently only support 4.1 and 4.2?

I can of course edit the configuration file, but I would have to do that after 
any deployment or upgrade of the nfs servers.

Regards

Jens Hyllegaard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Mario Giammarco
Hi Federico,
here I am not mixing raid1 with ceph. I am doing a comparison: is it safer
to have a server with raid1 disks or two servers with ceph and size=2
min_size=1 ?
We are talking about real world examples where a customer is buying a new
server and want to choose.

Il giorno gio 4 feb 2021 alle ore 05:52 Federico Lucifredi <
feder...@redhat.com> ha scritto:

> Ciao Mario,
>
>
>> It is obvious and a bit paranoid because many servers on many customers
>> run
>> on raid1 and so you are saying: yeah you have two copies of the data but
>> you can broke both. Consider that in ceph recovery is automatic, with
>> raid1
>> some one must manually go to the customer and change disks. So ceph is
>> already an improvement in this case even with size=2. With size 3 and min
>> 2
>> it is a bigger improvement I know.
>>
>
> Generally speaking, users running Ceph at any scale do not use RAID to
> mirror their drives. They rely on data resiliency as delivered by Ceph
> (three replicas on HDD, two replicas on solid state media).
>
> It is expensive to run RAID underneath Ceph, and in some cases even
> counter-productive. We do use RAID controllers whenever we can because they
> are battery-backed and insure writes hit the local disk even on a power
> failure, but that is (ideally) the only case where you hear the words RAID
> and Ceph together.
>
>  -- "'Problem' is a bleak word for challenge" - Richard Fish
> _
> Federico Lucifredi
> Product Management Director, Ceph Storage Platform
> Red Hat
> A273 4F57 58C0 7FE8 838D 4F87 AEEB EC18 4A73 88AC
> redhat.com   TRIED. TESTED. TRUSTED.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Mario Giammarco
Il giorno gio 4 feb 2021 alle ore 00:33 Simon Ironside <
sirons...@caffetine.org> ha scritto:

>
>
> On 03/02/2021 19:48, Mario Giammarco wrote:
>
> To labour Dan's point a bit further, maybe a RAID5/6 analogy is better
> than RAID1. Yes, I know we're not talking erasure coding pools here but
> this is similar to the reasons why people moved from RAID5 (size=2, kind
> of) to RAID6 (size=3, kind of). I.e. the more disks you have in an array
> (cluster, in our case) and the bigger those disks are, the greater the
> chance you have of encountering a second problem during a recovery.
>
> Yes I know the motivations for raid6 but to simplify  the use case I am
comparing ceph size=2 to raid1.


> > What I ask is this: what happens with min_size=1 and split brain,
> > network down or similar things: do ceph block writes because it has no
> > quorum on monitors? Are there some failure scenarios that I have not
> > considered?
>
> It sounds like in your example you would have 3 physical servers in
> total. So would you have both a monitor and OSDs processes on each server?
>
>
Yes sorry if it was not clear:
- three servers
- three monitors
- three managers
- 6 osd (two disks per server)


> If so, it's not really related to min_size=1 but to answer your question
> you could lose one monitor and the cluster would continue. Losing a
> second monitor will stop your cluster until this is resolved. In your
> example setup (with colocated mons & OSDs) this would presumably also
> mean you'd lost two OSDs servers too so you'd have bigger problems.
>
>
Losing the switch means monitors are up but cannot communicate so they
should stop?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Mario Giammarco
Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster 
ha scritto:

>
> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if
> possible.
>
>
I will investigate I heard that erasure coding is slow.

Anyway I will write here the reason of this thread:
In my customers I have usually proxmox+ceph with:

- three servers
- three monitors
- 6 osd (two per server)
- size=3 and min_size=2

I followed the recommendations to stay safe.
But one day one disk of one server has broken, osd where at 55%.
What happened then?
Ceph started filling the remaining OSD to maintain size=3
OSD reached 90% ceph stopped all.
Customer VMs froze and customer lost time and some data that was not
written on disk.

So I got angry size=3 and customer still loses time and data?






> Cheers, Dan
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco  wrote:
>
>> Thanks Simon and thanks to other people that have replied.
>> Sorry but I try to explain myself better.
>> It is evident to me that if I have two copies of data, one brokes and
>> while
>> ceph creates again a new copy of the data also the disk with the second
>> copy brokes you lose the data.
>> It is obvious and a bit paranoid because many servers on many customers
>> run
>> on raid1 and so you are saying: yeah you have two copies of the data but
>> you can broke both. Consider that in ceph recovery is automatic, with
>> raid1
>> some one must manually go to the customer and change disks. So ceph is
>> already an improvement in this case even with size=2. With size 3 and min
>> 2
>> it is a bigger improvement I know.
>>
>> What I ask is this: what happens with min_size=1 and split brain, network
>> down or similar things: do ceph block writes because it has no quorum on
>> monitors? Are there some failure scenarios that I have not considered?
>> Thanks again!
>> Mario
>>
>>
>>
>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
>> sirons...@caffetine.org> ha scritto:
>>
>> > On 03/02/2021 09:24, Mario Giammarco wrote:
>> > > Hello,
>> > > Imagine this situation:
>> > > - 3 servers with ceph
>> > > - a pool with size 2 min 1
>> > >
>> > > I know perfectly the size 3 and min 2 is better.
>> > > I would like to know what is the worst thing that can happen:
>> >
>> > Hi Mario,
>> >
>> > This thread is worth a read, it's an oldie but a goodie:
>> >
>> >
>> >
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>> >
>> > Especially this post, which helped me understand the importance of
>> > min_size=2
>> >
>> >
>> >
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
>> >
>> > Cheers,
>> > Simon
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Alexander E. Patrakov
There is a big difference between traditional RAID1 and Ceph. Namely, with
Ceph, there are nodes where OSDs are running, and these nodes need
maintenance. You want to be able to perform maintenance even if you have
one broken OSD, that's why the recommendation is to have three copies with
Ceph. There is no such "maintenance" consideration with traditional RAID1,
so two copies are OK there.

чт, 4 февр. 2021 г. в 00:49, Mario Giammarco :

> Thanks Simon and thanks to other people that have replied.
> Sorry but I try to explain myself better.
> It is evident to me that if I have two copies of data, one brokes and while
> ceph creates again a new copy of the data also the disk with the second
> copy brokes you lose the data.
> It is obvious and a bit paranoid because many servers on many customers run
> on raid1 and so you are saying: yeah you have two copies of the data but
> you can broke both. Consider that in ceph recovery is automatic, with raid1
> some one must manually go to the customer and change disks. So ceph is
> already an improvement in this case even with size=2. With size 3 and min 2
> it is a bigger improvement I know.
>
> What I ask is this: what happens with min_size=1 and split brain, network
> down or similar things: do ceph block writes because it has no quorum on
> monitors? Are there some failure scenarios that I have not considered?
> Thanks again!
> Mario
>
>
>
> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
> sirons...@caffetine.org> ha scritto:
>
> > On 03/02/2021 09:24, Mario Giammarco wrote:
> > > Hello,
> > > Imagine this situation:
> > > - 3 servers with ceph
> > > - a pool with size 2 min 1
> > >
> > > I know perfectly the size 3 and min 2 is better.
> > > I would like to know what is the worst thing that can happen:
> >
> > Hi Mario,
> >
> > This thread is worth a read, it's an oldie but a goodie:
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
> >
> > Especially this post, which helped me understand the importance of
> > min_size=2
> >
> >
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
> >
> > Cheers,
> > Simon
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Dan van der Ster
On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco  wrote:
>
>
>
> Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster 
>  ha scritto:
>>
>>
>> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if 
>> possible.
>>
>
> I will investigate I heard that erasure coding is slow.
>
> Anyway I will write here the reason of this thread:
> In my customers I have usually proxmox+ceph with:
>
> - three servers
> - three monitors
> - 6 osd (two per server)
> - size=3 and min_size=2
>
> I followed the recommendations to stay safe.
> But one day one disk of one server has broken, osd where at 55%.
> What happened then?
> Ceph started filling the remaining OSD to maintain size=3
> OSD reached 90% ceph stopped all.
> Customer VMs froze and customer lost time and some data that was not written 
> on disk.
>
> So I got angry size=3 and customer still loses time and data?

You should size the osd fullness config in such a way that failure you
expect would still leave sufficient capacity.
In our case, we plan so that we could lose and re-replicate an entire
rack and still have enough space left. -- (IOW, with 5-6 racks, we
start to add capacity when the clusters reach ~70-75% full)

In your case, the issue is more extreme:
Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD
fails and is marked out, you are telling ceph that *all* of the
objects will need to be written to the last remaining disk on that
host with the failure.
So unless your cluster was under 40-50% used, that osd is going to
become overfull. (But BTW, ceph will get backfillfull on the loaded
OSD before stopping IO -- this should not have blocked your user
unless they *also* filled the disk with new data at the same time).

IMO with a cluster this size, you should not ever mark out any OSDs --
rather, you should leave the PGs degraded, replace the disk (keep the
same OSD ID), then recover those objects to the new disk.
Or, keep it <40% used (which sounds like a waste).

-- dan





>
>
>
>
>
>>
>> Cheers, Dan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco  wrote:
>>>
>>> Thanks Simon and thanks to other people that have replied.
>>> Sorry but I try to explain myself better.
>>> It is evident to me that if I have two copies of data, one brokes and while
>>> ceph creates again a new copy of the data also the disk with the second
>>> copy brokes you lose the data.
>>> It is obvious and a bit paranoid because many servers on many customers run
>>> on raid1 and so you are saying: yeah you have two copies of the data but
>>> you can broke both. Consider that in ceph recovery is automatic, with raid1
>>> some one must manually go to the customer and change disks. So ceph is
>>> already an improvement in this case even with size=2. With size 3 and min 2
>>> it is a bigger improvement I know.
>>>
>>> What I ask is this: what happens with min_size=1 and split brain, network
>>> down or similar things: do ceph block writes because it has no quorum on
>>> monitors? Are there some failure scenarios that I have not considered?
>>> Thanks again!
>>> Mario
>>>
>>>
>>>
>>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
>>> sirons...@caffetine.org> ha scritto:
>>>
>>> > On 03/02/2021 09:24, Mario Giammarco wrote:
>>> > > Hello,
>>> > > Imagine this situation:
>>> > > - 3 servers with ceph
>>> > > - a pool with size 2 min 1
>>> > >
>>> > > I know perfectly the size 3 and min 2 is better.
>>> > > I would like to know what is the worst thing that can happen:
>>> >
>>> > Hi Mario,
>>> >
>>> > This thread is worth a read, it's an oldie but a goodie:
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>>> >
>>> > Especially this post, which helped me understand the importance of
>>> > min_size=2
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
>>> >
>>> > Cheers,
>>> > Simon
>>> > ___
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>> >
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mon db high iops

2021-02-04 Thread Seena Fallah
This is my osdmap commit diff:
report 4231583130
"osdmap_first_committed": 300814,
"osdmap_last_committed": 304062,

My disk latency is 25ms because of the high block size that rocksdb is
using.
should I provide a high-performance disk than I'm using for my monitor
nodes?

On Thu, Feb 4, 2021 at 3:09 AM Seena Fallah  wrote:

> Hi all,
>
> My monitor nodes are getting up and down because of paxos lease timeout
> and there is a high iops (2k iops) and 500MB/s throughput on
> /var/lib/ceph/mon/ceph.../store.db/.
> My cluster is in a recovery state and there is a bunch of degraded pgs on
> my cluster.
>
> It seems it's doing a 200k block size io on rocksdb. Is that okay?!
> Also is there any solution to fix these downtimes for monitors?
>
> Thanks for your help!
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using RBD to pack billions of small files

2021-02-04 Thread Lionel Bouton
Hi,

Le 04/02/2021 à 08:41, Loïc Dachary a écrit :
> Hi Frederico,
>
> On 04/02/2021 05:51, Federico Lucifredi wrote:
>> Hi Loïc,
>>    I am intrigued, but am missing something: why not using RGW, and store 
>> the source code files as objects? RGW has native compression and can take 
>> care of that behind the scenes.
> Excellent question!
>>    Is the desire to use RBD only due to minimum allocation sizes?
> I *assume* that since RGW does have

If I understand correctly I assume that you are missing a "not" here.

>  specific strategies to take advantage of the fact that objects are immutable 
> and will never be removed:
>
> * It will be slower to add artifacts in RGW than in an RBD image + index
> * The metadata in RGW will be larger than an RBD image + index
>
> However I have not verified this and if you have an opinion I'd love to hear 
> it :-)

Reading the exchanges I believe you are focused on the reading speed and
space efficiency. Did you consider the writing speed with such a scheme ?

Depending on how you store the index, you could block on each write and
would have to consider Ceph latency (ie: if your writer fails recovering
can be tricky without having waited for writes to update your index).
With your 100TB target and 3kb artifact size a 1ms latency and blocking
writes translate to a whole year spent writing. If you manage to get to
a 0.1ms latency (not sure if this is achievable with Ceph yet) you end
with a month. Depending on how you plan to populate the store this could
be a problem. You'll have to consider if the artifact writing rate limit
can become a bottleneck during normal use too.

You can probably design a scheme supporting storing multiple values in a
single write but it seems to add complexity which might come with
unwanted performance problems and space use itself.

I'm not familiar with space efficiency on modern Ceph versions (still
using filestore on Hammer...), do you have a ballpark estimation of the
costs of storing artifacts as simple objects ? Unless you already worked
out the whole design that would be my first concern : it could end up
being an inefficiency worth the trade-off for simplicity.

I'm unfamiliar with the gateway and how well and easily it can scale so
my first impulse was to bypass RGW to use the librados interface
directly. You can definitely begin with a RGW solution as it is a bit
easier to implement and switch to librados later if RGW ever becomes a
bottleneck. If you need speed either writing or reading, both RGW and
librados would work : you can have as many clients managing objects in
parallel without any lock on writes on your end to manage. This is a
very simple storage design and simplicity can't be overrated :-)
The only potential downside (in addition to space inefficiency) that I
can see would be walking the list of objects. This is doable but with
billions of them this could be very slow. Not sure if it could become a
need given your use case though.

For reference, I just found the results of a test with a moderately
comparable test set :
https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond.
I didn't finish reading it yet but the volume seems comparable to your
use case although with 64kB objects.

Note : I've seen questions about 100TB RBDs in the thread. We use such
beasts in two clusters : they work fine but are a pain when deleting or
downsizing them. During one downsize on the slowest cluster we had to
pause the operation manually (SIGSTOP to the rbd process) during periods
of high loads and let it continue after. This took about a week (but the
cluster was admittedly underpowered for its use at the time).

Best regards,

--
Lionel Bouton
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Robert Sander
Hi,

Am 04.02.21 um 12:10 schrieb Frank Schilder:

> Going to 2+2 EC will not really help

On such a small cluster you cannot even use EC because there are not
enough independent hosts. As a rule of thumb there should be k+m+1 hosts
in a cluster AFAIK.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] NVMe and 2x Replica

2021-02-04 Thread Adam Boyhan
I know there is already a few threads about 2x replication but I wanted to 
start one dedicated to discussion on NVMe. There are some older threads, but 
nothing recent that addresses how the vendors are now pushing the idea of 2x. 

We are in the process of considering Ceph to replace our Nimble setup. We will 
have two completely separate clusters at two different sites that we are using 
rbd-mirror snapshot replication. The plan would be to run 2x replication on 
each cluster. 3x is still an option, but for obvious reasons 2x is enticing. 

Both clusters will be spot on to the super micro example in the white paper 
below. 

It seems all the big vendors feel 2x is safe with NVMe but I get the feeling 
this community feels otherwise. Trying to wrap my head around were the 
disconnect is between the big players and the community. I could be missing 
something, but even our Supermicro contact that we worked the config out with 
was in agreement with 2x on NVMe. 

Appreciate the input! 

[ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | 
https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ] 

[ 
https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
 ] 
[ 
https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
 | 
https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
 ] 

[ 
https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
 | 
https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
 ] 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replace OSD without PG remapping

2021-02-04 Thread Frank Schilder
Hi Tony,

OK, I understand better now as well. I was really wondering why you wanted to 
avoid the self-healing. Its the main reason for using ceph :)

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Tony Liu 
Sent: 04 February 2021 02:41:57
To: Frank Schilder; ceph-users@ceph.io
Subject: RE: replace OSD without PG remapping

Thank you Frank, "degradation is exactly what needs to be
avoided/fixed at all cost", clear and loud, point is taken!
I didn't actually quite get it last time. I used to think
degradation would be OK, but now, I agree with you, that is
not OK at all for production storage.
Appreciate your patience!

Tony
> -Original Message-
> From: Frank Schilder 
> Sent: Tuesday, February 2, 2021 11:47 PM
> To: Tony Liu ; ceph-users@ceph.io
> Subject: Re: replace OSD without PG remapping
>
> You asked about exactly this before:
> https://lists.ceph.io/hyperkitty/list/ceph-
> us...@ceph.io/thread/IGYCYJTAMBDDOD2AQUCJQ6VSUWIO4ELW/#ZJU3555Z5WQTJDPCT
> MPZ6LOFTIUKKQUS
>
> It is not possible to avoid remapping, because if the PGs are not
> remapped you would have degraded redundancy. In any storage system, this
> degradation is exactly what needs to be avoided/fixed at all cost.
>
> I don't see an issue with health status messages issued by self-healing.
> That's the whole point of ceph, just let it do its job and don't get
> freaked out by health_warn.
>
> You can, however try to keep the window of rebalancing short and this is
> exactly what was discussed in the thread above already. As is pointed
> out there as well, even this is close to pointless. Just deploy a few
> more disks than you need, let the broken ones go and be happy that ceph
> is taking care of the rest and even tells you about its progress.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Tony Liu 
> Sent: 03 February 2021 03:10:26
> To: ceph-users@ceph.io
> Subject: [ceph-users] replace OSD without PG remapping
>
> Hi,
>
> There are multiple different procedures to replace an OSD.
> What I want is to replace an OSD without PG remapping.
>
> #1
> I tried "orch osd rm --replace", which sets OSD reweight 0 and status
> "destroyed". "orch osd rm status" shows "draining".
> All PGs on this OSD are remapped. Checked "pg dump", can't find this OSD
> any more.
>
> 1) Given [1], setting weight 0 seems better than setting reweight 0.
> Is that right? If yes, should we change the behavior of "orch osd rm --
> replace"?
>
> 2) "ceph status" doesn't show anything about OSD draining.
> Is there any way to see the progress of draining?
> Is there actually copy happening? The PG on this OSD is remapped and
> copied to another OSD, right?
>
> 3) When OSD is replaced, there will be remapping and backfilling.
>
> 4) There is remapping in #2 and remapping again in #3.
> I want to avoid it.
>
> #2
> Is there any procedure that doesn't mark OSD out (set reweight 0),
> neither set weight 0, which should keep PG map unchanged, but just warn
> about less redundancy (one out of 3 OSDs of PG is down), and when OSD is
> replaced, no remapping, just data backfilling?
>
> [1] https://ceph.com/geen-categorie/difference-between-ceph-osd-
> reweight-and-ceph-osd-crush-reweight/
>
>
> Thanks!
> Tony
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using RBD to pack billions of small files

2021-02-04 Thread Loïc Dachary


On 04/02/2021 12:08, Lionel Bouton wrote:
> Hi,
>
> Le 04/02/2021 à 08:41, Loïc Dachary a écrit :
>> Hi Frederico,
>>
>> On 04/02/2021 05:51, Federico Lucifredi wrote:
>>> Hi Loïc,
>>>    I am intrigued, but am missing something: why not using RGW, and store 
>>> the source code files as objects? RGW has native compression and can take 
>>> care of that behind the scenes.
>> Excellent question!
>>>    Is the desire to use RBD only due to minimum allocation sizes?
>> I *assume* that since RGW does have
> If I understand correctly I assume that you are missing a "not" here.
Yes :-)
>
>>  specific strategies to take advantage of the fact that objects are 
>> immutable and will never be removed:
>>
>> * It will be slower to add artifacts in RGW than in an RBD image + index
>> * The metadata in RGW will be larger than an RBD image + index
>>
>> However I have not verified this and if you have an opinion I'd love to hear 
>> it :-)
> Reading the exchanges I believe you are focused on the reading speed and
> space efficiency. Did you consider the writing speed with such a scheme ?
Yes and the goal is to achieve 100MB/s write speed.
>
> Depending on how you store the index, you could block on each write and
> would have to consider Ceph latency (ie: if your writer fails recovering
> can be tricky without having waited for writes to update your index).
> With your 100TB target and 3kb artifact size a 1ms latency and blocking
> writes translate to a whole year spent writing. If you manage to get to
> a 0.1ms latency (not sure if this is achievable with Ceph yet) you end
> with a month. Depending on how you plan to populate the store this could
> be a problem. You'll have to consider if the artifact writing rate limit
> can become a bottleneck during normal use too.

>
> You can probably design a scheme supporting storing multiple values in a
> single write but it seems to add complexity which might come with
> unwanted performance problems and space use itself.
>
> I'm not familiar with space efficiency on modern Ceph versions (still
> using filestore on Hammer...), do you have a ballpark estimation of the
> costs of storing artifacts as simple objects ? Unless you already worked
> out the whole design that would be my first concern : it could end up
> being an inefficiency worth the trade-off for simplicity.
I did not measure the overhead and I'm assuming it is significant
enough to justify RGW implemented packing.
>
> I'm unfamiliar with the gateway and how well and easily it can scale so
> my first impulse was to bypass RGW to use the librados interface
> directly. 
Using librados directly would work but the caller would have to implement
packing in the same way RBD or RGW does. It is a lot of work to do that
properly.
> You can definitely begin with a RGW solution as it is a bit
> easier to implement and switch to librados later if RGW ever becomes a
> bottleneck. If you need speed either writing or reading, both RGW and
> librados would work : you can have as many clients managing objects in
> parallel without any lock on writes on your end to manage. This is a
> very simple storage design and simplicity can't be overrated :-)
> The only potential downside (in addition to space inefficiency) that I
> can see would be walking the list of objects. This is doable but with
> billions of them this could be very slow. Not sure if it could become a
> need given your use case though.
I'll research more and try to figure out a way to compare write/read speed in 
both
cases.
> For reference, I just found the results of a test with a moderately
> comparable test set :
> https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond.
> I didn't finish reading it yet but the volume seems comparable to your
> use case although with 64kB objects.
That's a significant difference but the benchmarks results are still
very useful.
> Note : I've seen questions about 100TB RBDs in the thread. We use such
> beasts in two clusters : they work fine but are a pain when deleting or
> downsizing them. During one downsize on the slowest cluster we had to
> pause the operation manually (SIGSTOP to the rbd process) during periods
> of high loads and let it continue after. This took about a week (but the
> cluster was admittedly underpowered for its use at the time).
Intersting ! In this use case having a single RBD image does not
seem to be a good idea. Ceph is designed to scale out. But RBD images
are not designed to grow indefinitely. Having multiple 1TB images sounds like
a sane tradeoff.
>
> Best regards,
Thanks for taking the time to think about this use case :-)

Cheers
> --
> Lionel Bouton
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Loïc Dachary, Artisan Logiciel Libre




OpenPGP_signature
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-

[ceph-users] Re: NFS version 4.0

2021-02-04 Thread Daniel Gryniewicz
The preference for 4.1 and later is because 4.0 has a much less useful 
graceful restart (which is used for HA/failover as well).  Ganesha 
itself supports 4.0 perfectly fine, and it should work fine with Ceph, 
but HA setups will be much more difficult, and will be limited in 
functionality.


Daniel

On 2/4/21 3:27 AM, Jens Hyllegaard (Soft Design A/S) wrote:

Hi.

We are trying to set up an NFS server using ceph which needs to be accessed by 
an IBM System i.
As far as I can tell the IBM System i only supports nfs v. 4.
Looking at the nfs-ganesha deployments it seems that these only support 4.1 or 
4.2. I have tried editing the configuration file to support 4.0 and it seems to 
work.
Is there a reason than it currently only support 4.1 and 4.2?

I can of course edit the configuration file, but I would have to do that after 
any deployment or upgrade of the nfs servers.

Regards

Jens Hyllegaard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-04 Thread Jens Hyllegaard (Soft Design A/S)
Hi.

I have the same situation. Running 15.2.8
I created a specification that looked just like it. With rotational in the data 
and non-rotational in the db.

First use applied fine. Afterwards it only uses the hdd, and not the ssd.
Also, is there a way to remove an unused osd service.
I manages to create osd.all-available-devices, when I tried to stop the 
autocreation of OSD's. Using ceph orch apply osd --all-available-devices 
--unmanaged=true

I created the original OSD using the web interface.

Regards

Jens
-Original Message-
From: Eugen Block  
Sent: 3. februar 2021 11:40
To: Tony Liu 
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: db_devices doesn't show up in exported osd service 
spec

How do you manage the db_sizes of your SSDs? Is that managed automatically by 
ceph-volume? You could try to add another config and see what it does, maybe 
try to add block_db_size?


Zitat von Tony Liu :

> All mon, mgr, crash and osd are upgraded to 15.2.8. It actually fixed 
> another issue (no device listed after adding host).
> But this issue remains.
> ```
> # cat osd-spec.yaml
> service_type: osd
> service_id: osd-spec
> placement:
>   host_pattern: ceph-osd-[1-3]
> data_devices:
>   rotational: 1
> db_devices:
>   rotational: 0
>
> # ceph orch apply osd -i osd-spec.yaml Scheduled osd.osd-spec 
> update...
>
> # ceph orch ls --service_name osd.osd-spec --export
> service_type: osd
> service_id: osd-spec
> service_name: osd.osd-spec
> placement:
>   host_pattern: ceph-osd-[1-3]
> spec:
>   data_devices:
> rotational: 1
>   filter_logic: AND
>   objectstore: bluestore
> ```
> db_devices still doesn't show up.
> Keep scratching my head...
>
>
> Thanks!
> Tony
>> -Original Message-
>> From: Eugen Block 
>> Sent: Tuesday, February 2, 2021 2:20 AM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: db_devices doesn't show up in exported osd 
>> service spec
>>
>> Hi,
>>
>> I would recommend to update (again), here's my output from a 15.2.8 
>> test
>> cluster:
>>
>>
>> host1:~ # ceph orch ls --service_name osd.default --export
>> service_type: osd
>> service_id: default
>> service_name: osd.default
>> placement:
>>hosts:
>>- host4
>>- host3
>>- host1
>>- host2
>> spec:
>>block_db_size: 4G
>>data_devices:
>>  rotational: 1
>>  size: '20G:'
>>db_devices:
>>  size: '10G:'
>>filter_logic: AND
>>objectstore: bluestore
>>
>>
>> Regards,
>> Eugen
>>
>>
>> Zitat von Tony Liu :
>>
>> > Hi,
>> >
>> > When build cluster Octopus 15.2.5 initially, here is the OSD 
>> > service spec file applied.
>> > ```
>> > service_type: osd
>> > service_id: osd-spec
>> > placement:
>> >   host_pattern: ceph-osd-[1-3]
>> > data_devices:
>> >   rotational: 1
>> > db_devices:
>> >   rotational: 0
>> > ```
>> > After applying it, all HDDs were added and DB of each hdd is 
>> > created on SSD.
>> >
>> > Here is the export of OSD service spec.
>> > ```
>> > # ceph orch ls --service_name osd.osd-spec --export
>> > service_type: osd
>> > service_id: osd-spec
>> > service_name: osd.osd-spec
>> > placement:
>> >   host_pattern: ceph-osd-[1-3]
>> > spec:
>> >   data_devices:
>> > rotational: 1
>> >   filter_logic: AND
>> >   objectstore: bluestore
>> > ```
>> > Why db_devices doesn't show up there?
>> >
>> > When I replace a disk recently, when the new disk was installed and 
>> > zapped, OSD was automatically re-created, but DB was created on 
>> > HDD, not SSD. I assume this is because of that missing db_devices?
>> >
>> > I tried to update service spec, the same result, db_devices doesn't 
>> > show up when export it.
>> >
>> > Is this some known issue or something I am missing?
>> >
>> >
>> > Thanks!
>> > Tony
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send 
>> > an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
>> email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Frank Schilder
> - three servers
> - three monitors
> - 6 osd (two per server)
> - size=3 and min_size=2

This is a set-up that I would not run at all. The first one is, that ceph lives 
on the law of large numbers and 6 is a small number. Hence, your OSD fill-up 
due to uneven distribution.

What comes to my mind is a hyper-converged server with 6+ disks in a RAID10 
array, possibly with a good controller with battery-powered or other 
non-volatile cache. Ceph will never beat that performance. Put in some extra 
disks as hot-spare and you have close to self-healing storage.

Such a small ceph cluster will inherit all the baddies of ceph (performance, 
maintenance) without giving any of the goodies (scale-out, self-healing, proper 
distributed raid protection). Ceph needs size to become well-performing and pay 
off the maintenance and architectural effort.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mario Giammarco 
Sent: 04 February 2021 11:29:49
To: Dan van der Ster
Cc: Ceph Users
Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2

Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster 
ha scritto:

>
> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if
> possible.
>
>
I will investigate I heard that erasure coding is slow.

Anyway I will write here the reason of this thread:
In my customers I have usually proxmox+ceph with:

- three servers
- three monitors
- 6 osd (two per server)
- size=3 and min_size=2

I followed the recommendations to stay safe.
But one day one disk of one server has broken, osd where at 55%.
What happened then?
Ceph started filling the remaining OSD to maintain size=3
OSD reached 90% ceph stopped all.
Customer VMs froze and customer lost time and some data that was not
written on disk.

So I got angry size=3 and customer still loses time and data?






> Cheers, Dan
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco  wrote:
>
>> Thanks Simon and thanks to other people that have replied.
>> Sorry but I try to explain myself better.
>> It is evident to me that if I have two copies of data, one brokes and
>> while
>> ceph creates again a new copy of the data also the disk with the second
>> copy brokes you lose the data.
>> It is obvious and a bit paranoid because many servers on many customers
>> run
>> on raid1 and so you are saying: yeah you have two copies of the data but
>> you can broke both. Consider that in ceph recovery is automatic, with
>> raid1
>> some one must manually go to the customer and change disks. So ceph is
>> already an improvement in this case even with size=2. With size 3 and min
>> 2
>> it is a bigger improvement I know.
>>
>> What I ask is this: what happens with min_size=1 and split brain, network
>> down or similar things: do ceph block writes because it has no quorum on
>> monitors? Are there some failure scenarios that I have not considered?
>> Thanks again!
>> Mario
>>
>>
>>
>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
>> sirons...@caffetine.org> ha scritto:
>>
>> > On 03/02/2021 09:24, Mario Giammarco wrote:
>> > > Hello,
>> > > Imagine this situation:
>> > > - 3 servers with ceph
>> > > - a pool with size 2 min 1
>> > >
>> > > I know perfectly the size 3 and min 2 is better.
>> > > I would like to know what is the worst thing that can happen:
>> >
>> > Hi Mario,
>> >
>> > This thread is worth a read, it's an oldie but a goodie:
>> >
>> >
>> >
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>> >
>> > Especially this post, which helped me understand the importance of
>> > min_size=2
>> >
>> >
>> >
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
>> >
>> > Cheers,
>> > Simon
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Frank Schilder
> Because you have 3 hosts, 2 osds each, and 3 replicas: ...
> So unless your cluster was under 40-50% used, that osd is going to
> become overfull.

Yes, overlooked this. With 2 disks per host statistics is not yet at play here, 
its the deterministic case. To run it safe, you need to have at least 2*3=6 
times the storage capacity compared with data stored. Going to 2+2 EC will not 
really help and size 2 min_size 1 will be a disaster in any case.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 04 February 2021 11:57:38
To: Mario Giammarco
Cc: Ceph Users
Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2

On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco  wrote:
>
>
>
> Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster 
>  ha scritto:
>>
>>
>> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if 
>> possible.
>>
>
> I will investigate I heard that erasure coding is slow.
>
> Anyway I will write here the reason of this thread:
> In my customers I have usually proxmox+ceph with:
>
> - three servers
> - three monitors
> - 6 osd (two per server)
> - size=3 and min_size=2
>
> I followed the recommendations to stay safe.
> But one day one disk of one server has broken, osd where at 55%.
> What happened then?
> Ceph started filling the remaining OSD to maintain size=3
> OSD reached 90% ceph stopped all.
> Customer VMs froze and customer lost time and some data that was not written 
> on disk.
>
> So I got angry size=3 and customer still loses time and data?

You should size the osd fullness config in such a way that failure you
expect would still leave sufficient capacity.
In our case, we plan so that we could lose and re-replicate an entire
rack and still have enough space left. -- (IOW, with 5-6 racks, we
start to add capacity when the clusters reach ~70-75% full)

In your case, the issue is more extreme:
Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD
fails and is marked out, you are telling ceph that *all* of the
objects will need to be written to the last remaining disk on that
host with the failure.
So unless your cluster was under 40-50% used, that osd is going to
become overfull. (But BTW, ceph will get backfillfull on the loaded
OSD before stopping IO -- this should not have blocked your user
unless they *also* filled the disk with new data at the same time).

IMO with a cluster this size, you should not ever mark out any OSDs --
rather, you should leave the PGs degraded, replace the disk (keep the
same OSD ID), then recover those objects to the new disk.
Or, keep it <40% used (which sounds like a waste).

-- dan





>
>
>
>
>
>>
>> Cheers, Dan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco  wrote:
>>>
>>> Thanks Simon and thanks to other people that have replied.
>>> Sorry but I try to explain myself better.
>>> It is evident to me that if I have two copies of data, one brokes and while
>>> ceph creates again a new copy of the data also the disk with the second
>>> copy brokes you lose the data.
>>> It is obvious and a bit paranoid because many servers on many customers run
>>> on raid1 and so you are saying: yeah you have two copies of the data but
>>> you can broke both. Consider that in ceph recovery is automatic, with raid1
>>> some one must manually go to the customer and change disks. So ceph is
>>> already an improvement in this case even with size=2. With size 3 and min 2
>>> it is a bigger improvement I know.
>>>
>>> What I ask is this: what happens with min_size=1 and split brain, network
>>> down or similar things: do ceph block writes because it has no quorum on
>>> monitors? Are there some failure scenarios that I have not considered?
>>> Thanks again!
>>> Mario
>>>
>>>
>>>
>>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
>>> sirons...@caffetine.org> ha scritto:
>>>
>>> > On 03/02/2021 09:24, Mario Giammarco wrote:
>>> > > Hello,
>>> > > Imagine this situation:
>>> > > - 3 servers with ceph
>>> > > - a pool with size 2 min 1
>>> > >
>>> > > I know perfectly the size 3 and min 2 is better.
>>> > > I would like to know what is the worst thing that can happen:
>>> >
>>> > Hi Mario,
>>> >
>>> > This thread is worth a read, it's an oldie but a goodie:
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>>> >
>>> > Especially this post, which helped me understand the importance of
>>> > min_size=2
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
>>> >
>>> > Cheers,
>>> > Simon
>>> > ___
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>> >
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe se

[ceph-users] Re: Using RBD to pack billions of small files

2021-02-04 Thread Matthew Vernon

Hi,

On 04/02/2021 07:41, Loïc Dachary wrote:


On 04/02/2021 05:51, Federico Lucifredi wrote:

Hi Loïc,
    I am intrigued, but am missing something: why not using RGW, and store the 
source code files as objects? RGW has native compression and can take care of 
that behind the scenes.

Excellent question!


    Is the desire to use RBD only due to minimum allocation sizes?

I *assume* that since RGW does have specific strategies to take advantage of 
the fact that objects are immutable and will never be removed:

* It will be slower to add artifacts in RGW than in an RBD image + index
* The metadata in RGW will be larger than an RBD image + index


RGW addition is pretty quick up to fairly large buckets; and if you're 
not expecting to want to list the bucket contents often, then RGW might 
well be a good option for your object store with small files.


Or at least, using some of the RGW code (I think there's a librgw) to 
re-use a bunch of its code for your use case; this feels more natural to 
me than using RBD for this.


Regards,

Matthew
[pleased software heritage are still looking at Ceph :) ]


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Eneko Lacunza

Hi all,

El 4/2/21 a las 11:56, Frank Schilder escribió:

- three servers
- three monitors
- 6 osd (two per server)
- size=3 and min_size=2

This is a set-up that I would not run at all. The first one is, that ceph lives 
on the law of large numbers and 6 is a small number. Hence, your OSD fill-up 
due to uneven distribution.

What comes to my mind is a hyper-converged server with 6+ disks in a RAID10 
array, possibly with a good controller with battery-powered or other 
non-volatile cache. Ceph will never beat that performance. Put in some extra 
disks as hot-spare and you have close to self-healing storage.

Such a small ceph cluster will inherit all the baddies of ceph (performance, 
maintenance) without giving any of the goodies (scale-out, self-healing, proper 
distributed raid protection). Ceph needs size to become well-performing and pay 
off the maintenance and architectural effort.



It's funny that we have multiple clusters similar to this, and we and 
our customers couldn't be happier. Just use a HCI solution (like for 
example Proxmox VE, but there are others) to manage everything.


Maybe the weakest thing in that configuration is having 2 OSDs per node; 
osd nearfull must be tuned accordingly so that no OSD goes beyond about 
0.45, so that in case of failure of one disk, the other OSD in the node 
has enough space for healing replication.


When deciding min_size, one has to balance availability (failure during 
maintenance of one node with min_size=2) vs risk of data loss (min_size=1).


Not everyone needs to max SSD disk IOPS; having a decent, HA setup can 
be of much value...


Cheers


--
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO/
https://www.linkedin.com/company/37269706/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread DHilsbos
Adam;

Earlier this week, another thread presented 3 white papers in support of 
running 2x on NVMe for Ceph.

I searched each to find the section where 2x was discussed.  What I found was 
interesting.  First, there are really only 2 positions here: Micron's and Red 
Hat's.  Supermicro copies Micron's positon paragraph word for word.  Not 
surprising considering that they are advertising a Supermicro / Micron solution.

This is Micron's statement:
" NVMe SSDs have high reliability with high MTBR and low bit error rate. 2x 
replication is recommended in production when deploying OSDs on NVMe versus the 
3x replication common with legacy storage."

This is Red Hat's statement:
" Given the better MTBF and MTTR of flash-based media, many Ceph customers have 
chosen to run 2x replications in
production when deploying OSDs on flash. This differs from magnetic media 
deployments, which typically use 3x replication."

Looking at these statements, these acronyms pop out at me: MTBR and MTTR.  MTBR 
is Mean Time Between Replacements, while MTTR is Mean Time Till Replacement.  
Essentially; this is saying that most companies replaces these drives before 
they have to worry about large numbers failing.

Regarding MTBF; I can't find any data to support Red Hat's assertion that MTBF 
is better for flash.  I looked at both Western Digital Gold, and Seagate Exos 
12 TB drives, and found they both list a MTBF of 2.5 million hours.  I was 
unable to find any information on the MTBF of Micron drives, but the MTBF of 
Kingston's DC1000B 240GB drive is 2 million hours.

Personally, this looks like marketing BS to me.  SSD shops want to sell SSDs, 
but because of the cost difference they have to convince buyers that their 
products are competitive.

Pitch is thus:
Our products cost twice as much, but LOOK you only need 2/3 as many, and you 
get all these other benefits (performance).  Plus, if you replace everything in 
2 or 3 years anyway, then you won't have to worry about them failing.

I'll address general concerns of 2x replication in another email.

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: Adam Boyhan [mailto:ad...@medent.com] 
Sent: Thursday, February 4, 2021 4:38 AM
To: ceph-users
Subject: [ceph-users] NVMe and 2x Replica

I know there is already a few threads about 2x replication but I wanted to 
start one dedicated to discussion on NVMe. There are some older threads, but 
nothing recent that addresses how the vendors are now pushing the idea of 2x. 

We are in the process of considering Ceph to replace our Nimble setup. We will 
have two completely separate clusters at two different sites that we are using 
rbd-mirror snapshot replication. The plan would be to run 2x replication on 
each cluster. 3x is still an option, but for obvious reasons 2x is enticing. 

Both clusters will be spot on to the super micro example in the white paper 
below. 

It seems all the big vendors feel 2x is safe with NVMe but I get the feeling 
this community feels otherwise. Trying to wrap my head around were the 
disconnect is between the big players and the community. I could be missing 
something, but even our Supermicro contact that we worked the config out with 
was in agreement with 2x on NVMe. 

Appreciate the input! 

[ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | 
https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ] 

[ 
https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
 ] 
[ 
https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
 | 
https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
 ] 

[ 
https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
 | 
https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
 ] 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Anthony D'Atri


> 
> Maybe the weakest thing in that configuration is having 2 OSDs per node; osd 
> nearfull must be tuned accordingly so that no OSD goes beyond about 0.45, so 
> that in case of failure of one disk, the other OSD in the node has enough 
> space for healing replication.
> 

A careful setting of mon_osd_down_out_subtree_limit can help in the situation 
of losing a whole node, though as you and others have noted, this topology has 
other challenges.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Anthony D'Atri


> I searched each to find the section where 2x was discussed.  What I found was 
> interesting.  First, there are really only 2 positions here: Micron's and Red 
> Hat's.  Supermicro copies Micron's positon paragraph word for word.  Not 
> surprising considering that they are advertising a Supermicro / Micron 
> solution.

FWIW, at Cephalocon another vendor made a similar claim during a talk.

* Failure rates are averages, not minima.  Some drives will always fail sooner
* Firmware and other design flaws can result in much higher rates of failure or 
insidious UREs that can result in partial data unavailability or loss
* Latent soft failures may not be detected until a deep scrub succeeds, which 
could be weeks later
* In a distributed system, there are up/down/failure scenarios where the 
location of even one good / canonical / latest copy of data is unclear, 
especially when drive or HBA cache is in play.
* One of these is a power failure.  Sure PDU / PSU redundancy helps, but stuff 
happens, like a DC underprovisioning amps, so that a spike in user traffic 
results in the whole row going down :-x  Various unpleasant things can happen.

I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID.  As others 
have written, as drives get larger the time to fill them with replica data 
increases, as does the chance of overlapping failures.  I’ve experieneced R2 
overlapping failures more than once, with and before Ceph.

My sense has been that not many people run R2 for data they care about, and as 
has been written recently 2,2 EC is safer with the same raw:usable ratio.  I’ve 
figured that vendors make R2 statements like these as a selling point to assert 
lower TCO.  My first response is often “How much would it cost you directly, 
and indirectly in terms of user / customer goodwill, to loose data?”.

> Personally, this looks like marketing BS to me.  SSD shops want to sell SSDs, 
> but because of the cost difference they have to convince buyers that their 
> products are competitive.

^this.  I’m watching the QLC arena with interest for the potential to narrow 
the CapEx gap.  Durability has been one concern, though I’m seeing newer 
products claiming that eg. ZNS improves that.  It also seems that there are 
something like what, *4* separate EDSFF / ruler form factors, I really want to 
embrace those eg. for object clusters, but I’m VERY wary of the longevity of 
competing standards and any single-source for chassies or drives.  

> Our products cost twice as much, but LOOK you only need 2/3 as many, and you 
> get all these other benefits (performance).  Plus, if you replace everything 
> in 2 or 3 years anyway, then you won't have to worry about them failing.

Refresh timelines. You’re funny ;)  Every time, every single time, that I’ve 
worked in an organization that claims a 3 (or 5, or whatever) hardware refresh 
cycle, it hasn’t happened.  When you start getting close, the capex doesn’t 
materialize, or the opex cost of DC hands and operational oversight.  “How do 
you know that the drives will start failing or getting slower?  Let’s revisit 
this in 6 months”.  Etc.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Adam Boyhan
All great input and points guys. 

Helps me lean towards 3 copes a bit more. 

I mean honestly NVMe cost per TB isn't that much more than SATA SSD now. 
Somewhat surprised the salesmen aren't pitching 3x replication as it makes them 
more money. 



From: "Anthony D'Atri"  
To: "ceph-users"  
Sent: Thursday, February 4, 2021 12:47:27 PM 
Subject: [ceph-users] Re: NVMe and 2x Replica 

> I searched each to find the section where 2x was discussed. What I found was 
> interesting. First, there are really only 2 positions here: Micron's and Red 
> Hat's. Supermicro copies Micron's positon paragraph word for word. Not 
> surprising considering that they are advertising a Supermicro / Micron 
> solution. 

FWIW, at Cephalocon another vendor made a similar claim during a talk. 

* Failure rates are averages, not minima. Some drives will always fail sooner 
* Firmware and other design flaws can result in much higher rates of failure or 
insidious UREs that can result in partial data unavailability or loss 
* Latent soft failures may not be detected until a deep scrub succeeds, which 
could be weeks later 
* In a distributed system, there are up/down/failure scenarios where the 
location of even one good / canonical / latest copy of data is unclear, 
especially when drive or HBA cache is in play. 
* One of these is a power failure. Sure PDU / PSU redundancy helps, but stuff 
happens, like a DC underprovisioning amps, so that a spike in user traffic 
results in the whole row going down :-x Various unpleasant things can happen. 

I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. As others 
have written, as drives get larger the time to fill them with replica data 
increases, as does the chance of overlapping failures. I’ve experieneced R2 
overlapping failures more than once, with and before Ceph. 

My sense has been that not many people run R2 for data they care about, and as 
has been written recently 2,2 EC is safer with the same raw:usable ratio. I’ve 
figured that vendors make R2 statements like these as a selling point to assert 
lower TCO. My first response is often “How much would it cost you directly, and 
indirectly in terms of user / customer goodwill, to loose data?”. 

> Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, 
> but because of the cost difference they have to convince buyers that their 
> products are competitive. 

^this. I’m watching the QLC arena with interest for the potential to narrow the 
CapEx gap. Durability has been one concern, though I’m seeing newer products 
claiming that eg. ZNS improves that. It also seems that there are something 
like what, *4* separate EDSFF / ruler form factors, I really want to embrace 
those eg. for object clusters, but I’m VERY wary of the longevity of competing 
standards and any single-source for chassies or drives. 

> Our products cost twice as much, but LOOK you only need 2/3 as many, and you 
> get all these other benefits (performance). Plus, if you replace everything 
> in 2 or 3 years anyway, then you won't have to worry about them failing. 

Refresh timelines. You’re funny ;) Every time, every single time, that I’ve 
worked in an organization that claims a 3 (or 5, or whatever) hardware refresh 
cycle, it hasn’t happened. When you start getting close, the capex doesn’t 
materialize, or the opex cost of DC hands and operational oversight. “How do 
you know that the drives will start failing or getting slower? Let’s revisit 
this in 6 months”. Etc. 

___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread DHilsbos
My impression is that cost / TB for a drive may be approaching parity, but the 
TB /drive is still well below (or at least at densities approaching parity, 
cost / TB is still quite high).  I can get a Micron 15TB SSD for $2600, but why 
would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB Seagate Exos 
for <$500, or a 18TB WD Gold for <$600?  Personally I wouldn't use drives that 
big, in our little tiny clusters, but it exemplifies the issues around 
discussing cost parity.

As such a cluster needs more dives for the same total size (thus more nodes), 
which drives up the cost / TB for a cluster.

My 2 cents.

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: Adam Boyhan [mailto:ad...@medent.com] 
Sent: Thursday, February 4, 2021 10:58 AM
To: Anthony D'Atri
Cc: ceph-users
Subject: [ceph-users] Re: NVMe and 2x Replica

All great input and points guys. 

Helps me lean towards 3 copes a bit more. 

I mean honestly NVMe cost per TB isn't that much more than SATA SSD now. 
Somewhat surprised the salesmen aren't pitching 3x replication as it makes them 
more money. 



From: "Anthony D'Atri"  
To: "ceph-users"  
Sent: Thursday, February 4, 2021 12:47:27 PM 
Subject: [ceph-users] Re: NVMe and 2x Replica 

> I searched each to find the section where 2x was discussed. What I found was 
> interesting. First, there are really only 2 positions here: Micron's and Red 
> Hat's. Supermicro copies Micron's positon paragraph word for word. Not 
> surprising considering that they are advertising a Supermicro / Micron 
> solution. 

FWIW, at Cephalocon another vendor made a similar claim during a talk. 

* Failure rates are averages, not minima. Some drives will always fail sooner 
* Firmware and other design flaws can result in much higher rates of failure or 
insidious UREs that can result in partial data unavailability or loss 
* Latent soft failures may not be detected until a deep scrub succeeds, which 
could be weeks later 
* In a distributed system, there are up/down/failure scenarios where the 
location of even one good / canonical / latest copy of data is unclear, 
especially when drive or HBA cache is in play. 
* One of these is a power failure. Sure PDU / PSU redundancy helps, but stuff 
happens, like a DC underprovisioning amps, so that a spike in user traffic 
results in the whole row going down :-x Various unpleasant things can happen. 

I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. As others 
have written, as drives get larger the time to fill them with replica data 
increases, as does the chance of overlapping failures. I’ve experieneced R2 
overlapping failures more than once, with and before Ceph. 

My sense has been that not many people run R2 for data they care about, and as 
has been written recently 2,2 EC is safer with the same raw:usable ratio. I’ve 
figured that vendors make R2 statements like these as a selling point to assert 
lower TCO. My first response is often “How much would it cost you directly, and 
indirectly in terms of user / customer goodwill, to loose data?”. 

> Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, 
> but because of the cost difference they have to convince buyers that their 
> products are competitive. 

^this. I’m watching the QLC arena with interest for the potential to narrow the 
CapEx gap. Durability has been one concern, though I’m seeing newer products 
claiming that eg. ZNS improves that. It also seems that there are something 
like what, *4* separate EDSFF / ruler form factors, I really want to embrace 
those eg. for object clusters, but I’m VERY wary of the longevity of competing 
standards and any single-source for chassies or drives. 

> Our products cost twice as much, but LOOK you only need 2/3 as many, and you 
> get all these other benefits (performance). Plus, if you replace everything 
> in 2 or 3 years anyway, then you won't have to worry about them failing. 

Refresh timelines. You’re funny ;) Every time, every single time, that I’ve 
worked in an organization that claims a 3 (or 5, or whatever) hardware refresh 
cycle, it hasn’t happened. When you start getting close, the capex doesn’t 
materialize, or the opex cost of DC hands and operational oversight. “How do 
you know that the drives will start failing or getting slower? Let’s revisit 
this in 6 months”. Etc. 

___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Jack

On 2/4/21 7:17 PM, dhils...@performair.com wrote:
hy would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB Seagate Exos for <$500, or a 18TB WD Gold for <$600?  


IOPS
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Mark Lehrer
> It seems all the big vendors feel 2x is safe with NVMe but
> I get the feeling this community feels otherwise

Definitely!

As someone who works for a big vendor (and I have since I worked at
Fusion-IO way back in the old days), IMO the correct way to phrase
this would probably be that "someone in technical marketing at the big
vendors" was convinced that 2x was safe enough to put in a white paper
or sales document.  They (we, I guess, since I'm one of these types of
people) are focused on performance and cost numbers and as much as I
hate to admit it, it can get in the way of long-term reliability
settings sometimes.

This doesn't mean that they are "wrong" -- these documents are
primarily meant to show the capabilities of their hardware, with a
bill of materials containing their part numbers.  It is expected that
end users will adjust a few things when it comes to a production
environment.

The idea that NVMe is safer than spinning rust drives is not
necessarily true -- and it's beside the point.  You are just as likely
to run into a weird situation where an OSD or pg acts up or disappears
for non-hardware reasons.

Unless you can live with "nine fives" instead of "five nines" (say, a
caching type of application where you can re-generate the data), use a
size of at least 3 -- and if you can't afford this much storage then
look at erasure coding schemes.

All of this is IMO of course,
Mark



On Thu, Feb 4, 2021 at 4:38 AM Adam Boyhan  wrote:
>
> I know there is already a few threads about 2x replication but I wanted to 
> start one dedicated to discussion on NVMe. There are some older threads, but 
> nothing recent that addresses how the vendors are now pushing the idea of 2x.
>
> We are in the process of considering Ceph to replace our Nimble setup. We 
> will have two completely separate clusters at two different sites that we are 
> using rbd-mirror snapshot replication. The plan would be to run 2x 
> replication on each cluster. 3x is still an option, but for obvious reasons 
> 2x is enticing.
>
> Both clusters will be spot on to the super micro example in the white paper 
> below.
>
> It seems all the big vendors feel 2x is safe with NVMe but I get the feeling 
> this community feels otherwise. Trying to wrap my head around were the 
> disconnect is between the big players and the community. I could be missing 
> something, but even our Supermicro contact that we worked the config out with 
> was in agreement with 2x on NVMe.
>
> Appreciate the input!
>
> [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | 
> https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf ]
>
> [ 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  ]
> [ 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  | 
> https://www.redhat.com/cms/managed-files/st-micron-ceph-performance-reference-architecture-f17294-201904-en.pdf
>  ]
>
> [ 
> https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
>  | 
> https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf
>  ]
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Anthony D'Atri


>> Why would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB Seagate 
>> Exos for <$500, or a 18TB WD Gold for <$600?  
> 
> IOPS

Some installations don’t care so much about IOPS.

Less-tangible factors include:

* Time to repair and thus to restore redundancy.  When an EC pool of spinners 
takes a *month* to weight up a drive, that’s a significant operational and data 
durability / availability concern.

* RMAs.  They’re a pain, especially if you have to work them through a chassis 
vendor, who likely will be dilatory and demand unreasonable hoops like 
attaching a BMC web interface screenshot for every drive.  This translates to 
each RMA being modeled with a certain shipping / person-hour cost, which means 
that for lower unit-value items it may not be worth the hassle.  It is not 
unreasonable to guesstimate a threshold around USD 500.  Soit is not uncommon 
to just trash failed / DOA spinners — or letting them stack up indefinitely in 
a corner — instead of recovering their value.

As I wrote … in 2019 I think it was, with spinners you have some manner of HBA 
in the mix.  If that HBA is a fussy RAID model, you may have significant added 
cost for the RoC, onboard RAM, and supercap/BBU.  Complexity also comes with 
neverending firmware bugs and cache management nightmares.  Gas gauge firmware… 
don’t even get me talking about that.

And consider how many TB of 3.5” spinners you fit into an RU, compared to 2.5” 
or EDSFF flash.  RUs aren’t free, and SATA HBAs will bottleneck a relatively 
dense HDD chassis long before a similar number of NVMe drives will bottleneck.  
Unless perhaps you have the misfortune of a chassis manufacturer who for some 
reason runs NVMe PCI lanes *though* an HBA.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Steven Pine
Taking a month to weight up a drive suggests the cluster doesn't have
enough spare IO capacity.

And for everyone suggesting EC, I don't understand how anyone really thinks
that's a valid alternative with the min allocation / space amplification
bug, no one in this community, not even the top developers of the project
can provide an accurate space projection for EC usage -- and if you cannot
predict how much space an EC configuration will use in the wild because of
a bug that isn't well documented nor discussed, you cannot begin to even
talk about costs and numbers. I understand '4k' block sizes may fix this
issue but not everyone can necessarily run ceph off the latest master.

There are a lot of hidden costs in using ceph which can vary depending on
usage needs, such as having spare io for recovery operations or ensuring
your total cluster disk usage stays below 60%.


On Thu, Feb 4, 2021 at 2:25 PM Anthony D'Atri 
wrote:

>
>
> >> Why would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB
> Seagate Exos for <$500, or a 18TB WD Gold for <$600?
> >
> > IOPS
>
> Some installations don’t care so much about IOPS.
>
> Less-tangible factors include:
>
> * Time to repair and thus to restore redundancy.  When an EC pool of
> spinners takes a *month* to weight up a drive, that’s a significant
> operational and data durability / availability concern.
>
> * RMAs.  They’re a pain, especially if you have to work them through a
> chassis vendor, who likely will be dilatory and demand unreasonable hoops
> like attaching a BMC web interface screenshot for every drive.  This
> translates to each RMA being modeled with a certain shipping / person-hour
> cost, which means that for lower unit-value items it may not be worth the
> hassle.  It is not unreasonable to guesstimate a threshold around USD 500.
> Soit is not uncommon to just trash failed / DOA spinners — or letting them
> stack up indefinitely in a corner — instead of recovering their value.
>
> As I wrote … in 2019 I think it was, with spinners you have some manner of
> HBA in the mix.  If that HBA is a fussy RAID model, you may have
> significant added cost for the RoC, onboard RAM, and supercap/BBU.
> Complexity also comes with neverending firmware bugs and cache management
> nightmares.  Gas gauge firmware… don’t even get me talking about that.
>
> And consider how many TB of 3.5” spinners you fit into an RU, compared to
> 2.5” or EDSFF flash.  RUs aren’t free, and SATA HBAs will bottleneck a
> relatively dense HDD chassis long before a similar number of NVMe drives
> will bottleneck.  Unless perhaps you have the misfortune of a chassis
> manufacturer who for some reason runs NVMe PCI lanes *though* an HBA.
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Steven Pine

*E * steven.p...@webair.com  |  *P * 516.938.4100 x
*Webair* | 501 Franklin Avenue Suite 200, Garden City NY, 11530
webair.com
[image: Facebook icon]   [image:
Twitter icon]  [image: Linkedin icon]

NOTICE: This electronic mail message and all attachments transmitted with
it are intended solely for the use of the addressee and may contain legally
privileged proprietary and confidential information. If the reader of this
message is not the intended recipient, or if you are an employee or agent
responsible for delivering this message to the intended recipient, you are
hereby notified that any dissemination, distribution, copying, or other use
of this message or its attachments is strictly prohibited. If you have
received this message in error, please notify the sender immediately by
replying to this message and delete it from your computer.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread huxia...@horebdata.cn
>IMO with a cluster this size, you should not ever mark out any OSDs --
>rather, you should leave the PGs degraded, replace the disk (keep the
>same OSD ID), then recover those objects to the new disk.
>Or, keep it <40% used (which sounds like a waste).

Dear Dan,

I particularly like your idea of " leave the PGs degraded, and replace the disk 
with the same OSD ID ". This is a wonderful thing i really want to do. 

Could you please share some more details on how to achieve this, or some 
scripts already being tested?

thanks a lot,

Samuel





huxia...@horebdata.cn
 
From: Dan van der Ster
Date: 2021-02-04 11:57
To: Mario Giammarco
CC: Ceph Users
Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2
On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco  wrote:
>
>
>
> Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster 
>  ha scritto:
>>
>>
>> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if 
>> possible.
>>
>
> I will investigate I heard that erasure coding is slow.
>
> Anyway I will write here the reason of this thread:
> In my customers I have usually proxmox+ceph with:
>
> - three servers
> - three monitors
> - 6 osd (two per server)
> - size=3 and min_size=2
>
> I followed the recommendations to stay safe.
> But one day one disk of one server has broken, osd where at 55%.
> What happened then?
> Ceph started filling the remaining OSD to maintain size=3
> OSD reached 90% ceph stopped all.
> Customer VMs froze and customer lost time and some data that was not written 
> on disk.
>
> So I got angry size=3 and customer still loses time and data?
 
You should size the osd fullness config in such a way that failure you
expect would still leave sufficient capacity.
In our case, we plan so that we could lose and re-replicate an entire
rack and still have enough space left. -- (IOW, with 5-6 racks, we
start to add capacity when the clusters reach ~70-75% full)
 
In your case, the issue is more extreme:
Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD
fails and is marked out, you are telling ceph that *all* of the
objects will need to be written to the last remaining disk on that
host with the failure.
So unless your cluster was under 40-50% used, that osd is going to
become overfull. (But BTW, ceph will get backfillfull on the loaded
OSD before stopping IO -- this should not have blocked your user
unless they *also* filled the disk with new data at the same time).
 
IMO with a cluster this size, you should not ever mark out any OSDs --
rather, you should leave the PGs degraded, replace the disk (keep the
same OSD ID), then recover those objects to the new disk.
Or, keep it <40% used (which sounds like a waste).
 
-- dan
 
 
 
 
 
>
>
>
>
>
>>
>> Cheers, Dan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco  wrote:
>>>
>>> Thanks Simon and thanks to other people that have replied.
>>> Sorry but I try to explain myself better.
>>> It is evident to me that if I have two copies of data, one brokes and while
>>> ceph creates again a new copy of the data also the disk with the second
>>> copy brokes you lose the data.
>>> It is obvious and a bit paranoid because many servers on many customers run
>>> on raid1 and so you are saying: yeah you have two copies of the data but
>>> you can broke both. Consider that in ceph recovery is automatic, with raid1
>>> some one must manually go to the customer and change disks. So ceph is
>>> already an improvement in this case even with size=2. With size 3 and min 2
>>> it is a bigger improvement I know.
>>>
>>> What I ask is this: what happens with min_size=1 and split brain, network
>>> down or similar things: do ceph block writes because it has no quorum on
>>> monitors? Are there some failure scenarios that I have not considered?
>>> Thanks again!
>>> Mario
>>>
>>>
>>>
>>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
>>> sirons...@caffetine.org> ha scritto:
>>>
>>> > On 03/02/2021 09:24, Mario Giammarco wrote:
>>> > > Hello,
>>> > > Imagine this situation:
>>> > > - 3 servers with ceph
>>> > > - a pool with size 2 min 1
>>> > >
>>> > > I know perfectly the size 3 and min 2 is better.
>>> > > I would like to know what is the worst thing that can happen:
>>> >
>>> > Hi Mario,
>>> >
>>> > This thread is worth a read, it's an oldie but a goodie:
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>>> >
>>> > Especially this post, which helped me understand the importance of
>>> > min_size=2
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
>>> >
>>> > Cheers,
>>> > Simon
>>> > ___
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>> >
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an e

[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Anthony D'Atri
Weighting up slowly so as not to DoS users.  Huge omaps and EC.  So yes you’re 
actually agreeing with me.

> 
> Taking a month to weight up a drive suggests the cluster doesn't have
> enough spare IO capacity.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] replace OSD failed

2021-02-04 Thread Tony Liu
Hi,

With 15.2.8, run "ceph orch rm osd 12 --replace --force",
PGs on osd.12 are remapped, osd.12 is removed from "ceph osd tree",
the daemon is removed from "ceph orch ps", the device is "available"
in "ceph orch device ls". Everything seems good at this point.

Then dry-run service spec.
```
# cat osd-spec.yaml
service_type: osd
service_id: osd-spec
placement:
  hosts:
  - ceph-osd-1
data_devices:
  rotational: 1
db_devices:
  rotational: 0

# ceph orch apply osd -i osd-spec.yaml --dry-run
+-+--++--+--+-+
|SERVICE  |NAME  |HOST|DATA  |DB|WAL  |
+-+--++--+--+-+
|osd  |osd-spec  |ceph-osd-3  |/dev/sdd  |/dev/sdb  |-|
+-+--++--+--+-+
```
It looks as expected.

Then "ceph orch apply osd -i osd-spec.yaml".
Here is the log of cephadm.
```
/bin/docker:stderr --> relative data size: 1.0
/bin/docker:stderr --> passed block_db devices: 1 physical, 0 LVM
/bin/docker:stderr Running command: /usr/bin/ceph-authtool --gen-print-key
/bin/docker:stderr Running command: /usr/bin/ceph --cluster ceph --name 
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd 
tree -f json
/bin/docker:stderr Running command: /usr/bin/ceph --cluster ceph --name 
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - 
osd new b05c3c90-b7d5-4f13-8a58-f72761c1971b 12
/bin/docker:stderr Running command: /usr/sbin/vgcreate --force --yes 
ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64 /dev/sdd
/bin/docker:stderr  stdout: Physical volume "/dev/sdd" successfully created.
/bin/docker:stderr  stdout: Volume group 
"ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" successfully created
/bin/docker:stderr Running command: /usr/sbin/lvcreate --yes -l 572318 -n 
osd-block-b05c3c90-b7d5-4f13-8a58-f72761c1971b 
ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64
/bin/docker:stderr  stderr: Volume group 
"ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" has insufficient free space (572317 
extents): 572318 required.
/bin/docker:stderr --> Was unable to complete a new OSD, will rollback changes
```
Q1, why VG name (ceph-) is different from others (ceph-block-)?
Q2, where is that 572318 from? Since all HDDs are the same model, VG
"Total PE" of all HDDs is 572317.
Has anyone seen similar issues? Anything I am missing?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replace OSD failed

2021-02-04 Thread Tony Liu
Here is the log from ceph-volume.
```
[2021-02-05 04:03:17,000][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/vgcreate --force --yes ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64 
/dev/sdd
[2021-02-05 04:03:17,134][ceph_volume.process][INFO  ] stdout Physical volume 
"/dev/sdd" successfully created.
[2021-02-05 04:03:17,166][ceph_volume.process][INFO  ] stdout Volume group 
"ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" successfully created
[2021-02-05 04:03:17,189][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/vgs --noheadings --readonly --units=b --nosuffix --separator=";" -S 
vg_name=ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64 -o 
vg_name,pv_count,lv_count,vg_attr,vg_extent_count,vg_free_count,vg_extent_size
[2021-02-05 04:03:17,229][ceph_volume.process][INFO  ] stdout 
ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64";"1";"0";"wz--n-";"572317";"572317";"4194304
[2021-02-05 04:03:17,229][ceph_volume.api.lvm][DEBUG ] size was passed: 2.18 TB 
-> 572318
[2021-02-05 04:03:17,235][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/lvcreate --yes -l 572318 -n 
osd-block-b05c3c90-b7d5-4f13-8a58-f72761c1971b 
ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64
[2021-02-05 04:03:17,244][ceph_volume.process][INFO  ] stderr Volume group 
"ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" has insufficient free space (572317 
extents): 572318 required.
```
size was passed: 2.18 TB -> 572318
How is  this calculated?


Thanks!
Tony
> -Original Message-
> From: Tony Liu 
> Sent: Thursday, February 4, 2021 8:34 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] replace OSD failed
> 
> Hi,
> 
> With 15.2.8, run "ceph orch rm osd 12 --replace --force", PGs on osd.12
> are remapped, osd.12 is removed from "ceph osd tree", the daemon is
> removed from "ceph orch ps", the device is "available"
> in "ceph orch device ls". Everything seems good at this point.
> 
> Then dry-run service spec.
> ```
> # cat osd-spec.yaml
> service_type: osd
> service_id: osd-spec
> placement:
>   hosts:
>   - ceph-osd-1
> data_devices:
>   rotational: 1
> db_devices:
>   rotational: 0
> 
> # ceph orch apply osd -i osd-spec.yaml --dry-run
> +-+--++--+--+-+
> |SERVICE  |NAME  |HOST|DATA  |DB|WAL  |
> +-+--++--+--+-+
> |osd  |osd-spec  |ceph-osd-3  |/dev/sdd  |/dev/sdb  |-|
> +-+--++--+--+-+
> ```
> It looks as expected.
> 
> Then "ceph orch apply osd -i osd-spec.yaml".
> Here is the log of cephadm.
> ```
> /bin/docker:stderr --> relative data size: 1.0 /bin/docker:stderr -->
> passed block_db devices: 1 physical, 0 LVM /bin/docker:stderr Running
> command: /usr/bin/ceph-authtool --gen-print-key /bin/docker:stderr
> Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-
> osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
> /bin/docker:stderr Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> -i - osd new b05c3c90-b7d5-4f13-8a58-f72761c1971b 12 /bin/docker:stderr
> Running command: /usr/sbin/vgcreate --force --yes ceph-a3886f74-3de9-
> 4e6e-a983-8330eda0bd64 /dev/sdd /bin/docker:stderr  stdout: Physical
> volume "/dev/sdd" successfully created.
> /bin/docker:stderr  stdout: Volume group "ceph-a3886f74-3de9-4e6e-a983-
> 8330eda0bd64" successfully created /bin/docker:stderr Running command:
> /usr/sbin/lvcreate --yes -l 572318 -n osd-block-b05c3c90-b7d5-4f13-8a58-
> f72761c1971b ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64
> /bin/docker:stderr  stderr: Volume group "ceph-a3886f74-3de9-4e6e-a983-
> 8330eda0bd64" has insufficient free space (572317 extents): 572318
> required.
> /bin/docker:stderr --> Was unable to complete a new OSD, will rollback
> changes ``` Q1, why VG name (ceph-) is different from others (ceph-
> block-)?
> Q2, where is that 572318 from? Since all HDDs are the same model, VG
> "Total PE" of all HDDs is 572317.
> Has anyone seen similar issues? Anything I am missing?
> 
> 
> Thanks!
> Tony
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replace OSD failed

2021-02-04 Thread Tony Liu
Here is the issue.
https://tracker.ceph.com/issues/47758


Thanks!
Tony
> -Original Message-
> From: Tony Liu 
> Sent: Thursday, February 4, 2021 8:46 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: replace OSD failed
> 
> Here is the log from ceph-volume.
> ```
> [2021-02-05 04:03:17,000][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/vgcreate --force --yes ceph-a3886f74-3de9-4e6e-a983-
> 8330eda0bd64 /dev/sdd
> [2021-02-05 04:03:17,134][ceph_volume.process][INFO  ] stdout Physical
> volume "/dev/sdd" successfully created.
> [2021-02-05 04:03:17,166][ceph_volume.process][INFO  ] stdout Volume
> group "ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" successfully created
> [2021-02-05 04:03:17,189][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/vgs --noheadings --readonly --units=b --nosuffix --
> separator=";" -S vg_name=ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64 -o
> vg_name,pv_count,lv_count,vg_attr,vg_extent_count,vg_free_count,vg_exten
> t_size
> [2021-02-05 04:03:17,229][ceph_volume.process][INFO  ] stdout ceph-
> a3886f74-3de9-4e6e-a983-8330eda0bd64";"1";"0";"wz--n-
> ";"572317";"572317";"4194304
> [2021-02-05 04:03:17,229][ceph_volume.api.lvm][DEBUG ] size was passed:
> 2.18 TB -> 572318
> [2021-02-05 04:03:17,235][ceph_volume.process][INFO  ] Running command:
> /usr/sbin/lvcreate --yes -l 572318 -n osd-block-b05c3c90-b7d5-4f13-8a58-
> f72761c1971b ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64
> [2021-02-05 04:03:17,244][ceph_volume.process][INFO  ] stderr Volume
> group "ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" has insufficient free
> space (572317 extents): 572318 required.
> ```
> size was passed: 2.18 TB -> 572318
> How is  this calculated?
> 
> 
> Thanks!
> Tony
> > -Original Message-
> > From: Tony Liu 
> > Sent: Thursday, February 4, 2021 8:34 PM
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] replace OSD failed
> >
> > Hi,
> >
> > With 15.2.8, run "ceph orch rm osd 12 --replace --force", PGs on
> > osd.12 are remapped, osd.12 is removed from "ceph osd tree", the
> > daemon is removed from "ceph orch ps", the device is "available"
> > in "ceph orch device ls". Everything seems good at this point.
> >
> > Then dry-run service spec.
> > ```
> > # cat osd-spec.yaml
> > service_type: osd
> > service_id: osd-spec
> > placement:
> >   hosts:
> >   - ceph-osd-1
> > data_devices:
> >   rotational: 1
> > db_devices:
> >   rotational: 0
> >
> > # ceph orch apply osd -i osd-spec.yaml --dry-run
> > +-+--++--+--+-+
> > |SERVICE  |NAME  |HOST|DATA  |DB|WAL  |
> > +-+--++--+--+-+
> > |osd  |osd-spec  |ceph-osd-3  |/dev/sdd  |/dev/sdb  |-|
> > +-+--++--+--+-+
> > ```
> > It looks as expected.
> >
> > Then "ceph orch apply osd -i osd-spec.yaml".
> > Here is the log of cephadm.
> > ```
> > /bin/docker:stderr --> relative data size: 1.0 /bin/docker:stderr -->
> > passed block_db devices: 1 physical, 0 LVM /bin/docker:stderr Running
> > command: /usr/bin/ceph-authtool --gen-print-key /bin/docker:stderr
> > Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-
> > osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f
> > json /bin/docker:stderr Running command: /usr/bin/ceph --cluster ceph
> > --name client.bootstrap-osd --keyring
> > /var/lib/ceph/bootstrap-osd/ceph.keyring
> > -i - osd new b05c3c90-b7d5-4f13-8a58-f72761c1971b 12
> > /bin/docker:stderr Running command: /usr/sbin/vgcreate --force --yes
> > ceph-a3886f74-3de9-
> > 4e6e-a983-8330eda0bd64 /dev/sdd /bin/docker:stderr  stdout: Physical
> > volume "/dev/sdd" successfully created.
> > /bin/docker:stderr  stdout: Volume group
> > "ceph-a3886f74-3de9-4e6e-a983- 8330eda0bd64" successfully created
> /bin/docker:stderr Running command:
> > /usr/sbin/lvcreate --yes -l 572318 -n
> > osd-block-b05c3c90-b7d5-4f13-8a58-
> > f72761c1971b ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64
> > /bin/docker:stderr  stderr: Volume group
> > "ceph-a3886f74-3de9-4e6e-a983- 8330eda0bd64" has insufficient free
> > space (572317 extents): 572318 required.
> > /bin/docker:stderr --> Was unable to complete a new OSD, will rollback
> > changes ``` Q1, why VG name (ceph-) is different from others
> > (ceph- block-)?
> > Q2, where is that 572318 from? Since all HDDs are the same model, VG
> > "Total PE" of all HDDs is 572317.
> > Has anyone seen similar issues? Anything I am missing?
> >
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an em

[ceph-users] log_meta log_data was turned off in multisite and deleted

2021-02-04 Thread Szabo, Istvan (Agoda)
Hi,

Is there a way to reinitialize the stored data and make it sync from the logs?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Multisite reshard stale instances

2021-02-04 Thread Szabo, Istvan (Agoda)
Hi,

I found 6-700 stale instances with the reshard stale instances list command.
Is there a way to clean it up (or actually should I clean it up)?
The stale instance rm doesn't work in multisite.

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Pascal Ehlert
Sorry to jump in here, but would you care to explain why the total disk 
usage should stay under 60%?
This is not something I have heard before and a quick Google search 
didn't return anything useful.


Steven Pine wrote on 04.02.21 20:41:

There are a lot of hidden costs in using ceph which can vary depending on
usage needs, such as having spare io for recovery operations or ensuring
your total cluster disk usage stays below 60%.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Brian :
Certainly with a small number of nodes / osd this makes sense as to lose a
node could quickly make the cluster storage capacity be full very quickly.

On Friday, February 5, 2021, Pascal Ehlert  wrote:
> Sorry to jump in here, but would you care to explain why the total disk
usage should stay under 60%?
> This is not something I have heard before and a quick Google search
didn't return anything useful.
>
> Steven Pine wrote on 04.02.21 20:41:
>>
>> There are a lot of hidden costs in using ceph which can vary depending on
>> usage needs, such as having spare io for recovery operations or ensuring
>> your total cluster disk usage stays below 60%.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io