[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-06 Thread Frank Schilder
> - three servers as reccomended by proxmox (with 10gb ethernet and so on)
> - size=3 and min_size=2 reccomended by Ceph

You forgot the ceph recommendation* to provide sufficient fail-over capacity in 
case a failure domain or disk fails. The recommendation would be to have 4 
hosts with 25% capacity left free for fail-over and another 10% for handling 
imbalance. With very few disks I would increase the buffer for imbalance.

* Its actually not a recommendation, its a requirement for non-experimental 
clusters.

Everything else has been answered already in great detail.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mario Giammarco 
Sent: 05 February 2021 21:10:33
To: Eneko Lacunza
Cc: Ceph Users
Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2

Il giorno gio 4 feb 2021 alle ore 12:19 Eneko Lacunza 
ha scritto:

> Hi all,
>
> El 4/2/21 a las 11:56, Frank Schilder escribió:
> >> - three servers
> >> - three monitors
> >> - 6 osd (two per server)
> >> - size=3 and min_size=2
> > This is a set-up that I would not run at all. The first one is, that
> ceph lives on the law of large numbers and 6 is a small number. Hence, your
> OSD fill-up due to uneven distribution.
> >
> > What comes to my mind is a hyper-converged server with 6+ disks in a
> RAID10 array, possibly with a good controller with battery-powered or other
> non-volatile cache. Ceph will never beat that performance. Put in some
> extra disks as hot-spare and you have close to self-healing storage.
> >
> > Such a small ceph cluster will inherit all the baddies of ceph
> (performance, maintenance) without giving any of the goodies (scale-out,
> self-healing, proper distributed raid protection). Ceph needs size to
> become well-performing and pay off the maintenance and architectural effort.
> >
>
> It's funny that we have multiple clusters similar to this, and we and
> our customers couldn't be happier. Just use a HCI solution (like for
> example Proxmox VE, but there are others) to manage everything.
>
>



> Maybe the weakest thing in that configuration is having 2 OSDs per node;
> osd nearfull must be tuned accordingly so that no OSD goes beyond about
> 0.45, so that in case of failure of one disk, the other OSD in the node
> has enough space for healing replication.
>
>
I reply to both: infact I am using Proxmox VE and I am following all
guidelines for ha hyperconverged server:

- three servers as reccomended by proxmox (with 10gb ethernet and so on)
- size=3 and min_size=2 reccomended by Ceph

It is not that a morning I wake up and put some random  hardware together,
I followed guidelines.
The result should be:
- if a disk (or more) brokes work goes on
- if a server brokes the VMs on the server start on another server and
work goes on.

The result is: one disk brokes, ceph fills the other one in the same server
, reaches 90% and EVERYTHING stops including all VMs and the customer has
lost unsaved data and it cannot run the VMs it needs to continue works.
Not very "HA" as hoped.

Size=3 means 3xhdd cost. Now I must double it again 6x. Customer will not
buy other disks.

So I ask (again): apart the known fact that with size=2 I risk that a
second disk brokes before ceph has filled again the second copy of data are
there other risks??
I repeat: I know perfectly size=3 is "better" I followed guidelines but
what can happen with size=2 and min_size=1?
The only thing I can imagine is that if I power down switches I get a split
brain but in this case monitor quorum is not reached and so ceph should
stop writing and so I do not risk inconsistent data.
Are there other things to consider?
Thanks,
Mario
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-06 Thread Frank Schilder
> How do you achieve that? 2 hours?

That's a long story. Short one is, by taking a wrong path for trouble shooting. 
I should have stayed with my check-list instead. This is the whole point of the 
redundancy remark I made, that 1 admin mistake doesn't hurt and you are less 
likely to panic if one happens.

For a too long time on this day, I thought I had lost the whole cluster.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Konstantin Shalygin 
Sent: 06 February 2021 12:04:12
To: Frank Schilder
Cc: Alexander E. Patrakov; Mario Giammarco; ceph-users
Subject: Re: [ceph-users] Re: Worst thing that can happen if I have size= 2

How do you achieve that? 2 hours? Install new drive for db is 10min of DC 
engineer hand work (if drive is HHHL and need power off server). Then after 
server is boots your mon already up. After you provision new drive, make fstab, 
is stop, rm old monstore, mount new, mon mkfs, start. Even if this is not 
covered by script is max 5 minutes to reach quorum.


Thanks,
k

Sent from my iPhone

> On 5 Feb 2021, at 12:03, Frank Schilder  wrote:
>
> I learned this the hard way when upgrading our MON data disks. We have 3 MONs 
> and I needed to migrate each MON store to new storage. Of course I managed to 
> install the new disks in one and wipe the MON store on another MON. 2 hours 
> downtime.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-volume bluestore _read_fsid unparsable uuid

2021-02-06 Thread Frank Schilder
Hi Dave and everyone else affected,

I'm responding to a thread you opened on an issue with lvm OSD creation:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YYH3VANVV22WGM3CNL4TN4TTL63FCEVD/
https://tracker.ceph.com/issues/43868

Most important question: is there a workaround?

My observations: I'm running into the exact same issue on mimic 13.2.10. The 
strange thing is, that some OSDs get created and others fail. I can't see a 
pattern here. I have 1 host where every create worked out and another where 
half failed. The important lines in the log are probably:

 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 
bluestore(/var/lib/ceph/osd/ceph-342/) _read_fsid unparsable uuid
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 bdev(0x561db199c700 
/var/lib/ceph/osd/ceph-342//block) _aio_start io_setup(2) failed with EAGAIN; 
try increasing /proc/sys/fs/aio-max-nr
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 
bluestore(/var/lib/ceph/osd/ceph-342/) mkfs failed, (11) Resource temporarily 
unavailable
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 OSD::mkfs: 
ObjectStore::mkfs failed with error (11) Resource temporarily unavailable
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 [0;31m ** ERROR: error 
creating empty object store in /var/lib/ceph/osd/ceph-342/: (11) Resource 
temporarily unavailable[0m

I really need to get a decent number of disks up very soon. Any help is 
appreciated. I can provide more output if that helps.

Best regards and good weekend!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume bluestore _read_fsid unparsable uuid

2021-02-06 Thread Frank Schilder
I just noticed one difference between the two servers:

"Broken" server:

# lvm vgs
  Failed to set up async io, using sync io.
  VG#PV #LV #SN Attr   VSize  VFree
[listing follows]

"Good" server:
# lvm vgs
  VG#PV #LV #SN Attr   VSize  VFree
[listing follows]

Could this play a role here?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 February 2021 14:31:38
To: ceph-users@ceph.io
Cc: Dave Hall
Subject: [ceph-users] ceph-volume bluestore _read_fsid unparsable uuid

Hi Dave and everyone else affected,

I'm responding to a thread you opened on an issue with lvm OSD creation:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YYH3VANVV22WGM3CNL4TN4TTL63FCEVD/
https://tracker.ceph.com/issues/43868

Most important question: is there a workaround?

My observations: I'm running into the exact same issue on mimic 13.2.10. The 
strange thing is, that some OSDs get created and others fail. I can't see a 
pattern here. I have 1 host where every create worked out and another where 
half failed. The important lines in the log are probably:

 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 
bluestore(/var/lib/ceph/osd/ceph-342/) _read_fsid unparsable uuid
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 bdev(0x561db199c700 
/var/lib/ceph/osd/ceph-342//block) _aio_start io_setup(2) failed with EAGAIN; 
try increasing /proc/sys/fs/aio-max-nr
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 
bluestore(/var/lib/ceph/osd/ceph-342/) mkfs failed, (11) Resource temporarily 
unavailable
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 OSD::mkfs: 
ObjectStore::mkfs failed with error (11) Resource temporarily unavailable
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 [0;31m ** ERROR: error 
creating empty object store in /var/lib/ceph/osd/ceph-342/: (11) Resource 
temporarily unavailable[0m

I really need to get a decent number of disks up very soon. Any help is 
appreciated. I can provide more output if that helps.

Best regards and good weekend!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume bluestore _read_fsid unparsable uuid

2021-02-06 Thread Frank Schilder
OK, found it. The second line in the error messages actually gives it away:

 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 bdev(0x561db199c700 
/var/lib/ceph/osd/ceph-342//block) _aio_start io_setup(2) failed with EAGAIN; 
try increasing /proc/sys/fs/aio-max-nr

On my system, the default is rather small:

# sysctl fs.aio-max-nr
fs.aio-max-nr = 65536

Seemingly not a problem for ceph-disk OSDs:

# sysctl fs.aio-nr
fs.aio-nr = 32768

However, LVM OSDs seem to be quite hungry. Increasing the value to

sysctl -w fs.aio-max-nr=1048576

solved it for me.

Should have read it more carefully.

Best regards and a nice weekend.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 February 2021 14:37:15
To: ceph-users@ceph.io
Cc: Dave Hall
Subject: [ceph-users] Re: ceph-volume bluestore _read_fsid unparsable uuid

I just noticed one difference between the two servers:

"Broken" server:

# lvm vgs
  Failed to set up async io, using sync io.
  VG#PV #LV #SN Attr   VSize  VFree
[listing follows]

"Good" server:
# lvm vgs
  VG#PV #LV #SN Attr   VSize  VFree
[listing follows]

Could this play a role here?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 06 February 2021 14:31:38
To: ceph-users@ceph.io
Cc: Dave Hall
Subject: [ceph-users] ceph-volume bluestore _read_fsid unparsable uuid

Hi Dave and everyone else affected,

I'm responding to a thread you opened on an issue with lvm OSD creation:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YYH3VANVV22WGM3CNL4TN4TTL63FCEVD/
https://tracker.ceph.com/issues/43868

Most important question: is there a workaround?

My observations: I'm running into the exact same issue on mimic 13.2.10. The 
strange thing is, that some OSDs get created and others fail. I can't see a 
pattern here. I have 1 host where every create worked out and another where 
half failed. The important lines in the log are probably:

 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 
bluestore(/var/lib/ceph/osd/ceph-342/) _read_fsid unparsable uuid
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 bdev(0x561db199c700 
/var/lib/ceph/osd/ceph-342//block) _aio_start io_setup(2) failed with EAGAIN; 
try increasing /proc/sys/fs/aio-max-nr
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 
bluestore(/var/lib/ceph/osd/ceph-342/) mkfs failed, (11) Resource temporarily 
unavailable
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 OSD::mkfs: 
ObjectStore::mkfs failed with error (11) Resource temporarily unavailable
 stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 [0;31m ** ERROR: error 
creating empty object store in /var/lib/ceph/osd/ceph-342/: (11) Resource 
temporarily unavailable[0m

I really need to get a decent number of disks up very soon. Any help is 
appreciated. I can provide more output if that helps.

Best regards and good weekend!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-06 Thread Tony Liu
Add dev to comment.

With 15.2.8, when apply OSD service spec, db_devices is gone.
Here is the service spec file.
==
service_type: osd
service_id: osd-spec
placement:
  hosts:
  - ceph-osd-1
spec:
  objectstore: bluestore
  data_devices:
rotational: 1
  db_devices:
rotational: 0
==

Here is the logging from mon. The message with "Tony" is added by me
in mgr to confirm. The audit from mon shows db_devices is gone.
Is there anything in mon to filter that out based on host info?
How can I trace it?
==
audit 2021-02-07T00:45:38.106171+ mgr.ceph-control-1.nxjnzz (mgr.24142551) 
4020 : audit [DBG] from='client.24184218 -' entity='client.admin' 
cmd=[{"prefix": "orch apply osd", "target": ["mon-mgr", ""]}]: dispatch
cephadm 2021-02-07T00:45:38.108546+ mgr.ceph-control-1.nxjnzz 
(mgr.24142551) 4021 : cephadm [INF] Marking host: ceph-osd-1 for OSDSpec 
preview refresh.
cephadm 2021-02-07T00:45:38.108798+ mgr.ceph-control-1.nxjnzz 
(mgr.24142551) 4022 : cephadm [INF] Saving service osd.osd-spec spec with 
placement ceph-osd-1
cephadm 2021-02-07T00:45:38.108893+ mgr.ceph-control-1.nxjnzz 
(mgr.24142551) 4023 : cephadm [INF] Tony: spec: placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='ceph-osd-1',
 network='', name='')]), service_id='osd-spec', service_type='osd', 
data_devices=DeviceSelection(rotational=1, all=False), 
db_devices=DeviceSelection(rotational=0, all=False), osd_id_claims={}, 
unmanaged=False, filter_logic='AND', preview_only=False)>
audit 2021-02-07T00:45:38.109782+ mon.ceph-control-3 (mon.2) 25 : audit 
[INF] from='mgr.24142551 10.6.50.30:0/2838166251' 
entity='mgr.ceph-control-1.nxjnzz' cmd=[{"prefix":"config-key 
set","key":"mgr/cephadm/spec.osd.osd-spec","val":"{\"created\": 
\"2021-02-07T00:45:38.108810\", \"spec\": {\"placement\": {\"hosts\": 
[\"ceph-osd-1\"]}, \"service_id\": \"osd-spec\", \"service_name\": 
\"osd.osd-spec\", \"service_type\": \"osd\", \"spec\": {\"data_devices\": 
{\"rotational\": 1}, \"filter_logic\": \"AND\", \"objectstore\": 
\"bluestore\"}}}"}]: dispatch
audit 2021-02-07T00:45:38.110133+ mon.ceph-control-1 (mon.0) 107 : audit 
[INF] from='mgr.24142551 ' entity='mgr.ceph-control-1.nxjnzz' 
cmd=[{"prefix":"config-key 
set","key":"mgr/cephadm/spec.osd.osd-spec","val":"{\"created\": 
\"2021-02-07T00:45:38.108810\", \"spec\": {\"placement\": {\"hosts\": 
[\"ceph-osd-1\"]}, \"service_id\": \"osd-spec\", \"service_name\": 
\"osd.osd-spec\", \"service_type\": \"osd\", \"spec\": {\"data_devices\": 
{\"rotational\": 1}, \"filter_logic\": \"AND\", \"objectstore\": 
\"bluestore\"}}}"}]: dispatch
audit 2021-02-07T00:45:38.152756+ mon.ceph-control-1 (mon.0) 108 : audit 
[INF] from='mgr.24142551 ' entity='mgr.ceph-control-1.nxjnzz' 
cmd='[{"prefix":"config-key 
set","key":"mgr/cephadm/spec.osd.osd-spec","val":"{\"created\": 
\"2021-02-07T00:45:38.108810\", \"spec\": {\"placement\": {\"hosts\": 
[\"ceph-osd-1\"]}, \"service_id\": \"osd-spec\", \"service_name\": 
\"osd.osd-spec\", \"service_type\": \"osd\", \"spec\": {\"data_devices\": 
{\"rotational\": 1}, \"filter_logic\": \"AND\", \"objectstore\": 
\"bluestore\"}}}"}]': finished
==

Thanks!
Tony
> -Original Message-
> From: Jens Hyllegaard (Soft Design A/S) 
> Sent: Thursday, February 4, 2021 6:31 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: db_devices doesn't show up in exported osd
> service spec
> 
> Hi.
> 
> I have the same situation. Running 15.2.8 I created a specification that
> looked just like it. With rotational in the data and non-rotational in
> the db.
> 
> First use applied fine. Afterwards it only uses the hdd, and not the ssd.
> Also, is there a way to remove an unused osd service.
> I manages to create osd.all-available-devices, when I tried to stop the
> autocreation of OSD's. Using ceph orch apply osd --all-available-devices
> --unmanaged=true
> 
> I created the original OSD using the web interface.
> 
> Regards
> 
> Jens
> -Original Message-
> From: Eugen Block 
> Sent: 3. februar 2021 11:40
> To: Tony Liu 
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: db_devices doesn't show up in exported osd
> service spec
> 
> How do you manage the db_sizes of your SSDs? Is that managed
> automatically by ceph-volume? You could try to add another config and
> see what it does, maybe try to add block_db_size?
> 
> 
> Zitat von Tony Liu :
> 
> > All mon, mgr, crash and osd are upgraded to 15.2.8. It actually fixed
> > another issue (no device listed after adding host).
> > But this issue remains.
> > ```
> > # cat osd-spec.yaml
> > service_type: osd
> > service_id: osd-spec
> > placement:
> >   host_pattern: ceph-osd-[1-3]
> > data_devices:
> >   rotational: 1
> > db_devices:
> >   rotational: 0
> >
> > # ceph orch apply osd -i osd-spec.yaml Scheduled osd.osd-spec
> > update...
> >
> > # ceph orch