[ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Florian Haas
Hi everyone,

I'm trying to wrap my head around an issue we recently saw, as it
relates to RBD locks, Qemu/KVM, and libvirt.

Our data center graced us with a sudden and complete dual-feed power
failure that affected both a Ceph cluster (Luminous, 12.2.12), and
OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these
things really happen, even in 2019.)

Once nodes were powered back up, the Ceph cluster came up gracefully
with no intervention required — all we saw was some Mon clock skew until
NTP peers had fully synced. Yay! However, our Nova compute nodes, or
rather the libvirt VMs that were running on them, were in not so great a
shape. The VMs booted up fine initially, but then blew up as soon as
they were trying to write to their RBD-backed virtio devices — which, of
course, was very early in the boot sequence as they had dirty filesystem
journals to apply.

Being able to read from, but not write to, RBDs is usually an issue with
exclusive locking, so we stopped one of the affected VMs, checked the
RBD locks on its device, and found (with rbd lock ls) that the lock was
still being held even after the VM was definitely down — both "openstack
server show" and "virsh domstate" agreed on this. We manually cleared
the lock (rbd lock rm), started the VM, and it booted up fine.

Repeat for all VMs, and we were back in business.

If I understand correctly, image locks — in contrast to image watchers —
have no timeout, so locks must be always be explicitly released, or they
linger forever.

So that raises a few questions:

(1) Is it correct to assume that the lingering lock was actually from
*before* the power failure?

(2) What, exactly, triggers the lock acquisition and release in this
context? Is it nova-compute that does this, or libvirt, or Qemu/KVM?

(3) Would the same issue be expected essentially in any hard failure of
even a single compute node, and if so, does that mean that what
https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova
evacuate" (and presumably, by extension also about "nova host-evacuate")
is inaccurate? If so, what would be required to make that work?

(4) If (3), is it correct to assume that the same considerations apply
to the Nova resume_guests_state_on_host_boot feature, i.e. that
automatic guest recovery wouldn't be expected to succeed even if a node
experienced just a hard reboot, as opposed to a a catastrophic permanent
failure? And again, what would be required to make that work?  Is it
really necessary to clean all RBD locks manually?

Grateful for any insight that people could share here. I'd volunteer to
add a brief writeup of locking functionality in this context to the docs.

Thanks!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Object

2019-11-15 Thread Wido den Hollander
Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
pool and Object the large Object is in?

Wido

On 11/15/19 12:23 AM, dhils...@performair.com wrote:
> All;
> 
> We had a warning about a large OMAP object pop up in one of our clusters 
> overnight.  The cluster is configured for CephFS, but nothing mounts a 
> CephFS, at this time.
> 
> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, and 
> the MGR log on one of the mons, with no useful references to the pool / pg 
> where the large OMAP objects resides.
> 
> Is my only option to find this large OMAP object to go through the OSD logs 
> for the individual OSDs in the cluster?
> 
> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director - Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange CEPH_ARGS problems

2019-11-15 Thread Rainer Krienke
I found a typo in my post:

Of course I tried

export CEPH_ARGS="-n client.rz --keyring="

and not

export CEPH_ARGS=="-n client.rz --keyring="

Thanks
Rainer

Am 15.11.19 um 07:46 schrieb Rainer Krienke:
> Hello,
> 
> I try to use CEPH_ARGS in order to use eg rbd with a non client.admin
> user and keyring without extra parameters. On a ceph-client with Ubuntu
> 18.04.3 I get this:
> 
> # unset CEPH_ARGS
> # rbd --name=client.user --keyring=/etc/ceph/ceph.client.user.keyring ls
> a
> b
> c
> 
> # export CEPH_ARGS=="-n client.rz --keyring=/etc
> /ceph/ceph.client.user.keyring"
> # rbd ls
> 
> rbd: couldn't connect to the cluster!
> rbd: listing images failed: (22) Invalid argument
> 
> # export CEPH_ARGS=="--keyring=/etc/ceph/ceph.client.user.keyring"
> # rbd -n client.user ls
> a
> b
> c
> 
> Is this the desired behavior? I would like to set both user name and
> keyring to be used, so that I can run rbd without any parameters.
> 
> How do you do this?
> 
> Thanks
> Rainer
> 


-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange CEPH_ARGS problems

2019-11-15 Thread Janne Johansson
Is the flip between the client name "rz" and "user" also a mistype? It's
hard to divinate if it is intentional or not since you are mixing it about.


Den fre 15 nov. 2019 kl 10:57 skrev Rainer Krienke :

> I found a typo in my post:
>
> Of course I tried
>
> export CEPH_ARGS="-n client.rz --keyring="
>
> and not
>
> export CEPH_ARGS=="-n client.rz --keyring="
>
> Thanks
> Rainer
>
> Am 15.11.19 um 07:46 schrieb Rainer Krienke:
> > Hello,
> >
> > I try to use CEPH_ARGS in order to use eg rbd with a non client.admin
> > user and keyring without extra parameters. On a ceph-client with Ubuntu
> > 18.04.3 I get this:
> >
> > # unset CEPH_ARGS
> > # rbd --name=client.user --keyring=/etc/ceph/ceph.client.user.keyring ls
> > a
> > b
> > c
> >
> > # export CEPH_ARGS=="-n client.rz --keyring=/etc
> > /ceph/ceph.client.user.keyring"
> > # rbd ls
> > 
> > rbd: couldn't connect to the cluster!
> > rbd: listing images failed: (22) Invalid argument
> >
> > # export CEPH_ARGS=="--keyring=/etc/ceph/ceph.client.user.keyring"
> > # rbd -n client.user ls
> > a
> > b
> > c
> >
> > Is this the desired behavior? I would like to set both user name and
> > keyring to be used, so that I can run rbd without any parameters.
> >
> > How do you do this?
> >
> > Thanks
> > Rainer
> >
>
>
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange CEPH_ARGS problems

2019-11-15 Thread Konstantin Shalygin

I found a typo in my post:

Of course I tried

export CEPH_ARGS="-n client.rz --keyring="

and not

export CEPH_ARGS=="-n client.rz --keyring="


try `export CEPH_ARGS="--id rz --keyring=..."`



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange CEPH_ARGS problems

2019-11-15 Thread Rainer Krienke
This is not my day :-)

Yes this flip beween client.rz and client.user was not intended. Its
another typo.  When trying to run rbd I used everywhere the same
client.rz user and the same keyring /etc/ceph/ceph.client.user.keyring.

Sorry
Rainer

Am 15.11.19 um 11:02 schrieb Janne Johansson:
> Is the flip between the client name "rz" and "user" also a mistype? It's
> hard to divinate if it is intentional or not since you are mixing it about.
> 
> 
> Den fre 15 nov. 2019 kl 10:57 skrev Rainer Krienke
> mailto:krie...@uni-koblenz.de>>:
> 
> I found a typo in my post:
> 
> Of course I tried
> 
> export CEPH_ARGS="-n client.rz --keyring="
> 
> and not
> 
> export CEPH_ARGS=="-n client.rz --keyring="
> 
> Thanks
> Rainer
> 
> Am 15.11.19 um 07:46 schrieb Rainer Krienke:
> > Hello,
> >
> > I try to use CEPH_ARGS in order to use eg rbd with a non client.admin
> > user and keyring without extra parameters. On a ceph-client with
> Ubuntu
> > 18.04.3 I get this:
> >
> > # unset CEPH_ARGS
> > # rbd --name=client.user
> --keyring=/etc/ceph/ceph.client.user.keyring ls
> > a
> > b
> > c
> >
> > # export CEPH_ARGS=="-n client.rz --keyring=/etc
> > /ceph/ceph.client.user.keyring"
> > # rbd ls
> > 
> > rbd: couldn't connect to the cluster!
> > rbd: listing images failed: (22) Invalid argument
> >
> > # export CEPH_ARGS=="--keyring=/etc/ceph/ceph.client.user.keyring"> 
> Thanks
> Rainer
> 
> Am 15.11.19 um 07:46 schrieb Rainer Krienke:
> > Hello,
> >
> > I try to use CEPH_ARGS in order to use eg rbd with a non client.admin
> > user and keyring without extra parameters. On a ceph-client with
> Ubuntu
> > 18.04.3 I get this:
> >
> > # unset CEPH_ARGS
> > # rbd --name=client.user
> --keyring=/etc/ceph/ceph.client.user.keyring ls
> > a
> > b
> > c
> >
> > # export CEPH_ARGS=="-n client.rz --keyring=/etc
> > /ceph/ceph.client.user.keyring"
> > # rbd ls
> > 
> > rbd: couldn't connect to the cluster!
> > rbd: listing images failed: (22) Invalid argument
> >
> > # export CEPH_ARGS=="--keyring=/etc/ceph/ceph.client.user.keyring"
> > # rbd -n client.user ls
> > a
> > b
> > c
> >
> > Is this the desired behavior? I would like to set both user name and
> > keyring to be used, so that I can run rbd without any parameters.
> >
> > How do you do this?
> >
> > Thanks
> > Rainer
> >
> 
> 
> -- 
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> May the most significant bit of your life be positive.
> > # rbd -n client.user ls
> > a
> > b
> > c
> >
> > Is this the desired behavior? I would like to set both user name and
> > keyring to be used, so that I can run rbd without any parameters.
> >
> > How do you do this?
> >
> > Thanks
> > Rainer
> >
> 
> 
> -- 
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> > Thanks
> Rainer
> 
> Am 15.11.19 um 07:46 schrieb Rainer Krienke:
> > Hello,
> >
> > I try to use CEPH_ARGS in order to use eg rbd with a non client.admin
> > user and keyring without extra parameters. On a ceph-client with
> Ubuntu
> > 18.04.3 I get this:
> >
> > # unset CEPH_ARGS
> > # rbd --name=client.user
> --keyring=/etc/ceph/ceph.client.user.keyring ls
> > a
> > b
> > c
> >
> > # export CEPH_ARGS=="-n client.rz --keyring=/etc
> > /ceph/ceph.client.user.keyring"
> > # rbd ls
> > 
> > rbd: couldn't connect to the cluster!
> > rbd: listing images failed: (22) Invalid argument
> >
> > # export CEPH_ARGS=="--keyring=/etc/ceph/ceph.client.user.keyring"
> > # rbd -n client.user ls
> > a
> > b
> > c
> >
> > Is this the desired behavior? I would like to set both user name and
> > keyring to be used, so that I can run rbd without any parameters.
> >
> > How do you do this?
> >
> > Thanks
> > Rainer
> >
> 
> 
> -- 

Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Simon Ironside

Hi Florian,

Any chance the key your compute nodes are using for the RBD pool is 
missing 'allow command "osd blacklist"' from its mon caps?


Simon

On 15/11/2019 08:19, Florian Haas wrote:

Hi everyone,

I'm trying to wrap my head around an issue we recently saw, as it
relates to RBD locks, Qemu/KVM, and libvirt.

Our data center graced us with a sudden and complete dual-feed power
failure that affected both a Ceph cluster (Luminous, 12.2.12), and
OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these
things really happen, even in 2019.)

Once nodes were powered back up, the Ceph cluster came up gracefully
with no intervention required — all we saw was some Mon clock skew until
NTP peers had fully synced. Yay! However, our Nova compute nodes, or
rather the libvirt VMs that were running on them, were in not so great a
shape. The VMs booted up fine initially, but then blew up as soon as
they were trying to write to their RBD-backed virtio devices — which, of
course, was very early in the boot sequence as they had dirty filesystem
journals to apply.

Being able to read from, but not write to, RBDs is usually an issue with
exclusive locking, so we stopped one of the affected VMs, checked the
RBD locks on its device, and found (with rbd lock ls) that the lock was
still being held even after the VM was definitely down — both "openstack
server show" and "virsh domstate" agreed on this. We manually cleared
the lock (rbd lock rm), started the VM, and it booted up fine.

Repeat for all VMs, and we were back in business.

If I understand correctly, image locks — in contrast to image watchers —
have no timeout, so locks must be always be explicitly released, or they
linger forever.

So that raises a few questions:

(1) Is it correct to assume that the lingering lock was actually from
*before* the power failure?

(2) What, exactly, triggers the lock acquisition and release in this
context? Is it nova-compute that does this, or libvirt, or Qemu/KVM?

(3) Would the same issue be expected essentially in any hard failure of
even a single compute node, and if so, does that mean that what
https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova
evacuate" (and presumably, by extension also about "nova host-evacuate")
is inaccurate? If so, what would be required to make that work?

(4) If (3), is it correct to assume that the same considerations apply
to the Nova resume_guests_state_on_host_boot feature, i.e. that
automatic guest recovery wouldn't be expected to succeed even if a node
experienced just a hard reboot, as opposed to a a catastrophic permanent
failure? And again, what would be required to make that work?  Is it
really necessary to clean all RBD locks manually?

Grateful for any insight that people could share here. I'd volunteer to
add a brief writeup of locking functionality in this context to the docs.

Thanks!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Beginner question netwokr configuration best practice

2019-11-15 Thread Willi Schiegel

Hello All,

I'm starting to setup a Ceph cluster and am confused about the 
recommendations for the network setup.


In the Mimic manual I can read

"We recommend running a Ceph Storage Cluster with two networks: a public 
(front-side) network and a cluster (back-side) network."


In the Nautilus manual there is

"Ceph functions just fine with a public network only, but you may see 
significant performance improvement with a second “cluster” network in a 
large cluster.


It is possible to run a Ceph Storage Cluster with two networks: a public 
(front-side) network and a cluster (back-side) network. However, this 
approach complicates network configuration (both hardware and software) 
and does not usually have a significant impact on overall performance. 
For this reason, we generally recommend that dual-NIC systems either be 
configured with two IPs on the same network, or bonded."


Am I misunderstanding something or is "significant performance 
improvement" and "does not usually have a significant impact on overall 
performance" in the Nautilus doc contradictory? So, which way to go?


Thank you very much
Willi


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Wido den Hollander


On 11/15/19 11:24 AM, Simon Ironside wrote:
> Hi Florian,
> 
> Any chance the key your compute nodes are using for the RBD pool is
> missing 'allow command "osd blacklist"' from its mon caps?
> 

Added to this I recommend to use the 'profile rbd' for the mon caps.

As also stated in the OpenStack docs:
https://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication

Wido

> Simon
> 
> On 15/11/2019 08:19, Florian Haas wrote:
>> Hi everyone,
>>
>> I'm trying to wrap my head around an issue we recently saw, as it
>> relates to RBD locks, Qemu/KVM, and libvirt.
>>
>> Our data center graced us with a sudden and complete dual-feed power
>> failure that affected both a Ceph cluster (Luminous, 12.2.12), and
>> OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these
>> things really happen, even in 2019.)
>>
>> Once nodes were powered back up, the Ceph cluster came up gracefully
>> with no intervention required — all we saw was some Mon clock skew until
>> NTP peers had fully synced. Yay! However, our Nova compute nodes, or
>> rather the libvirt VMs that were running on them, were in not so great a
>> shape. The VMs booted up fine initially, but then blew up as soon as
>> they were trying to write to their RBD-backed virtio devices — which, of
>> course, was very early in the boot sequence as they had dirty filesystem
>> journals to apply.
>>
>> Being able to read from, but not write to, RBDs is usually an issue with
>> exclusive locking, so we stopped one of the affected VMs, checked the
>> RBD locks on its device, and found (with rbd lock ls) that the lock was
>> still being held even after the VM was definitely down — both "openstack
>> server show" and "virsh domstate" agreed on this. We manually cleared
>> the lock (rbd lock rm), started the VM, and it booted up fine.
>>
>> Repeat for all VMs, and we were back in business.
>>
>> If I understand correctly, image locks — in contrast to image watchers —
>> have no timeout, so locks must be always be explicitly released, or they
>> linger forever.
>>
>> So that raises a few questions:
>>
>> (1) Is it correct to assume that the lingering lock was actually from
>> *before* the power failure?
>>
>> (2) What, exactly, triggers the lock acquisition and release in this
>> context? Is it nova-compute that does this, or libvirt, or Qemu/KVM?
>>
>> (3) Would the same issue be expected essentially in any hard failure of
>> even a single compute node, and if so, does that mean that what
>> https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova
>> evacuate" (and presumably, by extension also about "nova host-evacuate")
>> is inaccurate? If so, what would be required to make that work?
>>
>> (4) If (3), is it correct to assume that the same considerations apply
>> to the Nova resume_guests_state_on_host_boot feature, i.e. that
>> automatic guest recovery wouldn't be expected to succeed even if a node
>> experienced just a hard reboot, as opposed to a a catastrophic permanent
>> failure? And again, what would be required to make that work?  Is it
>> really necessary to clean all RBD locks manually?
>>
>> Grateful for any insight that people could share here. I'd volunteer to
>> add a brief writeup of locking functionality in this context to the docs.
>>
>> Thanks!
>>
>> Cheers,
>> Florian
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginner question netwokr configuration best practice

2019-11-15 Thread Wido den Hollander


On 11/15/19 12:57 PM, Willi Schiegel wrote:
> Hello All,
> 
> I'm starting to setup a Ceph cluster and am confused about the
> recommendations for the network setup.
> 
> In the Mimic manual I can read
> 
> "We recommend running a Ceph Storage Cluster with two networks: a public
> (front-side) network and a cluster (back-side) network."
> 
> In the Nautilus manual there is
> 
> "Ceph functions just fine with a public network only, but you may see
> significant performance improvement with a second “cluster” network in a
> large cluster.
> 
> It is possible to run a Ceph Storage Cluster with two networks: a public
> (front-side) network and a cluster (back-side) network. However, this
> approach complicates network configuration (both hardware and software)
> and does not usually have a significant impact on overall performance.
> For this reason, we generally recommend that dual-NIC systems either be
> configured with two IPs on the same network, or bonded."
> 
> Am I misunderstanding something or is "significant performance
> improvement" and "does not usually have a significant impact on overall
> performance" in the Nautilus doc contradictory? So, which way to go?
> 

There is no need to have a public and cluster network with Ceph. Working
as a Ceph consultant I've deployed multi-PB Ceph clusters with a single
public network without any problems. Each node has a single IP-address,
nothing more, nothing less.

The whole idea of a separated public/cluster dates back from the time
when 10G was expensive. But nowadays having 2x25G per node isn't that
expensive anymore and is sufficient for allmost all use cases.

I'd save the money for a second network and spend it on a additional
machine in the cluster. That let's you scale out even more.

My philosophy: One node, one IP.

I've deployed dozens of clusters this way and they all work fine :-)

Wido

> Thank you very much
> Willi
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node failure -- corrupt memory

2019-11-15 Thread Wido den Hollander



On 11/11/19 2:00 PM, Shawn Iverson wrote:
> Hello Cephers!
> 
> I had a node over the weekend go nuts from what appears to have been
> failed/bad memory modules and/or motherboard.
> 
> This resulted in several OSDs blocking IO for > 128s (indefinitely).
> 
> I was not watching my alerts too closely over the weekend, or else I may
> have caught it early. The servers in the entire cluster reliant on ceph
> stalled from the blocked IO on this failing node and had to be restarted
> after taking the faulty node offline.
> 
> So, my question is, is there a way to tell ceph to start setting OSDs
> out in the event of an IO blockage that exceeds a certain limit, or are
> there risks in doing so that I would be better off dealing with a
> stalled ceph cluster?
> 

In the end the OSDs should commit suicide, but this is always a problem
with hardware. The best would be if you could have the Linux machine
just kill itself and not have the OSD handle this.

So for example just kill the node when memory problems occur.
panic_on_oom for example (not happening in this case) is something I've
set before. (combined with kernel.panic=60)

Wido

> -- 
> Shawn Iverson, CETL
> Director of Technology
> Rush County Schools
> ivers...@rushville.k12.in.us 
> 
> Cybersecurity
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Florian Haas
On 15/11/2019 11:23, Simon Ironside wrote:
> Hi Florian,
> 
> Any chance the key your compute nodes are using for the RBD pool is
> missing 'allow command "osd blacklist"' from its mon caps?
> 
> Simon

Hi Simon,

I received this off-list but then subsequently saw this message pop up
in the list archive, so I hope it's OK to reply on-list?

So that cap was indeed missing, thanks for the hint! However, I am still
trying to understand how this is related to the issue we saw.

The only documentation-ish article that I found about osd blacklist caps
is this:

https://access.redhat.com/solutions/3391211

We can also confirm a bunch of "access denied" messages when trying to
blacklist an OSD in the mon logs. So the content of that article
definitely applies to our situation, I'm just not sure I follow how the
absence of that capability caused this issue.

The article talks about RBD watchers, not locks. To the best of my
knowledge, a watcher operates like a lease on the image, which is
periodically renewed. If not renewed in 30 seconds of client inactivity,
the cluster considers the client dead. (Please correct me if I'm wrong.)
For us, that didn't help. We had to actively remove locks with "rbd lock
rm". Is the article using the wrong terms? Is there a link between
watchers and locks that I'm unaware of?

Semi-relatedly, as I understand it OSD blacklisting happens based either
on an IP address, or on a socket address (IP:port). While this comes in
handy in host evacuation, it doesn't in in-place recovery (see question
4 in my original message).

- If the blacklist happens based on IP address alone (and that's what
seems to be what the client attempts to be doing, based on our log
messages), then it would break recovery-in-place after a hard reboot
altogether.

- Even if the client would blacklist based on an address:port pair, it
would be just very unlikely that an RBD client used the same source port
to connect after the node recovers in place, but not impossible.

So I am wondering: is this incorrect documentation, or incorrect
behavior, or am I simply making dead-wrong assumptions?

Cheers,
Florian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginner question netwokr configuration best practice

2019-11-15 Thread Willi Schiegel

Thank you, you answer helps a lot!

On 15.11.19 13:21, Wido den Hollander wrote:



On 11/15/19 12:57 PM, Willi Schiegel wrote:

Hello All,

I'm starting to setup a Ceph cluster and am confused about the
recommendations for the network setup.

In the Mimic manual I can read

"We recommend running a Ceph Storage Cluster with two networks: a public
(front-side) network and a cluster (back-side) network."

In the Nautilus manual there is

"Ceph functions just fine with a public network only, but you may see
significant performance improvement with a second “cluster” network in a
large cluster.

It is possible to run a Ceph Storage Cluster with two networks: a public
(front-side) network and a cluster (back-side) network. However, this
approach complicates network configuration (both hardware and software)
and does not usually have a significant impact on overall performance.
For this reason, we generally recommend that dual-NIC systems either be
configured with two IPs on the same network, or bonded."

Am I misunderstanding something or is "significant performance
improvement" and "does not usually have a significant impact on overall
performance" in the Nautilus doc contradictory? So, which way to go?



There is no need to have a public and cluster network with Ceph. Working
as a Ceph consultant I've deployed multi-PB Ceph clusters with a single
public network without any problems. Each node has a single IP-address,
nothing more, nothing less.

The whole idea of a separated public/cluster dates back from the time
when 10G was expensive. But nowadays having 2x25G per node isn't that
expensive anymore and is sufficient for allmost all use cases.

I'd save the money for a second network and spend it on a additional
machine in the cluster. That let's you scale out even more.

My philosophy: One node, one IP.

I've deployed dozens of clusters this way and they all work fine :-)

Wido


Thank you very much
Willi


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Simon Ironside

Hi Florian,

On 15/11/2019 12:32, Florian Haas wrote:


I received this off-list but then subsequently saw this message pop up
in the list archive, so I hope it's OK to reply on-list?


Of course, I just clicked the wrong reply button the first time.


So that cap was indeed missing, thanks for the hint! However, I am still
trying to understand how this is related to the issue we saw.


I had exactly the same happen to me as happened to you a week or so ago. 
Compute node lost power and once restored the VMs would start booting 
but fail early on when they tried to write.


My key was also missing that cap, adding it and resetting the affected 
VMs was the only action I took to sort things out. I didn't need to go 
around removing locks by hand as you did. As you say, waiting 30 seconds 
didn't do any good so it doesn't appear to be a watcher thing.


This was mentioned in the release notes for Luminous[1], I'd missed it 
too as I redeployed Nautilus instead and skipped these steps:




Verify that all RBD client users have sufficient caps to blacklist other 
client users. RBD client users with only "allow r" monitor caps should 
be updated as follows:


# ceph auth caps client. mon 'allow r, allow command "osd 
blacklist"' osd ''




Simon

[1] 
https://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Florian Haas
On 15/11/2019 14:27, Simon Ironside wrote:
> Hi Florian,
> 
> On 15/11/2019 12:32, Florian Haas wrote:
> 
>> I received this off-list but then subsequently saw this message pop up
>> in the list archive, so I hope it's OK to reply on-list?
> 
> Of course, I just clicked the wrong reply button the first time.
> 
>> So that cap was indeed missing, thanks for the hint! However, I am still
>> trying to understand how this is related to the issue we saw.
> 
> I had exactly the same happen to me as happened to you a week or so ago.
> Compute node lost power and once restored the VMs would start booting
> but fail early on when they tried to write.
> 
> My key was also missing that cap, adding it and resetting the affected
> VMs was the only action I took to sort things out. I didn't need to go
> around removing locks by hand as you did. As you say, waiting 30 seconds
> didn't do any good so it doesn't appear to be a watcher thing.

Right, so suffice to say that that article is at least somewhere between
incomplete and misleading. :)

> This was mentioned in the release notes for Luminous[1], I'd missed it
> too as I redeployed Nautilus instead and skipped these steps:
> 
> 
> 
> Verify that all RBD client users have sufficient caps to blacklist other
> client users. RBD client users with only "allow r" monitor caps should
> be updated as follows:
> 
> # ceph auth caps client. mon 'allow r, allow command "osd
> blacklist"' osd ''
> 
> 

Yup, looks like we missed that bit of the release notes too (cluster has
been in production for several major releases now).

So it looks like we've got a fix for this. Thanks!

Also Wido, thanks for the reminder on profile rbd; we'll look into that too.

However, I'm still failing to wrap my read around the causality chain
here, and also around the interplay between watchers, locks, and
blacklists. If anyone could share some insight about this that I could
distill into a doc patch, I'd much appreciate that.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NVMe disk - size

2019-11-15 Thread Kristof Coucke
Hi all, We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks (14TB) and 2 NVMe disks (1,6TB).The idea was to use the NVMe as “fast device”…The recommendations I’ve read in the online documentation, state that the db block device should be around 4%~5% of the slow device. So, the block.db should be somewhere between 600GB and 700GB as a best practice.However… I was thinking to only reserve 200GB per OSD as fast device… Which is 1/3 of the recommendation… I’ve tested in the labs, and it does work fine with even very small devices (the spillover does its job).Though, before taking the system in to production, I would like to verify that no issues arise. Is it recommended to still use it as a block.db, or is it recommended to only use it as a WAL device?Should I just split the NVMe in three and only configure 3 OSDs to use the system? (This would mean that the performace shall be degraded to the speed of the slowest device…) The initial cluster is +1PB and we’re planning to expand it again with 1PB in the near future to migrate our data.We’ll only use the system thru the RGW (No CephFS, nor block device), and we’ll store “a lot” of small files on it… (Millions of files a day) The reason I’m asking it, is that I’ve been able to break the test system (long story), causing OSDs to fail as they ran out of space… Expanding the disks (the block DB device as well as the main block device) failed with the ceph-bluestore-tool… Thanks for your answer! Kristof
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NVMe disk - size

2019-11-15 Thread Kristof Coucke
Hi all,



We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks
(14TB) and 2 NVMe disks (1,6TB).

The idea was to use the NVMe as “fast device”…

The recommendations I’ve read in the online documentation, state that the
db block device should be around 4%~5% of the slow device. So, the block.db
should be somewhere between 600GB and 700GB as a best practice.

However… I was thinking to only reserve 200GB per OSD as fast device… Which
is 1/3 of the recommendation…



I’ve tested in the labs, and it does work fine with even very small devices
(the spillover does its job).

Though, before taking the system in to production, I would like to verify
that no issues arise.



   - Is it recommended to still use it as a block.db, or is it recommended
   to only use it as a WAL device?
   - Should I just split the NVMe in three and only configure 3 OSDs to use
   the system? (This would mean that the performace shall be degraded to the
   speed of the slowest device…)



The initial cluster is +1PB and we’re planning to expand it again with 1PB
in the near future to migrate our data.

We’ll only use the system thru the RGW (No CephFS, nor block device), and
we’ll store “a lot” of small files on it… (Millions of files a day)



The reason I’m asking it, is that I’ve been able to break the test system
(long story), causing OSDs to fail as they ran out of space… Expanding the
disks (the block DB device as well as the main block device) failed with
the ceph-bluestore-tool…



Thanks for your answer!



Kristof
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke  wrote:
> We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks 
> (14TB) and 2 NVMe disks (1,6TB).
> The recommendations I’ve read in the online documentation, state that the db 
> block device should be around 4%~5% of the slow device. So, the block.db 
> should be somewhere between 600GB and 700GB as a best practice.

That recommendation is unfortunately not based on any facts :(
How much you really need depends on your actual usage.

> However… I was thinking to only reserve 200GB per OSD as fast device… Which 
> is 1/3 of the recommendation…

For various weird internal reason it'll only use ~30 GB in the steady
state during operation before spilling over at the moment, 300 GB
would be the next magical number
(search mailing list for details)


> Is it recommended to still use it as a block.db

yes

> or is it recommended to only use it as a WAL device?

no, there is no advantage to that if it's that large


> Should I just split the NVMe in three and only configure 3 OSDs to use the 
> system? (This would mean that the performace shall be degraded to the speed 
> of the slowest device…)

no

> We’ll only use the system thru the RGW (No CephFS, nor block device), and 
> we’ll store “a lot” of small files on it… (Millions of files a day)

the current setup gives you around ~1.3 TB of usable metadata space
which may or may not be enough, really depends on how much "a lot" is
and how small "small" is.

It might be better to use the NVMe disks as dedicated OSDs and map all
metadata pools onto them directly, that allows you to fully utilize
the space for RGW metadata (but not Ceph metadata in the data pools)
without running into weird db size restrictions.
There are advantages and disadvantages to both approaches

Paul

>
>
>
> The reason I’m asking it, is that I’ve been able to break the test system 
> (long story), causing OSDs to fail as they ran out of space… Expanding the 
> disks (the block DB device as well as the main block device) failed with the 
> ceph-bluestore-tool…
>
>
>
> Thanks for your answer!
>
>
>
> Kristof
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Wido den Hollander


On 11/15/19 3:19 PM, Kristof Coucke wrote:
> Hi all,
> 
>  
> 
> We’ve configured a Ceph cluster with 10 nodes, each having 13 large
> disks (14TB) and 2 NVMe disks (1,6TB).
> 
> The idea was to use the NVMe as “fast device”…
> 
> The recommendations I’ve read in the online documentation, state that
> the db block device should be around 4%~5% of the slow device. So, the
> block.db should be somewhere between 600GB and 700GB as a best practice.
> 
> However… I was thinking to only reserve 200GB per OSD as fast device…
> Which is 1/3 of the recommendation…
> 
>  
The 4% rule is way to much. Usually 10GB per 1TB of storage is
sufficient, so 1%. You should be safe with 200GB per OSD.

> 
> I’ve tested in the labs, and it does work fine with even very small
> devices (the spillover does its job).
> 
> Though, before taking the system in to production, I would like to
> verify that no issues arise.
> 
>  
> 
>   * Is it recommended to still use it as a block.db, or is it
> recommended to only use it as a WAL device?

Use it as DB.

>   * Should I just split the NVMe in three and only configure 3 OSDs to
> use the system? (This would mean that the performace shall be
> degraded to the speed of the slowest device…)
> 

I would split them. So 6 and 7 OSDs per NVMe. I normally use LVM on top
of each device and create 2 LVs per OSD:

- WAL: 1GB
- DB: xx GB

>  
> 
> The initial cluster is +1PB and we’re planning to expand it again with
> 1PB in the near future to migrate our data.
> 
> We’ll only use the system thru the RGW (No CephFS, nor block device),
> and we’ll store “a lot” of small files on it… (Millions of files a day)
> 
>  
> 
> The reason I’m asking it, is that I’ve been able to break the test
> system (long story), causing OSDs to fail as they ran out of space…
> Expanding the disks (the block DB device as well as the main block
> device) failed with the ceph-bluestore-tool…
> 
>  
> 
> Thanks for your answer!
> 
>  
> 
> Kristof
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Kristof Coucke
Hi Paul,

Thank you for the answer.
I didn't thought of that approach... (Using the NVMe for the meta data pool
of RGW).

>From where do you get the limitation of 1.3TB?

I don't get that one...

Br,

Kristof

Op vr 15 nov. 2019 om 15:26 schreef Paul Emmerich :

> On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke 
> wrote:
> > We’ve configured a Ceph cluster with 10 nodes, each having 13 large
> disks (14TB) and 2 NVMe disks (1,6TB).
> > The recommendations I’ve read in the online documentation, state that
> the db block device should be around 4%~5% of the slow device. So, the
> block.db should be somewhere between 600GB and 700GB as a best practice.
>
> That recommendation is unfortunately not based on any facts :(
> How much you really need depends on your actual usage.
>
> > However… I was thinking to only reserve 200GB per OSD as fast device…
> Which is 1/3 of the recommendation…
>
> For various weird internal reason it'll only use ~30 GB in the steady
> state during operation before spilling over at the moment, 300 GB
> would be the next magical number
> (search mailing list for details)
>
>
> > Is it recommended to still use it as a block.db
>
> yes
>
> > or is it recommended to only use it as a WAL device?
>
> no, there is no advantage to that if it's that large
>
>
> > Should I just split the NVMe in three and only configure 3 OSDs to use
> the system? (This would mean that the performace shall be degraded to the
> speed of the slowest device…)
>
> no
>
> > We’ll only use the system thru the RGW (No CephFS, nor block device),
> and we’ll store “a lot” of small files on it… (Millions of files a day)
>
> the current setup gives you around ~1.3 TB of usable metadata space
> which may or may not be enough, really depends on how much "a lot" is
> and how small "small" is.
>
> It might be better to use the NVMe disks as dedicated OSDs and map all
> metadata pools onto them directly, that allows you to fully utilize
> the space for RGW metadata (but not Ceph metadata in the data pools)
> without running into weird db size restrictions.
> There are advantages and disadvantages to both approaches
>
> Paul
>
> >
> >
> >
> > The reason I’m asking it, is that I’ve been able to break the test
> system (long story), causing OSDs to fail as they ran out of space…
> Expanding the disks (the block DB device as well as the main block device)
> failed with the ceph-bluestore-tool…
> >
> >
> >
> > Thanks for your answer!
> >
> >
> >
> > Kristof
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 4:04 PM Kristof Coucke  wrote:
>
> Hi Paul,
>
> Thank you for the answer.
> I didn't thought of that approach... (Using the NVMe for the meta data pool 
> of RGW).
>
> From where do you get the limitation of 1.3TB?

13 OSDs/Server * 10 Servers * 30 GB/OSD usable DB space / 3 (Replica)


>
> I don't get that one...
>
> Br,
>
> Kristof
>
> Op vr 15 nov. 2019 om 15:26 schreef Paul Emmerich :
>>
>> On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke  
>> wrote:
>> > We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks 
>> > (14TB) and 2 NVMe disks (1,6TB).
>> > The recommendations I’ve read in the online documentation, state that the 
>> > db block device should be around 4%~5% of the slow device. So, the 
>> > block.db should be somewhere between 600GB and 700GB as a best practice.
>>
>> That recommendation is unfortunately not based on any facts :(
>> How much you really need depends on your actual usage.
>>
>> > However… I was thinking to only reserve 200GB per OSD as fast device… 
>> > Which is 1/3 of the recommendation…
>>
>> For various weird internal reason it'll only use ~30 GB in the steady
>> state during operation before spilling over at the moment, 300 GB
>> would be the next magical number
>> (search mailing list for details)
>>
>>
>> > Is it recommended to still use it as a block.db
>>
>> yes
>>
>> > or is it recommended to only use it as a WAL device?
>>
>> no, there is no advantage to that if it's that large
>>
>>
>> > Should I just split the NVMe in three and only configure 3 OSDs to use the 
>> > system? (This would mean that the performace shall be degraded to the 
>> > speed of the slowest device…)
>>
>> no
>>
>> > We’ll only use the system thru the RGW (No CephFS, nor block device), and 
>> > we’ll store “a lot” of small files on it… (Millions of files a day)
>>
>> the current setup gives you around ~1.3 TB of usable metadata space
>> which may or may not be enough, really depends on how much "a lot" is
>> and how small "small" is.
>>
>> It might be better to use the NVMe disks as dedicated OSDs and map all
>> metadata pools onto them directly, that allows you to fully utilize
>> the space for RGW metadata (but not Ceph metadata in the data pools)
>> without running into weird db size restrictions.
>> There are advantages and disadvantages to both approaches
>>
>> Paul
>>
>> >
>> >
>> >
>> > The reason I’m asking it, is that I’ve been able to break the test system 
>> > (long story), causing OSDs to fail as they ran out of space… Expanding the 
>> > disks (the block DB device as well as the main block device) failed with 
>> > the ceph-bluestore-tool…
>> >
>> >
>> >
>> > Thanks for your answer!
>> >
>> >
>> >
>> > Kristof
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread EDH - Manuel Rios Fernandez
Hi,

For solve the issue, mount with:

rbd map pool/disk_id , and mount the / volume in a linux machine "A ceph
node will be ok", this will flush the journal and close and discard the
pending changes in openstack nodes cache, then unmount and rbd unmap. Boot
the instance from openstack again, and voila will work.

For windows instances you must use ntfsfix in a linux computer with the same
commands.

Regards,
Manuel




-Mensaje original-
De: ceph-users  En nombre de Simon
Ironside
Enviado el: viernes, 15 de noviembre de 2019 14:28
Para: ceph-users 
Asunto: Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM,
and Ceph RBD locks

Hi Florian,

On 15/11/2019 12:32, Florian Haas wrote:

> I received this off-list but then subsequently saw this message pop up 
> in the list archive, so I hope it's OK to reply on-list?

Of course, I just clicked the wrong reply button the first time.

> So that cap was indeed missing, thanks for the hint! However, I am 
> still trying to understand how this is related to the issue we saw.

I had exactly the same happen to me as happened to you a week or so ago. 
Compute node lost power and once restored the VMs would start booting but
fail early on when they tried to write.

My key was also missing that cap, adding it and resetting the affected VMs
was the only action I took to sort things out. I didn't need to go around
removing locks by hand as you did. As you say, waiting 30 seconds didn't do
any good so it doesn't appear to be a watcher thing.

This was mentioned in the release notes for Luminous[1], I'd missed it too
as I redeployed Nautilus instead and skipped these steps:



Verify that all RBD client users have sufficient caps to blacklist other
client users. RBD client users with only "allow r" monitor caps should be
updated as follows:

# ceph auth caps client. mon 'allow r, allow command "osd blacklist"'
osd ''



Simon

[1]
https://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-k
raken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
>
>  I normally use LVM on top
> of each device and create 2 LVs per OSD:
>
> - WAL: 1GB
> - DB: xx GB

Why? I've seen this a few times and I can't figure out what the
advantage of doing this explicitly on the LVM level instead of relying
on BlueStore to handle this.


Paul

>
> >
> >
> > The initial cluster is +1PB and we’re planning to expand it again with
> > 1PB in the near future to migrate our data.
> >
> > We’ll only use the system thru the RGW (No CephFS, nor block device),
> > and we’ll store “a lot” of small files on it… (Millions of files a day)
> >
> >
> >
> > The reason I’m asking it, is that I’ve been able to break the test
> > system (long story), causing OSDs to fail as they ran out of space…
> > Expanding the disks (the block DB device as well as the main block
> > device) failed with the ceph-bluestore-tool…
> >
> >
> >
> > Thanks for your answer!
> >
> >
> >
> > Kristof
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Object

2019-11-15 Thread DHilsbos
All;

Thank you for your help so far.  I have found the log entries from when the 
object was found, but don't see a reference to the pool.

Here the logs:
2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
starts
2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap object 
found. Object: 
56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key count: 
380425 Size (bytes): 82896978

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: Wido den Hollander [mailto:w...@42on.com] 
Sent: Friday, November 15, 2019 1:56 AM
To: Dominic Hilsbos; ceph-users@lists.ceph.com
Cc: Stephen Self
Subject: Re: [ceph-users] Large OMAP Object

Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
pool and Object the large Object is in?

Wido

On 11/15/19 12:23 AM, dhils...@performair.com wrote:
> All;
> 
> We had a warning about a large OMAP object pop up in one of our clusters 
> overnight.  The cluster is configured for CephFS, but nothing mounts a 
> CephFS, at this time.
> 
> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, and 
> the MGR log on one of the mons, with no useful references to the pool / pg 
> where the large OMAP objects resides.
> 
> Is my only option to find this large OMAP object to go through the OSD logs 
> for the individual OSDs in the cluster?
> 
> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director - Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Wido den Hollander


On 11/15/19 4:25 PM, Paul Emmerich wrote:
> On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
>>
>>  I normally use LVM on top
>> of each device and create 2 LVs per OSD:
>>
>> - WAL: 1GB
>> - DB: xx GB
> 
> Why? I've seen this a few times and I can't figure out what the
> advantage of doing this explicitly on the LVM level instead of relying
> on BlueStore to handle this.
> 

If the WAL+DB are on a external device you want the WAL to be there as
well. That's why I specify the WAL separate.

This might be an 'old habbit' as well.

Wido

> 
> Paul
> 
>>
>>>
>>>
>>> The initial cluster is +1PB and we’re planning to expand it again with
>>> 1PB in the near future to migrate our data.
>>>
>>> We’ll only use the system thru the RGW (No CephFS, nor block device),
>>> and we’ll store “a lot” of small files on it… (Millions of files a day)
>>>
>>>
>>>
>>> The reason I’m asking it, is that I’ve been able to break the test
>>> system (long story), causing OSDs to fail as they ran out of space…
>>> Expanding the disks (the block DB device as well as the main block
>>> device) failed with the ceph-bluestore-tool…
>>>
>>>
>>>
>>> Thanks for your answer!
>>>
>>>
>>>
>>> Kristof
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Object

2019-11-15 Thread Wido den Hollander


On 11/15/19 4:35 PM, dhils...@performair.com wrote:
> All;
> 
> Thank you for your help so far.  I have found the log entries from when the 
> object was found, but don't see a reference to the pool.
> 
> Here the logs:
> 2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
> starts
> 2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap object 
> found. Object: 
> 56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key 
> count: 380425 Size (bytes): 82896978
> 

In this case it's in pool 56, check 'ceph df' to see which pool that is.

To me this seems like a RGW bucket which index grew too big.

Use:

$ radosgw-admin bucket list
$ radosgw-admin metadata get bucket:

And match that UUID back to the bucket.

Wido

> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director – Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> 
> 
> 
> -Original Message-
> From: Wido den Hollander [mailto:w...@42on.com] 
> Sent: Friday, November 15, 2019 1:56 AM
> To: Dominic Hilsbos; ceph-users@lists.ceph.com
> Cc: Stephen Self
> Subject: Re: [ceph-users] Large OMAP Object
> 
> Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
> pool and Object the large Object is in?
> 
> Wido
> 
> On 11/15/19 12:23 AM, dhils...@performair.com wrote:
>> All;
>>
>> We had a warning about a large OMAP object pop up in one of our clusters 
>> overnight.  The cluster is configured for CephFS, but nothing mounts a 
>> CephFS, at this time.
>>
>> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, and 
>> the MGR log on one of the mons, with no useful references to the pool / pg 
>> where the large OMAP objects resides.
>>
>> Is my only option to find this large OMAP object to go through the OSD logs 
>> for the individual OSDs in the cluster?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA 
>> Director - Information Technology 
>> Perform Air International Inc.
>> dhils...@performair.com 
>> www.PerformAir.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Joshua M. Boniface

Hey All:

I've also quite frequently experienced this sort of issue with my Ceph 
RBD-backed QEMU/KVM
cluster (not OpenStack specifically). Should this workaround of allowing the 
'osd blacklist'
command in the caps help in that scenario as well, or is this an 
OpenStack-specific
functionality?

Thanks,
Joshua

On 2019-11-15 9:02 a.m., Florian Haas wrote:


On 15/11/2019 14:27, Simon Ironside wrote:

Hi Florian,

On 15/11/2019 12:32, Florian Haas wrote:


I received this off-list but then subsequently saw this message pop up
in the list archive, so I hope it's OK to reply on-list?

Of course, I just clicked the wrong reply button the first time.


So that cap was indeed missing, thanks for the hint! However, I am still
trying to understand how this is related to the issue we saw.

I had exactly the same happen to me as happened to you a week or so ago.
Compute node lost power and once restored the VMs would start booting
but fail early on when they tried to write.

My key was also missing that cap, adding it and resetting the affected
VMs was the only action I took to sort things out. I didn't need to go
around removing locks by hand as you did. As you say, waiting 30 seconds
didn't do any good so it doesn't appear to be a watcher thing.

Right, so suffice to say that that article is at least somewhere between
incomplete and misleading. :)


This was mentioned in the release notes for Luminous[1], I'd missed it
too as I redeployed Nautilus instead and skipped these steps:



Verify that all RBD client users have sufficient caps to blacklist other
client users. RBD client users with only "allow r" monitor caps should
be updated as follows:

# ceph auth caps client. mon 'allow r, allow command "osd
blacklist"' osd ''



Yup, looks like we missed that bit of the release notes too (cluster has
been in production for several major releases now).

So it looks like we've got a fix for this. Thanks!

Also Wido, thanks for the reminder on profile rbd; we'll look into that too.

However, I'm still failing to wrap my read around the causality chain
here, and also around the interplay between watchers, locks, and
blacklists. If anyone could share some insight about this that I could
distill into a doc patch, I'd much appreciate that.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 4:39 PM Wido den Hollander  wrote:
>
>
>
> On 11/15/19 4:25 PM, Paul Emmerich wrote:
> > On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
> >>
> >>  I normally use LVM on top
> >> of each device and create 2 LVs per OSD:
> >>
> >> - WAL: 1GB
> >> - DB: xx GB
> >
> > Why? I've seen this a few times and I can't figure out what the
> > advantage of doing this explicitly on the LVM level instead of relying
> > on BlueStore to handle this.
> >
>
> If the WAL+DB are on a external device you want the WAL to be there as
> well. That's why I specify the WAL separate.
>
> This might be an 'old habbit' as well.

But the WAL will be placed onto the DB device if it isn't explicitly
specified, so there's no advantage to having a separate partition.


Paul

>
> Wido
>
> >
> > Paul
> >
> >>
> >>>
> >>>
> >>> The initial cluster is +1PB and we’re planning to expand it again with
> >>> 1PB in the near future to migrate our data.
> >>>
> >>> We’ll only use the system thru the RGW (No CephFS, nor block device),
> >>> and we’ll store “a lot” of small files on it… (Millions of files a day)
> >>>
> >>>
> >>>
> >>> The reason I’m asking it, is that I’ve been able to break the test
> >>> system (long story), causing OSDs to fail as they ran out of space…
> >>> Expanding the disks (the block DB device as well as the main block
> >>> device) failed with the ceph-bluestore-tool…
> >>>
> >>>
> >>>
> >>> Thanks for your answer!
> >>>
> >>>
> >>>
> >>> Kristof
> >>>
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Object

2019-11-15 Thread Paul Emmerich
Note that the size limit changed from 2M keys to 200k keys recently
(14.2.3 or 14.2.2 or something), so that object is probably older and
that's just the first deep scrub with the reduced limit that triggered
the warning.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 4:40 PM Wido den Hollander  wrote:
>
>
>
> On 11/15/19 4:35 PM, dhils...@performair.com wrote:
> > All;
> >
> > Thank you for your help so far.  I have found the log entries from when the 
> > object was found, but don't see a reference to the pool.
> >
> > Here the logs:
> > 2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
> > starts
> > 2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap 
> > object found. Object: 
> > 56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key 
> > count: 380425 Size (bytes): 82896978
> >
>
> In this case it's in pool 56, check 'ceph df' to see which pool that is.
>
> To me this seems like a RGW bucket which index grew too big.
>
> Use:
>
> $ radosgw-admin bucket list
> $ radosgw-admin metadata get bucket:
>
> And match that UUID back to the bucket.
>
> Wido
>
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: Wido den Hollander [mailto:w...@42on.com]
> > Sent: Friday, November 15, 2019 1:56 AM
> > To: Dominic Hilsbos; ceph-users@lists.ceph.com
> > Cc: Stephen Self
> > Subject: Re: [ceph-users] Large OMAP Object
> >
> > Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
> > pool and Object the large Object is in?
> >
> > Wido
> >
> > On 11/15/19 12:23 AM, dhils...@performair.com wrote:
> >> All;
> >>
> >> We had a warning about a large OMAP object pop up in one of our clusters 
> >> overnight.  The cluster is configured for CephFS, but nothing mounts a 
> >> CephFS, at this time.
> >>
> >> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, 
> >> and the MGR log on one of the mons, with no useful references to the pool 
> >> / pg where the large OMAP objects resides.
> >>
> >> Is my only option to find this large OMAP object to go through the OSD 
> >> logs for the individual OSDs in the cluster?
> >>
> >> Thank you,
> >>
> >> Dominic L. Hilsbos, MBA
> >> Director - Information Technology
> >> Perform Air International Inc.
> >> dhils...@performair.com
> >> www.PerformAir.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Simon Ironside

On 15/11/2019 15:44, Joshua M. Boniface wrote:

Hey All:

I've also quite frequently experienced this sort of issue with my Ceph 
RBD-backed QEMU/KVM
cluster (not OpenStack specifically). Should this workaround of allowing the 
'osd blacklist'
command in the caps help in that scenario as well, or is this an 
OpenStack-specific
functionality?


Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's 
required for all RBD clients.


Simon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Joshua M. Boniface

Thanks Simon! I've implemented it, I guess I'll test it out next time my 
homelab's power dies :-)

On 2019-11-15 10:54 a.m., Simon Ironside wrote:


On 15/11/2019 15:44, Joshua M. Boniface wrote:

Hey All:
I've also quite frequently experienced this sort of issue with my Ceph 
RBD-backed QEMU/KVM
cluster (not OpenStack specifically). Should this workaround of allowing the 
'osd blacklist'
command in the caps help in that scenario as well, or is this an 
OpenStack-specific
functionality?

Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's
required for all RBD clients.
Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic - cephfs scrub errors

2019-11-15 Thread Andras Pataki

Dear cephers,

We've had a few (dozen or so) rather odd scrub errors in our Mimic 
(13.2.6) cephfs:


2019-11-15 07:52:52.614 7fffcc41f700  0 log_channel(cluster) log [DBG] : 
2.b5b scrub starts
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b shard 599 soid 2:dad01506:::100314224ad.0160:head : candidate 
size 4158512 info size 0 mismatch
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b shard 2768 soid 2:dad01506:::100314224ad.0160:head : candidate 
size 4158512 info size 0 mismatch
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b shard 3512 soid 2:dad01506:::100314224ad.0160:head : candidate 
size 4158512 info size 0 mismatch
2019-11-15 07:52:55.190 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b soid 2:dad01506:::100314224ad.0160:head : failed to pick 
suitable object info
2019-11-15 07:52:55.198 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
scrub 2.b5b 2:dad01506:::100314224ad.0160:head : on disk size 
(4158512) does not match object info size (0) adjusted for ondisk to (0)
2019-11-15 07:53:55.441 7fffcc41f700 -1 log_channel(cluster) log [ERR] : 
2.b5b scrub 4 errors


Finding the file - it turns out to be a very small file:

1100338046125 -rw-r- 1 schen schen 41237 Nov 14 17:18 
/mnt/ceph/users/schen/main/Jellium/3u3d_3D/fort.4321169


We use 4MB stripe size - and it looks like the scrub complains about 
object 0x160, which is way beyond the end of the file (since the file 
should fit in just one object).  Retrieving the object gets an empty one 
- and it looks like all the objects between object 1 and 0x160 also 
exist as empty objects (and object 0 contains the whole, correct file 
contents).  Any ideas why so many empty objects get created beyond the 
end of the file?  Would this be the result of the file being 
overwritten/truncated?  Just for my understanding - if it is truncation, 
is that done by the client, or the MDS?


Any ideas how the inconsistencies could have come about?  Possibly 
something failed during the file truncation?


Thanks,

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Paul Emmerich
To clear up a few misconceptions here:

* RBD keyrings should use the "profile rbd" permissions, everything
else is *wrong* and should be fixed asap
* Manually adding the blacklist permission might work but isn't
future-proof, fix the keyring instead
* The suggestion to mount them elsewhere to fix this only works
because "elsewhere" probably has an admin keyring, this is a bad
work-around, fix the keyring instead
* This is unrelated to openstack and will happen with *any* reasonably
configured hypervisor that uses exclusive locking

This problem usually happens after upgrading to Luminous without
reading the change log. The change log tells you to adjust the keyring
permissions accordingly

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface  wrote:
>
> Thanks Simon! I've implemented it, I guess I'll test it out next time my 
> homelab's power dies :-)
>
> On 2019-11-15 10:54 a.m., Simon Ironside wrote:
>
> On 15/11/2019 15:44, Joshua M. Boniface wrote:
>
> Hey All:
>
> I've also quite frequently experienced this sort of issue with my Ceph 
> RBD-backed QEMU/KVM
>
> cluster (not OpenStack specifically). Should this workaround of allowing the 
> 'osd blacklist'
>
> command in the caps help in that scenario as well, or is this an 
> OpenStack-specific
>
> functionality?
>
> Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's
> required for all RBD clients.
>
> Simon
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Object

2019-11-15 Thread DHilsbos
Paul;

I upgraded the cluster in question from 14.2.2 to 14.2.4 just before this came 
up, so that makes sense.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Paul 
Emmerich
Sent: Friday, November 15, 2019 8:48 AM
To: Wido den Hollander
Cc: Ceph Users
Subject: Re: [ceph-users] Large OMAP Object

Note that the size limit changed from 2M keys to 200k keys recently
(14.2.3 or 14.2.2 or something), so that object is probably older and
that's just the first deep scrub with the reduced limit that triggered
the warning.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 4:40 PM Wido den Hollander  wrote:
>
>
>
> On 11/15/19 4:35 PM, dhils...@performair.com wrote:
> > All;
> >
> > Thank you for your help so far.  I have found the log entries from when the 
> > object was found, but don't see a reference to the pool.
> >
> > Here the logs:
> > 2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
> > starts
> > 2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap 
> > object found. Object: 
> > 56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key 
> > count: 380425 Size (bytes): 82896978
> >
>
> In this case it's in pool 56, check 'ceph df' to see which pool that is.
>
> To me this seems like a RGW bucket which index grew too big.
>
> Use:
>
> $ radosgw-admin bucket list
> $ radosgw-admin metadata get bucket:
>
> And match that UUID back to the bucket.
>
> Wido
>
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: Wido den Hollander [mailto:w...@42on.com]
> > Sent: Friday, November 15, 2019 1:56 AM
> > To: Dominic Hilsbos; ceph-users@lists.ceph.com
> > Cc: Stephen Self
> > Subject: Re: [ceph-users] Large OMAP Object
> >
> > Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
> > pool and Object the large Object is in?
> >
> > Wido
> >
> > On 11/15/19 12:23 AM, dhils...@performair.com wrote:
> >> All;
> >>
> >> We had a warning about a large OMAP object pop up in one of our clusters 
> >> overnight.  The cluster is configured for CephFS, but nothing mounts a 
> >> CephFS, at this time.
> >>
> >> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, 
> >> and the MGR log on one of the mons, with no useful references to the pool 
> >> / pg where the large OMAP objects resides.
> >>
> >> Is my only option to find this large OMAP object to go through the OSD 
> >> logs for the individual OSDs in the cluster?
> >>
> >> Thank you,
> >>
> >> Dominic L. Hilsbos, MBA
> >> Director - Information Technology
> >> Perform Air International Inc.
> >> dhils...@performair.com
> >> www.PerformAir.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Object

2019-11-15 Thread DHilsbos
Wido;

Ok, yes, I have tracked it down to the index for one of our buckets.  I missed 
the ID in the ceph df output previously.  Next time I'll wait to read replies 
until I've finished my morning coffee.

How would I go about correcting this?

The content for this bucket is basically just junk, as we're still doing 
production qualification, and workflow planning.  Moving from Windows file 
shares to self-hosted cloud storage is a significant undertaking.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: Friday, November 15, 2019 8:40 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Large OMAP Object



On 11/15/19 4:35 PM, dhils...@performair.com wrote:
> All;
> 
> Thank you for your help so far.  I have found the log entries from when the 
> object was found, but don't see a reference to the pool.
> 
> Here the logs:
> 2019-11-14 03:10:16.508601 osd.1 (osd.1) 21 : cluster [DBG] 56.7 deep-scrub 
> starts
> 2019-11-14 03:10:18.325881 osd.1 (osd.1) 22 : cluster [WRN] Large omap object 
> found. Object: 
> 56:f7d15b13:::.dir.f91aeff8-a365-47b4-a1c8-928cd66134e8.44130.1:head Key 
> count: 380425 Size (bytes): 82896978
> 

In this case it's in pool 56, check 'ceph df' to see which pool that is.

To me this seems like a RGW bucket which index grew too big.

Use:

$ radosgw-admin bucket list
$ radosgw-admin metadata get bucket:

And match that UUID back to the bucket.

Wido

> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director – Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> 
> 
> 
> -Original Message-
> From: Wido den Hollander [mailto:w...@42on.com] 
> Sent: Friday, November 15, 2019 1:56 AM
> To: Dominic Hilsbos; ceph-users@lists.ceph.com
> Cc: Stephen Self
> Subject: Re: [ceph-users] Large OMAP Object
> 
> Did you check /var/log/ceph/ceph.log on one of the Monitors to see which
> pool and Object the large Object is in?
> 
> Wido
> 
> On 11/15/19 12:23 AM, dhils...@performair.com wrote:
>> All;
>>
>> We had a warning about a large OMAP object pop up in one of our clusters 
>> overnight.  The cluster is configured for CephFS, but nothing mounts a 
>> CephFS, at this time.
>>
>> The cluster mostly uses RGW.  I've checked the cluster log, the MON log, and 
>> the MGR log on one of the mons, with no useful references to the pool / pg 
>> where the large OMAP objects resides.
>>
>> Is my only option to find this large OMAP object to go through the OSD logs 
>> for the individual OSDs in the cluster?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA 
>> Director - Information Technology 
>> Perform Air International Inc.
>> dhils...@performair.com 
>> www.PerformAir.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating from block to lvm

2019-11-15 Thread Mike Cave
Greetings all!

I am looking at upgrading to Nautilus in the near future (currently on Mimic). 
We have a cluster built on 480 OSDs all using multipath and simple block 
devices. I see that the ceph-disk tool is now deprecated and the ceph-volume 
tool doesn’t do everything that ceph-disk did for simple devices (e.g. I’m 
unable to activate a new osd and set the location of wal/block.db, so far as I 
have been able to figure out). So for disk replacements going forward it could 
get ugly.

We deploy/manage using Ceph Ansible.

I’m okay with updating the OSDs to LVM and understand that it will require a 
full rebuild of each OSD.

I was thinking of going OSD by OSD through the cluster until they are all 
completed. However, someone suggested doing an entire node at a time (that 
would be 20 OSDs at a time in this case). Is one method going to be better than 
the other?

Also a question about setting-up LVM: given I’m using multipath devices, do I 
have to preconfigure the LVM devices before running the ansible plays or will 
ansible take care of the LVM setup (even though they are on multipath)?

I would then do the upgrade to Nautilus from Mimic after all the OSDs were 
converted.

I’m looking for opinions on best practices to complete this as I’d like to 
minimize impact to our clients.

Cheers,
Mike Cave
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread vitalif
Use 30 GB for all OSDs. Other values are pointless, because 
https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing


You can use the rest of free NVMe space for bcache - it's much better 
than just allocating it for block.db.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Full FLash NVME Cluster recommendation

2019-11-15 Thread Nathan Fish
Bluestore will use about 4 cores, but in my experience, the maximum
utilization I've seen has been something like: 100%, 100%, 50%, 50%

So those first 2 cores are the bottleneck for pure OSD IOPS. This sort
of pattern isn't uncommon in multithreaded programs. This was on HDD
OSDs with DB/WAL on NVMe, as well as some small metadata OSDs on pure
NVMe. SSD OSDs default to 2 threads per shard, and HDD to 1, but we
had to set HDD to 2 as well when we enabled NVMe WAL/DB. Otherwise the
OSDs ran out of CPU and failed to heartbeat when under load. I believe
that if we had 50% faster cores, we might not have needed to do this.

On SSDs/NVMe you can compensate for slower cores with more OSDs, but
of course only for parallel operations. Anything that is
serial+synchronous, not so much. I would expect something like 4 OSDs
per NVMe, 4 cores per OSD. That's already 16 cores per node just for
OSDs.

Our bottleneck in practice is the Ceph MDS, which seems to use exactly
2 cores and has no setting to change this. As far as I can tell, if we
had 50% faster cores just for the MDS, I would expect roughly +50%
performance in terms of metadata ops/second. Each filesystem has it's
own rank-0 MDS, so this load will be split across daemons. The MDS can
also use a ton of RAM (32GB) if the clients have a working set of 1
million+ files. Multi-mds exists to further split the load, but is
quite new and I would not trust it. CephFS in general is likely where
you will have the most issues, as it both new and complex compared to
a simple object store. Having an MDS in standby-replay mode keeps it's
RAM cache synced with the active, so you get far faster failover (
O(seconds) rather than O(minutes) with a few million file caps) but
you use the same RAM again.

So, IMHO, you will want at least:
CPU:
16 cores per 1-card NVMe OSD node. 2 cores per filesystem (maybe 1 if
you don't expect a lot of simultaneous load?)

RAM:
The Bluestore default is 4GB per OSD, so 16GB per node.
~32GB of RAM per active and standby-replay MDS if you expect file
counts in the millions, so 64GB per filesystem.

128GB of RAM per node ought to do, if you have less than 14 filesystems?

YMMV.

On Fri, Nov 15, 2019 at 11:17 AM Anthony D'Atri  wrote:
>
> I’ve been trying unsuccessfully to convince some folks of the need for fast 
> cores, there’s the idea that the effect would be slight.  Do you have any 
> numbers?  I’ve also read a claim that each BlueStore will use 3-4 cores
>,
> They’re listening to me though about splitting the card into multiple OSDs.
>
> > On Nov 15, 2019, at 7:38 AM, Nathan Fish  wrote:
> >
> > In order to get optimal performance out of NVMe, you will want very
> > fast cores, and you will probably have to split each NVMe card into
> > 2-4 OSD partitions in order to throw enough cores at it.
> >
> > On Fri, Nov 15, 2019 at 10:24 AM Yoann Moulin  wrote:
> >>
> >> Hello,
> >>
> >> I'm going to deploy a new cluster soon based on 6.4TB NVME PCI-E Cards, I 
> >> will have only 1 NVME card per node and 38 nodes.
> >>
> >> The use case is to offer cephfs volumes for a k8s platform, I plan to use 
> >> an EC-POOL 8+3 for the cephfs_data pool.
> >>
> >> Do you have recommendations for the setup or mistakes to avoid? I use 
> >> ceph-ansible to deploy all myclusters.
> >>
> >> Best regards,
> >>
> >> --
> >> Yoann Moulin
> >> EPFL IC-IT
> >> ___
> >> ceph-users mailing list -- ceph-us...@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-us...@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Paul Emmerich
You'll have to tell LVM about multi-path, otherwise LVM gets confused.
But that should be the only thing

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 6:04 PM Mike Cave  wrote:
>
> Greetings all!
>
>
>
> I am looking at upgrading to Nautilus in the near future (currently on 
> Mimic). We have a cluster built on 480 OSDs all using multipath and simple 
> block devices. I see that the ceph-disk tool is now deprecated and the 
> ceph-volume tool doesn’t do everything that ceph-disk did for simple devices 
> (e.g. I’m unable to activate a new osd and set the location of wal/block.db, 
> so far as I have been able to figure out). So for disk replacements going 
> forward it could get ugly.
>
>
>
> We deploy/manage using Ceph Ansible.
>
>
>
> I’m okay with updating the OSDs to LVM and understand that it will require a 
> full rebuild of each OSD.
>
>
>
> I was thinking of going OSD by OSD through the cluster until they are all 
> completed. However, someone suggested doing an entire node at a time (that 
> would be 20 OSDs at a time in this case). Is one method going to be better 
> than the other?
>
>
>
> Also a question about setting-up LVM: given I’m using multipath devices, do I 
> have to preconfigure the LVM devices before running the ansible plays or will 
> ansible take care of the LVM setup (even though they are on multipath)?
>
>
>
> I would then do the upgrade to Nautilus from Mimic after all the OSDs were 
> converted.
>
>
>
> I’m looking for opinions on best practices to complete this as I’d like to 
> minimize impact to our clients.
>
>
>
> Cheers,
>
> Mike Cave
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Mike Cave
So would you recommend doing an entire node at the same time or per-osd?

 
Senior Systems Administrator

Research Computing Services Team

University of Victoria

O: 250.472.4997

On 2019-11-15, 10:28 AM, "Paul Emmerich"  wrote:

You'll have to tell LVM about multi-path, otherwise LVM gets confused.
But that should be the only thing

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 6:04 PM Mike Cave  wrote:
>
> Greetings all!
>
>
>
> I am looking at upgrading to Nautilus in the near future (currently on 
Mimic). We have a cluster built on 480 OSDs all using multipath and simple 
block devices. I see that the ceph-disk tool is now deprecated and the 
ceph-volume tool doesn’t do everything that ceph-disk did for simple devices 
(e.g. I’m unable to activate a new osd and set the location of wal/block.db, so 
far as I have been able to figure out). So for disk replacements going forward 
it could get ugly.
>
>
>
> We deploy/manage using Ceph Ansible.
>
>
>
> I’m okay with updating the OSDs to LVM and understand that it will 
require a full rebuild of each OSD.
>
>
>
> I was thinking of going OSD by OSD through the cluster until they are all 
completed. However, someone suggested doing an entire node at a time (that 
would be 20 OSDs at a time in this case). Is one method going to be better than 
the other?
>
>
>
> Also a question about setting-up LVM: given I’m using multipath devices, 
do I have to preconfigure the LVM devices before running the ansible plays or 
will ansible take care of the LVM setup (even though they are on multipath)?
>
>
>
> I would then do the upgrade to Nautilus from Mimic after all the OSDs 
were converted.
>
>
>
> I’m looking for opinions on best practices to complete this as I’d like 
to minimize impact to our clients.
>
>
>
> Cheers,
>
> Mike Cave
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Janne Johansson
Den fre 15 nov. 2019 kl 19:40 skrev Mike Cave :

> So would you recommend doing an entire node at the same time or per-osd?
>

You should be able to do it per-OSD (or per-disk in case you run more than
one OSD per disk), to minimize data movement over the network, letting
other OSDs on the same host take a bit of the load while re-making the
disks one by one. You can use "ceph osd reweight  0.0" to make the
particular OSD release its data but still claim it supplies $crush-weight
to the host, meaning the other disks will have to take its data more or
less.
Moving data between disks in the same host usually goes lots faster than
over the network to other hosts.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Martin Verges
I would consider doing it host-by-host wise, as you should always be able
to handle the complete loss of a node. This would be much faster in the end
as you save a lot of time not migrating data back and forth. However this
can lead to problems if your cluster is not configured according to the
hardware performance given.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 15. Nov. 2019 um 20:46 Uhr schrieb Janne Johansson <
icepic...@gmail.com>:

> Den fre 15 nov. 2019 kl 19:40 skrev Mike Cave :
>
>> So would you recommend doing an entire node at the same time or per-osd?
>>
>
> You should be able to do it per-OSD (or per-disk in case you run more than
> one OSD per disk), to minimize data movement over the network, letting
> other OSDs on the same host take a bit of the load while re-making the
> disks one by one. You can use "ceph osd reweight  0.0" to make the
> particular OSD release its data but still claim it supplies $crush-weight
> to the host, meaning the other disks will have to take its data more or
> less.
> Moving data between disks in the same host usually goes lots faster than
> over the network to other hosts.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Mike Cave
Good points, thank you for the insight.

Given that I’m hosting the journals (wal/block.dbs) on ssds, would I need to do 
all the OSDs hosts on each journal ssd at the same time? I’m fairly sure this 
would be the case.


Senior Systems Administrator
Research Computing Services Team
University of Victoria
O: 250.472.4997

From: Janne Johansson 
Date: Friday, November 15, 2019 at 11:46 AM
To: Cave Mike 
Cc: Paul Emmerich , ceph-users 

Subject: Re: [ceph-users] Migrating from block to lvm

Den fre 15 nov. 2019 kl 19:40 skrev Mike Cave 
mailto:mc...@uvic.ca>>:
So would you recommend doing an entire node at the same time or per-osd?

You should be able to do it per-OSD (or per-disk in case you run more than one 
OSD per disk), to minimize data movement over the network, letting other OSDs 
on the same host take a bit of the load while re-making the disks one by one. 
You can use "ceph osd reweight  0.0" to make the particular OSD release 
its data but still claim it supplies $crush-weight to the host, meaning the 
other disks will have to take its data more or less.
Moving data between disks in the same host usually goes lots faster than over 
the network to other hosts.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating from block to lvm

2019-11-15 Thread Mike Cave
Losing a node is not a big deal for us (dual bonded 10G connection to each 
node).

I’m thinking:

  1.  Drain node
  2.  Redeploy with Ceph Ansible

It would require much less hands-on time for our group.

I know the churn on the cluster would be high, which was my only concern.

Mike


Senior Systems Administrator
Research Computing Services Team
University of Victoria

From: Martin Verges 
Date: Friday, November 15, 2019 at 11:52 AM
To: Janne Johansson 
Cc: Cave Mike , ceph-users 
Subject: Re: [ceph-users] Migrating from block to lvm

I would consider doing it host-by-host wise, as you should always be able to 
handle the complete loss of a node. This would be much faster in the end as you 
save a lot of time not migrating data back and forth. However this can lead to 
problems if your cluster is not configured according to the hardware 
performance given.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 15. Nov. 2019 um 20:46 Uhr schrieb Janne Johansson 
mailto:icepic...@gmail.com>>:
Den fre 15 nov. 2019 kl 19:40 skrev Mike Cave 
mailto:mc...@uvic.ca>>:
So would you recommend doing an entire node at the same time or per-osd?

You should be able to do it per-OSD (or per-disk in case you run more than one 
OSD per disk), to minimize data movement over the network, letting other OSDs 
on the same host take a bit of the load while re-making the disks one by one. 
You can use "ceph osd reweight  0.0" to make the particular OSD release 
its data but still claim it supplies $crush-weight to the host, meaning the 
other disks will have to take its data more or less.
Moving data between disks in the same host usually goes lots faster than over 
the network to other hosts.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com