[ceph-users] Re: How to let osd up which is down

2021-11-22 Thread Janne Johansson
Den mån 22 nov. 2021 kl 06:52 skrev GHui :
>
> I have do "systemctl restart ceph.target". But the osd service is not started.
> It's strange that the osd.2 is up, but I cann't find the osd service is up, 
> or the osd container is up.
> [root@GHui cephconfig]# ceph osd df
> ID  CLASS  WEIGHT   REWEIGHT  SIZE  RAW USE  DATA  OMAP  META  AVAIL  %USE  
> VAR   PGS  STATUS
>  0ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> 1.000down
>  1ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> 1.000down
>  4ssd  0.36389   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> 1.000down
>  2ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> 1.000  up
>  3ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> 1.000down
>  5ssd  0.36389   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> 1.000down
>TOTAL   0 B  0 B   0 B   0 B   0 B0 B 0
> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
> I would very much appreciate any advice.

Looks a bit like when you create your OSDs pointing to sda, sdb and so
on, then reboot and the system assigns new letters (or new numbers for
dm-1, dm-2..) and now the links under /var/lib/ceph/osd/*/.. points
wrong.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to let osd up which is down

2021-11-22 Thread Janne Johansson
Den mån 22 nov. 2021 kl 09:03 skrev Janne Johansson :
>
> Den mån 22 nov. 2021 kl 06:52 skrev GHui :
> >
> > I have do "systemctl restart ceph.target". But the osd service is not 
> > started.
> > It's strange that the osd.2 is up, but I cann't find the osd service is up, 
> > or the osd container is up.
> > [root@GHui cephconfig]# ceph osd df
> > ID  CLASS  WEIGHT   REWEIGHT  SIZE  RAW USE  DATA  OMAP  META  AVAIL  %USE  
> > VAR   PGS  STATUS
> >  0ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> > 1.000down
> >  1ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> > 1.000down
> >  4ssd  0.36389   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> > 1.000down
> >  2ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> > 1.000  up
> >  3ssd  1.74660   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> > 1.000down
> >  5ssd  0.36389   1.0   0 B  0 B   0 B   0 B   0 B0 B 0  
> > 1.000down
> >TOTAL   0 B  0 B   0 B   0 B   0 B0 B 0
> > MIN/MAX VAR: 1.00/1.00  STDDEV: 0
> > I would very much appreciate any advice.
>
> Looks a bit like when you create your OSDs pointing to sda, sdb and so
> on, then reboot and the system assigns new letters (or new numbers for
> dm-1, dm-2..) and now the links under /var/lib/ceph/osd/*/.. points
> wrong.
>
> --
> May the most significant bit of your life be positive.


And to "fix" this you need to repair all the links
lrwxrwxrwx. 1 ceph ceph  48 May 13  2019 block ->
/dev/mapper/5785afc3-bfbb-47b8-8343-4a532888b912

Perhaps "ceph-volume lvm list" and/or "ceph-volume inventory" can help
identify which raw drive was related to which osd-number.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 回复: Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-22 Thread 胡 玮文
Hi David,

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

I think this is the reason. Although the page is describing a erasure coded 
pool, I think it also applies to replicated pools. You may check that page and 
try the steps described there.

Weiwen Hu

发件人: David Tinker
发送时间: 2021年11月22日 15:13
收件人: Stefan Kooman
抄送: ceph-users@ceph.io
主题: [ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

I set osd.7 as "in", uncordened the node, scaled the OSD deployment back up
and things are recovering with cluster status HEALTH_OK.

I found this message from the archives:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47071.html

"You have a large difference in the capacities of the nodes. This results in
a different host weight, which in turn might lead to problems with the
crush algorithm. It is not able to get three different hosts for OSD placement
for some of the PGs.

CEPH and crush do not cope well with heterogenous setups. I would suggest
to move one of the OSDs from host ceph1 to ceph4 to equalize the host
weight."

My nodes do have very different weights. What I am trying to do is
re-install each node in the cluster so they all have the same amount of
space for Ceph (much less than before .. we need more space for hostpath
stuff).

# ceph osd tree
ID   CLASS  WEIGHTTYPE NAME   STATUS
REWEIGHT  PRI-AFF
 -1 13.77573  root default
 -5 13.77573  region FSN1
-22  0.73419  zone FSN1-DC13
-210  host node5-redacted-com
-27  0.73419  host node7-redacted-com
  1ssd   0.36710  osd.1   up
1.0  1.0
  5ssd   0.36710  osd.5   up
1.0  1.0
-10  6.20297  zone FSN1-DC14
 -9  6.20297  host node3-redacted-com
  2ssd   *3.10149*  osd.2   up
  1.0  1.0
  4ssd   *3.10149*  osd.4   up
  1.0  1.0
-18  3.19919  zone FSN1-DC15
-17  *3.19919*  host node4-redacted-com
  7ssd   *3.19919*  osd.7 down
0  1.0
 -4  2.90518  zone FSN1-DC16
 -3  2.90518  host node1-redacted-com
  0ssd   *1.45259*  osd.0   up
  1.0  1.0
  3ssd   *1.45259*  osd.3   up
  1.0  1.0
-14  0.73419  zone FSN1-DC18
-130  host node2-redacted-com
-25  0.73419  host node6-redacted-com
 10ssd   0.36710  osd.10  up
1.0  1.0
 11ssd   0.36710  osd.11  up
1.0  1.0


Should I just change the weights before/after removing OSD 7?

With something like "ceph osd crush reweight osd.7 1.0"?

Thanks


On Thu, Nov 18, 2021 at 9:41 PM Stefan Kooman  wrote:

> On 11/18/21 17:08, David Tinker wrote:
> > Would it be worth setting the OSD I removed back to "in" (or whatever
> > the opposite of "out") is and seeing if things recovered?
>
> ceph osd in osd.7 would that be. It shouldn't hurt. But I really don't
> understand why this won't resolve itself.
>
> If this gets it fixed you might want to try the pgmapper "drain" command
> from [1]. And when that is done set the osd out.
>
> Gr. Stefan
>
> [1]: https://github.com/digitalocean/pgremapper/
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Is ceph itself a single point of failure?

2021-11-22 Thread Marius Leustean
Many of us deploy ceph as a solution to storage high-availability.

During the time, I've encountered a couple of moments when ceph refused to
deliver I/O to VMs even when a tiny part of the PGs were stuck in
non-active states due to challenges on the OSDs.
So I found myself in very unpleasant situations when an entire cluster went
down because of 1 single node, even if that cluster was supposed to be
fault-tolerant.

Regardless of the reason, the cluster itself can be a single point of
failure, even if it's has a lot of nodes.

How do you segment your deployments so that your business doesn't
get jeopardised in the case when your ceph cluster misbehaves?

Does anyone even use ceph for a very large clusters, or do you prefer to
separate everything into smaller clusters?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph fs Maximum number of files supported

2021-11-22 Thread Nigel Williams
On Sat, 20 Nov 2021 at 02:26, Yan, Zheng  wrote:

> we have FS contain more than 40 billions small files.
>

That is an impressive statistic! Are you able to share the output of ceph
-s / ceph df /etc to get an idea of your cluster deployment?

thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is ceph itself a single point of failure?

2021-11-22 Thread Marc
> 
> Many of us deploy ceph as a solution to storage high-availability.
> 
> During the time, I've encountered a couple of moments when ceph refused
> to
> deliver I/O to VMs even when a tiny part of the PGs were stuck in
> non-active states due to challenges on the OSDs.

I do not know what you mean by this, you can tune this with your min size and 
replication. It is hard to believe that exactly harddrives fail in the same pg. 
I wonder if this is not more related to your 'non-default' config?

> So I found myself in very unpleasant situations when an entire cluster
> went
> down because of 1 single node, even if that cluster was supposed to be
> fault-tolerant.

That is also very hard to believe, since I am updating ceph and reboot one node 
at time, which is just going fine.

> 
> Regardless of the reason, the cluster itself can be a single point of
> failure, even if it's has a lot of nodes.

Indeed, like the data center, and like the planet. The question you should ask 
yourself, do you have a better alternative? For the 3-4 years I have been using 
ceph, I did not find a better alternative (also not looking for it ;))

> How do you segment your deployments so that your business doesn't
> get jeopardised in the case when your ceph cluster misbehaves?
> 
> Does anyone even use ceph for a very large clusters, or do you prefer to
> separate everything into smaller clusters?

If you would read and investigate, you would not need to ask this question. 
Is your lack of knowledge of ceph maybe a critical issue? I know the ceph 
organization likes to make everything as simple as possible for everyone. But 
this has of course its flip side when users run into serious issues.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is ceph itself a single point of failure?

2021-11-22 Thread Marius Leustean
> I do not know what you mean by this, you can tune this with your min size
and replication. It is hard to believe that exactly harddrives fail in the
same pg. I wonder if this is not more related to your 'non-default' config?

In my setup size=2 and min_size=1. I had cases when 1 PG being stuck in
peering state was causing all the VMs in that pool to not get any I/O. My
setup is really "default", deployed with minimal config changes derived
from ceph-ansible and with even number of OSDs per host.

> That is also very hard to believe, since I am updating ceph and reboot
one node at time, which is just going fine.

Real case: host goes down, individual OSDs from other hosts started
consuming >100GB RAM during backfill and get OOM-killed (but hey,
documentation says that "provisioning ~8GB per BlueStore OSD is advised.")

> If you would read and investigate, you would not need to ask this
question.

I was thinking of getting insights on other people's environments, thus
asking questions :)

> Is your lack of knowledge of ceph maybe a critical issue?

I'm just that poor guy reading and understanding the official documentation
and lists, but getting hit by the real world ceph.

On Mon, Nov 22, 2021 at 12:23 PM Marc  wrote:

> >
> > Many of us deploy ceph as a solution to storage high-availability.
> >
> > During the time, I've encountered a couple of moments when ceph refused
> > to
> > deliver I/O to VMs even when a tiny part of the PGs were stuck in
> > non-active states due to challenges on the OSDs.
>
> I do not know what you mean by this, you can tune this with your min size
> and replication. It is hard to believe that exactly harddrives fail in the
> same pg. I wonder if this is not more related to your 'non-default' config?
>
> > So I found myself in very unpleasant situations when an entire cluster
> > went
> > down because of 1 single node, even if that cluster was supposed to be
> > fault-tolerant.
>
> That is also very hard to believe, since I am updating ceph and reboot one
> node at time, which is just going fine.
>
> >
> > Regardless of the reason, the cluster itself can be a single point of
> > failure, even if it's has a lot of nodes.
>
> Indeed, like the data center, and like the planet. The question you should
> ask yourself, do you have a better alternative? For the 3-4 years I have
> been using ceph, I did not find a better alternative (also not looking for
> it ;))
>
> > How do you segment your deployments so that your business doesn't
> > get jeopardised in the case when your ceph cluster misbehaves?
> >
> > Does anyone even use ceph for a very large clusters, or do you prefer to
> > separate everything into smaller clusters?
>
> If you would read and investigate, you would not need to ask this
> question.
> Is your lack of knowledge of ceph maybe a critical issue? I know the ceph
> organization likes to make everything as simple as possible for everyone.
> But this has of course its flip side when users run into serious issues.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is ceph itself a single point of failure?

2021-11-22 Thread Janne Johansson
Den mån 22 nov. 2021 kl 11:40 skrev Marius Leustean :
> > I do not know what you mean by this, you can tune this with your min size
> and replication. It is hard to believe that exactly harddrives fail in the
> same pg. I wonder if this is not more related to your 'non-default' config?
>
> In my setup size=2 and min_size=1. I had cases when 1 PG being stuck in
> peering state was causing all the VMs in that pool to not get any I/O. My
> setup is really "default", deployed with minimal config changes derived
> from ceph-ansible and with even number of OSDs per host.

nono, default is repl=3, min_size=2 for the very reason that you need to be able
to continue when one OSD is down. You set yourself into this position
by reducing
the safety and ceph reacted by stopping the writes rather than
allowing you to lose
data.

If you were afraid of losing access, you should have tuned it in the
other direction
instead, repl=4,5 and min_size 2,3 at that, so you could lose two
drives and still recover/continue.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is ceph itself a single point of failure?

2021-11-22 Thread Martin Verges
> In my setup size=2 and min_size=1

just don't.

> Real case: host goes down, individual OSDs from other hosts started
consuming >100GB RAM during backfill and get OOM-killed

configure your cluster in a better way can help

There will never be a single system that redundant that it has 100% uptime.
And as you can see on a regular basis, even big corps like facebook seem to
have some outages of their highly redundant systems. But there is a
difference between data loss and the unavailability to access your data for
a short period. You can design Ceph to be super redundant, to not lose
data, and to run even if one datacenter burns down without a downtime. But
this all come with costs, sometimes quite high costs. Often it's cheaper to
live with a short interruption or to build 2 separated systems than to get
more nines to your availability on a single one.

--
Martin Verges
Managing director

Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


On Mon, 22 Nov 2021 at 11:40, Marius Leustean  wrote:

> > I do not know what you mean by this, you can tune this with your min size
> and replication. It is hard to believe that exactly harddrives fail in the
> same pg. I wonder if this is not more related to your 'non-default' config?
>
> In my setup size=2 and min_size=1. I had cases when 1 PG being stuck in
> peering state was causing all the VMs in that pool to not get any I/O. My
> setup is really "default", deployed with minimal config changes derived
> from ceph-ansible and with even number of OSDs per host.
>
> > That is also very hard to believe, since I am updating ceph and reboot
> one node at time, which is just going fine.
>
> Real case: host goes down, individual OSDs from other hosts started
> consuming >100GB RAM during backfill and get OOM-killed (but hey,
> documentation says that "provisioning ~8GB per BlueStore OSD is advised.")
>
> > If you would read and investigate, you would not need to ask this
> question.
>
> I was thinking of getting insights on other people's environments, thus
> asking questions :)
>
> > Is your lack of knowledge of ceph maybe a critical issue?
>
> I'm just that poor guy reading and understanding the official documentation
> and lists, but getting hit by the real world ceph.
>
> On Mon, Nov 22, 2021 at 12:23 PM Marc  wrote:
>
> > >
> > > Many of us deploy ceph as a solution to storage high-availability.
> > >
> > > During the time, I've encountered a couple of moments when ceph refused
> > > to
> > > deliver I/O to VMs even when a tiny part of the PGs were stuck in
> > > non-active states due to challenges on the OSDs.
> >
> > I do not know what you mean by this, you can tune this with your min size
> > and replication. It is hard to believe that exactly harddrives fail in
> the
> > same pg. I wonder if this is not more related to your 'non-default'
> config?
> >
> > > So I found myself in very unpleasant situations when an entire cluster
> > > went
> > > down because of 1 single node, even if that cluster was supposed to be
> > > fault-tolerant.
> >
> > That is also very hard to believe, since I am updating ceph and reboot
> one
> > node at time, which is just going fine.
> >
> > >
> > > Regardless of the reason, the cluster itself can be a single point of
> > > failure, even if it's has a lot of nodes.
> >
> > Indeed, like the data center, and like the planet. The question you
> should
> > ask yourself, do you have a better alternative? For the 3-4 years I have
> > been using ceph, I did not find a better alternative (also not looking
> for
> > it ;))
> >
> > > How do you segment your deployments so that your business doesn't
> > > get jeopardised in the case when your ceph cluster misbehaves?
> > >
> > > Does anyone even use ceph for a very large clusters, or do you prefer
> to
> > > separate everything into smaller clusters?
> >
> > If you would read and investigate, you would not need to ask this
> > question.
> > Is your lack of knowledge of ceph maybe a critical issue? I know the ceph
> > organization likes to make everything as simple as possible for everyone.
> > But this has of course its flip side when users run into serious issues.
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is ceph itself a single point of failure?

2021-11-22 Thread Eino Tuominen
On Monday, November 22, 2021 at 12:39 Marius Leustean  
wrote:

> In my setup size=2 and min_size=1.

I'm sorry, but that's the root cause of the problems you're seeing. You really 
want size=3, min_size=2 for your production cluster unless you have some 
specific uncommon use case and you really know what you're doing.

-- 
  Eino Tuominen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: have buckets with low number of shards

2021-11-22 Thread mahnoosh shahidi
 Can anyone help me with these questions?

On Sun, Nov 21, 2021 at 11:23 AM mahnoosh shahidi 
wrote:

> Hi,
>
> Running cluster in octopus 15.2.12 . We have a big bucket with about 800M
> objects and resharding this bucket makes many slow ops in our bucket index
> osds. I wanna know what happens if I don't reshard this bucket any more?
> How does it affect the performance? The performance problem would be only
> for that bucket or it affects the entire bucket index pool?
>
> Regards,
> Mahnoosh
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-22 Thread Manuel Lausch
yeah. The starting patch works.
Now the stopping side is still missing. Do you have some patches to
https://tracker.ceph.com/issues/53327 already, which I could test?

Thanks,
Manuel


On Fri, 19 Nov 2021 11:56:41 +0100
Manuel Lausch  wrote:

> Nice. Just now I building a 16.2.6 relese with this patch and will
> test it. 
> 
> Thanks,
> Manuel
> 
> 
> On Thu, 18 Nov 2021 15:02:38 -0600
> Sage Weil  wrote:
> 
> > Okay, good news: on the osd start side, I identified the bug (and easily
> > reproduced locally).  The tracker and fix are:
> > 
> >  https://tracker.ceph.com/issues/53326
> >  https://github.com/ceph/ceph/pull/44015
> > 
> > These will take a while to work through QA and get backported.
> > 
> > Also, to reiterate what I said on the call earlier today about the osd
> > stopping issues:
> >  - A key piece of the original problem you were seeing was because
> > require_osd_release wasn't up to date, which meant that the the dead_epoch
> > metadata wasn't encoded in the OSDMap and we would basically *always* go
> > into the read lease wait when an OSD stopped.
> >  - Now that that is fixed, it appears as though setting both
> > osd_fast_shutdown *and* osd_fast_shutdown_notify_mon is the winning
> > combination.
> > 
> > I would be curious to hear if adjusting the icmp throttle kernel setting
> > makes things behave better when osd_fast_shutdown_notify_mon=false (the
> > default), but this is more out of curiosity--I think we've concluded that
> > we should set this option to true by default.
> > 
> > If I'm missing anything, please let me know!
> > 
> > Thanks for your patience in tracking this down.  It's always a bit tricky
> > when there are multiple contributing factors (in this case, at least 3).
> > 
> > sage
> > 
> > 
> >   
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] SATA SSD recommendations.

2021-11-22 Thread Luke Hall

Hello,

We are looking to replace the 36 aging 4TB HDDs in our 6 OSD machines 
with 36x 4TB SATA SSDs.


There's obviously a big range of prices for large SSDs so I would 
appreciate any recommendations of Manufacturer/models to consider/avoid.


I expect the balance to be between

price/performance/durability

Thanks in advance for any advice offered.

Luke

--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Martin Verges
As the price for SSDs is the same regardless of the interface, I would not
invest so much money in a still slow and outdated platform.
Just buy some new chassis as well and go NVMe. It adds only a little cost
but will increase performance drastically.

--
Martin Verges
Managing director

Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


On Mon, 22 Nov 2021 at 13:57, Luke Hall  wrote:

> Hello,
>
> We are looking to replace the 36 aging 4TB HDDs in our 6 OSD machines
> with 36x 4TB SATA SSDs.
>
> There's obviously a big range of prices for large SSDs so I would
> appreciate any recommendations of Manufacturer/models to consider/avoid.
>
> I expect the balance to be between
>
> price/performance/durability
>
> Thanks in advance for any advice offered.
>
> Luke
>
> --
> All postal correspondence to:
> The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
>
> *Follow us on Twitter* @posipeople
>
> The Positive Internet Company Limited is registered in England and Wales.
> Registered company number: 3673639. VAT no: 726 7072 28.
> Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-22 Thread David Tinker
Yes it is on:

# ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.001867",
"last_optimize_started": "Mon Nov 22 13:10:24 2021",
"mode": "upmap",
"optimize_result": "Unable to find further optimization, or pool(s)
pg_num is decreasing, or distribution is already perfect",
"plans": []
}

On Mon, Nov 22, 2021 at 10:17 AM Stefan Kooman  wrote:

> On 11/22/21 08:12, David Tinker wrote:
> > I set osd.7 as "in", uncordened the node, scaled the OSD deployment back
> > up and things are recovering with cluster status HEALTH_OK.
> >
> > I found this message from the archives:
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47071.html
> > 
> >
> > "You have a large difference in the capacities of the nodes. This
> > resultsin a different host weight, which in turn might lead to problems
> > withthe crush algorithm. It is not able to get three different hosts for
> > OSDplacement for some of the PGs.
> >
> > CEPH and crush do not cope well with heterogenous setups. I wouldsuggest
> > to move one of the OSDs from host ceph1 to ceph4 to equalize thehost
> > weight."
> >
> > My nodes do have very different weights. What I am trying to do is
> > re-install each node in the cluster so they all have the same amount of
> > space for Ceph (much less than before .. we need more space for hostpath
> > stuff).
> >
> > # ceph osd tree
> > ID   CLASS  WEIGHTTYPE NAME   STATUS
> REWEIGHT  PRI-AFF
> >   -1 13.77573  root default
> >   -5 13.77573  region FSN1
> > -22  0.73419  zone FSN1-DC13
> > -210  host node5-redacted-com
> > -27  0.73419  host node7-redacted-com
> >1ssd   0.36710  osd.1   up
>  1.0  1.0
> >5ssd   0.36710  osd.5   up
>  1.0  1.0
> > -10  6.20297  zone FSN1-DC14
> >   -9  6.20297  host node3-redacted-com
> >2ssd*3.10149*   osd.2   up
>  1.0  1.0
> >4ssd*3.10149*   osd.4   up
>  1.0  1.0
> > -18  3.19919  zone FSN1-DC15
> > -17*3.19919*   host node4-redacted-com
> >7ssd*3.19919*   osd.7 down
>  0  1.0
> >   -4  2.90518  zone FSN1-DC16
> >   -3  2.90518  host node1-redacted-com
> >0ssd*1.45259*   osd.0   up
>  1.0  1.0
> >3ssd*1.45259*   osd.3   up
>  1.0  1.0
> > -14  0.73419  zone FSN1-DC18
> > -130  host node2-redacted-com
> > -25  0.73419  host node6-redacted-com
> >   10ssd   0.36710  osd.10  up
>  1.0  1.0
> >   11ssd   0.36710  osd.11  up
>  1.0  1.0
> >
> >
> > Should I just change the weights before/after removing OSD 7?
> >
> > With something like "ceph osd crush reweight osd.7 1.0"?
>
> The ceph balancer is there to balance PGs across all nodes. Do you have
> it enabled?
>
> ceph balancer status
>
> The most efficient way is to use mode upmap (should work with modern
> clients):
>
> ceph balancer mode upmap
>
> Gr. Stefn
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Luke Hall

On 22/11/2021 12:59, Martin Verges wrote:
As the price for SSDs is the same regardless of the interface, I would 
not invest so much money in a still slow and outdated platform.
Just buy some new chassis as well and go NVMe. It adds only a little 
cost but will increase performance drastically.


Thanks but replacing these chassis will be something we look to do in 
perhaps a years time. For now we need a stop-gap and switching out to 
SATA SSDs is the easiest option.


On Mon, 22 Nov 2021 at 13:57, Luke Hall > wrote:


Hello,

We are looking to replace the 36 aging 4TB HDDs in our 6 OSD machines
with 36x 4TB SATA SSDs.

There's obviously a big range of prices for large SSDs so I would
appreciate any recommendations of Manufacturer/models to consider/avoid.

I expect the balance to be between

price/performance/durability

Thanks in advance for any advice offered.

Luke

-- 
All postal correspondence to:

The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and
Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts,
EN4 9EE.
___
ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io




--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-22 Thread David Tinker
I just had a look at the balance docs and it says "No adjustments will be
made to the PG distribution if the cluster is degraded (e.g., because an
OSD has failed and the system has not yet healed itself).". That implies
that the balancer won't run until the disruption caused by the removed OSD
has been sorted out?

On Mon, Nov 22, 2021 at 3:12 PM David Tinker  wrote:

> Yes it is on:
>
> # ceph balancer status
> {
> "active": true,
> "last_optimize_duration": "0:00:00.001867",
> "last_optimize_started": "Mon Nov 22 13:10:24 2021",
> "mode": "upmap",
> "optimize_result": "Unable to find further optimization, or pool(s)
> pg_num is decreasing, or distribution is already perfect",
> "plans": []
> }
>
> On Mon, Nov 22, 2021 at 10:17 AM Stefan Kooman  wrote:
>
>> On 11/22/21 08:12, David Tinker wrote:
>> > I set osd.7 as "in", uncordened the node, scaled the OSD deployment
>> back
>> > up and things are recovering with cluster status HEALTH_OK.
>> >
>> > I found this message from the archives:
>> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47071.html
>> > 
>> >
>> > "You have a large difference in the capacities of the nodes. This
>> > resultsin a different host weight, which in turn might lead to problems
>> > withthe crush algorithm. It is not able to get three different hosts
>> for
>> > OSDplacement for some of the PGs.
>> >
>> > CEPH and crush do not cope well with heterogenous setups. I
>> wouldsuggest
>> > to move one of the OSDs from host ceph1 to ceph4 to equalize thehost
>> > weight."
>> >
>> > My nodes do have very different weights. What I am trying to do is
>> > re-install each node in the cluster so they all have the same amount of
>> > space for Ceph (much less than before .. we need more space for
>> hostpath
>> > stuff).
>> >
>> > # ceph osd tree
>> > ID   CLASS  WEIGHTTYPE NAME   STATUS
>> REWEIGHT  PRI-AFF
>> >   -1 13.77573  root default
>> >   -5 13.77573  region FSN1
>> > -22  0.73419  zone FSN1-DC13
>> > -210  host node5-redacted-com
>> > -27  0.73419  host node7-redacted-com
>> >1ssd   0.36710  osd.1   up
>>  1.0  1.0
>> >5ssd   0.36710  osd.5   up
>>  1.0  1.0
>> > -10  6.20297  zone FSN1-DC14
>> >   -9  6.20297  host node3-redacted-com
>> >2ssd*3.10149*   osd.2   up
>>  1.0  1.0
>> >4ssd*3.10149*   osd.4   up
>>  1.0  1.0
>> > -18  3.19919  zone FSN1-DC15
>> > -17*3.19919*   host node4-redacted-com
>> >7ssd*3.19919*   osd.7 down
>>0  1.0
>> >   -4  2.90518  zone FSN1-DC16
>> >   -3  2.90518  host node1-redacted-com
>> >0ssd*1.45259*   osd.0   up
>>  1.0  1.0
>> >3ssd*1.45259*   osd.3   up
>>  1.0  1.0
>> > -14  0.73419  zone FSN1-DC18
>> > -130  host node2-redacted-com
>> > -25  0.73419  host node6-redacted-com
>> >   10ssd   0.36710  osd.10  up
>>  1.0  1.0
>> >   11ssd   0.36710  osd.11  up
>>  1.0  1.0
>> >
>> >
>> > Should I just change the weights before/after removing OSD 7?
>> >
>> > With something like "ceph osd crush reweight osd.7 1.0"?
>>
>> The ceph balancer is there to balance PGs across all nodes. Do you have
>> it enabled?
>>
>> ceph balancer status
>>
>> The most efficient way is to use mode upmap (should work with modern
>> clients):
>>
>> ceph balancer mode upmap
>>
>> Gr. Stefn
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread mj

Hi,

We were in the same position as you, and replaced our 24 4TB harddisks 
with Samsung PM883 .


They seem to work quite nicely, and their wearout (after one year) is 
still at 1% for our use.


MJ

Op 22-11-2021 om 13:57 schreef Luke Hall:

Hello,

We are looking to replace the 36 aging 4TB HDDs in our 6 OSD machines 
with 36x 4TB SATA SSDs.


There's obviously a big range of prices for large SSDs so I would 
appreciate any recommendations of Manufacturer/models to consider/avoid.


I expect the balance to be between

price/performance/durability

Thanks in advance for any advice offered.

Luke


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Luke Hall

On 22/11/2021 15:18, mj wrote:

Hi,

We were in the same position as you, and replaced our 24 4TB harddisks 
with Samsung PM883 .


They seem to work quite nicely, and their wearout (after one year) is 
still at 1% for our use.


Thanks, that's really useful to know.


Op 22-11-2021 om 13:57 schreef Luke Hall:

Hello,

We are looking to replace the 36 aging 4TB HDDs in our 6 OSD machines 
with 36x 4TB SATA SSDs.


There's obviously a big range of prices for large SSDs so I would 
appreciate any recommendations of Manufacturer/models to consider/avoid.


I expect the balance to be between

price/performance/durability

Thanks in advance for any advice offered.

Luke


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Peter Lieven
Am 22.11.21 um 16:25 schrieb Luke Hall:
> On 22/11/2021 15:18, mj wrote:
>> Hi,
>>
>> We were in the same position as you, and replaced our 24 4TB harddisks with 
>> Samsung PM883 .
>>
>> They seem to work quite nicely, and their wearout (after one year) is still 
>> at 1% for our use.
>
> Thanks, that's really useful to know.


Whatever SSD you choose, look if they support power-loss-protection and make 
sure you disable the write cache.


Peter



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG states question and improving peering times

2021-11-22 Thread Stephen Smith6
I believe this is a fairly straight-forward question, but is it true that any PG not in "active+..." (Peering, down, etc.) blocks writes to the entire pool? I'm also wondering if there are methods for improving peering times for placement groups.
 
Thanks,
Eric

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Annoying MDS_CLIENT_RECALL Warning

2021-11-22 Thread 胡 玮文
Thanks Patrick and Dan,

I conclude that it is caused by a large amount of inotify watches that keeps 
the inode from being evicted. I used this script [1] and found that the number 
of watches matched the num_caps. And if I kill the process (VS Code server in 
our case) holding the inotify instance, the caps can be release.

So there is not much I can do. Maybe tell users not to open a huge directory 
with vs code, or just increase “mds_min_caps_working_set”.

[1]: 
https://github.com/fatso83/dotfiles/blob/master/utils/scripts/inotify-consumers

Weiwen Hu

发件人: Patrick Donnelly
发送时间: 2021年11月20日 3:20
收件人: 胡 玮文
抄送: Dan van der Ster; 
ceph-users@ceph.io
主题: Re: [ceph-users] Re: Annoying MDS_CLIENT_RECALL Warning

On Fri, Nov 19, 2021 at 2:14 AM 胡 玮文  wrote:
>
> Thanks Dan,
>
> I choose one of the stuck client to investigate, as shown below, it currently 
> holds ~269700 caps, which is pretty high with no obvious reason. I cannot 
> understand most of the output, and failed to find any documents about it.
>
> # ceph tell mds.cephfs.gpu018.ovxvoz client ls id=7915658
> [
> {
> "id": 7915658,
> "entity": {
> "name": {
> "type": "client",
> "num": 7915658
> },
> "addr": {
> "type": "v1",
> "addr": "202.38.247.227:0",
> "nonce": 3019311016
> }
> },
> "state": "open",
> "num_leases": 0,
> "num_caps": 269695,
> "request_load_avg": 184,
> "uptime": 1340483.111458218,
> "requests_in_flight": 0,
> "num_completed_requests": 0,
> "num_completed_flushes": 1,
> "reconnecting": false,
> "recall_caps": {
> "value": 1625220.0378812221,
> "halflife": 60
> },
> "release_caps": {
> "value": 69.432671270941171,
> "halflife": 60
> },
> "recall_caps_throttle": {
> "value": 63255.667075845187,
> "halflife": 1.5
> },
> "recall_caps_throttle2o": {
> "value": 26064.679002183591,
> "halflife": 0.5
> },
> "session_cache_liveness": {
> "value": 259.9718480278375,
> "halflife": 300
> },

The MDS considers your client to be quiescent so it's asking it to
release caps. However it's not doing so. This may be a bug in the
kernel client.

> "cap_acquisition": {
> "value": 0,
> "halflife": 10
> },
> "delegated_inos": [... 7 items removed ],
> "inst": "client.7915658 v1:202.38.247.227:0/3019311016",
> "completed_requests": [],
> "prealloc_inos": [ ... 9 items removed ],
> "client_metadata": {
> "client_features": {
> "feature_bits": "0x7bff"
> },
> "metric_spec": {
> "metric_flags": {
> "feature_bits": "0x001f"
> }
> },
> "entity_id": "smil",
> "hostname": "gpu027",
> "kernel_version": "5.11.0-37-generic",
> "root": "/"
> }
> }
> ]
>
> I suspect that some files are in use so that their caps cannot be released. 
> However, "sudo lsof +f -- /mnt/cephfs | wc -l" just shows about 9k open 
> files, well below "num_caps".
>
> I also looked at 
> /sys/kernel/debug/ceph/e88d509a-f6fc-11ea-b25d-a0423f3ac864.client7915658/caps
>  on the client. The number of lines in it matches the "num_caps" reported by 
> MDS. This file also tells me which caps are not released. I investigated some 
> of them, but cannot see anything special. One example is attached here.
>
> # ceph tell mds.cephfs.gpu018.ovxvoz dump inode 0x100068b9d24
> {
> "path": "/dataset/coco2017/train2017/00342643.jpg",
> "ino": 1099621440804,
> "rdev": 0,
> "ctime": "2021-04-23T09:49:54.433652+",
> "btime": "2021-04-23T09:49:54.425652+",
> "mode": 33204,
> "uid": 85969,
> "gid": 85969,
> "nlink": 1,
> "dir_layout": {
> "dir_hash": 0,
> "unused1": 0,
> "unused2": 0,
> "unused3": 0
> },
> "layout": {
> "stripe_unit": 4194304,
> "stripe_count": 1,
> "object_size": 4194304,
> "pool_id": 5,
> "pool_ns": ""
> },
> "old_pools": [],
> "size": 147974,
> "truncate_seq": 1,
> "truncate_size": 18446744073709551615,
> "truncate_from": 0,
> "truncate_pending": 0,
> "mtime": "2021-04-23T09:49:54.433652+",
> "atime": "2021-04-23T09:49:54.425652+",
> "time_warp_seq": 0,
> "change_attr": 1,
> "export_pin": -1,
> "export_ephemeral_random_pin": 0,
> "export_ephemeral_distributed_pin": false,
> "client_ranges

[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Dan van der Ster
On Mon, Nov 22, 2021 at 4:52 PM Nico Schottelius
 wrote:
>
>
> Peter Lieven  writes:
> > Whatever SSD you choose, look if they support power-loss-protection and 
> > make sure you disable the write cache.
>
> I have read this statement multiple times now, but I am still puzzled by
> the disabling write cache statement. What is wrong with having a (BBU
> based) write cache in front of the SSDs / how does it decrease the
> perfomance?
>
> As far as I can tell, this is just another buffer/cache that is added in
> the write chain and thus should improve speed for small writes. For
> write bursts / longer writes the cache will not play a role anymore.
>
> What am I missing?

This: https://github.com/ceph/ceph/pull/43848/files
https://yourcmc.ru/wiki/Ceph_performance#Drive_cache_is_slowing_you_down

-- dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Darren Soothill
This depends on how the write cache is implemented and where the cache is.

If its on a caching controller that has a BBU then it depends on what happens 
when a f_sync is issued.

If it forces the write to go down to the underlying devices then it could be a 
bad thing.

With many caching controllers as the controller is the end device that you 
connect to at an OS level then you can get substantial performance increases by 
having the cache enabled.

If you are directly connected to the drive as in there is no caching controller 
in front of it then some drives exhibit a performance degradation under 
bluestone due to all the F-syncs that are forcing writes down to media. 

Some drives don’t allow this parameter to be changed and some just ignore 
whatever setting you put on it. 

Darren

> On 22 Nov 2021, at 15:42, Nico Schottelius  
> wrote:
> 
> 
> Peter Lieven  writes:
>> Whatever SSD you choose, look if they support power-loss-protection and make 
>> sure you disable the write cache.
> 
> I have read this statement multiple times now, but I am still puzzled by
> the disabling write cache statement. What is wrong with having a (BBU
> based) write cache in front of the SSDs / how does it decrease the
> perfomance?
> 
> As far as I can tell, this is just another buffer/cache that is added in
> the write chain and thus should improve speed for small writes. For
> write bursts / longer writes the cache will not play a role anymore.
> 
> What am I missing?
> 
> --
> Sustainable and modern Infrastructures by ungleich.ch
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Anthony D'Atri



> This depends on how the write cache is implemented and where the cache is.

Exactly!

> With many caching controllers as the controller is the end device that you 
> connect to at an OS level then you can get substantial performance increases 
> by having the cache enabled.

A couple of years ago I posted a litany of my experiences with RoC HBAs, 
hinting at the substantial costs involved. 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recursive delete hangs on cephfs

2021-11-22 Thread Gregory Farnum
Oh, I misread your initial email and thought you were on hard drives.
These do seem slow for SSDs.

You could try tracking down where the time is spent; perhaps run
strace and see which calls are taking a while, and go through the op
tracker on the MDS and see if it has anything that's obviously taking
a long time.
-Greg

On Wed, Nov 17, 2021 at 8:00 PM Sasha Litvak
 wrote:
>
> Gregory,
> Thank you for your reply, I do understand that a number of serialized lookups 
> may take time.  However if 3.25 sec is OK,  11.2 seconds sounds long, and I 
> had once removed a large subdirectory which took over 20 minutes to complete. 
>  I attempted to use nowsync mount option with kernel 5.15 and it seems to 
> hide latency (i.e. it is almost immediately returns prompt after recursive 
> directory removal.  However, I am not sure whether nowsync is safe to use 
> with kernel >= 5.8.  I also have kernel 5.3 on one of the client clusters and 
> nowsync there is not supported, however all rm operations happen reasonably 
> fast.  So the second question is, does 5.3's libceph behave differently on 
> recursing rm compared to 5.4 or 5.8?
>
>
> On Wed, Nov 17, 2021 at 9:52 AM Gregory Farnum  wrote:
>>
>> On Sat, Nov 13, 2021 at 5:25 PM Sasha Litvak
>>  wrote:
>> >
>> > I continued looking into the issue and have no idea what hinders the
>> > performance yet. However:
>> >
>> > 1. A client operating with kernel 5.3.0-42 (ubuntu 18.04) has no such
>> > problems.  I delete a directory with hashed subdirs (00 - ff) and total
>> > space taken by files ~707MB spread across those 256 in 3.25 s.
>>
>> Recursive rm first requires the client to get capabilities on the
>> files in question, and the MDS to read that data off disk.
>> Newly-created directories will be cached, but old ones might not be.
>>
>> So this might just be the consequence of having to do 256 serialized
>> disk lookups on hard drives. 3.25 seconds seems plausible to me.
>>
>> The number of bytes isn't going to have any impact on how long it
>> takes to delete from the client side — that deletion is just marking
>> it in the MDS, and then the MDS does the object removals in the
>> background.
>> -Greg
>>
>> >
>> > 2. A client operating with kernel 5.8.0-53 (ubuntu 20.04) processes a
>> > similar directory with less space taken ~ 530 MB spread across 256 subdirs
>> > in 11.2 s.
>> >
>> > 3.Yet another client with kernel 5.4.156 has similar latency removing
>> > directories as in line 2.
>> >
>> > In all scenarios, mounts are set with the same options, i.e.
>> > noatime,secret-file,acl.
>> >
>> > Client 1 has luminous, client 2 has octopus, client 3 has nautilus.   While
>> > they are all on the same LAN, ceph -s on 2 and 3 returns in ~ 800 ms and on
>> > client in ~300 ms.
>> >
>> > Any ideas are appreciated,
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Nov 12, 2021 at 8:44 PM Sasha Litvak 
>> > wrote:
>> >
>> > > The metadata pool is on the same type of drives as other pools; every 
>> > > node
>> > > uses SATA SSDs.  They are all read / write mix DC types.  Intel and 
>> > > Seagate.
>> > >
>> > > On Fri, Nov 12, 2021 at 8:02 PM Anthony D'Atri 
>> > > wrote:
>> > >
>> > >> MDS RAM cache vs going to the metadata pool?  What type of drives is 
>> > >> your
>> > >> metadata pool on?
>> > >>
>> > >> > On Nov 12, 2021, at 5:30 PM, Sasha Litvak 
>> > >> > 
>> > >> wrote:
>> > >> >
>> > >> > I am running Pacific 16.2.4 cluster and recently noticed that rm -rf
>> > >> >  visibly hangs on the old directories.  Cluster is healthy,
>> > >> has a
>> > >> > light load, and any newly created directories deleted immediately 
>> > >> > (well
>> > >> rm
>> > >> > returns command prompt immediately).  The directories in question have
>> > >> 10 -
>> > >> > 20 small text files so nothing should be slow when removing them.
>> > >> >
>> > >> > I wonder if someone can please give me a hint on where to start
>> > >> > troubleshooting as I see no "big bad bear" yet.
>> > >> > ___
>> > >> > ceph-users mailing list -- ceph-users@ceph.io
>> > >> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> > >>
>> > >>
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread mj
Yes, we were a little bit concerned about the write endurence of those 
drives. There are SSD with much higher DWPD endurance, but we expected 
that we would not need the higher endurance. So we decided not to pay 
the extra price.


Turns out to have been a good guess. (educated guess, but still)

MJ

Op 22-11-2021 om 16:25 schreef Luke Hall:


They seem to work quite nicely, and their wearout (after one year) is 
still at 1% for our use.


Thanks, that's really useful to know.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SATA SSD recommendations.

2021-11-22 Thread Anthony D'Atri
I have direct experience with SATA SSDs used for RBD with an active public 
cloud (QEMU/KVM) workload.  Drives rated ~ 1 DWPD after 3+ years of service 
consistently reported <10% of lifetime used.

SMART lifetime counters are often (always?) based on rated PE cycles, which I 
would expect to be more or less linear over the drive’s lifetime.

There’s a lot of FUD around endurance.  One sees very few if any actual cases 
of production drives running out, especially when they are legit “enterprise” 
class models.  Some drives report lifetime USED; some report lifetime 
REMAINING.  Sometimes the smartmontools drive.db entries get the polarity 
wrong.  Trending lifetime and reallocation blocks used vs remaining over time 
can be very illuminating, especially when certain models may exhibit (ahem) 
firmware deficiencies.

It is common to depreciate server gear over 5 years (at least in the US).  Mind 
you, depreciation is one thing, and CapEx approval for refresh is quite 
another, but I would expect chassis to experience more failures over time and 
issues with replacement parts availability than the drives themselves.

ymmocv

— aad

> 
> Yes, we were a little bit concerned about the write endurence of those 
> drives. There are SSD with much higher DWPD endurance, but we expected that 
> we would not need the higher endurance. So we decided not to pay the extra 
> price.
> 
> Turns out to have been a good guess. (educated guess, but still)
> 
> MJ
> 
> Op 22-11-2021 om 16:25 schreef Luke Hall:
>>> 
>>> They seem to work quite nicely, and their wearout (after one year) is still 
>>> at 1% for our use.
>> Thanks, that's really useful to know.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW support IAM user authentication

2021-11-22 Thread nio
hi,all:
In the process of using RGW, I still cannot authenticate users through
IAM. In the near future, will RGW support IAM to manage user permissions
and authentication functions?


Looking forward to your reply 😁
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW support IAM user authentication

2021-11-22 Thread Pritha Srivastava
Hi Nio,

Can you provide more details around what you are trying to do?

RGW supports attaching IAM policies to users that aid in managing their
permissions.

Thanks,
Pritha

On Tue, Nov 23, 2021 at 11:43 AM nio  wrote:

> hi,all:
> In the process of using RGW, I still cannot authenticate users through
> IAM. In the near future, will RGW support IAM to manage user permissions
> and authentication functions?
>
>
> Looking forward to your reply 😁
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG states question and improving peering times

2021-11-22 Thread Janne Johansson
Den mån 22 nov. 2021 kl 16:36 skrev Stephen Smith6 :
>
> I believe this is a fairly straight-forward question, but is it true that any 
> PG not in "active+..." (Peering, down, etc.) blocks writes to the entire pool?

I'm not sure if this is strictly true, but in the example of say a VM
having a 40G rbd image as it's harddrive, it will split this into some
1 4M pieces which will be spread across your PGs, and if you have
less than 1 PGs in the rbd pool, these pieces will end up on all
PGs. So, if one or more PGs are inactive in some way, then it would
only be a matter of time before reads or writes on this VM hits one of
the inactive PGs and stops there.

Since most of the data is spread around regardless of if you do rgw,
rbd, cephfs and so on, the effect might as well feel like "one bad PG
stops the pool" since all the load balancing done in the cluster will
make your clients get stuck on the inactive PGs sooner or later.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to let osd up which is down

2021-11-22 Thread Janne Johansson
Den tis 23 nov. 2021 kl 05:56 skrev GHui :
>
> I use "systemctl start/stop ceph.target" to start and stop Ceph Cluster. 
> Maybe this is problem. Because of I restart the computer. The osd is all up.
> Is there any way to safe restart Ceph Cluster?

That is how you stop and start all ceph services on one host, yes.
I don't know if you did have the issue I described, I just noticed
that it looked very much like it.

Read the separate osd logs in /var/log/ceph/ceph-osd..log to
see why they do not start in your case.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: One pg stuck in active+undersized+degraded after OSD down

2021-11-22 Thread David Tinker
Yes it recovered when I put the OSD back in. The issue is that it fails to
sort itself out when I remove that OSD even though I have loads of space
and 8 other OSDs in 4 different zones to choose from. The weights are very
different (some 3.2 others 0.36) and that post I found suggested that this
might cause trouble for the crush algorithm. So I was thinking about making
them more even before removing the OSD.

On Mon, Nov 22, 2021 at 3:59 PM Stefan Kooman  wrote:

> On 11/22/21 14:50, David Tinker wrote:
> > I just had a look at the balance docs and it says "No adjustments will
> > be made to the PG distribution if the cluster is degraded (e.g., because
> > an OSD has failed and the system has not yet healed itself).". That
> > implies that the balancer won't run until the disruption caused by the
> > removed OSD has been sorted out?
>
> That's correct. Is the cluster recovering / backfilling now?
>
> Gr. Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io