[ceph-users] Re: Planning cluster

2023-07-11 Thread Ml Ml
Never ever use osd pool default min size = 1

this will break your neck and does not make sense really.

:-)

On Mon, Jul 10, 2023 at 7:33 PM Dan van der Ster
 wrote:
>
> Hi Jan,
>
> On Sun, Jul 9, 2023 at 11:17 PM Jan Marek  wrote:
>
> > Hello,
> >
> > I have a cluster, which have this configuration:
> >
> > osd pool default size = 3
> > osd pool default min size = 1
> >
>
> Don't use min_size = 1 during regular stable operations. Instead, use
> min_size = 2 to ensure data safety, and then you can set the pool to
> min_size = 1 manually in the case of an emergency. (E.g. in case the 2
> copies fail and will not be recoverable).
>
>
> > I have 5 monitor nodes and 7 OSD nodes.
> >
>
> 3 monitors is probably enough. Put 2 in the same DC with 2 replicas, and
> the other in the DC with 1 replica.
>
>
> > I have changed a crush map to divide ceph cluster to two
> > datacenters - in the first one will be a part of cluster with 2
> > copies of data and in the second one will be part of cluster
> > with one copy - only emergency.
> >
> > I still have this cluster in one
> >
> > This cluster have a 1 PiB of raw data capacity, thus it is very
> > expensive add a further 300TB capacity to have 2+2 data redundancy.
> >
> > Will it works?
> >
> > If I turn off the 1/3 location, will it be operational?
>
>
> Yes the PGs should be active and accept IO. But the cluster will be
> degraded, it cannot stay in this state permanently. (You will need to
> recover the 3rd replica or change the crush map).
>
>
>
> > I
> > believe, it is a better choose, it will. And what if "die" 2/3
> > location?
>
>
> with min_size = 2, the PG wil be inactive. but the data will be safe. If
> this happens, then set min_size=1 to activate the PGs.
> Mon will not have quorum though -- you need a plan for that. And also plan
> where you put your MDSs.
>
> -- dan
>
> __
> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>
>
>
>
> > On this cluster is pool with cephfs - this is a main
> > part of CEPH.
> >
> > Many thanks for your notices.
> >
> > Sincerely
> > Jan Marek
> > --
> > Ing. Jan Marek
> > University of South Bohemia
> > Academic Computer Centre
> > Phone: +420389032080
> > http://www.gnu.org/philosophy/no-word-attachments.cs.html
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW dynamic resharding blocks write ops

2023-07-11 Thread Eugen Block

Hi again, I got the log excerpt with the rgw error message:

s3:put_obj block_while_resharding ERROR: bucket is still resharding,  
please retry


Below is the message in context, I don't see a return code though,  
only 206 for the get requests. Unfortunately, we only have a recorded  
putty session so the information is limited. Does it help anyway to  
get to the bottom of this?
One more thing, we noticed that rgw_reshard_bucket_lock_duration was  
changed somewhere in Nautilus from 120 to 360 seconds. They haven't  
reported these errors before the upgrade, so I feel like either they  
were lucky and simply didn't get into this resharding while trying to  
write or the lock_duration was actually only 120 seconds which may  
have been okay for the application. It's all just guessing at the  
moment, we don't have config dumps of all configs in place from before  
the upgrade, at least not that I'm aware of.
Anyway, I still need to discuss with them if disabling dynamic  
resharding is the way to go (and then manually reshard during  
maintenance) and preshard new buckets if they can tell how many  
objects are expected in the new bucket.
The error message repeats for around 3 and a half minutes, apparently  
this is the time it took to reshard the bucket. Maybe reducing the  
lock_duration to 120 seconds could also help here, but I wonder what  
the consequence would be. Would it stop resharding after 2 minutes and  
leave something orphaned behind or how is that lock_duration impacting  
the process exactly?
One more question, I see these INFO messages "found lock on ",  
but the error message "bucket is still resharding" doesn't contain the  
bucket name. Because the INFO message I saw a lot, not only during the  
application timeout errors, so they don't seem related. How can I tell  
which bucket is throwing the error during resharding?


Thanks,
Eugen

709+ 7f23a77be700  1 beast: 0x7f24dc1d85d0:  - ICAS_nondicom  
[06/Jul/2023:00:06:58.613 +] "GET  
/shaprod-lts/20221114193114-20521114-77bfd>
305+ 7f239dfab700  0 req 17231063235781096603 91.367729187s  
s3:put_obj block_while_resharding ERROR: bucket is still resharding,  
please retry
313+ 7f23246b8700  0 req 13860563368404093374 91.383728027s  
s3:put_obj block_while_resharding ERROR: bucket is still resharding,  
please retry
313+ 7f2382f75700  0 req 17231063235781096603 91.375732422s  
s3:put_obj NOTICE: resharding operation on bucket index detected,  
blocking
313+ 7f231669c700  0 req 13860563368404093374 91.383728027s  
s3:put_obj NOTICE: resharding operation on bucket index detected,  
blocking
365+ 7f23a0fb1700  0 INFO: RGWReshardLock::lock found lock on  
jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be  
held by another RGW p>
365+ 7f22fe66c700  0 INFO: RGWReshardLock::lock found lock on  
jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be  
held by another RGW p>
365+ 7f237c768700  0 INFO: RGWReshardLock::lock found lock on  
jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be  
held by another RGW p>
365+ 7f2361732700  0 INFO: RGWReshardLock::lock found lock on  
jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be  
held by another RGW p>
409+ 7f231669c700  0 INFO: RGWReshardLock::lock found lock on  
jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be  
held by another RGW p>
409+ 7f2382f75700  0 INFO: RGWReshardLock::lock found lock on  
jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be  
held by another RGW p>
669+ 7f22e3e37700  0 req 18215535743838894575 91.735725403s  
s3:put_obj block_while_resharding ERROR: bucket is still resharding,  
please retry
809+ 7f2326ebd700  0 req 18215535743838894575 91.875732422s  
s3:put_obj NOTICE: resharding operation on bucket index detected,  
blocking





Zitat von Eugen Block :

We had a quite small window yesterday to debug, I found the error  
messages but we didn't collect the logs yet, I will ask them to do  
that on Monday. I *think* the error was something like this:


resharding operation on bucket index detected, blocking  
block_while_resharding ERROR: bucket is still resharding, please  
retry


But I'll verify and ask them to collect the logs.

[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/4XMMPSHW7OQ3NU7IE4QFK6A2QVDQ2CJR/


Zitat von Casey Bodley :


while a bucket is resharding, rgw will retry several times internally
to apply the write before returning an error to the client. while most
buckets can be resharded within seconds, very large buckets may hit
these timeouts. any other cause of slow osd ops could also have that
effect. it can be helpful to pre-shard very large buckets to avoid
these resharding delays

can you tell which error code was returned to the client there? it
should be a retryable error, and many http clients have retry logic to
prevent these errors from reaching the app

[ceph-users] cephadm problem with MON deployment

2023-07-11 Thread Adam Huffman
Hello

I'm trying to add MONs in advance of a planned downtime.

This has actually ended up removing an existing MON, which isn't helpful.

The error I'm seeing is:

Invalid argument: /var/lib/ceph/mon/ceph-/store.db: does not
exist (create_if_missing is false)
error opening mon data directory at '/var/lib/ceph/mon/ceph-':
(22) Invalid argument

It appears that the fsid is being stripped, because the directory was there.
It's now in /var/lib/ceph//removed

This appears to be similar to:
https://tracker.ceph.com/issues/45167
which was closed for lack of a reproducer.

The command I ran was:

sudo ceph orch apply mon --placement="comma-separated hostname list"

after running that with "--dry-run".

Would be grateful for some advice here - I wasn't expecting to reduce the
MON count.

Best Wishes,
Adam
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm problem with MON deployment

2023-07-11 Thread Adam Huffman
Forgot to say we're on Pacific 16.2.13.

On Tue, 11 Jul 2023 at 08:55, Adam Huffman 
wrote:

> Hello
>
> I'm trying to add MONs in advance of a planned downtime.
>
> This has actually ended up removing an existing MON, which isn't helpful.
>
> The error I'm seeing is:
>
> Invalid argument: /var/lib/ceph/mon/ceph-/store.db: does not
> exist (create_if_missing is false)
> error opening mon data directory at '/var/lib/ceph/mon/ceph-':
> (22) Invalid argument
>
> It appears that the fsid is being stripped, because the directory was
> there.
> It's now in /var/lib/ceph//removed
>
> This appears to be similar to:
> https://tracker.ceph.com/issues/45167
> which was closed for lack of a reproducer.
>
> The command I ran was:
>
> sudo ceph orch apply mon --placement="comma-separated hostname list"
>
> after running that with "--dry-run".
>
> Would be grateful for some advice here - I wasn't expecting to reduce the
> MON count.
>
> Best Wishes,
> Adam
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm problem with MON deployment

2023-07-11 Thread Adam Huffman
Okay, this turned out to be down to cephadm rejecting the request because
the new MON was not in the list of public networks.
I had seen that error in the logs, but it looked as though it was a
consequence of the store.db error, rather than a cause.

After adding the new network, and repeating the request, the new MONs were
created.

On Tue, 11 Jul 2023 at 08:57, Adam Huffman 
wrote:

> Forgot to say we're on Pacific 16.2.13.
>
> On Tue, 11 Jul 2023 at 08:55, Adam Huffman 
> wrote:
>
>> Hello
>>
>> I'm trying to add MONs in advance of a planned downtime.
>>
>> This has actually ended up removing an existing MON, which isn't helpful.
>>
>> The error I'm seeing is:
>>
>> Invalid argument: /var/lib/ceph/mon/ceph-/store.db: does not
>> exist (create_if_missing is false)
>> error opening mon data directory at '/var/lib/ceph/mon/ceph-':
>> (22) Invalid argument
>>
>> It appears that the fsid is being stripped, because the directory was
>> there.
>> It's now in /var/lib/ceph//removed
>>
>> This appears to be similar to:
>> https://tracker.ceph.com/issues/45167
>> which was closed for lack of a reproducer.
>>
>> The command I ran was:
>>
>> sudo ceph orch apply mon --placement="comma-separated hostname list"
>>
>> after running that with "--dry-run".
>>
>> Would be grateful for some advice here - I wasn't expecting to reduce the
>> MON count.
>>
>> Best Wishes,
>> Adam
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm fails to deploy loki with promtail correctly

2023-07-11 Thread Sake Ceph
I'm not sure if it's a bug with Cephadm, but it looks like it. I've got Loki 
deployed on one machine and Promtail deployed to all machines. After creating a 
login, I can view only the logs on the hosts on which Loki is running.

When inspecting the Promtail configuration, the configured URL for Loki is set 
to http://host.containers.internal:3100. Shouldn't this be configured by 
Cephadm and pointing to the Loki host?

This looks a lot like the issues with incorrectly setting the Grafana or 
Prometheus URL's, bug 57018 is created for this. Should I create another bug 
report?

And does someone know a workaround to set the correct URL for the time being?

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD memory usage after cephadm adoption

2023-07-11 Thread Luis Domingues
Hi everyone,

We recently migrate a cluster from ceph-ansible to cephadm. Everything went as 
expected.
But now we have some alerts on high memory usage. Cluster is running ceph 
16.2.13.

Of course, after adoption OSDs ended up in the  zone:

NAME PORTS RUNNING REFRESHED AGE PLACEMENT
osd 88 7m ago - 

But the weirdest thing I observed, is that the OSDs seem to use more memory 
that the mem limit:

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER 
ID
osd.0  running (5d) 2m ago 5d 19.7G 6400M 16.2.13 327f301eff51 
ca07fe74a0fa
osd.1  running (5d) 2m ago 5d 7068M 6400M 16.2.13 327f301eff51 
6223ed8e34e9
osd.10  running (5d) 10m ago 5d 7235M 6400M 16.2.13 327f301eff51 
073ddc0d7391 osd.100  running (5d) 2m ago 5d 7118M 6400M 16.2.13 
327f301eff51 b7f9238c0c24

Does anybody knows why OSDs would use more memory than the limit?

Thanks

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON sync time depends on outage duration

2023-07-11 Thread Eugen Block
I'm not so sure anymore if that could really help here. The dump-keys  
output from the mon contains 42 million osd_snap prefix entries, 39  
million of them are "purged_snap" keys. I also compared to other  
clusters as well, those aren't tombstones but expected "history" of  
purged snapshots. So I don't think removing a couple of hundred trash  
snapshots will actually reduce the number of osd_snap keys. At least  
doubling the payload_size seems to have a positive impact. The  
compaction during the sync has a negative impact, of course, same as  
not having the mon store on SSDs.
I'm currently playing with a test cluster, removing all "purged_snap"  
entries from the mon db (not finished yet) to see what that will do  
with the mon and if it will even start correctly. But has anyone done  
that, removing keys from the mon store? Not sure what to expect yet...


Zitat von Dan van der Ster :


Oh yes, sounds like purging the rbd trash will be the real fix here!
Good luck!

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Mon, Jul 10, 2023 at 6:10 AM Eugen Block  wrote:


Hi,
I got a customer response with payload size 4096, that made things
even worse. The mon startup time was now around 40 minutes. My doubts
wrt decreasing the payload size seem confirmed. Then I read Dan's
response again which also mentions that the default payload size could
be too small. So I asked them to double the default (2M instead of 1M)
and am now waiting for a new result. I'm still wondering why this only
happens when the mon is down for more than 5 minutes. Does anyone have
an explanation for that time factor?
Another thing they're going to do is to remove lots of snapshot
tombstones (rbd mirroring snapshots in the trash namespace), maybe
that will reduce the osd_snap keys in the mon db, which then would
increase the startup time. We'll see...

Zitat von Eugen Block :

> Thanks, Dan!
>
>> Yes that sounds familiar from the luminous and mimic days.
>> The workaround for zillions of snapshot keys at that time was to use:
>>   ceph config set mon mon_sync_max_payload_size 4096
>
> I actually did search for mon_sync_max_payload_keys, not bytes so I
> missed your thread, it seems. Thanks for pointing that out. So the
> defaults seem to be these in Octopus:
>
> "mon_sync_max_payload_keys": "2000",
> "mon_sync_max_payload_size": "1048576",
>
>> So it could be in your case that the sync payload is just too small to
>> efficiently move 42 million osd_snap keys? Using debug_paxos and
debug_mon
>> you should be able to understand what is taking so long, and tune
>> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
>
> I'm confused, if the payload size is too small, why would decreasing
> it help? Or am I misunderstanding something? But it probably won't
> hurt to try it with 4096 and see if anything changes. If not we can
> still turn on debug logs and take a closer look.
>
>> And additional to Dan suggestion, the HDD is not a good choices for
>> RocksDB, which is most likely the reason for this thread, I think
>> that from the 3rd time the database just goes into compaction
>> maintenance
>
> Believe me, I know... but there's not much they can currently do
> about it, quite a long story... But I have been telling them that
> for months now. Anyway, I will make some suggestions and report back
> if it worked in this case as well.
>
> Thanks!
> Eugen
>
> Zitat von Dan van der Ster :
>
>> Hi Eugen!
>>
>> Yes that sounds familiar from the luminous and mimic days.
>>
>> Check this old thread:
>>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
>> (that thread is truncated but I can tell you that it worked for Frank).
>> Also the even older referenced thread:
>>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
>>
>> The workaround for zillions of snapshot keys at that time was to use:
>>   ceph config set mon mon_sync_max_payload_size 4096
>>
>> That said, that sync issue was supposed to be fixed by way of adding the
>> new option mon_sync_max_payload_keys, which has been around since
nautilus.
>>
>> So it could be in your case that the sync payload is just too small to
>> efficiently move 42 million osd_snap keys? Using debug_paxos and
debug_mon
>> you should be able to understand what is taking so long, and tune
>> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
>>
>> Good luck!
>>
>> Dan
>>
>> __
>> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>>
>>
>>
>> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block  wrote:
>>
>>> Hi *,
>>>
>>> I'm investigating an interesting issue on two customer clusters (used
>>> for mirroring) I've not solved yet, but today we finally made some
>>> progress. Maybe someone has an idea where to look next, I'd appreciate
>>> any hi

[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-11 Thread Mark Nelson

Hi Luis,


Can you do a "ceph tell osd. perf dump" and "ceph daemon osd. 
dump_mempools"?  Those should help us understand how much memory is 
being used by different parts of the OSD/bluestore and how much memory 
the priority cache thinks it has to work with.



Mark

On 7/11/23 4:57 AM, Luis Domingues wrote:

Hi everyone,

We recently migrate a cluster from ceph-ansible to cephadm. Everything went as 
expected.
But now we have some alerts on high memory usage. Cluster is running ceph 
16.2.13.

Of course, after adoption OSDs ended up in the  zone:

NAME PORTS RUNNING REFRESHED AGE PLACEMENT
osd 88 7m ago - 

But the weirdest thing I observed, is that the OSDs seem to use more memory 
that the mem limit:

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER 
ID
osd.0  running (5d) 2m ago 5d 19.7G 6400M 16.2.13 327f301eff51 
ca07fe74a0fa
osd.1  running (5d) 2m ago 5d 7068M 6400M 16.2.13 327f301eff51 
6223ed8e34e9
osd.10  running (5d) 10m ago 5d 7235M 6400M 16.2.13 327f301eff51 073ddc0d7391 
osd.100  running (5d) 2m ago 5d 7118M 6400M 16.2.13 327f301eff51 b7f9238c0c24

Does anybody knows why OSDs would use more memory than the limit?

Thanks

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] upload-part-copy gets access denied after cluster upgrade

2023-07-11 Thread Motahare S
Hello everyone,
We have a ceph cluster which was recently updated from octopus(15.2.12) to
pacific(16.2.13). There has been a problem in multi part upload, which is,
when doing UPLOAD_PART_COPY from a valid and existing previously uploaded
part, it gets 403, ONLY WHEN IT'S CALLED BY SERVICE-USER. The same scenario
gets a 200 response by a full-access sub-user, and both sub-user and
service-user get 200 on the same scenario in octopus version. The policy
for service user access is as below:

{
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam:::user/wid:suserid"
},
"Action": "*",
"Resource": [
"arn:aws:s3:::bucketname",
"arn:aws:s3:::bucketname/*"
]
}
]
} Note that this very service-user can perform a multi-part upload without
any problem on both versions, only the upload_part_copy and only on
pacific, it gets 403; which makes it unlikely to be an access problem. Has
anyone encountered this issue?
I performed multi-part upload using boto3 but there has been the same issue
on other clients as well.

regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON sync time depends on outage duration

2023-07-11 Thread Josh Baergen
Out of curiosity, what is your require_osd_release set to? (ceph osd
dump | grep require_osd_release)

Josh

On Tue, Jul 11, 2023 at 5:11 AM Eugen Block  wrote:
>
> I'm not so sure anymore if that could really help here. The dump-keys
> output from the mon contains 42 million osd_snap prefix entries, 39
> million of them are "purged_snap" keys. I also compared to other
> clusters as well, those aren't tombstones but expected "history" of
> purged snapshots. So I don't think removing a couple of hundred trash
> snapshots will actually reduce the number of osd_snap keys. At least
> doubling the payload_size seems to have a positive impact. The
> compaction during the sync has a negative impact, of course, same as
> not having the mon store on SSDs.
> I'm currently playing with a test cluster, removing all "purged_snap"
> entries from the mon db (not finished yet) to see what that will do
> with the mon and if it will even start correctly. But has anyone done
> that, removing keys from the mon store? Not sure what to expect yet...
>
> Zitat von Dan van der Ster :
>
> > Oh yes, sounds like purging the rbd trash will be the real fix here!
> > Good luck!
> >
> > __
> > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
> >
> >
> >
> >
> > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block  wrote:
> >
> >> Hi,
> >> I got a customer response with payload size 4096, that made things
> >> even worse. The mon startup time was now around 40 minutes. My doubts
> >> wrt decreasing the payload size seem confirmed. Then I read Dan's
> >> response again which also mentions that the default payload size could
> >> be too small. So I asked them to double the default (2M instead of 1M)
> >> and am now waiting for a new result. I'm still wondering why this only
> >> happens when the mon is down for more than 5 minutes. Does anyone have
> >> an explanation for that time factor?
> >> Another thing they're going to do is to remove lots of snapshot
> >> tombstones (rbd mirroring snapshots in the trash namespace), maybe
> >> that will reduce the osd_snap keys in the mon db, which then would
> >> increase the startup time. We'll see...
> >>
> >> Zitat von Eugen Block :
> >>
> >> > Thanks, Dan!
> >> >
> >> >> Yes that sounds familiar from the luminous and mimic days.
> >> >> The workaround for zillions of snapshot keys at that time was to use:
> >> >>   ceph config set mon mon_sync_max_payload_size 4096
> >> >
> >> > I actually did search for mon_sync_max_payload_keys, not bytes so I
> >> > missed your thread, it seems. Thanks for pointing that out. So the
> >> > defaults seem to be these in Octopus:
> >> >
> >> > "mon_sync_max_payload_keys": "2000",
> >> > "mon_sync_max_payload_size": "1048576",
> >> >
> >> >> So it could be in your case that the sync payload is just too small to
> >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> >> debug_mon
> >> >> you should be able to understand what is taking so long, and tune
> >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
> >> >
> >> > I'm confused, if the payload size is too small, why would decreasing
> >> > it help? Or am I misunderstanding something? But it probably won't
> >> > hurt to try it with 4096 and see if anything changes. If not we can
> >> > still turn on debug logs and take a closer look.
> >> >
> >> >> And additional to Dan suggestion, the HDD is not a good choices for
> >> >> RocksDB, which is most likely the reason for this thread, I think
> >> >> that from the 3rd time the database just goes into compaction
> >> >> maintenance
> >> >
> >> > Believe me, I know... but there's not much they can currently do
> >> > about it, quite a long story... But I have been telling them that
> >> > for months now. Anyway, I will make some suggestions and report back
> >> > if it worked in this case as well.
> >> >
> >> > Thanks!
> >> > Eugen
> >> >
> >> > Zitat von Dan van der Ster :
> >> >
> >> >> Hi Eugen!
> >> >>
> >> >> Yes that sounds familiar from the luminous and mimic days.
> >> >>
> >> >> Check this old thread:
> >> >>
> >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
> >> >> (that thread is truncated but I can tell you that it worked for Frank).
> >> >> Also the even older referenced thread:
> >> >>
> >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
> >> >>
> >> >> The workaround for zillions of snapshot keys at that time was to use:
> >> >>   ceph config set mon mon_sync_max_payload_size 4096
> >> >>
> >> >> That said, that sync issue was supposed to be fixed by way of adding the
> >> >> new option mon_sync_max_payload_keys, which has been around since
> >> nautilus.
> >> >>
> >> >> So it could be in your case that the sync payload is just too small to
> >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
> >> debug_mon
> >> >> you should

[ceph-users] Re: MON sync time depends on outage duration

2023-07-11 Thread Eugen Block

It was installed with Octopus and hasn't been upgraded yet:

"require_osd_release": "octopus",


Zitat von Josh Baergen :


Out of curiosity, what is your require_osd_release set to? (ceph osd
dump | grep require_osd_release)

Josh

On Tue, Jul 11, 2023 at 5:11 AM Eugen Block  wrote:


I'm not so sure anymore if that could really help here. The dump-keys
output from the mon contains 42 million osd_snap prefix entries, 39
million of them are "purged_snap" keys. I also compared to other
clusters as well, those aren't tombstones but expected "history" of
purged snapshots. So I don't think removing a couple of hundred trash
snapshots will actually reduce the number of osd_snap keys. At least
doubling the payload_size seems to have a positive impact. The
compaction during the sync has a negative impact, of course, same as
not having the mon store on SSDs.
I'm currently playing with a test cluster, removing all "purged_snap"
entries from the mon db (not finished yet) to see what that will do
with the mon and if it will even start correctly. But has anyone done
that, removing keys from the mon store? Not sure what to expect yet...

Zitat von Dan van der Ster :

> Oh yes, sounds like purging the rbd trash will be the real fix here!
> Good luck!
>
> __
> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>
>
>
>
> On Mon, Jul 10, 2023 at 6:10 AM Eugen Block  wrote:
>
>> Hi,
>> I got a customer response with payload size 4096, that made things
>> even worse. The mon startup time was now around 40 minutes. My doubts
>> wrt decreasing the payload size seem confirmed. Then I read Dan's
>> response again which also mentions that the default payload size could
>> be too small. So I asked them to double the default (2M instead of 1M)
>> and am now waiting for a new result. I'm still wondering why this only
>> happens when the mon is down for more than 5 minutes. Does anyone have
>> an explanation for that time factor?
>> Another thing they're going to do is to remove lots of snapshot
>> tombstones (rbd mirroring snapshots in the trash namespace), maybe
>> that will reduce the osd_snap keys in the mon db, which then would
>> increase the startup time. We'll see...
>>
>> Zitat von Eugen Block :
>>
>> > Thanks, Dan!
>> >
>> >> Yes that sounds familiar from the luminous and mimic days.
>> >> The workaround for zillions of snapshot keys at that time was to use:
>> >>   ceph config set mon mon_sync_max_payload_size 4096
>> >
>> > I actually did search for mon_sync_max_payload_keys, not bytes so I
>> > missed your thread, it seems. Thanks for pointing that out. So the
>> > defaults seem to be these in Octopus:
>> >
>> > "mon_sync_max_payload_keys": "2000",
>> > "mon_sync_max_payload_size": "1048576",
>> >
>> >> So it could be in your case that the sync payload is just too small to
>> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
>> debug_mon
>> >> you should be able to understand what is taking so long, and tune
>> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.
>> >
>> > I'm confused, if the payload size is too small, why would decreasing
>> > it help? Or am I misunderstanding something? But it probably won't
>> > hurt to try it with 4096 and see if anything changes. If not we can
>> > still turn on debug logs and take a closer look.
>> >
>> >> And additional to Dan suggestion, the HDD is not a good choices for
>> >> RocksDB, which is most likely the reason for this thread, I think
>> >> that from the 3rd time the database just goes into compaction
>> >> maintenance
>> >
>> > Believe me, I know... but there's not much they can currently do
>> > about it, quite a long story... But I have been telling them that
>> > for months now. Anyway, I will make some suggestions and report back
>> > if it worked in this case as well.
>> >
>> > Thanks!
>> > Eugen
>> >
>> > Zitat von Dan van der Ster :
>> >
>> >> Hi Eugen!
>> >>
>> >> Yes that sounds familiar from the luminous and mimic days.
>> >>
>> >> Check this old thread:
>> >>
>>  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
>> >> (that thread is truncated but I can tell you that it worked  
for Frank).

>> >> Also the even older referenced thread:
>> >>
>>  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/

>> >>
>> >> The workaround for zillions of snapshot keys at that time was to use:
>> >>   ceph config set mon mon_sync_max_payload_size 4096
>> >>
>> >> That said, that sync issue was supposed to be fixed by way of  
adding the

>> >> new option mon_sync_max_payload_keys, which has been around since
>> nautilus.
>> >>
>> >> So it could be in your case that the sync payload is just too small to
>> >> efficiently move 42 million osd_snap keys? Using debug_paxos and
>> debug_mon
>> >> you should be able to understand what is taking so long, and tune
>> >> mo

[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-11 Thread Luis Domingues
Here you have. Perf dump:

{
"AsyncMessenger::Worker-0": {
"msgr_recv_messages": 12239872,
"msgr_send_messages": 12284221,
"msgr_recv_bytes": 43759275160,
"msgr_send_bytes": 61268769426,
"msgr_created_connections": 754,
"msgr_active_connections": 100,
"msgr_running_total_time": 939.476931816,
"msgr_running_send_time": 337.873686715,
"msgr_running_recv_time": 360.728238752,
"msgr_running_fast_dispatch_time": 183.737116872,
"msgr_send_messages_queue_lat": {
"avgcount": 12284206,
"sum": 1538.989479364,
"avgtime": 0.000125281
},
"msgr_handle_ack_lat": {
"avgcount": 5258403,
"sum": 1.005075918,
"avgtime": 0.00191
}
},
"AsyncMessenger::Worker-1": {
"msgr_recv_messages": 12099771,
"msgr_send_messages": 12138795,
"msgr_recv_bytes": 56967534605,
"msgr_send_bytes": 130548664272,
"msgr_created_connections": 647,
"msgr_active_connections": 91,
"msgr_running_total_time": 977.277996439,
"msgr_running_send_time": 362.155959231,
"msgr_running_recv_time": 365.376281473,
"msgr_running_fast_dispatch_time": 191.186643292,
"msgr_send_messages_queue_lat": {
"avgcount": 12138818,
"sum": 1557.187685700,
"avgtime": 0.000128281
},
"msgr_handle_ack_lat": {
"avgcount": 6155265,
"sum": 1.096270527,
"avgtime": 0.00178
}
},
"AsyncMessenger::Worker-2": {
"msgr_recv_messages": 11858354,
"msgr_send_messages": 11960404,
"msgr_recv_bytes": 60727084610,
"msgr_send_bytes": 168534726650,
"msgr_created_connections": 1043,
"msgr_active_connections": 103,
"msgr_running_total_time": 937.324084772,
"msgr_running_send_time": 351.174710644,
"msgr_running_recv_time": 2744.276782474,
"msgr_running_fast_dispatch_time": 172.960322050,
"msgr_send_messages_queue_lat": {
"avgcount": 11960392,
"sum": 1763.762581924,
"avgtime": 0.000147466
},
"msgr_handle_ack_lat": {
"avgcount": 2651457,
"sum": 0.538495450,
"avgtime": 0.00203
}
},
"bluefs": {
"db_total_bytes": 128005955584,
"db_used_bytes": 3271557120,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 1810377216,
"slow_used_bytes": 0,
"num_files": 70,
"log_bytes": 13045760,
"log_compactions": 58,
"logged_bytes": 922333184,
"files_written_wal": 2,
"files_written_sst": 13,
"bytes_written_wal": 1988489216,
"bytes_written_sst": 268890112,
"bytes_written_slow": 0,
"max_bytes_wal": 0,
"max_bytes_db": 3271557120,
"max_bytes_slow": 0,
"read_random_count": 577484,
"read_random_bytes": 2879541532,
"read_random_disk_count": 284290,
"read_random_disk_bytes": 1540394118,
"read_random_buffer_count": 319088,
"read_random_buffer_bytes": 1339147414,
"read_count": 1086625,
"read_bytes": 15054317429,
"read_prefetch_count": 1069462,
"read_prefetch_bytes": 14506469332,
"read_zeros_candidate": 0,
"read_zeros_errors": 0
},
"bluestore": {
"kv_flush_lat": {
"avgcount": 225099,
"sum": 526.605165277,
"avgtime": 0.002339438
},
"kv_commit_lat": {
"avgcount": 225099,
"sum": 61.412175620,
"avgtime": 0.000272822
},
"kv_sync_lat": {
"avgcount": 225099,
"sum": 588.017340897,
"avgtime": 0.002612261
},
"kv_final_lat": {
"avgcount": 225096,
"sum": 6.516869320,
"avgtime": 0.28951
},
"state_prepare_lat": {
"avgcount": 241063,
"sum": 173.705759592,
"avgtime": 0.000720582
},
"state_aio_wait_lat": {
"avgcount": 241063,
"sum": 1008.936150524,
"avgtime": 0.004185362
},
"state_io_done_lat": {
"avgcount": 241063,
"sum": 2.923457351,
"avgtime": 0.12127
},
"state_kv_queued_lat": {
"avgcount": 241063,
"sum": 560.050193021,
"avgtime": 0.002323252
},
"state_kv_commiting_lat": {
"avgcount": 241063,
"sum": 68.355225981,
"avgtime": 0.000283557
},
"state_kv_done_lat": {
"avgcount": 241063,
"sum": 0.097836444,
"avgtime": 0.00405
},
"state_deferred_queued_lat": {
"avgcount": 47230,
  

[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-11 Thread Mark Nelson

On 7/11/23 09:44, Luis Domingues wrote:


 "bluestore-pricache": {
 "target_bytes": 6713193267,
 "mapped_bytes": 6718742528,
 "unmapped_bytes": 467025920,
 "heap_bytes": 7185768448,
 "cache_bytes": 4161537138
 },


Hi Luis,


Looks like the mapped bytes for this OSD process is very close to (just 
a little over) the target bytes that has been set when you did the perf 
dump. There is some unmapped memory that can be reclaimed by the kernel, 
but we can't force the kernel to reclaim it.  It could be that the 
kernel is being a little lazy if there isn't memory pressure.


The way the memory autotuning works in Ceph is that periodically the 
prioritycache system will look at the mapped memory usage of the 
process, then grow/shrink the aggregate size of the in-memory caches to 
try and stay near the target.  It's reactive in nature, meaning that it 
can't completely control for spikes.  It also can't shrink the caches 
below a small minimum size, so if there is a memory leak it will help to 
an extent but can't completely fix it.  Once the aggregate memory size 
is decided on, it goes through a process of looking at how hot the 
different caches are and assigns memory based on where it thinks the 
memory would be most useful.  Again this is based on mapped memory 
though.  It can't force the kernel to reclaim memory that has already 
been released.


Thanks,

Mark

--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Per minor-version view on docs.ceph.com

2023-07-11 Thread Satoru Takeuchi
Hi,

I have a request about docs.ceph.com. Could you provide per minor-version views
on docs.ceph.com? Currently, we can select the Ceph version
by using `https://docs.ceph.com/en/". In this case, we can
use the major
version's code names (e.g., "quincy") or "latest". However, we can't
use minor version
numbers like "v17.2.6". It's convenient for me (and I guess for many
other users, too)
to be able to select the document for the version which we actually use.

In my recent case, I've read the mclock's document of quincy because I
use v17.2.6.
However, the document has changed a lot from v17.2.6 to the quincy's latest one
because of the recent mclock's rework.

Thanks,
Satoru
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io