"Anthony D'Atri" <a...@dreamsnake.net> schreef op 28 september 2024 16:24:
>> No retries.
>> Is it expected that resharding can take so long?
>> (in a setup with all NVMe drives)
> 
> Which drive SKU(s)? How full are they? Is their firmware up to date? How many 
> RGWs? Have you tuned
> your server network stack? Disabled Nagle? How many bucket OSDs? How many 
> index OSDs? How many PGs
> in the bucket and index pools? How many buckets? Do you have like 200M 
> objects per? Do you have the
> default max objects/shard setting?
> 
> Tiny objects are the devil of many object systems. I can think of cases where 
> the above questions
> could affect this case. I think you resharding in advance might help.

- Drives advertise themselves as “Dell Ent NVMe v2 AGN MU U.2 6.4TB” (think 
that is Samsung under the Dell sticker), newest 2.5.0 firmware.  They are 
pretty empty. Although there is some 10% capacity being used by other stuff 
(RBD images)

- Single bucket. My import application already errored out after only 72 M 
objects/476 GiB of data, and need a lot more. Objects are between 0 bytes and 1 
MB, 7 KB average.

- Currently using only 1 RGW during my test run to simplify looking at logs, 
although I have 4.

- I cannot touch TCP socket options settings in my Java application.
When you build a S3AsyncClient with the Java AWS SDK using the .crtBuilder(), 
the SDK outsources the communication to the AWS aws-c-s3/aws-c-http/aws-io CRT 
libraries written in C, and I never get to see the raw socket in Java.
Looking at the source I don’t think Amazon is disabling the nagle algorithm in 
their code.
At least I don’t see TCP_NODELAY or similar options being used at the place 
they seem to set the socket options:
https://github.com/awslabs/aws-c-io/blob/c345d77274db83c0c2e30331814093e7c84c45e2/source/posix/socket.c#L1216

- Did not tune any network settings, and it is pretty quiet on the network 
side, nowhere near saturating bandwidth because objects are so small.

- Did not really tune anything else either yet. Pretty much a default cephadm 
setup for now.

- See it (automagically) allocated 1024 PGs for .data and 32 for .index.

- Think the main delay is just Ceph wanting to make sure everything is sync’ed 
to storage before reporting success. So that is why I am making a lot of 
concurrent connections to perform multiple PUT requests simultaneously. But 
even with 250 connections, it only does around 5000 objects per second 
according to the “object ingress/egress” Grafana graph. Can probably raise it 
some more…


Had the default max. objects per shard settings for the dynamic sharding.
But have now manually resharded to 10069 shards, and will have a go to see if 
it works better now.


Yours sincerely,

Floris Bos
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to