[ceph-users] Re: Dashboard : Block image listing and infos

2021-02-11 Thread Gilles Mocellin

Hello Ernesto,

Le 2021-02-10 18:37, Ernesto Puerta a écrit :

Thanks, Gilles. I recently opened a PR to improve RBD image listing
(https://github.com/ceph/ceph/pull/39344). In your specific case, I
think that part of the issue could come from calculating the actually
provisioned capacity.

Could you please share the image details (or an `rbd info `
dump), like this?


Here is one of my rbd info :

fcadmin -> sudo rbd --id veeam --image veeam-repos/veeam-repo1-vol1 info
rbd image 'veeam-repo1-vol1':
size 40 TiB in 10485760 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 16c20a43741ab4
data_pool: veeam-repos.data
block_name_prefix: rbd_data.13.16c20a43741ab4
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool

op_features:
flags:
create_timestamp: Fri Jan 15 18:25:13 2021
access_timestamp: Fri Jan 15 18:25:13 2021
modify_timestamp: Fri Jan 15 19:22:54 2021






Kind Regards,Ernesto

On Thu, Jan 21, 2021 at 11:02 PM Gilles Mocellin
 wrote:


Hi !

I respond to the list, as it may help others.
I also reorder the response.


On Mon, Jan 18, 2021 at 2:41 PM Gilles Mocellin <

gilles.mocel...@nuagelibre.org> wrote:

Hello Cephers,

On a new cluster, I only have 2 RBD block images, and the

Dashboard

doesn't manage to list them correctly.

I have this message :
   Warning
   Displaying previously cached data for pool veeam-repos.

Sometime it disappears, but as soon as I reload or return to the

listing

page, it's there.

What I've seen, is a high CPU load due to ceph-mgr on the active
manager.
And also stack-traces like this :

[...]

dashboard.exceptions.ViewCacheNoDataException: ViewCache: unable

to

retrieve data

I also see that, when I try to edit an image :

2021-01-18T11:13:26.383+0100 7f00199ca700  0 [dashboard ERROR
frontend.error]




(https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/#/block/rbd/edit/veeam->

repos%252Fveeam-repo2-vol1





): Cannot read property 'features_name' of
undefined

  TypeError: Cannot read property 'features_name' of undefined

[...]


But that's perhaps just becaus I open an Edit window on the

image and it

does not have the datas.
The Edit window is empty, and I can't edit things, especially, I

wan't

to resize the image.


[...]

--
Gilles


Le jeudi 21 janvier 2021, 21:56:58 CET Ernesto Puerta a écrit :

Hey Gilles,

If I'm not wrong, that exception (ViewCacheNoDataException)

happens when

the dashboard is unable to gather all required data from Ceph

within a

defined timeout (5 secs I think, since the UI refreshes the data

every ~5

seconds).

It'd be great if you could provide the steps to reproduce it and

some

insights into your environment (number of RBD pools, number of RBD

images,

snapshots, etc.).

Kind Regards,

Ernesto


OK,
As it is now, it always hapens, on the image listing, I have the
Warning and
the list is not always up to date, if I create an image, I must wait
very long
to see it.
Also, I can not edit the 2 big images I have. Perhaps the size is
important,
they are 2 images of 40 TB.
If I create a 1 GB test image, I can edit and resize it.
But impossible withe the big image, the windows opens but all the
fields are
empty.

Also, if it can matter, the images use a data pool (EC 3+2).

I have 2 pools, a replicated one for metadatas veeam-repos (replic
x3), and a
data pool veeam-repos.data (EC 3+2).
My cluster has 6 nodes with AMD 16 cores CPU, 128 GB RAM, 10 8 TB
HDD.
So 60 OSD. Soon doubling everything to 12 nodes.

Usage, as the pool and image names can tell, is to mount RBD image
as a XFS
filesystem for a Veeam Backup Repository (krbd, because nbd-rbd
tailed
regularly, especially during fstrim).

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 15.2.8 mgr keep crashing every few days

2021-02-11 Thread levin ng
Hi all,

I’d recently deployed ceph 15.2.8 with  3(mon,mgr,rgw,mds) and 4 (osd)
total 7 host, however I encountered mgr crash a few times a week, the
crashing mgr can be any one of 3. I couldn’t identify the problem behind
and here is the crash info, appreciate anyone if you have suggestions that
I could narrow it down.

Thank you very much.

{
"assert_condition": "ret == 0",
"assert_file":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el8/BUILD/ceph-15.2.8/src/common/Thread.cc",
"assert_func": "void Thread::create(const char*, size_t)",
"assert_line": 157,
"assert_msg":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el8/BUILD/ceph-15.2.8/src/common/Thread.cc:
In function 'void Thread::create(const char*, size_t)' thread 7f833addc700
time
2021-02-10T20:00:32.980508+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el8/BUILD/ceph-15.2.8/src/common/Thread.cc:
157: FAILED ceph_assert(ret == 0)\n",
"assert_thread_name": "mgr-fin",
"backtrace": [
"(()+0x12b20) [0x7f835a51cb20]",
"(gsignal()+0x10f) [0x7f8358f6d7ff]",
"(abort()+0x127) [0x7f8358f57c35]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a9) [0x7f835c07b735]",
"(()+0x27a8fe) [0x7f835c07b8fe]",
"(()+0x34cef6) [0x7f835c14def6]",
"(DispatchQueue::start()+0x3a) [0x7f835c29697a]",
"(AsyncMessenger::ready()+0xcd) [0x7f835c3340cd]",
"(Messenger::add_dispatcher_head(Dispatcher*)+0x68)
[0x7f835c3f8478]",
"(MonClient::get_monmap_and_config()+0xbb) [0x7f835c3f66ab]",
"(ceph_mount_info::init()+0x4d) [0x7f834298435d]",
"(()+0x3680f) [0x7f8342cd280f]",
"(()+0x19d421) [0x7f835ba5c421]",
"(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
"(()+0x179c78) [0x7f835ba38c78]",
"(()+0x19d1c7) [0x7f835ba5c1c7]",
"(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
"(()+0x179c78) [0x7f835ba38c78]",
"(()+0x19d1c7) [0x7f835ba5c1c7]",
"(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
"(()+0x1221d4) [0x7f835b9e11d4]",
"(()+0x122c55) [0x7f835b9e1c55]",
"(()+0x19cf27) [0x7f835ba5bf27]",
"(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
"(_PyFunction_FastCallDict()+0x122) [0x7f835b9b9ec2]",
"(_PyObject_FastCallDict()+0x70e) [0x7f835b9bac9e]",
"(()+0x10dc70) [0x7f835b9ccc70]",
"(_PyObject_FastCallDict()+0x6ec) [0x7f835b9bac7c]",
"(PyObject_CallFunctionObjArgs()+0xe8) [0x7f835b9dbd48]",
"(_PyEval_EvalFrameDefault()+0x2588) [0x7f835ba5eef8]",
"(()+0xf99b4) [0x7f835b9b89b4]",
"(()+0x179e60) [0x7f835ba38e60]",
"(()+0x19d1c7) [0x7f835ba5c1c7]",
"(_PyEval_EvalFrameDefault()+0x10d5) [0x7f835ba5da45]",
"(()+0x179c78) [0x7f835ba38c78]",
"(()+0x19d1c7) [0x7f835ba5c1c7]",
"(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
"(()+0xfa326) [0x7f835b9b9326]",
"(()+0x179e60) [0x7f835ba38e60]",
"(()+0x19d1c7) [0x7f835ba5c1c7]",
"(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
"(()+0x179c78) [0x7f835ba38c78]",
"(()+0x19d1c7) [0x7f835ba5c1c7]",
"(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
"(_PyFunction_FastCallDict()+0x122) [0x7f835b9b9ec2]",
"(_PyObject_FastCallDict()+0x70e) [0x7f835b9bac9e]",
"(()+0x10dc70) [0x7f835b9ccc70]",
"(PyObject_Call()+0x4b) [0x7f835b9c1acb]",
"(PyObject_CallMethod()+0x10b) [0x7f835ba5ac6b]",
"(ActivePyModule::handle_command(ModuleCommand const&, MgrSession
const&, std::map,
std::allocator >, boost::variant, std::allocator >, bool, long, double,
std::vector,
std::allocator >, std::allocator, std::allocator > > >, std::vector >, std::vector > >,
std::less, std::allocator, std::allocator > const,
boost::variant,
std::allocator >, bool, long, double,
std::vector,
std::allocator >, std::allocator, std::allocator > > >, std::vector >, std::vector > > > >
> const&, ceph::buffer::v15_2_0::list const&,
std::__cxx11::basic_stringstream,
std::allocator >*, std::__cxx11::basic_stringstream, std::allocator >*)+0x222) [0x55bc0b8a0cb2]",
"(()+0x1b0fdd) [0x55bc0b8f5fdd]",
"(Context::complete(int)+0xd) [0x55bc0b8b0bdd]",
"(Finisher::finisher_thread_entry()+0x1a5) [0x7f835c10b465]",
"(()+0x814a) [0x7f835a51214a]",
"(clone()+0x43) [0x7f8359032f23]"
],
"ceph_version": "15.2.8",
"crash_id":
"2021-02-10T20:00:32.989661Z_201fd5fb-6e0a-4b50-8a95-fdf9ed9aeb81",
"entity_name": "mgr.sds01-cp.cwcxek",
"os_id": "centos",
"os_na

[ceph-users] Data sync init vs bucket sync init

2021-02-11 Thread Szabo, Istvan (Agoda)
Hi,

What’s the difference between data sync init and bucket sync init? Data 
initialise the complete cluster? Bucket only bucket?

I see when initialise finished, have shards behind but doesn’t do anything with 
it?

What is the proper steps to bring things back to sync?

Init
Run
Restart


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 15.2.8 mgr keep crashing every few days

2021-02-11 Thread Sebastian Luna Valero
Hi,

The following thread on this emailing list might be relevant:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IAJRTIMFALJTZD3KYBHT4G7GEL6EHRR5/#IAJRTIMFALJTZD3KYBHT4G7GEL6EHRR5

Best regards,
Sebastian


El jue, 11 feb 2021 a las 10:32, levin ng ()
escribió:

> Hi all,
>
> I’d recently deployed ceph 15.2.8 with  3(mon,mgr,rgw,mds) and 4 (osd)
> total 7 host, however I encountered mgr crash a few times a week, the
> crashing mgr can be any one of 3. I couldn’t identify the problem behind
> and here is the crash info, appreciate anyone if you have suggestions that
> I could narrow it down.
>
> Thank you very much.
>
> {
> "assert_condition": "ret == 0",
> "assert_file":
>
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el8/BUILD/ceph-15.2.8/src/common/Thread.cc",
> "assert_func": "void Thread::create(const char*, size_t)",
> "assert_line": 157,
> "assert_msg":
>
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el8/BUILD/ceph-15.2.8/src/common/Thread.cc:
> In function 'void Thread::create(const char*, size_t)' thread 7f833addc700
> time
>
> 2021-02-10T20:00:32.980508+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.8/rpm/el8/BUILD/ceph-15.2.8/src/common/Thread.cc:
> 157: FAILED ceph_assert(ret == 0)\n",
> "assert_thread_name": "mgr-fin",
> "backtrace": [
> "(()+0x12b20) [0x7f835a51cb20]",
> "(gsignal()+0x10f) [0x7f8358f6d7ff]",
> "(abort()+0x127) [0x7f8358f57c35]",
> "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1a9) [0x7f835c07b735]",
> "(()+0x27a8fe) [0x7f835c07b8fe]",
> "(()+0x34cef6) [0x7f835c14def6]",
> "(DispatchQueue::start()+0x3a) [0x7f835c29697a]",
> "(AsyncMessenger::ready()+0xcd) [0x7f835c3340cd]",
> "(Messenger::add_dispatcher_head(Dispatcher*)+0x68)
> [0x7f835c3f8478]",
> "(MonClient::get_monmap_and_config()+0xbb) [0x7f835c3f66ab]",
> "(ceph_mount_info::init()+0x4d) [0x7f834298435d]",
> "(()+0x3680f) [0x7f8342cd280f]",
> "(()+0x19d421) [0x7f835ba5c421]",
> "(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
> "(()+0x179c78) [0x7f835ba38c78]",
> "(()+0x19d1c7) [0x7f835ba5c1c7]",
> "(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
> "(()+0x179c78) [0x7f835ba38c78]",
> "(()+0x19d1c7) [0x7f835ba5c1c7]",
> "(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
> "(()+0x1221d4) [0x7f835b9e11d4]",
> "(()+0x122c55) [0x7f835b9e1c55]",
> "(()+0x19cf27) [0x7f835ba5bf27]",
> "(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
> "(_PyFunction_FastCallDict()+0x122) [0x7f835b9b9ec2]",
> "(_PyObject_FastCallDict()+0x70e) [0x7f835b9bac9e]",
> "(()+0x10dc70) [0x7f835b9ccc70]",
> "(_PyObject_FastCallDict()+0x6ec) [0x7f835b9bac7c]",
> "(PyObject_CallFunctionObjArgs()+0xe8) [0x7f835b9dbd48]",
> "(_PyEval_EvalFrameDefault()+0x2588) [0x7f835ba5eef8]",
> "(()+0xf99b4) [0x7f835b9b89b4]",
> "(()+0x179e60) [0x7f835ba38e60]",
> "(()+0x19d1c7) [0x7f835ba5c1c7]",
> "(_PyEval_EvalFrameDefault()+0x10d5) [0x7f835ba5da45]",
> "(()+0x179c78) [0x7f835ba38c78]",
> "(()+0x19d1c7) [0x7f835ba5c1c7]",
> "(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
> "(()+0xfa326) [0x7f835b9b9326]",
> "(()+0x179e60) [0x7f835ba38e60]",
> "(()+0x19d1c7) [0x7f835ba5c1c7]",
> "(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
> "(()+0x179c78) [0x7f835ba38c78]",
> "(()+0x19d1c7) [0x7f835ba5c1c7]",
> "(_PyEval_EvalFrameDefault()+0x498) [0x7f835ba5ce08]",
> "(_PyFunction_FastCallDict()+0x122) [0x7f835b9b9ec2]",
> "(_PyObject_FastCallDict()+0x70e) [0x7f835b9bac9e]",
> "(()+0x10dc70) [0x7f835b9ccc70]",
> "(PyObject_Call()+0x4b) [0x7f835b9c1acb]",
> "(PyObject_CallMethod()+0x10b) [0x7f835ba5ac6b]",
> "(ActivePyModule::handle_command(ModuleCommand const&, MgrSession
> const&, std::map,
> std::allocator >, boost::variant std::char_traits, std::allocator >, bool, long, double,
> std::vector,
> std::allocator >, std::allocator std::char_traits, std::allocator > > >, std::vector std::allocator >, std::vector > >,
> std::less, std::allocator std::char_traits, std::allocator > const,
> boost::variant,
> std::allocator >, bool, long, double,
> std::vector,
> std::allocator >, std::allocator std::char_traits, std::allocator > > >, std::vector std::allocator >, std::vector > > > >
> > const&, ceph::buffer::v15_2_0::list const&,
> std::__cxx11::basic_str

[ceph-users] Data sync init vs bucket sync init

2021-02-11 Thread Szabo, Istvan (Agoda)
Hi,

What’s the difference between data sync init and bucket sync init? Data 
initialise the complete cluster? Bucket only bucket?

I see when initialise finished, have shards behind but doesn’t do anything with 
it?

What is the proper steps to bring things back to sync?

Init
Run
Restart ??


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how far can we go using vstart.sh script for fake dev cluster-HELP

2021-02-11 Thread Bobby
Hi,

Ceph source code contains a script called vstart.sh  which allows
developers to quickly test their code using a simple deployment on your
development system.

Here: https://docs.ceph.com/en/latest//dev/quick_guide/

I am really curious that how far we can go with vstart.sh script.

While my development cluster is running, I use tools like rados bench, rbd
and rbd-nbd to benchmark simple workload and test my code. Do we have
options to change the network settings in the fake cluster built from
vstart script and later benchmark it? For example , trying 1gbit ethernet
and 10gbit ethernet.

Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: krbd crc

2021-02-11 Thread Ilya Dryomov
On Thu, Feb 11, 2021 at 1:34 AM Seena Fallah  wrote:
>
> Hi,
> I have a few questions about krbd on kernel 4.15
>
> 1. Does it support msgr v2? (If not which kernel supports msgr v2?)

No.  Support for msgr2 has been merged into kernel 5.11, due to be
released this weekend.

Note that the kernel client will only support revision 1 of the msgr2
protocol (also referred to as msgr2.1).  The original msgr2 protocol has
security, integrity and some general robustness issues that made it not
conducive to bringing into the kernel.

msgr2.1 protocol was implemented in nautilus 14.2.11 and octopus
15.2.5, so if you want e.g. in-transit encryption with krbd, you will
need at least those versions on the server side.

The original msgr2 protocol is considered deprecated.

> 2. If krbd is using msgr v1, does it checksum (CRC) the messages that it
> sends to see for example if the write is correct or not? and if it does
> checksums, If there were a problem in write how does it react to that? For
> example, does it raise I/O Error or retry or...?

Yes, it does.  In case of a crc mismatch, the messenger will reset the
session and the write will be retried automatically.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: krbd crc

2021-02-11 Thread Seena Fallah
Many thanks for your response.
One more question, In the case of a CRC mismatch how many times does it
retry and does it raise any error logs in the kernel to see if it had a CRC
mismatch or not?

On Thu, Feb 11, 2021 at 3:05 PM Ilya Dryomov  wrote:

> On Thu, Feb 11, 2021 at 1:34 AM Seena Fallah 
> wrote:
> >
> > Hi,
> > I have a few questions about krbd on kernel 4.15
> >
> > 1. Does it support msgr v2? (If not which kernel supports msgr v2?)
>
> No.  Support for msgr2 has been merged into kernel 5.11, due to be
> released this weekend.
>
> Note that the kernel client will only support revision 1 of the msgr2
> protocol (also referred to as msgr2.1).  The original msgr2 protocol has
> security, integrity and some general robustness issues that made it not
> conducive to bringing into the kernel.
>
> msgr2.1 protocol was implemented in nautilus 14.2.11 and octopus
> 15.2.5, so if you want e.g. in-transit encryption with krbd, you will
> need at least those versions on the server side.
>
> The original msgr2 protocol is considered deprecated.
>
> > 2. If krbd is using msgr v1, does it checksum (CRC) the messages that it
> > sends to see for example if the write is correct or not? and if it does
> > checksums, If there were a problem in write how does it react to that?
> For
> > example, does it raise I/O Error or retry or...?
>
> Yes, it does.  In case of a crc mismatch, the messenger will reset the
> session and the write will be retried automatically.
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: krbd crc

2021-02-11 Thread Ilya Dryomov
On Thu, Feb 11, 2021 at 12:44 PM Seena Fallah  wrote:
>
> Many thanks for your response.
> One more question, In the case of a CRC mismatch how many times does it retry 
> and does it raise any error logs in the kernel to see if it had a CRC 
> mismatch or not?

You will see bad "crc/signature" errors in dmesg.

When the session is reset all its state is discarded, so it will retry
indefinitely.

>
> On Thu, Feb 11, 2021 at 3:05 PM Ilya Dryomov  wrote:
>>
>> On Thu, Feb 11, 2021 at 1:34 AM Seena Fallah  wrote:
>> >
>> > Hi,
>> > I have a few questions about krbd on kernel 4.15
>> >
>> > 1. Does it support msgr v2? (If not which kernel supports msgr v2?)
>>
>> No.  Support for msgr2 has been merged into kernel 5.11, due to be
>> released this weekend.
>>
>> Note that the kernel client will only support revision 1 of the msgr2
>> protocol (also referred to as msgr2.1).  The original msgr2 protocol has
>> security, integrity and some general robustness issues that made it not
>> conducive to bringing into the kernel.
>>
>> msgr2.1 protocol was implemented in nautilus 14.2.11 and octopus
>> 15.2.5, so if you want e.g. in-transit encryption with krbd, you will
>> need at least those versions on the server side.
>>
>> The original msgr2 protocol is considered deprecated.
>>
>> > 2. If krbd is using msgr v1, does it checksum (CRC) the messages that it
>> > sends to see for example if the write is correct or not? and if it does
>> > checksums, If there were a problem in write how does it react to that? For
>> > example, does it raise I/O Error or retry or...?
>>
>> Yes, it does.  In case of a crc mismatch, the messenger will reset the
>> session and the write will be retried automatically.
>>
>> Thanks,
>>
>> Ilya

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Storage-class split objects

2021-02-11 Thread Marcelo
Hi Casey, thank you for the reply.

I was wondering, just as the placement target is in the bucket metadata in
the index, if it would not be possible to insert the storage-class
information in the metadata of the object that is in the index as well. Or
did I get it wrong and there is absolutely no type of object metadata in
the index, just a listing of the objects?

Thanks again, Marcelo.


Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Em qua., 10 de fev. de 2021 às 11:43, Casey Bodley 
escreveu:

> On Wed, Feb 10, 2021 at 8:31 AM Marcelo  wrote:
> >
> > Hello all!
> >
> > We have a cluster where there are HDDs for data and NVMEs for journals
> and
> > indexes. We recently added pure SSD hosts, and created a storage class
> SSD.
> > To do this, we create a default.rgw.hot.data pool, associate a crush rule
> > using SSD and create a HOT storage class in the placement-target. The
> > problem is when we send an object to use a HOT storage class, it is in
> both
> > the STANDARD storage class pool and the HOT pool.
> >
> > STANDARD pool:
> > # rados -p default.rgw.buckets.data ls
> > d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
> >
> > # rados -p default.rgw.buckets.data stat
> > d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
> >
> default.rgw.buckets.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
> > mtime 2021-02-09 14: 54: 14.00, size 0
> >
> >
> > HOT pool:
> > # rados -p default.rgw.hot.data ls
> >
> d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
> >
> >
> > # rados -p default.rgw.hot.data stat
> >
> d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
> >
> default.rgw.hot.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
> > mtime 2021-02-09 14: 54: 14.00, size 15220
> >
> > The object itself is in the HOT pool, however it creates this other
> object
> > similar to an index in the STANDARD pool. Monitoring with iostat we
> noticed
> > that this behavior generates an unnecessary IO on disks that do not need
> to
> > be touched.
> >
> > Why this behavior? Are there any ways around it?
>
> this object in the STANDARD pool is called the 'head object', and it
> holds the s3 object's metadata - including an attribute that says
> which storage class the object's data is in
>
> when an S3 client downloads the object with a 'GET /bucket/LICENSE'
> request, it doesn't specify the storage class. so radosgw has to find
> its head object in a known location (the bucket's default storage
> class pool) in order to figure out which pool holds the object's data
>
> >
> > Thanks, Marcelo
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW/Swift 404 error when listing/deleting a newly created empty bucket

2021-02-11 Thread Mike Cave
So, as the subject states I have an issue with buckets returning a 404 error 
when they are listed immediately after being created; as well the bucket fails 
to be deleted if you try to delete it immediately after creation.

The behaviour is intermittent.

If I leave the bucket in place for a few minutes, the bucket behaves normally. 
I’m thinking this is a metadata issue or something along those lines but I’m 
out of my depth now.

To the best of our knowledge the cluster has not changed in any way since the 
same tests were run in December with no errors.

We are running Ceph 14.2.16 on all parts of the cluster.

I am using the python-swift client for the connection on a CentOS7 machine.

Can replicate the results from the mons or an external client as well.

I’m willing to share my test script as well if you would like to see how I’m 
generating the error.

Here is a piece of the logs in case I missed something in the interpretation 
(log level at 20):

14:23:17.069 7faba00df700  1 == starting new request req=0x55fb7a138700 
=
14:23:17.069 7faba00df700  2 req 148 0.000s initializing for trans_id = 
tx00094-0060245cd5-2b8949-default
14:23:17.069 7faba00df700 10 rgw api priority: s3=8 s3website=7
14:23:17.069 7faba00df700 10 host=
14:23:17.069 7faba00df700 20 subdomain= domain= in_hosted_domain=0 
in_hosted_domain_s3website=0
14:23:17.069 7faba00df700 -1 res_query() failed
14:23:17.069 7faba00df700 20 final domain/bucket subdomain= domain= 
in_hosted_domain=0 in_hosted_domain_s3website=0 s->info.domain= 
s->info.request_uri=/swift/v1/404test
14:23:17.069 7faba00df700 10 ver=v1 first=404test req=
14:23:17.069 7faba00df700 10 handler=28RGWHandler_REST_Bucket_SWIFT
14:23:17.069 7faba00df700  2 req 148 0.000s getting op 2
14:23:17.069 7faba00df700 10 req 148 0.000s swift:delete_bucket scheduling with 
dmclock client=3 cost=1
14:23:17.069 7faba00df700 10 op=30RGWDeleteBucket_ObjStore_SWIFT
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket verifying 
requester
14:23:17.069 7faba00df700 20 req 148 0.000s swift:delete_bucket 
rgw::auth::swift::DefaultStrategy: trying rgw::auth::swift::TempURLEngine
14:23:17.069 7faba00df700 20 req 148 0.000s swift:delete_bucket 
rgw::auth::swift::TempURLEngine denied with reason=-13
14:23:17.069 7faba00df700 20 req 148 0.000s swift:delete_bucket 
rgw::auth::swift::DefaultStrategy: trying rgw::auth::swift::SignedTokenEngine
14:23:17.069 7faba00df700 10 req 148 0.000s swift:delete_bucket 
swift_user=xmcc:swift
14:23:17.069 7faba00df700 20 build_token 
token=0a00786d63633a73776966748960ea4653df708a55ae2560e58acf01
14:23:17.069 7faba00df700 20 req 148 0.000s swift:delete_bucket 
rgw::auth::swift::SignedTokenEngine granted access
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket normalizing 
buckets and tenants
14:23:17.069 7faba00df700 10 s->object= s->bucket=404test
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket init permissions
14:23:17.069 7faba00df700 20 get_system_obj_state: rctx=0x55fb7a137770 
obj=default.rgw.meta:root:404test state=0x55fb7a060ac0 s->prefetch_data=0
14:23:17.069 7faba00df700 10 cache get: name=default.rgw.meta+root+404test : 
hit (negative entry)
14:23:17.069 7faba00df700 20 get_system_obj_state: rctx=0x55fb7a137130 
obj=default.rgw.meta:users.uid:xmcc state=0x55fb7a060f40 s->prefetch_data=0
14:23:17.069 7faba00df700 10 cache get: name=default.rgw.meta+users.uid+xmcc : 
hit (requested=0x6, cached=0x17)
14:23:17.069 7faba00df700 20 get_system_obj_state: s->obj_tag was set empty
14:23:17.069 7faba00df700 20 Read xattr: user.rgw.idtag
14:23:17.069 7faba00df700 20 get_system_obj_state: rctx=0x55fb7a137130 
obj=default.rgw.meta:users.uid:xmcc state=0x55fb7a060f40 s->prefetch_data=0
14:23:17.069 7faba00df700 10 cache get: name=default.rgw.meta+users.uid+xmcc : 
hit (requested=0x6, cached=0x17)
14:23:17.069 7faba00df700 20 get_system_obj_state: s->obj_tag was set empty
14:23:17.069 7faba00df700 20 Read xattr: user.rgw.idtag
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket recalculating 
target
14:23:17.069 7faba00df700 10 Starting retarget
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket reading 
permissions
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket init op
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket verifying op 
mask
14:23:17.069 7faba00df700 20 req 148 0.000s swift:delete_bucket required_mask= 
4 user.op_mask=7
14:23:17.069 7faba00df700  2 req 148 0.000s swift:delete_bucket verifying op 
permissions
14:23:17.069 7faba00df700 20 req 148 0.000s swift:delete_bucket -- Getting 
permissions begin with perm_mask=50
14:23:17.069 7faba00df700  5 req 148 0.000s swift:delete_bucket Searching 
permissions for identity=rgw::auth::ThirdPartyAccountApplier() -> 
rgw::auth::SysReqApplier -> rgw::auth::LocalApplier(acct_user=xmcc, 
acct_name=xmcc, subuser=swift, perm_mask=15, is_admin=0) mask=50
14:23:17.069 7faba00df700  5 Searching per

[ceph-users] Re: Storage-class split objects

2021-02-11 Thread Casey Bodley
On Thu, Feb 11, 2021 at 9:31 AM Marcelo  wrote:
>
> Hi Casey, thank you for the reply.
>
> I was wondering, just as the placement target is in the bucket metadata in
> the index, if it would not be possible to insert the storage-class
> information in the metadata of the object that is in the index as well. Or
> did I get it wrong and there is absolutely no type of object metadata in
> the index, just a listing of the objects?

the bucket index is for bucket listing, so each entry in the index
stores enough metadata (mtime, etag, size, etc) to satisfy the
s3/swift bucket listing APIs. this does include the storage class for
each object

but GetObject requests don't read from the bucket index, they just
look for a 'head object' with the object's name

for objects in the default storage class, we also store the first
chunk (4M) of data in the head object - so a GetObject request can
satisfy small object reads in a single round trip

for objects in non-default storage classes, we need one level of
indirection to locate the data. we *could* potentially go through the
bucket index for this, but the index itself is optional (see indexless
buckets) and has a looser consistency model than the head object,
which we can write atomically when an upload finishes

>
> Thanks again, Marcelo.
>
> 
> Livre
> de vírus. www.avast.com
> .
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> Em qua., 10 de fev. de 2021 às 11:43, Casey Bodley 
> escreveu:
>
> > On Wed, Feb 10, 2021 at 8:31 AM Marcelo  wrote:
> > >
> > > Hello all!
> > >
> > > We have a cluster where there are HDDs for data and NVMEs for journals
> > and
> > > indexes. We recently added pure SSD hosts, and created a storage class
> > SSD.
> > > To do this, we create a default.rgw.hot.data pool, associate a crush rule
> > > using SSD and create a HOT storage class in the placement-target. The
> > > problem is when we send an object to use a HOT storage class, it is in
> > both
> > > the STANDARD storage class pool and the HOT pool.
> > >
> > > STANDARD pool:
> > > # rados -p default.rgw.buckets.data ls
> > > d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
> > >
> > > # rados -p default.rgw.buckets.data stat
> > > d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
> > >
> > default.rgw.buckets.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4_LICENSE
> > > mtime 2021-02-09 14: 54: 14.00, size 0
> > >
> > >
> > > HOT pool:
> > > # rados -p default.rgw.hot.data ls
> > >
> > d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
> > >
> > >
> > > # rados -p default.rgw.hot.data stat
> > >
> > d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
> > >
> > default.rgw.hot.data/d86dade5-d401-427b-870a-0670ec3ecb65.385198.4__shadow_.rmpla1NTgArcUQdSLpW4qEgTDlbhn9f_0
> > > mtime 2021-02-09 14: 54: 14.00, size 15220
> > >
> > > The object itself is in the HOT pool, however it creates this other
> > object
> > > similar to an index in the STANDARD pool. Monitoring with iostat we
> > noticed
> > > that this behavior generates an unnecessary IO on disks that do not need
> > to
> > > be touched.
> > >
> > > Why this behavior? Are there any ways around it?
> >
> > this object in the STANDARD pool is called the 'head object', and it
> > holds the s3 object's metadata - including an attribute that says
> > which storage class the object's data is in
> >
> > when an S3 client downloads the object with a 'GET /bucket/LICENSE'
> > request, it doesn't specify the storage class. so radosgw has to find
> > its head object in a known location (the bucket's default storage
> > class pool) in order to figure out which pool holds the object's data
> >
> > >
> > > Thanks, Marcelo
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-11 Thread Tony Liu
/dev/sdb is the SSD holding DB LVs for multiple HDDs.
What I expect is that, as long as there is sufficient space on db_devices 
specified
in service spec, a LV should be created.
Now, circling back to the original question, how does OSD replacement work?
I've been trying for a few weeks and hitting different issues, no luck.

Thanks!
Tony

From: Jens Hyllegaard (Soft Design A/S) 
Sent: February 10, 2021 11:54 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: db_devices doesn't show up in exported osd service 
spec

According to your "pvs" you still have a VG on your sdb device. As long as that 
is on there, it will not be available to ceph. I have had to do a lvremove, 
like this:
lvremove ceph-78c78efb-af86-427c-8be1-886fa1d54f8a 
osd-db-72784b7a-b5c0-46e6-8566-74758c297adc

Do a lvs command to see the right parameters.

Regards

Jens

-Original Message-
From: Tony Liu 
Sent: 10. februar 2021 22:59
To: David Orman 
Cc: Jens Hyllegaard (Soft Design A/S) ; 
ceph-users@ceph.io
Subject: Re: [ceph-users] Re: db_devices doesn't show up in exported osd 
service spec

Hi David,

===
# pvs
  PV VG  Fmt  Attr 
PSizePFree
  /dev/sda3  vg0 lvm2 a-- 
1.09t  0
  /dev/sdb   ceph-block-dbs-f8d28f1f-2dd3-47d0-9110-959e88405112 lvm2 a--  
<447.13g 127.75g
  /dev/sdc   ceph-block-8f85121e-98bf-4466-aaf3-d888bcc938f6 lvm2 a-- 
2.18t  0
  /dev/sde   ceph-block-0b47f685-a60b-42fb-b679-931ef763b3c8 lvm2 a-- 
2.18t  0
  /dev/sdf   ceph-block-c526140d-c75f-4b0d-8c63-fbb2a8abfaa2 lvm2 a-- 
2.18t  0
  /dev/sdg   ceph-block-52b422f7-900a-45ff-a809-69fadabe12fa lvm2 a-- 
2.18t  0
  /dev/sdh   ceph-block-da269f0d-ae11-4178-bf1e-6441b8800336 lvm2 a-- 
2.18t  0
===
After "orch osd rm", which doesn't clean up DB LV on OSD node, I manually clean 
it up by running "ceph-volume lvm zap --osd-id 12", which does the cleanup.
Is "orch device ls" supposed to show SSD device available if there is free 
space?
That could be another issue.

Thanks!
Tony

From: David Orman 
Sent: February 10, 2021 01:19 PM
To: Tony Liu
Cc: Jens Hyllegaard (Soft Design A/S); ceph-users@ceph.io
Subject: Re: [ceph-users] Re: db_devices doesn't show up in exported osd 
service spec

It's displaying sdb (what I assume you want to be used as a DB device) as 
unavailable. What's "pvs" output look like on that "ceph-osd-1" host? Perhaps 
it is full. I see the other email you sent regarding replacement; I suspect the 
pre-existing LV from your previous OSD is not re-used. You may need to delete 
it then the service specification should re-create it along with the OSD. If I 
remember correctly, I stopped the automatic application of the service spec 
(ceph orch rm osd.servicespec) when I had to replace a failed OSD, removed the 
OSD, nuked the LV on the db device in question, put in the new drive, then 
re-enabled the service-spec (ceph orch apply osd -i) and the OSD + DB/WAL were 
created appropriately. I don't remember the exact sequence, and it may depend 
on the ceph version. I'm also unsure if the "orch osd rm  --replace 
[--force]" will allow preservation of the db/wal mapping, it might be worth 
looking at in the future.

On Wed, Feb 10, 2021 at 2:22 PM Tony Liu 
mailto:tonyliu0...@hotmail.com>> wrote:
Hi David,

Request info is below.

# ceph orch device ls ceph-osd-1
HOSTPATH  TYPE   SIZE  DEVICE_ID   MODEL
VENDOR   ROTATIONAL  AVAIL  REJECT REASONS
ceph-osd-1  /dev/sdd  hdd   2235G  SEAGATE_DL2400MM0159_WBM2VL2G   
DL2400MM0159 SEAGATE  1   True
ceph-osd-1  /dev/sda  hdd   1117G  SEAGATE_ST1200MM0099_WFK4NNDY   
ST1200MM0099 SEAGATE  1   False  LVM detected, Insufficient space 
(<5GB) on vgs, locked
ceph-osd-1  /dev/sdb  ssd447G  ATA_MZ7KH480HAHQ0D3_S5CNNA0N305738  
MZ7KH480HAHQ0D3  ATA  0   False  LVM detected, locked
ceph-osd-1  /dev/sdc  hdd   2235G  SEAGATE_DL2400MM0159_WBM2WNSE   
DL2400MM0159 SEAGATE  1   False  LVM detected, Insufficient space 
(<5GB) on vgs, locked
ceph-osd-1  /dev/sde  hdd   2235G  SEAGATE_DL2400MM0159_WBM2WP2S   
DL2400MM0159 SEAGATE  1   False  LVM detected, Insufficient space 
(<5GB) on vgs, locked
ceph-osd-1  /dev/sdf  hdd   2235G  SEAGATE_DL2400MM0159_WBM2VK99   
DL2400MM0159 SEAGATE  1   False  LVM detected, Insufficient space 
(<5GB) on vgs, locked
ceph-osd-1  /dev/sdg  hdd   2235G  SEAGATE_DL2400MM0159_WBM2VJBT   
DL2400MM0159 SEAGATE  1   False  LVM detected, Insufficient space 
(<5GB) on vgs, locked
ceph-osd-1  /dev/sdh  hdd   2235G  SEAGATE_DL2400MM0159_WBM2VMFK   
DL2400MM0159 SEAGATE  1   False  LVM detected, Insufficient space 
(<5GB) o

[ceph-users] Re: Proper solution of slow_ops

2021-02-11 Thread Davor Cubranic
But the config reference says “high” is already the default value? 
(https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ 
)

This selects which priority ops will be sent to the strict queue verses the 
normal queue. The low setting sends all replication ops and higher to the 
strict queue, while the high option sends only replication acknowledgment ops 
and higher to the strict queue. Setting this to high should help when a few 
OSDs in the cluster are very busy especially when combined with wpq in the 
osd_op_queue setting. OSDs that are very busy handling replication traffic 
could starve primary client traffic on these OSDs without these settings. 
Requires a restart.
Valid Choices: low, high
Default: high


> On Feb 9, 2021, at 4:42 AM, Milan Kupcevic  wrote:
> 
> On 2/9/21 7:29 AM, Michal Strnad wrote:
>> 
>> we are looking for a proper solution of slow_ops. When the disk failed,
>> node is restated ... a lot of slow operations appear. Even if disk (OSD)
>> or node is back again most of slow_ops are still there. On the internet
>> we found only advice that we have to restart monitor. But this is not
>> right approach. Do you have some better solution? How did you treat
>> slow_ops in your production clusters?
>> 
>> We are running the latest nautilus on all clusters.
>> 
> 
> 
> 
> This config setting should help:
> 
> ceph config set osd osd_op_queue_cut_off high
> 
> 
> 
> -- 
> Milan Kupcevic
> Senior Cyberinfrastructure Engineer at Project NESE
> Harvard University
> FAS Research Computing
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Increasing QD=1 performance (lowering latency)

2021-02-11 Thread Joachim Kraftmayer

Hi Wido,

do you know what happened to mellanox's ceph rdma project of 2018?

We will test ARM Ampere for all-flash this half-year and probably get 
the opportunity to experiment with software defined memory.


Regards, Joachim

___

Clyso GmbH

Am 08.02.2021 um 14:21 schrieb Paul Emmerich:

A few things that you can try on the network side to shave off microseconds:

1) 10G Base-T has quite some latency compared to fiber or DAC. I've
measured 2 µs on Base-T vs. 0.3µs on fiber for one link in one direction,
so that's 8µs you can save for a round-trip if it's client -> switch -> osd
and back. Note that my measurement was for small packets, not sure how big
that penalty still is with large packets. Some of it comes from the large
block size (~3 kbit IIRC) of the layer 1 encoding, some is just processing
time of that complex encoding.

2) Setting the switch to cut-through instead of store-and-forward can help,
especially on slower links. Serialization time is 0.8ns per byte on 10
gbit, so ~3.2µs for a 4kb packet.

3) Depending on which NIC you use: check if it has some kind of interrupt
throttling feature that you can adjust or disable. If your Base-T NIC is an
Intel NIC, especially on the older Niantic ones (i.e. X5xx X5xx using ixgbe
probably also X7xx, i40e), that can make a large difference. Try setting
itr=0 for the ixgbe kernel module. Note that you might want to compile your
kernel with CONFIG_IRQ_TIME_ACCOUNTING when using this option, otherwise
CPU usage statistics will be wildly inaccurate if the driver takes a
significant amount of CPU time (should not be a problem for the setup
described here, but something to be aware of). This may get you up to 100µs
in the best case. No idea about other NICs

4) No idea about the state in Ceph, but: SO_BUSY_POLL on sockets does help
with latency, but I forgot the details

5) Correct NUMA pinning (a single socket AMD system is NUMA) can reduce
tail latency, but doesn't do anything for average and median latency and I
have no insights specific to Ceph, though.


This could get you a few microseconds, I think especially 3 and 4 are worth
trying. Please do report results if you test this, I'm always interested in
hearing stories about low-level performance optimizations :)

Paul



On Tue, Feb 2, 2021 at 10:17 AM Wido den Hollander  wrote:


Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido


[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-11 Thread Davor Cubranic
I ran into the same situation when I created my first Octopus cluster. After 
purging everything, I started over and included a “model” instead of 
“rotational: 0” for data_devices in the spec, and this time it worked fine (it 
appears both in the output of `orch apply` and `orch ls —export`, as well as in 
the “devices” of `ceph osd metadata ID` output). Try that instead of “size", 
maybe? (It helped that all SSDs are the same type in this cluster.)

Note: as part of purging my previous attempt, I made sure all traces of LVM 
were gone from those drives, i.e., lvremove/vgdelete/pgremove.

Davor

> On Feb 9, 2021, at 10:05 PM, Tony Liu  wrote:
> 
> With db_devices.size, db_devices shows up from "orch ls --export",
> but no DB device/lvm created for the OSD. Any clues?
> 
> Thanks!
> Tony
> 
> From: Jens Hyllegaard (Soft Design A/S) 
> Sent: February 9, 2021 01:16 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: db_devices doesn't show up in exported osd service 
> spec
> 
> Hi Tony.
> 
> I assume they used a size constraint instead of rotational. So if all your 
> SSD's are 1TB or less , and all HDD's are more than that you could use:
> 
> spec:
>  objectstore: bluestore
>  data_devices:
>rotational: true
>  filter_logic: AND
>  db_devices:
>size: ':1TB'
> 
> It was usable in my test environment, and seems to work.
> 
> Regards
> 
> Jens
> 
> 
> -Original Message-
> From: Tony Liu 
> Sent: 9. februar 2021 02:09
> To: David Orman 
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: db_devices doesn't show up in exported osd service 
> spec
> 
> Hi David,
> 
> Could you show me an example of OSD service spec YAML to workaround it by 
> specifying size?
> 
> Thanks!
> Tony
> 
> From: David Orman 
> Sent: February 8, 2021 04:06 PM
> To: Tony Liu
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: db_devices doesn't show up in exported osd 
> service spec
> 
> Adding ceph-users:
> 
> We ran into this same issue, and we used a size specification to workaround 
> for now.
> 
> Bug and patch:
> 
> https://tracker.ceph.com/issues/49014
> https://github.com/ceph/ceph/pull/39083
> 
> Backport to Octopus:
> 
> https://github.com/ceph/ceph/pull/39171
> 
> On Sat, Feb 6, 2021 at 7:05 PM Tony Liu 
> mailto:tonyliu0...@hotmail.com>> wrote:
> Add dev to comment.
> 
> With 15.2.8, when apply OSD service spec, db_devices is gone.
> Here is the service spec file.
> ==
> service_type: osd
> service_id: osd-spec
> placement:
>  hosts:
>  - ceph-osd-1
> spec:
>  objectstore: bluestore
>  data_devices:
>rotational: 1
>  db_devices:
>rotational: 0
> ==
> 
> Here is the logging from mon. The message with "Tony" is added by me in mgr 
> to confirm. The audit from mon shows db_devices is gone.
> Is there anything in mon to filter that out based on host info?
> How can I trace it?
> ==
> audit 2021-02-07T00:45:38.106171+ mgr.ceph-control-1.nxjnzz 
> (mgr.24142551) 4020 : audit [DBG] from='client.24184218 -' 
> entity='client.admin' cmd=[{"prefix": "orch apply osd", "target": ["mon-mgr", 
> ""]}]: dispatch cephadm 2021-02-07T00:45:38.108546+ 
> mgr.ceph-control-1.nxjnzz (mgr.24142551) 4021 : cephadm [INF] Marking host: 
> ceph-osd-1 for OSDSpec preview refresh.
> cephadm 2021-02-07T00:45:38.108798+ mgr.ceph-control-1.nxjnzz 
> (mgr.24142551) 4022 : cephadm [INF] Saving service osd.osd-spec spec with 
> placement ceph-osd-1 cephadm 2021-02-07T00:45:38.108893+ 
> mgr.ceph-control-1.nxjnzz (mgr.24142551) 4023 : cephadm [INF] Tony: spec: 
>  DriveGroupSpec(name=osd-spec->placement=PlacementSpec(hosts=[HostPlacementSpec(hostname='ceph-osd-1',
>  network='', name='')]), service_id='osd-spec', service_type='osd', 
> data_devices=DeviceSelection(rotational=1, all=False), 
> db_devices=DeviceSelection(rotational=0, all=False), osd_id_claims={}, 
> unmanaged=False, filter_logic='AND', preview_only=False)> audit 
> 2021-02-07T00:45:38.109782+ mon.ceph-control-3 (mon.2) 25 : audit [INF] 
> from='mgr.24142551 10.6.50.30:0/2838166251' 
> entity='mgr.ceph-control-1.nxjnzz' cmd=[{"prefix":"config-key 
> set","key":"mgr/cephadm/spec.osd.osd-spec","val":"{\"created\": 
> \"2021-02-07T00:45:38.108810\", \"spec\": {\"plac
> ement\": {\"hosts\": [\"ceph-osd-1\"]}, \"service_id\": \"osd-spec\", 
> \"service_name\": \"osd.osd-spec\", \"service_type\": \"osd\", \"spec\": 
> {\"data_devices\": {\"rotational\": 1}, \"filter_logic\": \"AND\", 
> \"objectstore\": \"bluestore\"}}}"}]: dispatch audit 
> 2021-02-07T00:45:38.110133+ mon.ceph-control-1 (mon.0) 107 : audit [INF] 
> from='mgr.24142551 ' entity='mgr.ceph-control-1.nxjnzz' 
> cmd=[{"prefix":"config-key 
> set","key":"mgr/cephadm/spec.osd.osd-spec","val":"{\"created\": 
> \"2021-02-07T00:45:38.108810\", \"spec\": {\"pl

[ceph-users] Re: Proper solution of slow_ops

2021-02-11 Thread Milan Kupcevic
On 2/11/21 1:39 PM, Davor Cubranic wrote:
> But the config reference says “high” is already the default value?
> (https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/)
> 


It is not default in Nautilus. See
https://docs.ceph.com/en/nautilus/rados/configuration/osd-config-ref/?#operations


osd op queue cut off

Description

This selects which priority ops will be sent to the strict queue
verses the normal queue. The low setting sends all replication ops and
higher to the strict queue, while the high option sends only replication
acknowledgement ops and higher to the strict queue. Setting this to high
should help when a few OSDs in the cluster are very busy especially when
combined with wpq in the osd op queue setting. OSDs that are very busy
handling replication traffic could starve primary client traffic on
these OSDs without these settings. Requires a restart.

Type
String

Valid Choices
low, high

Default
low





> 
>> On Feb 9, 2021, at 4:42 AM, Milan Kupcevic > > wrote:
>>
>> On 2/9/21 7:29 AM, Michal Strnad wrote:
>>>
>>> we are looking for a proper solution of slow_ops. When the disk failed,
>>> node is restated ... a lot of slow operations appear. Even if disk (OSD)
>>> or node is back again most of slow_ops are still there. On the internet
>>> we found only advice that we have to restart monitor. But this is not
>>> right approach. Do you have some better solution? How did you treat
>>> slow_ops in your production clusters?
>>>
>>> We are running the latest nautilus on all clusters.
>>>
>>
>>
>>
>> This config setting should help:
>>
>> ceph config set osd osd_op_queue_cut_off high
>>
>>



-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MultiSite Sync Problem and Shard Number Relation?

2021-02-11 Thread George Yil
Any input would be very appreciated.

Thanks.

> George Yil  şunları yazdı (9 Şub 2021 16:41):
> 
> Hi,
> 
> I am sort of newby to RGW multisite. I guess there is an important limitation 
> about bucket index sharding if you run multisite. I would like to learn 
> better or correct myself. And I also want to leave a bookmark here for future 
> cephers if possible. I apologize if this is asked before however I am not 
> able to find a good explanation from the mailing list.
> 
> AFAIK RGW needs bucket indexes as to serve bucket listing. This is optional 
> but mostly demanded.
> 
> Now you have to set a value for the number of shards of the bucket. This is 
> not dynamic since multisite does not support dynamic resharding yet.
> AFAIK if your bucket hits the shard limit then your bucket sync is in trouble 
> or maybe stopped. And you should plan a cluster outage in order to  reshard 
> the bucket and resync from start. 
> 
> I would like to verify if this is a correct assumption?
> 
> Here is the problem I've encounter right now.
> 
> I have 2 nautilus 14.2.9 RGW multisite clusters. Since I was not fully aware 
> of multisite/bucket limitations, I find myself with a fairly large bucket 
> with 256 millions of objects. 
> Now my problem is that bucket syncing seems to be stopped or stalled. I am 
> %100 certain of which but I failed to figure out the exact problem from 
> "radosgw-admin bucket sync status". It is quite possible that I maybe still 
> don't know the proper tool to figure out the root cause.
> 
> From the RGW logs I could capture this: "check_bucket_shards: resharding 
> needed: stats.num_objects=256166901 shard max_objects=7500". Cluster 
> ceph.conf includes "rgw override bucket index max shards = 750".
> 
> I suppose I need to reshard this bucket to continue syncing. And the only way 
> to reshard is that I have to stop all RGW services of both clusters, delete 
> the secondary zone, reshard the bucket and resync the bucket from the 
> beginning. Or for a better solution I should divide this large bucket into 
> smaller buckets. This might have no easy way but to migrate with some kind of 
> S3 sync tool (rather a fast one!).
> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph osd df results

2021-02-11 Thread Marc
Should the ceph osd df results not have this result for every device class? I 
do not think that there people mixing these classes in pools.

MIN/MAX VAR: 0.78/4.28  STDDEV: 6.15

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] adding a second rgw instance on the same zone

2021-02-11 Thread Adrian Nicolae

Hi guys,

I have a Mimic cluster with only one RGW machine.  My setup is simple - 
one realm, one zonegroup, one zone.  How can I safely add a second RGW 
server to the same zone ?


Is it safe to just run "ceph-deploy rgw create" for the second server 
without impacting the existing metadata pools ?  What about the existing 
S3/Swift users - they should be available to the second RGW from the 
current pools, right ?


My biggest concern is that the second RGW server will try to recreate 
some internal pools when going online so I just want to double-check 
that I will not mess the current setup when adding the second instance :)


Thanks.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm upgrade issue persisting with one node

2021-02-11 Thread Darrin Hodges
Hi all,

Still getting upgrade issue with cephadm " Upgrade: failed to pull
target image".  On each of the nodes in the cluster I can do:


docker pull docker.io/ceph/ceph:v15.2.8


And there is no error but the upgrade command fails still. I can see an
entry in the logs for:


Feb 11 22:27:56 ceph-admin01 bash[2641]: audit
2021-02-11T22:27:55.339997+ mon.ceph-admin01 (mon.0) 61 : audit
[INF] from='mgr.12064184 ' entity='mgr.ceph-osd01.rpgexq'
cmd=[{"prefix":"config-key
set","key":"mgr/cephadm/upgrade_state","val":"{\"target_name\":
\"docker.io/ceph/ceph:v15.2.8\", \"progress_id\":
\"7cf2e315-6cfe-4e9a-88bc-ec8d611b6b4f\", \"error\":
\"UPGRADE_FAILED_PULL: Upgrade: failed to pull target image\",
\"paused\": true}"}]: dispatch


any ideas how I can find extra info on what is going on there?


thanks

Darrin


-- 
CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. 
It may contain privileged, confidential or copyright information. If you are 
not the named recipients, any use, reliance upon, disclosure or copying of this 
email or any attachments is unauthorised. If you have received this email in 
error, please reply via email or telephone +61 2 8004 5928.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] File I/O with mixed read/write and high streaming performance

2021-02-11 Thread perkins
Hi all,

I'm trying to understand if the CephFS is a good approach for the following 
scenario.  From some of the OLD benchmarks, GlusterFS significantly beat the 
CephFS when many file I/Os were required.   But... this was an OLD benchmark.   
I'd like your thoughts on the matter.

What I need to perform are  the following two steps:

Re-organize files  (Step one)

I need to take a large directory structure (assumed to reside on CephFS) and 
"re-arrange" it via a copy or link mechanism.  I will want to make a full copy 
of the directory structure but do a simple disk span chunking so that all the 
files in the original copy end up in a set of folders where each folder is no 
larger than a fixed size.  This is like what we did back in the days where we 
needed to write data in CDROM sized chunks.  There is a set of tools that will 
do this in the genisoimage package (dssplit and dirsplit).  Folder Axe was the 
MS Windows equivalent

Presumably, this would put a large random read and random write load on the 
cluster.  Since the size can be large (hundreds of G (maybe up to 1TB) with 10s 
to 100s of thousands of small files), I would need for this to be well 
optimized.  One mechanism that might be available is to use hard or soft links 
so that no actual copying is done (Don't know if CephFS/POSIX supports this).   
The linking approach would probably put a large strain on the MDS servers but 
not so much on the storage.

Write to media (Step two)

I need to stream the chunked folders to a set of media devices (think tape 
drive) that can ingest at high speed (about 200 megabytes per second... yes 
bytes).  I'd like to make sure that we can feed the ingest at the max rate (if 
possible). Whether we can write the folder chunks one at a time or in 
parallel (to multiple tape drives) remains to be seen.  Presumably, this would 
put a large random read load on the cluster.  Once the media has been 
successfully written, the chunked copy can be deleted.

Notes:

Currently, planning for all access to be done via Linux servers.  I'm eagerly 
watching the windows native CephFS beta.
The server performing the chunking job will be the only reader/writer of the 
data.
The server performing the streaming job will also be the only reader/writer of 
the data. 
If we can support parallel, then there may be 2-3 chunking servers and 2-3 
streaming servers operating concurrently.
There are only a few system in play... NOT hundreds of concurrent clients 
accessing the data.
One might assume that we could keep the raw data on cheaper disk and then 
"reconstruct" the copy on flash.  In this scenario, we can stream from flash.

I'd definitely appreciate your feedback on whether CephFS would be a good fit.

Thanks in advance for your thoughts!

- Steve
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Some slow writes over time

2021-02-11 Thread João Victor Mafra
I am running a simple workload (just 1 client and 1 file) of random writings on 
Ceph FS and I noticed that approximately 3% of the operations (well spread over 
time) show latencies higher than the other 97% (100 ms x 10 ms). Is there any 
reason for this to happen? 

- I'm using fio with O_DIRECT to avoid cache buffer, so it is expected that the 
operations will only be completed after writing to the disk. 
- My WAL is also disabled, so there is no reason for ceph to be doing deferred 
writing.
- I performed the same workload on the gluster fs and the latencies were 
uniform over time.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cannot access "Object Gateway" in dashboard after setting rgw api keys

2021-02-11 Thread Alfonso Martinez Hidalgo
Hi Troels,

1) It seems you need to set up the user id like this:

ceph dashboard set-rgw-api-user-id 

More info here:
https://docs.ceph.com/en/nautilus/mgr/dashboard/#enabling-the-object-gateway-management-frontend

2) Have you set up multisite configuration (realms/zonegroups/zones) ?
Please paste the output of:

radosgw-admin realm list
radosgw-admin zonegroup list
radosgw-admin zone list

Regards,
-- 

Alfonso Martínez

Senior Software Engineer, Ceph Storage

Red Hat 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm upgrade issue persisting with one node

2021-02-11 Thread Darrin Hodges
Its all good, finally got it working, some of the osd nodes had the
incorrect default gw.

cheers
Darrin

On 12/2/21 9:41 am, Darrin Hodges wrote:
> Hi all,
>
> Still getting upgrade issue with cephadm " Upgrade: failed to pull
> target image".  On each of the nodes in the cluster I can do:
>
>
> docker pull docker.io/ceph/ceph:v15.2.8
>
>
> And there is no error but the upgrade command fails still. I can see an
> entry in the logs for:
>
>
> Feb 11 22:27:56 ceph-admin01 bash[2641]: audit
> 2021-02-11T22:27:55.339997+ mon.ceph-admin01 (mon.0) 61 : audit
> [INF] from='mgr.12064184 ' entity='mgr.ceph-osd01.rpgexq'
> cmd=[{"prefix":"config-key
> set","key":"mgr/cephadm/upgrade_state","val":"{\"target_name\":
> \"docker.io/ceph/ceph:v15.2.8\", \"progress_id\":
> \"7cf2e315-6cfe-4e9a-88bc-ec8d611b6b4f\", \"error\":
> \"UPGRADE_FAILED_PULL: Upgrade: failed to pull target image\",
> \"paused\": true}"}]: dispatch
>
>
> any ideas how I can find extra info on what is going on there?
>
>
> thanks
>
> Darrin
>
>

-- 
CONFIDENTIALITY NOTICE: This email is intended for the named recipients only. 
It may contain privileged, confidential or copyright information. If you are 
not the named recipients, any use, reliance upon, disclosure or copying of this 
email or any attachments is unauthorised. If you have received this email in 
error, please reply via email or telephone +61 2 8004 5928.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph osd df results

2021-02-11 Thread Fox, Kevin M
+1


From: Marc 
Sent: Thursday, February 11, 2021 12:09 PM
To: ceph-users
Subject: [ceph-users] ceph osd df results

Check twice before you click! This email originated from outside PNNL.


Should the ceph osd df results not have this result for every device class? I 
do not think that there people mixing these classes in pools.

MIN/MAX VAR: 0.78/4.28  STDDEV: 6.15

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io