Hello,

we are configure new ceph cluster with Mellanox 2x100Gbps cards.

We bond this two ports to MLAG bond0 interface.

In the async+posix mode everythink is OK, cluster is in the
HELTH_OK state.

CEPH version is 18.2.1.

Then we tried to configure RoCE for cluster part of network, but
without success.

Our ceph config dump (only relevant config):

global                   advanced  ms_async_rdma_device_name              
mlx5_bond_0                                                                     
           *
global                   advanced  ms_async_rdma_gid_idx                  3
global  host:ceph1-nvme  advanced  ms_async_rdma_local_gid                
0000:0000:0000:0000:0000:ffff:a0d9:05d8                                         
           *
global  host:ceph2-nvme  advanced  ms_async_rdma_local_gid                
0000:0000:0000:0000:0000:ffff:a0d9:05d7                                         
           *
global  host:ceph3-nvme  advanced  ms_async_rdma_local_gid                
0000:0000:0000:0000:0000:ffff:a0d9:05d6                                         
           *
global                   advanced  ms_async_rdma_roce_ver                 2
global                   advanced  ms_async_rdma_type                     rdma  
                                                                                
     *
global                   advanced  ms_cluster_type                        
async+rdma                                                                      
           *
global                   advanced  ms_public_type                         
async+posix                                                                     
           *


On the ceph1-nvme there is this show_gids.sh list:

# ./show_gids.sh
DEV     PORT    INDEX   GID                                     IPv4            
VER     DEV
---     ----    -----   ---                                     ------------    
---     ---
mlx5_bond_0     1       0       fe80:0000:0000:0000:0e42:a1ff:fe93:b004         
        v1      bond0
mlx5_bond_0     1       1       fe80:0000:0000:0000:0e42:a1ff:fe93:b004         
        v2      bond0
mlx5_bond_0     1       2       0000:0000:0000:0000:0000:ffff:a0d9:05d8 
160.217.5.216   v1      bond0
mlx5_bond_0     1       3       0000:0000:0000:0000:0000:ffff:a0d9:05d8 
160.217.5.216   v2      bond0
n_gids_found=4

I have set this line in /etc/security/limits.conf:

*       hard    memlock unlimited

But when I tried to restart ceph.target, OSD nodes didn't start
with this errors, see attachment.

Mellanox drivers are from Debian bookworm kernel.

Is there somethink missing in config, or some errors?

When I change ms_cluster_type to async+posix and restart
ceph.target, cluster converged to HEALTH_OK state...

Thanks for advices...

Sincerely
Jan Marek
-- 
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
2024-02-05T09:56:50.249344+01:00 ceph3-nvme ceph-osd[21139]: auth: 
KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring
2024-02-05T09:56:50.249362+01:00 ceph3-nvme ceph-osd[21139]: auth: 
KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring
2024-02-05T09:56:50.249377+01:00 ceph3-nvme ceph-osd[21139]: 
asok(0x559592978000) register_command rotate-key hook 0x7fffcc26d398
2024-02-05T09:56:50.249391+01:00 ceph3-nvme ceph-osd[21139]: 
log_channel(cluster) update_config to_monitors: true to_syslog: false 
syslog_facility:  prio: info to_graylog: false graylog_host: 127.0.0.1 
graylog_port: 12201)
2024-02-05T09:56:50.249409+01:00 ceph3-nvme ceph-osd[21139]: osd.2 109 
log_to_monitors true
2024-02-05T09:56:50.249424+01:00 ceph3-nvme ceph-osd[21139]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc:
 In function 'void Infiniband::init()' thread 7fb3bb042700 time 
2024-02-05T08:56:50.142198+0000#012/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc:
 1061: FAILED ceph_assert(device)#012#012 ceph version 18.2.1 
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)#012 1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) 
[0x55958fb905b3]#012 2: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779]#012 3: 
(Infiniband::init()+0x95b) [0x55959077df0b]#012 4: 
(RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, 
ServerSocket*)+0x30) [0x55959053bba0]#012 5: /usr/bin/ceph-osd(+0xfbb03f) 
[0x55959051e03f]#012 6: (EventCenter::process_events(unsigned int, 
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) 
[0x55959052f4c4]#012 7: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]#012 8: 
/lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]#012 9: 
/lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]#012 10: clone()
2024-02-05T09:56:50.249447+01:00 ceph3-nvme ceph-osd[21139]: *** Caught signal 
(Aborted) **#012 in thread 7fb3bb042700 thread_name:msgr-worker-0#012#012 ceph 
version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)#012 1: 
/lib64/libpthread.so.0(+0x12d20) [0x7fb3c2498d20]#012 2: gsignal()#012 3: 
abort()#012 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x18f) [0x55958fb9060d]#012 5: /usr/bin/ceph-osd(+0x62d779) 
[0x55958fb90779]#012 6: (Infiniband::init()+0x95b) [0x55959077df0b]#012 7: 
(RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, 
ServerSocket*)+0x30) [0x55959053bba0]#012 8: /usr/bin/ceph-osd(+0xfbb03f) 
[0x55959051e03f]#012 9: (EventCenter::process_events(unsigned int, 
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) 
[0x55959052f4c4]#012 10: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]#012 11: 
/lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]#012 12: 
/lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]#012 13: clone()#012 NOTE: a 
copy of the executable, or `objdump -rdS <executable>` is needed to interpret 
this.
2024-02-05T09:56:50.249475+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  -2717> 
2024-02-05T08:56:50.132+0000 7fb3c49da640 -1 osd.2 109 log_to_monitors true
2024-02-05T09:56:50.249503+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  -2716> 
2024-02-05T08:56:50.140+0000 7fb3bb042700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc:
 In function 'void Infiniband::init()' thread 7fb3bb042700 time 
2024-02-05T08:56:50.142198+0000
2024-02-05T09:56:50.249530+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc:
 1061: FAILED ceph_assert(device)
2024-02-05T09:56:50.249551+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 
2024-02-05T09:56:50.249571+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  ceph version 18.2.1 
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
2024-02-05T09:56:50.249593+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  1: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) 
[0x55958fb905b3]
2024-02-05T09:56:50.249614+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  2: 
/usr/bin/ceph-osd(+0x62d779) [0x55958fb90779]
2024-02-05T09:56:50.249634+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  3: 
(Infiniband::init()+0x95b) [0x55959077df0b]
2024-02-05T09:56:50.249656+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  4: 
(RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, 
ServerSocket*)+0x30) [0x55959053bba0]
2024-02-05T09:56:50.249678+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  5: 
/usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f]
2024-02-05T09:56:50.249699+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  6: 
(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, 
std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4]
2024-02-05T09:56:50.249721+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  7: 
/usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]
2024-02-05T09:56:50.249741+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  8: 
/lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]
2024-02-05T09:56:50.249761+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  9: 
/lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]
2024-02-05T09:56:50.249784+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  10: clone()
2024-02-05T09:56:50.249804+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 
2024-02-05T09:56:50.249824+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  -2715> 
2024-02-05T08:56:50.168+0000 7fb3bb042700 -1 *** Caught signal (Aborted) **
2024-02-05T09:56:50.249843+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  in thread 7fb3bb042700 
thread_name:msgr-worker-0
2024-02-05T09:56:50.249862+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 
2024-02-05T09:56:50.249881+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  ceph version 18.2.1 
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
2024-02-05T09:56:50.249901+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  1: 
/lib64/libpthread.so.0(+0x12d20) [0x7fb3c2498d20]
2024-02-05T09:56:50.249920+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  2: gsignal()
2024-02-05T09:56:50.249940+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  3: abort()
2024-02-05T09:56:50.249959+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  4: 
(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) 
[0x55958fb9060d]
2024-02-05T09:56:50.249982+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  5: 
/usr/bin/ceph-osd(+0x62d779) [0x55958fb90779]
2024-02-05T09:56:50.250002+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  6: 
(Infiniband::init()+0x95b) [0x55959077df0b]
2024-02-05T09:56:50.250021+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  7: 
(RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, 
ServerSocket*)+0x30) [0x55959053bba0]
2024-02-05T09:56:50.250042+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  8: 
/usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f]
2024-02-05T09:56:50.250066+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  9: 
(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, 
std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4]
2024-02-05T09:56:50.250090+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  10: 
/usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]
2024-02-05T09:56:50.250110+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  11: 
/lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]
2024-02-05T09:56:50.250132+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  12: 
/lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]
2024-02-05T09:56:50.250154+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  13: clone() 
2024-02-05T09:56:50.250174+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]:  NOTE: a copy of the 
executable, or `objdump -rdS <executable>` is needed to interpret this.
2024-02-05T09:56:50.250196+01:00 ceph3-nvme 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 
2024-02-05T09:56:50.332927+01:00 ceph3-nvme podman[21994]: 2024-02-05 
09:56:50.332706257 +0100 CET m=+0.015245010 container died 
7b35d0bcf4e2c6af121e5973c4fec96940f5626f3eb7530d560d83bc14a7fea4 
(image=quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f,
 name=ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2, 
org.label-schema.vendor=CentOS, CEPH_POINT_RELEASE=-18.2.1, GIT_CLEAN=True, 
maintainer=Guillaume Abrioux <gabri...@redhat.com>, 
org.label-schema.name=CentOS Stream 8 Base Image, 
GIT_COMMIT=4e397bbc7ff93e76025ef390087dfcea05ef676e, 
org.label-schema.schema-version=1.0, org.label-schema.build-date=20240102, 
org.label-schema.license=GPLv2, GIT_BRANCH=HEAD, RELEASE=HEAD, 
io.buildah.version=1.29.1, GIT_REPO=https://github.com/ceph/ceph-container.git, 
ceph=True)
2024-02-05T09:56:50.337613+01:00 ceph3-nvme systemd[1]: 
var-lib-containers-storage-overlay-3df2065cca86c325ca5c08e2c4dd88773c177351b5ff65b21e53ace6e5b2fecb-merged.mount:
 Deactivated successfully.
2024-02-05T09:56:50.341278+01:00 ceph3-nvme podman[21994]: 2024-02-05 
09:56:50.341208686 +0100 CET m=+0.023747429 container remove 
7b35d0bcf4e2c6af121e5973c4fec96940f5626f3eb7530d560d83bc14a7fea4 
(image=quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f,
 name=ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2, ceph=True, 
org.label-schema.schema-version=1.0, io.buildah.version=1.29.1, 
GIT_REPO=https://github.com/ceph/ceph-container.git, GIT_BRANCH=HEAD, 
GIT_COMMIT=4e397bbc7ff93e76025ef390087dfcea05ef676e, maintainer=Guillaume 
Abrioux <gabri...@redhat.com>, org.label-schema.license=GPLv2, 
org.label-schema.vendor=CentOS, org.label-schema.build-date=20240102, 
org.label-schema.name=CentOS Stream 8 Base Image, GIT_CLEAN=True, RELEASE=HEAD, 
CEPH_POINT_RELEASE=-18.2.1)
2024-02-05T09:56:50.343181+01:00 ceph3-nvme systemd[1]: 
ceph-87483e28-c19a-11ee-90ed-0c42a193b004@osd.2.service: Main process exited, 
code=exited, status=139/n/a

Attachment: signature.asc
Description: PGP signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to