Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

Marco Baldini - H.S. Amiata Mon, 23 Oct 2017 07:11:08 -0700

Thanks for reply

My ceph.conf:


   [global]
             auth client required = none
             auth cluster required = none
             auth service required = none
             bluestore_block_db_size = 64424509440
             *cluster network = 10.10.10.0/24*
             fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
             keyring = /etc/pve/priv/$cluster.$name.keyring
             mon allow pool delete = true
             osd journal size = 5120
             osd pool default min size = 2
             osd pool default size = 3
             *public network = 10.10.10.0/24*

   [client]
             rbd cache = true
             rbd cache max dirty = 134217728
             rbd cache max dirty age = 2
             rbd cache size = 268435456
             rbd cache target dirty = 67108864
             rbd cache writethrough until flush = true

   [osd]
             keyring = /var/lib/ceph/osd/ceph-$id/keyring

   [mon.pve-hs-3]
             host = pve-hs-3
             mon addr = 10.10.10.253:6789

   [mon.pve-hs-main]
             host = pve-hs-main
             mon addr = 10.10.10.251:6789

   [mon.pve-hs-2]
             host = pve-hs-2
             mon addr = 10.10.10.252:6789


Each node has two ethernet cards in LACP bond on network 10.10.10.x

auto bond0
iface bond0 inet static
        address  10.10.10.252
        netmask  255.255.255.0
        slaves enp4s0 enp4s1
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer3+4
#CLUSTER BOND


The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run"

#
interface gigabitEthernet 1/0/1

  channel-group 4 mode active
#
interface gigabitEthernet 1/0/2

  channel-group 4 mode active
#
interface gigabitEthernet 1/0/3

  channel-group 2 mode active
#
interface gigabitEthernet 1/0/4

  channel-group 2 mode active
#
interface gigabitEthernet 1/0/5

  channel-group 3 mode active
#
interface gigabitEthernet 1/0/6

  channel-group 3 mode active
#
interface gigabitEthernet 1/0/7

#
interface gigabitEthernet 1/0/8


Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5 and 6


Routing table, show with "ip -4 route show  table all"

default via 192.168.2.1 dev vmbr0 onlink
*10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252*
192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 linkdown
192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252

*broadcast 10.10.10.0 dev bond0 table local proto kernel scope link src10.10.10.252**local 10.10.10.252 dev bond0 table local proto kernel scope host src10.10.10.252**broadcast 10.10.10.255 dev bond0 table local proto kernel scope link src10.10.10.252*

broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 
127.0.0.1
broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link src 
192.168.1.252 linkdown
local 192.168.1.252 dev vmbr1 table local proto kernel scope host src 
192.168.1.252
broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope link src 
192.168.1.252 linkdown
broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link src 
192.168.2.252
local 192.168.2.252 dev vmbr0 table local proto kernel scope host src 
192.168.2.252
broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope link src 
192.168.2.252


Network configuration

*$ ip -4 a*
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
group default qlen 1000
    inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1
       valid_lft forever preferred_lft forever

*7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdiscnoqueue state UP group default qlen 1000****inet 10.10.10.252/24 brd 10.10.10.255 scope global bond0****valid_lft forever preferred_lft forever***8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000

    inet 192.168.2.252/24 brd 192.168.2.255 scope global vmbr0
       valid_lft forever preferred_lft forever

*$ ip -4 link*
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode 
DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master 
vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff

*3: enp4s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdiscpfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000****link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff****4: enp4s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdiscpfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000****link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000

    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

*7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdiscnoqueue state UP mode DEFAULT group default qlen 1000****link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

    link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
9: tap104i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc 
pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether b2:47:55:9f:d3:0b brd ff:ff:ff:ff:ff:ff
11: veth103i0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether fe:03:27:0d:02:38 brd ff:ff:ff:ff:ff:ff link-netnsid 0
13: veth106i0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether fe:ce:4f:09:24:45 brd ff:ff:ff:ff:ff:ff link-netnsid 1
14: tap109i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq 
master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 3a:f0:99:3f:6a:75 brd ff:ff:ff:ff:ff:ff
15: tap201i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc 
pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 16:99:8a:56:6d:7f brd ff:ff:ff:ff:ff:ff


I think it's everything.

Thanks



Il 23/10/2017 15:42, Denes Dolhay ha scritto:


Hi,

Maybe some routing issue?


"CEPH has public and cluster network on 10.10.10.0/24"

This means that the nodes have public and cluster network separatelyboth on 10.10.10.0/24, or that you did not specify a separate clusternetwork?


Please provide route table, ifconfig, ceph.conf


Regards,

Denes


On 10/23/2017 03:35 PM, Marco Baldini - H.S. Amiata wrote:


Hello

I have a CEPH cluster with 3 nodes, each with 3 OSDs, runningProxmox, CEPH versions:


{
     "mon": {
         "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 3
     },
     "mgr": {
         "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 3
     },
     "osd": {
         "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 9
     },
     "mds": {},
     "overall": {
         "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous 
(stable)": 15
     }
}

CEPH has public and cluster network on 10.10.10.0/24, the three nodesare 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking isworking good (I kept ping from one of the nodes to the others tworunning for hours and had 0 packet loss)


On one node with ip 10.10.10.252 I get strange message in dmesg

kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, 
hunting for new mon
kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established

and that is going on all the day.

In ceph -w I get

2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0

pve-hs-main is the host with ip 10.10.10.251

Actually CEPH storage is very low on usage, on average 200 kB/s reador write (as shown with ceph -s) so I don't think it's a problemabout load average of the cluster.

The strange is that I see mon1 10.10.10.252:6789 session lost andthat's from log of node 10.10.10.252 so it's losing connection withthe monitor on the same node, I don't think it's network related.

I already tried with nodes reboot, ceph-mon and ceph-mgr restart, butthe problem is still there.


Any ideas?

Thanks




--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio:        0577-779396
Cellulare:      335-8765169
WEB:    www.hsamiata.it <https://www.hsamiata.it>
EMAIL:  [email protected] <mailto:[email protected]>



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio:        0577-779396
Cellulare:      335-8765169
WEB:    www.hsamiata.it <https://www.hsamiata.it>
EMAIL:  [email protected] <mailto:[email protected]>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Continuous error: "libceph: monX session lost, hunting for new mon" on one host

Reply via email to