Finally i've disabled the mon_osd_report_timeout option and seems to works fine.
Greetings!. 2017-10-17 19:02 GMT+02:00 Daniel Carrasco <d.carra...@i2tic.com>: > Thanks!! > > I'll take a look later. > > Anyway, all my Ceph daemons are in same version on all nodes (I've > upgraded the whole cluster). > > Cheers!! > > > El 17 oct. 2017 6:39 p. m., "Marc Roos" <m.r...@f1-outsourcing.eu> > escribió: > > Did you check this? > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39886.html > > > > > > > > > -----Original Message----- > From: Daniel Carrasco [mailto:d.carra...@i2tic.com] > Sent: dinsdag 17 oktober 2017 17:49 > To: ceph-us...@ceph.com > Subject: [ceph-users] OSD are marked as down after jewel -> luminous > upgrade > > Hello, > > Today I've decided to upgrade my Ceph cluster to latest LTS version. To > do it I've used the steps posted on release notes: > http://ceph.com/releases/v12-2-0-luminous-released/ > > After upgrade all the daemons I've noticed that all OSD daemons are > marked as down even when all are working, so the cluster becomes down. > > Maybe the problem is the command "ceph osd require-osd-release > luminous", but all OSD are on Luminous version. > > ------------------------------------------------------------------------ > ------------------------------------- > > ------------------------------------------------------------------------ > ------------------------------------- > > # ceph versions > { > "mon": { > "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) > luminous (stable)": 3 > }, > "mgr": { > "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) > luminous (stable)": 3 > }, > "osd": { > "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) > luminous (stable)": 2 > }, > "mds": { > "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) > luminous (stable)": 2 > }, > "overall": { > "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) > luminous (stable)": 10 > } > } > > ------------------------------------------------------------------------ > ------------------------------------- > > ------------------------------------------------------------------------ > ------------------------------------- > > # ceph osd versions > { > "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) > luminous (stable)": 2 } > > # ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 0.08780 root default > -2 0.04390 host alantra_fs-01 > 0 ssd 0.04390 osd.0 up 1.00000 1.00000 > -3 0.04390 host alantra_fs-02 > 1 ssd 0.04390 osd.1 up 1.00000 1.00000 > -4 0 host alantra_fs-03 > > ------------------------------------------------------------------------ > ------------------------------------- > > ------------------------------------------------------------------------ > ------------------------------------- > > # ceph -s > cluster: > id: 5f8e66b5-1adc-4930-b5d8-c0f44dc2037e > health: HEALTH_WARN > nodown flag(s) set > > services: > mon: 3 daemons, quorum alantra_fs-02,alantra_fs-01,alantra_fs-03 > mgr: alantra_fs-03(active), standbys: alantra_fs-01, alantra_fs-02 > mds: cephfs-1/1/1 up {0=alantra_fs-01=up:active}, 1 up:standby > osd: 2 osds: 2 up, 2 in > flags nodown > > data: > pools: 3 pools, 192 pgs > objects: 40177 objects, 3510 MB > usage: 7486 MB used, 84626 MB / 92112 MB avail > pgs: 192 active+clean > > io: > client: 564 kB/s rd, 767 B/s wr, 33 op/s rd, 0 op/s wr > > ------------------------------------------------------------------------ > ------------------------------------- > > ------------------------------------------------------------------------ > ------------------------------------- > Log: > 2017-10-17 16:15:25.466807 mon.alantra_fs-02 [INF] osd.0 marked down > after no beacon for 29.864632 seconds > 2017-10-17 16:15:25.467557 mon.alantra_fs-02 [WRN] Health check failed: > 1 osds down (OSD_DOWN) > 2017-10-17 16:15:25.467587 mon.alantra_fs-02 [WRN] Health check failed: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:15:27.494526 mon.alantra_fs-02 [WRN] Health check failed: > Degraded data redundancy: 63 pgs unclean (PG_DEGRADED) > 2017-10-17 16:15:27.501956 mon.alantra_fs-02 [INF] Health check cleared: > OSD_DOWN (was: 1 osds down) > 2017-10-17 16:15:27.501997 mon.alantra_fs-02 [INF] Health check cleared: > OSD_HOST_DOWN (was: 1 host (1 osds) down) > 2017-10-17 16:15:27.502012 mon.alantra_fs-02 [INF] Cluster is now > healthy > 2017-10-17 16:15:27.518798 mon.alantra_fs-02 [INF] osd.0 > 10.20.1.109:6801/3319 boot > 2017-10-17 16:15:26.414023 osd.0 [WRN] Monitor daemon marked osd.0 down, > but it is still running > 2017-10-17 16:15:30.470477 mon.alantra_fs-02 [INF] osd.1 marked down > after no beacon for 25.007336 seconds > 2017-10-17 16:15:30.471014 mon.alantra_fs-02 [WRN] Health check failed: > 1 osds down (OSD_DOWN) > 2017-10-17 16:15:30.471047 mon.alantra_fs-02 [WRN] Health check failed: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:15:30.532427 mon.alantra_fs-02 [WRN] overall HEALTH_WARN 1 > osds down; 1 host (1 osds) down; Degraded data redundancy: 63 pgs > unclean > 2017-10-17 16:15:31.590661 mon.alantra_fs-02 [INF] Health check cleared: > PG_DEGRADED (was: Degraded data redundancy: 63 pgs unclean) > 2017-10-17 16:15:34.703027 mon.alantra_fs-02 [INF] Health check cleared: > OSD_DOWN (was: 1 osds down) > 2017-10-17 16:15:34.703061 mon.alantra_fs-02 [INF] Health check cleared: > OSD_HOST_DOWN (was: 1 host (1 osds) down) > 2017-10-17 16:15:34.703078 mon.alantra_fs-02 [INF] Cluster is now > healthy > 2017-10-17 16:15:34.714002 mon.alantra_fs-02 [INF] osd.1 > 10.20.1.97:6801/2310 boot > 2017-10-17 16:15:33.614640 osd.1 [WRN] Monitor daemon marked osd.1 down, > but it is still running > 2017-10-17 16:15:35.767050 mon.alantra_fs-02 [WRN] Health check failed: > Degraded data redundancy: 40176/80352 objects degraded (50.000%), 63 pgs > unclean, 192 pgs degraded (PG_DEGRADED) > 2017-10-17 16:15:40.852094 mon.alantra_fs-02 [INF] Health check cleared: > PG_DEGRADED (was: Degraded data redundancy: 19555/80352 objects degraded > (24.337%), 63 pgs unclean, 96 pgs degraded) > 2017-10-17 16:15:40.852129 mon.alantra_fs-02 [INF] Cluster is now > healthy > 2017-10-17 16:15:55.475549 mon.alantra_fs-02 [INF] osd.0 marked down > after no beacon for 25.005072 seconds > 2017-10-17 16:15:55.476086 mon.alantra_fs-02 [WRN] Health check failed: > 1 osds down (OSD_DOWN) > 2017-10-17 16:15:55.476114 mon.alantra_fs-02 [WRN] Health check failed: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:15:57.557651 mon.alantra_fs-02 [WRN] Health check failed: > Degraded data redundancy: 63 pgs unclean (PG_DEGRADED) > 2017-10-17 16:15:57.558176 mon.alantra_fs-02 [INF] Health check cleared: > OSD_DOWN (was: 1 osds down) > 2017-10-17 16:15:57.558206 mon.alantra_fs-02 [INF] Health check cleared: > OSD_HOST_DOWN (was: 1 host (1 osds) down) > 2017-10-17 16:15:57.558230 mon.alantra_fs-02 [INF] Cluster is now > healthy > 2017-10-17 16:15:57.596646 mon.alantra_fs-02 [INF] osd.0 > 10.20.1.109:6801/3319 boot > 2017-10-17 16:15:56.447979 osd.0 [WRN] Monitor daemon marked osd.0 down, > but it is still running > 2017-10-17 16:16:00.479015 mon.alantra_fs-02 [INF] osd.1 marked down > after no beacon for 25.004706 seconds > 2017-10-17 16:16:00.479536 mon.alantra_fs-02 [WRN] Health check failed: > 1 osds down (OSD_DOWN) > 2017-10-17 16:16:00.479577 mon.alantra_fs-02 [WRN] Health check failed: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:16:01.634966 mon.alantra_fs-02 [INF] Health check cleared: > PG_DEGRADED (was: Degraded data redundancy: 63 pgs unclean) > 2017-10-17 16:16:02.643766 mon.alantra_fs-02 [INF] Health check cleared: > OSD_DOWN (was: 1 osds down) > 2017-10-17 16:16:02.643798 mon.alantra_fs-02 [INF] Health check cleared: > OSD_HOST_DOWN (was: 1 host (1 osds) down) > 2017-10-17 16:16:02.643815 mon.alantra_fs-02 [INF] Cluster is now > healthy > 2017-10-17 16:16:02.691761 mon.alantra_fs-02 [INF] osd.1 > 10.20.1.97:6801/2310 boot > 2017-10-17 16:16:01.153925 osd.1 [WRN] Monitor daemon marked osd.1 down, > but it is still running > 2017-10-17 16:16:25.497378 mon.alantra_fs-02 [INF] osd.0 marked down > after no beacon for 25.018358 seconds > 2017-10-17 16:16:25.497946 mon.alantra_fs-02 [WRN] Health check failed: > 1 osds down (OSD_DOWN) > 2017-10-17 16:16:25.497973 mon.alantra_fs-02 [WRN] Health check failed: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:16:27.517811 mon.alantra_fs-02 [WRN] Health check failed: > Degraded data redundancy: 62 pgs unclean (PG_DEGRADED) > 2017-10-17 16:16:28.538617 mon.alantra_fs-02 [INF] Health check cleared: > OSD_DOWN (was: 1 osds down) > 2017-10-17 16:16:28.538647 mon.alantra_fs-02 [INF] Health check cleared: > OSD_HOST_DOWN (was: 1 host (1 osds) down) > 2017-10-17 16:16:28.552535 mon.alantra_fs-02 [INF] osd.0 > 10.20.1.109:6801/3319 boot > 2017-10-17 16:16:27.287020 osd.0 [WRN] Monitor daemon marked osd.0 down, > but it is still running > 2017-10-17 16:16:30.500686 mon.alantra_fs-02 [INF] osd.1 marked down > after no beacon for 25.007173 seconds > 2017-10-17 16:16:30.501217 mon.alantra_fs-02 [WRN] Health check failed: > 1 osds down (OSD_DOWN) > 2017-10-17 16:16:30.501250 mon.alantra_fs-02 [WRN] Health check failed: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:16:30.532618 mon.alantra_fs-02 [WRN] overall HEALTH_WARN 1 > osds down; 1 host (1 osds) down; Degraded data redundancy: 62 pgs > unclean > 2017-10-17 16:16:34.869504 mon.alantra_fs-02 [WRN] Health check update: > Degraded data redundancy: 40177/80354 objects degraded (50.000%), 63 pgs > unclean, 192 pgs degraded (PG_DEGRADED) > 2017-10-17 16:16:34.192978 osd.1 [WRN] Monitor daemon marked osd.1 down, > but it is still running > 2017-10-17 16:16:55.505503 mon.alantra_fs-02 [INF] osd.0 marked down > after no beacon for 25.004803 seconds > 2017-10-17 16:16:55.506192 mon.alantra_fs-02 [WRN] Health check update: > 2 osds down (OSD_DOWN) > 2017-10-17 16:16:55.506223 mon.alantra_fs-02 [WRN] Health check update: > 3 hosts (2 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:16:55.506242 mon.alantra_fs-02 [WRN] Health check failed: > 1 root (2 osds) down (OSD_ROOT_DOWN) > 2017-10-17 16:16:56.530112 mon.alantra_fs-02 [INF] Health check cleared: > OSD_ROOT_DOWN (was: 1 root (2 osds) down) > 2017-10-17 16:16:56.554446 mon.alantra_fs-02 [INF] osd.0 > 10.20.1.109:6801/3319 boot > 2017-10-17 16:16:55.542656 osd.0 [WRN] Monitor daemon marked osd.0 down, > but it is still running > 2017-10-17 16:17:00.524161 mon.alantra_fs-02 [WRN] Health check update: > 1 osds down (OSD_DOWN) > 2017-10-17 16:17:00.524217 mon.alantra_fs-02 [WRN] Health check update: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:17:00.553635 mon.alantra_fs-02 [INF] mon.1 > 10.20.1.109:6789/0 > 2017-10-17 16:17:00.553691 mon.alantra_fs-02 [INF] mon.2 > 10.20.1.216:6789/0 > 2017-10-17 16:17:16.885662 mon.alantra_fs-02 [WRN] Health check update: > Degraded data redundancy: 40177/80354 objects degraded (50.000%), 96 pgs > unclean, 192 pgs degraded (PG_DEGRADED) > 2017-10-17 16:17:25.528348 mon.alantra_fs-02 [INF] osd.0 marked down > after no beacon for 25.004060 seconds > 2017-10-17 16:17:25.528960 mon.alantra_fs-02 [WRN] Health check update: > 2 osds down (OSD_DOWN) > 2017-10-17 16:17:25.528991 mon.alantra_fs-02 [WRN] Health check update: > 3 hosts (2 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:17:25.529011 mon.alantra_fs-02 [WRN] Health check failed: > 1 root (2 osds) down (OSD_ROOT_DOWN) > 2017-10-17 16:17:26.544228 mon.alantra_fs-02 [INF] Health check cleared: > OSD_ROOT_DOWN (was: 1 root (2 osds) down) > 2017-10-17 16:17:26.568819 mon.alantra_fs-02 [INF] osd.0 > 10.20.1.109:6801/3319 boot > 2017-10-17 16:17:25.557037 osd.0 [WRN] Monitor daemon marked osd.0 down, > but it is still running > 2017-10-17 16:17:30.532840 mon.alantra_fs-02 [WRN] overall HEALTH_WARN 1 > osds down; 1 host (1 osds) down; Degraded data redundancy: 40177/80354 > objects degraded (50.000%), 96 pgs unclean, 192 pgs degraded > 2017-10-17 16:17:30.538294 mon.alantra_fs-02 [WRN] Health check update: > 1 osds down (OSD_DOWN) > 2017-10-17 16:17:30.538333 mon.alantra_fs-02 [WRN] Health check update: > 1 host (1 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:17:31.602434 mon.alantra_fs-02 [WRN] Health check update: > Degraded data redundancy: 40177/80354 objects degraded (50.000%), 192 > pgs unclean, 192 pgs degraded (PG_DEGRADED) > 2017-10-17 16:17:55.540005 mon.alantra_fs-02 [INF] osd.0 marked down > after no beacon for 25.001599 seconds > 2017-10-17 16:17:55.540538 mon.alantra_fs-02 [WRN] Health check update: > 2 osds down (OSD_DOWN) > 2017-10-17 16:17:55.540562 mon.alantra_fs-02 [WRN] Health check update: > 3 hosts (2 osds) down (OSD_HOST_DOWN) > 2017-10-17 16:17:55.540585 mon.alantra_fs-02 [WRN] Health check failed: > 1 root (2 osds) down (OSD_ROOT_DOWN) > 2017-10-17 16:18:28.916734 mon.alantra_fs-02 [WRN] Health check update: > Degraded data redundancy: 40177/80354 objects degraded (50.000%), 192 > pgs unclean, 192 pgs degraded, 192 pgs undersized (PG_DEGRADED) > 2017-10-17 16:18:30.533096 mon.alantra_fs-02 [WRN] overall HEALTH_WARN 2 > osds down; 3 hosts (2 osds) down; 1 root (2 osds) down; Degraded data > redundancy: 40177/80354 objects degraded (50.000%), 192 pgs unclean, 192 > pgs degraded, 192 pgs undersized > 2017-10-17 16:18:56.929295 mon.alantra_fs-02 [WRN] Health check failed: > Reduced data availability: 192 pgs stale (PG_AVAILABILITY) > > > > ------------------------------------------------------------------------ > ------------------------------------- > > ------------------------------------------------------------------------ > ------------------------------------- > > ceph.conf > > [global] > fsid = 5f8e66b5-1adc-4930-b5d8-c0f44dc2037e > mon_initial_members = alantra_fs-01, alantra_fs-02, alantra_fs-03 > mon_host = 10.20.1.109,10.20.1.97,10.20.1.216 > public_network = 10.20.1.0/24 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > > > ## > ### OSD > ## > [osd] > osd_mon_heartbeat_interval = 5 > osd_mon_report_interval_max = 10 > osd_heartbeat_grace = 10 > osd_fast_fail_on_connection_refused = True > osd_pool_default_pg_num = 128 > osd_pool_default_pgp_num = 128 > osd_pool_default_size = 2 > osd_pool_default_min_size = 2 > > ## > > ### Monitores > ## > [mon] > mon_allow_pool_delete = false > mon_osd_report_timeout = 25 > mon_osd_min_down_reporters = 1 > > [mon.alantra_fs-01] > host = alantra_fs-01 > mon_addr = 10.20.1.109:6789 > > [mon.alantra_fs-02] > host = alantra_fs-02 > mon_addr = 10.20.1.97:6789 > > [mon.alantra_fs-03] > host = alantra_fs-03 > mon_addr = 10.20.1.216:6789 > > > ## > ### MDS > ## > [mds] > mds_cache_size = 250000 > > > ## > ### Client > ## > [client] > client_cache_size = 32768 > client_mount_timeout = 30 > client_oc_max_objects = 2000 > client_oc_size = 629145600 > rbd_cache = true > rbd_cache_size = 671088640 > ------------------------------------------------------------------------ > ------------------------------------- > > ------------------------------------------------------------------------ > ------------------------------------- > > > > For now I've added the nodown flag to keep all OSD online, and all is > working fine, but this is not the best way to do it. > > Someone knows how to fix this problem?. Maybe this release needs to open > new ports on firewall? > > Thanks!! > > -- > > _________________________________________ > > > Daniel Carrasco Marín > > Ingeniería para la Innovación i2TIC, S.L. > Tlf: +34 911 12 32 84 Ext: 223 > www.i2tic.com > <http://www.i2tic.com/> _________________________________________ > > > > > -- _________________________________________ Daniel Carrasco Marín Ingeniería para la Innovación i2TIC, S.L. Tlf: +34 911 12 32 84 Ext: 223 www.i2tic.com _________________________________________
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com