[ceph-users] Degraded data redundancy (low space): 1 pg backfill_toofull

2018-07-28 Thread Sebastian Igerl
 Hi,

i added 4 more OSDs on my 4 node Test Cluster and now i'm in HEALTH_ERR
state. Right now its still recovering, but still, should this happen ? None
of my OSDs are full. Maybe i need more PGs ? But since my %USE is < 40% it
should be still ok to recover without HEALTH_ERR ?

  data:
pools:   7 pools, 484 pgs
objects: 2.70 M objects, 10 TiB
usage:   31 TiB used, 114 TiB / 146 TiB avail
pgs: 2422839/8095065 objects misplaced (29.930%)
 343 active+clean
 101 active+remapped+backfill_wait
 39  active+remapped+backfilling
 1   active+remapped+backfill_wait+backfill_toofull

  io:
recovery: 315 MiB/s, 78 objects/s





ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
 0   hdd 2.72890  1.0 2.7 TiB 975 GiB 1.8 TiB 34.89 1.62  31
 1   hdd 2.72899  1.0 2.7 TiB 643 GiB 2.1 TiB 23.00 1.07  36
 8   hdd 7.27739  1.0 7.3 TiB 1.7 TiB 5.5 TiB 23.85 1.11  83
12   hdd 7.27730  1.0 7.3 TiB 1.1 TiB 6.2 TiB 14.85 0.69  81
16   hdd 7.27730  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.68 1.29  74
20   hdd 9.09569  1.0 9.1 TiB 108 GiB 9.0 TiB  1.16 0.05  43
 2   hdd 2.72899  1.0 2.7 TiB 878 GiB 1.9 TiB 31.40 1.46  36
 3   hdd 2.72899  1.0 2.7 TiB 783 GiB 2.0 TiB 28.02 1.30  39
 9   hdd 7.27739  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.58 1.28  85
13   hdd 7.27730  1.0 7.3 TiB 2.2 TiB 5.1 TiB 30.10 1.40  78
17   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.23 1.31  84
21   hdd 9.09569  1.0 9.1 TiB 192 GiB 8.9 TiB  2.06 0.10  41
 4   hdd 2.72899  1.0 2.7 TiB 927 GiB 1.8 TiB 33.18 1.54  34
 5   hdd 2.72899  1.0 2.7 TiB 1.0 TiB 1.7 TiB 37.57 1.75  28
10   hdd 7.27739  1.0 7.3 TiB 2.2 TiB 5.0 TiB 30.66 1.43  87
14   hdd 7.27730  1.0 7.3 TiB 1.8 TiB 5.5 TiB 24.23 1.13  89
18   hdd 7.27730  1.0 7.3 TiB 2.5 TiB 4.8 TiB 33.83 1.57  93
22   hdd 9.09569  1.0 9.1 TiB 210 GiB 8.9 TiB  2.26 0.10  44
 6   hdd 2.72899  1.0 2.7 TiB 350 GiB 2.4 TiB 12.51 0.58  21
 7   hdd 2.72899  1.0 2.7 TiB 980 GiB 1.8 TiB 35.07 1.63  35
11   hdd 7.27739  1.0 7.3 TiB 2.8 TiB 4.4 TiB 39.14 1.82  99
15   hdd 7.27730  1.0 7.3 TiB 1.6 TiB 5.6 TiB 22.49 1.05  82
19   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.49 1.32  77
23   hdd 9.09569  1.0 9.1 TiB 285 GiB 8.8 TiB  3.06 0.14  52
TOTAL 146 TiB  31 TiB 114 TiB 21.51
MIN/MAX VAR: 0.05/1.82  STDDEV: 11.78




Right after adding the osds it showed degraded for a few minutes, since all
my pools have a redundancy of 3 and i'm adding osd i'm a bit confused why
this happens ? I get why it's misplaced, but undersized and degraded ?

pgs: 4611/8095032 objects degraded (0.057%)
 2626460/8095032 objects misplaced (32.445%)
 215 active+clean
 192 active+remapped+backfill_wait
 26  active+recovering+undersized+remapped
 17  active+recovery_wait+undersized+degraded+remapped
 16  active+recovering
 11  active+recovery_wait+degraded
 6   active+remapped+backfilling
 1   active+remapped+backfill_toofull


Maybe someone can give me some pointers on what i'm missing to understand
whats happening here ?

Thanks!

Sebastian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded data redundancy (low space): 1 pg backfill_toofull

2018-07-28 Thread Sinan Polat
Ceph has tried to (re)balance your data, backfill_toofull means no available 
space to move data, but you have plenty of space.

Why do you have so little pgs? I would increase the amount of pgs, but before 
doing so lets see what others will say.

Sinan

> Op 28 jul. 2018 om 11:50 heeft Sebastian Igerl  het 
> volgende geschreven:
> 
> Hi,
> 
> i added 4 more OSDs on my 4 node Test Cluster and now i'm in HEALTH_ERR 
> state. Right now its still recovering, but still, should this happen ? None 
> of my OSDs are full. Maybe i need more PGs ? But since my %USE is < 40% it 
> should be still ok to recover without HEALTH_ERR ?
> 
>   data:
> pools:   7 pools, 484 pgs
> objects: 2.70 M objects, 10 TiB
> usage:   31 TiB used, 114 TiB / 146 TiB avail
> pgs: 2422839/8095065 objects misplaced (29.930%)
>  343 active+clean
>  101 active+remapped+backfill_wait
>  39  active+remapped+backfilling
>  1   active+remapped+backfill_wait+backfill_toofull
> 
>   io:
> recovery: 315 MiB/s, 78 objects/s
> 
> 
> 
> 
> 
> ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
>  0   hdd 2.72890  1.0 2.7 TiB 975 GiB 1.8 TiB 34.89 1.62  31
>  1   hdd 2.72899  1.0 2.7 TiB 643 GiB 2.1 TiB 23.00 1.07  36
>  8   hdd 7.27739  1.0 7.3 TiB 1.7 TiB 5.5 TiB 23.85 1.11  83
> 12   hdd 7.27730  1.0 7.3 TiB 1.1 TiB 6.2 TiB 14.85 0.69  81
> 16   hdd 7.27730  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.68 1.29  74
> 20   hdd 9.09569  1.0 9.1 TiB 108 GiB 9.0 TiB  1.16 0.05  43
>  2   hdd 2.72899  1.0 2.7 TiB 878 GiB 1.9 TiB 31.40 1.46  36
>  3   hdd 2.72899  1.0 2.7 TiB 783 GiB 2.0 TiB 28.02 1.30  39
>  9   hdd 7.27739  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.58 1.28  85
> 13   hdd 7.27730  1.0 7.3 TiB 2.2 TiB 5.1 TiB 30.10 1.40  78
> 17   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.23 1.31  84
> 21   hdd 9.09569  1.0 9.1 TiB 192 GiB 8.9 TiB  2.06 0.10  41
>  4   hdd 2.72899  1.0 2.7 TiB 927 GiB 1.8 TiB 33.18 1.54  34
>  5   hdd 2.72899  1.0 2.7 TiB 1.0 TiB 1.7 TiB 37.57 1.75  28
> 10   hdd 7.27739  1.0 7.3 TiB 2.2 TiB 5.0 TiB 30.66 1.43  87
> 14   hdd 7.27730  1.0 7.3 TiB 1.8 TiB 5.5 TiB 24.23 1.13  89
> 18   hdd 7.27730  1.0 7.3 TiB 2.5 TiB 4.8 TiB 33.83 1.57  93
> 22   hdd 9.09569  1.0 9.1 TiB 210 GiB 8.9 TiB  2.26 0.10  44
>  6   hdd 2.72899  1.0 2.7 TiB 350 GiB 2.4 TiB 12.51 0.58  21
>  7   hdd 2.72899  1.0 2.7 TiB 980 GiB 1.8 TiB 35.07 1.63  35
> 11   hdd 7.27739  1.0 7.3 TiB 2.8 TiB 4.4 TiB 39.14 1.82  99
> 15   hdd 7.27730  1.0 7.3 TiB 1.6 TiB 5.6 TiB 22.49 1.05  82
> 19   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.49 1.32  77
> 23   hdd 9.09569  1.0 9.1 TiB 285 GiB 8.8 TiB  3.06 0.14  52
> TOTAL 146 TiB  31 TiB 114 TiB 21.51
> MIN/MAX VAR: 0.05/1.82  STDDEV: 11.78
> 
> 
> 
> 
> Right after adding the osds it showed degraded for a few minutes, since all 
> my pools have a redundancy of 3 and i'm adding osd i'm a bit confused why 
> this happens ? I get why it's misplaced, but undersized and degraded ?
> 
> pgs: 4611/8095032 objects degraded (0.057%)
>  2626460/8095032 objects misplaced (32.445%)
>  215 active+clean
>  192 active+remapped+backfill_wait
>  26  active+recovering+undersized+remapped
>  17  active+recovery_wait+undersized+degraded+remapped
>  16  active+recovering
>  11  active+recovery_wait+degraded
>  6   active+remapped+backfilling
>  1   active+remapped+backfill_toofull
> 
> 
> Maybe someone can give me some pointers on what i'm missing to understand 
> whats happening here ?
> 
> Thanks!
> 
> Sebastian
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded data redundancy (low space): 1 pg backfill_toofull

2018-07-28 Thread Sebastian Igerl
i set up my test cluster many years ago with only 3 OSDs and never
increased the PGs :-) I plan on doing so after its healthy again...  it's
long overdue... maybe 512 :-)

and yes that's what i thought too.. it should have more than enough space
to move data  .. hmm...

i wouldn't be surprised if it fixes itself after recovery.. but still would
be nice to know whats going on.

And the initial degraded still confuses me...

by the way.. i'm on mimic :-) latest version from today. 13.2.1

Sebastian


On Sat, Jul 28, 2018 at 12:03 PM Sinan Polat  wrote:

> Ceph has tried to (re)balance your data, backfill_toofull means no
> available space to move data, but you have plenty of space.
>
> Why do you have so little pgs? I would increase the amount of pgs, but
> before doing so lets see what others will say.
>
> Sinan
>
> Op 28 jul. 2018 om 11:50 heeft Sebastian Igerl  het
> volgende geschreven:
>
> Hi,
>
> i added 4 more OSDs on my 4 node Test Cluster and now i'm in HEALTH_ERR
> state. Right now its still recovering, but still, should this happen ? None
> of my OSDs are full. Maybe i need more PGs ? But since my %USE is < 40% it
> should be still ok to recover without HEALTH_ERR ?
>
>   data:
> pools:   7 pools, 484 pgs
> objects: 2.70 M objects, 10 TiB
> usage:   31 TiB used, 114 TiB / 146 TiB avail
> pgs: 2422839/8095065 objects misplaced (29.930%)
>  343 active+clean
>  101 active+remapped+backfill_wait
>  39  active+remapped+backfilling
>  1   active+remapped+backfill_wait+backfill_toofull
>
>   io:
> recovery: 315 MiB/s, 78 objects/s
>
>
>
>
>
> ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
>  0   hdd 2.72890  1.0 2.7 TiB 975 GiB 1.8 TiB 34.89 1.62  31
>  1   hdd 2.72899  1.0 2.7 TiB 643 GiB 2.1 TiB 23.00 1.07  36
>  8   hdd 7.27739  1.0 7.3 TiB 1.7 TiB 5.5 TiB 23.85 1.11  83
> 12   hdd 7.27730  1.0 7.3 TiB 1.1 TiB 6.2 TiB 14.85 0.69  81
> 16   hdd 7.27730  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.68 1.29  74
> 20   hdd 9.09569  1.0 9.1 TiB 108 GiB 9.0 TiB  1.16 0.05  43
>  2   hdd 2.72899  1.0 2.7 TiB 878 GiB 1.9 TiB 31.40 1.46  36
>  3   hdd 2.72899  1.0 2.7 TiB 783 GiB 2.0 TiB 28.02 1.30  39
>  9   hdd 7.27739  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.58 1.28  85
> 13   hdd 7.27730  1.0 7.3 TiB 2.2 TiB 5.1 TiB 30.10 1.40  78
> 17   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.23 1.31  84
> 21   hdd 9.09569  1.0 9.1 TiB 192 GiB 8.9 TiB  2.06 0.10  41
>  4   hdd 2.72899  1.0 2.7 TiB 927 GiB 1.8 TiB 33.18 1.54  34
>  5   hdd 2.72899  1.0 2.7 TiB 1.0 TiB 1.7 TiB 37.57 1.75  28
> 10   hdd 7.27739  1.0 7.3 TiB 2.2 TiB 5.0 TiB 30.66 1.43  87
> 14   hdd 7.27730  1.0 7.3 TiB 1.8 TiB 5.5 TiB 24.23 1.13  89
> 18   hdd 7.27730  1.0 7.3 TiB 2.5 TiB 4.8 TiB 33.83 1.57  93
> 22   hdd 9.09569  1.0 9.1 TiB 210 GiB 8.9 TiB  2.26 0.10  44
>  6   hdd 2.72899  1.0 2.7 TiB 350 GiB 2.4 TiB 12.51 0.58  21
>  7   hdd 2.72899  1.0 2.7 TiB 980 GiB 1.8 TiB 35.07 1.63  35
> 11   hdd 7.27739  1.0 7.3 TiB 2.8 TiB 4.4 TiB 39.14 1.82  99
> 15   hdd 7.27730  1.0 7.3 TiB 1.6 TiB 5.6 TiB 22.49 1.05  82
> 19   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.49 1.32  77
> 23   hdd 9.09569  1.0 9.1 TiB 285 GiB 8.8 TiB  3.06 0.14  52
> TOTAL 146 TiB  31 TiB 114 TiB 21.51
> MIN/MAX VAR: 0.05/1.82  STDDEV: 11.78
>
>
>
>
> Right after adding the osds it showed degraded for a few minutes, since
> all my pools have a redundancy of 3 and i'm adding osd i'm a bit confused
> why this happens ? I get why it's misplaced, but undersized and degraded ?
>
> pgs: 4611/8095032 objects degraded (0.057%)
>  2626460/8095032 objects misplaced (32.445%)
>  215 active+clean
>  192 active+remapped+backfill_wait
>  26  active+recovering+undersized+remapped
>  17  active+recovery_wait+undersized+degraded+remapped
>  16  active+recovering
>  11  active+recovery_wait+degraded
>  6   active+remapped+backfilling
>  1   active+remapped+backfill_toofull
>
>
> Maybe someone can give me some pointers on what i'm missing to understand
> whats happening here ?
>
> Thanks!
>
> Sebastian
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded data redundancy (low space): 1 pg backfill_toofull

2018-07-28 Thread Sebastian Igerl
well..  it repaired itself..  hmm.. still.. strange..  :-)

[INF] Health check cleared: PG_DEGRADED_FULL (was: Degraded data redundancy
(low space): 1 pg backfill_toofull)

On Sat, Jul 28, 2018 at 12:03 PM Sinan Polat  wrote:

> Ceph has tried to (re)balance your data, backfill_toofull means no
> available space to move data, but you have plenty of space.
>
> Why do you have so little pgs? I would increase the amount of pgs, but
> before doing so lets see what others will say.
>
> Sinan
>
> Op 28 jul. 2018 om 11:50 heeft Sebastian Igerl  het
> volgende geschreven:
>
> Hi,
>
> i added 4 more OSDs on my 4 node Test Cluster and now i'm in HEALTH_ERR
> state. Right now its still recovering, but still, should this happen ? None
> of my OSDs are full. Maybe i need more PGs ? But since my %USE is < 40% it
> should be still ok to recover without HEALTH_ERR ?
>
>   data:
> pools:   7 pools, 484 pgs
> objects: 2.70 M objects, 10 TiB
> usage:   31 TiB used, 114 TiB / 146 TiB avail
> pgs: 2422839/8095065 objects misplaced (29.930%)
>  343 active+clean
>  101 active+remapped+backfill_wait
>  39  active+remapped+backfilling
>  1   active+remapped+backfill_wait+backfill_toofull
>
>   io:
> recovery: 315 MiB/s, 78 objects/s
>
>
>
>
>
> ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
>  0   hdd 2.72890  1.0 2.7 TiB 975 GiB 1.8 TiB 34.89 1.62  31
>  1   hdd 2.72899  1.0 2.7 TiB 643 GiB 2.1 TiB 23.00 1.07  36
>  8   hdd 7.27739  1.0 7.3 TiB 1.7 TiB 5.5 TiB 23.85 1.11  83
> 12   hdd 7.27730  1.0 7.3 TiB 1.1 TiB 6.2 TiB 14.85 0.69  81
> 16   hdd 7.27730  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.68 1.29  74
> 20   hdd 9.09569  1.0 9.1 TiB 108 GiB 9.0 TiB  1.16 0.05  43
>  2   hdd 2.72899  1.0 2.7 TiB 878 GiB 1.9 TiB 31.40 1.46  36
>  3   hdd 2.72899  1.0 2.7 TiB 783 GiB 2.0 TiB 28.02 1.30  39
>  9   hdd 7.27739  1.0 7.3 TiB 2.0 TiB 5.3 TiB 27.58 1.28  85
> 13   hdd 7.27730  1.0 7.3 TiB 2.2 TiB 5.1 TiB 30.10 1.40  78
> 17   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.23 1.31  84
> 21   hdd 9.09569  1.0 9.1 TiB 192 GiB 8.9 TiB  2.06 0.10  41
>  4   hdd 2.72899  1.0 2.7 TiB 927 GiB 1.8 TiB 33.18 1.54  34
>  5   hdd 2.72899  1.0 2.7 TiB 1.0 TiB 1.7 TiB 37.57 1.75  28
> 10   hdd 7.27739  1.0 7.3 TiB 2.2 TiB 5.0 TiB 30.66 1.43  87
> 14   hdd 7.27730  1.0 7.3 TiB 1.8 TiB 5.5 TiB 24.23 1.13  89
> 18   hdd 7.27730  1.0 7.3 TiB 2.5 TiB 4.8 TiB 33.83 1.57  93
> 22   hdd 9.09569  1.0 9.1 TiB 210 GiB 8.9 TiB  2.26 0.10  44
>  6   hdd 2.72899  1.0 2.7 TiB 350 GiB 2.4 TiB 12.51 0.58  21
>  7   hdd 2.72899  1.0 2.7 TiB 980 GiB 1.8 TiB 35.07 1.63  35
> 11   hdd 7.27739  1.0 7.3 TiB 2.8 TiB 4.4 TiB 39.14 1.82  99
> 15   hdd 7.27730  1.0 7.3 TiB 1.6 TiB 5.6 TiB 22.49 1.05  82
> 19   hdd 7.27730  1.0 7.3 TiB 2.1 TiB 5.2 TiB 28.49 1.32  77
> 23   hdd 9.09569  1.0 9.1 TiB 285 GiB 8.8 TiB  3.06 0.14  52
> TOTAL 146 TiB  31 TiB 114 TiB 21.51
> MIN/MAX VAR: 0.05/1.82  STDDEV: 11.78
>
>
>
>
> Right after adding the osds it showed degraded for a few minutes, since
> all my pools have a redundancy of 3 and i'm adding osd i'm a bit confused
> why this happens ? I get why it's misplaced, but undersized and degraded ?
>
> pgs: 4611/8095032 objects degraded (0.057%)
>  2626460/8095032 objects misplaced (32.445%)
>  215 active+clean
>  192 active+remapped+backfill_wait
>  26  active+recovering+undersized+remapped
>  17  active+recovery_wait+undersized+degraded+remapped
>  16  active+recovering
>  11  active+recovery_wait+degraded
>  6   active+remapped+backfilling
>  1   active+remapped+backfill_toofull
>
>
> Maybe someone can give me some pointers on what i'm missing to understand
> whats happening here ?
>
> Thanks!
>
> Sebastian
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread ceph . novice
Dear users and developers.
 
I've updated our dev-cluster from v13.2.0 to v13.2.1 yesterday and since then 
everything is badly broken.
I've restarted all Ceph components via "systemctl" and also rebootet the server 
SDS21 and SDS24, nothing changes.

This cluster started as Kraken, was updated to Luminous (up to v12.2.5) and 
then to Mimic.

Here are some system related infos, see 
https://semestriel.framapad.org/p/DTkBspmnfU

Somehow I guess this may have to do with the various "ceph-disk", 
"ceph-volume", ceph-lvm" changes in the last months?!?

Thanks & regards
 Anton

--

 

Gesendet: Samstag, 28. Juli 2018 um 00:22 Uhr
Von: "Bryan Stillwell" 
An: "ceph-users@lists.ceph.com" 
Betreff: Re: [ceph-users] v13.2.1 Mimic released

I decided to upgrade my home cluster from Luminous (v12.2.7) to Mimic (v13.2.1) 
today and ran into a couple issues:
 
1. When restarting the OSDs during the upgrade it seems to forget my upmap 
settings.  I had to manually return them to the way they were with commands 
like:
 
ceph osd pg-upmap-items 5.1 11 18 8 6 9 0
ceph osd pg-upmap-items 5.1f 11 17
 
I also saw this when upgrading from v12.2.5 to v12.2.7.
 
2. Also after restarting the first OSD during the upgrade I saw 21 messages 
like these in ceph.log:
 
2018-07-27 15:53:49.868552 osd.1 osd.1 10.0.0.207:6806/4029643 97 : cluster 
[WRN] failed to encode map e100467 with expected crc
2018-07-27 15:53:49.922365 osd.6 osd.6 10.0.0.16:6804/90400 25 : cluster [WRN] 
failed to encode map e100467 with expected crc
2018-07-27 15:53:49.925585 osd.6 osd.6 10.0.0.16:6804/90400 26 : cluster [WRN] 
failed to encode map e100467 with expected crc
2018-07-27 15:53:49.944414 osd.18 osd.18 10.0.0.15:6808/120845 8 : cluster 
[WRN] failed to encode map e100467 with expected crc
2018-07-27 15:53:49.944756 osd.17 osd.17 10.0.0.15:6800/120749 13 : cluster 
[WRN] failed to encode map e100467 with expected crc
 
Is this a sign that full OSD maps were sent out by the mons to every OSD like 
back in the hammer days?  I seem to remember that OSD maps should be a lot 
smaller now, so maybe this isn't as big of a problem as it was back then?
 
Thanks,
Bryan
 

From: ceph-users  on behalf of Sage Weil 

Date: Friday, July 27, 2018 at 1:25 PM
To: "ceph-annou...@lists.ceph.com" , 
"ceph-users@lists.ceph.com" , 
"ceph-maintain...@lists.ceph.com" , 
"ceph-de...@vger.kernel.org" 
Subject: [ceph-users] v13.2.1 Mimic released

 

This is the first bugfix release of the Mimic v13.2.x long term stable release

series. This release contains many fixes across all components of Ceph,

including a few security fixes. We recommend that all users upgrade.

 

Notable Changes

--

 

* CVE 2018-1128: auth: cephx authorizer subject to replay attack (issue#24836 
http://tracker.ceph.com/issues/24836, Sage Weil)

* CVE 2018-1129: auth: cephx signature check is weak (issue#24837 
http://tracker.ceph.com/issues/24837[http://tracker.ceph.com/issues/24837], 
Sage Weil)

* CVE 2018-10861: mon: auth checks not correct for pool ops (issue#24838

* 

Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread Sage Weil
Can you include more or your osd log file?

On July 28, 2018 9:46:16 AM CDT, ceph.nov...@habmalnefrage.de wrote:
>Dear users and developers.
> 
>I've updated our dev-cluster from v13.2.0 to v13.2.1 yesterday and
>since then everything is badly broken.
>I've restarted all Ceph components via "systemctl" and also rebootet
>the server SDS21 and SDS24, nothing changes.
>
>This cluster started as Kraken, was updated to Luminous (up to v12.2.5)
>and then to Mimic.
>
>Here are some system related infos, see
>https://semestriel.framapad.org/p/DTkBspmnfU
>
>Somehow I guess this may have to do with the various "ceph-disk",
>"ceph-volume", ceph-lvm" changes in the last months?!?
>
>Thanks & regards
> Anton
>
>--
>
> 
>
>Gesendet: Samstag, 28. Juli 2018 um 00:22 Uhr
>Von: "Bryan Stillwell" 
>An: "ceph-users@lists.ceph.com" 
>Betreff: Re: [ceph-users] v13.2.1 Mimic released
>
>I decided to upgrade my home cluster from Luminous (v12.2.7) to Mimic
>(v13.2.1) today and ran into a couple issues:
> 
>1. When restarting the OSDs during the upgrade it seems to forget my
>upmap settings.  I had to manually return them to the way they were
>with commands like:
> 
>ceph osd pg-upmap-items 5.1 11 18 8 6 9 0
>ceph osd pg-upmap-items 5.1f 11 17
> 
>I also saw this when upgrading from v12.2.5 to v12.2.7.
> 
>2. Also after restarting the first OSD during the upgrade I saw 21
>messages like these in ceph.log:
> 
>2018-07-27 15:53:49.868552 osd.1 osd.1 10.0.0.207:6806/4029643 97 :
>cluster [WRN] failed to encode map e100467 with expected crc
>2018-07-27 15:53:49.922365 osd.6 osd.6 10.0.0.16:6804/90400 25 :
>cluster [WRN] failed to encode map e100467 with expected crc
>2018-07-27 15:53:49.925585 osd.6 osd.6 10.0.0.16:6804/90400 26 :
>cluster [WRN] failed to encode map e100467 with expected crc
>2018-07-27 15:53:49.944414 osd.18 osd.18 10.0.0.15:6808/120845 8 :
>cluster [WRN] failed to encode map e100467 with expected crc
>2018-07-27 15:53:49.944756 osd.17 osd.17 10.0.0.15:6800/120749 13 :
>cluster [WRN] failed to encode map e100467 with expected crc
> 
>Is this a sign that full OSD maps were sent out by the mons to every
>OSD like back in the hammer days?  I seem to remember that OSD maps
>should be a lot smaller now, so maybe this isn't as big of a problem as
>it was back then?
> 
>Thanks,
>Bryan
> 
>
>From: ceph-users  on behalf of Sage
>Weil 
>Date: Friday, July 27, 2018 at 1:25 PM
>To: "ceph-annou...@lists.ceph.com" ,
>"ceph-users@lists.ceph.com" ,
>"ceph-maintain...@lists.ceph.com" ,
>"ceph-de...@vger.kernel.org" 
>Subject: [ceph-users] v13.2.1 Mimic released
>
> 
>
>This is the first bugfix release of the Mimic v13.2.x long term stable
>release
>
>series. This release contains many fixes across all components of Ceph,
>
>including a few security fixes. We recommend that all users upgrade.
>
> 
>
>Notable Changes
>
>--
>
> 
>
>* CVE 2018-1128: auth: cephx authorizer subject to replay attack
>(issue#24836 http://tracker.ceph.com/issues/24836, Sage Weil)
>
>* CVE 2018-1129: auth: cephx signature check is weak (issue#24837
>http://tracker.ceph.com/issues/24837[http://tracker.ceph.com/issues/24837],
>Sage Weil)
>
>* CVE 2018-10861: mon: auth checks not correct for pool ops
>(issue#24838
>
>*
>Jason Dillaman)
>
> 
>
>For more details and links to various issues and pull requests, please
>
>refer to the ceph release blog at
>https://ceph.com/releases/13-2-1-mimic-released[https://ceph.com/releases/13-2-1-mimic-released]
>
> 
>
>Changelog
>
>-
>
>* bluestore:  common/hobject: improved hash calculation for hobject_t
>etc (pr#22777, Adam Kupczyk, Sage Weil)
>
>* bluestore,core: mimic: os/bluestore: don't store/use
>path_block.{db,wal} from meta (pr#22477, Sage Weil, Alfredo Deza)
>
>* bluestore: os/bluestore: backport 24319 and 24550 (issue#24550,
>issue#24502, issue#24319, issue#24581, pr#22649, Sage Weil)
>
>* bluestore: os/bluestore: fix incomplete faulty range marking when
>doing compression (pr#22910, Igor Fedotov)
>
>* bluestore: spdk: fix ceph-osd crash when activate SPDK (issue#24472,
>issue#24371, pr#22684, tone-zhang)
>
>* build/ops: build/ops: ceph.git has two different versions of dpdk in
>the source tree (issue#24942, issue#24032, pr#23070, Kefu Chai)
>
>* build/ops: build/ops: install-deps.sh fails on newest openSUSE Leap
>(issue#25065, pr#23178, Kyr Shatskyy)
>
>* build/ops: build/ops: Mimic build fails with -DWITH_RADOSGW=0
>(issue#24766, pr#22851, Dan Mick)
>
>* build/ops: cmake: enable RTTI for both debug and release RocksDB
>builds (pr#22299, Igor Fedotov)
>
>* build/ops: deb/rpm: add python-six as build-time and run-time
>dependency (issue#24885, pr#22948, Nathan Cutler, Kefu Chai)
>
>* build/ops: deb,rpm: fix block.db symlink ownership (pr#23246, Sage
>Weil)
>
>* build/ops: include: fix build with older clang (OSX target)
>(pr#23049, Christopher Blum)
>
>* build/ops: inclu

Re: [ceph-users] Slack-IRC integration

2018-07-28 Thread Alex Gorbachev
-- Forwarded message --
From: Matt.Brown 


Can you please add me to the ceph-storage slack channel?  Thanks!


Me too, please

--
Alex Gorbachev
Storcium


- Matt Brown | Lead Engineer | Infrastructure Services – Cloud &
Compute | Target | 7000 Target Pkwy N., NCE-0706 | Brooklyn Park, MN
55445 | 612.304.4956


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Setting up Ceph on EC2 i3 instances

2018-07-28 Thread Sean Redmond
Hi,

You may need to consider the latency between the az's, it may make it
difficult to get very high iops - I suspect that is the reason ebs is
replicated within a single AZ.

Have you any data that shows the latency between the az's?

Thanks

On Sat, 28 Jul 2018, 05:52 Mansoor Ahmed,  wrote:

> Hello,
>
> We are working on setting up Ceph on AWS i3 instances that have NVMe SSD
> as instance store to create our own EBS that spans multiple availability
> zones. We want to achieve better performance compared to EBS with
> provisioned IOPS.
>
> I thought it would be good to reach out to the community to see if any one
> has done this or if anyone would advice against it or any other advice that
> could be of help.
>
> Thank you for your help in advance.
>
> Regards
> Mansoor
> ᐧ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade Ceph 13.2.0 -> 13.2.1 and Windows iSCSI clients breakup

2018-07-28 Thread Wladimir Mutel

Dear all,

	I want to share some experience of upgrading my experimental 1-host 
Ceph cluster from v13.2.0 to v13.2.1.
 First, I fetched new packages and installed them using 'apt 
dist-upgrade', which went smooth as usual.
 Then I noticed from 'lsof', that Ceph daemons were not restarted after 
the upgrade ('ceph osd versions' still showed 13.2.0).
 Using instructions on Luminous->Mimic upgrade, I decided to restart 
ceph-{mon,mgr,osd}.targets.
 And surely, on restarting ceph-osd.target, iSCSI sessions had been 
broken on tcmu-runner side ('Timing out cmd', 'Handler connection 
lost'), and Windows (2008 R2) clients lost their iSCSI devices.

 But that was only a beginning of surprises that followed.
Looking into Windows Disk Management, I noticed that iSCSI disks were 
re-detected with size about 0.12 Gb larger, i.e. 2794.52 GB instead of 
2794.40 GB, and of course the system lost their GPT labels from its 
sight. I quickly checked 'rbd info' on Ceph side and did not notice any 
increase in RBD images. They were still exactly 715398 4MB-objects as I 
intended initially.
 Restarting iSCSI initiator service on Windows did not help. Restarting 
the whole Windows did not help. Restarting tcmu-runner on Ceph side did 
not help. What resolved the problem, to my great surprise, was 
_removing/re-adding MPIO feature and re-adding iSCSI multipath support_.
After that, Windows detected iSCSI disks with proper size again, and 
restored visibility of GPT partitions, dynamic disk metadata and all the 
rest.


Ok, I avoided data loss at this time, but I have some remaining 
questions :

1. Can Ceph minor version upgrades be made less disruptive and 
traumatic? Like, some king of staged/rolling OSD daemons restart within 
single upgraded host, without losing librbd sessions ?


2. Is Windows (2008 R2) MPIO support really that screwed & crippled ? 
Were there any improvements in Win2012/2016 ? I have physical servers 
with Windows 2008 R2, and I would like to mirror their volumes to Ceph 
iSCSI targets, then convert them into QEMU/KVM virtual machines where 
the same data will be accessed with librbd. During my initial 
experiments, I found that reinstalling MPIO & re-enabling iSCSI 
multipath would fix most problems in Windows iSCSI access, but I would 
like to have a faster way of resetting iSCSI+MPIO state when something 
is going wrong on Windows side like in my case.


3. Anybody has an idea of where these 0.12 GB (probably 120 or 128 MB) 
were taken from ?


Thank you in advance for your responses.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread ceph . novice
Hi Sage.

Sure. Any specific OSD(s) log(s)? Or just any?

Gesendet: Samstag, 28. Juli 2018 um 16:49 Uhr
Von: "Sage Weil" 
An: ceph.nov...@habmalnefrage.de, ceph-users@lists.ceph.com, 
ceph-de...@vger.kernel.org
Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

Can you include more or your osd log file?
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help needed to recover from cache tier OSD crash

2018-07-28 Thread Dmitry
Hello all,

would someone please help with recovering from a recent failure of all cache 
tier pool OSDs?

My CEPH cluster has a usual replica 2 pool with two 500GB SSD OSD’s writeback 
cache tier over it (also replica 2). 

Both cache OSD’s were created with standard ceph deploy tool, and have 2 
partitions (one journal and one XFS).

The target_max_bytes parameter for this cache pool was set to a 70% of the size 
of a single SSD disk to avoid overflow. This configuration worked fine for 
years..

But recently, for some unknown reason, when exporting large 300GB raw RBD image 
with 'rbd export’ command, both cache OSDs got 100% full and crashed.

In attempt to flush all the data from the cache to the underlying pool and 
avoid further damage, I have switched the cache pool into ‘forward’ mode and 
restarted both cache OSDs.

Both worked for some minutes and segfaulted again, and do not start anymore. 
Debugging the crash errors, I found out that the error is related to decoding 
attributes.

When checked with 'getfattr -d’ on random object files and directories on 
affected OSDs, I discovered that there are NO extended attributes exist at all 
anymore.

So I suspect that due to filesystems getting 100% full and restarting the OSD 
daemons several times, the XFS was somehow corrupted and lost the extended 
attributes which are required for CEPH to operate.


The question is - is it possible to somehow recover the attributes or flush the 
cached data back to the cold storage pool?

Would someone advise or help to recover the data please?

—
Regards,
Dmit

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slack-IRC integration

2018-07-28 Thread Dan van der Ster
It's here https://ceph-storage.slack.com/ but for some reason the list of
accepted email domains is limited. I have no idea who is maintaining this.

Anyway, the slack is just mirroring #ceph and #ceph-devel on IRC so better
to connect there directly.

Cheers, Dan




On Sat, Jul 28, 2018, 6:59 PM Alex Gorbachev  wrote:

> -- Forwarded message --
> From: Matt.Brown 
>
>
> Can you please add me to the ceph-storage slack channel?  Thanks!
>
>
> Me too, please
>
> --
> Alex Gorbachev
> Storcium
>
>
> - Matt Brown | Lead Engineer | Infrastructure Services – Cloud &
> Compute | Target | 7000 Target Pkwy N., NCE-0706 | Brooklyn Park, MN
> 55445 | 612.304.4956
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread ceph . novice
Have you guys changed something with the systemctl startup of the OSDs?

I've stopped and disabled all the OSDs on all my hosts via "systemctl 
stop|disable ceph-osd.target" and rebooted all the nodes. Everything look just 
the same.
The I started all the OSD daemons one after the other via the CLI with 
"/usr/bin/ceph-osd -f --cluster ceph --id $NR --setuser ceph --setgroup ceph > 
/tmp/osd.${NR}.log 2>&1 & " and now everything (ok, beside the ZABBIX mgr 
module?!?) seems to work :|


  cluster:
id: 2a919338-4e44-454f-bf45-e94a01c2a5e6
health: HEALTH_WARN
Failed to send data to Zabbix

  services:
mon: 3 daemons, quorum sds20,sds21,sds22
mgr: sds22(active), standbys: sds20, sds21
osd: 18 osds: 18 up, 18 in
rgw: 4 daemons active

  data:
pools:   25 pools, 1390 pgs
objects: 2.55 k objects, 3.4 GiB
usage:   26 GiB used, 8.8 TiB / 8.8 TiB avail
pgs: 1390 active+clean

  io:
client:   11 KiB/s rd, 10 op/s rd, 0 op/s wr

Any hints?

--
 

Gesendet: Samstag, 28. Juli 2018 um 23:35 Uhr
Von: ceph.nov...@habmalnefrage.de
An: "Sage Weil" 
Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org
Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")
Hi Sage.

Sure. Any specific OSD(s) log(s)? Or just any?

Gesendet: Samstag, 28. Juli 2018 um 16:49 Uhr
Von: "Sage Weil" 
An: ceph.nov...@habmalnefrage.de, ceph-users@lists.ceph.com, 
ceph-de...@vger.kernel.org
Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

Can you include more or your osd log file?
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")

2018-07-28 Thread Vasu Kulkarni
On Sat, Jul 28, 2018 at 6:02 PM,   wrote:
> Have you guys changed something with the systemctl startup of the OSDs?
I think there is some kind of systemd issue hidden in mimic,
https://tracker.ceph.com/issues/25004

>
> I've stopped and disabled all the OSDs on all my hosts via "systemctl 
> stop|disable ceph-osd.target" and rebooted all the nodes. Everything look 
> just the same.
> The I started all the OSD daemons one after the other via the CLI with 
> "/usr/bin/ceph-osd -f --cluster ceph --id $NR --setuser ceph --setgroup ceph 
> > /tmp/osd.${NR}.log 2>&1 & " and now everything (ok, beside the ZABBIX mgr 
> module?!?) seems to work :|
>
>
>   cluster:
> id: 2a919338-4e44-454f-bf45-e94a01c2a5e6
> health: HEALTH_WARN
> Failed to send data to Zabbix
>
>   services:
> mon: 3 daemons, quorum sds20,sds21,sds22
> mgr: sds22(active), standbys: sds20, sds21
> osd: 18 osds: 18 up, 18 in
> rgw: 4 daemons active
>
>   data:
> pools:   25 pools, 1390 pgs
> objects: 2.55 k objects, 3.4 GiB
> usage:   26 GiB used, 8.8 TiB / 8.8 TiB avail
> pgs: 1390 active+clean
>
>   io:
> client:   11 KiB/s rd, 10 op/s rd, 0 op/s wr
>
> Any hints?
>
> --
>
>
> Gesendet: Samstag, 28. Juli 2018 um 23:35 Uhr
> Von: ceph.nov...@habmalnefrage.de
> An: "Sage Weil" 
> Cc: ceph-users@lists.ceph.com, ceph-de...@vger.kernel.org
> Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")
> Hi Sage.
>
> Sure. Any specific OSD(s) log(s)? Or just any?
>
> Gesendet: Samstag, 28. Juli 2018 um 16:49 Uhr
> Von: "Sage Weil" 
> An: ceph.nov...@habmalnefrage.de, ceph-users@lists.ceph.com, 
> ceph-de...@vger.kernel.org
> Betreff: Re: [ceph-users] HELP! --> CLUSER DOWN (was "v13.2.1 Mimic released")
>
> Can you include more or your osd log file?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com