[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-04 Thread Vitaliy Filippov

Hi,

Try to repeat your test with numjobs=1, I've already seen strange  
behaviour with parallel jobs to one RBD image.


Also as usual: https://yourcmc.ru/wiki/Ceph_performance :-)


Hi,

We have a production cluster of 27 OSD's across 5 servers (all SSD's
running bluestore), and have started to notice a possible performance  
issue.


In order to isolate the problem, we built a single server with a single
OSD, and ran a few FIO tests. The results are puzzling, not that we were
expecting good performance on a single OSD.

In short, with a sequential write test, we are seeing huge numbers of  
reads

hitting the actual SSD

Key FIO parameters are:

[global]
pool=benchmarks
rbdname=disk-1
direct=1
numjobs=2
iodepth=1
blocksize=4k
group_reporting=1
[writer]
readwrite=write

iostat results are:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s  
avgrq-sz

avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   0.00   105.00 4896.00  294.00 312080.00  1696.00
120.92

   17.253.353.550.02   0.02  12.60

There are nearly ~5000 reads/second (~300 MB/sec), compared with only  
~300
writes (~1.5MB/sec), when we are doing a sequential write test? The  
system

is otherwise idle, with no other workload.

Running the same fio test with only 1 thread (numjobs=1) still shows a  
high

number of reads (110).

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s  
avgrq-sz

avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   0.00  1281.00  110.00 1463.00   440.00 12624.00 
16.61

0.030.020.050.02   0.02   3.40

Can anyone kindly offer any comments on why we are seeing this behaviour?

I can understand if there's the occasional read here and there if
RocksDB/WAL entries need to be read from disk during the sequential write
test, but this seems significantly high and unusual.

FIO results (numjobs=2)
writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=rbd, iodepth=1
...
fio-3.7
Starting 2 processes
Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta
01m:00s]
writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb  3 22:46:16 2020
  write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec)
slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83
clat (msec): min=2, max=210, avg=58.32, stdev=70.54
 lat (msec): min=2, max=210, avg=58.35, stdev=70.54
clat percentiles (msec):
 |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[ 
3],
 | 30.00th=[3], 40.00th=[3], 50.00th=[   54], 60.00th=[
62],
 | 70.00th=[   65], 80.00th=[  174], 90.00th=[  188], 95.00th=[   
194],
 | 99.00th=[  201], 99.50th=[  203], 99.90th=[  209], 99.95th=[   
209],

 | 99.99th=[  211]
   bw (  KiB/s): min=   24, max=  144, per=49.69%, avg=68.08,  
stdev=38.22,

samples=239
   iops: min=6, max=   36, avg=16.97, stdev= 9.55,  
samples=239

  lat (msec)   : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18%
  cpu  : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118
  IO depths: 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,

=64=0.0%

 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

 issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s),
io=8228KiB (8425kB), run=60038-60038msec
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bluestore cache parameter precedence

2020-02-04 Thread Boris Epstein
Hello list,

As stated in this document:

https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

there are multiple parameters defining cache limits for BlueStore. You have
bluestore_cache_size (presumably controlling the cache size),
bluestore_cache_size_hdd (presumably doing the same for HDD storage only)
and bluestore_cache_size_ssd (presumably being the equivalent for SSD). My
question is, does bluestore_cache_size override the disk-specific
parameters, or do I need to set the disk-specific (or, rather, storage type
specific ones separately if I want to keep them to a certain value.

Thanks in advance.

Boris.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recovery_unfound

2020-02-04 Thread Jake Grimmett
Hi Paul,

Many thanks for your helpful suggestions.

Yes, we have 13 pgs with "might_have_unfound" entries.

(also 1 pgs without "might_have_unfound" stuck in
active+recovery_unfound+degraded+repair state)

Taking one pg with unfound objects:

[root@ceph1 ~]# ceph health detail | grep  5.5c9
pg 5.5c9 has 2 unfound objects
pg 5.5c9 is active+recovery_unfound+degraded, acting
[347,442,381,215,91,260,31,94,178,302], 2 unfound
pg 5.5c9 is active+recovery_unfound+degraded, acting
[347,442,381,215,91,260,31,94,178,302], 2 unfound
pg 5.5c9 not deep-scrubbed since 2020-01-16 08:05:43.119336
pg 5.5c9 not scrubbed since 2020-01-16 08:05:43.119336

Checking the state:

[root@ceph1 ~]# ceph pg 5.5c9 query | jq .recovery_state
[
  {
"name": "Started/Primary/Active",
"enter_time": "2020-02-03 09:57:30.982038",
"might_have_unfound": [
  {
"osd": "31(6)",
"status": "already probed"
  },
  {
"osd": "91(4)",
"status": "already probed"
  },
  {
"osd": "94(7)",
"status": "already probed"
  },
  {
"osd": "178(8)",
"status": "already probed"
  },
  {
"osd": "215(3)",
"status": "already probed"
  },
  {
"osd": "260(5)",
"status": "already probed"
  },
  {
"osd": "302(9)",
"status": "already probed"
  },
  {
"osd": "381(2)",
"status": "already probed"
  },
  {
"osd": "442(1)",
"status": "already probed"
  }
],
"recovery_progress": {
  "backfill_targets": [],
  "waiting_on_backfill": [],
  "last_backfill_started": "MIN",
  "backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
  },
  "peer_backfill_info": [],
  "backfills_in_flight": [],
  "recovering": [],
  "pg_backend": {
"recovery_ops": [],
"read_ops": []
  }
},
"scrub": {
  "scrubber.epoch_start": "0",
  "scrubber.active": false,
  "scrubber.state": "INACTIVE",
  "scrubber.start": "MIN",
  "scrubber.end": "MIN",
  "scrubber.max_end": "MIN",
  "scrubber.subset_last_update": "0'0",
  "scrubber.deep": false,
  "scrubber.waiting_on_whom": []
}
  },
  {
"name": "Started",
"enter_time": "2020-02-03 09:57:29.788310"
  }
]

-

Taking your advice, I restart the primary osd for this pg:

[root@ceph1 ~]# ceph osd down 347

This doesn't change the output of "ceph pg 5.5c9 query", apart from
updating the Started time, and ceph health still shows unfound objects.

To fix this, do we need to issue a scrub (or deep scrub) so that the
objects can be found?

Just in case, I've issued a manual scrub:

[root@ceph1 ~]# ceph pg scrub 5.5c9
instructing pg 5.5c9s0 on osd.347 to scrub

The cluster is currently busy deleting snapshots, so it may take a while
before the scrub starts.

best regards,

Jake

On 2/3/20 6:31 PM, Paul Emmerich wrote:
> This might be related to recent problems with OSDs not being queried
> for unfound objects properly in some cases (which I think was fixed in
> master?)
> 
> Anyways: run ceph pg  query on the affected PGs, check for "might
> have unfound" and try restarting the OSDs mentioned there. Probably
> also sufficient to just run "ceph osd down" on the primaries on the
> affected PGs to get them to re-check.
> 
> 
> Paul
> 


-- 
Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSDs crashing

2020-02-04 Thread Raymond Clotfelter
I have 30 or so OSDs on a cluster with 240 that just keep crashing. Below is 
the last part of one of the log files showing the crash, can anyone please help 
me read this to figure out what is going on and how to correct it? When I start 
the OSDs they generally seem to work for 5-30 minutes, and then one by one they 
will start dropping out with logs similar to this.

Thanks.

   -29> 2020-02-04 06:00:23.447 7fe300d41700  5 osd.168 pg_epoch: 459335 
pg[6.1217s2( v 443432'5 (0'0,443432'5] local-lis/les=459328/459329 n=1 
ec=260267/6574 lis/c 459331/428950 les/c/f 459332/440468/290442 
459333/459334/459294) 
[2147483647,107,168,2147483647,102]/[81,107,168,89,102]p81(0) r=2 lpr=459334 
pi=[428950,459334)/82 crt=443432'5 lcod 0'0 remapped NOTIFY mbc={}] exit 
Started/Stray 1.017145 6 0.000323
   -28> 2020-02-04 06:00:23.447 7fe300d41700  5 osd.168 pg_epoch: 459335 
pg[6.1217s2( v 443432'5 (0'0,443432'5] local-lis/les=459328/459329 n=1 
ec=260267/6574 lis/c 459331/428950 les/c/f 459332/440468/290442 
459333/459334/459294) 
[2147483647,107,168,2147483647,102]/[81,107,168,89,102]p81(0) r=2 lpr=459334 
pi=[428950,459334)/82 crt=443432'5 lcod 0'0 remapped NOTIFY mbc={}] enter 
Started/ReplicaActive
   -27> 2020-02-04 06:00:23.447 7fe300d41700  5 osd.168 pg_epoch: 459335 
pg[6.1217s2( v 443432'5 (0'0,443432'5] local-lis/les=459328/459329 n=1 
ec=260267/6574 lis/c 459331/428950 les/c/f 459332/440468/290442 
459333/459334/459294) 
[2147483647,107,168,2147483647,102]/[81,107,168,89,102]p81(0) r=2 lpr=459334 
pi=[428950,459334)/82 crt=443432'5 lcod 0'0 remapped NOTIFY mbc={}] enter 
Started/ReplicaActive/RepNotRecovering
   -26> 2020-02-04 06:00:23.455 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -25> 2020-02-04 06:00:23.455 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -24> 2020-02-04 06:00:23.455 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -23> 2020-02-04 06:00:23.459 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -22> 2020-02-04 06:00:23.459 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -21> 2020-02-04 06:00:23.459 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -20> 2020-02-04 06:00:23.459 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -19> 2020-02-04 06:00:23.463 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -18> 2020-02-04 06:00:23.463 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -17> 2020-02-04 06:00:23.471 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -16> 2020-02-04 06:00:23.471 7fe309d53700  3 osd.168 459335 handle_osd_map 
epochs [459335,459335], i have 459335, src has [403399,459335]
   -15> 2020-02-04 06:00:23.471 7fe300d41700  5 osd.168 pg_epoch: 459335 
pg[6.1217s2( v 443432'5 (0'0,443432'5] local-lis/les=459334/459335 n=1 
ec=260267/6574 lis/c 459334/428950 les/c/f 459335/440468/290442 
459333/459334/459294) 
[2147483647,107,168,2147483647,102]/[81,107,168,89,102]p81(0) r=2 lpr=459334 
pi=[428950,459334)/82 luod=0'0 crt=443432'5 lcod 0'0 active+remapped mbc={}] 
exit Started/ReplicaActive/RepNotRecovering 0.021923 2 0.98
   -14> 2020-02-04 06:00:23.471 7fe300d41700  5 osd.168 pg_epoch: 459335 
pg[6.1217s2( v 443432'5 (0'0,443432'5] local-lis/les=459334/459335 n=1 
ec=260267/6574 lis/c 459334/428950 les/c/f 459335/440468/290442 
459333/459334/459294) 
[2147483647,107,168,2147483647,102]/[81,107,168,89,102]p81(0) r=2 lpr=459334 
pi=[428950,459334)/82 luod=0'0 crt=443432'5 lcod 0'0 active+remapped mbc={}] 
enter Started/ReplicaActive/RepWaitRecoveryReserved
   -13> 2020-02-04 06:00:23.471 7fe300d41700  5 osd.168 pg_epoch: 459335 
pg[6.1217s2( v 443432'5 (0'0,443432'5] local-lis/les=459334/459335 n=1 
ec=260267/6574 lis/c 459334/428950 les/c/f 459335/440468/290442 
459333/459334/459294) 
[2147483647,107,168,2147483647,102]/[81,107,168,89,102]p81(0) r=2 lpr=459334 
pi=[428950,459334)/82 luod=0'0 crt=443432'5 lcod 0'0 active+remapped mbc={}] 
exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000137 1 0.80
   -12> 2020-02-04 06:00:23.471 7fe300d41700  5 osd.168 pg_epoch: 459335 
pg[6.1217s2( v 443432'5 (0'0,443432'5] local-lis/les=459334/459335 n=1 
ec=260267/6574 lis/c 459334/428950 les/c/f 459335/440468/290442 
459333/459334/459294) 
[2147483647,107,168,2147483647,102]/[81,107,168,89,102]p81(0) r=2 lpr=459334 
pi=[428950,459334)/82 luod=0'0 crt=443432'5 lcod 0'0 active+remapped mbc={}] 
enter Started/ReplicaActive/RepRecovering
   -11> 202

[ceph-users] Doubt about AVAIL space on df

2020-02-04 Thread German Anders
Hello Everyone,

I would like to understand if this output is right:

*# ceph df*
GLOBAL:
SIZEAVAIL   RAW USED %RAW USED
85.1TiB 43.7TiB  41.4TiB 48.68
POOLS:
NAMEID USED%USED MAX AVAIL OBJECTS
volumes 13 13.8TiB 64.21   7.68TiB 3620495

I only have (1) pool called 'volumes' which is using 13.8TiB (we have a
replica of 3) so it's actually using 41,4TiB and that would be the RAW
USED, at this point is fine, but, then it said on the GLOBAL section that
the AVAIL space is 43.7TiB and the %RAW USED is only 48.68%.

So if I use the 7.68TiB of MAX AVAIL and the pool goes up to 100% of usage,
that would not lead to the total space of the cluster, right? I mean were
are those 43.7TiB of AVAIL space?

I'm using Luminous 12.2.12 release.

Sorry if it's a silly question or if it has been answered before.

Thanks in advance,

Best regards,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Doubt about AVAIL space on df

2020-02-04 Thread EDH - Manuel Rios
Hi German,

Can you post , ceph osd df tree ?

Looks like your usage distribution is not perfect and that's why you got less 
space than real.
Regards


-Mensaje original-
De: German Anders  
Enviado el: martes, 4 de febrero de 2020 14:00
Para: ceph-us...@ceph.com
Asunto: [ceph-users] Doubt about AVAIL space on df

Hello Everyone,

I would like to understand if this output is right:

*# ceph df*
GLOBAL:
SIZEAVAIL   RAW USED %RAW USED
85.1TiB 43.7TiB  41.4TiB 48.68
POOLS:
NAMEID USED%USED MAX AVAIL OBJECTS
volumes 13 13.8TiB 64.21   7.68TiB 3620495

I only have (1) pool called 'volumes' which is using 13.8TiB (we have a replica 
of 3) so it's actually using 41,4TiB and that would be the RAW USED, at this 
point is fine, but, then it said on the GLOBAL section that the AVAIL space is 
43.7TiB and the %RAW USED is only 48.68%.

So if I use the 7.68TiB of MAX AVAIL and the pool goes up to 100% of usage, 
that would not lead to the total space of the cluster, right? I mean were are 
those 43.7TiB of AVAIL space?

I'm using Luminous 12.2.12 release.

Sorry if it's a silly question or if it has been answered before.

Thanks in advance,

Best regards,
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Doubt about AVAIL space on df

2020-02-04 Thread German Anders
Hi Manuel,

Sure thing:

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
 0  nvme 1.0  1.0 1.09TiB  496GiB  622GiB 44.35 0.91 143
 1  nvme 1.0  1.0 1.09TiB  488GiB  630GiB 43.63 0.89 141
 2  nvme 1.0  1.0 1.09TiB  537GiB  581GiB 48.05 0.99 155
 3  nvme 1.0  1.0 1.09TiB  473GiB  644GiB 42.36 0.87 137
 4  nvme 1.0  1.0 1.09TiB  531GiB  587GiB 47.52 0.97 153
 5  nvme 1.0  1.0 1.09TiB  476GiB  642GiB 42.55 0.87 137
 6  nvme 1.0  1.0 1.09TiB  467GiB  651GiB 41.77 0.86 135
 7  nvme 1.0  1.0 1.09TiB  543GiB  574GiB 48.61 1.00 157
 8  nvme 1.0  1.0 1.09TiB  481GiB  636GiB 43.08 0.88 139
 9  nvme 1.0  1.0 1.09TiB  457GiB  660GiB 40.92 0.84 133
10  nvme 1.0  1.0 1.09TiB  513GiB  604GiB 45.92 0.94 148
11  nvme 1.0  1.0 1.09TiB  484GiB  634GiB 43.29 0.89 140
12  nvme 1.0  1.0 1.09TiB  498GiB  620GiB 44.57 0.91 144
13  nvme 1.0  1.0 1.09TiB  560GiB  557GiB 50.13 1.03 162
14  nvme 1.0  1.0 1.09TiB  576GiB  542GiB 51.55 1.06 167
15  nvme 1.0  1.0 1.09TiB  545GiB  572GiB 48.78 1.00 158
16  nvme 1.0  1.0 1.09TiB  537GiB  581GiB 48.02 0.98 155
17  nvme 1.0  1.0 1.09TiB  507GiB  611GiB 45.36 0.93 147
18  nvme 1.0  1.0 1.09TiB  490GiB  628GiB 43.86 0.90 142
19  nvme 1.0  1.0 1.09TiB  533GiB  584GiB 47.72 0.98 155
20  nvme 1.0  1.0 1.09TiB  467GiB  651GiB 41.75 0.86 134
21  nvme 1.0  1.0 1.09TiB  447GiB  671GiB 39.97 0.82 129
22  nvme 1.00099  1.0 1.09TiB  561GiB  557GiB 50.16 1.03 162
23  nvme 1.0  1.0 1.09TiB  441GiB  677GiB 39.46 0.81 127
24  nvme 1.0  1.0 1.09TiB  500GiB  618GiB 44.72 0.92 145
25  nvme 1.0  1.0 1.09TiB  462GiB  656GiB 41.30 0.85 133
26  nvme 1.0  1.0 1.09TiB  445GiB  672GiB 39.85 0.82 129
27  nvme 1.0  1.0 1.09TiB  564GiB  554GiB 50.45 1.03 162
28  nvme 1.0  1.0 1.09TiB  512GiB  605GiB 45.84 0.94 148
29  nvme 1.0  1.0 1.09TiB  553GiB  565GiB 49.49 1.01 160
30  nvme 1.0  1.0 1.09TiB  526GiB  592GiB 47.07 0.97 152
31  nvme 1.0  1.0 1.09TiB  484GiB  633GiB 43.34 0.89 140
32  nvme 1.0  1.0 1.09TiB  504GiB  613GiB 45.13 0.93 146
33  nvme 1.0  1.0 1.09TiB  550GiB  567GiB 49.23 1.01 159
34  nvme 1.0  1.0 1.09TiB  497GiB  620GiB 44.51 0.91 143
35  nvme 1.0  1.0 1.09TiB  457GiB  661GiB 40.88 0.84 132
36  nvme 1.0  1.0 1.09TiB  539GiB  578GiB 48.25 0.99 156
37  nvme 1.0  1.0 1.09TiB  516GiB  601GiB 46.19 0.95 149
38  nvme 1.0  1.0 1.09TiB  518GiB  600GiB 46.35 0.95 149
39  nvme 1.0  1.0 1.09TiB  456GiB  662GiB 40.81 0.84 132
40  nvme 1.0  1.0 1.09TiB  527GiB  591GiB 47.13 0.97 152
41  nvme 1.0  1.0 1.09TiB  536GiB  581GiB 47.98 0.98 155
42  nvme 1.0  1.0 1.09TiB  521GiB  597GiB 46.62 0.96 151
43  nvme 1.0  1.0 1.09TiB  459GiB  659GiB 41.05 0.84 132
44  nvme 1.0  1.0 1.09TiB  549GiB  569GiB 49.12 1.01 158
45  nvme 1.0  1.0 1.09TiB  569GiB  548GiB 50.95 1.04 164
46  nvme 1.0  1.0 1.09TiB  450GiB  668GiB 40.28 0.83 130
47  nvme 1.0  1.0 1.09TiB  491GiB  626GiB 43.97 0.90 142
48  nvme 1.0  1.0  931GiB  551GiB  381GiB 59.13 1.21 159
49  nvme 1.0  1.0  931GiB  469GiB  463GiB 50.34 1.03 136
50  nvme 1.0  1.0  931GiB  548GiB  384GiB 58.78 1.21 158
51  nvme 1.0  1.0  931GiB  380GiB  552GiB 40.79 0.84 109
52  nvme 1.0  1.0  931GiB  486GiB  445GiB 52.20 1.07 141
53  nvme 1.0  1.0  931GiB  502GiB  429GiB 53.93 1.11 146
54  nvme 1.0  1.0  931GiB  479GiB  452GiB 51.42 1.05 139
55  nvme 1.0  1.0  931GiB  521GiB  410GiB 55.93 1.15 150
56  nvme 1.0  1.0  931GiB  570GiB  361GiB 61.25 1.26 165
57  nvme 1.0  1.0  931GiB  404GiB  527GiB 43.43 0.89 117
58  nvme 1.0  1.0  931GiB  455GiB  476GiB 48.89 1.00 132
59  nvme 1.0  1.0  931GiB  535GiB  397GiB 57.39 1.18 154
60  nvme 1.0  1.0  931GiB  499GiB  433GiB 53.56 1.10 144
61  nvme 1.0  1.0  931GiB  446GiB  485GiB 47.92 0.98 129
62  nvme 1.0  1.0  931GiB  505GiB  427GiB 54.18 1.11 146
63  nvme 1.0  1.0  931GiB  563GiB  369GiB 60.39 1.24 162
64  nvme 1.0  1.0  931GiB  605GiB  326GiB 64.99 1.33 175
65  nvme 1.0  1.0  931GiB  476GiB  455GiB 51.10 1.05 138
66  nvme 1.0  1.0  931GiB  460GiB  471GiB 49.38 1.01 133
67  nvme 1.0  1.0  931GiB  483GiB  449GiB 51.82 1.06 140
68  nvme 1.0  1.0  931GiB  520GiB  411GiB 55.86 1.15 151
69  nvme 1.0  1.0  931GiB  481GiB  450GiB 51.64 1.06 139
70  nvme 1.0  1.0  931GiB  505GiB  426GiB 54.24 1.11 146
71  nvme 1.0  1.0  931GiB  576GiB  356GiB 61.81 1.27 166
72  nvme 1.0  1.0  931GiB  552GiB  379GiB 59.30 1.22 160
73  nvme 1.0  1.0  931GiB  442GiB  489GiB 47.47 0.97 128
74  nvme 1.0  1.0  931GiB  450GiB  482GiB 48.28 0.99 130
75  nvme 1.0  1.

[ceph-users] osd_memory_target ignored

2020-02-04 Thread Frank Schilder
I recently upgraded from 13.2.2 to 13.2.8 and observe two changes that I 
struggle with:

- from release notes: The bluestore_cache_* options are no longer needed. They 
are replaced by osd_memory_target, defaulting to 4GB.
- the default for bluestore_allocator has changed from stupid to bitmap,

which seem to conflict each other, or at least I seem unable to achieve what I 
want.

I have a number of OSDs for which I would like to increase the cache size. In 
the past I used bluestore_cache_size=8G and it worked like a charm. I now 
changed that to osd_memory_target=8G without any effect. The usage stays at 4G 
and the virtual size is about 5G. I would expect both to be close to 8G. The 
read cache for these OSDs usually fills up within a few hours. The cluster is 
now running a few days with the new configs to no avail.

The documentation of osd_memory_target refers to tcmalloc a lot. Is this in 
conflict with allocator=bitmap? If so, what is the way to tune cache sizes (say 
if tcmalloc is not used/how to check?)? Are bluestore_cache_* indeed obsolete 
as the above release notes suggest, or is this not true?

Many thanks for your help.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Doubt about AVAIL space on df

2020-02-04 Thread EDH - Manuel Rios
With “ceph osd df tree” will be clear but right now I can see that some %USE 
osd between 44% and 65%.

Ceph osd df tree give also the balance at host level.

Do you have balancer enabled ?No “perfect” distribution cause that you cant use 
the full space.

In our case we gain space manually rebalancing disk, that cause some objects 
moves to other osd but you can so fast space available.

Regards


De: German Anders 
Enviado el: martes, 4 de febrero de 2020 14:20
Para: EDH - Manuel Rios 
CC: ceph-us...@ceph.com
Asunto: Re: [ceph-users] Doubt about AVAIL space on df

Hi Manuel,

Sure thing:

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
 0  nvme 1.0  1.0 1.09TiB  496GiB  622GiB 44.35 0.91 143
 1  nvme 1.0  1.0 1.09TiB  488GiB  630GiB 43.63 0.89 141
 2  nvme 1.0  1.0 1.09TiB  537GiB  581GiB 48.05 0.99 155
 3  nvme 1.0  1.0 1.09TiB  473GiB  644GiB 42.36 0.87 137
 4  nvme 1.0  1.0 1.09TiB  531GiB  587GiB 47.52 0.97 153
 5  nvme 1.0  1.0 1.09TiB  476GiB  642GiB 42.55 0.87 137
 6  nvme 1.0  1.0 1.09TiB  467GiB  651GiB 41.77 0.86 135
 7  nvme 1.0  1.0 1.09TiB  543GiB  574GiB 48.61 1.00 157
 8  nvme 1.0  1.0 1.09TiB  481GiB  636GiB 43.08 0.88 139
 9  nvme 1.0  1.0 1.09TiB  457GiB  660GiB 40.92 0.84 133
10  nvme 1.0  1.0 1.09TiB  513GiB  604GiB 45.92 0.94 148
11  nvme 1.0  1.0 1.09TiB  484GiB  634GiB 43.29 0.89 140
12  nvme 1.0  1.0 1.09TiB  498GiB  620GiB 44.57 0.91 144
13  nvme 1.0  1.0 1.09TiB  560GiB  557GiB 50.13 1.03 162
14  nvme 1.0  1.0 1.09TiB  576GiB  542GiB 51.55 1.06 167
15  nvme 1.0  1.0 1.09TiB  545GiB  572GiB 48.78 1.00 158
16  nvme 1.0  1.0 1.09TiB  537GiB  581GiB 48.02 0.98 155
17  nvme 1.0  1.0 1.09TiB  507GiB  611GiB 45.36 0.93 147
18  nvme 1.0  1.0 1.09TiB  490GiB  628GiB 43.86 0.90 142
19  nvme 1.0  1.0 1.09TiB  533GiB  584GiB 47.72 0.98 155
20  nvme 1.0  1.0 1.09TiB  467GiB  651GiB 41.75 0.86 134
21  nvme 1.0  1.0 1.09TiB  447GiB  671GiB 39.97 0.82 129
22  nvme 1.00099  1.0 1.09TiB  561GiB  557GiB 50.16 1.03 162
23  nvme 1.0  1.0 1.09TiB  441GiB  677GiB 39.46 0.81 127
24  nvme 1.0  1.0 1.09TiB  500GiB  618GiB 44.72 0.92 145
25  nvme 1.0  1.0 1.09TiB  462GiB  656GiB 41.30 0.85 133
26  nvme 1.0  1.0 1.09TiB  445GiB  672GiB 39.85 0.82 129
27  nvme 1.0  1.0 1.09TiB  564GiB  554GiB 50.45 1.03 162
28  nvme 1.0  1.0 1.09TiB  512GiB  605GiB 45.84 0.94 148
29  nvme 1.0  1.0 1.09TiB  553GiB  565GiB 49.49 1.01 160
30  nvme 1.0  1.0 1.09TiB  526GiB  592GiB 47.07 0.97 152
31  nvme 1.0  1.0 1.09TiB  484GiB  633GiB 43.34 0.89 140
32  nvme 1.0  1.0 1.09TiB  504GiB  613GiB 45.13 0.93 146
33  nvme 1.0  1.0 1.09TiB  550GiB  567GiB 49.23 1.01 159
34  nvme 1.0  1.0 1.09TiB  497GiB  620GiB 44.51 0.91 143
35  nvme 1.0  1.0 1.09TiB  457GiB  661GiB 40.88 0.84 132
36  nvme 1.0  1.0 1.09TiB  539GiB  578GiB 48.25 0.99 156
37  nvme 1.0  1.0 1.09TiB  516GiB  601GiB 46.19 0.95 149
38  nvme 1.0  1.0 1.09TiB  518GiB  600GiB 46.35 0.95 149
39  nvme 1.0  1.0 1.09TiB  456GiB  662GiB 40.81 0.84 132
40  nvme 1.0  1.0 1.09TiB  527GiB  591GiB 47.13 0.97 152
41  nvme 1.0  1.0 1.09TiB  536GiB  581GiB 47.98 0.98 155
42  nvme 1.0  1.0 1.09TiB  521GiB  597GiB 46.62 0.96 151
43  nvme 1.0  1.0 1.09TiB  459GiB  659GiB 41.05 0.84 132
44  nvme 1.0  1.0 1.09TiB  549GiB  569GiB 49.12 1.01 158
45  nvme 1.0  1.0 1.09TiB  569GiB  548GiB 50.95 1.04 164
46  nvme 1.0  1.0 1.09TiB  450GiB  668GiB 40.28 0.83 130
47  nvme 1.0  1.0 1.09TiB  491GiB  626GiB 43.97 0.90 142
48  nvme 1.0  1.0  931GiB  551GiB  381GiB 59.13 1.21 159
49  nvme 1.0  1.0  931GiB  469GiB  463GiB 50.34 1.03 136
50  nvme 1.0  1.0  931GiB  548GiB  384GiB 58.78 1.21 158
51  nvme 1.0  1.0  931GiB  380GiB  552GiB 40.79 0.84 109
52  nvme 1.0  1.0  931GiB  486GiB  445GiB 52.20 1.07 141
53  nvme 1.0  1.0  931GiB  502GiB  429GiB 53.93 1.11 146
54  nvme 1.0  1.0  931GiB  479GiB  452GiB 51.42 1.05 139
55  nvme 1.0  1.0  931GiB  521GiB  410GiB 55.93 1.15 150
56  nvme 1.0  1.0  931GiB  570GiB  361GiB 61.25 1.26 165
57  nvme 1.0  1.0  931GiB  404GiB  527GiB 43.43 0.89 117
58  nvme 1.0  1.0  931GiB  455GiB  476GiB 48.89 1.00 132
59  nvme 1.0  1.0  931GiB  535GiB  397GiB 57.39 1.18 154
60  nvme 1.0  1.0  931GiB  499GiB  433GiB 53.56 1.10 144
61  nvme 1.0  1.0  931GiB  446GiB  485GiB 47.92 0.98 129
62  nvme 1.0  1.0  931GiB  505GiB  427GiB 54.18 1.11 146
63  nvme 1.0  1.0  931GiB  563GiB  369GiB 60.39 1.24 162
64  nvme 1.0  1.0  931GiB  605GiB  326GiB 64.99 1.33 175
65  nvme 1.0  1.0  931GiB  476GiB  455GiB 51.10 1.05 138
66  nvme 1.0  1.0  931Gi

[ceph-users] Re: Doubt about AVAIL space on df

2020-02-04 Thread German Anders
Manuel, find the output of ceph osd df tree command:

# ceph osd df tree
ID  CLASS WEIGHT   REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS TYPE NAME
 -7   84.00099- 85.1TiB 41.6TiB 43.6TiB 48.82 1.00   - root root
 -5   12.0- 13.1TiB 5.81TiB 7.29TiB 44.38 0.91   - rack
rack1
 -1   12.0- 13.1TiB 5.81TiB 7.29TiB 44.38 0.91   -
node cpn01
  0  nvme  1.0  1.0 1.09TiB  496GiB  621GiB 44.40 0.91 143
osd.0
  1  nvme  1.0  1.0 1.09TiB  489GiB  629GiB 43.72 0.90 141
osd.1
  2  nvme  1.0  1.0 1.09TiB  537GiB  581GiB 48.03 0.98 155
osd.2
  3  nvme  1.0  1.0 1.09TiB  474GiB  644GiB 42.40 0.87 137
osd.3
  4  nvme  1.0  1.0 1.09TiB  532GiB  586GiB 47.57 0.97 153
osd.4
  5  nvme  1.0  1.0 1.09TiB  476GiB  642GiB 42.60 0.87 137
osd.5
  6  nvme  1.0  1.0 1.09TiB  467GiB  650GiB 41.82 0.86 135
osd.6
  7  nvme  1.0  1.0 1.09TiB  544GiB  574GiB 48.65 1.00 157
osd.7
  8  nvme  1.0  1.0 1.09TiB  482GiB  636GiB 43.12 0.88 139
osd.8
  9  nvme  1.0  1.0 1.09TiB  458GiB  660GiB 40.96 0.84 133
osd.9
 10  nvme  1.0  1.0 1.09TiB  514GiB  604GiB 45.97 0.94 148
osd.10
 11  nvme  1.0  1.0 1.09TiB  484GiB  633GiB 43.34 0.89 140
osd.11
 -6   12.00099- 13.1TiB 6.02TiB 7.08TiB 45.98 0.94   - rack
rack2
 -2   12.00099- 13.1TiB 6.02TiB 7.08TiB 45.98 0.94   -
node cpn02
 12  nvme  1.0  1.0 1.09TiB  499GiB  619GiB 44.61 0.91 144
osd.12
 13  nvme  1.0  1.0 1.09TiB  561GiB  557GiB 50.19 1.03 162
osd.13
 14  nvme  1.0  1.0 1.09TiB  577GiB  541GiB 51.60 1.06 167
osd.14
 15  nvme  1.0  1.0 1.09TiB  546GiB  572GiB 48.84 1.00 158
osd.15
 16  nvme  1.0  1.0 1.09TiB  537GiB  580GiB 48.07 0.98 155
osd.16
 17  nvme  1.0  1.0 1.09TiB  508GiB  610GiB 45.41 0.93 147
osd.17
 18  nvme  1.0  1.0 1.09TiB  490GiB  628GiB 43.86 0.90 142
osd.18
 19  nvme  1.0  1.0 1.09TiB  534GiB  584GiB 47.76 0.98 155
osd.19
 20  nvme  1.0  1.0 1.09TiB  467GiB  651GiB 41.80 0.86 134
osd.20
 21  nvme  1.0  1.0 1.09TiB  447GiB  671GiB 40.01 0.82 129
osd.21
 22  nvme  1.00099  1.0 1.09TiB  561GiB  556GiB 50.21 1.03 162
osd.22
 23  nvme  1.0  1.0 1.09TiB  441GiB  677GiB 39.45 0.81 127
osd.23
-15   12.0- 13.1TiB 5.92TiB 7.18TiB 45.20 0.93   - rack
rack3
 -3   12.0- 13.1TiB 5.92TiB 7.18TiB 45.20 0.93   -
node cpn03
 24  nvme  1.0  1.0 1.09TiB  500GiB  617GiB 44.77 0.92 145
osd.24
 25  nvme  1.0  1.0 1.09TiB  462GiB  655GiB 41.37 0.85 133
osd.25
 26  nvme  1.0  1.0 1.09TiB  446GiB  672GiB 39.88 0.82 129
osd.26
 27  nvme  1.0  1.0 1.09TiB  565GiB  553GiB 50.54 1.04 162
osd.27
 28  nvme  1.0  1.0 1.09TiB  513GiB  605GiB 45.89 0.94 148
osd.28
 29  nvme  1.0  1.0 1.09TiB  554GiB  564GiB 49.55 1.01 160
osd.29
 30  nvme  1.0  1.0 1.09TiB  527GiB  591GiB 47.12 0.97 152
osd.30
 31  nvme  1.0  1.0 1.09TiB  484GiB  634GiB 43.31 0.89 140
osd.31
 32  nvme  1.0  1.0 1.09TiB  505GiB  612GiB 45.21 0.93 146
osd.32
 33  nvme  1.0  1.0 1.09TiB  551GiB  567GiB 49.28 1.01 159
osd.33
 34  nvme  1.0  1.0 1.09TiB  498GiB  620GiB 44.52 0.91 143
osd.34
 35  nvme  1.0  1.0 1.09TiB  457GiB  660GiB 40.93 0.84 132
osd.35
-16   12.0- 13.1TiB 6.00TiB 7.10TiB 45.77 0.94   - rack
rack4
 -4   12.0- 13.1TiB 6.00TiB 7.10TiB 45.77 0.94   -
node cpn04
 36  nvme  1.0  1.0 1.09TiB  540GiB  578GiB 48.29 0.99 156
osd.36
 37  nvme  1.0  1.0 1.09TiB  517GiB  601GiB 46.25 0.95 149
osd.37
 38  nvme  1.0  1.0 1.09TiB  519GiB  599GiB 46.42 0.95 149
osd.38
 39  nvme  1.0  1.0 1.09TiB  457GiB  661GiB 40.85 0.84 132
osd.39
 40  nvme  1.0  1.0 1.09TiB  527GiB  590GiB 47.17 0.97 152
osd.40
 41  nvme  1.0  1.0 1.09TiB  537GiB  581GiB 48.01 0.98 155
osd.41
 42  nvme  1.0  1.0 1.09TiB  522GiB  596GiB 46.68 0.96 151
osd.42
 43  nvme  1.0  1.0 1.09TiB  459GiB  658GiB 41.09 0.84 132
osd.43
 44  nvme  1.0  1.0 1.09TiB  550GiB  568GiB 49.17 1.01 158
osd.44
 45  nvme  1.0  1.0 1.09TiB  570GiB  548GiB 51.00 1.04 164
osd.45
 46  nvme  1.0  1.0 1.09TiB  451GiB  667GiB 40.32 0.83 130
osd.46
 47  nvme  1.0  1.0 1.09TiB  492GiB  626GiB 44.03 0.90 142
osd.47
-20   12.0- 10.9TiB 5.77TiB 5.15TiB 52.84 1.08   - rack
rack5
-19   12.0- 10.9TiB 5.77TiB 5.15TiB 52.84 1.08   -
node cpn05
 48  nvme  1.0  1.0  931GiB  551GiB  380GiB 59.19 1.21 159
osd.48
 49  nvme  1.0  1.0  931GiB  469GiB  462GiB 50.39 1.03 136
osd.49
 50  nvme  1.0  1.0  931GiB  548GiB  384GiB 58.83 1.20 158
osd.50
 51 

[ceph-users] Re: Doubt about AVAIL space on df

2020-02-04 Thread Wido den Hollander



On 2/4/20 2:00 PM, German Anders wrote:
> Hello Everyone,
> 
> I would like to understand if this output is right:
> 
> *# ceph df*
> GLOBAL:
> SIZEAVAIL   RAW USED %RAW USED
> 85.1TiB 43.7TiB  41.4TiB 48.68
> POOLS:
> NAMEID USED%USED MAX AVAIL OBJECTS
> volumes 13 13.8TiB 64.21   7.68TiB 3620495
> 
> I only have (1) pool called 'volumes' which is using 13.8TiB (we have a
> replica of 3) so it's actually using 41,4TiB and that would be the RAW
> USED, at this point is fine, but, then it said on the GLOBAL section that
> the AVAIL space is 43.7TiB and the %RAW USED is only 48.68%.
> 
> So if I use the 7.68TiB of MAX AVAIL and the pool goes up to 100% of usage,
> that would not lead to the total space of the cluster, right? I mean were
> are those 43.7TiB of AVAIL space?
> 

MAX Avail looks at the fullest OSD and then also takes the nearfull
ratio of an OSD into account which is 85% by defaullt.

It will prevent you from filling up the system 100%.

Wido

> I'm using Luminous 12.2.12 release.
> 
> Sorry if it's a silly question or if it has been answered before.
> 
> Thanks in advance,
> 
> Best regards,
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Write i/o in CephFS metadata pool

2020-02-04 Thread Samy Ascha



> On 2 Feb 2020, at 12:45, Patrick Donnelly  wrote:
> 
> On Wed, Jan 29, 2020 at 1:25 AM Samy Ascha  wrote:
>> 
>> Hi!
>> 
>> I've been running CephFS for a while now and ever since setting it up, I've 
>> seen unexpectedly large write i/o on the CephFS metadata pool.
>> 
>> The filesystem is otherwise stable and I'm seeing no usage issues.
>> 
>> I'm in a read-intensive environment, from the clients' perspective and 
>> throughput for the metadata pool is consistently larger than that of the 
>> data pool.
>> 
>> For example:
>> 
>> # ceph osd pool stats
>> pool cephfs_data id 1
>>  client io 7.6 MiB/s rd, 19 KiB/s wr, 404 op/s rd, 1 op/s wr
>> 
>> pool cephfs_metadata id 2
>>  client io 338 KiB/s rd, 43 MiB/s wr, 84 op/s rd, 26 op/s wr
>> 
>> I realise, of course, that this is a momentary display of statistics, but I 
>> see this unbalanced r/w activity consistently when monitoring it live.
>> 
>> I would like some insight into what may be causing this large imbalance in 
>> r/w, especially since I'm in a read-intensive (web hosting) environment.
> 
> The MDS is still writing its journal and updating the "open file
> table". The MDS needs to record certain information about the state of
> its cache and the state issued to clients. Even if the clients aren't
> changing anything. (This is workload dependent but will be most
> obvious when clients are opening files _not_ in cache already.)
> 
> -- 
> Patrick Donnelly, Ph.D.
> He / Him / His
> Senior Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> 

Hi Patrick,

Thanks for this extra information.

I should be able to confirm this by checking network traffic flowing from the 
MDSes to the OSDs, and compare it to what's coming in from the CephFS clients.

I'll report back when I have more information on that. I'm a little caught up 
in other stuff right now, but I wanted to just acknowledge your message.

Samy



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_memory_target ignored

2020-02-04 Thread Stefan Kooman
Hi,

Quoting Frank Schilder (fr...@dtu.dk):
> I recently upgraded from 13.2.2 to 13.2.8 and observe two changes that
> I struggle with:
> 
> - from release notes: The bluestore_cache_* options are no longer
> needed. They are replaced by osd_memory_target, defaulting to 4GB.  -
> the default for bluestore_allocator has changed from stupid to bitmap,
> 
> which seem to conflict each other, or at least I seem unable to
> achieve what I want.
> 
> I have a number of OSDs for which I would like to increase the cache
> size. In the past I used bluestore_cache_size=8G and it worked like a
> charm. I now changed that to osd_memory_target=8G without any effect.
> The usage stays at 4G and the virtual size is about 5G. I would expect
> both to be close to 8G. The read cache for these OSDs usually fills up
> within a few hours. The cluster is now running a few days with the new
> configs to no avail.

How do you check the memory usage? We have a osd_memory_target=11G and
the OSDs consume this exact amount of RAM (ps aux |grep osd). We are
running 13.2.8. ceph daemon osd.$id dump_mempools would give ~ 4 GiB of
RAM. So there is more RAM usage than only specified by "mempool"
obviously.

> 
> The documentation of osd_memory_target refers to tcmalloc a lot. Is
> this in conflict with allocator=bitmap? If so, what is the way to tune
> cache sizes (say if tcmalloc is not used/how to check?)? Are
> bluestore_cache_* indeed obsolete as the above release notes suggest,
> or is this not true?

AFAIK these are not related. We use "bluefs_allocator": "bitmap" and
"bluestore_allocator": "bitmap".

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] All pgs peering indefinetely

2020-02-04 Thread Rodrigo Severo - Fábrica
Hi,


I have a rather small cephfs cluster with 3 machines right now: all of
them sharing MDS, MON, MGS and OSD roles.

I had to move all machines to a new physical location and,
unfortunately, I had to move all of them at the same time.

They are already on again but ceph won't be accessible as all pgs are
in peering state and OSD keep going down and up again.

Here is some info about my cluster:

---
# ceph -s
  cluster:
id: e348b63c-d239-4a15-a2ce-32f29a00431c
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report slow metadata IOs
2 osds down
1 host (2 osds) down
Reduced data availability: 324 pgs inactive, 324 pgs peering
7 daemons have recently crashed
10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow ops

  services:
mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
rgw: 1 daemon active (a2-df)

  data:
pools:   7 pools, 324 pgs
objects: 850.25k objects, 744 GiB
usage:   2.3 TiB used, 14 TiB / 16 TiB avail
pgs: 100.000% pgs not active
 324 peering
---

---
# ceph osd df tree
ID  CLASSWEIGHT   REWEIGHT SIZERAW USE DATAOMAPMETA
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
 -1  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83 1.00   -root default
-10  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83 1.00   -datacenter df
 -3   5.45799- 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -host a1-df
  3 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
3.6 TiB  0.03 0.00   0   down osd.3
  0  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
1.1 TiB 41.43 3.00   0   down osd.0
 -5   5.45799- 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -host a2-df
  4 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
3.6 TiB  0.03 0.00 100 up osd.4
  1  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
1.1 TiB 41.42 3.00 224 up osd.1
 -7   5.45767- 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -host a3-df
  5 hdd-slow  3.63869  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
3.6 TiB  0.03 0.00 100 up osd.5
  2  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
1.1 TiB 41.43 3.00 224 up osd.2
 TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83
MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
---

At this exact moment both OSDs from server a1-df were down but that's
changing. Sometimes I have only one OSD down, but most of the times I
have 2. And exactly which ones are actually down keeps changing.

What should I do to get my cluster back up? Just wait?


Regards,

Rodrigo Severo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_memory_target ignored

2020-02-04 Thread Frank Schilder
Dear Stefan,

I check with top the total allocation. ps -aux gives:

USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
ceph  784155 15.8  3.1 6014276 4215008 ? Sl   Jan31 932:13 
/usr/bin/ceph-osd --cluster ceph -f -i 243 ...
ceph  784732 16.6  3.0 6058736 4082504 ? Sl   Jan31 976:59 
/usr/bin/ceph-osd --cluster ceph -f -i 247 ...
ceph  785812 17.1  3.0 5989576 3959996 ? Sl   Jan31 1008:46 
/usr/bin/ceph-osd --cluster ceph -f -i 254 ...
ceph  786352 14.9  3.1 5955520 4132840 ? Sl   Jan31 874:37 
/usr/bin/ceph-osd --cluster ceph -f -i 256 ...

These should have 8GB resident by now, but stay at or just below 4G. The other 
options are set as

[root@ceph-04 ~]# ceph config get osd.243 bluefs_allocator
bitmap
[root@ceph-04 ~]# ceph config get osd.243 bluestore_allocator
bitmap
[root@ceph-04 ~]# ceph config get osd.243 osd_memory_target
8589934592

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 04 February 2020 16:34:34
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] osd_memory_target ignored

Hi,

Quoting Frank Schilder (fr...@dtu.dk):
> I recently upgraded from 13.2.2 to 13.2.8 and observe two changes that
> I struggle with:
>
> - from release notes: The bluestore_cache_* options are no longer
> needed. They are replaced by osd_memory_target, defaulting to 4GB.  -
> the default for bluestore_allocator has changed from stupid to bitmap,
>
> which seem to conflict each other, or at least I seem unable to
> achieve what I want.
>
> I have a number of OSDs for which I would like to increase the cache
> size. In the past I used bluestore_cache_size=8G and it worked like a
> charm. I now changed that to osd_memory_target=8G without any effect.
> The usage stays at 4G and the virtual size is about 5G. I would expect
> both to be close to 8G. The read cache for these OSDs usually fills up
> within a few hours. The cluster is now running a few days with the new
> configs to no avail.

How do you check the memory usage? We have a osd_memory_target=11G and
the OSDs consume this exact amount of RAM (ps aux |grep osd). We are
running 13.2.8. ceph daemon osd.$id dump_mempools would give ~ 4 GiB of
RAM. So there is more RAM usage than only specified by "mempool"
obviously.

>
> The documentation of osd_memory_target refers to tcmalloc a lot. Is
> this in conflict with allocator=bitmap? If so, what is the way to tune
> cache sizes (say if tcmalloc is not used/how to check?)? Are
> bluestore_cache_* indeed obsolete as the above release notes suggest,
> or is this not true?

AFAIK these are not related. We use "bluefs_allocator": "bitmap" and
"bluestore_allocator": "bitmap".

Gr. Stefan

--
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread DHilsbos
Rodrigo;

Are all your hosts using the same IP addresses as before the move?  Is the new 
network structured the same?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: Rodrigo Severo - Fábrica [mailto:rodr...@fabricadeideias.com] 
Sent: Tuesday, February 04, 2020 8:40 AM
To: ceph-users
Subject: [ceph-users] All pgs peering indefinetely

Hi,


I have a rather small cephfs cluster with 3 machines right now: all of
them sharing MDS, MON, MGS and OSD roles.

I had to move all machines to a new physical location and,
unfortunately, I had to move all of them at the same time.

They are already on again but ceph won't be accessible as all pgs are
in peering state and OSD keep going down and up again.

Here is some info about my cluster:

---
# ceph -s
  cluster:
id: e348b63c-d239-4a15-a2ce-32f29a00431c
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report slow metadata IOs
2 osds down
1 host (2 osds) down
Reduced data availability: 324 pgs inactive, 324 pgs peering
7 daemons have recently crashed
10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow ops

  services:
mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
rgw: 1 daemon active (a2-df)

  data:
pools:   7 pools, 324 pgs
objects: 850.25k objects, 744 GiB
usage:   2.3 TiB used, 14 TiB / 16 TiB avail
pgs: 100.000% pgs not active
 324 peering
---

---
# ceph osd df tree
ID  CLASSWEIGHT   REWEIGHT SIZERAW USE DATAOMAPMETA
AVAIL   %USE  VAR  PGS STATUS TYPE NAME
 -1  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83 1.00   -root default
-10  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83 1.00   -datacenter df
 -3   5.45799- 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -host a1-df
  3 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
3.6 TiB  0.03 0.00   0   down osd.3
  0  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
1.1 TiB 41.43 3.00   0   down osd.0
 -5   5.45799- 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -host a2-df
  4 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
3.6 TiB  0.03 0.00 100 up osd.4
  1  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
1.1 TiB 41.42 3.00 224 up osd.1
 -7   5.45767- 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
4.7 TiB 13.83 1.00   -host a3-df
  5 hdd-slow  3.63869  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
3.6 TiB  0.03 0.00 100 up osd.5
  2  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
1.1 TiB 41.43 3.00 224 up osd.2
 TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
 14 TiB 13.83
MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
---

At this exact moment both OSDs from server a1-df were down but that's
changing. Sometimes I have only one OSD down, but most of the times I
have 2. And exactly which ones are actually down keeps changing.

What should I do to get my cluster back up? Just wait?


Regards,

Rodrigo Severo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] More OMAP Issues

2020-02-04 Thread DHilsbos
All;

We're backing to having large OMAP object warnings regarding our RGW index pool.

This cluster is now in production, so I can simply dump the buckets / pools and 
hope everything works out.

I did some additional research on this issue, and it looks like I need to 
(re)shard the bucket (index?).  I found information that suggests that, for 
older versions of Ceph, buckets couldn't be sharded after creation[1].  Other 
information suggests the Nautilus (which we are running), can re-shard 
dynamically, but not when multi-site replication is configured[2].

This suggests that a "manual" resharding of a Nautilus cluster should be 
possible, but I can't find the commands to do it.  Has anyone done this?  Does 
anyone have the commands to do it?  I can schedule down time for the cluster, 
and take the RADOSGW instance(s), and dependent user services offline.

[1]: https://ceph.io/geen-categorie/radosgw-big-index/
[2]: https://docs.ceph.com/docs/master/radosgw/dynamicresharding/

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-04 Thread Bradley Kite
Hi Vitaliy

Yes - I tried this and I can still see a number of reads (~110 iops,
440KB/sec) on the SSD, so it is significantly better, but the result is
still puzzling - I'm trying to understand what is causing the reads. The
problem is amplified with numjobs >= 2 but it looks like it is still there
with just 1.

Like some caching parameter is not correct, and the same blocks are being
read over and over when doing a write?

Could anyone advise on the best way for me to investigate further?

I've tried strace (with -k) and 'perf record' but neither produce any
useful stack traces to help understand what's going on.

Regards
--
Brad




On Tue, 4 Feb 2020 at 11:05, Vitaliy Filippov  wrote:

> Hi,
>
> Try to repeat your test with numjobs=1, I've already seen strange
> behaviour with parallel jobs to one RBD image.
>
> Also as usual: https://yourcmc.ru/wiki/Ceph_performance :-)
>
> > Hi,
> >
> > We have a production cluster of 27 OSD's across 5 servers (all SSD's
> > running bluestore), and have started to notice a possible performance
> > issue.
> >
> > In order to isolate the problem, we built a single server with a single
> > OSD, and ran a few FIO tests. The results are puzzling, not that we were
> > expecting good performance on a single OSD.
> >
> > In short, with a sequential write test, we are seeing huge numbers of
> > reads
> > hitting the actual SSD
> >
> > Key FIO parameters are:
> >
> > [global]
> > pool=benchmarks
> > rbdname=disk-1
> > direct=1
> > numjobs=2
> > iodepth=1
> > blocksize=4k
> > group_reporting=1
> > [writer]
> > readwrite=write
> >
> > iostat results are:
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> > avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > nvme0n1   0.00   105.00 4896.00  294.00 312080.00  1696.00
> > 120.92
> >17.253.353.550.02   0.02  12.60
> >
> > There are nearly ~5000 reads/second (~300 MB/sec), compared with only
> > ~300
> > writes (~1.5MB/sec), when we are doing a sequential write test? The
> > system
> > is otherwise idle, with no other workload.
> >
> > Running the same fio test with only 1 thread (numjobs=1) still shows a
> > high
> > number of reads (110).
> >
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> > avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > nvme0n1   0.00  1281.00  110.00 1463.00   440.00 12624.00
> > 16.61
> > 0.030.020.050.02   0.02   3.40
> >
> > Can anyone kindly offer any comments on why we are seeing this behaviour?
> >
> > I can understand if there's the occasional read here and there if
> > RocksDB/WAL entries need to be read from disk during the sequential write
> > test, but this seems significantly high and unusual.
> >
> > FIO results (numjobs=2)
> > writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=rbd, iodepth=1
> > ...
> > fio-3.7
> > Starting 2 processes
> > Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta
> > 01m:00s]
> > writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb  3 22:46:16 2020
> >   write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec)
> > slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83
> > clat (msec): min=2, max=210, avg=58.32, stdev=70.54
> >  lat (msec): min=2, max=210, avg=58.35, stdev=70.54
> > clat percentiles (msec):
> >  |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[
> > 3],
> >  | 30.00th=[3], 40.00th=[3], 50.00th=[   54], 60.00th=[
> > 62],
> >  | 70.00th=[   65], 80.00th=[  174], 90.00th=[  188], 95.00th=[
> > 194],
> >  | 99.00th=[  201], 99.50th=[  203], 99.90th=[  209], 99.95th=[
> > 209],
> >  | 99.99th=[  211]
> >bw (  KiB/s): min=   24, max=  144, per=49.69%, avg=68.08,
> > stdev=38.22,
> > samples=239
> >iops: min=6, max=   36, avg=16.97, stdev= 9.55,
> > samples=239
> >   lat (msec)   : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18%
> >   cpu  : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118
> >   IO depths: 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >> =64=0.0%
> >  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >> =64=0.0%
> >  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >> =64=0.0%
> >  issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0
> >  latency   : target=0, window=0, percentile=100.00%, depth=1
> >
> > Run status group 0 (all jobs):
> >   WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s),
> > io=8228KiB (8425kB), run=60038-60038msec
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> With best regards,
>Vitaliy Filippov
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@cep

[ceph-users] Re: recovery_unfound

2020-02-04 Thread Chad William Seys

Hi Jake and all,
  We're having what looks to be the exact same problem.  In our case it 
happened when I was "draining" an OSD for removal.  (ceph crush 
remove...)  Adding the OSD back doesn't help workaround the bug. 
Everything is either triply replicated or EC k3m2, either of which 
should stand loss of two hosts (much less one OSD).

  We're running 13.2.6 .
  I tried various OSD restarts, deep-scrubs, with no change. I'm 
leaving things alone hoping that croit.io will update their package to 
13.2.8 soonish.  Maybe that will help kick it in the pants.


Chad.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-04 Thread Igor Fedotov

Hi Bradley,

you might want to check performance counters for this specific OSD.

Available via 'ceph daemon osd.0 perf dump'  command in Nautilus. A bit 
different command for Luminous AFAIR.


Then look for 'read' substring in the dump and try to find unexpectedly 
high read-related counter values if any.


And/or share it here for brief analysis.


Thanks,

Igor

*
*

**

On 2/4/2020 7:36 PM, Bradley Kite wrote:

Hi Vitaliy

Yes - I tried this and I can still see a number of reads (~110 iops,
440KB/sec) on the SSD, so it is significantly better, but the result is
still puzzling - I'm trying to understand what is causing the reads. The
problem is amplified with numjobs >= 2 but it looks like it is still there
with just 1.

Like some caching parameter is not correct, and the same blocks are being
read over and over when doing a write?

Could anyone advise on the best way for me to investigate further?

I've tried strace (with -k) and 'perf record' but neither produce any
useful stack traces to help understand what's going on.

Regards
--
Brad




On Tue, 4 Feb 2020 at 11:05, Vitaliy Filippov  wrote:


Hi,

Try to repeat your test with numjobs=1, I've already seen strange
behaviour with parallel jobs to one RBD image.

Also as usual: https://yourcmc.ru/wiki/Ceph_performance :-)


Hi,

We have a production cluster of 27 OSD's across 5 servers (all SSD's
running bluestore), and have started to notice a possible performance
issue.

In order to isolate the problem, we built a single server with a single
OSD, and ran a few FIO tests. The results are puzzling, not that we were
expecting good performance on a single OSD.

In short, with a sequential write test, we are seeing huge numbers of
reads
hitting the actual SSD

Key FIO parameters are:

[global]
pool=benchmarks
rbdname=disk-1
direct=1
numjobs=2
iodepth=1
blocksize=4k
group_reporting=1
[writer]
readwrite=write

iostat results are:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   0.00   105.00 4896.00  294.00 312080.00  1696.00
120.92
17.253.353.550.02   0.02  12.60

There are nearly ~5000 reads/second (~300 MB/sec), compared with only
~300
writes (~1.5MB/sec), when we are doing a sequential write test? The
system
is otherwise idle, with no other workload.

Running the same fio test with only 1 thread (numjobs=1) still shows a
high
number of reads (110).

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   0.00  1281.00  110.00 1463.00   440.00 12624.00
16.61
 0.030.020.050.02   0.02   3.40

Can anyone kindly offer any comments on why we are seeing this behaviour?

I can understand if there's the occasional read here and there if
RocksDB/WAL entries need to be read from disk during the sequential write
test, but this seems significantly high and unusual.

FIO results (numjobs=2)
writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=rbd, iodepth=1
...
fio-3.7
Starting 2 processes
Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta
01m:00s]
writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb  3 22:46:16 2020
   write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec)
 slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83
 clat (msec): min=2, max=210, avg=58.32, stdev=70.54
  lat (msec): min=2, max=210, avg=58.35, stdev=70.54
 clat percentiles (msec):
  |  1.00th=[3],  5.00th=[3], 10.00th=[3], 20.00th=[
3],
  | 30.00th=[3], 40.00th=[3], 50.00th=[   54], 60.00th=[
62],
  | 70.00th=[   65], 80.00th=[  174], 90.00th=[  188], 95.00th=[
194],
  | 99.00th=[  201], 99.50th=[  203], 99.90th=[  209], 99.95th=[
209],
  | 99.99th=[  211]
bw (  KiB/s): min=   24, max=  144, per=49.69%, avg=68.08,
stdev=38.22,
samples=239
iops: min=6, max=   36, avg=16.97, stdev= 9.55,
samples=239
   lat (msec)   : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18%
   cpu  : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118
   IO depths: 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,

=64=0.0%

  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

  issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0
  latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s),
io=8228KiB (8425kB), run=60038-60038msec
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
With best regards,
Vitaliy Filippov


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send a

[ceph-users] Re: More OMAP Issues

2020-02-04 Thread Paul Emmerich
Are you running a multi-site setup?
In this case it's best to set the default shard size to large enough
number *before* enabling multi-site.

If you didn't do this: well... I think the only way is still to
completely re-sync the second site...


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Feb 4, 2020 at 5:23 PM  wrote:
>
> All;
>
> We're backing to having large OMAP object warnings regarding our RGW index 
> pool.
>
> This cluster is now in production, so I can simply dump the buckets / pools 
> and hope everything works out.
>
> I did some additional research on this issue, and it looks like I need to 
> (re)shard the bucket (index?).  I found information that suggests that, for 
> older versions of Ceph, buckets couldn't be sharded after creation[1].  Other 
> information suggests the Nautilus (which we are running), can re-shard 
> dynamically, but not when multi-site replication is configured[2].
>
> This suggests that a "manual" resharding of a Nautilus cluster should be 
> possible, but I can't find the commands to do it.  Has anyone done this?  
> Does anyone have the commands to do it?  I can schedule down time for the 
> cluster, and take the RADOSGW instance(s), and dependent user services 
> offline.
>
> [1]: https://ceph.io/geen-categorie/radosgw-big-index/
> [2]: https://docs.ceph.com/docs/master/radosgw/dynamicresharding/
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: More OMAP Issues

2020-02-04 Thread DHilsbos
Paul;

Yes, we are running a multi-site setup.

Re-sync would be acceptable at this point, as we only have 4 TiB in use right 
now.

Tearing down and reconfiguring the second site would also be acceptable, except 
that I've never been able to cleanly remove a zone from a zone group.  The only 
way I've found to remove a zone completely is to tear down the entire RADOSGW 
configuration (delete .rgw.root pool from both clusters).

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: Paul Emmerich [mailto:paul.emmer...@croit.io] 
Sent: Tuesday, February 04, 2020 9:52 AM
To: Dominic Hilsbos
Cc: ceph-users
Subject: Re: [ceph-users] More OMAP Issues

Are you running a multi-site setup?
In this case it's best to set the default shard size to large enough
number *before* enabling multi-site.

If you didn't do this: well... I think the only way is still to
completely re-sync the second site...


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Feb 4, 2020 at 5:23 PM  wrote:
>
> All;
>
> We're backing to having large OMAP object warnings regarding our RGW index 
> pool.
>
> This cluster is now in production, so I can simply dump the buckets / pools 
> and hope everything works out.
>
> I did some additional research on this issue, and it looks like I need to 
> (re)shard the bucket (index?).  I found information that suggests that, for 
> older versions of Ceph, buckets couldn't be sharded after creation[1].  Other 
> information suggests the Nautilus (which we are running), can re-shard 
> dynamically, but not when multi-site replication is configured[2].
>
> This suggests that a "manual" resharding of a Nautilus cluster should be 
> possible, but I can't find the commands to do it.  Has anyone done this?  
> Does anyone have the commands to do it?  I can schedule down time for the 
> cluster, and take the RADOSGW instance(s), and dependent user services 
> offline.
>
> [1]: https://ceph.io/geen-categorie/radosgw-big-index/
> [2]: https://docs.ceph.com/docs/master/radosgw/dynamicresharding/
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bluestore cache parameter precedence

2020-02-04 Thread Igor Fedotov

Hi Boris,

general settings (unless they are set to zero) override disk-specific 
settings .


I.e. bluestore_cache_size overrides both bluestore_cache_size_hdd and 
bluestore_cache_size_ssd.


Here is the code snippet in case you know C++

  if (cct->_conf->bluestore_cache_size) {
    cache_size = cct->_conf->bluestore_cache_size;
  } else {
    // choose global cache size based on backend type
    if (_use_rotational_settings()) {
  cache_size = cct->_conf->bluestore_cache_size_hdd;
    } else {
  cache_size = cct->_conf->bluestore_cache_size_ssd;
    }
  }

Thanks,

Igor

On 2/4/2020 2:14 PM, Boris Epstein wrote:

Hello list,

As stated in this document:

https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

there are multiple parameters defining cache limits for BlueStore. You have
bluestore_cache_size (presumably controlling the cache size),
bluestore_cache_size_hdd (presumably doing the same for HDD storage only)
and bluestore_cache_size_ssd (presumably being the equivalent for SSD). My
question is, does bluestore_cache_size override the disk-specific
parameters, or do I need to set the disk-specific (or, rather, storage type
specific ones separately if I want to keep them to a certain value.

Thanks in advance.

Boris.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-04 Thread vitalif
SSD (block.db) partition contains object metadata in RocksDB so it 
probably loads the metadata before modifying objects (if it's not in 
cache yet). Also it sometimes performs compaction which also results in 
disk reads and writes. There are other things going on that I'm not 
completely aware of. There's the RBD object map... Maybe there are some 
locks that come into action when you parallel writes...


There's a config option to enable RocksDB performance counters. You can 
have a look into it.


However if you're just trying to understand why RBD isn't super fast 
then I don't think these reads are the cause...



Hi Vitaliy

Yes - I tried this and I can still see a number of reads (~110 iops,
440KB/sec) on the SSD, so it is significantly better, but the result
is still puzzling - I'm trying to understand what is causing the
reads. The problem is amplified with numjobs >= 2 but it looks like it
is still there with just 1.

Like some caching parameter is not correct, and the same blocks are
being read over and over when doing a write?

Could anyone advise on the best way for me to investigate further?

I've tried strace (with -k) and 'perf record' but neither produce any
useful stack traces to help understand what's going on.

Regards
--
Brad

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephalocon Seoul is canceled

2020-02-04 Thread Sage Weil
Hi everyone,

We are sorry to announce that, due to the recent coronavirus outbreak, we 
are canceling Cephalocon for March 3-5 in Seoul.

More details will follow about how to best handle cancellation of hotel 
reservations and so forth.  Registrations will of course be 
refunded--expect an email with details in the next day or two.

We are still looking into whether it makes sense to reschedule the event 
for later in the year.

Thank you to everyone who has helped to plan this event, submitted talks, 
and agreed to sponsor.  It makes us sad to cancel, but the safety of 
our community is of the utmost importance, and it was looking increasing 
unlikely that we could make this event a success.

Stay tuned...
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bluestore cache parameter precedence

2020-02-04 Thread Boris Epstein
Hi Igor,

Thanks!

I think the code needs to be corrected - the choice criteria for which
setting to use when

cct->_conf->bluestore_cache_size == 0

should be as follows:

1) See what kind of storage you have.

2) Select type-appropriate storage.

Is this code public-editable? I'll be happy to correct that.

Regards,

Boris.

On Tue, Feb 4, 2020 at 12:10 PM Igor Fedotov  wrote:

> Hi Boris,
>
> general settings (unless they are set to zero) override disk-specific
> settings .
>
> I.e. bluestore_cache_size overrides both bluestore_cache_size_hdd and
> bluestore_cache_size_ssd.
>
> Here is the code snippet in case you know C++
>
>if (cct->_conf->bluestore_cache_size) {
>  cache_size = cct->_conf->bluestore_cache_size;
>} else {
>  // choose global cache size based on backend type
>  if (_use_rotational_settings()) {
>cache_size = cct->_conf->bluestore_cache_size_hdd;
>  } else {
>cache_size = cct->_conf->bluestore_cache_size_ssd;
>  }
>}
>
> Thanks,
>
> Igor
>
> On 2/4/2020 2:14 PM, Boris Epstein wrote:
> > Hello list,
> >
> > As stated in this document:
> >
> >
> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
> >
> > there are multiple parameters defining cache limits for BlueStore. You
> have
> > bluestore_cache_size (presumably controlling the cache size),
> > bluestore_cache_size_hdd (presumably doing the same for HDD storage only)
> > and bluestore_cache_size_ssd (presumably being the equivalent for SSD).
> My
> > question is, does bluestore_cache_size override the disk-specific
> > parameters, or do I need to set the disk-specific (or, rather, storage
> type
> > specific ones separately if I want to keep them to a certain value.
> >
> > Thanks in advance.
> >
> > Boris.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Bucket rename with

2020-02-04 Thread EDH - Manuel Rios
Hi

Some Customer asked us for a normal easy problem, they want rename a bucket.

Checking the Nautilus documentation looks by now its not possible, but I 
checked master documentation and a CLI should be accomplish this apparently.

$ radosgw-admin bucket link --bucket=foo --bucket-new-name=bar --uid=johnny

Will be backported to Nautilus? Or its still just for developer/master users?

https://docs.ceph.com/docs/master/man/8/radosgw-admin/

Regards
Manuel

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread Rodrigo Severo - Fábrica
Em ter., 4 de fev. de 2020 às 13:11,  escreveu:
>
> Rodrigo;
>
> Are all your hosts using the same IP addresses as before the move?  Is the 
> new network structured the same?

Yes for both questions.


Rodrigo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread Rodrigo Severo - Fábrica
Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica
 escreveu:
>
> Hi,
>
>
> I have a rather small cephfs cluster with 3 machines right now: all of
> them sharing MDS, MON, MGS and OSD roles.
>
> I had to move all machines to a new physical location and,
> unfortunately, I had to move all of them at the same time.
>
> They are already on again but ceph won't be accessible as all pgs are
> in peering state and OSD keep going down and up again.
>
> Here is some info about my cluster:
>
> ---
> # ceph -s
>   cluster:
> id: e348b63c-d239-4a15-a2ce-32f29a00431c
> health: HEALTH_WARN
> 1 filesystem is degraded
> 1 MDSs report slow metadata IOs
> 2 osds down
> 1 host (2 osds) down
> Reduced data availability: 324 pgs inactive, 324 pgs peering
> 7 daemons have recently crashed
> 10 slow ops, oldest one blocked for 206 sec, mon.a2-df has slow 
> ops
>
>   services:
> mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
> mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
> mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
> osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
> rgw: 1 daemon active (a2-df)
>
>   data:
> pools:   7 pools, 324 pgs
> objects: 850.25k objects, 744 GiB
> usage:   2.3 TiB used, 14 TiB / 16 TiB avail
> pgs: 100.000% pgs not active
>  324 peering
> ---
>
> ---
> # ceph osd df tree
> ID  CLASSWEIGHT   REWEIGHT SIZERAW USE DATAOMAPMETA
> AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>  -1  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>  14 TiB 13.83 1.00   -root default
> -10  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>  14 TiB 13.83 1.00   -datacenter df
>  -3   5.45799- 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
> 4.7 TiB 13.83 1.00   -host a1-df
>   3 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
> 3.6 TiB  0.03 0.00   0   down osd.3
>   0  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
> 1.1 TiB 41.43 3.00   0   down osd.0
>  -5   5.45799- 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
> 4.7 TiB 13.83 1.00   -host a2-df
>   4 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
> 3.6 TiB  0.03 0.00 100 up osd.4
>   1  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
> 1.1 TiB 41.42 3.00 224 up osd.1
>  -7   5.45767- 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
> 4.7 TiB 13.83 1.00   -host a3-df
>   5 hdd-slow  3.63869  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
> 3.6 TiB  0.03 0.00 100 up osd.5
>   2  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
> 1.1 TiB 41.43 3.00 224 up osd.2
>  TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>  14 TiB 13.83
> MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
> ---
>
> At this exact moment both OSDs from server a1-df were down but that's
> changing. Sometimes I have only one OSD down, but most of the times I
> have 2. And exactly which ones are actually down keeps changing.
>
> What should I do to get my cluster back up? Just wait?

I just found out that I have a few pgs "stuck peering":

---
# ceph health detail | grep peering
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs;
2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs
inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow
ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow
ops.
PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs peering
pg 1.39 is stuck peering for 14011.965915, current state peering,
last acting [0,1]
pg 1.3a is stuck peering for 14084.993947, current state peering,
last acting [0,1]
pg 1.3b is stuck peering for 14274.225311, current state peering,
last acting [0,1]
pg 1.3c is stuck peering for 15937.859532, current state peering,
last acting [1,0]
pg 1.3d is stuck peering for 15786.873447, current state peering,
last acting [1,0]
pg 1.3e is stuck peering for 15841.947891, current state peering,
last acting [1,0]
pg 1.3f is stuck peering for 15841.912853, current state peering,
last acting [1,0]
pg 1.40 is stuck peering for 14031.769901, current state peering,
last acting [0,1]
pg 1.41 is stuck peering for 14010.216124, current state peering,
last acting [0,1]
pg 1.42 is stuck peering for 15841.895446, current state peering,
last acting [1,0]
pg 1.43 is stuck peering for 15915.024413, current state peering,
last acting [1,0]
pg 1.44 is stuck peering for 

[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread Wesley Dillingham
I would guess that you have something preventing osd to osd communication
on ports 6800-7300 or osd to mon communication on  port 6789 and/or 3300.


Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Tue, Feb 4, 2020 at 12:44 PM Rodrigo Severo - Fábrica <
rodr...@fabricadeideias.com> wrote:

> Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica
>  escreveu:
> >
> > Hi,
> >
> >
> > I have a rather small cephfs cluster with 3 machines right now: all of
> > them sharing MDS, MON, MGS and OSD roles.
> >
> > I had to move all machines to a new physical location and,
> > unfortunately, I had to move all of them at the same time.
> >
> > They are already on again but ceph won't be accessible as all pgs are
> > in peering state and OSD keep going down and up again.
> >
> > Here is some info about my cluster:
> >
> > ---
> > # ceph -s
> >   cluster:
> > id: e348b63c-d239-4a15-a2ce-32f29a00431c
> > health: HEALTH_WARN
> > 1 filesystem is degraded
> > 1 MDSs report slow metadata IOs
> > 2 osds down
> > 1 host (2 osds) down
> > Reduced data availability: 324 pgs inactive, 324 pgs peering
> > 7 daemons have recently crashed
> > 10 slow ops, oldest one blocked for 206 sec, mon.a2-df has
> slow ops
> >
> >   services:
> > mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
> > mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
> > mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
> > osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
> > rgw: 1 daemon active (a2-df)
> >
> >   data:
> > pools:   7 pools, 324 pgs
> > objects: 850.25k objects, 744 GiB
> > usage:   2.3 TiB used, 14 TiB / 16 TiB avail
> > pgs: 100.000% pgs not active
> >  324 peering
> > ---
> >
> > ---
> > # ceph osd df tree
> > ID  CLASSWEIGHT   REWEIGHT SIZERAW USE DATAOMAPMETA
> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
> >  -1  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
> >  14 TiB 13.83 1.00   -root default
> > -10  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
> >  14 TiB 13.83 1.00   -datacenter df
> >  -3   5.45799- 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
> > 4.7 TiB 13.83 1.00   -host a1-df
> >   3 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
> > 3.6 TiB  0.03 0.00   0   down osd.3
> >   0  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
> > 1.1 TiB 41.43 3.00   0   down osd.0
> >  -5   5.45799- 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
> > 4.7 TiB 13.83 1.00   -host a2-df
> >   4 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
> > 3.6 TiB  0.03 0.00 100 up osd.4
> >   1  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
> > 1.1 TiB 41.42 3.00 224 up osd.1
> >  -7   5.45767- 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
> > 4.7 TiB 13.83 1.00   -host a3-df
> >   5 hdd-slow  3.63869  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
> > 3.6 TiB  0.03 0.00 100 up osd.5
> >   2  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
> > 1.1 TiB 41.43 3.00 224 up osd.2
> >  TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
> >  14 TiB 13.83
> > MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
> > ---
> >
> > At this exact moment both OSDs from server a1-df were down but that's
> > changing. Sometimes I have only one OSD down, but most of the times I
> > have 2. And exactly which ones are actually down keeps changing.
> >
> > What should I do to get my cluster back up? Just wait?
>
> I just found out that I have a few pgs "stuck peering":
>
> ---
> # ceph health detail | grep peering
> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs;
> 2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs
> inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow
> ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow
> ops.
> PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs
> peering
> pg 1.39 is stuck peering for 14011.965915, current state peering,
> last acting [0,1]
> pg 1.3a is stuck peering for 14084.993947, current state peering,
> last acting [0,1]
> pg 1.3b is stuck peering for 14274.225311, current state peering,
> last acting [0,1]
> pg 1.3c is stuck peering for 15937.859532, current state peering,
> last acting [1,0]
> pg 1.3d is stuck peering for 15786.873447, current state peering,
> last acting [1,0]

[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread Rodrigo Severo - Fábrica
Em ter., 4 de fev. de 2020 às 14:54, Wesley Dillingham
 escreveu:
>
>
> I would guess that you have something preventing osd to osd communication on 
> ports 6800-7300 or osd to mon communication on  port 6789 and/or 3300.

The 3 servers are on the same subnet. They are connect to a
non-managed switch. And none have any firewall (iptables) rules
blocking anything. They can ping one the other.

Can you think about some other way that some traffic could be blocked?
Or some other test I could do to check for connectivity?


Regards,

Rodrigo





>
>
> Respectfully,
>
> Wes Dillingham
> w...@wesdillingham.com
> LinkedIn
>
>
> On Tue, Feb 4, 2020 at 12:44 PM Rodrigo Severo - Fábrica 
>  wrote:
>>
>> Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica
>>  escreveu:
>> >
>> > Hi,
>> >
>> >
>> > I have a rather small cephfs cluster with 3 machines right now: all of
>> > them sharing MDS, MON, MGS and OSD roles.
>> >
>> > I had to move all machines to a new physical location and,
>> > unfortunately, I had to move all of them at the same time.
>> >
>> > They are already on again but ceph won't be accessible as all pgs are
>> > in peering state and OSD keep going down and up again.
>> >
>> > Here is some info about my cluster:
>> >
>> > ---
>> > # ceph -s
>> >   cluster:
>> > id: e348b63c-d239-4a15-a2ce-32f29a00431c
>> > health: HEALTH_WARN
>> > 1 filesystem is degraded
>> > 1 MDSs report slow metadata IOs
>> > 2 osds down
>> > 1 host (2 osds) down
>> > Reduced data availability: 324 pgs inactive, 324 pgs peering
>> > 7 daemons have recently crashed
>> > 10 slow ops, oldest one blocked for 206 sec, mon.a2-df has 
>> > slow ops
>> >
>> >   services:
>> > mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
>> > mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
>> > mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
>> > osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
>> > rgw: 1 daemon active (a2-df)
>> >
>> >   data:
>> > pools:   7 pools, 324 pgs
>> > objects: 850.25k objects, 744 GiB
>> > usage:   2.3 TiB used, 14 TiB / 16 TiB avail
>> > pgs: 100.000% pgs not active
>> >  324 peering
>> > ---
>> >
>> > ---
>> > # ceph osd df tree
>> > ID  CLASSWEIGHT   REWEIGHT SIZERAW USE DATAOMAPMETA
>> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>> >  -1  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83 1.00   -root default
>> > -10  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83 1.00   -datacenter df
>> >  -3   5.45799- 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -host a1-df
>> >   3 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
>> > 3.6 TiB  0.03 0.00   0   down osd.3
>> >   0  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
>> > 1.1 TiB 41.43 3.00   0   down osd.0
>> >  -5   5.45799- 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -host a2-df
>> >   4 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
>> > 3.6 TiB  0.03 0.00 100 up osd.4
>> >   1  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
>> > 1.1 TiB 41.42 3.00 224 up osd.1
>> >  -7   5.45767- 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -host a3-df
>> >   5 hdd-slow  3.63869  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
>> > 3.6 TiB  0.03 0.00 100 up osd.5
>> >   2  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
>> > 1.1 TiB 41.43 3.00 224 up osd.2
>> >  TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83
>> > MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
>> > ---
>> >
>> > At this exact moment both OSDs from server a1-df were down but that's
>> > changing. Sometimes I have only one OSD down, but most of the times I
>> > have 2. And exactly which ones are actually down keeps changing.
>> >
>> > What should I do to get my cluster back up? Just wait?
>>
>> I just found out that I have a few pgs "stuck peering":
>>
>> ---
>> # ceph health detail | grep peering
>> HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs;
>> 2 osds down; 1 host (2 osds) down; Reduced data availability: 324 pgs
>> inactive, 324 pgs peering; 7 daemons have recently crashed; 80 slow
>> ops, oldest one blocked for 33 sec, daemons [osd.0,osd.1] have slow
>> ops.
>> PG_AVAILABILITY Reduced data availability: 324 pgs inactive, 324 pgs peering
>> pg 1.39 is stuc

[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread DHilsbos
Rodrigo;

Best bet would be to check logs.  Check the OSD logs on the affected server.  
Check cluster logs on the MONs.  Check OSD logs on other servers.

Your Ceph version(s) and your OS distribution and version would also be useful 
to help you troubleshoot this OSD flapping issue.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: Rodrigo Severo - Fábrica [mailto:rodr...@fabricadeideias.com] 
Sent: Tuesday, February 04, 2020 11:05 AM
To: Wesley Dillingham
Cc: ceph-users
Subject: [ceph-users] Re: All pgs peering indefinetely

Em ter., 4 de fev. de 2020 às 14:54, Wesley Dillingham
 escreveu:
>
>
> I would guess that you have something preventing osd to osd communication on 
> ports 6800-7300 or osd to mon communication on  port 6789 and/or 3300.

The 3 servers are on the same subnet. They are connect to a
non-managed switch. And none have any firewall (iptables) rules
blocking anything. They can ping one the other.

Can you think about some other way that some traffic could be blocked?
Or some other test I could do to check for connectivity?


Regards,

Rodrigo





>
>
> Respectfully,
>
> Wes Dillingham
> w...@wesdillingham.com
> LinkedIn
>
>
> On Tue, Feb 4, 2020 at 12:44 PM Rodrigo Severo - Fábrica 
>  wrote:
>>
>> Em ter., 4 de fev. de 2020 às 12:39, Rodrigo Severo - Fábrica
>>  escreveu:
>> >
>> > Hi,
>> >
>> >
>> > I have a rather small cephfs cluster with 3 machines right now: all of
>> > them sharing MDS, MON, MGS and OSD roles.
>> >
>> > I had to move all machines to a new physical location and,
>> > unfortunately, I had to move all of them at the same time.
>> >
>> > They are already on again but ceph won't be accessible as all pgs are
>> > in peering state and OSD keep going down and up again.
>> >
>> > Here is some info about my cluster:
>> >
>> > ---
>> > # ceph -s
>> >   cluster:
>> > id: e348b63c-d239-4a15-a2ce-32f29a00431c
>> > health: HEALTH_WARN
>> > 1 filesystem is degraded
>> > 1 MDSs report slow metadata IOs
>> > 2 osds down
>> > 1 host (2 osds) down
>> > Reduced data availability: 324 pgs inactive, 324 pgs peering
>> > 7 daemons have recently crashed
>> > 10 slow ops, oldest one blocked for 206 sec, mon.a2-df has 
>> > slow ops
>> >
>> >   services:
>> > mon: 3 daemons, quorum a2-df,a3-df,a1-df (age 47m)
>> > mgr: a2-df(active, since 82m), standbys: a3-df, a1-df
>> > mds: cephfs:1/1 {0=a2-df=up:replay} 2 up:standby
>> > osd: 6 osds: 4 up (since 5s), 6 in (since 47m)
>> > rgw: 1 daemon active (a2-df)
>> >
>> >   data:
>> > pools:   7 pools, 324 pgs
>> > objects: 850.25k objects, 744 GiB
>> > usage:   2.3 TiB used, 14 TiB / 16 TiB avail
>> > pgs: 100.000% pgs not active
>> >  324 peering
>> > ---
>> >
>> > ---
>> > # ceph osd df tree
>> > ID  CLASSWEIGHT   REWEIGHT SIZERAW USE DATAOMAPMETA
>> > AVAIL   %USE  VAR  PGS STATUS TYPE NAME
>> >  -1  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83 1.00   -root default
>> > -10  16.37366-  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83 1.00   -datacenter df
>> >  -3   5.45799- 5.5 TiB 773 GiB 770 GiB 382 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -host a1-df
>> >   3 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
>> > 3.6 TiB  0.03 0.00   0   down osd.3
>> >   0  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 382 MiB 1.7 GiB
>> > 1.1 TiB 41.43 3.00   0   down osd.0
>> >  -5   5.45799- 5.5 TiB 773 GiB 770 GiB 370 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -host a2-df
>> >   4 hdd-slow  3.63899  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
>> > 3.6 TiB  0.03 0.00 100 up osd.4
>> >   1  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 370 MiB 1.7 GiB
>> > 1.1 TiB 41.42 3.00 224 up osd.1
>> >  -7   5.45767- 5.5 TiB 773 GiB 770 GiB 387 MiB 2.7 GiB
>> > 4.7 TiB 13.83 1.00   -host a3-df
>> >   5 hdd-slow  3.63869  1.0 3.6 TiB 1.1 GiB  90 MiB 0 B   1 GiB
>> > 3.6 TiB  0.03 0.00 100 up osd.5
>> >   2  hdd  1.81898  1.0 1.8 TiB 772 GiB 770 GiB 387 MiB 1.7 GiB
>> > 1.1 TiB 41.43 3.00 224 up osd.2
>> >  TOTAL  16 TiB 2.3 TiB 2.3 TiB 1.1 GiB 8.1 GiB
>> >  14 TiB 13.83
>> > MIN/MAX VAR: 0.00/3.00  STDDEV: 21.82
>> > ---
>> >
>> > At this exact moment both OSDs from server a1-df were down but that's
>> > changing. Sometimes I have only one OSD down, but most of the times I
>> > have 2. And 

[ceph-users] Re: All pgs peering indefinetely

2020-02-04 Thread Rodrigo Severo - Fábrica
Em ter., 4 de fev. de 2020 às 15:19,  escreveu:
>
> Rodrigo;
>
> Best bet would be to check logs.  Check the OSD logs on the affected server.  
> Check cluster logs on the MONs.  Check OSD logs on other servers.
>
> Your Ceph version(s) and your OS distribution and version would also be 
> useful to help you troubleshoot this OSD flapping issue.

Looking at the logs I finally found the issue: when I said that there
were no changes in network topology, I was mistaken. I removed an
unused (or at least I thought so) network board from each server.

These servers had 2 network boards that I installed and configured so
I would have a "public network" and a "cluster network". That was when
I was first installing the ceph cluster.

After having some problems with this set up I was advised by members
of this list to not use this dual network setup as it could make
debugging much more difficult. I followed this advice, or at least
tried to.

To make a long story short, ceph was still trying to use the second
network for some OSDs. With a "ceph config rm global cluster_network"
and a general restart of the cluster, everything started working
again.

Thanks for the help and sorry for the confusion.


Regards,

Rodrigo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_memory_target ignored

2020-02-04 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk):
> Dear Stefan,
> 
> I check with top the total allocation. ps -aux gives:
> 
> USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> ceph  784155 15.8  3.1 6014276 4215008 ? Sl   Jan31 932:13 
> /usr/bin/ceph-osd --cluster ceph -f -i 243 ...
> ceph  784732 16.6  3.0 6058736 4082504 ? Sl   Jan31 976:59 
> /usr/bin/ceph-osd --cluster ceph -f -i 247 ...
> ceph  785812 17.1  3.0 5989576 3959996 ? Sl   Jan31 1008:46 
> /usr/bin/ceph-osd --cluster ceph -f -i 254 ...
> ceph  786352 14.9  3.1 5955520 4132840 ? Sl   Jan31 874:37 
> /usr/bin/ceph-osd --cluster ceph -f -i 256 ...
> 
> These should have 8GB resident by now, but stay at or just below 4G. The 
> other options are set as
> 
> [root@ceph-04 ~]# ceph config get osd.243 bluefs_allocator
> bitmap
> [root@ceph-04 ~]# ceph config get osd.243 bluestore_allocator
> bitmap
> [root@ceph-04 ~]# ceph config get osd.243 osd_memory_target
> 8589934592

What does "bluestore_cache_size" read? Our OSDs report "0".

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Migrate journal to Nvme from old SSD journal drive?

2020-02-04 Thread Alex L
Hi,
I finally got my Samsung PM983 [1] to use as journal for about 6 drives plus 
drive cache replacing a consumer SSD - Kingston SV300. 

But I can't for the life of me figure out how to move an existing journal to 
this NVME on my Nautilus cluster. 

# Created a new big partition on the NVME
sgdisk --new=1:2048:+33GiB --change-name="1:ceph block.db" 
--typecode="1:30cd0809-c2b2-499c-8879-2d6b78529876" --mbrtogpt /dev/nvme0n1
partprobe
sgdisk -p /dev/nvme0n1

# The below assumes there is already a partition+ fs on the nvme?
ceph-bluestore-tool bluefs-bdev-migrate –dev-target /dev/nvme0n1p1 -devs-source 
/var/lib/ceph/osd/ceph-1/block.db
- too many positional options have been specified on the command line

ceph-bluestore-tool bluefs-bdev-migrate -–path 
/var/lib/ceph/osd/ceph-1/block.db –-dev-target /dev/nvme0n1p1
- too many positional options have been specified on the command line

# Or should I create a new block device? if yes, will WAL come along ? And how 
do I remove the SSD journal partition (the old)
ceph-bluestore-tool bluefs-bdev-new-db -–path /var/lib/ceph/osd/ceph-1/block.db 
–-dev-target /dev/nvme0n1p1

The documentation is not very clear on what migration does, nor has the same 
concept of a DEVICE (/dev/sda is a device for me) it seems.

Thanks in advance,
Alex


[1] - Performance stats: 
https://docs.google.com/spreadsheets/d/1LXupjEUnNdf011QNr24pkAiDBphzpz5_MwM0t9oAl54/edit?usp=sharing
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Understanding Bluestore performance characteristics

2020-02-04 Thread Bradley Kite
Hi Igor,

This has been very helpful.

I have identified (when numjobs=1, the least-worst case) that there are
approximately just as many bluestore_write_small_pre_read per second as
there are sequential-write IOPS per second:

Tue  4 Feb 22:44:34 GMT 2020
"bluestore_write_small_pre_read": 572818,
Tue  4 Feb 22:44:36 GMT 2020
"bluestore_write_small_pre_read": 576640,
Tue  4 Feb 22:44:37 GMT 2020
"bluestore_write_small_pre_read": 580501,

(Approx ~3800 write small pre-read)

With fio showing (1-minute average)

  write: IOPS=3292, BW=12.9MiB/s (13.5MB/s)(772MiB/60002msec)

This is my first dive into the code, but it looks like the
"bluestore_write_small_pre_read" counter gets incremented when there is a
head-read or tail-read of the block being written.

I dont understand enough about bluestore yet, but my thinking up until this
point was that most blue-store writes would have been aligned to the
allocation-chunk size, avoiding the need for head/tail reads? I've
specifically tried to tune bluestore_min_alloc_size.

Further more, I also noticed that the majority of these writes are actually
being written to the bluestore WAL - I've also got a very high number
of deferred_write_ops (marginally lower than bluestore_write_small_pre_read
per second - ~2200 vs 3800).

I tried to tune out deferred writes by
setting bluestore_prefer_deferred_size but it did not have any impact - I'm
guessing because the deferred writes are coming from the fact that the
writes are somehow not aligned with the originally allocated chunk sizes,
and head/tail (bluestore_write_small_pre_read) writes are *always* written
as deferred writes?

This is the first time I'm dipping my toe into this, so I've got a lot to
learn - but my obvious question at this point, is: Is it possible to tune
Bluestore so that all writes are 4k aligned to avoid the head/tail reads
that I'm seeing? This is purely an RBD solution (no RGW or CephFS) and all
file systems residing on the RBD volumes use 4k block sizes so I'm assuming
all writes should all be 4k aligned?

Thanks for your help so far.

Regards
--
Brad.


On Tue, 4 Feb 2020 at 16:51, Igor Fedotov  wrote:

> Hi Bradley,
>
> you might want to check performance counters for this specific OSD.
>
> Available via 'ceph daemon osd.0 perf dump'  command in Nautilus. A bit
> different command for Luminous AFAIR.
>
> Then look for 'read' substring in the dump and try to find unexpectedly
> high read-related counter values if any.
>
> And/or share it here for brief analysis.
>
>
> Thanks,
>
> Igor
>
>
>
> On 2/4/2020 7:36 PM, Bradley Kite wrote:
>
> Hi Vitaliy
>
> Yes - I tried this and I can still see a number of reads (~110 iops,
> 440KB/sec) on the SSD, so it is significantly better, but the result is
> still puzzling - I'm trying to understand what is causing the reads. The
> problem is amplified with numjobs >= 2 but it looks like it is still there
> with just 1.
>
> Like some caching parameter is not correct, and the same blocks are being
> read over and over when doing a write?
>
> Could anyone advise on the best way for me to investigate further?
>
> I've tried strace (with -k) and 'perf record' but neither produce any
> useful stack traces to help understand what's going on.
>
> Regards
> --
> Brad
>
>
>
>
> On Tue, 4 Feb 2020 at 11:05, Vitaliy Filippov  
>  wrote:
>
>
> Hi,
>
> Try to repeat your test with numjobs=1, I've already seen strange
> behaviour with parallel jobs to one RBD image.
>
> Also as usual: https://yourcmc.ru/wiki/Ceph_performance :-)
>
>
> Hi,
>
> We have a production cluster of 27 OSD's across 5 servers (all SSD's
> running bluestore), and have started to notice a possible performance
> issue.
>
> In order to isolate the problem, we built a single server with a single
> OSD, and ran a few FIO tests. The results are puzzling, not that we were
> expecting good performance on a single OSD.
>
> In short, with a sequential write test, we are seeing huge numbers of
> reads
> hitting the actual SSD
>
> Key FIO parameters are:
>
> [global]
> pool=benchmarks
> rbdname=disk-1
> direct=1
> numjobs=2
> iodepth=1
> blocksize=4k
> group_reporting=1
> [writer]
> readwrite=write
>
> iostat results are:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1   0.00   105.00 4896.00  294.00 312080.00  1696.00
> 120.92
>17.253.353.550.02   0.02  12.60
>
> There are nearly ~5000 reads/second (~300 MB/sec), compared with only
> ~300
> writes (~1.5MB/sec), when we are doing a sequential write test? The
> system
> is otherwise idle, with no other workload.
>
> Running the same fio test with only 1 thread (numjobs=1) still shows a
> high
> number of reads (110).
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1   0.00  1281.00  110.00 1463.00   440.00 12624.00
> 16.61
> 0.03