[ceph-users] Ceph health status reports: subtrees have overcommitted pool target_size_ratio + subtrees have overcommitted pool target_size_bytes

2019-10-15 Thread Thomas
Hi,

checking my Ceph health status I get this warning:
1 subtrees have overcommitted pool target_size_bytes; 1 subtrees have
overcommitted pool target_size_ratio

The details are as follows:
Pools ['hdb_backup'] overcommit available storage by 1.288x due to
target_size_bytes    0  on pools []

However, the relevant settings POOL_TARGET_SIZE_RATIO_OVERCOMMITTED
according to health checks
 are
not enabled:
root@ld3955:~# ceph osd pool get hdb_backup target_size_ratio
Error ENOENT: option 'target_size_ratio' is not set on pool 'hdb_backup'
root@ld3955:~# ceph osd pool get hdb_backup target_size_bytes
Error ENOENT: option 'target_size_bytes' is not set on pool 'hdb_backup'

Therefore I would consider this warning as a bug.

Regards
Thomas
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph health status reports: Reduced data availability and this is resulting in slow requests are blocked

2019-10-15 Thread Thomas
Hi,

I want to use balancer mode "upmap" for all pools.
This mode is currently enable for pool "hdb_backup" with ~600TB used space.
root@ld3955:~# rados df
POOL_NAME  USED  OBJECTS CLONES    COPIES MISSING_ON_PRIMARY
UNFOUND DEGRADED   RD_OPS  RD    WR_OPS  WR USED COMPR UNDER COMPR
backup  0 B    0  0 0 
0   0    0    0 0 B 0 0 B    0 B 0 B
cephfs_data 1.1 TiB    89592  0    268776 
0   0    0    0 0 B 43443 144 GiB    0 B 0 B
cephfs_metadata 311 MiB   48  0   144 
0   0    0    6   6 KiB  7465 106 MiB    0 B 0 B
hdb_backup  585 TiB 51077985  0 153233955 
0   0    0 12577024 4.3 TiB 281002173 523 TiB    0 B 0 B
hdd 6.3 TiB   585051  0   1755153 
0   0    0  4420255  69 GiB   8219453 1.2 TiB    0 B 0 B

root@ld3955:~# ceph osd lspools
11 hdb_backup
51 hdd
52 backup
57 cephfs_data
58 cephfs_metadata

I started with pool "cephfs_metadata" that allocated comparable low data.
At the same moment when I executed
ceph config set mgr mgr/balancer/pool_ids 11,52,58
Ceph status shows increasing number of Reduced data availability:  pg
inactive,  pg peering

And the number of
 slow requests are blocked > 32 sec
was increasing heavily up to 180.

Observing Ceph log (ceph -w) it becomes clear that there's a correlation
between pg inactive and slow requests are block in my cluster.

How can I start analysis why the cluster reports slow requests?

THX

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-users Digest, Vol 81, Issue 28

2019-10-15 Thread renjianxinlover
hi,
if  needing to stat size of all root dirs in cephfs file system, is there 
any simple way to do that via ceph system tools?
thanks 


| |
renjianxinlover
|
|
renjianxinlo...@163.com
|
签名由网易邮箱大师定制
On 10/15/2019 04:57, wrote:
Send ceph-users mailing list submissions to
ceph-users@ceph.io

To subscribe or unsubscribe via email, send a message with subject or
body 'help' to
ceph-users-requ...@ceph.io

You can reach the person managing the list at
ceph-users-ow...@ceph.io

When replying, please edit your Subject line so it is more specific
than "Re: Contents of ceph-users digest..."

Today's Topics:

1. Past_interval start interval mismatch (last_clean_epoch reported)
(Huseyin Cotuk)
2. Re: Constant write load on 4 node ceph cluster (Ingo Schmidt)
3. Re: Constant write load on 4 node ceph cluster (Paul Emmerich)
4. RGW blocking on large objects (Robert LeBlanc)
5. Re: Recurring issue: PG is inconsistent, but lists no inconsistent objects
(Florian Haas)
6. Re: Recurring issue: PG is inconsistent, but lists no inconsistent objects
(Reed Dier)


--

Date: Mon, 14 Oct 2019 18:41:42 +0300
From: Huseyin Cotuk 
Subject: [ceph-users] Past_interval start interval mismatch
(last_clean_epoch reported)
To: ceph-users@ceph.io
Message-ID: <0db35170-05c5-4290-b4e2-9cb2c2bb3...@gmail.com>
Content-Type: multipart/alternative;
boundary="Apple-Mail=_B3AF029E-ABDE-4E3D-A975-9C670354218D"


--Apple-Mail=_B3AF029E-ABDE-4E3D-A975-9C670354218D
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=us-ascii

Hi all,

I also hit the bug #24866 in my test environment. According to the logs, =
the last_clean_epoch in the specified OSD/PG is 17703, but the interval =
starts with 17895. So the OSD fails to start. There are some other OSDs =
in the same status.=20

2019-10-14 18:22:51.908 7f0a275f1700 -1 osd.21 pg_epoch: 18432 pg[18.51( =
v 18388'4 lc 18386'3 (0'0,18388'4] local-lis/les=3D18430/18431 n=3D1 =
ec=3D295/295 lis/c 18430/17702 les/c/f 18431/17703/0 18428/18430/18421) =
[11,21]/[11,21,20] r=3D1 lpr=3D18431 pi=3D[17895,18430)/3 crt=3D18388'4 =
lcod 0'0 unknown m=3D1 mbc=3D{}] 18.51 past_intervals [17895,18430) =
start interval does not contain the required bound [17703,18430) start

The cause is pg 18.51 went clean in 17703 but 17895 is reported to the =
monitor.=20

I am using the last stable version of Mimic (13.2.6).

Any idea how to fix it? Is there any way to bypass this check or fix the =
reported epoch #?

Thanks in advance.=20

Best regards,
Huseyin Cotuk
hco...@gmail.com


--Apple-Mail=_B3AF029E-ABDE-4E3D-A975-9C670354218D
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=us-ascii

Hi =
all,I also hit the =
bug #24866 in my test environment. According to the logs, the =
last_clean_epoch in the specified OSD/PG is 17703, but the interval =
starts with 17895. So the OSD fails to start. There are some other OSDs =
in the same status. 2019-10-14 18:22:51.908 7f0a275f1700 -1 osd.21 pg_epoch: =
18432 pg[18.51( v 18388'4 lc 18386'3 (0'0,18388'4] =
local-lis/les=3D18430/18431 n=3D1 ec=3D295/295 lis/c 18430/17702 les/c/f =
18431/17703/0 18428/18430/18421) [11,21]/[11,21,20] r=3D1 lpr=3D18431=
pi=3D[17895,18430)/3 crt=3D18388'4 lcod 0'0 unknown m=3D1 mbc=3D{}] =
18.51 past_intervals [17895,18430) start interval does not contain the =
required bound [17703,18430) startThe cause is pg 18.51 =
went clean in 17703 but 17895 is reported to the =
monitor. I am using the =
last stable version of Mimic (13.2.6).Any idea how to fix it? Is there any way to bypass this check =
or fix the reported epoch #?Thanks in advance. 
Best regards,Huseyin =
Cotukmailto:hco...@gmail.com"; =
class=3D"">hco...@gmail.com=

--Apple-Mail=_B3AF029E-ABDE-4E3D-A975-9C670354218D--

--

Date: Mon, 14 Oct 2019 18:34:17 +0200 (CEST)
From: Ingo Schmidt 
Subject: [ceph-users] Re: Constant write load on 4 node ceph cluster
To: Ashley Merrick 
Cc: ceph-users 
Message-ID:
<1013769429.74266.1571070857785.javamail.zim...@langeoog.de>
Content-Type: text/plain; charset=utf-8

Great, this helped a lot. Although "ceph iostat" didn't give iostats of single 
images, but just general overview of IO, i remembered the new nautilus RDB 
performance monitoring.

https://ceph.com/rbd/new-in-nautilus-rbd-performance-monitoring/

With a "simple"
rbd perf image iotop
i was able to see that the writes indeed are from the Log Server and the Zabbix 
Monitoring Server. I didn't expect that it would cause that much I/O... 
unbelieveable...

- Ursprüngliche Mail -
Von: "Ashley Merrick" 
An: "i schmidt" 
CC: "ceph-users" 
Gesendet: Montag, 14. Oktober 2019 15:20:46
Betreff: Re: [ceph-users] Constant write load on 4 node ceph cluster

Is the storage being used for the whole VM disk?

If so have you checked none of your software is writing constant log's? Or 
something that could continuously write to disk.

If your running a new version 

[ceph-users] Dealing with changing EC Rules with drive classifications

2019-10-15 Thread Jeremi Avenant
Good day

I'm currently administrating a Ceph cluster that consists out of HDDs &
SSDs. The rule for cephfs_data (ec) is to write to both these drive
classifications (HDD+SSD). I would like to change it so that
cephfs_metadata (non-ec) writes to SSD & cephfs_data (erasure encoded "ec")
writes to HDD since we're experiencing high disk latency.

1) The first option to come to mind would be to migrate each pool to a new
rule but this would mean moving a tonne of data around. (How is disk space
calculated on this, if I use 600 TB in an EC pool, do I need another 600 TB
pool to move it over, or does it shrink the existing pool as it inflates
the new pool while moving?)

2) I would like to know if the alternative is possible:
i.e. Delete the SSDs from the default host bucket (leave everything as it
is) and move the metadata pool to the SSD based crush rule.

However I'm not sure if this is possible as it will be deleting a leaf from
a bucket in our default root. Which means when you add a new SSD osd where
does it end up?

crush map - http://pastefile.fr/6f37e7e594a61d0edd9dc947349c756b
ceph osd pool ls detail -
http://pastefile.fr/0f215e1252ec58c144d9abfe1688adc8
osd tree - http://pastefile.fr/2acdd377a2db021b6af2996929b85082

If anyone has any input it would be greatly appreciated.

Regards

-- 




*Jeremi-Ernst Avenant, Mr.*Cloud Infrastructure Specialist
Inter-University Institute for Data Intensive Astronomy
5th Floor, Department of Physics and Astronomy,
University of Cape Town

Tel: 021 959 4137 <0219592327>
Web: www.idia.ac.za 
E-mail (IDIA): jer...@idia.ac.za 
Rondebosch, Cape Town, 7600
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDMA

2019-10-15 Thread Max Krasilnikov
Hello! 

 Mon, Oct 14, 2019 at 07:28:07AM -, gabryel.mason-williams wrote: 

> Hello,
> 
> I was wondering what user experience was with using Ceph over RDMA? 
>   - How you set it up?

We had used RoCE Lag with Mellanox ConnectX-4 Lx.

>   - Documentation used to set it up?

Generally, Mellanox community docs and Ceph docs:
https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide

>   - Known issues when using it?

Ceph's distribution does not include Systemd units with LimitMEMLOCK=infinity
setting. Also it was needed to start Ceph as root to workaround some limits.
Ceph rbd clients, so as mgr daemons, do not suport rdma, so it was needed to set
ms_cluster_type = async+rdma
ms_type = async+rdma
ms_public_type = async+posix
[mgr]
ms_type = async+posix

And we needed to disable any Jumbo Frames support in order to work with RDMA.


>   - If you still use it?

As I can see on my graphs, it is latency drop with Nautilus+RDMA. As for now,
cluster is up and running for 2 weeks without any issues and with our production
load (rbd, radosgw, cephfs).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDMA

2019-10-15 Thread Vitaliy Filippov

Wow, does it really work?

And why is it not supported by RBD?

Can you show us the latency graphs before and after and tell the I/O  
pattern to which the latency applies? Previous common knowledge was that  
RDMA almost doesn't affect latency with Ceph, because most of the latency  
is in Ceph itself.



Hello!

 Mon, Oct 14, 2019 at 07:28:07AM -, gabryel.mason-williams wrote:


Hello,

I was wondering what user experience was with using Ceph over RDMA?
  - How you set it up?


We had used RoCE Lag with Mellanox ConnectX-4 Lx.


  - Documentation used to set it up?


Generally, Mellanox community docs and Ceph docs:
https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide


  - Known issues when using it?


Ceph's distribution does not include Systemd units with  
LimitMEMLOCK=infinity
setting. Also it was needed to start Ceph as root to workaround some  
limits.
Ceph rbd clients, so as mgr daemons, do not suport rdma, so it was  
needed to set

ms_cluster_type = async+rdma
ms_type = async+rdma
ms_public_type = async+posix
[mgr]
ms_type = async+posix

And we needed to disable any Jumbo Frames support in order to work with  
RDMA.




  - If you still use it?


As I can see on my graphs, it is latency drop with Nautilus+RDMA. As for  
now,
cluster is up and running for 2 weeks without any issues and with our  
production

load (rbd, radosgw, cephfs).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDMA

2019-10-15 Thread Wido den Hollander


On 10/15/19 1:29 PM, Vitaliy Filippov wrote:
> Wow, does it really work?
> 
> And why is it not supported by RBD?
> 
> Can you show us the latency graphs before and after and tell the I/O
> pattern to which the latency applies? Previous common knowledge was that
> RDMA almost doesn't affect latency with Ceph, because most of the
> latency is in Ceph itself.
> 

This is still the case. RDMA might shave a bit off the network latency,
but the code/CPU latency is still the highest in Ceph.

There is no real benefit in using RDMA over Ethernet+IP as that latency
is already very low with modern chips and switches.

Wido

>> Hello!
>>
>>  Mon, Oct 14, 2019 at 07:28:07AM -, gabryel.mason-williams wrote:
>>
>>> Hello,
>>>
>>> I was wondering what user experience was with using Ceph over RDMA?
>>>   - How you set it up?
>>
>> We had used RoCE Lag with Mellanox ConnectX-4 Lx.
>>
>>>   - Documentation used to set it up?
>>
>> Generally, Mellanox community docs and Ceph docs:
>> https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide
>>
>>
>>>   - Known issues when using it?
>>
>> Ceph's distribution does not include Systemd units with
>> LimitMEMLOCK=infinity
>> setting. Also it was needed to start Ceph as root to workaround some
>> limits.
>> Ceph rbd clients, so as mgr daemons, do not suport rdma, so it was
>> needed to set
>> ms_cluster_type = async+rdma
>> ms_type = async+rdma
>> ms_public_type = async+posix
>> [mgr]
>> ms_type = async+posix
>>
>> And we needed to disable any Jumbo Frames support in order to work
>> with RDMA.
>>
>>
>>>   - If you still use it?
>>
>> As I can see on my graphs, it is latency drop with Nautilus+RDMA. As
>> for now,
>> cluster is up and running for 2 weeks without any issues and with our
>> production
>> load (rbd, radosgw, cephfs).
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Inconsistent PG with data_digest_mismatch_info on all OSDs

2019-10-15 Thread Wido den Hollander
Hi,

I have a Mimic 13.2.6 cluster which is throwing an error on a PG that
it's inconsistent.

PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 21.e6d is active+clean+inconsistent, acting [988,508,825]

I checked 'list-inconsistent-obj' (See below) and it shows:

selected_object_info: "data_digest": "0xf4342d4a
all osds: "data_digest": "0x224d6ca6"

This looks like issue 24994 [0], but this is Mimic 13.2.6 and not
Luminous 12.2.7

I also tried to download the object:

rados -p  get rb.0.304abf.238e1f29.0001cc48
rb.0.304abf.238e1f29.0001cc48

That doesn't work. This blocks for ever and causes osd.988 to report a
slow request.

I don't want to repair the PG at the moment as this might be a bug.

Right now I'm thinking of restarting all three OSDs of that PG as this
might get things moving again.

But how could this happen with all three OSDs reporting the same data
digest?

Ideas?

Wido


{
  "epoch": 1351206,
  "inconsistents": [
{
  "object": {
"name": "rb.0.304abf.238e1f29.0001cc48",
"nspace": "",
"locator": "",
"snap": "head",
"version": 28429940
  },
  "errors": [],
  "union_shard_errors": [
"data_digest_mismatch_info"
  ],
  "selected_object_info": {
"oid": {
  "oid": "rb.0.304abf.238e1f29.0001cc48",
  "key": "",
  "snapid": -2,
  "hash": 367865453,
  "max": 0,
  "pool": 21,
  "namespace": ""
},
"version": "1246902'28520530",
"prior_version": "1240840'28429940",
"last_reqid": "osd.736.0:3448586",
"user_version": 28429940,
"size": 4194304,
"mtime": "2019-08-12 05:24:02.672911",
"local_mtime": "2019-08-12 05:24:02.673401",
"lost": 0,
"flags": [
  "dirty",
  "data_digest",
  "omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xf4342d4a",
"omap_digest": "0x",
"expected_object_size": 4194304,
"expected_write_size": 4194304,
"alloc_hint_flags": 0,
"manifest": {
  "type": 0
},
"watchers": {}
  },
  "shards": [
{
  "osd": 508,
  "primary": false,
  "errors": [
"data_digest_mismatch_info"
  ],
  "size": 4194304,
  "omap_digest": "0x",
  "data_digest": "0x224d6ca6"
},
{
  "osd": 825,
  "primary": false,
  "errors": [
"data_digest_mismatch_info"
  ],
  "size": 4194304,
  "omap_digest": "0x",
  "data_digest": "0x224d6ca6"
},
{
  "osd": 988,
  "primary": true,
  "errors": [
"data_digest_mismatch_info"
  ],
  "size": 4194304,
  "omap_digest": "0x",
  "data_digest": "0x224d6ca6"
}
  ]
}
  ]
}

[0]: https://tracker.ceph.com/issues/24994
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDMA

2019-10-15 Thread Max Krasilnikov
День добрий! 

 Tue, Oct 15, 2019 at 02:29:58PM +0300, vitalif wrote: 

> Wow, does it really work?
> 
> And why is it not supported by RBD?

I hadn't dive into sources, but it stated in docs.

> 
> Can you show us the latency graphs before and after and tell the I/O pattern
> to which the latency applies? Previous common knowledge was that RDMA almost
> doesn't affect latency with Ceph, because most of the latency is in Ceph
> itself.

There is graph here. It was pure Nautilus before 10-05 and Nautilus+RDMA after.
https://nc.avalon.org.ua/s/LptPTEaTeTTyKtD
Link expires on Nov 1.

Most of my clients is OpenStack instances with rbd volumes. Cluster consists of
30 ssd and 10 hdd osds, rbd volumes lies on ssd.

It was experiment with RDMA, but it's result was resonably good to test it for a
longer time.

> >>I was wondering what user experience was with using Ceph over RDMA?
> >>  - How you set it up?
> >
> >We had used RoCE Lag with Mellanox ConnectX-4 Lx.
> >
> >>  - Documentation used to set it up?
> >
> >Generally, Mellanox community docs and Ceph docs:
> >https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide
> >
> >>  - Known issues when using it?
> >
> >Ceph's distribution does not include Systemd units with
> >LimitMEMLOCK=infinity
> >setting. Also it was needed to start Ceph as root to workaround some
> >limits.
> >Ceph rbd clients, so as mgr daemons, do not suport rdma, so it was needed
> >to set
> >ms_cluster_type = async+rdma
> >ms_type = async+rdma
> >ms_public_type = async+posix
> >[mgr]
> >ms_type = async+posix
> >
> >And we needed to disable any Jumbo Frames support in order to work with
> >RDMA.
> >
> >
> >>  - If you still use it?
> >
> >As I can see on my graphs, it is latency drop with Nautilus+RDMA. As for
> >now,
> >cluster is up and running for 2 weeks without any issues and with our
> >production
> >load (rbd, radosgw, cephfs).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDMA

2019-10-15 Thread vitalif

I don't see any changes here...

There is graph here. It was pure Nautilus before 10-05 and 
Nautilus+RDMA after.

https://nc.avalon.org.ua/s/LptPTEaTeTTyKtD
Link expires on Nov 1.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RDMA

2019-10-15 Thread Paul Emmerich
That's apply/commit latency (the exact same since BlueStore btw, no
point in tracking both). It should not contain any network component.

Since the path you are optimizing is inter-OSD communication: check
out subop latency, that's the one where this should show up.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Oct 15, 2019 at 2:39 PM  wrote:
>
> I don't see any changes here...
>
> > There is graph here. It was pure Nautilus before 10-05 and
> > Nautilus+RDMA after.
> > https://nc.avalon.org.ua/s/LptPTEaTeTTyKtD
> > Link expires on Nov 1.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW blocking on large objects

2019-10-15 Thread Robert LeBlanc
On Mon, Oct 14, 2019 at 2:58 PM Paul Emmerich  wrote:
>
> Could the 4 GB GET limit saturate the connection from rgw to Ceph?
> Simple to test: just rate-limit the health check GET

I don't think so, we have dual 25Gbp in a LAG, so Ceph to RGW has
multiple paths, but we aren't balancing on port yet, so RGW to HAProxy
is probably limited to one link.

> Did you increase "objecter inflight ops" and "objecter inflight op bytes"?
> You absolutely should adjust these settings for large RGW setups,
> defaults of 1024 and 100 MB are way too low for many RGW setups, we
> default to 8192 and 800MB
>
> Sometimes "ms async op threads" and "ms async max op threads" might
> help as well (we adjust them by default, but for other reasons)

Thanks, I'll look into these options and see if they help.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dealing with changing EC Rules with drive classifications

2019-10-15 Thread Robert LeBlanc
On Tue, Oct 15, 2019 at 2:42 AM Jeremi Avenant  wrote:

> Good day
>
> I'm currently administrating a Ceph cluster that consists out of HDDs &
> SSDs. The rule for cephfs_data (ec) is to write to both these drive
> classifications (HDD+SSD). I would like to change it so that
> cephfs_metadata (non-ec) writes to SSD & cephfs_data (erasure encoded "ec")
> writes to HDD since we're experiencing high disk latency.
>
> 1) The first option to come to mind would be to migrate each pool to a new
> rule but this would mean moving a tonne of data around. (How is disk space
> calculated on this, if I use 600 TB in an EC pool, do I need another 600 TB
> pool to move it over, or does it shrink the existing pool as it inflates
> the new pool while moving?)
>
> 2) I would like to know if the alternative is possible:
> i.e. Delete the SSDs from the default host bucket (leave everything as it
> is) and move the metadata pool to the SSD based crush rule.
>
> However I'm not sure if this is possible as it will be deleting a leaf
> from a bucket in our default root. Which means when you add a new SSD osd
> where does it end up?
>
> crush map - http://pastefile.fr/6f37e7e594a61d0edd9dc947349c756b
> ceph osd pool ls detail -
> http://pastefile.fr/0f215e1252ec58c144d9abfe1688adc8
> osd tree - http://pastefile.fr/2acdd377a2db021b6af2996929b85082
>
> If anyone has any input it would be greatly appreciated.
>

What version of Ceph are you running? You may be able to use device classes
instead of munging the CRUSH tree.

Updating the rule to change the destinations will only move data around (it
may be a large data movement) and will only need as much space as PGs in
flight use. For instance if your PG size is 100 GB and an erasure encoding
of 10+2, then each PG takes 10 GB on each OSD. If your osd_max_backfills =
1, then you only need 10 GB of head room on each OSD to make the data
movement. If your osd_max_backfills = 2, then you need 20 GBs as two PGs
may be moved onto the OSD before any PGs may be deleted off of it.

By changing the rule to only use HDD drive class, it will migrate the data
off the SSDs and onto the HDDs (only moving PG shards as needed). Then you
can change the replication rule for the metadata to only use SSD, then it
will migrate the PG replicatas off the HDDs.

Setting the following in /etc/ceph/ceph.conf on the OSDs and restarting the
OSDs before backfilling will reduce the impact of the backfills.

osd op queue = wpq
osd op queue cut off = high


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Corrupted block.db for osd. How to extract particular PG from that osd?

2019-10-15 Thread Alexey Kalinkin
Hello cephers!

We're lost block.db file for one of our osd. This results in down osd and
incomplete PG. Block file from osd, which symlinks to particular /dev
folder is live and correct.
My main question is there any thereoretical possibility to extract
particular PG from osd which block.db device has been lost? Is there any
way I can make it using ceph-objectstore-tool or ceph-bluestore-tool? We're
running ceph version 12.2.12

-- 
Very truly yours,
Kalinkin Alexey
Research fellow of Epigenetic lab, RCMG.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS and 32-bit inode numbers

2019-10-15 Thread Dan van der Ster
Hi all,

One of our users has some 32-bit commercial software that they want to
use with CephFS, but it's not working because our inode numbers are
too large. E.g. his application gets a "file too big" error trying to
stat inode 0x40008445FB3.

I'm aware that CephFS is offsets the inode numbers by (mds_rank + 1) *
2^40; in the case above the file is managed by mds.3.

Did anyone see this same issue and find a workaround? (I read that
GlusterFS has an enable-in32 client option -- does CephFS have
something like that planned?)

Thanks!

Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Corrupted block.db for osd. How to extract particular PG from that osd?

2019-10-15 Thread Paul Emmerich
No, it's not possible to recover from a *completely dead* block.db, it
contains all the metadata (like... object names)

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Oct 15, 2019 at 6:02 PM Alexey Kalinkin  wrote:
>
> Hello cephers!
>
> We're lost block.db file for one of our osd. This results in down osd and 
> incomplete PG. Block file from osd, which symlinks to particular /dev folder 
> is live and correct.
> My main question is there any thereoretical possibility to extract particular 
> PG from osd which block.db device has been lost? Is there any way I can make 
> it using ceph-objectstore-tool or ceph-bluestore-tool? We're running ceph 
> version 12.2.12
>
> --
> Very truly yours,
> Kalinkin Alexey
> Research fellow of Epigenetic lab, RCMG.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recurring issue: PG is inconsistent, but lists no inconsistent objects

2019-10-15 Thread Florian Haas
On 14/10/2019 22:57, Reed Dier wrote:
> I had something slightly similar to you.
> 
> However, my issue was specific/limited to the device_health_metrics pool
> that is auto-created with 1 PG when you turn that mgr feature on.
> 
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56315.html

Thank you — yes that does look superficially similar, though in my case
it's an RGW pool. (Also, my sympathy on the OSD crashes; that must have
been quite the jolt.)

However, the similarities unfortunately end where the pg repair fixes
things for you. For me, the scrub error keeps coming back. It's quite odd.

Cheers,
Florian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS and 32-bit Inode Numbers

2019-10-15 Thread Nathan Fish
I'm not sure exactly what would happen on an inode collision, but I'm
guessing Bad Things. If my math is correct, a 2^32 inode space will
have roughly 1 collision per 2^16 entries. As that's only 65536,
that's not safe at all.

On Mon, Oct 14, 2019 at 8:14 AM Dan van der Ster  wrote:
>
> OK I found that the kernel has an "ino32" mount option which hashes 64
> bit inos to 32-bit space.
> Has anyone tried this?
> What happens if two files collide?
>
> -- Dan
>
> On Mon, Oct 14, 2019 at 1:18 PM Dan van der Ster  wrote:
> >
> > Hi all,
> >
> > One of our users has some 32-bit commercial software that they want to
> > use with CephFS, but it's not working because our inode numbers are
> > too large. E.g. his application gets a "file too big" error trying to
> > stat inode 0x40008445FB3.
> >
> > I'm aware that CephFS is offsets the inode numbers by (mds_rank + 1) *
> > 2^40; in the case above the file is managed by mds.3.
> >
> > Did anyone see this same issue and find a workaround? (I read that
> > GlusterFS has an enable-in32 client option -- does CephFS have
> > something like that planned?)
> >
> > Thanks!
> >
> > Dan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Librados in openstack

2019-10-15 Thread solarflow99
I was wondering if this is provided somehow?  All I see is rbd and radosgw
mentioned.  If you have applications built with librados surely openstack
must have a way to provide it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] run-s3tests.sh against Nautilus

2019-10-15 Thread Francisco Londono

I ran the s3 test (run-s3tests.sh)  in vstart mode against Nautilus. Is there a 
better guide out there than this one?

https://docs.ceph.com/ceph-prs/17381/dev/#testing-how-to-run-s3-tests-locally

Thus far I ran into a ceph.conf parse issue, a keyring permission issue and a 
radosgw crash.

+ radosgw-admin user create --uid=s3test1 --display-name=tester1 
--access-key=access1 --secret=secret1 --email=test...@ceph.com
server name not found: [v2:xx.x.x.xx:40466 (Name or service not known)
unable to parse addrs in '[v2:xx.x.x.xx:40466,v1:xx.x.x.xx:40467]'
couldn't init storage provider

The ceph.conf under the global header shows:
   mon host =  [v2:xx.x.x.xx:40466,v1:xx.x.x.xx:40467]

In Luminous, the ceph.conf created used to be in this format:
  host = (hostname)
mon addr = xx.x.x.xx:6789

Thanks.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS and 32-bit Inode Numbers

2019-10-15 Thread Janne Johansson
Den tis 15 okt. 2019 kl 19:40 skrev Nathan Fish :

> I'm not sure exactly what would happen on an inode collision, but I'm
> guessing Bad Things. If my math is correct, a 2^32 inode space will
> have roughly 1 collision per 2^16 entries. As that's only 65536,
> that's not safe at all.
>

Yeah, the birthday paradox will make sure you hit it very soon. 8-(

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS and 32-bit Inode Numbers

2019-10-15 Thread Gregory Farnum
Once upon a time ceph-fuse did its own internal hash-map of live
inodes to handle that (by just remembering which 64-bit inode any
32-bit one actually referred to).

Unfortunately I believe this has been ripped out because it caused
problems when the kernel tried to do lookups on 32-bit inodes that
were so old they'd been recycled or dropped out of the mapping table.

It's conceivable the kernel can implement this more safely and I'd
certainly defer to Zheng or somebody as this is just out of my head
and from prepping the source tree to see what still exists, but I
would not expect it to be a safe long-term solution.
-Greg

On Tue, Oct 15, 2019 at 12:01 PM Janne Johansson  wrote:
>
>
>
> Den tis 15 okt. 2019 kl 19:40 skrev Nathan Fish :
>>
>> I'm not sure exactly what would happen on an inode collision, but I'm
>> guessing Bad Things. If my math is correct, a 2^32 inode space will
>> have roughly 1 collision per 2^16 entries. As that's only 65536,
>> that's not safe at all.
>
>
> Yeah, the birthday paradox will make sure you hit it very soon. 8-(
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS Crashes on “ceph fs volume v011”

2019-10-15 Thread Guilherme Geronimo
Dear ceph users,
we're experiencing a segfault during MDS startup (replay process) which is 
making our FS inaccessible.

MDS log messages:

Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c08f49700  1 -- 192.168.8.195:6800/3181891717 <== osd.26 
192.168.8.209:6821/2419345 3  osd_op_reply(21 1. [getxattr] v0'0 
uv0 ondisk = -61 ((61) No data available)) v8  154+0+0 (3715233608 0 0) 
0x2776340 con 0x18bd500
Oct 15 03:41:39.894584 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10 mds.0.cache.ino(0x100) _fetched got 0 and 544
Oct 15 03:41:39.894658 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10 mds.0.cache.ino(0x100)  magic is 'ceph fs volume v011' 
(expecting 'ceph fs volume v011')
Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10  mds.0.cache.snaprealm(0x100 seq 1 0x1799c00) open_parents 
[1,head]
Oct 15 03:41:39.894735 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10 mds.0.cache.ino(0x100) _fetched [inode 0x100 [...2,head] ~mds0/ 
auth v275131 snaprealm=0x1799c00 f(v0 1=1+0) n(v76166 rc2020-07-17 
15:29:27.00 b41838692297 -3184=-3168+-16)/n() (iversion lock) 0x18bf800]
Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10 MDSIOContextBase::complete: 18C_IO_Inode_Fetched
Oct 15 03:41:39.894821 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10 mds.0.cache.ino(0x1) _fetched got 0 and 482
Oct 15 03:41:39.894891 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.201 
7f3c00589700 10 mds.0.cache.ino(0x1)  magic is 'ceph fs volume v011' (expecting 
'ceph fs volume v011')
Oct 15 03:41:39.894958 mds1 ceph-mds:   -472> 2019-10-15 00:40:30.205 
7f3c00589700 -1 *** Caught signal (Segmentation fault) **#012 in thread 
7f3c00589700 thread_name:fn_anonymous#012#012 ceph version 13.2.6 
(7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)#012 1: (()+0x11390) 
[0x7f3c0e48a390]#012 2: (operator<<(std::ostream&, SnapRealm const&)+0x42) 
[0x72cb92]#012 3: (SnapRealm::merge_to(SnapRealm*)+0x308) [0x72f488]#012 4: 
(CInode::decode_snap_blob(ceph::buffer::list&)+0x53) [0x6e1f63]#012 5: 
(CInode::decode_store(ceph::buffer::list::iterator&)+0x76) [0x702b86]#012 6: 
(CInode::_fetched(ceph::buffer::list&, ceph::buffer::list&, Context*)+0x1b2) 
[0x702da2]#012 7: (MDSIOContextBase::complete(int)+0x119) [0x74fcc9]#012 8: 
(Finisher::finisher_thread_entry()+0x12e) [0x7f3c0ebffece]#012 9: (()+0x76ba) 
[0x7f3c0e4806ba]#012 10: (clone()+0x6d) [0x7f3c0dca941d]#012 NOTE: a copy of 
the executable, or `objdump -rdS ` is needed to interpret this.
Oct 15 03:41:39.895400 mds1 ceph-mds: --- logging levels ---
Oct 15 03:41:39.895473 mds1 ceph-mds:0/ 5 none
Oct 15 03:41:39.895473 mds1 ceph-mds:0/ 1 lockdep


Cluster status information:

  cluster:
id: b8205875-e56f-4280-9e52-6aab9c758586
health: HEALTH_WARN
1 filesystem is degraded
1 nearfull osd(s)
11 pool(s) nearfull

  services:
mon: 3 daemons, quorum mon1,mon2,mon3
mgr: mon1(active), standbys: mon2, mon3
mds: fs_padrao-1/1/1 up  {0=mds1=up:replay(laggy or crashed)}
osd: 90 osds: 90 up, 90 in

  data:
pools:   11 pools, 1984 pgs
objects: 75.99 M objects, 285 TiB
usage:   457 TiB used, 181 TiB / 639 TiB avail
pgs: 1896 active+clean
 87   active+clean+scrubbing+deep+repair
 1active+clean+scrubbing

  io:
client:   89 KiB/s wr, 0 op/s rd, 3 op/s wr

Has anyone seen anything like this?

Regards,

[]'s
Arthur

Enviado via iPhone___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Day Content & Sponsors Needed

2019-10-15 Thread Mike Perez
Hi everyone,

Our schedule has filled up for Ceph Day London, but we're still looking for
content for Ceph Day Poland on October 28, as well as Ceph Day San Diego
November 18. If you're interested in giving a community talk, please see
any of the Ceph Days links from my earlier email for the CFP form. If you
need help with preparing your title and abstract, please contact me
directly. Thank you!

On Sat, Oct 5, 2019 at 11:03 AM Mike Perez  wrote:

> Hi everyone,
>
> We have some exciting Ceph Days coming up and we're still looking for
> content and sponsors. Take a look:
>
> * Ceph Day Argentina (Spanish speakers only) - October 16:
> https://ceph.io/cephdays/ceph-day-argentina-2019/
>
> * Ceph Day London - October 24:
> https://ceph.io/cephdays/ceph-day-london-2019/
>
> * Ceph Day Poland - October 28:
> https://ceph.io/cephdays/ceph-day-poland-2019/
>
> * Ceph Day San Diego - November 18:
> https://ceph.io/cephdays/ceph-rook-day-san-diego-2019/
>
> If you or you know someone who can speak at these events please follow the
> call for proposal links on each page. If you need help working on your
> title/abstract, please feel free to reach out to me directly.
>
> If your company is interested in sponsoring a Ceph Day, please take a look
> at our brochure:
>
> https://ceph.io/wp-content/uploads/2019/10/Ceph-Day-Partner-Sponsorship.pdf
>
> Thanks!
>
> --
>
> Mike Perez
>
> he/him
>
> Ceph Community Manager
>
>
> M: +1-951-572-2633
>
> 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
> @Thingee   Thingee
>  
> 
>


-- 

Mike Perez

he/him

Ceph Community Manager


M: +1-951-572-2633

494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
@Thingee   Thingee
 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io