Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-23 Thread Daniel Carrasco
Finally, I've changed the configuration to the following:

##
### MDS
##
[mds]
  mds_cache_memory_limit = 792723456
  mds_bal_mode = 1

##
### Client
##
[client]
  client_cache_size = 32768
  client_mount_timeout = 30
  client_oc_max_obijects = 2
  client_oc_size = 1048576000
  client_permissions = false
  client_quota = false
  rbd_cache = true
  rbd_cache_size = 671088640

I've disabled client_permissions and client_quota because the cluster is
only for the webpage and the network is isolated, so it don't need to check
the permissions every time, and I've disabled the quota check because there
is no quota on this cluster.
This will lower the petitions to MDS and CPU usage, right?

Greetings!!

2018-02-22 19:34 GMT+01:00 Patrick Donnelly :

> On Wed, Feb 21, 2018 at 11:17 PM, Daniel Carrasco 
> wrote:
> > I want to search also if there is any way to cache file metadata on
> client,
> > to lower the MDS load. I suppose that files are cached but the client
> check
> > with MDS if there are changes on files. On my server files are the most
> of
> > time read-only so MDS data can be also cached for a while.
>
> The MDS issues capabilities that allow clients to coherently cache
> metadata.
>
> --
> Patrick Donnelly
>



-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Some monitors have still not reached quorum

2018-02-23 Thread Kevin Olbrich
 Hi!

On a new cluster, I get the following error. All 3x mons are connected to
the same switch and ping between them works (firewalls disabled).
Mon-nodes are Ubuntu 16.04 LTS on Cep Luminous.


[ceph_deploy.mon][ERROR ] Some monitors have still not reached quorum:
[ceph_deploy.mon][ERROR ] mon03
[ceph_deploy.mon][ERROR ] mon02
[ceph_deploy.mon][ERROR ] mon01


root@adminnode:~# cat ceph.conf
[global]
fsid = 2689defb-8715-47bb-8d78-e862089adf7a
ms_bind_ipv6 = true
mon_initial_members = mon01, mon02, mon03
mon_host =
[fd91:462b:4243:47e::1:1],[fd91:462b:4243:47e::1:2],[fd91:462b:4243:47e::1:3]
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = fdd1:ecbd:731f:ee8e::/64
cluster network = fd91:462b:4243:47e::/64


root@mon01:~# ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: eth0:  mtu 9000 qdisc pfifo_fast state
UP group default qlen 1000
link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
inet 172.17.1.1/16 brd 172.17.255.255 scope global eth0
   valid_lft forever preferred_lft forever
inet6 fd91:462b:4243:47e::1:1/64 scope global
   valid_lft forever preferred_lft forever
inet6 fe80::baae:edff:fee9:b661/64 scope link
   valid_lft forever preferred_lft forever
3: wlan0:  mtu 1500 qdisc noop state DOWN group
default qlen 1000
link/ether 00:db:df:64:34:d5 brd ff:ff:ff:ff:ff:ff
4: eth0.22@eth0:  mtu 9000 qdisc noqueue
state UP group default qlen 1000
link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
inet6 fdd1:ecbd:731f:ee8e::1:1/64 scope global
   valid_lft forever preferred_lft forever
inet6 fe80::baae:edff:fee9:b661/64 scope link
   valid_lft forever preferred_lft forever


Don't mind wlan0, thats because this node is built from an Intel NUC.

Any idea?

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG mapped to OSDs on same host although 'chooseleaf type host'

2018-02-23 Thread Wido den Hollander



On 02/23/2018 12:42 AM, Mike Lovell wrote:
was the pg-upmap feature used to force a pg to get mapped to a 
particular osd?




Yes it was. This is a semi-production cluster where the balancer module 
has been enabled with the upmap feature.


It remapped PGs it seems to OSDs on the same host.

root@man:~# ceph osd dump|grep pg_upmap|grep 1.41
pg_upmap_items 1.41 [9,15,11,7,10,2]
root@man:~#

I don't know exactly what I have to extract from that output, but it 
does seem to be the case here.


I removed the upmap entry for this PG and fixed it there:

$ ceph osd rm-pg-upmap-items 1.41

I also disabled the balancer for now (will report a issue) and removed 
all other upmap entries:


$ ceph osd dump|grep pg_upmap_items|awk '{print $2}'|xargs -n 1 ceph osd 
rm-pg-upmap-items


Thanks for the hint!

Wido


mike

On Thu, Feb 22, 2018 at 10:28 AM, Wido den Hollander > wrote:


Hi,

I have a situation with a cluster which was recently upgraded to
Luminous and has a PG mapped to OSDs on the same host.

root@man:~# ceph pg map 1.41
osdmap e21543 pg 1.41 (1.41) -> up [15,7,4] acting [15,7,4]
root@man:~#

root@man:~# ceph osd find 15|jq -r '.crush_location.host'
n02
root@man:~# ceph osd find 7|jq -r '.crush_location.host'
n01
root@man:~# ceph osd find 4|jq -r '.crush_location.host'
n02
root@man:~#

As you can see, OSD 15 and 4 are both on the host 'n02'.

This PG went inactive when the machine hosting both OSDs went down
for maintenance.

My first suspect was the CRUSHMap and the rules, but those are fine:

rule replicated_ruleset {
         id 0
         type replicated
         min_size 1
         max_size 10
         step take default
         step chooseleaf firstn 0 type host
         step emit
}

This is the only rule in the CRUSHMap.

ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       19.50325 root default
-2        2.78618     host n01
  5   ssd  0.92999         osd.5      up  1.0 1.0
  7   ssd  0.92619         osd.7      up  1.0 1.0
14   ssd  0.92999         osd.14     up  1.0 1.0
-3        2.78618     host n02
  4   ssd  0.92999         osd.4      up  1.0 1.0
  8   ssd  0.92619         osd.8      up  1.0 1.0
15   ssd  0.92999         osd.15     up  1.0 1.0
-4        2.78618     host n03
  3   ssd  0.92999         osd.3      up  0.94577 1.0
  9   ssd  0.92619         osd.9      up  0.82001 1.0
16   ssd  0.92999         osd.16     up  0.84885 1.0
-5        2.78618     host n04
  2   ssd  0.92999         osd.2      up  0.93501 1.0
10   ssd  0.92619         osd.10     up  0.76031 1.0
17   ssd  0.92999         osd.17     up  0.82883 1.0
-6        2.78618     host n05
  6   ssd  0.92999         osd.6      up  0.84470 1.0
11   ssd  0.92619         osd.11     up  0.80530 1.0
18   ssd  0.92999         osd.18     up  0.86501 1.0
-7        2.78618     host n06
  1   ssd  0.92999         osd.1      up  0.88353 1.0
12   ssd  0.92619         osd.12     up  0.79602 1.0
19   ssd  0.92999         osd.19     up  0.83171 1.0
-8        2.78618     host n07
  0   ssd  0.92999         osd.0      up  1.0 1.0
13   ssd  0.92619         osd.13     up  0.86043 1.0
20   ssd  0.92999         osd.20     up  0.77153 1.0

Here you see osd.15 and osd.4 on the same host 'n02'.

This cluster was upgraded from Hammer to Jewel and now Luminous and
it doesn't have the latest tunables yet, but should that matter? I
never encountered this before.

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

I don't want to touch this yet in the case this is a bug or glitch
in the matrix somewhere.

I hope it's just a admin mistake, but so far I'm not able to find a
clue pointing to that.

root@man:~# ceph osd dump|head -n 12
epoch 21545
fsid 0b6fb388-6233-4eeb-a55c-476ed12bdf0a
created 2015-04-28 14:43:53.950159
modified 2018-02-22 17:56:42.497849
flags sortbitwise,recovery_deletes,purged_snapdirs
crush_version 22
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release luminous
root@man:~#

I also downloaded the CRUSHmap and ran crushtool with --test and
--show-mappings, but that didn't show any PG mapped to the same host.

Any ideas on what might be going on here?

Wido
___
ceph-users mailing list
ceph-users@lists.cep

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-02-23 Thread Caspar Smit
Hi all,

Thanks for all your follow ups on this. The Samsung SM863a is indeed a very
good alternative, thanks!
We ordered both (SM863a & DC S4600) so we can compare.

Intel's response (I mean the lack of it) is not very promising. Allthough
we have very good experiences with Intel DC SSD's we still want to give
them a chance.
Hopefully the SCV10111 firmware has fixed the issue! (The changelog for the
firmware doesn't mention any major problem fixed though, only 'bugfixes')

Will let you know the results (probably in a month or two).

Kind regards,
Caspar

2018-02-23 0:18 GMT+01:00 Mike Lovell :

> adding ceph-users back on.
>
> it sounds like the enterprise samsungs and hitachis have been mentioned
> on the list as alternatives. i have 2 micron 5200 (pro i think) that i'm
> beginning testing on and have some micron 9100 nvme drives to use as
> journals. so the enterprise micron might be good. i did try some micron
> m600s a couple years ago and was disappointed by them so i'm avoiding the
> "prosumer" ones from micron if i can. my use case has been the 1TB range
> ssds and am using them mainly as a cache tier and filestore. my needs might
> not line up closely with yours though.
>
> mike
>
> On Thu, Feb 22, 2018 at 3:58 PM, Hans Chris Jones <
> chris.jo...@lambdastack.io> wrote:
>
>> Interesting. This does not inspire confidence. What SSDs (2TB or 4TB) do
>> people have good success with in high use production systems with bluestore?
>>
>> Thanks
>>
>> On Thu, Feb 22, 2018 at 5:32 PM, Mike Lovell 
>> wrote:
>>
>>> hrm. intel has, until a year ago, been very good with ssds. the
>>> description of your experience definitely doesn't inspire confidence. intel
>>> also dropping the entire s3xxx and p3xxx series last year before having a
>>> viable replacement has been driving me nuts.
>>>
>>> i don't know that i have the luxury of being able to return all of the
>>> ones i have or just buying replacements. i'm going to need to at least try
>>> them in production. it'll probably happen with the s4600 limited to a
>>> particular fault domain. these are also going to be filestore osds so maybe
>>> that will result in a different behavior. i'll try to post updates as i
>>> have them.
>>>
>>> mike
>>>
>>> On Thu, Feb 22, 2018 at 2:33 PM, David Herselman  wrote:
>>>
 Hi Mike,



 I eventually got hold of a customer relations manager at Intel but his
 attitude was lack luster and Intel never officially responded to any
 correspondence we sent them. The Intel s4600 drives all passed our standard
 burn-in tests, they exclusively appear to fail once they handle production
 BlueStore usage, generally after a couple days use.



 Intel really didn’t seem interested, even after explaining that the
 drives were in different physical systems in different data centres and
 that I had been in contact with another Intel customer who had experienced
 similar failures in Dell equipment (our servers are pure Intel).





 Perhaps there’s interest in a Lawyer picking up the issue and their
 attitude. Not advising customers of a known issue which leads to data loss
 is simply negligent, especially on a product that they tout as being more
 reliable than spinners and has their Data Centre reliability stamp.



 I returned the lot and am done with Intel SSDs, will advise as many
 customers and peers to do the same…





 Regards

 David Herselman



>>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some monitors have still not reached quorum

2018-02-23 Thread Kevin Olbrich
I always see this:

[mon01][DEBUG ] "mons": [
[mon01][DEBUG ]   {
[mon01][DEBUG ] "addr": "[fd91:462b:4243:47e::1:1]:6789/0",
[mon01][DEBUG ] "name": "mon01",
[mon01][DEBUG ] "public_addr": "[fd91:462b:4243:47e::1:1]:6789/0",
[mon01][DEBUG ] "rank": 0
[mon01][DEBUG ]   },
[mon01][DEBUG ]   {
[mon01][DEBUG ] "addr": "0.0.0.0:0/1",
[mon01][DEBUG ] "name": "mon02",
[mon01][DEBUG ] "public_addr": "0.0.0.0:0/1",
[mon01][DEBUG ] "rank": 1
[mon01][DEBUG ]   },
[mon01][DEBUG ]   {
[mon01][DEBUG ] "addr": "0.0.0.0:0/2",
[mon01][DEBUG ] "name": "mon03",
[mon01][DEBUG ] "public_addr": "0.0.0.0:0/2",
[mon01][DEBUG ] "rank": 2
[mon01][DEBUG ]   }
[mon01][DEBUG ] ]


DNS is working fine and the hostnames are also listed in /etc/hosts.
I already purged the mon but still the same problem.

- Kevin


2018-02-23 10:26 GMT+01:00 Kevin Olbrich :

> Hi!
>
> On a new cluster, I get the following error. All 3x mons are connected to
> the same switch and ping between them works (firewalls disabled).
> Mon-nodes are Ubuntu 16.04 LTS on Cep Luminous.
>
>
> [ceph_deploy.mon][ERROR ] Some monitors have still not reached quorum:
> [ceph_deploy.mon][ERROR ] mon03
> [ceph_deploy.mon][ERROR ] mon02
> [ceph_deploy.mon][ERROR ] mon01
>
>
> root@adminnode:~# cat ceph.conf
> [global]
> fsid = 2689defb-8715-47bb-8d78-e862089adf7a
> ms_bind_ipv6 = true
> mon_initial_members = mon01, mon02, mon03
> mon_host = [fd91:462b:4243:47e::1:1],[fd91:462b:4243:47e::1:2],[
> fd91:462b:4243:47e::1:3]
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public network = fdd1:ecbd:731f:ee8e::/64
> cluster network = fd91:462b:4243:47e::/64
>
>
> root@mon01:~# ip a
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group
> default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 9000 qdisc pfifo_fast
> state UP group default qlen 1000
> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
> inet 172.17.1.1/16 brd 172.17.255.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet6 fd91:462b:4243:47e::1:1/64 scope global
>valid_lft forever preferred_lft forever
> inet6 fe80::baae:edff:fee9:b661/64 scope link
>valid_lft forever preferred_lft forever
> 3: wlan0:  mtu 1500 qdisc noop state DOWN group
> default qlen 1000
> link/ether 00:db:df:64:34:d5 brd ff:ff:ff:ff:ff:ff
> 4: eth0.22@eth0:  mtu 9000 qdisc noqueue
> state UP group default qlen 1000
> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
> inet6 fdd1:ecbd:731f:ee8e::1:1/64 scope global
>valid_lft forever preferred_lft forever
> inet6 fe80::baae:edff:fee9:b661/64 scope link
>valid_lft forever preferred_lft forever
>
>
> Don't mind wlan0, thats because this node is built from an Intel NUC.
>
> Any idea?
>
> Kind regards
> Kevin
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some monitors have still not reached quorum

2018-02-23 Thread Kevin Olbrich
I found a fix: It is *mandatory *to set the public network to the same
network the mons use.
Skipping this while the mon has another network interface, saves garbage to
the monmap.

- Kevin

2018-02-23 11:38 GMT+01:00 Kevin Olbrich :

> I always see this:
>
> [mon01][DEBUG ] "mons": [
> [mon01][DEBUG ]   {
> [mon01][DEBUG ] "addr": "[fd91:462b:4243:47e::1:1]:6789/0",
> [mon01][DEBUG ] "name": "mon01",
> [mon01][DEBUG ] "public_addr": "[fd91:462b:4243:47e::1:1]:6789/0",
> [mon01][DEBUG ] "rank": 0
> [mon01][DEBUG ]   },
> [mon01][DEBUG ]   {
> [mon01][DEBUG ] "addr": "0.0.0.0:0/1",
> [mon01][DEBUG ] "name": "mon02",
> [mon01][DEBUG ] "public_addr": "0.0.0.0:0/1",
> [mon01][DEBUG ] "rank": 1
> [mon01][DEBUG ]   },
> [mon01][DEBUG ]   {
> [mon01][DEBUG ] "addr": "0.0.0.0:0/2",
> [mon01][DEBUG ] "name": "mon03",
> [mon01][DEBUG ] "public_addr": "0.0.0.0:0/2",
> [mon01][DEBUG ] "rank": 2
> [mon01][DEBUG ]   }
> [mon01][DEBUG ] ]
>
>
> DNS is working fine and the hostnames are also listed in /etc/hosts.
> I already purged the mon but still the same problem.
>
> - Kevin
>
>
> 2018-02-23 10:26 GMT+01:00 Kevin Olbrich :
>
>> Hi!
>>
>> On a new cluster, I get the following error. All 3x mons are connected to
>> the same switch and ping between them works (firewalls disabled).
>> Mon-nodes are Ubuntu 16.04 LTS on Cep Luminous.
>>
>>
>> [ceph_deploy.mon][ERROR ] Some monitors have still not reached quorum:
>> [ceph_deploy.mon][ERROR ] mon03
>> [ceph_deploy.mon][ERROR ] mon02
>> [ceph_deploy.mon][ERROR ] mon01
>>
>>
>> root@adminnode:~# cat ceph.conf
>> [global]
>> fsid = 2689defb-8715-47bb-8d78-e862089adf7a
>> ms_bind_ipv6 = true
>> mon_initial_members = mon01, mon02, mon03
>> mon_host = [fd91:462b:4243:47e::1:1],[fd91:462b:4243:47e::1:2],[fd91:
>> 462b:4243:47e::1:3]
>> auth_cluster_required = cephx
>> auth_service_required = cephx
>> auth_client_required = cephx
>> public network = fdd1:ecbd:731f:ee8e::/64
>> cluster network = fd91:462b:4243:47e::/64
>>
>>
>> root@mon01:~# ip a
>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group
>> default qlen 1000
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> inet 127.0.0.1/8 scope host lo
>>valid_lft forever preferred_lft forever
>> inet6 ::1/128 scope host
>>valid_lft forever preferred_lft forever
>> 2: eth0:  mtu 9000 qdisc pfifo_fast
>> state UP group default qlen 1000
>> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
>> inet 172.17.1.1/16 brd 172.17.255.255 scope global eth0
>>valid_lft forever preferred_lft forever
>> inet6 fd91:462b:4243:47e::1:1/64 scope global
>>valid_lft forever preferred_lft forever
>> inet6 fe80::baae:edff:fee9:b661/64 scope link
>>valid_lft forever preferred_lft forever
>> 3: wlan0:  mtu 1500 qdisc noop state DOWN group
>> default qlen 1000
>> link/ether 00:db:df:64:34:d5 brd ff:ff:ff:ff:ff:ff
>> 4: eth0.22@eth0:  mtu 9000 qdisc
>> noqueue state UP group default qlen 1000
>> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
>> inet6 fdd1:ecbd:731f:ee8e::1:1/64 scope global
>>valid_lft forever preferred_lft forever
>> inet6 fe80::baae:edff:fee9:b661/64 scope link
>>valid_lft forever preferred_lft forever
>>
>>
>> Don't mind wlan0, thats because this node is built from an Intel NUC.
>>
>> Any idea?
>>
>> Kind regards
>> Kevin
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph scrub logs: _scan_snaps no head for $object?

2018-02-23 Thread Mehmet

Sage Wrote( Tue, 2 Jan 2018 17:57:32 + (UTC)):

Hi Stefan, Mehmet,



Hi Sage,
Sorry for the *extremly late* response!


Are these clusters that were upgraded from prior versions, or fresh
luminous installs?


My Cluster was initialy installed with jewel (10.2.1) have seen some 
minor updates and is finaly upgraded from Jewel (10.2.10) to Luminous 
(12.2.1)


Actualy is installed:

- ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
luminous (stable)


I had a look in my logfiled and have still the log entries like:

... .. .
2018-02-23 11:23:34.247878 7feaa2a2d700 -1 osd.59 pg_epoch: 36269 
pg[0.346( v 36269'30160204 (36269'30158634,36269'30160204] 
local-lis/les=36253/36254 n=12956 ec=141/141 lis/c 36253/36253 les/c/f 
36254/36264/0 36253/36253/36190) [4,59,23] r=1 lpr=36253 luod=0'0 
crt=36269'30160204 lcod 36269'30160203 active] _scan_snaps no head for 
0:62e347cd:::rbd_data.63efee238e1f29.038c:48 (have MIN)

... .. .

need further information?
- Mehmet



This message indicates that there is a stray clone object with no
associated head or snapdir object.  That normally should never
happen--it's presumably the result of a (hopefully old) bug.  The scrub
process doesn't even clean them up, which maybe says something about 
how

common it is/was...

sage


On Sun, 24 Dec 2017, ceph@xx wrote:

> Hi Stefan,
>
> Am 14. Dezember 2017 09:48:36 MEZ schrieb Stefan Kooman :
> >Hi,
> >
> >We see the following in the logs after we start a scrub for some osds:
> >
> >ceph-osd.2.log:2017-12-14 06:50:47.180344 7f0f47db2700  0
> >log_channel(cluster) log [DBG] : 1.2d8 scrub starts
> >ceph-osd.2.log:2017-12-14 06:50:47.180915 7f0f47db2700 -1 osd.2
> >pg_epoch: 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209]
> >local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f
> > >11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733
> >crt=11890'165209 lcod 11890'165208 mlcod 11890'165208
> >active+clean+scrubbing] _scan_snaps no head for
> >1:1b518155:::rbd_data.620652ae8944a.0126:29 (have MIN)
> >ceph-osd.2.log:2017-12-14 06:50:47.180929 7f0f47db2700 -1 osd.2
> >pg_epoch: 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209]
> >local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f
> >11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733
> >crt=11890'165209 lcod 11890'165208 mlcod 11890'165208
> >active+clean+scrubbing] _scan_snaps no head for
> >1:1b518155:::rbd_data.620652ae8944a.0126:14 (have MIN)
> >ceph-osd.2.log:2017-12-14 06:50:47.180941 7f0f47db2700 -1 osd.2
> >pg_epoch: 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209]
> >local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f
> >11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733
> >crt=11890'165209 lcod 11890'165208 mlcod 11890'165208
> >active+clean+scrubbing] _scan_snaps no head for
> >1:1b518155:::rbd_data.620652ae8944a.0126:a (have MIN)
> >ceph-osd.2.log:2017-12-14 06:50:47.214198 7f0f43daa700  0
> >log_channel(cluster) log [DBG] : 1.2d8 scrub ok
> >
> >So finally it logs "scrub ok", but what does " _scan_snaps no head for
> >..." mean?
>
> I also see this lines in our Logfiles and am wonder  what this means.
>
> >Does this indicate a problem?
>
> I do not guess so because we actually have not  any issues.
>
> >
> >Ceph 12.2.2 with bluestore on lvm
>
> We using 12.2.2 with filestore on xfs.
>
> - Mehmet
> >
> >Gr. Stefan
> ___
> ceph-users mailing list

v> ceph-users@xx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@xx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] erasure coding chunk distribution

2018-02-23 Thread Dennis Benndorf

Hi,

at the moment we use ceph with one big rbd pool and size=4 and use a 
rule to ensure that 2 copies are in each of our two rooms. This works 
great for VMs. But there is some big data which should be stored online 
but a bit cheaper. We think about using cephfs for it with erasure 
coding and k=4 and m=4. How would the placement of the chunks work, I 
mean is the above possible for erasure coding also?


In addition, is there anybody out there using cephfs with erasure coding 
in big scale (hundreds of TB) and who can tell me something about 
his/her experience on stability?


Thanks in advance,
Dennis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-23 Thread Behnam Loghmani
Finally, the problem is solved by changing the whole hardware of failure
server except hard disks.

The last test which I have done before changing server was, cross
exchanging SSD disks between failure server(node A) and one of the healthy
servers(node B) and recreating the cluster.
In this test we see that again node A failed with "Corruption: block
checksum mismatch code". So we figured out that there is something strange
with the board of node A and disks are healthy.

In this scenario, various tests were done:
1- recreating OSDs
2- changing SSD disk
3- changing SATA port and cable
4- cross exchanging SSD disks

To those who helped me with this problem, I sincerely thank you so much.

Best regards,
Behnam Loghmani



On Thu, Feb 22, 2018 at 3:18 PM, David Turner  wrote:

> Did you remove and recreate the OSDs that used the SSD for their WAL/DB?
> Or did you try to do something to not have to do that?  That is an integral
> part of the OSD and changing the SSD would destroy the OSDs involved unless
> you attempted some sort of dd.  If you did that, then any corruption for
> the mon very well might still persist.
>
> On Thu, Feb 22, 2018 at 3:44 AM Behnam Loghmani 
> wrote:
>
>> Hi Brian,
>>
>> The issue started with failing mon service and after that both OSDs on
>> that node failed to start.
>> Mon service is on SSD disk and WAL/DB of OSDs on that SSD too with lvm.
>> I have changed SSD disk with new one, and changing SATA port and cable
>> but the problem is still remaining.
>> All disk tests are fine and disk doesn't have any error.
>>
>> On Wed, Feb 21, 2018 at 10:03 PM, Brian :  wrote:
>>
>>> Hello
>>>
>>> Wasn't this originally an issue with mon store now you are getting a
>>> checksum error from an OSD? I think some hardware here in this node is just
>>> hosed.
>>>
>>>
>>> On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani <
>>> behnam.loghm...@gmail.com> wrote:
>>>
 Hi there,

 I changed SATA port and cable of SSD disk and also update ceph to
 version 12.2.3 and rebuild OSDs
 but when recovery starts OSDs failed with this error:


 2018-02-21 21:12:18.037974 7f3479fe2d00 -1 
 bluestore(/var/lib/ceph/osd/ceph-7)
 _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
 expected 0xaf1040a2, device location [0x1~1000], logical extent
 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
 2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable
 to read osd superblock
 2018-02-21 21:12:18.038009 7f3479fe2d00  1 
 bluestore(/var/lib/ceph/osd/ceph-7)
 umount
 2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
 shutdown
 2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
 2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
 [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
 ceph-12.2.3/src/rocksdb/db/db_impl.cc:217] Shutdown: ca
 nceling all background work
 2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
 2018/02/21-21:12:18.041514) [/home/jenkins-build/build/
 workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
 AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/
 release/12.2.3/rpm/el7/BUILD/ceph-12.
 2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
 level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
 MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
 0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
 hutdown in progress: Database shutdown or Column
 2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
 2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
 "job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
 "output_level": 1, "num_output_files": 1, "total_ou
 tput_size": 902552, "num_input_records": 4470, "num_output_records":
 4377, "num_subcompactions": 1, "num_single_delete_mismatches": 0,
 "num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0,
 0]}
 2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
 {"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
 "file_number": 249}
 2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
 [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
 ceph-12.2.3/src/rocksdb/db/db_impl.cc:343] Shutdown com
 plete
 2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
 2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
 shutdown
 2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
 shutdo

[ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-23 Thread Caspar Smit
Hi All,

What would be the proper way to preventively replace a DB/WAL SSD (when it
is nearing it's DWPD/TBW limit and not failed yet).

It hosts DB partitions for 5 OSD's

Maybe something like:

1) ceph osd reweight 0 the 5 OSD's
2) let backfilling complete
3) destroy/remove the 5 OSD's
4) replace SSD
5) create 5 new OSD's with seperate DB partition on new SSD

When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
thought maybe the following would work:

1) ceph osd set noout
2) stop the 5 OSD's (systemctl stop)
3) 'dd' the old SSD to a new SSD of same or bigger size
4) remove the old SSD
5) start the 5 OSD's (systemctl start)
6) let backfilling/recovery complete (only delta data between OSD stop and
now)
6) ceph osd unset noout

Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid
stuff preventing this to work?

Or is there another 'less hacky' way to replace a DB SSD without moving too
much data?

Kind regards,
Caspar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-23 Thread Nico Schottelius

A very interesting question and I would add the follow up question:

Is there an easy way to add an external DB/WAL devices to an existing
OSD?

I suspect that it might be something on the lines of:

- stop osd
- create a link in ...ceph/osd/ceph-XX/block.db to the target device
- (maybe run some kind of osd mkfs ?)
- start osd

Has anyone done this so far or recommendations on how to do it?

Which also makes me wonder: what is actually the format of WAL and
BlockDB in bluestore? Is there any documentation available about it?

Best,

Nico


Caspar Smit  writes:

> Hi All,
>
> What would be the proper way to preventively replace a DB/WAL SSD (when it
> is nearing it's DWPD/TBW limit and not failed yet).
>
> It hosts DB partitions for 5 OSD's
>
> Maybe something like:
>
> 1) ceph osd reweight 0 the 5 OSD's
> 2) let backfilling complete
> 3) destroy/remove the 5 OSD's
> 4) replace SSD
> 5) create 5 new OSD's with seperate DB partition on new SSD
>
> When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
> thought maybe the following would work:
>
> 1) ceph osd set noout
> 2) stop the 5 OSD's (systemctl stop)
> 3) 'dd' the old SSD to a new SSD of same or bigger size
> 4) remove the old SSD
> 5) start the 5 OSD's (systemctl start)
> 6) let backfilling/recovery complete (only delta data between OSD stop and
> now)
> 6) ceph osd unset noout
>
> Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid
> stuff preventing this to work?
>
> Or is there another 'less hacky' way to replace a DB SSD without moving too
> much data?
>
> Kind regards,
> Caspar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread Reed Dier
Below is ceph -s

>   cluster:
> id: {id}
> health: HEALTH_WARN
> noout flag(s) set
> 260610/1068004947 objects misplaced (0.024%)
> Degraded data redundancy: 23157232/1068004947 objects degraded 
> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
> 
>   services:
> mon: 3 daemons, quorum mon02,mon01,mon03
> mgr: mon03(active), standbys: mon02
> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>  flags noout
> 
>   data:
> pools:   5 pools, 5316 pgs
> objects: 339M objects, 46627 GB
> usage:   154 TB used, 108 TB / 262 TB avail
> pgs: 23157232/1068004947 objects degraded (2.168%)
>  260610/1068004947 objects misplaced (0.024%)
>  4984 active+clean
>  183  active+undersized+degraded+remapped+backfilling
>  145  active+undersized+degraded+remapped+backfill_wait
>  3active+remapped+backfill_wait
>  1active+remapped+backfilling
> 
>   io:
> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
> recovery: 37057 kB/s, 50 keys/s, 217 objects/s

Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
fs-metadata pool at 32 PG.

> Are you sure the recovery is actually going slower, or are the individual ops 
> larger or more expensive?

The objects should not vary wildly in size.
Even if they were differing in size, the SSDs are roughly idle in their current 
state of backfilling when examining wait in iotop, or atop, or sysstat/iostat.

This compares to when I was fully saturating the SATA backplane with over 
1000MB/s of writes to multiple disks when the backfills were going “full speed.”

Here is a breakdown of recovery io by pool:

> pool objects-ssd id 20
>   recovery io 6779 kB/s, 92 objects/s
>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
> 
> pool cephfs-hdd id 17
>   recovery io 40542 kB/s, 158 objects/s
>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr

So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client traffic 
at the moment, which seems conspicuous to me.

Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with 
one OSD occasionally spiking up to 300-500 for a few minutes. Stats being 
pulled by both local CollectD instances on each node, as well as the Influx 
plugin in MGR as we evaluate that against collectd.

Thanks,

Reed


> On Feb 22, 2018, at 6:21 PM, Gregory Farnum  wrote:
> 
> What's the output of "ceph -s" while this is happening?
> 
> Is there some identifiable difference between these two states, like you get 
> a lot of throughput on the data pools but then metadata recovery is slower?
> 
> Are you sure the recovery is actually going slower, or are the individual ops 
> larger or more expensive?
> 
> My WAG is that recovering the metadata pool, composed mostly of directories 
> stored in omap objects, is going much slower for some reason. You can adjust 
> the cost of those individual ops some by changing 
> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure 
> which way you want to go or indeed if this has anything to do with the 
> problem you're seeing. (eg, it could be that reading out the omaps is 
> expensive, so you can get higher recovery op numbers by turning down the 
> number of entries per request, but not actually see faster backfilling 
> because you have to issue more requests.)
> -Greg
> 
> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier  > wrote:
> Hi all,
> 
> I am running into an odd situation that I cannot easily explain.
> I am currently in the midst of destroy and rebuild of OSDs from filestore to 
> bluestore.
> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
> 
> My path to replacing the OSDs is to set the noout, norecover, norebalance 
> flag, destroy the OSD, create the OSD back, (iterate n times, all within a 
> single failure domain), unset the flags, and let it go. It finishes, rinse, 
> repeat.
> 
> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 
> NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for 
> block.db (previously filestore journals).
> 2x10GbE networking between the nodes. SATA backplane caps out at around 10 
> Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
> 
> When the flags are unset, recovery starts and I see a very large rush of 
> traffic, however, after the first machine completed, the performance tapered 
> off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery 
> ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 150-250 
> recovery ops on 5 SSDs, ba

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread Reed Dier
Probably unrelated, but I do keep seeing this odd negative objects degraded 
message on the fs-metadata pool:

> pool fs-metadata-ssd id 16
>   -34/3 objects degraded (-1133.333%)
>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr

Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
culprit? Maybe its some weird sampling interval issue thats been solved in 
12.2.3?

Thanks,

Reed


> On Feb 23, 2018, at 8:26 AM, Reed Dier  wrote:
> 
> Below is ceph -s
> 
>>   cluster:
>> id: {id}
>> health: HEALTH_WARN
>> noout flag(s) set
>> 260610/1068004947 objects misplaced (0.024%)
>> Degraded data redundancy: 23157232/1068004947 objects degraded 
>> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>> 
>>   services:
>> mon: 3 daemons, quorum mon02,mon01,mon03
>> mgr: mon03(active), standbys: mon02
>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>  flags noout
>> 
>>   data:
>> pools:   5 pools, 5316 pgs
>> objects: 339M objects, 46627 GB
>> usage:   154 TB used, 108 TB / 262 TB avail
>> pgs: 23157232/1068004947 objects degraded (2.168%)
>>  260610/1068004947 objects misplaced (0.024%)
>>  4984 active+clean
>>  183  active+undersized+degraded+remapped+backfilling
>>  145  active+undersized+degraded+remapped+backfill_wait
>>  3active+remapped+backfill_wait
>>  1active+remapped+backfilling
>> 
>>   io:
>> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
> 
> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
> fs-metadata pool at 32 PG.
> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
> 
> The objects should not vary wildly in size.
> Even if they were differing in size, the SSDs are roughly idle in their 
> current state of backfilling when examining wait in iotop, or atop, or 
> sysstat/iostat.
> 
> This compares to when I was fully saturating the SATA backplane with over 
> 1000MB/s of writes to multiple disks when the backfills were going “full 
> speed.”
> 
> Here is a breakdown of recovery io by pool:
> 
>> pool objects-ssd id 20
>>   recovery io 6779 kB/s, 92 objects/s
>>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>> 
>> pool cephfs-hdd id 17
>>   recovery io 40542 kB/s, 158 objects/s
>>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
> 
> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client 
> traffic at the moment, which seems conspicuous to me.
> 
> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with 
> one OSD occasionally spiking up to 300-500 for a few minutes. Stats being 
> pulled by both local CollectD instances on each node, as well as the Influx 
> plugin in MGR as we evaluate that against collectd.
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 22, 2018, at 6:21 PM, Gregory Farnum > > wrote:
>> 
>> What's the output of "ceph -s" while this is happening?
>> 
>> Is there some identifiable difference between these two states, like you get 
>> a lot of throughput on the data pools but then metadata recovery is slower?
>> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
>> 
>> My WAG is that recovering the metadata pool, composed mostly of directories 
>> stored in omap objects, is going much slower for some reason. You can adjust 
>> the cost of those individual ops some by changing 
>> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure 
>> which way you want to go or indeed if this has anything to do with the 
>> problem you're seeing. (eg, it could be that reading out the omaps is 
>> expensive, so you can get higher recovery op numbers by turning down the 
>> number of entries per request, but not actually see faster backfilling 
>> because you have to issue more requests.)
>> -Greg
>> 
>> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier > > wrote:
>> Hi all,
>> 
>> I am running into an odd situation that I cannot easily explain.
>> I am currently in the midst of destroy and rebuild of OSDs from filestore to 
>> bluestore.
>> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
>> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
>> 
>> My path to replacing the OSDs is to set the noout, norecover, norebalance 
>> flag, destroy the OSD, create the OSD back, (iterate n times, all within a 
>> single failure domain), unset the flags, and let it go. It finishes, rinse, 
>> r

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread David Turner
Here is a [1] link to a ML thread tracking some slow backfilling on
bluestore.  It came down to the backfill sleep setting for them.  Maybe it
will help.

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html

On Fri, Feb 23, 2018 at 10:46 AM Reed Dier  wrote:

> Probably unrelated, but I do keep seeing this odd negative objects
> degraded message on the fs-metadata pool:
>
> pool fs-metadata-ssd id 16
>   -34/3 objects degraded (-1133.333%)
>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
>
>
> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a
> culprit? Maybe its some weird sampling interval issue thats been solved in
> 12.2.3?
>
> Thanks,
>
> Reed
>
>
> On Feb 23, 2018, at 8:26 AM, Reed Dier  wrote:
>
> Below is ceph -s
>
>   cluster:
> id: {id}
> health: HEALTH_WARN
> noout flag(s) set
> 260610/1068004947 objects misplaced (0.024%)
> Degraded data redundancy: 23157232/1068004947 objects degraded
> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>
>   services:
> mon: 3 daemons, quorum mon02,mon01,mon03
> mgr: mon03(active), standbys: mon02
> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>  flags noout
>
>   data:
> pools:   5 pools, 5316 pgs
> objects: 339M objects, 46627 GB
> usage:   154 TB used, 108 TB / 262 TB avail
> pgs: 23157232/1068004947 objects degraded (2.168%)
>  260610/1068004947 objects misplaced (0.024%)
>  4984 active+clean
>  183  active+undersized+degraded+remapped+backfilling
>  145  active+undersized+degraded+remapped+backfill_wait
>  3active+remapped+backfill_wait
>  1active+remapped+backfilling
>
>   io:
> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
>
>
> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the
> fs-metadata pool at 32 PG.
>
> Are you sure the recovery is actually going slower, or are the individual
> ops larger or more expensive?
>
> The objects should not vary wildly in size.
> Even if they were differing in size, the SSDs are roughly idle in their
> current state of backfilling when examining wait in iotop, or atop, or
> sysstat/iostat.
>
> This compares to when I was fully saturating the SATA backplane with over
> 1000MB/s of writes to multiple disks when the backfills were going “full
> speed.”
>
> Here is a breakdown of recovery io by pool:
>
> pool objects-ssd id 20
>   recovery io 6779 kB/s, 92 objects/s
>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>
> pool cephfs-hdd id 17
>   recovery io 40542 kB/s, 158 objects/s
>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
>
>
> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client
> traffic at the moment, which seems conspicuous to me.
>
> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops,
> with one OSD occasionally spiking up to 300-500 for a few minutes. Stats
> being pulled by both local CollectD instances on each node, as well as the
> Influx plugin in MGR as we evaluate that against collectd.
>
> Thanks,
>
> Reed
>
>
> On Feb 22, 2018, at 6:21 PM, Gregory Farnum  wrote:
>
> What's the output of "ceph -s" while this is happening?
>
> Is there some identifiable difference between these two states, like you
> get a lot of throughput on the data pools but then metadata recovery is
> slower?
>
> Are you sure the recovery is actually going slower, or are the individual
> ops larger or more expensive?
>
> My WAG is that recovering the metadata pool, composed mostly of
> directories stored in omap objects, is going much slower for some reason.
> You can adjust the cost of those individual ops some by
> changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm
> not sure which way you want to go or indeed if this has anything to do with
> the problem you're seeing. (eg, it could be that reading out the omaps is
> expensive, so you can get higher recovery op numbers by turning down the
> number of entries per request, but not actually see faster backfilling
> because you have to issue more requests.)
> -Greg
>
> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier  wrote:
>
>> Hi all,
>>
>> I am running into an odd situation that I cannot easily explain.
>> I am currently in the midst of destroy and rebuild of OSDs from filestore
>> to bluestore.
>> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing
>> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
>>
>> My path to replacing the OSDs is to set the noout, norecover, norebalance
>> flag, destroy the OSD, create 

Re: [ceph-users] erasure coding chunk distribution

2018-02-23 Thread Gregory Farnum
On Fri, Feb 23, 2018 at 5:05 AM Dennis Benndorf <
dennis.bennd...@googlemail.com> wrote:

> Hi,
>
> at the moment we use ceph with one big rbd pool and size=4 and use a
> rule to ensure that 2 copies are in each of our two rooms. This works
> great for VMs. But there is some big data which should be stored online
> but a bit cheaper. We think about using cephfs for it with erasure
> coding and k=4 and m=4. How would the placement of the chunks work, I
> mean is the above possible for erasure coding also?
>

Yes. It's mostly the same procedure as I imagine you used on your
replicated pool. Except you'd use "indep" instead of "rep" and you'd be
choosing 4 leafs out of each DC instead of 2.
-Greg


>
> In addition, is there anybody out there using cephfs with erasure coding
> in big scale (hundreds of TB) and who can tell me something about
> his/her experience on stability?
>
> Thanks in advance,
> Dennis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balanced MDS, all as active and recomended client settings.

2018-02-23 Thread Patrick Donnelly
On Fri, Feb 23, 2018 at 12:54 AM, Daniel Carrasco  wrote:
>  client_permissions = false

Yes, this will potentially reduce checks against the MDS.

>   client_quota = false

This option no longer exists since Luminous; quota enforcement is no
longer optional. However, if you don't have any quotas then there is
no added load on the client/mds.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Install previous version of Ceph

2018-02-23 Thread Scottix
Hey,
We had one of our monitor servers die on us and I have a replacement
computer now. In between that time you have released 12.2.3 but we are
still on 12.2.2.

We are on Ubuntu servers

I see all the binaries are in the repo but your package cache only shows
12.2.3, is there a reason for not keeping the previous builds like in my
case.

I could do an install like
apt install ceph-mon=12.2.2

Also how would I go installing 12.2.2 in my scenario since I don't want to
update till have this monitor running again.

Thanks,
Scott
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] wrong stretch package dependencies (was Luminous v12.2.3 released)

2018-02-23 Thread kefu chai
i am starting a new thread. replied inlined.

On Thu, Feb 22, 2018 at 9:24 PM, Micha Krause  wrote:
> Hi,
>
> Debian Packages for stretch have broken dependencies:
>
> The following packages have unmet dependencies:
>  ceph-common : Depends: libleveldb1 but it is not installable
>Depends: libsnappy1 but it is not installable
>  ceph-mon : Depends: libleveldb1 but it is not installable
> Depends: libsnappy1 but it is not installable
>  ceph-osd : Depends: libleveldb1 but it is not installable
> Depends: libsnappy1 but it is not installable
>  ceph-base : Depends: libleveldb1 but it is not installable
>  Depends: libsnappy1 but it is not installable
>
>
> https://packages.debian.org/search?keywords=libsnappy1&searchon=names&suite=all§ion=all
>
> https://packages.debian.org/search?suite=all§ion=all&arch=any&searchon=names&keywords=libleveldb1
>
> They should Probably depend on the 1v5 packages, and they did in version
> 12.2.2.

agreed. but the packages built for stretch do depend on the library
packages with "v5" suffix [0].


$ wget --quiet 
https://download.ceph.com/debian-luminous/dists/stretch/main/binary-amd64/Packages.bz2
-O - | bzcat | grep libsnappy

Depends: binutils, ceph-common (= 12.2.3-1~bpo90+1), cryptsetup-bin |
cryptsetup, debianutils, findutils, gdisk, grep, logrotate, psmisc,
xfsprogs, python-pkg-resources, python2.7:any, python:any (<< 2.8),
python:any (>= 2.7.5-5~), libaio1 (>= 0.3.93), libblkid1 (>= 2.16),
libc6 (>= 2.16), libfuse2 (>= 2.2), libgcc1 (>= 1:3.0),
libgoogle-perftools4, libibverbs1 (>= 1.1.6), libleveldb1v5, libnspr4
(>= 2:4.9-2~), libnss3 (>= 2:3.13.4-2~), librados2, libsnappy1v5,
libstdc++6 (>= 6), zlib1g (>= 1:1.1.4)
Depends: librbd1 (= 12.2.3-1~bpo90+1), python-cephfs (=
12.2.3-1~bpo90+1), python-prettytable, python-rados (=
12.2.3-1~bpo90+1), python-rbd (= 12.2.3-1~bpo90+1), python-requests,
python-rgw (= 12.2.3-1~bpo90+1), init-system-helpers (>= 1.18~),
python2.7:any, python:any (<< 2.8), python:any (>= 2.7.5-5~), libaio1
(>= 0.3.9), libbabeltrace-ctf1 (>= 1.2.1), libbabeltrace1 (>= 1.2.1),
libblkid1 (>= 2.17.2), libc6 (>= 2.16), libcephfs2, libcurl3-gnutls
(>= 7.28.0), libexpat1 (>= 2.0.1), libfuse2 (>= 2.2), libgcc1 (>=
1:3.0), libgoogle-perftools4, libibverbs1 (>= 1.1.6), libkeyutils1 (>=
1.4), libldap-2.4-2 (>= 2.4.7), libleveldb1v5, libnspr4 (>= 2:4.9-2~),
libnss3 (>= 2:3.13.4-2~), librados2, libradosstriper1, libsnappy1v5,
libstdc++6 (>= 6), libudev1 (>= 183), zlib1g (>= 1:1.1.4)
Depends: ceph-base (= 12.2.3-1~bpo90+1), python-flask,
init-system-helpers (>= 1.18~), libaio1 (>= 0.3.9), libblkid1 (>=
2.16), libc6 (>= 2.16), libfuse2 (>= 2.2), libgcc1 (>= 1:3.0),
libgoogle-perftools4, libibverbs1 (>= 1.1.6), libleveldb1v5, libnspr4
(>= 2:4.9-2~), libnss3 (>= 2:3.13.4-2~), librados2, libsnappy1v5,
libstdc++6 (>= 6), zlib1g (>= 1:1.1.4)
Depends: ceph-base (= 12.2.3-1~bpo90+1), parted, lvm2,
init-system-helpers (>= 1.18~), python-pkg-resources, python2.7:any,
python:any (<< 2.8), python:any (>= 2.7.5-5~), libaio1 (>= 0.3.93),
libblkid1 (>= 2.17.2), libc6 (>= 2.16), libfuse2 (>= 2.8), libgcc1 (>=
1:3.0), libgoogle-perftools4, libibverbs1 (>= 1.1.6), libleveldb1v5,
liblttng-ust0 (>= 2.5.0), libnspr4 (>= 2:4.9-2~), libnss3 (>=
2:3.13.4-2~), librados2, libsnappy1v5, libstdc++6 (>= 6), zlib1g (>=
1:1.1.4)
Depends: ceph-common, curl, jq, socat, xmlstarlet, libaio1 (>=
0.3.93), libblkid1 (>= 2.17.2), libc6 (>= 2.16), libcephfs2,
libcurl3-gnutls (>= 7.28.0), libexpat1 (>= 2.0.1), libfuse2 (>= 2.2),
libgcc1 (>= 1:3.0), libgoogle-perftools4, libibverbs1 (>= 1.1.6),
libkeyutils1 (>= 1.4), libldap-2.4-2 (>= 2.4.7), libleveldb1v5,
libnspr4 (>= 2:4.9-2~), libnss3 (>= 2:3.13.4-2~), librados2,
libradosstriper1, librbd1, libsnappy1v5, libstdc++6 (>= 6), libudev1
(>= 183), zlib1g (>= 1:1.1.4)

i also downloaded ceph-osd_12.2.3-1~bpo90+1_amd64.deb to check its dependencies:

$ wget 
https://download.ceph.com/debian-luminous/pool/main/c/ceph/ceph-osd_12.2.3-1~bpo90%2B1_amd64.deb
$ dpkg-deb -R ceph-osd_12.2.3-1~bpo90+1_amd64.deb ceph-osd
$ grep libsnappy ceph-osd/DEBIAN/control
Depends: ceph-base (= 12.2.3-1~bpo90+1), parted, lvm2,
init-system-helpers (>= 1.18~), python-pkg-resources, python2.7:any,
python:any (<< 2.8), python:any (>= 2.7.5-5~), libaio1 (>= 0.3.93),
libblkid1 (>= 2.17.2), libc6 (>= 2.16), libfuse2 (>= 2.8), libgcc1 (>=
1:3.0), libgoogle-perftools4, libibverbs1 (>= 1.1.6), libleveldb1v5,
liblttng-ust0 (>= 2.5.0), libnspr4 (>= 2:4.9-2~), libnss3 (>=
2:3.13.4-2~), librados2, libsnappy1v5, libstdc++6 (>= 6), zlib1g (>=
1:1.1.4)

so, which package were you trying to install? and from where did you
download it?

--
[0] https://wiki.debian.org/GCC5

-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some monitors have still not reached quorum

2018-02-23 Thread David Turner
The only communication on the cluster network is between osds. All other
traffic for clients, mons, MDS, etc is on the public network. The cluster
network is what the osds use to backfill, recover, sent replica copies of
your data to the secondary osds, read parts of EC objects before the
primary sends them to the client, scrubs, and anything else where osds talk
to each other. No node needs, or should, have an IP on the cluster network
other than nodes with OSDs in them.

On Fri, Feb 23, 2018, 6:43 AM Kevin Olbrich  wrote:

> I found a fix: It is *mandatory *to set the public network to the same
> network the mons use.
> Skipping this while the mon has another network interface, saves garbage
> to the monmap.
>
> - Kevin
>
> 2018-02-23 11:38 GMT+01:00 Kevin Olbrich :
>
>> I always see this:
>>
>> [mon01][DEBUG ] "mons": [
>> [mon01][DEBUG ]   {
>> [mon01][DEBUG ] "addr": "[fd91:462b:4243:47e::1:1]:6789/0",
>> [mon01][DEBUG ] "name": "mon01",
>> [mon01][DEBUG ] "public_addr": "[fd91:462b:4243:47e::1:1]:6789/0",
>> [mon01][DEBUG ] "rank": 0
>> [mon01][DEBUG ]   },
>> [mon01][DEBUG ]   {
>> [mon01][DEBUG ] "addr": "0.0.0.0:0/1",
>> [mon01][DEBUG ] "name": "mon02",
>> [mon01][DEBUG ] "public_addr": "0.0.0.0:0/1",
>> [mon01][DEBUG ] "rank": 1
>> [mon01][DEBUG ]   },
>> [mon01][DEBUG ]   {
>> [mon01][DEBUG ] "addr": "0.0.0.0:0/2",
>> [mon01][DEBUG ] "name": "mon03",
>> [mon01][DEBUG ] "public_addr": "0.0.0.0:0/2",
>> [mon01][DEBUG ] "rank": 2
>> [mon01][DEBUG ]   }
>> [mon01][DEBUG ] ]
>>
>>
>> DNS is working fine and the hostnames are also listed in /etc/hosts.
>> I already purged the mon but still the same problem.
>>
>> - Kevin
>>
>>
>> 2018-02-23 10:26 GMT+01:00 Kevin Olbrich :
>>
>>> Hi!
>>>
>>> On a new cluster, I get the following error. All 3x mons are connected
>>> to the same switch and ping between them works (firewalls disabled).
>>> Mon-nodes are Ubuntu 16.04 LTS on Cep Luminous.
>>>
>>>
>>> [ceph_deploy.mon][ERROR ] Some monitors have still not reached quorum:
>>> [ceph_deploy.mon][ERROR ] mon03
>>> [ceph_deploy.mon][ERROR ] mon02
>>> [ceph_deploy.mon][ERROR ] mon01
>>>
>>>
>>> root@adminnode:~# cat ceph.conf
>>> [global]
>>> fsid = 2689defb-8715-47bb-8d78-e862089adf7a
>>> ms_bind_ipv6 = true
>>> mon_initial_members = mon01, mon02, mon03
>>> mon_host =
>>> [fd91:462b:4243:47e::1:1],[fd91:462b:4243:47e::1:2],[fd91:462b:4243:47e::1:3]
>>> auth_cluster_required = cephx
>>> auth_service_required = cephx
>>> auth_client_required = cephx
>>> public network = fdd1:ecbd:731f:ee8e::/64
>>> cluster network = fd91:462b:4243:47e::/64
>>>
>>>
>>> root@mon01:~# ip a
>>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
>>> group default qlen 1000
>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>> inet 127.0.0.1/8 scope host lo
>>>valid_lft forever preferred_lft forever
>>> inet6 ::1/128 scope host
>>>valid_lft forever preferred_lft forever
>>> 2: eth0:  mtu 9000 qdisc pfifo_fast
>>> state UP group default qlen 1000
>>> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
>>> inet 172.17.1.1/16 brd 172.17.255.255 scope global eth0
>>>valid_lft forever preferred_lft forever
>>> inet6 fd91:462b:4243:47e::1:1/64 scope global
>>>valid_lft forever preferred_lft forever
>>> inet6 fe80::baae:edff:fee9:b661/64 scope link
>>>valid_lft forever preferred_lft forever
>>> 3: wlan0:  mtu 1500 qdisc noop state DOWN group
>>> default qlen 1000
>>> link/ether 00:db:df:64:34:d5 brd ff:ff:ff:ff:ff:ff
>>> 4: eth0.22@eth0:  mtu 9000 qdisc
>>> noqueue state UP group default qlen 1000
>>> link/ether b8:ae:ed:e9:b6:61 brd ff:ff:ff:ff:ff:ff
>>> inet6 fdd1:ecbd:731f:ee8e::1:1/64 scope global
>>>valid_lft forever preferred_lft forever
>>> inet6 fe80::baae:edff:fee9:b661/64 scope link
>>>valid_lft forever preferred_lft forever
>>>
>>>
>>> Don't mind wlan0, thats because this node is built from an Intel NUC.
>>>
>>> Any idea?
>>>
>>> Kind regards
>>> Kevin
>>>
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Bluestore performance question

2018-02-23 Thread David Turner
Your 6.7GB of DB partition for each 4TB osd is on the very small side of
things. It's been discussed a few times in the ML and the general use case
seems to be about 10GB DB per 1TB of osd. That would be about 40GB DB
partition for each of your osds. This general rule covers most things
except for RGW and cephfs with millions and millions of small files. The DB
grows with the number of objects as well with how much the object are
modified.

With a DB partition of 6.7GB for these osds, you will likely be spilling
over from the SSD into the HDD for your DB by the time your cluster is 25%
full. There is no hard and fast role for how big to make the DB partitions,
but you should watch closely to see how big your DBs grows as your cluster
fills up.

On Thu, Feb 22, 2018, 7:08 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Hi Vadim,
>
> many thanks for these benchmark results!
>
> This indeed looks extremely similar to what we achieve after enabling
> connected mode.
>
> Our 6 OSD-hosts are Supermicro systems with 2 HDDs (Raid 1) for the OS,
> and 32 HDDs (4 TB) + 2 SSDs for the OSDs.
> The 2 SSDs have 16 LVM volumes each (which have ~ 6.7 GB each) to contain
> the Bluestore BlockDB for the 32 OSDs.
>
> So in our case, we have 32 OSDs "behind" one IPoIB link, and the link is
> clearly defining the limit.
> Also we are running an EC pool, so inter-OSD traffic for any read and
> write operation is heavy.
> If I perform a test with "iperf -d" (i.e. send and receive in parallel), I
> sadly note
> that the observed limitation to ~20 GBit/s which you also get is on the
> sum of both directions.
>
> My expectation is that also for you the limit might be given by the IPoIB
> link speed - the disks could probably do much faster,
> especially if you change to Bluestore.
>
> Our workload, by the way, is also HPC - or maybe rather, HTC (High
> Throughput Computing), but luckily our users are used to a significantly
> slower filesystem from the old cluster and will likely not make use of the
> throughput we can already achieve with IPoIB.
>
> Many thanks again for sharing your benchmarks!
>
> Cheers,
> Oliver
>
> Am 22.02.2018 um 13:15 schrieb Vadim Bulst:
> > Hi Oliver,
> >
> > i also use Infiniband and Cephfs for HPC purposes.
> >
> > My setup:
> >
> >   * 4x Dell R730xd and expansion shelf, 24 OSD à 8TB, 128GB Ram,
> 2x10Core Intel 4th Gen, Mellanox ConnectX-3, no SSD-Cache
> >
> >   * 7x Dell R630 Clients
> >
> >   * Ceph-Cluster running on Ubuntu Xenial and Ceph Jewel deployed with
> Ceph-Ansible
> >   * Cephfs-Clients on Debian Stretch and Cephfs kernel module
> >
> >   * IPoverIB for public and custer network, IB-adapters are in connected
> mode and MTU is 65520
> >
> >
> > Future improvements: moving cephfs_metadata-pool to a NVMe pool , update
> to Luminous and Bluestore
> >
> > root@polstor02:/home/urzadmin# ceph -s
> > cluster 7c4bfd06-046f-49e4-bb77-0402d7ca98e5
> >  health HEALTH_OK
> >  monmap e2: 3 mons at {polstor01=
> 10.10.144.211:6789/0,polstor02=10.10.144.212:6789/0,polstor03=10.10.144.213:6789/0
> }
> > election epoch 5034, quorum 0,1,2
> polstor01,polstor02,polstor03
> >   fsmap e2091562: 1/1/1 up {0=polstor02=up:active}, 1
> up:standby-replay, 1 up:standby
> >  osdmap e2078945: 95 osds: 95 up, 95 in
> > flags sortbitwise,require_jewel_osds
> >   pgmap v8638409: 4224 pgs, 2 pools, 93414 GB data, 34592 kobjects
> > 274 TB used, 416 TB / 690 TB avail
> > 4221 active+clean
> >3 active+clean+scrubbing+deep
> >   client io 1658 B/s rd, 3 op/s rd, 0 op/s wr
> >
> >
> > These are my messurements:
> >
> > 
> > Server listening on TCP port 5001
> > TCP window size: 85.3 KByte (default)
> > 
> > [  4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port
> 42584
> > [ ID] Interval   Transfer Bandwidth
> > [  4]  0.0-10.0 sec  27.2 GBytes  23.3 Gbits/sec
> > [  5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port
> 42586
> > [  5]  0.0-10.0 sec  25.4 GBytes  21.8 Gbits/sec
> > [  4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port
> 42588
> > [  4]  0.0-10.0 sec  19.9 GBytes  17.1 Gbits/sec
> > [  5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port
> 42590
> > [  5]  0.0-10.0 sec  20.2 GBytes  17.3 Gbits/sec
> > [  4] local 10.10.144.213 port 5001 connected with 10.10.144.212 port
> 42592
> > [  4]  0.0-10.0 sec  30.2 GBytes  25.9 Gbits/sec
> > [  5] local 10.10.144.213 port 5001 connected with 10.10.144.212 port
> 42594
> > [  5]  0.0-10.0 sec  26.1 GBytes  22.4 Gbits/sec
> >
> > root@polstor02:/home/urzadmin# rados bench -p cephfs_data 10 write
> --no-cleanup -t
> 40
> [1220/1945]
> > Maintaining 40 concurrent writes of 4194304 bytes to objects of size
> 4194304 for up to 10 seconds or 0 objects
> > Object prefix: benchmar

Re: [ceph-users] Ceph auth caps - make it more user error proof

2018-02-23 Thread David Turner
+1 for this. I messed up a cap on a cluster I was configuring doing this
same thing. Luckily it wasn't production and I could fix it quickly.

On Thu, Feb 22, 2018, 8:09 PM Gregory Farnum  wrote:

> On Wed, Feb 21, 2018 at 10:54 AM, Enrico Kern
>  wrote:
> > Hey all,
> >
> > i would suggest some changes to the ceph auth caps command.
> >
> > Today i almost fucked up half of one of our openstack regions with i/o
> > errors because of user failure.
> >
> > I tried to add osd blacklist caps to a cinder keyring after luminous
> > upgrade.
> >
> > I did so by issuing ceph auth caps client.cinder mon 'bla'
> >
> > doing this i forgot that this will wipe also other caps and not just only
> > updates caps for mon because you need to specify all in one line. Result
> was
> > all of our vms passing out with read only filesystems after a while
> because
> > osd caps were gone.
> >
> > I suggest that if you only pass
> >
> > Ceph auth caps mon
> >
> > It only updates caps for mon or osd etc. and leaves others untouched. Or
> at
> > least print some huge error message.
> >
> > I know it is more a pebkac problem, but ceph is doing great in being
> idiot
> > proof and this would make it even more idiot proof ;)
>
> This sounds like a good idea to me! I created a ticket at
> http://tracker.ceph.com/issues/23096
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-23 Thread David Turner
Caspar, it looks like your idea should work. Worst case scenario seems like
the osd wouldn't start, you'd put the old SSD back in and go back to the
idea to weight them to 0, backfilling, then recreate the osds. Definitely
with a try in my opinion, and I'd love to hear your experience after.

Nico, it is not possible to change the WAL or DB size, location, etc after
osd creation. If you want to change the configuration of the osd after
creation, you have to remove it from the cluster and recreate it. There is
no similar functionality to how you could move, recreate, etc filesystem
osd journals. I think this might be on the radar as a feature, but I don't
know for certain. I definitely consider it to be a regression of bluestore.



On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius 
wrote:

>
> A very interesting question and I would add the follow up question:
>
> Is there an easy way to add an external DB/WAL devices to an existing
> OSD?
>
> I suspect that it might be something on the lines of:
>
> - stop osd
> - create a link in ...ceph/osd/ceph-XX/block.db to the target device
> - (maybe run some kind of osd mkfs ?)
> - start osd
>
> Has anyone done this so far or recommendations on how to do it?
>
> Which also makes me wonder: what is actually the format of WAL and
> BlockDB in bluestore? Is there any documentation available about it?
>
> Best,
>
> Nico
>
>
> Caspar Smit  writes:
>
> > Hi All,
> >
> > What would be the proper way to preventively replace a DB/WAL SSD (when
> it
> > is nearing it's DWPD/TBW limit and not failed yet).
> >
> > It hosts DB partitions for 5 OSD's
> >
> > Maybe something like:
> >
> > 1) ceph osd reweight 0 the 5 OSD's
> > 2) let backfilling complete
> > 3) destroy/remove the 5 OSD's
> > 4) replace SSD
> > 5) create 5 new OSD's with seperate DB partition on new SSD
> >
> > When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
> > thought maybe the following would work:
> >
> > 1) ceph osd set noout
> > 2) stop the 5 OSD's (systemctl stop)
> > 3) 'dd' the old SSD to a new SSD of same or bigger size
> > 4) remove the old SSD
> > 5) start the 5 OSD's (systemctl start)
> > 6) let backfilling/recovery complete (only delta data between OSD stop
> and
> > now)
> > 6) ceph osd unset noout
> >
> > Would this be a viable method to replace a DB SSD? Any udev/serial
> nr/uuid
> > stuff preventing this to work?
> >
> > Or is there another 'less hacky' way to replace a DB SSD without moving
> too
> > much data?
> >
> > Kind regards,
> > Caspar
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG overdose protection causing PG unavailability

2018-02-23 Thread David Turner
There was another part to my suggestion which was to set the initial crush
weight to 0 in ceph.conf. after you add all of your osds, you could
download the crush map, weight the new osds to what they should be, and
upload the crush map to give them all the ability to take PGs at the same
time. With this method you never have any osds that can take PGs on the
host until all of them can.

On Thu, Feb 22, 2018, 7:14 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 23.02.2018 um 01:05 schrieb Gregory Farnum:
> >
> >
> > On Wed, Feb 21, 2018 at 2:46 PM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >
> > Dear Cephalopodians,
> >
> > in a Luminous 12.2.3 cluster with a pool with:
> > - 192 Bluestore OSDs total
> > - 6 hosts (32 OSDs per host)
> > - 2048 total PGs
> > - EC profile k=4, m=2
> > - CRUSH failure domain = host
> > which results in 2048*6/192 = 64 PGs per OSD on average, I run into
> issues with PG overdose protection.
> >
> > In case I reinstall one OSD host (zapping all disks), and recreate
> the OSDs one by one with ceph-volume,
> > they will usually come back "slowly", i.e. one after the other.
> >
> > This means the first OSD will initially be assigned all 2048 PGs (to
> fulfill the "failure domain host" requirement),
> > thus breaking through the default osd_max_pg_per_osd_hard_ratio of 2.
> > We also use mon_max_pg_per_osd default, i.e. 200.
> >
> > This appears to cause the previously active (but of course
> undersized+degraded) PGs to enter an "activating+remapped" state,
> > and hence they become unavailable.
> > Thus, data availability is reduced. All this is caused by adding an
> OSD!
> >
> > Of course, as more and more OSDs are added until all 32 are back
> online, this situation is relaxed.
> > Still, I observe that some PGs get stuck in this "activating" state,
> and can't seem to figure out from logs or by dumping them
> > what's the actual reason. Waiting does not help, PGs stay
> "activating", data stays inaccessible.
> >
> >
> > Can you upload logs from each of the OSDs that are (and should be, but
> aren't) involved with one of the PGs that happens to? (ceph-post-file) And
> create a ticket about it?
>
> I'll reproduce in the weekend and then capture the logs, at least I did
> not see anything in there, but I also am not yet too much used to reading
> them.
>
> What I can already confirm for sure is that after I set:
> osd_max_pg_per_osd_hard_ratio = 32
> in ceph.conf (global) and deploy new OSD hosts with that, the problem has
> fully vanished. I have already tested this with two machines.
>
> Cheers,
> Oliver
>
> >
> > Once you have a good map, all the PGs should definitely activate
> themselves.
> > -Greg
> >
> >
> > Waiting a bit and manually restarting the ceph-OSD-services on the
> reinstalled host seems to bring them back.
> > Also, adjusting osd_max_pg_per_osd_hard_ratio to something large
> (e.g. 10) appears to prevent the issue.
> >
> > So my best guess is that this is related to PG overdose protection.
> > Any ideas on how to best overcome this / similar observations?
> >
> > It would be nice to be able to reinstall an OSD host without
> temporarily making data unavailable,
> > right now the only thing which comes to my mind is to effectively
> disable PG overdose protection.
> >
> > Cheers,
> > Oliver
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com