date:20180126

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder


Hi Thomas,

No, we haven't gotten any closer to resolving this, in fact we had 
another issue again when we added a new nvme drive to our nvme servers 
(storage11, storage12 and storage13) that had weight 1.7 instead of the 
usual 0.728 size. This (see below) is what a nvme and hdd server pair at 
a site looks like, and it broke when adding osd.10 (adding the nvme 
drive to storage12 and storage13 worked, it failed when adding the last 
one to storage11). Changing osd.10's weight to 1.0 instead and 
recompiling crushmap allowed all PGs to activate.


Unfortunately this is a production cluster that we were hoping to expand 
as needed, so if there is a problem we quickly have to revert to the 
last working crushmap, so no time to debug :(


We are currently building a copy of the environment though virtualized 
and I hope that we will be able to re-create the issue there as we will 
be able to break it at will :)



host storage11 {
    id -5   # do not change unnecessarily
    id -6 class nvme    # do not change unnecessarily
    id -10 class hdd    # do not change unnecessarily
    # weight 4.612
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 0.728
    item osd.3 weight 0.728
    item osd.6 weight 0.728
    item osd.7 weight 0.728
    item osd.10 weight 1.700
}
host storage21 {
    id -13  # do not change unnecessarily
    id -14 class nvme   # do not change unnecessarily
    id -15 class hdd    # do not change unnecessarily
    # weight 65.496
    alg straw2
    hash 0  # rjenkins1
    item osd.12 weight 5.458
    item osd.13 weight 5.458
    item osd.14 weight 5.458
    item osd.15 weight 5.458
    item osd.16 weight 5.458
    item osd.17 weight 5.458
    item osd.18 weight 5.458
    item osd.19 weight 5.458
    item osd.20 weight 5.458
    item osd.21 weight 5.458
    item osd.22 weight 5.458
    item osd.23 weight 5.458
}


Den 2018-01-26 kl. 08:45, skrev Thomas Bennett:

Hi Peter,

Not sure if you have got to the bottom of your problem, but I seem to 
have found what might be a similar problem. I recommend reading 
below,  as there could be a potential hidden problem.


Yesterday our cluster went into *HEALTH_WARN* state**and I noticed 
that one of my pg's was listed as '/activating/' and marked as 
'/inactive/' and '/unclean/'.


We also have a mixed OSD system - 768 HDDs and 16 NVMEs with three 
crush rules for object placement: the default /replicated_rule/ (I 
never deleted it) and then two new ones for /replicate_rule_hdd/ and 
/replicate_rule_nvme./


Running a query on the pg (in my case pg 15.792) did not yield 
anything out of place, except for it telling me that that it's state 
was '/activating/' (that's not even a pg state: pg states 
) and 
made me slightly alarmed.


The bits of information that alerted me to the issue where:

1. Running 'ceph pg dump' and finding the 'activating' pg showed the 
following information:


15.792 activating [4,724,242] #for pool 15 pg there are osds 4,724,242


2. Running 'ceph osd tree | grep 'osd.4 ' and getting the following 
information:


4 nvme osd.4

3. Now checking what pool 15 is by running 'ceph osd pool ls detail':

pool 15 'default.rgw.data' replicated size 3 min_size 2 crush_rule 1


These three bits of information made me realise what was going on:

  * OSD 4,724,242 are all nvmes
  * Pool 15 should obey crush_rule 1 (/replicate_rule_hdd)/
  * Pool 15 has pgs that use nvmes!

I found the following really useful tool online which showed me the 
depth of the problem: Get the Number of Placement Groups Per Osd 



So it turns out in my case pool 15 has osds in all the nvmes!

To test a fix to mimic the problem again - I executed the following 
command: 'ceph osd pg-upmap-items 15.792 4 22 724 67 76 242'


It remap the osds used by the 'activating' pg and my cluster status 
when back to *HEALTH_OK *and the pg went back to normal making the 
cluster appear healthy.


Luckily for me we've not put the cluster into production so I'll just 
blow away the pool and recreate it.


What I've not yet figured out is how this happened.

The steps (I think) I took where:

 1. Run ceph-ansible and  'default.rgw.data' pool was created
automatically.
 2. I think I then increased the pg count.
 3. Create a new rule: ceph osd crush rule create-replicated
replicated_rule_hdd default host hdd
 4. Move pool to new rule:ceph osd pool set
default.rgw.data crush_rule replicated_rule_hdd

I don't know what the expected behaviour of the set command is, so I'm 
planing to see if I can recreate the problem on a test cluster to see 
which part of the process created the problem. Perhaps I should have 
first migra

[ceph-users] swift capabilities support in radosgw

2018-01-26 Thread Syed Armani

Hello folks,


I am getting this error "Capabilities GET failed: https://SWIFT:8080/info 404 
Not Found", 
when executing a "$ swift capabilities" command against a radosgw cluster.


I was wondering whether radosgw supports the listing of activated 
capabilities[0] via Swift API? 
Something a user can see with "$ swift capabilities" in a native swift cluster.

 
[0] 
https://developer.openstack.org/api-ref/object-store/index.html#list-activated-capabilities

Thanks!

Cheers,
Syed


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Thomas Bennett

Hi Peter,

Just to check if your problem is similar to mine:

   - Do you have any pools that follow a crush rule to only use osds that
   are backed by hdds (i.e not nvmes)?
   - Do these pools obey that rule? i.e do they maybe have pgs that are on
   nvmes?

Regards,
Tom

On Fri, Jan 26, 2018 at 11:48 AM, Peter Linder 
wrote:

> Hi Thomas,
>
> No, we haven't gotten any closer to resolving this, in fact we had another
> issue again when we added a new nvme drive to our nvme servers (storage11,
> storage12 and storage13) that had weight 1.7 instead of the usual 0.728
> size. This (see below) is what a nvme and hdd server pair at a site looks
> like, and it broke when adding osd.10 (adding the nvme drive to storage12
> and storage13 worked, it failed when adding the last one to storage11).
> Changing osd.10's weight to 1.0 instead and recompiling crushmap allowed
> all PGs to activate.
>
> Unfortunately this is a production cluster that we were hoping to expand
> as needed, so if there is a problem we quickly have to revert to the last
> working crushmap, so no time to debug :(
>
> We are currently building a copy of the environment though virtualized and
> I hope that we will be able to re-create the issue there as we will be able
> to break it at will :)
>
>
> host storage11 {
> id -5   # do not change unnecessarily
> id -6 class nvme# do not change unnecessarily
> id -10 class hdd# do not change unnecessarily
> # weight 4.612
> alg straw2
> hash 0  # rjenkins1
> item osd.0 weight 0.728
> item osd.3 weight 0.728
> item osd.6 weight 0.728
> item osd.7 weight 0.728
> item osd.10 weight 1.700
> }
> host storage21 {
> id -13  # do not change unnecessarily
> id -14 class nvme   # do not change unnecessarily
> id -15 class hdd# do not change unnecessarily
> # weight 65.496
> alg straw2
> hash 0  # rjenkins1
> item osd.12 weight 5.458
> item osd.13 weight 5.458
> item osd.14 weight 5.458
> item osd.15 weight 5.458
> item osd.16 weight 5.458
> item osd.17 weight 5.458
> item osd.18 weight 5.458
> item osd.19 weight 5.458
> item osd.20 weight 5.458
> item osd.21 weight 5.458
> item osd.22 weight 5.458
> item osd.23 weight 5.458
> }
>
>
> Den 2018-01-26 kl. 08:45, skrev Thomas Bennett:
>
> Hi Peter,
>
> Not sure if you have got to the bottom of your problem,  but I seem to
> have found what might be a similar problem. I recommend reading below,  as
> there could be a potential hidden problem.
>
> Yesterday our cluster went into *HEALTH_WARN* state and I noticed that
> one of my pg's was listed as '*activating*' and marked as '*inactive*'
> and '*unclean*'.
>
> We also have a mixed OSD system - 768 HDDs and 16 NVMEs with three crush
> rules for object placement: the default *replicated_rule* (I never
> deleted it) and then two new ones for *replicate_rule_hdd* and
> *replicate_rule_nvme.*
>
> Running a query on the pg (in my case pg 15.792) did not yield anything
> out of place, except for it telling me that that it's state was '
> *activating*' (that's not even a pg state: pg states
> ) and made
> me slightly alarmed.
>
> The bits of information that alerted me to the issue where:
>
> 1. Running 'ceph pg dump' and finding the 'activating' pg showed the
> following information:
>
> 15.792 activating [4,724,242] #for pool 15 pg there are osds 4,724,242
>
>
> 2. Running 'ceph osd tree | grep 'osd.4 ' and getting the following
> information:
>
> 4 nvme osd.4
>
> 3. Now checking what pool 15 is by running 'ceph osd pool ls detail':
>
> pool 15 'default.rgw.data' replicated size 3 min_size 2 crush_rule 1
>
>
> These three bits of information made me realise what was going on:
>
>- OSD 4,724,242 are all nvmes
>- Pool 15 should obey crush_rule 1 (*replicate_rule_hdd)*
>- Pool 15 has pgs that use nvmes!
>
> I found the following really useful tool online which showed me the depth
> of the problem: Get the Number of Placement Groups Per Osd
> 
>
> So it turns out in my case pool 15 has osds in all the nvmes!
>
> To test a fix to mimic the problem again - I executed the following
> command: 'ceph osd pg-upmap-items 15.792 4 22 724 67 76 242'
>
> It remap the osds used by the 'activating' pg and my cluster status when
> back to *HEALTH_OK *and the pg went back to normal making the cluster
> appear healthy.
>
> Luckily for me we've not put the cluster into production so I'll just blow
> away the pool and recreate it.
>
> What I've not yet figured out is how this happened.
>
> The steps (I think) I took where:
>
>1. Run ceph-ansible and  'default.rgw.data' pool w

[ceph-users] Can't make LDAP work

2018-01-26 Thread Theofilos Mouratidis

They gave me a ldap server working with users inside, and I want to create
tokens for these users
 to use s3 from their ldap credentials.
I tried using the sanity check and I got this one working:

ldapsearch -x -D "CN=cephs3,OU=Users,OU=Organic Units,DC=example,DC=com" -W
-H ldaps://ldap.example.com:636 -b 'OU=Users,OU=Organic
Units,DC=example,DC=com' 'cn=*' dn

My config is like this:
[global]
rgw_ldap_binddn = "CN=cephs3,OU=Users,OU=Organic Units,DC=example,DC=com"
rgw_ldap_dnattr = "cn"
rgw_ldap_searchdn = "OU=Users,OU=Organic Units,DC=example,DC=com"
rgw_ldap_secret = "plaintext_pass"
rgw_ldap_uri = ldaps://ldap.example.com:636
rgw_s3_auth_use_ldap = true

I create my token to test the ldap feature:

export RGW_ACCESS_KEY_ID="myuser" #where "dn: cn=myuser..." is in
ldap.example.com
export RGW_SECRET_ACCESS_KEY="mypass"
radosgw-token --encode --ttype=ad
abcad=
radosgw-token --encode --ttype=ldap
abcldap=

Now I go to s3cmd and in config I have something like this:
acess_key = abcad=
secret_key =
use_https = false
host_base = ceph_rgw.example.com:8080
host_bucket = ceph_rgw.example.com:8080


I get access denied,
then I try with the ldap key and I get the same problem.
I created a local user out of curiosity and I put in s3cmd acess and secret
and I could create a bucket. What am I doing wrong?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-volume raw disks

2018-01-26 Thread Alfredo Deza

That looks like Luminous, but not 12.2.2

The 'raw' device handling is supported in 12.2.2 for sure.

On Thu, Jan 25, 2018 at 10:42 PM, David Turner  wrote:
> Did you wipe all of the existing partitions and such first?  Which version
> of ceph?  The below commands are what I ran to re-add my osds as bluestore
> after moving all data off of them.
>
> ceph-volume lvm zap /dev/sdb
> ceph-volume lvm create --bluestore --data /dev/sdb
>
> On Thu, Jan 25, 2018 at 9:41 PM Nathan Dehnel  wrote:
>>
>> The doc at
>> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#ceph-volume-lvm-prepare
>> says I can pass a physical device to ceph-volume. But when I try to do that:
>>
>> gentooserver ~ # ceph-volume lvm create --bluestore --data /dev/sdb
>> usage: ceph-volume lvm create [-h] [--journal JOURNAL] --data DATA
>>   [--journal-size GB] [--bluestore]
>> [--filestore]
>>   [--osd-id OSD_ID] [--osd-fsid OSD_FSID]
>> ceph-volume lvm create: error: Logical volume must be specified as
>> 'volume_group/logical_volume' but got: /dev/sdb
>>
>> Am I missing something?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder

Well, we do, but our problem is with our hybrid setup (1 nvme and 2 
hdds). The other two (that we rarely use) are nvme only and hdd only, as 
far as I can tell they work and "take" command uses class to select only 
the relevant OSDs.


I'll just paste our entire crushmap dump here. This one starts working 
when changing the 1.7 weight to 1.0... crushtool --test doesn't show any 
errors in any case, all PGs seem to be properly assigned to osds.


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class nvme
device 25 osd.25 class nvme
device 26 osd.26 class nvme
device 27 osd.27 class nvme
device 36 osd.36 class hdd
device 37 osd.37 class hdd
device 38 osd.38 class hdd
device 39 osd.39 class hdd
device 40 osd.40 class hdd
device 41 osd.41 class hdd
device 42 osd.42 class hdd
device 43 osd.43 class hdd
device 44 osd.44 class hdd
device 45 osd.45 class hdd
device 46 osd.46 class hdd
device 47 osd.47 class hdd
device 48 osd.48 class hdd
device 49 osd.49 class hdd
device 50 osd.50 class hdd
device 51 osd.51 class hdd
device 52 osd.52 class hdd
device 53 osd.53 class hdd
device 54 osd.54 class hdd
device 55 osd.55 class hdd
device 56 osd.56 class hdd
device 57 osd.57 class hdd
device 58 osd.58 class hdd
device 59 osd.59 class hdd

# types
type 0 osd
type 1 host
type 2 hostgroup
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host storage11 {
    id -5   # do not change unnecessarily
    id -6 class nvme    # do not change unnecessarily
    id -10 class hdd    # do not change unnecessarily
    # weight 4.612
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 0.728
    item osd.3 weight 0.728
    item osd.6 weight 0.728
    item osd.7 weight 0.728
    item osd.10 weight 1.700
}
host storage21 {
    id -13  # do not change unnecessarily
    id -14 class nvme   # do not change unnecessarily
    id -15 class hdd    # do not change unnecessarily
    # weight 65.496
    alg straw2
    hash 0  # rjenkins1
    item osd.12 weight 5.458
    item osd.13 weight 5.458
    item osd.14 weight 5.458
    item osd.15 weight 5.458
    item osd.16 weight 5.458
    item osd.17 weight 5.458
    item osd.18 weight 5.458
    item osd.19 weight 5.458
    item osd.20 weight 5.458
    item osd.21 weight 5.458
    item osd.22 weight 5.458
    item osd.23 weight 5.458
}
datacenter HORN79 {
    id -19  # do not change unnecessarily
    id -26 class nvme   # do not change unnecessarily
    id -27 class hdd    # do not change unnecessarily
    # weight 70.108
    alg straw2
    hash 0  # rjenkins1
    item storage11 weight 4.612
    item storage21 weight 65.496
}
host storage13 {
    id -7   # do not change unnecessarily
    id -8 class nvme    # do not change unnecessarily
    id -11 class hdd    # do not change unnecessarily
    # weight 4.612
    alg straw2
    hash 0  # rjenkins1
    item osd.24 weight 0.728
    item osd.25 weight 0.728
    item osd.26 weight 0.728
    item osd.27 weight 0.728
    item osd.8 weight 1.700
}
host storage23 {
    id -16  # do not change unnecessarily
    id -17 class nvme   # do not change unnecessarily
    id -18 class hdd    # do not change unnecessarily
    # weight 65.784
    alg straw2
    hash 0  # rjenkins1
    item osd.36 weight 5.482
    item osd.37 weight 5.482
    item osd.38 weight 5.482
    item osd.39 weight 5.482
    item osd.40 weight 5.482
    item osd.41 weight 5.482
    item osd.42 weight 5.482
    item osd.43 weight 5.482
    item osd.44 weight 5.482
    item osd.45 weight 5.482
    item osd.58 weight 5.482
    item osd.59 weight 5.482
}
datacenter WAR {
    id -20  # do not change unnecessarily
    id -24 class nvme   # do not change

Re: [ceph-users] How ceph client read data from ceph cluster

2018-01-26 Thread Maged Mokhtar

On 2018-01-26 09:09, shadow_lin wrote:

> Hi List, 
> I read a old article about how ceph client read from ceph cluster.It said the 
> client only read from the primary osd. Since ceph cluster in replicate mode 
> have serveral copys of data only read from one copy seems waste the 
> performance of concurrent read from all the copys. 
> But that artcile is rather old so maybe ceph has imporved to read from all 
> the copys? But I haven't find any info about that. 
> Any info about that would be appreciated. 
> Thanks 
> 
> 2018-01-26 
> -
> shadow_lin 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi  

The majority of cases you will have more concurrent io requests than
disks, so the load will already be distributed evenly. If this is not
the case and you have a large cluster with fewer clients, you may
consider using object/rbd striping so each io will be divided into
different osd requests. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RGW Upgrade to Luminous Inconsistent PGs in index pools

2018-01-26 Thread David Turner

I just upgraded to Luminous yesterday and before the upgrade was complete,
we had SSD OSDs flapping up and down and scrub errors in the RGW index
pools.  I consistently made sure that we had all OSDs back up and the
cluster healthy before continuing and never reduced the min_size below 2
for the pools on the NVMes.  The RGW daemons for our 2 multi-site realms
restarted themselves (due to a long-standing memory leak supposedly fixed
in 12.2.2) and prematurely upgraded themselves before all of the OSDs had
been upgraded and I thought that was the reason for the scrub errors and
inconsistent PGs... however this morning I had a scrub error in our local
only realm which does not use multi-site and had not restarted any of it's
RGW daemons until after all of the OSDs had been upgraded.

Is there anything we should be looking at for this?  Any idea what could be
causing these scrub errors?  I can issue a repair on the PG and the scrub
errors go away, but then they keep coming back on the same PGs later.  I
can also issue a deep-scrub on every PG in these pools and they return
clean, but then later show back up with the scrub errors and inconsistent
PGs on the same PGs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW Upgrade to Luminous Inconsistent PGs in index pools

2018-01-26 Thread David Turner

That last part with the scrubs, when manually run returning clean may not
be accurate.  Doing more testing, but the problem is definitely persistent
even after a repair returns to show the PG as clean.

On Fri, Jan 26, 2018 at 7:41 AM David Turner  wrote:

> I just upgraded to Luminous yesterday and before the upgrade was complete,
> we had SSD OSDs flapping up and down and scrub errors in the RGW index
> pools.  I consistently made sure that we had all OSDs back up and the
> cluster healthy before continuing and never reduced the min_size below 2
> for the pools on the NVMes.  The RGW daemons for our 2 multi-site realms
> restarted themselves (due to a long-standing memory leak supposedly fixed
> in 12.2.2) and prematurely upgraded themselves before all of the OSDs had
> been upgraded and I thought that was the reason for the scrub errors and
> inconsistent PGs... however this morning I had a scrub error in our local
> only realm which does not use multi-site and had not restarted any of it's
> RGW daemons until after all of the OSDs had been upgraded.
>
> Is there anything we should be looking at for this?  Any idea what could
> be causing these scrub errors?  I can issue a repair on the PG and the
> scrub errors go away, but then they keep coming back on the same PGs
> later.  I can also issue a deep-scrub on every PG in these pools and they
> return clean, but then later show back up with the scrub errors and
> inconsistent PGs on the same PGs.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

2018-01-26 Thread David Turner

http://tracker.ceph.com/issues/22796

I was curious if anyone here had any ideas or experience with this
problem.  I created the tracker for this yesterday when I woke up to find
all 3 of my SSD OSDs not running and unable to start due to this segfault.
These OSDs are in my small home cluster and hold the cephfs_cache and
cephfs_metadata pools.

To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my 9
OSDs to Bluestore, reconfigured my crush rules to utilize OSD classes,
failed to remove the CephFS cache tier due to
http://tracker.ceph.com/issues/22754, created these 3 SSD OSDs and updated
the cephfs_cache and cephfs_metadata pools to use the replicated_ssd crush
rule... fast forward 2 days of this working great to me waking up with all
3 of them crashed and unable to start.  There is an OSD log with debug
bluestore = 5 attached to the tracker at the top of the email.

My CephFS is completely down while these 2 pools are inaccessible.  The
OSDs themselves are in-tact if I need to move the data out manually to the
HDDs or something.  Any help is appreciated.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Snapshot trimming

2018-01-26 Thread David Turner

You may find the information in this ML thread useful.
https://www.spinics.net/lists/ceph-users/msg41279.html

It talks about a couple ways to track your snaptrim queue.

On Fri, Jan 26, 2018 at 2:09 AM Karun Josy  wrote:

> Hi,
>
> We have set no scrub , no deep scrub flag on a ceph cluster.
> When we are deleting snapshots we are not seeing any change in usage space.
>
> I understand that Ceph OSDs delete data asynchronously, so deleting a
> snapshot doesn’t free up the disk space immediately. But we are not seeing
> any change for sometime.
>
> What can be possible reason ? Any suggestions would be really helpful as
> the cluster size seems to be growing each day even though snapshots are
> deleted.
>
>
> Karun
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-volume raw disks

2018-01-26 Thread David Turner

I didn't test those commands on 12.2.1 either.  I've only used 12.2.2.

On Fri, Jan 26, 2018 at 6:35 AM Alfredo Deza  wrote:

> That looks like Luminous, but not 12.2.2
>
> The 'raw' device handling is supported in 12.2.2 for sure.
>
> On Thu, Jan 25, 2018 at 10:42 PM, David Turner 
> wrote:
> > Did you wipe all of the existing partitions and such first?  Which
> version
> > of ceph?  The below commands are what I ran to re-add my osds as
> bluestore
> > after moving all data off of them.
> >
> > ceph-volume lvm zap /dev/sdb
> > ceph-volume lvm create --bluestore --data /dev/sdb
> >
> > On Thu, Jan 25, 2018 at 9:41 PM Nathan Dehnel 
> wrote:
> >>
> >> The doc at
> >>
> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#ceph-volume-lvm-prepare
> >> says I can pass a physical device to ceph-volume. But when I try to do
> that:
> >>
> >> gentooserver ~ # ceph-volume lvm create --bluestore --data /dev/sdb
> >> usage: ceph-volume lvm create [-h] [--journal JOURNAL] --data DATA
> >>   [--journal-size GB] [--bluestore]
> >> [--filestore]
> >>   [--osd-id OSD_ID] [--osd-fsid OSD_FSID]
> >> ceph-volume lvm create: error: Logical volume must be specified as
> >> 'volume_group/logical_volume' but got: /dev/sdb
> >>
> >> Am I missing something?
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

2018-01-26 Thread Nick Fisk

I can see this in the logs:

 

2018-01-25 06:05:56.292124 7f37fa6ea700 -1 log_channel(cluster) log [ERR] : 
full status failsafe engaged, dropping updates, now 101% full

2018-01-25 06:05:56.325404 7f3803f9c700 -1 bluestore(/var/lib/ceph/osd/ceph-9) 
_do_alloc_write failed to reserve 0x4000

2018-01-25 06:05:56.325434 7f3803f9c700 -1 bluestore(/var/lib/ceph/osd/ceph-9) 
_do_write _do_alloc_write failed with (28) No space left on device

2018-01-25 06:05:56.325462 7f3803f9c700 -1 bluestore(/var/lib/ceph/osd/ceph-9) 
_txc_add_transaction error (28) No space left on device not handled on 
operation 10 (op 0, counting from 0)

 

Are they out of space, or is something mis-reporting?

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David 
Turner
Sent: 26 January 2018 13:03
To: ceph-users 
Subject: [ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

 

http://tracker.ceph.com/issues/22796

 

I was curious if anyone here had any ideas or experience with this problem.  I 
created the tracker for this yesterday when I woke up to find all 3 of my SSD 
OSDs not running and unable to start due to this segfault.  These OSDs are in 
my small home cluster and hold the cephfs_cache and cephfs_metadata pools.

 

To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my 9 OSDs 
to Bluestore, reconfigured my crush rules to utilize OSD classes, failed to 
remove the CephFS cache tier due to http://tracker.ceph.com/issues/22754, 
created these 3 SSD OSDs and updated the cephfs_cache and cephfs_metadata pools 
to use the replicated_ssd crush rule... fast forward 2 days of this working 
great to me waking up with all 3 of them crashed and unable to start.  There is 
an OSD log with debug bluestore = 5 attached to the tracker at the top of the 
email.

 

My CephFS is completely down while these 2 pools are inaccessible.  The OSDs 
themselves are in-tact if I need to move the data out manually to the HDDs or 
something.  Any help is appreciated.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder

Ok, so creating our setup in the lab and adding the pools, our hybrid 
pool cannot even be properly created with around 1/3 of the PGs stuck in 
various states:


  cluster:
    id: e07f568d-056c-4e01-9292-732c64ab4f8e
    health: HEALTH_WARN
    Reduced data availability: 1070 pgs inactive, 204 pgs peering
    Degraded data redundancy: 1087 pgs unclean, 69 pgs 
degraded, 69 pgs undersized

    too many PGs per OSD (215 > max 200)

  services:
    mon: 3 daemons, quorum s11,s12,s13
    mgr: s12(active), standbys: s11, s13
    osd: 51 osds: 51 up, 51 in

  data:
    pools:   3 pools, 4608 pgs
    objects: 0 objects, 0 bytes
    usage:   56598 MB used, 706 GB / 761 GB avail
    pgs: 17.643% pgs unknown
 5.577% pgs not active
 3521 active+clean
 813  unknown
 204  creating+peering
 46   undersized+degraded+peered
 17   active+undersized+degraded
 6    creating+activating+undersized+degraded
 1    creating+activating


It is stuck like this, and I cant query the problematic PGs:

# ceph pg 2.7cf query

Error ENOENT: i don't have pgid 2.7cf

So, so far, great success :). Now I only have to learn how to fix it, 
any ideas anyone?




Den 2018-01-26 kl. 12:59, skrev Peter Linder:


Well, we do, but our problem is with our hybrid setup (1 nvme and 2 
hdds). The other two (that we rarely use) are nvme only and hdd only, 
as far as I can tell they work and "take" command uses class to select 
only the relevant OSDs.


I'll just paste our entire crushmap dump here. This one starts working 
when changing the 1.7 weight to 1.0... crushtool --test doesn't show 
any errors in any case, all PGs seem to be properly assigned to osds.


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class nvme
device 25 osd.25 class nvme
device 26 osd.26 class nvme
device 27 osd.27 class nvme
device 36 osd.36 class hdd
device 37 osd.37 class hdd
device 38 osd.38 class hdd
device 39 osd.39 class hdd
device 40 osd.40 class hdd
device 41 osd.41 class hdd
device 42 osd.42 class hdd
device 43 osd.43 class hdd
device 44 osd.44 class hdd
device 45 osd.45 class hdd
device 46 osd.46 class hdd
device 47 osd.47 class hdd
device 48 osd.48 class hdd
device 49 osd.49 class hdd
device 50 osd.50 class hdd
device 51 osd.51 class hdd
device 52 osd.52 class hdd
device 53 osd.53 class hdd
device 54 osd.54 class hdd
device 55 osd.55 class hdd
device 56 osd.56 class hdd
device 57 osd.57 class hdd
device 58 osd.58 class hdd
device 59 osd.59 class hdd

# types
type 0 osd
type 1 host
type 2 hostgroup
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host storage11 {
    id -5   # do not change unnecessarily
    id -6 class nvme    # do not change unnecessarily
    id -10 class hdd    # do not change unnecessarily
    # weight 4.612
    alg straw2
    hash 0  # rjenkins1
    item osd.0 weight 0.728
    item osd.3 weight 0.728
    item osd.6 weight 0.728
    item osd.7 weight 0.728
    item osd.10 weight 1.700
}
host storage21 {
    id -13  # do not change unnecessarily
    id -14 class nvme   # do not change unnecessarily
    id -15 class hdd    # do not change unnecessarily
    # weight 65.496
    alg straw2
    hash 0  # rjenkins1
    item osd.12 weight 5.458
    item osd.13 weight 5.458
    item osd.14 weight 5.458
    item osd.15 weight 5.458
    item osd.16 weight 5.458
    item osd.17 weight 5.458
    item osd.18 weight 5.458
    item osd.19 weight 5.458
    item osd.20 weight 5.458
    item osd.21 weight 5.458
    item osd.22 weight 5.458
    item osd.23 weight 5.458
}
datacenter HORN79 {
    id -19  # do not change unnecessarily
    id -26 class nvme   # do not change unnecessarily
    id -27 class hdd    # do not change unnecessarily
    # weight 70.108
    alg straw2
    hash 0  # rjenkins1

Re: [ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

2018-01-26 Thread David Turner

I wouldn't be shocked if they were out of space, but `ceph osd df` only
showed them as 45% full when I was first diagnosing this.  Now they are
showing completely full with the same command.  I'm thinking the cache tier
behavior might have changed to Luminous because I was keeping my cache
completely empty before with a max target objects of 0 which flushed things
out consistently after my minimum flush age.  I noticed it wasn't keeping
up with the flushing as well as it had in Jewel, but didn't think too much
of it.  Anyway, that's something I can tinker with after the pools are back
up and running.

If they are full and on Bluestore, what can I do to clean them up?  I
assume that I need to keep the metadata pool in-tact, but I don't need to
maintain any data in the cache pool.  I have a copy of everything written
in the last 24 hours prior to this incident and nothing is modified after
it is in cephfs.

On Fri, Jan 26, 2018 at 8:23 AM Nick Fisk  wrote:

> I can see this in the logs:
>
>
>
> 2018-01-25 06:05:56.292124 7f37fa6ea700 -1 log_channel(cluster) log [ERR]
> : full status failsafe engaged, dropping updates, now 101% full
>
> 2018-01-25 06:05:56.325404 7f3803f9c700 -1
> bluestore(/var/lib/ceph/osd/ceph-9) _do_alloc_write failed to reserve 0x4000
>
> 2018-01-25 06:05:56.325434 7f3803f9c700 -1
> bluestore(/var/lib/ceph/osd/ceph-9) _do_write _do_alloc_write failed with
> (28) No space left on device
>
> 2018-01-25 06:05:56.325462 7f3803f9c700 -1
> bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (28) No
> space left on device not handled on operation 10 (op 0, counting from 0)
>
>
>
> Are they out of space, or is something mis-reporting?
>
>
>
> Nick
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *David Turner
> *Sent:* 26 January 2018 13:03
> *To:* ceph-users 
> *Subject:* [ceph-users] BlueStore.cc: 9363: FAILED assert(0 ==
> "unexpected error")
>
>
>
> http://tracker.ceph.com/issues/22796
>
>
>
> I was curious if anyone here had any ideas or experience with this
> problem.  I created the tracker for this yesterday when I woke up to find
> all 3 of my SSD OSDs not running and unable to start due to this segfault.
> These OSDs are in my small home cluster and hold the cephfs_cache and
> cephfs_metadata pools.
>
>
>
> To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my 9
> OSDs to Bluestore, reconfigured my crush rules to utilize OSD classes,
> failed to remove the CephFS cache tier due to
> http://tracker.ceph.com/issues/22754, created these 3 SSD OSDs and
> updated the cephfs_cache and cephfs_metadata pools to use the
> replicated_ssd crush rule... fast forward 2 days of this working great to
> me waking up with all 3 of them crashed and unable to start.  There is an
> OSD log with debug bluestore = 5 attached to the tracker at the top of the
> email.
>
>
>
> My CephFS is completely down while these 2 pools are inaccessible.  The
> OSDs themselves are in-tact if I need to move the data out manually to the
> HDDs or something.  Any help is appreciated.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

2018-01-26 Thread David Turner

If I could get it started, I could flush-evict the cache, but that's not
seeming likely.

On Fri, Jan 26, 2018 at 8:33 AM David Turner  wrote:

> I wouldn't be shocked if they were out of space, but `ceph osd df` only
> showed them as 45% full when I was first diagnosing this.  Now they are
> showing completely full with the same command.  I'm thinking the cache tier
> behavior might have changed to Luminous because I was keeping my cache
> completely empty before with a max target objects of 0 which flushed things
> out consistently after my minimum flush age.  I noticed it wasn't keeping
> up with the flushing as well as it had in Jewel, but didn't think too much
> of it.  Anyway, that's something I can tinker with after the pools are back
> up and running.
>
> If they are full and on Bluestore, what can I do to clean them up?  I
> assume that I need to keep the metadata pool in-tact, but I don't need to
> maintain any data in the cache pool.  I have a copy of everything written
> in the last 24 hours prior to this incident and nothing is modified after
> it is in cephfs.
>
> On Fri, Jan 26, 2018 at 8:23 AM Nick Fisk  wrote:
>
>> I can see this in the logs:
>>
>>
>>
>> 2018-01-25 06:05:56.292124 7f37fa6ea700 -1 log_channel(cluster) log [ERR]
>> : full status failsafe engaged, dropping updates, now 101% full
>>
>> 2018-01-25 06:05:56.325404 7f3803f9c700 -1
>> bluestore(/var/lib/ceph/osd/ceph-9) _do_alloc_write failed to reserve 0x4000
>>
>> 2018-01-25 06:05:56.325434 7f3803f9c700 -1
>> bluestore(/var/lib/ceph/osd/ceph-9) _do_write _do_alloc_write failed with
>> (28) No space left on device
>>
>> 2018-01-25 06:05:56.325462 7f3803f9c700 -1
>> bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (28) No
>> space left on device not handled on operation 10 (op 0, counting from 0)
>>
>>
>>
>> Are they out of space, or is something mis-reporting?
>>
>>
>>
>> Nick
>>
>>
>>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
>> Of *David Turner
>> *Sent:* 26 January 2018 13:03
>> *To:* ceph-users 
>> *Subject:* [ceph-users] BlueStore.cc: 9363: FAILED assert(0 ==
>> "unexpected error")
>>
>>
>>
>> http://tracker.ceph.com/issues/22796
>>
>>
>>
>> I was curious if anyone here had any ideas or experience with this
>> problem.  I created the tracker for this yesterday when I woke up to find
>> all 3 of my SSD OSDs not running and unable to start due to this segfault.
>> These OSDs are in my small home cluster and hold the cephfs_cache and
>> cephfs_metadata pools.
>>
>>
>>
>> To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my
>> 9 OSDs to Bluestore, reconfigured my crush rules to utilize OSD classes,
>> failed to remove the CephFS cache tier due to
>> http://tracker.ceph.com/issues/22754, created these 3 SSD OSDs and
>> updated the cephfs_cache and cephfs_metadata pools to use the
>> replicated_ssd crush rule... fast forward 2 days of this working great to
>> me waking up with all 3 of them crashed and unable to start.  There is an
>> OSD log with debug bluestore = 5 attached to the tracker at the top of the
>> email.
>>
>>
>>
>> My CephFS is completely down while these 2 pools are inaccessible.  The
>> OSDs themselves are in-tact if I need to move the data out manually to the
>> HDDs or something.  Any help is appreciated.
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] BlueStore.cc: 9363: FAILED assert(0 == "unexpected error")

2018-01-26 Thread David Turner

I also just got my new SSDs that are 480GB if they could be used to move
the PGs to.  Thank you for your help.

On Fri, Jan 26, 2018 at 8:33 AM David Turner  wrote:

> If I could get it started, I could flush-evict the cache, but that's not
> seeming likely.
>
> On Fri, Jan 26, 2018 at 8:33 AM David Turner 
> wrote:
>
>> I wouldn't be shocked if they were out of space, but `ceph osd df` only
>> showed them as 45% full when I was first diagnosing this.  Now they are
>> showing completely full with the same command.  I'm thinking the cache tier
>> behavior might have changed to Luminous because I was keeping my cache
>> completely empty before with a max target objects of 0 which flushed things
>> out consistently after my minimum flush age.  I noticed it wasn't keeping
>> up with the flushing as well as it had in Jewel, but didn't think too much
>> of it.  Anyway, that's something I can tinker with after the pools are back
>> up and running.
>>
>> If they are full and on Bluestore, what can I do to clean them up?  I
>> assume that I need to keep the metadata pool in-tact, but I don't need to
>> maintain any data in the cache pool.  I have a copy of everything written
>> in the last 24 hours prior to this incident and nothing is modified after
>> it is in cephfs.
>>
>> On Fri, Jan 26, 2018 at 8:23 AM Nick Fisk  wrote:
>>
>>> I can see this in the logs:
>>>
>>>
>>>
>>> 2018-01-25 06:05:56.292124 7f37fa6ea700 -1 log_channel(cluster) log
>>> [ERR] : full status failsafe engaged, dropping updates, now 101% full
>>>
>>> 2018-01-25 06:05:56.325404 7f3803f9c700 -1
>>> bluestore(/var/lib/ceph/osd/ceph-9) _do_alloc_write failed to reserve 0x4000
>>>
>>> 2018-01-25 06:05:56.325434 7f3803f9c700 -1
>>> bluestore(/var/lib/ceph/osd/ceph-9) _do_write _do_alloc_write failed with
>>> (28) No space left on device
>>>
>>> 2018-01-25 06:05:56.325462 7f3803f9c700 -1
>>> bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (28) No
>>> space left on device not handled on operation 10 (op 0, counting from 0)
>>>
>>>
>>>
>>> Are they out of space, or is something mis-reporting?
>>>
>>>
>>>
>>> Nick
>>>
>>>
>>>
>>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
>>> Behalf Of *David Turner
>>> *Sent:* 26 January 2018 13:03
>>> *To:* ceph-users 
>>> *Subject:* [ceph-users] BlueStore.cc: 9363: FAILED assert(0 ==
>>> "unexpected error")
>>>
>>>
>>>
>>> http://tracker.ceph.com/issues/22796
>>>
>>>
>>>
>>> I was curious if anyone here had any ideas or experience with this
>>> problem.  I created the tracker for this yesterday when I woke up to find
>>> all 3 of my SSD OSDs not running and unable to start due to this segfault.
>>> These OSDs are in my small home cluster and hold the cephfs_cache and
>>> cephfs_metadata pools.
>>>
>>>
>>>
>>> To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my
>>> 9 OSDs to Bluestore, reconfigured my crush rules to utilize OSD classes,
>>> failed to remove the CephFS cache tier due to
>>> http://tracker.ceph.com/issues/22754, created these 3 SSD OSDs and
>>> updated the cephfs_cache and cephfs_metadata pools to use the
>>> replicated_ssd crush rule... fast forward 2 days of this working great to
>>> me waking up with all 3 of them crashed and unable to start.  There is an
>>> OSD log with debug bluestore = 5 attached to the tracker at the top of the
>>> email.
>>>
>>>
>>>
>>> My CephFS is completely down while these 2 pools are inaccessible.  The
>>> OSDs themselves are in-tact if I need to move the data out manually to the
>>> HDDs or something.  Any help is appreciated.
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-volume raw disks

2018-01-26 Thread Alfredo Deza

On Fri, Jan 26, 2018 at 8:07 AM, David Turner  wrote:
> I didn't test those commands on 12.2.1 either.  I've only used 12.2.2.
>
> On Fri, Jan 26, 2018 at 6:35 AM Alfredo Deza  wrote:
>>
>> That looks like Luminous, but not 12.2.2
>>
>> The 'raw' device handling is supported in 12.2.2 for sure.
>>
>> On Thu, Jan 25, 2018 at 10:42 PM, David Turner 
>> wrote:
>> > Did you wipe all of the existing partitions and such first?  Which
>> > version
>> > of ceph?  The below commands are what I ran to re-add my osds as
>> > bluestore
>> > after moving all data off of them.
>> >
>> > ceph-volume lvm zap /dev/sdb
>> > ceph-volume lvm create --bluestore --data /dev/sdb
>> >
>> > On Thu, Jan 25, 2018 at 9:41 PM Nathan Dehnel 
>> > wrote:
>> >>
>> >> The doc at
>> >>
>> >> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#ceph-volume-lvm-prepare

That link is using the 'master' branch version of the docs.
Unfortunately, you would need to know that our urls will send you
straight to master for docs,
it will not tell you that 'master' is unreleased, and will not give
you an option to point to a different version.

Again, the functionality you are looking for is in 12.2.2, not in 12.2.1

>> >> says I can pass a physical device to ceph-volume. But when I try to do
>> >> that:
>> >>
>> >> gentooserver ~ # ceph-volume lvm create --bluestore --data /dev/sdb
>> >> usage: ceph-volume lvm create [-h] [--journal JOURNAL] --data DATA
>> >>   [--journal-size GB] [--bluestore]
>> >> [--filestore]
>> >>   [--osd-id OSD_ID] [--osd-fsid OSD_FSID]
>> >> ceph-volume lvm create: error: Logical volume must be specified as
>> >> 'volume_group/logical_volume' but got: /dev/sdb
>> >>
>> >> Am I missing something?
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can't make LDAP work

2018-01-26 Thread Matt Benjamin

Hi Theofilos,

I'm not sure what's going wrong offhand, I see all the pieces in your writeup.

The first thing I would verify is that "CN=cephs3,OU=Users,OU=Organic
Units,DC=example,DC=com" see the users in
ldaps://ldap.example.com:636, and that "cn=myuser..." can itself
simple bind using standard tools.

What Ceph version are you running?

Matt

On Fri, Jan 26, 2018 at 5:27 AM, Theofilos Mouratidis
 wrote:
> They gave me a ldap server working with users inside, and I want to create
> tokens for these users
>  to use s3 from their ldap credentials.
> I tried using the sanity check and I got this one working:
>
> ldapsearch -x -D "CN=cephs3,OU=Users,OU=Organic Units,DC=example,DC=com" -W
> -H ldaps://ldap.example.com:636 -b 'OU=Users,OU=Organic
> Units,DC=example,DC=com' 'cn=*' dn
>
> My config is like this:
> [global]
> rgw_ldap_binddn = "CN=cephs3,OU=Users,OU=Organic Units,DC=example,DC=com"
> rgw_ldap_dnattr = "cn"
> rgw_ldap_searchdn = "OU=Users,OU=Organic Units,DC=example,DC=com"
> rgw_ldap_secret = "plaintext_pass"
> rgw_ldap_uri = ldaps://ldap.example.com:636
> rgw_s3_auth_use_ldap = true
>
> I create my token to test the ldap feature:
>
> export RGW_ACCESS_KEY_ID="myuser" #where "dn: cn=myuser..." is in
> ldap.example.com
> export RGW_SECRET_ACCESS_KEY="mypass"
> radosgw-token --encode --ttype=ad
> abcad=
> radosgw-token --encode --ttype=ldap
> abcldap=
>
> Now I go to s3cmd and in config I have something like this:
> acess_key = abcad=
> secret_key =
> use_https = false
> host_base = ceph_rgw.example.com:8080
> host_bucket = ceph_rgw.example.com:8080
>
>
> I get access denied,
> then I try with the ldap key and I get the same problem.
> I created a local user out of curiosity and I put in s3cmd acess and secret
> and I could create a bucket. What am I doing wrong?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] swift capabilities support in radosgw

2018-01-26 Thread Matt Benjamin

Hi Syed,

RGW supports Swift /info in Luminous.

By default iirc those aren't at the root of the URL hierarchy, but
there's an option to change that, since last year, see
https://github.com/ceph/ceph/pull/10280.

Matt

On Fri, Jan 26, 2018 at 5:10 AM, Syed Armani  wrote:
> Hello folks,
>
>
> I am getting this error "Capabilities GET failed: https://SWIFT:8080/info 404 
> Not Found",
> when executing a "$ swift capabilities" command against a radosgw cluster.
>
>
> I was wondering whether radosgw supports the listing of activated 
> capabilities[0] via Swift API?
> Something a user can see with "$ swift capabilities" in a native swift 
> cluster.
>
>
> [0] 
> https://developer.openstack.org/api-ref/object-store/index.html#list-activated-capabilities
>
> Thanks!
>
> Cheers,
> Syed
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread David


Hi!

On luminous 12.2.2

I'm migrating some OSDs from filestore to bluestore using the "simple" method 
as described in docs: 
http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds
 

Mark out and Replace.

However, at 9.: ceph-volume create --bluestore --data $DEVICE --osd-id $ID
it seems to create the bluestore but it fails to authenticate with the old 
osd-id auth.
(the command above is also missing lvm or simple)

I think it's related to this:
http://tracker.ceph.com/issues/22642 

# ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
Running command: sudo vgcreate --force --yes 
ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
 stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
enabling it!
 stdout: Physical volume "/dev/sdc" successfully created
 stdout: Volume group "ceph-efad7df8-721d-43d8-8d02-449406e70b90" successfully 
created
Running command: sudo lvcreate --yes -l 100%FREE -n 
osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9 
ceph-efad7df8-721d-43d8-8d02-449406e70b90
 stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
enabling it!
 stdout: Logical volume "osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9" 
created.
Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
Running command: chown -R ceph:ceph /dev/dm-4
Running command: sudo ln -s 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
 /var/lib/ceph/osd/ceph-0/block
Running command: sudo ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 
/var/lib/ceph/osd/ceph-0/activate.monmap
 stderr: got monmap epoch 2
Running command: ceph-authtool /var/lib/ceph/osd/ceph-0/keyring 
--create-keyring --name osd.0 --add-key 
 stdout: creating /var/lib/ceph/osd/ceph-0/keyring
 stdout: added entity osd.0 auth auth(auid = 18446744073709551615 key=  
with 0 caps)
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore 
--mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --key 
 --osd-data /var/lib/ceph/osd/ceph-0/ 
--osd-uuid 138ce507-f28a-45bf-814c-7fa124a9d9b9 --setuser ceph --setgroup ceph
 stderr: 2018-01-26 14:59:10.039549 7fd7ef951cc0 -1 
bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
of struct encoding
 stderr: 2018-01-26 14:59:10.039744 7fd7ef951cc0 -1 
bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
of struct encoding
 stderr: 2018-01-26 14:59:10.039925 7fd7ef951cc0 -1 
bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
of struct encoding
 stderr: 2018-01-26 14:59:10.039984 7fd7ef951cc0 -1 
bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
 stderr: 2018-01-26 14:59:11.359951 7fd7ef951cc0 -1 key 
 stderr: 2018-01-26 14:59:11.888476 7fd7ef951cc0 -1 created object store 
/var/lib/ceph/osd/ceph-0/ for osd.0 fsid efad7df8-721d-43d8-8d02-449406e70b90
Running command: sudo ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
 --path /var/lib/ceph/osd/ceph-0
Running command: sudo ln -snf 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
 /var/lib/ceph/osd/ceph-0/block
Running command: chown -R ceph:ceph /dev/dm-4
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
Running command: sudo systemctl enable 
ceph-volume@lvm-0-138ce507-f28a-45bf-814c-7fa124a9d9b9
 stderr: Created symlink from 
/etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-138ce507-f28a-45bf-814c-7fa124a9d9b9.service
 to /lib/systemd/system/ceph-volume@.service.
Running command: sudo systemctl start ceph-osd@0

ceph-osd.0.log shows:

2018-01-26 15:09:07.379039 7f545d3b9cc0  4 rocksdb: 
[/build/ceph-12.2.2/src/rocksdb/db/version_set.cc:2859] Recovered from manifest 
file:db/MANIFEST-95 succeeded,manifest_file_number is 95, next_file_number 
is 97, last_sequence is 21, log_number is 0,prev_log_number is 
0,max_column_family is 0

2018-01-26 15:09:07.379046 7f545d3b9cc0  4 rocksdb: 
[/build/ceph-12.2.2/src/rocksdb/db/version_set.cc:2867] Column family [default] 
(ID 0), log number is 94

[ceph-users] Bluefs WAL : bluefs _allocate failed to allocate on bdev 0

2018-01-26 Thread Dietmar Rieder

Hi all,

I've a question regarding bluestore wal.db:


We are running a 10 OSD node + 3 MON/MDS node cluster (luminous 12.2.2).
Each OSD node has 22xHDD (8TB) OSDs, 2xSSD (1.6TB) OSDs and 2xNVME (800
GB) for bluestore wal and db.

We have separated wal and db partitions
wal partitions are 1GB
db partitions are 64GB

The cluster is providing cephfs from one HDD (EC 6+3) and one SSD
(3xrep) pool.
Since the cluster is "new" we have not much data ~30TB (HDD EC) and
~140GB (SSD rep) stored on it yet.

I just noticed that the wal db usage for the SSD OSDs is all more or
less equal ~518MB. The wal db usage for the HDD OSDs is as well quite
balanced at 284-306MB, however there is one OSD whose wal db usage is ~ 1GB


   "bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 68719468544,
"db_used_bytes": 1114636288,
"wal_total_bytes": 1073737728,
"wal_used_bytes": 1072693248,
"slow_total_bytes": 320057901056,
"slow_used_bytes": 0,
"num_files": 16,
"log_bytes": 862326784,
"log_compactions": 0,
"logged_bytes": 850575360,
"files_written_wal": 2,
"files_written_sst": 9,
"bytes_written_wal": 744469265,
"bytes_written_sst": 568855830
},


and I got the following log entries:

2018-01-26 16:31:05.484284 7f65ea28a700  1 bluefs _allocate failed to
allocate 0x40 on bdev 0, free 0xff000; fallback to bdev 1

Is there any reason for this difference ~300MB vs 1GB?
I have in mind that 1GB of wal should be enough, and old logs should be
purged to free space. (can this be triggered manually?)

Could this be related to the fact that the HDD OSD in question was
failing some week ago and we replaced it with with a new HDD?

Do we have to expect problems/performace reductions, with the falling
back to bdev 1?

Thanks for any clarifying comment
   Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Peter Linder

Ok, by randomly toggling settings *MOST* of the PGs in the test cluster 
is online, but a few are not. No matter how much I change, a few of them 
seem to not activate. They are running bluestore with version 12.2.2, i 
think created with ceph-volume.


Here is the output from ceph pg X query of one that won't activate (but 
was active before but got remapped due to one of my changes. What should 
I look for and where should I look next to understand?



# ceph pg 3.12c query
{
    "state": "activating",
    "snap_trimq": "[]",
    "epoch": 918,
    "up": [
    6,
    0,
    12
    ],
    "acting": [
    6,
    0,
    12
    ],
    "actingbackfill": [
    "0",
    "6",
    "12"
    ],
    "info": {
    "pgid": "3.12c",
    "last_update": "0'0",
    "last_complete": "0'0",
    "log_tail": "0'0",
    "last_user_version": 0,
    "last_backfill": "MAX",
    "last_backfill_bitwise": 0,
    "purged_snaps": [],
    "history": {
    "epoch_created": 314,
    "epoch_pool_created": 314,
    "last_epoch_started": 862,
    "last_interval_started": 860,
    "last_epoch_clean": 862,
    "last_interval_clean": 860,
    "last_epoch_split": 0,
    "last_epoch_marked_full": 0,
    "same_up_since": 872,
    "same_interval_since": 915,
    "same_primary_since": 789,
    "last_scrub": "0'0",
    "last_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_deep_scrub": "0'0",
    "last_deep_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_clean_scrub_stamp": "2018-01-26 13:29:35.010846"
    },
    "stats": {
    "version": "0'0",
    "reported_seq": "427",
    "reported_epoch": "918",
    "state": "activating",
    "last_fresh": "2018-01-26 17:26:39.603121",
    "last_change": "2018-01-26 17:26:36.161131",
    "last_active": "2018-01-26 17:25:09.770406",
    "last_peered": "2018-01-26 17:24:17.510532",
    "last_clean": "2018-01-26 17:24:17.510532",
    "last_became_active": "2018-01-26 17:24:09.211916",
    "last_became_peered": "2018-01-26 17:24:09.211916",
    "last_unstale": "2018-01-26 17:26:39.603121",
    "last_undegraded": "2018-01-26 17:26:39.603121",
    "last_fullsized": "2018-01-26 17:26:39.603121",
    "mapping_epoch": 915,
    "log_start": "0'0",
    "ondisk_log_start": "0'0",
    "created": 314,
    "last_epoch_clean": 862,
    "parent": "0.0",
    "parent_split_bits": 0,
    "last_scrub": "0'0",
    "last_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_deep_scrub": "0'0",
    "last_deep_scrub_stamp": "2018-01-26 13:29:35.010846",
    "last_clean_scrub_stamp": "2018-01-26 13:29:35.010846",
    "log_size": 0,
    "ondisk_log_size": 0,
    "stats_invalid": false,
    "dirty_stats_invalid": false,
    "omap_stats_invalid": false,
    "hitset_stats_invalid": false,
    "hitset_bytes_stats_invalid": false,
    "pin_stats_invalid": false,
    "stat_sum": {
    "num_bytes": 0,
    "num_objects": 0,
    "num_object_clones": 0,
    "num_object_copies": 0,
    "num_objects_missing_on_primary": 0,
    "num_objects_missing": 0,
    "num_objects_degraded": 0,
    "num_objects_misplaced": 0,
    "num_objects_unfound": 0,
    "num_objects_dirty": 0,
    "num_whiteouts": 0,
    "num_read": 0,
    "num_read_kb": 0,
    "num_write": 0,
    "num_write_kb": 0,
    "num_scrub_errors": 0,
    "num_shallow_scrub_errors": 0,
    "num_deep_scrub_errors": 0,
    "num_objects_recovered": 0,
    "num_bytes_recovered": 0,
    "num_keys_recovered": 0,
    "num_objects_omap": 0,
    "num_objects_hit_set_archive": 0,
    "num_bytes_hit_set_archive": 0,
    "num_flush": 0,
    "num_flush_kb": 0,
    "num_evict": 0,
    "num_evict_kb": 0,
    "num_promote": 0,
    "num_flush_mode_high": 0,
    "num_flush_mode_low": 0,
    "num_evict_mode_some": 0,
    "num_evict_mode_full": 0,
    "num_objects_pinned": 0,
    "num_legacy_snapsets": 0
    },
    "up": [
    6,
    0,
    12
    ],
    "acting": [
    6,
    0,
    12
    ],
    "blocked_by": [],
    "up_primary": 6,
    "acting_primary": 6
    },
    "empt

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread Reed Dier

This is the exact issue that I ran into when starting my bluestore conversion 
journey.

See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html 


Specifying --osd-id causes it to fail.

Below are my steps for OSD replace/migrate from filestore to bluestore.

BIG caveat here in that I am doing destructive replacement, in that I am not 
allowing my objects to be migrated off of the OSD I’m replacing before nuking 
it.
With 8TB drives it just takes way too long, and I trust my failure domains and 
other hardware to get me through the backfills.
So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 3) 
reading data elsewhere, writing back on, I am taking step one out, and trusting 
my two other copies of the objects. Just wanted to clarify my steps.

I also set norecover and norebalance flags immediately prior to running these 
commands so that it doesn’t try to start moving data unnecessarily. Then when 
done, remove those flags, and let it backfill.

> systemctl stop ceph-osd@$ID.service
> ceph-osd -i $ID --flush-journal
> umount /var/lib/ceph/osd/ceph-$ID
> ceph-volume lvm zap /dev/$ID
> ceph osd crush remove osd.$ID
> ceph auth del osd.$ID
> ceph osd rm osd.$ID
> ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME

So essentially I fully remove the OSD from crush and the osdmap, and when I add 
the OSD back, like I would a new OSD, it fills in the numeric gap with the $ID 
it had before.

Hope this is helpful.
Been working well for me so far, doing 3 OSDs at a time (half of a failure 
domain).

Reed

> On Jan 26, 2018, at 10:01 AM, David  wrote:
> 
> 
> Hi!
> 
> On luminous 12.2.2
> 
> I'm migrating some OSDs from filestore to bluestore using the "simple" method 
> as described in docs: 
> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds
>  
> 
> Mark out and Replace.
> 
> However, at 9.: ceph-volume create --bluestore --data $DEVICE --osd-id $ID
> it seems to create the bluestore but it fails to authenticate with the old 
> osd-id auth.
> (the command above is also missing lvm or simple)
> 
> I think it's related to this:
> http://tracker.ceph.com/issues/22642 
> 
> # ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
> Running command: sudo vgcreate --force --yes 
> ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
>  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
> enabling it!
>  stdout: Physical volume "/dev/sdc" successfully created
>  stdout: Volume group "ceph-efad7df8-721d-43d8-8d02-449406e70b90" 
> successfully created
> Running command: sudo lvcreate --yes -l 100%FREE -n 
> osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9 
> ceph-efad7df8-721d-43d8-8d02-449406e70b90
>  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
> enabling it!
>  stdout: Logical volume "osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9" 
> created.
> Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
> Running command: chown -R ceph:ceph /dev/dm-4
> Running command: sudo ln -s 
> /dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
>  /var/lib/ceph/osd/ceph-0/block
> Running command: sudo ceph --cluster ceph --name client.bootstrap-osd 
> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 
> /var/lib/ceph/osd/ceph-0/activate.monmap
>  stderr: got monmap epoch 2
> Running command: ceph-authtool /var/lib/ceph/osd/ceph-0/keyring 
> --create-keyring --name osd.0 --add-key 
>  stdout: creating /var/lib/ceph/osd/ceph-0/keyring
>  stdout: added entity osd.0 auth auth(auid = 18446744073709551615 key= 
>  with 0 caps)
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
> Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore 
> --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --key 
>  --osd-data /var/lib/ceph/osd/ceph-0/ 
> --osd-uuid 138ce507-f28a-45bf-814c-7fa124a9d9b9 --setuser ceph --setgroup ceph
>  stderr: 2018-01-26 14:59:10.039549 7fd7ef951cc0 -1 
> bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
> label at offset 102: buffer::malformed_input: void 
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
> of struct encoding
>  stderr: 2018-01-26 14:59:10.039744 7fd7ef951cc0 -1 
> bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
> label at offset 102: buffer::malformed_input: void 
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
> of struct encoding
>  stderr: 2018-01-26 14:59:10.039925 7fd7ef951cc0 -1 
> bluestore(/var/l

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread David Majchrzak

Thanks that helped!

Since I had already "halfway" created a lvm volume I wanted to start from the 
beginning and zap it.

Tried to zap the raw device but failed since --destroy doesn't seem to be in 
12.2.2

http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/ 


root@int1:~# ceph-volume lvm zap /dev/sdc --destroy
usage: ceph-volume lvm zap [-h] [DEVICE]
ceph-volume lvm zap: error: unrecognized arguments: --destroy

So i zapped it with the vg/lvm instead.
ceph-volume lvm zap 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9

However I run create on it since the LVM was already there.
So I zapped it with sgdisk and ran dmsetup remove. After that I was able to 
create it again.

However - each "ceph-volume lvm create" that I ran that failed, successfully 
added an osd to crush map ;)

So I've got this now:

root@int1:~# ceph osd df tree
ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS TYPE NAME
-1   2.60959- 2672G  1101G  1570G 41.24 1.00   - root default
-2   0.87320-  894G   369G   524G 41.36 1.00   - host int1
 3   ssd 0.43660  1.0  447G   358G 90295M 80.27 1.95 301 osd.3
 8   ssd 0.43660  1.0  447G 11273M   436G  2.46 0.06  19 osd.8
-3   0.86819-  888G   366G   522G 41.26 1.00   - host int2
 1   ssd 0.43159  1.0  441G   167G   274G 37.95 0.92 147 osd.1
 4   ssd 0.43660  1.0  447G   199G   247G 44.54 1.08 173 osd.4
-4   0.86819-  888G   365G   523G 41.09 1.00   - host int3
 2   ssd 0.43159  1.0  441G   193G   248G 43.71 1.06 174 osd.2
 5   ssd 0.43660  1.0  447G   172G   274G 38.51 0.93 146 osd.5
 0 00 0  0  0 00   0 osd.0
 6 00 0  0  0 00   0 osd.6
 7 00 0  0  0 00   0 osd.7

I guess I can just remove them from crush,auth and rm them?

Kind Regards,

David Majchrzak

> 26 jan. 2018 kl. 18:09 skrev Reed Dier :
> 
> This is the exact issue that I ran into when starting my bluestore conversion 
> journey.
> 
> See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html 
> 
> 
> Specifying --osd-id causes it to fail.
> 
> Below are my steps for OSD replace/migrate from filestore to bluestore.
> 
> BIG caveat here in that I am doing destructive replacement, in that I am not 
> allowing my objects to be migrated off of the OSD I’m replacing before nuking 
> it.
> With 8TB drives it just takes way too long, and I trust my failure domains 
> and other hardware to get me through the backfills.
> So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 3) 
> reading data elsewhere, writing back on, I am taking step one out, and 
> trusting my two other copies of the objects. Just wanted to clarify my steps.
> 
> I also set norecover and norebalance flags immediately prior to running these 
> commands so that it doesn’t try to start moving data unnecessarily. Then when 
> done, remove those flags, and let it backfill.
> 
>> systemctl stop ceph-osd@$ID.service
>> ceph-osd -i $ID --flush-journal
>> umount /var/lib/ceph/osd/ceph-$ID
>> ceph-volume lvm zap /dev/$ID
>> ceph osd crush remove osd.$ID
>> ceph auth del osd.$ID
>> ceph osd rm osd.$ID
>> ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME
> 
> So essentially I fully remove the OSD from crush and the osdmap, and when I 
> add the OSD back, like I would a new OSD, it fills in the numeric gap with 
> the $ID it had before.
> 
> Hope this is helpful.
> Been working well for me so far, doing 3 OSDs at a time (half of a failure 
> domain).
> 
> Reed
> 
>> On Jan 26, 2018, at 10:01 AM, David > > wrote:
>> 
>> 
>> Hi!
>> 
>> On luminous 12.2.2
>> 
>> I'm migrating some OSDs from filestore to bluestore using the "simple" 
>> method as described in docs: 
>> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds
>>  
>> 
>> Mark out and Replace.
>> 
>> However, at 9.: ceph-volume create --bluestore --data $DEVICE --osd-id $ID
>> it seems to create the bluestore but it fails to authenticate with the old 
>> osd-id auth.
>> (the command above is also missing lvm or simple)
>> 
>> I think it's related to this:
>> http://tracker.ceph.com/issues/22642 
>> 
>> # ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
>> Running command: sudo vgcreate --force --yes 
>> ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
>>  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
>> enabling it!
>>  stdout: Physical volume "/dev/sdc" successfully created
>>  stdout: Volume group "ceph-efad7df8-

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread David Majchrzak

Ran:

ceph auth del osd.0
ceph auth del osd.6
ceph auth del osd.7
ceph osd rm osd.0
ceph osd rm osd.6
ceph osd rm osd.7

which seems to have removed them.

Thanks for the help Reed!

Kind Regards,
David Majchrzak


> 26 jan. 2018 kl. 18:32 skrev David Majchrzak :
> 
> Thanks that helped!
> 
> Since I had already "halfway" created a lvm volume I wanted to start from the 
> beginning and zap it.
> 
> Tried to zap the raw device but failed since --destroy doesn't seem to be in 
> 12.2.2
> 
> http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/ 
> 
> 
> root@int1:~# ceph-volume lvm zap /dev/sdc --destroy
> usage: ceph-volume lvm zap [-h] [DEVICE]
> ceph-volume lvm zap: error: unrecognized arguments: --destroy
> 
> So i zapped it with the vg/lvm instead.
> ceph-volume lvm zap 
> /dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
> 
> However I run create on it since the LVM was already there.
> So I zapped it with sgdisk and ran dmsetup remove. After that I was able to 
> create it again.
> 
> However - each "ceph-volume lvm create" that I ran that failed, successfully 
> added an osd to crush map ;)
> 
> So I've got this now:
> 
> root@int1:~# ceph osd df tree
> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS TYPE NAME
> -1   2.60959- 2672G  1101G  1570G 41.24 1.00   - root default
> -2   0.87320-  894G   369G   524G 41.36 1.00   - host int1
>  3   ssd 0.43660  1.0  447G   358G 90295M 80.27 1.95 301 osd.3
>  8   ssd 0.43660  1.0  447G 11273M   436G  2.46 0.06  19 osd.8
> -3   0.86819-  888G   366G   522G 41.26 1.00   - host int2
>  1   ssd 0.43159  1.0  441G   167G   274G 37.95 0.92 147 osd.1
>  4   ssd 0.43660  1.0  447G   199G   247G 44.54 1.08 173 osd.4
> -4   0.86819-  888G   365G   523G 41.09 1.00   - host int3
>  2   ssd 0.43159  1.0  441G   193G   248G 43.71 1.06 174 osd.2
>  5   ssd 0.43660  1.0  447G   172G   274G 38.51 0.93 146 osd.5
>  0 00 0  0  0 00   0 osd.0
>  6 00 0  0  0 00   0 osd.6
>  7 00 0  0  0 00   0 osd.7
> 
> I guess I can just remove them from crush,auth and rm them?
> 
> Kind Regards,
> 
> David Majchrzak
> 
>> 26 jan. 2018 kl. 18:09 skrev Reed Dier > >:
>> 
>> This is the exact issue that I ran into when starting my bluestore 
>> conversion journey.
>> 
>> See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html 
>> 
>> 
>> Specifying --osd-id causes it to fail.
>> 
>> Below are my steps for OSD replace/migrate from filestore to bluestore.
>> 
>> BIG caveat here in that I am doing destructive replacement, in that I am not 
>> allowing my objects to be migrated off of the OSD I’m replacing before 
>> nuking it.
>> With 8TB drives it just takes way too long, and I trust my failure domains 
>> and other hardware to get me through the backfills.
>> So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 
>> 3) reading data elsewhere, writing back on, I am taking step one out, and 
>> trusting my two other copies of the objects. Just wanted to clarify my steps.
>> 
>> I also set norecover and norebalance flags immediately prior to running 
>> these commands so that it doesn’t try to start moving data unnecessarily. 
>> Then when done, remove those flags, and let it backfill.
>> 
>>> systemctl stop ceph-osd@$ID.service 
>>> ceph-osd -i $ID --flush-journal
>>> umount /var/lib/ceph/osd/ceph-$ID
>>> ceph-volume lvm zap /dev/$ID
>>> ceph osd crush remove osd.$ID
>>> ceph auth del osd.$ID
>>> ceph osd rm osd.$ID
>>> ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME
>> 
>> So essentially I fully remove the OSD from crush and the osdmap, and when I 
>> add the OSD back, like I would a new OSD, it fills in the numeric gap with 
>> the $ID it had before.
>> 
>> Hope this is helpful.
>> Been working well for me so far, doing 3 OSDs at a time (half of a failure 
>> domain).
>> 
>> Reed
>> 
>>> On Jan 26, 2018, at 10:01 AM, David >> > wrote:
>>> 
>>> 
>>> Hi!
>>> 
>>> On luminous 12.2.2
>>> 
>>> I'm migrating some OSDs from filestore to bluestore using the "simple" 
>>> method as described in docs: 
>>> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds
>>>  
>>> 
>>> Mark out and Replace.
>>> 
>>> However, at 9.: ceph-volume create --bluestore --data $DEVICE --osd-id $ID
>>> it seems to create the bluestore but it fails to authenticate with the old 
>>> osd-id auth.
>>> (the command above is also missing lvm or

Re: [ceph-users] Importance of Stable Mon and OSD IPs

2018-01-26 Thread Mayank Kumar

Resending in case this email was lost

On Tue, Jan 23, 2018 at 10:50 PM Mayank Kumar  wrote:

> Thanks Burkhard for the detailed explanation. Regarding the following:-
>
> >>>The ceph client (librbd accessing a volume in this case) gets
> asynchronous notification from the ceph mons in case of relevant changes,
> e.g. updates to the osd map reflecting the failure of an OSD.
> i have some more questions:-
> 1:  Does the asynchronous notification for both osdmap and monmap comes
> from mons ?
> 2:  Are these asynchronous notifications retriable ?
> 3: Is it possible that the these asynchronous notifications are lost  ?
> 4: Does the monmap and osdmap reside in the kernel or user space ? The
> reason i am asking is , for a rbd volume that is already mounted on a host,
> will it continue to receive those asynchronoous notifications for changes
> to both osd and mon ips or not ? If All mon ips change,  but the mon
> configuration file is updated to reflect the new mon ips, should the
> existing rbd volume mounted still be able to contact the osd's and mons or
> is there some form of caching in the kernel space for an already mounted
> rbd volume
>
>
> Some more context for why i am getting all these doubts:-
> We internally had a ceph cluster with rbd volumes being provisioned by
> Kubernetes. With existing rbd volumes already mounted , we wiped out the
> old ceph cluster and created a brand new ceph cluster . But the existing
> rbd volumes from the old cluster still remained. Any kubernetes pods that
> landed on the same host as an old rbd volume would not create because the
> volume failed to attach and mount. Looking at the kernel messages we saw
> the following:-
>
> -- Logs begin at Fri 2018-01-19 02:05:38 GMT, end at Fri 2018-01-19
> 19:23:14 GMT. --
>
> Jan 19 19:20:39 host1.com kernel: *libceph: osd2 10.231.171.131:6808
>  socket closed (con state CONNECTING)*
>
> Jan 19 19:18:30 host1.com kernel: *libceph: osd28 10.231.171.52:6808
>  socket closed (con state CONNECTING)*
>
> Jan 19 19:18:30 host1.com kernel: *libceph: osd0 10.231.171.131:6800
>  socket closed (con state CONNECTING)*
>
> Jan 19 19:15:40 host1.com kernel: *libceph: osd21 10.231.171.99:6808
>  wrong peer at address*
>
> Jan 19 19:15:40 host1.com kernel: *libceph: wrong peer,
> want 10.231.171.99:6808/42661 ,
> got 10.231.171.99:6808/73168 *
>
> Jan 19 19:15:34 host1.com kernel: *libceph: osd11 10.231.171.114:6816
>  wrong peer at address*
>
> Jan 19 19:15:34 host1.com kernel: *libceph: wrong peer,
> want 10.231.171.114:6816/130908 ,
> got 10.231.171.114:6816/85562 *
>
> The Ceph cluster had new osd ip and mon ips.
>
> So my questions, since these messages are coming from the kernel module,
> why cant the kernel module figure out that the mon and osd ips have
> changed. Is there some caching in the kernel ? when rbd create/attach is
> called on that host, it is passed new mon ips , so doesnt that update the
> old already mounted rbd volumes.
>
> Hope i made my doubts clear and yes i am a beginner in Ceph with very
> limited knowledge.
>
> Thanks for your help again
> Mayank
>
>
> On Tue, Jan 23, 2018 at 1:24 AM, Burkhard Linke <
> burkhard.li...@computational.bio.uni-giessen.de> wrote:
>
>> Hi,
>>
>>
>> On 01/23/2018 09:53 AM, Mayank Kumar wrote:
>>
>>> Hi Ceph Experts
>>>
>>> I am a new user of Ceph and currently using Kubernetes to deploy Ceph
>>> RBD Volumes. We our doing some initial work rolling it out to internal
>>> customers and in doing that we are using the ip of the host as the ip of
>>> the osd and mons. This means if a host goes down , we loose that ip. While
>>> we are still experimenting with these behaviors, i wanted to see what the
>>> community thinks for the following scenario :-
>>>
>>> 1: a rbd volume is already attached and mounted on host A
>>> 2: the osd on which this rbd volume resides, dies and never comes back up
>>> 3: another osd is replaced in its place. I dont know the intricacies
>>> here, but i am assuming the data for this rbd volume either moves to
>>> different osd's or goes back to the newly installed osd
>>> 4: the new osd has completley new ip
>>> 5: will the rbd volume attached to host A learn the new osd ip on which
>>> its data resides and everything just continues to work ?
>>>
>>> What if all the mons also have changed ip ?
>>>
>> A volume does not reside "on a osd". The volume is striped, and each
>> strip is stored in a placement group; the placement group on the other hand
>> is distributed to several OSDs depending on the crush rules and the number
>> of replicates.
>>
>> If an OSD dies, ceph will backfill the now missing replicates to another
>> OSD, given another OSD satisfying the crush rules is available. The same
>

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread Wido den Hollander




On 01/26/2018 06:37 PM, David Majchrzak wrote:

Ran:

ceph auth del osd.0
ceph auth del osd.6
ceph auth del osd.7
ceph osd rm osd.0
ceph osd rm osd.6
ceph osd rm osd.7

which seems to have removed them.



Did you destroy the OSD prior to running ceph-volume?

$ ceph osd destroy 6

After you've done that you can use ceph-volume to re-create the OSD.

Wido


Thanks for the help Reed!

Kind Regards,
David Majchrzak


26 jan. 2018 kl. 18:32 skrev David Majchrzak >:


Thanks that helped!

Since I had already "halfway" created a lvm volume I wanted to start 
from the beginning and zap it.


Tried to zap the raw device but failed since --destroy doesn't seem to 
be in 12.2.2


http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/

root@int1:~# ceph-volume lvm zap /dev/sdc --destroy
usage: ceph-volume lvm zap [-h] [DEVICE]
ceph-volume lvm zap: error: unrecognized arguments: --destroy

So i zapped it with the vg/lvm instead.
ceph-volume lvm zap 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9


However I run create on it since the LVM was already there.
So I zapped it with sgdisk and ran dmsetup remove. After that I was 
able to create it again.


However - each "ceph-volume lvm create" that I ran that failed, 
successfully added an osd to crush map ;)


So I've got this now:

root@int1:~# ceph osd df tree
ID CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL  %USE  VAR  PGS TYPE NAME
-1       2.60959        - 2672G  1101G  1570G 41.24 1.00   - root default
-2       0.87320        -  894G   369G   524G 41.36 1.00   -     host int1
 3   ssd 0.43660  1.0  447G   358G 90295M 80.27 1.95 301         osd.3
 8   ssd 0.43660  1.0  447G 11273M   436G  2.46 0.06  19         osd.8
-3       0.86819        -  888G   366G   522G 41.26 1.00   -     host int2
 1   ssd 0.43159  1.0  441G   167G   274G 37.95 0.92 147         osd.1
 4   ssd 0.43660  1.0  447G   199G   247G 44.54 1.08 173         osd.4
-4       0.86819        -  888G   365G   523G 41.09 1.00   -     host int3
 2   ssd 0.43159  1.0  441G   193G   248G 43.71 1.06 174         osd.2
 5   ssd 0.43660  1.0  447G   172G   274G 38.51 0.93 146         osd.5
 0             0        0     0      0      0     0    0   0 osd.0
 6             0        0     0      0      0     0    0   0 osd.6
 7             0        0     0      0      0     0    0   0 osd.7

I guess I can just remove them from crush,auth and rm them?

Kind Regards,

David Majchrzak

26 jan. 2018 kl. 18:09 skrev Reed Dier >:


This is the exact issue that I ran into when starting my bluestore 
conversion journey.


See my thread here: 
https://www.spinics.net/lists/ceph-users/msg41802.html


Specifying --osd-id causes it to fail.

Below are my steps for OSD replace/migrate from filestore to bluestore.

BIG caveat here in that I am doing destructive replacement, in that I 
am not allowing my objects to be migrated off of the OSD I’m 
replacing before nuking it.
With 8TB drives it just takes way too long, and I trust my failure 
domains and other hardware to get me through the backfills.
So instead of 1) reading data off, writing data elsewhere 2) 
remove/re-add 3) reading data elsewhere, writing back on, I am taking 
step one out, and trusting my two other copies of the objects. Just 
wanted to clarify my steps.


I also set norecover and norebalance flags immediately prior to 
running these commands so that it doesn’t try to start moving data 
unnecessarily. Then when done, remove those flags, and let it backfill.



systemctl stop ceph-osd@$ID.service 
ceph-osd -i $ID --flush-journal
umount /var/lib/ceph/osd/ceph-$ID
ceph-volume lvm zap /dev/$ID
ceph osd crush remove osd.$ID
ceph auth del osd.$ID
ceph osd rm osd.$ID
ceph-volume lvm create --bluestore --data /dev/$DATA --block.db 
/dev/$NVME


So essentially I fully remove the OSD from crush and the osdmap, and 
when I add the OSD back, like I would a new OSD, it fills in the 
numeric gap with the $ID it had before.


Hope this is helpful.
Been working well for me so far, doing 3 OSDs at a time (half of a 
failure domain).


Reed

On Jan 26, 2018, at 10:01 AM, David > wrote:



Hi!

On luminous 12.2.2

I'm migrating some OSDs from filestore to bluestore using the 
"simple" method as described in docs: 
http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds

Mark out and Replace.

However, at 9.: ceph-volume create --bluestore --data $DEVICE 
--osd-id $ID
it seems to create the bluestore but it fails to authenticate with 
the old osd-id auth.

(the command above is also missing lvm or simple)

I think it's related to this:
http://tracker.ceph.com/issues/22642

# ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
Running command: sudo vgcreate --force --yes 
ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
 stderr: WARNING: lvmetad is running bu

Re: [ceph-users] How ceph client read data from ceph cluster

2018-01-26 Thread shadow_lin

Hi Maged,
I just want to make sure if I understand how ceph client read from cluster.So 
with current version of ceph(12.2.2) the client only read from the primary 
osd(one copy),is that true?

2018-01-27 


lin.yunfan



发件人：Maged Mokhtar 
发送时间：2018-01-26 20:27
主题：Re: [ceph-users] How ceph client read data from ceph cluster
收件人："shadow_lin"
抄送："ceph-users"



On 2018-01-26 09:09, shadow_lin wrote:
Hi List,
I read a old article about how ceph client read from ceph cluster.It said the 
client only read from the primary osd. Since ceph cluster in replicate mode 
have serveral copys of data only read from one copy seems waste the performance 
of concurrent read from all the copys.
But that artcile is rather old so maybe ceph has imporved to read from all the 
copys? But I haven't find any info about that.
Any info about that would be appreciated.
Thanks
2018-01-26


shadow_lin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Hi 
The majority of cases you will have more concurrent io requests than disks, so 
the load will already be distributed evenly. If this is not the case and you 
have a large cluster with fewer clients, you may consider using object/rbd 
striping so each io will be divided into different osd requests.
Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread David Majchrzak

I did do that.
It didn't add the auth key to ceph, so I had to do that manually. Then it said 
that osd.0 was set as destroyed, which yes, it was still in crushmap.

I followed the docs to a point.

> 26 jan. 2018 kl. 18:50 skrev Wido den Hollander :
> 
> 
> 
> On 01/26/2018 06:37 PM, David Majchrzak wrote:
>> Ran:
>> ceph auth del osd.0
>> ceph auth del osd.6
>> ceph auth del osd.7
>> ceph osd rm osd.0
>> ceph osd rm osd.6
>> ceph osd rm osd.7
>> which seems to have removed them.
> 
> Did you destroy the OSD prior to running ceph-volume?
> 
> $ ceph osd destroy 6
> 
> After you've done that you can use ceph-volume to re-create the OSD.
> 
> Wido
> 
>> Thanks for the help Reed!
>> Kind Regards,
>> David Majchrzak
>>> 26 jan. 2018 kl. 18:32 skrev David Majchrzak >> >:
>>> 
>>> Thanks that helped!
>>> 
>>> Since I had already "halfway" created a lvm volume I wanted to start from 
>>> the beginning and zap it.
>>> 
>>> Tried to zap the raw device but failed since --destroy doesn't seem to be 
>>> in 12.2.2
>>> 
>>> http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/
>>> 
>>> root@int1:~# ceph-volume lvm zap /dev/sdc --destroy
>>> usage: ceph-volume lvm zap [-h] [DEVICE]
>>> ceph-volume lvm zap: error: unrecognized arguments: --destroy
>>> 
>>> So i zapped it with the vg/lvm instead.
>>> ceph-volume lvm zap 
>>> /dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
>>> 
>>> However I run create on it since the LVM was already there.
>>> So I zapped it with sgdisk and ran dmsetup remove. After that I was able to 
>>> create it again.
>>> 
>>> However - each "ceph-volume lvm create" that I ran that failed, 
>>> successfully added an osd to crush map ;)
>>> 
>>> So I've got this now:
>>> 
>>> root@int1:~# ceph osd df tree
>>> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS TYPE NAME
>>> -1   2.60959- 2672G  1101G  1570G 41.24 1.00   - root default
>>> -2   0.87320-  894G   369G   524G 41.36 1.00   - host int1
>>>  3   ssd 0.43660  1.0  447G   358G 90295M 80.27 1.95 301 osd.3
>>>  8   ssd 0.43660  1.0  447G 11273M   436G  2.46 0.06  19 osd.8
>>> -3   0.86819-  888G   366G   522G 41.26 1.00   - host int2
>>>  1   ssd 0.43159  1.0  441G   167G   274G 37.95 0.92 147 osd.1
>>>  4   ssd 0.43660  1.0  447G   199G   247G 44.54 1.08 173 osd.4
>>> -4   0.86819-  888G   365G   523G 41.09 1.00   - host int3
>>>  2   ssd 0.43159  1.0  441G   193G   248G 43.71 1.06 174 osd.2
>>>  5   ssd 0.43660  1.0  447G   172G   274G 38.51 0.93 146 osd.5
>>>  0 00 0  0  0 00   0 osd.0
>>>  6 00 0  0  0 00   0 osd.6
>>>  7 00 0  0  0 00   0 osd.7
>>> 
>>> I guess I can just remove them from crush,auth and rm them?
>>> 
>>> Kind Regards,
>>> 
>>> David Majchrzak
>>> 
 26 jan. 2018 kl. 18:09 skrev Reed Dier >>> >:

 This is the exact issue that I ran into when starting my bluestore 
 conversion journey.

 See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html

 Specifying --osd-id causes it to fail.

 Below are my steps for OSD replace/migrate from filestore to bluestore.

 BIG caveat here in that I am doing destructive replacement, in that I am 
 not allowing my objects to be migrated off of the OSD I’m replacing before 
 nuking it.
 With 8TB drives it just takes way too long, and I trust my failure domains 
 and other hardware to get me through the backfills.
 So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 
 3) reading data elsewhere, writing back on, I am taking step one out, and 
 trusting my two other copies of the objects. Just wanted to clarify my 
 steps.

 I also set norecover and norebalance flags immediately prior to running 
 these commands so that it doesn’t try to start moving data unnecessarily. 
 Then when done, remove those flags, and let it backfill.

> systemctl stop ceph-osd@$ID.service 
> ceph-osd -i $ID --flush-journal
> umount /var/lib/ceph/osd/ceph-$ID
> ceph-volume lvm zap /dev/$ID
> ceph osd crush remove osd.$ID
> ceph auth del osd.$ID
> ceph osd rm osd.$ID
> ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME

 So essentially I fully remove the OSD from crush and the osdmap, and when 
 I add the OSD back, like I would a new OSD, it fills in the numeric gap 
 with the $ID it had before.

 Hope this is helpful.
 Been working well for me so far, doing 3 OSDs at a time (half of a failure 
 domain).

 Reed

> On Jan 26, 2018, at 10:01 AM, David  > wrot

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread Wido den Hollander




On 01/26/2018 06:53 PM, David Majchrzak wrote:

I did do that.
It didn't add the auth key to ceph, so I had to do that manually. Then it said 
that osd.0 was set as destroyed, which yes, it was still in crushmap.

I followed the docs to a point.



Odd, the 'destroy' command should remove the auth key. Afterwards 
ceph-volume will use the bootstrap-osd key to create it again.


I didn't try this with ceph-volume yet, but I'm in the process of doing 
the same with ceph-disk going to BlueStore and that works just fine.


Wido




26 jan. 2018 kl. 18:50 skrev Wido den Hollander :



On 01/26/2018 06:37 PM, David Majchrzak wrote:

Ran:
ceph auth del osd.0
ceph auth del osd.6
ceph auth del osd.7
ceph osd rm osd.0
ceph osd rm osd.6
ceph osd rm osd.7
which seems to have removed them.


Did you destroy the OSD prior to running ceph-volume?

$ ceph osd destroy 6

After you've done that you can use ceph-volume to re-create the OSD.

Wido


Thanks for the help Reed!
Kind Regards,
David Majchrzak

26 jan. 2018 kl. 18:32 skrev David Majchrzak mailto:da...@visions.se>>:

Thanks that helped!

Since I had already "halfway" created a lvm volume I wanted to start from the 
beginning and zap it.

Tried to zap the raw device but failed since --destroy doesn't seem to be in 
12.2.2

http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/

root@int1:~# ceph-volume lvm zap /dev/sdc --destroy
usage: ceph-volume lvm zap [-h] [DEVICE]
ceph-volume lvm zap: error: unrecognized arguments: --destroy

So i zapped it with the vg/lvm instead.
ceph-volume lvm zap 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9

However I run create on it since the LVM was already there.
So I zapped it with sgdisk and ran dmsetup remove. After that I was able to 
create it again.

However - each "ceph-volume lvm create" that I ran that failed, successfully 
added an osd to crush map ;)

So I've got this now:

root@int1:~# ceph osd df tree
ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS TYPE NAME
-1   2.60959- 2672G  1101G  1570G 41.24 1.00   - root default
-2   0.87320-  894G   369G   524G 41.36 1.00   - host int1
  3   ssd 0.43660  1.0  447G   358G 90295M 80.27 1.95 301 osd.3
  8   ssd 0.43660  1.0  447G 11273M   436G  2.46 0.06  19 osd.8
-3   0.86819-  888G   366G   522G 41.26 1.00   - host int2
  1   ssd 0.43159  1.0  441G   167G   274G 37.95 0.92 147 osd.1
  4   ssd 0.43660  1.0  447G   199G   247G 44.54 1.08 173 osd.4
-4   0.86819-  888G   365G   523G 41.09 1.00   - host int3
  2   ssd 0.43159  1.0  441G   193G   248G 43.71 1.06 174 osd.2
  5   ssd 0.43660  1.0  447G   172G   274G 38.51 0.93 146 osd.5
  0 00 0  0  0 00   0 osd.0
  6 00 0  0  0 00   0 osd.6
  7 00 0  0  0 00   0 osd.7

I guess I can just remove them from crush,auth and rm them?

Kind Regards,

David Majchrzak


26 jan. 2018 kl. 18:09 skrev Reed Dier mailto:reed.d...@focusvq.com>>:

This is the exact issue that I ran into when starting my bluestore conversion 
journey.

See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html

Specifying --osd-id causes it to fail.

Below are my steps for OSD replace/migrate from filestore to bluestore.

BIG caveat here in that I am doing destructive replacement, in that I am not 
allowing my objects to be migrated off of the OSD I’m replacing before nuking 
it.
With 8TB drives it just takes way too long, and I trust my failure domains and 
other hardware to get me through the backfills.
So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 3) 
reading data elsewhere, writing back on, I am taking step one out, and trusting 
my two other copies of the objects. Just wanted to clarify my steps.

I also set norecover and norebalance flags immediately prior to running these 
commands so that it doesn’t try to start moving data unnecessarily. Then when 
done, remove those flags, and let it backfill.


systemctl stop ceph-osd@$ID.service 
ceph-osd -i $ID --flush-journal
umount /var/lib/ceph/osd/ceph-$ID
ceph-volume lvm zap /dev/$ID
ceph osd crush remove osd.$ID
ceph auth del osd.$ID
ceph osd rm osd.$ID
ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME


So essentially I fully remove the OSD from crush and the osdmap, and when I add 
the OSD back, like I would a new OSD, it fills in the numeric gap with the $ID 
it had before.

Hope this is helpful.
Been working well for me so far, doing 3 OSDs at a time (half of a failure 
domain).

Reed


On Jan 26, 2018, at 10:01 AM, David mailto:da...@visions.se>> wrote:


Hi!

On luminous 12.2.2

I'm migrating some OSDs from filestore to bluestore using the "simple" method 
as described in docs: 
http://docs.ceph.com/docs/mas

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread David Majchrzak

destroy did remove the auth key, however create didnt add the auth, I had to do 
it manually.
Then I tried to start the osd.0 again and it failed because osdmap said it was 
destroyed.

I've summed my steps below:


Here are my commands prior to create:

root@int1:~# ceph osd out 0

<-- wait for rebalance/recover -->

root@int1:~# ceph osd safe-to-destroy 0
OSD(s) 0 are safe to destroy without reducing data durability.

root@int1:~# systemctl kill ceph-osd@0
root@int1:~# ceph status
  cluster:
id: efad7df8-721d-43d8-8d02-449406e70b90
health: HEALTH_OK

  services:
mon: 3 daemons, quorum int1,int2,int3
mgr: int1(active), standbys: int3, int2
osd: 6 osds: 5 up, 5 in

  data:
pools:   2 pools, 320 pgs
objects: 97038 objects, 364 GB
usage:   1096 GB used, 1128 GB / 2224 GB avail
pgs: 320 active+clean

  io:
client:   289 kB/s rd, 870 kB/s wr, 46 op/s rd, 48 op/s wr

root@int1:~# mount | grep /var/lib/ceph/osd/ceph-0
/dev/sdc1 on /var/lib/ceph/osd/ceph-0 type xfs 
(rw,noatime,attr2,inode64,noquota)
root@int1:~# umount /var/lib/ceph/osd/ceph-0
root@int1:~# ceph-volume lvm zap /dev/sdc
Zapping: /dev/sdc
Running command: sudo wipefs --all /dev/sdc
 stdout: /dev/sdc: 8 bytes were erased at offset 0x0200 (gpt): 45 46 49 20 
50 41 52 54
/dev/sdc: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 
52 54
/dev/sdc: 2 bytes were erased at offset 0x01fe (PMBR): 55 aa
/dev/sdc: calling ioctl to re-read partition table: Success
Running command: dd if=/dev/zero of=/dev/sdc bs=1M count=10
 stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB) copied
 stderr: , 0.0253999 s, 413 MB/s
--> Zapping successful for: /dev/sdc
root@int1:~# ceph osd destroy 0 --yes-i-really-mean-it
destroyed osd.0
root@int1:~# ceph status
  cluster:
id: efad7df8-721d-43d8-8d02-449406e70b90
health: HEALTH_OK

  services:
mon: 3 daemons, quorum int1,int2,int3
mgr: int1(active), standbys: int3, int2
osd: 6 osds: 5 up, 5 in

  data:
pools:   2 pools, 320 pgs
objects: 97038 objects, 364 GB
usage:   1096 GB used, 1128 GB / 2224 GB avail
pgs: 320 active+clean

  io:
client:   56910 B/s rd, 1198 kB/s wr, 15 op/s rd, 48 op/s wr

root@int1:~# ceph-volume create --bluestore --data /dev/sdc --osd-id 0
usage: ceph-volume [-h] [--cluster CLUSTER] [--log-level LOG_LEVEL]
   [--log-path LOG_PATH]
ceph-volume: error: unrecognized arguments: create --bluestore --data /dev/sdc 
--osd-id 0
root@int1:~# ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
Running command: sudo vgcreate --force --yes 
ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
 stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
enabling it!
 stdout: Physical volume "/dev/sdc" successfully created
 stdout: Volume group "ceph-efad7df8-721d-43d8-8d02-449406e70b90" successfully 
created
Running command: sudo lvcreate --yes -l 100%FREE -n 
osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9 
ceph-efad7df8-721d-43d8-8d02-449406e70b90
 stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
enabling it!
 stdout: Logical volume "osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9" 
created.
Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
Running command: chown -R ceph:ceph /dev/dm-4
Running command: sudo ln -s 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
 /var/lib/ceph/osd/ceph-0/block
Running command: sudo ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 
/var/lib/ceph/osd/ceph-0/activate.monmap
 stderr: got monmap epoch 2
Running command: ceph-authtool /var/lib/ceph/osd/ceph-0/keyring 
--create-keyring --name osd.0 --add-key AQA5Qmta9LERFhAAKU+AmT1Sm56nk7sWx2BATQ==
 stdout: creating /var/lib/ceph/osd/ceph-0/keyring
 stdout: added entity osd.0 auth auth(auid = 18446744073709551615 
key=AQA5Qmta9LERFhAAKU+AmT1Sm56nk7sWx2BATQ== with 0 caps)
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore 
--mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --key 
 --osd-data /var/lib/ceph/osd/ceph-0/ 
--osd-uuid 138ce507-f28a-45bf-814c-7fa124a9d9b9 --setuser ceph --setgroup ceph
 stderr: 2018-01-26 14:59:10.039549 7fd7ef951cc0 -1 
bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
of struct encoding
 stderr: 2018-01-26 14:59:10.039744 7fd7ef951cc0 -1 
bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode pas

Re: [ceph-users] OSDs missing from cluster all from one node

2018-01-26 Thread Andre Goree


On 2018/01/25 7:09 pm, Brad Hubbard wrote:
...


It's highly likely this is a network connectivity issue (or your
machine is struggling under load, but that should be obvious to
detect).


...




--
Cheers,
Brad



We've confirmed a networking issue within the cluster.  Thanks for 
pointing me in the right direction!



--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Signature check failures.

2018-01-26 Thread Cary

Hello,

 We are running Luminous 12.2.2. 6 OSD hosts with 12 1TB OSDs, and 64GB
RAM. Each host has a SSD for Bluestore's block.wal and block.db.
There are 5 monitor nodes as well with 32GB RAM. All servers have
Gentoo with kernel, 4.12.12-gentoo.

When I export an image using:
rbd export pool-name/volume-name  /location/image-name.raw

Message similar to below are displayed. The signature check fails
randomly. And sometimes a message about a bad authorizer, but not
everytime.
The image is still exported successfully.

2018-01-24 17:35:15.616080 7fc8d4024700  0 cephx:
verify_authorizer_reply bad nonce got 4552544084014661633 expected
4552499520046621785 sent 4552499520046621784
2018-01-24 17:35:15.616098 7fc8d4024700  0 --
172.21.32.16:0/1412094654 >> 172.21.32.6:6802/6219 conn(0x7fc8b0078a50
:-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
l=1)._process_connection failed verifying authorize reply
2018-01-24 17:35:15.699004 7fc8d4024700  0 SIGN: MSG 2 Message
signature does not match contents.
2018-01-24 17:35:15.699020 7fc8d4024700  0 SIGN: MSG 2Signature on message:
2018-01-24 17:35:15.699021 7fc8d4024700  0 SIGN: MSG 2sig:
8189090775647585001
2018-01-24 17:35:15.699047 7fc8d4024700  0 SIGN: MSG 2Locally
calculated signature:
2018-01-24 17:35:15.699048 7fc8d4024700  0 SIGN: MSG 2
sig_check:140500325643792
2018-01-24 17:35:15.699049 7fc8d4024700  0 Signature failed.
2018-01-24 17:35:15.699050 7fc8d4024700  0 --
172.21.32.16:0/1412094654 >> 172.21.32.2:6807/153106
conn(0x7fc8bc020870 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH
pgs=26018 cs=1 l=1).process Signature check failed

Does anyone know what could cause this, and what I can do to fix it.

Thank you,

Cary
-Dynamic
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread Wido den Hollander




On 01/26/2018 07:09 PM, David Majchrzak wrote:

destroy did remove the auth key, however create didnt add the auth, I had to do 
it manually.
Then I tried to start the osd.0 again and it failed because osdmap said it was 
destroyed.



That seems like this bug: http://tracker.ceph.com/issues/22673


I've summed my steps below:


Here are my commands prior to create:

root@int1:~# ceph osd out 0



<-- wait for rebalance/recover -->

root@int1:~# ceph osd safe-to-destroy 0
OSD(s) 0 are safe to destroy without reducing data durability.



Although it's a very safe route it's not required. You'll have a double 
rebalance here.



root@int1:~# systemctl kill ceph-osd@0


I recommend using 'stop' and not kill. The stop is a clear and graceful 
shutdown.


As I haven't used ceph-volume before I'm not able to tell exactly why 
the commands underneath fail.


Wido


root@int1:~# ceph status
   cluster:
 id: efad7df8-721d-43d8-8d02-449406e70b90
 health: HEALTH_OK

   services:
 mon: 3 daemons, quorum int1,int2,int3
 mgr: int1(active), standbys: int3, int2
 osd: 6 osds: 5 up, 5 in

   data:
 pools:   2 pools, 320 pgs
 objects: 97038 objects, 364 GB
 usage:   1096 GB used, 1128 GB / 2224 GB avail
 pgs: 320 active+clean

   io:
 client:   289 kB/s rd, 870 kB/s wr, 46 op/s rd, 48 op/s wr

root@int1:~# mount | grep /var/lib/ceph/osd/ceph-0
/dev/sdc1 on /var/lib/ceph/osd/ceph-0 type xfs 
(rw,noatime,attr2,inode64,noquota)
root@int1:~# umount /var/lib/ceph/osd/ceph-0
root@int1:~# ceph-volume lvm zap /dev/sdc
Zapping: /dev/sdc
Running command: sudo wipefs --all /dev/sdc
  stdout: /dev/sdc: 8 bytes were erased at offset 0x0200 (gpt): 45 46 49 20 
50 41 52 54
/dev/sdc: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 41 
52 54
/dev/sdc: 2 bytes were erased at offset 0x01fe (PMBR): 55 aa
/dev/sdc: calling ioctl to re-read partition table: Success
Running command: dd if=/dev/zero of=/dev/sdc bs=1M count=10
  stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB) copied
  stderr: , 0.0253999 s, 413 MB/s
--> Zapping successful for: /dev/sdc
root@int1:~# ceph osd destroy 0 --yes-i-really-mean-it
destroyed osd.0
root@int1:~# ceph status
   cluster:
 id: efad7df8-721d-43d8-8d02-449406e70b90
 health: HEALTH_OK

   services:
 mon: 3 daemons, quorum int1,int2,int3
 mgr: int1(active), standbys: int3, int2
 osd: 6 osds: 5 up, 5 in

   data:
 pools:   2 pools, 320 pgs
 objects: 97038 objects, 364 GB
 usage:   1096 GB used, 1128 GB / 2224 GB avail
 pgs: 320 active+clean

   io:
 client:   56910 B/s rd, 1198 kB/s wr, 15 op/s rd, 48 op/s wr

root@int1:~# ceph-volume create --bluestore --data /dev/sdc --osd-id 0
usage: ceph-volume [-h] [--cluster CLUSTER] [--log-level LOG_LEVEL]
[--log-path LOG_PATH]
ceph-volume: error: unrecognized arguments: create --bluestore --data /dev/sdc 
--osd-id 0
root@int1:~# ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
Running command: sudo vgcreate --force --yes 
ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
enabling it!
  stdout: Physical volume "/dev/sdc" successfully created
  stdout: Volume group "ceph-efad7df8-721d-43d8-8d02-449406e70b90" successfully 
created
Running command: sudo lvcreate --yes -l 100%FREE -n 
osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9 
ceph-efad7df8-721d-43d8-8d02-449406e70b90
  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
enabling it!
  stdout: Logical volume "osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9" 
created.
Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
Running command: chown -R ceph:ceph /dev/dm-4
Running command: sudo ln -s 
/dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
 /var/lib/ceph/osd/ceph-0/block
Running command: sudo ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 
/var/lib/ceph/osd/ceph-0/activate.monmap
  stderr: got monmap epoch 2
Running command: ceph-authtool /var/lib/ceph/osd/ceph-0/keyring 
--create-keyring --name osd.0 --add-key AQA5Qmta9LERFhAAKU+AmT1Sm56nk7sWx2BATQ==
  stdout: creating /var/lib/ceph/osd/ceph-0/keyring
  stdout: added entity osd.0 auth auth(auid = 18446744073709551615 
key=AQA5Qmta9LERFhAAKU+AmT1Sm56nk7sWx2BATQ== with 0 caps)
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore 
--mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --key 
 --osd-data /var/lib/ceph/osd/ceph-0/ 
--osd-uuid 138ce507-f28a-45bf-814c-7fa124a9d9b9 --setuser ceph --setgroup ceph
  stderr: 2018-01-26 14:59:10.039549 7fd7ef951cc0 -1 
bluestore(/var/lib/ceph/osd/cep

Re: [ceph-users] How ceph client read data from ceph cluster

2018-01-26 Thread Maged Mokhtar

Hi Lin, 

Yes it will read from the primary osd, but for the reasons stated this
should not impact performance. 

Maged 

On 2018-01-26 19:52, shadow_lin wrote:

> Hi Maged, 
> I just want to make sure if I understand how ceph client read from cluster.So 
> with current version of ceph(12.2.2) the client only read from the primary 
> osd(one copy),is that true? 
> 
> 2018-01-27
> -
> 
> lin.yunfan 
> -
> 
> 发件人：Maged Mokhtar  
> 发送时间：2018-01-26 20:27 
> 主题：Re: [ceph-users] How ceph client read data from ceph cluster 
> 收件人："shadow_lin" 
> 抄送："ceph-users" 
> 
> On 2018-01-26 09:09, shadow_lin wrote: 
> Hi List, 
> I read a old article about how ceph client read from ceph cluster.It said the 
> client only read from the primary osd. Since ceph cluster in replicate mode 
> have serveral copys of data only read from one copy seems waste the 
> performance of concurrent read from all the copys. 
> But that artcile is rather old so maybe ceph has imporved to read from all 
> the copys? But I haven't find any info about that. 
> Any info about that would be appreciated. 
> Thanks 
> 
> 2018-01-26 
> -
> shadow_lin 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> Hi  
> 
> The majority of cases you will have more concurrent io requests than disks, 
> so the load will already be distributed evenly. If this is not the case and 
> you have a large cluster with fewer clients, you may consider using 
> object/rbd striping so each io will be divided into different osd requests. 
> 
> Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread David Majchrzak

Yeah, next one will be without double rebalance, I just had alot of time on my 
hands.

Never did use kill before, however I followed the docs here. Should probably be 
updated.

http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds
 


Is this the tracker for the docs?
http://tracker.ceph.com/projects/ceph-website/issues?set_filter=1&tracker_id=6 



> 26 jan. 2018 kl. 19:22 skrev Wido den Hollander :
> 
> 
> 
> On 01/26/2018 07:09 PM, David Majchrzak wrote:
>> destroy did remove the auth key, however create didnt add the auth, I had to 
>> do it manually.
>> Then I tried to start the osd.0 again and it failed because osdmap said it 
>> was destroyed.
> 
> That seems like this bug: http://tracker.ceph.com/issues/22673
> 
>> I've summed my steps below:
>> Here are my commands prior to create:
>> root@int1:~# ceph osd out 0
> 
>> <-- wait for rebalance/recover -->
>> root@int1:~# ceph osd safe-to-destroy 0
>> OSD(s) 0 are safe to destroy without reducing data durability.
> 
> Although it's a very safe route it's not required. You'll have a double 
> rebalance here.
> 
>> root@int1:~# systemctl kill ceph-osd@0
> 
> I recommend using 'stop' and not kill. The stop is a clear and graceful 
> shutdown.
> 
> As I haven't used ceph-volume before I'm not able to tell exactly why the 
> commands underneath fail.
> 
> Wido
> 
>> root@int1:~# ceph status
>>   cluster:
>> id: efad7df8-721d-43d8-8d02-449406e70b90
>> health: HEALTH_OK
>>   services:
>> mon: 3 daemons, quorum int1,int2,int3
>> mgr: int1(active), standbys: int3, int2
>> osd: 6 osds: 5 up, 5 in
>>   data:
>> pools:   2 pools, 320 pgs
>> objects: 97038 objects, 364 GB
>> usage:   1096 GB used, 1128 GB / 2224 GB avail
>> pgs: 320 active+clean
>>   io:
>> client:   289 kB/s rd, 870 kB/s wr, 46 op/s rd, 48 op/s wr
>> root@int1:~# mount | grep /var/lib/ceph/osd/ceph-0
>> /dev/sdc1 on /var/lib/ceph/osd/ceph-0 type xfs 
>> (rw,noatime,attr2,inode64,noquota)
>> root@int1:~# umount /var/lib/ceph/osd/ceph-0
>> root@int1:~# ceph-volume lvm zap /dev/sdc
>> Zapping: /dev/sdc
>> Running command: sudo wipefs --all /dev/sdc
>>  stdout: /dev/sdc: 8 bytes were erased at offset 0x0200 (gpt): 45 46 49 
>> 20 50 41 52 54
>> /dev/sdc: 8 bytes were erased at offset 0x6fc86d5e00 (gpt): 45 46 49 20 50 
>> 41 52 54
>> /dev/sdc: 2 bytes were erased at offset 0x01fe (PMBR): 55 aa
>> /dev/sdc: calling ioctl to re-read partition table: Success
>> Running command: dd if=/dev/zero of=/dev/sdc bs=1M count=10
>>  stderr: 10+0 records in
>> 10+0 records out
>> 10485760 bytes (10 MB) copied
>>  stderr: , 0.0253999 s, 413 MB/s
>> --> Zapping successful for: /dev/sdc
>> root@int1:~# ceph osd destroy 0 --yes-i-really-mean-it
>> destroyed osd.0
>> root@int1:~# ceph status
>>   cluster:
>> id: efad7df8-721d-43d8-8d02-449406e70b90
>> health: HEALTH_OK
>>   services:
>> mon: 3 daemons, quorum int1,int2,int3
>> mgr: int1(active), standbys: int3, int2
>> osd: 6 osds: 5 up, 5 in
>>   data:
>> pools:   2 pools, 320 pgs
>> objects: 97038 objects, 364 GB
>> usage:   1096 GB used, 1128 GB / 2224 GB avail
>> pgs: 320 active+clean
>>   io:
>> client:   56910 B/s rd, 1198 kB/s wr, 15 op/s rd, 48 op/s wr
>> root@int1:~# ceph-volume create --bluestore --data /dev/sdc --osd-id 0
>> usage: ceph-volume [-h] [--cluster CLUSTER] [--log-level LOG_LEVEL]
>>[--log-path LOG_PATH]
>> ceph-volume: error: unrecognized arguments: create --bluestore --data 
>> /dev/sdc --osd-id 0
>> root@int1:~# ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
>> Running command: sudo vgcreate --force --yes 
>> ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
>>  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
>> enabling it!
>>  stdout: Physical volume "/dev/sdc" successfully created
>>  stdout: Volume group "ceph-efad7df8-721d-43d8-8d02-449406e70b90" 
>> successfully created
>> Running command: sudo lvcreate --yes -l 100%FREE -n 
>> osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9 
>> ceph-efad7df8-721d-43d8-8d02-449406e70b90
>>  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
>> enabling it!
>>  stdout: Logical volume "osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9" 
>> created.
>> Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
>> Running command: chown -R ceph:ceph /dev/dm-4
>> Running command: sudo ln -s 
>> /dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
>>  /var/lib/ceph/osd/ceph-0/block
>> Running command: sudo ceph --cluster ceph --name client.bootstrap-osd 
>> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 
>> /var/lib/ceph/osd/ceph-0/activate.monmap
>>

[ceph-users] Ceph OSDs fail to start with RDMA

2018-01-26 Thread Moreno, Orlando

Hi all,

I am trying to bring up a Ceph cluster where the private network is 
communicating via RoCEv2. The storage nodes have 2 dual-port 25Gb Mellanox 
ConnectX-4 NICs, with each NIC's ports bonded (2x25Gb mode 4). I have set 
memory limits to unlimited, can rping to each node, and 
ms_async_rdma_device_name set to the ibdev (mlx5_bond_1). Everything goes 
smoothly until I start bringing up OSDs. Nothing appears in stderr, but upon 
further inspection of the OSD log, I see the following error:

RDMAConnectedSocketImpl activate failed to transition to RTR state: (19) No 
such device
/build/ceph-12.2.2/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In function 
'void RDMAConnectedSocketImpl::handle_connection()' thread 7f908633c700 time 
2018-01-26 10:47:51.607573
/build/ceph-12.2.2/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 221: FAILED 
assert(!r)

ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) 
[0x564a2ccf7892]
2: (RDMAConnectedSocketImpl::handle_connection()+0xb4a) [0x564a2d007fba]
3: (EventCenter::process_events(int, std::chrono::duration >*)+0xa08) [0x564a2cd9a418]
4: (()+0xb4f3a8) [0x564a2cd9e3a8]
5: (()+0xb8c80) [0x7f9088c04c80]
6: (()+0x76ba) [0x7f90892f36ba]
7: (clone()+0x6d) [0x7f908836a41d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Anyone see this before or have any suggestions?

Thanks,
Orlando
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread Reed Dier

Bit late for this to be helpful, but instead of zapping the lvm labels, you 
could alternatively destroy the lvm volume by hand.

> lvremove -f /
> vgremove 
> pvremove /dev/ceph-device (should wipe labels)


Then you should be able to run ‘ceph-volume lvm zap /dev/sdX’ and retry the 
'ceph-volume lvm create’ command (sans --osd-id flag) and it should run as well.

This info will hopefully be useful for those not as well versed with lvm as I 
am/was at the time I needed this info.

Reed

> On Jan 26, 2018, at 11:32 AM, David Majchrzak  wrote:
> 
> Thanks that helped!
> 
> Since I had already "halfway" created a lvm volume I wanted to start from the 
> beginning and zap it.
> 
> Tried to zap the raw device but failed since --destroy doesn't seem to be in 
> 12.2.2
> 
> http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/ 
> 
> 
> root@int1:~# ceph-volume lvm zap /dev/sdc --destroy
> usage: ceph-volume lvm zap [-h] [DEVICE]
> ceph-volume lvm zap: error: unrecognized arguments: --destroy
> 
> So i zapped it with the vg/lvm instead.
> ceph-volume lvm zap 
> /dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
> 
> However I run create on it since the LVM was already there.
> So I zapped it with sgdisk and ran dmsetup remove. After that I was able to 
> create it again.
> 
> However - each "ceph-volume lvm create" that I ran that failed, successfully 
> added an osd to crush map ;)
> 
> So I've got this now:
> 
> root@int1:~# ceph osd df tree
> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS TYPE NAME
> -1   2.60959- 2672G  1101G  1570G 41.24 1.00   - root default
> -2   0.87320-  894G   369G   524G 41.36 1.00   - host int1
>  3   ssd 0.43660  1.0  447G   358G 90295M 80.27 1.95 301 osd.3
>  8   ssd 0.43660  1.0  447G 11273M   436G  2.46 0.06  19 osd.8
> -3   0.86819-  888G   366G   522G 41.26 1.00   - host int2
>  1   ssd 0.43159  1.0  441G   167G   274G 37.95 0.92 147 osd.1
>  4   ssd 0.43660  1.0  447G   199G   247G 44.54 1.08 173 osd.4
> -4   0.86819-  888G   365G   523G 41.09 1.00   - host int3
>  2   ssd 0.43159  1.0  441G   193G   248G 43.71 1.06 174 osd.2
>  5   ssd 0.43660  1.0  447G   172G   274G 38.51 0.93 146 osd.5
>  0 00 0  0  0 00   0 osd.0
>  6 00 0  0  0 00   0 osd.6
>  7 00 0  0  0 00   0 osd.7
> 
> I guess I can just remove them from crush,auth and rm them?
> 
> Kind Regards,
> 
> David Majchrzak
> 
>> 26 jan. 2018 kl. 18:09 skrev Reed Dier > >:
>> 
>> This is the exact issue that I ran into when starting my bluestore 
>> conversion journey.
>> 
>> See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html 
>> 
>> 
>> Specifying --osd-id causes it to fail.
>> 
>> Below are my steps for OSD replace/migrate from filestore to bluestore.
>> 
>> BIG caveat here in that I am doing destructive replacement, in that I am not 
>> allowing my objects to be migrated off of the OSD I’m replacing before 
>> nuking it.
>> With 8TB drives it just takes way too long, and I trust my failure domains 
>> and other hardware to get me through the backfills.
>> So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 
>> 3) reading data elsewhere, writing back on, I am taking step one out, and 
>> trusting my two other copies of the objects. Just wanted to clarify my steps.
>> 
>> I also set norecover and norebalance flags immediately prior to running 
>> these commands so that it doesn’t try to start moving data unnecessarily. 
>> Then when done, remove those flags, and let it backfill.
>> 
>>> systemctl stop ceph-osd@$ID.service 
>>> ceph-osd -i $ID --flush-journal
>>> umount /var/lib/ceph/osd/ceph-$ID
>>> ceph-volume lvm zap /dev/$ID
>>> ceph osd crush remove osd.$ID
>>> ceph auth del osd.$ID
>>> ceph osd rm osd.$ID
>>> ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME
>> 
>> So essentially I fully remove the OSD from crush and the osdmap, and when I 
>> add the OSD back, like I would a new OSD, it fills in the numeric gap with 
>> the $ID it had before.
>> 
>> Hope this is helpful.
>> Been working well for me so far, doing 3 OSDs at a time (half of a failure 
>> domain).
>> 
>> Reed
>> 
>>> On Jan 26, 2018, at 10:01 AM, David >> > wrote:
>>> 
>>> 
>>> Hi!
>>> 
>>> On luminous 12.2.2
>>> 
>>> I'm migrating some OSDs from filestore to bluestore using the "simple" 
>>> method as described in docs: 
>>> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds
>>>  
>>>

Re: [ceph-users] Snapshot trimming

2018-01-26 Thread Karun Josy

Is scrubbing and deep scrubbing necessary for Snaptrim operation to happen ?

Karun Josy

On Fri, Jan 26, 2018 at 9:29 PM, Karun Josy  wrote:

> Thank you for your quick response!
>
> I used the command to fetch the snap_trimq from many pgs, however it seems
> they don't have any in queue ?
>
> For eg :
> 
> $ echo $(( $(ceph pg  55.4a query | grep snap_trimq | cut -d[ -f2 | cut
> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  55.5a query | grep snap_trimq | cut -d[ -f2 | cut
> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  55.88 query | grep snap_trimq | cut -d[ -f2 | cut
> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  55.55 query | grep snap_trimq | cut -d[ -f2 | cut
> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  54.a query | grep snap_trimq | cut -d[ -f2 | cut -d]
> -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  34.1d query | grep snap_trimq | cut -d[ -f2 | cut
> -d] -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> $ echo $(( $(ceph pg  1.3f query | grep snap_trimq | cut -d[ -f2 | cut -d]
> -f1 | tr ',' '\n' | wc -l) - 1 ))
> 0
> =
>
>
> While going through the PG query, I find that these PGs have no value in
> purged_snaps section too.
> For eg :
> ceph pg  55.80 query
> --
> ---
> ---
>  {
> "peer": "83(3)",
> "pgid": "55.80s3",
> "last_update": "43360'15121927",
> "last_complete": "43345'15073146",
> "log_tail": "43335'15064480",
> "last_user_version": 15066124,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 1,
> "purged_snaps": [],
> "history": {
> "epoch_created": 5950,
> "epoch_pool_created": 5950,
> "last_epoch_started": 43339,
> "last_interval_started": 43338,
> "last_epoch_clean": 43340,
> "last_interval_clean": 43338,
> "last_epoch_split": 0,
> "last_epoch_marked_full": 42032,
> "same_up_since": 43338,
> "same_interval_since": 43338,
> "same_primary_since": 43276,
> "last_scrub": "35299'13072533",
> "last_scrub_stamp": "2018-01-18 14:01:19.557972",
> "last_deep_scrub": "31372'12176860",
> "last_deep_scrub_stamp": "2018-01-15 12:21:17.025305",
> "last_clean_scrub_stamp": "2018-01-18 14:01:19.557972"
> },
>
> Not sure if it is related.
>
> The cluster is not open to any new clients. However we see a steady growth
> of  space usage every day.
> And worst case scenario, it might grow faster than we can add more space,
> which will be dangerous.
>
> Any help is really appreciated.
>
> Karun Josy
>
> On Fri, Jan 26, 2018 at 8:23 PM, David Turner 
> wrote:
>
>> "snap_trimq": "[]",
>>
>> That is exactly what you're looking for to see how many objects a PG
>> still had that need to be cleaned up. I think something like this should
>> give you the number of objects in the snap_trimq for a PG.
>>
>> echo $(( $(ceph pg $pg query | grep snap_trimq | cut -d[ -f2 | cut -d]
>> -f1 | tr ',' '\n' | wc -l) - 1 ))
>>
>> Note, I'm not at a computer and topping this from my phone so it's not
>> pretty and I know of a few ways to do that better, but that should work all
>> the same.
>>
>> For your needs a visual inspection of several PGs should be sufficient to
>> see if there is anything in the snap_trimq to begin with.
>>
>> On Fri, Jan 26, 2018, 9:18 AM Karun Josy  wrote:
>>
>>>  Hi David,
>>>
>>> Thank you for the response. To be honest, I am afraid it is going to be
>>> a issue in our cluster.
>>> It seems snaptrim has not been going on for sometime now , maybe because
>>> we were expanding the cluster adding nodes for the past few weeks.
>>>
>>> I would be really glad if you can guide me how to overcome this.
>>> Cluster has about 30TB data and 11 million objects. With about 100 disks
>>> spread across 16 nodes. Version is 12.2.2
>>> Searching through the mailing lists I can see many cases where the
>>> performance were affected while snaptrimming.
>>>
>>> Can you help me figure out these :
>>>
>>> - How to find snaptrim queue of a PG.
>>> - Can snaptrim be started just on 1 PG
>>> - How can I make sure cluster IO performance is not affected ?
>>> I read about osd_snap_trim_sleep , how can it be changed ?
>>> Is this the command : ceph tell osd.* injectargs '--osd_snap_trim_sleep
>>> 0.005'
>>>
>>> If yes what is the recommended value that we can use ?
>>>
>>> Also, what all parameters should we be concerned about? I would really
>>> appreciate any suggestions.
>>>
>>>
>>> Below is a brief extract of a PG queried
>>> 
>>> ceph pg  55.77 query
>>> {
>>> "state": "active+clean",
>>> "snap_trimq": "[]",
>>> ---
>>> 
>>>
>>> "pgid": "55.77s7",
>>> "last_

40 matches

Mail list logo