[ceph-users] Adding multiple osd's to an active cluster

2017-02-17 Thread nigel davies
Hay All

How is the best way to added multiple osd's to an active cluster?

As the last time i done this i all most killed the VM's we had running on
the cluster

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck peering after host reboot

2017-02-17 Thread george.vasilakakos
Hi Wido,

In an effort to get the cluster to complete peering that PG (as we need to be 
able to use our pool) we have removed osd.595 from the CRUSH map to allow a new 
mapping to occur.

When I left the office yesterday osd.307 had replaced osd.595 in the up set but 
the acting set had CRUSH_ITEM_NONE in place of the primary. The PG was in a 
remapped+peering state and recovery was taking place for the other PGs that 
lived on that OSD.
Worth noting that osd.307 in on the same host as osd.595.

We’ll have a look on osd.595 like you suggested.



On 17/02/2017, 06:48, "Wido den Hollander"  wrote:

>
>> Op 16 februari 2017 om 14:55 schreef george.vasilaka...@stfc.ac.uk:
>> 
>> 
>> Hi folks,
>> 
>> I have just made a tracker for this issue: 
>> http://tracker.ceph.com/issues/18960
>> I used ceph-post-file to upload some logs from the primary OSD for the 
>> troubled PG.
>> 
>> Any help would be appreciated.
>> 
>> If we can't get it to peer, we'd like to at least get it unstuck, even if it 
>> means data loss.
>> 
>> What's the proper way to go about doing that?
>
>Can you try this:
>
>1. Go to the host
>2. Stop OSD 595
>3. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info 
>--pgid 1.323
>
>What does osd.595 think about that PG?
>
>You could even try 'rm-past-intervals' with the object-store tool, but that 
>might be a bit dangerous. Wouldn't do that immediately.
>
>Wido
>
>> 
>> Best regards,
>> 
>> George
>> 
>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
>> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
>> Sent: 14 February 2017 10:27
>> To: bhubb...@redhat.com; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> 
>> Hi Brad,
>> 
>> I'll be doing so later in the day.
>> 
>> Thanks,
>> 
>> George
>> 
>> From: Brad Hubbard [bhubb...@redhat.com]
>> Sent: 13 February 2017 22:03
>> To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
>> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> 
>> I'd suggest creating a tracker and uploading a full debug log from the
>> primary so we can look at this in more detail.
>> 
>> On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
>> > Hi Brad,
>> >
>> > I could not tell you that as `ceph pg 1.323 query` never completes, it 
>> > just hangs there.
>> >
>> > On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>> >
>> > On Thu, Feb 9, 2017 at 3:36 AM,   wrote:
>> > > Hi Corentin,
>> > >
>> > > I've tried that, the primary hangs when trying to injectargs so I 
>> > set the option in the config file and restarted all OSDs in the PG, it 
>> > came up with:
>> > >
>> > > pg 1.323 is remapped+peering, acting 
>> > [595,1391,2147483647,127,937,362,267,320,7,634,716]
>> > >
>> > > Still can't query the PG, no error messages in the logs of osd.240.
>> > > The logs on osd.595 and osd.7 still fill up with the same messages.
>> >
>> > So what does "peering_blocked_by_detail" show in that case since it
>> > can no longer show "peering_blocked_by_history_les_bound"?
>> >
>> > >
>> > > Regards,
>> > >
>> > > George
>> > > 
>> > > From: Corentin Bonneton [l...@titin.fr]
>> > > Sent: 08 February 2017 16:31
>> > > To: Vasilakakos, George (STFC,RAL,SC)
>> > > Cc: ceph-users@lists.ceph.com
>> > > Subject: Re: [ceph-users] PG stuck peering after host reboot
>> > >
>> > > Hello,
>> > >
>> > > I already had the case, I applied the parameter 
>> > (osd_find_best_info_ignore_history_les) to all the osd that have reported 
>> > the queries blocked.
>> > >
>> > > --
>> > > Cordialement,
>> > > CEO FEELB | Corentin BONNETON
>> > > cont...@feelb.io
>> > >
>> > > Le 8 févr. 2017 à 17:17, 
>> > george.vasilaka...@stfc.ac.uk a 
>> > écrit :
>> > >
>> > > Hi Ceph folks,
>> > >
>> > > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
>> > pools.
>> > >
>> > > After rebooting a host last night, one PG refuses to complete peering
>> > >
>> > > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
>> > last acting [595,1391,240,127,937,362,267,320,7,634,716]
>> > >
>> > > Restarting OSDs or hosts does nothing to help, or sometimes results 
>> > in things like this:
>> > >
>> > > pg 1.323 is remapped+peering, acting 
>> > [2147483647,1391,240,127,937,362,267,320,7,634,716]
>> > >
>> > >
>> > > The host that was rebooted is home to osd.7 (8). If I go onto it to 
>> > look at the logs for osd.7 this is what I see:
>> > >
>> > > $ tail -f /var/log/ceph/ceph-osd.7.log
>> > > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
>> > XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 
>> > pipe(0x7f6074a0b400 sd=34 :42828 s=

Re: [ceph-users] Question regarding CRUSH algorithm

2017-02-17 Thread Richard Hesketh
On 16/02/17 20:44, girish kenkere wrote:
> Thanks David,
> 
> Its not quiet what i was looking for. Let me explain my question in more 
> detail -
> 
> This is excerpt from Crush paper, this explains how crush algo running on 
> each client/osd maps pg to an osd during the write operation[lets assume].
> 
> /"Tree buckets are structured as a weighted binary search tree with items at 
> the leaves. Each interior node knows the total weight of its left and right 
> subtrees and is labeled according to a fixed strategy (described below). In 
> order to select an item within a bucket, CRUSH starts at the root of the tree 
> and calculates the hash of the input key x, replica number r, the bucket 
> identifier, and the label at the current tree node (initially the root). The 
> result is compared to the weight ratio of the left and right subtrees to 
> decide which child node to visit next. This process is repeated until a leaf 
> node is reached, at which point the associated item in the bucket is chosen. 
> Only logn hashes and node comparisons are needed to locate an item.:"/
> 
>  My question is along the way the tree structure changes, weights of the 
> nodes change and some nodes even go away. In that case, how are future reads 
> lead to pg to same osd mapping? Its not cached anywhere, same algo runs for 
> every future read - what i am missing is how it picks the same osd[where data 
> resides] every time. With a modified crush map, won't we end up with 
> different leaf node if we apply same algo? 
> 
> Thanks
> Girish
> 
> On Thu, Feb 16, 2017 at 12:05 PM, David Turner  > wrote:
> 
> As a piece to the puzzle, the client always has an up to date osd map 
> (which includes the crush map).  If it's out of date, then it has to get a 
> new one before it can request to read or write to the cluster.  That way the 
> client will never have old information and if you add or remove storage, the 
> client will always have the most up to date map to know where the current 
> copies of the files are.
> 
> This can cause slow downs in your cluster performance if you are updating 
> your osdmap frequently, which can be caused by deleting a lot of snapshots as 
> an example.

> 
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com 
> ] on behalf of girish kenkere 
> [kngen...@gmail.com ]
> *Sent:* Thursday, February 16, 2017 12:43 PM
> *To:* ceph-users@lists.ceph.com 
> *Subject:* [ceph-users] Question regarding CRUSH algorithm
> 
> Hi, I have a question regarding CRUSH algorithm - please let me know how 
> this works. CRUSH paper talks about how given an object we select OSD via two 
> mapping - first one is obj to PG and then PG to OSD. 
> 
> This PG to OSD mapping is something i dont understand. It uses pg#, 
> cluster map, and placement rules. How is it guaranteed to return correct OSD 
> for future reads after the cluster map/placement rules has changed due to 
> nodes coming and out?
> 
> Thanks
> Girish

I think there is confusion over when the CRUSH algorithm is being run. It's my 
understanding that the object->PG mapping is always dynamically computed, and 
that's pretty simple (hash the object ID, take it modulo [num_pgs in pool], 
prepend pool ID, 8.0b's your uncle), but the PG->OSD mapping is only computed 
when new PGs are created or the CRUSH map changes. The result of that 
computation is stored in the cluster map and then locating a particular PG is a 
matter of looking it up in the map, not recalculating its location - PG 
placement is pseudorandom and nondeterministic anyway, so that would never work.

So - the client DOES run CRUSH to find the location of an object, but only in 
the sense of working out which PG it's in. It then looks up the PG in the 
cluster map (which includes the osdmap that David mentioned).

See http://docs.ceph.com

Re: [ceph-users] PG stuck peering after host reboot

2017-02-17 Thread Wido den Hollander

> Op 17 februari 2017 om 11:09 schreef george.vasilaka...@stfc.ac.uk:
> 
> 
> Hi Wido,
> 
> In an effort to get the cluster to complete peering that PG (as we need to be 
> able to use our pool) we have removed osd.595 from the CRUSH map to allow a 
> new mapping to occur.
> 
> When I left the office yesterday osd.307 had replaced osd.595 in the up set 
> but the acting set had CRUSH_ITEM_NONE in place of the primary. The PG was in 
> a remapped+peering state and recovery was taking place for the other PGs that 
> lived on that OSD.
> Worth noting that osd.307 in on the same host as osd.595.
> 
> We’ll have a look on osd.595 like you suggested.
> 

If the PG still doesn't recover do the same on osd.307 as I think that 'ceph pg 
X query' still hangs?

The info from ceph-objectstore-tool might shed some more light on this PG.

Wido

> 
> 
> On 17/02/2017, 06:48, "Wido den Hollander"  wrote:
> 
> >
> >> Op 16 februari 2017 om 14:55 schreef george.vasilaka...@stfc.ac.uk:
> >> 
> >> 
> >> Hi folks,
> >> 
> >> I have just made a tracker for this issue: 
> >> http://tracker.ceph.com/issues/18960
> >> I used ceph-post-file to upload some logs from the primary OSD for the 
> >> troubled PG.
> >> 
> >> Any help would be appreciated.
> >> 
> >> If we can't get it to peer, we'd like to at least get it unstuck, even if 
> >> it means data loss.
> >> 
> >> What's the proper way to go about doing that?
> >
> >Can you try this:
> >
> >1. Go to the host
> >2. Stop OSD 595
> >3. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info 
> >--pgid 1.323
> >
> >What does osd.595 think about that PG?
> >
> >You could even try 'rm-past-intervals' with the object-store tool, but that 
> >might be a bit dangerous. Wouldn't do that immediately.
> >
> >Wido
> >
> >> 
> >> Best regards,
> >> 
> >> George
> >> 
> >> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
> >> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
> >> Sent: 14 February 2017 10:27
> >> To: bhubb...@redhat.com; ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] PG stuck peering after host reboot
> >> 
> >> Hi Brad,
> >> 
> >> I'll be doing so later in the day.
> >> 
> >> Thanks,
> >> 
> >> George
> >> 
> >> From: Brad Hubbard [bhubb...@redhat.com]
> >> Sent: 13 February 2017 22:03
> >> To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
> >> Subject: Re: [ceph-users] PG stuck peering after host reboot
> >> 
> >> I'd suggest creating a tracker and uploading a full debug log from the
> >> primary so we can look at this in more detail.
> >> 
> >> On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
> >> > Hi Brad,
> >> >
> >> > I could not tell you that as `ceph pg 1.323 query` never completes, it 
> >> > just hangs there.
> >> >
> >> > On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
> >> >
> >> > On Thu, Feb 9, 2017 at 3:36 AM,   
> >> > wrote:
> >> > > Hi Corentin,
> >> > >
> >> > > I've tried that, the primary hangs when trying to injectargs so I 
> >> > set the option in the config file and restarted all OSDs in the PG, it 
> >> > came up with:
> >> > >
> >> > > pg 1.323 is remapped+peering, acting 
> >> > [595,1391,2147483647,127,937,362,267,320,7,634,716]
> >> > >
> >> > > Still can't query the PG, no error messages in the logs of osd.240.
> >> > > The logs on osd.595 and osd.7 still fill up with the same messages.
> >> >
> >> > So what does "peering_blocked_by_detail" show in that case since it
> >> > can no longer show "peering_blocked_by_history_les_bound"?
> >> >
> >> > >
> >> > > Regards,
> >> > >
> >> > > George
> >> > > 
> >> > > From: Corentin Bonneton [l...@titin.fr]
> >> > > Sent: 08 February 2017 16:31
> >> > > To: Vasilakakos, George (STFC,RAL,SC)
> >> > > Cc: ceph-users@lists.ceph.com
> >> > > Subject: Re: [ceph-users] PG stuck peering after host reboot
> >> > >
> >> > > Hello,
> >> > >
> >> > > I already had the case, I applied the parameter 
> >> > (osd_find_best_info_ignore_history_les) to all the osd that have 
> >> > reported the queries blocked.
> >> > >
> >> > > --
> >> > > Cordialement,
> >> > > CEO FEELB | Corentin BONNETON
> >> > > cont...@feelb.io
> >> > >
> >> > > Le 8 févr. 2017 à 17:17, 
> >> > george.vasilaka...@stfc.ac.uk a 
> >> > écrit :
> >> > >
> >> > > Hi Ceph folks,
> >> > >
> >> > > I have a cluster running Jewel 10.2.5 using a mix EC and 
> >> > replicated pools.
> >> > >
> >> > > After rebooting a host last night, one PG refuses to complete 
> >> > peering
> >> > >
> >> > > pg 1.323 is stuck inactive for 73352.498493, current state 
> >> > peering, last acting [595,1391,240,127,937,362,267,320,7,634,716]
> >> > >
> >> > > Restarting OSDs or hosts does no

Re: [ceph-users] PG stuck peering after host reboot

2017-02-17 Thread george.vasilakakos
On 17/02/2017, 12:00, "Wido den Hollander"  wrote:



>
>> Op 17 februari 2017 om 11:09 schreef george.vasilaka...@stfc.ac.uk:
>> 
>> 
>> Hi Wido,
>> 
>> In an effort to get the cluster to complete peering that PG (as we need to 
>> be able to use our pool) we have removed osd.595 from the CRUSH map to allow 
>> a new mapping to occur.
>> 
>> When I left the office yesterday osd.307 had replaced osd.595 in the up set 
>> but the acting set had CRUSH_ITEM_NONE in place of the primary. The PG was 
>> in a remapped+peering state and recovery was taking place for the other PGs 
>> that lived on that OSD.
>> Worth noting that osd.307 in on the same host as osd.595.
>> 
>> We’ll have a look on osd.595 like you suggested.
>> 
>
>If the PG still doesn't recover do the same on osd.307 as I think that 'ceph 
>pg X query' still hangs?

Will do, what’s even more worrying is that ceph pg X query also hangs on PG 
with osd.1391 as the primary (which is rank 1 in the stuck PG). osd.1391 had 3 
threads running 100% CPU during recovery, osd.307 was idling. OSDs 595 and 1391 
were also unresponsive to ceph tell but responsive to ceph daemon. I’ve not 
tried it with 307.

>
>The info from ceph-objectstore-tool might shed some more light on this PG.
>
>Wido
>
>> 
>> 
>> On 17/02/2017, 06:48, "Wido den Hollander"  wrote:
>> 
>> >
>> >> Op 16 februari 2017 om 14:55 schreef george.vasilaka...@stfc.ac.uk:
>> >> 
>> >> 
>> >> Hi folks,
>> >> 
>> >> I have just made a tracker for this issue: 
>> >> http://tracker.ceph.com/issues/18960
>> >> I used ceph-post-file to upload some logs from the primary OSD for the 
>> >> troubled PG.
>> >> 
>> >> Any help would be appreciated.
>> >> 
>> >> If we can't get it to peer, we'd like to at least get it unstuck, even if 
>> >> it means data loss.
>> >> 
>> >> What's the proper way to go about doing that?
>> >
>> >Can you try this:
>> >
>> >1. Go to the host
>> >2. Stop OSD 595
>> >3. ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-595 --op info 
>> >--pgid 1.323
>> >
>> >What does osd.595 think about that PG?
>> >
>> >You could even try 'rm-past-intervals' with the object-store tool, but that 
>> >might be a bit dangerous. Wouldn't do that immediately.
>> >
>> >Wido
>> >
>> >> 
>> >> Best regards,
>> >> 
>> >> George
>> >> 
>> >> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of 
>> >> george.vasilaka...@stfc.ac.uk [george.vasilaka...@stfc.ac.uk]
>> >> Sent: 14 February 2017 10:27
>> >> To: bhubb...@redhat.com; ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> >> 
>> >> Hi Brad,
>> >> 
>> >> I'll be doing so later in the day.
>> >> 
>> >> Thanks,
>> >> 
>> >> George
>> >> 
>> >> From: Brad Hubbard [bhubb...@redhat.com]
>> >> Sent: 13 February 2017 22:03
>> >> To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
>> >> Subject: Re: [ceph-users] PG stuck peering after host reboot
>> >> 
>> >> I'd suggest creating a tracker and uploading a full debug log from the
>> >> primary so we can look at this in more detail.
>> >> 
>> >> On Mon, Feb 13, 2017 at 9:11 PM,   wrote:
>> >> > Hi Brad,
>> >> >
>> >> > I could not tell you that as `ceph pg 1.323 query` never completes, it 
>> >> > just hangs there.
>> >> >
>> >> > On 11/02/2017, 00:40, "Brad Hubbard"  wrote:
>> >> >
>> >> > On Thu, Feb 9, 2017 at 3:36 AM,   
>> >> > wrote:
>> >> > > Hi Corentin,
>> >> > >
>> >> > > I've tried that, the primary hangs when trying to injectargs so I 
>> >> > set the option in the config file and restarted all OSDs in the PG, it 
>> >> > came up with:
>> >> > >
>> >> > > pg 1.323 is remapped+peering, acting 
>> >> > [595,1391,2147483647,127,937,362,267,320,7,634,716]
>> >> > >
>> >> > > Still can't query the PG, no error messages in the logs of 
>> >> > osd.240.
>> >> > > The logs on osd.595 and osd.7 still fill up with the same 
>> >> > messages.
>> >> >
>> >> > So what does "peering_blocked_by_detail" show in that case since it
>> >> > can no longer show "peering_blocked_by_history_les_bound"?
>> >> >
>> >> > >
>> >> > > Regards,
>> >> > >
>> >> > > George
>> >> > > 
>> >> > > From: Corentin Bonneton [l...@titin.fr]
>> >> > > Sent: 08 February 2017 16:31
>> >> > > To: Vasilakakos, George (STFC,RAL,SC)
>> >> > > Cc: ceph-users@lists.ceph.com
>> >> > > Subject: Re: [ceph-users] PG stuck peering after host reboot
>> >> > >
>> >> > > Hello,
>> >> > >
>> >> > > I already had the case, I applied the parameter 
>> >> > (osd_find_best_info_ignore_history_les) to all the osd that have 
>> >> > reported the queries blocked.
>> >> > >
>> >> > > --
>> >> > > Cordialement,
>> >> > > CEO FEELB | Corentin BONNETON
>> >> > > cont...@feelb.io
>> >> > >
>> >> > > Le 8 févr. 2017 à 17:17, 
>> >> > george.vasi

[ceph-users] moving rgw pools to ssd cache

2017-02-17 Thread Малков Петр Викторович

Hello!

I'am looking for method to make rgw ssd-cache tier with hdd

https://blog-fromsomedude.rhcloud.com/2015/11/06/Ceph-RadosGW-Placement-Targets/

I successfully created rgw pools for ssd as described above
And Placement Targets are written to users profile
So data can be written to any hdd or ssd pool
"placement_targets": [
{
"name": "default-placement",
"tags": ["default-placement"]
},
{
"name": "fast-placement",
   "tags": ["fast-placement"]
}
],
"default_placement": "default-placement",


"placement_pools": [
{
"key": "default-placement",
"val": {
"index_pool": "default.rgw.buckets.index",
"data_pool": "default.rgw.buckets.data",
"data_extra_pool": "default.rgw.buckets.non-ec",
"index_type": 0
}
},
{
"key": "fast-placement",
"val": {
"index_pool": "default.rgw.fast.buckets.index",
"data_pool": "default.rgw.fast.buckets.data",
"data_extra_pool": "default.rgw.fast.buckets.non-ec",
"index_type": 0
}


"keys": [
{
"user": "test2",
"default_placement": "fast-placement",
"placement_tags": ["default-placement", "fast-placement"],


NAMEID USED   %USED MAX AVAIL 
OBJECTS
default.rgw.fast.buckets.data   16  2906M  0.13 2216G   
 4339
default.rgw.buckets.data18 20451M  0.1216689G   
 6944


Then I make tier:
ceph osd tier add default.rgw.buckets.data default.rgw.fast.buckets.data 
--force-nonempty
ceph osd tier cache-mode default.rgw.fast.buckets.data writeback
ceph osd tier set-overlay default.rgw.buckets.data default.rgw.fast.buckets.data

ceph osd pool set default.rgw.fast.buckets.data hit_set_type bloom
ceph osd pool set default.rgw.fast.buckets.data hit_set_count 1
ceph osd pool set default.rgw.fast.buckets.data hit_set_period 300
ceph osd pool set default.rgw.fast.buckets.data target_max_bytes 22000
ceph osd pool set default.rgw.fast.buckets.data cache_min_flush_age 300
ceph osd pool set default.rgw.fast.buckets.data cache_min_evict_age 300
ceph osd pool set default.rgw.fast.buckets.data cache_target_dirty_ratio 0.01
ceph osd pool set default.rgw.fast.buckets.data cache_target_full_ratio 0.02


Put in some data, it felt down to ssd-cache and lower to hdd-pool.
Cluster has no active clients that could keep warm data.
But after 300 seconds no flushing/evicting. Only direct command works:

rados -p default.rgw.fast.buckets.data cache-flush-evict-all

How to fix?
(Jewel version 10.2.5)

--
Petr Malkov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-02-17 Thread John Spray
On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah
 wrote:
> On one our platform mgr uses 3 CPU cores . Is there a ticket available for
> this issue ?

Not that I'm aware of, you could go ahead and open one.

Cheers,
John

> Thanks,
> Muthu
>
> On 14 February 2017 at 03:13, Brad Hubbard  wrote:
>>
>> Could one of the reporters open a tracker for this issue and attach
>> the requested debugging data?
>>
>> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis 
>> wrote:
>> > I am having the same issue. When I looked at my idle cluster this
>> > morning,
>> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of
>> > that.  I
>> > have 3 AIO nodes, and only one of them seemed to be affected.
>> >
>> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard 
>> > wrote:
>> >>
>> >> Want to install debuginfo packages and use something like this to try
>> >> and find out where it is spending most of its time?
>> >>
>> >> https://poormansprofiler.org/
>> >>
>> >> Note that you may need to do multiple runs to get a "feel" for where
>> >> it is spending most of its time. Also not that likely only one or two
>> >> threads will be using the CPU (you can see this in ps output using a
>> >> command like the following) the rest will likely be idle or waiting
>> >> for something.
>> >>
>> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan
>> >>
>> >> Observation of these two and maybe a couple of manual gstack dumps
>> >> like this to compare thread ids to ps output (LWP is the thread id
>> >> (tid) in gdb output) should give us some idea of where it is spinning.
>> >>
>> >> # gstack $(pidof ceph-mgr)
>> >>
>> >>
>> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff
>> >>  wrote:
>> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on
>> >> > CentOS
>> >> > 7 w/
>> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and
>> >> > has
>> >> > allocated ~11GB of RAM after a single day of usage. Only the active
>> >> > manager
>> >> > is performing this way. The growth is linear and reproducible.
>> >> >
>> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with
>> >> > 45x8TB
>> >> > OSDs
>> >> > each.
>> >> >
>> >> >
>> >> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94,
>> >> > 4.21
>> >> >
>> >> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
>> >> >
>> >> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7
>> >> > si,
>> >> > 0.0
>> >> > st
>> >> >
>> >> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812
>> >> > buff/cache
>> >> >
>> >> > KiB Swap:  2097148 total,  2097148 free,0 used.  4836772
>> >> > avail
>> >> > Mem
>> >> >
>> >> >
>> >> >   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> >> > COMMAND
>> >> >
>> >> >  2351 ceph  20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27
>> >> > ceph-mgr
>> >> >
>> >> >  2302 ceph  20   0  620316 267992 157620 S   2.3  1.6  65:11.50
>> >> > ceph-mon
>> >> >
>> >> >
>> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J
>> >> >  wrote:
>> >> >>
>> >> >> John,
>> >> >>
>> >> >> This morning I compared the logs from yesterday and I show a
>> >> >> noticeable
>> >> >> increase in messages like these:
>> >> >>
>> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all mon_status
>> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all health
>> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all pg_summary
>> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
>> >> >> mgrdigest v1
>> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest
>> >> >> v1
>> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
>> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
>> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all mon_status
>> >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all health
>> >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all
>> >> >> notify_all:
>> >> >> notify_all pg_summary
>> >> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
>> >> >> mgrdigest v1
>> >> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest
>> >> >> v1
>> >> >>
>> >> >>
>> >> >> In a 1 minute period yesterday I saw 84 times this group of messages
>> >> >> showed up.  Today that same group of messages showed up 156 times.
>> >> >>
>> >> >> Other than that I did see an increase in this messages from 9 times
>> >> >> a
>> >> >> minute to 14 times a minute:
>> >> >>
>> >> >> 2017-01-11 09:00:00.

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Matyas Koszik


I'm not sure what variable should I be looking at exactly, but after
reading through all of them I don't see anyting supsicious, all values are
0. I'm attaching it anyway, in case I missed something:
https://atw.hu/~koszik/ceph/osd26-perf


I tried debugging the ceph pg query a bit more, and it seems that it
gets stuck communicating with the mon - it doesn't even try to connect to
the osd. This is the end of the log:

13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
{"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"..., 
53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
{"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
msg_flags=0}, MSG_NOSIGNAL) = 98
13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, NULL, 
NULL) = 9
13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
{"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, MSG_DONTWAIT, 
NULL, NULL) = 9
13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
{"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, NULL, 
NULL) = 9
13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
{"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, NULL, 
NULL) = 9

... and this goes on for as long as I let it. When I kill it, I get this:
RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
"pgid": "6.245"}']": exception 'int' object is not iterable

I restarted (again) osd26 with max debugging; after grepping for 6.245,
this is the log I get:
https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245

Matyas


On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:

> If the PG cannot be queried I would bet on OSD message throttler. Check with 
> "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which is 
> holding this PG  if message throttler current value is not equal max. If it 
> is, increase the max value in ceph.conf and restart OSD.
>
> --
> Tomasz Kuzemko
> tomasz.kuze...@corp.ovh.com
>
> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisał(a):
>
> >
> > Hi,
> >
> > It seems that my ceph cluster is in an erroneous state of which I cannot
> > see right now how to get out of.
> >
> > The status is the following:
> >
> > health HEALTH_WARN
> >   25 pgs degraded
> >   1 pgs stale
> >   26 pgs stuck unclean
> >   25 pgs undersized
> >   recovery 23578/9450442 objects degraded (0.249%)
> >   recovery 45/9450442 objects misplaced (0.000%)
> >   crush map has legacy tunables (require bobtail, min is firefly)
> > monmap e17: 3 mons at x
> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
> >   flags require_jewel_osds
> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
> >   36750 GB used, 61947 GB / 98697 GB avail
> >   23578/9450442 objects degraded (0.249%)
> >   45/9450442 objects misplaced (0.000%)
> >   4362 active+clean
> > 24 active+undersized+degraded
> >  1 stale+active+undersized+degraded+remapped
> >  1 active+remapped
> >
> >
> > I tried restarting all OSDs, to no avail, it actually made things a bit
> > worse.
> > From a user point of view the cluster works perfectly (apart from that
> > stale pg, which fortunately hit the pool on which I keep swap images
> > only).
> >
> > A little background: I made the mistake of creating the cluster with
> > size=2 pools, which I'm now in the process of rectifying, but that
> > requires some fiddling around. I also tried moving to more optimal
> > tunables (firefly), but the documentation is a bit optimistic
> > with the 'up to 10%' data movement - it was over 50% in my case, so I
> > reverted to bobtail immediately after I saw that number. I then started
> > reweighing the osds in anticipation of the size=3 bump, and I think that's
> > when this bug hit me.
> >
> > Right now I have a pg (6.245) that cannot even be queried - the command
> > times out, or gives this output: https://atw.hu/~koszik/ceph/pg6.245
> >
> > I queried a few other pgs that are acting up, but cannot see anything
> > suspicious, other than the fact they do not have a working peer:
> > https://atw.hu/~koszik/ceph/pg4.2ca
> > https://atw.hu/~koszik/ceph/pg4.2e4
> >
> > Health details can be found here: https://atw.hu/~koszik/ceph/health
> > OSD tree: https://atw.hu/~koszik/ceph/tree (here the weight sum of
> > ssd/store3_ssd seems to be off, but that has been the case for quite some
> > time - not sure if it's related to any of this)
> >
> >
> > I tried setting debugging to 20/20 on some of the affected osds, but there
> > was nothing there that gav

Re: [ceph-users] KVM/QEMU rbd read latency

2017-02-17 Thread Jason Dillaman
On Fri, Feb 17, 2017 at 2:14 AM, Alexandre DERUMIER  wrote:
> and I have good hope than this new feature
> "RBD: Add support readv,writev for rbd"
> http://marc.info/?l=ceph-devel&m=148726026914033&w=2

Definitely will eliminate 1 unnecessary data copy -- but sadly it
still will make a single copy within librbd immediately since librados
*might* touch the IO memory after it has ACKed the op. Once that issue
is addressed, librbd can eliminate that copy if the librbd cache is
disabled. We also need to support >1 librbd/librados-internal IO
thread for outbound/inbound paths.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Tendrl-devel] Calamari-server for CentOS

2017-02-17 Thread Ken Dreyer
I think the most up-to-date source of Calamari CentOS packages would
be https://shaman.ceph.com/repos/calamari/1.5/

On Fri, Feb 17, 2017 at 7:38 AM, Martin Kudlej  wrote:
> Hello all,
>
> I would like to ask again about calamari-server package for CentOS 7. Is
> there any plan to have calamari-server in Storage SIG in CentOS 7, please?
>
>
> On 01/17/2017 02:31 PM, Martin Kudlej wrote:
>>
>> Hello Ceph users,
>>
>> I've installed Ceph from
>> SIG(https://wiki.centos.org/SpecialInterestGroup/Storage) on CentOS 7.
>> I would like to install Calamari server too. It is not available in
>> SIG(http://mirror.centos.org/centos/7/storage/x86_64/ceph-jewel/). I've
>> found
>> https://github.com/ksingh7/ceph-calamari-packages/tree/master/CentOS-el7
>> but there I cannot be sure
>> that it is well formed and maintained version for CentOS.
>>
>> Where can I find Calamari server package for CentOS 7, please?
>>
>
> --
> Best Regards,
> Martin Kudlej.
> RHSC/USM Senior Quality Assurance Engineer
> Red Hat Czech s.r.o.
>
> Phone: +420 532 294 155
> E-mail:mkudlej at redhat.com
> IRC:   mkudlej at #brno, #gluster, #storage-qa, #rhs, #rh-ceph, #usm-meeting
> @ redhat
>   #tendrl-devel @ freenode
>
> ___
> Tendrl-devel mailing list
> tendrl-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/tendrl-devel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM/QEMU rbd read latency

2017-02-17 Thread Alexandre DERUMIER
>>We also need to support >1 librbd/librados-internal IO
>>thread for outbound/inbound paths.

Could be worderfull !
multiple iothread by disk is coming for qemu too. (I have seen Paolo Bonzini 
sending a lot of patches this month)



- Mail original -
De: "Jason Dillaman" 
À: "aderumier" 
Cc: "Phil Lacroute" , "ceph-users" 

Envoyé: Vendredi 17 Février 2017 15:16:39
Objet: Re: [ceph-users] KVM/QEMU rbd read latency

On Fri, Feb 17, 2017 at 2:14 AM, Alexandre DERUMIER  
wrote: 
> and I have good hope than this new feature 
> "RBD: Add support readv,writev for rbd" 
> http://marc.info/?l=ceph-devel&m=148726026914033&w=2 

Definitely will eliminate 1 unnecessary data copy -- but sadly it 
still will make a single copy within librbd immediately since librados 
*might* touch the IO memory after it has ACKed the op. Once that issue 
is addressed, librbd can eliminate that copy if the librbd cache is 
disabled. We also need to support >1 librbd/librados-internal IO 
thread for outbound/inbound paths. 

-- 
Jason 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disable debug logging: best practice or not?

2017-02-17 Thread Kostis Fardelas
Hi,
I keep reading recommendations about disabling debug logging in Ceph
in order to improve performance. There are two things that are unclear
to me though:

a. what do we lose if we decrease default debug logging and where is
the sweet point in order to not lose critical messages?

I would say for example that we would lose OSD traces after a random
OSD crash if we set debug level of osd subsystem to zero (log and
memory). Or a "Errno 24: Too many open files" on mon logs if we set
debug level of mon subsystem to zero.

b. is there a recommended set of debug settings to mute in client
nodes versus the Ceph daemons nodes? I mean are there settings that
you would mute in client nodes, but prefer to keep them on default
levels on Ceph nodes?

Best regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple osd's to an active cluster

2017-02-17 Thread Brian Andrus
As described recently in several other threads, we like to add OSDs in to
their proper CRUSH location, but with the following parameter set:

  osd crush initial weight = 0

We then bring the OSDs in to the cluster (0 impact in our environment), and
then gradually increase CRUSH weight to bring them to their final desired
value, all at the same time.

The script I use basically checks for all OSDs less than our target weight
in each iteration, and moves it closer to the target weight by a defined
increment and then waiting for HEALTH_OK or other acceptable state.

I would suggest starting with .001 with large groups of OSDs. We can
comfortably bring in 100 OSDs with increments of .004 at a time or so..
Theoretically we could just let them all weight in at once, but this allows
us to find a comfortable rate and pause the process whenever/wherever we
want if it does cause issues.

Hope that helps.

On Fri, Feb 17, 2017 at 1:42 AM, nigel davies  wrote:

> Hay All
>
> How is the best way to added multiple osd's to an active cluster?
>
> As the last time i done this i all most killed the VM's we had running on
> the cluster
>
> Thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Brian Andrus | Cloud Systems Engineer | DreamHost
brian.and...@dreamhost.com | www.dreamhost.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] S3 Radosgw : how to grant a user within a tenant

2017-02-17 Thread Vincent Godin
I created 2 users : jack & bob inside a tenant_A
jack created a bucket named BUCKET_A and want to give read access to the
user bob

with s3cmd, i can grant a user without tenant easylly: s3cmd setacl
--acl-grant=read:user s3://BUCKET_A

but with an explicit tenant, i tried :
--acl-grant=read:bob
--acl-grant=read:tenant_A$bob
--acl-grant=read:tenant_A\$bob
--acl-grant=read:"tenant_A:bob"

each time, i got a s3 error : 400 (invalidArgument)

Does someone know the solution ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disable debug logging: best practice or not?

2017-02-17 Thread Wido den Hollander

> Op 17 februari 2017 om 17:44 schreef Kostis Fardelas :
> 
> 
> Hi,
> I keep reading recommendations about disabling debug logging in Ceph
> in order to improve performance. There are two things that are unclear
> to me though:
> 
> a. what do we lose if we decrease default debug logging and where is
> the sweet point in order to not lose critical messages?
> 

Having logs enabled consumes CPU power, mainly on the OSDs. It introduces 
latency, so when people disable logging they do this to lower latency.

I usually disable logs on the OSDs and leave them on on the MONs. Enable them 
on the OSDs when needed.

[osd]
debug_osd = 0
debug_filestore = 0
debug_journal = 0
debug_ms = 0

Wido

> I would say for example that we would lose OSD traces after a random
> OSD crash if we set debug level of osd subsystem to zero (log and
> memory). Or a "Errno 24: Too many open files" on mon logs if we set
> debug level of mon subsystem to zero.
> 
> b. is there a recommended set of debug settings to mute in client
> nodes versus the Ceph daemons nodes? I mean are there settings that
> you would mute in client nodes, but prefer to keep them on default
> levels on Ceph nodes?
> 
> Best regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] S3 Radosgw : how to grant a user within a tenant

2017-02-17 Thread Bastian Rosner
On 02/17/2017 06:25 PM, Vincent Godin wrote:
> I created 2 users : jack & bob inside a tenant_A
> jack created a bucket named BUCKET_A and want to give read access to the
> user bob
> 
> with s3cmd, i can grant a user without tenant easylly: s3cmd setacl
> --acl-grant=read:user s3://BUCKET_A
> 
> but with an explicit tenant, i tried :
> --acl-grant=read:bob
> --acl-grant=read:tenant_A$bob
> --acl-grant=read:tenant_A\$bob
> --acl-grant=read:"tenant_A:bob"
> 
> each time, i got a s3 error : 400 (invalidArgument)
> 
> Does someone know the solution ?

Have you tried using email-address instead of tenant:UID?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] crushtool mappings wrong

2017-02-17 Thread Gregory Farnum
On Thu, Feb 16, 2017 at 3:51 PM, Blair Bethwaite
 wrote:
> Hi Brian,
>
> After another hour of staring at the decompiled crushmap and playing around
> with crushtool command lines I finally looked harder at your command line
> and noticed I was also specifying "--simulate", removing that gives me
> mappings that make much more sense (at least for the existing production
> ruleset that I know actually work on the cluster)! So, thanks!
>
> This looks like the same problem that Robert ran into some time ago:
> https://www.spinics.net/lists/ceph-users/msg16950.html, where I believe he
> filed a bug (http://tracker.ceph.com/issues/11224) that was rather
> unhelpfully rejected with no explanation. I'm going to update that now.
> Would be great to get an explanation of what that damned simulate flag is
> supposed to do, the documentation for crushtool is somewhat thin on the
> subject.

Heh. You're right of course. Best as I remember, --simulate is
designed to generate an actually random placement of the PGs so a user
can do statistical comparisons between that and the CRUSH output to
measure quality of balance, etc. This was a project by a summer intern
many years ago and I'm not sure anybody knows how to use it properly
any more, and I wouldn't expect that flag to interact well
with...anything else?
-Greg


>
> Cheers,
>
> On 17 February 2017 at 05:56, Brian Andrus 
> wrote:
>>
>> v10.2.5 - crushtool working fine to show rack mappings. How are you
>> running the command? Get some sleep! ha.
>>
>> crushtool -i /tmp/crush.map --test --ruleset 3 --num-rep 3 --show-mappings
>>
>> rule byrack {
>> ruleset 3
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type rack
>> step emit
>> }
>>
>>
>> On Thu, Feb 16, 2017 at 7:10 AM, Blair Bethwaite
>>  wrote:
>>>
>>> Am I going nuts (it is extremely late/early here), or is crushtool
>>> totally broken? I'm trying to configure a ruleset that will place
>>> exactly one replica into three different racks (under each of which
>>> there are 8-10 hosts). crushtool has given me empty mappings for just
>>> about every rule I've tried that wasn't just the simplest: chooseleaf
>>> 0 host. Suspecting something was up with crushtool I have now tried to
>>> verify correctness on an existing rule and it is including OSDs in the
>>> result mappings that are not even in this hierarchy...
>>>
>>> (this is on a 10.2.2 install)
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> --
>> Brian Andrus | Cloud Systems Engineer | DreamHost
>> brian.and...@dreamhost.com | www.dreamhost.com
>
>
>
>
> --
> Cheers,
> ~Blairo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Gregory Farnum
Situations that are stable lots of undersized PGs like this generally
mean that the CRUSH map is failing to allocate enough OSDs for certain
PGs. The log you have says the OSD is trying to NOTIFY the new primary
that the PG exists here on this replica.

I'd guess you only have 3 hosts and are trying to place all your
replicas on independent boxes. Bobtail tunables have trouble with that
and you're going to need to pay the cost of moving to more modern
ones.
-Greg

On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
>
>
> I'm not sure what variable should I be looking at exactly, but after
> reading through all of them I don't see anyting supsicious, all values are
> 0. I'm attaching it anyway, in case I missed something:
> https://atw.hu/~koszik/ceph/osd26-perf
>
>
> I tried debugging the ceph pg query a bit more, and it seems that it
> gets stuck communicating with the mon - it doesn't even try to connect to
> the osd. This is the end of the log:
>
> 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
> {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"..., 
> 53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
> {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
> msg_flags=0}, MSG_NOSIGNAL) = 98
> 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, NULL, 
> NULL) = 9
> 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 
> 9
> 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, MSG_DONTWAIT, 
> NULL, NULL) = 9
> 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
> 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, 
> NULL, NULL) = 9
> 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
> 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, NULL, 
> NULL) = 9
>
> ... and this goes on for as long as I let it. When I kill it, I get this:
> RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
> "pgid": "6.245"}']": exception 'int' object is not iterable
>
> I restarted (again) osd26 with max debugging; after grepping for 6.245,
> this is the log I get:
> https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
>
> Matyas
>
>
> On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
>
>> If the PG cannot be queried I would bet on OSD message throttler. Check with 
>> "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which is 
>> holding this PG  if message throttler current value is not equal max. If it 
>> is, increase the max value in ceph.conf and restart OSD.
>>
>> --
>> Tomasz Kuzemko
>> tomasz.kuze...@corp.ovh.com
>>
>> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisał(a):
>>
>> >
>> > Hi,
>> >
>> > It seems that my ceph cluster is in an erroneous state of which I cannot
>> > see right now how to get out of.
>> >
>> > The status is the following:
>> >
>> > health HEALTH_WARN
>> >   25 pgs degraded
>> >   1 pgs stale
>> >   26 pgs stuck unclean
>> >   25 pgs undersized
>> >   recovery 23578/9450442 objects degraded (0.249%)
>> >   recovery 45/9450442 objects misplaced (0.000%)
>> >   crush map has legacy tunables (require bobtail, min is firefly)
>> > monmap e17: 3 mons at x
>> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
>> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
>> >   flags require_jewel_osds
>> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
>> >   36750 GB used, 61947 GB / 98697 GB avail
>> >   23578/9450442 objects degraded (0.249%)
>> >   45/9450442 objects misplaced (0.000%)
>> >   4362 active+clean
>> > 24 active+undersized+degraded
>> >  1 stale+active+undersized+degraded+remapped
>> >  1 active+remapped
>> >
>> >
>> > I tried restarting all OSDs, to no avail, it actually made things a bit
>> > worse.
>> > From a user point of view the cluster works perfectly (apart from that
>> > stale pg, which fortunately hit the pool on which I keep swap images
>> > only).
>> >
>> > A little background: I made the mistake of creating the cluster with
>> > size=2 pools, which I'm now in the process of rectifying, but that
>> > requires some fiddling around. I also tried moving to more optimal
>> > tunables (firefly), but the documentation is a bit optimistic
>> > with the 'up to 10%' data movement - it was over 50% in my case, so I
>> > reverted to bobtail immediately after I saw that number. I then started
>> > reweighing the osds in anticipation of the size=3 bump, and I think that's
>> > when this bug hit me.
>> >
>> > Right now I have a pg (6.245) that cannot even be queried - the command
>> > times out, or gives this ou

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Shinobu Kinjo
Can you do?

 * ceph osd getcrushmap -o ./crushmap.o; crushtool -d ./crushmap.o -o
./crushmap.txt

On Sat, Feb 18, 2017 at 3:52 AM, Gregory Farnum  wrote:
> Situations that are stable lots of undersized PGs like this generally
> mean that the CRUSH map is failing to allocate enough OSDs for certain
> PGs. The log you have says the OSD is trying to NOTIFY the new primary
> that the PG exists here on this replica.
>
> I'd guess you only have 3 hosts and are trying to place all your
> replicas on independent boxes. Bobtail tunables have trouble with that
> and you're going to need to pay the cost of moving to more modern
> ones.
> -Greg
>
> On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
>>
>>
>> I'm not sure what variable should I be looking at exactly, but after
>> reading through all of them I don't see anyting supsicious, all values are
>> 0. I'm attaching it anyway, in case I missed something:
>> https://atw.hu/~koszik/ceph/osd26-perf
>>
>>
>> I tried debugging the ceph pg query a bit more, and it seems that it
>> gets stuck communicating with the mon - it doesn't even try to connect to
>> the osd. This is the end of the log:
>>
>> 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
>> {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"..., 
>> 53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
>> {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
>> msg_flags=0}, MSG_NOSIGNAL) = 98
>> 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, NULL, 
>> NULL) = 9
>> 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
>> = 9
>> 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, MSG_DONTWAIT, 
>> NULL, NULL) = 9
>> 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 
>> 9
>> 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, 
>> NULL, NULL) = 9
>> 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
>> 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, 
>> NULL, NULL) = 9
>>
>> ... and this goes on for as long as I let it. When I kill it, I get this:
>> RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
>> "pgid": "6.245"}']": exception 'int' object is not iterable
>>
>> I restarted (again) osd26 with max debugging; after grepping for 6.245,
>> this is the log I get:
>> https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
>>
>> Matyas
>>
>>
>> On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
>>
>>> If the PG cannot be queried I would bet on OSD message throttler. Check 
>>> with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which 
>>> is holding this PG  if message throttler current value is not equal max. If 
>>> it is, increase the max value in ceph.conf and restart OSD.
>>>
>>> --
>>> Tomasz Kuzemko
>>> tomasz.kuze...@corp.ovh.com
>>>
>>> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisał(a):
>>>
>>> >
>>> > Hi,
>>> >
>>> > It seems that my ceph cluster is in an erroneous state of which I cannot
>>> > see right now how to get out of.
>>> >
>>> > The status is the following:
>>> >
>>> > health HEALTH_WARN
>>> >   25 pgs degraded
>>> >   1 pgs stale
>>> >   26 pgs stuck unclean
>>> >   25 pgs undersized
>>> >   recovery 23578/9450442 objects degraded (0.249%)
>>> >   recovery 45/9450442 objects misplaced (0.000%)
>>> >   crush map has legacy tunables (require bobtail, min is firefly)
>>> > monmap e17: 3 mons at x
>>> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
>>> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
>>> >   flags require_jewel_osds
>>> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
>>> >   36750 GB used, 61947 GB / 98697 GB avail
>>> >   23578/9450442 objects degraded (0.249%)
>>> >   45/9450442 objects misplaced (0.000%)
>>> >   4362 active+clean
>>> > 24 active+undersized+degraded
>>> >  1 stale+active+undersized+degraded+remapped
>>> >  1 active+remapped
>>> >
>>> >
>>> > I tried restarting all OSDs, to no avail, it actually made things a bit
>>> > worse.
>>> > From a user point of view the cluster works perfectly (apart from that
>>> > stale pg, which fortunately hit the pool on which I keep swap images
>>> > only).
>>> >
>>> > A little background: I made the mistake of creating the cluster with
>>> > size=2 pools, which I'm now in the process of rectifying, but that
>>> > requires some fiddling around. I also tried moving to more optimal
>>> > tunables (firefly), but the documentation is a bit optimistic
>>> > with the 'up to 10%' data movement - it was over 50% in my case, so I
>>> > reverted to b

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Matyas Koszik

It's at https://atw.hu/~koszik/ceph/crushmap.txt


On Sat, 18 Feb 2017, Shinobu Kinjo wrote:

> Can you do?
>
>  * ceph osd getcrushmap -o ./crushmap.o; crushtool -d ./crushmap.o -o
> ./crushmap.txt
>
> On Sat, Feb 18, 2017 at 3:52 AM, Gregory Farnum  wrote:
> > Situations that are stable lots of undersized PGs like this generally
> > mean that the CRUSH map is failing to allocate enough OSDs for certain
> > PGs. The log you have says the OSD is trying to NOTIFY the new primary
> > that the PG exists here on this replica.
> >
> > I'd guess you only have 3 hosts and are trying to place all your
> > replicas on independent boxes. Bobtail tunables have trouble with that
> > and you're going to need to pay the cost of moving to more modern
> > ones.
> > -Greg
> >
> > On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
> >>
> >>
> >> I'm not sure what variable should I be looking at exactly, but after
> >> reading through all of them I don't see anyting supsicious, all values are
> >> 0. I'm attaching it anyway, in case I missed something:
> >> https://atw.hu/~koszik/ceph/osd26-perf
> >>
> >>
> >> I tried debugging the ceph pg query a bit more, and it seems that it
> >> gets stuck communicating with the mon - it doesn't even try to connect to
> >> the osd. This is the end of the log:
> >>
> >> 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
> >> {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"...,
> >>  53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
> >> {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
> >> msg_flags=0}, MSG_NOSIGNAL) = 98
> >> 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, 
> >> NULL, NULL) = 9
> >> 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> >> {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, 
> >> MSG_NOSIGNAL) = 9
> >> 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, 
> >> MSG_DONTWAIT, NULL, NULL) = 9
> >> 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> >> {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
> >> = 9
> >> 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, 
> >> NULL, NULL) = 9
> >> 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> >> {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 
> >> 9
> >> 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, 
> >> NULL, NULL) = 9
> >>
> >> ... and this goes on for as long as I let it. When I kill it, I get this:
> >> RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
> >> "pgid": "6.245"}']": exception 'int' object is not iterable
> >>
> >> I restarted (again) osd26 with max debugging; after grepping for 6.245,
> >> this is the log I get:
> >> https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
> >>
> >> Matyas
> >>
> >>
> >> On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
> >>
> >>> If the PG cannot be queried I would bet on OSD message throttler. Check 
> >>> with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which 
> >>> is holding this PG  if message throttler current value is not equal max. 
> >>> If it is, increase the max value in ceph.conf and restart OSD.
> >>>
> >>> --
> >>> Tomasz Kuzemko
> >>> tomasz.kuze...@corp.ovh.com
> >>>
> >>> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisał(a):
> >>>
> >>> >
> >>> > Hi,
> >>> >
> >>> > It seems that my ceph cluster is in an erroneous state of which I cannot
> >>> > see right now how to get out of.
> >>> >
> >>> > The status is the following:
> >>> >
> >>> > health HEALTH_WARN
> >>> >   25 pgs degraded
> >>> >   1 pgs stale
> >>> >   26 pgs stuck unclean
> >>> >   25 pgs undersized
> >>> >   recovery 23578/9450442 objects degraded (0.249%)
> >>> >   recovery 45/9450442 objects misplaced (0.000%)
> >>> >   crush map has legacy tunables (require bobtail, min is firefly)
> >>> > monmap e17: 3 mons at x
> >>> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
> >>> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
> >>> >   flags require_jewel_osds
> >>> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
> >>> >   36750 GB used, 61947 GB / 98697 GB avail
> >>> >   23578/9450442 objects degraded (0.249%)
> >>> >   45/9450442 objects misplaced (0.000%)
> >>> >   4362 active+clean
> >>> > 24 active+undersized+degraded
> >>> >  1 stale+active+undersized+degraded+remapped
> >>> >  1 active+remapped
> >>> >
> >>> >
> >>> > I tried restarting all OSDs, to no avail, it actually made things a bit
> >>> > worse.
> >>> > From a user point of view the cluster works perfectly (apart from that
> >>> > stale pg, which fortunately hit the pool on which I keep swap images
> >>> > only).
> >>> >
> >>> > A little background: I made the mistake of creating 

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Matyas Koszik

I have size=2 and 3 independent nodes. I'm happy to try firefly tunables,
but a bit scared that it would make things even worse.


On Fri, 17 Feb 2017, Gregory Farnum wrote:

> Situations that are stable lots of undersized PGs like this generally
> mean that the CRUSH map is failing to allocate enough OSDs for certain
> PGs. The log you have says the OSD is trying to NOTIFY the new primary
> that the PG exists here on this replica.
>
> I'd guess you only have 3 hosts and are trying to place all your
> replicas on independent boxes. Bobtail tunables have trouble with that
> and you're going to need to pay the cost of moving to more modern
> ones.
> -Greg
>
> On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
> >
> >
> > I'm not sure what variable should I be looking at exactly, but after
> > reading through all of them I don't see anyting supsicious, all values are
> > 0. I'm attaching it anyway, in case I missed something:
> > https://atw.hu/~koszik/ceph/osd26-perf
> >
> >
> > I tried debugging the ceph pg query a bit more, and it seems that it
> > gets stuck communicating with the mon - it doesn't even try to connect to
> > the osd. This is the end of the log:
> >
> > 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
> > {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"..., 
> > 53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
> > {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
> > msg_flags=0}, MSG_NOSIGNAL) = 98
> > 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> > 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> > {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
> > = 9
> > 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> > 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> > {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
> > = 9
> > 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> > 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> > {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
> > 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> >
> > ... and this goes on for as long as I let it. When I kill it, I get this:
> > RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
> > "pgid": "6.245"}']": exception 'int' object is not iterable
> >
> > I restarted (again) osd26 with max debugging; after grepping for 6.245,
> > this is the log I get:
> > https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
> >
> > Matyas
> >
> >
> > On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
> >
> >> If the PG cannot be queried I would bet on OSD message throttler. Check 
> >> with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which 
> >> is holding this PG  if message throttler current value is not equal max. 
> >> If it is, increase the max value in ceph.conf and restart OSD.
> >>
> >> --
> >> Tomasz Kuzemko
> >> tomasz.kuze...@corp.ovh.com
> >>
> >> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisał(a):
> >>
> >> >
> >> > Hi,
> >> >
> >> > It seems that my ceph cluster is in an erroneous state of which I cannot
> >> > see right now how to get out of.
> >> >
> >> > The status is the following:
> >> >
> >> > health HEALTH_WARN
> >> >   25 pgs degraded
> >> >   1 pgs stale
> >> >   26 pgs stuck unclean
> >> >   25 pgs undersized
> >> >   recovery 23578/9450442 objects degraded (0.249%)
> >> >   recovery 45/9450442 objects misplaced (0.000%)
> >> >   crush map has legacy tunables (require bobtail, min is firefly)
> >> > monmap e17: 3 mons at x
> >> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
> >> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
> >> >   flags require_jewel_osds
> >> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
> >> >   36750 GB used, 61947 GB / 98697 GB avail
> >> >   23578/9450442 objects degraded (0.249%)
> >> >   45/9450442 objects misplaced (0.000%)
> >> >   4362 active+clean
> >> > 24 active+undersized+degraded
> >> >  1 stale+active+undersized+degraded+remapped
> >> >  1 active+remapped
> >> >
> >> >
> >> > I tried restarting all OSDs, to no avail, it actually made things a bit
> >> > worse.
> >> > From a user point of view the cluster works perfectly (apart from that
> >> > stale pg, which fortunately hit the pool on which I keep swap images
> >> > only).
> >> >
> >> > A little background: I made the mistake of creating the cluster with
> >> > size=2 pools, which I'm now in the process of rectifying, but that
> >> > requires some fiddling around. I also tried moving to more optimal
> >> > tunables (firefly), but the documenta

Re: [ceph-users] KVM/QEMU rbd read latency

2017-02-17 Thread Phil Lacroute
Thanks everyone for the suggestions.  Disabling the RBD cache, disabling the 
debug logging and building qemu with jemalloc each had a significant impact.  
Performance is up from 25K IOPS to 63K IOPS.  Hopefully the ongoing work to 
reduce the number of buffer copies will yield further improvements.

I have a followup question about the debug logging.  Is there any way to dump 
the in-memory logs from the QEMU RBD client?  If not (and I couldn’t find a way 
to do this), then nothing is lost by disabling the logging on client machines.

Thanks,
Phil

> On Feb 16, 2017, at 1:20 PM, Jason Dillaman  wrote:
> 
> Few additional suggestions:
> 
> 1) For high IOPS random read workloads, the librbd cache is most likely going 
> to be a bottleneck and is providing zero benefit. Recommend setting 
> "cache=none" on your librbd QEMU disk to disable it.
> 
> 2) Disable logging via your ceph.conf. Example settings:
> 
> debug_auth = 0/0
> debug_buffer = 0/0
> debug_context = 0/0
> debug_crypto = 0/0
> debug_finisher = 0/0
> debug_ms = 0/0
> debug_objectcacher = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_striper = 0/0
> debug_tp = 0/0
> 
> The above two config changes on my small development cluster take my librbd 
> 4K random reads IOPS from ~9.5K to ~12.5K IOPS (+32%)
> 
> 3) librbd / librados is very heavy with small memory allocations on the IO 
> path and previous reports have indicated that using jemalloc w/ QEMU shows 
> large improvements.
> 
> LD_PRELOADing jemalloc within fio using the optimized config takes me from 
> ~12.5K IOPS to ~13.5K IOPS (+8%).
> 
> 
> On Thu, Feb 16, 2017 at 3:38 PM, Steve Taylor  > wrote:
> 
> You might try running fio directly on the host using the rbd ioengine (direct 
> librbd) and see how that compares. The major difference between that and the 
> krbd test will be the page cache readahead, which will be present in the krbd 
> stack but not with the rbd ioengine. I would have expected the guest OS to 
> normalize that some due to its own page cache in the librbd test, but that 
> might at least give you some more clues about where to look further.
> 
> 
> 
>   Steve Taylor | Senior Software 
> Engineer | StorageCraft Technology Corporation 
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799  |
> 
> 
> 
> If you are not the intended recipient of this message or received it 
> erroneously, please notify the sender and delete it, together with any 
> attachments, and be advised that any dissemination or copying of this message 
> is prohibited.
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Phil Lacroute
> Sent: Thursday, February 16, 2017 11:54 AM
> To: ceph-users@lists.ceph.com 
> Subject: [ceph-users] KVM/QEMU rbd read latency
> 
> Hi,
> 
> I am doing some performance characterization experiments for ceph with KVM 
> guests, and I’m observing significantly higher read latency when using the 
> QEMU rbd client compared to krbd.  Is that expected or have I missed some 
> tuning knobs to improve this?
> 
> Cluster details:
> Note that this cluster was built for evaluation purposes, not production, 
> hence the choice of small SSDs with low endurance specs.
> Client host OS: Debian, 4.7.0 kernel
> QEMU version 2.7.0
> Ceph version Jewel 10.2.3
> Client and OSD CPU: Xeon D-1541 2.1 GHz
> OSDs: 5 nodes, 3 SSDs each, one journal partition and one data partition per 
> SSD, XFS data file system (15 OSDs total)
> Disks: DC S3510 240GB
> Network: 10 GbE, dedicated switch for storage traffic Guest OS: Debian, 
> virtio drivers
> 
> Performance testing was done with fio on raw disk devices using this config:
> ioengine=libaio
> iodepth=128
> direct=1
> size=100%
> rw=randread
> bs=4k
> 
> Case 1: krbd, fio running on the raw rbd device on the client host (no guest)
> IOPS: 142k
> Average latency: 0.9 msec
> 
> Case 2: krbd, fio running in a guest (libvirt config below)
>
>  
>  
>  
>  
>
> IOPS: 119k
> Average Latency: 1.1 msec
> 
> Case 3: QEMU RBD client, fio running in a guest (libvirt config below)
>
>  
>  
>
>  
>  
>  
>
> IOPS: 25k
> Average Latency: 5.2 msec
> 
> The question is why the test with the QEMU RBD client (case 3) shows 4 msec 
> of additional latency compared the guest using the krbd-mapped image (case 2).
> 
> Note that the IOPS bottleneck for all of these cases is the rate at which the 
> client issues requests, which is limited by the average latency and the 
> maximum number of outstanding requests (128).  Since the latency is the 
> dominant factor in average read throughput for these small accesses, we would 
> really like to understand the source of the additional latency.
> 
> Thanks,
> Phil
> 
> 
> 
> 
> 
> 

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Shinobu Kinjo
You may need to increase ``choose_total_tries`` to more than 50
(default) up to 100.

 - 
http://docs.ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map

 - https://github.com/ceph/ceph/blob/master/doc/man/8/crushtool.rst

On Sat, Feb 18, 2017 at 5:25 AM, Matyas Koszik  wrote:
>
> I have size=2 and 3 independent nodes. I'm happy to try firefly tunables,
> but a bit scared that it would make things even worse.
>
>
> On Fri, 17 Feb 2017, Gregory Farnum wrote:
>
>> Situations that are stable lots of undersized PGs like this generally
>> mean that the CRUSH map is failing to allocate enough OSDs for certain
>> PGs. The log you have says the OSD is trying to NOTIFY the new primary
>> that the PG exists here on this replica.
>>
>> I'd guess you only have 3 hosts and are trying to place all your
>> replicas on independent boxes. Bobtail tunables have trouble with that
>> and you're going to need to pay the cost of moving to more modern
>> ones.
>> -Greg
>>
>> On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
>> >
>> >
>> > I'm not sure what variable should I be looking at exactly, but after
>> > reading through all of them I don't see anyting supsicious, all values are
>> > 0. I'm attaching it anyway, in case I missed something:
>> > https://atw.hu/~koszik/ceph/osd26-perf
>> >
>> >
>> > I tried debugging the ceph pg query a bit more, and it seems that it
>> > gets stuck communicating with the mon - it doesn't even try to connect to
>> > the osd. This is the end of the log:
>> >
>> > 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
>> > {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"...,
>> >  53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
>> > {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
>> > msg_flags=0}, MSG_NOSIGNAL) = 98
>> > 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, 
>> > NULL, NULL) = 9
>> > 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> > {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, 
>> > MSG_NOSIGNAL) = 9
>> > 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, 
>> > MSG_DONTWAIT, NULL, NULL) = 9
>> > 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> > {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
>> > = 9
>> > 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, 
>> > NULL, NULL) = 9
>> > 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> > {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 
>> > 9
>> > 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, 
>> > NULL, NULL) = 9
>> >
>> > ... and this goes on for as long as I let it. When I kill it, I get this:
>> > RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
>> > "pgid": "6.245"}']": exception 'int' object is not iterable
>> >
>> > I restarted (again) osd26 with max debugging; after grepping for 6.245,
>> > this is the log I get:
>> > https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
>> >
>> > Matyas
>> >
>> >
>> > On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
>> >
>> >> If the PG cannot be queried I would bet on OSD message throttler. Check 
>> >> with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which 
>> >> is holding this PG  if message throttler current value is not equal max. 
>> >> If it is, increase the max value in ceph.conf and restart OSD.
>> >>
>> >> --
>> >> Tomasz Kuzemko
>> >> tomasz.kuze...@corp.ovh.com
>> >>
>> >> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisaĹ (a):
>> >>
>> >> >
>> >> > Hi,
>> >> >
>> >> > It seems that my ceph cluster is in an erroneous state of which I cannot
>> >> > see right now how to get out of.
>> >> >
>> >> > The status is the following:
>> >> >
>> >> > health HEALTH_WARN
>> >> >   25 pgs degraded
>> >> >   1 pgs stale
>> >> >   26 pgs stuck unclean
>> >> >   25 pgs undersized
>> >> >   recovery 23578/9450442 objects degraded (0.249%)
>> >> >   recovery 45/9450442 objects misplaced (0.000%)
>> >> >   crush map has legacy tunables (require bobtail, min is firefly)
>> >> > monmap e17: 3 mons at x
>> >> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
>> >> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
>> >> >   flags require_jewel_osds
>> >> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
>> >> >   36750 GB used, 61947 GB / 98697 GB avail
>> >> >   23578/9450442 objects degraded (0.249%)
>> >> >   45/9450442 objects misplaced (0.000%)
>> >> >   4362 active+clean
>> >> > 24 active+undersized+degraded
>> >> >  1 stale+active+undersized+degraded+remapped
>> >> >  1 active+remapped
>> >> >
>> >> >
>> >> > I tried restarting all OSDs, to no avail, it actually made things a bit
>> >> > worse.
>> >> > From a user point of 

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Matyas Koszik


I set it to 100, then restarted osd26, but after recovery everything is as
it was before.


On Sat, 18 Feb 2017, Shinobu Kinjo wrote:

> You may need to increase ``choose_total_tries`` to more than 50
> (default) up to 100.
>
>  - 
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map
>
>  - https://github.com/ceph/ceph/blob/master/doc/man/8/crushtool.rst
>
> On Sat, Feb 18, 2017 at 5:25 AM, Matyas Koszik  wrote:
> >
> > I have size=2 and 3 independent nodes. I'm happy to try firefly tunables,
> > but a bit scared that it would make things even worse.
> >
> >
> > On Fri, 17 Feb 2017, Gregory Farnum wrote:
> >
> >> Situations that are stable lots of undersized PGs like this generally
> >> mean that the CRUSH map is failing to allocate enough OSDs for certain
> >> PGs. The log you have says the OSD is trying to NOTIFY the new primary
> >> that the PG exists here on this replica.
> >>
> >> I'd guess you only have 3 hosts and are trying to place all your
> >> replicas on independent boxes. Bobtail tunables have trouble with that
> >> and you're going to need to pay the cost of moving to more modern
> >> ones.
> >> -Greg
> >>
> >> On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
> >> >
> >> >
> >> > I'm not sure what variable should I be looking at exactly, but after
> >> > reading through all of them I don't see anyting supsicious, all values 
> >> > are
> >> > 0. I'm attaching it anyway, in case I missed something:
> >> > https://atw.hu/~koszik/ceph/osd26-perf
> >> >
> >> >
> >> > I tried debugging the ceph pg query a bit more, and it seems that it
> >> > gets stuck communicating with the mon - it doesn't even try to connect to
> >> > the osd. This is the end of the log:
> >> >
> >> > 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
> >> > {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"...,
> >> >  53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
> >> > {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
> >> > msg_flags=0}, MSG_NOSIGNAL) = 98
> >> > 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, 
> >> > NULL, NULL) = 9
> >> > 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> >> > {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, 
> >> > MSG_NOSIGNAL) = 9
> >> > 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, 
> >> > MSG_DONTWAIT, NULL, NULL) = 9
> >> > 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> >> > {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, 
> >> > MSG_NOSIGNAL) = 9
> >> > 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, 
> >> > MSG_DONTWAIT, NULL, NULL) = 9
> >> > 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> >> > {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
> >> > = 9
> >> > 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, 
> >> > NULL, NULL) = 9
> >> >
> >> > ... and this goes on for as long as I let it. When I kill it, I get this:
> >> > RuntimeError: "None": exception "['{"prefix": 
> >> > "get_command_descriptions", "pgid": "6.245"}']": exception 'int' object 
> >> > is not iterable
> >> >
> >> > I restarted (again) osd26 with max debugging; after grepping for 6.245,
> >> > this is the log I get:
> >> > https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
> >> >
> >> > Matyas
> >> >
> >> >
> >> > On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
> >> >
> >> >> If the PG cannot be queried I would bet on OSD message throttler. Check 
> >> >> with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD 
> >> >> which is holding this PG  if message throttler current value is not 
> >> >> equal max. If it is, increase the max value in ceph.conf and restart 
> >> >> OSD.
> >> >>
> >> >> --
> >> >> Tomasz Kuzemko
> >> >> tomasz.kuze...@corp.ovh.com
> >> >>
> >> >> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisaÄš 
> >> >> (a):
> >> >>
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > It seems that my ceph cluster is in an erroneous state of which I 
> >> >> > cannot
> >> >> > see right now how to get out of.
> >> >> >
> >> >> > The status is the following:
> >> >> >
> >> >> > health HEALTH_WARN
> >> >> >   25 pgs degraded
> >> >> >   1 pgs stale
> >> >> >   26 pgs stuck unclean
> >> >> >   25 pgs undersized
> >> >> >   recovery 23578/9450442 objects degraded (0.249%)
> >> >> >   recovery 45/9450442 objects misplaced (0.000%)
> >> >> >   crush map has legacy tunables (require bobtail, min is firefly)
> >> >> > monmap e17: 3 mons at x
> >> >> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
> >> >> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
> >> >> >   flags require_jewel_osds
> >> >> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
> >> >> >   36750 GB used, 61947 GB / 98697 GB avail
> >> >> >   23578/9450442 objects 

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Matyas Koszik


Looks like you've provided me with the solution, thanks!
I've set the tunables to firefly, and now I only see the normal states
associated with a recovering cluster, there're no more stale pgs.
I hope it'll stay like this when it's done, but that'll take quite a
while.

Matyas


On Fri, 17 Feb 2017, Gregory Farnum wrote:

> Situations that are stable lots of undersized PGs like this generally
> mean that the CRUSH map is failing to allocate enough OSDs for certain
> PGs. The log you have says the OSD is trying to NOTIFY the new primary
> that the PG exists here on this replica.
>
> I'd guess you only have 3 hosts and are trying to place all your
> replicas on independent boxes. Bobtail tunables have trouble with that
> and you're going to need to pay the cost of moving to more modern
> ones.
> -Greg
>
> On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
> >
> >
> > I'm not sure what variable should I be looking at exactly, but after
> > reading through all of them I don't see anyting supsicious, all values are
> > 0. I'm attaching it anyway, in case I missed something:
> > https://atw.hu/~koszik/ceph/osd26-perf
> >
> >
> > I tried debugging the ceph pg query a bit more, and it seems that it
> > gets stuck communicating with the mon - it doesn't even try to connect to
> > the osd. This is the end of the log:
> >
> > 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
> > {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"..., 
> > 53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
> > {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
> > msg_flags=0}, MSG_NOSIGNAL) = 98
> > 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> > 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> > {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
> > = 9
> > 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> > 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> > {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
> > = 9
> > 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> > 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
> > {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 9
> > 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, 
> > NULL, NULL) = 9
> >
> > ... and this goes on for as long as I let it. When I kill it, I get this:
> > RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
> > "pgid": "6.245"}']": exception 'int' object is not iterable
> >
> > I restarted (again) osd26 with max debugging; after grepping for 6.245,
> > this is the log I get:
> > https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
> >
> > Matyas
> >
> >
> > On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
> >
> >> If the PG cannot be queried I would bet on OSD message throttler. Check 
> >> with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which 
> >> is holding this PG  if message throttler current value is not equal max. 
> >> If it is, increase the max value in ceph.conf and restart OSD.
> >>
> >> --
> >> Tomasz Kuzemko
> >> tomasz.kuze...@corp.ovh.com
> >>
> >> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisał(a):
> >>
> >> >
> >> > Hi,
> >> >
> >> > It seems that my ceph cluster is in an erroneous state of which I cannot
> >> > see right now how to get out of.
> >> >
> >> > The status is the following:
> >> >
> >> > health HEALTH_WARN
> >> >   25 pgs degraded
> >> >   1 pgs stale
> >> >   26 pgs stuck unclean
> >> >   25 pgs undersized
> >> >   recovery 23578/9450442 objects degraded (0.249%)
> >> >   recovery 45/9450442 objects misplaced (0.000%)
> >> >   crush map has legacy tunables (require bobtail, min is firefly)
> >> > monmap e17: 3 mons at x
> >> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
> >> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
> >> >   flags require_jewel_osds
> >> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
> >> >   36750 GB used, 61947 GB / 98697 GB avail
> >> >   23578/9450442 objects degraded (0.249%)
> >> >   45/9450442 objects misplaced (0.000%)
> >> >   4362 active+clean
> >> > 24 active+undersized+degraded
> >> >  1 stale+active+undersized+degraded+remapped
> >> >  1 active+remapped
> >> >
> >> >
> >> > I tried restarting all OSDs, to no avail, it actually made things a bit
> >> > worse.
> >> > From a user point of view the cluster works perfectly (apart from that
> >> > stale pg, which fortunately hit the pool on which I keep swap images
> >> > only).
> >> >
> >> > A little background: I made the mistake of creating the cluster with
> >> > size=2 pools, which I'm now in the 

Re: [ceph-users] pgs stuck unclean

2017-02-17 Thread Shinobu Kinjo
On Sat, Feb 18, 2017 at 9:03 AM, Matyas Koszik  wrote:
>
>
> Looks like you've provided me with the solution, thanks!

:)

> I've set the tunables to firefly, and now I only see the normal states
> associated with a recovering cluster, there're no more stale pgs.
> I hope it'll stay like this when it's done, but that'll take quite a
> while.
>
> Matyas
>
>
> On Fri, 17 Feb 2017, Gregory Farnum wrote:
>
>> Situations that are stable lots of undersized PGs like this generally
>> mean that the CRUSH map is failing to allocate enough OSDs for certain
>> PGs. The log you have says the OSD is trying to NOTIFY the new primary
>> that the PG exists here on this replica.
>>
>> I'd guess you only have 3 hosts and are trying to place all your
>> replicas on independent boxes. Bobtail tunables have trouble with that
>> and you're going to need to pay the cost of moving to more modern
>> ones.
>> -Greg
>>
>> On Fri, Feb 17, 2017 at 5:30 AM, Matyas Koszik  wrote:
>> >
>> >
>> > I'm not sure what variable should I be looking at exactly, but after
>> > reading through all of them I don't see anyting supsicious, all values are
>> > 0. I'm attaching it anyway, in case I missed something:
>> > https://atw.hu/~koszik/ceph/osd26-perf
>> >
>> >
>> > I tried debugging the ceph pg query a bit more, and it seems that it
>> > gets stuck communicating with the mon - it doesn't even try to connect to
>> > the osd. This is the end of the log:
>> >
>> > 13:36:07.006224 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"\7", 1}, 
>> > {"\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\17\0\177\0\2\0\27\0\0\0\0\0\0\0\0\0"...,
>> >  53}, {"\1\0\0\0\6\0\0\0osdmap9\4\1\0\0\0\0\0\1", 23}, 
>> > {"\255UC\211\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1", 21}], msg_controllen=0, 
>> > msg_flags=0}, MSG_NOSIGNAL) = 98
>> > 13:36:07.207010 recvfrom(3, "\10\6\0\0\0\0\0\0\0", 4096, MSG_DONTWAIT, 
>> > NULL, NULL) = 9
>> > 13:36:09.963843 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> > {"9\356\246X\245\330r9", 8}], msg_controllen=0, msg_flags=0}, 
>> > MSG_NOSIGNAL) = 9
>> > 13:36:09.964340 recvfrom(3, "\0179\356\246X\245\330r9", 4096, 
>> > MSG_DONTWAIT, NULL, NULL) = 9
>> > 13:36:19.964154 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> > {"C\356\246X\24\226w9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) 
>> > = 9
>> > 13:36:19.964573 recvfrom(3, "\17C\356\246X\24\226w9", 4096, MSG_DONTWAIT, 
>> > NULL, NULL) = 9
>> > 13:36:29.964439 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"\16", 1}, 
>> > {"M\356\246X|\353{9", 8}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 
>> > 9
>> > 13:36:29.964938 recvfrom(3, "\17M\356\246X|\353{9", 4096, MSG_DONTWAIT, 
>> > NULL, NULL) = 9
>> >
>> > ... and this goes on for as long as I let it. When I kill it, I get this:
>> > RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
>> > "pgid": "6.245"}']": exception 'int' object is not iterable
>> >
>> > I restarted (again) osd26 with max debugging; after grepping for 6.245,
>> > this is the log I get:
>> > https://atw.hu/~koszik/ceph/ceph-osd.26.log.6245
>> >
>> > Matyas
>> >
>> >
>> > On Fri, 17 Feb 2017, Tomasz Kuzemko wrote:
>> >
>> >> If the PG cannot be queried I would bet on OSD message throttler. Check 
>> >> with "ceph --admin-daemon PATH_TO_ADMIN_SOCK perf dump" on each OSD which 
>> >> is holding this PG  if message throttler current value is not equal max. 
>> >> If it is, increase the max value in ceph.conf and restart OSD.
>> >>
>> >> --
>> >> Tomasz Kuzemko
>> >> tomasz.kuze...@corp.ovh.com
>> >>
>> >> Dnia 17.02.2017 o godz. 01:59 Matyas Koszik  napisaĹ (a):
>> >>
>> >> >
>> >> > Hi,
>> >> >
>> >> > It seems that my ceph cluster is in an erroneous state of which I cannot
>> >> > see right now how to get out of.
>> >> >
>> >> > The status is the following:
>> >> >
>> >> > health HEALTH_WARN
>> >> >   25 pgs degraded
>> >> >   1 pgs stale
>> >> >   26 pgs stuck unclean
>> >> >   25 pgs undersized
>> >> >   recovery 23578/9450442 objects degraded (0.249%)
>> >> >   recovery 45/9450442 objects misplaced (0.000%)
>> >> >   crush map has legacy tunables (require bobtail, min is firefly)
>> >> > monmap e17: 3 mons at x
>> >> >   election epoch 8550, quorum 0,1,2 store1,store3,store2
>> >> > osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
>> >> >   flags require_jewel_osds
>> >> > pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
>> >> >   36750 GB used, 61947 GB / 98697 GB avail
>> >> >   23578/9450442 objects degraded (0.249%)
>> >> >   45/9450442 objects misplaced (0.000%)
>> >> >   4362 active+clean
>> >> > 24 active+undersized+degraded
>> >> >  1 stale+active+undersized+degraded+remapped
>> >> >  1 active+remapped
>> >> >
>> >> >
>> >> > I tried restarting all OSDs, to no avail, it actually made things a bit
>> >> > worse.
>> >> > From a user point of view the cluster works perfectly (apart from that
>> >> > stale pg, which fo

[ceph-users] How safe is ceph pg repair these days?

2017-02-17 Thread Tracy Reed
I have a 3 replica cluster. A couple times I have run into inconsistent
PGs. I googled it and ceph docs and various blogs say run a repair
first. But a couple people on IRC and a mailing list thread from 2015
say that ceph blindly copies the primary over the secondaries and calls
it good. 

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html

I sure hope that isn't the case. If so it would seem highly
irresponsible to implement such a naive command called "repair". I have
recently learned how to properly analyze the OSD logs and manually fix
these things but not before having run repair on a dozen inconsistent
PGs. Now I'm worried about what sort of corruption I may have
introduced. Repairing things by hand is a simple heuristic based on
comparing the size or checksum (as indicated by the logs) for each of
the 3 copies and figuring out which is correct. Presumably matching two
out of three should win and the odd object out should be deleted since
having the exact same kind of error on two different OSDs is highly
improbable. I don't understand why ceph repair wouldn't have done this
all along.

What is the current best practice in the use of ceph repair?

Thanks!

-- 
Tracy Reed


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Experience with 5k RPM/archive HDDs

2017-02-17 Thread Mike Miller

Hi,

don't go there, we tried this with SMR drives, which will slow down to 
somewhere around 2-3 IOPS during backfilling/recovery and that renders 
the cluster useless for client IO. Things might change in the future, 
but for now, I would strongly recommend against SMR.


Go for normal SATA drives with only slightly higher price/capacity ratios.

- mike

On 2/3/17 2:46 PM, Stillwell, Bryan J wrote:

On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"
 wrote:




Op 3 februari 2017 om 11:03 schreef Maxime Guyot
:


Hi,

Interesting feedback!

  > In my opinion the SMR can be used exclusively for the RGW.
  > Unless it's something like a backup/archive cluster or pool with
little to none concurrent R/W access, you're likely to run out of IOPS
(again) long before filling these monsters up.

That¹s exactly the use case I am considering those archive HDDs for:
something like AWS Glacier, a form of offsite backup probably via
radosgw. The classic Seagate enterprise class HDD provide ³too much²
performance for this use case, I could live with 1Ž4 of the performance
for that price point.



If you go down that route I suggest that you make a mixed cluster for RGW.

A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or
PM863 or a Intel DC series.

All pools by default should go to those OSDs.

Only the RGW buckets data pool should go to the big SMR drives. However,
again, expect very, very low performance of those disks.


One of the other concerns you should think about is recovery time when one
of these drives fail.  The more OSDs you have, the less this becomes an
issue, but on a small cluster is might take over a day to fully recover
from an OSD failure.  Which is a decent amount of time to have degraded
PGs.

Bryan

E-MAIL CONFIDENTIALITY NOTICE:
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM/QEMU rbd read latency

2017-02-17 Thread Jason Dillaman
On Fri, Feb 17, 2017 at 3:35 PM, Phil Lacroute 
wrote:

> I have a followup question about the debug logging.  Is there any way to
> dump the in-memory logs from the QEMU RBD client?  If not (and I couldn’t
> find a way to do this), then nothing is lost by disabling the logging on
> client machines.
>

If you have the admin socket properly configured for client applications
(via the "admin socket" config option), you can run "ceph --admin-daemon
/path/to/asok log dump".


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How safe is ceph pg repair these days?

2017-02-17 Thread Shinobu Kinjo
if ``ceph pg deep-scrub `` does not work
then
  do
``ceph pg repair 


On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed  wrote:
> I have a 3 replica cluster. A couple times I have run into inconsistent
> PGs. I googled it and ceph docs and various blogs say run a repair
> first. But a couple people on IRC and a mailing list thread from 2015
> say that ceph blindly copies the primary over the secondaries and calls
> it good.
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html
>
> I sure hope that isn't the case. If so it would seem highly
> irresponsible to implement such a naive command called "repair". I have
> recently learned how to properly analyze the OSD logs and manually fix
> these things but not before having run repair on a dozen inconsistent
> PGs. Now I'm worried about what sort of corruption I may have
> introduced. Repairing things by hand is a simple heuristic based on
> comparing the size or checksum (as indicated by the logs) for each of
> the 3 copies and figuring out which is correct. Presumably matching two
> out of three should win and the odd object out should be deleted since
> having the exact same kind of error on two different OSDs is highly
> improbable. I don't understand why ceph repair wouldn't have done this
> all along.
>
> What is the current best practice in the use of ceph repair?
>
> Thanks!
>
> --
> Tracy Reed
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How safe is ceph pg repair these days?

2017-02-17 Thread Tracy Reed
Well, that's the question...is that safe? Because the link to the
mailing list post (possibly outdated) says that what you just suggested
is definitely NOT safe. Is the mailing list post wrong? Has the
situation changed? Exactly what does ceph repair do now? I suppose I
could go dig into the code but I'm not an expert and would hate to get
it wrong and post possibly bogus info the the list for other newbies to
find and worry about and possibly lose their data.

On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly:
> if ``ceph pg deep-scrub `` does not work
> then
>   do
> ``ceph pg repair 
> 
> 
> On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed  wrote:
> > I have a 3 replica cluster. A couple times I have run into inconsistent
> > PGs. I googled it and ceph docs and various blogs say run a repair
> > first. But a couple people on IRC and a mailing list thread from 2015
> > say that ceph blindly copies the primary over the secondaries and calls
> > it good.
> >
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html
> >
> > I sure hope that isn't the case. If so it would seem highly
> > irresponsible to implement such a naive command called "repair". I have
> > recently learned how to properly analyze the OSD logs and manually fix
> > these things but not before having run repair on a dozen inconsistent
> > PGs. Now I'm worried about what sort of corruption I may have
> > introduced. Repairing things by hand is a simple heuristic based on
> > comparing the size or checksum (as indicated by the logs) for each of
> > the 3 copies and figuring out which is correct. Presumably matching two
> > out of three should win and the odd object out should be deleted since
> > having the exact same kind of error on two different OSDs is highly
> > improbable. I don't understand why ceph repair wouldn't have done this
> > all along.
> >
> > What is the current best practice in the use of ceph repair?
> >
> > Thanks!
> >
> > --
> > Tracy Reed
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

-- 
Tracy Reed


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com