Re: [ceph-users] snaps & consistency group

2016-05-03 Thread Yair Magnezi
Thank you Jason .

Does RBD volumes consistency group supported in Jewel ? can we take
consistent snapshots for volumes consistency group .
Implementation is for openstack -->
http://docs.openstack.org/admin-guide/blockstorage-consistency-groups.html

Thanks Again .


*Yair Magnezi *




*Storage & Data Protection TL   // KenshooOffice +972 7 32862423   //
Mobile +972 50 575-2955__*



On Tue, May 3, 2016 at 12:31 AM, Jason Dillaman  wrote:

> There is no current capability to support snapshot consistency groups
> within RBD; however, support for snapshot consistency groups is
> currently being developed for the Ceph kraken release.
>
> On Sun, May 1, 2016 at 11:04 AM, Yair Magnezi 
> wrote:
> > Hello Guys .
> >
> > I'm a little bit confused about ceph's capability to take a a consistency
> > snapshots ( more then one rbd image  )
> >
> > is there a way to do this ( we're running hammer right now )
> >
> > Thanks
> >
> >
> >
> >
> >
> > This e-mail, as well as any attached document, may contain material
> which is
> > confidential and privileged and may include trademark, copyright and
> other
> > intellectual property rights that are proprietary to Kenshoo Ltd,  its
> > subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments
> may
> > be read, copied and used only by the addressee for the purpose(s) for
> which
> > it was disclosed herein. If you have received it in error, please destroy
> > the message and any attachment, and contact us immediately. If you are
> not
> > the intended recipient, be aware that any review, reliance, disclosure,
> > copying, distribution or use of the contents of this message without
> > Kenshoo's express permission is strictly prohibited.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>

-- 
This e-mail, as well as any attached document, may contain material which 
is confidential and privileged and may include trademark, copyright and 
other intellectual property rights that are proprietary to Kenshoo Ltd, 
 its subsidiaries or affiliates ("Kenshoo"). This e-mail and its 
attachments may be read, copied and used only by the addressee for the 
purpose(s) for which it was disclosed herein. If you have received it in 
error, please destroy the message and any attachment, and contact us 
immediately. If you are not the intended recipient, be aware that any 
review, reliance, disclosure, copying, distribution or use of the contents 
of this message without Kenshoo's express permission is strictly prohibited.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Web based S3 client

2016-05-03 Thread 张灿
S3 is originally Amazon's protocol so any details could be found in Amazon’s 
documentation. If you want to know the details about the restful API, see 
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketPUTcors.html
Now Sree has been updated adding a bucket action to help setup bucket CORS. You 
might be interested to checkout the new version: https://github.com/cannium/Sree

Cheers,
Can ZHANG

On May 3, 2016, at 11:07, Ben Hines mailto:bhi...@gmail.com>> 
wrote:

Can you provide info on how to set up CORS for a ceph bucket? The docs you 
linked to are Amazon specific.

On Mon, May 2, 2016 at 8:05 PM, Ben Hines 
mailto:bhi...@gmail.com>> wrote:
IMO Having to enter your S3 key/secret makes deployment actually difficult for 
users. Everyone knows their LDAP login -- no one knows the S3 keys, we'd only 
use this for Ceph debugging purposes, most likely.

Perhaps i can implement this myself, thanks :)

-Ben

On Mon, May 2, 2016 at 7:48 PM, Can Zhang(张灿) 
mailto:zhang...@le.com>> wrote:
Hi Ben,

Actually if not for CORS issues, we would have implemented Sree as a javascript 
app fully running in browser. By design the backend should be as simple as 
possible(currently less than 150 lines of python) to ease deployment for new 
users. So as for now we wouldn’t support LDAP auth but if you need help to 
develop your auth mechanism, don’t hesitate to email me.


Cheers,
Can ZHANG


On Apr 30, 2016, at 01:54, Ben Hines 
mailto:bhi...@gmail.com>> wrote:

This is nice.  For my use case, I'd love to see LDAP authentication for the 
client frontend with the S3 credentials hidden from the user.

I could probably even handle the LDAP auth part with nginx, if the sree 
applciation would save its configuration on the server side.

-Ben

On Thu, Apr 28, 2016 at 7:29 PM, Can Zhang(张灿) 
mailto:zhang...@le.com>> wrote:
Hi,

In order to help new users to get hands on S3, we developed a web based S3 
client called “Sree”, and hope to see if it could become part of Ceph. 
Currently we host the project at:

https://github.com/cannium/Sree

Users could use Sree to manage their files in browser, through Ceph’s S3 
interface. I think it’s more friendly for new users than s3cmd, and would help 
Ceph to hit more users.

Any suggestions are welcomed. Hope to see your replies.


Cheers,
Can ZHANG

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Web based S3 client

2016-05-03 Thread 张灿
S3 is originally Amazon's protocol so any details could be found in Amazon’s 
documentation. If you want to know the details about the restful API, see 
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketPUTcors.html
Now Sree has been updated adding a bucket action to help setup bucket CORS. You 
might be interested to checkout the new version: https://github.com/cannium/Sree

Cheers,
Can ZHANG

On May 3, 2016, at 11:07, Ben Hines mailto:bhi...@gmail.com>> 
wrote:

Can you provide info on how to set up CORS for a ceph bucket? The docs you 
linked to are Amazon specific.

On Mon, May 2, 2016 at 8:05 PM, Ben Hines 
mailto:bhi...@gmail.com>> wrote:
IMO Having to enter your S3 key/secret makes deployment actually difficult for 
users. Everyone knows their LDAP login -- no one knows the S3 keys, we'd only 
use this for Ceph debugging purposes, most likely.

Perhaps i can implement this myself, thanks :)

-Ben

On Mon, May 2, 2016 at 7:48 PM, Can Zhang(张灿) 
mailto:zhang...@le.com>> wrote:
Hi Ben,

Actually if not for CORS issues, we would have implemented Sree as a javascript 
app fully running in browser. By design the backend should be as simple as 
possible(currently less than 150 lines of python) to ease deployment for new 
users. So as for now we wouldn’t support LDAP auth but if you need help to 
develop your auth mechanism, don’t hesitate to email me.


Cheers,
Can ZHANG


On Apr 30, 2016, at 01:54, Ben Hines 
mailto:bhi...@gmail.com>> wrote:

This is nice.  For my use case, I'd love to see LDAP authentication for the 
client frontend with the S3 credentials hidden from the user.

I could probably even handle the LDAP auth part with nginx, if the sree 
applciation would save its configuration on the server side.

-Ben

On Thu, Apr 28, 2016 at 7:29 PM, Can Zhang(张灿) 
mailto:zhang...@le.com>> wrote:
Hi,

In order to help new users to get hands on S3, we developed a web based S3 
client called “Sree”, and hope to see if it could become part of Ceph. 
Currently we host the project at:

https://github.com/cannium/Sree

Users could use Sree to manage their files in browser, through Ceph’s S3 
interface. I think it’s more friendly for new users than s3cmd, and would help 
Ceph to hit more users.

Any suggestions are welcomed. Hope to see your replies.


Cheers,
Can ZHANG

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lab Newbie Here: Where do I start?

2016-05-03 Thread Michael Ferguson
Thanks.

Warm Regards.

 

 

From: Tu Holmes [mailto:tu.hol...@gmail.com] 
Sent: Tuesday, May 03, 2016 1:38 AM
To: ceph-users@lists.ceph.com; Michael Ferguson 
Subject: Re: [ceph-users] Lab Newbie Here: Where do I start?

 

I would start here. 

 

https://www.redhat.com/en/resources/red-hat-ceph-storage-hardware-configuration-guide

 

 

//Tu

 

_
From: Michael Ferguson mailto:fergu...@eastsidemiami.com> >
Sent: Monday, May 2, 2016 12:30 PM
Subject: [ceph-users] Lab Newbie Here: Where do I start?
To: mailto:ceph-users@lists.ceph.com> >



G’Day All,

 

I have two old Promise Vtrak E310s JBOBs (still with support) each with 4 600GB 
Seagate SAS HDD and 8 2TB SATA HDD and two old HP DL360.

While I am seeing so many ceph-deploy this and ceph-deploy that, I have not 
found any help that starts with the hardware.

There seem to be lots of assumption about the hardware.

Can anyone provide some directional oversight on getting the hardware going so 
as to accept ceph in a HA setting?

For example, how should all these drives be provisioned on each 
VTrak example and served up to the HP DL360, RAID or no Raid?

I plan to use CentOS

Once ceph is installed and has control of the storage from the 
VTraks I plan to install VirtualBox or Oracle VM or VMware or anything else 
that I can use at zero or minimum cost.

I am not adverse to paying for some instructional help in this 
regard.

Please advise

 

Ferguson, 

“First, your place, and then, the world’s”

“Good work ain’t cheap, and cheap work ain’t good”
  fergu...@eastsidemiami.com

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure pool performance expectations

2016-05-03 Thread Nick Fisk
Hi Peter,


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Peter Kerdisle
> Sent: 02 May 2016 08:17
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Erasure pool performance expectations
> 
> Hi guys,
> 
> I am currently testing the performance of RBD using a cache pool and a 4/2
> erasure profile pool.
> 
> I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with
> 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit
> cluster network for the erasure pool (10x3TB without separate journal). This
> is all on Jewel.
> 
> What I would like to know is if the performance I'm seeing is to be expected
> and if there is some way to test this in a more qualifiable way.
> 
> Everything works as expected if the files are present on the cache pool,
> however when things need to be retrieved from the cache pool I see
> performance degradation. I'm trying to simulate real usage as much as
> possible and trying to retrieve files from the RBD volume over FTP from a
> client server. What I'm seeing is that the FTP transfer will stall for 
> seconds at a
> time and then get some more data which results in an average speed of
> 200KB/s. From the cache this is closer to 10MB/s. Is this the expected
> behaviour from a erasure coded tier with cache in front?

Unfortunately yes. The whole Erasure/Cache thing only really works well if the 
data in the EC tier is only accessed infrequently, otherwise the overheads in 
cache promotion/flushing quickly brings the cluster down to its knees. However 
it looks as though you are mainly doing reads, which means you can probably 
alter your cache settings to not promote so aggressively on reads, as reads can 
be proxied through to the EC tier instead of promoting. This should reduce the 
amount of required cache promotions.

Can you try setting min_read_recency_for promote to something higher?

Also can you check what your hit_set_period and hit_set_count is currently set 
to.


> Right now I'm unsure how to scientifically test the performance retrieving
> files when there is a cache miss. If somebody could point me towards a
> better way of doing that I would appreciate the help.
> 
> An other thing is that I'm seeing a lot of messages popping up in dmesg on
> my client server on which the RBD volumes are mounted. (IPs removed)
> 
> [685881.477383] libceph: osd50 :6800 socket closed (con state OPEN)
> [685895.597733] libceph: osd54 :6808 socket closed (con state OPEN)
> [685895.663971] libceph: osd54 :6808 socket closed (con state OPEN)
> [685895.710424] libceph: osd54 :6808 socket closed (con state OPEN)
> [685895.749417] libceph: osd54 :6808 socket closed (con state OPEN)
> [685896.517778] libceph: osd54 :6808 socket closed (con state OPEN)
> [685906.690445] libceph: osd74 :6824 socket closed (con state OPEN)
> 
> Is this a symptom of something?

This is just stale connections to the OSD's timing out after the idle period 
and is nothing to worry about.

> 
> Thanks in advance,
> 
> Peter


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure pool performance expectations

2016-05-03 Thread Peter Kerdisle
Hey Nick,

Thanks for taking the time to answer my questions. Some in-line comments.

On Tue, May 3, 2016 at 10:51 AM, Nick Fisk  wrote:

> Hi Peter,
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Peter Kerdisle
> > Sent: 02 May 2016 08:17
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] Erasure pool performance expectations
> >
> > Hi guys,
> >
> > I am currently testing the performance of RBD using a cache pool and a
> 4/2
> > erasure profile pool.
> >
> > I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with
> > 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit
> > cluster network for the erasure pool (10x3TB without separate journal).
> This
> > is all on Jewel.
> >
> > What I would like to know is if the performance I'm seeing is to be
> expected
> > and if there is some way to test this in a more qualifiable way.
> >
> > Everything works as expected if the files are present on the cache pool,
> > however when things need to be retrieved from the cache pool I see
> > performance degradation. I'm trying to simulate real usage as much as
> > possible and trying to retrieve files from the RBD volume over FTP from a
> > client server. What I'm seeing is that the FTP transfer will stall for
> seconds at a
> > time and then get some more data which results in an average speed of
> > 200KB/s. From the cache this is closer to 10MB/s. Is this the expected
> > behaviour from a erasure coded tier with cache in front?
>
> Unfortunately yes. The whole Erasure/Cache thing only really works well if
> the data in the EC tier is only accessed infrequently, otherwise the
> overheads in cache promotion/flushing quickly brings the cluster down to
> its knees. However it looks as though you are mainly doing reads, which
> means you can probably alter your cache settings to not promote so
> aggressively on reads, as reads can be proxied through to the EC tier
> instead of promoting. This should reduce the amount of required cache
> promotions.
>

You are correct that reads have a lower priority of being cached, only when
they are used very frequently should this be done in an ideal situation.


>
> Can you try setting min_read_recency_for promote to something higher?
>

I looked into the setting before but I must admit it's exact purpose eludes
me still. Would it be correct to simplify it
as 'min_read_recency_for_promote determines the amount of times a piece
would have to be read in a certain interval (set by hit_set_period) in
order to promote it to the caching tier' ?


> Also can you check what your hit_set_period and hit_set_count is currently
> set to.
>

hit_set_count is set to 1 and hit_set_period to 1800.

What would increasing the hit_set_count do exactly?


>
> > Right now I'm unsure how to scientifically test the performance
> retrieving
> > files when there is a cache miss. If somebody could point me towards a
> > better way of doing that I would appreciate the help.
> >
> > An other thing is that I'm seeing a lot of messages popping up in dmesg
> on
> > my client server on which the RBD volumes are mounted. (IPs removed)
> >
> > [685881.477383] libceph: osd50 :6800 socket closed (con state OPEN)
> > [685895.597733] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685895.663971] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685895.710424] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685895.749417] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685896.517778] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685906.690445] libceph: osd74 :6824 socket closed (con state OPEN)
> >
> > Is this a symptom of something?
>
> This is just stale connections to the OSD's timing out after the idle
> period and is nothing to worry about.
>

Glad to hear that, I was fearing something might be wrong.

Thanks again.

Peter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] snaps & consistency group

2016-05-03 Thread Jason Dillaman
On Tue, May 3, 2016 at 3:20 AM, Yair Magnezi  wrote:
> Does RBD volumes consistency group supported in Jewel ? can we take
> consistent snapshots for volumes consistency group .

No, this feature is being actively worked on for the Kraken release of
Ceph (the next major release after Jewel).

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Gaurav Bafna
Hi Cephers,

I am running a very small cluster of 3 storage and 2 monitor nodes.

After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
remain undersized for unknown reason.

After I restart that 1 osd deamon, the cluster recovers in no time .

Size of all pools are 3 and min_size is 2.

Can anybody please help ?

Output of  "ceph -s"
cluster fac04d85-db48-4564-b821-deebda046261
 health HEALTH_WARN
9 pgs degraded
9 pgs stuck degraded
9 pgs stuck unclean
9 pgs stuck undersized
9 pgs undersized
recovery 3327/195138 objects degraded (1.705%)
pool .users pg_num 512 > pgp_num 8
 monmap e2: 2 mons at
{dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
 osdmap e857: 69 osds: 68 up, 68 in
  pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
279 GB used, 247 TB / 247 TB avail
3327/195138 objects degraded (1.705%)
 887 active+clean
   9 active+undersized+degraded
  client io 395 B/s rd, 0 B/s wr, 0 op/s

ceph health detail output :

HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
degraded (1.705%); pool .users pg_num 512 > pgp_num 8
pg 7.a is stuck unclean for 322742.938959, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck unclean for 322754.823455, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck unclean for 322750.685684, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck unclean for 322732.665345, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck unclean for 331869.103538, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck unclean for 331871.208948, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck unclean for 331822.771240, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck unclean for 323021.274535, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck unclean for 323007.574395, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck undersized for 322487.284302, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck undersized for 322487.287164, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck undersized for 322487.285566, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck undersized for 322487.287168, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck undersized for 331351.476170, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck undersized for 331351.475707, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck undersized for 322487.280309, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck undersized for 322487.286347, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck undersized for 322487.280027, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck degraded for 322487.284340, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck degraded for 322487.287202, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck degraded for 322487.285604, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck degraded for 322487.287207, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck degraded for 331351.476209, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck degraded for 331351.475746, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck degraded for 322487.280348, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck degraded for 322487.286386, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck degraded for 322487.280066, current state
active+undersized+degraded, last acting [43,1]
pg 5.72 is active+undersized+degraded, acting [16,49]
pg 5.4e is active+undersized+degraded, acting [16,38]
pg 5.32 is active+undersized+degraded, acting [39,19]
pg 5.37 is active+undersized+degraded, acting [43,1]
pg 5.2c is active+undersized+degraded, acting [47,18]
pg 5.27 is active+undersized+degraded, acting [26,19]
pg 6.13 is active+undersized+degraded, acting [30,16]
pg 4.17 is active+undersized+degraded, acting [47,20]
pg 7.a is active+undersized+degraded, acting [38,2]
recovery 3327/195138 objects degraded (1.705%)
pool .users pg_num 512 > pgp_num 8


My crush map is default.

Ceph.conf is :

[osd]
osd mkfs type=xfs
osd recovery threads=2
osd disk thread ioprio class=idle
osd disk thread ioprio prio

[ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Gaurav Bafna
Hi Cephers,

I am running a very small cluster of 3 storage and 2 monitor nodes.

After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
remain undersized for unknown reason.

After I restart that 1 osd deamon, the cluster recovers in no time .

Size of all pools are 3 and min_size is 2.

Can anybody please help ?

Output of  "ceph -s"
cluster fac04d85-db48-4564-b821-deebda046261
 health HEALTH_WARN
9 pgs degraded
9 pgs stuck degraded
9 pgs stuck unclean
9 pgs stuck undersized
9 pgs undersized
recovery 3327/195138 objects degraded (1.705%)
pool .users pg_num 512 > pgp_num 8
 monmap e2: 2 mons at
{dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
 osdmap e857: 69 osds: 68 up, 68 in
  pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
279 GB used, 247 TB / 247 TB avail
3327/195138 objects degraded (1.705%)
 887 active+clean
   9 active+undersized+degraded
  client io 395 B/s rd, 0 B/s wr, 0 op/s

ceph health detail output :

HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
degraded (1.705%); pool .users pg_num 512 > pgp_num 8
pg 7.a is stuck unclean for 322742.938959, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck unclean for 322754.823455, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck unclean for 322750.685684, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck unclean for 322732.665345, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck unclean for 331869.103538, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck unclean for 331871.208948, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck unclean for 331822.771240, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck unclean for 323021.274535, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck unclean for 323007.574395, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck undersized for 322487.284302, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck undersized for 322487.287164, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck undersized for 322487.285566, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck undersized for 322487.287168, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck undersized for 331351.476170, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck undersized for 331351.475707, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck undersized for 322487.280309, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck undersized for 322487.286347, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck undersized for 322487.280027, current state
active+undersized+degraded, last acting [43,1]
pg 7.a is stuck degraded for 322487.284340, current state
active+undersized+degraded, last acting [38,2]
pg 5.27 is stuck degraded for 322487.287202, current state
active+undersized+degraded, last acting [26,19]
pg 5.32 is stuck degraded for 322487.285604, current state
active+undersized+degraded, last acting [39,19]
pg 6.13 is stuck degraded for 322487.287207, current state
active+undersized+degraded, last acting [30,16]
pg 5.4e is stuck degraded for 331351.476209, current state
active+undersized+degraded, last acting [16,38]
pg 5.72 is stuck degraded for 331351.475746, current state
active+undersized+degraded, last acting [16,49]
pg 4.17 is stuck degraded for 322487.280348, current state
active+undersized+degraded, last acting [47,20]
pg 5.2c is stuck degraded for 322487.286386, current state
active+undersized+degraded, last acting [47,18]
pg 5.37 is stuck degraded for 322487.280066, current state
active+undersized+degraded, last acting [43,1]
pg 5.72 is active+undersized+degraded, acting [16,49]
pg 5.4e is active+undersized+degraded, acting [16,38]
pg 5.32 is active+undersized+degraded, acting [39,19]
pg 5.37 is active+undersized+degraded, acting [43,1]
pg 5.2c is active+undersized+degraded, acting [47,18]
pg 5.27 is active+undersized+degraded, acting [26,19]
pg 6.13 is active+undersized+degraded, acting [30,16]
pg 4.17 is active+undersized+degraded, acting [47,20]
pg 7.a is active+undersized+degraded, acting [38,2]
recovery 3327/195138 objects degraded (1.705%)
pool .users pg_num 512 > pgp_num 8


My crush map is default.

Ceph.conf is :

[osd]
osd mkfs type=xfs
osd recovery threads=2
osd disk thread ioprio class=idle
osd disk thread ioprio prio

Re: [ceph-users] Erasure pool performance expectations

2016-05-03 Thread Nick Fisk


> -Original Message-
> From: Peter Kerdisle [mailto:peter.kerdi...@gmail.com]
> Sent: 03 May 2016 12:15
> To: n...@fisk.me.uk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Erasure pool performance expectations
> 
> Hey Nick,
> 
> Thanks for taking the time to answer my questions. Some in-line comments.
> 
> On Tue, May 3, 2016 at 10:51 AM, Nick Fisk  wrote:
> Hi Peter,
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> > Peter Kerdisle
> > Sent: 02 May 2016 08:17
> > To: ceph-users@lists.ceph.com
> > Subject: [ceph-users] Erasure pool performance expectations
> >
> > Hi guys,
> >
> > I am currently testing the performance of RBD using a cache pool and a 4/2
> > erasure profile pool.
> >
> > I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data) with
> > 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and 10Gbit
> > cluster network for the erasure pool (10x3TB without separate journal).
> This
> > is all on Jewel.
> >
> > What I would like to know is if the performance I'm seeing is to be
> expected
> > and if there is some way to test this in a more qualifiable way.
> >
> > Everything works as expected if the files are present on the cache pool,
> > however when things need to be retrieved from the cache pool I see
> > performance degradation. I'm trying to simulate real usage as much as
> > possible and trying to retrieve files from the RBD volume over FTP from a
> > client server. What I'm seeing is that the FTP transfer will stall for 
> > seconds
> at a
> > time and then get some more data which results in an average speed of
> > 200KB/s. From the cache this is closer to 10MB/s. Is this the expected
> > behaviour from a erasure coded tier with cache in front?
> 
> Unfortunately yes. The whole Erasure/Cache thing only really works well if
> the data in the EC tier is only accessed infrequently, otherwise the overheads
> in cache promotion/flushing quickly brings the cluster down to its knees.
> However it looks as though you are mainly doing reads, which means you can
> probably alter your cache settings to not promote so aggressively on reads,
> as reads can be proxied through to the EC tier instead of promoting. This
> should reduce the amount of required cache promotions.
> 
> You are correct that reads have a lower priority of being cached, only when
> they are used very frequently should this be done in an ideal situation.
> 
> 
> Can you try setting min_read_recency_for promote to something higher?
> 
> I looked into the setting before but I must admit it's exact purpose eludes me
> still. Would it be correct to simplify it as 'min_read_recency_for_promote
> determines the amount of times a piece would have to be read in a certain
> interval (set by hit_set_period) in order to promote it to the caching tier' ?

Yes that’s correct. Every hit_set_period (assuming there is IO going on) a new 
hitset is created up until the hit_set_count limit. The recency defines how 
many of the last x hitsets an object must have been accessed in. 

Tuning it is a bit of a dark art at the moment as you have to try and get all 
the values to match your workload. For starters try something like

Read recency =  2 or 3
Hit_set_count =10
Hit_set_period=60

Which will mean if an object is read more than 2 or 3 times in a row within the 
last few minutes it will be promoted. There is no granularity below a single 
hitset, so if an object gets hit a 1000 times in 1 minute but then nothing for 
5 minutes it will not cause a promotion.

> 
> 
> Also can you check what your hit_set_period and hit_set_count is currently
> set to.
> 
> hit_set_count is set to 1 and hit_set_period to 1800.
> 
> What would increasing the hit_set_count do exactly?
> 
> 
> 
> > Right now I'm unsure how to scientifically test the performance retrieving
> > files when there is a cache miss. If somebody could point me towards a
> > better way of doing that I would appreciate the help.
> >
> > An other thing is that I'm seeing a lot of messages popping up in dmesg on
> > my client server on which the RBD volumes are mounted. (IPs removed)
> >
> > [685881.477383] libceph: osd50 :6800 socket closed (con state OPEN)
> > [685895.597733] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685895.663971] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685895.710424] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685895.749417] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685896.517778] libceph: osd54 :6808 socket closed (con state OPEN)
> > [685906.690445] libceph: osd74 :6824 socket closed (con state OPEN)
> >
> > Is this a symptom of something?
> 
> This is just stale connections to the OSD's timing out after the idle period 
> and
> is nothing to worry about.
> 
> Glad to hear that, I was fearing something might be wrong.
> 
> Thanks again.
> 
> Peter

___
ceph

Re: [ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Tupper Cole
The degraded pgs are mapped to the down OSD and have not mapped to a new
OSD. Removing the OSD would likely result in a full recovery.

As a note, having two monitors (or any even number of monitors) is not
recommended. If either monitor goes down you will lose quorum. The
recommended number of monitors for any cluster is at least three.

On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna  wrote:

> Hi Cephers,
>
> I am running a very small cluster of 3 storage and 2 monitor nodes.
>
> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
> remain undersized for unknown reason.
>
> After I restart that 1 osd deamon, the cluster recovers in no time .
>
> Size of all pools are 3 and min_size is 2.
>
> Can anybody please help ?
>
> Output of  "ceph -s"
> cluster fac04d85-db48-4564-b821-deebda046261
>  health HEALTH_WARN
> 9 pgs degraded
> 9 pgs stuck degraded
> 9 pgs stuck unclean
> 9 pgs stuck undersized
> 9 pgs undersized
> recovery 3327/195138 objects degraded (1.705%)
> pool .users pg_num 512 > pgp_num 8
>  monmap e2: 2 mons at
> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
> election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
>  osdmap e857: 69 osds: 68 up, 68 in
>   pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
> 279 GB used, 247 TB / 247 TB avail
> 3327/195138 objects degraded (1.705%)
>  887 active+clean
>9 active+undersized+degraded
>   client io 395 B/s rd, 0 B/s wr, 0 op/s
>
> ceph health detail output :
>
> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
> degraded (1.705%); pool .users pg_num 512 > pgp_num 8
> pg 7.a is stuck unclean for 322742.938959, current state
> active+undersized+degraded, last acting [38,2]
> pg 5.27 is stuck unclean for 322754.823455, current state
> active+undersized+degraded, last acting [26,19]
> pg 5.32 is stuck unclean for 322750.685684, current state
> active+undersized+degraded, last acting [39,19]
> pg 6.13 is stuck unclean for 322732.665345, current state
> active+undersized+degraded, last acting [30,16]
> pg 5.4e is stuck unclean for 331869.103538, current state
> active+undersized+degraded, last acting [16,38]
> pg 5.72 is stuck unclean for 331871.208948, current state
> active+undersized+degraded, last acting [16,49]
> pg 4.17 is stuck unclean for 331822.771240, current state
> active+undersized+degraded, last acting [47,20]
> pg 5.2c is stuck unclean for 323021.274535, current state
> active+undersized+degraded, last acting [47,18]
> pg 5.37 is stuck unclean for 323007.574395, current state
> active+undersized+degraded, last acting [43,1]
> pg 7.a is stuck undersized for 322487.284302, current state
> active+undersized+degraded, last acting [38,2]
> pg 5.27 is stuck undersized for 322487.287164, current state
> active+undersized+degraded, last acting [26,19]
> pg 5.32 is stuck undersized for 322487.285566, current state
> active+undersized+degraded, last acting [39,19]
> pg 6.13 is stuck undersized for 322487.287168, current state
> active+undersized+degraded, last acting [30,16]
> pg 5.4e is stuck undersized for 331351.476170, current state
> active+undersized+degraded, last acting [16,38]
> pg 5.72 is stuck undersized for 331351.475707, current state
> active+undersized+degraded, last acting [16,49]
> pg 4.17 is stuck undersized for 322487.280309, current state
> active+undersized+degraded, last acting [47,20]
> pg 5.2c is stuck undersized for 322487.286347, current state
> active+undersized+degraded, last acting [47,18]
> pg 5.37 is stuck undersized for 322487.280027, current state
> active+undersized+degraded, last acting [43,1]
> pg 7.a is stuck degraded for 322487.284340, current state
> active+undersized+degraded, last acting [38,2]
> pg 5.27 is stuck degraded for 322487.287202, current state
> active+undersized+degraded, last acting [26,19]
> pg 5.32 is stuck degraded for 322487.285604, current state
> active+undersized+degraded, last acting [39,19]
> pg 6.13 is stuck degraded for 322487.287207, current state
> active+undersized+degraded, last acting [30,16]
> pg 5.4e is stuck degraded for 331351.476209, current state
> active+undersized+degraded, last acting [16,38]
> pg 5.72 is stuck degraded for 331351.475746, current state
> active+undersized+degraded, last acting [16,49]
> pg 4.17 is stuck degraded for 322487.280348, current state
> active+undersized+degraded, last acting [47,20]
> pg 5.2c is stuck degraded for 322487.286386, current state
> active+undersized+degraded, last acting [47,18]
> pg 5.37 is stuck degraded for 322487.280066, current state
> active+undersized+degraded, last acting [43,1]
> pg 5.72 is active+undersized+degraded, acting [16,49]
> pg 5.4e is active+undersized+degraded, acting [16,38]
> pg 5.32 is active+undersized+d

[ceph-users] existing ceph cluster - clean start

2016-05-03 Thread Andrei Mikhailovsky
Hello, 

I am planning to make some changes to our ceph cluster and would like to ask 
the community of the best route to take. 

Our existing cluster is made of 3 osd servers (two of which are also mon 
servers) and the total of 3 mon servers. The cluster is currently running on 
Ubuntu 14.04.x LTS. Due to the historical testing, troubleshooting and cluster 
setup, the servers are not really uniformly done (software wise) and I would 
like to standardise as many things as possible. I am slowly migrating to 
Saltstack for infrastructure management and would like to manage my ceph 
cluster with Salt as well.My initial thoughts are to start with a clean Ubuntu 
16.04 server install, connect it to salt server and manage all software 
installs through salt. This will make sure that all servers would be pretty 
much standard in terms of software. 

My question is what is the best way to migrate the existing cluster without 
having downtime? Should I os wipe one of the osd servers (wipe the osd disk and 
not the osd/journal disks), install OS with salt and point ceph to the existing 
osds? After that, do the same with the second osd server and finally with the 
third one. 

Is ceph smart enough to figure out that the osds belong to an existing cluster 
and join the reinstalled osd server to the cluster? If this can be done, I 
assume this is the fastest way to achieve this. If not, what is the best route 
to take? 

Many thanks 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Gaurav Bafna
Thanks Tupper for replying.

Shouldn't the PG be remapped to other OSDs ?

Yes , removing OSD from the cluster is resulting into full recovery.
But that should not be needed , right ?



On Tue, May 3, 2016 at 6:31 PM, Tupper Cole  wrote:
> The degraded pgs are mapped to the down OSD and have not mapped to a new
> OSD. Removing the OSD would likely result in a full recovery.
>
> As a note, having two monitors (or any even number of monitors) is not
> recommended. If either monitor goes down you will lose quorum. The
> recommended number of monitors for any cluster is at least three.
>
> On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna  wrote:
>>
>> Hi Cephers,
>>
>> I am running a very small cluster of 3 storage and 2 monitor nodes.
>>
>> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
>> remain undersized for unknown reason.
>>
>> After I restart that 1 osd deamon, the cluster recovers in no time .
>>
>> Size of all pools are 3 and min_size is 2.
>>
>> Can anybody please help ?
>>
>> Output of  "ceph -s"
>> cluster fac04d85-db48-4564-b821-deebda046261
>>  health HEALTH_WARN
>> 9 pgs degraded
>> 9 pgs stuck degraded
>> 9 pgs stuck unclean
>> 9 pgs stuck undersized
>> 9 pgs undersized
>> recovery 3327/195138 objects degraded (1.705%)
>> pool .users pg_num 512 > pgp_num 8
>>  monmap e2: 2 mons at
>> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
>> election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
>>  osdmap e857: 69 osds: 68 up, 68 in
>>   pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
>> 279 GB used, 247 TB / 247 TB avail
>> 3327/195138 objects degraded (1.705%)
>>  887 active+clean
>>9 active+undersized+degraded
>>   client io 395 B/s rd, 0 B/s wr, 0 op/s
>>
>> ceph health detail output :
>>
>> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
>> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
>> degraded (1.705%); pool .users pg_num 512 > pgp_num 8
>> pg 7.a is stuck unclean for 322742.938959, current state
>> active+undersized+degraded, last acting [38,2]
>> pg 5.27 is stuck unclean for 322754.823455, current state
>> active+undersized+degraded, last acting [26,19]
>> pg 5.32 is stuck unclean for 322750.685684, current state
>> active+undersized+degraded, last acting [39,19]
>> pg 6.13 is stuck unclean for 322732.665345, current state
>> active+undersized+degraded, last acting [30,16]
>> pg 5.4e is stuck unclean for 331869.103538, current state
>> active+undersized+degraded, last acting [16,38]
>> pg 5.72 is stuck unclean for 331871.208948, current state
>> active+undersized+degraded, last acting [16,49]
>> pg 4.17 is stuck unclean for 331822.771240, current state
>> active+undersized+degraded, last acting [47,20]
>> pg 5.2c is stuck unclean for 323021.274535, current state
>> active+undersized+degraded, last acting [47,18]
>> pg 5.37 is stuck unclean for 323007.574395, current state
>> active+undersized+degraded, last acting [43,1]
>> pg 7.a is stuck undersized for 322487.284302, current state
>> active+undersized+degraded, last acting [38,2]
>> pg 5.27 is stuck undersized for 322487.287164, current state
>> active+undersized+degraded, last acting [26,19]
>> pg 5.32 is stuck undersized for 322487.285566, current state
>> active+undersized+degraded, last acting [39,19]
>> pg 6.13 is stuck undersized for 322487.287168, current state
>> active+undersized+degraded, last acting [30,16]
>> pg 5.4e is stuck undersized for 331351.476170, current state
>> active+undersized+degraded, last acting [16,38]
>> pg 5.72 is stuck undersized for 331351.475707, current state
>> active+undersized+degraded, last acting [16,49]
>> pg 4.17 is stuck undersized for 322487.280309, current state
>> active+undersized+degraded, last acting [47,20]
>> pg 5.2c is stuck undersized for 322487.286347, current state
>> active+undersized+degraded, last acting [47,18]
>> pg 5.37 is stuck undersized for 322487.280027, current state
>> active+undersized+degraded, last acting [43,1]
>> pg 7.a is stuck degraded for 322487.284340, current state
>> active+undersized+degraded, last acting [38,2]
>> pg 5.27 is stuck degraded for 322487.287202, current state
>> active+undersized+degraded, last acting [26,19]
>> pg 5.32 is stuck degraded for 322487.285604, current state
>> active+undersized+degraded, last acting [39,19]
>> pg 6.13 is stuck degraded for 322487.287207, current state
>> active+undersized+degraded, last acting [30,16]
>> pg 5.4e is stuck degraded for 331351.476209, current state
>> active+undersized+degraded, last acting [16,38]
>> pg 5.72 is stuck degraded for 331351.475746, current state
>> active+undersized+degraded, last acting [16,49]
>> pg 4.17 is stuck degraded for 322487.280348, current state
>> active+undersized+degraded, last acting [47,20]
>> pg 5.2c is stuck d

[ceph-users] 4kN vs. 512E drives and choosing drives

2016-05-03 Thread Oliver Dzombic
Hi,

i am currently trying to make a more or less smart decision what HDD's
will be used for the cold storage behind the ssd cache tier.

As i saw, there are lately different drives available:

512N ( 512 bytes native sector size )
512E ( 512 bytes emulated on 4k sector size )
4kN ( 4k native sector size )

So the question is, if anyone has any experience with that in terms of
performance.

For me, in my current deployment, i have:

osd_mkfs_options_xfs = -f -i size=2048

Which will generate a secorsize of 2048 bytes, which will be suboptimal
or maybe even not working with the 4kN drives.

So, especially for ceph, what is better, bigger sectorsizes like 4k ?
especially when working with a 4kN drive ?

From what i see within ceph on my xfs drives, most ( all ?! ) files are
4194304 bytes long, so about 4 MB blocks. So a Blocksize of 4k would be
fine with that i think.

Or better smaller sectorsizes ?

#

And about choosing drives:

So far there are this options for me, for now:


- MG04SCA40EA:
Toshiba 3.5" 4TB SAS 12Gb/s 7.2K RPM 128M 4Kn (Tomcat R)

- HUS726040AL4210:
HGST 3.5" 4TB SAS 12Gb/s 7.2K RPM 128M 0F22794 4kn ISE (Aries KP)

- ST4000NM0014:
Seagate 3.5" 4TB SAS 12Gb/s 7.2K RPM 128M Makara (4kN)


This drives are also available in 512E versions.


Anyone has any experience with any of that drives ?

Any input is welcome.

Thank you !


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Tupper Cole
Yes the pg *should *get remapped, but that is not always the case. For
discussion on thi, check out the tracker below. Your particular
circumstances may be a little different, but the idea is the same.

http://tracker.ceph.com/issues/3806



On Tue, May 3, 2016 at 9:16 AM, Gaurav Bafna  wrote:

> Thanks Tupper for replying.
>
> Shouldn't the PG be remapped to other OSDs ?
>
> Yes , removing OSD from the cluster is resulting into full recovery.
> But that should not be needed , right ?
>
>
>
> On Tue, May 3, 2016 at 6:31 PM, Tupper Cole  wrote:
> > The degraded pgs are mapped to the down OSD and have not mapped to a new
> > OSD. Removing the OSD would likely result in a full recovery.
> >
> > As a note, having two monitors (or any even number of monitors) is not
> > recommended. If either monitor goes down you will lose quorum. The
> > recommended number of monitors for any cluster is at least three.
> >
> > On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna  wrote:
> >>
> >> Hi Cephers,
> >>
> >> I am running a very small cluster of 3 storage and 2 monitor nodes.
> >>
> >> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
> >> remain undersized for unknown reason.
> >>
> >> After I restart that 1 osd deamon, the cluster recovers in no time .
> >>
> >> Size of all pools are 3 and min_size is 2.
> >>
> >> Can anybody please help ?
> >>
> >> Output of  "ceph -s"
> >> cluster fac04d85-db48-4564-b821-deebda046261
> >>  health HEALTH_WARN
> >> 9 pgs degraded
> >> 9 pgs stuck degraded
> >> 9 pgs stuck unclean
> >> 9 pgs stuck undersized
> >> 9 pgs undersized
> >> recovery 3327/195138 objects degraded (1.705%)
> >> pool .users pg_num 512 > pgp_num 8
> >>  monmap e2: 2 mons at
> >> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
> >> election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
> >>  osdmap e857: 69 osds: 68 up, 68 in
> >>   pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
> >> 279 GB used, 247 TB / 247 TB avail
> >> 3327/195138 objects degraded (1.705%)
> >>  887 active+clean
> >>9 active+undersized+degraded
> >>   client io 395 B/s rd, 0 B/s wr, 0 op/s
> >>
> >> ceph health detail output :
> >>
> >> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
> >> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
> >> degraded (1.705%); pool .users pg_num 512 > pgp_num 8
> >> pg 7.a is stuck unclean for 322742.938959, current state
> >> active+undersized+degraded, last acting [38,2]
> >> pg 5.27 is stuck unclean for 322754.823455, current state
> >> active+undersized+degraded, last acting [26,19]
> >> pg 5.32 is stuck unclean for 322750.685684, current state
> >> active+undersized+degraded, last acting [39,19]
> >> pg 6.13 is stuck unclean for 322732.665345, current state
> >> active+undersized+degraded, last acting [30,16]
> >> pg 5.4e is stuck unclean for 331869.103538, current state
> >> active+undersized+degraded, last acting [16,38]
> >> pg 5.72 is stuck unclean for 331871.208948, current state
> >> active+undersized+degraded, last acting [16,49]
> >> pg 4.17 is stuck unclean for 331822.771240, current state
> >> active+undersized+degraded, last acting [47,20]
> >> pg 5.2c is stuck unclean for 323021.274535, current state
> >> active+undersized+degraded, last acting [47,18]
> >> pg 5.37 is stuck unclean for 323007.574395, current state
> >> active+undersized+degraded, last acting [43,1]
> >> pg 7.a is stuck undersized for 322487.284302, current state
> >> active+undersized+degraded, last acting [38,2]
> >> pg 5.27 is stuck undersized for 322487.287164, current state
> >> active+undersized+degraded, last acting [26,19]
> >> pg 5.32 is stuck undersized for 322487.285566, current state
> >> active+undersized+degraded, last acting [39,19]
> >> pg 6.13 is stuck undersized for 322487.287168, current state
> >> active+undersized+degraded, last acting [30,16]
> >> pg 5.4e is stuck undersized for 331351.476170, current state
> >> active+undersized+degraded, last acting [16,38]
> >> pg 5.72 is stuck undersized for 331351.475707, current state
> >> active+undersized+degraded, last acting [16,49]
> >> pg 4.17 is stuck undersized for 322487.280309, current state
> >> active+undersized+degraded, last acting [47,20]
> >> pg 5.2c is stuck undersized for 322487.286347, current state
> >> active+undersized+degraded, last acting [47,18]
> >> pg 5.37 is stuck undersized for 322487.280027, current state
> >> active+undersized+degraded, last acting [43,1]
> >> pg 7.a is stuck degraded for 322487.284340, current state
> >> active+undersized+degraded, last acting [38,2]
> >> pg 5.27 is stuck degraded for 322487.287202, current state
> >> active+undersized+degraded, last acting [26,19]
> >> pg 5.32 is stuck degraded for 322487.285604, current state
> >> active+undersized+degraded, la

Re: [ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Gaurav Bafna
Also , the old PGs are not mapped to the down osd as seen from the
ceph health detail

pg 5.72 is active+undersized+degraded, acting [16,49]
pg 5.4e is active+undersized+degraded, acting [16,38]
pg 5.32 is active+undersized+degraded, acting [39,19]
pg 5.37 is active+undersized+degraded, acting [43,1]
pg 5.2c is active+undersized+degraded, acting [47,18]
pg 5.27 is active+undersized+degraded, acting [26,19]
pg 6.13 is active+undersized+degraded, acting [30,16]
pg 4.17 is active+undersized+degraded, acting [47,20]
pg 7.a is active+undersized+degraded, acting [38,2]

>From pg query of 7.a

{
"state": "active+undersized+degraded",
"snap_trimq": "[]",
"epoch": 857,
"up": [
38,
2
],
"acting": [
38,
2
],
"actingbackfill": [
"2",
"38"
],
"info": {
"pgid": "7.a",
"last_update": "0'0",
"last_complete": "0'0",
"log_tail": "0'0",
"last_user_version": 0,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 13,
"last_epoch_started": 818,
"last_epoch_clean": 818,
"last_epoch_split": 0,
"same_up_since": 817,
"same_interval_since": 817,


Complete pq query info at : http://pastebin.com/ZHB6M4PQ

On Tue, May 3, 2016 at 6:46 PM, Gaurav Bafna  wrote:
> Thanks Tupper for replying.
>
> Shouldn't the PG be remapped to other OSDs ?
>
> Yes , removing OSD from the cluster is resulting into full recovery.
> But that should not be needed , right ?
>
>
>
> On Tue, May 3, 2016 at 6:31 PM, Tupper Cole  wrote:
>> The degraded pgs are mapped to the down OSD and have not mapped to a new
>> OSD. Removing the OSD would likely result in a full recovery.
>>
>> As a note, having two monitors (or any even number of monitors) is not
>> recommended. If either monitor goes down you will lose quorum. The
>> recommended number of monitors for any cluster is at least three.
>>
>> On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna  wrote:
>>>
>>> Hi Cephers,
>>>
>>> I am running a very small cluster of 3 storage and 2 monitor nodes.
>>>
>>> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
>>> remain undersized for unknown reason.
>>>
>>> After I restart that 1 osd deamon, the cluster recovers in no time .
>>>
>>> Size of all pools are 3 and min_size is 2.
>>>
>>> Can anybody please help ?
>>>
>>> Output of  "ceph -s"
>>> cluster fac04d85-db48-4564-b821-deebda046261
>>>  health HEALTH_WARN
>>> 9 pgs degraded
>>> 9 pgs stuck degraded
>>> 9 pgs stuck unclean
>>> 9 pgs stuck undersized
>>> 9 pgs undersized
>>> recovery 3327/195138 objects degraded (1.705%)
>>> pool .users pg_num 512 > pgp_num 8
>>>  monmap e2: 2 mons at
>>> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
>>> election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
>>>  osdmap e857: 69 osds: 68 up, 68 in
>>>   pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
>>> 279 GB used, 247 TB / 247 TB avail
>>> 3327/195138 objects degraded (1.705%)
>>>  887 active+clean
>>>9 active+undersized+degraded
>>>   client io 395 B/s rd, 0 B/s wr, 0 op/s
>>>
>>> ceph health detail output :
>>>
>>> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
>>> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
>>> degraded (1.705%); pool .users pg_num 512 > pgp_num 8
>>> pg 7.a is stuck unclean for 322742.938959, current state
>>> active+undersized+degraded, last acting [38,2]
>>> pg 5.27 is stuck unclean for 322754.823455, current state
>>> active+undersized+degraded, last acting [26,19]
>>> pg 5.32 is stuck unclean for 322750.685684, current state
>>> active+undersized+degraded, last acting [39,19]
>>> pg 6.13 is stuck unclean for 322732.665345, current state
>>> active+undersized+degraded, last acting [30,16]
>>> pg 5.4e is stuck unclean for 331869.103538, current state
>>> active+undersized+degraded, last acting [16,38]
>>> pg 5.72 is stuck unclean for 331871.208948, current state
>>> active+undersized+degraded, last acting [16,49]
>>> pg 4.17 is stuck unclean for 331822.771240, current state
>>> active+undersized+degraded, last acting [47,20]
>>> pg 5.2c is stuck unclean for 323021.274535, current state
>>> active+undersized+degraded, last acting [47,18]
>>> pg 5.37 is stuck unclean for 323007.574395, current state
>>> active+undersized+degraded, last acting [43,1]
>>> pg 7.a is stuck undersized for 322487.284302, current state
>>> active+undersized+degraded, last acting [38,2]
>>> pg 5.27 is stuck undersized for 322487.287164, current state
>>> active+undersized+degraded, last acting [26,19]
>>> pg 5.32 is stuck undersized for 322487.285566, current state
>>> active+undersized+degraded, last acting [39,19]
>>> pg 6.13 is s

Re: [ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Varada Kari
Pgs are degraded because they don't have enough copies of the data. What
is your replication size?

You can refer to
http://docs.ceph.com/docs/master/rados/operations/pg-states/  for PG states.

Varada

On Tuesday 03 May 2016 06:56 PM, Gaurav Bafna wrote:
> Also , the old PGs are not mapped to the down osd as seen from the
> ceph health detail
>
> pg 5.72 is active+undersized+degraded, acting [16,49]
> pg 5.4e is active+undersized+degraded, acting [16,38]
> pg 5.32 is active+undersized+degraded, acting [39,19]
> pg 5.37 is active+undersized+degraded, acting [43,1]
> pg 5.2c is active+undersized+degraded, acting [47,18]
> pg 5.27 is active+undersized+degraded, acting [26,19]
> pg 6.13 is active+undersized+degraded, acting [30,16]
> pg 4.17 is active+undersized+degraded, acting [47,20]
> pg 7.a is active+undersized+degraded, acting [38,2]
>
> From pg query of 7.a
>
> {
> "state": "active+undersized+degraded",
> "snap_trimq": "[]",
> "epoch": 857,
> "up": [
> 38,
> 2
> ],
> "acting": [
> 38,
> 2
> ],
> "actingbackfill": [
> "2",
> "38"
> ],
> "info": {
> "pgid": "7.a",
> "last_update": "0'0",
> "last_complete": "0'0",
> "log_tail": "0'0",
> "last_user_version": 0,
> "last_backfill": "MAX",
> "purged_snaps": "[]",
> "history": {
> "epoch_created": 13,
> "last_epoch_started": 818,
> "last_epoch_clean": 818,
> "last_epoch_split": 0,
> "same_up_since": 817,
> "same_interval_since": 817,
>
>
> Complete pq query info at : http://pastebin.com/ZHB6M4PQ
>
> On Tue, May 3, 2016 at 6:46 PM, Gaurav Bafna  wrote:
>> Thanks Tupper for replying.
>>
>> Shouldn't the PG be remapped to other OSDs ?
>>
>> Yes , removing OSD from the cluster is resulting into full recovery.
>> But that should not be needed , right ?
>>
>>
>>
>> On Tue, May 3, 2016 at 6:31 PM, Tupper Cole  wrote:
>>> The degraded pgs are mapped to the down OSD and have not mapped to a new
>>> OSD. Removing the OSD would likely result in a full recovery.
>>>
>>> As a note, having two monitors (or any even number of monitors) is not
>>> recommended. If either monitor goes down you will lose quorum. The
>>> recommended number of monitors for any cluster is at least three.
>>>
>>> On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna  wrote:
 Hi Cephers,

 I am running a very small cluster of 3 storage and 2 monitor nodes.

 After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
 remain undersized for unknown reason.

 After I restart that 1 osd deamon, the cluster recovers in no time .

 Size of all pools are 3 and min_size is 2.

 Can anybody please help ?

 Output of  "ceph -s"
 cluster fac04d85-db48-4564-b821-deebda046261
  health HEALTH_WARN
 9 pgs degraded
 9 pgs stuck degraded
 9 pgs stuck unclean
 9 pgs stuck undersized
 9 pgs undersized
 recovery 3327/195138 objects degraded (1.705%)
 pool .users pg_num 512 > pgp_num 8
  monmap e2: 2 mons at
 {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
 election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
  osdmap e857: 69 osds: 68 up, 68 in
   pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
 279 GB used, 247 TB / 247 TB avail
 3327/195138 objects degraded (1.705%)
  887 active+clean
9 active+undersized+degraded
   client io 395 B/s rd, 0 B/s wr, 0 op/s

 ceph health detail output :

 HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
 degraded (1.705%); pool .users pg_num 512 > pgp_num 8
 pg 7.a is stuck unclean for 322742.938959, current state
 active+undersized+degraded, last acting [38,2]
 pg 5.27 is stuck unclean for 322754.823455, current state
 active+undersized+degraded, last acting [26,19]
 pg 5.32 is stuck unclean for 322750.685684, current state
 active+undersized+degraded, last acting [39,19]
 pg 6.13 is stuck unclean for 322732.665345, current state
 active+undersized+degraded, last acting [30,16]
 pg 5.4e is stuck unclean for 331869.103538, current state
 active+undersized+degraded, last acting [16,38]
 pg 5.72 is stuck unclean for 331871.208948, current state
 active+undersized+degraded, last acting [16,49]
 pg 4.17 is stuck unclean for 331822.771240, current state
 active+undersized+degraded, last acting [47,20]
 pg 5.2c is stuck unclean for 323021.274535, current state
 active+undersized+degraded, last acting [47,18]
 pg 5.37 is stuck unclean for 323007.574395, current s

Re: [ceph-users] Erasure pool performance expectations

2016-05-03 Thread Peter Kerdisle
Thank you, I will attempt to play around with these settings and see if I
can achieve better read performance.

Appreciate your insights.

Peter

On Tue, May 3, 2016 at 3:00 PM, Nick Fisk  wrote:

>
>
> > -Original Message-
> > From: Peter Kerdisle [mailto:peter.kerdi...@gmail.com]
> > Sent: 03 May 2016 12:15
> > To: n...@fisk.me.uk
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Erasure pool performance expectations
> >
> > Hey Nick,
> >
> > Thanks for taking the time to answer my questions. Some in-line comments.
> >
> > On Tue, May 3, 2016 at 10:51 AM, Nick Fisk  wrote:
> > Hi Peter,
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of
> > > Peter Kerdisle
> > > Sent: 02 May 2016 08:17
> > > To: ceph-users@lists.ceph.com
> > > Subject: [ceph-users] Erasure pool performance expectations
> > >
> > > Hi guys,
> > >
> > > I am currently testing the performance of RBD using a cache pool and a
> 4/2
> > > erasure profile pool.
> > >
> > > I have two SSD cache servers (2 SSDs for journals, 7 SSDs for data)
> with
> > > 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public and
> 10Gbit
> > > cluster network for the erasure pool (10x3TB without separate journal).
> > This
> > > is all on Jewel.
> > >
> > > What I would like to know is if the performance I'm seeing is to be
> > expected
> > > and if there is some way to test this in a more qualifiable way.
> > >
> > > Everything works as expected if the files are present on the cache
> pool,
> > > however when things need to be retrieved from the cache pool I see
> > > performance degradation. I'm trying to simulate real usage as much as
> > > possible and trying to retrieve files from the RBD volume over FTP
> from a
> > > client server. What I'm seeing is that the FTP transfer will stall for
> seconds
> > at a
> > > time and then get some more data which results in an average speed of
> > > 200KB/s. From the cache this is closer to 10MB/s. Is this the expected
> > > behaviour from a erasure coded tier with cache in front?
> >
> > Unfortunately yes. The whole Erasure/Cache thing only really works well
> if
> > the data in the EC tier is only accessed infrequently, otherwise the
> overheads
> > in cache promotion/flushing quickly brings the cluster down to its knees.
> > However it looks as though you are mainly doing reads, which means you
> can
> > probably alter your cache settings to not promote so aggressively on
> reads,
> > as reads can be proxied through to the EC tier instead of promoting. This
> > should reduce the amount of required cache promotions.
> >
> > You are correct that reads have a lower priority of being cached, only
> when
> > they are used very frequently should this be done in an ideal situation.
> >
> >
> > Can you try setting min_read_recency_for promote to something higher?
> >
> > I looked into the setting before but I must admit it's exact purpose
> eludes me
> > still. Would it be correct to simplify it as
> 'min_read_recency_for_promote
> > determines the amount of times a piece would have to be read in a certain
> > interval (set by hit_set_period) in order to promote it to the caching
> tier' ?
>
> Yes that’s correct. Every hit_set_period (assuming there is IO going on) a
> new hitset is created up until the hit_set_count limit. The recency defines
> how many of the last x hitsets an object must have been accessed in.
>
> Tuning it is a bit of a dark art at the moment as you have to try and get
> all the values to match your workload. For starters try something like
>
> Read recency =  2 or 3
> Hit_set_count =10
> Hit_set_period=60
>
> Which will mean if an object is read more than 2 or 3 times in a row
> within the last few minutes it will be promoted. There is no granularity
> below a single hitset, so if an object gets hit a 1000 times in 1 minute
> but then nothing for 5 minutes it will not cause a promotion.
>
> >
> >
> > Also can you check what your hit_set_period and hit_set_count is
> currently
> > set to.
> >
> > hit_set_count is set to 1 and hit_set_period to 1800.
> >
> > What would increasing the hit_set_count do exactly?
> >
> >
> >
> > > Right now I'm unsure how to scientifically test the performance
> retrieving
> > > files when there is a cache miss. If somebody could point me towards a
> > > better way of doing that I would appreciate the help.
> > >
> > > An other thing is that I'm seeing a lot of messages popping up in
> dmesg on
> > > my client server on which the RBD volumes are mounted. (IPs removed)
> > >
> > > [685881.477383] libceph: osd50 :6800 socket closed (con state OPEN)
> > > [685895.597733] libceph: osd54 :6808 socket closed (con state OPEN)
> > > [685895.663971] libceph: osd54 :6808 socket closed (con state OPEN)
> > > [685895.710424] libceph: osd54 :6808 socket closed (con state OPEN)
> > > [685895.749417] libceph: osd54 :6808 socket closed (con state OPEN)
> > > [685896.517778] libceph: osd54 :

Re: [ceph-users] Cluster not recovering after OSD deamon is down

2016-05-03 Thread Gaurav Bafna
The replication size is 3 and min_size is 2. Yes , they don't have
enough copies. Ceph by itself should recover from this state to ensure
durability .

@Tupper : In this bug, each node is hosting only three osds . In my
set up , every node has 23 osds. So this should not be the issue .



On Tue, May 3, 2016 at 7:00 PM, Varada Kari  wrote:
> Pgs are degraded because they don't have enough copies of the data. What
> is your replication size?
>
> You can refer to
> http://docs.ceph.com/docs/master/rados/operations/pg-states/  for PG states.
>
> Varada
>
> On Tuesday 03 May 2016 06:56 PM, Gaurav Bafna wrote:
>> Also , the old PGs are not mapped to the down osd as seen from the
>> ceph health detail
>>
>> pg 5.72 is active+undersized+degraded, acting [16,49]
>> pg 5.4e is active+undersized+degraded, acting [16,38]
>> pg 5.32 is active+undersized+degraded, acting [39,19]
>> pg 5.37 is active+undersized+degraded, acting [43,1]
>> pg 5.2c is active+undersized+degraded, acting [47,18]
>> pg 5.27 is active+undersized+degraded, acting [26,19]
>> pg 6.13 is active+undersized+degraded, acting [30,16]
>> pg 4.17 is active+undersized+degraded, acting [47,20]
>> pg 7.a is active+undersized+degraded, acting [38,2]
>>
>> From pg query of 7.a
>>
>> {
>> "state": "active+undersized+degraded",
>> "snap_trimq": "[]",
>> "epoch": 857,
>> "up": [
>> 38,
>> 2
>> ],
>> "acting": [
>> 38,
>> 2
>> ],
>> "actingbackfill": [
>> "2",
>> "38"
>> ],
>> "info": {
>> "pgid": "7.a",
>> "last_update": "0'0",
>> "last_complete": "0'0",
>> "log_tail": "0'0",
>> "last_user_version": 0,
>> "last_backfill": "MAX",
>> "purged_snaps": "[]",
>> "history": {
>> "epoch_created": 13,
>> "last_epoch_started": 818,
>> "last_epoch_clean": 818,
>> "last_epoch_split": 0,
>> "same_up_since": 817,
>> "same_interval_since": 817,
>>
>>
>> Complete pq query info at : http://pastebin.com/ZHB6M4PQ
>>
>> On Tue, May 3, 2016 at 6:46 PM, Gaurav Bafna  wrote:
>>> Thanks Tupper for replying.
>>>
>>> Shouldn't the PG be remapped to other OSDs ?
>>>
>>> Yes , removing OSD from the cluster is resulting into full recovery.
>>> But that should not be needed , right ?
>>>
>>>
>>>
>>> On Tue, May 3, 2016 at 6:31 PM, Tupper Cole  wrote:
 The degraded pgs are mapped to the down OSD and have not mapped to a new
 OSD. Removing the OSD would likely result in a full recovery.

 As a note, having two monitors (or any even number of monitors) is not
 recommended. If either monitor goes down you will lose quorum. The
 recommended number of monitors for any cluster is at least three.

 On Tue, May 3, 2016 at 8:42 AM, Gaurav Bafna  wrote:
> Hi Cephers,
>
> I am running a very small cluster of 3 storage and 2 monitor nodes.
>
> After I kill 1 osd-daemon, the cluster never recovers fully. 9 PGs
> remain undersized for unknown reason.
>
> After I restart that 1 osd deamon, the cluster recovers in no time .
>
> Size of all pools are 3 and min_size is 2.
>
> Can anybody please help ?
>
> Output of  "ceph -s"
> cluster fac04d85-db48-4564-b821-deebda046261
>  health HEALTH_WARN
> 9 pgs degraded
> 9 pgs stuck degraded
> 9 pgs stuck unclean
> 9 pgs stuck undersized
> 9 pgs undersized
> recovery 3327/195138 objects degraded (1.705%)
> pool .users pg_num 512 > pgp_num 8
>  monmap e2: 2 mons at
> {dssmon2=10.140.13.13:6789/0,dssmonleader1=10.140.13.11:6789/0}
> election epoch 1038, quorum 0,1 dssmonleader1,dssmon2
>  osdmap e857: 69 osds: 68 up, 68 in
>   pgmap v106601: 896 pgs, 9 pools, 435 MB data, 65047 objects
> 279 GB used, 247 TB / 247 TB avail
> 3327/195138 objects degraded (1.705%)
>  887 active+clean
>9 active+undersized+degraded
>   client io 395 B/s rd, 0 B/s wr, 0 op/s
>
> ceph health detail output :
>
> HEALTH_WARN 9 pgs degraded; 9 pgs stuck degraded; 9 pgs stuck unclean;
> 9 pgs stuck undersized; 9 pgs undersized; recovery 3327/195138 objects
> degraded (1.705%); pool .users pg_num 512 > pgp_num 8
> pg 7.a is stuck unclean for 322742.938959, current state
> active+undersized+degraded, last acting [38,2]
> pg 5.27 is stuck unclean for 322754.823455, current state
> active+undersized+degraded, last acting [26,19]
> pg 5.32 is stuck unclean for 322750.685684, current state
> active+undersized+degraded, last acting [39,19]
> pg 6.13 is stuck unclean for 322732.665345, current state
> active+undersized+degraded, last acting [30,16]
> pg 5.4e is stuck unclean for 331869.103538,

Re: [ceph-users] Erasure pool performance expectations

2016-05-03 Thread Mark Nelson
In addition to what nick said, it's really valuable to watch your cache 
tier write behavior during heavy IO.  One thing I noticed is you said 
you have 2 SSDs for journals and 7 SSDs for data.  If they are all of 
the same type, you're likely bottlenecked by the journal SSDs for 
writes, which compounded with the heavy promotions is going to really 
hold you back.


What you really want:

1) (assuming filestore) equal large write throughput between the 
journals and data disks.


2) promotions to be limited by some reasonable fraction of the cache 
tier and/or network throughput (say 70%).  This is why the 
user-configurable promotion throttles were added in jewel.


3) The cache tier to fill up quickly when empty but change slowly once 
it's full (ie limiting promotions and evictions).  No real way to do 
this yet.


Mark

On 05/03/2016 08:40 AM, Peter Kerdisle wrote:

Thank you, I will attempt to play around with these settings and see if
I can achieve better read performance.

Appreciate your insights.

Peter

On Tue, May 3, 2016 at 3:00 PM, Nick Fisk mailto:n...@fisk.me.uk>> wrote:



> -Original Message-
> From: Peter Kerdisle [mailto:peter.kerdi...@gmail.com
]
> Sent: 03 May 2016 12:15
> To: n...@fisk.me.uk 
> Cc: ceph-users@lists.ceph.com 
> Subject: Re: [ceph-users] Erasure pool performance expectations
>
> Hey Nick,
>
> Thanks for taking the time to answer my questions. Some in-line
comments.
>
> On Tue, May 3, 2016 at 10:51 AM, Nick Fisk mailto:n...@fisk.me.uk>> wrote:
> Hi Peter,
>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
] On Behalf
> Of
> > Peter Kerdisle
> > Sent: 02 May 2016 08:17
> > To: ceph-users@lists.ceph.com 
> > Subject: [ceph-users] Erasure pool performance expectations
> >
> > Hi guys,
> >
> > I am currently testing the performance of RBD using a cache pool
and a 4/2
> > erasure profile pool.
> >
> > I have two SSD cache servers (2 SSDs for journals, 7 SSDs for
data) with
> > 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public
and 10Gbit
> > cluster network for the erasure pool (10x3TB without separate
journal).
> This
> > is all on Jewel.
> >
> > What I would like to know is if the performance I'm seeing is to be
> expected
> > and if there is some way to test this in a more qualifiable way.
> >
> > Everything works as expected if the files are present on the
cache pool,
> > however when things need to be retrieved from the cache pool I see
> > performance degradation. I'm trying to simulate real usage as
much as
> > possible and trying to retrieve files from the RBD volume over
FTP from a
> > client server. What I'm seeing is that the FTP transfer will
stall for seconds
> at a
> > time and then get some more data which results in an average
speed of
> > 200KB/s. From the cache this is closer to 10MB/s. Is this the
expected
> > behaviour from a erasure coded tier with cache in front?
>
> Unfortunately yes. The whole Erasure/Cache thing only really works
well if
> the data in the EC tier is only accessed infrequently, otherwise
the overheads
> in cache promotion/flushing quickly brings the cluster down to its
knees.
> However it looks as though you are mainly doing reads, which means
you can
> probably alter your cache settings to not promote so aggressively
on reads,
> as reads can be proxied through to the EC tier instead of
promoting. This
> should reduce the amount of required cache promotions.
>
> You are correct that reads have a lower priority of being cached,
only when
> they are used very frequently should this be done in an ideal
situation.
>
>
> Can you try setting min_read_recency_for promote to something higher?
>
> I looked into the setting before but I must admit it's exact
purpose eludes me
> still. Would it be correct to simplify it as
'min_read_recency_for_promote
> determines the amount of times a piece would have to be read in a
certain
> interval (set by hit_set_period) in order to promote it to the
caching tier' ?

Yes that’s correct. Every hit_set_period (assuming there is IO going
on) a new hitset is created up until the hit_set_count limit. The
recency defines how many of the last x hitsets an object must have
been accessed in.

Tuning it is a bit of a dark art at the moment as you have to try
and get all the values to match your workload. For starters try
something like

Read recency =  2 or 3
Hit_set_count =10
Hit_set_perio

Re: [ceph-users] Erasure pool performance expectations

2016-05-03 Thread Nick Fisk
Mark,

Thanks for pointing out about the throttles, they completely slipped my
mind. But then it got me thinking, why weren't they kicking in and stopping
too much promotions happening in the case of the OP.

I had a quick look at my current OSD settings

sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
promote
"osd_tier_promote_max_objects_sec": "5242880",
"osd_tier_promote_max_bytes_sec": "25",

Uh ohthey look the wrong way round to me?

Github shows the same

https://github.com/ceph/ceph/search?utf8=%E2%9C%93&q=osd_tier_promote_max_by
tes_sec

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: 03 May 2016 15:05
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Erasure pool performance expectations
> 
> In addition to what nick said, it's really valuable to watch your cache
tier write
> behavior during heavy IO.  One thing I noticed is you said you have 2 SSDs
for
> journals and 7 SSDs for data.  If they are all of the same type, you're
likely
> bottlenecked by the journal SSDs for writes, which compounded with the
> heavy promotions is going to really hold you back.
> 
> What you really want:
> 
> 1) (assuming filestore) equal large write throughput between the journals
> and data disks.
> 
> 2) promotions to be limited by some reasonable fraction of the cache tier
> and/or network throughput (say 70%).  This is why the user-configurable
> promotion throttles were added in jewel.
> 
> 3) The cache tier to fill up quickly when empty but change slowly once
it's full
> (ie limiting promotions and evictions).  No real way to do this yet.
> 
> Mark
> 
> On 05/03/2016 08:40 AM, Peter Kerdisle wrote:
> > Thank you, I will attempt to play around with these settings and see
> > if I can achieve better read performance.
> >
> > Appreciate your insights.
> >
> > Peter
> >
> > On Tue, May 3, 2016 at 3:00 PM, Nick Fisk  > > wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Peter Kerdisle [mailto:peter.kerdi...@gmail.com
> > ]
> > > Sent: 03 May 2016 12:15
> > > To: n...@fisk.me.uk 
> > > Cc: ceph-users@lists.ceph.com 
> > > Subject: Re: [ceph-users] Erasure pool performance expectations
> > >
> > > Hey Nick,
> > >
> > > Thanks for taking the time to answer my questions. Some in-line
> > comments.
> > >
> > > On Tue, May 3, 2016 at 10:51 AM, Nick Fisk  > > wrote:
> > > Hi Peter,
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> > ] On Behalf
> > > Of
> > > > Peter Kerdisle
> > > > Sent: 02 May 2016 08:17
> > > > To: ceph-users@lists.ceph.com 
> > > > Subject: [ceph-users] Erasure pool performance expectations
> > > >
> > > > Hi guys,
> > > >
> > > > I am currently testing the performance of RBD using a cache pool
> > and a 4/2
> > > > erasure profile pool.
> > > >
> > > > I have two SSD cache servers (2 SSDs for journals, 7 SSDs for
> > data) with
> > > > 2x10Gbit bonded each and a six OSD nodes with a 10Gbit public
> > and 10Gbit
> > > > cluster network for the erasure pool (10x3TB without separate
> > journal).
> > > This
> > > > is all on Jewel.
> > > >
> > > > What I would like to know is if the performance I'm seeing is to
be
> > > expected
> > > > and if there is some way to test this in a more qualifiable way.
> > > >
> > > > Everything works as expected if the files are present on the
> > cache pool,
> > > > however when things need to be retrieved from the cache pool I
see
> > > > performance degradation. I'm trying to simulate real usage as
> > much as
> > > > possible and trying to retrieve files from the RBD volume over
> > FTP from a
> > > > client server. What I'm seeing is that the FTP transfer will
> > stall for seconds
> > > at a
> > > > time and then get some more data which results in an average
> > speed of
> > > > 200KB/s. From the cache this is closer to 10MB/s. Is this the
> > expected
> > > > behaviour from a erasure coded tier with cache in front?
> > >
> > > Unfortunately yes. The whole Erasure/Cache thing only really works
> > well if
> > > the data in the EC tier is only accessed infrequently, otherwise
> > the overheads
> > > in cache promotion/flushing quickly brings the cluster down to its
> > knees.
> > > However it looks as though you are mainly doing reads, which means
> > you can
> > > probably alter your cache settings to not promote so aggressively
> > on reads,
> > > as reads ca

Re: [ceph-users] Erasure pool performance expectations

2016-05-03 Thread Mark Nelson

aha!  I blame sage and take no responsibility. :D

https://github.com/ceph/ceph/commit/49c3521b05c33fff68a926d404d5216d1b078955

On 05/03/2016 09:24 AM, Nick Fisk wrote:

https://github.com/ceph/ceph/search?utf8=%E2%9C%93&q=osd_tier_promote_max_by
tes_sec

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Read/Write Speed

2016-05-03 Thread Roozbeh Shafiee
Hi,

I have a test Ceph cluster in my lab which will be a storage backend for one of 
my projects.
This cluster is my first experience on CentOS-7, but recently I had some use 
case on Ubuntu 14.04 too.

Actually everything works fine and I have a good functionality on this cluster, 
but the main problem is the performance
of cluster in read and write data. I have too much swing in read and write and 
the rate of this swing is between 60 KB/s - 70 MB/s, specially on read.
how can I tune this cluster as stable storage backend for my case?

More information:
Number of OSDs: 5 physical server with 4x4TB - 16 GB of RAM - Core i7 CPU
Number of Monitors: 1 virtual machine with 180 GB on SSD - 16 GB of RAM - on an 
KVM Virtualization Machine
All Operating Systems: CentOS 7.2 with default kernel 3.10
All File Systems: XFS
Ceph Version: 10.2 Jewel
Switch for Private Networking: D-Link DGS-1008D Gigabit 8
NICs: Gb/s NIC x 2 for each server
Block Device on Client Server: Linux kernel RBD module

Thank you
Roozbeh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Disabling POSIX locking semantics for CephFS

2016-05-03 Thread Burkhard Linke

Hi,

we have a number of legacy applications that do not cope well with the 
POSIX locking semantics in CephFS due to missing locking support (e.g. 
flock syscalls). We are able to fix some of these applications, but 
others are binary only.


Is it possible to disable POSIX locking completely in CephFS (either 
kernel client or ceph-fuse)?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling POSIX locking semantics for CephFS

2016-05-03 Thread Gregory Farnum
On Tue, May 3, 2016 at 9:30 AM, Burkhard Linke
 wrote:
> Hi,
>
> we have a number of legacy applications that do not cope well with the POSIX
> locking semantics in CephFS due to missing locking support (e.g. flock
> syscalls). We are able to fix some of these applications, but others are
> binary only.
>
> Is it possible to disable POSIX locking completely in CephFS (either kernel
> client or ceph-fuse)?

I'm confused. CephFS supports all of these — although some versions of
FUSE don't; you need a new-ish kernel.

So are you saying that
1) in your setup, it doesn't support both fcntl and flock,
2) that some of your applications don't do well under that scenario?

I don't really see how it's safe for you to just disable the
underlying file locking in an application which depends on it. You may
need to upgrade enough that all file locks are supported.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Read/Write Speed

2016-05-03 Thread Mark Nelson

Hi Roozbeh,

There isn't nearly enough information here regarding your benchmark and 
test parameters to be able to tell why you are seeing performance 
swings.  It could be anything from network hiccups, to throttling in the 
ceph stack, to unlucky randomness in object distribution, to vibrations 
in the rack causing your disk heads to resync, to fragmentation of the 
underlying filesystem (especially important for sequential reads).


Generally speaking if you want to try to isolate the source of the 
problem, it's best to find a way to make the issue repeatable on demand, 
then setup your tests so you can record system metrics (device 
queue/service times, throughput stalls, network oddities, etc) and start 
systematically tracking down when and why slowdowns occur.  Sometimes 
you might even be able to reproduce issues outside of Ceph (Network 
problems are often a common source).


It might also be worth looking at your PG and data distribution.  IE if 
you have some clumpiness you might see variation in performance as some 
OSDs starve for IOs while others are overloaded.


Good luck!

Mark

On 05/03/2016 11:16 AM, Roozbeh Shafiee wrote:

Hi,

I have a test Ceph cluster in my lab which will be a storage backend for
one of my projects.
This cluster is my first experience on CentOS-7, but recently I had some
use case on Ubuntu 14.04 too.

Actually everything works fine and I have a good functionality on this
cluster, but the main problem is the performance
of cluster in read and write data. I have too much swing in read and
write and the rate of this swing is between 60 KB/s - 70 MB/s, specially
on read.
how can I tune this cluster as stable storage backend for my case?

More information:

Number of OSDs: 5 physical server with 4x4TB - 16 GB of RAM - Core
i7 CPU
Number of Monitors: 1 virtual machine with 180 GB on SSD - 16 GB of
RAM - on an KVM Virtualization Machine
All Operating Systems: CentOS 7.2 with default kernel 3.10
All File Systems: XFS
Ceph Version: 10.2 Jewel
Switch for Private Networking: D-Link DGS-1008D Gigabit 8
NICs: Gb/s NIC x 2 for each server
Block Device on Client Server: Linux kernel RBD module


Thank you
Roozbeh



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel, cephfs and selinux

2016-05-03 Thread Gregory Farnum
On Sun, May 1, 2016 at 5:52 PM, Andrus, Brian Contractor
 wrote:
> All,
>
>
>
> I thought there was a way to mount CephFS using the kernel driver and be
> able to honor selinux labeling.
>
> Right now, if I do ‘ls -lZ' on a mounted cephfs, I get question marks
> instead of any contexts.
>
> When I mount it, I see in dmesg:
>
>
>
> [858946.554719] SELinux: initialized (dev ceph, type ceph), not configured
> for labeling
>
>
>
>
>
> Is this something that is in the works and will be available to test?

I don't know much about this, but http://tracker.ceph.com/issues/5486

So I think we're all set, but the administrator/SELinux upstream needs
to do something in order to let SELinux actually use/set labels on
CephFS mounts.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Errors

2016-05-03 Thread Blade Doyle
Hi Oliver,

Thanks for your reply.

The problem could have been caused by crashing/flapping OSD's. The cluster
is stable now, but lots of pg problems remain.

$ ceph health
HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck degraded; 1
pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; 4 pgs
undersized; recovery 1489/523934 objects degraded (0.284%); recovery
2620/523934 objects misplaced (0.500%); 158 scrub errors

Example: for pg 1.32 :

$ ceph health detail | grep "pg 1.32"
pg 1.32 is stuck inactive for 13260.118985, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck unclean for 945560.550800, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck undersized for 12855.304944, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck degraded for 12855.305305, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is undersized+degraded+peered, acting [6]

I tried various things like:

$ ceph pg repair 1.32
instructing pg 1.32 on osd.6 to repair

$ ceph pg deep-scrub 1.32
instructing pg 1.32 on osd.6 to deep-scrub

Its odd that I never do see any log on osd.6 about scrubbing or repairing
that pg (after waiting many hours).  I attached "ceph pg query" and a grep
of osd logs for that page.  If there is a better way to provide large logs
please let me know.

For reference the last mention of that pg in the logs is:

2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418  kicking pg 1.32
2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 pg[1.32( v
338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 les/c
349347/349347 349418/349418/349418) [] r=-1 lpr=349418 pi=349346-349417/1
crt=338815'7743 lcod 0'0 inactive NOTIFY] lock


Suggestions appreciated,
Blade.




On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle  wrote:

> Hi Ceph-Users,
>
> Help with how to resolve these would be appreciated.
>
> 2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log [INF] :
> 4.97 deep-scrub starts
> 2016-04-30 09:26:00.041962 93009350  0 -- 192.168.2.52:6800/6640 >>
> 192.168.2.32:0/3983425916 pipe(0x27406000 sd=111 :6800 s=0 pgs=0 cs=0 l=0
> c=0x272da0a0).accept peer addr is really 192.168.2.32:0/3983425916
> (socket is 192.168.2.32:38514/0)
> 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145
> dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
> 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 deep-scrub 1 errors
> 2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log [INF] :
> 4.97 scrub starts
> 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145 dirty,
> 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts, 365855441/365855441
> bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 scrub 1 errors
>
> Thanks Much,
> Blade.
>
{
"state": "undersized+degraded+peered",
"snap_trimq": "[]",
"epoch": 350071,
"up": [
6
],
"acting": [
6
],
"actingbackfill": [
"6"
],
"info": {
"pgid": "1.32",
"last_update": "338815'7745",
"last_complete": "338815'7745",
"log_tail": "20981'4727",
"last_user_version": 99149,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 17,
"last_epoch_started": 349421,
"last_epoch_clean": 349491,
"last_epoch_split": 0,
"same_up_since": 349420,
"same_interval_since": 349490,
"same_primary_since": 349420,
"last_scrub": "338815'7745",
"last_scrub_stamp": "2016-04-21 22:05:56.984147",
"last_deep_scrub": "338815'7745",
"last_deep_scrub_stamp": "2016-04-21 22:05:56.984147",
"last_clean_scrub_stamp": "2016-04-21 22:05:56.984147"
},
"stats": {
"version": "338815'7745",
"reported_seq": "61243",
"reported_epoch": "350068",
"state": "undersized+degraded+peered",
"last_fresh": "2016-05-02 19:30:21.999749",
"last_change": "2016-05-02 17:10:46.95",
"last_active": "2016-05-02 17:04:01.016156",
"last_peered": "2016-05-02 19:30:21.999749",
"last_clean": "2016-04-21 22:05:40.584862",
"last_became_active": "0.00",
"last_became_peered": "0.00",
"last_unstale": "2016-05-02 19:30:21.999749",
"last_undegraded": "2016-05-02 17:10:45.831094",
"last_fullsized": "2016-05-02 17:10:45.831094",
"mapping_epoch": 349418,
"log_start": "20981'4727",
"ondisk_lo

Re: [ceph-users] Scrub Errors

2016-05-03 Thread Oliver Dzombic
Hi Blade,

if you dont see anything in the logs, then you should raise the debug
level/frequency.

You must at least see, that the repair command has been issued  ( started ).

Also i am wondering about the [6] from your output.

That means, that there is only 1 copy of it ( on osd.6 ).

What is your setting for the minimal required copies ?

osd_pool_default_min_size = ??

And whats the setting for the to create copies ?

osd_pool_default_size = ???

Please give us the output of

ceph osd pool ls detail

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 03.05.2016 um 19:11 schrieb Blade Doyle:
> Hi Oliver,
> 
> Thanks for your reply.
> 
> The problem could have been caused by crashing/flapping OSD's. The
> cluster is stable now, but lots of pg problems remain.
> 
> $ ceph health
> HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck degraded; 1
> pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; 4 pgs
> undersized; recovery 1489/523934 objects degraded (0.284%); recovery
> 2620/523934 objects misplaced (0.500%); 158 scrub errors
> 
> Example: for pg 1.32 :
> 
> $ ceph health detail | grep "pg 1.32"
> pg 1.32 is stuck inactive for 13260.118985, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck unclean for 945560.550800, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck undersized for 12855.304944, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck degraded for 12855.305305, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is undersized+degraded+peered, acting [6]
> 
> I tried various things like:
> 
> $ ceph pg repair 1.32
> instructing pg 1.32 on osd.6 to repair
> 
> $ ceph pg deep-scrub 1.32
> instructing pg 1.32 on osd.6 to deep-scrub
> 
> Its odd that I never do see any log on osd.6 about scrubbing or
> repairing that pg (after waiting many hours).  I attached "ceph pg
> query" and a grep of osd logs for that page.  If there is a better way
> to provide large logs please let me know.
> 
> For reference the last mention of that pg in the logs is:
> 
> 2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418  kicking pg 1.32
> 2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 pg[1.32( v
> 338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 les/c
> 349347/349347 349418/349418/349418) [] r=-1 lpr=349418
> pi=349346-349417/1 crt=338815'7743 lcod 0'0 inactive NOTIFY] lock
> 
> 
> Suggestions appreciated,
> Blade.
> 
> 
> 
> 
> On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle  > wrote:
> 
> Hi Ceph-Users,
> 
> Help with how to resolve these would be appreciated.
> 
> 2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log
> [INF] : 4.97 deep-scrub starts
> 2016-04-30 09:26:00.041962 93009350  0 -- 192.168.2.52:6800/6640
>  >> 192.168.2.32:0/3983425916
>  pipe(0x27406000 sd=111 :6800 s=0
> pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really
> 192.168.2.32:0/3983425916  (socket
> is 192.168.2.32:38514/0 )
> 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0
> clones, 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137
> whiteouts, 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 deep-scrub 1 errors
> 2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log
> [INF] : 4.97 scrub starts
> 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones,
> 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
> 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 scrub 1 errors
> 
> Thanks Much,
> Blade.
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Incorrect crush map

2016-05-03 Thread Ben Hines
My crush map keeps putting some OSDs on the wrong node. Restarting them
fixes it temporarily, but they eventually hop back to the other node that
they aren't really on.

Is there anything that can cause this to look for?

Ceph 9.2.1

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling POSIX locking semantics for CephFS

2016-05-03 Thread Burkhard Linke

Hi,

On 03.05.2016 18:39, Gregory Farnum wrote:

On Tue, May 3, 2016 at 9:30 AM, Burkhard Linke
 wrote:

Hi,

we have a number of legacy applications that do not cope well with the POSIX
locking semantics in CephFS due to missing locking support (e.g. flock
syscalls). We are able to fix some of these applications, but others are
binary only.

Is it possible to disable POSIX locking completely in CephFS (either kernel
client or ceph-fuse)?

I'm confused. CephFS supports all of these — although some versions of
FUSE don't; you need a new-ish kernel.

So are you saying that
1) in your setup, it doesn't support both fcntl and flock,
2) that some of your applications don't do well under that scenario?

I don't really see how it's safe for you to just disable the
underlying file locking in an application which depends on it. You may
need to upgrade enough that all file locks are supported.


The application in question does a binary search in a large data file 
(~75 GB), which is stored on CephFS. It uses open and mmap without any 
further locking controls (neither fcntl nor flock). The performance was 
very poor with CephFS (Ubuntu Trusty 4.4 backport kernel from Xenial and 
ceph-fuse) compared to the same application with a NFS based storage. I 
didn't had the time to dig further into the kernel implementation yet, 
but I assume that the root cause is locking pages accessed via the 
memory mapped file. Adding a simple flock syscall for marking the data 
file globally as shared solved the problem for us, reducing the overall 
runtime from nearly 2 hours to 5 minutes (and thus comparable to the NFS 
control case). The application runs on our HPC cluster, so several 100 
instances may access the same data file at once.


We have other applications that were written without locking support and 
that do not perform very well with CephFS. There was a thread in 
February with a short discussion about CephFS mmap performance 
(http://article.gmane.org/gmane.comp.file-systems.ceph.user/27501). As 
pointed out in that thread, the problem is not only related to mmap 
itself, but also to the need to implement a proper invalidation. We 
cannot fix this for all our applications due to the lack of man power 
and the lack of source code in some cases. We either have to find a way 
to make them work with CephFS, or use a different setup, e.g. an extra 
NFS based mount point with a re-export of CephFS. I would like to avoid 
the later solution...


Disabling the POSIX semantics and having a fallback to a more NFS-like 
semantic without guarantees is a setback, but probably the easier way 
(if it is possible at all). Most data accessed by these applications is 
read only, so complex locking is not necessary in these cases.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Implications of using directory as Ceph OSD devices

2016-05-03 Thread Vincenzo Pii
ceph-disk can prepare a disk a partition or a directory to be used as a device.

What are the implications and limits of using a directory?
Can it be used both for journal and storage?
What file system should the directory exist on?


Vincenzo Pii | TERALYTICS
DevOps Engineer
Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone: +41 (0) 79 191 11 08
email: vincenzo@teralytics.net 
www.teralytics.net 

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de 
Vries

This e-mail message contains confidential information which is for the sole 
attention and use of the intended recipient. Please notify us at once if you 
think that it may not be intended for you and delete it immediately. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Status of ceph-docker

2016-05-03 Thread Vincenzo Pii
https://github.com/ceph/ceph-docker

Is someone using ceph-docker in production or the project is meant more for 
development and experimentation?

Vincenzo Pii | TERALYTICS
DevOps Engineer
Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone: +41 (0) 79 191 11 08
email: vincenzo@teralytics.net 
www.teralytics.net 

Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de 
Vries

This e-mail message contains confidential information which is for the sole 
attention and use of the intended recipient. Please notify us at once if you 
think that it may not be intended for you and delete it immediately. 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph degraded writes

2016-05-03 Thread Ben Hines
The Hammer .93 to .94 notes said:
If upgrading from v0.93, setosd enable degraded writes = false   on all
osds prior to upgrading. The degraded writes feature has been reverted due
to 11155.

Our cluster is now on Infernalis 9.2.1 and we still have this setting set.
Can we get rid of it? Was this release note just needed for the upgrade? I
think we may be encountering problems in our cluster during recovery
because we can't write to any object which has less than 3 copies even
though we have min_size at 1.

thanks,

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph degraded writes

2016-05-03 Thread Gregory Farnum
On Tue, May 3, 2016 at 4:10 PM, Ben Hines  wrote:
> The Hammer .93 to .94 notes said:
> If upgrading from v0.93, setosd enable degraded writes = false   on all osds
> prior to upgrading. The degraded writes feature has been reverted due to
> 11155.
>
> Our cluster is now on Infernalis 9.2.1 and we still have this setting set.
> Can we get rid of it? Was this release note just needed for the upgrade? I
> think we may be encountering problems in our cluster during recovery because
> we can't write to any object which has less than 3 copies even though we
> have min_size at 1.

Looks like this was only necessary for the upgrade, but I don't think
it will be impacting anything any more.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Changing pg_num on cache pool

2016-05-03 Thread Michael Shuey
I mistakenly created a cache pool with way too few PGs.  It's attached
as a write-back cache to an erasure-coded pool, has data in it, etc.;
cluster's using Infernalis.  Normally, I can increase pg_num live, but
when I try in this case I get:

# ceph osd pool set cephfs_data_cache pg_num 256

Error EPERM: splits in cache pools must be followed by scrubs and
leave sufficient free space to avoid overfilling.  use
--yes-i-really-mean-it to force.


Is there something I need to do, before increasing PGs on a cache
pool?  Can this be (safely) done live?

--
Mike Shuey
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Incorrect crush map

2016-05-03 Thread Wade Holler
Hi Ben,

What OS+Version ?

Best Regards,
Wade


On Tue, May 3, 2016 at 2:44 PM Ben Hines  wrote:

> My crush map keeps putting some OSDs on the wrong node. Restarting them
> fixes it temporarily, but they eventually hop back to the other node that
> they aren't really on.
>
> Is there anything that can cause this to look for?
>
> Ceph 9.2.1
>
> -Ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] May CDM Moved

2016-05-03 Thread Patrick McGarry
Hey cephers,

Sorry for the late notice here, but due to an unavoidable conflict it
seems we’ll have to move this month’s CDM to next week. I’m leaving
the URL for blueprints the same in case there are bookmarks or other
links still floating around out there, but please submit at least a
couple of sentences about work you are doing in Ceph so that we can
review progress and status..

http://tracker.ceph.com/projects/ceph/wiki/CDM_04-MAY-2016

The May CDM will be 11 May 2016 @ 9p EST. Thanks.

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com