[ceph-users] mds "Behing on trimming"

2016-03-21 Thread Dzianis Kahanovich
I have (second time) stuck mds warning: "Behind on trimming (63/30)". Looks
working. What it mean and how to avoid it? And how to fix (exclude stop/migrate
active mds)?

Looks happened both time at night - probably on long backup/write operations
(something like compressed local root backup to cephfs). Also all local mounts
inside cluster (fuse) moved to automout to reduce clients pressure. Still 5
permanent kernel clients.

Now I remount all but 1 kernel clients. Message persist.

Current version 9.2.1-12-g3c10a09 with
https://github.com/ceph/ceph/commit/24de350d936e5ed70835d0ab2ad6b0b4f506123f.patch
, previous incident was older & without patch.

-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: object unfound before backfill

2016-03-21 Thread lin zhou
Hi,guys

my cluster face a network problem so it occur some error.after solve
network problem.

latency of some osds in one node is high,using ceph osd perf,which come to
3000+

so I delete this osd from cluster,keep osd data device.
after recover and backfill,then I face the problem describe in title.ceph
health detail is :

pg 4.438 is active+recovering+degraded+remapped, acting [7,11], 1 unfound
pg 4.438 is stuck unclean for 135368.626141, current state
active+recovering+degraded+remapped, last acting [7,11]
recovery 1062/4842087 objects degraded (0.022%); 1/2028378 unfound (0.000%)

root@node-67:~# ceph pg map 4.438
osdmap e42522 pg 4.438 (4.438) -> up [34,20,30] acting [7,11]

I can  see the pg data in deleted osd.6,which have some different with the
existing osd.7 and osd.11
can I copy the pg data to the new up set? ignore acting set?

some info is below,the output of pg query is the attachment.


root@node-67:~# ceph pg 4.438 list_missing
{ "offset": { "oid": "",
  "key": "",
  "snapid": 0,
  "hash": 0,
  "max": 0,
  "pool": -1,
  "namespace": ""},
  "num_missing": 1,
  "num_unfound": 1,
  "objects": [
{ "oid": { "oid": "rbd_data.188b9163c78e9.15f2",
  "key": "",
  "snapid": -2,
  "hash": 2427198520,
  "max": 0,
  "pool": 4,
  "namespace": ""},
  "need": "39188'2314230",
  "have": "39174'2314229",
  "locations": []}],

rootceph pg 4.438 mark_unfound_lost revert
Error EINVAL: pg has 1 unfound objects but we haven't probed all sources,
not marking lost

root@node-67:~# ceph pg 4.438 mark_unfound_lost delete
Error EINVAL: pg has 1 unfound objects but we haven't probed all sources,
not marking lost

pg query output is :
https://drive.google.com/file/d/0B08hG89CXoPbb2p0ZFc2OGRRQmpkcGVuZnoxNFJnQS05UDlv/view
-
hnuzhoul...@gmail.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] object unfound before finish backfill, up set diff from acting set

2016-03-21 Thread lin zhou
Hi,guys

my cluster face a network problem so it occur some error.after solve
network problem.

latency of some osds in one node is high,using ceph osd perf,which come to 3000+

so I delete this osd from cluster,keep osd data device.
after recover and backfill,then I face the problem describe in
title.ceph health detail is :

pg 4.438 is active+recovering+degraded+remapped, acting [7,11], 1 unfound
pg 4.438 is stuck unclean for 135368.626141, current state
active+recovering+degraded+remapped, last acting [7,11]
recovery 1062/4842087 objects degraded (0.022%); 1/2028378 unfound (0.000%)

root@node-67:~# ceph pg map 4.438
osdmap e42522 pg 4.438 (4.438) -> up [34,20,30] acting [7,11]

I can  see the pg data in deleted osd.6,which have some different with
the existing osd.7 and osd.11
can I copy the pg data to the new up set? ignore acting set?

some info is below,the output of pg query is the attachment.


root@node-67:~# ceph pg 4.438 list_missing
{ "offset": { "oid": "",
  "key": "",
  "snapid": 0,
  "hash": 0,
  "max": 0,
  "pool": -1,
  "namespace": ""},
  "num_missing": 1,
  "num_unfound": 1,
  "objects": [
{ "oid": { "oid": "rbd_data.188b9163c78e9.15f2",
  "key": "",
  "snapid": -2,
  "hash": 2427198520,
  "max": 0,
  "pool": 4,
  "namespace": ""},
  "need": "39188'2314230",
  "have": "39174'2314229",
  "locations": []}],

rootceph pg 4.438 mark_unfound_lost revert
Error EINVAL: pg has 1 unfound objects but we haven't probed all
sources, not marking lost

root@node-67:~# ceph pg 4.438 mark_unfound_lost delete
Error EINVAL: pg has 1 unfound objects but we haven't probed all
sources, not marking lost

pg query output is
:https://drive.google.com/file/d/0B08hG89CXoPbb2p0ZFc2OGRRQmpkcGVuZnoxNFJnQS05UDlv/view
-
hnuzhoul...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fresh install - all OSDs remain down and out

2016-03-21 Thread Markus Goldberg

Hi,
i have upgraded my hardware and installed ceph totally new as described 
in http://docs.ceph.com/docs/master/rados/deployment/
The last job was creating the OSDs 
http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-osd/
I have used the create command and after that, the OSDs should be in and 
up but they are all down and out.

An additionally osd activate command does not help.

Ubuntu 14.04.4 kernel 4.2.1
ceph 10.0.2

What should i do, where is my mistake?

This is ceph.conf:

[global]
fsid = 122e929a-111b-4067-80e4-3fef39e66ecf
mon_initial_members = bd-0, bd-1, bd-2
mon_host = xxx.xxx.xxx.20,xxx.xxx.xxx.21,xxx.xxx.xxx.22
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = xxx.xxx.xxx.0/24
cluster network = 192.168.1.0/24
osd_journal_size = 10240
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 333
osd pool default pgp num = 333
osd crush chooseleaf type = 1
osd_mkfs_type = btrfs
osd_mkfs_options_btrfs = -f -n 32k -l 32k
osd_mount_options_btrfs = rw,noatime,nodiratime,autodefrag
mds_max_file_size = 50


This is the log of the last osd:
##
bd-2:/dev/sdaf:/dev/sdaf2
ceph-deploy disk zap bd-2:/dev/sdaf
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy osd 
create --fs-type btrfs bd-2:/dev/sdaf:/dev/sdaf2

[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  disk  : [('bd-2', 
'/dev/sdaf', '/dev/sdaf2')]

[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: create
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   : 
/etc/ceph/dmcrypt-keys

[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   : 


[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : btrfs
[ceph_deploy.cli][INFO  ]  func  : at 0x7f944e16b500>

[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  zap_disk  : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks 
bd-2:/dev/sdaf:/dev/sdaf2

[bd-2][DEBUG ] connected to host: bd-2
[bd-2][DEBUG ] detect platform information from remote host
[bd-2][DEBUG ] detect machine type
[bd-2][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] Deploying osd to bd-2
[bd-2][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host bd-2 disk /dev/sdaf journal 
/dev/sdaf2 activate True
[bd-2][INFO  ] Running command: ceph-disk -v prepare --cluster ceph 
--fs-type btrfs -- /dev/sdaf /dev/sdaf2
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--check-allows-journal -i 0 --cluster ceph
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--check-wants-journal -i 0 --cluster ceph
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--check-needs-journal -i 0 --cluster ceph
[bd-2][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdaf uuid path is 
/sys/dev/block/65:240/dm/uuid
[bd-2][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdaf uuid path is 
/sys/dev/block/65:240/dm/uuid
[bd-2][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdaf uuid path is 
/sys/dev/block/65:240/dm/uuid
[bd-2][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdaf2 uuid path is 
/sys/dev/block/65:242/dm/uuid
[bd-2][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdaf2 uuid path is 
/sys/dev/block/65:242/dm/uuid
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=fsid
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd 
--cluster=ceph --show-config-value=osd_journal_size
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
[bd-2][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf 
--cluster=ceph --name=osd. --lookup osd_dmcrypt_type
[bd-2][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdaf uuid path is 
/sys/dev/block/65:240/dm/uuid
[bd-2][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdaf2 uuid path is 

[ceph-users] DSS 7000 for large scale object storage

2016-03-21 Thread Bastian Rosner

Hi,

any chance that somebody here already got hands on Dell DSS 7000 
machines?


4U chassis containing 90x 3.5" drives and 2x dual-socket server sleds 
(DSS7500). Sounds ideal for high capacity and density clusters, since 
each of the server-sleds would run 45 drives, which I believe is a 
suitable number of OSDs per node.


When searching for this model there's not much detailed information out 
there.
Sadly I could not find a review from somebody who actually owns a bunch 
of them and runs a decent PB-size cluster with it.


Cheers, Bastian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DSS 7000 for large scale object storage

2016-03-21 Thread David
Sounds like you’ll have a field day waiting for rebuild in case of a node 
failure or an upgrade of the crush map ;)

David


> 21 mars 2016 kl. 09:55 skrev Bastian Rosner :
> 
> Hi,
> 
> any chance that somebody here already got hands on Dell DSS 7000 machines?
> 
> 4U chassis containing 90x 3.5" drives and 2x dual-socket server sleds 
> (DSS7500). Sounds ideal for high capacity and density clusters, since each of 
> the server-sleds would run 45 drives, which I believe is a suitable number of 
> OSDs per node.
> 
> When searching for this model there's not much detailed information out there.
> Sadly I could not find a review from somebody who actually owns a bunch of 
> them and runs a decent PB-size cluster with it.
> 
> Cheers, Bastian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs infernalis (ceph version 9.2.1) - bonnie++

2016-03-21 Thread Yan, Zheng
On Mon, Mar 21, 2016 at 2:33 PM, Michael Hanscho  wrote:
> On 2016-03-21 05:07, Yan, Zheng wrote:
>> On Sat, Mar 19, 2016 at 9:38 AM, Michael Hanscho  wrote:
>>> Hi!
>>>
>>> Trying to run bonnie++ on cephfs mounted via the kernel driver on a
>>> centos 7.2.1511 machine resulted in:
>>>
>>> # bonnie++ -r 128 -u root -d /data/cephtest/bonnie2/
>>> Using uid:0, gid:0.
>>> Writing a byte at a time...done
>>> Writing intelligently...done
>>> Rewriting...done
>>> Reading a byte at a time...done
>>> Reading intelligently...done
>>> start 'em...done...done...done...done...done...
>>> Create files in sequential order...done.
>>> Stat files in sequential order...done.
>>> Delete files in sequential order...Bonnie: drastic I/O error (rmdir):
>>> Directory not empty
>>> Cleaning up test directory after error.
>>
>> Please check if there are leftover files in the test directory. This
>> seems like readdir bug (some files are missing in readdir result) in
>> old kernel. which version of kernel were you using?
>
> The bonnie++ directory and a file (0 bytes) in it was left over - after
> the error message - you are right.
> Kernel: 3.10.0-327.10.1.el7.x86_64 (Latest CentOS 7.2 kernel
>
> (If I run the same test (Version: 1.96) on a local HD on the same
> machine - it is working as expected.)

The bug was introduced by RHEL7 backports. It's can be fixed by the
attached patch.


Thank you for reporting this.
Yan, Zheng

>
> Gruesse
> Michael
>
>
>> Regards
>> Yan, Zheng
>>
>>>
>>> # ceph -w
>>> cluster 
>>>  health HEALTH_OK
>>>  monmap e3: 3 mons at
>>> {cestor4=:6789/0,cestor5=:6789/0,cestor6=:6789/0}
>>> election epoch 62, quorum 0,1,2 cestor4,cestor5,cestor6
>>>  mdsmap e30: 1/1/1 up {0=cestor2=up:active}, 1 up:standby
>>>  osdmap e703: 60 osds: 60 up, 60 in
>>> flags sortbitwise
>>>   pgmap v135437: 1344 pgs, 4 pools, 4315 GB data, 2315 kobjects
>>> 7262 GB used, 320 TB / 327 TB avail
>>> 1344 active+clean
>>>
>>> Any ideas?
>>>
>>> Gruesse
>>> Michael
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


readdir.patch
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DSS 7000 for large scale object storage

2016-03-21 Thread Sean Redmond
I used a Unit a little like this (
https://www.sgi.com/products/storage/servers/mis_server.html) for a SATA
pool in ceph - rebuilds after a failure of a node can be painful without a
fair amount of testing & tuning.

I have opted for more units with less disks for future builds using R730XD.

On Mon, Mar 21, 2016 at 9:33 AM, David  wrote:

> Sounds like you’ll have a field day waiting for rebuild in case of a node
> failure or an upgrade of the crush map ;)
>
> David
>
>
> > 21 mars 2016 kl. 09:55 skrev Bastian Rosner :
> >
> > Hi,
> >
> > any chance that somebody here already got hands on Dell DSS 7000
> machines?
> >
> > 4U chassis containing 90x 3.5" drives and 2x dual-socket server sleds
> (DSS7500). Sounds ideal for high capacity and density clusters, since each
> of the server-sleds would run 45 drives, which I believe is a suitable
> number of OSDs per node.
> >
> > When searching for this model there's not much detailed information out
> there.
> > Sadly I could not find a review from somebody who actually owns a bunch
> of them and runs a decent PB-size cluster with it.
> >
> > Cheers, Bastian
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DONTNEED fadvise flag

2016-03-21 Thread Kenneth Waegeman
Thanks! As we are using the kernel client of EL7, does someone knows if 
that client supports it?


On 16/03/16 20:29, Gregory Farnum wrote:

On Wed, Mar 16, 2016 at 9:46 AM, Kenneth Waegeman
 wrote:

Hi all,

Quick question: Does cephFS pass the fadvise DONTNEED flag and take it into
account?
I want to use the --drop-cache option of rsync 3.1.1 to not fill the cache
when rsyncing to cephFS

It looks like ceph-fuse unfortunately does not. I'm not sure about the
kernel client though.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds "Behing on trimming"

2016-03-21 Thread John Spray
On Mon, Mar 21, 2016 at 7:44 AM, Dzianis Kahanovich
 wrote:
> I have (second time) stuck mds warning: "Behind on trimming (63/30)". Looks
> working. What it mean and how to avoid it? And how to fix (exclude 
> stop/migrate
> active mds)?

The MDS has a metadata journal, whose length is measured in
"segments", and it is trimmed when the number of segments gets greater
than a certain limit.  The warning is telling you that the journal is
meant to be trimmed after 30 segments, but you currently have 63
segments.

This can happen when something (including a client) is failing to
properly clean up after itself, and leaving extra references to
something in one of the older segments.  In fact, a bug in the kernel
client was the original motivation for adding this warning message.

John

> Looks happened both time at night - probably on long backup/write operations
> (something like compressed local root backup to cephfs). Also all local mounts
> inside cluster (fuse) moved to automout to reduce clients pressure. Still 5
> permanent kernel clients.
>
> Now I remount all but 1 kernel clients. Message persist.

There's probably a reason you haven't already done this, but the next
logical debug step would be to try unmounting that last kernel client
(and mention what version it is)

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot remove rbd locks

2016-03-21 Thread Christoph Adomeit
Zhanks Jaseon,

this worked ...

On Fri, Mar 18, 2016 at 02:31:44PM -0400, Jason Dillaman wrote:
> Try the following:
> 
> # rbd lock remove vm-114-disk-1 "auto 140454012457856" client.71260575
> 
> -- 
> 
> Jason Dillaman 
> 
> 
> - Original Message -
> > From: "Christoph Adomeit" 
> > To: ceph-us...@ceph.com
> > Sent: Friday, March 18, 2016 11:14:00 AM
> > Subject: [ceph-users] Cannot remove rbd locks
> > 
> > Hi,
> > 
> > some of my rbds show they have an exclusive lock.
> > 
> > I think the lock can be stale or weeks old.
> > 
> > We have also once added feature exclusive lock and later removed that 
> > feature
> > 
> > I can see the lock:
> > 
> > root@machine:~# rbd lock list vm-114-disk-1
> > There is 1 exclusive lock on this image.
> > Locker  ID   Address
> > client.71260575 auto 140454012457856 10.67.1.14:0/1131494432
> > 
> > iBut I cannot remove the lock:
> > 
> > root@machine:~# rbd lock remove vm-114-disk-1 auto client.71260575
> > rbd: releasing lock failed: (2) No such file or directory
> > 
> > How can I remove the locks ?
> > 
> > Thanks
> >   Christoph
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 

-- 
Christoph Adomeit
GATWORKS GmbH
Reststrauch 191
41199 Moenchengladbach
Sitz: Moenchengladbach
Amtsgericht Moenchengladbach, HRB 6303
Geschaeftsfuehrer:
Christoph Adomeit, Hans Wilhelm Terstappen

christoph.adom...@gatworks.de Internetloesungen vom Feinsten
Fon. +49 2166 9149-32  Fax. +49 2166 9149-10
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DSS 7000 for large scale object storage

2016-03-21 Thread Bastian Rosner
Yes, rebuild in case of a whole chassis failure is indeed an issue. That 
depends on how the failure domain looks like.


I'm currently thinking of initially not running fully equipped nodes.
Let's say four of these machines with 60x 6TB drives each, so only 
loaded 2/3.

That's raw 1440TB distributed over eight OSD nodes.
Each individual OSD-node would therefore host "only" 30 OSDs but still 
allow for fast expansion.


Usually delivery and installation of a bunch of HDDs is much faster than 
servers.


I really wonder how easy it is to add additional disks and whether 
chance for node- or even chassis-failure increases.


Cheers, Bastian

Am 2016-03-21 10:33, schrieb David:

Sounds like you’ll have a field day waiting for rebuild in case of a
node failure or an upgrade of the crush map ;)

David



21 mars 2016 kl. 09:55 skrev Bastian Rosner :

Hi,

any chance that somebody here already got hands on Dell DSS 7000 
machines?


4U chassis containing 90x 3.5" drives and 2x dual-socket server sleds 
(DSS7500). Sounds ideal for high capacity and density clusters, since 
each of the server-sleds would run 45 drives, which I believe is a 
suitable number of OSDs per node.


When searching for this model there's not much detailed information 
out there.
Sadly I could not find a review from somebody who actually owns a 
bunch of them and runs a decent PB-size cluster with it.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds "Behing on trimming"

2016-03-21 Thread Dzianis Kahanovich
PS Now I stop this mds, active migrated and warning removed. Cannot try more.

Dzianis Kahanovich пишет:
> John Spray пишет:
> 
>>> Looks happened both time at night - probably on long backup/write operations
>>> (something like compressed local root backup to cephfs). Also all local 
>>> mounts
>>> inside cluster (fuse) moved to automout to reduce clients pressure. Still 5
>>> permanent kernel clients.
>>>
>>> Now I remount all but 1 kernel clients. Message persist.
>>
>> There's probably a reason you haven't already done this, but the next
>> logical debug step would be to try unmounting that last kernel client
>> (and mention what version it is)
> 
> 4.5.0. This VM now finally was deadlocked in some places (may be there problem
> from same roots) and hard restarted, now mounted again. Message persists.
> 
> Just near week ago I remove some of additional mount options. Started from old
> days (when VMs was on same servers with cluster) I mounts with
> "wsize=131072,rsize=131072,write_congestion_kb=128,readdir_max_bytes=131072"
> (and net.ipv4.tcp_notsent_lowat = 131072) to conserve RAM. Obtaining good
> servers for VMs I remove it. May be better turn it back for better congestion
> quantum.
> 


-- 
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fresh install - all OSDs remain down and out

2016-03-21 Thread 施柏安
Hi Markus

You should define the "osd device" and "host" then make ceph cluster work.
Take the types in your map (osd, host, chasis.root) to design the
crushmap according to your needed.
Example:
​​

host node1 {
id -1
alg straw
hash 0
item osd.0 weight 1.00
item osd.1 weight 1.00
}
host node2 {
id -2
alg straw
hash 0
item osd.2 weight 1.00
item osd.3 weight 1.00
}
root default {
id 0
alg straw
hash 0
item node1 weight 2.00 (sum of its item)
item node2 weight 2.00
}

​​Then you can use default ruleset. It is set to take the root "default".


2016-03-21 19:50 GMT+08:00 Markus Goldberg :

> Hi desmond,
> this is my decompile_map:
> root@bd-a:/etc/ceph# cat decompile_map
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable straw_calc_version 1
>
> # devices
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> root default {
> id -1   # do not change unnecessarily
> # weight 0.000
> alg straw
> hash 0  # rjenkins1
> }
>
> # rules
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # end crush map
> root@bd-a:/etc/ceph#
>
> How should i change It?
> I never had to edit anything in this area in former versions of ceph. Has
> something changed?
> Is any new parameter nessessary in ceph.conf while installing?
>
> Thank you,
>   Markus
>
> Am 21.03.2016 um 10:34 schrieb 施柏安:
>
> It seems that there no setting weight to all of your osd. So the pg stuck
> in creating.
> you can use some command to edit crushmap for setting weight:
>
> # ceph osd getcrushmap -o map
> # crushtool -d map -o decompile_map
> # vim decompile_map (then you can change the weight to all of your osd and
> its host weight)
> # crushtool -c decompile_map -o changed_map
> # ceph osd setcrushmap -i changed_map
>
> Then, it should work in your situation.
>
>
> 2016-03-21 17:20 GMT+08:00 Markus Goldberg :
>
>> Hi,
>> root@bd-a:~# ceph osd tree
>> ID WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1  0 root default
>>  0  0 osd.0   down0  1.0
>>  1  0 osd.1   down0  1.0
>>  2  0 osd.2   down0  1.0
>> ...delete all the other OSDs as they are the same
>> ...
>> 88  0 osd.88  down0  1.0
>> 89  0 osd.89  down0  1.0
>> root@bd-a:~#
>>
>> bye,
>>   Markus
>>
>> Am 21.03.2016 um 10:10 schrieb 施柏安:
>>
>> What's your crushmap show? Or command 'ceph osd tree' show.
>>
>> 2016-03-21 16:39 GMT+08:00 Markus Goldberg < 
>> goldb...@uni-hildesheim.de>:
>>
>>> Hi,
>>> i have upgraded my hardware and installed ceph totally new as described
>>> in 
>>> http://docs.ceph.com/docs/master/rados/deployment/
>>> The last job was creating the OSDs
>>> 
>>> http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-osd/
>>> I have used the create command and after that, the OSDs should be in and
>>> up but they are all down and out.
>>> An additionally osd activate command does not help.
>>>
>>> Ubuntu 14.04.4 kernel 4.2.1
>>> ceph 10.0.2
>>>
>>> What should i do, where is my mistake?
>>>
>>> This is ceph.conf:
>>>
>>> [global]
>>> fsid = 122e929a-111b-4067-80e4-3fef39e66ecf
>>> mon_initial_members = bd-0, bd-1, bd-2
>>> mon_host = xxx.xxx.xxx.20,xxx.xxx.xxx.21,xxx.xxx.xxx.22
>>> auth_cluster_required = cephx
>>> auth_service_required = cephx
>>> auth_client_required = cephx
>>> public network = xxx.xxx.xxx.0/24
>>> cluster network = 192.168.1.0/24
>>> osd_journal_size = 10240
>>> osd pool default size = 2
>>> osd pool default min size = 1
>>> osd pool default pg num = 333
>>> osd pool default pgp num = 333
>>> osd crush chooseleaf type = 1
>>> osd_mkfs_type = btrfs
>>> osd_mkfs_options_btrfs = -f -n 32k -l 32k
>>> osd_mount_options_btrfs = rw,noatime,nodiratime,autodefrag
>>> mds_max_file_size = 50
>>>
>>>
>>> This is the log of the last osd:
>>> ##
>>> bd-2:/dev/sdaf:/dev/sdaf2
>>> ceph-deploy disk zap bd-2:/dev/sdaf
>>> [ceph_deploy.conf][DEBUG ] found configuration file at:
>>> /root/.cephdeploy.conf
>>> [ceph_deploy.cli][INFO  ] Invoked (1.5.31): /usr/bin/ceph-deploy osd
>>> create --fs-type btrfs bd-2:/dev/sdaf:/dev/sdaf2
>>> [ceph_deploy.cli][INFO  ] ceph-deploy options:
>>> [ceph_deploy.cli][INFO  ]  username  : None
>>> [ceph_deploy.cli][INFO  ]  disk  

Re: [ceph-users] DSS 7000 for large scale object storage

2016-03-21 Thread David
From my experience you’ll be better off planning exactly how many OSD’s and 
nodes you’re going to have and if possible equip them from the start.

By just adding a new drive to the same pool ceph will start to rearrange data 
across the whole cluster which might lead to less client IO depending on what 
you’re comfortable with. In a worst case scenario, your clients won’t have 
enough IO and your services might be ”down” until it’s healthy again.

Rebuilding 60 x 6TB drives will take quite some time. Each SATA drive has about 
75MB-125MB throughput at best, so a rebuild of once such drive would take 
approx. 16-17 hours. Usually it takes some x2 or x3 times longer in a normal 
case and if your controllers or network is limited.

// david


> 21 mars 2016 kl. 13:13 skrev Bastian Rosner :
> 
> Yes, rebuild in case of a whole chassis failure is indeed an issue. That 
> depends on how the failure domain looks like.
> 
> I'm currently thinking of initially not running fully equipped nodes.
> Let's say four of these machines with 60x 6TB drives each, so only loaded 2/3.
> That's raw 1440TB distributed over eight OSD nodes.
> Each individual OSD-node would therefore host "only" 30 OSDs but still allow 
> for fast expansion.
> 
> Usually delivery and installation of a bunch of HDDs is much faster than 
> servers.
> 
> I really wonder how easy it is to add additional disks and whether chance for 
> node- or even chassis-failure increases.
> 
> Cheers, Bastian
> 
> Am 2016-03-21 10:33, schrieb David:
>> Sounds like you’ll have a field day waiting for rebuild in case of a
>> node failure or an upgrade of the crush map ;)
>> David
>>> 21 mars 2016 kl. 09:55 skrev Bastian Rosner :
>>> Hi,
>>> any chance that somebody here already got hands on Dell DSS 7000 machines?
>>> 4U chassis containing 90x 3.5" drives and 2x dual-socket server sleds 
>>> (DSS7500). Sounds ideal for high capacity and density clusters, since each 
>>> of the server-sleds would run 45 drives, which I believe is a suitable 
>>> number of OSDs per node.
>>> When searching for this model there's not much detailed information out 
>>> there.
>>> Sadly I could not find a review from somebody who actually owns a bunch of 
>>> them and runs a decent PB-size cluster with it.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does object map feature lock snapshots ?

2016-03-21 Thread Christoph Adomeit
Hi Jason,

I can reproduce the issue 100%

Use standard ceph version 9.2.1 from repository

Create a vm rbd version 2, in my example it is:

vm-192-disk-1

enable these features:

rbd feature enable $IMG exclusive-lock
rbd feature enable $IMG object-map
rbd feature enable $IMG fast-diff

start the vm and inside the vm run some io, i ran bonnie++ in a loop

then go ahead and create first snapshot

/usr/bin/rbd snap create rbd/vm-192-disk-1@initial.20160321-130439
export the snapshot (don't know if it is necessary)

/usr/bin/rbd export --rbd-concurrent-management-ops 20 
vm-192-disk-1@initial.20160321-130439 -|pigz -b 512|/bin/dd 
of=/backups/ceph/vm-192-disk-1.initial.20160321-130439.gz.tmp && /bin/mv 
/backups/ceph/vm-192-disk-1.initial.20160321-130439.gz.tmp 
/backups/ceph/vm-192-disk-1.initial.20160321-130439.gz 


this is no problem, it will work

then create the second snapshot:

/usr/bin/rbd snap create rbd/vm-192-disk-1@incremental.20160321-130741

after a few seconds you see on the console:

2016-03-21 13:08:46.091526 7f8ab372a7c0 -1 librbd::ImageWatcher: 0x561d8a394150 
no lock owners detected


So it is not the export diff that is hanging, it is the rbd snap create 
operation on an
additional snapshot

Often the io in the vm is also hanging and sometimes load in the vm goes up to 
800 or more.

Even after stopping the vm I can see the image has an exclusive lock:

# rbd lock ls vm-192-disk-1
There is 1 exclusive lock on this image.
Locker  ID   Address 
client.71565451 auto 140269345641344 10.67.1.15:0/2701777604 

Without the image features i do not have these problems.

Can you reproduce this ?

Greetings
  Christoph


On Sun, Mar 20, 2016 at 10:57:16AM -0400, Jason Dillaman wrote:
> Definitely not a known issue but from a quick test (running export-diff 
> against an image being actively written) I wasn't able to recreate on v9.2.1. 
>  Are you able to recreate this reliably, and if so, can you share the steps 
> you used?
> 
> Thanks,
> 
> -- 
> 
> Jason Dillaman 
> 
> 
> - Original Message -
> > From: "Christoph Adomeit" 
> > To: "Jason Dillaman" 
> > Cc: ceph-us...@ceph.com
> > Sent: Friday, March 18, 2016 6:19:16 AM
> > Subject: Re: [ceph-users] Does object map feature lock snapshots ?
> > 
> > Hi,
> > 
> > I had no special logging activated.
> > 
> > Today I re-enabled exclusive-lock object-map and fast-diff on an image in
> > 9.2.1
> > 
> > As soon as I ran an rbd export-diff I had lots of these error messages on 
> > the
> > console of the rbd export process:
> > 
> > 2016-03-18 11:18:21.546658 7f77245d1700  1 heartbeat_map is_healthy
> > 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
> > 2016-03-18 11:18:26.546750 7f77245d1700  1 heartbeat_map is_healthy
> > 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
> > 2016-03-18 11:18:31.546840 7f77245d1700  1 heartbeat_map is_healthy
> > 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
> > 2016-03-18 11:18:36.546928 7f77245d1700  1 heartbeat_map is_healthy
> > 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
> > 2016-03-18 11:18:41.547017 7f77245d1700  1 heartbeat_map is_healthy
> > 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
> > 2016-03-18 11:18:46.547105 7f77245d1700  1 heartbeat_map is_healthy
> > 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60
> > 
> > 
> > Is this a known issue ?
> > 
> > 
> > 
> > 
> > 
> > On Tue, Mar 08, 2016 at 11:22:17AM -0500, Jason Dillaman wrote:
> > > Is there anyway for you to provide debug logs (i.e. debug rbd = 20) from
> > > your rbd CLI and qemu process when you attempt to create a snapshot?  In
> > > v9.2.0, there was an issue [1] where the cache flush writeback from the
> > > snap create request was being blocked when the exclusive lock feature was
> > > enabled, but that should have been fixed in v9.2.1.
> > > 
> > > [1] http://tracker.ceph.com/issues/14542
> > > 
> > > --
> > > 
> > > Jason Dillaman
> > > 
> > > 
> > > - Original Message -
> > > > From: "Christoph Adomeit" 
> > > > To: ceph-us...@ceph.com
> > > > Sent: Tuesday, March 8, 2016 11:13:04 AM
> > > > Subject: [ceph-users] Does object map feature lock snapshots ?
> > > > 
> > > > Hi,
> > > > 
> > > > i have installed ceph 9.21 on proxmox with kernel 4.2.8-1-pve.
> > > > 
> 

Re: [ceph-users] DONTNEED fadvise flag

2016-03-21 Thread Yan, Zheng

> On Mar 21, 2016, at 18:17, Kenneth Waegeman  wrote:
> 
> Thanks! As we are using the kernel client of EL7, does someone knows if that 
> client supports it?
> 

fadvise DONTNEED is supported by kernel memory management subsystem. Fadvise 
DONTNEED works for all filesystems (including cephfs kernel client) that use 
page cache. 

Yan, Zheng
 
> On 16/03/16 20:29, Gregory Farnum wrote:
>> On Wed, Mar 16, 2016 at 9:46 AM, Kenneth Waegeman
>>  wrote:
>>> Hi all,
>>> 
>>> Quick question: Does cephFS pass the fadvise DONTNEED flag and take it into
>>> account?
>>> I want to use the --drop-cache option of rsync 3.1.1 to not fill the cache
>>> when rsyncing to cephFS
>> It looks like ceph-fuse unfortunately does not. I'm not sure about the
>> kernel client though.
>> -Greg
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph RBD client on OSD nodes - how about a Docker deployment?

2016-03-21 Thread Christian Sarrasin

Hi there,

The docs have an ominous warning that one shouldn't run the RBD client 
(to mount block devices) on a machine which also serves OSDs [1]


Due to budget constraints, this topology would be useful in our 
situation.  Couple of q's:


1) Does the limitation also apply if the OSD daemon is run off a docker 
container [2]?


2) Any similar restrictions for a machine running MON daemons?

Many thanks!
Chris

---
[1] http://docs.ceph.com/docs/master/start/quick-rbd/
[2] https://hub.docker.com/r/ceph/daemon/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD client on OSD nodes - how about a Docker deployment?

2016-03-21 Thread Gregory Farnum
On Mon, Mar 21, 2016 at 11:45 AM, Christian Sarrasin
 wrote:
> Hi there,
>
> The docs have an ominous warning that one shouldn't run the RBD client (to
> mount block devices) on a machine which also serves OSDs [1]
>
> Due to budget constraints, this topology would be useful in our situation.
> Couple of q's:
>
> 1) Does the limitation also apply if the OSD daemon is run off a docker
> container [2]?
>
> 2) Any similar restrictions for a machine running MON daemons?

This is specifically a problem for kernel RBD clients, not userspace
ones — the problem being that if you get into a low-memory situation,
it may try to flush dirty pages out over RBD, which will make the OSD
process try to allocate memory which may not be available. Deadlock!

I don't think this should be a problem if you're running librbd in KVM
etc, but I don't *think* Docker provides enough isolation to prevent
it as a theoretical possibility.

Monitors shouldn't be an issue, although if they go down you're going
to have a very, very unhappy cluster so I wouldn't call it wise if
there's any possibility of resource contention
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD client on OSD nodes - how about a Docker deployment?

2016-03-21 Thread Gregory Farnum
...wow. Sorry for the spam at this point.

(How did you get some gmane address named after me to be in the
recipients list?)

On Mon, Mar 21, 2016 at 1:25 PM, Gregory Farnum  wrote:
> Heh, I failed to re-add the list the first time. Trying again, since
> they can probably help more than me on this topic.
>
> On Mon, Mar 21, 2016 at 1:24 PM, Gregory Farnum  wrote:
>> [ Re-adding the list. ]
>>
>> On Mon, Mar 21, 2016 at 1:13 PM, Christian Sarrasin
>>  wrote:
>>>
>>>
>>> Thanks Greg, really appreciate your detailed insight!
>>>
>>> If I wanted to go down the KVM route, I suppose it's best to run the OSDs on
>>> the hypervisor host (as they're likely to be more performance sensitive) and
>>> run the clients on the VMs?
>>
>> Probably? I haven't done this myself or seen many users doing it. I
>> think people are just starting to work on that sort of
>> "hyper-converged" setup but they're focusing on KVM+OSDs, not on
>> kernel mounts.
>> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] dependency of ceph_objectstore_tool in unhealthy ceph0.80.7 in ubuntu12.04

2016-03-21 Thread lin zhou
Hi,

I want to using ceph_objectstore_tool to export a pg from an OSD which
has been  delete from cluster just as
https://ceph.com/community/incomplete-pgs-oh-my/ do.


my ceph version is 0.80.7,and ceph_objectstore_tool has a dependency
of libgoogle-perftools0.

But libgoogle-perftools4 has been installed default。


I test install ceps-common(0.80.11 ) and ceps-test(0.80.11)  in a pure
ubuntu12.04,libgoogle-perftools0 is installed not
libgoogle-perftools4.

But now my cluster is not healthy,I do not know whether I can update
to 0.80.11 first.


>pg 4.438 is stuck unclean for 352515.041170, current state 
>active+recovery_wait+degraded+remapped, last acting [7,11]

>pg 4.438 is active+recovery_wait+degraded+remapped, acting [7,11], 1 unfound


>root@node-65:~# ceph pg map 4.438

>osdmap e42701 pg 4.438 (4.438) -> up [34,20,30] acting [7,11]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Optimations of cephfs clients on WAN: Looking for suggestions.

2016-03-21 Thread Goncalo Borges
Dear CephFS gurus...

I would like your advise on how to improve performance without compromising 
reliability for CephFS clients deployed under a WAN.

Currently, our infrastructure relies on:
- ceph infernalis
- a ceph object cluster, with all core infrastructure components sitting in the 
same data centre:
  a./ 8 storage servers (8 OSDs per server running on single spins; 2 SSDs with 
4 partitions each for OSD journals)
  b. 3 MONS
- one active MDS and another MDS on standby replay mode, also in the same data 
centre.
- OSDs, MONs and MDS all natively connected at 10 Gb/s
- cephfs clients mounted via ceph-fuse in different physical geographical 
locations (different network ranges)
- The communication bottleneck between cephfs clients and the core Ceph/CephFS 
infrastructure is not the network but the 1 GB Eth cards of some of the hosts 
where cephfs clients are deployed.

Although this setup is not exactly what we are aiming in the future, for now, I 
would like to ask for suggestions of what parameters to tune to improve 
performance without compromising reliability, specially for those cephfs 
clients under 1 Gb/s links.

In the past I have found some generic article which debated this issue but I am 
not able to find it now, nor other relevant info.

Help is appreciated.

Thank you for you feedback

Cheers
Goncalo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any suggestion to deal with slow request?

2016-03-21 Thread lin zhou
I face the same problem.

my osd.7 occur slow request,and many pg has a stat of active+recovery_wait.


I checked network and the device of osd.7,no errors.


Have you solve your problem ?

2016-01-08 13:06 GMT+08:00 Christian Balzer :
>
> Hello,
>
>
> On Fri, 8 Jan 2016 12:22:04 +0800 Jevon Qiao wrote:
>
>> Hi Robert,
>>
>> Thank you for the prompt response.
>>
>> The OSDs are built on XFS and the drives are Intel SSDs.  Each SSD is
>> parted into two partitions, one is for journal, the other is for data.
>> There is no alignment issue for the partitions.
>>
>
> As Robert said, details. All of them can be crucial.
>
> The missing detail here is which exact model of Intel SSDs.
>
> What you're describing below is not typical for Intel DC type SSDs (they
> perform at full speed and are very consistent at that).
>
> My suspicion is that you're using consumer grade SSDs.
>
>
>> When slow request msg is outputted, the workload is quite light on the
>> replication OSDs.
>>
>> Device: rrqm/s   wrqm/s r/s w/s rMB/swMB/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda   0.00 0.000.50   30.00 0.00 0.18
>> 12.33 0.000.08   0.08   0.25
>> sdb   0.00 0.500.50   78.00 0.00 0.75
>> 19.57 0.091.20   0.08   0.60
>> sdc   0.00 0.500.00   28.00 0.00 0.24
>> 17.75 0.010.32   0.11   0.30
>>
>
> Look into atop, it gives you (with a big enough window) a very
> encompassing view of what your system is doing and were bottlenecks are
> likely to be.
>
>> I benchmarked some OSDs with 'ceph tell osd.x bench',and learned that
>> the throughput for some OSDs(the disk usage is over 60%) is 21MB/s,
>> which seems abnormal.
>>
>> $ ceph tell osd.24 bench
>> { "bytes_written": 1073741824,
>>"blocksize": 4194304,
>>"bytes_per_sec": "22995975.00"}
>>
>> But the throughput for some newly added OSDs can reach 370MB/s. I
>> suspect if it is related to the GC of SSD. If so, it might explain why
>> it takes such long time to write journal. Any idea?
>>
> There are lots of threads in this ML about which type of SSDs are suitable
> for journals or not.
>
> Regards,
>
> Chibi
>> Another phenomenon that the journal_write is queued in writeq for 3
>> seconds, I checked the corresponding process logic in function
>> FileJournal::submit_entry() and FileJournal::write_thread_entry(), I did
>> not find anything suspicious point.
>>
>> Thanks,
>> Jevon
>> On 8/1/16 00:43, Robert LeBlanc wrote:
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA256
>> >
>> > What is the file system on the OSDs? Anything interesting in
>> > iostat/atop? What are the drives backing the OSDs? A few more details
>> > would be helpful.
>> > - 
>> > Robert LeBlanc
>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> >
>> >
>> > On Wed, Jan 6, 2016 at 9:03 PM, Jevon Qiao  wrote:
>> >> Hi Cephers,
>> >>
>> >> We have a Ceph cluster running 0.80.9, which consists of 36 OSDs with
>> >> 3 replicas. Recently, some OSDs keep reporting slow request and the
>> >> cluster has a performance downgrade.
>> >>
>> >>  From the log of one OSD, I observe that all the slow requests are
>> >> resulted from waiting for the replicas to complete. And the
>> >> replication OSDs are not always some specific ones but could be any
>> >> other two OSDs.
>> >>
>> >> 2016-01-06 08:17:11.887016 7f175ef25700  0 log [WRN] : slow request
>> >> 1.162776 seconds old, received at 2016-01-06 08:17:11.887092:
>> >> osd_op(client.13302933.0:839452
>> >> rbd_data.c2659c728b0ddb.0024 [stat,set-alloc-hint
>> >> object_size 16777216 write_size 16777216,write 12099584~8192]
>> >> 3.abd08522 ack+ondisk+write e4661) v4 currently waiting for subops
>> >> from 24,31
>> >>
>> >> I dumped out the historic Ops of the OSD and noticed the following
>> >> information:
>> >> 1) wait about 8 seconds for the replies from the replica OSDs.
>> >>  { "time": "2016-01-06 08:17:03.879264",
>> >>"event": "op_applied"},
>> >>  { "time": "2016-01-06 08:17:11.684598",
>> >>"event": "sub_op_applied_rec"},
>> >>  { "time": "2016-01-06 08:17:11.687016",
>> >>"event": "sub_op_commit_rec"},
>> >>
>> >> 2) spend more than 3 seconds in writeq and 2 seconds to write the
>> >> journal. { "time": "2016-01-06 08:19:16.887519",
>> >>"event": "commit_queued_for_journal_write"},
>> >>  { "time": "2016-01-06 08:19:20.109339",
>> >>"event": "write_thread_in_journal_buffer"},
>> >>  { "time": "2016-01-06 08:19:22.177952",
>> >>"event": "journaled_completion_queued"},
>> >>
>> >> Any ideas or suggestions?
>> >>
>> >> BTW, I checked the underlying network with iperf, it works fine.
>> >>
>> >> Th