Re: [ceph-users] Is it safe to enable rbd cache with qemu?

2014-08-23 Thread Sage Weil
For Giant, we have changed the default librbd caching options to:

 rbd cache = true
 rbd cache writethrough until flush = true

The second option enables the cache for reads but does writethrough until 
we observe a FLUSH command come through, which implies that the guest OS 
is issuing barriers.  This doesn't guarantee they are doing it properly, 
of course, but it means they are at least trying.  Once we see a flush, we 
infer that writeback is safe.

sage


On Sat, 23 Aug 2014, Alexandre DERUMIER wrote:

> >>But what about Windows? Does NTFS support barriers too?
> 
> Windows > 2003 support FUA (like in newer linux kernel). So it's safe.
> 
> virtio-win driver support it too since 1 or 2 year.
> 
> I have had a discuss about it some year ago, see :
> 
> https://github.com/YanVugenfirer/kvm-guest-drivers-windows/issues/3
> 
> 
> - Mail original - 
> 
> De: "Yufang"  
> ?: "Alexandre DERUMIER"  
> Cc: ceph-users@lists.ceph.com 
> Envoy?: Vendredi 22 Ao?t 2014 18:05:32 
> Objet: Re: [ceph-users] Is it safe to enable rbd cache with qemu? 
> 
> Thanks, Alexandre. But what about Windows? Does NTFS support barriers too? 
> Should I have confident that win2k3 guest could survive from data loss on 
> host/guest crash? 
>  iPad 
> 
> > ? 2014?8?22??23:07?Alexandre DERUMIER  ??? 
> > 
> > Hi, 
> > for RHEL5, I'm not sure 
> > 
> > be barriers supported is maybe not implemented in virtio devices,lvm,dm 
> > raid and some filesystem, 
> > depend of the kernel version. 
> > 
> > Not sure what is backported in rhel5 kernel 
> > 
> > 
> > see 
> > http://monolight.cc/2011/06/barriers-caches-filesystems/ 
> > 
> > 
> > 
> > - Mail original - 
> > 
> > De: "Yufang Zhang"  
> > ?: ceph-users@lists.ceph.com 
> > Envoy?: Vendredi 22 Ao?t 2014 13:05:02 
> > Objet: [ceph-users] Is it safe to enable rbd cache with qemu? 
> > 
> > 
> > Hi guys, 
> > 
> > 
> > Apologize if this question has been asked before. I'd like to know if it is 
> > safe to enable rbd cache with qemu (cache mode set as writeback) in 
> > production? Currently, there are 4 types of guest os supported in our 
> > production: REHL5, RHEL6, Win2k3, Win2k8. Our host is RHEL6.2 on which qemu 
> > supports ' barrier-passing'. Thus we have confident that RHEL6 guests(with 
> > barrier enabled by default ) could work well with rbd cache enabled. But as 
> > for REHL5, Win2k3 and Win2k8, I am not sure if it is 100% safe on scenarios 
> > such as guest crash, host crash or power loss. Could anybody give some 
> > suggestion? Really appreciate your help. 
> > 
> > 
> > Yufang 
> > ___ 
> > ceph-users mailing list 
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor/OSD report tuning question

2014-08-23 Thread Bruce McFarland
Hello,
I have a Cluster with 30 OSDs distributed over 3 Storage Servers connected by a 
10G cluster link and connected to the Monitor over 1G. I still have a lot to 
understand with Ceph. Observing the cluster messages in a "ceph -watch" window 
I see a lot of osd "flapping" when it is sitting in a configured state and 
page/placement groups constantly changing status. The cluster was configured 
and came up to 1920 'active + clean' pages.

The 3 status below outputs were issued over the course of about 2 to minutes. 
As you can see there is a lot of activity where I'm assuming the osd reporting 
is occasionally outside the heartbeat TO and various pages/placement groups get 
set to 'stale' and/or 'degrded' but still 'active'. There are osd's being  
marked 'out' in the osd map that I see in the watch window as reported of 
failures that very quickly report "wrongly marked me down". I'm assuming I need 
to 'tune' some of the many TO values so that the osd's and page/placement 
groups all can report within the TO window.


A quick look at the -admin-daemon config show cmd tells me that I might 
consider tuning some of these values:

[root@ceph0 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.20.asok config 
show | grep report
  "mon_osd_report_timeout": "900",
  "mon_osd_min_down_reporters": "1",
  "mon_osd_min_down_reports": "3",
  "osd_mon_report_interval_max": "120",
  "osd_mon_report_interval_min": "5",
  "osd_pg_stat_report_interval_max": "500",
[root@ceph0 ceph]#

Which osd and/or mon settings should I increase/decrease to eliminate all this 
state flapping while the cluster sits configured with no data?
Thanks,
Bruce

014-08-23 13:16:15.564932 mon.0 [INF] osd.20 209.243.160.83:6800/20604 failed 
(65 reports from 20 peers after 23.380808 >= grace 21.991016)
2014-08-23 13:16:15.565784 mon.0 [INF] osd.23 209.243.160.83:6810/29727 failed 
(79 reports from 20 peers after 23.675170 >= grace 21.990903)
2014-08-23 13:16:15.566038 mon.0 [INF] osd.25 209.243.160.83:6808/31984 failed 
(65 reports from 20 peers after 23.380921 >= grace 21.991016)
2014-08-23 13:16:15.566206 mon.0 [INF] osd.26 209.243.160.83:6811/518 failed 
(65 reports from 20 peers after 23.381043 >= grace 21.991016)
2014-08-23 13:16:15.566372 mon.0 [INF] osd.27 209.243.160.83:6822/2511 failed 
(65 reports from 20 peers after 23.381195 >= grace 21.991016)
.
.
.
2014-08-23 13:17:09.547684 osd.20 [WRN] map e27128 wrongly marked me down
2014-08-23 13:17:10.826541 osd.23 [WRN] map e27130 wrongly marked me down
2014-08-23 13:20:09.615826 mon.0 [INF] osdmap e27134: 30 osds: 26 up, 30 in
2014-08-23 13:17:10.954121 osd.26 [WRN] map e27130 wrongly marked me down
2014-08-23 13:17:19.125177 osd.25 [WRN] map e27135 wrongly marked me down

[root@ceph-mon01 ceph]# ceph -s
cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
 health HEALTH_OK
 monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election epoch 2, 
quorum 0 ceph-mon01
 osdmap e26636: 30 osds: 30 up, 30 in
  pgmap v56534: 1920 pgs, 3 pools, 0 bytes data, 0 objects
26586 MB used, 109 TB / 109 TB avail
1920 active+clean
[root@ceph-mon01 ceph]# ceph -s
cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
 health HEALTH_WARN 160 pgs degraded; 83 pgs stale
 monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election epoch 2, 
quorum 0 ceph-mon01
 osdmap e26641: 30 osds: 30 up, 30 in
  pgmap v56545: 1920 pgs, 3 pools, 0 bytes data, 0 objects
26558 MB used, 109 TB / 109 TB avail
  83 stale+active+clean
 160 active+degraded
1677 active+clean
[root@ceph-mon01 ceph]# ceph -s
cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
 health HEALTH_OK
 monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election epoch 2, 
quorum 0 ceph-mon01
 osdmap e26657: 30 osds: 30 up, 30 in
  pgmap v56584: 1920 pgs, 3 pools, 0 bytes data, 0 objects
26610 MB used, 109 TB / 109 TB avail
1920 active+clean
[root@ceph-mon01 ceph]#

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor/OSD report tuning question

2014-08-23 Thread Bruce McFarland
Based on the ceph watch output my guess is the osd_heartbeat_grace default of 
20 is causing my reporting issues. I've seen failures, all of which recover, 
from reporting after 22 to ~28 second. I was unable to set osd_heartbeat_grace 
using the runtime command - every syntax I tried the cmd failed. I changed the 
setting in ceph.conf and restarted all of the daemons. The runtime config now 
reflects the new setting of osd_heartbeat_grace to 30, but I still see osd 
failures in the ceph -w output for reporting outside the 20 second grace.


-  What am I overlooking?

-  What is the proper syntax for changing the osd_heartbeat_grace at 
runtime?


[root@ceph0 ceph]# ceph osd tell osd.* injectargs '--osd_heartbeat_grace 30'
"osd tell" is deprecated; try "tell osd." instead (id can be "*")
[root@ceph0 ceph]# ceph osd tell osd.20 injectargs '--osd_heartbeat_grace 30'
"osd tell" is deprecated; try "tell osd." instead (id can be "*")
[root@ceph0 ceph]# ceph osd tell * injectargs '--osd_heartbeat_grace 30'
"osd tell" is deprecated; try "tell osd." instead (id can be "*")
[root@ceph0 ceph]# ceph ceph-osd tell osd.* injectargs '--osd_heartbeat_grace 
30'
no valid command found; 10 closest matches:

After making change to ceph.conf and restarting all daemons the 
osd_heartbeat_grace is now reporting 30, but osd's are still failing for 
exceeding the 20 second default grace.
[root@ceph0 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.20.asok config 
show | grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
[root@ceph0 ceph]#

2014-08-23 14:16:18.069827 mon.0 [INF] osd.20 209.243.160.83:6806/23471 failed 
(76 reports from 20 peers after 24.267838 >= grace 20.994852)
2014-08-23 14:13:20.057523 osd.26 [WRN] map e28337 wrongly marked me down

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bruce 
McFarland
Sent: Saturday, August 23, 2014 1:24 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Monitor/OSD report tuning question

Hello,
I have a Cluster with 30 OSDs distributed over 3 Storage Servers connected by a 
10G cluster link and connected to the Monitor over 1G. I still have a lot to 
understand with Ceph. Observing the cluster messages in a "ceph -watch" window 
I see a lot of osd "flapping" when it is sitting in a configured state and 
page/placement groups constantly changing status. The cluster was configured 
and came up to 1920 'active + clean' pages.

The 3 status below outputs were issued over the course of about 2 to minutes. 
As you can see there is a lot of activity where I'm assuming the osd reporting 
is occasionally outside the heartbeat TO and various pages/placement groups get 
set to 'stale' and/or 'degrded' but still 'active'. There are osd's being  
marked 'out' in the osd map that I see in the watch window as reported of 
failures that very quickly report "wrongly marked me down". I'm assuming I need 
to 'tune' some of the many TO values so that the osd's and page/placement 
groups all can report within the TO window.


A quick look at the -admin-daemon config show cmd tells me that I might 
consider tuning some of these values:

[root@ceph0 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.20.asok config 
show | grep report
  "mon_osd_report_timeout": "900",
  "mon_osd_min_down_reporters": "1",
  "mon_osd_min_down_reports": "3",
  "osd_mon_report_interval_max": "120",
  "osd_mon_report_interval_min": "5",
  "osd_pg_stat_report_interval_max": "500",
[root@ceph0 ceph]#

Which osd and/or mon settings should I increase/decrease to eliminate all this 
state flapping while the cluster sits configured with no data?
Thanks,
Bruce

014-08-23 13:16:15.564932 mon.0 [INF] osd.20 209.243.160.83:6800/20604 failed 
(65 reports from 20 peers after 23.380808 >= grace 21.991016)
2014-08-23 13:16:15.565784 mon.0 [INF] osd.23 209.243.160.83:6810/29727 failed 
(79 reports from 20 peers after 23.675170 >= grace 21.990903)
2014-08-23 13:16:15.566038 mon.0 [INF] osd.25 209.243.160.83:6808/31984 failed 
(65 reports from 20 peers after 23.380921 >= grace 21.991016)
2014-08-23 13:16:15.566206 mon.0 [INF] osd.26 209.243.160.83:6811/518 failed 
(65 reports from 20 peers after 23.381043 >= grace 21.991016)
2014-08-23 13:16:15.566372 mon.0 [INF] osd.27 209.243.160.83:6822/2511 failed 
(65 reports from 20 peers after 23.381195 >= grace 21.991016)
.
.
.
2014-08-23 13:17:09.547684 osd.20 [WRN] map e27128 wrongly marked me down
2014-08-23 13:17:10.826541 osd.23 [WRN] map e27130 wrongly marked me down
2014-08-23 13:20:09.615826 mon.0 [INF] osdmap e27134: 30 osds: 26 up, 30 in
2014-08-23 13:17:10.954121 osd.26 [WRN] map e27130 wrongly marked me down
2014-08-23 13:17:19.125177 osd.25 [WRN] map e27135 wrongly marked me down

[root@ceph-mon01 ceph]# ceph -s
cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
 health HEALTH_OK
 monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election epoch 2, 
quorum 0 ceph-mon01

Re: [ceph-users] One Mon log huge and this Mon down often

2014-08-23 Thread debian Only
this is happen i use  *ceph-deploy create ceph01-vm ceph02-vm ceph04-vm  *to
create 3 Mons member.
now every 10 hours, one  Mon will down.   every time have this error,  some
time the hardisk have enough space left,such as 30G.

i deployed Ceph before,  only create one Mon at first step  *ceph-deploy
create ceph01-vm ,  and then ceph-deploy mon add ceph02-vm, *not meet this
problem.

i do not know why ?

2014-08-23 10:19:43.910650 7f3c0028c700  0
mon.ceph01-vm@1(peon).data_health(56)
*update_stats avail 5% total 15798272 used 12941508 avail 926268*
2014-08-23 10:19:43.910806 7f3c0028c700 -1
mon.ceph01-vm@1(peon).data_health(56)
reached critical levels of available space on local monitor storage --
shutdown!
2014-08-23 10:19:43.910811 7f3c0028c700  0 ** Shutdown via Data Health
Service **
2014-08-23 10:19:43.931427 7f3bffa8b700  1 mon.ceph01-vm@1(peon).paxos(paxos
active c 15814..16493) is_readable now=2014-08-23 10:19:43.931433
lease_expire=2014-08-23 10:19:45.989585 has v0 lc 16493
2014-08-23 10:19:43.931486 7f3bfe887700 -1 mon.ceph01-vm@1(peon) e2 *** Got
Signal Interrupt ***
2014-08-23 10:19:43.931515 7f3bfe887700  1 mon.ceph01-vm@1(peon) e2 shutdown
2014-08-23 10:19:43.931725 7f3bfe887700  0 quorum service shutdown
2014-08-23 10:19:43.931730 7f3bfe887700  0 mon.ceph01-vm@1(shutdown).health(56)
HealthMonitor::service_shutdown 1 services
2014-08-23 10:19:43.931735 7f3bfe887700  0 quorum service shutdown



2014-08-22 21:31 GMT+07:00 debian Only :

> this time ceph01-vm down, no big log happen ,  other 2 ok.do not
> what's the reason,  this is not my first time install Ceph.  but this is
> first time i meet that mon down again and again.
>
> ceph.conf on each OSDs and MONs
>  [global]
> fsid = 075f1aae-48de-412e-b024-b0f014dbc8cf
> mon_initial_members = ceph01-vm, ceph02-vm, ceph04-vm
> mon_host = 192.168.123.251,192.168.123.252,192.168.123.250
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
>
> rgw print continue = false
> rgw dns name = ceph-radosgw
> osd pool default pg num = 128
> osd pool default pgp num = 128
>
>
> [client.radosgw.gateway]
> host = ceph-radosgw
> keyring = /etc/ceph/ceph.client.radosgw.keyring
> rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
> log file = /var/log/ceph/client.radosgw.gateway.log
>
>
> 2014-08-22 18:15 GMT+07:00 Joao Eduardo Luis :
>
> On 08/22/2014 10:21 AM, debian Only wrote:
>>
>>> i have  3 mons in Ceph 0.80.5 on Wheezy. have one RadosGW
>>>
>>> when happen this first time, i increase the mon log device.
>>> this time mon.ceph02-vm down, only this mon down,  other 2 is ok.
>>>
>>> pls some one give me some guide.
>>>
>>>   27M Aug 22 02:11 ceph-mon.ceph04-vm.log
>>>   43G Aug 22 02:11 ceph-mon.ceph02-vm.log
>>>   2G Aug 22 02:11 ceph-mon.ceph01-vm.log
>>>
>>
>> Depending on the debug level you set, and depending on which subsystems
>> you set a higher debug level, the monitor can spit out A LOT of information
>> in a short period of time.  43GB is nothing compared to some 100+ GB logs
>> I've had churn through in the past.
>>
>> However, I'm not grasping what kind of help you need.  According to your
>> 'ceph -s' below the monitors seem okay -- all are in, health is OK.
>>
>> If you issue is with having that one monitor spitting out humongous
>> amounts of debug info here's what you need to do:
>>
>> - If you added one or more 'debug  = X' to that monitor's
>> ceph.conf, you will want to remove them so that in a future restart the
>> monitor doesn't start with non-default debug levels.
>>
>> - You will want to inject default debug levels into that one monitor.
>>
>> Depending on what debug levels you have increased, you will want to run a
>> version of "ceph tell mon.ceph02-vm injectargs '--debug-mon 1/5 --debug-ms
>> 0/5 --debug-paxos 1/5'"
>>
>>   -Joao
>>
>>
>>> # ceph -s
>>>  cluster 075f1aae-48de-412e-b024-b0f014dbc8cf
>>>   health HEALTH_OK
>>>   monmap e2: 3 mons at
>>> {ceph01-vm=192.168.123.251:6789/0,ceph02-vm=192.168.123.
>>> 252:6789/0,ceph04-vm=192.168.123.250:6789/0
>>>  >> 6789/0,ceph04-vm=192.168.123.250:6789/0>},
>>>
>>> election epoch 44, quorum 0,1,2 ceph04-vm,ceph01-vm,ceph02-vm
>>>   mdsmap e10: 1/1/1 up {0=ceph06-vm=up:active}
>>>   osdmap e145: 10 osds: 10 up, 10 in
>>>pgmap v4394: 2392 pgs, 21 pools, 4503 MB data, 1250 objects
>>>  13657 MB used, 4908 GB / 4930 GB avail
>>>  2392 active+clean
>>>
>>>
>>> /2014-08-22 02:06:34.738828 7ff2b9557700  1
>>>
>>> mon.ceph02-vm@2(peon).paxos(paxos active c 9037..9756) is_readable
>>> now=2014-08-22 02:06:34.738830 lease_expire=2014-08-22 02:06:39.701305
>>> has v0 lc 9756/
>>> /2014-08-22 02:06:36.618805 7ff2b9557700  1
>>>
>>> mon.ceph02-vm@2(peon).paxos(paxos active c 9037..9756) is_readable
>>> now=2014-08-22 02:06:36.618807 lease_expire=2014-08-22 02:06:39.701305
>>> has v0 lc 975

Re: [ceph-users] ceph cluster inconsistency?

2014-08-23 Thread Haomai Wang
It's really strange! I write a test program according the key ordering
you provided and parse the corresponding value. It's true!

I have no idea now. If free, could you add this debug code to
"src/os/GenericObjectMap.cc" and insert *before* "assert(start <=
header.oid);":

dout(0) << "start: " << start << "header.oid: " << header.oid << dendl;

Then you need to recompile ceph-osd and run it again. The output log
can help it!

On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang  wrote:
> I feel a little embarrassed, 1024 rows still true for me.
>
>  I was wondering if you could give your all keys via
> ""ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
> _GHOBJTOSEQ_ > keys.log“.
>
> thanks!
>
> On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman
>  wrote:
>>
>> - Message from Haomai Wang  -
>>Date: Tue, 19 Aug 2014 12:28:27 +0800
>>
>>From: Haomai Wang 
>> Subject: Re: [ceph-users] ceph cluster inconsistency?
>>  To: Kenneth Waegeman 
>>  Cc: Sage Weil , ceph-users@lists.ceph.com
>>
>>
>>> On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman
>>>  wrote:


 - Message from Haomai Wang  -
Date: Mon, 18 Aug 2014 18:34:11 +0800

From: Haomai Wang 
 Subject: Re: [ceph-users] ceph cluster inconsistency?
  To: Kenneth Waegeman 
  Cc: Sage Weil , ceph-users@lists.ceph.com



> On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman
>  wrote:
>>
>>
>> Hi,
>>
>> I tried this after restarting the osd, but I guess that was not the aim
>> (
>> # ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
>> _GHOBJTOSEQ_|
>> grep 6adb1100 -A 100
>> IO error: lock /var/lib/ceph/osd/ceph-67/current//LOCK: Resource
>> temporarily
>> unavailable
>> tools/ceph_kvstore_tool.cc: In function 'StoreTool::StoreTool(const
>> string&)' thread 7f8fecf7d780 time 2014-08-18 11:12:29.551780
>> tools/ceph_kvstore_tool.cc: 38: FAILED assert(!db_ptr->open(std::cerr))
>> ..
>> )
>>
>> When I run it after bringing the osd down, it takes a while, but it has
>> no
>> output.. (When running it without the grep, I'm getting a huge list )
>
>
>
> Oh, sorry for it! I made a mistake, the hash value(6adb1100) will be
> reversed into leveldb.
> So grep "benchmark_data_ceph001.cubone.os_5560_object789734" should be
> help it.
>
 this gives:

 [root@ceph003 ~]# ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/
 list
 _GHOBJTOSEQ_ | grep 5560_object789734 -A 100

 _GHOBJTOSEQ_:3%e0s0_head!0011BDA6!!3!!benchmark_data_ceph001%ecubone%eos_5560_object789734!head

 _GHOBJTOSEQ_:3%e0s0_head!0011C027!!3!!benchmark_data_ceph001%ecubone%eos_31461_object1330170!head

 _GHOBJTOSEQ_:3%e0s0_head!0011C6FD!!3!!benchmark_data_ceph001%ecubone%eos_4919_object227366!head

 _GHOBJTOSEQ_:3%e0s0_head!0011CB03!!3!!benchmark_data_ceph001%ecubone%eos_5560_object1363631!head

 _GHOBJTOSEQ_:3%e0s0_head!0011CDF0!!3!!benchmark_data_ceph001%ecubone%eos_5560_object1573957!head

 _GHOBJTOSEQ_:3%e0s0_head!0011D02C!!3!!benchmark_data_ceph001%ecubone%eos_5560_object1019282!head

 _GHOBJTOSEQ_:3%e0s0_head!0011E2B5!!3!!benchmark_data_ceph001%ecubone%eos_31461_object1283563!head

 _GHOBJTOSEQ_:3%e0s0_head!0011E511!!3!!benchmark_data_ceph001%ecubone%eos_4919_object273736!head

 _GHOBJTOSEQ_:3%e0s0_head!0011E547!!3!!benchmark_data_ceph001%ecubone%eos_5560_object1170628!head

 _GHOBJTOSEQ_:3%e0s0_head!0011EAAB!!3!!benchmark_data_ceph001%ecubone%eos_4919_object256335!head

 _GHOBJTOSEQ_:3%e0s0_head!0011F446!!3!!benchmark_data_ceph001%ecubone%eos_5560_object1484196!head

 _GHOBJTOSEQ_:3%e0s0_head!0011FC59!!3!!benchmark_data_ceph001%ecubone%eos_5560_object884178!head

 _GHOBJTOSEQ_:3%e0s0_head!001203F3!!3!!benchmark_data_ceph001%ecubone%eos_5560_object853746!head

 _GHOBJTOSEQ_:3%e0s0_head!001208E3!!3!!benchmark_data_ceph001%ecubone%eos_5560_object36633!head

 _GHOBJTOSEQ_:3%e0s0_head!00120B37!!3!!benchmark_data_ceph001%ecubone%eos_31461_object1235337!head

 _GHOBJTOSEQ_:3%e0s0_head!001210B6!!3!!benchmark_data_ceph001%ecubone%eos_5560_object1661351!head

 _GHOBJTOSEQ_:3%e0s0_head!001210CB!!3!!benchmark_data_ceph001%ecubone%eos_5560_object238126!head

 _GHOBJTOSEQ_:3%e0s0_head!0012184C!!3!!benchmark_data_ceph001%ecubone%eos_5560_object339943!head

 _GHOBJTOSEQ_:3%e0s0_head!00121916!!3!!benchmark_data_ceph001%ecubone%eos_5560_object1047094!head

 _GHOBJTOSEQ_:3%e0s0_head!001219C1!!3!!benchmark_data_ceph001%ecubone%eos_31461_object520642!head

 _GHOBJTOSEQ_:3%e0s0_head!001222BB!!3!!benchmark_data_ceph001%ecubone%eos_5560_object639565!head

 _GHOBJTOSEQ_:3%e0s0_head!001223AA!!3!!benchmark_data_ceph001%ecubone%eos_4919_object231080!head

 _GHOBJTOSEQ_:3%e0s0_head!0012243C!!3!!ben

[ceph-users] osd_heartbeat_grace set to 30 but osd's still fail for grace > 20

2014-08-23 Thread Bruce McFarland
I see osd's being failed for heartbeat reporting > default osd_heartbeat_grace 
of 20 but the run time config shows that the grace is set to 30. Is there 
another variable for the osd or the mon I need to set for the non default 
osd_heartbeat_grace of 30 to take effect?

2014-08-23 23:03:08.982590 mon.0 [INF] osd.23 209.243.160.83:6812/31567 failed 
(73 reports from 20 peers after 20.462129 >= grace 20.00)
2014-08-23 23:03:09.058927 mon.0 [INF] osdmap e37965: 30 osds: 29 up, 30 in
2014-08-23 23:03:09.070575 mon.0 [INF] pgmap v82213: 1920 pgs: 62 
stale+active+clean, 1858 active+clean; 0 bytes data, 8193 MB used, 109 TB / 109 
TB avail
2014-08-23 23:03:09.860169 mon.0 [INF] osd.20 209.243.160.83:6806/29554 failed 
(62 reports from 20 peers after 21.339816 >= grace 20.995899)
2014-08-23 23:03:09.860246 mon.0 [INF] osd.26 209.243.160.83:6811/1098 failed 
(66 reports from 20 peers after 21.339380 >= grace 20.995899)
2014-08-23 23:03:09.860307 mon.0 [INF] osd.29 209.243.160.83:6804/3217 failed 
(62 reports from 20 peers after 21.339341 >= grace 20.995899)
2014-08-23 23:03:10.076721 mon.0 [INF] osdmap e37966: 30 osds: 26 up, 30 in


[root@ceph1 ceph]# sh -x ./ceph1-daemon-config.sh grace
+ '[' 1 '!=' 1 ']'
+ for i in '{0..9}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ grep grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ grep grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.5.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.6.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.7.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ grep grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.8.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{0..9}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.9.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
[root@ceph1 ceph]#

[root@ceph2 ceph]# sh -x ./ceph2-daemon-config.sh grace
+ '[' 1 '!=' 1 ']'
+ for i in '{10..19}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.10.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ grep grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.11.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ grep grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.12.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ grep grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.13.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.14.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.15.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ grep grace
+ ceph --admin-daemon /var/run/ceph/ceph-osd.16.asok config show
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ ceph --admin-daemon /var/run/ceph/ceph-osd.17.asok config show
+ grep grace
  "mon_osd_adjust_heartbeat_grace": "true",
  "mds_beacon_grace": "15",
  "osd_heartbeat_grace": "30",
+ for i in '{10..19}'
+ ceph --admin-daemon /var/run/ce