JC, Thanks
I think the max scrub option that you refer to is a value per osd and not per cluster. So, the default is not to run more than 1 scrub per osd. So, if you have 100 osds by default it will not run more than 100 scurb processes at the same time. However, I want to limit the number on a cluster basis rather than on an osd basis. Andrei ----- Original Message ----- > From: "Jean-Charles Lopez" <jelo...@redhat.com> > To: "Andrei Mikhailovsky" <and...@arhont.com> > Cc: ceph-users@lists.ceph.com > Sent: Sunday, 12 April, 2015 5:17:10 PM > Subject: Re: [ceph-users] deep scrubbing causes osd down > Hi andrei > There is one parameter, osd_max_scrub I think, that controls the > number of scrubs per OSD. But the default is 1 if I'm correct. > Can you check on one of your OSDs with the admin socket? > Then it remains the option of scheduling the deep scrubs via a cron > job after setting nodeep-scrub to prevent automatic deep scrubbing. > Dan Van Der Ster had a post on this ML regarding this. > JC > While moving. Excuse unintended typos. > On Apr 12, 2015, at 05:21, Andrei Mikhailovsky < and...@arhont.com > > wrote: > > JC, > > > the restart of the osd servers seems to have stabilised the > > cluster. > > It has been a few hours since the restart and I haven't not seen a > > single osd disconnect. > > > Is there a way to limit the total number of scrub and/or deep-scrub > > processes running at the same time? For instance, I do not want to > > have more than 1 or 2 scrub/deep-scrubs running at the same time on > > my cluster. How do I implement this? > > > Thanks > > > Andrei > > > ----- Original Message ----- > > > > From: "Andrei Mikhailovsky" < and...@arhont.com > > > > > > > To: "LOPEZ Jean-Charles" < jelo...@redhat.com > > > > > > > Cc: ceph-users@lists.ceph.com > > > > > > Sent: Sunday, 12 April, 2015 9:02:05 AM > > > > > > Subject: Re: [ceph-users] deep scrubbing causes osd down > > > > > > JC, > > > > > > I've implemented the following changes to the ceph.conf and > > > restarted > > > mons and osds. > > > > > > osd_scrub_chunk_min = 1 > > > > > > osd_scrub_chunk_max =5 > > > > > > Things have become considerably worse after the changes. Shortly > > > after doing that, majority of osd processes started taking up > > > over > > > 100% cpu and the cluster has considerably slowed down. All my vms > > > are reporting high IO wait (between 30-80%), even vms which are > > > pretty idle and don't do much. > > > > > > i have tried restarting all osds, but shortly after the restart > > > the > > > cpu usage goes up. The osds are showing the following logs: > > > > > > 2015-04-12 08:39:28.853860 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : slow request 60.277590 seconds old, received at > > > 2015-04-12 > > > 08:38:28.576168: osd_op(client.69637439.0:290325926 > > > rbd_data.265f967a5f7514.0000000000004a00 [set-alloc-hint > > > object_size > > > 4194304 write_size 4194304,write 1249280~4096] 5.cb2620e0 snapc > > > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently > > > waiting for missing object > > > > > > 2015-04-12 08:39:28.853863 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : slow request 60.246943 seconds old, received at > > > 2015-04-12 > > > 08:38:28.606815: osd_op(client.69637439.0:290325927 > > > rbd_data.265f967a5f7514.0000000000004a00 [set-alloc-hint > > > object_size > > > 4194304 write_size 4194304,write 1310720~4096] 5.cb2620e0 snapc > > > ac=[ac] ack+ondisk+write+known_if_redirected e74834) currently > > > waiting for missing object > > > > > > 2015-04-12 08:39:36.855180 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : 7 slow requests, 1 included below; oldest blocked for > > > > 68.278951 secs > > > > > > 2015-04-12 08:39:36.855191 7f96f81dd700 0 log_channel(default) > > > log > > > [WRN] : slow request 30.268450 seconds old, received at > > > 2015-04-12 > > > 08:39:06.586669: osd_op(client.64965167.0:1607510 > > > rbd_data.1f264b2ae8944a.0000000000000228 [set-alloc-hint > > > object_size > > > 4194304 write_size 4194304,write 3584000~69632] 5.30418007 > > > ack+ondisk+write+known_if_redirected e74834) currently waiting > > > for > > > subops from 9 > > > > > > 2015-04-12 08:40:43.570004 7f96dd693700 0 <cls> > > > cls/rgw/cls_rgw.cc:1458: gc_iterate_entries > > > end_key=1_01428824443.569998000 > > > > > > [In total i've got around 40,000 slow request entries accumulated > > > overnight ((( ] > > > > > > On top of that, I have reports of osds going down and back up as > > > frequently as every 10-20 minutes. This effects all osds and not > > > a > > > particular set of osds. > > > > > > I will restart the osd servers to see if it makes a difference, > > > otherwise, I will need to revert back to the default settings as > > > the > > > cluster as it currently is is not functional. > > > > > > Andrei > > > > > > ----- Original Message ----- > > > > > > > From: "LOPEZ Jean-Charles" < jelo...@redhat.com > > > > > > > > > > > To: "Andrei Mikhailovsky" < and...@arhont.com > > > > > > > > > > > Cc: "LOPEZ Jean-Charles" < jelo...@redhat.com >, > > > > ceph-users@lists.ceph.com > > > > > > > > > > Sent: Saturday, 11 April, 2015 7:54:18 PM > > > > > > > > > > Subject: Re: [ceph-users] deep scrubbing causes osd down > > > > > > > > > > Hi Andrei, > > > > > > > > > > 1) what ceph version are you running? > > > > > > > > > > 2) what distro and version are you running? > > > > > > > > > > 3) have you checked the disk elevator for the OSD devices to be > > > > set > > > > to cfq? > > > > > > > > > > 4) Have have you considered exploring the following parameters > > > > to > > > > further tune > > > > > > > > > > - osd_scrub_chunk_min lower the default value of 5. e.g. = 1 > > > > > > > > > > - osd_scrub_chunk_max lower the default value of 25. e.g. = 5 > > > > > > > > > > - osd_deep_scrub_stride If you have lowered parameters above, > > > > you > > > > can > > > > play with this one to fit best your physical disk behaviour. - > > > > osd_scrub_sleep introduce a half second sleep between 2 scrubs; > > > > e.g. > > > > = 0.5 to start with a half second delay > > > > > > > > > > Cheers > > > > > > > > > > JC > > > > > > > > > > > On 10 Apr 2015, at 12:01, Andrei Mikhailovsky < > > > > > and...@arhont.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > > > > I was wondering if anyone noticed that the deep scrubbing > > > > > process > > > > > causes some osd to go down? > > > > > > > > > > > > > > > I have been keeping an eye on a few remaining stability > > > > > issues > > > > > in > > > > > my > > > > > test cluster. One of the unsolved issues is the occasional > > > > > reporting > > > > > of osd(s) going down and coming back up after about 20-30 > > > > > seconds. > > > > > This happens to various osds throughout the cluster. I have a > > > > > small > > > > > cluster of just 2 osd servers with 9 osds each. > > > > > > > > > > > > > > > The common trend that i see week after week is that whenever > > > > > there > > > > > is > > > > > a long deep scrubbing activity on the cluster it triggers one > > > > > or > > > > > more osds to go down for a short period of time. After the > > > > > osd > > > > > is > > > > > marked down, it goes back up after about 20 seconds. > > > > > Obviously > > > > > there > > > > > is a repair process that kicks in which causes more load on > > > > > the > > > > > cluster. While looking at the logs, i've not seen the osds > > > > > being > > > > > marked down when the cluster is not deep scrubbing. It > > > > > _always_ > > > > > happens when there is a deep scrub activity. I am seeing the > > > > > reports > > > > > of osds going down about 3-4 times a week. > > > > > > > > > > > > > > > The latest happened just recently with the following log > > > > > entries: > > > > > > > > > > > > > > > 2015-04-10 19:32:48.330430 mon.0 192.168.168.13:6789/0 > > > > > 3441533 > > > > > : > > > > > cluster [INF] pgmap v50849466: 8508 pgs: 8506 active+clean, 2 > > > > > active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, > > > > > 23310 > > > > > GB > > > > > / 50206 GB avail; 1005 B/s rd, 1005 > > > > > > > > > > > > > > > B/s wr, 0 op/s > > > > > > > > > > > > > > > 2015-04-10 19:32:52.950633 mon.0 192.168.168.13:6789/0 > > > > > 3441542 > > > > > : > > > > > cluster [INF] osd.6 192.168.168.200:6816/3738 failed (5 > > > > > reports > > > > > from > > > > > 5 peers after 60.747890 >= grace 46.701350) > > > > > > > > > > > > > > > 2015-04-10 19:32:53.121904 mon.0 192.168.168.13:6789/0 > > > > > 3441544 > > > > > : > > > > > cluster [INF] osdmap e74309: 18 osds: 17 up, 18 in > > > > > > > > > > > > > > > 2015-04-10 19:32:53.231730 mon.0 192.168.168.13:6789/0 > > > > > 3441545 > > > > > : > > > > > cluster [INF] pgmap v50849467: 8508 pgs: 599 > > > > > stale+active+clean, > > > > > 7907 active+clean, 1 stale+active+clean+scrubbing+deep, 1 > > > > > active+clean+scrubbing+deep; 13213 GB data, 26896 GB used, > > > > > 23310 > > > > > GB > > > > > / 50206 GB avail; 375 B/s rd, 0 op/s > > > > > > > > > > > > > > > osd.6 logs around the same time are: > > > > > > > > > > > > > > > 2015-04-10 19:16:29.110617 7fad6d5ec700 0 > > > > > log_channel(default) > > > > > log > > > > > [INF] : 5.3d7 deep-scrub ok > > > > > > > > > > > > > > > 2015-04-10 19:27:47.561389 7fad6bde9700 0 > > > > > log_channel(default) > > > > > log > > > > > [INF] : 5.276 deep-scrub ok > > > > > > > > > > > > > > > 2015-04-10 19:31:11.611321 7fad6d5ec700 0 > > > > > log_channel(default) > > > > > log > > > > > [INF] : 5.287 deep-scrub ok > > > > > > > > > > > > > > > 2015-04-10 19:31:53.339881 7fad7ce0b700 1 heartbeat_map > > > > > is_healthy > > > > > 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 > > > > > > > > > > > > > > > 2015-04-10 19:31:53.339887 7fad7ce0b700 1 heartbeat_map > > > > > is_healthy > > > > > 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 > > > > > > > > > > > > > > > 2015-04-10 19:31:53.339890 7fad7ce0b700 1 heartbeat_map > > > > > is_healthy > > > > > 'OSD::osd_op_tp thread 0x7fad705f2700' had timed out after 15 > > > > > > > > > > > > > > > 2015-04-10 19:31:53.340050 7fad7e60e700 1 heartbeat_map > > > > > is_healthy > > > > > 'OSD::osd_op_tp thread 0x7fad735f8700' had timed out after 15 > > > > > > > > > > > > > > > 2015-04-10 19:31:53.340053 7fad7e60e700 1 heartbeat_map > > > > > is_healthy > > > > > 'OSD::osd_op_tp thread 0x7fad745fa700' had timed out after 15 > > > > > > > > > > > > > > > [.....] > > > > > > > > > > > > > > > 2015-04-10 19:32:53.010609 7fad7e60e700 1 heartbeat_map > > > > > is_healthy > > > > > 'FileStore::op_tp thread 0x7fad86132700' had timed out after > > > > > 60 > > > > > > > > > > > > > > > 2015-04-10 19:32:53.010611 7fad7e60e700 1 heartbeat_map > > > > > is_healthy > > > > > 'FileStore::op_tp thread 0x7fad88937700' had timed out after > > > > > 60 > > > > > > > > > > > > > > > 2015-04-10 19:32:53.111470 7fad66ed2700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6837/4409 > > > > > pipe(0x2b793b80 sd=179 :6817 s=2 pgs=5 cs=1 l=0 > > > > > c=0x21e8b420).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.111496 7fad6329d700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6827/4208 > > > > > pipe(0x2b793600 sd=172 :6817 s=2 pgs=7 cs=1 l=0 > > > > > c=0x1791ab00).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.111463 7fad55bd0700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6822/3910 > > > > > pipe(0x2cb55dc0 sd=262 :6817 s=2 pgs=8 cs=1 l=0 > > > > > c=0xe7802c0).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.121815 7fad6218c700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6807/3575 > > > > > pipe(0x2cf8e080 sd=294 :6817 s=2 pgs=4 cs=1 l=0 > > > > > c=0x138669a0).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.121856 7fad67bdf700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 > > > > > pipe(0x2b792580 sd=190 :6817 s=2 pgs=9 cs=1 l=0 > > > > > c=0x138922c0).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.123545 7fad651bc700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6801/3053 > > > > > pipe(0x15e538c0 sd=260 :6817 s=2 pgs=1 cs=1 l=0 > > > > > c=0x16bf09a0).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.128729 7fad53eb3700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6832/4257 > > > > > pipe(0x37dcb80 sd=311 :6817 s=2 pgs=3 cs=1 l=0 > > > > > c=0x1131f420).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.132691 7fad53fb4700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6812/3674 > > > > > pipe(0x37dbdc0 sd=417 :6817 s=2 pgs=3 cs=1 l=0 > > > > > c=0x1c0badc0).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > 2015-04-10 19:32:53.139856 7fad65cc7700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 > > > > > pipe(0x2cf8eb80 sd=365 :6817 s=2 pgs=5 cs=1 l=0 > > > > > c=0xbd1f2c0).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > [ followed by about 500 or so lines which look similar to:] > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021252 7fad65cc7700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 > > > > > pipe(0x2b792580 sd=24 :45523 s=2 pgs=1721 cs=3371 l=0 > > > > > c=0x138922c0).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021280 7fad53eb3700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6807/3575 > > > > > pipe(0x2cf8e080 sd=37 :51859 s=2 pgs=1755 cs=3449 l=0 > > > > > c=0x138669a0).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021443 7fad66ed2700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6812/3674 > > > > > pipe(0x37dbdc0 sd=18 :46453 s=2 pgs=1782 cs=3489 l=0 > > > > > c=0x1c0badc0).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021702 7fad6218c700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6837/4409 > > > > > pipe(0x2b793b80 sd=38 :53724 s=2 pgs=1757 cs=3449 l=0 > > > > > c=0x21e8b420).fault, initiating reconnect > > > > > > > > > > > > > > > [followed by:] > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021830 7fad8e142700 0 > > > > > log_channel(default) > > > > > log > > > > > [WRN] : 6 slow requests, 6 included below; oldest blocked for > > > > > > > > > > > 94.284115 secs > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021834 7fad67bdf700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 > > > > > pipe(0x2cf8eb80 sd=29 :50115 s=2 pgs=1717 cs=3395 l=0 > > > > > c=0xbd1f2c0).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021840 7fad8e142700 0 > > > > > log_channel(default) > > > > > log > > > > > [WRN] : slow request 62.902554 seconds old, received at > > > > > 2015-04-10 > > > > > 19:32:06.119164: osd_op(client.64965167.0:1562644 > > > > > rbd_data.1f264b2ae8944a.0000000000000400 [set-alloc-hint > > > > > object_size > > > > > 4194304 write_size 4194304,write 1155072~4096] 5.f4cd34bb > > > > > ack+ondisk+write+known_if_redirected e74308) currently > > > > > waiting > > > > > for > > > > > subops from 13 > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021846 7fad8e142700 0 > > > > > log_channel(default) > > > > > log > > > > > [WRN] : slow request 63.004012 seconds old, received at > > > > > 2015-04-10 > > > > > 19:32:06.017706: osd_op(client.69637439.0:278581863 > > > > > rbd_data.265f967a5f7514.0000000000064e2e [set-alloc-hint > > > > > object_size > > > > > 4194304 write_size 4194304,write 1040384~4096] 5.bf3d2652 > > > > > snapc > > > > > ac=[ac] ack+ondisk+write+known_if_redirected e74308) > > > > > currently > > > > > waiting for subops from 11 > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021851 7fad8e142700 0 > > > > > log_channel(default) > > > > > log > > > > > [WRN] : slow request 94.284115 seconds old, received at > > > > > 2015-04-10 > > > > > 19:31:34.737603: osd_sub_op(unknown.0.0:0 5.299 0//0//-1 > > > > > [scrub-map] > > > > > v 0'0 snapset=0=[]:[] snapc=0=[]) currently started > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021856 7fad8e142700 0 > > > > > log_channel(default) > > > > > log > > > > > [WRN] : slow request 92.669022 seconds old, received at > > > > > 2015-04-10 > > > > > 19:31:36.352696: osd_op(client.69645050.0:2750479 > > > > > rbd_data.1b704a8f51a837.000000000000023d [stat,set-alloc-hint > > > > > object_size 4194304 write_size 4194304,write 1449984~4096] > > > > > 5.44a257d7 snapc > > > > > 9cc=[9cc,9c1,9b5,9a9,99d,991,984,972,953,2f8] > > > > > ack+ondisk+write+known_if_redirected e74308) currently > > > > > waiting > > > > > for > > > > > subops from 17 > > > > > > > > > > > > > > > 2015-04-10 19:33:09.021861 7fad8e142700 0 > > > > > log_channel(default) > > > > > log > > > > > [WRN] : slow request 76.733358 seconds old, received at > > > > > 2015-04-10 > > > > > 19:31:52.288360: osd_op(client.64987185.0:26635606 > > > > > rbd_data.1b652a2ae8944a.0000000000000623 [set-alloc-hint > > > > > object_size > > > > > 4194304 write_size 4194304,write 1282048~4096] 5.64d42a36 > > > > > ack+ondisk+write+known_if_redirected e74308) currently > > > > > waiting > > > > > for > > > > > subops from 17 > > > > > > > > > > > > > > > [and finally] > > > > > > > > > > > > > > > 2015-04-10 19:33:09.024361 7fad5c83b700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6832/4257 > > > > > pipe(0x37dcb80 sd=35 :55060 s=2 pgs=1719 cs=3359 l=0 > > > > > c=0x1131f420).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.024418 7fad82616700 0 > > > > > log_channel(default) > > > > > log > > > > > [WRN] : map e74309 wrongly marked me down > > > > > > > > > > > > > > > 2015-04-10 19:33:09.024466 7fad53fb4700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6827/4208 > > > > > pipe(0x2b793600 sd=33 :36954 s=2 pgs=1731 cs=3403 l=0 > > > > > c=0x1791ab00).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.024476 7fad67bdf700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 > > > > > pipe(0x2cf8eb80 sd=24 :50133 s=2 pgs=1720 cs=3401 l=0 > > > > > c=0xbd1f2c0).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.024553 7fad65cc7700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 > > > > > pipe(0x2b792580 sd=29 :45548 s=2 pgs=1725 cs=3379 l=0 > > > > > c=0x138922c0).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.024629 7fad53eb3700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6807/3575 > > > > > pipe(0x2cf8e080 sd=37 :51884 s=2 pgs=1759 cs=3457 l=0 > > > > > c=0x138669a0).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.024783 7fad650bb700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6842/4442 > > > > > pipe(0x2b792580 sd=29 :45548 s=1 pgs=1725 cs=3380 l=0 > > > > > c=0x138922c0).fault > > > > > > > > > > > > > > > 2015-04-10 19:33:09.025056 7fad6218c700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6837/4409 > > > > > pipe(0x2b793b80 sd=38 :53747 s=2 pgs=1761 cs=3457 l=0 > > > > > c=0x21e8b420).fault, initiating reconnect > > > > > > > > > > > > > > > 2015-04-10 19:33:09.025062 7fad61580700 0 -- > > > > > 192.168.168.200:6817/3738 >> 192.168.168.201:6817/3864 > > > > > pipe(0x2cf8eb80 sd=24 :50140 s=4 pgs=1720 cs=3402 l=0 > > > > > c=0xbd1f2c0).connect got RESETSESSION but no longer > > > > > connecting > > > > > > > > > > > > > > > 2015-04-10 19:33:16.999741 7fad5f964700 0 -- > > > > > 192.168.168.200:6846/1003738 >> 192.168.168.200:6837/4266 > > > > > pipe(0xa711580 sd=257 :6846 s=0 pgs=0 cs=0 l=0 > > > > > c=0x1afbf840).accept > > > > > connect_seq 0 vs existing 0 state connecting > > > > > > > > > > > > > > > 2015-04-10 19:33:16.999883 7fad5f863700 0 -- > > > > > 192.168.168.200:6846/1003738 >> 192.168.168.200:6827/3938 > > > > > pipe(0x2c48a8c0 sd=260 :6846 s=0 pgs=0 cs=0 l=0 > > > > > c=0x18a85580).accept > > > > > connect_seq 0 vs existing 0 state connecting > > > > > > > > > > > > > > > 2015-04-10 19:36:26.247402 7fad6bde9700 0 > > > > > log_channel(default) > > > > > log > > > > > [INF] : 5.299 deep-scrub ok > > > > > > > > > > > > > > > 2015-04-10 19:40:18.577423 7fad725f6700 0 <cls> > > > > > cls/rgw/cls_rgw.cc:1458: gc_iterate_entries > > > > > end_key=1_01428691218.577419000 > > > > > > > > > > > > > > > 2015-04-10 19:42:22.709291 7fad6edef700 0 > > > > > log_channel(default) > > > > > log > > > > > [INF] : 5.2ae deep-scrub ok > > > > > > > > > > > > > > > 2015-04-10 19:45:51.790224 7fad6bde9700 0 > > > > > log_channel(default) > > > > > log > > > > > [INF] : 5.2bb deep-scrub ok > > > > > > > > > > > > > > > 2015-04-10 19:48:17.022413 7fad5f863700 0 -- > > > > > 192.168.168.200:6846/1003738 >> 192.168.168.200:6827/3938 > > > > > pipe(0x2c48a8c0 sd=260 :6846 s=2 pgs=63 cs=1 l=0 > > > > > c=0x3f81160).fault > > > > > with nothing to send, going to standby > > > > > > > > > > > > > > > In the past I have been asked to verify that the networking > > > > > is > > > > > working okay. I've ran days and days of various tests and > > > > > verified > > > > > that the network is not the cause of the problems. The > > > > > network > > > > > is > > > > > running perfectly well with no errors reported on the > > > > > interfaces > > > > > nor > > > > > any delays. I ran multitude of tcp and udp tests including > > > > > firing > > > > > hundreds to thousands connections per second for weeks with > > > > > 0% > > > > > packet loss or any other network related errors. > > > > > > > > > > > > > > > Does anyone have any idea what could be causing the osds to > > > > > be > > > > > marked > > > > > down during the deep scrub activity? > > > > > > > > > > > > > > > I have enabled the following tuning options in the ceph.conf: > > > > > > > > > > > > > > > #change to decrease rebuild/recovery priority (should make > > > > > guest > > > > > vms > > > > > more responsive) and decrease scrubbing priority > > > > > > > > > > > > > > > osd recovery max chunk = 8388608 > > > > > > > > > > > > > > > osd recovery op priority = 1 > > > > > > > > > > > > > > > osd max backfills = 1 > > > > > > > > > > > > > > > osd recovery max active = 1 > > > > > > > > > > > > > > > osd recovery threads = 1 > > > > > > > > > > > > > > > osd_disk_thread_ioprio_priority = 7 > > > > > > > > > > > > > > > osd_disk_thread_ioprio_class = idle > > > > > > > > > > > > > > > #reporting of down osds > > > > > > > > > > > > > > > mon osd min down reporters = 6 > > > > > > > > > > > > > > > mon osd report timeout = 1800 > > > > > > > > > > > > > > > mon osd min down reports = 7 > > > > > > > > > > > > > > > osd heartbeat grace = 60 > > > > > > > > > > > > > > > Many thanks for your help > > > > > > > > > > > > > > > Andrei > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > ceph-users mailing list > > > > > > > > > > > > > > > ceph-users@lists.ceph.com > > > > > > > > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > _______________________________________________ > > > > > > ceph-users mailing list > > > > > > ceph-users@lists.ceph.com > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com