Hi, I can see ~17% hardware interrupts which I find a little high - can you make sure all load is spread over all your cores (/proc/interrupts)?
What about disk util once you restart them? Are they all 100% utilized or is it 'only' mostly cpu-bound? Also you're running a monitor on this node - how is the load on the nodes where you run a monitor compared to those where you dont? Cheers, Martin On Thu, Mar 20, 2014 at 10:18 AM, Quenten Grasso <qgra...@onq.com.au> wrote: > Hi All, > > > > I left out my OS/kernel version, Ubuntu 12.04.4 LTS w/ Kernel > 3.10.33-031033-generic (We upgrade our kernels to 3.10 due to Dell Drivers). > > > > Here's an example of starting all the OSD's after a reboot. > > > > top - 09:10:51 up 2 min, 1 user, load average: 332.93, 112.28, 39.96 > > Tasks: 310 total, 1 running, 309 sleeping, 0 stopped, 0 zombie > > Cpu(s): 50.3%us, 32.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 17.2%hi, 0.0%si, > 0.0%st > > Mem: 32917276k total, 6331224k used, 26586052k free, 1332k buffers > > Swap: 33496060k total, 0k used, 33496060k free, 1474084k cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 15875 root 20 0 910m 381m 50m S 60 1.2 0:50.57 ceph-osd > > 2996 root 20 0 867m 330m 44m S 59 1.0 0:58.32 ceph-osd > > 4502 root 20 0 907m 372m 47m S 58 1.2 0:55.14 ceph-osd > > 12465 root 20 0 949m 418m 55m S 58 1.3 0:51.79 ceph-osd > > 4171 root 20 0 886m 348m 45m S 57 1.1 0:56.17 ceph-osd > > 3707 root 20 0 941m 405m 50m S 57 1.3 0:59.68 ceph-osd > > 3560 root 20 0 924m 394m 51m S 56 1.2 0:59.37 ceph-osd > > 4318 root 20 0 965m 435m 55m S 56 1.4 0:54.80 ceph-osd > > 3337 root 20 0 935m 407m 51m S 56 1.3 1:01.96 ceph-osd > > 3854 root 20 0 897m 366m 48m S 55 1.1 1:00.55 ceph-osd > > 3143 root 20 0 1364m 424m 24m S 16 1.3 1:08.72 ceph-osd > > 2509 root 20 0 652m 261m 62m S 2 0.8 0:26.42 ceph-mon > > 4 root 20 0 0 0 0 S 0 0.0 0:00.08 kworker/0:0 > > > > Regards, > > Quenten Grasso > > > > *From:* ceph-users-boun...@lists.ceph.com [mailto: > ceph-users-boun...@lists.ceph.com] *On Behalf Of *Quenten Grasso > *Sent:* Tuesday, 18 March 2014 10:19 PM > *To:* 'ceph-users@lists.ceph.com' > *Subject:* [ceph-users] OSD Restarts cause excessively high load average > and "requests are blocked > 32 sec" > > > > Hi All, > > > > I'm trying to troubleshoot a strange issue with my Ceph cluster. > > > > We're Running Ceph Version 0.72.2 > > All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS > Drives and 2 x 100GB Intel DC S3700 SSD's for Journals. > > All Pools have a replica of 2 or better. I.e. metadata replica of 3. > > > > I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on > a single node (any node) the load average of that node shoots up to 230+ > and the whole cluster starts blocking IO requests until it settles down and > its fine again. > > > > Any ideas on why the load average goes so crazy & starts to block IO? > > > > > > <snips from my ceph.conf> > > [osd] > > osd data = /var/ceph/osd.$id > > osd journal size = 15000 > > osd mkfs type = xfs > > osd mkfs options xfs = "-i size=2048 -f" > > osd mount options xfs = > "rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k" > > osd max backfills = 5 > > osd recovery max active = 3 > > > > [osd.0] > > host = pbnerbd01 > > public addr = 10.100.96.10 > > cluster addr = 10.100.128.10 > > osd journal = > /dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1 > > devs = /dev/sda4 > > </end> > > > > Thanks, > > Quenten > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com