Hi
It happened again,and worse thing is that my system went to crash.we can
even not connect to it with ssh.
I use the sar command to capture the statistics information about it.Here
are my details:
[1]cpu(by using sar -u),we have to restart our system just as the red font
LINUX RESTART in the logs.
--------------------------------------------------------------------------------------------------
03:00:01 PM all 7.61 0.00 0.92 0.07 0.00
91.40
03:10:01 PM all 7.71 0.00 1.29 0.06 0.00
90.94
03:20:01 PM all 7.62 0.00 1.98 0.06 0.00
90.34
03:30:35 PM all 5.65 0.00 31.08 0.04 0.00
63.23
03:42:40 PM all 47.58 0.00 52.25 0.00 0.00
0.16
Average: all 8.21 0.00 1.57 0.05 0.00
90.17
04:42:04 PM LINUX RESTART
04:50:01 PM CPU %user %nice %system %iowait %steal
%idle
05:00:01 PM all 3.49 0.00 0.62 0.15 0.00
95.75
05:10:01 PM all 9.03 0.00 0.92 0.28 0.00
89.77
05:20:01 PM all 7.06 0.00 0.78 0.05 0.00
92.11
05:30:01 PM all 6.67 0.00 0.79 0.06 0.00
92.48
05:40:01 PM all 6.26 0.00 0.76 0.05 0.00
92.93
05:50:01 PM all 5.49 0.00 0.71 0.05 0.00
93.75
--------------------------------------------------------------------------------------------------
[2]mem(by using sar -r)
--------------------------------------------------------------------------------------------------
03:00:01 PM 1519272 196633272 99.23 361112 76364340 143574212
47.77
03:10:01 PM 1451764 196700780 99.27 361196 76336340 143581608
47.77
03:20:01 PM 1453400 196699144 99.27 361448 76248584 143551128
47.76
03:30:35 PM 1513844 196638700 99.24 361648 76022016 143828244
47.85
03:42:40 PM 1481108 196671436 99.25 361676 75718320 144478784
48.07
Average: 5051607 193100937 97.45 362421 81775777 142758861
47.50
04:42:04 PM LINUX RESTART
04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit
%commit
05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460
44.90
05:10:01 PM 136468244 61684300 31.13 219572 31709216 134966548
44.91
05:20:01 PM 135092452 63060092 31.82 221488 32162324 134949788
44.90
05:30:01 PM 133410464 64742080 32.67 233848 32793848 134976828
44.91
05:40:01 PM 132022052 66130492 33.37 235812 33278908 135007268
44.92
05:50:01 PM 130630408 67522136 34.08 237140 33900912 135099764
44.95
Average: 136996792 61155752 30.86 206645 30415642 134991776
44.91
--------------------------------------------------------------------------------------------------
As the blue font parts show that my hardware crash from 03:30:35.It is hung
up until I restart it manually at 04:42:04
ALl the above information just snapshot the performance when it crashed
while there is nothing cover the reason.I have also
check the /var/log/messages and find nothing useful.
Note that I run the command- sar -v .It shows something abnormal:
------------------------------------------------------------------------------------------------
02:50:01 PM 11542262 9216 76446 258
03:00:01 PM 11645526 9536 76421 258
03:10:01 PM 11748690 9216 76451 258
03:20:01 PM 11850191 9152 76331 258
03:30:35 PM 11972313 10112 132625 258
03:42:40 PM 12177319 13760 340227 258
Average: 8293601 8950 68187 161
04:42:04 PM LINUX RESTART
04:50:01 PM dentunusd file-nr inode-nr pty-nr
05:00:01 PM 35410 7616 35223 4
05:10:01 PM 137320 7296 42632 6
05:20:01 PM 247010 7296 42839 9
05:30:01 PM 358434 7360 42697 9
05:40:01 PM 471543 7040 42929 10
05:50:01 PM 583787 7296 42837 13
------------------------------------------------------------------------------------------------
and I check the man info about the -v option :
------------------------------------------------------------------------------------------------
*-v* Report status of inode, file and other kernel tables. The following
values are displayed:
*dentunusd*
Number of unused cache entries in the directory cache.
*file-nr*
Number of file handles used by the system.
*inode-nr*
Number of inode handlers used by the system.
*pty-nr*
Number of pseudo-terminals used by the system.
------------------------------------------------------------------------------------------------
Is the any clue about the crash? Would you please give me some suggestions?
Best Regards.
2016-03-16 14:01 GMT+08:00 YouPeng Yang <[email protected]>:
> Hello
> The problem appears several times ,however I could not capture the top
> output .My script is as follows code.
> I check the sys cpu usage whether it exceed 30%.the other metric
> information can be dumpped successfully except the top .
> Would you like to check my script that I am not able to figure out what is
> wrong.
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> #!/bin/bash
>
> while :
> do
> sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk '{if($6 <
> 30) print 1; else print 0;}' )
>
> if [ $sysusage -eq 0 ];then
> #echo $sysusage
> #perf record -o perf$(date +%Y%m%d%H%M%S).data -a -g -F 1000
> sleep 30
> file=$(date +%Y%m%d%H%M%S)
> top -n 2 >> top$file.data
> iotop -b -n 2 >> iotop$file.data
> iostat >> iostat$file.data
> netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in state)
> print i,"\t",state[i]}' >> netstat$file.data
> fi
> sleep 5
> done
> You have new mail in /var/spool/mail/root
>
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> 2016-03-08 21:39 GMT+08:00 YouPeng Yang <[email protected]>:
>
>> Hi all
>> Thanks for your reply.I do some investigation for much time.and I will
>> post some logs of the 'top' and IO in a few days when the crash come again.
>>
>> 2016-03-08 10:45 GMT+08:00 Shawn Heisey <[email protected]>:
>>
>>> On 3/7/2016 2:23 AM, Toke Eskildsen wrote:
>>> > How does this relate to YouPeng reporting that the CPU usage increases?
>>> >
>>> > This is not a snark. YouPeng mentions kernel issues. It might very well
>>> > be that IO is the real problem, but that it manifests in a
>>> non-intuitive
>>> > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am
>>> > not so sure. Can high kernel load (Sy% in *nix top) indicate that the
>>> IO
>>> > system is struggling, even if IO-Wait is low?
>>>
>>> It might turn out to be not directly related to memory, you're right
>>> about that. A very high query rate or particularly CPU-heavy queries or
>>> analysis could cause high CPU usage even when memory is plentiful, but
>>> in that situation I would expect high user percentage, not kernel. I'm
>>> not completely sure what might cause high kernel usage if iowait is low,
>>> but no specific information was given about iowait. I've seen iowait
>>> percentages of 10% or less with problems clearly caused by iowait.
>>>
>>> With the available information (especially seeing 700GB of index data),
>>> I believe that the "not enough memory" scenario is more likely than
>>> anything else. If the OP replies and says they have plenty of memory,
>>> then we can move on to the less common (IMHO) reasons for high CPU with
>>> a large index.
>>>
>>> If the OS is one that reports load average, I am curious what the 5
>>> minute average is, and how many real (non-HT) CPU cores there are.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>
>>
>