Hi, I'm trying to track down the source of a performance problem with an application. The application in question is from a firewall vendor and the component of it that is suffering is the log handling daemon. The log daemon is single threaded and receives logs from firewalls and also services various consumers which could be either simple log viewing CLI or GUI clients or log reporting engines running on other servers.
The performance issue is seen from both the consumers and producers. Producers complain that the log server is not responding and therefore unable to send logs. The consumers don't complain (but the users do!) but are very slow when reading the logs, filtering on certain fields etc. I'm at a bit of a loss on where to go from here. Below is the analysis I've done so far. I'm looking for advice on whether this can be solved by throwing more resources at it or whether it is being limited by a poorly written application. Please let me know if you'd like to see more output from dtrace or other commands. Note, one thing I have seen is that if I stop the reporting engine running on a different server then the number of cross-calls drops significantly. The log server specs are: 4 Core T5120 with 16GB RAM 2TB SAN storage (mirrored ZFS pool). $ uname -a SunOS logserver 5.10 Generic_137111-01 sun4v sparc SUNW,SPARC-Enterprise-T5120 Here is what I've looked at so far: 1. vmstat - seems ok to me. # vmstat 3 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m2 m1 in sy cs us sy id 0 0 0 15422904 700912 133 15 6 0 0 0 0 0 0 0 1 1618 15569 1774 3 3 94 0 0 0 15457416 730808 54 26 13 0 0 0 0 0 0 0 0 1679 18773 1913 3 3 94 0 0 0 15455944 729320 148 8 0 0 0 0 0 0 0 0 0 1073 19944 1000 4 3 93 0 0 0 15454744 728632 223 3 0 0 0 0 0 0 0 0 0 1864 22512 2104 4 3 93 0 0 0 15454232 729328 160 4 0 0 0 0 0 0 0 0 0 994 18935 997 3 3 94 0 0 0 15453400 730080 163 4 0 0 0 0 0 0 0 0 0 1777 18536 2088 3 3 94 0 0 0 15452920 730040 185 3 0 0 0 0 0 0 0 0 0 1950 21576 2198 4 3 93 0 0 0 15452416 730456 146 0 0 0 0 0 0 0 0 0 0 1033 20987 958 3 3 94 2. iostat - just looking at the fibre channel HBA throughput. Not really pushing the SAN to its limits. # iostat -xznC 1 | egrep c.$ r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.5 1.0 5.7 14.1 0.0 0.0 0.0 15.5 0 1 c1 1.8 87.9 187.7 3028.6 0.0 1.5 0.2 16.5 0 26 c2 1.9 88.1 194.4 3066.4 0.0 1.5 0.2 16.5 0 26 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 521.4 0.0 15009.5 0.0 1.5 0.0 3.0 0 78 c2 0.0 518.4 0.0 15647.2 0.0 1.5 0.0 2.9 0 78 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 5.0 3.0 0.4 0.0 0.0 0.0 0.0 0.0 0 0 c2 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 4.0 0.0 2.0 0.0 0.0 0.0 4.9 0 2 c1 0.0 387.4 0.0 12853.2 0.0 1.0 0.0 2.6 0 59 c2 0.0 414.5 0.0 13446.9 0.0 1.0 0.0 2.5 0 57 c3 Running a `dd` of a 2GB log file shows how fast we can push the SAN which peaks at 222 MBytes/s! # time dd if=2009-06-28_224348_21.log of=/dev/null bs=128k 16368+1 records in 16368+1 records out real 0m9.298s user 0m0.094s sys 0m4.138s while dd is happening: # iostat -xznC 1 | egrep c.$ r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.5 1.0 5.7 14.0 0.0 0.0 0.0 15.5 0 1 c1 1.9 87.9 189.6 3028.4 0.0 1.5 0.2 16.5 0 26 c2 1.9 88.1 196.2 3066.3 0.0 1.5 0.2 16.4 0 26 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 2.1 0.0 263.5 0.0 0.0 0.0 0.0 8.4 0 2 c2 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 5.1 0.0 651.9 0.0 0.0 0.1 0.0 18.1 0 9 c2 5.1 0.0 651.9 0.0 0.0 0.3 0.0 64.2 0 33 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 5.0 3.0 0.4 0.0 0.0 0.0 0.0 0.0 0 0 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.0 0.0 4.9 0.0 0.0 0.0 0.0 0.2 0 0 c1 386.8 0.0 49513.3 0.0 0.0 20.0 0.0 51.8 0 678 c2 392.7 0.0 50269.1 0.0 0.0 20.0 0.0 51.0 0 667 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 869.9 3.0 110697.5 0.0 0.0 49.6 0.0 56.8 0 1389 c2 843.6 0.0 107982.0 0.0 0.0 49.6 0.0 58.8 0 1356 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 737.5 380.6 94397.2 16839.9 0.0 53.3 0.0 47.7 0 1347 c2 742.4 362.8 95025.5 15487.8 0.0 53.7 0.0 48.6 0 1366 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 884.6 0.0 113231.1 0.0 0.0 53.5 0.0 60.5 0 1547 c2 894.7 0.0 114516.7 0.0 0.0 53.7 0.0 60.0 0 1528 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 824.0 0.0 105467.2 0.0 0.0 50.3 0.0 61.1 0 1425 c2 831.9 0.0 106478.8 0.0 0.0 50.1 0.0 60.3 0 1439 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 837.2 0.0 107162.6 0.0 0.0 51.8 0.0 61.9 0 1472 c2 838.3 0.0 107299.9 0.0 0.0 51.5 0.0 61.4 0 1468 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 844.7 0.0 108124.9 0.0 0.0 51.9 0.0 61.4 0 1468 c2 842.7 0.0 107865.9 0.0 0.0 51.3 0.0 60.9 0 1464 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 866.4 154.9 110896.3 12598.9 0.0 60.8 0.0 59.6 0 1551 c2 857.4 390.7 109745.6 17811.3 0.0 59.2 0.0 47.5 0 1544 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 51.1 0.0 39.6 0.0 0.9 0.0 18.3 0 30 c1 830.5 0.0 106298.7 0.0 0.0 55.7 0.0 67.1 0 1511 c2 842.5 0.0 107839.9 0.0 0.0 54.9 0.0 65.2 0 1530 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 13.0 0.0 8.5 0.0 0.1 0.0 9.5 0 8 c1 355.1 0.0 45454.7 0.0 0.0 13.3 0.0 37.4 0 565 c2 318.0 0.0 40705.3 0.0 0.0 13.3 0.0 41.8 0 572 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.0 0.0 128.3 0.0 0.0 0.0 0.0 6.2 0 1 c2 5.0 3.0 0.4 0.0 0.0 0.0 0.0 0.0 0 0 c3 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 2.0 0.0 1.0 0.0 0.0 0.0 1.6 0 0 c1 0.0 482.1 0.0 14731.3 0.0 1.2 0.0 2.4 0 64 c2 0.0 428.1 0.0 15401.4 0.0 1.1 0.0 2.6 0 61 c3 I think I can safely eliminate the SAN as the bottleneck. 3. mpstat - the column that stands out to me is xcal. I've seen it peak up to 1.5 million xcalls on a single CPU! # mpstat 2 [snip] CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 96318 337 150 194 3 50 40 7 640 7 12 0 81 1 801 0 338 87 0 144 4 33 171 0 889 61 31 0 8 2 0 0 7 36 0 45 0 16 28 1 543 3 3 0 94 3 0 0 1 15 1 12 0 6 3 0 0 0 0 0 100 4 0 0 50458 305 253 123 9 23 29 4 3513 27 13 0 60 5 0 0 223 215 176 89 2 18 70 0 109 7 1 0 91 6 0 0 10920 9 0 17 0 1 1 1 38 0 1 0 99 7 0 0 195 168 163 10 0 0 38 0 0 0 1 0 99 16 1 0 172 222 0 465 9 80 84 9 2324 22 4 0 73 17 0 0 24 83 1 157 5 16 35 0 870 15 1 0 84 18 0 0 23 52 0 108 0 3 28 0 0 0 0 0 100 19 0 0 1 35 0 69 0 0 0 0 0 0 0 0 100 20 9 0 31 6 0 0 5 0 1 0 17 100 0 0 0 21 0 0 10 26 0 49 0 10 27 0 60 0 0 0 100 22 0 0 0 2 1 0 0 0 0 0 0 0 0 0 100 23 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 24 83 0 592427 26 0 42 7 15 58 1 1922 32 46 0 23 25 1 0 212 50 0 98 1 23 21 3 949 5 2 0 93 26 0 0 1 8 0 16 0 7 8 0 2 0 0 0 100 27 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 28 0 0 117 53 0 105 5 17 24 7 1923 41 6 0 53 29 0 0 3 24 0 47 0 10 9 1 497 2 1 0 97 30 0 0 0 1 0 1 0 0 0 0 0 0 0 0 100 31 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 32 1 0 2254 119 0 251 5 57 102 6 2284 33 8 0 59 33 4 0 334 79 0 163 7 20 37 6 2007 13 6 0 81 34 0 0 3 5 0 9 0 2 2 0 0 0 0 0 100 35 0 0 3 9 0 17 0 0 0 0 0 0 0 0 100 36 0 0 259875 110 0 239 11 33 102 9 2727 10 33 0 57 37 0 0 4 13 0 24 0 5 3 0 282 5 0 0 95 38 0 0 3 37 0 76 0 0 0 0 0 0 0 0 100 39 0 0 0 1 0 1 0 0 0 0 0 0 0 0 100 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 7 0 1593 256 154 28 6 7 42 0 9074 48 25 0 27 1 0 0 2 13 0 23 0 7 0 0 30 0 1 0 99 2 0 0 0 3 0 4 0 0 0 0 80 0 0 0 100 3 0 0 0 2 1 0 0 0 0 0 0 0 0 0 100 4 0 0 523 429 411 124 4 6 24 0 1703 7 5 0 87 5 0 0 0 52 0 102 0 0 0 0 72 0 0 0 100 6 0 0 0 5 0 8 0 0 0 0 0 0 0 0 100 7 0 0 1 19 0 36 0 0 0 0 0 0 0 0 100 16 0 0 125980 44 0 79 7 13 12 0 3319 22 18 0 59 17 0 0 3 15 0 27 0 1 2 0 293 3 1 0 96 18 0 0 1 5 0 9 0 0 0 0 0 0 0 0 100 19 0 0 2 19 0 35 0 0 0 0 0 0 0 0 100 20 0 0 4 23 0 46 0 6 6 0 1254 6 1 0 93 21 0 0 0 2 0 2 0 0 0 0 0 0 0 0 100 22 0 0 0 2 1 0 0 0 0 0 0 0 0 0 100 23 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 24 3 0 94 49 0 89 4 12 20 0 3314 14 7 0 78 25 0 0 0 6 0 10 0 2 10 0 38 0 0 0 100 26 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 27 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 28 0 0 311 32 0 58 3 7 5 0 1156 8 5 0 86 29 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 30 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 31 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 32 1 0 407 53 0 104 4 16 27 1 3042 15 7 0 78 33 0 0 36 86 0 170 0 2 0 0 2 0 0 0 100 34 0 0 0 3 0 5 0 0 0 0 0 0 0 0 100 35 0 0 3 7 0 12 0 0 0 0 0 0 0 0 100 36 0 0 12 49 0 96 1 8 2 0 878 4 1 0 95 37 1 0 32 3 0 4 0 1 0 0 12 0 0 0 100 38 0 0 3 27 0 51 0 0 0 0 0 0 0 0 100 39 0 0 0 4 0 6 0 0 0 0 0 0 0 0 100 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 204 258 146 92 3 16 3 0 2692 7 6 0 87 1 0 0 4 26 0 38 2 7 10 0 1644 13 2 0 84 2 0 0 0 6 0 8 0 1 3 0 160 0 1 0 99 3 0 0 0 2 1 0 0 0 0 0 0 0 0 0 100 4 0 0 238 451 406 171 5 13 6 0 972 4 3 0 93 5 0 0 6 40 4 72 0 1 0 0 52 0 0 0 100 6 0 0 4 17 0 33 0 1 6 0 243 1 1 0 98 7 0 0 1 6 0 10 0 0 0 0 0 0 0 0 100 16 2 0 187838 42 0 72 12 5 37 0 5551 35 33 0 32 17 0 0 5 21 0 40 0 4 0 0 4 0 0 0 100 18 0 0 3 7 0 13 0 0 0 0 0 0 0 0 100 19 0 0 1 21 0 40 0 0 0 0 0 0 0 0 100 20 0 0 2 18 0 35 0 7 14 0 532 2 0 0 98 21 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 22 0 0 0 2 1 0 0 0 0 0 0 0 0 0 100 23 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 24 2 0 128 32 0 57 3 9 90 0 4981 33 8 0 59 25 0 0 1 4 0 7 0 2 0 0 39 0 0 0 100 26 0 0 0 1 0 2 0 1 0 0 0 0 0 0 100 27 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 28 0 0 94 26 0 47 1 5 76 0 1136 4 6 0 89 29 0 0 0 3 0 4 0 1 0 0 2 0 0 0 100 30 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 31 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100 32 0 0 561 38 0 66 5 12 16 0 3872 23 15 0 62 33 0 0 8 78 0 155 2 3 1 0 38 0 0 0 100 34 0 0 0 2 0 3 0 0 0 0 0 0 0 0 100 35 0 0 3 9 0 17 0 0 0 0 0 0 0 0 100 36 0 0 15 60 0 115 6 6 6 0 1457 8 2 0 90 37 0 0 1 4 0 7 0 0 0 0 0 0 0 0 100 38 0 0 5 28 0 53 0 0 0 0 0 0 0 0 100 39 0 0 1 4 0 6 0 0 0 0 0 0 0 0 100 4. who is making all the cross calls # dtrace -n 'sysinfo:::xcalls { @num[pid,execname] = count(); }' dtrace: description 'sysinfo:::xcalls ' matched 4 probes ^C 7 svc.startd 1 402 nscd 1 1384 snmpd 1 1759 cpd 1 1921 cpmad 1 [...snip...] 2164 fwd 26 1966 fwd 28 1927 fwd 31 2055 cpd 45 2144 fwd 45 826 snmpd 102 3 fsflush 190 21288 fwm 479 21289 fw_loader 495 21286 fwm 1093 21287 dtrace 1430 0 sched 26601 2153 fwd 337655 5. systems calls by this process. # dtrace -n 'syscall:::entry /pid == 2153/ { @num[probefunc] = count(); }' dtrace: description 'syscall:::entry ' matched 229 probes ^C getpid 1 pollsys 74 pread64 87 pwrite64 134 lwp_sigmask 148 llseek 11010 lseek 18126 read 19654 write 57857 6. What is it reading & writing? This first column is the file descriptor number, the second the number of times it was seen: # dtrace -n 'syscall::read:entry /pid == 2153/ { @[arg0] = count(); }' dtrace: description 'syscall::read:entry ' matched 1 probe ^C 47 1 49 12 83 12 73 17 51 18 50 21 53 24 56 24 57 282 58 381 55 401 78 5742 70 5777 77 12835 # dtrace -n 'syscall::write:entry /pid == 2153/ { @[arg0] = count(); }' dtrace: description 'syscall::write:entry ' matched 1 probe ^C 13 4 82 7 78 158 70 160 24 21201 23 24974 22 28747 # pfiles 2153 2153: /opt/CPmds-R65/customers/XXX/CPsuite-R65/fw1/bin/fwd -n Current rlimit: 65536 file descriptors [...snip...] 22: S_IFREG mode:0644 dev:256,65541 ino:96177 uid:0 gid:0 size:1197740174 O_RDWR|O_CREAT|O_TRUNC /var/opt/CPmds-R65/customers/XXX/CPsuite-R65/fw1/log/fw.log 23: S_IFREG mode:0644 dev:256,65541 ino:96178 uid:0 gid:0 size:47337100 O_RDWR|O_CREAT|O_TRUNC /var/opt/CPmds-R65/customers/XXX/CPsuite-R65/fw1/log/fw.logptr 24: S_IFREG mode:0644 dev:256,65541 ino:96179 uid:0 gid:0 size:39200652 O_RDWR|O_CREAT|O_TRUNC /var/opt/CPmds-R65/customers/XXX/CPsuite-R65/fw1/log/fw.loginitial_ptr [...snip...] 70: S_IFSOCK mode:0666 dev:332,0 ino:25771 uid:0 gid:0 size:0 O_RDWR|O_NDELAY SOCK_STREAM SO_REUSEADDR,SO_OOBINLINE,SO_SNDBUF(49152),SO_RCVBUF(49640),IP_NEXTHOP(0.0.193.232) sockname: AF_INET 1.1.1.42 port: 18184 peername: AF_INET 1.1.1.5 port: 55011 [...snip...] 77: S_IFREG mode:0644 dev:256,65541 ino:125222 uid:0 gid:0 size:2145391731 O_RDONLY /var/opt/CPmds-R65/customers/XXX/CPsuite-R65/fw1/log/2009-06-08_215011_153.log 78: S_IFSOCK mode:0666 dev:332,0 ino:44055 uid:0 gid:0 size:0 O_RDWR|O_NDELAY SOCK_STREAM SO_REUSEADDR,SO_OOBINLINE,SO_SNDBUF(49152),SO_RCVBUF(49640),IP_NEXTHOP(0.0.193.232) sockname: AF_INET 1.1.1.42 port: 18184 peername: AF_INET 1.1.1.5 port: 55147 [...snip...] So the file being written is fw.log which is to be expected. But the reader is fd 77 2009-06-08_215011_153.log closely followed by fd’s 70 & 78 which are tcp connections to log reporting server. So an educated guess would say that this is the file that reporting server is trying to index. Checking the reporting server confirms this. regards Matthew Flanagan -- http://wadofstuff.blogspot.com -- This message posted from opensolaris.org _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org