Comments inline.
Matthew Flanagan wrote:
Hi Phil
Matthew,
In addition to Roland's comment about patching (which
is particularly
important due to the rate of change in ZFS) ...
1) Your dd test doesn't really tell us much. Do you
know what size
writes fwd is doing? I doubt it is 128k. You could
use truss/dtrace to
find out and then use dd again with the same
blocksize.
truss shows writes like this:
/1: write(22, "\0 4\001 :919C\v J H90 b".., 96) = 96
/1: write(23, " %97 *AF", 4) = 4
/1: write(24, " %97 *AF", 4) = 4
and reads like this:
/1: read(77, "\0 D\003 :919F16 J ,EACF".., 131072) = 131072
/1: read(77, "\0 D\003 :919F16 J ,EADC".., 131072) = 131072
/1: read(77, "\0 D\003 :919F16 J ,EADC".., 131072) = 131072
/1: read(78, 0x033BD918, 5) Err#11 EAGAIN
/1: read(78, 0x033BD918, 5) Err#11 EAGAIN
/1: read(78, 0x033BD918, 5) Err#11 EAGAIN
/1: read(78, 0x033BD918, 5) Err#11 EAGAIN
I had done a truss before and seen the 128k reads. Though I hadn't noticed so
many EAGAIN's.
The writes are tiny. That's a lot of overhead. But I think we need the
big picture, so some DTrace quntizations of per-fd sizes might be useful
to fill out the picture some more.
Those EAGAINs are a concern. It would be good to get a handle on how
much that happens, but it really points to code that needs fixing.
Unconstrained polling is a big no-no. I get the impression that this
code isn't up to the kind of load being put through it. I'd like to see
multiple threads and/or the use of poll(2) or event ports.
2) You've said the application is single threaded.
This isn't a good fit
for the T-series (which was designed for throughput
workloads at the
expense of single thread performance ... e.g. the
M-series has typically
4x the single thread performance the T-series).
The server has handled the workload quite well until recently when we've added
a few new firewalls that do quite a large volume of logging. If I can build a
case for a faster server then the easier it is to convince the business to buy
it.
The application has a finite capacity. It will max out sooner or later.
Faster CPUs might put off the evil day, but the code really needs fixing
too. Your T5120 has plenty of capacity, but not with this code, and your
prstat data confirms ...
3) It would be helpful to see 'prstat -Lm' to get a
better overall feel
for what is running where (especially, where fwd is
spending its time).
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
2153 root 41 59 0.1 0.0 0.0 0.0 0.3 0.2 383 11 7K 0 fwd/1
That's one whole VCPU dedicated to this task. Either we improve the code
(e.g. save cycles) or you need a faster CPU with respec to single thread
performance. Of course, if we could also make the application
multithreaded or multiprocess the T5120 has plenty of CPU to spare!
1923 root 16 5.4 0.1 0.0 0.0 0.0 78 0.1 52 18 2K 0 fwd/1
2144 root 12 1.3 0.0 0.0 0.0 0.0 87 0.0 10 4 503 0 fwd/1
1927 root 5.4 4.2 0.0 0.0 0.0 0.0 90 0.0 51 1 1K 0 fwd/1
1941 root 2.5 4.3 0.0 0.0 0.0 0.0 93 0.0 29 0 855 0 fwd/1
1975 root 6.4 0.4 0.0 0.0 0.0 0.0 93 0.0 1 2 936 0 cpmad/1
2166 root 5.5 0.1 0.0 0.0 0.0 0.0 94 0.0 5 1 36 0 fwm/1
22500 root 4.8 0.6 0.0 0.0 0.0 0.0 95 0.0 1 2 1K 0 cpmad/1
23745 root 0.7 4.6 0.0 0.0 0.0 0.0 95 0.0 42 1 603 0 prstat/1
1966 root 2.8 2.1 0.0 0.0 0.0 0.0 95 0.0 12 1 996 0 fwd/1
2250 root 3.6 0.1 0.0 0.0 0.0 0.0 96 0.0 4 0 36 0 fwm/1
2175 root 1.7 0.9 0.0 0.0 0.0 0.0 97 0.0 19 0 404 0 fwd/1
21873 root 2.3 0.1 0.0 0.0 0.0 0.0 98 0.0 1 0 396 0 cpmad/1
23523 root 1.8 0.2 0.0 0.0 0.0 0.0 98 0.0 1 1 423 0 cpmad/1
2164 root 1.2 0.5 0.0 0.0 0.0 0.0 98 0.0 5 0 320 0 fwd/1
1902 root 0.4 0.8 0.0 0.0 0.0 0.0 99 0.0 1 0 159 0 fwd/1
402 root 0.2 0.1 0.0 0.0 0.0 0.0 100 0.0 39 0 234 0 nscd/25
1901 root 0.2 0.1 0.0 0.0 0.0 0.0 100 0.0 7 0 201 0 fwd/1
1957 root 0.1 0.1 0.0 0.0 0.0 0.0 100 0.0 1 0 21 0 fwd/1
23657 mflanaga 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 3 2 24 0 sshd/1
1747 root 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 20 0 cpd/1
1934 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 3 0 90 0 fwd/1
2137 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 1 0 18 0 fwd/1
2157 root 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 1 0 8 0 fwm/1
2145 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 1 0 9 0 fwd/1
1921 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 5 0 cpmad/1
2055 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 12 0 cpd/1
1960 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 7 0 fwm/1
1944 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 7 0 fwm/1
1917 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 8 0 fwd/1
522 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 10 2 xntpd/1
1968 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 7 0 fwm/1
2147 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 4 0 fwm/1
2056 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 3 0 5 0 cpd/2
1757 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 3 0 5 0 cpd/2
1747 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 3 0 5 0 cpd/2
1759 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 3 0 5 0 cpd/2
1745 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 3 0 5 0 cpd/2
2039 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 3 0 5 0 cpd/2
Total: 115 processes, 412 lwps, load averages: 2.07, 2.05, 2.05
4) It's cool to see you using DTrace. You've shown
again that one-liners
can tell us so much, and so easily. My only comment
is that it would be
useful to know how long your tests ran before you hit
^C.
I usually ran each dtrace for 20 seconds.
5) How about some DTrace to find out where in the
kernel the crosscalls
are happening? For example ...
dtrace -n 'sysinfo:::xcalls { @[stack()] =
count(); }
... although bringing the system up-to-date with
patches may resolve
this issue anyway.
# dtrace -n 'sysinfo:::xcalls { @[stack()] = count(); }'
[...snip...]
unix`send_one_mondo+0x1c
unix`xt_one_unchecked+0xc8
genunix`sleepq_wakeone_chan+0x70
genunix`cv_signal+0x58
genunix`taskq_dispatch+0x94
genunix`callout_schedule+0x48
genunix`clock+0x6a0
genunix`cyclic_softint+0xa4
unix`cbe_level10+0x8
unix`intr_thread+0x168
2068
unix`send_mondo_set+0x314
unix`xt_some+0x1a0
unix`sfmmu_tlb_range_demap+0x1f0
unix`hat_unload_callback+0x850
unix`segkmem_free_vn+0x58
genunix`kmem_magazine_destroy+0xd8
genunix`kmem_depot_ws_reap+0x50
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
2181
unix`send_mondo_set+0x314
unix`xt_some+0x1a0
unix`xt_sync+0x158
unix`hat_unload_callback+0x898
unix`segkmem_free_vn+0x58
genunix`kmem_magazine_destroy+0xd8
genunix`kmem_depot_ws_reap+0x50
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
2181
unix`send_one_mondo+0x1c
unix`xt_one_unchecked+0xc8
genunix`sleepq_wakeone_chan+0x70
genunix`cv_signal+0x58
dls`soft_ring_process+0x214
dls`dls_soft_ring_fanout+0x2bc
dls`i_dls_link_rx+0x1f0
mac`mac_rx+0x8c
e1000g`e1000g_intr_pciexpress+0xbc
px`px_msiq_intr+0x1e8
unix`intr_thread+0x168
2841
unix`send_one_mondo+0x1c
unix`xt_one_unchecked+0xc8
genunix`sleepq_wakeone_chan+0x70
genunix`cv_signal+0x58
genunix`taskq_dispatch+0x94
zfs`zio_done+0x190
zfs`zio_vdev_io_assess+0x178
genunix`taskq_thread+0x1a4
unix`thread_start+0x4
3498
unix`send_one_mondo+0x1c
unix`xt_one_unchecked+0xc8
genunix`sleepq_wakeone_chan+0x70
genunix`cv_signal+0x58
genunix`taskq_dispatch+0x94
emcp`PowerPlatformTopIodone+0x60
emcp`PowerTopIodone+0xc
emcp`PowerProcessTopIodonePirps+0x2c
emcp`PowerBottomIodoneNew+0x308
emcp`PowerPlatformBottomIodone+0x2c
ssd`ssd_return_command+0x240
ssd`ssdintr+0x268
qlc`ql_fast_fcp_post+0x178
qlc`ql_24xx_status_entry+0x2ec
qlc`ql_response_pkt+0x268
qlc`ql_isr_aif+0x7dc
px`px_msiq_intr+0x1e8
unix`intr_thread+0x168
5017
unix`send_mondo_set+0x314
unix`xt_some+0x1a0
unix`sfmmu_tlb_range_demap+0x1f0
unix`hat_unload_callback+0x850
unix`ppmapout+0x94
zfs`mappedread+0xa4
zfs`zfs_read+0x19c
genunix`fop_read+0x20
genunix`read+0x274
unix`syscall_trap32+0xcc
135480
unix`send_mondo_set+0x314
unix`xt_some+0x1a0
unix`xt_sync+0x158
unix`hat_unload_callback+0x898
unix`ppmapout+0x94
zfs`mappedread+0xa4
zfs`zfs_read+0x19c
genunix`fop_read+0x20
genunix`read+0x274
unix`syscall_trap32+0xcc
135480
Ah! This is very telling since zfs_read() only calls mappedread() if it
thinks someone also has the file mmapped. ZFS and mmap(2) do not mix
well because ZFS has its own cache (the ARC) apart from the Solaris page
cache. However, mmap(2) currently uses the Solaris page cache for mapped
ZFS files, so ZFS has to keep the two caches in sync (so everyone sees
the file in the same state). The crosscall storm is the result.
I worked on a partial fix for mapped read performance, but I doubt that
will help your application much (because it wasn't aimed at read()
performance of an elsewhere mapped file).
So the big question is: who has the the file mapped, and can you stop
that? Note: utilities like cat(1) and cp(1) use mmap(2) to read data,
but dd(1) doesn't (e.g. if you had a script that copied the files
periodically, you could substitute dd(1) and lose the crosscalls).
Does fwd mmap the file itself? (you can use pmap(1) to find out).
I've also done the same dtrace but with ustack() just for the process.
# dtrace -n 'sysinfo:::xcalls /pid == 2153/ { @[ustack()] = count(); }'
[snip]
libc_psr.so.1`memcpy+0x18c
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
libCPLogRepository.so`__1cICLogFileHReadLog6MpnJILogEvent_LpL33_i_+0xd4
libCPLogRepository.so`__1cICLogFileIReadNext6MpnJILogEvent_pL33_i_+0x40
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x164
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
libCPLogLuFile.so`0xfc493e9c
libfw1.so`__1cNlog_read_next6FpnPICplogLuLogFile_pnJILogEvent_bpp3pi6_3_+0xfc
libfwdopsec.so`lea_send_line+0x108
libopsec.so`lea_send_50_records+0x68
libopsec.so`lea_server_event_handler+0xc8
libopsec.so`opsec_fwasync_conn_handler+0x2fc
libComUtils.so`0xfe393bc4
libComUtils.so`T_event_mainloop_iter+0x4c4
libComUtils.so`T_event_mainloop+0x118
libfw1.so`fwd_cmain+0x131c
libfw1.so`fw_cmain+0x910
fwd`_start+0x108
18
libc_psr.so.1`memcpy+0x18c
libCPLogRepository.so`__1cOCBinaryLogFileHReadLog6MpnJILogEvent_pnNCLogFileChain__b_+0xc
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
libCPLogRepository.so`__1cICLogFileHReadLog6MpnJILogEvent_LpL33_i_+0xd4
libCPLogRepository.so`__1cICLogFileIReadNext6MpnJILogEvent_pL33_i_+0x40
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x164
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
0x3
32
libc_psr.so.1`memcpy+0x18c
libCPLogRepository.so`__1cOCBinaryLogFileHReadLog6MpnJILogEvent_pnNCLogFileChain__b_+0xc
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x654
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
libCPLogLuFile.so`0xfc493e9c
libfw1.so`__1cNlog_read_next6FpnPICplogLuLogFile_pnJILogEvent_bpp3pi6_3_+0xfc
libfwdopsec.so`lea_send_line+0x108
libopsec.so`lea_send_50_records+0x68
libopsec.so`lea_server_event_handler+0xc8
libopsec.so`opsec_fwasync_conn_handler+0x2fc
libComUtils.so`0xfe393bc4
libComUtils.so`T_event_mainloop_iter+0x4c4
libComUtils.so`T_event_mainloop+0x118
libfw1.so`fwd_cmain+0x131c
libfw1.so`fw_cmain+0x910
fwd`_start+0x108
62
libc.so.1`_read+0x8
libc.so.1`fread+0x298
libCPLogRepository.so`__1cOCBinaryLogFileNReadLogHeader6kMpnJILogEvent_pnNCLogFileChain__b_+0x280
libCPLogRepository.so`__1cOCBinaryLogFileHReadLog6MpnJILogEvent_pnNCLogFileChain__b_+0xc
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x3f8
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
libCPLogLuFile.so`0xfc493e9c
libfw1.so`__1cNlog_read_next6FpnPICplogLuLogFile_pnJILogEvent_bpp3pi6_3_+0xfc
libfwdopsec.so`lea_send_line+0x108
libopsec.so`lea_send_50_records+0x68
libopsec.so`lea_server_event_handler+0xc8
libopsec.so`opsec_fwasync_conn_handler+0x2fc
libComUtils.so`0xfe393bc4
libComUtils.so`T_event_mainloop_iter+0x4c4
libComUtils.so`T_event_mainloop+0x118
libfw1.so`fwd_cmain+0x131c
libfw1.so`fw_cmain+0x910
fwd`_start+0x108
226302
Hope this helps,
Phil
I hope so too. And thanks for the fast response!
regards
Matthew
Nothing else there, except that taking an xcall during a memcpy() might
indicate a mapped read. However, xcalls can happen in many ways. A
PID-predicated stack aggregation of the xcalls might shed a little more
light ...
dtrace -n 'sysinfo:::xcalls /pid == 2153/ { @[stack()] = count(); }'
Phil
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org