Re: [perf-discuss] poor application performance - very high cross-calls

Phil Harman Mon, 29 Jun 2009 05:56:28 -0700

Comments inline.

Matthew Flanagan wrote:

Hi Phil

Matthew,
In addition to Roland's comment about patching (which
is particularlyimportant due to the rate of change in ZFS) ...
1) Your dd test doesn't really tell us much. Do you
know what sizewrites fwd is doing? I doubt it is 128k. You coulduse truss/dtrace tofind out and then use dd again with the same
blocksize.


truss shows writes like this:

/1:     write(22, "\0 4\001 :919C\v J H90 b".., 96)     = 96
/1:     write(23, " %97 *AF", 4)                        = 4
/1:     write(24, " %97 *AF", 4)                        = 4

and reads like this:
/1:     read(77, "\0 D\003 :919F16 J ,EACF".., 131072)  = 131072
/1:     read(77, "\0 D\003 :919F16 J ,EADC".., 131072)  = 131072
/1:     read(77, "\0 D\003 :919F16 J ,EADC".., 131072)  = 131072
/1:     read(78, 0x033BD918, 5)                         Err#11 EAGAIN
/1:     read(78, 0x033BD918, 5)                         Err#11 EAGAIN
/1:     read(78, 0x033BD918, 5)                         Err#11 EAGAIN
/1:     read(78, 0x033BD918, 5)                         Err#11 EAGAIN

I had done a truss before and seen the 128k reads. Though I hadn't noticed so 
many EAGAIN's.

The writes are tiny. That's a lot of overhead. But I think we need thebig picture, so some DTrace quntizations of per-fd sizes might be usefulto fill out the picture some more.

Those EAGAINs are a concern. It would be good to get a handle on howmuch that happens, but it really points to code that needs fixing.Unconstrained polling is a big no-no. I get the impression that thiscode isn't up to the kind of load being put through it. I'd like to seemultiple threads and/or the use of poll(2) or event ports.

2) You've said the application is single threaded.
This isn't a good fitfor the T-series (which was designed for throughputworkloads at theexpense of single thread performance ... e.g. theM-series has typically4x the single thread performance the T-series).
The server has handled the workload quite well until recently when we've added 
a few new firewalls that do quite a large volume of logging. If I can build a 
case for a faster server then the easier it is to convince the business to buy 
it.

The application has a finite capacity. It will max out sooner or later.Faster CPUs might put off the evil day, but the code really needs fixingtoo. Your T5120 has plenty of capacity, but not with this code, and yourprstat data confirms ...

3) It would be helpful to see 'prstat -Lm' to get a
better overall feelfor what is running where (especially, where fwd is
spending its time).


   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  2153 root      41  59 0.1 0.0 0.0 0.0 0.3 0.2 383  11  7K   0 fwd/1

That's one whole VCPU dedicated to this task. Either we improve the code(e.g. save cycles) or you need a faster CPU with respec to single threadperformance. Of course, if we could also make the applicationmultithreaded or multiprocess the T5120 has plenty of CPU to spare!

  1923 root      16 5.4 0.1 0.0 0.0 0.0  78 0.1  52  18  2K   0 fwd/1
  2144 root      12 1.3 0.0 0.0 0.0 0.0  87 0.0  10   4 503   0 fwd/1
  1927 root     5.4 4.2 0.0 0.0 0.0 0.0  90 0.0  51   1  1K   0 fwd/1
  1941 root     2.5 4.3 0.0 0.0 0.0 0.0  93 0.0  29   0 855   0 fwd/1
  1975 root     6.4 0.4 0.0 0.0 0.0 0.0  93 0.0   1   2 936   0 cpmad/1
  2166 root     5.5 0.1 0.0 0.0 0.0 0.0  94 0.0   5   1  36   0 fwm/1
 22500 root     4.8 0.6 0.0 0.0 0.0 0.0  95 0.0   1   2  1K   0 cpmad/1
 23745 root     0.7 4.6 0.0 0.0 0.0 0.0  95 0.0  42   1 603   0 prstat/1
  1966 root     2.8 2.1 0.0 0.0 0.0 0.0  95 0.0  12   1 996   0 fwd/1
  2250 root     3.6 0.1 0.0 0.0 0.0 0.0  96 0.0   4   0  36   0 fwm/1
  2175 root     1.7 0.9 0.0 0.0 0.0 0.0  97 0.0  19   0 404   0 fwd/1
 21873 root     2.3 0.1 0.0 0.0 0.0 0.0  98 0.0   1   0 396   0 cpmad/1
 23523 root     1.8 0.2 0.0 0.0 0.0 0.0  98 0.0   1   1 423   0 cpmad/1
  2164 root     1.2 0.5 0.0 0.0 0.0 0.0  98 0.0   5   0 320   0 fwd/1
  1902 root     0.4 0.8 0.0 0.0 0.0 0.0  99 0.0   1   0 159   0 fwd/1
   402 root     0.2 0.1 0.0 0.0 0.0 0.0 100 0.0  39   0 234   0 nscd/25
  1901 root     0.2 0.1 0.0 0.0 0.0 0.0 100 0.0   7   0 201   0 fwd/1
  1957 root     0.1 0.1 0.0 0.0 0.0 0.0 100 0.0   1   0  21   0 fwd/1
 23657 mflanaga 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0   3   2  24   0 sshd/1
  1747 root     0.1 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0  20   0 cpd/1
  1934 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   3   0  90   0 fwd/1
  2137 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0  18   0 fwd/1
  2157 root     0.1 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0   8   0 fwm/1
  2145 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0   9   0 fwd/1
  1921 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0   5   0 cpmad/1
  2055 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0  12   0 cpd/1
  1960 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0   7   0 fwm/1
  1944 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0   7   0 fwm/1
  1917 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0   8   0 fwd/1
   522 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0  10   2 xntpd/1
  1968 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0   7   0 fwm/1
  2147 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0   4   0 fwm/1
  2056 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   3   0   5   0 cpd/2
  1757 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   3   0   5   0 cpd/2
  1747 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   3   0   5   0 cpd/2
  1759 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   3   0   5   0 cpd/2
  1745 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   3   0   5   0 cpd/2
  2039 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   3   0   5   0 cpd/2
Total: 115 processes, 412 lwps, load averages: 2.07, 2.05, 2.05

4) It's cool to see you using DTrace. You've shown
again that one-linerscan tell us so much, and so easily. My only commentis that it would beuseful to know how long your tests ran before you hit
^C.


I usually ran each dtrace for 20 seconds.

5) How about some DTrace to find out where in the
kernel the crosscallsare happening? For example ...
dtrace -n 'sysinfo:::xcalls { @[stack()] =
 count();  }
... although bringing the system up-to-date with
patches may resolvethis issue anyway.


# dtrace -n 'sysinfo:::xcalls { @[stack()] = count(); }'
[...snip...]
              unix`send_one_mondo+0x1c
              unix`xt_one_unchecked+0xc8
              genunix`sleepq_wakeone_chan+0x70
              genunix`cv_signal+0x58
              genunix`taskq_dispatch+0x94
              genunix`callout_schedule+0x48
              genunix`clock+0x6a0
              genunix`cyclic_softint+0xa4
              unix`cbe_level10+0x8
              unix`intr_thread+0x168
             2068

              unix`send_mondo_set+0x314
              unix`xt_some+0x1a0
              unix`sfmmu_tlb_range_demap+0x1f0
              unix`hat_unload_callback+0x850
              unix`segkmem_free_vn+0x58
              genunix`kmem_magazine_destroy+0xd8
              genunix`kmem_depot_ws_reap+0x50
              genunix`taskq_thread+0x1a4
              unix`thread_start+0x4
             2181

              unix`send_mondo_set+0x314
              unix`xt_some+0x1a0
              unix`xt_sync+0x158
              unix`hat_unload_callback+0x898
              unix`segkmem_free_vn+0x58
              genunix`kmem_magazine_destroy+0xd8
              genunix`kmem_depot_ws_reap+0x50
              genunix`taskq_thread+0x1a4
              unix`thread_start+0x4
             2181

              unix`send_one_mondo+0x1c
              unix`xt_one_unchecked+0xc8
              genunix`sleepq_wakeone_chan+0x70
              genunix`cv_signal+0x58
              dls`soft_ring_process+0x214
              dls`dls_soft_ring_fanout+0x2bc
              dls`i_dls_link_rx+0x1f0
              mac`mac_rx+0x8c
              e1000g`e1000g_intr_pciexpress+0xbc
              px`px_msiq_intr+0x1e8
              unix`intr_thread+0x168
             2841

              unix`send_one_mondo+0x1c
              unix`xt_one_unchecked+0xc8
              genunix`sleepq_wakeone_chan+0x70
              genunix`cv_signal+0x58
              genunix`taskq_dispatch+0x94
              zfs`zio_done+0x190
              zfs`zio_vdev_io_assess+0x178
              genunix`taskq_thread+0x1a4
              unix`thread_start+0x4
             3498

              unix`send_one_mondo+0x1c
              unix`xt_one_unchecked+0xc8
              genunix`sleepq_wakeone_chan+0x70
              genunix`cv_signal+0x58
              genunix`taskq_dispatch+0x94
              emcp`PowerPlatformTopIodone+0x60
              emcp`PowerTopIodone+0xc
              emcp`PowerProcessTopIodonePirps+0x2c
              emcp`PowerBottomIodoneNew+0x308
              emcp`PowerPlatformBottomIodone+0x2c
              ssd`ssd_return_command+0x240
              ssd`ssdintr+0x268
              qlc`ql_fast_fcp_post+0x178
              qlc`ql_24xx_status_entry+0x2ec
              qlc`ql_response_pkt+0x268
              qlc`ql_isr_aif+0x7dc
              px`px_msiq_intr+0x1e8
              unix`intr_thread+0x168
             5017

              unix`send_mondo_set+0x314
              unix`xt_some+0x1a0
              unix`sfmmu_tlb_range_demap+0x1f0
              unix`hat_unload_callback+0x850
              unix`ppmapout+0x94
              zfs`mappedread+0xa4
              zfs`zfs_read+0x19c
              genunix`fop_read+0x20
              genunix`read+0x274
              unix`syscall_trap32+0xcc
           135480

              unix`send_mondo_set+0x314
              unix`xt_some+0x1a0
              unix`xt_sync+0x158
              unix`hat_unload_callback+0x898
              unix`ppmapout+0x94
              zfs`mappedread+0xa4
              zfs`zfs_read+0x19c
              genunix`fop_read+0x20
              genunix`read+0x274
              unix`syscall_trap32+0xcc
           135480

Ah! This is very telling since zfs_read() only calls mappedread() if itthinks someone also has the file mmapped. ZFS and mmap(2) do not mixwell because ZFS has its own cache (the ARC) apart from the Solaris pagecache. However, mmap(2) currently uses the Solaris page cache for mappedZFS files, so ZFS has to keep the two caches in sync (so everyone seesthe file in the same state). The crosscall storm is the result.

I worked on a partial fix for mapped read performance, but I doubt thatwill help your application much (because it wasn't aimed at read()performance of an elsewhere mapped file).

So the big question is: who has the the file mapped, and can you stopthat? Note: utilities like cat(1) and cp(1) use mmap(2) to read data,but dd(1) doesn't (e.g. if you had a script that copied the filesperiodically, you could substitute dd(1) and lose the crosscalls).


Does fwd mmap the file itself? (you can use pmap(1) to find out).

I've also done the same dtrace but with ustack() just for the process.

# dtrace -n 'sysinfo:::xcalls /pid == 2153/ { @[ustack()] = count(); }'

[snip]
              libc_psr.so.1`memcpy+0x18c
              
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
              
libCPLogRepository.so`__1cICLogFileHReadLog6MpnJILogEvent_LpL33_i_+0xd4
              
libCPLogRepository.so`__1cICLogFileIReadNext6MpnJILogEvent_pL33_i_+0x40
              
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x164
              
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
              libCPLogLuFile.so`0xfc493e9c
              
libfw1.so`__1cNlog_read_next6FpnPICplogLuLogFile_pnJILogEvent_bpp3pi6_3_+0xfc
              libfwdopsec.so`lea_send_line+0x108
              libopsec.so`lea_send_50_records+0x68
              libopsec.so`lea_server_event_handler+0xc8
              libopsec.so`opsec_fwasync_conn_handler+0x2fc
              libComUtils.so`0xfe393bc4
              libComUtils.so`T_event_mainloop_iter+0x4c4
              libComUtils.so`T_event_mainloop+0x118
              libfw1.so`fwd_cmain+0x131c
              libfw1.so`fw_cmain+0x910
              fwd`_start+0x108
               18

              libc_psr.so.1`memcpy+0x18c
              
libCPLogRepository.so`__1cOCBinaryLogFileHReadLog6MpnJILogEvent_pnNCLogFileChain__b_+0xc
              
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
              
libCPLogRepository.so`__1cICLogFileHReadLog6MpnJILogEvent_LpL33_i_+0xd4
              
libCPLogRepository.so`__1cICLogFileIReadNext6MpnJILogEvent_pL33_i_+0x40
              
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x164
              
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
              0x3
               32

              libc_psr.so.1`memcpy+0x18c
              
libCPLogRepository.so`__1cOCBinaryLogFileHReadLog6MpnJILogEvent_pnNCLogFileChain__b_+0xc
              
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
              
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x654
              
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
              libCPLogLuFile.so`0xfc493e9c
              
libfw1.so`__1cNlog_read_next6FpnPICplogLuLogFile_pnJILogEvent_bpp3pi6_3_+0xfc
              libfwdopsec.so`lea_send_line+0x108
              libopsec.so`lea_send_50_records+0x68
              libopsec.so`lea_server_event_handler+0xc8
              libopsec.so`opsec_fwasync_conn_handler+0x2fc
              libComUtils.so`0xfe393bc4
              libComUtils.so`T_event_mainloop_iter+0x4c4
              libComUtils.so`T_event_mainloop+0x118
              libfw1.so`fwd_cmain+0x131c
              libfw1.so`fw_cmain+0x910
              fwd`_start+0x108
               62

              libc.so.1`_read+0x8
              libc.so.1`fread+0x298
              
libCPLogRepository.so`__1cOCBinaryLogFileNReadLogHeader6kMpnJILogEvent_pnNCLogFileChain__b_+0x280
              
libCPLogRepository.so`__1cOCBinaryLogFileHReadLog6MpnJILogEvent_pnNCLogFileChain__b_+0xc
              
libCPLogRepository.so`__1cICLogFilePReadLogByOffset6MpnJILogEvent_LpL3_i_+0x118
              
libCPLogLuFile.so`__1cKCLuLogFileLgatherChain6Mn0APtReadNavigation_kLrnKCSrvVector4CpnJILogEvent_nDstdTallocator_interface4n0DJallocator4C4__C4____r4_i_+0x3f8
              
libCPLogLuFile.so`__1cKCLuLogFileJdoReadLog6MpnJILogEvent_n0APtReadNavigation_kLipp2pi_i_+0x80
              libCPLogLuFile.so`0xfc493e9c
              
libfw1.so`__1cNlog_read_next6FpnPICplogLuLogFile_pnJILogEvent_bpp3pi6_3_+0xfc
              libfwdopsec.so`lea_send_line+0x108
              libopsec.so`lea_send_50_records+0x68
              libopsec.so`lea_server_event_handler+0xc8
              libopsec.so`opsec_fwasync_conn_handler+0x2fc
              libComUtils.so`0xfe393bc4
              libComUtils.so`T_event_mainloop_iter+0x4c4
              libComUtils.so`T_event_mainloop+0x118
              libfw1.so`fwd_cmain+0x131c
              libfw1.so`fw_cmain+0x910
              fwd`_start+0x108
           226302

Hope this helps,
Phil


I hope so too. And thanks for the fast response!

regards

Matthew

Nothing else there, except that taking an xcall during a memcpy() mightindicate a mapped read. However, xcalls can happen in many ways. APID-predicated stack aggregation of the xcalls might shed a little morelight ...


 dtrace -n 'sysinfo:::xcalls /pid == 2153/ { @[stack()] = count(); }'


Phil
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] poor application performance - very high cross-calls

Reply via email to