I'm working with a colleague on a network latency problem he is having 
on a T5220 server.  Basically, his server reads some data off the 
network interface via sockets from a different hardware device, and then 
the T5220 server responds with some data written out to the same the 
network interface. 
There is a timing "requirement", where the reads happen about every 80 
milliseconds, and the responding writes need to happen and be complete 
in about 240 milliseconds.  On other hardware (and even on Solaris 8), 
typical responses are in the 20 msec range, and had no latency issues at 
all.

What he is seeing is the write response happens OK for a while, then it 
slows down where the response (socket write from the T5220) takes longer 
than 240 msecs.

He used the socketsnoop.d script from the DTrace Toolkit, and had the 
following observations:

> The problem in a nutshell is that there is a message (controls for 
> hardware) sent about every 80 msec from the T5220/Solaris v10 to an 
> MVME6100/VxWorks v5.5.1 via TCP/IP over the e1000g1 interface.  The 
> message needs to get to the MVME6100 within 240 msec of being sent or 
> the signal processor will run out of commands.  The process sending the 
> message is running at a real time priority and only one other controller
> process has a higher real time priority.  From a dtrace that monitored 
> fbt:ip:tcp_output:entry, I see 5 socket writes each time my message is
> sent:
>
>  
>
>            TIME                WALLTIME    UID   PID DIR  SIZE CMD (wrapped 
> on next line)
>
> 499194.679681840 2008 Oct 20 14:23:35.61     0 21542   W 14600 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680047812 2008 Oct 20 14:23:35.62     0 21542   W 14600 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680408073 2008 Oct 20 14:23:35.62     0 21542   W 14600 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680613575 2008 Oct 20 14:23:35.62     0 21542   W 14600 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.680732307 2008 Oct 20 14:23:35.62     0 21542   W  4160 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched.
>
>  
>
> When the problem occurred I saw only 3 write followed by a 2 references to 
> sched:
>
>  
>
> 499194.761630757 2008 Oct 20 14:23:35.70     0 21542   W 14600 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.761992335 2008 Oct 20 14:23:35.70     0 21542   W 14600 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.762465836 2008 Oct 20 14:23:35.70     0 21542   W 14600 
> /home/cm/block3/v66_block3/20081015/rdp/bin/rc_radar_sched
>
> 499194.763077839 2008 Oct 20 14:23:35.70     0     0   W 14600 sched 
> 499194.763134514 2008 Oct 20 14:23:35.70     0     0   W  4160 sched
>
>  
>   
The 5 socket write()'s above, (four at 14600 bytes, and the fifth @ 4160 
bytes) from the rc_radar_sched process, are normal, and within the time 
window we need. But when it takes longer than 240 msecs, we only see 3 
socket write()'s at 14600 bytes from rc_radar_sched, followed by two 
socket write()'s from the sched process (the first is 14600 bytes, and 
the second 4160). This is when the delay happens for subsequent write()'s.

What could be going on here? Any more data we should dtrace?

Thanks,
Pat



_______________________________________________
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org

Reply via email to