[perf-discuss] help diagnosing system hang

Ethan Erchinger Wed, 03 Dec 2008 22:43:51 -0800

Hi all,

First, I'll say my intent is not to spam a bunch of lists, but after 
posting to opensolaris-discuss I had someone communicate with me offline 
that these lists would possibly be a better place to start.  So here we 
are. For those on all three lists, sorry for the repetition.


Second, this message is meant to solicit help in diagnosing the issue 
described below.  Any hints on how DTrace may help, or where in general 
to start would be much appreciated.  Back to the subject at hand.

-----------------------

I'm testing an application which makes use of a large file mmap'd into 
memory, as if it the application was using malloc().  The file is 
roughly 2x the size of physical ram.  Basically, I'm seeing the system 
stall for long periods of time, 60+ seconds, and then resume.  The file 
lives on an SSD (Intel x25-e) and I'm using zfs's lzjb compression to 
make more efficient use of the ~30G of space provided by that SSD.

The general flow of things is, start application, ask it to use a 50G 
file. The file is created in a sparse manner at the location
designated, then mmap is called on the entire file.  All fine up to this
point.

I then start loading data into the application, and it starts pushing
data to the file as you'd expect.  Data is pushed to the file early and 
often, as it's mmap'd with the MAP_SHARED flag.  But, when the 
application's resident size reaches about 80% of the physical ram on the 
system, the system starts paging and things are still working relatively 
well, though slower, as expected.

Soon after, when reaching about 40G of data, I get stalls accessing the
SSD (according to iostat), in other words, no IO to that drive.  When I
started looking into what could be causing it, such as IO timeouts, I
run dmesg and it hangs after printing a timestamp.  I can ctrl-c dmesg,
but subsequent runs provide no better results.  I see no new messages in
/var/adm/messages, as I'd expect.

Eventually the system recovers, the latest case took over 10 minutes to
recover, after killing the application mentioned above, and I do see
disk timeouts in dmesg.

So, I can only assume that there's either a driver bug in the SATA/SAS
controller I'm using and it's throwing timeouts, or the SSD is having
issues.  Looking at the zpool configuration, I see that failmode=wait,
and since that SSD is the only member of the zpool I would expect IO to
hang.

But, does that mean that dmesg should hang also?  Does that mean that
the kernel has at least one thread stuck?  Would failmode=continue be
more desired, or resilient?

During the hang, load-avg is artificially high, fmd being the one
process that sticks out in prstat output.  But fmdump -v doesn't show
anything relevant.

Anyone have ideas on how to diagnose what's going on there?

Thanks,
Ethan

System: Sun x4240 dual-amd2347, 32G of ram
SAS/SATA Controller: LSI3081E
OS: osol snv_98
SSD: Intel x25-e


_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

[perf-discuss] help diagnosing system hang

Reply via email to