Re: [perf-discuss] [zfs-discuss] help diagnosing system hang

Richard Elling Thu, 04 Dec 2008 09:06:17 -0800

Ethan Erchinger wrote:
> Hi all,
>
> First, I'll say my intent is not to spam a bunch of lists, but after 
> posting to opensolaris-discuss I had someone communicate with me offline 
> that these lists would possibly be a better place to start.  So here we 
> are. For those on all three lists, sorry for the repetition.
>
> Second, this message is meant to solicit help in diagnosing the issue 
> described below.  Any hints on how DTrace may help, or where in general 
> to start would be much appreciated.  Back to the subject at hand.
>
> -----------------------
>
> I'm testing an application which makes use of a large file mmap'd into 
> memory, as if it the application was using malloc().  The file is 
> roughly 2x the size of physical ram.  Basically, I'm seeing the system 
> stall for long periods of time, 60+ seconds, and then resume.  The file 
> lives on an SSD (Intel x25-e) and I'm using zfs's lzjb compression to 
> make more efficient use of the ~30G of space provided by that SSD.
>
> The general flow of things is, start application, ask it to use a 50G 
> file. The file is created in a sparse manner at the location
> designated, then mmap is called on the entire file.  All fine up to this
> point.
>
> I then start loading data into the application, and it starts pushing
> data to the file as you'd expect.  Data is pushed to the file early and 
> often, as it's mmap'd with the MAP_SHARED flag.  But, when the 
> application's resident size reaches about 80% of the physical ram on the 
> system, the system starts paging and things are still working relatively 
> well, though slower, as expected.
>
> Soon after, when reaching about 40G of data, I get stalls accessing the
> SSD (according to iostat), in other words, no IO to that drive.  When I
> started looking into what could be causing it, such as IO timeouts, I
> run dmesg and it hangs after printing a timestamp.  I can ctrl-c dmesg,
> but subsequent runs provide no better results.  I see no new messages in
> /var/adm/messages, as I'd expect.
>
> Eventually the system recovers, the latest case took over 10 minutes to
> recover, after killing the application mentioned above, and I do see
> disk timeouts in dmesg.
>
> So, I can only assume that there's either a driver bug in the SATA/SAS
> controller I'm using and it's throwing timeouts, or the SSD is having
> issues.  Looking at the zpool configuration, I see that failmode=wait,
> and since that SSD is the only member of the zpool I would expect IO to
> hang.
>
> But, does that mean that dmesg should hang also?  Does that mean that
> the kernel has at least one thread stuck?  Would failmode=continue be
> more desired, or resilient?
>
> During the hang, load-avg is artificially high, fmd being the one
> process that sticks out in prstat output.  But fmdump -v doesn't show
> anything relevant.
>


I've seen these symptoms when a large number of errors were reported
in a short period of time and memory was low.  What does "fmdump -eV"
show?

Also, it would help to know what OS release you are using.
 -- richard

> Anyone have ideas on how to diagnose what's going on there?
>
> Thanks,
> Ethan
>
> System: Sun x4240 dual-amd2347, 32G of ram
> SAS/SATA Controller: LSI3081E
> OS: osol snv_98
> SSD: Intel x25-e
>
>
> _______________________________________________
> zfs-discuss mailing list
> [EMAIL PROTECTED]
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] [zfs-discuss] help diagnosing system hang

Reply via email to