Ethan Erchinger wrote: > Hi all, > > First, I'll say my intent is not to spam a bunch of lists, but after > posting to opensolaris-discuss I had someone communicate with me offline > that these lists would possibly be a better place to start. So here we > are. For those on all three lists, sorry for the repetition. > > Second, this message is meant to solicit help in diagnosing the issue > described below. Any hints on how DTrace may help, or where in general > to start would be much appreciated. Back to the subject at hand. > > ----------------------- > > I'm testing an application which makes use of a large file mmap'd into > memory, as if it the application was using malloc(). The file is > roughly 2x the size of physical ram. Basically, I'm seeing the system > stall for long periods of time, 60+ seconds, and then resume. The file > lives on an SSD (Intel x25-e) and I'm using zfs's lzjb compression to > make more efficient use of the ~30G of space provided by that SSD. > > The general flow of things is, start application, ask it to use a 50G > file. The file is created in a sparse manner at the location > designated, then mmap is called on the entire file. All fine up to this > point. > > I then start loading data into the application, and it starts pushing > data to the file as you'd expect. Data is pushed to the file early and > often, as it's mmap'd with the MAP_SHARED flag. But, when the > application's resident size reaches about 80% of the physical ram on the > system, the system starts paging and things are still working relatively > well, though slower, as expected. > > Soon after, when reaching about 40G of data, I get stalls accessing the > SSD (according to iostat), in other words, no IO to that drive. When I > started looking into what could be causing it, such as IO timeouts, I > run dmesg and it hangs after printing a timestamp. I can ctrl-c dmesg, > but subsequent runs provide no better results. I see no new messages in > /var/adm/messages, as I'd expect. > > Eventually the system recovers, the latest case took over 10 minutes to > recover, after killing the application mentioned above, and I do see > disk timeouts in dmesg. > > So, I can only assume that there's either a driver bug in the SATA/SAS > controller I'm using and it's throwing timeouts, or the SSD is having > issues. Looking at the zpool configuration, I see that failmode=wait, > and since that SSD is the only member of the zpool I would expect IO to > hang. > > But, does that mean that dmesg should hang also? Does that mean that > the kernel has at least one thread stuck? Would failmode=continue be > more desired, or resilient? > > During the hang, load-avg is artificially high, fmd being the one > process that sticks out in prstat output. But fmdump -v doesn't show > anything relevant. >
I've seen these symptoms when a large number of errors were reported in a short period of time and memory was low. What does "fmdump -eV" show? Also, it would help to know what OS release you are using. -- richard > Anyone have ideas on how to diagnose what's going on there? > > Thanks, > Ethan > > System: Sun x4240 dual-amd2347, 32G of ram > SAS/SATA Controller: LSI3081E > OS: osol snv_98 > SSD: Intel x25-e > > > _______________________________________________ > zfs-discuss mailing list > [EMAIL PROTECTED] > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org