Hi Marion, Thanks for your swifty reply!
Have you got the latest firmware on your LSI 1068E HBA's? These have been > known to have lockups/timeouts when used with SAS expanders (disk > enclosures) > with incompatible firmware revisions, and/or with older mpt drivers. > I'll need to check that out -- I'm 90% sure that these are fresh out of box HBAs. Will try an upgrade there and see if we get any joy there... > The MD1220 is a 6Gbit/sec device. You may be better off with a matching > HBA -- Dell has certainly told us the MD1200-series is not intended for > use with the 3Gbit/sec HBA's. We're doing fine with the LSI SAS 9200-8e, > for example, when connecting to Dell MD1200's with the 2TB "nearline SAS" > disk drives. > I was aware the MD1220 is a 6G device, but I figured that since our IO throughput doesn't actually come close to saturating 3Gbit/sec that it would just operate at the lower speed and be OK. I guess it is something to look at if I run out of other options... Last, are you sure it's memory-related? You might keep an eye on " > arcstat.pl" > output and see what the ARC sizes look like just prior to lockup. Also, > maybe you can look up instructions on how to force a crash dump when the > system hangs -- one of the experts around here could tell a lot from a > crash dump file. > I'm starting to doubt that it is a memory issue now -- especially since I now have some results from my latest "test"... output of arcstat.pl looked like this just prior to the lock up: 19:57:36 24G 24G 94 161 61 194 1 1 19:57:41 24G 24G 96 174 62 213 0 0 time arcsz c mh% mhit hit% hits l2hit% l2hits 19:57:46 23G 24G 94 161 62 192 1 1 19:57:51 24G 24G 96 169 63 205 0 0 19:57:56 24G 24G 95 169 61 206 0 0 ^-- This is the very last line printed... I actually discovered and rebooted the machine via DRAC at around 20:44, so it had been in it's bad state for around 1 hour. Some snippets from the output some 20 minutes earlier shows the point at while the arcsz grew to reach the maximum: time arcsz c mh% mhit hit% hits l2hit% l2hits 19:36:45 21G 24G 95 152 58 177 0 0 19:37:00 22G 24G 95 156 57 182 0 0 19:37:15 22G 24G 95 159 59 185 0 0 19:37:30 23G 24G 94 153 58 178 0 0 19:37:45 23G 24G 95 169 59 195 0 0 19:38:00 24G 24G 95 160 59 187 0 0 19:38:25 24G 24G 96 151 58 177 0 0 So it seems that arcsz reaching the 24G maximum wasn't necessarily to blame, since the system operated for a good 20mins in this state. I was also logging "vmstat 5" prior to the crash (though I forgot to include some timestamps in my output) and these are the final lines recorded in that log: kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 0 25885248 18012208 71 2090 0 0 0 0 0 0 0 0 22 17008 210267 30229 1 5 94 0 0 0 25884764 18001848 71 2044 0 0 0 0 0 0 0 0 25 14846 151228 25911 1 5 94 0 0 0 25884208 17991876 71 2053 0 0 0 0 0 0 0 0 8 16343 185416 28946 1 5 93 So it seems there was some 17-18G free in the system when the lock up occurred. Curious... I was also capturing some arc info from mdb -k and the output prior to the lock up was... Monday, October 31, 2011 07:57:51 PM UTC arc_no_grow = 0 arc_tempreserve = 0 MB arc_meta_used = 4621 MB arc_meta_limit = 20480 MB arc_meta_max = 4732 MB Monday, October 31, 2011 07:57:56 PM UTC arc_no_grow = 0 arc_tempreserve = 0 MB arc_meta_used = 4622 MB arc_meta_limit = 20480 MB arc_meta_max = 4732 MB Looks like metadata was not primarily responsible for consuming all of that 24G of ARC in arcstat.pl output... Also seems nothing interesting in /var/adm/messages leading up to my rebooting : Oct 31 18:42:57 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 512 PPM exceeds tolerance 500 PPM Oct 31 18:44:01 mslvstdp02r last message repeated 1 time Oct 31 18:45:05 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 512 PPM exceeds tolerance 500 PPM Oct 31 18:46:09 mslvstdp02r last message repeated 1 time Oct 31 18:47:23 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:06:13 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:09:27 mslvstdp02r last message repeated 4 times Oct 31 19:25:04 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:28:17 mslvstdp02r last message repeated 3 times Oct 31 19:46:17 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:49:32 mslvstdp02r last message repeated 4 times Oct 31 20:44:33 mslvstdp02r genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version snv_151a 64-bit Oct 31 20:44:33 mslvstdp02r genunix: [ID 877030 kern.notice] Copyright (c) 1983, 2010, Oracle and/or its affiliates. All rights reserved. Just some ntpd stuff really... I'm going to check out a firmware upgrade next.. I can't believe that this is really an out of memory situation now when there is 17-18G free reported by vmstat.... Lets see... Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss