Re: [zfs-discuss] panic in buf_hash_remove
Out of curiosity, is this panic reproducible? A bug should be filed on this for more investigation. Feel free to open one or I'll open it if you forward me info on where the crash dump is and information on the I/O stress test you were running. thanks, Noel :-) ** "Question all the answers" On Jun 12, 2006, at 3:45 PM, Daniel Rock wrote: Hi, had recently this panic during some I/O stress tests: > $BAD TRAP: type=e (#pf Page fault) rp=fe80005c3980 addr=30 occurred in module "zfs" due to a NULL pointer dereference sched: #pf Page fault Bad kernel fault at addr=0x30 pid=0, pc=0xf3ee322e, sp=0xfe80005c3a70, eflags=0x10206 cr0: 8005003b cr4: 6f0 cr2: 30 cr3: a49a000 cr8: c rdi: fe80f0aa2b40 rsi: 89c3a050 rdx: 6352 rcx: 2f r8:0 r9: 30 rax: 64f2 rbx:2 rbp: fe80005c3aa0 r10: fe80f0c979 r11: bd7189449a7087 r12: 89c3a040 r13: 89c3a040 r14:32790 r15: 0 fsb: 8000 gsb: 8149d800 ds: 43 es: 43 fs:0 gs: 1c3 trp:e err:0 rip: f3ee322e cs: 28 rfl:10206 rsp: fe80005c3a70 ss: 30 fe80005c3870 unix:die+eb () fe80005c3970 unix:trap+14f9 () fe80005c3980 unix:cmntrap+140 () fe80005c3aa0 zfs:buf_hash_remove+54 () fe80005c3b00 zfs:arc_change_state+1bd () fe80005c3b70 zfs:arc_evict_ghost+d1 () fe80005c3b90 zfs:arc_adjust+10f () fe80005c3bb0 zfs:arc_kmem_reclaim+d0 () fe80005c3bf0 zfs:arc_kmem_reap_now+30 () fe80005c3c60 zfs:arc_reclaim_thread+108 () fe80005c3c70 unix:thread_start+8 () syncing file systems... done dumping to /dev/md/dsk/swap, offset 644874240, content: kernel > $c buf_hash_remove+0x54(89c3a040) arc_change_state+0x1bd(c0099370, 89c3a040, c0098f30) arc_evict_ghost+0xd1(c0099470, 14b5c0c4) arc_adjust+0x10f() arc_kmem_reclaim+0xd0() arc_kmem_reap_now+0x30(0) arc_reclaim_thread+0x108() thread_start+8() > ::status debugging crash dump vmcore.0 (64-bit) from server operating system: 5.11 snv_39 (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=fe80005c3980 addr=30 occurred in module "zfs" due to a NULL pointer dereference dump content: kernel pages only Daniel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] panic in buf_hash_remove
Noel Dellofano schrieb: Out of curiosity, is this panic reproducible? Hmm, not directly. The panic happened during a long running I/O stress test in the middle of the night. The tests had already run for ~6 hours at that time. > A bug should be filed on this for more investigation. Feel free to open one or I'll open it if you forward me info on where the crash dump is and information on the I/O stress test you were running. The core dump is very large. Even compressed with bzip2 it is still ~300MB in size. I will upload it to my external server this night and post details where the crash dump can be found. The tests I ran were Oracle database tests with many concurrent connections to the database. During the time of crash system load and I/O was just average though. Daniel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy - destroying a snapshot
On Mon, Jun 12, 2006 at 12:58:17PM +0200, Robert Milkowski wrote: > I'm writing a script to do automatically snapshots and destroy old > one. I think it would be great to add to zfs destroy another option > so only snapshots can be destroyed. Something like: > > zfs destroy -s SNAPSHOT > > so if something other than snapshot is provided as an argument > zfs destroy wouldn't actually destroy it. > That way it would be much safer to write scripts. > > What do you think? I think that you shouldn't run commands that you don't want run. If you need some safeguards while developing a script, you can always write a wrapper script around zfs(1m). However, 'zfs destroy ' will fail if the filesystem has snapshots (presumably most will, if your intent is to destroy a snapshot), which provides you with some safeguards. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy - destroying a snapshot
On Tue, Jun 13, 2006 at 01:43:08PM -0700, Matthew Ahrens wrote: > On Mon, Jun 12, 2006 at 12:58:17PM +0200, Robert Milkowski wrote: > > I'm writing a script to do automatically snapshots and destroy old > > one. I think it would be great to add to zfs destroy another option > > so only snapshots can be destroyed. Something like: > > > > zfs destroy -s SNAPSHOT > > > > so if something other than snapshot is provided as an argument > > zfs destroy wouldn't actually destroy it. > > That way it would be much safer to write scripts. > > > > What do you think? > > I think that you shouldn't run commands that you don't want run. If you > need some safeguards while developing a script, you can always write a > wrapper script around zfs(1m). Alternatively, you could just make sure your argument always has a '@' in it: zfs destroy -s [EMAIL PROTECTED] Cheers, - jonathan > However, 'zfs destroy ' will fail if the filesystem has snapshots > (presumably most will, if your intent is to destroy a snapshot), which > provides you with some safeguards. > > --matt > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Jonathan Adams, Solaris Kernel Development ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS and databases
Sorry for resurrecting this interesting discussion so late: I'm skinning backwards through the forum. One comment about segregating database logs is that people who take their data seriously often want a 'belt plus suspenders' approach to recovery. Conventional RAID, even supplemented with ZFS's self-healing scrubbing, isn't sufficient (though RAID-6 might be): they want at least the redo logs separate so that in the extremely unlikely event that they lose something in the (already replicated) database the failure is guaranteed not to have affected the redo logs as well, from which they can reconstruct the current database state from a backup. True, this will mean that you can't aggregate redo log activity with other transaction bulk-writes, but that's at least partly good as well: databases are often extremely sensitive to redo log write latency and would not want such writes delayed by combination with other updates, let alone by up to a 5-second delay. ZFS's synchronous write intent log could help here (if you replicate it: serious database people would consider even the very temporary exposure to a single failure inherent in an unmirrored log completely unacceptable), but that could also be slowed by other synch small write activity; conversely, databases often couldn't care less about the latency of many of their other writes, because their own (replicated) redo log has already established the persistence that they need. As for direct I/O, it's not clear why ZFS couldn't support it: it could verify each read in user memory against its internal checksum and perform its self-healing magic if necessary before returning completion status (which would be the same status it would return if the same situation occurred during its normal mode of operation: either unconditional success or success-after-recovery if the application might care to know that); it could handle each synchronous write analogously, and if direct I/O mechanisms support lazy writes then presumably they tie up the user buffer until the write completes such that you could use your normal mechanisms there as well (just operating on the user buffer instead of your cache). In this I'm assuming that 'direct I/O' refers not to raw device access but to file-oriented access that simply avoids any internal cache use, such that you could still use your no-overwrite approach. Of course, this also assumes that the direct I/O is always being performed in aligned integral multiples of checksum units by the application; if not, you'd either have to bag the checksum facility (this would not be an entirely unreasonable option to offer, given that some sophisticated applictions might want to use their own even higher-level integrity mechanisms, e.g., across geographically-separated sites, and would not need yours) or run everything through cache as you normally do. In suitably-aligned cases where you do validate the data you could avoid half the copy overhead (an issue of memory bandwidth as well as simply operation latency: TPC-C submissions can be affected by this, though it may be rare in real-world use) by integrating the checksum calculation with the copy, but would still have multiple copies of the data taking up memory in a situation (direct I/O) where the application *by definition* does not expect you to be caching the data (quite likely because it is doing any desirable caching itself). Tablespace contiguity may, however, be a deal-breaker for some users: it is common for tablespaces to be scanned sequentially (when selection criteria don't mesh with existing indexes, perhaps especially in joins where the smaller tablespace (still too large to be retained in cache, though) is scanned repeatedly in an inner loop, and a DBMS often goes to some effort to keep them defragmented. Until ZFS provides some effective continuous defragmenting mechanisms of its own, its no-overwrite policy may do more harm than good in such cases (since the database's own logs keep persistence latency low, while the backing tablespaces can then be updated at leisure). I do want to comment on the observation that "enough concurrent 128K I/O can saturate a disk" - the apparent implication being that one could therefore do no better with larger accesses, an incorrect conclusion. Current disks can stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the average-seek-plus-partial-rotation required to get to that 128 KB in the first place. Thus on a full drive serial random accesses to 128 KB chunks will yield only about 20% of the drive's streaming capability (by contrast, accessing data using serial random accesses in 4 MB contiguous chunks achieves around 90% of a drive's streaming capability): one can do better on disks that support queuing if one allows queues to form, but this trades significantly increased average operation latency for the increase in throughput (and said increas
[zfs-discuss] slow mkdir
Hello zfs-discuss, NFS server on snv_39/SPARC, zfs filesystems exported. Solaris 10 x64 clients (zfs-s10-0315), filesystems mounted from nfs server using NFSv3 over TCP. What I see from NFS clients is that mkdir operations to ZFS filesystems could take even 20s! while to UFS exported filesystems I can't see even one with time over 1s (there're also UFS exported filesystems from other NFS servers). How do I measure the time? On nfs client I do: bash-3.00# dtrace -n syscall::mkdir:entry'/execname == "our-app"/{self->t=timestamp;self->vt=vtimestamp;self->arg0=arg0}' -n syscall::mkdir:return'/self->t/[EMAIL PROTECTED] copyin(self->arg0,11)]=max((timestamp-self->t)/10);self->arg0=0;self->t=0;self->vt=0;}' -n tick-5s'{printa(@);}' bash-3.00# What I get is times even 20-30s but only for ZFS exported filesystems. It's not that all mkdir are that bad. On one of those filesystems I tried several time to just mkdir some dir from command line - for many tries I got new dir created immediately, but then I hang for ~8s. -bash-3.00$ truss -ED -v all mkdir www [...] 0. 0. umask(022) = 0 mkdir("www", 0777) (sleeping...) 8.0158 0.0001 mkdir("www", 0777) = 0 0.0002 0. _exit(0) I tried it locally on ZFS (the same filesystem) and not on NFS - this time I get very fast mkdir every time I try it. So probably something on with client<->NFSv3<->ZFS Looks like if traffic is lighter then I see 3s at most so it's much better. Any idea? ps. of course there're no colisions, etc. on network. At least I can't find anything unusual. -- Best regards, Robert mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS panic while mounting lofi device?
I believe ZFS is causing a panic whenever I attempt to mount an iso image (SXCR build 39) that happens to reside on a ZFS file system. The problem is 100% reproducible. I'm quite new to OpenSolaris, so I may be incorrect in saying it's ZFS' fault. Also, let me know if you need any additional information or debug output to help diagnose things. Config: [b]bash-3.00# uname -a SunOS mathrock-opensolaris 5.11 opensol-20060605 i86pc i386 i86pc[/b] Scenario: [b]bash-3.00# mount -F hsfs -o ro `lofiadm -a /data/OS/Solaris/sol-nv-b39-x86-dvd.iso` /tmp/test[/b] After typing that the system hangs, the network drops, panics, and reboots. "/data" is a ZFS file system built on a raidz pool of 3 disks. [b]bash-3.00# zpool status sata pool: sata state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM sataONLINE 0 0 0 raidz1ONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 errors: No known data errors bash-3.00# zfs list sata/data NAME USED AVAIL REFER MOUNTPOINT sata/data 16.9G 533G 16.9G /data[/b] Error: [b]Jun 13 19:33:01 mathrock-opensolaris pseudo: [ID 129642 kern.info] pseudo-device: lofi0 Jun 13 19:33:01 mathrock-opensolaris genunix: [ID 936769 kern.info] lofi0 is /pseudo/[EMAIL PROTECTED] Jun 13 19:33:04 mathrock-opensolaris unix: [ID 836849 kern.notice] Jun 13 19:33:04 mathrock-opensolaris ^Mpanic[cpu1]/thread=d1fafde0: Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 920532 kern.notice] page_unlock: page c51b29e0 is not locked Jun 13 19:33:04 mathrock-opensolaris unix: [ID 10 kern.notice] Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafb54 unix:page_unlock+160 (c51b29e0) Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafbb0 zfs:zfs_getpage+27a (d1e897c0, 3 000, 0, ) Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafc0c genunix:fop_getpage+36 (d1e897c0 , 8000, 0, ) Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafca0 genunix:segmap_fault+202 (ce043f 58, fec23310,) Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafd08 genunix:segmap_getmapflt+6fc (fe c23310, d1e897c0,) Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafd78 lofi:lofi_strategy_task+2c8 (d2b 6bee0, 0, 0, 0, ) Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafdc8 genunix:taskq_thread+194 (c5e87f 30, 0) Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafdd8 unix:thread_start+8 () Jun 13 19:33:04 mathrock-opensolaris unix: [ID 10 kern.notice] Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 672855 kern.notice] syncing file systems... [/b] This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] zfs destroy - destroying a snapshot
Hello Matthew, Tuesday, June 13, 2006, 10:43:08 PM, you wrote: MA> On Mon, Jun 12, 2006 at 12:58:17PM +0200, Robert Milkowski wrote: >> I'm writing a script to do automatically snapshots and destroy old >> one. I think it would be great to add to zfs destroy another option >> so only snapshots can be destroyed. Something like: >> >> zfs destroy -s SNAPSHOT >> >> so if something other than snapshot is provided as an argument >> zfs destroy wouldn't actually destroy it. >> That way it would be much safer to write scripts. >> >> What do you think? MA> I think that you shouldn't run commands that you don't want run. If you MA> need some safeguards while developing a script, you can always write a MA> wrapper script around zfs(1m). Well, it's like saying that we don't need '-f' option for a zpool. It's just too easy with ZFS to screw up. Using snapshots in ZFS is so easy and penalty free that it'll be common I belive. Many sysadmins will write their own scripts and it's just too easy to destroy a filesystem instead of a snapshot unintentionally. I know you can write wrappers, etc. but it just complicates life while simple option would solve problem. The same with 'zpool destroy' - imho it should never allow destroying a pool if ant fs|clone|snapshot is mounted unless -f is provided. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] zpool status and CKSUM errors
Hello Eric, Monday, June 12, 2006, 11:21:24 PM, you wrote: ES> I reproduced this pretty easily on a lab machine. I've filed: ES> 6437568 ditto block repair is incorrectly propagated to root vdev Good, thank you. ES> To track this issue. Keep in mind that you do have a flakey ES> controller/lun/something. If this had been a user data block, your data ES> would be gone. Well, probably something is wrong. But it surprises me that every time I get CKSUM error in that config every time it relates to metadata... well quite unlikely isn't it? btw: if it would be a data block then app reading that block would get proper error and that's it - right? -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss