Re: [zfs-discuss] panic in buf_hash_remove

2006-06-13 Thread Noel Dellofano
Out of curiosity, is this panic reproducible? A bug should be filed on  
this for more investigation. Feel free to open one or I'll open it if  
you forward me info on where the crash dump is and information on the  
I/O stress test you were running.


thanks,
Noel :-)


 
**


"Question all the answers"
On Jun 12, 2006, at 3:45 PM, Daniel Rock wrote:


Hi,

had recently this panic during some I/O stress tests:

> $BAD TRAP: type=e (#pf Page fault) rp=fe80005c3980 addr=30 occurred  
in module "zfs" due to a NULL pointer dereference



sched:
#pf Page fault
Bad kernel fault at addr=0x30
pid=0, pc=0xf3ee322e, sp=0xfe80005c3a70, eflags=0x10206
cr0: 8005003b cr4: 6f0
cr2: 30 cr3: a49a000 cr8: c
rdi: fe80f0aa2b40 rsi: 89c3a050 rdx:  
6352
rcx:   2f  r8:0  r9:
30
rax: 64f2 rbx:2 rbp:  
fe80005c3aa0
r10:   fe80f0c979 r11:   bd7189449a7087 r12:  
89c3a040
r13: 89c3a040 r14:32790 r15:
 0
fsb: 8000 gsb: 8149d800  ds:
43
 es:   43  fs:0  gs:   
1c3
trp:e err:0 rip:  
f3ee322e
 cs:   28 rfl:10206 rsp:  
fe80005c3a70

 ss:   30

fe80005c3870 unix:die+eb ()
fe80005c3970 unix:trap+14f9 ()
fe80005c3980 unix:cmntrap+140 ()
fe80005c3aa0 zfs:buf_hash_remove+54 ()
fe80005c3b00 zfs:arc_change_state+1bd ()
fe80005c3b70 zfs:arc_evict_ghost+d1 ()
fe80005c3b90 zfs:arc_adjust+10f ()
fe80005c3bb0 zfs:arc_kmem_reclaim+d0 ()
fe80005c3bf0 zfs:arc_kmem_reap_now+30 ()
fe80005c3c60 zfs:arc_reclaim_thread+108 ()
fe80005c3c70 unix:thread_start+8 ()

syncing file systems...
 done
dumping to /dev/md/dsk/swap, offset 644874240, content: kernel
> $c
buf_hash_remove+0x54(89c3a040)
arc_change_state+0x1bd(c0099370, 89c3a040,  
c0098f30)

arc_evict_ghost+0xd1(c0099470, 14b5c0c4)
arc_adjust+0x10f()
arc_kmem_reclaim+0xd0()
arc_kmem_reap_now+0x30(0)
arc_reclaim_thread+0x108()
thread_start+8()
> ::status
debugging crash dump vmcore.0 (64-bit) from server
operating system: 5.11 snv_39 (i86pc)
panic message:
BAD TRAP: type=e (#pf Page fault) rp=fe80005c3980 addr=30 occurred  
in module "zfs" due to a NULL pointer dereference

dump content: kernel pages only



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] panic in buf_hash_remove

2006-06-13 Thread Daniel Rock

Noel Dellofano schrieb:

Out of curiosity, is this panic reproducible?


Hmm, not directly. The panic happened during a long running I/O stress test 
in the middle of the night. The tests had already run for ~6 hours at that time.



> A bug should be filed on
this for more investigation. Feel free to open one or I'll open it if 
you forward me info on where the crash dump is and information on the 
I/O stress test you were running.


The core dump is very large. Even compressed with bzip2 it is still ~300MB 
in size. I will upload it to my external server this night and post details 
where the crash dump can be found.


The tests I ran were Oracle database tests with many concurrent connections 
to the database. During the time of crash system load and I/O was just 
average though.



Daniel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy - destroying a snapshot

2006-06-13 Thread Matthew Ahrens
On Mon, Jun 12, 2006 at 12:58:17PM +0200, Robert Milkowski wrote:
>   I'm writing a script to do automatically snapshots and destroy old
>   one. I think it would be great to add to zfs destroy another option
>   so only snapshots can be destroyed. Something like:
> 
>  zfs destroy -s SNAPSHOT
> 
>  so if something other than snapshot is provided as an argument
>  zfs destroy wouldn't actually destroy it.
>  That way it would be much safer to write scripts.
> 
>  What do you think?

I think that you shouldn't run commands that you don't want run.  If you
need some safeguards while developing a script, you can always write a
wrapper script around zfs(1m).

However, 'zfs destroy ' will fail if the filesystem has snapshots
(presumably most will, if your intent is to destroy a snapshot), which
provides you with some safeguards.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs destroy - destroying a snapshot

2006-06-13 Thread Jonathan Adams
On Tue, Jun 13, 2006 at 01:43:08PM -0700, Matthew Ahrens wrote:
> On Mon, Jun 12, 2006 at 12:58:17PM +0200, Robert Milkowski wrote:
> >   I'm writing a script to do automatically snapshots and destroy old
> >   one. I think it would be great to add to zfs destroy another option
> >   so only snapshots can be destroyed. Something like:
> > 
> >  zfs destroy -s SNAPSHOT
> > 
> >  so if something other than snapshot is provided as an argument
> >  zfs destroy wouldn't actually destroy it.
> >  That way it would be much safer to write scripts.
> > 
> >  What do you think?
> 
> I think that you shouldn't run commands that you don't want run.  If you
> need some safeguards while developing a script, you can always write a
> wrapper script around zfs(1m).

Alternatively, you could just make sure your argument always has a '@' in
it:

zfs destroy -s [EMAIL PROTECTED]

Cheers,
- jonathan

> However, 'zfs destroy ' will fail if the filesystem has snapshots
> (presumably most will, if your intent is to destroy a snapshot), which
> provides you with some safeguards.
> 
> --matt
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Jonathan Adams, Solaris Kernel Development
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: ZFS and databases

2006-06-13 Thread can you guess?
Sorry for resurrecting this interesting discussion so late:  I'm skinning 
backwards through the forum.

One comment about segregating database logs is that people who take their data 
seriously often want a 'belt plus suspenders' approach to recovery.  
Conventional RAID, even supplemented with ZFS's self-healing scrubbing, isn't 
sufficient (though RAID-6 might be):  they want at least the redo logs separate 
so that in the extremely unlikely event that they lose something in the 
(already replicated) database the failure is guaranteed not to have affected 
the redo logs as well, from which they can reconstruct the current database 
state from a backup.

True, this will mean that you can't aggregate redo log activity with other 
transaction bulk-writes, but that's at least partly good as well:  databases 
are often extremely sensitive to redo log write latency and would not want such 
writes delayed by combination with other updates, let alone by up to a 5-second 
delay.

ZFS's synchronous write intent log could help here (if you replicate it:  
serious database people would consider even the very temporary exposure to a 
single failure inherent in an unmirrored log completely unacceptable), but that 
could also be slowed by other synch small write activity; conversely, databases 
often couldn't care less about the latency of many of their other writes, 
because their own (replicated) redo log has already established the persistence 
that they need.

As for direct I/O, it's not clear why ZFS couldn't support it:  it could verify 
each read in user memory against its internal checksum and perform its 
self-healing magic if necessary before returning completion status (which would 
be the same status it would return if the same situation occurred during its 
normal mode of operation:  either unconditional success or 
success-after-recovery if the application might care to know that); it could 
handle each synchronous write analogously, and if direct I/O mechanisms support 
lazy writes then presumably they tie up the user buffer until the write 
completes such that you could use your normal mechanisms there as well (just 
operating on the user buffer instead of your cache).  In this I'm assuming that 
'direct I/O' refers not to raw device access but to file-oriented access that 
simply avoids any internal cache use, such that you could still use your 
no-overwrite approach.

Of course, this also assumes that the direct I/O is always being performed in 
aligned integral multiples of checksum units by the application; if not, you'd 
either have to bag the checksum facility (this would not be an entirely 
unreasonable option to offer, given that some sophisticated applictions might 
want to use their own even higher-level integrity mechanisms, e.g., across 
geographically-separated sites, and would not need yours) or run everything 
through cache as you normally do.  In suitably-aligned cases where you do 
validate the data you could avoid half the copy overhead (an issue of memory 
bandwidth as well as simply operation latency:  TPC-C submissions can be 
affected by this, though it may be rare in real-world use) by integrating the 
checksum calculation with the copy, but would still have multiple copies of the 
data taking up memory in a situation (direct I/O) where the application *by 
definition* does not expect you to be caching the data (quite likely because it 
is doing any desirable caching itself).

Tablespace contiguity may, however, be a deal-breaker for some users:  it is 
common for tablespaces to be scanned sequentially (when selection criteria 
don't mesh with existing indexes, perhaps especially in joins where the smaller 
tablespace (still too large to be retained in cache, though) is scanned 
repeatedly in an inner loop, and a DBMS often goes to some effort to keep them 
defragmented.  Until ZFS provides some effective continuous defragmenting 
mechanisms of its own, its no-overwrite policy may do more harm than good in 
such cases (since the database's own logs keep persistence latency low, while 
the backing tablespaces can then be updated at leisure).

I do want to comment on the observation that "enough concurrent 128K I/O can 
saturate a disk" - the apparent implication being that one could therefore do 
no better with larger accesses, an incorrect conclusion.  Current disks can 
stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the 
average-seek-plus-partial-rotation required to get to that 128 KB in the first 
place.  Thus on a full drive serial random accesses to 128 KB chunks will yield 
only about 20% of the drive's streaming capability (by contrast, accessing data 
using serial random accesses in 4 MB contiguous chunks achieves around 90% of a 
drive's streaming capability):  one can do better on disks that support queuing 
if one allows queues to form, but this trades significantly increased average 
operation latency for the increase in throughput (and said increas

[zfs-discuss] slow mkdir

2006-06-13 Thread Robert Milkowski
Hello zfs-discuss,

  NFS server on snv_39/SPARC, zfs filesystems exported.
  Solaris 10 x64 clients (zfs-s10-0315), filesystems mounted from nfs
  server using NFSv3 over TCP.

  What I see from NFS clients is that mkdir operations to ZFS filesystems
  could take even 20s! while to UFS exported filesystems I can't
  see even one with time over 1s
 (there're also UFS exported filesystems from other NFS servers).


 How do I measure the time? On nfs client I do:
bash-3.00# dtrace -n syscall::mkdir:entry'/execname == 
"our-app"/{self->t=timestamp;self->vt=vtimestamp;self->arg0=arg0}' -n 
syscall::mkdir:return'/self->t/[EMAIL PROTECTED] 
copyin(self->arg0,11)]=max((timestamp-self->t)/10);self->arg0=0;self->t=0;self->vt=0;}'
 -n tick-5s'{printa(@);}'
bash-3.00#


What I get is times even 20-30s but only for ZFS exported filesystems.
It's not that all mkdir are that bad.

On one of those filesystems I tried several time to just mkdir some
dir from command line - for many tries I got new dir created
immediately, but then I hang for ~8s.

-bash-3.00$ truss -ED -v all mkdir www
[...]
 0.  0. umask(022)  = 0
mkdir("www", 0777)  (sleeping...)
 8.0158  0.0001 mkdir("www", 0777)  = 0
 0.0002  0. _exit(0)

I tried it locally on ZFS (the same filesystem) and not on NFS - this
time I get very fast mkdir every time I try it.

So probably something on with client<->NFSv3<->ZFS

Looks like if traffic is lighter then I see 3s at most so
it's much better.

Any idea?


ps. of course there're no colisions, etc. on network. At least I can't
find anything unusual.




-- 
Best regards,
 Robert  mailto:[EMAIL PROTECTED]
 http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS panic while mounting lofi device?

2006-06-13 Thread Nathanael Burton
I believe ZFS is causing a panic whenever I attempt to mount an iso image (SXCR 
build 39) that happens to reside on a ZFS file system.  The problem is 100% 
reproducible.  I'm quite new to OpenSolaris, so I may be incorrect in saying 
it's ZFS' fault.  Also, let me know if you need any additional information or 
debug output to help diagnose things.  

Config:
[b]bash-3.00# uname -a
SunOS mathrock-opensolaris 5.11 opensol-20060605 i86pc i386 i86pc[/b]

Scenario:
[b]bash-3.00# mount -F hsfs -o ro `lofiadm -a 
/data/OS/Solaris/sol-nv-b39-x86-dvd.iso` /tmp/test[/b]

After typing that the system hangs, the network drops, panics, and reboots. 
"/data" is a ZFS file system built on a raidz pool of 3 disks.

[b]bash-3.00# zpool status sata
  pool: sata
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
sataONLINE   0 0 0
  raidz1ONLINE   0 0 0
c2t0d0  ONLINE   0 0 0
c2t1d0  ONLINE   0 0 0
c2t2d0  ONLINE   0 0 0

errors: No known data errors
bash-3.00# zfs list sata/data
NAME   USED  AVAIL  REFER  MOUNTPOINT
sata/data 16.9G   533G  16.9G  /data[/b]

Error:
[b]Jun 13 19:33:01 mathrock-opensolaris pseudo: [ID 129642 kern.info] 
pseudo-device: lofi0
Jun 13 19:33:01 mathrock-opensolaris genunix: [ID 936769 kern.info] lofi0 is 
/pseudo/[EMAIL PROTECTED]
Jun 13 19:33:04 mathrock-opensolaris unix: [ID 836849 kern.notice]
Jun 13 19:33:04 mathrock-opensolaris ^Mpanic[cpu1]/thread=d1fafde0:
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 920532 kern.notice] 
page_unlock: page c51b29e0 is not locked
Jun 13 19:33:04 mathrock-opensolaris unix: [ID 10 kern.notice]
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafb54 
unix:page_unlock+160 (c51b29e0)
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafbb0 
zfs:zfs_getpage+27a (d1e897c0, 3
000, 0, )
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafc0c 
genunix:fop_getpage+36 (d1e897c0
, 8000, 0, )
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafca0 
genunix:segmap_fault+202 (ce043f
58, fec23310,)
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafd08 
genunix:segmap_getmapflt+6fc (fe
c23310, d1e897c0,)
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafd78 
lofi:lofi_strategy_task+2c8 (d2b
6bee0, 0, 0, 0, )
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafdc8 
genunix:taskq_thread+194 (c5e87f
30, 0)
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 353471 kern.notice] d1fafdd8 
unix:thread_start+8 ()
Jun 13 19:33:04 mathrock-opensolaris unix: [ID 10 kern.notice]
Jun 13 19:33:04 mathrock-opensolaris genunix: [ID 672855 kern.notice] syncing 
file systems...
[/b]
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] zfs destroy - destroying a snapshot

2006-06-13 Thread Robert Milkowski
Hello Matthew,

Tuesday, June 13, 2006, 10:43:08 PM, you wrote:

MA> On Mon, Jun 12, 2006 at 12:58:17PM +0200, Robert Milkowski wrote:
>>   I'm writing a script to do automatically snapshots and destroy old
>>   one. I think it would be great to add to zfs destroy another option
>>   so only snapshots can be destroyed. Something like:
>> 
>>  zfs destroy -s SNAPSHOT
>> 
>>  so if something other than snapshot is provided as an argument
>>  zfs destroy wouldn't actually destroy it.
>>  That way it would be much safer to write scripts.
>> 
>>  What do you think?

MA> I think that you shouldn't run commands that you don't want run.  If you
MA> need some safeguards while developing a script, you can always write a
MA> wrapper script around zfs(1m).

Well, it's like saying that we don't need '-f' option for a zpool.

It's just too easy with ZFS to screw up.
Using snapshots in ZFS is so easy and penalty free that it'll be
common I belive. Many sysadmins will write their own scripts and it's
just too easy to destroy a filesystem instead of a snapshot
unintentionally. I know you can write wrappers, etc. but it just
complicates life while simple option would solve problem.

The same with 'zpool destroy' - imho it should never allow destroying
a pool if ant fs|clone|snapshot is mounted unless -f is provided.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] zpool status and CKSUM errors

2006-06-13 Thread Robert Milkowski
Hello Eric,

Monday, June 12, 2006, 11:21:24 PM, you wrote:

ES> I reproduced this pretty easily on a lab machine.  I've filed:

ES> 6437568 ditto block repair is incorrectly propagated to root vdev

Good, thank you.

ES> To track this issue.  Keep in mind that you do have a flakey
ES> controller/lun/something.  If this had been a user data block, your data
ES> would be gone.

Well, probably something is wrong.
But it surprises me that every time I get CKSUM error in that config
every time it relates to metadata... well quite unlikely isn't it?

btw: if it would be a data block then app reading that block would get
proper error and that's it - right?


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss