[zfs-discuss] About bug 6486493 (ZFS boot incompatible with the SATA framework)

2007-10-03 Thread Marc Bevand
I would like to test ZFS boot on my home server, but according to bug 
6486493 ZFS boot cannot be used if the disks are attached to a SATA
controller handled by a driver using the new SATA framework (which
is my case: driver si3124). I have never heard of someone having
successfully used ZFS boot with the SATA framework, so I assume this
bug is real and everybody out there playing with ZFS boot is doing so
with PATA controllers, or SATA controllers operating in compatibility
mode, or SCSI controllers, right ?

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Roch - PAE
Rayson Ho writes:

 > 1) Modern DBMSs cache database pages in their own buffer pool because
 > it is less expensive than to access data from the OS. (IIRC, MySQL's
 > MyISAM is the only one that relies on the FS cache, but a lot of MySQL
 > sites use INNODB which has its own buffer pool)
 > 

The DB can and should cache data whether or not directio is used.

 > 2) Also, direct I/O is faster because it avoid double buffering.
 > 

A piece of data can be in one buffer, 2 buffers, 3
buffers. That says nothing about performance. More below.

So I guess  you  mean DIO  is  faster because it  avoids the
extra copy: dma straight to  user buffer rather than DMA  to
kernel buffer then copy to user buffer. If an  I/O is 5ms an
8K copy is  about 10 usec. Is  avoiding the copy  really the
most urgent thing to work on ?



 > Rayson
 > 
 > 
 > 
 > 
 > On 10/2/07, eric kustarz <[EMAIL PROTECTED]> wrote:
 > > Not yet, see:
 > > 6429855 Need way to tell ZFS that caching is a lost cause
 > >
 > > Is there a specific reason why you need to do the caching at the DB
 > > level instead of the file system?  I'm really curious as i've got
 > > conflicting data on why people do this.  If i get more data on real
 > > reasons on why we shouldn't cache at the file system, then this could
 > > get bumped up in my priority queue.
 > >

I can't answer this although can well imagine that the DB is 
the most efficent place to cache it's own data all organised 
and formatted to respond to queries. 

But  once the  DB has signified   to the FS that it  doesn't
require the FS to cache data  then the benefit from this RFE
is  that the memory used to   stage the data  can be quickly
recycled by ZFS  for  subsequent  operations. It  means  ZFS
memory footprint   is  more  likely to containuseful ZFS
metadata and not cached data block we know are not likely to
be used again anytime soon.

We also would operated better in mixed DIO/non-DIO workloads.


See also:
http://blogs.sun.com/roch/entry/zfs_and_directio

-r



 > > eric
 > > ___
 > > zfs-discuss mailing list
 > > zfs-discuss@opensolaris.org
 > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 > >
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] future ZFS Boot and ZFS "copies"

2007-10-03 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I know that first release of ZFS boot will be support single disk and
mirroring configurations. With ZFS "copies" support in Solaris 10 U5 (I
hope), I was wondering about breaking my current mirror and using both
disks in stripe mode, protecting the critical bits with ZFS "copies" .
Those bits would include the OS.

Would ZFS boot be able to boot from a "copies" boot dataset, when one of
the disks are failing?. Counting that ditto blocks are spread between
both disks, of course.

PS: ZFS "copies" = Ditto blocks.

- --
Jesus Cea Avion _/_/  _/_/_/_/_/_/
[EMAIL PROTECTED] http://www.argo.es/~jcea/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/  _/_/_/_/_/
   _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRwOVbJlgi5GaxT1NAQLyRQP/dSRx8tIlx+wsBtxWOgCLEnknNeBI/0sV
DPWEYXiv8Y60hSoW6+3UbhdhD0CLrunFZR7OCL1Dykq3roj/51Aabm1ZwK3QMujR
TRTrW93oPkluM2bQEmkK/NUYh4iGcBtGfZVa5RI9DT0eKCQPe1grGv5If9c4xEZE
Z34tbQ2I8PI=
=5FjP
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future ZFS Boot and ZFS "copies"

2007-10-03 Thread Darren J Moffat
Jesus Cea wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> I know that first release of ZFS boot will be support single disk and
> mirroring configurations. With ZFS "copies" support in Solaris 10 U5 (I
> hope), I was wondering about breaking my current mirror and using both
> disks in stripe mode, protecting the critical bits with ZFS "copies" .
> Those bits would include the OS.

Why would you do that when it would reduce your protection and ZFS boot 
can boot from a mirror anyway.

What problem are you trying to solve ?  The only thing I can think of is 
attempting to increase performance by increasing the number of spindles.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future ZFS Boot and ZFS "copies"

2007-10-03 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Darren J Moffat wrote:
> Why would you do that when it would reduce your protection and ZFS boot 
> can boot from a mirror anyway.

I guess ditto blocks would be protection enough, since the data would be
duplicated between both disks. Of course, backups are your friend.

> What problem are you trying to solve ?  The only thing I can think of is 
> attempting to increase performance by increasing the number of spindles.

Read performance would double, and this is very nice, but my main
motivation would be disk space: I have some hundred of gigabytes of data
that I could easily recover from a backup, or that I wouldn't mind to
lose if something catastrofic enough occurs. For example, divx movies or
MP3's files. Since I do daily backups, selective ZFS "copies" could
almost double my diskspace. I don't need to mirror my "/usr/local/" if I
have daily backups. But I could protect the boot environment or my mail
dataset using ditto blocks.

Playing with ZFS "copies", I can use a single pool and modulate
space/protection per dataset according to my needs and compromises.

- --
Jesus Cea Avion _/_/  _/_/_/_/_/_/
[EMAIL PROTECTED] http://www.argo.es/~jcea/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:[EMAIL PROTECTED] _/_/_/_/  _/_/_/_/_/
   _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQCVAwUBRwObzZlgi5GaxT1NAQJq7gP/V1g6KUPS8T9hnA3KDmKMbIeDKoqphRO5
POehmhnWsPlO8BPa+CxT/ZRUwbNYCte9kYYWeJzXNRpUyGtFvREBjtgK6swIQXUC
n0D0gG0yI4aU1qzdX8X4bqomDaoL/Ho7YQu00j+P8mEfUdYzqY/odOVklZKq92U3
zfyDj7fgTVQ=
=cDSg
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Matty
On 10/3/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
> Rayson Ho writes:
>
>  > 1) Modern DBMSs cache database pages in their own buffer pool because
>  > it is less expensive than to access data from the OS. (IIRC, MySQL's
>  > MyISAM is the only one that relies on the FS cache, but a lot of MySQL
>  > sites use INNODB which has its own buffer pool)
>  >
>
> The DB can and should cache data whether or not directio is used.

It does, which leads to the core problem. Why do we have to store the
exact same data twice in memory (i.e., once in the ARC, and once in
the shared memory segment that Oracle uses)? Due to the lack of direct
I/O and kernel asynchronous I/O in ZFS, my employer has decided to
stick with VxFS. I would love nothing more than to use ZFS with our
databases, but unfortunately these missing features prevent us from
doing so. :(

Thanks,
- Ryan
-- 
UNIX Administrator
http://prefetch.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future ZFS Boot and ZFS "copies"

2007-10-03 Thread Moore, Joe
 
Jesus Cea wrote:
> Darren J Moffat wrote:
> > Why would you do that when it would reduce your protection 
> and ZFS boot 
> > can boot from a mirror anyway.
> 
> I guess ditto blocks would be protection enough, since the 
> data would be
> duplicated between both disks. Of course, backups are your friend.

I asked almost the exact same question when I first heard about ditto
blocks.  (See
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-May/040596.html
and followups)

There are 2 key differences between ditto blocks and mirrors:

1) The ZFS pool is considered "unprotected".  That means a device
failure will result in a kernel panic.

2) Ditto block separation is not enforced.  The allocator tries to keep
the second copy "far" from the first one, but it is possible that both
copies of your /etc/passwd file are on the same VDEV.  This means that a
device failure could result in real loss of data.

It would be really nice if there was some sort of
enforced-ditto-separation (fail w/ device full if unable to satisfy) but
that doesn't exist currently.

--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] future ZFS Boot and ZFS "copies"

2007-10-03 Thread Darren J Moffat
Moore, Joe wrote:
> It would be really nice if there was some sort of
> enforced-ditto-separation (fail w/ device full if unable to satisfy) but
> that doesn't exist currently.

How would that be different to a mirror ?

I guess it is different to a mirror because only some datasets in the 
pool would be "mirrored" instead of all of them.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Roch - PAE

Matty writes:
 > On 10/3/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
 > > Rayson Ho writes:
 > >
 > >  > 1) Modern DBMSs cache database pages in their own buffer pool because
 > >  > it is less expensive than to access data from the OS. (IIRC, MySQL's
 > >  > MyISAM is the only one that relies on the FS cache, but a lot of MySQL
 > >  > sites use INNODB which has its own buffer pool)
 > >  >
 > >
 > > The DB can and should cache data whether or not directio is used.
 > 
 > It does, which leads to the core problem. Why do we have to store the
 > exact same data twice in memory (i.e., once in the ARC, and once in
 > the shared memory segment that Oracle uses)? 

We do not retain 2 copies of the same data.

If the DB cache is made large enough to consume most of memory,
the ZFS copy will quickly be evicted to stage other I/Os on
their way to the DB cache.

What problem does that pose ?

-r

 > 
 > Thanks,
 > - Ryan
 > -- 
 > UNIX Administrator
 > http://prefetch.net
 > ___
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-discuss Digest, Vol 24, Issue 5

2007-10-03 Thread Solaris
Richard,

  Having read your blog regarding the copies feature, do you have an
opinion on whether mirroring or copies are better for a SAN situation?

  It strikes me that since we're discussing SAN and not local physical
disk, that for a system needing 100GB of usable storage (size chosen
for round numbers) the configuration choices present an interesting
discussion.  I would probably throw out raidz(2) right off the top
leaving the following choices between the SAN presenting one 200GB
LUN,  two 100GB LUNs, or four 50GB LUNs for the respective configs of
copies=2, raid 0, and raid 0+1.  My assumption would be that the 0+1
would still be the best all around solution to balance speed and
redundancy.  It would also give you more flexibility moving forward by
adding more vdev mirrors for future space requirements.

  Setting copies = X does not seem to change the idea that you're
storage will cost X times as much as a plain UFS (or ZFS) arrangement
which relies on the hardware for redundancy.  Based on your diagrams
in the blog, it would also seem to be slower by a factor of 1/X.
Maybe cost and speed aren't a factor, but as I often hear "Cheap,
Fast, Reliable, pick two." and being that we're looking at SAN, I
would think the choice is Fast and Reliable.

  Do you have the time/hardware to examine the same things you look
have blogged about before using ZFS+JBOD, now instead using  ZFS+SAN?


> Date: Tue, 02 Oct 2007 17:12:06 -0700
> From: Richard Elling <[EMAIL PROTECTED]>
> Subject: Re: [zfs-discuss] zfs in san
> To: Todd Sawyers <[EMAIL PROTECTED]>
> Cc: zfs-discuss@opensolaris.org
> Message-ID: <[EMAIL PROTECTED]>
> Content-Type: text/plain; format=flowed; charset=ISO-8859-1
>
> Todd Sawyers wrote:
> > I am planning to use zfs with fiber attached san disk from a emc symmetrix's
> > Based on a note in the admin guide it appears that even though
> > the symmetrixs will handle the hardware raid it is still advisable to create
> > a zfs mirror on the host side to take full advantage of zfs's self
> > healing/error checking and correcting.
> >
> > is this true ?
>
> Yes, though you have more options than just mirroring.  Consider setting
> policies on important file systems such as copies=2.
>
> > Additionally I am wondering how the zfs mirror will handle a backend
> > disk failure on the Symmetrixs ?
> > with vxvm disk failures are transparent and nothing needs to be done on
> > the host side.
>
> As long as the Symmetrix does not propagate the error to the host, then
> it should work the same.  However, rather than just trust the storage,
> ZFS will verify the data with checksums.  This is a good thing.
>
> > Will this be the same with zfs ? or will zpool replace commands need to
> > be run ?
>
> As long as the data at the host is good, no ZFS commands are needed.
> However, there have been occasional reports of failed SAN hardware
> corrupting data.  ZFS can detect this, and given enough redundancy, can
> correct the data.
>   -- richard
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Nicolas Williams
On Wed, Oct 03, 2007 at 10:42:53AM +0200, Roch - PAE wrote:
> Rayson Ho writes:
>  > 2) Also, direct I/O is faster because it avoid double buffering.
> 
> A piece of data can be in one buffer, 2 buffers, 3
> buffers. That says nothing about performance. More below.
> 
> So I guess  you  mean DIO  is  faster because it  avoids the
> extra copy: dma straight to  user buffer rather than DMA  to
> kernel buffer then copy to user buffer. If an  I/O is 5ms an
> 8K copy is  about 10 usec. Is  avoiding the copy  really the
> most urgent thing to work on ?

If the DB is huge relative to RAM, and very busy, then memory pressure
could become a problem.  And it's not just the time spent copying
buffers, but the resources spent managing those copies.  (Just guessing.)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] replacing a device with itself doesn't work

2007-10-03 Thread MP
Hi,
I hope someone can help cos ATM zfs' logic seems a little askew.
I just swapped a failing 200gb drive that was one half of a 400gb gstripe 
device which I was using as one of the devices in a 3 device raidz1. When the 
OS came back up after the drive had been changed, the necessary metadata was of 
course not on the new drive so the stripe didn't exist. Zfs understandably 
complained it couldn't open the stripe, however it did not show the array as 
degraded. I didn't save the output, but it was just like described in this 
thread:

http://www.nabble.com/Shooting-yourself-in-the-foot-with-ZFS:-is-quite-easy-t4512790.html

I recreated the gstripe device under the same name stripe/str1 and assumed I 
could just:

# zpool replace pool stripe/str1
invalid vdev specification
stripe/str1 is in use (r1w1e1)

It also told me to try -f, which I did, but was greeted with the same error.
Why can I not replace a device with itself?
As the man page describes just this procedure I'm a little confused.
Try as I might (online, offline, scrub) I could not get the array to rebuild, 
just like was the guy described in that thread above. I eventually resorted to 
recreating the stripe with a different name stripe/str2. I could then perform a:

# zpool replace pool stripe/str1 stripe/str2

Is there a reason I have to jump through these seemingly pointless hoops to 
replace a device with itself?
Many thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Stange zfs mirror behavarior

2007-10-03 Thread Alex
Hi,

we are running a v240 with a zfs pool mirror onto two 3310 (SCSI). During 
redundancy test, when offlining one 3310.. all zfs data are unsable.
- zpool hang without displaying any info
- trying to read filesystem hang the command (df,ls,...)
- /var/log/messages keep sending error for the fautly disk
 scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],70/[EMAIL 
PROTECTED],1/[EMAIL PROTECTED],0 (sd41): offline
scsi: [ID 107833 kern.warning] WARNING: /[EMAIL PROTECTED],70/[EMAIL 
PROTECTED]/[EMAIL PROTECTED],0 (sd5):
disk not responding to selection
and other


using sotruss with zpool command lead to :
libzfs.so.2:*zpool_iter(0x92b88, 0x1ae38, 0x87fa0)

and using mdb and zpool command are in the same state
> ::pgrep zpool | ::walk thread | ::findstack
stack pointer for thread 30003adb900: 2a105092d51
[ 02a105092d51 turnstile_block+0x604() ]
  02a105092e01 mutex_vector_enter+0x478()
  02a105092ec1 spa_all_configs+0x64()
  02a105092f81 zfs_ioc_pool_configs+4()
  02a105093031 zfsdev_ioctl+0x15c()
  02a1050930e1 fop_ioctl+0x20()
  02a105093191 ioctl+0x184()
  02a1050932e1 syscall_trap32+0xcc()
stack pointer for thread 3000213a9e0: 2a1050b2ca1
[ 02a1050b2ca1 cv_wait+0x38() ]
  02a1050b2d51 spa_config_enter+0x38()
  02a1050b2e01 spa_open_common+0x1e0()
  02a1050b2eb1 spa_get_stats+0x1c()
  02a1050b2f71 zfs_ioc_pool_stats+0x10()
  02a1050b3031 zfsdev_ioctl+0x15c()
  02a1050b30e1 fop_ioctl+0x20()
  02a1050b3191 ioctl+0x184()
  02a1050b32e1 syscall_trap32+0xcc()
stack pointer for thread 30003adafa0: 2a100c7aca1
[ 02a100c7aca1 cv_wait+0x38() ]
  02a100c7ad51 spa_config_enter+0x38()
  02a100c7ae01 spa_open_common+0x1e0()
  02a100c7aeb1 spa_get_stats+0x1c()
  02a100c7af71 zfs_ioc_pool_stats+0x10()
  02a100c7b031 zfsdev_ioctl+0x15c()
  02a100c7b0e1 fop_ioctl+0x20()
  02a100c7b191 ioctl+0x184()
  02a100c7b2e1 syscall_trap32+0xcc()
stack pointer for thread 3000213a080: 2a1051e8ca1
[ 02a1051e8ca1 cv_wait+0x38() ]
  02a1051e8d51 spa_config_enter+0x38()
  02a1051e8e01 spa_open_common+0x1e0()
  02a1051e8eb1 spa_get_stats+0x1c()
  02a1051e8f71 zfs_ioc_pool_stats+0x10()
  02a1051e9031 zfsdev_ioctl+0x15c()
  02a1051e90e1 fop_ioctl+0x20()
  02a1051e9191 ioctl+0x184()
  02a1051e92e1 syscall_trap32+0xcc()
stack pointer for thread 30001725960: 2a100d98c91
[ 02a100d98c91 cv_wait+0x38() ]
  02a100d98d41 spa_config_enter+0x88()
  02a100d98df1 spa_vdev_enter+0x20()
  02a100d98ea1 spa_vdev_setpath+0x10()
  02a100d98f71 zfs_ioc_vdev_setpath+0x3c()
  02a100d99031 zfsdev_ioctl+0x15c()
  02a100d990e1 fop_ioctl+0x20()
  02a100d99191 ioctl+0x184()
  02a100d992e1 syscall_trap32+0xcc()


Any one got info about problem like this with zfs ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] About bug 6486493 (ZFS boot incompatible with the SATA framework)

2007-10-03 Thread Eric Schrock
This bug was rendered moot via 6528732 in build snv_68 (and s10_u5).  We
now store physical devices paths with the vnodes, so even though the
SATA framework doesn't correctly support open by devid in early boot, we
can fallback to the device path just fine.  ZFS root works great on
thumper, which uses the marvell SATA driver.

- Eric

On Wed, Oct 03, 2007 at 08:10:16AM +, Marc Bevand wrote:
> I would like to test ZFS boot on my home server, but according to bug 
> 6486493 ZFS boot cannot be used if the disks are attached to a SATA
> controller handled by a driver using the new SATA framework (which
> is my case: driver si3124). I have never heard of someone having
> successfully used ZFS boot with the SATA framework, so I assume this
> bug is real and everybody out there playing with ZFS boot is doing so
> with PATA controllers, or SATA controllers operating in compatibility
> mode, or SCSI controllers, right ?
> 
> -marc
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Interim way to pfinstall into zfs root?

2007-10-03 Thread Gordon Ross
Has anyone figured out a way to make pfinstall work
sufficiently to just pkgadd all the packages in a DVD
(or netinstall) image into a new ZFS root?

I have a ZFS root pool and an initial root FS that was
copied in from a cpio archive of a previous UFS root.
That much works great.  BFU works for the OS.  But
I really want to snapshot, clone, pfinstall to upgrade.

I tried creating a script to run pfinstall similar to how
liveupgrade runs it, but pfinstall always seemed eager
to newfs my root pool.  Is there tricky way to do this?

Thanks,
Gordon
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Rayson Ho
On 10/3/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
> We do not retain 2 copies of the same data.
>
> If the DB cache is made large enough to consume most of memory,
> the ZFS copy will quickly be evicted to stage other I/Os on
> their way to the DB cache.
>
> What problem does that pose ?

Hi Roch,

1) The memory copy operations are expensive... I think the following
is a good intro to this problem:

"Copying data in memory can be a serious bottleneck in DBMS software
today. This fact is often a surprise to database students, who assume
that main-memory operations are "free" compared to disk I/O. But in
practice, a welltuned database installation is typically not
I/O-bound."  (section 3.2)

http://mitpress.mit.edu/books/chapters/0262693143chapm2.pdf

(Ch 2: Anatomy of a Database System, Readings in Database Systems, 4th Ed)


2) If you look at the TPC-C disclosure reports, you will see vendors
using thousands of disks for the top 10 systems. With that many disks
working in parallel, the I/O latencies are not as big as of a problem
as systems with fewer disks.


3) Also interesting is Concurrent I/O, which was introduced in AIX 5.2:

"Improving Database Performance With AIX Concurrent I/O"
http://www-03.ibm.com/systems/p/os/aix/whitepapers/db_perf_aix.html

"Improve database performance on file system containers in IBM DB2 UDB
V8.2 using Concurrent I/O on AIX"
http://www-128.ibm.com/developerworks/db2/library/techarticle/dm-0408lee/

Rayson



>
> -r
>
>  >
>  > Thanks,
>  > - Ryan
>  > --
>  > UNIX Administrator
>  > http://prefetch.net
>  > ___
>  > zfs-discuss mailing list
>  > zfs-discuss@opensolaris.org
>  > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replacing a device with itself doesn't work

2007-10-03 Thread Richard Elling
MP wrote:
> Hi,
> I hope someone can help cos ATM zfs' logic seems a little askew.
> I just swapped a failing 200gb drive that was one half of a 400gb gstripe 
> device which I was using as one of the devices in a 3 device raidz1. When the 
> OS came back up after the drive had been changed, the necessary metadata was 
> of course not on the new drive so the stripe didn't exist. Zfs understandably 
> complained it couldn't open the stripe, however it did not show the array as 
> degraded. I didn't save the output, but it was just like described in this 
> thread:
> 
> http://www.nabble.com/Shooting-yourself-in-the-foot-with-ZFS:-is-quite-easy-t4512790.html
> 
> I recreated the gstripe device under the same name stripe/str1 and assumed I 
> could just:
> 
> # zpool replace pool stripe/str1
> invalid vdev specification
> stripe/str1 is in use (r1w1e1)
> 
> It also told me to try -f, which I did, but was greeted with the same error.
> Why can I not replace a device with itself?
> As the man page describes just this procedure I'm a little confused.
> Try as I might (online, offline, scrub) I could not get the array to rebuild, 
> just like was the guy described in that thread above. I eventually resorted 
> to recreating the stripe with a different name stripe/str2. I could then 
> perform a:
> 
> # zpool replace pool stripe/str1 stripe/str2
> 
> Is there a reason I have to jump through these seemingly pointless hoops to 
> replace a device with itself?
> Many thanks.

Yes.  From the fine manual on zpool:
  zpool replace [-f] pool old_device [new_device]

  Replaces old_device with new_device. This is  equivalent
  to attaching new_device, waiting for it to resilver, and
  then detaching old_device.
...
  If  new_device  is  not  specified,   it   defaults   to
  old_device.  This form of replacement is useful after an
  existing  disk  has  failed  and  has  been   physically
  replaced.  In  this case, the new disk may have the same
  /dev/dsk path as the old device, even though it is actu-
  ally a different disk. ZFS recognizes this.

For a stripe, you don't have redundancy, so you cannot replace the
disk with itself.  You would have to specify the [new_device]
I've submitted CR6612596 for a better error message and CR6612605
to mention this in the man page.
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Dale Ghent
On Oct 3, 2007, at 10:31 AM, Roch - PAE wrote:

> If the DB cache is made large enough to consume most of memory,
> the ZFS copy will quickly be evicted to stage other I/Os on
> their way to the DB cache.
>
> What problem does that pose ?

Personally, I'm still not completely sold on the performance  
(performance as in ability, not speed) of ARC eviction. Often times,  
especially during a resilver, a server with ~2GB of RAM free under  
normal circumstances will dive down to the minfree floor, causing  
processes to be swapped out. We've had to take to manually  
constraining ARC max size so this situation is avoided. This is on  
s10u2/3. I haven't tried anything heavy duty with Nevada simply  
because I don't put Nevada in production situations.

Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I'm  
surprised that this is being met with skepticism considering that  
Oracle highly recommends direct IO be used,  and, IIRC, Oracle  
performance was the main motivation to adding DIO to UFS back in  
Solaris 2.6. This isn't a problem with ZFS or any specific fs per se,  
it's the buffer caching they all employ. So I'm a big fan of seeing  
6429855 come to fruition.

/dale
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Jim Mauro

Hey Roch -
> We do not retain 2 copies of the same data.
>
> If the DB cache is made large enough to consume most of memory,
> the ZFS copy will quickly be evicted to stage other I/Os on
> their way to the DB cache.
>
> What problem does that pose ?

Can't answer that question empirically, because we can't measure this, but
I imagine there's some overhead to ZFS cache management in evicting and
replacing blocks, and that overhead could be eliminated if ZFS could be
told not to cache the blocks at all.

Now, obviously, whether this overhead would be in the noise level, or
something that actually hurts sustainable performance will depend on
several things, but I can envision scenerios where it's overhead I'd
rather avoid if I could.

Thanks,
/jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replacing a device with itself doesn't work

2007-10-03 Thread Richard Elling
more below...

MP wrote:
> On 03/10/2007, *Richard Elling* <[EMAIL PROTECTED] 
> > wrote:
> 
> Yes.  From the fine manual on zpool:
>   zpool replace [-f] pool old_device [new_device]
> 
>   Replaces old_device with new_device. This is  equivalent
>   to attaching new_device, waiting for it to resilver, and
>   then detaching old_device.
> ...
>   If  new_device  is  not  specified,   it   defaults   to
>   old_device.  This form of replacement is useful after an
>   existing  disk  has  failed  and  has  been   physically
>   replaced.  In  this case, the new disk may have the same
>   /dev/dsk path as the old device, even though it is actu-
>   ally a different disk. ZFS recognizes this.
> 
> For a stripe, you don't have redundancy, so you cannot replace the
> disk with itself. 
> 
> 
> I don't see how a stripe makes a difference. It's just 2 drives joined 
> together logically to make a
> new device. It can be used by the system just like a normal hard drive.  Just 
> like a normal hard
> drive it too has no redundancy?

Correct.  It would be redundant if it were a mirror, raidz, or raidz2.  In the
case of stripes of mirrors, raidz, or raidz2 vdevs, they are redundant.

> You would have to specify the [new_device]
> I've submitted CR6612596 for a better error message and CR6612605
> to mention this in the man page.
> 
> 
> Perhaps I was a little unclear. Zfs did a few things during this whole 
> escapade which seemed wrong.
> 
> # mdconfig -a -tswap -s64m
> md0
> # mdconfig -a -tswap -s64m
> md1
> # mdconfig -a -tswap -s64m
> md2

I presume you're not running Solaris, so please excuse me if I take a
Solaris view to this problem.

> # zpool create tank raidz md0 md1 md2
> # zpool status -v tank
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
> 
> NAMESTATE READ WRITE CKSUM
> tankONLINE   0 0 0
>   raidz1ONLINE   0 0 0
> md0 ONLINE   0 0 0
> md1 ONLINE   0 0 0
> md2 ONLINE   0 0 0
> 
> errors: No known data errors
> # zpool offline tank md0
> Bringing device md0 offline
> # dd if=/dev/zero of=/dev/md0 bs=1m
> dd: /dev/md0: end of device
> 65+0 records in
> 64+0 records out
> 67108864 bytes transferred in 0.044925 secs (1493798602 bytes/sec)
> # zpool status -v tank
>   pool: tank
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
> Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> action: Online the device using 'zpool online' or replace the device with
> 'zpool replace'.
>  scrub: none requested
> config:
> 
> NAMESTATE READ WRITE CKSUM
> tankDEGRADED 0 0 0
>   raidz1DEGRADED 0 0 0
> md0 OFFLINE  0 0 0
> md1 ONLINE   0 0 0
> md2 ONLINE   0 0 0
> 
> errors: No known data errors
> 
> 
> At this point where the drive is offline a 'zpool replace tank md0' will 
> fix the array.

Correct.  The pool is redundant.

> However, if instead the other advice given; 'zpool online tank md0' is 
> used then problems start to occur:
> 
> 
> # zpool online tank md0
> # zpool status -v tank
>   pool: tank
>  state: ONLINE
> status: One or more devices could not be used because the label is 
> missing or
> invalid.  Sufficient replicas exist for the pool to continue
> functioning in a degraded state.
> action: Replace the device using 'zpool replace'.
>see: http://www.sun.com/msg/ZFS-8000-4J
>  scrub: resilver completed with 0 errors on Wed Oct  3 18:44:22 2007
> config:
> 
> NAMESTATE READ WRITE CKSUM
> tankONLINE   0 0 0
>   raidz1ONLINE   0 0 0
> md0 UNAVAIL  0 0 0  corrupted data
> md1 ONLINE   0 0 0
> md2 ONLINE   0 0 0
> 
> errors: No known data errors
> 
> -
> ^^^
> Surely this is wrong? Zpool shows the pool as 'ONLINE'  and not 
> degraded. Whereas the status explanation
> says that it is degraded and 'zpool replace' is required. That's just 
> confusing.

I agree, I would expect the STATE to be DEGRADED.

> -
> 
> # zpool scrub tank
> # zpool status -v tank
>   pool: tank
>  state: ONLINE
> status: One or more devices could not be used because the label is 
> missing or
> invalid.  Sufficient replicas exist for the pool to continue
> functioning in a degraded state.
> action: Replace the device using 'zpool replace'.
>see: http://www.sun.com/msg/ZFS-8000-4J
>  

Re: [zfs-discuss] replacing a device with itself doesn't work

2007-10-03 Thread Pawel Jakub Dawidek
On Wed, Oct 03, 2007 at 12:10:19PM -0700, Richard Elling wrote:
> > -
> > 
> > # zpool scrub tank
> > # zpool status -v tank
> >   pool: tank
> >  state: ONLINE
> > status: One or more devices could not be used because the label is 
> > missing or
> > invalid.  Sufficient replicas exist for the pool to continue
> > functioning in a degraded state.
> > action: Replace the device using 'zpool replace'.
> >see: http://www.sun.com/msg/ZFS-8000-4J
> >  scrub: resilver completed with 0 errors on Wed Oct  3 18:45:06 2007
> > config:
> > 
> > NAMESTATE READ WRITE CKSUM
> > tankONLINE   0 0 0
> >   raidz1ONLINE   0 0 0
> > md0 UNAVAIL  0 0 0  corrupted data
> > md1 ONLINE   0 0 0
> > md2 ONLINE   0 0 0
> > 
> > errors: No known data errors
> > # zpool replace tank md0
> > invalid vdev specification
> > use '-f' to override the following errors:
> > md0 is in use (r1w1e1)
> > # zpool replace -f tank md0
> > invalid vdev specification
> > the following errors must be manually repaired:
> > md0 is in use (r1w1e1)
> > 
> > -
> > Well the advice of 'zpool replace' doesn't work. At this point the user 
> > is now stuck. There seems to
> > be just no way to now use the existing device md0.
> 
> In Solaris NV b72, this works as you expect.
> # zpool replace zwimming /dev/ramdisk/rd1
> # zpool status -v zwimming
>pool: zwimming
>   state: DEGRADED
>   scrub: resilver completed with 0 errors on Wed Oct  3 11:55:36 2007
> config:
> 
>  NAMESTATE READ WRITE CKSUM
>  zwimmingDEGRADED 0 0 0
>raidz1DEGRADED 0 0 0
>  replacing   DEGRADED 0 0 0
>/dev/ramdisk/rd1/old  FAULTED  0 0 0  corrupted 
> data
>/dev/ramdisk/rd1  ONLINE   0 0 0
>  /dev/ramdisk/rd2ONLINE   0 0 0
>  /dev/ramdisk/rd3ONLINE   0 0 0
> 
> errors: No known data errors
> # zpool status -v zwimming
>pool: zwimming
>   state: ONLINE
>   scrub: resilver completed with 0 errors on Wed Oct  3 11:55:36 2007
> config:
> 
>  NAME  STATE READ WRITE CKSUM
>  zwimming  ONLINE   0 0 0
>raidz1  ONLINE   0 0 0
>  /dev/ramdisk/rd1  ONLINE   0 0 0
>  /dev/ramdisk/rd2  ONLINE   0 0 0
>  /dev/ramdisk/rd3  ONLINE   0 0 0
> 
> errors: No known data errors

Good to know, but I think it's still a bit of ZFS fault. The error
message 'md0 is in use (r1w1e1)' means that something (I'm quite sure
it's ZFS) keeps device open. Why does it keeps it open when it doesn't
recognize it? Or maybe it tries to open it twice for write (exclusively)
when replacing, which is not allowed in GEOM in FreeBSD.

I can take a look if this is the former or the latter, but it should be
fixed in ZFS itself, IMHO.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgprcvACVf6zj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] replacing a device with itself doesn't work

2007-10-03 Thread MC
I think I might have run into the same problem.  At the time I assumed I was 
doing something wrong, but...

I made a b72 raidz out of three new 1gb virtual disks in vmware.  I shut the vm 
off, replaced one of the disks with a new 1.5gb virtual disk.  No matter what 
command I tried, I couldn't get the new disk into the array.  The docs said 
that replacing the vdev with itself would work, but it didn't.  Nor did setting 
the 'automatic replace' feature on the pool and plugging a new device in.  I 
recall most of the errors being "device in use".

Maybe I wasn't the problem after all?  0_o
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Richard Elling
Rayson Ho wrote:
> On 10/3/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
>> We do not retain 2 copies of the same data.
>>
>> If the DB cache is made large enough to consume most of memory,
>> the ZFS copy will quickly be evicted to stage other I/Os on
>> their way to the DB cache.
>>
>> What problem does that pose ?
> 
> Hi Roch,
> 
> 1) The memory copy operations are expensive... I think the following
> is a good intro to this problem:
> 
> "Copying data in memory can be a serious bottleneck in DBMS software
> today. This fact is often a surprise to database students, who assume
> that main-memory operations are "free" compared to disk I/O. But in
> practice, a welltuned database installation is typically not
> I/O-bound."  (section 3.2)

... just the ones people are complaining about ;-)
Indeed it seems rare that a DB performance escalation does not involve
I/O tuning :-(

> http://mitpress.mit.edu/books/chapters/0262693143chapm2.pdf
> 
> (Ch 2: Anatomy of a Database System, Readings in Database Systems, 4th Ed)
> 
> 
> 2) If you look at the TPC-C disclosure reports, you will see vendors
> using thousands of disks for the top 10 systems. With that many disks
> working in parallel, the I/O latencies are not as big as of a problem
> as systems with fewer disks.
> 
> 
> 3) Also interesting is Concurrent I/O, which was introduced in AIX 5.2:
> 
> "Improving Database Performance With AIX Concurrent I/O"
> http://www-03.ibm.com/systems/p/os/aix/whitepapers/db_perf_aix.html

This is a pretty decent paper and some of the issues are the same with
UFS.  To wit, direct I/O is not always a win (qv. Bob Sneed's blog)
It also describes what we call the single writer lock problem, which IBM
solves with Concurrent I/O.  See also:
http://www.solarisinternals.com/wiki/index.php/Direct_I/O

ZFS doesn't have the single writer lock problem.  See also:
http://blogs.sun.com/roch/entry/zfs_to_ufs_performance_comparison

Slightly off-topic, in looking at some field data this morning (looking
for something completely unrelated) I notice that the use of directio
on UFS is declining over time.  I'm not sure what that means... hopefully
not more performance escalations...
  -- richard
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Dale Ghent
On Oct 3, 2007, at 5:21 PM, Richard Elling wrote:

> Slightly off-topic, in looking at some field data this morning  
> (looking
> for something completely unrelated) I notice that the use of directio
> on UFS is declining over time.  I'm not sure what that means...  
> hopefully
> not more performance escalations...

Sounds like someone from ZFS team needs to get with someone from  
Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road  
boundary should look, because it doesn't sound like there's a  
definitive or at least a sure answer here.

Oracle trumpets the use of DIO, and there are benchmarks and first- 
hand accounts out there from DBAs on its virtues - at least when  
running on UFS (and EXT2/3 on Linux, etc)

As it relates to ZFS mechanics specifically, there doesn't appear to  
be any settled opinion.

/dale
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best option for my home file server?

2007-10-03 Thread Christopher
Would the nv_sata driver also be used on nforce 590 sli? I found Asus M2N32 WS 
PRO at my hw shop which has 9 internal sata connectors.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best option for my home file server?

2007-10-03 Thread Richard Elling
I believe so.  The Solaris device detection tool will show the MCP version, too.
 http://www.sun.com/bigadmin/hcl/hcts/device_detect.html
  -- richard

Christopher wrote:
> Would the nv_sata driver also be used on nforce 590 sli? I found Asus M2N32 
> WS PRO at my hw shop which has 9 internal sata connectors.
>  
>  
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Jason J. W. Williams
Hi Dale,

We're testing out the enhanced arc_max enforcement (track DNLC
entries) using Build 72 right now. Hopefully, it will fix the memory
creep, which is the only real downside to ZFS for DB work it seems to
me. Frankly, of our DB loads have improved performance with ZFS. I
suspect its because we are write-heavy.

-J

On 10/3/07, Dale Ghent <[EMAIL PROTECTED]> wrote:
> On Oct 3, 2007, at 10:31 AM, Roch - PAE wrote:
>
> > If the DB cache is made large enough to consume most of memory,
> > the ZFS copy will quickly be evicted to stage other I/Os on
> > their way to the DB cache.
> >
> > What problem does that pose ?
>
> Personally, I'm still not completely sold on the performance
> (performance as in ability, not speed) of ARC eviction. Often times,
> especially during a resilver, a server with ~2GB of RAM free under
> normal circumstances will dive down to the minfree floor, causing
> processes to be swapped out. We've had to take to manually
> constraining ARC max size so this situation is avoided. This is on
> s10u2/3. I haven't tried anything heavy duty with Nevada simply
> because I don't put Nevada in production situations.
>
> Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I'm
> surprised that this is being met with skepticism considering that
> Oracle highly recommends direct IO be used,  and, IIRC, Oracle
> performance was the main motivation to adding DIO to UFS back in
> Solaris 2.6. This isn't a problem with ZFS or any specific fs per se,
> it's the buffer caching they all employ. So I'm a big fan of seeing
> 6429855 come to fruition.
>
> /dale
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Kugutsumen
Postgres assumes that the OS takes care of caching:

"PLEASE NOTE. PostgreSQL counts a lot on the OS to cache data files  
and hence does not bother with duplicating its file caching effort.  
The shared buffers parameter assumes that OS is going to cache a lot  
of files and hence it is generally very low compared with system RAM.  
Even for a dataset in excess of 20GB, a setting of 128MB may be too  
much, if you have only 1GB RAM and an aggressive-at-caching OS like  
Linux." Tuning PostgreSQL for performance, Shridhar Daithankar, Josh  
Berkus, 2003, http://www.varlena.com/GeneralBits/Tidbits/perf.html

Slightly off-topic, I have noticed at least 25% performance gain on  
my postgresql database after installing Wu Fengguang's adaptive read- 
ahead disk cache patch for the linux kernel. http://lkml.org/lkml/ 
2005/9/15/185

http://www.samag.com/documents/s=10101/sam0616a/0616a.htm

I was wondering if Solaris uses a similar approach.

On 04/10/2007, at 4:44 AM, Dale Ghent wrote:


> On Oct 3, 2007, at 5:21 PM, Richard Elling wrote:
>
>
>> Slightly off-topic, in looking at some field data this morning
>> (looking
>> for something completely unrelated) I notice that the use of directio
>> on UFS is declining over time.  I'm not sure what that means...
>> hopefully
>> not more performance escalations...
>>
>
> Sounds like someone from ZFS team needs to get with someone from
> Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road
> boundary should look, because it doesn't sound like there's a
> definitive or at least a sure answer here.
>
> Oracle trumpets the use of DIO, and there are benchmarks and first-
> hand accounts out there from DBAs on its virtues - at least when
> running on UFS (and EXT2/3 on Linux, etc)
>
> As it relates to ZFS mechanics specifically, there doesn't appear to
> be any settled opinion.
>
> /dale
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread eric kustarz
>
> Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I'm
> surprised that this is being met with skepticism considering that
> Oracle highly recommends direct IO be used,  and, IIRC, Oracle
> performance was the main motivation to adding DIO to UFS back in
> Solaris 2.6. This isn't a problem with ZFS or any specific fs per se,
> it's the buffer caching they all employ. So I'm a big fan of seeing
> 6429855 come to fruition.

The point is that directI/O typically means two things:
1) concurrent I/O
2) no caching at the file system

Most file systems (ufs, vxfs, etc.) don't do 1) or 2) without turning  
on "directI/O".

ZFS *does* 1.  It doesn't do 2 (currently).

That is what we're trying to discuss here.

Where does the win come from with "directI/O"?  Is it 1), 2), or some  
combination?  If its a combination, what's the percentage of each  
towards the win?

We need to tease 1) and 2) apart to have a full understanding.  I'm  
not against adding 2) to ZFS but want more information.  I suppose  
i'll just prototype it and find out for myself.

eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread eric kustarz

On Oct 3, 2007, at 3:44 PM, Dale Ghent wrote:

> On Oct 3, 2007, at 5:21 PM, Richard Elling wrote:
>
>> Slightly off-topic, in looking at some field data this morning
>> (looking
>> for something completely unrelated) I notice that the use of directio
>> on UFS is declining over time.  I'm not sure what that means...
>> hopefully
>> not more performance escalations...
>
> Sounds like someone from ZFS team needs to get with someone from
> Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road
> boundary should look, because it doesn't sound like there's a
> definitive or at least a sure answer here.

I've done that already (Oracle, Postgres, JavaDB, etc.).  Because the  
holy grail of "directI/O" is an overloaded term, we don't really know  
where the win within "directI/O" lies.  In any event, it seems the  
only way to get a definitive answer here is to prototype a no caching  
property...

eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?

2007-10-03 Thread Nathan Kroenert
Some people are just dumb. Take me, for instance... :)

Was just looking into ZFS on iscsi and doing some painful and unnatural 
things to my boxes and dropped a panic I was not expecting.

Here is what I did.

Server: (S10_u4 sparc)
  - zpool create usb /dev/dsk/c4t0d0s0
 (on a 4gb USB stick, if it matters)
  - zfs create -s -V 200mb usb/is0
  - zfs set shareiscsi=on usb/is0

On Client A (nv_72 amd64)
  - iscsiadm stuff to enable sendtarget and set discovery-address to the 
server above
  - svcadm enable iscsiinitator
  - zpool create server_usb iscsi_target_created_above
  - created a few files
  - exported pool

On Client B (nv_65 amd64 xen dom0)
  - iscsiadm stuff and enable service and import pool - import failed 
due to newer pool version... dang.
  - re-create pool
  - create some other files and stuff
  - export pool

Client A
  - import pool make couple-o-changes

Client B
  - import pool -f  (heh)

Client A + B - With both mounting the same pool, touched a couple of 
files, and removed a couple of files from each client

Client A + B - zpool export

Client A - Attempted import and dropped the panic.

Oct  4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80:
Oct  4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion 
failed: dmu_read(os, smo->smo_object, offset, size, entry_map) == 0 (0x5 
== 0x0)
, file: ../../common/fs/zfs/space_map.c, line: 339
Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160 
genunix:assfail3+b9 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200 
zfs:space_map_load+2ef ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240 
zfs:metaslab_activate+66 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300 
zfs:metaslab_group_alloc+24e ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0 
zfs:metaslab_alloc_dva+192 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470 
zfs:metaslab_alloc+82 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0 
zfs:zio_dva_allocate+68 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0 
zfs:zio_next_stage+b3 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510 
zfs:zio_checksum_generate+6e ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530 
zfs:zio_next_stage+b3 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0 
zfs:zio_write_compress+239 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0 
zfs:zio_next_stage+b3 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610 
zfs:zio_wait_for_children+5d ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630 
zfs:zio_wait_children_ready+20 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650 
zfs:zio_next_stage_async+bb ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670 
zfs:zio_nowait+11 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960 
zfs:dbuf_sync_leaf+1ac ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0 
zfs:dbuf_sync_list+51 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10 
zfs:dnode_sync+23b ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50 
zfs:dmu_objset_sync_dnodes+55 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0 
zfs:dmu_objset_sync+13d ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40 
zfs:dsl_pool_sync+199 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0 
zfs:spa_sync+1c5 ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60 
zfs:txg_sync_thread+19a ()
Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70 
unix:thread_start+8 ()
Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]

Yep - Sure I did some boneheaded things here (grin) and deserved a good 
kick in the groin, however, should I panic a whole box just because I 
have attempted to import a dud pool??

Without re-creating the pool, I can now panic the system reliably just 
through attempting to import the pool

I was a little surprised, as I would have though that there should have 
been no chance for really nasty things to have happened at a systemwide 
level, and we should have just bailed on the mount / import.

I see a few bugs that were closeish to this, but not a great match...

Is this a known issue, already fixed in a later build, or should I bug it?

After spending a little time playing with iscsi, I have to say it's 
almost inevitable that someone is going to do this by accident and panic 
a big box for what I see as no good reason. (though I'm happy to be 
educated... ;)

Oh - and also - Kudos to the ZFS team and the other involved in the 
whole iSCSI thing. So easy a

Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?

2007-10-03 Thread Dick Davies
On 04/10/2007, Nathan Kroenert <[EMAIL PROTECTED]> wrote:

> Client A
>   - import pool make couple-o-changes
>
> Client B
>   - import pool -f  (heh)

> Oct  4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80:
> Oct  4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion
> failed: dmu_read(os, smo->smo_object, offset, size, entry_map) == 0 (0x5
> == 0x0)
> , file: ../../common/fs/zfs/space_map.c, line: 339
> Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160
> genunix:assfail3+b9 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200
> zfs:space_map_load+2ef ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240
> zfs:metaslab_activate+66 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300
> zfs:metaslab_group_alloc+24e ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0
> zfs:metaslab_alloc_dva+192 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470
> zfs:metaslab_alloc+82 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0
> zfs:zio_dva_allocate+68 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0
> zfs:zio_next_stage+b3 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510
> zfs:zio_checksum_generate+6e ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530
> zfs:zio_next_stage+b3 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0
> zfs:zio_write_compress+239 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0
> zfs:zio_next_stage+b3 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610
> zfs:zio_wait_for_children+5d ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630
> zfs:zio_wait_children_ready+20 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650
> zfs:zio_next_stage_async+bb ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670
> zfs:zio_nowait+11 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960
> zfs:dbuf_sync_leaf+1ac ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0
> zfs:dbuf_sync_list+51 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10
> zfs:dnode_sync+23b ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50
> zfs:dmu_objset_sync_dnodes+55 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0
> zfs:dmu_objset_sync+13d ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40
> zfs:dsl_pool_sync+199 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0
> zfs:spa_sync+1c5 ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60
> zfs:txg_sync_thread+19a ()
> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70
> unix:thread_start+8 ()
> Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]

> Is this a known issue, already fixed in a later build, or should I bug it?

It shouldn't panic the machine, no. I'd raise a bug.

> After spending a little time playing with iscsi, I have to say it's
> almost inevitable that someone is going to do this by accident and panic
> a big box for what I see as no good reason. (though I'm happy to be
> educated... ;)

You use ACLs and TPGT groups to ensure 2 hosts can't simultaneously
access the same LUN by accident. You'd have the same problem with
Fibre Channel SANs.
-- 
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Louwtjie Burger
Would it be easier to ...

1) Change ZFS code to enable a sort of directIO emulation and then run
various tests... or

2) Use Sun's performance team, which have all the experience in the
world when it comes to performing benchmarks on Solaris and Oracle ..
+ a Dtrace master to drill down and see what the difference is between
UFS and UFS/DIO... and where the real win lies.


On 10/4/07, eric kustarz <[EMAIL PROTECTED]> wrote:
>
> On Oct 3, 2007, at 3:44 PM, Dale Ghent wrote:
>
> > On Oct 3, 2007, at 5:21 PM, Richard Elling wrote:
> >
> >> Slightly off-topic, in looking at some field data this morning
> >> (looking
> >> for something completely unrelated) I notice that the use of directio
> >> on UFS is declining over time.  I'm not sure what that means...
> >> hopefully
> >> not more performance escalations...
> >
> > Sounds like someone from ZFS team needs to get with someone from
> > Oracle/MySQL/Postgres and get the skinny on how the IO rubber->road
> > boundary should look, because it doesn't sound like there's a
> > definitive or at least a sure answer here.
>
> I've done that already (Oracle, Postgres, JavaDB, etc.).  Because the
> holy grail of "directI/O" is an overloaded term, we don't really know
> where the win within "directI/O" lies.  In any event, it seems the
> only way to get a definitive answer here is to prototype a no caching
> property...
>
> eric
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss