Re: [zfs-discuss] Public ZFS API ?

2009-03-18 Thread Ian Collins

Cherry Shu wrote:
Are any plans for an API that would allow ZFS commands including 
snapshot/rollback integrated with customer's application?



libzfs.h?

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Casper . Dik

>Recently there's been discussion [1] in the Linux community about how
>filesystems should deal with rename(2), particularly in the case of a crash.
>ext4 was found to truncate files after a crash, that had been written with
>open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is
> because ext4 uses delayed allocation and may not write the contents to disk
>immediately, but commits metadata changes quite frequently. So when
>rename("foo.tmp","foo") is committed to disk, it has a length of zero which
>is later updated when the data is written to disk. This means after a crash,
>"foo" is zero-length, and both the new and the old data has been lost, which
>is undesirable. This doesn't happen when using ext3's default settings
>because ext3 writes data to disk before metadata (which has performance
>problems, see Firefox 3 and fsync[2])

Believing that, somehow, "metadata" is more important than "other data"
should have been put to rest with UFS.  Yes, it's easier to "fsck" the
filesystem when the metadata is correct and that gets you a valid 
filesystem but that doesn't mean that you get a filesystem with valid contents.

>Ted T'so's (the main author of ext3 and ext4) response is that applications
>which perform open(),write(),close(),rename() in the expectation that they
>will either get the old data or the new data, but not no data at all, are
>broken, and instead should call open(),write(),fsync(),close(),rename().
>Most other people are arguing that POSIX says rename(2) is atomic, and while
>POSIX doesn't specify crash recovery, returning no data at all after a crash
>is clearly wrong, and excessive use of fsync is overkill and
>counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for
>fsync). I've omitted a lot of detail, but I think this is the core of the
>argument.


As long as POSIX believes that systems don't crash, then clearly there is
nothing in the standard which would help the argument on either side.

It is a "quality of implementation" property.  Apparently, T'so feels
that reordering filesystem operations is fine.


>Now the question I have, is how does ZFS deal with
>open(),write(),close(),rename() in the case of a crash? Will it always
>return the new data or the old data, or will it sometimes return no data? Is
> returning no data defensible, either under POSIX or common sense? Comments
>about other filesystems, eg UFS are also welcome. As a counter-point, XFS
>(written by SGI) is notorious for data-loss after a crash, but its authors
>defend the behaviour as POSIX-compliant.

I didn't know about XFS behaviour on crash.  I don't know exactly how ZFS 
commits transaction groups; the ZFS authors can tell and I hope they chime 
in.

The only time POSIX is in question is when the fileserver crashes and 
whether or not the NFS server keeps its promises.  Some typical Linux 
configuration would break some of those promises.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Joerg Schilling
James Andrewartha  wrote:

> Recently there's been discussion [1] in the Linux community about how
> filesystems should deal with rename(2), particularly in the case of a crash.
> ext4 was found to truncate files after a crash, that had been written with
> open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is
>  because ext4 uses delayed allocation and may not write the contents to disk
> immediately, but commits metadata changes quite frequently. So when
> rename("foo.tmp","foo") is committed to disk, it has a length of zero which
> is later updated when the data is written to disk. This means after a crash,
> "foo" is zero-length, and both the new and the old data has been lost, which
> is undesirable. This doesn't happen when using ext3's default settings
> because ext3 writes data to disk before metadata (which has performance
> problems, see Firefox 3 and fsync[2])
>
> Ted T'so's (the main author of ext3 and ext4) response is that applications
> which perform open(),write(),close(),rename() in the expectation that they
> will either get the old data or the new data, but not no data at all, are
> broken, and instead should call open(),write(),fsync(),close(),rename().
> Most other people are arguing that POSIX says rename(2) is atomic, and while
> POSIX doesn't specify crash recovery, returning no data at all after a crash
> is clearly wrong, and excessive use of fsync is overkill and
> counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for
> fsync). I've omitted a lot of detail, but I think this is the core of the
> argument.

The problem in this case is not whether rename() is atomic but whether the
file that replaces the old file in an atomic rename() operation is in a 
stable state on the disk before calling rename().

The calling sequence of the failing code was:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, "dat", size);
close(f);
rename("new", "old");

The only granted way to have the file "new" in a stable state on the disk
is to call:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, "dat", size);
fsync(f);
close(f);

Do not forget to check error codes.

If the application would call:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
if (write(f, "dat", size) != size)
fail();
if (fsync(f) < 0)
fail()
if (close(f) < 0)
fail()
if (rename("new", "old") < 0)
fail();

and if after a crash there is neither the old file nor the
new file on the disk in a consistent state, then you may blame the
file system.


Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Public ZFS API ?

2009-03-18 Thread Darren J Moffat

Ian Collins wrote:

Cherry Shu wrote:
Are any plans for an API that would allow ZFS commands including 
snapshot/rollback integrated with customer's application?



libzfs.h?


The API in there is Contracted Consolidation Private.  Note that private 
does not mean hidden it means:


 Private

 A Private interface is an interface provided by  a  com-
 ponent  (or  product)  intended only for the use of that
 component. A Private interface might still be visible to
 or  accessible  by  other components. Because the use of
 interfaces private to another  component  carries  great
 stability  risks,  such use is explicitly not supported.
 Components not supplied by Sun Microsystems  should  not
 use Private interfaces.

 Most Private interfaces are not  documented.  It  is  an
 exceptional case when a Private interface is documented.
 Reasons for documenting a Private interface include, but
 are  not  limited  to,  the intention that the interface
 might be reclassified to one  of  the  public  stability
 level classifications in the future or the fact that the
 interface is inordinately visible.

That "not suppied by Sun Microsystems" should change to be not included 
as part of the OpenSolaris distribution.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Moore, Joe
Joerg Schilling wrote:
> James Andrewartha  wrote:
> > Recently there's been discussion [1] in the Linux community about how 
> > filesystems should deal with rename(2), particularly in the case of a crash.
> > ext4 was found to truncate files after a crash, that had been written with
> > open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is
> >  because ext4 uses delayed allocation and may not write the contents to disk
> > immediately, but commits metadata changes quite frequently. So when
> > rename("foo.tmp","foo") is committed to disk, it has a length of zero which
> > is later updated when the data is written to disk. This means after a crash,
> > "foo" is zero-length, and both the new and the old data has been lost, which
> > is undesirable. This doesn't happen when using ext3's default settings
> > because ext3 writes data to disk before metadata (which has performance
> > problems, see Firefox 3 and fsync[2])
> >
> > Ted T'so's (the main author of ext3 and ext4) response is that applications
> > which perform open(),write(),close(),rename() in the expectation that they
> > will either get the old data or the new data, but not no data at all, are
> > broken, and instead should call open(),write(),fsync(),close(),rename().
>
> The only granted way to have the file "new" in a stable state on the
> disk
> is to call:
> 
> f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
> write(f, "dat", size);
> fsync(f);
> close(f);

AFAIUI, the ZFS transaction group maintains write ordering, at least as far as 
write()s to the file would be in the ZIL ahead of the rename() metadata updates.

So I think the atomicity is maintained without requiring the application to 
call fsync() before closing the file.  If the TXG is applied and the rename() 
is included, then the file writes have been too, so foo would have the new 
contents.  If the TXG containing the rename() isn't complete and on the ZIL 
device at crash time, foo would have the old contents.

Posix doesn't require the OS to sync() the file contents on close for local 
files like it does for NFS access?  How odd.

--Joe

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Casper . Dik


>AFAIUI, the ZFS transaction group maintains write ordering, at least as far as 
>write()s to the fil
e would be in the ZIL ahead of the rename() metadata updates.
>
>So I think the atomicity is maintained without requiring the application to 
>call fsync() before cl
osing the file.  If the TXG is applied and the rename() is included, then the 
file writes have been
 too, so foo would have the new contents.  If the TXG containing the rename() 
isn't complete and on
 the ZIL device at crash time, foo would have the old contents.
>
>Posix doesn't require the OS to sync() the file contents on close for local 
>files like it does for
 NFS access?  How odd.

perhaps sync() but not fsync().

But I'm not sure that that is the case.  UFS does that, it schedules 
writing the modified content when the file is closed but onlyon the last 
close.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Public ZFS API ?

2009-03-18 Thread Erast Benson
On Tue, 2009-03-17 at 14:53 -0400, Cherry Shu wrote:
> Are any plans for an API that would allow ZFS commands including 
> snapshot/rollback integrated with customer's application?

Sounds like you are looking for abstraction layering on top of
integrated solution such as NexentaStor. Take a look on API it provides
here:

http://www.nexenta.com/nexentastor-api

SA-API has bindings for C, C++, Perl, Python and Ruby. This
documentation contains examples and samples to demonstrate SA-API
applications in C, C++, Perl, Python and Ruby. You can develop and run
SA-API applications on both Windows and Linux platforms.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Public ZFS API ?

2009-03-18 Thread Richard Elling

Cherry Shu wrote:
Are any plans for an API that would allow ZFS commands including 
snapshot/rollback integrated with customer's application?


This is trivially implemented with system(3c). It is somewhat more difficult
with libzfs. So it really depends on how much work they want to do.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Bob Friesenhahn

On Wed, 18 Mar 2009, Joerg Schilling wrote:


The problem in this case is not whether rename() is atomic but whether the
file that replaces the old file in an atomic rename() operation is in a
stable state on the disk before calling rename().


This topic is quite disturbing to me ...


The calling sequence of the failing code was:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, "dat", size);
close(f);
rename("new", "old");

The only granted way to have the file "new" in a stable state on the disk
is to call:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, "dat", size);
fsync(f);
close(f);


But the problem is not that the file "new" is in an unstable state. 
The problem is that it seems that some filesystems are not preserving 
the ordering of requests.  Failing to preserve the ordering of 
requests is fraught with peril.


POSIX does not care about "disks" or "filesystems".  The only correct 
behavior is for operations to be applied in the order that they are 
requested of the operating system.  This is a core function of any 
operating system.  It is therefore ok for some (or all) of the data 
which was written to "new" to be lost, or for the rename operation to 
be lost, but it is not ok for the rename to end up with a corrupted 
file with the new name.


In summary, I don't agree with you that the misbehavior is correct, 
but I do agree that copious expensive fsync()s should be assured to 
work around the problem.


As it happens, current versions of my own application should be safe 
from this Linux filesystem bug, but older versions are not.   There is 
even a way to request fsync() on every file close, but that could be 
quite expensive so it is not the default.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Dyer-Bennet

On Wed, March 18, 2009 05:08, Joerg Schilling wrote:

> The problem in this case is not whether rename() is atomic but whether the
> file that replaces the old file in an atomic rename() operation is in a
> stable state on the disk before calling rename().

Good, I was hoping somebody saw it that way.

People tend to assume that a successful close() guarantees the data
written to that file is on disk, and I don't believe that is actually
promised by POSIX (though I'm by no means a POSIX rules lawyer) or most
other modern systems.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread A Darren Dunham
On Tue, Mar 17, 2009 at 03:51:25PM -0700, Neal Pollack wrote:
> >Step 3, you'll be presented with the disks to be selected as in 
> >previous releases. So, for example, to select the boot disks on the 
> >Thumper,
> >select both of them:
> >
> >[x] c5t0d0
> >[x] c4t0d0
> 
> Why have the controller numbers/mappings changed between Solaris 10
> and Solaris Nevada?  I just installed Solaris Nevada 110 to see what
> it would do.  Thank you, and I now understand that to find the disk
> name, like above c5t0d0 for physical slot 0 on X4500, I can use
> "cfgadm | grep sata3/0"

> I also now understand that in the installer screens, I can select 2 
> disks and they
> will become a mirrored root zpool.
> 

> What I do not understand, is that on Solaris Nevada 110,  the x4500 
> Thumper physical
> disk slots 0 and 1 are labeled as controller 3 and not controller 5. 
> For example;
> 
> # cfgadm | grep sata3/0
> sata3/0::dsk/c3t0d0disk connectedconfigured   ok
> # cfgadm | grep sata3/4
> sata3/4::dsk/c3t4d0disk connectedconfigured   ok
> # uname -a
> SunOS zcube-1 5.11 snv_110 i86pc i386 i86pc

The numberings are not pre-set, and probably have nothing to do with
Solaris 10 vs Nevada (or ZFS).

Controller numberings are sequential as they are discovered by the OS.
So different probe order, post-boot hardware installations, or when
drivers get installed can case the number assigned to be different on
different machines.

> Of course, that means I shold stay away from all the X4500 and ZFS docs if
> I run Solaris Nevada on an X4500?

Why would that be?  It doesn't claim that there will be a particular
mapping for a particular X4500.

> Any ideas why the mapping is not matching s10 or the docs?

As far as I read, the docs are giving you an example.  They're not
declaring that yours will be the same.  

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Richard Elling

Bob Friesenhahn wrote:
As it happens, current versions of my own application should be safe 
from this Linux filesystem bug, but older versions are not. There is 
even a way to request fsync() on every file close, but that could be 
quite expensive so it is not the default. 


Pragmatically, it is much easier to change the file system once, than
to test or change the zillions of applications that might be broken.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Tim
On Wed, Mar 18, 2009 at 10:59 AM, A Darren Dunham  wrote:

> On Tue, Mar 17, 2009 at 03:51:25PM -0700, Neal Pollack wrote:
> > >Step 3, you'll be presented with the disks to be selected as in
> > >previous releases. So, for example, to select the boot disks on the
> > >Thumper,
> > >select both of them:
> > >
> > >[x] c5t0d0
> > >[x] c4t0d0
> >
> > Why have the controller numbers/mappings changed between Solaris 10
> > and Solaris Nevada?  I just installed Solaris Nevada 110 to see what
> > it would do.  Thank you, and I now understand that to find the disk
> > name, like above c5t0d0 for physical slot 0 on X4500, I can use
> > "cfgadm | grep sata3/0"
>
> > I also now understand that in the installer screens, I can select 2
> > disks and they
> > will become a mirrored root zpool.
> >
>
> > What I do not understand, is that on Solaris Nevada 110,  the x4500
> > Thumper physical
> > disk slots 0 and 1 are labeled as controller 3 and not controller 5.
> > For example;
> >
> > # cfgadm | grep sata3/0
> > sata3/0::dsk/c3t0d0disk connectedconfigured   ok
> > # cfgadm | grep sata3/4
> > sata3/4::dsk/c3t4d0disk connectedconfigured   ok
> > # uname -a
> > SunOS zcube-1 5.11 snv_110 i86pc i386 i86pc
>
> The numberings are not pre-set, and probably have nothing to do with
> Solaris 10 vs Nevada (or ZFS).
>
> Controller numberings are sequential as they are discovered by the OS.
> So different probe order, post-boot hardware installations, or when
> drivers get installed can case the number assigned to be different on
> different machines.
>
> > Of course, that means I shold stay away from all the X4500 and ZFS docs
> if
> > I run Solaris Nevada on an X4500?
>
> Why would that be?  It doesn't claim that there will be a particular
> mapping for a particular X4500.
>
> > Any ideas why the mapping is not matching s10 or the docs?
>
> As far as I read, the docs are giving you an example.  They're not
> declaring that yours will be the same.
>

Just an observation, but it sort of defeats the purpose of buying sun
hardware with sun software if you can't even get a "this is how your drives
will map" out of the deal...

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Richard Elling

Tim wrote:


Just an observation, but it sort of defeats the purpose of buying sun 
hardware with sun software if you can't even get a "this is how your 
drives will map" out of the deal... 


Sun could fix that, but would you really want a replacement for BIOS?
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Bryan Allen
+--
| On 2009-03-18 10:14:26, Richard Elling wrote:
| 
| >Just an observation, but it sort of defeats the purpose of buying sun 
| >hardware with sun software if you can't even get a "this is how your 
| >drives will map" out of the deal... 
| 
| Sun could fix that, but would you really want a replacement for BIOS?

Well, actually... :)
-- 
bda
Cyberpunk is dead.  Long live cyberpunk.
http://mirrorshades.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Nicolas Williams
On Wed, Mar 18, 2009 at 11:15:48AM -0400, Moore, Joe wrote:
> Posix doesn't require the OS to sync() the file contents on close for
> local files like it does for NFS access?  How odd.

Why should it?  If POSIX is agnostic as to system crashes / power
failures, then why should it say anything about when data should hit the
disk in the absence of explicit sync()/fsync() calls?

NFS is a different beast though.  Client cache coherency and other
issues come up.  So to maintain POSIX semantics a number of NFS
operations must be synchronous and close() on the client requires
flushing dirty buffers to the server.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Tim
On Wed, Mar 18, 2009 at 12:14 PM, Richard Elling
wrote:

> Tim wrote:
>
>>
>> Just an observation, but it sort of defeats the purpose of buying sun
>> hardware with sun software if you can't even get a "this is how your drives
>> will map" out of the deal...
>>
>
> Sun could fix that, but would you really want a replacement for BIOS?
> -- richard
>
>
Yes, I really would.  I also have a hard time believing BIOS is the issue.
I have a 7110 sitting directly below an x4240 in one of my racks... the 7110
has no issues reporting disks properly.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Neal Pollack

On 03/18/09 10:43 AM, Tim wrote:
On Wed, Mar 18, 2009 at 12:14 PM, Richard Elling 
mailto:richard.ell...@gmail.com>> wrote:


Tim wrote:


Just an observation, but it sort of defeats the purpose of
buying sun hardware with sun software if you can't even get a
"this is how your drives will map" out of the deal...


Sun could fix that, but would you really want a replacement for BIOS?
-- richard


Yes, I really would.  I also have a hard time believing BIOS is the 
issue.  I have a 7110 sitting directly below an x4240 in one of my 
racks... the 7110 has no issues reporting disks properly.


BIOS is indeed an issue.  In many x86/x64 PC architecture designs, and 
the current enumeration design of Solaris,
if you add controller cards, or move a controller card, after a previous 
OS installation, then the controller numbers
and ordering changes on all the devices.  ZFS apparently does not care, 
but UFS would, since bios designates a specific
disk to boot from, and the OS would have a specific boot path including 
a controller number such as

/dev/dsk/c3t4d0s0 that could change, hence no longer boot.

Getting to EFI firmware, dumping BIOS, and redesigning the Solaris 
device enumeration framework would

make things a little more flexible in that type of scenario.




--Tim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Tim
On Wed, Mar 18, 2009 at 12:49 PM, Neal Pollack  wrote:

>  On 03/18/09 10:43 AM, Tim wrote:
>
> On Wed, Mar 18, 2009 at 12:14 PM, Richard Elling  > wrote:
>
>> Tim wrote:
>>
>>>
>>> Just an observation, but it sort of defeats the purpose of buying sun
>>> hardware with sun software if you can't even get a "this is how your drives
>>> will map" out of the deal...
>>>
>>
>>  Sun could fix that, but would you really want a replacement for BIOS?
>> -- richard
>>
>>
> Yes, I really would.  I also have a hard time believing BIOS is the issue.
> I have a 7110 sitting directly below an x4240 in one of my racks... the 7110
> has no issues reporting disks properly.
>
>
> BIOS is indeed an issue.  In many x86/x64 PC architecture designs, and the
> current enumeration design of Solaris,
> if you add controller cards, or move a controller card, after a previous OS
> installation, then the controller numbers
> and ordering changes on all the devices.  ZFS apparently does not care, but
> UFS would, since bios designates a specific
> disk to boot from, and the OS would have a specific boot path including a
> controller number such as
> /dev/dsk/c3t4d0s0 that could change, hence no longer boot.
>
> Getting to EFI firmware, dumping BIOS, and redesigning the Solaris device
> enumeration framework would
> make things a little more flexible in that type of scenario.
>
>

How does any of that affect an x4500 with onboard controllers that can't
ever be moved?

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Carsten Aulbert
Hi Tim,

Tim wrote:
> 
> How does any of that affect an x4500 with onboard controllers that can't
> ever be moved?

Well, consider one box being installed from CD (external USB-CD) and
another one which is jumpstarted via the network. The results usually
are two different boot device names :(

Q: Is there an easy way to reset this without breaking everything?

Cheers

Carsten
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Public ZFS API ?

2009-03-18 Thread Ian Collins

Darren J Moffat wrote:

Ian Collins wrote:

Cherry Shu wrote:
Are any plans for an API that would allow ZFS commands including 
snapshot/rollback integrated with customer's application?



libzfs.h?


The API in there is Contracted Consolidation Private.  Note that 
private does not mean hidden it means:


 Private

 A Private interface is an interface provided by  a  com-
 ponent  (or  product)  intended only for the use of that
 component. A Private interface might still be visible to
 or  accessible  by  other components. Because the use of
 interfaces private to another  component  carries  great
 stability  risks,  such use is explicitly not supported.
 Components not supplied by Sun Microsystems  should  not
 use Private interfaces.

 Most Private interfaces are not  documented.  It  is  an
 exceptional case when a Private interface is documented.
 Reasons for documenting a Private interface include, but
 are  not  limited  to,  the intention that the interface
 might be reclassified to one  of  the  public  stability
 level classifications in the future or the fact that the
 interface is inordinately visible.

That "not suppied by Sun Microsystems" should change to be not 
included as part of the OpenSolaris distribution.


Maybe I should complete my C++ abstraction layer and try to get it 
included as part of the OpenSolaris distribution?


libzfs is too useful to keep hidden away.

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Nicolas Williams
On Wed, Mar 18, 2009 at 11:43:09AM -0500, Bob Friesenhahn wrote:
> In summary, I don't agree with you that the misbehavior is correct, 
> but I do agree that copious expensive fsync()s should be assured to 
> work around the problem.

fsync() is, indeed, expensive.  Lots of calls to fsync() that are not
necessary for correct application operation EXCEPT as a workaround for
lame filesystem re-ordering are a sure way to kill performance.

I'd rather the filesystems were fixed than end up with sync;sync;sync;
type folklore.  Or just don't use lame filesystems.

> As it happens, current versions of my own application should be safe 
> from this Linux filesystem bug, but older versions are not.   There is 
> even a way to request fsync() on every file close, but that could be 
> quite expensive so it is not the default.

So now you pepper your apps with an option to fsync() on close()?  Ouch.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Bob Friesenhahn

On Wed, 18 Mar 2009, Richard Elling wrote:


Bob Friesenhahn wrote:
As it happens, current versions of my own application should be safe from 
this Linux filesystem bug, but older versions are not. There is even a way 
to request fsync() on every file close, but that could be quite expensive 
so it is not the default. 


Pragmatically, it is much easier to change the file system once, than
to test or change the zillions of applications that might be broken.


Yes, and particularly because fsync() can be very expensive.  At one 
time fsync() was the same as sync() for ZFS.  Presumably it is 
improved by now.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Miles Nordin
> "ja" == James Andrewartha  writes:

ja> other people are arguing that POSIX says rename(2) is atomic,

Their statement is true but it's NOT an argument against T'so who is
100% right: the applications using that calling sequence for crash
consistency are not portable under POSIX.

atomic has nothing to do with crash consistency.  

It's about the view of the filesystem by other processes on the same
system, ex., the security vulnerabilities one can have with setuid
binaries that work in /tmp if said binaries don't take advantage of
certain guarantees of atomicity to avoid race conditions.  Obviously
/tmp has zero to do with what the filesystem looks like after a crash:
it always looks _empty_.

For ext4 the argument is settled, fix the app.  But a more productive
way to approach the problem would be to look at tradeoffs between
performance and crash consistency.  Maybe we need fbarrier() (which
could return faster---it sounds like on ZFS it could be a noop)
instead of fsync(), or maybe something more, something genuinely
post-Unix like limited filesystem-transactions that can open, commit,
rollback.  It's hard for a generation that grew up under POSIX to
think outside it.

A hypothetical new API ought to help balance performance/consistency
for networked filesystems, too, like NFS or Lustre/OCFS/...  For
example, networked filesystems often promise close-to-open
consistency, and the promise doesn't necessarily have to do with
crashing.  It means,

  client A client B
   write
   close
   sendmsg  >   poll
open
read(will see all A's writes)


  client A client B
   write
   wait a while
   sendmsg  --->poll
read(all bets are off)

This could stand obvious improvements in two ways.  First, if I'm
trying to send data to B using the filesystem 

 (monkey chorus: don't do that!  it won't work!  you
  have to send data between nodes with
  libgnetdatasender and it's associated avahi-using
  setuid-nobody daemon!  just check it out of svn.  no
  it doesn't support IPv6 but the NEXT VERSION, what,
  1000 nodes? well then you definitely don't want
  to---

  DOWN, monkeychorus!  If I feel like writing in
  Python or Javurscript or even PHP, let me.  If I
  feel like sending data through a filesystem, find a
  way to let me!  why the hell not do it?  I said
  post-POSIX.)

send USING THE FILESYSTEM, then maybe I don't want to close the file
all the time because that's slow or just annoying.  Is there some
dance I can do using locks on B or A to say, ``I need B to see the
data, but I do not necessarily need, nor want to wait, for it to be
committed to disk---I just want it consistent on all clients''?  like,
suppose I keep the file open on A and B at the same time over NFS.
Will taking a write lock on A and a read lock on B actually flush the
client's cache and get the information moved from A to B faster?

Second, we've discussed before NFSv3 write-write-write-commit batching
doesn't work across close/open, so people need slogs to make their
servers fast for the task of writing thousands of tiny files while for
mounting VM disk images over NFS the slog might not be so badly
needed.  Even with the slog, the tiny-files scenario would be slowed
down by network roundtrips.  If we had a transaction API, we could
open a transaction, write 1000 files, then close it.  On a high-rtt
network this could be many orders of magnitude faster than what we
have now.  but it's hard to imagine a transactional API that doesn't
break the good things about POSIX-style like ``relatively simple'',
``apparently-stateless NFS client-server sessions'', ``advisory
locking only'', ...


pgpYqREyYRLrY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mount ZFS hangs on boot

2009-03-18 Thread Brent Jones
On Wed, Mar 18, 2009 at 11:28 AM, Miles Nordin  wrote:
>> "bj" == Brent Jones  writes:
>
>    bj> I only have about 50 filesystems, and just a handful of
>    bj> snapshots for each filesystem.
>
> there were earlier stories of people who had imports taking hours to
> complete with no feedback because ZFS was rolling forward some
> partly-completed operation interrupted by the crash, like destroying a
> snapshot or something.  maybe you shoudl just wait.
>

Wait I did, and it did finally come up.
A partially completed operation may make sense, as when the iSCSI
target was block due to a Windows box hanging, and the connection not
letting go, a ZFS destroy on that pool never did complete.
So maybe it tried to finish that action.

A mystery for sure, but its up and working now.

-- 
Brent Jones
br...@servuhome.net
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Miles Nordin
> "c" == Miles Nordin  writes:

 c> fbarrier()

on second thought that couldn't help this problem.  The goal is to
associate writing to the directory (rename) with writing to the file
referenced by that inode/handle (write/fsync/``fbarrier''), and in
POSIX these two things are pretty distant and unrelated to each other.
The posix way to associate these two things is to wait for fsync() to
return before asking for the rename.  The waiting is expressive---it's
an extremely simple, easy-to-understand API for associating one thing
with another.  I thought maybe this was so simple there was only one
thing not two, so the wait coudl be skipped, but I am wrong.

It is too bad because as others have said it means these fsync()'s
will have to go in to make the app correct/portable with the API we
have to work under, even though ZFS has certain convenient quirks and
probably doesn't need them.

IMHO the best reaction to the KDE hysteria would be to make sure
SQLite and BerkeleyDB are fast as possible and effortlessly correct on
ZFS, and anything that's slow because of too much synchronous writing
to tiny files should use a library instead.  This is not currently the
case because for high performance one has to manually match DB and ZFS
record sizes which isn't practical for these tiny throwaway databases
that must share a filesystem with nonDB stuff, and there might be room
for improvement in terms of online defragmentation too.


pgpIEWQ58qaLi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Casper . Dik

>On Wed, Mar 18, 2009 at 11:43:09AM -0500, Bob Friesenhahn wrote:
>> In summary, I don't agree with you that the misbehavior is correct, 
>> but I do agree that copious expensive fsync()s should be assured to 
>> work around the problem.
>
>fsync() is, indeed, expensive.  Lots of calls to fsync() that are not
>necessary for correct application operation EXCEPT as a workaround for
>lame filesystem re-ordering are a sure way to kill performance.
>
>I'd rather the filesystems were fixed than end up with sync;sync;sync;
>type folklore.  Or just don't use lame filesystems.
>
>> As it happens, current versions of my own application should be safe 
>> from this Linux filesystem bug, but older versions are not.   There is 
>> even a way to request fsync() on every file close, but that could be 
>> quite expensive so it is not the default.
>
>So now you pepper your apps with an option to fsync() on close()?  Ouch.


fsync() was always a wart.

Many of the Unx filesystem writes didn't that is was a problem, but it
still is.  This is now part of the folklore: "you must fsync".

But why do filesystem writers insist that the filesystem can reorder
all operations?  And why do they believe that "meta data" is more
important?

Clearly, that is false: how else can you rename files which the system 
hasn't written already?

I noticed that our old ufs code issued two synchronous writes when
creating a file.  Unfortunately, it should have used three even when we 
don't care what's in the file.

Casper



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Dyer-Bennet

On Wed, March 18, 2009 11:43, Bob Friesenhahn wrote:
> On Wed, 18 Mar 2009, Joerg Schilling wrote:
>>
>> The problem in this case is not whether rename() is atomic but whether
>> the
>> file that replaces the old file in an atomic rename() operation is in a
>> stable state on the disk before calling rename().
>
> This topic is quite disturbing to me ...
>
>> The calling sequence of the failing code was:
>>
>> f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
>> write(f, "dat", size);
>> close(f);
>> rename("new", "old");
>>
>> The only granted way to have the file "new" in a stable state on the
>> disk
>> is to call:
>>
>> f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
>> write(f, "dat", size);
>> fsync(f);
>> close(f);
>
> But the problem is not that the file "new" is in an unstable state.
> The problem is that it seems that some filesystems are not preserving
> the ordering of requests.  Failing to preserve the ordering of
> requests is fraught with peril.

Only in very limited cases.  For example, writing the blocks of a file can
occur in any order, so long as no block is written twice and so long as no
reads are performed.  It simply doesn't matter what order that goes to
disk in.  As soon as somebody reads one of the blocks written, then some
of the ordering becomes important.

You're trying, I think, to argue from first principles; may I suggest that
a lot is known about filesystem (and database) semantics, and that we will
get further if we work within what's already known about that, rather than
trying to reinvent the wheel from scratch?

>
> POSIX does not care about "disks" or "filesystems".  The only correct
> behavior is for operations to be applied in the order that they are
> requested of the operating system.  This is a core function of any
> operating system.

Is this what it actually says in the POSIX documents?  Or in any other
filesystem formal definition?

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Dyer-Bennet

On Wed, March 18, 2009 11:59, Richard Elling wrote:
> Bob Friesenhahn wrote:
>> As it happens, current versions of my own application should be safe
>> from this Linux filesystem bug, but older versions are not. There is
>> even a way to request fsync() on every file close, but that could be
>> quite expensive so it is not the default.
>
> Pragmatically, it is much easier to change the file system once, than
> to test or change the zillions of applications that might be broken.

On the other hand, by doing so we've set limits on the behavior of all
future applications.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread Neal Pollack

On 03/18/09 11:09 AM, Tim wrote:



On Wed, Mar 18, 2009 at 12:49 PM, Neal Pollack > wrote:


On 03/18/09 10:43 AM, Tim wrote:

On Wed, Mar 18, 2009 at 12:14 PM, Richard Elling
mailto:richard.ell...@gmail.com>> wrote:

Tim wrote:


Just an observation, but it sort of defeats the purpose
of buying sun hardware with sun software if you can't
even get a "this is how your drives will map" out of the
deal...


Sun could fix that, but would you really want a replacement
for BIOS?
-- richard


Yes, I really would.  I also have a hard time believing BIOS is
the issue.  I have a 7110 sitting directly below an x4240 in one
of my racks... the 7110 has no issues reporting disks properly.


BIOS is indeed an issue.  In many x86/x64 PC architecture designs,
and the current enumeration design of Solaris,
if you add controller cards, or move a controller card, after a
previous OS installation, then the controller numbers
and ordering changes on all the devices.  ZFS apparently does not
care, but UFS would, since bios designates a specific
disk to boot from, and the OS would have a specific boot path
including a controller number such as
/dev/dsk/c3t4d0s0 that could change, hence no longer boot.

Getting to EFI firmware, dumping BIOS, and redesigning the Solaris
device enumeration framework would
make things a little more flexible in that type of scenario.



How does any of that affect an x4500 with onboard controllers that 
can't ever be moved?


Stick a fiber channel controller card into your x4500 PCI slot, then go 
back and look at your
controller numbering, even for the built-in disk controller chips.  Here 
is the cfgadm output

for an X4500 that I set up yesterday.
Notice that the first two controller numbers are for the fibre channel 
devices, and
then notice that the disk controller numbers no longer match your 
documentation,

or your blogs about suggested configuration;

$ cat zcube1.txt
Ap_Id  Type Receptacle   Occupant 
Condition
c6 fc   connectedunconfigured 
unknown
c7 fc   connectedunconfigured 
unknown

sata0/0::dsk/c0t0d0disk connectedconfigured   ok
sata0/1::dsk/c0t1d0disk connectedconfigured   ok
sata0/2::dsk/c0t2d0disk connectedconfigured   ok
sata0/3::dsk/c0t3d0disk connectedconfigured   ok
sata0/4::dsk/c0t4d0disk connectedconfigured   ok
sata0/5::dsk/c0t5d0disk connectedconfigured   ok
sata0/6::dsk/c0t6d0disk connectedconfigured   ok
sata0/7::dsk/c0t7d0disk connectedconfigured   ok
sata1/0::dsk/c1t0d0disk connectedconfigured   ok
sata1/1::dsk/c1t1d0disk connectedconfigured   ok
sata1/2::dsk/c1t2d0disk connectedconfigured   ok
sata1/3::dsk/c1t3d0disk connectedconfigured   ok
sata1/4::dsk/c1t4d0disk connectedconfigured   ok
sata1/5::dsk/c1t5d0disk connectedconfigured   ok
sata1/6::dsk/c1t6d0disk connectedconfigured   ok
sata1/7::dsk/c1t7d0disk connectedconfigured   ok
sata2/0::dsk/c2t0d0disk connectedconfigured   ok
sata2/1::dsk/c2t1d0disk connectedconfigured   ok
sata2/2::dsk/c2t2d0disk connectedconfigured   ok
sata2/3::dsk/c2t3d0disk connectedconfigured   ok
sata2/4::dsk/c2t4d0disk connectedconfigured   ok
sata2/5::dsk/c2t5d0disk connectedconfigured   ok
sata2/6::dsk/c2t6d0disk connectedconfigured   ok
sata2/7::dsk/c2t7d0disk connectedconfigured   ok
sata3/0::dsk/c3t0d0disk connectedconfigured   
ok   <<-- Boot disk, slot 0

sata3/1::dsk/c3t1d0disk connectedconfigured   ok
sata3/2::dsk/c3t2d0disk connectedconfigured   ok
sata3/3::dsk/c3t3d0disk connectedconfigured   ok
sata3/4::dsk/c3t4d0disk connectedconfigured   
ok  <<--  Boot disk, slot 1

sata3/5::dsk/c3t5d0disk connectedconfigured   ok
sata3/6::dsk/c3t6d0disk connectedconfigured   ok
sata3/7::dsk/c3t7d0disk connectedconfigured   ok
sata4/0::dsk/c4t0d0disk connectedconfigured   ok
sata4/1::dsk/c4t1d0disk connectedconfigured   ok
sata4/2::dsk/c4t2d0disk connectedconfigured   ok
sata4/3::dsk/c4t3d0disk connecte

Re: [zfs-discuss] How do I "mirror" zfs rpool, x4500?

2009-03-18 Thread A Darren Dunham
On Wed, Mar 18, 2009 at 07:13:41PM +0100, Carsten Aulbert wrote:
> Well, consider one box being installed from CD (external USB-CD) and
> another one which is jumpstarted via the network. The results usually
> are two different boot device names :(
> 
> Q: Is there an easy way to reset this without breaking everything?

The mapping should be in /dev/cfg (and possibly portions of
/etc/path_to_inst).  In the old days I'd say that changing a non-boot
controller there should be enough.

I'm not sure if anything in the boot archive needs to be changed as
well.

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SQLite3 on ZFS (Re: rename(2), atomicity, crashes and fsync())

2009-03-18 Thread Nicolas Williams
On Wed, Mar 18, 2009 at 03:01:30PM -0400, Miles Nordin wrote:
> IMHO the best reaction to the KDE hysteria would be to make sure
> SQLite and BerkeleyDB are fast as possible and effortlessly correct on
> ZFS, and anything that's slow because of too much synchronous writing

I tried to do that for SQLite3.

I ran into these problems:

1) The max page size for SQLite3 is 16KB.  It can be made 32KB but I got
   some tests to core dump when I did that.  It cannot go beyond that
   without massive changes to SQLite3.  Or maybe the sizes in question
   were 32KB and 64KB -- either way, smaller than ZFS' preferred block
   size.

2) The SQLite3 tests depend on the page size being 1KB.  So changing
   SQLite3 to select the underlying filesystem's preferred block size
   causes spurious test failures.

3) The default SQLite3 cache size becomes a very small 60 or so pages
   when maxing the page size.  I suspect that will mean more pread(2)
   syscalls; whether that's a problem or not, I'm not sure.

Therefore I held off putting back this change to SQLite3 in the
OpenSolaris SFW consolidation.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Public ZFS API ?

2009-03-18 Thread Cherry Shu

Thank you all for replying my email!

Will pass all the information to the customer.

Thanks,
Cherry

Darren J Moffat wrote:

Ian Collins wrote:

Cherry Shu wrote:
Are any plans for an API that would allow ZFS commands including 
snapshot/rollback integrated with customer's application?



libzfs.h?


The API in there is Contracted Consolidation Private.  Note that 
private does not mean hidden it means:


 Private

 A Private interface is an interface provided by  a  com-
 ponent  (or  product)  intended only for the use of that
 component. A Private interface might still be visible to
 or  accessible  by  other components. Because the use of
 interfaces private to another  component  carries  great
 stability  risks,  such use is explicitly not supported.
 Components not supplied by Sun Microsystems  should  not
 use Private interfaces.

 Most Private interfaces are not  documented.  It  is  an
 exceptional case when a Private interface is documented.
 Reasons for documenting a Private interface include, but
 are  not  limited  to,  the intention that the interface
 might be reclassified to one  of  the  public  stability
 level classifications in the future or the fact that the
 interface is inordinately visible.

That "not suppied by Sun Microsystems" should change to be not 
included as part of the OpenSolaris distribution.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Magda

On Mar 18, 2009, at 12:43, Bob Friesenhahn wrote:

POSIX does not care about "disks" or "filesystems".  The only  
correct behavior is for operations to be applied in the order that  
they are requested of the operating system.  This is a core function  
of any operating system.  It is therefore ok for some (or all) of  
the data which was written to "new" to be lost, or for the rename  
operation to be lost, but it is not ok for the rename to end up with  
a corrupted file with the new name.


Out of curiousity, is this what POSIX actually specifies? If that is  
the case, wouldn't that mean that the behaviour of ext3/4 is  
incorrect? (Assuming that it does re-order operations.)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread James Litchfield

POSIX has a  Synchronized I/O Data (and File) Integrity Completion
definition (line 115434 of the Issue 7 (POSIX.1-2008) specification). 
What it
says is that writes for a byte range in a file must complete before any 
pending

reads for that byte range are satisfied.

It does not say that if you have 3 pending writes and pending reads for 
a byte range,
that the writes  must complete in  the order issued - simply that they 
must all complete
before any reads complete. See lines 71371-71376 in the write() 
discussion. The
specification explicitly avoids discussing the "behavior of concurrent 
writes to a file from
multiple processes." and suggests that applications doing this "should 
use some form

of concurrency control."

It is true that because of these semantics, many file system 
implementations will use
locks to ensure that no reads can occur in the entire file while writes 
are happening
which has the side  effect of ensuring the writes are executed in the 
order they are issued.
This is an implementation detail that can be  complicated by async IO as 
well. The only
guarantee  POSIX offers is that all  pending  writes to the relevant 
byte range in the file
will be completed before a read to that byte range is  allowed. An 
in-progress read is
expected to block any writes to the relevant byte range file the  read 
completes.


The specification also does not say the bits for a file must end up on 
the disk without
an intervening  fsync() operation unless you've explicitly asked for 
data synchronization
(O_SYNC,  O_DSYNC) when you opened  the file. The fsync() discussion  
(line 31956)
says that the bits must undergo a "physical write of data from the 
buffer cache" that should
be completed  when the fsync() call returns. If there are errors, the 
return from the fsync()
call should express the fact that one or more errors occurred. The only 
guarantee that the
physical write happens is if the system supports the  
_POSIX_SYNCHRONIZED_IO option. If
not, the comment is to read the system's  conformance documentation (if 
any) to see what
actually does happen. In the case that _POSIX_SYNCHRONIZED_IO is not 
supported,

it's perfectly allowable for fsync()  to be a no-op.

Jim Litchfield
---
David Magda wrote:

On Mar 18, 2009, at 12:43, Bob Friesenhahn wrote:

POSIX does not care about "disks" or "filesystems".  The only correct 
behavior is for operations to be applied in the order that they are 
requested of the operating system.  This is a core function of any 
operating system.  It is therefore ok for some (or all) of the data 
which was written to "new" to be lost, or for the rename operation to 
be lost, but it is not ok for the rename to end up with a corrupted 
file with the new name.


Out of curiousity, is this what POSIX actually specifies? If that is 
the case, wouldn't that mean that the behaviour of ext3/4 is 
incorrect? (Assuming that it does re-order operations.)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Miles Nordin
> "dm" == David Magda  writes:

dm> is this what POSIX actually specifies?

i doubt it.  If it did, it would basically mandate a log-structured /
COW filesystem, which, although not a _bad_ idea, is way too far from
a settled debate to be enshrining in a mandatory ``standard'' (ex.,
the database fragmentation problems with LFS, WaFL, ZFS.  And the
large number of important deployed non-COW filesystems on POSIX
systems).  

There's no other so-far-demonstrated way than log-structured/COW to
achieve this property which some people think they're entitled to take
for granted: ``after a reboot, the system must appear as though it did
not reorder any writes.  The filesystem must recover to some exact
state that it passed through in the minutes leading up to the crash,
some state as observed from the POSIX userland (above all write
caches).''

It's a nice property.  Nine years ago when i was trying to get Linux
users to try NetBSD, I flogged this as a great virtue of LFS.  And if
I were designing a non-POSIX operating system to replace Unix, I'd
probably promise developers this property.  But achieving it is just
too constraining to belong in POSIX.

If you can find some application that can safely disable some safety
feature when it knows it's running on ZFS that it needs to keep on
other filesystems and thus perform absurdly faster on ZFS with no
risk, then you can demonstrate the worth of promising this property.
The fsync() that i'm sure KDE will add into all their broken apps is
such an example, but I doubt it will be ``absurdly faster'' enough to
get ZFS any attention.  Maybe something to do with virtual disk
backing stores for VM's?

But I don't think pushing exaggerated expectations as ``obvious'' in
front of people who don't know the nasty details yet, nor overstating
POSIX's minimal crash requirements, is going to work.  There are just
too many smart people ready to defend the non-log-stuctured
write-in-place filesystems.  And I believe it *is* possible to write a
correct database or MTA, even with the level of guarantee those
systems provide (provide in practice, not provide as specified by
POSIX).

And the guarantees ARE minimal---just:

 http://www.google.com/search?q=POSIX+%22crash+consistency%22

and you'll find even people against T'so's who want to change ext4
still agree POSIX is on T'so's side.

My own opinion is that the apps are unportable and need to be fixed,
and that what te side against T'so wants changed is so poorly stated
it's no more than ad-hoc ``make the apps not broken, because otherwise
anything which does the exact same thing as the broken app we just
found will also be broken!!!'' it's not a clearly articulatable
guarantee like that AIUI provided by transaction groups.

But linux app developers never seem to give much of a flying shit
whether their apps work on notLinux, which is why they think it's
``practical'' to change ext4 rather than the nonconformant app, so
dragging out the POSIX horse for flogging in support of ``change
ext4'' looks highly hypocritical, while flogging the same horse to
support ``ZFS is the only POSIXly correct filesystem on the planet''
is flatly incorrect but at least not hypocritical. :)


pgp6S7cxYCPf9.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss