Erblichs writes:
 > Jeff Bonwick,
 > 
 >      Do you agree that their is a major tradeoff of
 >      "builds up a wad of transactions in memory"?
 > 
 >      We loose the changes if we have an unstable
 >      environment.
 > 
 >      Thus, I don't quite understand why a 2-phase
 >      approach to commits isn't done. First, take the
 >      transactions as they come and do a minimal amount
 >      of a delayed write. If the number of transactions
 >      build up, then convert to the delayed write scheme.
 > 

I probably don't understand the  proposition.  It seems that
this is about making all writes synchronous and initially go
through the Zil and then convert  to the pool sync when load
builds up ?  The  problem is that if we  make all  writes go
through the synchronous Zil,   this   will limit  the   load
greatly in a way that we'll never build a backlog (unless we
scale to  1000s of threads). So is  this  about an option to
enable O_DSYNC for all files ?


 >      This assumption is that not all ZFS envs are write
 >      heavy versus write once and read-many type accesses.
 >      My assumption is that attribute/meta reading
 >      outweighs all other accesses.
 >      
 >      Wouldn't this approach allow minimal outstanding
 >      transactions and favor read access. Yes, the assumption
 >      is that once the "wad" is started, the amount of writing
 >      could be substantial and thus the amount of available
 >      bandwidth for reading is reduced. This would then allow
 >      for a more N states to be available. Right?

So the reads  _are_ prioritized over  pool writes by the I/O
scheduler. But it is correct  that the pool sync does impact
the read  latency at    least  on JBOD.  There  already  are
suggestions on reducing the impact (reserved read slots, and
throttling    writers,...).  Also  for   the  next build the
overhead of the  pool  sync is reduced  which opens   up the
possibility of testing with smaller txg_time.

I would be interested to know the problems you have observed
to see if we're covered.

 > 
 >      Second, their are a multiple uses  of "then: (then pushes,
 >      then flushes all disk..., then writes the new uberblock,
 >      then flushes the caches again), in which seems to have
 >      some level of possible parallelism which should reduce the
 >      latency from the start to the final write. Or did you just
 >      say that for simplicity sake?
 > 

The parallelism level of those operations seems very high to
me and it was improved last week (for  the tail end of the
pool sync). But note that the pool sync does not commonly
hold up a write or a zil commit. It does so only when the
storage is saturated for 10s of seconds. Given that memory
is finite we have to throttle applications at some point.


-r

 >      Mitchell Erblich
 >      -------------------
 >      
 > 
 > Jeff Bonwick wrote:
 > > 
 > > Toby Thain wrote:
 > > > I'm no guru, but would not ZFS already require strict ordering for its
 > > > transactions ... which property Peter was exploiting to get "fbarrier()"
 > > > for free?
 > > 
 > > Exactly.  Even if you disable the intent log, the transactional nature
 > > of ZFS ensures preservation of event ordering.  Note that disk caches
 > > don't come into it: ZFS builds up a wad of transactions in memory,
 > > then pushes them out as a transaction group.  That entire group will
 > > either commit or not.  ZFS writes all the new data to new locations,
 > > then flushes all disk write caches, then writes the new uberblock,
 > > then flushes the caches again.  Thus you can lose power at any point
 > > in the middle of committing transaction group N, and you're guaranteed
 > > that upon reboot, everything will either be at state N or state N-1.
 > > 
 > > I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
 > > thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
 > > every system call.
 > > 
 > > Jeff
 > > _______________________________________________
 > > zfs-discuss mailing list
 > > zfs-discuss@opensolaris.org
 > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to