Erblichs writes: > Jeff Bonwick, > > Do you agree that their is a major tradeoff of > "builds up a wad of transactions in memory"? > > We loose the changes if we have an unstable > environment. > > Thus, I don't quite understand why a 2-phase > approach to commits isn't done. First, take the > transactions as they come and do a minimal amount > of a delayed write. If the number of transactions > build up, then convert to the delayed write scheme. >
I probably don't understand the proposition. It seems that this is about making all writes synchronous and initially go through the Zil and then convert to the pool sync when load builds up ? The problem is that if we make all writes go through the synchronous Zil, this will limit the load greatly in a way that we'll never build a backlog (unless we scale to 1000s of threads). So is this about an option to enable O_DSYNC for all files ? > This assumption is that not all ZFS envs are write > heavy versus write once and read-many type accesses. > My assumption is that attribute/meta reading > outweighs all other accesses. > > Wouldn't this approach allow minimal outstanding > transactions and favor read access. Yes, the assumption > is that once the "wad" is started, the amount of writing > could be substantial and thus the amount of available > bandwidth for reading is reduced. This would then allow > for a more N states to be available. Right? So the reads _are_ prioritized over pool writes by the I/O scheduler. But it is correct that the pool sync does impact the read latency at least on JBOD. There already are suggestions on reducing the impact (reserved read slots, and throttling writers,...). Also for the next build the overhead of the pool sync is reduced which opens up the possibility of testing with smaller txg_time. I would be interested to know the problems you have observed to see if we're covered. > > Second, their are a multiple uses of "then: (then pushes, > then flushes all disk..., then writes the new uberblock, > then flushes the caches again), in which seems to have > some level of possible parallelism which should reduce the > latency from the start to the final write. Or did you just > say that for simplicity sake? > The parallelism level of those operations seems very high to me and it was improved last week (for the tail end of the pool sync). But note that the pool sync does not commonly hold up a write or a zil commit. It does so only when the storage is saturated for 10s of seconds. Given that memory is finite we have to throttle applications at some point. -r > Mitchell Erblich > ------------------- > > > Jeff Bonwick wrote: > > > > Toby Thain wrote: > > > I'm no guru, but would not ZFS already require strict ordering for its > > > transactions ... which property Peter was exploiting to get "fbarrier()" > > > for free? > > > > Exactly. Even if you disable the intent log, the transactional nature > > of ZFS ensures preservation of event ordering. Note that disk caches > > don't come into it: ZFS builds up a wad of transactions in memory, > > then pushes them out as a transaction group. That entire group will > > either commit or not. ZFS writes all the new data to new locations, > > then flushes all disk write caches, then writes the new uberblock, > > then flushes the caches again. Thus you can lose power at any point > > in the middle of committing transaction group N, and you're guaranteed > > that upon reboot, everything will either be at state N or state N-1. > > > > I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool > > thing is that on ZFS, fbarrier() is a no-op. It's implicit after > > every system call. > > > > Jeff > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss