On Mon, Apr 11, 2016 at 9:56 AM, Raghavendra Gowdappa <rgowd...@redhat.com> wrote:
> > > +Raghavendra G who implemented this option in write-behind, to this > > upstream patch review discussion > > Thanks Pranith. Kritika did inform us about the discussion. We were > working on ways to find out solutions to the problems raised (and it was a > long festive weekend in Bangalore). > > Sorry for top-posting. I am trying to address two issues raised in this > mail thread: > 1. No ways to identify whether setting of option succeeded in gfapi. > 2. Why retaining cache after fsync failure is not default behavior? > > 1. No ways to identify whether setting of option succeeded in gfapi: > > I've added Poornima and Raghavendra Talur who work on gfapi to assist on > this. > There is currently no such feature in gfapi. We could think of two possible solutions: a. Have the qemu-glusterfs driver require a version of glusterfs-api.rpm which surely has this write behind option. In that case, if you use glfs_set_xlator_option before glfs_init with right key and value, there is no way the volume set gets rejected. Note that this is a installation time dependency, we don't have libgfapi library versioning. This solution is good enough, if the logic in qemu is ret = glfs_set_xlator_option; if (ret) { exit because we don't have required environment. } proceed with work; b. Second solution is to implement a case in write_behind getxattr FOP to handle such request and reply back with the current setting. Look at dht_get_real_filename for similar feature. You will have to implement logic similar to something like below ret = glfs_getxattr ( fs, "/", glusterfs.write-behind- resync-failed-syncs-after-fsync-status, &value, size); if ( (ret = -1 && errno == ENODATA) || value == 0 ) { // write behind in the stack does not understand this option // OR // we have write-behind with required option set to off <work with the assumptions that there are not retries> } else { // We do have the required option } One negative aspect of both the solutions above is the loosely coupled nature of logic. If the behavior ever changes in lower layer(which is gfapi or write-behind in this case) the upper layer(qemu) will have to a. have a dependency of the sort which defines both the minimum version and maximum version of package required b. interpret the return values with more if-else conditions to maintain backward compatibility. We are thinking of other ways too, but given above are current solutions that come to mind. Thanks, Raghavendra Talur > 2. Why retaining cache after fsync failure is not default behavior? > > It was mostly to not to break two synchronized applications, which rely on > fsync failures to retry. Details of discussion can be found below. The > other reason was we could not figure out what POSIX's take on the state of > earlier write after fsync failure (Other filesystems xfs, ext4 had > non-uniform behavior). The question more specifically was "is it correct > for a write issued before a failed fsync to succeed on the backend storage > (persisting happened after fsync failure)?". I've also added Vijay Bellur > who was involved in the earlier discussion to cc list. > > Following is the discussion we had earlier with Kevin, Jeff Cody and > others on the same topic. I am quoting it verbatim below. > > <quote> > > > > > > > I would actually argue that this setting would be right for > everyone, > > > > > > not just qemu. Can you think of a case where keeping the data > cached > > > > > > after a failed fsync would actively hurt any application? I can > only > > > > > > think of cases where data is unnecessarily lost if data is > dropped. > > > > > > > > > > > > > > > > I worry about use cases with concurrent writers to the same file > and > > > > > how different applications would handle fsync() failures with our > new > > > > > behavior. > > > > > > > > Any specific scenario you're worried about? > > > > > > > > > Keeping the known old behavior as the default will ensure that we > do > > > > > not break anything once this is out. qemu/virt users with gluster > are > > > > > accustomed to setting the virt group and hence no additional knobs > > > > > would need to be altered by them. > > > > > > > > Not changing anything is a safe way to avoid regressions. But it's > also > > > > a safe way to leave bugs unfixed. I would say that the current > behavĂour > > > > is at least borderline buggy and very hard for applications to handle > > > > correctly. > > > > > > I tend to agree with Kevin on this. If we've an error handling strategy > > > that > > > is posix-complaint, I don't think there is need to add one more option. > > > Most > > > of the times options tend to be used in default values, which is > equivalent > > > to not providing an option at all. However, before doing that its > better we > > > think through whether it can affect any existing deployments adversely > > > (even > > > when they are not posix-complaint). > > > > > > > One pattern that I can think of - > > > > Applications that operate on the same file from different clients through > > some locking or other co-ordination would have this pattern: > > > > lock (file), write (file), fsync (file), unlock (file); > > > > Now if the first fsync() fails, based on application logic the offset > used > > for the failed write + fsync could be re-utilized by a co-ordinating > > application on another node to write out legitimate data. When control > > returns back to the application that received a failure, the subsequent > > write + fsync can cause data to be overwritten at the old offset along > with > > new data being written at the new offset. > > Yeah. I agree. Co-ordinated applications on different mounts will have > issues, if they are working on the assumption that after fsync no older > writes will hit the backend. Given that there seems to be a fair bit of > confusion on status of retry of older writes after an fsync failure, we can > expect some regressions. So, to summarize, > > 1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and > flush act as barriers and will make sure either > a. older writes are synced to backend > b. old writes are failed and never retried. > > after a failed fsync/flush. > > 2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a > failed fsync and retry them till a flush. After a flush, no retries of > failed syncs will be attempted. This behaviour will be introduced as an > option. > > 3. Transient and non-transient errors will be treated similarly and failed > syncs will be retried alike. > > Does everyone agree on the above points? If yes, I'll modify [1] > accordingly. > > [1] http://review.gluster.org/#/c/12594/ > > </quote> > >