>Number: 165923 >Category: kern >Synopsis: Writing to NFS-backed mmapped files fails if flushed >automatically >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Sun Mar 11 11:10:09 UTC 2012 >Closed-Date: >Last-Modified: >Originator: Joel Ray Holveck <jo...@juniper.net> >Release: FreeBSD 8.3-PRERELEASE i386 >Organization: Juniper Networks, Inc. >Environment: System: FreeBSD thor.piquan.org 8.3-PRERELEASE FreeBSD 8.3-PRERELEASE #2: Sat Feb 25 15:52:16 PST 2012 r...@thor.piquan.org:/usr/obj/usr/src/sys/THOR i386
>Description: When a process is using mmap to write to an NFS-backed file, under certain common circumstances, changes to data may not be properly flushed. (Note that presently, msync will report success in these conditions; this will be addressed in a separate PR.) If a process has an NFS-backed file mmapped in and dirties the data, the uid used for the RPC depends on what prompted the flush. If the process initiated the flush with msync (or sync), then the process's euid is used; this would be the expected behavior. However, if the syncer daemon initiates the flush, then uid 0 is used. Since it is common for NFS servers to use -maproot=-2:-2 or something similar to map accesses by uid 0 to a different (non-privileged) uid, the server behaves as if a different user was writing to the file than the process that's using mmap. Assuming the file is mode 0644 or more restrictive, the server will return NFSERR_ACCES. Formerly (e.g., in 8.2-RELEASE), this would cause the client's VM subsystem to go into an infinite loop: the client would attempt to flush to the server, the server returns NFSERR_ACCES, the client leaves the pages on the dirty list but still needs to flush them, repeat ad infinitum. In r223054 (on stable/8; MFC r222586), this behavior was changed: the VM system marks the pages as clean to avoid this type of loop. However, this comes with its own set of problems. As an example, consider the case where a process is gathering data into an mmap-backed datastore. The process gathers some data into the datastore. Before the process has a chance to msync, the syncer causes the datastore to be flushed, but since it fails, the pages are marked as clean. After this, even if the data-gathering process calls msync, the data will never be written to disk. As another example, the process may not explicitly call msync, and simply trust the VM system to ensure that the data is written. While such a process SHOULD call msync to verify that its data is successfully written, this is akin to checking the return value of close(2): important for data integrity but often skipped. If a process doesn't ever call msync, then the data will never be flushed. This is contrary to the mmap(2) and msync(2) man pages, which both state that "[t]he msync(2) system call is obsolete since BSD implements a coherent file system buffer cache." (A separate PR will be opened to update these man pages to consider the present bug, as well as the error-reporting usefulness of msync.) >How-To-Repeat: The trivial reproduction program would mmap a file over NFS as a user other than root, when the server has -maproot=-2:-2 (or similar), with a mode that would prevent uid -2 from writing to it. Then, change a byte, and exit. The program attached here demonstrates a few different aspects of this bug. As provided, this demonstrates a case in which the data is not properly written, but msync reports success. (The present bug is only about the data not being written; the failure of msync to report an error will be addressed in a separate PR.) See also "NOTES ON ALTERNATE SCENARIO" below. To use: 1. Set up an NFS server and a filesystem to export. Do not set -maproot (alias -r) or -mapall in exports(5) for this filesystem. 2. Set up an NFS client using FreeBSD 8.2-STABLE after svn commit 223054 (2011-06-13), 8.3-PRERELEASE, etc. Mount the NFS filesystem on /mnt. (IMPORTANT: Using 8.2 from before r223054 may have adverse effects on your network; see "EFFECTS ON 8.2-RELEASE", below.) 3. Pick a user other than root. (Your usual uid is fine.) 4. Create a file /mnt/backing-store owned by this user, mode 0644. 5. Compile the attached program. 6. (Optional): Start netstat -iw1 -I$YOUR_NFS_INTERFACE 7. Run the program as the user from step 3. Note the value that it expects to write. 8. (Optional): Once you see a 500k burst in the netstat from step 6, interrupt the program (froms step 7) with ^C. (This is just to save you from waiting.) 9. On the NFS server, view backing-store using "od -X" or similar. Note that the value from step 7 has not been properly written. * NOTES ON ALTERNATE SCENARIO The #defines WAIT_FOR_SYNC and DO_MSYNC are provided to explore other scenarios. Notably, if WAIT_FOR_SYNC is not defined, then the DO_MSYNC define allows the demonstration of the underlying bug: the data will be properly flushed if and only if DO_MSYNC is enabled. More fundamentally, the uid used to flush the data is the process's euid if DO_MSYNC is enabled, but root if it is not. This is in contradiction to the intended design of the VM system, which is intended to make msync obsolete. This obsolescence is documented in the msync(2) and mmap(2) man pages (although the error-reporting properties of msync(2) still make it useful; this will be addressed in a separate doc PR.) Note that, even if WAIT_FOR_SYNC is not defined, it still possible for the syncer daemon to attempt to flush data between when a process writes to a mapped page and when it calls msync; the call to sleep(3) in the attached program is just to force that race. (I hypothesize that this would also be the case if the page were evicted due to memory pressure, rather than flushed by the syncer.) This indicates that msync(2) is a necessary, but NOT sufficient, way for a process to verify that mapped files are flushed. The idea that it's necessary is contrary to the VM system's intent, as documented in the msync(2) and mmap(2) man pages. The fact that it's not sufficient is contrary to POSIX's assertion that msync may be used "for synchronized I/O data integrity completion" (and more explicit verbage in the informative sections). * EFFECTS ON 8.2-RELEASE If this program is run on a FreeBSD 8.2 client prior to r223054 (2011-06-13), then a different bug will appear: the kernel will enter an infinite loop in which it tries to flush the data, fails, and retries indefinitely. In extreme (but realistic and indeed observed) circumstances, this can significantly lower the client's responsiveness. It also can put significant load on both the network and the NFS server. If you find yourself in this situation (which you can identify from high traffic with netstat -i or, more reliably, using tcpdump to inspect NFS traffic), chmod the backing-store file to mode 0666, wait a moment for the client to flush the data, then chmod it back to 0644. Do NOT remove the backing-store file before it's written! If the backing-store file is removed during this loop, the error will change from NFSERR_ACCES to NFSERR_STALE, and the loop cannot (AFAIK) be stopped without rebooting the client. --- mmap-repro.c begins here --- #include <sys/types.h> #include <sys/mman.h> #include <sys/sysctl.h> #include <assert.h> #include <err.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #define WAIT_FOR_SYNC #define DO_MSYNC #define FILENAME "/mnt/backing-store" #define SIZE 0x100000 /* Large enough to see the sync when watching netstat */ #ifdef WAIT_FOR_SYNC static void sig_just_interrupt(int sig __unused) { return; } #endif int main(void) { int rv; int fd; pid_t *region; pid_t value; unsigned int i; #ifdef WAIT_FOR_SYNC int sleeptime; size_t sleeptime_size = sizeof(sleeptime); #endif /* * Setup the mapped region. */ fd = open(FILENAME, O_RDWR | O_CREAT, 0644); if (fd < 0) err(1, "%s", FILENAME); rv = ftruncate(fd, SIZE); if (rv < 0) err(1, "%s: ftruncate", FILENAME); region = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (region == NULL) err(1, "%s: mmap", FILENAME); rv = close(fd); if (rv != 0) err(1, "%s: close", FILENAME); /* * Fill the region with our pid. */ value = getpid(); printf("Writing %08x\n", value); for (i = 0; i < SIZE / sizeof(value); i++) region[i] = value; #ifdef WAIT_FOR_SYNC /* * Optionally, give the syncer time to flush the page. The * bug is that this will be written with uid 0, not our euid. */ rv = sysctlbyname("kern.filedelay", &sleeptime, &sleeptime_size, NULL, 0); if (rv != 0) err(1, "kern.filedelay"); assert(sleeptime_size == sizeof(sleeptime)); /* Add a little slack for the sync operation to complete. */ sleeptime += 2; /* * Set a handler for SIGINT while waiting. This gives the * user an opportunity to press ^C during this sleep to end it * early, if they have reason to believe that the syncer has * run (for instance, by seeing activity in netstat -iw1). */ printf("Waiting %i seconds for the syncer to run.\n", sleeptime); signal(SIGINT, sig_just_interrupt); sleep(sleeptime); signal(SIGINT, SIG_DFL); #endif /* WAIT_FOR_SYNC */ #ifdef DO_MSYNC /* * Call msync with MS_SYNC, which should ensure that the * mapped region gets properly written, and let us know if * there's a problem. * * If WAIT_FOR_SYNC is set, this call will reveal a second * aspect to the bug: msync will return success, even though * the changes were actually discarded. * * If WAIT_FOR_SYNC is not set, this will demonstrate a * difference between using msync and waiting for the pager: * the msync will cause the dirty data to be properly written. * This is at odds with the msync and mmap man pages, which * specify that "[t]he msync() system call is obsolete since * BSD implements a coherent file system buffer cache." * * Note that calling msync here has the same end result as * running sync(8) with the same euid as this process. */ rv = msync(region, SIZE, MS_SYNC); if (rv) err(1, "msync"); #endif /* DO_MSYNC */ /* This munmap is not needed, but is here for completeness. */ rv = munmap(region, SIZE); if (rv) err(1, "munmap"); return 0; } --- mmap-repro.c ends here --- >Fix: As a workaround, applications using mmap-backed files should call msync explicitly. Note that there is a race condition still present; the syncer daemon may attempt to flush the mapped file before the process calls msync. However, calling msync soon after changing the mapped region does lower the probability of failure. I haven't implemented a fix, but my current thoughts are as follows. In sys/vm/vm_object.c:vm_object_set_writeable_dirty, the vm_object_t should be marked with the credentials of the process that is writing. These credentials should be passed to VOP_PUTPAGES and used for the eventual NFS RPC in sys/nfsclient/nfs_bio.c:nfs_putpages. (Note that the line "cred = curthread->td_ucred;" in sys/nfsclient/nfs_bio.c:nfs_putpages is flagged with /* XXX */.) If I understand correctly, this parameter can be added without breaking source or binary compatibility with third-party filesystems. On the caller side, it seems unlikely that any third-party code calls VOP_PUTPAGES, but this change would break both source and binary compatibility if it did. There are uglier ways to implement this while maintaining compatibility, but these are not described here. >Release-Note: >Audit-Trail: >Unformatted: _______________________________________________ freebsd-bugs@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-bugs To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"