On 04/11/2012 01:07, Andrey Zonov wrote:
On 10.04.2012 20:19, Alan Cox wrote:
On 04/09/2012 10:26, John Baldwin wrote:
On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote:
On 04/04/2012 02:17, Konstantin Belousov wrote:
On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:
Hi,

I open the file, then call mmap() on the whole file and get pointer,
then I work with this pointer. I expect that page should be only once
touched to get it into the memory (disk cache?), but this doesn't
work!

I wrote the test (attached) and ran it for the 1G file generated from
/dev/random, the result is the following:

Prepare file:
# swapoff -a
# newfs /dev/ada0b
# mount /dev/ada0b /mnt
# dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024

Purge cache:
# umount /mnt
# mount /dev/ada0b /mnt

Run test:
$ ./mmap /mnt/random-1024 30
mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super:
0; other: 0)
mmap: 2 pass took: 7.356670 (none: 261648; res: 496; super:
0; other: 0)
mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super:
0; other: 0)
mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super:
0; other: 0)
mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super:
0; other: 0)
mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super:
0; other: 0)
mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super:
0; other: 0)
mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super:
0; other: 0)
mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super:
0; other: 0)
mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super:
0; other: 0)
mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super:
0; other: 0)
mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super:
0; other: 0)
mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super:
0; other: 0)
mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super:
0; other: 0)
mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super:
0; other: 0)
mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super:
0; other: 0)
mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super:
0; other: 0)
mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super:
0; other: 0)
mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super:
0; other: 0)
mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super:
0; other: 0)
mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super:
0; other: 0)
mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super:
0; other: 0)
mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super:
0; other: 0)
mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super:
0; other: 0)
mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super:
0; other: 0)
mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super:
0; other: 0)
mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super:
0; other: 0)
mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super:
0; other: 0)
mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super:
0; other: 0)
mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super:
0; other: 0)

If I ran this:
$ cat /mnt/random-1024> /dev/null
before test, when result is the following:

$ ./mmap /mnt/random-1024 5
mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super:
0; other: 0)
mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super:
0; other: 0)
mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super:
0; other: 0)
mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super:
0; other: 0)
mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super:
0; other: 0)

This is what I expect. But why this doesn't work without reading file
manually?
Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.
I'm pretty sure that the behavior here hasn't significantly changed in
about twelve years. Otherwise, I agree with your analysis.

On more than one occasion, I've been tempted to change:

pmap_remove_all(mt);
if (mt->dirty != 0)
vm_page_deactivate(mt);
else
vm_page_cache(mt);

to:

vm_page_dontneed(mt);

because I suspect that the current code does more harm than good. In
theory, it saves activations of the page daemon. However, more often
than not, I suspect that we are spending more on page reactivations than
we are saving on page daemon activations. The sequential access
detection heuristic is just too easily triggered. For example, I've
seen it triggered by demand paging of the gcc text segment. Also, I
think that pmap_remove_all() and especially vm_page_cache() are too
severe for a detection heuristic that is so easily triggered.
Are you planning to commit this?


Not yet. I did some tests with a file that was several times larger than
DRAM, and I didn't like what I saw. Initially, everything behaved as
expected, but about halfway through the test the bulk of the pages were
active. Despite the call to pmap_clear_reference() in
vm_page_dontneed(), the page daemon is finding the pages to be
referenced and reactivating them. The net result is that the time it
takes to read the file (from a relatively fast SSD) goes up by about
12%. So, this still needs work.


Hi Alan,

What do you think about attached patch?



Sorry for the slow reply, I've been rather busy for the past couple of weeks. What you propose is clearly good for sequential accesses, but not so good for random accesses. Keep in mind, the potential costs of unconditionally increasing the read window include not only wasted I/O but also increased memory pressure. Rather than argue about which is more important, sequential or random access, I think it's more productive to replace the sequential access heuristic. The current heuristic is just not that sophisticated. It's easy to do better.

The attached patch implements a new heuristic, which starts with the same initial read window as the current heuristic, but arithmetically grows the window on sequential page faults. From a stylistic standpoint, this patch also cleanly separates the "read ahead" logic from the "cache behind" logic.

At the same time, this new heuristic is more selective about performing cache behind. It requires three or four sequential page faults before cache behind is enabled. More precisely, it requires the read ahead window to reach its maximum size before cache behind is enabled.

For long, sequential accesses, the results of my performance tests are just good as unconditionally increasing the window size. I'm also seeing fewer pages needlessly cached by the cache behind heuristic. That said, there is still room for improvement. We are still not achieving the same sequential performance as "dd", and there are still more pages being cached than I would like.

Alan


Index: vm/vm_map.c
===================================================================
--- vm/vm_map.c (revision 234106)
+++ vm/vm_map.c (working copy)
@@ -1300,6 +1300,8 @@ charged:
        new_entry->protection = prot;
        new_entry->max_protection = max;
        new_entry->wired_count = 0;
+       new_entry->read_ahead = VM_FAULT_READ_AHEAD_INIT;
+       new_entry->next_read = OFF_TO_IDX(offset);
 
        KASSERT(cred == NULL || !ENTRY_CHARGED(new_entry),
            ("OVERCOMMIT: vm_map_insert leaks vm_map %p", new_entry));
Index: vm/vm_map.h
===================================================================
--- vm/vm_map.h (revision 234106)
+++ vm/vm_map.h (working copy)
@@ -112,8 +112,9 @@ struct vm_map_entry {
        vm_prot_t protection;           /* protection code */
        vm_prot_t max_protection;       /* maximum protection */
        vm_inherit_t inheritance;       /* inheritance */
+       uint8_t read_ahead;             /* pages in the read-ahead window */
        int wired_count;                /* can be paged if = 0 */
-       vm_pindex_t lastr;              /* last read */
+       vm_pindex_t next_read;          /* index of the next sequential read */
        struct ucred *cred;             /* tmp storage for creator ref */
 };
 
@@ -330,6 +331,14 @@ long vmspace_wired_count(struct vmspace *vmspace);
 #define        VM_FAULT_DIRTY 2                /* Dirty the page; use 
w/VM_PROT_COPY */
 
 /*
+ * Initially, mappings are slightly sequential.  The maximum window size must
+ * account for the map entry's "read_ahead" field being defined as an uint8_t.
+ */
+#define        VM_FAULT_READ_AHEAD_MIN         7
+#define        VM_FAULT_READ_AHEAD_INIT        15
+#define        VM_FAULT_READ_AHEAD_MAX         min(atop(MAXPHYS) - 1, 
UINT8_MAX)
+
+/*
  * The following "find_space" options are supported by vm_map_find()
  */
 #define        VMFS_NO_SPACE           0       /* don't find; use the given 
range */
Index: vm/vm_fault.c
===================================================================
--- vm/vm_fault.c       (revision 234106)
+++ vm/vm_fault.c       (working copy)
@@ -118,9 +118,11 @@ static int prefault_pageorder[] = {
 static int vm_fault_additional_pages(vm_page_t, int, int, vm_page_t *, int *);
 static void vm_fault_prefault(pmap_t, vm_offset_t, vm_map_entry_t);
 
-#define VM_FAULT_READ_AHEAD 8
-#define VM_FAULT_READ_BEHIND 7
-#define VM_FAULT_READ (VM_FAULT_READ_AHEAD+VM_FAULT_READ_BEHIND+1)
+#define        VM_FAULT_READ_BEHIND    8
+#define        VM_FAULT_READ_MAX       (1 + VM_FAULT_READ_AHEAD_MAX)
+#define        VM_FAULT_NINCR          (VM_FAULT_READ_MAX / 
VM_FAULT_READ_BEHIND)
+#define        VM_FAULT_SUM            (VM_FAULT_NINCR * (VM_FAULT_NINCR + 1) 
/ 2)
+#define        VM_FAULT_CACHE_BEHIND   (VM_FAULT_READ_BEHIND * VM_FAULT_SUM)
 
 struct faultstate {
        vm_page_t m;
@@ -136,6 +138,8 @@ struct faultstate {
        int vfslocked;
 };
 
+static void vm_fault_cache_behind(const struct faultstate *fs, int distance);
+
 static inline void
 release_page(struct faultstate *fs)
 {
@@ -236,13 +240,13 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_
     int fault_flags, vm_page_t *m_hold)
 {
        vm_prot_t prot;
-       int is_first_object_locked, result;
-       boolean_t growstack, wired;
+       long ahead, behind;
+       int alloc_req, era, faultcount, nera, reqpage, result;
+       boolean_t growstack, is_first_object_locked, wired;
        int map_generation;
        vm_object_t next_object;
-       vm_page_t marray[VM_FAULT_READ], mt, mt_prev;
+       vm_page_t marray[VM_FAULT_READ_MAX];
        int hardfault;
-       int faultcount, ahead, behind, alloc_req;
        struct faultstate fs;
        struct vnode *vp;
        int locked, error;
@@ -252,7 +256,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_
        PCPU_INC(cnt.v_vm_faults);
        fs.vp = NULL;
        fs.vfslocked = 0;
-       faultcount = behind = 0;
+       faultcount = reqpage = 0;
 
 RetryFault:;
 
@@ -460,76 +464,48 @@ readrest:
                 */
                if (TRYPAGER) {
                        int rv;
-                       int reqpage = 0;
                        u_char behavior = vm_map_entry_behavior(fs.entry);
 
                        if (behavior == MAP_ENTRY_BEHAV_RANDOM ||
                            P_KILLED(curproc)) {
+                               behind = 0;
                                ahead = 0;
+                       } else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
                                behind = 0;
+                               ahead = atop(fs.entry->end - vaddr) - 1;
+                               if (ahead > VM_FAULT_READ_AHEAD_MAX)
+                                       ahead = VM_FAULT_READ_AHEAD_MAX;
+                               if (fs.pindex == fs.entry->next_read)
+                                       vm_fault_cache_behind(&fs,
+                                           VM_FAULT_READ_MAX);
                        } else {
-                               behind = (vaddr - fs.entry->start) >> 
PAGE_SHIFT;
+                               /*
+                                * If this is a sequential page fault, then
+                                * arithmetically increase the number of pages
+                                * in the read-ahead window.  Otherwise, reset
+                                * the read-ahead window to its smallest size.
+                                */
+                               behind = atop(vaddr - fs.entry->start);
                                if (behind > VM_FAULT_READ_BEHIND)
                                        behind = VM_FAULT_READ_BEHIND;
-
-                               ahead = ((fs.entry->end - vaddr) >> PAGE_SHIFT) 
- 1;
-                               if (ahead > VM_FAULT_READ_AHEAD)
-                                       ahead = VM_FAULT_READ_AHEAD;
+                               ahead = atop(fs.entry->end - vaddr) - 1;
+                               era = fs.entry->read_ahead;
+                               if (fs.pindex == fs.entry->next_read) {
+                                       nera = era + behind;
+                                       if (nera > VM_FAULT_READ_AHEAD_MAX)
+                                               nera = VM_FAULT_READ_AHEAD_MAX;
+                                       behind = 0;
+                                       if (ahead > nera)
+                                               ahead = nera;
+                                       if (era == VM_FAULT_READ_AHEAD_MAX)
+                                               vm_fault_cache_behind(&fs,
+                                                   VM_FAULT_CACHE_BEHIND);
+                               } else if (ahead > VM_FAULT_READ_AHEAD_MIN)
+                                       ahead = VM_FAULT_READ_AHEAD_MIN;
+                               if (era != ahead)
+                                       fs.entry->read_ahead = ahead;
                        }
-                       is_first_object_locked = FALSE;
-                       if ((behavior == MAP_ENTRY_BEHAV_SEQUENTIAL ||
-                            (behavior != MAP_ENTRY_BEHAV_RANDOM &&
-                             fs.pindex >= fs.entry->lastr &&
-                             fs.pindex < fs.entry->lastr + VM_FAULT_READ)) &&
-                           (fs.first_object == fs.object ||
-                            (is_first_object_locked = 
VM_OBJECT_TRYLOCK(fs.first_object))) &&
-                           fs.first_object->type != OBJT_DEVICE &&
-                           fs.first_object->type != OBJT_PHYS &&
-                           fs.first_object->type != OBJT_SG) {
-                               vm_pindex_t firstpindex;
 
-                               if (fs.first_pindex < 2 * VM_FAULT_READ)
-                                       firstpindex = 0;
-                               else
-                                       firstpindex = fs.first_pindex - 2 * 
VM_FAULT_READ;
-                               mt = fs.first_object != fs.object ?
-                                   fs.first_m : fs.m;
-                               KASSERT(mt != NULL, ("vm_fault: missing mt"));
-                               KASSERT((mt->oflags & VPO_BUSY) != 0,
-                                   ("vm_fault: mt %p not busy", mt));
-                               mt_prev = vm_page_prev(mt);
-
-                               /*
-                                * note: partially valid pages cannot be 
-                                * included in the lookahead - NFS piecemeal
-                                * writes will barf on it badly.
-                                */
-                               while ((mt = mt_prev) != NULL &&
-                                   mt->pindex >= firstpindex &&
-                                   mt->valid == VM_PAGE_BITS_ALL) {
-                                       mt_prev = vm_page_prev(mt);
-                                       if (mt->busy ||
-                                           (mt->oflags & VPO_BUSY))
-                                               continue;
-                                       vm_page_lock(mt);
-                                       if (mt->hold_count ||
-                                           mt->wire_count) {
-                                               vm_page_unlock(mt);
-                                               continue;
-                                       }
-                                       pmap_remove_all(mt);
-                                       if (mt->dirty != 0)
-                                               vm_page_deactivate(mt);
-                                       else
-                                               vm_page_cache(mt);
-                                       vm_page_unlock(mt);
-                               }
-                               ahead += behind;
-                               behind = 0;
-                       }
-                       if (is_first_object_locked)
-                               VM_OBJECT_UNLOCK(fs.first_object);
-
                        /*
                         * Call the pager to retrieve the data, if any, after
                         * releasing the lock on the map.  We hold a ref on
@@ -899,7 +875,7 @@ vnode_locked:
         * without holding a write lock on it.
         */
        if (hardfault)
-               fs.entry->lastr = fs.pindex + faultcount - behind;
+               fs.entry->next_read = fs.pindex + faultcount - reqpage;
 
        if ((prot & VM_PROT_WRITE) != 0 ||
            (fault_flags & VM_FAULT_DIRTY) != 0) {
@@ -992,6 +968,56 @@ vnode_locked:
 }
 
 /*
+ * Speed up the reclamation of up to "distance" pages that precede the
+ * faulting pindex within the first object of the shadow chain.
+ */
+static void
+vm_fault_cache_behind(const struct faultstate *fs, int distance)
+{
+       vm_page_t m, m_prev;
+       vm_pindex_t pindex;
+       boolean_t is_first_object_locked;
+
+       VM_OBJECT_LOCK_ASSERT(fs->object, MA_OWNED);
+       is_first_object_locked = FALSE;
+       if (fs->first_object != fs->object && !(is_first_object_locked =
+           VM_OBJECT_TRYLOCK(fs->first_object)))
+               return;
+       if (fs->first_object->type != OBJT_DEVICE &&
+           fs->first_object->type != OBJT_PHYS &&
+           fs->first_object->type != OBJT_SG) {
+               if (fs->first_pindex < distance)
+                       pindex = 0;
+               else
+                       pindex = fs->first_pindex - distance;
+               if (pindex < OFF_TO_IDX(fs->entry->offset))
+                       pindex = OFF_TO_IDX(fs->entry->offset);
+               m = fs->first_object != fs->object ? fs->first_m : fs->m;
+               KASSERT(m != NULL, ("vm_fault_cache_behind: page missing"));
+               KASSERT((m->oflags & VPO_BUSY) != 0,
+                   ("vm_fault_cache_behind: page %p is not busy", m));
+               m_prev = vm_page_prev(m);
+               while ((m = m_prev) != NULL && m->pindex >= pindex &&
+                   m->valid == VM_PAGE_BITS_ALL) {
+                       m_prev = vm_page_prev(m);
+                       if (m->busy != 0 || (m->oflags & VPO_BUSY) != 0)
+                               continue;
+                       vm_page_lock(m);
+                       if (m->hold_count == 0 && m->wire_count == 0) {
+                               pmap_remove_all(m);
+                               if (m->dirty != 0)
+                                       vm_page_deactivate(m);
+                               else
+                                       vm_page_cache(m);
+                       }
+                       vm_page_unlock(m);
+               }
+       }
+       if (is_first_object_locked)
+               VM_OBJECT_UNLOCK(fs->first_object);
+}
+
+/*
  * vm_fault_prefault provides a quick way of clustering
  * pagefaults into a processes address space.  It is a "cousin"
  * of vm_map_pmap_enter, except it runs at page fault time instead
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Reply via email to