Re: nfs performance at high loads
I believe David Miller's latest zero-copy patches might help here. In his patch, the pull-up buffer is now allocated near the top of stack (in the sunrpc code), so it can be a blocking allocation. This doesn't fix the core VM problems, but does relieve the pressure _slightly_ on the VM (I assume, haven't tried David's patch yet). One of the core problems is that the VM keeps no measure of page fragmentation in the free page pool. The system reaches the state of having plenty of free single pages (so kswapd and friends aren't kicked - or if they are, they do no or little word), and very few buddied pages (which you need for some of the NFS requests). Unfortunately, even with keeping a mesaure of fragmentation, and insuring work is done when it is reached, doesn't solve the next problem. When a large order request comes in, the inactive_clean page list is reaped. As reclaim_page() simply selects the "oldest" page it can, with no regard as to whether it will buddy (now, or 'possibily in the near future), this list is quickly shrunk by a large order request - far too quickly for a well behaved system. An NFS write request, with an 8K block size, needs an order-2 (16K) pull up buffer (we shouldn't really be pulling the header into the same buffer as the data - perhaps we aren't any more?). On a well used system, an order-2 _blocking_ allocation ends up populating the order-0 and order-1 with quite a few pages from the inactive_clean. This then triggers another problem. :( As large (non-zero) order requests are always from the NORMAL or DMA zones, these zones tend to have a lot of free-pages (put there by the blind reclaim_page() - well, once you can do a blocking allocation they are, or when the fragmentation kicking is working). New allocations for pages for the page-cache often ignore the HIGHMEM zone (it reaches a steady state), and so is passed over by the loop at the head of __alloc_pages()). However, NORMAL and DMA zones tend to be above pages_low (due to the reason above), and so new page-cache pages came from these zones. On a HIGHMEM system this leads to thrashing of the NORMAL zone, while the HIGHMEM zone stays (relatively) quiet. Note: To make matters even worse under this condition, pulling pages out of the NORMAL zone is exactly what you don't want to happen! It would be much better if they could be left alone for a (short) while to give them chance to buddy - Linux (at present) doesn't care about the budding of pages in the HIGHMEM zone (no non-zero allocations come from there). I was working on these problems (and some others) a few months back, and will to return to them shortly. Unfortunately, the changes started to look too large for 2.4 Also, for NFS, the best solution now might be to give the nfsd threads a receive buffer. With David's patches, the pull-up occurs in the context of a thread, making this possible. This doesn't solve the problem for other subsystems which do non-zero order page allocations, but (perhaps) they have a low enough frequency not to be of real issue. Kapish, Note: Ensure you put a "sync" in your /etc/exports - the default behaviour was "async" (not legal for a valid SpecFS run). Mark On Wed, 4 Apr 2001, Alan Cox wrote: > > We have been seeing some problems with running nfs benchmarks > > at very high loads and were wondering if somebody could show > > some pointers to where the problem lies. > > The system is a 2.4.0 kernel on a 6.2 Red at distribution ( so > > Use 2.2.19. The 2.4 VM is currently too broken to survive high I/O benchmark > tests without going silly > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] vmalloc fault race
Hi, If two processes, sharing the same page tables, hit an unloaded vmalloc address in the kernel at the same time, one of the processes is killed (with the message "Unable to handle kernel paging request"). This occurs because the test on a vmalloc fault is too tight. On x86, it contains; if (pmd_present(*pmd) || !pmd_present(*pmd_k)) goto bad_area_nosemaphore; If two processes are racing, chances are that "pmd_present(*pmd)" will be true for one of them. It appears that; alpha, x86, arm, cris and sparc all have the same/similar test. I've included a patch for all the above archs, but have only tested on the x86 (so no guarantee it is correct for other archs). Note: Even though this race can only occur on SMP (assuming cannot context switch on entering a page-fault), I've unconditionally remove the test against pmd. It could be argued that it should be left in for UP... Patch is againt 2.4.2pre3. Mark diff -urN -X dontdiff vxfs-2.4.2pre3/arch/alpha/mm/fault.c markhe-2.4.2pre3/arch/alpha/mm/fault.c --- vxfs-2.4.2pre3/arch/alpha/mm/fault.cFri Dec 29 22:07:19 2000 +++ markhe-2.4.2pre3/arch/alpha/mm/fault.c Wed Feb 14 12:10:21 2001 @@ -223,7 +223,7 @@ pgd = current->active_mm->pgd + offset; pgd_k = swapper_pg_dir + offset; - if (!pgd_present(*pgd) && pgd_present(*pgd_k)) { + if (pgd_present(*pgd_k)) { pgd_val(*pgd) = pgd_val(*pgd_k); return; } diff -urN -X dontdiff vxfs-2.4.2pre3/arch/arm/mm/fault-common.c markhe-2.4.2pre3/arch/arm/mm/fault-common.c --- vxfs-2.4.2pre3/arch/arm/mm/fault-common.c Mon Feb 12 10:10:27 2001 +++ markhe-2.4.2pre3/arch/arm/mm/fault-common.c Wed Feb 14 12:12:13 2001 @@ -185,8 +185,6 @@ goto bad_area; pmd = pmd_offset(pgd, addr); - if (!pmd_none(*pmd)) - goto bad_area; set_pmd(pmd, *pmd_k); return 1; diff -urN -X dontdiff vxfs-2.4.2pre3/arch/cris/mm/fault.c markhe-2.4.2pre3/arch/cris/mm/fault.c --- vxfs-2.4.2pre3/arch/cris/mm/fault.c Mon Feb 12 10:10:27 2001 +++ markhe-2.4.2pre3/arch/cris/mm/fault.c Wed Feb 14 12:13:10 2001 @@ -381,7 +381,7 @@ pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); -if (pmd_present(*pmd) || !pmd_present(*pmd_k)) +if (!pmd_present(*pmd_k)) goto bad_area_nosemaphore; set_pmd(pmd, *pmd_k); return; diff -urN -X dontdiff vxfs-2.4.2pre3/arch/i386/mm/fault.c markhe-2.4.2pre3/arch/i386/mm/fault.c --- vxfs-2.4.2pre3/arch/i386/mm/fault.c Sun Nov 12 03:01:11 2000 +++ markhe-2.4.2pre3/arch/i386/mm/fault.c Wed Feb 14 12:14:13 2001 @@ -340,7 +340,7 @@ pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); - if (pmd_present(*pmd) || !pmd_present(*pmd_k)) + if (!pmd_present(*pmd_k)) goto bad_area_nosemaphore; set_pmd(pmd, *pmd_k); return; diff -urN -X dontdiff vxfs-2.4.2pre3/arch/sparc/mm/fault.c markhe-2.4.2pre3/arch/sparc/mm/fault.c --- vxfs-2.4.2pre3/arch/sparc/mm/fault.cMon Jan 1 18:37:41 2001 +++ markhe-2.4.2pre3/arch/sparc/mm/fault.c Wed Feb 14 12:17:36 2001 @@ -378,7 +378,7 @@ pmd = pmd_offset(pgd, address); pmd_k = pmd_offset(pgd_k, address); - if (pmd_present(*pmd) || !pmd_present(*pmd_k)) + if (!pmd_present(*pmd_k)) goto bad_area_nosemaphore; pmd_val(*pmd) = pmd_val(*pmd_k); return; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] HIGHMEM+buffers
Hi, On a 4GB SMP box, configured with HIGHMEM support, making a 670G (obviously using a volume manager) ext2 file system takes 12minutes (over 10minutes of sys time). One problem is buffer allocations do not use HIGHMEM, but the nr_free_buffer_pages() doesn't take this into account causing balance_dirty_state() to return the wrong state. The attached patch fixes the worse part of nr_free_buffer_pages() - with HIGHMEM it now only counts free+inactive pages in the NORMAL and DMA zone. It doesn't fix the inactive_dirty calculation. Also, in buffer.c:flush_dirty_buffers(), it is worthwhile to kick the disk queues if it decides to reschedule. With the patch, that 670G filesystem can be created in 5m36secs (under half the time). Patch is against 2.4.1-ac18. Mark diff -urN -X dontdiff vanilla-2.4.1-ac18/fs/buffer.c markhe-2.4.1-ac18/fs/buffer.c --- vanilla-2.4.1-ac18/fs/buffer.c Sun Feb 18 15:06:26 2001 +++ markhe-2.4.1-ac18/fs/buffer.c Sun Feb 18 19:03:19 2001 @@ -2638,8 +2638,11 @@ ll_rw_block(WRITE, 1, &bh); atomic_dec(&bh->b_count); - if (current->need_resched) + if (current->need_resched) { + /* kick what we've already pushed down */ + run_task_queue(&tq_disk); schedule(); + } goto restart; } out_unlock: diff -urN -X dontdiff vanilla-2.4.1-ac18/include/linux/swap.h markhe-2.4.1-ac18/include/linux/swap.h --- vanilla-2.4.1-ac18/include/linux/swap.h Sun Feb 18 15:06:29 2001 +++ markhe-2.4.1-ac18/include/linux/swap.h Sun Feb 18 18:11:03 2001 @@ -65,7 +65,9 @@ extern int nr_swap_pages; FASTCALL(unsigned int nr_free_pages(void)); +FASTCALL(unsigned int nr_free_pages_zone(int)); FASTCALL(unsigned int nr_inactive_clean_pages(void)); +FASTCALL(unsigned int nr_inactive_clean_pages_zone(int)); FASTCALL(unsigned int nr_free_buffer_pages(void)); extern int nr_active_pages; extern int nr_inactive_dirty_pages; diff -urN -X dontdiff vanilla-2.4.1-ac18/mm/page_alloc.c markhe-2.4.1-ac18/mm/page_alloc.c --- vanilla-2.4.1-ac18/mm/page_alloc.c Sun Feb 18 15:06:29 2001 +++ markhe-2.4.1-ac18/mm/page_alloc.c Sun Feb 18 19:04:36 2001 @@ -547,6 +547,23 @@ } /* + * Total amount of free (allocatable) RAM in a given zone. + */ +unsigned int nr_free_pages_zone (int zone_type) +{ + pg_data_t *pgdat; + unsigned int sum; + + sum = 0; + pgdat = pgdat_list; + while (pgdat) { + sum += (pgdat->node_zones+zone_type)->free_pages; + pgdat = pgdat->node_next; + } + return sum; +} + +/* * Total amount of inactive_clean (allocatable) RAM: */ unsigned int nr_inactive_clean_pages (void) @@ -565,14 +582,43 @@ } /* + * Total amount of inactive_clean (allocatable) RAM in a given zone. + */ +unsigned int nr_inactive_clean_pages_zone (int zone_type) +{ + pg_data_t *pgdat; + unsigned int sum; + + sum = 0; + pgdat = pgdat_list; + while (pgdat) { + sum += (pgdat->node_zones+zone_type)->inactive_clean_pages; + pgdat = pgdat->node_next; + } + return sum; +} + + +/* * Amount of free RAM allocatable as buffer memory: + * + * For HIGHMEM systems don't count HIGHMEM pages. + * This is function is still far from perfect for HIGHMEM systems, but + * it is close enough for the time being. */ unsigned int nr_free_buffer_pages (void) { unsigned int sum; - sum = nr_free_pages(); - sum += nr_inactive_clean_pages(); +#ifCONFIG_HIGHMEM + sum = nr_free_pages_zone(ZONE_NORMAL) + + nr_free_pages_zone(ZONE_DMA) + + nr_inactive_clean_pages_zone(ZONE_NORMAL) + + nr_inactive_clean_pages_zone(ZONE_DMA); +#else + sum = nr_free_pages() + + nr_inactive_clean_pages(); +#endif sum += nr_inactive_dirty_pages; /*
[PATCH] nfsd + scalability
Hi Neil, all, The nfs daemons run holding the global kernel lock. They still hold this lock over calls to file_op's read and write. The file system kernel interface (FSKI) doesn't require the kernel lock to be held over these read/write calls. The nfs daemons do not require that the reads or writes do not block (would be v silly if they did), so they have no guarantee the lock isn't dropped and retaken during blocking. ie. they aren't using it as a guard across the calls. Dropping the kernel lock around read and write in fs/nfsd/vfs.c is a _big_ SMP scalability win! Attached patch is against 2.4.1-ac18, but should apply to most recent kernel versions. Mark --- vanilla-2.4.1-ac18/fs/nfsd/vfs.cSun Feb 18 15:06:27 2001 +++ markhe-2.4.1-ac18/fs/nfsd/vfs.c Sun Feb 18 19:32:18 2001 @@ -30,6 +30,7 @@ #include #include #include +#include #include #define __NO_VERSION__ #include @@ -602,12 +603,28 @@ file.f_ralen = ra->p_ralen; file.f_rawin = ra->p_rawin; } + + /* +* The nfs daemons run holding the global kernel lock, but +* f_op->read() doesn't need the lock to be held. +* Drop it here to help scalability. +* +* The "kernel_locked()" test isn't perfect (someone else could be +* holding the lock when we're not), but it will eventually catch +* any cases of entering here without the lock held. +*/ + if (!kernel_locked()) + BUG(); + unlock_kernel(); + file.f_pos = offset; oldfs = get_fs(); set_fs(KERNEL_DS); err = file.f_op->read(&file, buf, *count, &file.f_pos); set_fs(oldfs); + lock_kernel(); + /* Write back readahead params */ if (ra != NULL) { dprintk("nfsd: raparms %ld %ld %ld %ld %ld\n", @@ -664,6 +681,22 @@ goto out_close; #endif + /* +* The nfs daemons run holding the global kernel lock, but +* f_op->write() doesn't need the lock to be held. +* Also, as the struct file is private, the export is read-locked, +* and the inode attached to the dentry cannot change under us, the +* lock can be dropped ahead of the call to write() for even better +* scalability. +* +* The "kernel_locked()" test isn't perfect (someone else could be +* holding the lock when we're not), but it will eventually catch +* any cases of entering here without the lock held. +*/ + if (!kernel_locked()) + BUG(); + unlock_kernel(); + dentry = file.f_dentry; inode = dentry->d_inode; exp = fhp->fh_export; @@ -692,9 +725,12 @@ /* Write the data. */ oldfs = get_fs(); set_fs(KERNEL_DS); err = file.f_op->write(&file, buf, cnt, &file.f_pos); + set_fs(oldfs); + + lock_kernel(); + if (err >= 0) nfsdstats.io_write += cnt; - set_fs(oldfs); /* clear setuid/setgid flag after write */ if (err >= 0 && (inode->i_mode & (S_ISUID | S_ISGID))) {
Re: [PATCH] nfsd + scalability
On Thu, 22 Feb 2001, Neil Brown wrote: > On Sunday February 18, [EMAIL PROTECTED] wrote: > > Hi Neil, all, > > > > The nfs daemons run holding the global kernel lock. They still hold > > this lock over calls to file_op's read and write. > > [snip] > > Dropping the kernel lock around read and write in fs/nfsd/vfs.c is a > > _big_ SMP scalability win! > > Certainly I would like to not hold the BKL so much, but I'm curious > how much effect it will really have. Do you have any data on the > effect of this change? Depends very much on the hardware configuration, and underlying filesystem. On an I/O bound system, obviously this has little effect. Using a filesystem which has a quite deep stack (CPU cycle heavy), this is a big win. I've been running with this for so long that I can't find my original data files at the moment, but it was around +8% improvment in throughput for a 4-way box under SpecFS with vxfs as the underlying filesystem. Less benefit for ext2 (all filesystems NFS exported "sync" and "no_subtree_check"). Some of the benefit came from the fact that there is also a volume manager sitting under the filesystem (more CPU cycles with the kernel lock held!). Holding the kernel lock for less cycles has an important side benefit! If it is held for less cycles, then the probability of it being held when an interrupt is processed is reduced. This benefit is further enhanced if there are bottom-halfs running off the back of the interrupt (such as networking code). I need to get actual figures for how many cycles are we spinning at the reacquire_kernel_lock() (in schedule()), but my gut feeling is that we aren't doing too well when running as a file server. > Also, I would much rather drop the lock in nfsd_dispatch() and then go > around reclaiming it whereever it was needed. > Subsequently we would drop the lock in nfsd() and make sure sunrpc is > safe. sunrpc definitely tries to be SMP safe, I haven't convienced myself that it actually is. In sunrpc, dropping the kernel lock when checksuming the UDP packet, and when sending, is another win-win case (again, need to find my data files, but this was approx +3% improvement in throughput). > This would be more work (any volunteers?:-) but I feel that dropping > it in little places like this is a bit unsatisfying. True, not 100% satisfying, but even little places give an overall improvement in scalability. If they can be proved to be correct, then why not do it? It moves the code in the right direction. I am planning on slow expanding the dropping of the kernel lock within NFS, rather than do it in one single bang. It looks do able, but there _might_ be an issue with the interaction with the dcache. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q: explicit alignment control for the slab allocator
On Thu, 1 Mar 2001, Manfred Spraul wrote: > Alan added a CONFIG options for FORCED_DEBUG slab debugging, but there > is one minor problem with FORCED_DEBUG: FORCED_DEBUG disables > HW_CACHEALIGN, and several drivers assume that HW_CACHEALIGN implies a > certain alignment (iirc usb/uhci.c assumes 16-byte alignment) > > I've attached a patch that fixes the explicit alignment control in > kmem_cache_create(). > > The parameter 'offset' [the minimum offset to be used for cache > coloring] actually is the guaranteed alignment, except that the > implementation was broken. I've fixed the implementation and renamed > 'offset' to 'align'. As the original author of the slab allocator, I can assure you there is nothing guaranteed about "offset". Neither is it to do with any minimum. The original idea behind offset was for objects with a "hot" area greater than a single L1 cache line. By using offset correctly (and to my knowledge it has never been used anywhere in the Linux kernel), a SLAB cache creator (caller of kmem_cache_create()) could ask the SLAB for more than one colour (space/L1 cache lines) offset between objects. As no one uses the feature it could well be broken, but is that a reason to change its meaning? Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q: explicit alignment control for the slab allocator
On Thu, 1 Mar 2001, Manfred Spraul wrote: > Mark Hemment wrote: > > > > The original idea behind offset was for objects with a "hot" area > > greater than a single L1 cache line. By using offset correctly (and to my > > knowledge it has never been used anywhere in the Linux kernel), a SLAB > > cache creator (caller of kmem_cache_create()) could ask the SLAB for more > > than one colour (space/L1 cache lines) offset between objects. > > > > What's the difference between this definition of 'offset' and alignment? The positioning of the first object within a slab (at least that is how it is suppose to work). The distance between all objects within a slab is constant, so the colouring of objects depends upon the cache line (offset) the first object is placed on. The alignment is the boundary objects fall upon within a slab. This may require 'padding' between the objects so they fall on the correct boundaries (ie. they aren't a 'natural' size). For kmem_cache_create(), a zero offset means the offset is the same as the alignment. Take the case of offset being 64, and alignment being 32. Here, the allocator attempts to place the first object on a 64byte boundary (say, at offset 0), and all subsequent objects (within the same cache) on a 32byte boundary. Now, when it comes to construct the next slab, it tries to place the first object 64bytes offset from the first object in the previous slab (say, at offset 64). The distance between the objects is still the same - ie. they fall on 32byte boundaries. See the difference? > alignment means that (addr%alignment==0) > offset means that (addr1-addr2 == n*offset) > > Isn't the only difference the alignment of the first object in a slab? Yes (as explained above). It is important. > Some hardware drivers use HW_CACHEALIGN and assume certain byte > alignments, and arm needs 1024 byte aligned blocks. I should have put a big comment in the allocator, saying aligment/offset are only hints to the allocator and not guarantees. Unfortunately, the allocator was always returning L1 aligned objects with HW_CACHEALIGN, so folks started to depend on it. Too late to break that now. It sounds as if HW_CACHEALIGN has been broken by a config option, and this needs to be fixed. But leave 'offset' alone?! If it isn't working as described above, then it needs fixing, but don't change its definition. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q: explicit alignment control for the slab allocator
On Thu, 1 Mar 2001, Manfred Spraul wrote: > Yes, I see the difference, but I'm not sure that it will work as > intended. > offset must be a multiple of the alignment, everything else won't work. The code does force the offset to be a multiple of the alignment - rounding the offset up. The idea was to a caller could something like; kmem_cache_create("foo", sizeof(foo_s), offsetoff(foo_s, member), ); where structure members in foo_s are "hot" up until the 'member' structure. > In which cases an offset > alignment is really a win? You've got me. :) I don't know. In the Bonwick paper, such a facility was described, so I thought "hey, sounds like that might be useful". Could be a win on archs with small L1 cache line sizes (16bytes on a 486) - but most modern processors have larger lines. Hmm, no that note, seen the L1 line size defined for a Pentium ? 128 bytes!! (CONFIG_X86_L1_CACHE_SHIFT of 7). That is probably going to waste a lot of space for small objects. > Obviously using offset 32 bytes for a structure with a 64 byte hot zone > means that 2 slabs with a different "color" compete for the same cache > lines [just assuming 32 byte cache lines for simplicity] in 50% of the > cases, but otoh offset==64 halfs the number of possible colors. Yes. It is possibly to improve on the current slab allocator, to get an extra colour or two out of it for some object sizes (eg. when the slab management is on slab, it is only ever at the front of a slab - it could also wrap round to the rear). Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q: explicit alignment control for the slab allocator
On Fri, 2 Mar 2001, Manfred Spraul wrote: > Zitiere Mark Hemment <[EMAIL PROTECTED]>: > > Could be a win on archs with small L1 cache line sizes (16bytes on a > > 486) - but most modern processors have larger lines. > > IIRC cache colouring was introduced for some sun hardware with 2 memory busses: > one handes (addr%256<128), the other one (addr%256>=128) > > If everything is perfectly aligned, the load one one bus was far higher than the > load on the other bus. Yes. High-end Intel PCs have also had interleaved buses for a few years now. So it is not just for Sun h/w. > > Hmm, no that note, seen the L1 line size defined for a Pentium ? > > 128 bytes!! (CONFIG_X86_L1_CACHE_SHIFT of 7). That is probably going to > > waste a lot of space for small objects. > > > No, it doesn't: > HWCACHE_ALIGN means "do not cross a cache line boundary". Ah, I broke my code! :( In my original slab, the code to do "packing" of objects into a single cache line was #if-def'ed out for SMP to avoid the possibility of false-sharing between objects. Not a large possibility, but it exists. > The question is who should decide about the cache colour offset? > > a) the slab allocator always chooses the smallest sensible offset (i.e. the > alignment) > b) the caller can specify the offset, e.g. if the caller knows that the hot zone > is large he would use a larger colour offset. Only the caller knows about the attributes and usage of an object, so they should be able to request (not demand) the offset/alignment of the allocator. (OK, they can demand the alignment.) > Even if the hot zone is larger than the default offset, is there any advantage > of increasing the colour offset beyond the alignment? > > I don't see an advantage. I do, but like you, I don't have any data to prove my point. Time to get profiling? Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 4G SGI quad Xeon - memory-related slowdowns
Hi Paul, > 2) Other block I/O output (eg dd if=/dev/zero of=/dev/sdi bs=4M) also > run very slowly What do you notice when running "top" and doing the above? Does the "buff" value grow high (+700MB), with high CPU usage? If so, I think this might be down to nr_free_buffer_pages(). This function includes the pages in all zones (including the HIGHMEM zone) in its calculations, while only DMA and NORMAL zone pages are used for buffers. This upsets the result from balance_dirty_state() (fs/buffer.c), and as a result the required flushing of buffers is only done as a result of running v low of pages in the DMA and NORMAL zones. I've attached a "quick hack" I did for 2.4.0. It doesn't completely solve the problem, but moves it in the right direction. Please let me know if this helps. Mark diff -urN -X dontdiff linux-2.4.0/mm/page_alloc.c markhe-2.4.0/mm/page_alloc.c --- linux-2.4.0/mm/page_alloc.c Wed Jan 3 17:59:06 2001 +++ markhe-2.4.0/mm/page_alloc.cMon Jan 15 15:35:14 2001 @@ -583,6 +583,27 @@ } /* + * Free pages in zone "type", and the zones below it. + */ +unsigned int nr_free_pages_zone (int type) +{ + unsigned int sum; + zone_t *zone; + pg_data_t *pgdat = pgdat_list; + + if (type >= MAX_NR_ZONES) + BUG(); + + sum = 0; + while (pgdat) { + for (zone = pgdat->node_zones; zone < pgdat->node_zones + type; +zone++) + sum += zone->free_pages; + pgdat = pgdat->node_next; + } + return sum; +} + +/* * Total amount of inactive_clean (allocatable) RAM: */ unsigned int nr_inactive_clean_pages (void) @@ -600,6 +621,25 @@ return sum; } +unsigned int nr_inactive_clean_pages_zone(int type) +{ + unsigned int sum; + zone_t *zone; + pg_data_t *pgdat = pgdat_list; + + if (type >= MAX_NR_ZONES) + BUG(); + type++; + + sum = 0; + while (pgdat) { + for (zone = pgdat->node_zones; zone < pgdat->node_zones + type; +zone++) + sum += zone->inactive_clean_pages; + pgdat = pgdat->node_next; + } + return sum; +} + /* * Amount of free RAM allocatable as buffer memory: */ @@ -607,9 +647,9 @@ { unsigned int sum; - sum = nr_free_pages(); - sum += nr_inactive_clean_pages(); - sum += nr_inactive_dirty_pages; + sum = nr_free_pages_zone(ZONE_NORMAL); + sum += nr_inactive_clean_pages_zone(ZONE_NORMAL); + sum += nr_inactive_dirty_pages; /* XXX */ /* * Keep our write behind queue filled, even if
[PATCH] Linus elevator
Hi, Looking at the second loop in elevator_linus_merge(), it is possible for requests to have their elevator_sequence go negative. This can cause a v long latency before the request is finally serviced. Say, for example, a request (in the queue) is jumped in the first loop in elevator_linus_merge() as "cmd != rw", even though its elevator_sequence is zero. If it is found that the new request will merge, the walking back over requests which were jumped makes no test for an already zeroed elevator_sequence. Hence it zero values can occur. With high default values for read/wite_latency, this hardly ever occurs. A simple fix for this is to test for zero before decrementing (patch below) in the second loop. Alternatively, should testing in the first loop be modified? Mark diff -u --recursive --new-file -X dontdiff linux-2.4.0-test12/drivers/block/elevator.c markhe-2.4.0-test12/drivers/block/elevator.c --- linux-2.4.0-test12/drivers/block/elevator.c Tue Dec 5 23:05:26 2000 +++ markhe-2.4.0-test12/drivers/block/elevator.cMon Dec 18 17:50:19 2000 @@ -90,6 +90,9 @@ if (ret != ELEVATOR_NO_MERGE && *req) { while ((entry = entry->next) != &q->queue_head) { struct request *tmp = blkdev_entry_to_request(entry); + + if (!tmp->elevator_sequence) + continue; tmp->elevator_sequence--; } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: test13-pre5
Hi, On Thu, 28 Dec 2000, David S. Miller wrote: >Date: Thu, 28 Dec 2000 23:17:22 +0100 >From: Andi Kleen <[EMAIL PROTECTED]> > >Would you consider patches for any of these points? > > To me it seems just as important to make sure struct page is > a power of 2 in size, with the waitq debugging turned off this > is true for both 32-bit and 64-bit hosts last time I checked. Checking test11 (which I'm running here), even with waitq debugging turned off, on 32-bit (IA32) the struct page is 68bytes (since the "age" member was re-introduced a while back). For my development testing, I'm running a _heavily_ hacked kernel. One of these hacks is to pull the wait_queue_head out of struct page; the waitq-heads are in a separate allocated area of memory, with a waitq-head pointer embedded in the page structure (allocated/initialised in free_area_init_core()). This gives a page structure of 60bytes, giving me one free double-word to play with (which I'm using as a pointer to a release function). Infact, there doesn't need to be a waitq-head allocated for each page structure - they can share; with a performance overhead on a false wakeup in __wait_on_page(). Note, for those of us running on 32bit with lots of physical memory, the available virtual address-space is of major consideration. Reducing the size of the page structure is more than just reducing cache misses - it gives us more virtual to play with... Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: test13-pre5
On Fri, 29 Dec 2000, Tim Wright wrote: > Yes, this is a very important point if we ever want to make serious use > of large memory machines on ia32. We ran into this with DYNIX/ptx when the > P6 added 36-bit physical addressing. Conserving KVA (kernel virtual address > space), became a very high priority. Eventually, we had to add code to play > silly segment games and "magically" materialize and dematerialize a 4GB > kernel virtual address space instead of the 1GB. This only comes into play > with really large amounts of memory, and is almost certainly not worth the > agony of implementation on Linux, but we'll need to be careful elsewhere to > conserve it as much as possible. Indeed. I'm compiling my kernels with 2GB virtual. Not as I want more NORMAL pages in the page cache (HIGH memory is fine), but as I need NORMAL pages for kernel data/structures (memory allocated from slab-caches) which need to be constantly mapped in. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VMt
Hi, On Mon, 25 Sep 2000, Stephen C. Tweedie wrote: > So you have run out of physical memory --- what do you do about it? Why let the system get into the state where it is neccessary to kill a process? Per-user/task resource counters should prevent unprivileged users from soaking up too many resources. That is the DoS protection. So an OOM is possibly; 1) A privileged, legally resource hungry, app(s) has taken all the memory. Could be too important to simply kill (it should exit gracefully). 2) Simply too many tasks*(memory-requirements-of-each-task). Ignoring allocations done by the kernel, the suitation comes down to the fact that the system has over committed its memory resources. ie. it has sold too many tickets for the number of seats in the plane, and all the passengers have turned up. (note, I use the term "memory" and not "physical memory", I'm including swap space). Why not protect the system from over committing its memory resources? It is possible to do true, system wide, resource counting of physical memory and swap space, and to deny a fork() or mmap() which would cause over committing of memoy resources if everyone cashed in their requirements. Named pages (those which came from a file) are the simplest to handle. If dirty, they already have allocated backing store, so we know there is somewhere to put them when memory is low. How many named pages need to be held in physical memory at any one instance for the system to function? Only a few, although if you reach that state, the system will be thrashing itself to death. Anonymous and copied (those faulted from a write to an MAP_PRIVATE|MAP_WRITE mapping) pages can be stored in either physical memory or on swap. To avoid getting into the OOM suitation, when these mappings are created the system needs to check that it has (and will have, in the future) space for every page that _could_ be allocated for the mapping - ie. work out the worst case (including page-tables). This space could be on swap or in physical memory. It is the accounting which needs to be done, not the actual allocation (and not even the decision of where to store the page when allocated - that is made much later, when it needs to be). If a machine has 2GB of RAM, a 1MB swap, and 1GB of dirty anon or copied pages, that is fine. I'm stressing this point, as the scheme of reserving space for an (as yet) unallocated page is sometimes refered to as "eager swap allocation" (or some such similar term). This is confusing. People then start to believe they need backing store for each anon/copied pages. You don't. You simply need somewhere to store it, and that could be a physical page. It is all in the accounting. :) Allocations made by the kernel, for the kernel, are (obviously) pinned memory. To ensure kernel allocations do not completely exhaust physical memory (or cause phyiscal memory to be over committed if the worst case occurs), they need to be limited. How to limit? As I first guess (and this is only a guess); 1) don't let kernel allocations exceed 25% of physical memory (tunable) 2) don't let kernel allocations succeed if they would cause over commitment. Both conditions would need to pass before an allocation could succeed. This does need much more thought. Should some tuning be per subsystem? I don't know Perhaps 1) isn't needed. I'm not sure. Because of 2), the total physical memory accounted for anon/copied pages needs to have a high watermark. Otherwise, in the accounting, the system could allow too much physical memory to be reserved for these types of pages (there doesn't need to be space on swap for each anon/copied page, just space somewhere - a watermark would prevent too much of this being physical memory). Note, this doesn't mean start swapping earlier - remember, this is accounting of anon/copied pages to avoid over commitment. For named pages, the page cache needs to have a reserved number of physical pages (ie. how small is it allowed to get, before pruning stops). Again, these reserved pages are in the accounting. mlock()ed pages need to have accouting also to prevent over commitment of physical memory. All fun. The disadvantages; 1) Extra code to do the accouting. This shouldn't be too heavy. 2) mmap(MAP_ANON)/mmap(MAP_PRIVATE|MAP_SHARED) can fail more readily. Programs which expect to memory map areas (which would created anon/copied pages when written to) will see an increased failure rate in mmap(). This can be very annoying, espically when you know the mapping will be used sparsely. One solution is to add a new mmap() flag, which tells the kernel to let this mmap() exceed the actually resources. With such a flag, the mmap() will be allowed, but the task should expected to be killed if memory is exhausted. (It could be
Re: test10-pre1 problems on 4-way SuperServer8050
Hi Tigran, On Wed, 11 Oct 2000, Tigran Aivazian wrote: > a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is > malfunctioning, interrupts are generated but no traffic gets through (YES, > I did plug it in correctly, this time, and I repeat 2.2.16 works!) I saw this the other week on our two-way Dell under a reasonibly heavy load - but with 3c59x.c driver, the eepro100s survived! Either NIC (had two Tornados) could go this away after anything from 1 to 36 hours of load. They would end up running in "poll" mode off the transmit watchdog timer. Swapped them for a dual-port eepro100 and no more problems. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kmap performance
Hi, Two performance changes against 2.4.3. flush_all_zero_pkmaps() is guarding against a race which cannot happen, and thus hurting performance. It uses the atomic fetch-and-clear "ptep_get_and_clear()" operation, which is much stronger than needed. No-one has the page mapped, and cannot get at the page's virtual address without taking the kmap_lock, which flush_all_zero_pkmaps() holds. Even with speculative execution, there are no races which need to be closed via an atomic op. On a two-way, Pentium-III system, flush_all_zero_pkmaps() was taking over 200K CPU cycles (with the flush_tlb_all() only accounting for ~9K of those cycles). This patch replaces ptep_get_and_clear() with a pte_page(), and pte_clear(). This reduces flush_all_zero_pkmaps() to around 75K cycles. The second part of this patch adds a conditional guard to the wake_up() call in kunmap_high(). With most usage patterns, a page will not be simultaneously mapped more than once, hence the most common case (by far) is for pkmap_count[] to decrement to 1. This was causing an unconditional call to wake_up(), when (again) the common case is to have no tasks in the wait-queue. This patches adds a guard to the wake_up() using an inlined waitqueue_active(), and so avoids unnecessary function calls. It also drops the actual wake_up() to be outside of the spinlock. This is safe, as any waiters will have placed themselves onto the queue under the kmap_lock, and kunmap_high() tests the queue under this lock. Mark diff -urN -X dontdiff linux-2.4.3/mm/highmem.c markhe-2.4.3/mm/highmem.c --- linux-2.4.3/mm/highmem.cTue Nov 28 20:31:02 2000 +++ markhe-2.4.3/mm/highmem.c Sat Mar 31 15:03:43 2001 @@ -46,7 +46,7 @@ for (i = 0; i < LAST_PKMAP; i++) { struct page *page; - pte_t pte; + /* * zero means we don't have anything to do, * >1 means that it is still in use. Only @@ -56,10 +56,21 @@ if (pkmap_count[i] != 1) continue; pkmap_count[i] = 0; - pte = ptep_get_and_clear(pkmap_page_table+i); - if (pte_none(pte)) + + /* sanity check */ + if (pte_none(pkmap_page_table[i])) BUG(); - page = pte_page(pte); + + /* +* Don't need an atomic fetch-and-clear op here; +* no-one has the page mapped, and cannot get at +* its virtual address (and hence PTE) without first +* getting the kmap_lock (which is held here). +* So no dangers, even with speculative execution. +*/ + page = pte_page(pkmap_page_table[i]); + pte_clear(&pkmap_page_table[i]); + page->virtual = NULL; } flush_tlb_all(); @@ -139,6 +150,7 @@ { unsigned long vaddr; unsigned long nr; + int need_wakeup; spin_lock(&kmap_lock); vaddr = (unsigned long) page->virtual; @@ -150,13 +162,31 @@ * A count must never go down to zero * without a TLB flush! */ + need_wakeup = 0; switch (--pkmap_count[nr]) { case 0: BUG(); case 1: - wake_up(&pkmap_map_wait); + /* +* Avoid an unnecessary wake_up() function call. +* The common case is pkmap_count[] == 1, but +* no waiters. +* The tasks queued in the wait-queue are guarded +* by both the lock in the wait-queue-head and by +* the kmap_lock. As the kmap_lock is held here, +* no need for the wait-queue-head's lock. Simply +* test if the queue is empty. +*/ + need_wakeup = waitqueue_active(&pkmap_map_wait); } spin_unlock(&kmap_lock); + + /* +* Can do wake-up, if needed, race-free outside of +* the spinlock. +*/ + if (need_wakeup) + wake_up(&pkmap_map_wait); } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Possible SCSI + block-layer bugs
Hi, I've never seen these trigger, but they look theoretically possible. When processing the completion of a SCSI request in a bottom-half, __scsi_end_request() can find all the buffers associated with the request haven't been completed (ie. leftovers). One question is; can this ever happen? If it can't then the code should be removed from __scsi_end_request(), if it can happen then there appears to be a few problems; The request is re-queued to the block layer via scsi_queue_next_request(), which uses the "special" pointer in the request structure to remember the Scsi_Cmnd associated with the request. The SCSI request function is then called, but doesn't guarantee to immediately process the re-queued request even though it was added at the head (say, the queue has become plugged). This can trigger two possible bugs. The first is that __scsi_end_request() doesn't decrement the hard_nr_sectors count in the request. As the request is back on the queue, it is possible for newly arriving buffer-heads to merge with the heads already hanging off the request. This merging uses the hard_nr_sectors when calculating both the merged hard_nr_sectors and nr_sectors counts. As the request is at the head, only back-merging can occur, but if __scsi_end_request() triggers another uncompleted request to be re-queued, it is possible to get front merging as well. The merging of a re-queued request looks safe, except for the hard_nr_sectors. This patch corrects the hard_nr_sectors accounting. The second bug is from request merging in attempt_merge(). For a re-queued request, the request structure is the one embedded in the Scsi_Cmnd (which is a copy of the request taken in the scsi_request_fn). In attempt_merge(), q->merge_requests_fn() is called to see the requests are allowed to merge. __scsi_merge_requests_fn() checks number of segments, etc, but doesn't check if one of the requests is a re-queued one (ie. no test against ->special). This can lead to attempt_merge() releasing the embedded request structure (which, as an extract copy, has the ->q set, so to blkdev_release_request() it looks like a request which originated from the block layer). This isn't too healthy. The fix here is to add a check in __scsi_merge_requests_fn() to check for ->special being non-NULL. The attached patch is against 2.4.3. Jens, Eric, anyone, comments? Mark diff -urN -X dontdiff linux-2.4.3/drivers/scsi/scsi_lib.c markhe-2.4.3/drivers/scsi/scsi_lib.c --- linux-2.4.3/drivers/scsi/scsi_lib.c Sat Mar 3 02:38:39 2001 +++ markhe-2.4.3/drivers/scsi/scsi_lib.cSat Mar 31 16:56:31 2001 @@ -377,12 +377,15 @@ nsect = bh->b_size >> 9; blk_finished_io(nsect); req->bh = bh->b_reqnext; - req->nr_sectors -= nsect; - req->sector += nsect; bh->b_reqnext = NULL; sectors -= nsect; bh->b_end_io(bh, uptodate); if ((bh = req->bh) != NULL) { + req->hard_sector += nsect; + req->hard_nr_sectors -= nsect; + req->sector += nsect; + req->nr_sectors -= nsect; + req->current_nr_sectors = bh->b_size >> 9; if (req->nr_sectors < req->current_nr_sectors) { req->nr_sectors = req->current_nr_sectors; diff -urN -X dontdiff linux-2.4.3/drivers/scsi/scsi_merge.c markhe-2.4.3/drivers/scsi/scsi_merge.c --- linux-2.4.3/drivers/scsi/scsi_merge.c Fri Feb 9 19:30:23 2001 +++ markhe-2.4.3/drivers/scsi/scsi_merge.c Sat Mar 31 16:38:27 2001 @@ -597,6 +597,13 @@ Scsi_Device *SDpnt; struct Scsi_Host *SHpnt; + /* +* First check if the either of the requests are re-queued +* requests. Can't merge them if they are. +*/ + if (req->special || next->special) + return 0; + SDpnt = (Scsi_Device *) q->queuedata; SHpnt = SDpnt->host;
[Patch] Pushing the global kernel lock (aka BKL)
Hi, Several places in the kernel run holding the global kernel lock when it isn't needed. This usually occurs when where is a function which can be called via many different paths; some holding the lock, others not. Now, if a function can block (and hence drop the kernel lock) the caller cannot rely on the kernel lock for its own integerity. Such a functon _may_ be a candidate for dropping the lock (if it is held) on entry, and retaken on exit. This improves SMP scalability. A good example of this is generic_make_request(). This function can be arrived at by several different paths, some of which will be holding the lock, and some not. It (or, rather the functions it can call) may block and a caller has no control over this. generic_make_request() does not need the kernel lock itself, and any functions it calls which do require the lock should be taking it (as they cannot guarantee it is already held). This makes it a good candidate for dropping the lock early (rather than only dropping when blocking). Dropping the kernel lock, even for a short period, helps scalability. Note, if the lock is held when an interrupt arrives, the interrupt handler runs holding the lock and so do any bottom-half handlers run on the back of the interrupt. So, less time it is held, the smaller the chance of being interrupted while holding it, the better the scalability. As the current nfsd code isn't threaded, it runs holding the kernel lock. Any reduction in holding the lock helps nfs server scalability. The current macros used to drop and retake the kernel lock in schedule() cannot be used in many cases as they do not nest. The attached patch (against 2.4.1-pre10) defines two new macros for x86 (save_and_drop_kernel_lock(x) and restore_kernel_lock(x)) and a new declaration macro (DECLARE_KERNEL_LOCK_COUNTER(x)). These allow "pushing" and "poping" of the kernel lock. The patch also contains some examples of usage (although the one in nfsd/vfs.c could be done with an unlock_kernel()/lock_kernel() pair). If the idea is acceptable, I'll tidy up the patch and add the necessary macros for other archs. My questions are; a) Does this make sense? b) Is it acceptable in the kernel? c) Any better suggestions for the macro names? I'd be interested in any results from those applying this patch and running benchmarks on SMP boxes - mainly filesystem benchmarks. I admit this is not the cleanest of ideas. Mark diff -urN -X dontdiff vxfs-2.4.1-pre10/drivers/block/ll_rw_blk.c markhe-2.4.1-pre10/drivers/block/ll_rw_blk.c --- vxfs-2.4.1-pre10/drivers/block/ll_rw_blk.c Wed Jan 24 10:56:23 2001 +++ markhe-2.4.1-pre10/drivers/block/ll_rw_blk.cThu Jan 25 09:47:09 2001 @@ -907,10 +907,24 @@ { int major = MAJOR(bh->b_rdev); request_queue_t *q; + DECLARE_KERNEL_LOCK_COUNTER(lock_depth) if (!bh->b_end_io) BUG(); + /* +* The caller cannot make any assumptions as to whether this +* function (or, rather, the funcs called from here) will block +* or not. +* Also, we can be called with or without the global kernel lock. +* As the kernel lock isn't need here, and should be taken by +* any functions called from here which need it, it is safe to +* drop the lock. This helps scalability (and _really_ helps +* scalability when there is a threaded volume manager sitting +* below). +*/ + save_and_drop_kernel_lock(lock_depth); + if (blk_size[major]) { unsigned long maxsector = (blk_size[major][MINOR(bh->b_rdev)] << 1) + 1; unsigned long sector = bh->b_rsector; @@ -931,6 +945,7 @@ blk_size[major][MINOR(bh->b_rdev)]); } bh->b_end_io(bh, 0); + restore_kernel_lock(lock_depth); return; } } @@ -953,6 +968,8 @@ break; } } while (q->make_request_fn(q, rw, bh)); + + restore_kernel_lock(lock_depth); } diff -urN -X dontdiff vxfs-2.4.1-pre10/fs/nfsd/vfs.c markhe-2.4.1-pre10/fs/nfsd/vfs.c --- vxfs-2.4.1-pre10/fs/nfsd/vfs.c Fri Dec 29 22:07:23 2000 +++ markhe-2.4.1-pre10/fs/nfsd/vfs.cThu Jan 25 10:01:51 2001 @@ -30,6 +30,7 @@ #include #include #include +#include #include #define __NO_VERSION__ #include @@ -596,6 +597,7 @@ mm_segment_toldfs; int err; struct file file; + DECLARE_KERNEL_LOCK_COUNTER(lock_depth) err = nfsd_open(rqstp, fhp, S_IFREG, MAY_READ, &file); if (err) @@ -618,12 +620,25 @@ file.f_ralen = ra->p_ralen; file.f_rawin = ra->p_rawin; } + + /* +* The nfs daemons run holding the glob
Re: [patch] 4GB I/O, cut three
Hi Jens, I ran this (well, cut-two) on a 4-way box with 4GB of memory and a modified qlogic fibre channel driver with 32disks hanging off it, without any problems. The test used was SpecFS 2.0 Peformance is definitely up - but I can't give an exact number, as the run with this patch was compiled with no-omit-frame-pointer for debugging any probs. I did change the patch so that bounce-pages always come from the NORMAL zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as I'm not 100% sure the VM is capable of keeping the zones it already has balanced - and adding another one might break the camels back. But as the test box has 4GB, it wasn't bouncing anyway. Mark On Tue, 29 May 2001, Jens Axboe wrote: > Another day, another version. > > Bugs fixed in this version: none > Known bugs in this version: none > > In other words, it's perfect of course. > > Changes: > > - Added ide-dma segment coalescing > - Only print highmem I/O enable info when HIGHMEM is actually set > > Please give it a test spin, especially if you have 1GB of RAM or more. > You should see something like this when booting: > > hda: enabling highmem I/O > ... > SCSI: channel 0, id 0: enabling highmem I/O > > depending on drive configuration etc. > > Plea to maintainers of the different architectures: could you please add > the arch parts to support this? This includes: > > - memory zoning at init time > - page_to_bus > - pci_map_page / pci_unmap_page > - set_bh_sg > - KM_BH_IRQ (for HIGHMEM archs) > > I think that's it, feel free to send me questions and (even better) > patches. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Jens Axboe wrote: > On Wed, May 30 2001, Mark Hemment wrote: > > Hi Jens, > > > > I ran this (well, cut-two) on a 4-way box with 4GB of memory and a > > modified qlogic fibre channel driver with 32disks hanging off it, without > > any problems. The test used was SpecFS 2.0 > > Cool, could you send me the qlogic diff? It's the one-liner can_dma32 > chance I'm interested in, I'm just not sure what driver you used :-) The qlogic driver is the one from; http://www.feral.com/isp.html I find this much more stable than the one already in the kernel. It did just need the one-liner change, but as the driver isn't in the kernel there isn't much point adding it change to your patch. :) > > I did change the patch so that bounce-pages always come from the NORMAL > > zone, hence the ZONE_DMA32 zone isn't needed. I avoided the new zone, as > > I'm not 100% sure the VM is capable of keeping the zones it already has > > balanced - and adding another one might break the camels back. But as the > > test box has 4GB, it wasn't bouncing anyway. > > You are right, this is definitely something that needs checking. I > really want this to work though. Rik, Andrea? Will the balancing handle > the extra zone? In theory it should do - ie. there isn't anything to stop it. With NFS loads, over a ported VxFS filesystem, I do see some problems between the NORMAL and HIGH zones. Thinking about it, ZONE_DMA32 shouldn't make this any worse. Rik, Andrea, quick description of a balancing problem; Consider a VM which is under load (but not stressed), such that all zone free-page pools are between their MIN and LOW marks, with pages in the inactive_clean lists. The NORMAL zone has non-zero page order allocations thrown at it. This causes __alloc_pages() to reap pages from the NORMAL inactive_clean list until the required buddy is built. The blind reaping causes the NORMAL zone to have a large number of free pages (greater than ->pages_low). Now, when HIGHMEM allocations come in (for page cache pages), they skip the HIGH zone and use the NORMAL zone (as it now has plenty of free pages) - the code at the top of __alloc_pages(), which checks against ->pages_low. But the NORMAL zone is usually under more pressure than the HIGH zone - as many more allocations needed ready-mapped memory. This causes the page-cache pages from the NORMAL zone to come under more pressure, and are "re-cycled" quicker than page-cache pages in the HIGHMEM zone. OK, we shouldn't be throwing too many non-zero page allocations at __alloc_pages(), but it does happen. Also, the problem isn't as bad as it first looks - HIGHMEM page-cache pages do get "recycled" (reclaimed), but there is a slight imbalance. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ll_rw_blk.c and high memory
Hi Jens, all, In drivers/block/ll_rw_blk.c:blk_dev_init(), the high and low queued sectors are calculated from the total number of free pages in all memory zones. Shouldn't this calculation be passed upon the number of pages upon which I/O can be done directly (ie. without bounce pages)? On a box with gigabytes of memory, high_queued_sectors becomes larger than the amount of memory upon which block I/O can be directly done - the test in ll_rw_block() against high_queued_sectors is then never true. The throttling of submitting I/O happens much earlier in the stack - at the allocation of a bounce page. This doesn't seem right. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
Hi again, :) On Tue, 29 May 2001, Jens Axboe wrote: > Another day, another version. > > Bugs fixed in this version: none > Known bugs in this version: none > > In other words, it's perfect of course. With the scsi-high patch, I'm not sure about the removal of the line from __scsi_end_request(); req->buffer = bh->b_data; A requeued request is not always processed immediately, so new buffer-heads arriving at the block-layer can be merged against it. A requeued request is placed at the head of a request list, so nothing can merge with it - but what about if multiple requests are requeued on the same queue? In Linus's tree, requests requeued via the SCSI layer can cause problems (corruption). I sent out a patch to cover this a few months back, which got picked up by Alan (its in the -ac series - see the changes to scsi_lib.c and scsi_merge.c) but no one posted any feedback. I've included some of the original message below. Mark -- >From [EMAIL PROTECTED] Sat Mar 31 16:07:14 2001 +0100 Date: Sat, 31 Mar 2001 16:07:13 +0100 (BST) From: Mark Hemment <[EMAIL PROTECTED]> Subject: [PATCH] Possible SCSI + block-layer bugs Hi, I've never seen these trigger, but they look theoretically possible. When processing the completion of a SCSI request in a bottom-half, __scsi_end_request() can find all the buffers associated with the request haven't been completed (ie. leftovers). One question is; can this ever happen? If it can't then the code should be removed from __scsi_end_request(), if it can happen then there appears to be a few problems; The request is re-queued to the block layer via scsi_queue_next_request(), which uses the "special" pointer in the request structure to remember the Scsi_Cmnd associated with the request. The SCSI request function is then called, but doesn't guarantee to immediately process the re-queued request even though it was added at the head (say, the queue has become plugged). This can trigger two possible bugs. The first is that __scsi_end_request() doesn't decrement the hard_nr_sectors count in the request. As the request is back on the queue, it is possible for newly arriving buffer-heads to merge with the heads already hanging off the request. This merging uses the hard_nr_sectors when calculating both the merged hard_nr_sectors and nr_sectors counts. As the request is at the head, only back-merging can occur, but if __scsi_end_request() triggers another uncompleted request to be re-queued, it is possible to get front merging as well. The merging of a re-queued request looks safe, except for the hard_nr_sectors. This patch corrects the hard_nr_sectors accounting. The second bug is from request merging in attempt_merge(). For a re-queued request, the request structure is the one embedded in the Scsi_Cmnd (which is a copy of the request taken in the scsi_request_fn). In attempt_merge(), q->merge_requests_fn() is called to see the requests are allowed to merge. __scsi_merge_requests_fn() checks number of segments, etc, but doesn't check if one of the requests is a re-queued one (ie. no test against ->special). This can lead to attempt_merge() releasing the embedded request structure (which, as an extract copy, has the ->q set, so to blkdev_release_request() it looks like a request which originated from the block layer). This isn't too healthy. The fix here is to add a check in __scsi_merge_requests_fn() to check for ->special being non-NULL. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 4GB I/O, cut three
On Wed, 30 May 2001, Jens Axboe wrote: > On Wed, May 30 2001, Mark Hemment wrote: > > This can lead to attempt_merge() releasing the embedded request > > structure (which, as an extract copy, has the ->q set, so to > > blkdev_release_request() it looks like a request which originated from > > the block layer). This isn't too healthy. > > > > The fix here is to add a check in __scsi_merge_requests_fn() to check > > for ->special being non-NULL. > > How about just adding > > if (req->cmd != next->cmd > || req->rq_dev != next->rq_dev > || req->nr_sectors + next->nr_sectors > q->max_sectors > || next->sem || req->special) > return; > > ie check for special too, that would make sense to me. Either way would > work, but I'd rather make this explicit in the block layer that 'not > normal' requests are left alone. That includes stuff with the sem set, > or special. Yes, that is an equivalent fix. In the original patch I wanted to keep the change local (ie. in the SCSI layer). Pushing the check up the generic block layer makes sense. Are you going to push this change to Linus, or should I? I'm assuming the other scsi-layer changes in Alan's tree will eventually be pushed. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] copy_from_high_bh
On Sun, 8 Jul 2001, Linus Torvalds wrote: > On Sun, 8 Jul 2001 [EMAIL PROTECTED] wrote: > > > > mm/highmem.c/copy_from_high_bh() blocks interrupts while copying "down" > > to a bounce buffer, for writing. > > This function is only ever called from create_bounce() (which cannot > > be called from an interrupt context - it may block), and the kmap > > 'KM_BOUNCE_WRITE' is only used in copy_from_high_bh(). > > If so it's not just the interrupt disable that is unnecessary, the > "kmap_atomic()" should also be just a plain "kmap()", I think. That might eat through kmap()'s virtual window too quickly. I did think about adding a test to see if the page was already mapped, and falling back to kmap_atomic() if it isn't. That should give the best of both worlds? Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] allow PF_MEMALLOC tasks to directly reclaim pages
Marcelo, Infact, the test can be made even weaker than that. We only what to avoid the inactive-clean list when allocating from within an interrupt (or from a bottom-half handler) to avoid deadlock on taking the pagecache_lock and pagemap_lru_lock. Note: no allocations are done while holding either of these locks. Even GFP_ATOMIC doesn't matter; that simply says if we should block or not (could be an allocation while holding a spinlock, but not inside an interrupt). I've been doing v heavy load on a 4-way server with; if (order == 0 && !in_interrupt()) direct_reclaim = 1; without any problems. Of course, the in_interrupt() is heavier than (gfp_mask & __GFP_WAIT), but the benefits outweight the slight fattening. Mark On Fri, 27 Apr 2001, Marcelo Tosatti wrote: > > Linus, > > Currently __alloc_pages() does not allow PF_MEMALLOC tasks to free clean > inactive pages. > > This is senseless --- if the allocation has __GFP_WAIT set, its ok to grab > the pagemap_lru_lock/pagecache_lock/etc. > > I checked all possible codepaths after reclaim_page() and they are ok. > > The following patch fixes that. > > > --- linux/mm/page_alloc.c.origFri Apr 27 05:59:35 2001 > +++ linux/mm/page_alloc.c Fri Apr 27 05:59:48 2001 > @@ -295,8 +295,7 @@ >* Can we take pages directly from the inactive_clean >* list? >*/ > - if (order == 0 && (gfp_mask & __GFP_WAIT) && > - !(current->flags & PF_MEMALLOC)) > + if (order == 0 && (gfp_mask & __GFP_WAIT)) > direct_reclaim = 1; > > /* > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel lock in dcache.c
Hi, d_move() in fs/dcache.c is checking the kernel lock is held (switch_names() does the same, but is only called from d_move()). My question is why? I can't see what it is using the kernel lock to sync/protect against. Anyone out there know? Thanks, Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] allocation looping + kswapd CPU cycles
In 2.4.3pre6, code in page_alloc.c:__alloc_pages(), changed from; try_to_free_pages(gfp_mask); wakeup_bdflush(); if (!order) goto try_again; to try_to_free_pages(gfp_mask); wakeup_bdflush(); goto try_again; This introduced the effect of a non-zero order, __GFP_WAIT allocation (without PF_MEMALLOC set), never returning failure. The allocation keeps looping in __alloc_pages(), kicking kswapd, until the allocation succeeds. If there is plenty of memory in the free-pools and inactive-lists free_shortage() will return false, causing the state of these free-pools/inactive-lists not to be 'improved' by kswapd. If there is nothing else changing/improving the free-pools or inactive-lists, the allocation loops forever (kicking kswapd). Does anyone know why the 2.4.3pre6 change was made? The attached patch (against 2.4.5-pre1) fixes the looping symptom, by adding a counter and looping only twice for non-zero order allocations. The real fix is to measure fragmentation and the progress of kswapd, but that is too drastic for 2.4.x. Mark diff -ur linux-2.4.5-pre1/mm/page_alloc.c markhe-2.4.5-pre1/mm/page_alloc.c --- linux-2.4.5-pre1/mm/page_alloc.cFri Apr 27 22:18:08 2001 +++ markhe-2.4.5-pre1/mm/page_alloc.c Tue May 8 13:42:12 2001 @@ -275,6 +275,7 @@ { zone_t **zone; int direct_reclaim = 0; + int loop; unsigned int gfp_mask = zonelist->gfp_mask; struct page * page; @@ -313,6 +314,7 @@ && nr_inactive_dirty_pages >= freepages.high) wakeup_bdflush(0); + loop = 0; try_again: /* * First, see if we have any zones with lots of free memory. @@ -453,7 +455,8 @@ if (gfp_mask & __GFP_WAIT) { memory_pressure++; try_to_free_pages(gfp_mask); - goto try_again; + if (!order || loop++ < 2) + goto try_again; } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] allocation looping + kswapd CPU cycles
On Wed, 9 May 2001, Marcelo Tosatti wrote: > On Wed, 9 May 2001, Mark Hemment wrote: > > Could introduce another allocation flag (__GFP_FAIL?) which is or'ed > > with a __GFP_WAIT to limit the looping? > > __GFP_FAIL is in the -ac tree already and it is being used by the bounce > buffer allocation code. Thanks for the pointer. For non-zero order allocations, the test against __GFP_FAIL is a little too soon; it would be better after we've tried to reclaim pages from the inactive-clean list. Any nasty side effects to this? Plus, the code still prevents PF_MEMALLOC processes from using the inactive-clean list for non-zero order allocations. As the trend seems to be to make zero and non-zero allocations 'equivalent', shouldn't this restriction to lifted? Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nasty SCSI performance drop-outs (potential test case)
On Fri, 11 May 2001, null wrote: > Time to mkfs the same two 5GB LUNs in parallel is 54 seconds. Hmmm. > Bandwidth on two CPUs is totally consumed (99.9%) and a third CPU is > usually consumed by the kupdated process. Activity lights on the storage > device are mostly idle during this time. I see you've got 1.2GBytes, so are using HIGHMEM support? I sumbitted a patch a few weeks back, against the buffer cache, which makes it behave better with HIGHMEM. The patch improved the time take to create large filesystems. This got picked up by Alan in his -ac series. Can't remember exactly when Alan merged it, but it is definitely in 2.4.4-ac3. I'd recommend giving it a try. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] allocation looping + kswapd CPU cycles
On Tue, 8 May 2001, David S. Miller wrote: > Actually, the change was made because it is illogical to try only > once on multi-order pages. Especially because we depend upon order > 1 pages so much (every task struct allocated). We depend upon them > even more so on sparc64 (certain kinds of page tables need to be > allocated as 1 order pages). > > The old code failed _far_ too easily, it was unacceptable. > > Why put some strange limit in there? Whatever number you pick > is arbitrary, and I can probably piece together an allocation > state where the choosen limit is too small. Agreed, but some allocations of non-zero orders can fall back to other schemes (such as an emergency buffer, or using vmalloc for a temp buffer) and don't want to be trapped in __alloc_pages() for too long. Could introduce another allocation flag (__GFP_FAIL?) which is or'ed with a __GFP_WAIT to limit the looping? > So instead, you could test for the condition that prevents any > possible forward progress, no? Yes, it is possible to trap when kswapd might not make any useful progress for a failing non-zero ordered allocation, and to set a global "force" flag (kswapd_force) to ensure it does something useful. For order-1 allocations, that would work. For order-2 (and above) it becomes much more difficult as the page 'reap' routines release/process pages based upon age and do not factor in whether a page may/will buddy (now or in the near future). This 'blind' processing of pages can wipe a significant percentage of the page cache when trying to build a buddy at a high order. Of course, no one should be doing really large order allocations and expecting them to succeed. But, if they are doing this, the allocation should at least fail. Mark - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/