On Tue, Jan 31, 2017 at 04:01:05PM +1100, Gavin Shan wrote: >On Tue, Jan 31, 2017 at 08:11:39AM +1100, Michael Ellerman wrote: >>Gavin Shan <gws...@linux.vnet.ibm.com> writes: >> >>I'd like to see some test results from multi-node systems. >> >>I'd also like to understand what has changed since we changed >>RECLAIM_DISTANCE in the first place, ie. why did it used to work and now >>doesn't? >> > >[Ccing Mel] > >Michael, thanks for review. I would like to explain a bit more. The issue >addressed by the patch is irrelevant to the number of NUMA nodes. There >is one procfs entry ("/proc/sys/vm/zone_reclaim_mode") which corresponds >to variable @node_reclaim_mode (their names don't match!). it can have >belowing bits or any combination of them. Its default value is RECLAIM_OFF (0). >Note RECLAIM_ZONE was obsoleted and I will send one patch to remove it >later. > >#define RECLAIM_OFF 0 >#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ >#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ >#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > >When @node_reclaim_mode is set to (RECLAIM_WRITE | RECLAIM_UNMAP), >node_reclaim() >isn't called on the preferred node as the condition is false: >zone_allows_reclaim( >node-A, node-A). As I observed, the distance from node-A to node-A is 10, >equal to >RECLAIM_DISTANCE. > >static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >{ > return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < > RECLAIM_DISTANCE; >} > > >__alloc_pages_nodemask > get_page_from_freelist <- WATERMARK_LOW > zone_watermark_fast <- Assume the allocation is breaking > WATERMARK_LOW > node_reclaim <- @node_reclaim_node isn't 0 and > zone_allows_reclaim(preferred_zone, > current_zone) returns true > __node_reclaim <- SWAP, WRITEPAGE and UNMAP setting from > @node_reclaim_node > shrink_node > buffered_rmqueue > __alloc_pages_slowpath > get_page_from_freelist <- WATERMARK_MIN > __alloc_pages_direct_compact <- If it's costly allocation (order > 3) > wake_all_kswapds > get_page_from_freelist <- NO_WATERMARK, CPU local node is set to > preferred one > __alloc_pages_direct_reclaim > __perform_reclaim > try_to_free_pages <- WRITEPAGE + UNMAP + SWAP > do_try_to_free_pages > shrink_zones <- Stop until priority (12) reaches to 0 > or reclaimed enough > shrink_node > __alloc_pages_direct_compact > > >Also, RECLAIM_DISTANCE is set to 30 in include/linux/topology.h. It's used >when arch >doesn't provide one. It's why I set this macro to 30 in this patch. This issue >is >introduced by commit 5f7a75acdb2 ("mm: page_alloc: do not cache reclaim >distances"). >In the patch, it had wrong replacement. So I would correct the wrong >replacement >alternatively. Or both of them. Which way do you think is the best? Maybe Mel >also >has thoughts. > > 39 static bool zone_allows_reclaim(struct zone *local_zone, struct zone > *zone) > 40 { > 41 - return node_isset(local_zone->node, > zone->zone_pgdat->reclaim_nodes); > 42 -} > 43 - > 44 -static void __paginginit init_zone_allows_reclaim(int nid) > 45 -{ > 46 - int i; > 47 - > 48 - for_each_node_state(i, N_MEMORY) > 49 - if (node_distance(nid, i) <= RECLAIM_DISTANCE) > 50 - node_set(i, NODE_DATA(nid)->reclaim_nodes); > 51 + return node_distance(zone_to_nid(local_zone), > zone_to_nid(zone)) < > 52 + RECLAIM_DISTANCE; > 53 } >
Sorry, to make it clear. The patch replaced "<=" with "<" wrongly :) Thanks, Gavin