memory types that are not
>> > initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as
types that are
> not initialized by device drivers and as a fallback.
>
> Signed-off-by: Ho-Ren (Jack) Chuang
> Signed-off-by: Hao Xiang
> Reviewed-by: "Huang, Ying"
> ---
> mm/memory-tiers.c | 94 +++
> 1 file chan
"node_memory_types[nid].memtype"
will be !NULL. And it's possible (in theory) that some nodes becomes
"node_state(nid, N_CPU) == true" between memory_tier_init() and
memory_tier_late_init().
Otherwise, Looks good to me. Feel free to add
Reviewed-by: "Huang, Y
> &default_memory_types);
> if (IS_ERR(default_dram_type))
> panic("%s() failed to allocate default DRAM tier\n", __func__);
>
> @@ -868,6 +919,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
> for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> + * Defer memory tier initialization on CPUless numa
> nodes.
> + * These will be initialized after firmware and devices
> are
> + * initialized.
> + */
> + continue;
> +
> memtier = set_node_memory_tier(node);
> if (IS_ERR(memtier))
> /*
--
Best Regards,
Huang, Ying
"Ho-Ren (Jack) Chuang" writes:
> On Fri, Mar 22, 2024 at 1:41 AM Huang, Ying wrote:
>>
>> "Ho-Ren (Jack) Chuang" writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 mem
_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> +
> &default_memory_types);
> if (IS_ERR(default_dram_type))
> panic("%s() failed to allocate default DRAM tier\n", __func__);
>
> @@ -868,6 +913,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
> for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> + * Defer memory tier initialization on CPUless numa
> nodes.
> + * These will be initialized after firmware and devices
> are
> + * initialized.
> + */
> + continue;
> +
> memtier = set_node_memory_tier(node);
> if (IS_ERR(memtier))
> /*
--
Best Regards,
Huang, Ying
* For now we can have 4 faster memory tiers with smaller adistance
>* than default DRAM tier.
>*/
> - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> + default_dram_type = mt_find_alloc_memory_type(
> + MEMTIER_ADISTANCE_DRAM,
> &default_memory_types);
> if (IS_ERR(default_dram_type))
> panic("%s() failed to allocate default DRAM tier\n", __func__);
>
> @@ -836,6 +908,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
> for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> + * Defer memory tier initialization on CPUless numa
> nodes.
> + * These will be initialized after firmware and devices
> are
> + * initialized.
> + */
> + continue;
> +
> memtier = set_node_memory_tier(node);
> if (IS_ERR(memtier))
> /*
--
Best Regards,
Huang, Ying
"Ho-Ren (Jack) Chuang" writes:
> On Tue, Mar 12, 2024 at 2:21 AM Huang, Ying wrote:
>>
>> "Ho-Ren (Jack) Chuang" writes:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 mem
> default_dram_perf.write_latency) *
> (default_dram_perf.read_bandwidth +
> default_dram_perf.write_bandwidth) /
> (perf->read_bandwidth + perf->write_bandwidth);
> - mutex_unlock(&memory_tier_lock);
> + mutex_unlock(&mt_perf_lock);
>
> return 0;
> }
> @@ -836,6 +890,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
> for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> + * Defer memory tier initialization on CPUless numa
> nodes.
> + * These will be initialized when HMAT information is
HMAT is platform specific, we should avoid to mention it in general code
if possible.
> + * available.
> + */
> + continue;
> +
> memtier = set_node_memory_tier(node);
> if (IS_ERR(memtier))
> /*
--
Best Regards,
Huang, Ying
i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave
i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave
i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave
"Verma, Vishal L" writes:
> On Tue, 2023-12-12 at 08:30 +0800, Huang, Ying wrote:
>> Vishal Verma writes:
>>
>> > Add a sysfs knob for dax devices to control the memmap_on_memory setting
>> > if the dax device were to be hotplugged as system mem
i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave
i.e. no memmap_on_memory semantics, to
> preserve legacy behavior. For dax devices via CXL, the default is on.
> The sysfs control allows the administrator to override the above
> defaults if needed.
>
> Cc: David Hildenbrand
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave
ernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Michal Hocko
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
> Suggested-by: David Hildenbrand
ernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Michal Hocko
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
> Suggeste
s via CXL. For non-CXL dax regions, retain the existing
> default behavior of hot adding without memmap_on_memory semantics.
>
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Michal Hocko
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hans
ernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Michal Hocko
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
> Suggeste
ernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/
>
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Michal Hocko
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
> Suggeste
"Verma, Vishal L" writes:
> On Tue, 2023-10-17 at 13:18 +0800, Huang, Ying wrote:
>> "Verma, Vishal L" writes:
>>
>> > On Thu, 2023-10-05 at 14:16 -0700, Dan Williams wrote:
>> > > Vishal Verma wrote:
>> > > &g
if (!dax_region->dev->driver) {
>>
>> Is the polarity backwards here? I.e. if the device is already attached to
>> the kmem driver it is too late to modify memmap_on_memory policy.
>
> Hm this sounded logical until I tried it. After a reconfigure-device to
> devda
ved might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Michal Hocko
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
>
Alistair Popple writes:
> "Huang, Ying" writes:
>
>> Alistair Popple writes:
>>
>>> "Huang, Ying" writes:
>>>
>>>> Hi, Alistair,
>>>>
>>>> Sorry for late response. Just come back from vacation.
>
Alistair Popple writes:
> "Huang, Ying" writes:
>
>> Alistair Popple writes:
>>
>>> Huang Ying writes:
>>>
>>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>>>> used for slow memory type in km
Alistair Popple writes:
> "Huang, Ying" writes:
>
>> Alistair Popple writes:
>>
>>> Huang Ying writes:
>>>
>>>> A memory tiering abstract distance calculation algorithm based on ACPI
>>>> HMAT is implemented. The ba
Alistair Popple writes:
> "Huang, Ying" writes:
>
>> Hi, Alistair,
>>
>> Sorry for late response. Just come back from vacation.
>
> Ditto for this response :-)
>
> I see Andrew has taken this into mm-unstable though, so my bad for not
&g
"Verma, Vishal L" writes:
> On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote:
>> Vishal Verma writes:
>>
>> >
>> > @@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
>> > }
>> > EXPORT_SYMBOL(try_offline_node);
>
"Verma, Vishal L" writes:
> On Mon, 2023-07-24 at 11:16 +0800, Huang, Ying wrote:
>> "Aneesh Kumar K.V" writes:
>> >
>> > > @@ -1339,27 +1367,20 @@ int __ref add_memory_resource(int nid,
>> > > struct resource *res, mhp_t mhp_flags
and latency numbers reported by HMAT here, but FWIW, this patchset
> puts the CXL nodes on a lower tier than DRAM nodes.
Thank you very much!
Can I add your "Tested-by" for the series?
--
Best Regards,
Huang, Ying
Hi, Alistair,
Sorry for late response. Just come back from vacation.
Alistair Popple writes:
> "Huang, Ying" writes:
>
>> Alistair Popple writes:
>>
>>> "Huang, Ying" writes:
>>>
>>>> Alistair Popple writes:
>>
Hi, Jonathan,
Thanks for review!
Jonathan Cameron writes:
> On Fri, 21 Jul 2023 09:29:30 +0800
> Huang Ying wrote:
>
>> Previously, in hmat_register_target_initiators(), the performance
>> attributes are calculated and the corresponding sysfs links and files
>>
Alistair Popple writes:
> "Huang, Ying" writes:
>
>> Alistair Popple writes:
>>
>>> "Huang, Ying" writes:
>>>
>>>>>> And, I don't think that we are forced to use the general notifier
>>>>>
Alistair Popple writes:
> "Huang, Ying" writes:
>
>>>> The other way (suggested by this series) is to make dax/kmem call a
>>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>>>> device and calculate the distance if th
Alistair Popple writes:
> "Huang, Ying" writes:
>
>> Hi, Alistair,
>>
>> Thanks a lot for comments!
>>
>> Alistair Popple writes:
>>
>>> Huang Ying writes:
>>>
>>>> The abstract distance may be calculated b
Alistair Popple writes:
> Huang Ying writes:
>
>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>> used for slow memory type in kmem driver. This limits the usage of
>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>>
Alistair Popple writes:
> Huang Ying writes:
>
>> A memory tiering abstract distance calculation algorithm based on ACPI
>> HMAT is implemented. The basic idea is as follows.
>>
>> The performance attributes of system default DRAM nodes are recorded
>&g
Hi, Alistair,
Thanks a lot for comments!
Alistair Popple writes:
> Huang Ying writes:
>
>> The abstract distance may be calculated by various drivers, such as
>> ACPI HMAT, CXL CDAT, etc. While it may be used by various code which
>> hot-add memory node, such as da
t; Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
> Reviewed-by: David Hildenbrand
> Signed-off-by: Vishal Verma
> ---
> include/linux/memory_hotplug.h | 5 +
> mm/memory_hotplug.c| 1 +
> 2 files chan
being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
> Sug
conditions for
>> it are met,. Teach try_remove_memory() to also expect that a memory
>> range being removed might have been split up into memblock sized chunks,
>> and to loop through those as needed.
>>
>> Cc: Andrew Morton
>> Cc: David Hildenbrand
>> Cc:
k.
Signed-off-by: "Huang, Ying"
Cc: Aneesh Kumar K.V
Cc: Wei Xu
Cc: Alistair Popple
Cc: Dan Williams
Cc: Dave Hansen
Cc: Davidlohr Bueso
Cc: Johannes Weiner
Cc: Jonathan Cameron
Cc: Michal Hocko
Cc: Yang Shi
Cc: Rafael J Wysocki
---
drivers/dax/k
distance of a memory node (target) to
MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
attributes of the node to that of the default DRAM nodes.
Signed-off-by: "Huang, Ying"
Cc: Aneesh Kumar K.V
Cc: Wei Xu
Cc: Alistair Popple
Cc: Dan Williams
Cc: Dave Hansen
Cc:
specified via
priority (notifier_block.priority).
Signed-off-by: "Huang, Ying"
Cc: Aneesh Kumar K.V
Cc: Wei Xu
Cc: Alistair Popple
Cc: Dan Williams
Cc: Dave Hansen
Cc: Davidlohr Bueso
Cc: Johannes Weiner
Cc: Jonathan Cameron
Cc: Michal Hocko
Cc: Yang Shi
Cc: Rafael J Wysocki
--
calculate the performance attributes for
a memory target without creating sysfs links and files.
To do that, hmat_register_target_initiators() is refactored to make it
possible to calculate performance attributes separately.
Signed-off-by: "Huang, Ying"
Cc: Aneesh Kumar K.V
Cc: Wei Xu
Cc
Optane DCPMM.
Changelog:
V1 (from RFC):
- Added some comments per Aneesh's comments, Thanks!
Best Regards,
Huang, Ying
Miaohe Lin writes:
> It appears that destroy_memory_type() isn't a very good name because
> we usually will not free the memory_type here. So rename it to a more
> appropriate name i.e. put_memory_type().
>
> Suggested-by: Huang, Ying
> Signed-off-by: Miaohe Lin
LGT
ally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki"
> Cc: Len Brown
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: D
; Cc: "Rafael J. Wysocki"
> Cc: Len Brown
> Cc: Andrew Morton
> Cc: David Hildenbrand
> Cc: Oscar Salvador
> Cc: Dan Williams
> Cc: Dave Jiang
> Cc: Dave Hansen
> Cc: Huang Ying
> Signed-off-by: Vishal Verma
> ---
> include/linux/memory_hotplug.h
wap_info_struct *p,
> int prio,
>
> static void _enable_swap_info(struct swap_info_struct *p)
> {
> - p->flags |= SWP_WRITEOK | SWP_VALID;
> + p->flags |= SWP_WRITEOK;
> atomic_long_add(p->pages, &nr_swap_pages);
> total_swap_pages +=
.vma = &pvma,
> };
>
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(swap);
Better to put get/put_swap_device() in shmem_swapin_page(), that make it
possible for us to remove get/put_swap_device() in lookup_swap_cache().
Best Regar
cted.
The race isn't important because it will not cause problem.
Best Regards,
Huang, Ying
> But the swap_entry
> isn't used in this function and we will have enough checking when we really
> operate the PTE entries later. So checking for non_swap_entry() is not
> really
sually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
This needs to be revised too. Unless you squash 1/4 and 2/4.
> Fixes: 0b
Miaohe Lin writes:
> On 2021/4/19 15:09, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>>> On 2021/4/19 10:48, Huang, Ying wrote:
>>>> Miaohe Lin writes:
>>>>
>>>>> We will use percpu-refcount to serialize against concurrent
Miaohe Lin writes:
> On 2021/4/19 15:04, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>>> On 2021/4/19 10:15, Huang, Ying wrote:
>>>> Miaohe Lin writes:
>>>>
>>>>> When I was investigating the swap code,
Miaohe Lin writes:
> On 2021/4/19 10:48, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>> patch adds the percpu_ref support for swap.
>>>
>>> Signed-off-by:
Miaohe Lin writes:
> On 2021/4/19 10:15, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CP
; atomic_long_add(p->pages, &nr_swap_pages);
> total_swap_pages += p->pages;
>
> @@ -2507,7 +2504,7 @@ static void enable_swap_info(struct swap_info_struct
> *p, int prio,
> spin_unlock(&swap_lock);
> /*
>* Guarantee swap_
> *si)
> SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> {
> struct swap_info_struct *p;
> - struct filename *name;
> + struct filename *name = NULL;
> struct file *swap_file = NULL;
> struct address_space *mapping;
> int p
f is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
I still suggest to squash PATCH 1-3, at least PATCH 1-2. That will
chan
node *inode = si->swap_file->f_mapping->host;[oops!]
>
> Close this race window by using get/put_swap_device() to guard against
> concurrent swapoff.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
No. This isn't the commit that introduces the race cond
ia git-blame to find
out it.
The patch itself looks good to me.
Best Regards,
Huang, Ying
> Signed-off-by: Miaohe Lin
> ---
> mm/swap_state.c | 6 --
> 1 file changed, 6 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 272ea2108c9d..df5405384520
Miaohe Lin writes:
> On 2021/4/15 22:31, Dennis Zhou wrote:
>> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>>> Dennis Zhou writes:
>>>
>>>> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>>>>> Dennis Zhou w
Kent Overstreet writes:
> On Thu, Apr 15, 2021 at 09:42:56PM -0700, Paul E. McKenney wrote:
>> On Tue, Apr 13, 2021 at 10:47:03AM +0800, Huang Ying wrote:
>> > One typical use case of percpu_ref_tryget() family functions is as
>> > follows,
>> >
Dennis Zhou writes:
> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>> Dennis Zhou writes:
>>
>> > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>> >> Dennis Zhou writes:
>> >>
>> >> > On Wed, Apr 1
a to combine the page table scanning
and rmap scanning in the page reclaiming. For example, if the
working-set is transitioned, we can take advantage of the fast page
table scanning to identify the new working-set quickly. While we can
fallback to the rmap scanning if the page table scanning doesn't help.
Best Regards,
Huang, Ying
"Zi Yan" writes:
> On 13 Apr 2021, at 23:00, Huang, Ying wrote:
>
>> Yang Shi writes:
>>
>>> The generic migration path will check refcount, so no need check refcount
>>> here.
>>> But the old code actually prevents from migrating shared T
Dennis Zhou writes:
> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>> Dennis Zhou writes:
>>
>> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> >> Dennis Zhou writes:
>> >>
>> >> > Hello,
>&g
Yu Zhao writes:
> On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying wrote:
>>
>> Yu Zhao writes:
>>
>> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel wrote:
>> >>
>> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
>> >
pressure?), and if we use the rmap, we need to
> scan a lot of pages anyway. Why not just scan them all?
This may be not the case. For rmap scanning, it's possible to scan only
a small portion of memory. But with the page table scanning, you need
to scan almost all (I understand you have som
Dennis Zhou writes:
> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> Dennis Zhou writes:
>>
>> > Hello,
>> >
>> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> >> Miaohe Lin writes:
>> >>
Dennis Zhou writes:
> Hello,
>
> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> Miaohe Lin writes:
>> >>
>> >>>
Miaohe Lin writes:
> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>>
us from migrating shared THP? If no, why not just remove
the old refcount checking?
Best Regards,
Huang, Ying
> Signed-off-by: Yang Shi
> ---
> mm/migrate.c | 16
> 1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.
b/mm/huge_memory.c
> @@ -1418,93 +1418,21 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> pmd_t pmd = vmf->orig_pmd;
> - struct anon_vma *anon_vma = NULL;
> + pmd_t oldpmd;
nit: the usage of oldpm
Miaohe Lin writes:
> On 2021/4/14 9:17, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>> "Huang, Ying" writes:
>>>>
>>>>> Miaohe Lin writes:
>>>>>
>>>&g
Miaohe Lin writes:
> On 2021/4/12 15:24, Huang, Ying wrote:
>> "Huang, Ying" writes:
>>
>>> Miaohe Lin writes:
>>>
>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>> patch adds the perc
Tim Chen writes:
> On 4/12/21 6:27 PM, Huang, Ying wrote:
>
>>
>> This isn't the commit that introduces the race. You can use `git blame`
>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of
Yu Zhao writes:
> On Fri, Mar 26, 2021 at 12:21 AM Huang, Ying wrote:
>>
>> Mel Gorman writes:
>>
>> > On Thu, Mar 25, 2021 at 12:33:45PM +0800, Huang, Ying wrote:
>> >> > I caution against this patch.
>> >> >
>> >&g
Yu Zhao writes:
> On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying wrote:
>>
>> Yu Zhao writes:
>>
>> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> >> Yu Zhao writes:
>> >>
>> >> > On Wed, Mar 17,
re are
> many single-page VMAs, i.e., not returning to the PGD table for each
> of such VMAs. Just a heads-up.
>
> The rmap, on the other hand, had to
> 1) lock each (shmem) page it scans
> 2) go through five levels of page tables for each page, even though
> some of them have the same LCAs
> during the test. The second part is worse given that I have 5 levels
> of page tables configured.
>
> Any additional benchmarks you would suggest? Thanks.
Hi, Yu,
Thanks for your data.
In addition to the data your measured above, is it possible for you to
measure some raw data? For example, how many CPU cycles does it take to
scan all pages in the system? For the page table scanning, the page
tables of all processes will be scanned. For the rmap scanning, all
pages in LRU will be scanned. And we can do that with difference
parameters, for example, shared vs. non-shared, sparse vs. dense. Then
we can get an idea about how fast the page table scanning can be.
Best Regards,
Huang, Ying
gotten from the
other fields may be invalid or inconsistent. To guarantee the correct
memory ordering, percpu_ref_tryget*() needs to be the ACQUIRE
operations.
This function implements that via using smp_load_acquire() in
__ref_is_percpu() to read the percpu pointer.
Signed-off-by: "Huang, Ying
If tier 0 memory used by the cgroup exceeds
> this high
> boundary, allocation of tier 0 memory by the cgroup will
> be throttled. The tier 0 memory used by this cgroup
> will also be subjected to heavy demotion.
I thi
o_swap_page() has been fixed. We need to fix shmem_swapin().
Best Regards,
Huang, Ying
> Signed-off-by: Miaohe Lin
> ---
> mm/swap_state.c | 11 +--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3bf0d0
unnecessary. The only caller has guaranteed the swap device
from swapoff.
Best Regards,
Huang, Ying
> ---
> mm/swap_state.c | 9 ++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 272ea2108c9d..709c260d644a 100
e overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
>
> Fixes: 235b62176712 ("mm/swap: add cluster lock")
This isn't the commit that introduces the race. You can use `git blame`
"Huang, Ying" writes:
> Miaohe Lin writes:
>
>> We will use percpu-refcount to serialize against concurrent swapoff. This
>> patch adds the percpu_ref support for later fixup.
>>
>> Signed-off-by: Miaohe Lin
>> ---
>> includ
L);
> + if (unlikely(error))
> + goto bad_swap;
> +
> name = getname(specialfile);
> if (IS_ERR(name)) {
> error = PTR_ERR(name);
> @@ -3356,6 +3374,7 @@ SYSCALL_DEFINE2(swapon, const char __user *,
> specialfile, int, swap_flags)
> bad_swap_unlock_inode:
> inode_unlock(inode);
> bad_swap:
> + percpu_ref_exit(&p->users);
Usually the resource freeing order matches their allocating order
reversely. So, if there's no special reason, please follow that rule.
Best Regards,
Huang, Ying
> free_percpu(p->percpu_cluster);
> p->percpu_cluster = NULL;
> free_percpu(p->cluster_next_cpu);
..
>>p->swap_file
>> = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>> ...
>> ...
>>
>
> Agree. This is also what I meant to illustrate. And you provide a better one.
> Many thanks!
For the pages that are swapped in through swap cache. That isn't an
issue. Because the page is locked, the swap entry will be marked with
SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
unlocked.
So the race is for the fast path as follows,
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1)
I found it in your original patch description. But please make it more
explicit to reduce the potential confusing.
Best Regards,
Huang, Ying
Miaohe Lin writes:
> On 2021/4/9 16:50, Huang, Ying wrote:
>> Miaohe Lin writes:
>>
>>> While we released the pte lock, somebody else might faulted in this pte.
>>> So we should check whether it's swap pte first to guard against such race
>>> o
27;t a real issue. entry or swap_entry isn't used in this
function. And we have enough checking when we really operate the PTE
entries later. But I admit it's confusing. So I suggest to just remove
the checking. We will check it when necessary.
Best Regards,
Huang, Ying
otential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.
Per my understanding, memory.low/min are for the memory protection
instead of the memory limiting. memory.high is for the memory limiting.
Best Regards,
Huang, Ying
Mel Gorman writes:
> On Fri, Apr 02, 2021 at 04:27:17PM +0800, Huang Ying wrote:
>> With NUMA balancing, in hint page fault handler, the faulting page
>> will be migrated to the accessing node if necessary. During the
>> migration, TLB will be shot down on all CPUs tha
to ~9.7e5) with about 9.2e6
pages (35.8GB) migrated. From the perf profile, it can be found that
the CPU cycles spent by try_to_unmap() and its callees reduces from
6.02% to 0.47%. That is, the CPU cycles spent by TLB shooting down
decreases greatly.
Signed-off-by: "Huang, Ying"
Re
to ~9.7e5) with about 9.2e6
pages (35.8GB) migrated. From the perf profile, it can be found that
the CPU cycles spent by try_to_unmap() and its callees reduces from
6.02% to 0.47%. That is, the CPU cycles spent by TLB shooting down
decreases greatly.
Signed-off-by: "Huang, Ying"
Cc
Mel Gorman writes:
> On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote:
>> Mel Gorman writes:
>>
>> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>> >> For NUMA balancing, in hint page fault handler, the faulting page will
>&g
Mel Gorman writes:
> On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote:
>> For NUMA balancing, in hint page fault handler, the faulting page will
>> be migrated to the accessing node if necessary. During the migration,
>> TLB will be shot down on all CPUs that t
Yu Zhao writes:
> On Mon, Mar 29, 2021 at 9:44 PM Huang, Ying wrote:
>>
>> Miaohe Lin writes:
>>
>> > On 2021/3/30 9:57, Huang, Ying wrote:
>> >> Hi, Miaohe,
>> >>
>> >> Miaohe Lin writes:
>> >>
>> >
Miaohe Lin writes:
> On 2021/3/30 9:57, Huang, Ying wrote:
>> Hi, Miaohe,
>>
>> Miaohe Lin writes:
>>
>>> Hi all,
>>> I am investigating the swap code, and I found the below possible race
>>> window:
>>>
>&g
would be really grateful. Thanks! :)
This appears possible. Even for swapcache case, we can't guarantee the
swap entry gotten from the page table is always valid too. The
underlying swap device can be swapped off at the same time. So we use
get/put_swap_device() for that. Maybe we need similar stuff here.
Best Regards,
Huang, Ying
1 - 100 of 1001 matches
Mail list logo