from:"Haggai Eran"

Kernel 3.6-rc3 BUG at mm/slab.c:2629

2012-08-28 Thread Haggai Eran

Hi,

I believe I have encountered a bug in kernel 3.6-rc3. It starts with the
assertion in mm/slab.c:2629 failing, and then the system hangs. I can
reproduce this bug by running a large compilation (compiling the kernel
for instance).

Here's what I see in netconsole:
> [ cut here ]
> kernel BUG at mm/slab.c:2629!
> invalid opcode:  [#1] SMP DEBUG_PAGEALLOC

I'm attaching netconsole logs I got with kernel 3.6-rc1, which contain a
little more details after the crash, but for some reason netconsole
didn't capture the full stack trace of the assertion. I caught a glimpse
at the console and I saw RIP was at cache_alloc_refill.

I'd be happy to provide further information or perform testing to help
solve this issue.

Regards,
Haggai Eran
2012-Aug-20 12:34:44 -   [ cut here ]
2012-Aug-20 12:34:44 - kernel BUG at mm/slab.c:2629!
2012-Aug-20 12:34:44 - invalid opcode:  [#1] SMP DEBUG_PAGEALLOC
2012-Aug-20 12:36:13 - Modules linked in: netconsole configfs nfs3 nfs_acl nfs 
fscache lockd autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm 
ib_cm iw_cm ib_addr ipv6 ib_sa dm_mirror dm_region_hash dm_log dm_mod uinput 
iTCO_wdt iTCO_vendor_support dcdbas coretemp hwmon kvm_intel kvm crc32c_intel 
ghash_clmulni_intel serio_raw pcspkr sg ses enclosure lpc_ich mfd_core mlx4_ib 
ib_mad ib_core mlx4_en mlx4_core bnx2 ext3 jbd mbcache sr_mod cdrom sd_mod 
crc_t10dif aesni_intel ablk_helper cryptd aes_x86_64 aes_generic pata_acpiBUG: 
spinlock lockup suspected on CPU#8, fixdep/28208
2012-Aug-20 12:36:13 - BUG: spinlock lockup suspected on CPU#0, fixdep/28209
2012-Aug-20 12:36:13 -  lock: 0x88100f1536a8, .magic: dead4ead, .owner: 
fixdep/28201, .owner_cpu: 9
2012-Aug-20 12:36:13 - Pid: 28209, comm: fixdep Tainted: G  D W
3.6.0-rc1-vanilla #47
2012-Aug-20 12:36:13 - Call Trace:
2012-Aug-20 12:36:13 -  [] spin_dump+0x78/0xc0
2012-Aug-20 12:36:13 -  [] do_raw_spin_lock+0x133/0x140
2012-Aug-20 12:36:13 -  [] _raw_spin_lock+0x49/0x50
2012-Aug-20 12:36:13 -  [] ? cache_alloc_refill+0xad/0x3b0
2012-Aug-20 12:36:13 -  [] ? 
cache_alloc_debugcheck_after+0x140/0x270
2012-Aug-20 12:36:13 -  [] cache_alloc_refill+0xad/0x3b0
2012-Aug-20 12:36:13 -  [] ? 
__change_page_attr_set_clr+0x75f/0xbd0
2012-Aug-20 12:36:13 -  [] ? get_empty_filp+0x79/0x1c0
2012-Aug-20 12:36:13 -  [] kmem_cache_alloc+0x1d4/0x300
2012-Aug-20 12:36:13 -  [] get_empty_filp+0x79/0x1c0
2012-Aug-20 12:36:13 -  [] path_openat+0x48/0x4a0
2012-Aug-20 12:36:13 -  [] ? getname_flags+0x37/0x100
2012-Aug-20 12:36:13 -  [] do_filp_open+0x49/0xa0
2012-Aug-20 12:36:13 -  [] ? _raw_spin_unlock+0x2b/0x50
2012-Aug-20 12:36:13 -  [] ? alloc_fd+0x145/0x250
2012-Aug-20 12:36:13 -  [] do_sys_open+0x108/0x1e0
2012-Aug-20 12:36:13 -  [] sys_open+0x21/0x30
2012-Aug-20 12:36:13 -  [] system_call_fastpath+0x16/0x1b
2012-Aug-20 12:36:13 - BUG: spinlock lockup suspected on CPU#6, fixdep/28210
2012-Aug-20 12:36:13 -  lock: 0x88100f1536a8, .magic: dead4ead, .owner: 
fixdep/28201, .owner_cpu: 9
2012-Aug-20 12:36:13 - Pid: 28210, comm: fixdep Tainted: G  D W
3.6.0-rc1-vanilla #47
2012-Aug-20 12:36:13 - Call Trace:
2012-Aug-20 12:36:13 -  [] spin_dump+0x78/0xc0
2012-Aug-20 12:36:13 -  [] do_raw_spin_lock+0x133/0x140
2012-Aug-20 12:36:13 -  [] _raw_spin_lock+0x49/0x50
2012-Aug-20 12:36:13 -  [] ? cache_alloc_refill+0xad/0x3b0
2012-Aug-20 12:36:13 -  [] cache_alloc_refill+0xad/0x3b0
2012-Aug-20 12:36:13 -  [] ? 
__change_page_attr_set_clr+0x75f/0xbd0
2012-Aug-20 12:36:13 -  [] ? get_empty_filp+0x79/0x1c0
2012-Aug-20 12:36:13 -  [] kmem_cache_alloc+0x1d4/0x300
2012-Aug-20 12:36:13 -  [] get_empty_filp+0x79/0x1c0
2012-Aug-20 12:36:13 -  [] path_openat+0x48/0x4a0
2012-Aug-20 12:36:13 -  [] ? getname_flags+0x37/0x100
2012-Aug-20 12:36:13 -  [] do_filp_open+0x49/0xa0
2012-Aug-20 12:36:13 -  [] ? _raw_spin_unlock+0x2b/0x50
2012-Aug-20 12:36:13 -  [] ? alloc_fd+0x145/0x250
2012-Aug-20 12:36:13 -  [] do_sys_open+0x108/0x1e0
2012-Aug-20 12:36:13 -  [] sys_open+0x21/0x30
2012-Aug-20 12:36:13 -  [] system_call_fastpath+0x16/0x1b
2012-Aug-20 12:36:13 - BUG: spinlock lockup suspected on CPU#4, cc1/28166
2012-Aug-20 12:36:13 -  lock: 0x88100f1536a8, .magic: dead4ead, .owner: 
fixdep/28201, .owner_cpu: 9
2012-Aug-20 12:36:13 - Pid: 28166, comm: cc1 Tainted: G  D W
3.6.0-rc1-vanilla #47
2012-Aug-20 12:36:13 - Call Trace:
2012-Aug-20 12:36:13 -  [] spin_dump+0x78/0xc0
2012-Aug-20 12:36:13 -  [] do_raw_spin_lock+0x133/0x140
2012-Aug-20 12:36:13 -  [] _raw_spin_lock+0x49/0x50
2012-Aug-20 12:36:13 -  [] ? cache_alloc_refill+0xad/0x3b0
2012-Aug-20 12:36:13 -  [] cache_alloc_refill+0xad/0x3b0
2012-Aug-20 12:36:13 -  [] ? 
__change_page_attr_set_clr+0x75f/0xbd0
2012-Aug-20 12:36:13 -  [] ? get_empty_filp+0x79/0x1c0
2012-Aug-20 12:36:13 -  [] kmem_cache_alloc+0x1d4/0x300
2012-Aug-20 12:36:13 -  [] get_empty_filp+0x79/0x1c0
2012-Aug-20 12:36:13 -  [] path_openat+0x48/0x4a0
2012-Aug-20 12:36:

Re: [patch v3.6] mm, slab: lock the correct nodelist after reenabling irqs

2012-08-29 Thread Haggai Eran

On 29/08/2012 05:57, David Rientjes wrote:
> On Tue, 28 Aug 2012, Haggai Eran wrote:
>
>> Hi,
>>
>> I believe I have encountered a bug in kernel 3.6-rc3. It starts with the
>> assertion in mm/slab.c:2629 failing, and then the system hangs. I can
>> reproduce this bug by running a large compilation (compiling the kernel
>> for instance).
>>
>> Here's what I see in netconsole:
>>> [ cut here ]
>>> kernel BUG at mm/slab.c:2629!
>>> invalid opcode:  [#1] SMP DEBUG_PAGEALLOC
>> I'm attaching netconsole logs I got with kernel 3.6-rc1, which contain a
>> little more details after the crash, but for some reason netconsole
>> didn't capture the full stack trace of the assertion. I caught a glimpse
>> at the console and I saw RIP was at cache_alloc_refill.
>>
> It only gets called from cache_alloc_refill().
>
> Looks like a problem in 072bb0aa5e0 ("mm: sl[au]b: add knowledge of 
> PFMEMALLOC reserve pages").  cache_grow() can reenable irqs which allows 
> this to be scheduled on a different cpu, possibly with a different node.  
> So it turns out that we lock the wrong node's list_lock because we don't 
> check the new node id when irqs are disabled again.
>
> I doubt you can reliably reproduce this, but the following should fix the 
> issue.
Your patch did solve the issue. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm: fix build warning for uninitialized value

2012-11-06 Thread Haggai Eran

On 05/11/2012 23:36, David Rientjes wrote:
> do_wp_page() sets mmun_called if mmun_start and mmun_end were initialized 
> and, if so, may call mmu_notifier_invalidate_range_end() with these 
> values.  This doesn't prevent gcc from emitting a build warning though:
> 
> mm/memory.c: In function ‘do_wp_page’:
> mm/memory.c:2530: warning: ‘mmun_start’ may be used uninitialized in this 
> function
> mm/memory.c:2531: warning: ‘mmun_end’ may be used uninitialized in this 
> function

I haven't seen these warning. Perhaps I used a different compiler
version, or the right flags.

> 
> It's much easier to initialize the variables to impossible values and do a 
> simple comparison to determine if they were initialized to remove the bool 
> entirely.

This solution looks great to me.

> 
> Signed-off-by: David Rientjes 
> ---
>  mm/memory.c |   10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2527,9 +2527,8 @@ static int do_wp_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   int ret = 0;
>   int page_mkwrite = 0;
>   struct page *dirty_page = NULL;
> - unsigned long mmun_start;   /* For mmu_notifiers */
> - unsigned long mmun_end; /* For mmu_notifiers */
> - bool mmun_called = false;   /* For mmu_notifiers */
> + unsigned long mmun_start = 0;   /* For mmu_notifiers */
> + unsigned long mmun_end = 0; /* For mmu_notifiers */
>  
>   old_page = vm_normal_page(vma, address, orig_pte);
>   if (!old_page) {
> @@ -2708,8 +2707,7 @@ gotten:
>   goto oom_free_new;
>  
>   mmun_start  = address & PAGE_MASK;
> - mmun_end= (address & PAGE_MASK) + PAGE_SIZE;
> - mmun_called = true;
> + mmun_end= mmun_start + PAGE_SIZE;
>   mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  
>   /*
> @@ -2778,7 +2776,7 @@ gotten:
>   page_cache_release(new_page);
>  unlock:
>   pte_unmap_unlock(page_table, ptl);
> - if (mmun_called)
> + if (mmun_end > mmun_start)
>   mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>   if (old_page) {
>   /*
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm: call invalidate_range_end in do_wp_page even for zero pages

2012-09-19 Thread Haggai Eran

The previous patch "mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end" only called the
invalidate_range_end mmu notifier function in do_wp_page when the new_page
variable wasn't NULL. This was done in order to only call invalidate_range_end
after invalidate_range_start was called. Unfortunately, there are situations
where new_page is NULL and invalidate_range_start is called. This caused
invalidate_range_start to be called without a matching invalidate_range_end,
causing kvm to loop indefinitely on the first page fault.

This patch adds a flag variable to do_wp_page that marks whether the
invalidate_range_start notifier was called. invalidate_range_end is then
called if the flag is true.

Reported-by: Jiri Slaby 
Cc: Avi Kivity 
Cc: Andrew Morton 
Signed-off-by: Haggai Eran 
---
I tested this patch against yesterday's linux-next (next-20120918), and it
seems to solve the problem with kvm. I used the same command line you reported:

  qemu-kvm -k en-us -usbdevice tablet -balloon virtio -hda IMAGE -smp 2 \
  -m 1000M -net user -net nic,model=e1000 -usb -serial pty

I was hoping you could also test it yourself, and see that it also works for
you, if you don't mind.

 mm/memory.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1a92d87..76ec199 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2529,6 +2529,7 @@ static int do_wp_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct page *dirty_page = NULL;
unsigned long mmun_start;   /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
+   bool mmun_called = false;   /* For mmu_notifiers */
 
old_page = vm_normal_page(vma, address, orig_pte);
if (!old_page) {
@@ -2706,8 +2707,9 @@ gotten:
if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;
 
-   mmun_start = address & PAGE_MASK;
-   mmun_end   = (address & PAGE_MASK) + PAGE_SIZE;
+   mmun_start  = address & PAGE_MASK;
+   mmun_end= (address & PAGE_MASK) + PAGE_SIZE;
+   mmun_called = true;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 
/*
@@ -2776,8 +2778,7 @@ gotten:
page_cache_release(new_page);
 unlock:
pte_unmap_unlock(page_table, ptl);
-   if (new_page)
-   /* Only call the end notifier if the begin was called. */
+   if (mmun_called)
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
if (old_page) {
/*
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task

2015-09-07 Thread Haggai Eran

On 07/09/2015 23:38, Parav Pandit wrote:
> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct 
> *p)
>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>   * pins the final release of task.io_context.  Also protects ->cpuset and
> - * ->cgroup.subsys[]. And ->vfork_done.
> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
s/projtects/protects/
>   *
>   * Nests both inside and outside of read_lock(&tasklist_lock).
>   * It must not be nested with write_lock_irq(&tasklist_lock),

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.

2015-09-07 Thread Haggai Eran

On 07/09/2015 23:38, Parav Pandit wrote:
> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
> index 8b64221..cdbdd60 100644
> --- a/include/linux/device_cgroup.h
> +++ b/include/linux/device_cgroup.h
> @@ -1,6 +1,57 @@
> +#ifndef _DEVICE_CGROUP
> +#define _DEVICE_CGROUP
> +
>  #include 
> +#include 
> +#include 

You cannot add this include line before adding the device_rdma_cgroup.h
(added in patch 5). You should reorder the patches so that after each
patch the kernel builds correctly.

I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile
before it was added to the kernel.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran

On 07/09/2015 23:38, Parav Pandit wrote:
> +/* RDMA resources from device cgroup perspective */
> +enum devcgroup_rdma_rt {
> + DEVCG_RDMA_RES_TYPE_UCTX,
> + DEVCG_RDMA_RES_TYPE_CQ,
> + DEVCG_RDMA_RES_TYPE_PD,
> + DEVCG_RDMA_RES_TYPE_AH,
> + DEVCG_RDMA_RES_TYPE_MR,
> + DEVCG_RDMA_RES_TYPE_MW,
I didn't see memory windows in dev_cgroup_files in patch 3. Is it used?
> + DEVCG_RDMA_RES_TYPE_SRQ,
> + DEVCG_RDMA_RES_TYPE_QP,
> + DEVCG_RDMA_RES_TYPE_FLOW,
> + DEVCG_RDMA_RES_TYPE_MAX,
> +};

> +struct devcgroup_rdma_tracker {
> + int limit;
> + atomic_t usage;
> + int failcnt;
> +};
Have you considered using struct res_counter?

> + * RDMA resource limits are hierarchical, so the highest configured limit of
> + * the hierarchy is enforced. Allowing resource limit configuration to 
> default
> + * cgroup allows fair share to kernel space ULPs as well.
In what way is the highest configured limit of the hierarchy enforced? I
would expect all the limits along the hierarchy to be enforced.

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> + int type = seq_cft(sf)->private;
> + u32 usage;
> +
> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
> + } else {
> + usage = dev_cg->rdma.tracker[type].limit;
If this is the resource limit, don't name it 'usage'.

> + seq_printf(sf, "%u\n", usage);
> + }
> + return 0;
> +}

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> + int type = seq_cft(sf)->private;
> + u32 usage;
> +
> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
I'm not sure hiding the actual number is good, especially in the
show_usage case.

> + } else {
> + usage = dev_cg->rdma.tracker[type].limit;
> + seq_printf(sf, "%u\n", usage);
> + }
> + return 0;
> +}

> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +   enum devcgroup_rdma_rt type, int num)
> +{
> + struct dev_cgroup *dev_cg, *p;
> + struct task_struct *ctx_task;
> +
> + if (!num)
> + return;
> +
> + /* get cgroup of ib_ucontext it belong to, to uncharge
> +  * so that when its called from any worker tasks or any
> +  * other tasks to which this resource doesn't belong to,
> +  * it can be uncharged correctly.
> +  */
> + if (ucontext)
> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> + else
> + ctx_task = current;
> + dev_cg = task_devcgroup(ctx_task);
> +
> + spin_lock(&ctx_task->rdma_res_counter->lock);
Don't you need an rcu read lock and rcu_dereference to access
rdma_res_counter?

> + ctx_task->rdma_res_counter->usage[type] -= num;
> +
> + for (p = dev_cg; p; p = parent_devcgroup(p))
> + uncharge_resource(p, type, num);
> +
> + spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> + if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> + rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);

> +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num)
> +{
> + struct dev_cgroup *dev_cg = task_devcgroup(current);
> + struct task_rdma_res_counter *res_cnt = current->rdma_res_counter;
> + int status;
> +
> + if (!res_cnt) {
> + res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL);
> + if (!res_cnt)
> + return -ENOMEM;
> +
> + spin_lock_init(&res_cnt->lock);
> + rcu_assign_pointer(current->rdma_res_counter, res_cnt);
Don't you need the task lock to update rdma_res_counter here?

> + }
> +
> + /* synchronize with migration task by taking lock, to avoid
> +  * race condition of performing cgroup resource migration
> +  * in non atomic way with this task, which can leads to leaked
> +  * resources in older cgroup.
> +  */
> + spin_lock(&res_cnt->lock);
> + status = try_charge_resource(dev_cg, type, num);
> + if (status)
> + goto busy;
> +
> + /* single task updating its rdma resource usage, so atomic is
> +  * not required.
> +  */
> + current->rdma_res_counter->usage[type] += num;
> +
> +busy:
> + spin_unlock(&res_cnt->lock);
> + return status;
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource);

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task

2015-09-08 Thread Haggai Eran

On 08/09/2015 10:04, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran  wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct 
>>> task_struct *p)
>>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>>> - * ->cgroup.subsys[]. And ->vfork_done.
>>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
>> s/projtects/protects/
>>>   *
>>>   * Nests both inside and outside of read_lock(&tasklist_lock).
>>>   * It must not be nested with write_lock_irq(&tasklist_lock),
>>
> 
> Hi Haggai Eran,
> Did you miss to put comments or I missed something?

Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in
your comment. You should change the word "projtects" to "protects".

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran

On 07/09/2015 23:38, Parav Pandit wrote:
> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +   enum devcgroup_rdma_rt type, int num)
> +{
> + struct dev_cgroup *dev_cg, *p;
> + struct task_struct *ctx_task;
> +
> + if (!num)
> + return;
> +
> + /* get cgroup of ib_ucontext it belong to, to uncharge
> +  * so that when its called from any worker tasks or any
> +  * other tasks to which this resource doesn't belong to,
> +  * it can be uncharged correctly.
> +  */
> + if (ucontext)
> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> + else
> + ctx_task = current;
So what happens if a process creates a ucontext, forks, and then the
child creates and destroys a CQ? If I understand correctly, created
resources are always charged to the current process (the child), but
when it is destroyed the owner of the ucontext (the parent) will be
uncharged.

Since ucontexts are not meant to be used by multiple processes, I think
it would be okay to always charge the owner process (the one that
created the ucontext).

> + dev_cg = task_devcgroup(ctx_task);
> +
> + spin_lock(&ctx_task->rdma_res_counter->lock);
> + ctx_task->rdma_res_counter->usage[type] -= num;
> +
> + for (p = dev_cg; p; p = parent_devcgroup(p))
> + uncharge_resource(p, type, num);
> +
> + spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> + if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> + rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.

2015-09-08 Thread Haggai Eran

On 07/09/2015 23:38, Parav Pandit wrote:
> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
> +{
> + INIT_LIST_HEAD(&ucontext->pd_list);
> + INIT_LIST_HEAD(&ucontext->mr_list);
> + INIT_LIST_HEAD(&ucontext->mw_list);
> + INIT_LIST_HEAD(&ucontext->cq_list);
> + INIT_LIST_HEAD(&ucontext->qp_list);
> + INIT_LIST_HEAD(&ucontext->srq_list);
> + INIT_LIST_HEAD(&ucontext->ah_list);
> + INIT_LIST_HEAD(&ucontext->xrcd_list);
> + INIT_LIST_HEAD(&ucontext->rule_list);
> +}

I don't see how this change is related to the patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-08 Thread Haggai Eran

On 07/09/2015 23:38, Parav Pandit wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.
> 
> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.
I don't think extending the device cgroup is the right place for these
limits. It is currently a very generic controller and adding various
RDMA resources to it looks out of place. Why not create a new controller
for rdma?

Another thing I noticed is that all limits in this cgroup are global,
while the resources they control are hardware device specific.
I think it would be better if the cgroup controlled the limits of each
device separately.

> With this patch, user verbs module queries rdma device cgroup controller
> to query process's limit to consume such resource. It uncharge resource 
> counter after resource is being freed.
This is another reason why per-device limits would be better. Since
limits are reflected to user-space when querying a specific device, it
will show the same maximum limit on every device opened. If the user
opens 3 devices they might expect to be able to open 3 times the number
of the resources they actually can.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.

2015-09-08 Thread Haggai Eran

On 08/09/2015 13:22, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran  wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
>>> +{
>>> + INIT_LIST_HEAD(&ucontext->pd_list);
>>> + INIT_LIST_HEAD(&ucontext->mr_list);
>>> + INIT_LIST_HEAD(&ucontext->mw_list);
>>> + INIT_LIST_HEAD(&ucontext->cq_list);
>>> + INIT_LIST_HEAD(&ucontext->qp_list);
>>> + INIT_LIST_HEAD(&ucontext->srq_list);
>>> + INIT_LIST_HEAD(&ucontext->ah_list);
>>> + INIT_LIST_HEAD(&ucontext->xrcd_list);
>>> + INIT_LIST_HEAD(&ucontext->rule_list);
>>> +}
>>
>> I don't see how this change is related to the patch.
> 
> Its not but code which I added makes this function to grow longer, so
> to keep it to same readability level, I did the cleanup.
> May be I can send separate patch for cleanup?

Sounds good to me.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran

On 08/09/2015 13:18, Parav Pandit wrote:
>> >
>>> >> + * RDMA resource limits are hierarchical, so the highest configured 
>>> >> limit of
>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to 
>>> >> default
>>> >> + * cgroup allows fair share to kernel space ULPs as well.
>> > In what way is the highest configured limit of the hierarchy enforced? I
>> > would expect all the limits along the hierarchy to be enforced.
>> >
> In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.
> 
> Lets take example to clarify.
> Say cg_A, cg_B, cg_C
> Role  name   limit
> Parent   cg_A   100
> Child_level1  cg_B (child of cg_A)20
> Child_level2: cg_C (child of cg_B)50
> 
> If the process allocating rdma resource belongs to cg_C, limit lowest
> limit in the hierarchy is applied during charge() stage.
> If cg_A limit happens to be 10, since 10 is lowest, its limit would be
> applicable as you expected.

Looking at the code, the usage in every level is charged. This is what I
would expect. I just think the comment is a bit misleading.

>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>>> +{
>>> + struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>>> + int type = seq_cft(sf)->private;
>>> + u32 usage;
>>> +
>>> + if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>>> + seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>> I'm not sure hiding the actual number is good, especially in the
>> show_usage case.
> 
> This is similar to following other controller same as newly added PID
> subsystem in showing max limit.

Okay.

>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> + struct dev_cgroup *dev_cg, *p;
>>> + struct task_struct *ctx_task;
>>> +
>>> + if (!num)
>>> + return;
>>> +
>>> + /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +  * so that when its called from any worker tasks or any
>>> +  * other tasks to which this resource doesn't belong to,
>>> +  * it can be uncharged correctly.
>>> +  */
>>> + if (ucontext)
>>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> + else
>>> + ctx_task = current;
>>> + dev_cg = task_devcgroup(ctx_task);
>>> +
>>> + spin_lock(&ctx_task->rdma_res_counter->lock);
>> Don't you need an rcu read lock and rcu_dereference to access
>> rdma_res_counter?
> 
> I believe, its not required because when uncharge() is happening, it
> can happen only from 3 contexts.
> (a) from the caller task context, who has made allocation call, so no
> synchronizing needed.
> (b) from the dealloc resource context, again this is from the same
> task context which allocated, it so this is single threaded, no need
> to syncronize.
I don't think it is true. You can access uverbs from multiple threads.
What may help your case here I think is the fact that only when the last
ucontext is released you can change the rdma_res_counter field, and
ucontext release takes the ib_uverbs_file->mutex.

Still, I think it would be best to use rcu_dereference(), if only for
documentation and sparse.

> (c) from the fput() context when process is terminated abruptly or as
> part of differed cleanup, when this is happening there cannot be
> allocator task anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.

2015-09-08 Thread Haggai Eran

On 08/09/2015 13:50, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran  wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> + struct dev_cgroup *dev_cg, *p;
>>> + struct task_struct *ctx_task;
>>> +
>>> + if (!num)
>>> + return;
>>> +
>>> + /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +  * so that when its called from any worker tasks or any
>>> +  * other tasks to which this resource doesn't belong to,
>>> +  * it can be uncharged correctly.
>>> +  */
>>> + if (ucontext)
>>> + ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> + else
>>> + ctx_task = current;
>> So what happens if a process creates a ucontext, forks, and then the
>> child creates and destroys a CQ? If I understand correctly, created
>> resources are always charged to the current process (the child), but
>> when it is destroyed the owner of the ucontext (the parent) will be
>> uncharged.
>>
>> Since ucontexts are not meant to be used by multiple processes, I think
>> it would be okay to always charge the owner process (the one that
>> created the ucontext).
> 
> I need to think about it. I would like to avoid keep per task resource
> counters for two reasons.
> For a while I thought that native fork() doesn't take care to share
> the RDMA resources and all CQ, QP dmaable memory from PID namespace
> perspective.
> 
> 1. Because, it could well happen that process and its child process is
> created in PID namespace_A, after which child is migrated to new PID
> namespace_B.
> after which parent from the namespace_A is terminated. I am not sure
> how the ucontext ownership changes from parent to child process at
> that point today.
> I prefer to keep this complexity out if at all it exists as process
> migration across namespaces is not a frequent event for which to
> optimize the code for.
> 
> 2. by having per task counter (as cost of memory some memory) allows
> to avoid using atomic during charge(), uncharge().
> 
> The intent is to have per task (process and thread) to have their
> resource counter instance, but I can see that its broken where its
> charging parent process as of now without atomics.
> As you said its ok to always charge the owner process, I have to relax
> 2nd requirement and fallback to use atomics for charge(), uncharge()
> or I have to get rid of ucontext from the uncharge() API which is
> difficult due to fput() being in worker thread context.
> 

I think the cost of atomic operations here would normally be negligible
compared to the cost of accessing the hardware to allocate or deallocate
these resources.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/7] mmu_notifier: keep track of active invalidation ranges v2

2014-12-25 Thread Haggai Eran

On 22/12/2014 18:48, j.gli...@gmail.com wrote:
>  static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -unsigned long start,
> -unsigned long end,
> -enum mmu_event event)
> +struct 
> mmu_notifier_range *range)
>  {
> + /*
> +  * Initialize list no matter what in case a mmu_notifier register after
> +  * a range_start but before matching range_end.
> +  */
> + INIT_LIST_HEAD(&range->list);

I don't see how can an mmu_notifier register after a range_start but
before a matching range_end. The mmu_notifier registration locks all mm
locks, and that should prevent any invalidation from running, right?

>   if (mm_has_notifiers(mm))
> - __mmu_notifier_invalidate_range_start(mm, start, end, event);
> + __mmu_notifier_invalidate_range_start(mm, range);
>  }

...

>  void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -unsigned long start,
> -unsigned long end,
> -enum mmu_event event)
> +struct mmu_notifier_range *range)
>  
>  {
>   struct mmu_notifier *mn;
> @@ -185,21 +183,36 @@ void __mmu_notifier_invalidate_range_start(struct 
> mm_struct *mm,
>   id = srcu_read_lock(&srcu);
>   hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>   if (mn->ops->invalidate_range_start)
> - mn->ops->invalidate_range_start(mn, mm, start,
> - end, event);
> + mn->ops->invalidate_range_start(mn, mm, range);
>   }
>   srcu_read_unlock(&srcu, id);
> +
> + /*
> +  * This must happen after the callback so that subsystem can block on
> +  * new invalidation range to synchronize itself.
> +  */
> + spin_lock(&mm->mmu_notifier_mm->lock);
> + list_add_tail(&range->list, &mm->mmu_notifier_mm->ranges);
> + mm->mmu_notifier_mm->nranges++;
> + spin_unlock(&mm->mmu_notifier_mm->lock);
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);

Don't you have a race here because you add the range struct after the
callback?

-
Thread A| Thread B
-
call mmu notifier callback  |
  clear SPTE|
| device page fault
|   mmu_notifier_range_is_valid returns true
|   install new SPTE
add event struct to list|
mm clears/modifies the PTE  |
-

So we are left with different entries in the host page table and the
secondary page table.

I would think you'd want the event struct to be added to the list before
the callback is run.

Best regards,
Haggai


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/7] HMM: introduce heterogeneous memory management.

2014-12-31 Thread Haggai Eran

Hi,

On 22/12/2014 18:48, j.gli...@gmail.com wrote:
> +/* hmm_device_register() - register a device with HMM.
> + *
> + * @device: The hmm_device struct.
> + * Returns: 0 on success or -EINVAL otherwise.
> + *
> + *
> + * Call when device driver want to register itself with HMM. Device driver 
> can
> + * only register once. It will return a reference on the device thus to 
> release
> + * a device the driver must unreference the device.

I see that the code doesn't actually have a reference count on the
hmm_device, but just registers and unregisters it through the
hmm_device_register/hmm_device_unregister functions. Perhaps you should
update the comment here to tell that.

> + */
> +int hmm_device_register(struct hmm_device *device)
> +{
> + /* sanity check */
> + BUG_ON(!device);
> + BUG_ON(!device->name);
> + BUG_ON(!device->ops);
> + BUG_ON(!device->ops->release);
> +
> + mutex_init(&device->mutex);
> + INIT_LIST_HEAD(&device->mirrors);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(hmm_device_register);

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] HMM: add per mirror page table.

2015-01-08 Thread Haggai Eran

On 06/01/2015 00:44, j.gli...@gmail.com wrote:
> + /* fence_wait() - to wait on device driver fence.
> +  *
> +  * @fence: The device driver fence struct.
> +  * Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
> +  *
> +  * Called when hmm want to wait for all operations associated with a
> +  * fence to complete (including device cache flush if the event mandate
> +  * it).
> +  *
> +  * Device driver must free fence and associated resources if it returns
> +  * something else thant -EAGAIN. On -EAGAIN the fence must not be free
> +  * as hmm will call back again.
> +  *
> +  * Return error if scheduled operation failed or if need to wait again.
> +  * -EIO Some input/output error with the device.
> +  * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
> +  *
> +  * All other return value trigger warning and are transformed to -EIO.
> +  */
> + int (*fence_wait)(struct hmm_fence *fence);

According to the comment, the device frees the fence struct when the
fence_wait callback returns zero or -EIO, but the code below calls
fence_unref after fence_wait on the same fence.

> +
> + /* fence_ref() - take a reference fence structure.
> +  *
> +  * @fence: Fence structure hmm is referencing.
> +  */
> + void (*fence_ref)(struct hmm_fence *fence);

I don't see fence_ref being called anywhere in the patchset. Is it
actually needed?

> +static void hmm_device_fence_wait(struct hmm_device *device,
> +   struct hmm_fence *fence)
> +{
> + struct hmm_mirror *mirror;
> + int r;
> +
> + if (fence == NULL)
> + return;
> +
> + list_del_init(&fence->list);
> + do {
> + r = device->ops->fence_wait(fence);
> + if (r == -EAGAIN)
> + io_schedule();
> + } while (r == -EAGAIN);
> +
> + mirror = fence->mirror;
> + device->ops->fence_unref(fence);
> + if (r)
> + hmm_mirror_release(mirror);
> +}
> +

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] HMM: add per mirror page table.

2015-01-08 Thread Haggai Eran

On 06/01/2015 00:44, j.gli...@gmail.com wrote:
> + /* fence_wait() - to wait on device driver fence.
> +  *
> +  * @fence: The device driver fence struct.
> +  * Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
> +  *
> +  * Called when hmm want to wait for all operations associated with a
> +  * fence to complete (including device cache flush if the event mandate
> +  * it).
> +  *
> +  * Device driver must free fence and associated resources if it returns
> +  * something else thant -EAGAIN. On -EAGAIN the fence must not be free
> +  * as hmm will call back again.
> +  *
> +  * Return error if scheduled operation failed or if need to wait again.
> +  * -EIO Some input/output error with the device.
> +  * -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
> +  *
> +  * All other return value trigger warning and are transformed to -EIO.
> +  */
> + int (*fence_wait)(struct hmm_fence *fence);

According to the comment, the device frees the fence struct when the
fence_wait callback returns zero or -EIO, but the code below calls
fence_unref after fence_wait on the same fence.

> +
> + /* fence_ref() - take a reference fence structure.
> +  *
> +  * @fence: Fence structure hmm is referencing.
> +  */
> + void (*fence_ref)(struct hmm_fence *fence);

I don't see fence_ref being called anywhere in the patchset. Is it
actually needed?

> +static void hmm_device_fence_wait(struct hmm_device *device,
> +   struct hmm_fence *fence)
> +{
> + struct hmm_mirror *mirror;
> + int r;
> +
> + if (fence == NULL)
> + return;
> +
> + list_del_init(&fence->list);
> + do {
> + r = device->ops->fence_wait(fence);
> + if (r == -EAGAIN)
> + io_schedule();
> + } while (r == -EAGAIN);
> +
> + mirror = fence->mirror;
> + device->ops->fence_unref(fence);
> + if (r)
> + hmm_mirror_release(mirror);
> +}
> +

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] HMM: add per mirror page table.

2015-01-10 Thread Haggai Eran

On 10/01/2015 08:48, Jerome Glisse wrote:
> On Thu, Jan 08, 2015 at 01:05:41PM +0200, Haggai Eran wrote:
>> On 06/01/2015 00:44, j.gli...@gmail.com wrote:
>>> +   /* fence_wait() - to wait on device driver fence.
>>> +*
>>> +* @fence: The device driver fence struct.
>>> +* Returns: 0 on success,-EIO on error, -EAGAIN to wait again.
>>> +*
>>> +* Called when hmm want to wait for all operations associated with a
>>> +* fence to complete (including device cache flush if the event mandate
>>> +* it).
>>> +*
>>> +* Device driver must free fence and associated resources if it returns
>>> +* something else thant -EAGAIN. On -EAGAIN the fence must not be free
>>> +* as hmm will call back again.
>>> +*
>>> +* Return error if scheduled operation failed or if need to wait again.
>>> +* -EIO Some input/output error with the device.
>>> +* -EAGAIN The fence not yet signaled, hmm reschedule waiting thread.
>>> +*
>>> +* All other return value trigger warning and are transformed to -EIO.
>>> +*/
>>> +   int (*fence_wait)(struct hmm_fence *fence);
>>
>> According to the comment, the device frees the fence struct when the
>> fence_wait callback returns zero or -EIO, but the code below calls
>> fence_unref after fence_wait on the same fence.
> 
> Yes comment is out of date, i wanted to simplify fence before readding
> it once needed (by device memory migration).
> 
>>
>>> +
>>> +   /* fence_ref() - take a reference fence structure.
>>> +*
>>> +* @fence: Fence structure hmm is referencing.
>>> +*/
>>> +   void (*fence_ref)(struct hmm_fence *fence);
>>
>> I don't see fence_ref being called anywhere in the patchset. Is it
>> actually needed?
> 
> Not right now but the page migration to device memory use it. But i
> can remove it now.
> 
> I can respin to make comment match code but i would like to know where
> i stand on everythings else.
> 

Well, I've read patches 1 through 4, and they seemed fine, although I
still want to have a deeper look into patch 4, because the page table
code seems a little tricky. I haven't completed reading patch 5 and 6 yet.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access

2015-04-02 Thread Haggai Eran

On 02/04/2015 16:30, Yann Droneaud wrote:
> Hi,
> 
> Le jeudi 02 avril 2015 à 10:52 +, Shachar Raindel a écrit :
>>> -Original Message-
>>> From: Yann Droneaud [mailto:ydrone...@opteya.com]
>>> Sent: Thursday, April 02, 2015 1:05 PM
>>> Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit :
> 
 +  /*
 +   * If the combination of the addr and size requested for this
>>> memory
 +   * region causes an integer overflow, return error.
 +   */
 +  if ((PAGE_ALIGN(addr + size) <= size) ||
 +  (PAGE_ALIGN(addr + size) <= addr))
 +  return ERR_PTR(-EINVAL);
 +
>>>
>>> Can access_ok() be used here ?
>>>
>>>  if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
>>> addr, size))
>>>   return ERR_PTR(-EINVAL);
>>>
>>
>> No, this will break the current ODP semantics.
>>
>> ODP allows the user to register memory that is not accessible yet.
>> This is a critical design feature, as it allows avoiding holding
>> a registration cache. Adding this check will break the behavior,
>> forcing memory to be all accessible when registering an ODP MR.
>>
> 
> Where's the check for the range being in userspace memory space,
> especially for the ODP case ?
> 
> For non ODP case (eg. plain old behavior), does get_user_pages()
> ensure the requested pages fit in userspace region on all 
> architectures ? I think so.

Yes, get_user_pages will return a smaller amount of pages than requested
if it encounters an unmapped region (or a region without write
permissions for write requests). If this happens, the loop in
ib_umem_get calls get_user_pages again with the next set of pages, and
this time if it the first page still cannot be mapped an error is returned.

> 
> In ODP case, I'm not sure such check is ever done ?

In ODP, we also call get_user_pages, but only when a page fault occurs
(see ib_umem_odp_map_dma_pages()). This allows the user to pre-register
a memory region that contains unmapped virtual space, and then mmap
different files into that area without needing to re-register.

> (Aside, does it take special mesure to protect shared mapping from
> being read and/or *written* ?)

I'm not sure I understand the question. Shared mappings that the process
is allowed to read or write are also allowed for the HCA (specifically,
to local and remote operations the same process performs using the HCA),
provided the application has registered their virtual address space as a
memory region.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access

2015-04-02 Thread Haggai Eran


On Thursday, April 2, 2015 7:44 PM, Shachar Raindel wrote:
>> -Original Message-
>> From: Yann Droneaud [mailto:ydrone...@opteya.com]
>> Sent: Thursday, April 02, 2015 7:35 PM
>> To: Haggai Eran
>> Cc: Shachar Raindel; Sagi Grimberg; oss-secur...@lists.openwall.com;
>>  (linux-r...@vger.kernel.org); linux-
>> ker...@vger.kernel.org; sta...@vger.kernel.org
>> Subject: Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected
>> physical memory access
>>
>> Hi Haggai,
>>
>> Le jeudi 02 avril 2015 à 18:18 +0300, Haggai Eran a écrit :
>> > On 02/04/2015 16:30, Yann Droneaud wrote:
>> >> Hi,
>> >>
>> >> Le jeudi 02 avril 2015 à 10:52 +, Shachar Raindel a écrit :
>> >>>> -Original Message-
>> >>>> From: Yann Droneaud [mailto:ydrone...@opteya.com]
>> >>>> Sent: Thursday, April 02, 2015 1:05 PM
>> >>>> Le mercredi 18 mars 2015 à 17:39 +, Shachar Raindel a écrit :
>> >>
>> >>>>> +  /*
>> >>>>> +   * If the combination of the addr and size requested for this
>> >>>> memory
>> >>>>> +   * region causes an integer overflow, return error.
>> >>>>> +   */
>> >>>>> +  if ((PAGE_ALIGN(addr + size) <= size) ||
>> >>>>> +  (PAGE_ALIGN(addr + size) <= addr))
>> >>>>> +  return ERR_PTR(-EINVAL);
>> >>>>> +
>> >>>>
>> >>>> Can access_ok() be used here ?
>> >>>>
>> >>>>  if (!access_ok(writable ? VERIFY_WRITE : VERIFY_READ,
>> >>>> addr, size))
>> >>>>   return ERR_PTR(-EINVAL);
>> >>>>
>> >>>
>> >>> No, this will break the current ODP semantics.
>> >>>
>> >>> ODP allows the user to register memory that is not accessible yet.
>> >>> This is a critical design feature, as it allows avoiding holding
>> >>> a registration cache. Adding this check will break the behavior,
>> >>> forcing memory to be all accessible when registering an ODP MR.
>> >>>
>> >>
>> >> Where's the check for the range being in userspace memory space,
>> >> especially for the ODP case ?
>> >>
>> >> For non ODP case (eg. plain old behavior), does get_user_pages()
>> >> ensure the requested pages fit in userspace region on all
>> >> architectures ? I think so.
>> >
>> > Yes, get_user_pages will return a smaller amount of pages than
>> requested
>> > if it encounters an unmapped region (or a region without write
>> > permissions for write requests). If this happens, the loop in
>> > ib_umem_get calls get_user_pages again with the next set of pages, and
>> > this time if it the first page still cannot be mapped an error is
>> returned.
>> >
>> >>
>> >> In ODP case, I'm not sure such check is ever done ?
>> >
>> > In ODP, we also call get_user_pages, but only when a page fault occurs
>> > (see ib_umem_odp_map_dma_pages()). This allows the user to pre-
>> register
>> > a memory region that contains unmapped virtual space, and then mmap
>> > different files into that area without needing to re-register.
>> >
>>
>> OK, thanks for the description.
>>
>> ...
>>
>> Another related question: as the large memory range could be registered
>> by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),
>> what's prevent the kernel to map a file as the result of mmap(0, ...)
>> in this  region, making it available remotely through IBV_WR_RDMA_READ /
>> IBV_WR_RDMA_WRITE ?
>>
> 
> This is not a bug. This is a feature.
> 
> Exposing a file through RDMA, using ODP, can be done exactly like this.
> Given that the application explicitly requested this behavior, I don't
> see why it is a problem. Actually, some of our tests use such flows.
> The mmu notifiers mechanism allow us to do this safely. When the page is
> written back to disk, it is removed from the ODP mapping. When it is
> accessed by the HCA, it is brought back to RAM.
> 

I want to add that we would like to see users registering a very large memory 
region (perhaps the entire process address space) for local access, and then 
enabling remote access only to specific regions using memory windows. However, 
this isn't supported yet by our driver. Still, there are valid cases where you 
would still want the results of an mmap(0,...) call to be remotely accessible, 
in cases where there is enough trust between the local process and the remote 
process. It may help a middleware communication library register a large 
portion of the address space in advance, and still work with random pointers 
given to it by another application module.

Regards,
Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access

2015-04-03 Thread Haggai Eran

On Thursday, April 2, 2015 11:40 PM, Yann Droneaud  wrote:
> Le jeudi 02 avril 2015 à 16:44 +, Shachar Raindel a écrit :
>> > -Original Message-
>> > From: Yann Droneaud [mailto:ydrone...@opteya.com]
>> > Sent: Thursday, April 02, 2015 7:35 PM
> 
>> > Another related question: as the large memory range could be registered
>> > by user space with ibv_reg_mr(pd, base, size, IB_ACCESS_ON_DEMAND),
>> > what's prevent the kernel to map a file as the result of mmap(0, ...)
>> > in this  region, making it available remotely through IBV_WR_RDMA_READ /
>> > IBV_WR_RDMA_WRITE ?
>> >
>>
>> This is not a bug. This is a feature.
>>
>> Exposing a file through RDMA, using ODP, can be done exactly like this.
>> Given that the application explicitly requested this behavior, I don't
>> see why it is a problem.
> 
> If the application cannot choose what will end up in the region it has
> registered, it's an issue !
> 
> What might happen if one library in a program call mmap(0, size, ...) to
> load a file storing a secret (a private key), and that file ends up
> being mapped in an registered but otherwise free region (afaict, the
> kernel is allowed to do it) ?
> What might happen if one library in a program call call mmap(0,
> size, ..., MAP_ANONYMOUS,...) to allocate memory, call mlock(), then
> write in this location a secret (a passphrase), and that area ends up
> in the memory region registered for on demand paging ?
> 
> The application haven't choose to disclose these confidential piece of
> information, but they are available for reading/writing by remote
> through RDMA given it knows the rkey of the memory region (which is a
> 32bits value).
> 
> I hope I'm missing something, because I'm not feeling confident such
> behavor is a feature.

What we are aiming for is the possibility to register the entire process' 
address 
space for RDMA operations (if the process chooses to use this feature).
This is similar to multiple threads accessing the same address space. I'm sure 
you wouldn't be complaining about the ability of one thread to access the 
secret 
passphrase mmapped by another thread in your example.

> I'm trying to understand how the application can choose what is exposed
> through RDMA if it registers a very large memory region for later use
> (but do not actually explicitly map something there yet): what's the
> consequences ?
> 
>void *start = sbrk(0);
>size_t size = ULONG_MAX - (unsigned long)start;
> 
>ibv_reg_mr(pd, start, size, IB_ACCESS_ON_DEMAND)

The consequences are exactly as you wrote. Just as giving a non-ODP rkey 
to a remote node allows the node to access the registered memory behind that 
rkey, giving an ODP rkey to a remote node allows that node to access the 
virtual address space behind that rkey.

Regards,
Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CVE-2014-8159 kernel: infiniband: uverbs: unprotected physical memory access

2015-04-14 Thread Haggai Eran

On 13/04/2015 16:29, Yann Droneaud wrote:
> Le jeudi 02 avril 2015 à 18:12 +0000, Haggai Eran a écrit :
...
>>
>> I want to add that we would like to see users registering a very large
>> memory region (perhaps the entire process address space) for local
>> access, and then enabling remote access only to specific regions using
>> memory windows. However, this isn't supported yet by our driver.
> 
> In such scheme, the registration must still be handled "manually":
> one has to create a memory window to get a rkey to be exchanged with
> a peer, so why one would want to register such a large memory region
> (the whole process space) ?
> 
> I guess creating the memory window is faster than registering memory
> region. 
Right.

It takes time to create and fill the hardware's page tables. Using
memory windows allows you to reuse the work done previously, while still
having a more granular control over the RDMA permissions. The larger MR
can be created with only local permissions, and the memory window can
add specific remote permissions to a smaller range. The memory window
utilizes the page tables created for the memory region.

> I'd rather say this is not an excuse to register a larger
> memory region (up to the whole process space, current and future) as it 
> sounds like a surprising optimisation: let the HCA known too many
> pages just to be sure it already knows some when the process want to 
> use them. It seems it would become difficult to handle if there's too
> many processes.
Are you worried about pinning too many pages? That is an issue we want
to solve with ODP :)

> 
> I would prefer creating memory region becoming costless (through ODP :).
I agree :)

> 
>>  Still, there are valid cases where you would still want the results
>> of an mmap(0,...) call to be remotely accessible, in cases where there
>> is enough trust between the local process and the remote process.
> 
> mmap(0, , fd) let the kernel choose where to put the file in 
> process virtual memory space, so it may, may not, partially, end up in 
> an ODP pre registered memory region for a range unallocated/unused yet.
> 
> I don't think one want such to happen.
I think that in some cases the benefit of allowing this outweigh the
risks. This is why it is an opt-in feature.

> 
>>  It may help a middleware communication library register a large
>> portion of the address space in advance, and still work with random
>> pointers given to it by another application module.
>>
> 
> But as said in the beginnig of your message, the middleware would have
> bind a memory window before posting work request / exposing rkey for
> the "random pointers".
> 
> So I fail to understand how could be used ODP when it comes to 
> registering a memory region not yet backed by something.

In this scenario, the middleware would first register the full address
space as an ODP memory region with local permissions only. When it wants
to provide remote access to some buffer, it would bind a memory window
over the ODP MR. This is possible with multiple processes because it
uses the virtual memory system without pinning. It won't cause random
mmap regions to be mapped for RDMA without the specific intent of the
application.

However, we currently don't have support for memory windows over ODP
MRs. Even if we did, there is some performance penalty due to binding
and invalidating memory windows. Some applications will still need full
process address space access for RDMA.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 16/36] HMM: add special swap filetype for memory migrated to HMM device memory.

2015-06-24 Thread Haggai Eran

On 21/05/2015 22:31, j.gli...@gmail.com wrote:
> From: Jerome Glisse 
> 
> When migrating anonymous memory from system memory to device memory
> CPU pte are replaced with special HMM swap entry so that page fault,
> get user page (gup), fork, ... are properly redirected to HMM helpers.
> 
> This patch only add the new swap type entry and hooks HMM helpers
> functions inside the page fault and fork code path.
> 
> Signed-off-by: Jérôme Glisse 
> Signed-off-by: Sherry Cheung 
> Signed-off-by: Subhash Gutti 
> Signed-off-by: Mark Hairgrove 
> Signed-off-by: John Hubbard 
> Signed-off-by: Jatin Kumar 
> ---
>  include/linux/hmm.h | 34 ++
>  include/linux/swap.h| 12 +++-
>  include/linux/swapops.h | 43 ++-
>  mm/hmm.c| 21 +
>  mm/memory.c | 22 ++
>  5 files changed, 130 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 186f497..f243eb5 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -257,6 +257,40 @@ void hmm_mirror_range_dirty(struct hmm_mirror *mirror,
>   unsigned long start,
>   unsigned long end);
>  
> +int hmm_handle_cpu_fault(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + pmd_t *pmdp, unsigned long addr,
> + unsigned flags, pte_t orig_pte);
> +
> +int hmm_mm_fork(struct mm_struct *src_mm,
> + struct mm_struct *dst_mm,
> + struct vm_area_struct *dst_vma,
> + pmd_t *dst_pmd,
> + unsigned long start,
> + unsigned long end);
> +
> +#else /* CONFIG_HMM */
> +
> +static inline int hmm_handle_mm_fault(struct mm_struct *mm,
I think this should be hmm_handle_cpu_fault, to match the function
declared above in the CONFIG_HMM case.

> +   struct vm_area_struct *vma,
> +   pmd_t *pmdp, unsigned long addr,
> +   unsigned flags, pte_t orig_pte)
> +{
> + return VM_FAULT_SIGBUS;
> +}
> +
> +static inline int hmm_mm_fork(struct mm_struct *src_mm,
> +   struct mm_struct *dst_mm,
> +   struct vm_area_struct *dst_vma,
> +   pmd_t *dst_pmd,
> +   unsigned long start,
> +   unsigned long end)
> +{
> + BUG();
> + return -ENOMEM;
> +}
>  
>  #endif /* CONFIG_HMM */
> +
> +
>  #endif

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 33/36] IB/odp/hmm: add core infiniband structure and helper for ODP with HMM.

2015-06-24 Thread Haggai Eran

On 21/05/2015 23:23, jgli...@redhat.com wrote:
> +int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
> +{
> + struct mm_struct *mm = get_task_mm(current);
> + struct ib_device *ib_device = context->device;
> + struct ib_mirror *ib_mirror;
> + struct pid *our_pid;
> + int ret;
> +
> + if (!mm || !ib_device->hmm_ready)
> + return -EINVAL;
> +
> + /* FIXME can this really happen ? */
No, following Yann Droneaud's patch 8abaae62f3fd ("IB/core: disallow
registering 0-sized memory region") ib_umem_get() checks against zero
sized umems.

> + if (unlikely(ib_umem_start(umem) == ib_umem_end(umem)))
> + return -EINVAL;
> +
> + /* Prevent creating ODP MRs in child processes */
> + rcu_read_lock();
> + our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
> + rcu_read_unlock();
> + put_pid(our_pid);
> + if (context->tgid != our_pid) {
> + mmput(mm);
> + return -EINVAL;
> + }
> +
> + umem->hugetlb = 0;
> + umem->odp_data = kmalloc(sizeof(*umem->odp_data), GFP_KERNEL);
> + if (umem->odp_data == NULL) {
> + mmput(mm);
> + return -ENOMEM;
> + }
> + umem->odp_data->private = NULL;
> + umem->odp_data->umem = umem;
> +
> + mutex_lock(&ib_device->hmm_mutex);
> + /* Is there an existing mirror for this process mm ? */
> + ib_mirror = ib_mirror_ref(context->ib_mirror);
> + if (!ib_mirror)
> + list_for_each_entry(ib_mirror, &ib_device->ib_mirrors, list) {
> + if (ib_mirror->base.hmm->mm != mm)
> + continue;
> + ib_mirror = ib_mirror_ref(ib_mirror);
> + break;
> + }
> +
> + if (ib_mirror == NULL ||
> + ib_mirror == list_first_entry(&ib_device->ib_mirrors,
> +   struct ib_mirror, list)) {
Is the second check an attempt to check if the list_for_each_entry above
passed through all the entries and didn't break? Maybe I missing
something, but I think that would cause the ib_mirror to hold a pointer
such that ib_mirror->list == ib_mirrors (point to the list head), and
not to the first entry.

In any case, I think it would be more clear if you add another ib_mirror
variable for iterating the ib_mirrors list.

> + /* We need to create a new mirror. */
> + ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL);
> + if (ib_mirror == NULL) {
> + mutex_unlock(&ib_device->hmm_mutex);
> + mmput(mm);
> + return -ENOMEM;
> + }
> + kref_init(&ib_mirror->kref);
> + init_rwsem(&ib_mirror->hmm_mr_rwsem);
> + ib_mirror->umem_tree = RB_ROOT;
> + ib_mirror->ib_device = ib_device;
> +
> + ib_mirror->base.device = &ib_device->hmm_dev;
> + ret = hmm_mirror_register(&ib_mirror->base);
> + if (ret) {
> + mutex_unlock(&ib_device->hmm_mutex);
> + kfree(ib_mirror);
> + mmput(mm);
> + return ret;
> + }
> +
> + list_add(&ib_mirror->list, &ib_device->ib_mirrors);
> + context->ib_mirror = ib_mirror_ref(ib_mirror);
> + }
> + mutex_unlock(&ib_device->hmm_mutex);
> + umem->odp_data.ib_mirror = ib_mirror;
> +
> + down_write(&ib_mirror->umem_rwsem);
> + rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
> + up_write(&ib_mirror->umem_rwsem);
> +
> + mmput(mm);
> + return 0;
> +}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: HMM (Heterogeneous Memory Management) v8

2015-05-31 Thread Haggai Eran

On 21/05/2015 22:31, j.gli...@gmail.com wrote:
> From design point of view not much changed since last patchset (2).
> Most of the change are in small details of the API expose to device
> driver. This version also include device driver change for Mellanox
> hardware to use HMM as an alternative to ODP (which provide a subset
> of HMM functionality specificaly for RDMA devices). Long term plan
> is to have HMM completely replace ODP.

Hi,

I think HMM would be a good long term solution indeed. For now I would
want to keep ODP and HMM side by side (as the patchset seem to do)
mainly since HMM is introduced as a STAGING feature and ODP is part of
the mainline kernel.

It would be nice if you could provide a git repository to access the
patches. I couldn't apply them to the current linux-next tree.

A minor thing: I noticed some style issues in the patches. You should
run checkpatch.pl on the patches and get them to match the coding style.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] RDMA/odp: convert to use HMM for ODP

2019-02-06 Thread Haggai Eran

On 1/29/2019 6:58 PM, jgli...@redhat.com wrote:
 > Convert ODP to use HMM so that we can build on common infrastructure
 > for different class of devices that want to mirror a process address
 > space into a device. There is no functional changes.

Thanks for sending this patch. I think in general it is a good idea to 
use a common infrastructure for ODP.

I have a couple of questions below.

> -static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
> - const struct mmu_notifier_range *range)
> -{
> - struct ib_ucontext_per_mm *per_mm =
> - container_of(mn, struct ib_ucontext_per_mm, mn);
> -
> - if (unlikely(!per_mm->active))
> - return;
> -
> - rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
> -   range->end,
> -   invalidate_range_end_trampoline, true, 
> NULL);
>   up_read(&per_mm->umem_rwsem);
> + return ret;
>   }
Previously the code held the umem_rwsem between range_start and 
range_end calls. I guess that was in order to guarantee that no device 
page faults take reference to the pages being invalidated while the 
invalidation is ongoing. I assume this is now handled by hmm instead, 
correct?

> +
> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> + ODP_READ_BIT,   /* HMM_PFN_VALID */
> + ODP_WRITE_BIT,  /* HMM_PFN_WRITE */
> + ODP_DEVICE_BIT, /* HMM_PFN_DEVICE_PRIVATE */
It seems that the mlx5_ib code in this patch currently ignores the 
ODP_DEVICE_BIT (e.g., in umem_dma_to_mtt). Is that okay? Or is it 
handled implicitly by the HMM_PFN_SPECIAL case?


> @@ -327,9 +287,10 @@ void put_per_mm(struct ib_umem_odp *umem_odp)
>   up_write(&per_mm->umem_rwsem);
>  
>   WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
> - mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
> + hmm_mirror_unregister(&per_mm->mirror);
>   put_pid(per_mm->tgid);
> - mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
> +
> + kfree(per_mm);
>  }
Previously the per_mm struct was released through call srcu, but now it 
is released immediately. Is it safe? I saw that hmm_mirror_unregister 
calls mmu_notifier_unregister_no_release, so I don't understand what 
prevents concurrently running invalidations from accessing the released 
per_mm struct.

> @@ -578,11 +578,27 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct 
> mlx5_ib_mr *mr,
>  
>  next_mr:
>   size = min_t(size_t, bcnt, ib_umem_end(&odp->umem) - io_virt);
> -
>   page_shift = mr->umem->page_shift;
>   page_mask = ~(BIT(page_shift) - 1);
> + off = (io_virt & (~page_mask));
> + size += (io_virt & (~page_mask));
> + io_virt = io_virt & page_mask;
> + off += (size & (~page_mask));
> + size = ALIGN(size, 1UL << page_shift);
> +
> + if (io_virt < ib_umem_start(&odp->umem))
> + return -EINVAL;
> +
>   start_idx = (io_virt - (mr->mmkey.iova & page_mask)) >> page_shift;
>  
> + if (odp_mr->per_mm == NULL || odp_mr->per_mm->mm == NULL)
> + return -ENOENT;
> +
> + ret = hmm_range_register(&range, odp_mr->per_mm->mm,
> +  io_virt, io_virt + size, page_shift);
> + if (ret)
> + return ret;
> +
>   if (prefetch && !downgrade && !mr->umem->writable) {
>   /* prefetch with write-access must
>* be supported by the MR
Isn't there a mistake in the calculation of the variable size? Itis 
first set to the size of the page fault range, but then you add the 
virtual address, so I guess it is actually the range end. Then you pass 
io_virt + size to hmm_range_register. Doesn't it double the size of the 
range

> -void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
> -  u64 bound)
> +void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp,
> +  u64 virt, u64 bound)
>  {
> + struct device *device = umem_odp->umem.context->device->dma_device;
>   struct ib_umem *umem = &umem_odp->umem;
> - int idx;
> - u64 addr;
> - struct ib_device *dev = umem->context->device;
> + unsigned long idx, page_mask;
> + struct hmm_range range;
> + long ret;
> +
> + if (!umem->npages)
> + return;
> +
> + bound = ALIGN(bound, 1UL << umem->page_shift);
> + page_mask = ~(BIT(umem->page_shift) - 1);
> + virt &= page_mask;
>  
>   virt  = max_t(u64, virt,  ib_umem_start(umem));
>   bound = min_t(u64, bound, ib_umem_end(umem));
> - /* Note that during the run of this function, the
> -  * notifiers_count of the MR is > 0, preventing any racing
> -  * faults from completion. We might be racing with other
> -  * invalidations, so we must make sure we free each page only
> -  * once. */
> +
> + idx = ((unsigned long)virt - ib_umem_start(umem)) >> PAGE_SHIFT;
> +
> + range.p

Re: [PATCH] mm/mmu_notifier: restore set_pte_at_notify semantics

2014-01-22 Thread Haggai Eran

On 22/01/2014 15:10, Andrea Arcangeli wrote:
> On Wed, Jan 15, 2014 at 11:40:34AM +0200, Mike Rapoport wrote:
>> Commit 6bdb913f0a70a4dfb7f066fb15e2d6f960701d00 (mm: wrap calls to
>> set_pte_at_notify with invalidate_range_start and invalidate_range_end)
>> breaks semantics of set_pte_at_notify. When calls to set_pte_at_notify
>> are wrapped with mmu_notifier_invalidate_range_start and
>> mmu_notifier_invalidate_range_end, KVM zaps pte during
>> mmu_notifier_invalidate_range_start callback and set_pte_at_notify has
>> no spte to update and therefore it's called for nothing.
>>
>> As Andrea suggested (1), the problem is resolved by calling
>> mmu_notifier_invalidate_page after PT lock has been released and only
>> for mmu_notifiers that do not implement change_ptr callback.
>>
>> (1) http://thread.gmane.org/gmane.linux.kernel.mm/111710/focus=111711
>>
>> Reported-by: Izik Eidus 
>> Signed-off-by: Mike Rapoport 
>> Cc: Andrea Arcangeli 
>> Cc: Haggai Eran 
>> Cc: Peter Zijlstra 
>> ---
>>  include/linux/mmu_notifier.h | 31 ++-
>>  kernel/events/uprobes.c  | 12 ++--
>>  mm/ksm.c | 15 +--
>>  mm/memory.c  | 14 +-
>>  mm/mmu_notifier.c| 24 ++--
>>  5 files changed, 64 insertions(+), 32 deletions(-)
> 
> Reviewed-by: Andrea Arcangeli 
> 

Hi Andrea, Mike,

Did you get a chance to consider the scenario I wrote about in the other
thread?

I'm worried about the following scenario:

Given a read-only page, suppose one host thread (thread 1) writes to
that page, and performs COW, but before it calls the
mmu_notifier_invalidate_page_if_missing_change_pte function another host
thread (thread 2) writes to the same page (this time without a page
fault). Then we have a valid entry in the secondary page table to a
stale page, and someone (thread 3) may read stale data from there.

Here's a diagram that shows this scenario:

Thread 1| Thread 2| Thread 3

do_wp_page(page 1)  | |
  ...   | |
  set_pte_at_notify | |
  ...   | write to page 1 |
| | stale access
  pte_unmap_unlock  | |
  invalidate_page_if_missing_change_pte | |

This is currently prevented by the use of the range start and range end
notifiers.

Do you agree that this scenario is possible with the new patch, or am I
missing something?

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/mmu_notifier: restore set_pte_at_notify semantics

2014-04-02 Thread Haggai Eran


On 03/30/2014 11:33 PM, Jerome Glisse wrote:

On Wed, Jan 22, 2014 at 04:01:15PM +0200, Haggai Eran wrote:

I'm worried about the following scenario:

Given a read-only page, suppose one host thread (thread 1) writes to
that page, and performs COW, but before it calls the
mmu_notifier_invalidate_page_if_missing_change_pte function another host
thread (thread 2) writes to the same page (this time without a page
fault). Then we have a valid entry in the secondary page table to a
stale page, and someone (thread 3) may read stale data from there.

Here's a diagram that shows this scenario:

Thread 1| Thread 2| Thread 3

do_wp_page(page 1)  | |
   ...   | |
   set_pte_at_notify | |
   ...   | write to page 1 |
 | | stale access
   pte_unmap_unlock  | |
   invalidate_page_if_missing_change_pte | |

This is currently prevented by the use of the range start and range end
notifiers.

Do you agree that this scenario is possible with the new patch, or am I
missing something?


I believe you are right, but of all the upstream user of the mmu_notifier
API only xen would suffer from this ie any user that do not have a proper
change_pte callback can see the bogus scenario you describe above.
Yes. I sent our RDMA paging RFC patch-set on linux-rdma [1] last month, 
and it would also suffer from this scenario, but it's not upstream yet.

The issue i see is with user that want to/or might sleep when they are
invalidation the secondary page table. The issue being that change_pte is
call with the cpu page table locked (well at least for the affected pmd).

I would rather keep the invalidate_range_start/end bracket around change_pte
and invalidate page. I think we can fix the kvm regression by other means.
Perhaps another possibility would be to do the 
invalidate_range_start/end bracket only when the mmu_notifier is missing 
a change_pte implementation.


Best regards,
Haggai

[1] http://www.spinics.net/lists/linux-rdma/msg18906.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] hmm: heterogeneous memory management v3

2014-07-30 Thread Haggai Eran

On 28/07/2014 18:39, Jerome Glisse wrote:
> On Mon, Jul 28, 2014 at 03:27:14PM +0300, Haggai Eran wrote:
>> On 14/06/2014 03:48, Jérôme Glisse wrote:> From: Jérôme Glisse 
>> 
>>>
>>> Motivation:
>>>
>>> ...
>>>
>>> The aim of the heterogeneous memory management is to provide a common API 
>>> that
>>> can be use by any such devices in order to mirror process address. The hmm 
>>> code
>>> provide an unique entry point and interface itself with the core mm code of 
>>> the
>>> linux kernel avoiding duplicate implementation and shielding device driver 
>>> code
>>> from core mm code.
>>>
>>> Moreover, hmm also intend to provide support for migrating memory to device
>>> private memory, allowing device to work on its own fast local memory. The 
>>> hmm
>>> code would be responsible to intercept cpu page fault on migrated range of 
>>> and
>>> to migrate it back to system memory allowing cpu to resume its access to the
>>> memory.
>>>
>>> Another feature hmm intend to provide is support for atomic operation for 
>>> the
>>> device even if the bus linking the device and the cpu do not have any such
>>> capabilities.
>>>
>>> We expect that graphic processing unit and network interface to be among the
>>> first users of such api.
>>
>> Hi,
>>
>> Sorry I'm only now replying to this email. I'm hoping my feedback is still 
>> relevant :)
>>
> 
> Any feedback is welcome.
> 
>> At Mellanox we are currently working on similar technology for avoiding
>> pinning memory for RDMA [1]. We currently have our own MMU notifier code
>> but once the HMM makes it into the kernel I hope we will be able to use it.
>>
>> I have a couple of questions below:
>>
>>>
>>> Hardware requirement:
>>>
>>> Because hmm is intended to be use by device driver there are minimum 
>>> features
>>> requirement for the hardware mmu :
>>>   - hardware have its own page table per process (can be share btw != 
>>> devices)
>>>   - hardware mmu support page fault and suspend execution until the page 
>>> fault
>>> is serviced by hmm code. The page fault must also trigger some form of
>>> interrupt so that hmm code can be call by the device driver.
>>>   - hardware must support at least read only mapping (otherwise it can not
>>> access read only range of the process address space).
>>>
>>> For better memory management it is highly recommanded that the device also
>>> support the following features :
>>>   - hardware mmu set access bit in its page table on memory access (like 
>>> cpu).
>>>   - hardware page table can be updated from cpu or through a fast path.
>>>   - hardware provide advanced statistic over which range of memory it access
>>> the most.
>>>   - hardware differentiate atomic memory access from regular access allowing
>>> to support atomic operation even on platform that do not have atomic
>>> support with there bus link with the device.
>>>
>>> Implementation:
>>>
>>> ...
>>
>>> +
>>> +/* struct hmm_event - used to serialize change to overlapping range of 
>>> address.
>>> + *
>>> + * @list:   List of pending|in progress event.
>>> + * @faddr:  First address (inclusive) for the range this event affect.
>>> + * @laddr:  Last address (exclusive) for the range this event affect.
>>> + * @iaddr:  First invalid address.
>>> + * @fences: List of device fences associated with this event.
>>> + * @etype:  Event type (munmap, migrate, truncate, ...).
>>> + * @backoff:Should this event backoff ie a new event render it 
>>> obsolete.
>>> + */
>>> +struct hmm_event {
>>> +   struct list_headlist;
>>> +   unsigned long   faddr;
>>> +   unsigned long   laddr;
>>> +   unsigned long   iaddr;
>>> +   struct list_headfences;
>>> +   enum hmm_etype  etype;
>>> +   boolbackoff;
>>
>> The backoff field is always being set to false in this patch, right? Is
>> it intended to be used only for device page migration?
> 
> Correct, migration to remote memory might happen concurently with other
> memory event that render migration pointless.
> 
> 
>>
>>> +};
>>> +
>&

Re: [RFC PATCH 1/6] mmu_notifier: add event information to address invalidation v4

2014-09-11 Thread Haggai Eran

On 29/08/2014 22:10, j.gli...@gmail.com wrote:
> + * - MMU_MUNMAP: the range is being unmapped (outcome of a munmap syscall or
> + * process destruction). However, access is still allowed, up until the
> + * invalidate_range_free_pages callback. This also implies that secondary
> + * page table can be trimmed, because the address range is no longer valid.

I couldn't find the invalidate_range_free_pages callback. Is that a left over 
from a previous version of the patch?

Also, I think that you have to invalidate the secondary PTEs of the range being 
unmapped immediately, because put_page may be called immediately after the 
invalidate_range_start returns.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] hmm: heterogeneous memory management v3

2014-07-28 Thread Haggai Eran

On 14/06/2014 03:48, Jérôme Glisse wrote:> From: Jérôme Glisse 

> 
> Motivation:
> 
> ...
> 
> The aim of the heterogeneous memory management is to provide a common API that
> can be use by any such devices in order to mirror process address. The hmm 
> code
> provide an unique entry point and interface itself with the core mm code of 
> the
> linux kernel avoiding duplicate implementation and shielding device driver 
> code
> from core mm code.
> 
> Moreover, hmm also intend to provide support for migrating memory to device
> private memory, allowing device to work on its own fast local memory. The hmm
> code would be responsible to intercept cpu page fault on migrated range of and
> to migrate it back to system memory allowing cpu to resume its access to the
> memory.
> 
> Another feature hmm intend to provide is support for atomic operation for the
> device even if the bus linking the device and the cpu do not have any such
> capabilities.
> 
> We expect that graphic processing unit and network interface to be among the
> first users of such api.

Hi,

Sorry I'm only now replying to this email. I'm hoping my feedback is still 
relevant :)

At Mellanox we are currently working on similar technology for avoiding
pinning memory for RDMA [1]. We currently have our own MMU notifier code
but once the HMM makes it into the kernel I hope we will be able to use it.

I have a couple of questions below:

> 
> Hardware requirement:
> 
> Because hmm is intended to be use by device driver there are minimum features
> requirement for the hardware mmu :
>   - hardware have its own page table per process (can be share btw != devices)
>   - hardware mmu support page fault and suspend execution until the page fault
> is serviced by hmm code. The page fault must also trigger some form of
> interrupt so that hmm code can be call by the device driver.
>   - hardware must support at least read only mapping (otherwise it can not
> access read only range of the process address space).
> 
> For better memory management it is highly recommanded that the device also
> support the following features :
>   - hardware mmu set access bit in its page table on memory access (like cpu).
>   - hardware page table can be updated from cpu or through a fast path.
>   - hardware provide advanced statistic over which range of memory it access
> the most.
>   - hardware differentiate atomic memory access from regular access allowing
> to support atomic operation even on platform that do not have atomic
> support with there bus link with the device.
> 
> Implementation:
> 
> ...

> +
> +/* struct hmm_event - used to serialize change to overlapping range of 
> address.
> + *
> + * @list:   List of pending|in progress event.
> + * @faddr:  First address (inclusive) for the range this event affect.
> + * @laddr:  Last address (exclusive) for the range this event affect.
> + * @iaddr:  First invalid address.
> + * @fences: List of device fences associated with this event.
> + * @etype:  Event type (munmap, migrate, truncate, ...).
> + * @backoff:Should this event backoff ie a new event render it obsolete.
> + */
> +struct hmm_event {
> + struct list_headlist;
> + unsigned long   faddr;
> + unsigned long   laddr;
> + unsigned long   iaddr;
> + struct list_headfences;
> + enum hmm_etype  etype;
> + boolbackoff;

The backoff field is always being set to false in this patch, right? Is
it intended to be used only for device page migration?

> +};
> +
> +
> +
> +
> +/* hmm_device - Each device driver must register one and only one hmm_device.
> + *
> + * The hmm_device is the link btw hmm and each device driver.
> + */
> +
> +/* struct hmm_device_operations - hmm device operation callback
> + */
> +struct hmm_device_ops {
> + /* device_destroy - free hmm_device (call when refcount drop to 0).
> +  *
> +  * @device: The device hmm specific structure.
> +  */
> + void (*device_destroy)(struct hmm_device *device);
> +
> + /* mirror_release() - device must stop using the address space.
> +  *
> +  * @mirror: The mirror that link process address space with the device.
> +  *
> +  * Called when as result of hmm_mirror_unregister or when mm is being
> +  * destroy.
> +  *
> +  * It's illegal for the device to call any hmm helper function after
> +  * this call back. The device driver must kill any pending device
> +  * thread and wait for completion of all of them.
> +  *
> +  * Note that even after this callback returns the device driver might
> +  * get call back from hmm. Callback will stop only once mirror_destroy
> +  * is call.
> +  */
> + void (*mirror_release)(struct hmm_mirror *hmm_mirror);
> +
> + /* mirror_destroy - free hmm_mirror (call when refcount drop to 0).
> +  *
> +  * @mirror: The mirror that link proc

Re: [PATCH] ib_umem_release should decrement mm->pinned_vm from ib_umem_get

2014-08-28 Thread Haggai Eran

On 26/08/2014 00:07, Shawn Bohrer wrote:
 The following patch fixes the issue by storing the mm_struct of the
>> > 
>> > You are doing more than just storing the mm_struct - you are taking
>> > a reference to the process' mm.  This can lead to a massive resource
>> > leakage. The reason is bit complex: The destruction flow for IB
>> > uverbs is based upon releasing the file handle for it. Once the file
>> > handle is released, all MRs, QPs, CQs, PDs, etc. that the process
>> > allocated are released.  For the kernel to release the file handle,
>> > the kernel reference count to it needs to reach zero.  Most IB
>> > implementations expose some hardware registers to the application by
>> > allowing it to mmap the uverbs device file.  This mmap takes a
>> > reference to uverbs device file handle that the application opened.
>> > This reference is dropped when the process mm is released during the
>> > process destruction.  Your code takes a reference to the mm that
>> > will only be released when the parent MR/QP is released.
>> > 
>> > Now, we have a deadlock - the mm is waiting for the MR to be
>> > destroyed, the MR is waiting for the file handle to be destroyed,
>> > and the file handle is waiting for the mm to be destroyed.
>> > 
>> > The proper solution is to keep a reference to the task_pid (using
>> > get_task_pid), and use this pid to get the task_struct and from it
>> > the mm_struct during the destruction flow.
>  
> I'll put together a patch using get_task_pid() and see if I can
> test/reproduce the issue.  This may take a couple of days since we
> have to test this in production at the moment.
>  

Hi,

I just wanted to point out that while working on the on demand paging patches
we also needed to keep a reference to the task pid (to make sure we always 
handle page faults on behalf of the correct mm struct). You can find the 
relevant code in the patch titled "IB/core: Add support for on demand paging 
regions" [1].

Haggai

[1] https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20552.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1 for-next 00/16] On demand paging

2014-09-09 Thread Haggai Eran

On 04/09/2014 00:15, Roland Dreier wrote:
> Have you done any review or testing of these changes?  If so can you
> share the results?

We have tested this feature thoroughly inside Mellanox. We ran random
tests that performed MR registrations, memory mappings and unmappings,
calls to madvise with MADV_DONTNEED for invalidations, sending and
receiving of data, and RDMA operations. The test validated the integrity
of the data, and we verified the integrity of kernel memory by running
the tests under a debugging kernel.

Best regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1 for-next 00/16] On demand paging

2014-09-10 Thread Haggai Eran

On 09/09/2014 17:21, Haggai Eran wrote:
> On 04/09/2014 00:15, Roland Dreier wrote:
>> Have you done any review or testing of these changes?  If so can you
>> share the results?
> 
> We have tested this feature thoroughly inside Mellanox. We ran random
> tests that performed MR registrations, memory mappings and unmappings,
> calls to madvise with MADV_DONTNEED for invalidations, sending and
> receiving of data, and RDMA operations. The test validated the integrity
> of the data, and we verified the integrity of kernel memory by running
> the tests under a debugging kernel.

We wanted to add regarding performance testing of these patches, we have
tested ODP on several setups, including low-level RDMA micro-benchmarks,
MPI applications, and iSER. In all cases, ODP delivers the *same*
bare-metal performance as obtained with standard MRs, in terms of both
BW and latency. In addition, performance of standard MRs is not affected
by the presence of ODP applications.

The main benefits of ODP is the simplified programming model, simplified
management, and avoiding worst-case memory commitment.
For example, we were able to run multiple concurrent instances of iSER
targets, allowing over-commitment that otherwise wouldn’t be possible.
In the MPI case, both IMB (Pallas) and applications achieved the same
performance as the pin-down cache, with minimal memory locking
privileges while avoiding any glibc hooks for detecting invalidations.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-06 Thread Haggai Eran

On Friday, December 4, 2015 8:02 PM, Nicholas Krause  
wrote:
> To: dledf...@redhat.com
> Cc: sean.he...@intel.com; hal.rosenst...@gmail.com; Haggai Eran; 
> jguntho...@obsidianresearch.com; Matan Barak; yun.w...@profitbricks.com; 
> ted.h@oracle.com; Doron Tsur; Erez Shitrit; david.ah...@oracle.com; 
> linux-r...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] infiniband:core:Add needed error path in cm_init_av_by_path
> 
> This adds a needed error path in the function, cm_init_av_by_path
> after the call to ib_init_ah_from_path in order to avoid incorrectly
> accessing the path pointer of structure type ib_sa_path_rec if this
> function call fails to complete its intended work successfully by
> returning a error code.
> 
> Signed-off-by: Nicholas Krause 

The subject doesn't seem to match the convention but apart from that,

Reviewed-by: Haggai Eran 

I wonder if this should go to stable. If I understand correctly, this will fail 
only when the SGID isn't found in the GID table, but such connections would 
fail later on when creating a QP, right?

Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] IB/core: constify mmu_notifier_ops structures

2015-11-29 Thread Haggai Eran

On 30/11/2015 00:02, Julia Lawall wrote:
> This mmu_notifier_ops structure is never modified, so declare it as
> const, like the other mmu_notifier_ops structures.
> 
> Done with the help of Coccinelle.
> 
> Signed-off-by: Julia Lawall 

Reviewed-by: Haggai Eran 

Thanks,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC rdma cgroup

2015-10-29 Thread Haggai Eran

On 28/10/2015 10:29, Parav Pandit wrote:
> 3. Resources are not defined by the RDMA cgroup. Resources are defined
> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
> subsystem can evolve without the need of rdma cgroup update. A new
> resource can be easily added by the RDMA/IB subsystem without touching
> rdma cgroup.
Resources exposed by the cgroup are basically a UAPI, so we have to be
careful to make it stable when it evolves. I understand the need for
vendor specific resources, following the discussion on the previous
proposal, but could you write on how you plan to allow these set of
resources to evolve?

> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
> hw resource pool per such device.
> (Nothing stops to have more devices and pools, but design is around
> this use case).
In what way does the design depend on this assumption?

> 9. Resource pool object is created in following situations.
> (a) administrative operation is done to set the limit and no previous
> resource pool exist for the device of interest for the cgroup.
> (b) no resource limits were configured, but IB/RDMA subsystem tries to
> charge the resource. so that when applications are running without
> limits and later on when limits are enforced, during uncharging, it
> correctly uncharges them, otherwise usage count will drop to negative.
> This is done using default resource pool.
> Instead of implementing any sort of time markers, default pool
> simplifies the design.
Having a default resource pool kind of implies there is a non-default
one. Is the only difference between the default and non-default the fact
that the second was created with an administrative operation and has
specified limits or is there some other difference?

> (c) When process migrate from one to other cgroup, resource is
> continue to be owned by the creator cgroup (rather css).
> After process migration, whenever new resource is created in new
> cgroup, it will be owned by new cgroup.
It sounds a little different from how other cgroups behave. I agree that
mostly processes will create the resources in their cgroup and won't
migrate, but why not move the charge during migration?

I finally wanted to ask about other limitations an RDMA cgroup could
handle. It would be great to be able to limit a container to be allowed
to use only a subset of the MAC/VLAN pairs programmed to a device, or
only a subset of P_Keys and GIDs it has. Do you see such limitations
also as part of this cgroup?

Thanks,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/8] IB/odp: export rbt_ib_umem_for_each_in_range()

2015-07-21 Thread Haggai Eran

On 17/07/2015 22:01, Jérôme Glisse wrote:
> The mlx5 driver will need this function for its driver specific bit
> of ODP (on demand paging) on HMM (Heterogeneous Memory Management).
> 
> Signed-off-by: Jérôme Glisse 
> ---
>  drivers/infiniband/core/umem_rbtree.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/infiniband/core/umem_rbtree.c 
> b/drivers/infiniband/core/umem_rbtree.c
> index 727d788..f030ec0 100644
> --- a/drivers/infiniband/core/umem_rbtree.c
> +++ b/drivers/infiniband/core/umem_rbtree.c
> @@ -92,3 +92,4 @@ int rbt_ib_umem_for_each_in_range(struct rb_root *root,
>  
>   return ret_val;
>  }
> +EXPORT_SYMBOL(rbt_ib_umem_for_each_in_range);
> 

Perhaps it would be better if the driver didn't access the internal
rbtree directly, and instead used something like an
ib_mirror_for_each_range() to do this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] IB/odp/hmm: add new kernel option to use HMM for ODP.

2015-07-21 Thread Haggai Eran

On 17/07/2015 22:01, Jérôme Glisse wrote:
> This is a preparatory patch for HMM implementation of ODP (on demand
> paging). It introduce a new configure option and add proper build
> time conditional code section. Enabling INFINIBAND_ON_DEMAND_PAGING_HMM
> will result in build error with this patch.
> 
> Signed-off-by: Jérôme Glisse 
> ---
...
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 3da0b16..765aeb3 100644

> @@ -100,28 +124,13 @@ void ib_umem_odp_release(struct ib_umem *umem);
>  
>  #define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
>  
> +
Please avoid adding double blank lines. You can find more of these by
running checkpatch on the patch.

Other than that:
Reviewed-by: Haggai Eran 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/8] IB/mlx5/hmm: add mlx5 HMM device initialization and callback v3.

2015-07-21 Thread Haggai Eran

On 17/07/2015 22:01, Jérôme Glisse wrote:
> diff --git a/drivers/infiniband/core/umem_odp.c 
> b/drivers/infiniband/core/umem_odp.c
> index ac87ac6..c5e7461 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -134,7 +134,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct 
> ib_umem *umem)
>   return -ENOMEM;
>   }
>   kref_init(&ib_mirror->kref);
> - init_rwsem(&ib_mirror->hmm_mr_rwsem);
> + init_rwsem(&ib_mirror->umem_rwsem);
>   ib_mirror->umem_tree = RB_ROOT;
>   ib_mirror->ib_device = ib_device;
>  

I think this line can be squashed to the previous patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/8] IB/mlx5/hmm: add mlx5 HMM device initialization and callback v3.

2015-07-21 Thread Haggai Eran

On 17/07/2015 22:01, Jérôme Glisse wrote:
> @@ -151,10 +151,11 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct 
> ib_umem *umem)
>   context->ib_mirror = ib_mirror_ref(ib_mirror);
>   }
>   mutex_unlock(&ib_device->hmm_mutex);
> - umem->odp_data.ib_mirror = ib_mirror;
> + umem->odp_data->ib_mirror = ib_mirror;
>  
>   down_write(&ib_mirror->umem_rwsem);
> - rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
> + rbt_ib_umem_insert(&umem->odp_data->interval_tree,
> +&ib_mirror->umem_tree);
>   up_write(&ib_mirror->umem_rwsem);
>  
>   mmput(mm);
> @@ -163,7 +164,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct 
> ib_umem *umem)
>  
>  void ib_umem_odp_release(struct ib_umem *umem)
>  {
> - struct ib_mirror *ib_mirror = umem->odp_data;
> + struct ib_mirror *ib_mirror = umem->odp_data->ib_mirror;
>  
>   /*
>* Ensure that no more pages are mapped in the umem.

It doesn't look like this code would have compiled before this patch,
and as far as I can see the previous patch removed the #error line.
Could you make sure all of the patches build correctly? You could use
tools/testing/ktest for instance.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv3 1/3] rdmacg: Added rdma cgroup controller.

2016-02-10 Thread Haggai Eran

On 01/02/2016 20:59, Parav Pandit wrote:
> On Tue, Feb 2, 2016 at 12:10 AM, Tejun Heo  wrote:
>> So, I'm really not gonna go for individual drivers defining resources
>> on their own.  That's a trainwreck waiting to happen.  There needs to
>> be a lot more scrutiny than that.
>>
> Not every low level driver. I started with that infrastructure in
> v2,v3 but I got your inputs and
> I align with that. It could be just single IB stack in one header file
> in one enum list would be sufficient.
> You have already given that example.
> With that mostly two resource type that I have can also shrink to just
> single type.
> Will wait to hear from them, in case if they have any different thought.

Hi,

Sorry for the late reply.

I think that starting with the standard set of resources that uverbs 
provide is good, and if we need in the future new types of resources 
we can add them later.

On 31/01/2016 19:50, Parav Pandit wrote:
> How would you like to see RDMA verb resources being defined - in RDMA
> cgroup or in IB stack?
> In current patch v5, its defined by the IB stack which is often
> shipped as different package due to high amount of changes, bug fixes,
> features.
> In v0 patch it was defined by the RDMA cgroup, which means any new
> resource addition/definition requires kernel upgrade. Which is hard to
> change often.

There is indeed an effort to backport the latest RDMA subsystem modules to 
older kernels, and it would be preferable to be able to introduce new
resources through these modules. However, I understand that there are no
other cgroups that are modules or depend on modules this way, so I would
understand if you decide against it.

> If resources are defined by RDMA cgroup kernel than adding/removing
> resource means, someone have to do lot of engineering with different
> versions of kernel support and loadable module support using compat.h
> etc at driver level, which in my mind is some amount of engineering
> compare to what v5 has to offer and its already available. With one
> round of cleanup in resource definition, it should be usable.
If I understand correctly, if the resources are defined in the cgroup,
you simply won't be able to add new resources with a module update,
no matter how hard you work.

I agree that if the cgroup code is changed for cleanup or whatever 
reason, the backporting may become difficult, but that's just life.

> Its important to provide this feedback to Tejun and me, so that we
> take informed design decision.

Sure. I hope this patchset gets accepted eventually, as I believe it 
solves a real problem. Today RDMA application can easily hog these 
resources and the rdma cgroup allows users to prevent that.

Regards,
Haggai

Re: [PATCHv6 3/3] rdmacg: Added documentation for rdmacg

2016-02-28 Thread Haggai Eran

On 24/02/2016 17:21, Parav Pandit wrote:
> On Wed, Feb 24, 2016 at 7:56 PM, Haggai Eran  wrote:
>> On 20/02/2016 13:00, Parav Pandit wrote:
>>> Added documentation for v1 and v2 version describing high
>>> level design and usage examples on using rdma controller.
>>>
>>> Signed-off-by: Parav Pandit 
>>
>> I think you might want to mention that resource limits are reflected
>> in the results returned from ib_uverbs_query_device/ibv_query_device
>> or printed from "ibv_devinfo -v".
>>
> Its valid point.
> Since this documentation is for rdma controller, I was wondering
> should I have it this documentation or should I add the uverbs_cmds.c?

I was thinking it should be in the documentation because an application
developer might look there first, without reading uverbs_cmd.c.

Haggai

Re: [PATCHv6 2/3] IB/core: added support to use rdma cgroup controller

2016-02-24 Thread Haggai Eran

On 20/02/2016 13:00, Parav Pandit wrote:
> +/**
> + * ib_device_unregister_rdmacg - unregister with rdma cgroup.
> + * @device: device to unregister.
> + *
> + * Unregister with the rdma cgroup. Should be called after
> + * all the resources are deallocated, and after a stage when any
> + * other resource allocation of user application cannot be done
> + * for this device to avoid any leak in accounting.
> + * HCA drivers should clear resource pool ops after ib stack
> + * unregisters with rdma cgroup.
HCA drivers don't supply their own ops in this version, right?
If so, you can update the comment.

> + */
> +void ib_device_unregister_rdmacg(struct ib_device *device)
> +{
> + rdmacg_unregister_device(&device->cg_device);
> +}

> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index 179e813..c3bd24c 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -352,10 +352,22 @@ int ib_register_device(struct ib_device *device,
>   goto out;
>   }
>  
> +#ifdef CONFIG_CGROUP_RDMA
> + ret = ib_device_register_rdmacg(device);
> + if (ret) {
> + printk(KERN_WARNING "Couldn't set up InfiniBand P_Key/GID 
> cache\n");
You should update the error string, and I think checkpatch recommends
using pr_warn().

> + ib_cache_cleanup_one(device);
> + goto out;
> + }
> +#endif
> +
>   ret = ib_device_register_sysfs(device, port_callback);
>   if (ret) {
>   printk(KERN_WARNING "Couldn't register device %s with driver 
> model\n",
>  device->name);
> +#ifdef CONFIG_CGROUP_RDMA
> + ib_device_unregister_rdmacg(device);
> +#endif
>   ib_cache_cleanup_one(device);
>   goto out;
>   }
> @@ -405,6 +417,10 @@ void ib_unregister_device(struct ib_device *device)
>  
>   mutex_unlock(&device_mutex);
>  
> +#ifdef CONFIG_CGROUP_RDMA
> + ib_device_unregister_rdmacg(device);
> +#endif
> +
>   ib_device_unregister_sysfs(device);
>   ib_cache_cleanup_one(device);
>  
> diff --git a/drivers/infiniband/core/uverbs_cmd.c 
> b/drivers/infiniband/core/uverbs_cmd.c
> index 1c02dea..7c51e8a 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c

> @@ -1777,6 +1851,12 @@ static int create_qp(struct ib_uverbs_file *file,
> &qp_lock_class);
>   down_write(&obj->uevent.uobject.mutex);
>  
> + pd  = idr_read_pd(cmd->pd_handle, file->ucontext);
> + if (!pd) {
> + ret = -EINVAL;
> + goto err_put;
> + }
> +
I'm not sure I understand why you need to keep the PD here. Why 
don't you use the same ib_device that is used to create the QP? 
The same applies comment also applies to other uverbs commands.

>   if (cmd->qp_type == IB_QPT_XRC_TGT) {
>   xrcd = idr_read_xrcd(cmd->pd_handle, file->ucontext,
>&xrcd_uobj);
> @@ -1811,8 +1891,7 @@ static int create_qp(struct ib_uverbs_file *file,
>  
>   scq = idr_read_cq(cmd->send_cq_handle, file->ucontext, !!rcq);
>   rcq = rcq ?: scq;
> - pd  = idr_read_pd(cmd->pd_handle, file->ucontext);
> - if (!pd || !scq) {
> + if (!scq) {
>   ret = -EINVAL;
>   goto err_put;
>   }
> @@ -1858,6 +1937,11 @@ static int create_qp(struct ib_uverbs_file *file,
>   goto err_put;
>   }
>  
> + ret = ib_rdmacg_try_charge(&obj->uevent.uobject.cg_obj, pd->device,
> +RDMA_VERB_RESOURCE_QP, 1);
> + if (ret)
> + goto err_put;
> +
>   if (cmd->qp_type == IB_QPT_XRC_TGT)
>   qp = ib_create_qp(pd, &attr);
>   else
> @@ -1865,7 +1949,7 @@ static int create_qp(struct ib_uverbs_file *file,
>  
>   if (IS_ERR(qp)) {
>   ret = PTR_ERR(qp);
> - goto err_put;
> + goto err_create;
>   }
>  
>   if (cmd->qp_type != IB_QPT_XRC_TGT) {
> @@ -1940,6 +2024,10 @@ err_cb:
>  err_destroy:
>   ib_destroy_qp(qp);
>  
> +err_create:
> + ib_rdmacg_uncharge(&obj->uevent.uobject.cg_obj, device,
> +RDMA_VERB_RESOURCE_QP, 1);
> +
>  err_put:
>   if (xrcd)
>   put_xrcd_read(xrcd_uobj);

> @@ -3323,6 +3441,11 @@ static int __uverbs_create_xsrq(struct ib_uverbs_file 
> *file,
>   obj->uevent.events_reported = 0;
>   INIT_LIST_HEAD(&obj->uevent.event_list);
>  
> + ret = ib_rdmacg_try_charge(&obj->uevent.uobject.cg_obj, pd->device,
> +RDMA_VERB_RESOURCE_SRQ, 1);
> + if (ret)
> + goto err_put_cq;
> +
I think you need a new error label to release the PD IDR but not
uncharge.

>   srq = pd->device->create_srq(pd, &attr, udata);
>   if (IS_ERR(srq)) {
>   ret = PTR_ERR(srq);
> @@ -3387,6 +3510,8 @@ err_destroy:
>

Re: [PATCHv6 3/3] rdmacg: Added documentation for rdmacg

2016-02-24 Thread Haggai Eran

On 20/02/2016 13:00, Parav Pandit wrote:
> Added documentation for v1 and v2 version describing high
> level design and usage examples on using rdma controller.
> 
> Signed-off-by: Parav Pandit 

I think you might want to mention that resource limits are reflected
in the results returned from ib_uverbs_query_device/ibv_query_device 
or printed from "ibv_devinfo -v".

Re: [PATCHv6 1/3] rdmacg: Added rdma cgroup controller

2016-02-24 Thread Haggai Eran

Hi,

Overall I the patch looks good to me. I have a few comments below.

On 20/02/2016 13:00, Parav Pandit wrote:
> Resource pool is created/destroyed dynamically whenever
> charging/uncharging occurs respectively and whenever user
> configuration is done. Its a tradeoff of memory vs little more code
Its -> It's
> space that creates resource pool whenever necessary,
> instead of creating them during cgroup creation and device registration
> time.
> 
> Signed-off-by: Parav Pandit 

> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 000..b370733
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h

> +struct rdmacg_device {
> + struct rdmacg_pool_info pool_info;
> + struct list_headrdmacg_list;
> + struct list_headrpool_head;
> + spinlock_t  rpool_lock; /* protects rsource pool list */
rsource -> resource

> + char*name;
> +};
> +
> +/* APIs for RDMA/IB stack to publish when a device wants to
> + * participate in resource accounting
> + */
> +int rdmacg_register_device(struct rdmacg_device *device);
> +void rdmacg_unregister_device(struct rdmacg_device *device);
> +
> +/* APIs for RDMA/IB stack to charge/uncharge pool specific resources */
> +int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
> +   struct rdmacg_device *device,
> +   int resource_index,
> +   int num);
> +void rdmacg_uncharge(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int resource_index,
> +  int num);
> +void rdmacg_query_limit(struct rdmacg_device *device,
> + int *limits, int max_count);
You can drop the max_count parameter, and require the caller to
always provide pool_info->table_len items, couldn't you?

> +
> +#endif   /* CONFIG_CGROUP_RDMA */
> +#endif   /* _CGROUP_RDMA_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 0df0336a..d0e597c 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -56,6 +56,10 @@ SUBSYS(hugetlb)
>  SUBSYS(pids)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_RDMA)
> +SUBSYS(rdma)
> +#endif
> +
>  /*
>   * The following subsystems are not supported on the default hierarchy.
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index 2232080..1741b65 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1054,6 +1054,16 @@ config CGROUP_PIDS
> since the PIDs limit only affects a process's ability to fork, not to
> attach to a cgroup.
>  
> +config CGROUP_RDMA
> + bool "RDMA controller"
> + help
> +   Provides enforcement of RDMA resources defined by IB stack.
> +   It is fairly easy for consumers to exhaust RDMA resources, which
> +   can result into resource unavailibility to other consumers.
unavailibility -> unavailability
> +   RDMA controller is designed to stop this from happening.
> +   Attaching processes with active RDMA resources to the cgroup
> +   hierarchy is allowed even if can cross the hierarchy's limit.
> +
>  config CGROUP_FREEZER
>   bool "Freezer controller"
>   help
> diff --git a/kernel/Makefile b/kernel/Makefile
> index baa55e5..501f5df 100644

> +/**
> + * uncharge_cg_resource - uncharge resource for rdma cgroup
> + * @cg: pointer to cg to uncharge and all parents in hierarchy
It only uncharges a single cg, right?
> + * @device: pointer to ib device
> + * @index: index of the resource to uncharge in cg (resource pool)
> + * @num: the number of rdma resource to uncharge
> + *
> + * It also frees the resource pool in the hierarchy for the resource pool
> + * which was created as part of charing operation.
charing -> charging
> + */
> +static void uncharge_cg_resource(struct rdma_cgroup *cg,
> +  struct rdmacg_device *device,
> +  int index, int num)
> +{
> + struct rdmacg_resource_pool *rpool;
> + struct rdmacg_pool_info *pool_info = &device->pool_info;
> +
> + spin_lock(&cg->rpool_list_lock);
> + rpool = find_cg_rpool_locked(cg, device);
Is it possible for rpool to be NULL?

> +
> + /*
> +  * A negative count (or overflow) is invalid,
> +  * it indicates a bug in the rdma controller.
> +  */
> + rpool->resources[index].usage -= num;
> +
> + WARN_ON_ONCE(rpool->resources[index].usage < 0);
> + rpool->refcnt--;
> + if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) {
> + /*
> +  * No user of the rpool and all entries are set to max, so
> +  * safe to delete this rpool.
> +  */
> + list_del(&rpool->cg_list);
> + spin_unlock(&cg->rpool_list_lock);
> +
> + free_cg_rpool(rpool);
> + return;
> + }
> + spin_unlock(&cg->rpool_list_lock);
> +}

> +/**
> + * charge_cg_resource - charg

Re: [PATCHv6 1/3] rdmacg: Added rdma cgroup controller

2016-02-25 Thread Haggai Eran

On 24/02/2016 18:16, Parav Pandit wrote:
>>> + struct rdmacg_resource_pool *rpool;
>>> + struct rdmacg_pool_info *pool_info = &device->pool_info;
>>> +
>>> + spin_lock(&cg->rpool_list_lock);
>>> + rpool = find_cg_rpool_locked(cg, device);
>> Is it possible for rpool to be NULL?
>>
> Unlikely, unless we have but in cgroup implementation.
> It may be worth to add WARN_ON and return from here to avoid kernel crash.
Sounds good.

>>> +static int charge_cg_resource(struct rdma_cgroup *cg,
>>> +   struct rdmacg_device *device,
>>> +   int index, int num)
>>> +{
>>> + struct rdmacg_resource_pool *rpool;
>>> + s64 new;
>>> + int ret = 0;
>>> +
>>> +retry:
>>> + spin_lock(&cg->rpool_list_lock);
>>> + rpool = find_cg_rpool_locked(cg, device);
>>> + if (!rpool) {
>>> + spin_unlock(&cg->rpool_list_lock);
>>> + ret = alloc_cg_rpool(cg, device);
>>> + if (ret)
>>> + goto err;
>>> + else
>>> + goto retry;
>> Instead of retrying after allocation of a new rpool, why not just return the
>> newly allocated rpool (or the existing one) from alloc_cg_rpool?
> 
> It can be done, but locking semantics just becomes difficult to
> review/maintain with that where alloc_cg_rpool will unlock and lock
> conditionally later on.
Maybe I'm missing something, but couldn't you simply lock rpool_list_lock
inside alloc_cg_rpool()? It already does that around its call to 
find_cg_rpool_locked() and the insertion to cg_list.

> This path will be hit anyway on first allocation typically. Once
> application is warm up, it will be unlikely to enter here.
> I should change if(!rpool) to if (unlikely(!rpool)).
Theoretically the new allocated rpool can be released again by the time you
get to the second call to find_cg_rpool_locked().

>>> + spin_lock(&cg->rpool_list_lock);
>>> + rpool = find_cg_rpool_locked(cg, device);
>>> + if (!rpool) {
>>> + spin_unlock(&cg->rpool_list_lock);
>>> + ret = alloc_cg_rpool(cg, device);
>>> + if (ret)
>>> + goto opt_err;
>>> + else
>>> + goto retry;
>> You can avoid the retry here too. Perhaps this can go into a function.
>>
> In v5 I had wrapper around code which used to similar hiding using
> get_cg_rpool and put_cg_rpool helper functions.
> But Tejun was of opinion that I should have locks outside of all those
> functions. With that approach, this is done.
> So I think its ok. to have it this way.
I thought that was about functions that only locked the lock, called the 
find function, and released the lock. What I'm suggesting is to have one
function that does "lock + find + allocate if needed + unlock", and another
function that does (under caller's lock) "check ref count + check max count +
release rpool".

>>> + }
>>> +
>>> + /* now set the new limits of the rpool */
>>> + while (enables) {
>>> + /* if user set the limit, enables bit is set */
>>> + if (enables & BIT(i)) {
>>> + enables &= ~BIT(i);
>>> + set_resource_limit(rpool, i, new_limits[i]);
>>> + }
>>> + i++;
>>> + }
>>> + if (rpool->refcnt == 0 &&
>>> + rpool->num_max_cnt == pool_info->table_len) {
>>> + /*
>>> +  * No user of the rpool and all entries are
>>> +  * set to max, so safe to delete this rpool.
>>> +  */
>>> + list_del(&rpool->cg_list);
>>> + spin_unlock(&cg->rpool_list_lock);
>>> + free_cg_rpool(rpool);
>>> + } else {
>>> + spin_unlock(&cg->rpool_list_lock);
>>> + }
>> You should consider putting this piece of code in a function (the
>> check of the reference counts and release of the rpool).
>>
> Yes. I did. Same as above comment. Also this function will have to
> unlock. Its usually better to lock/unlock from same function level,
> instead of locking at one level and unlocking from inside the
> function.
> Or
> I should have
> cg_rpool_cond_free_unlock() for above code (check of the reference
> counts and release of the rpool)?
It is confusing to lock and unlock in different contexts. Why not lock
in the caller context? free_cg_rpool() can be called under rpool_list_lock,
couldn't it? It locks device->rpool_lock, but uncharge_cg_resource() also
locks both in the same order.

Thanks,
Haggai

Re: [PATCHv6 1/3] rdmacg: Added rdma cgroup controller

2016-02-25 Thread Haggai Eran

On 25/02/2016 15:34, Parav Pandit wrote:
> On Thu, Feb 25, 2016 at 5:33 PM, Haggai Eran  wrote:
>>>>> +retry:
>>>>> + spin_lock(&cg->rpool_list_lock);
>>>>> + rpool = find_cg_rpool_locked(cg, device);
>>>>> + if (!rpool) {
>>>>> + spin_unlock(&cg->rpool_list_lock);
>>>>> + ret = alloc_cg_rpool(cg, device);
>>>>> + if (ret)
>>>>> + goto err;
>>>>> + else
>>>>> + goto retry;
>>>> Instead of retrying after allocation of a new rpool, why not just return 
>>>> the
>>>> newly allocated rpool (or the existing one) from alloc_cg_rpool?
>>>
>>> It can be done, but locking semantics just becomes difficult to
>>> review/maintain with that where alloc_cg_rpool will unlock and lock
>>> conditionally later on.
>> Maybe I'm missing something, but couldn't you simply lock rpool_list_lock
>> inside alloc_cg_rpool()? It already does that around its call to
>> find_cg_rpool_locked() and the insertion to cg_list.
> 
> No. ref_count and usage counters are updated at level where lock is
> taken in charge_cg_resource().
> If I move locking rpool_list_lock inside alloc_cg_rpool, unlocking
> will continue outside, alloc_cg_rpool() when its found or allocated.
> As you acknowledged in below comment that this makes confusing to
> lock/unlock from different context, I think current implementation
> achieves both.
> (a) take lock from single context
> (b) keep functionality of find and alloc in two separate individual functions

Okay, fair enough.

>> I thought that was about functions that only locked the lock, called the
>> find function, and released the lock. What I'm suggesting is to have one
>> function that does "lock + find + allocate if needed + unlock",
> 
> I had similar function in past which does,
> "lock + find + allocate if needed + + inc_ref_cnt + unlock", (get_cg_rpool)
> update usage_counter atomically, because other thread/process might update 
> too.
> check atomic_dec_cnt - on reaching zero, "lock + del_entry + unlock + free".
> 
> Tejun asked to simplify this to,
> 
> "lock + find + allocate if needed + inc_ref_cnt_without_atomic" + unlock".
> which I did in this patch v6.

Okay.

>> and another
>> function that does (under caller's lock) "check ref count + check max count +
>> release rpool".
> This can be done. Have one dumb basic question for thiat.
> Can we call kfree() with spin_lock held? All these years I tend to
> avoid doing so.
> 

I think so. This is an old link but I think it still applies: 
https://lkml.org/lkml/2004/11/21/130

Re: [PATCHv6 1/3] rdmacg: Added rdma cgroup controller

2016-02-25 Thread Haggai Eran

On 25/02/2016 16:26, Parav Pandit wrote:
>> Can we call kfree() with spin_lock held? All these years I tend to
>> avoid doing so.
> Also it doesn't look correct to hold the lock while freeing the memory
> which is totally unrelated to the lock.
> With that I think current code appears ok with exception that its
> duplicated at two place for code readability around lock.
> What say?
Yes, the only thing that bothered me there was the duplication.

Re: RFC rdma cgroup

2015-11-02 Thread Haggai Eran

On 29/10/2015 20:46, Parav Pandit wrote:
> On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran  wrote:
>> On 28/10/2015 10:29, Parav Pandit wrote:
>>> 3. Resources are not defined by the RDMA cgroup. Resources are defined
>>> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
>>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
>>> subsystem can evolve without the need of rdma cgroup update. A new
>>> resource can be easily added by the RDMA/IB subsystem without touching
>>> rdma cgroup.
>> Resources exposed by the cgroup are basically a UAPI, so we have to be
>> careful to make it stable when it evolves. I understand the need for
>> vendor specific resources, following the discussion on the previous
>> proposal, but could you write on how you plan to allow these set of
>> resources to evolve?
> 
> Its fairly simple.
> Here is the code snippet on how resources are defined in my tree.
> It doesn't have the RSS work queues yet, but can be added right after
> this patch.
> 
> Resource are defined as index and as match_table_t.
> 
> enum rdma_resource_type {
> RDMA_VERB_RESOURCE_UCTX,
> RDMA_VERB_RESOURCE_AH,
> RDMA_VERB_RESOURCE_PD,
> RDMA_VERB_RESOURCE_CQ,
> RDMA_VERB_RESOURCE_MR,
> RDMA_VERB_RESOURCE_MW,
> RDMA_VERB_RESOURCE_SRQ,
> RDMA_VERB_RESOURCE_QP,
> RDMA_VERB_RESOURCE_FLOW,
> RDMA_VERB_RESOURCE_MAX,
> };
> So UAPI RDMA resources can evolve by just adding more entries here.
Are the names that appear in userspace also controlled by uverbs? What
about the vendor specific resources?

>>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
>>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
>>> hw resource pool per such device.
>>> (Nothing stops to have more devices and pools, but design is around
>>> this use case).
>> In what way does the design depend on this assumption?
> 
> Current code when performs resource charging/uncharging, it needs to
> identify the resource pool which one to charge to.
> This resource pool is maintained as list_head and so its linear search
> per device.
> If we are thinking of 100 of RDMA devices per container, than liner
> search will not be good way and different data structure needs to be
> deployed.
Okay, sounds fine to me.

>>> (c) When process migrate from one to other cgroup, resource is
>>> continue to be owned by the creator cgroup (rather css).
>>> After process migration, whenever new resource is created in new
>>> cgroup, it will be owned by new cgroup.
>> It sounds a little different from how other cgroups behave. I agree that
>> mostly processes will create the resources in their cgroup and won't
>> migrate, but why not move the charge during migration?
>>
> With fork() process doesn't really own the resource (unlike other file
> and socket descriptors).
> Parent process might have died also.
> There is possibly no clear way to transfer resource to right child.
> Child that cgroup picks might not even want to own RDMA resources.
> RDMA resources might be allocated by one process and freed by other
> process (though this might not be the way they use it).
> Its pretty similar to other cgroups with exception in migration area,
> such exception comes from different behavior of how RDMA resources are
> owned, created and used.
> Recent unified hierarchy patch from Tejun equally highlights to not
> frequently migrate processes among cgroups.
> 
> So in current implementation, (like other),
> if process created a RDMA resource, forked a child.
> child and parent both can allocate and free more resources.
> child moved to different cgroup. But resource is shared among them.
> child can free also the resource. All crazy combinations are possible
> in theory (without much use cases).
> So at best they are charged to the first cgroup css in which
> parent/child are created and reference is hold to CSS.
> cgroup, process can die, cut css remains until RDMA resources are freed.
> This is similar to process behavior where task struct is release but
> id is hold up for a while.

I guess there aren't a lot of options when the resources can belong to
multiple cgroups. So after migrating, new resources will belong to the
new cgroup or the old one?

>> I finally wanted to ask about other limitations an RDMA cgroup could
>> handle. It would be great to be able to limit a container to be allowed
>> to use only a subset of the MAC/VLAN pairs programmed to a device,
> 
> Truly. I agree. That was one of the prime reason I originally ha

Re: RFC rdma cgroup

2015-11-04 Thread Haggai Eran

On 03/11/2015 21:11, Parav Pandit wrote:
> So it looks like below,
> #cat rdma.resources.verbs.list
> Output:
> mlx4_0 uctx ah pd cq mr mw srq qp flow
> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
What happens if you set a limit of rss_wq to mlx4_0 in this example?
Would it fail? I think it would be simpler for administrators if they
can configure every resource supported by uverbs. If a resource is not
supported by a specific device, you can never go over the limit anyway.

> #cat rdma.resources.hw.list
> hfi1 hw_qp hw_mr sw_pd
> (This particular one is hypothical example, I haven't actually coded
> this, unlike uverbs which is real).
Sounds fine to me. We will need to be careful to make sure that driver
maintainers don't break backward compatibility with this interface.

>> I guess there aren't a lot of options when the resources can belong to
>> multiple cgroups. So after migrating, new resources will belong to the
>> new cgroup or the old one?
> Resource always belongs to the cgroup in which its created, regardless
> of process migration.
> Again, its owned at the css level instead of cgroup. Therefore
> original cgroup can also be deleted but internal reference to data
> structure and that is freed and last rdma resource is freed.
Okay.

>>> For applications that doesn't use RDMA-CM, query_device and query_port
>>> will filter out the GID entries based on the network namespace in
>>> which caller process is running.
>> This could work well for RoCE, as each entry in the GID table is
>> associated with a net device and a network namespace. However, in
>> InfiniBand, the GID table isn't directly related to the network
>> namespace. As for the P_Keys, you could deduce the set of P_Keys of a
>> namespace by the set of IPoIB netdevs in the network namespace, but
>> InfiniBand is designed to also work without IPoIB, so I don't think it's
>> a good idea.
> Got it. Yeah, this code can be under if(device_type RoCE).
IIRC there's a core capability for the new GID table code that contains
namespace, so you can use that.

>> I think it would be better to allow each cgroup to limit the pkeys and
>> gids its processes can use.
> 
> o.k. So the use case is P_Key? So I believe requirement would similar
> to device cgroup.
> Where set of GID table entries are configured as white list entries.
> and when they are queried or used during create_ah or modify_qp, its
> compared against the white list (or in other words as ACL).
> If they are found in ACL, they are reported in query_device or in
> create_ah, modify_qp. If not they those calls are failed with
> appropriate status?
> Does this look ok? 
Yes, that sounds good to me.

> Can we address requirement as additional feature just after first path?
> Tejun had some other idea on this kind of requirement, and I need to
> discuss with him.
Of course. I think there's use for the RDMA cgroup even without a pkey
or GID ACL, just to make sure one application doesn't hog hardware
resources.

>>> One of the idea I was considering is: to create virtual RDMA device
>>> mapped to physical device.
>>> And configure GID count limit via configfs for each such device.
>> You could probably achieve what you want by creating a virtual RDMA
>> device and use the device cgroup to limit access to it, but it sounds to
>> me like an overkill.
> 
> Actually not much. Basically this virtual RDMA device points to the
> struct device of the physical device itself.
> So only overhead is linking this structure to native device structure
> and  passing most of the calls to native ib_device with thin filter
> layer in control path.
> post_send/recv/poll_cq will directly go native device and same performance.
Still, I think we already have code that wraps ib_device calls for
userspace, which is the ib_uverbs module. There's no need for an extra
layer.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] RDMA/odp: convert to use HMM for ODP

2019-02-12 Thread Haggai Eran

On 2/12/2019 6:11 PM, Jerome Glisse wrote:
> On Wed, Feb 06, 2019 at 08:44:26AM +0000, Haggai Eran wrote:
>> On 1/29/2019 6:58 PM, jgli...@redhat.com wrote:
>>   > Convert ODP to use HMM so that we can build on common infrastructure
>>   > for different class of devices that want to mirror a process address
>>   > space into a device. There is no functional changes.
>>
>> Thanks for sending this patch. I think in general it is a good idea to
>> use a common infrastructure for ODP.
>>
>> I have a couple of questions below.
>>
>>> -static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>>> -   const struct mmu_notifier_range *range)
>>> -{
>>> -   struct ib_ucontext_per_mm *per_mm =
>>> -   container_of(mn, struct ib_ucontext_per_mm, mn);
>>> -
>>> -   if (unlikely(!per_mm->active))
>>> -   return;
>>> -
>>> -   rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
>>> - range->end,
>>> - invalidate_range_end_trampoline, true, 
>>> NULL);
>>> up_read(&per_mm->umem_rwsem);
>>> +   return ret;
>>>}
>> Previously the code held the umem_rwsem between range_start and
>> range_end calls. I guess that was in order to guarantee that no device
>> page faults take reference to the pages being invalidated while the
>> invalidation is ongoing. I assume this is now handled by hmm instead,
>> correct?
> 
> It is a mix of HMM and driver in pagefault_mr() mlx5/odp.c
>  mutex_lock(&odp->umem_mutex);
>  if (hmm_vma_range_done(range)) {
>  ...
> 
> This is what serialize programming the hw and any concurrent CPU page
> table invalidation. This is also one of the thing i want to improve
> long term as mlx5_ib_update_xlt() can do memory allocation and i would
> like to avoid that ie make mlx5_ib_update_xlt() and its sub-functions
> as small and to the points as possible so that they could only fail if
> the hardware is in bad state not because of memory allocation issues.
I wonder if it would be possible to make use of the memory that is 
already allocated (ib_umem_odp->dma_list) for that purpose. This would 
probably mean that this area will need to be formatted according to the 
device hardware requirements (e.g., big endian), and then you can 
instruct the device to DMA the updated translations directly from their.

> 
> 
>>
>>> +
>>> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
>>> +   ODP_READ_BIT,   /* HMM_PFN_VALID */
>>> +   ODP_WRITE_BIT,  /* HMM_PFN_WRITE */
>>> +   ODP_DEVICE_BIT, /* HMM_PFN_DEVICE_PRIVATE */
>> It seems that the mlx5_ib code in this patch currently ignores the
>> ODP_DEVICE_BIT (e.g., in umem_dma_to_mtt). Is that okay? Or is it
>> handled implicitly by the HMM_PFN_SPECIAL case?
> 
> This is because HMM except a bit for device memory as same API is
> use for GPU which have device memory. I can add a comment explaining
> that it is not use for ODP but there just to comply with HMM API.
> 
>>
>>> @@ -327,9 +287,10 @@ void put_per_mm(struct ib_umem_odp *umem_odp)
>>> up_write(&per_mm->umem_rwsem);
>>>   
>>> WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
>>> -   mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
>>> +   hmm_mirror_unregister(&per_mm->mirror);
>>> put_pid(per_mm->tgid);
>>> -   mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
>>> +
>>> +   kfree(per_mm);
>>>   }
>> Previously the per_mm struct was released through call srcu, but now it
>> is released immediately. Is it safe? I saw that hmm_mirror_unregister
>> calls mmu_notifier_unregister_no_release, so I don't understand what
>> prevents concurrently running invalidations from accessing the released
>> per_mm struct.
> 
> Yes it is safe, the hmm struct has its own refcount and mirror holds a
> reference on it, the mm struct itself has a reference on the mm struct.
> So no structure can vanish before the other. However once release call-
> back happens you can no longer fault anything it will -EFAULT if you
> try to (not to mention that by then all the vma have been tear down).
> So even if some kernel thread race with destruction it will not be able
> to fault anything or use mirror struct in any meaning full way.
> 
> Note that in a regular tear down the ODP put_per_mm() will happen before
> the release callback as iirc file including d

Re: [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13

2016-11-23 Thread Haggai Eran

On 11/18/2016 8:18 PM, Jérôme Glisse wrote:
> Cliff note: HMM offers 2 things (each standing on its own). First
> it allows to use device memory transparently inside any process
> without any modifications to process program code. Second it allows
> to mirror process address space on a device.
> 
> Change since v12 is the use of struct page for device memory even if
> the device memory is not accessible by the CPU (because of limitation
> impose by the bus between the CPU and the device).
> 
> Using struct page means that their are minimal changes to core mm
> code. HMM build on top of ZONE_DEVICE to provide struct page, it
> adds new features to ZONE_DEVICE. The first 7 patches implement
> those changes.
> 
> Rest of patchset is divided into 3 features that can each be use
> independently from one another. First is the process address space
> mirroring (patch 9 to 13), this allow to snapshot CPU page table
> and to keep the device page table synchronize with the CPU one.
> 
> Second is a new memory migration helper which allow migration of
> a range of virtual address of a process. This memory migration
> also allow device to use their own DMA engine to perform the copy
> between the source memory and destination memory. This can be
> usefull even outside HMM context in many usecase.
> 
> Third part of the patchset (patch 17-18) is a set of helper to
> register a ZONE_DEVICE node and manage it. It is meant as a
> convenient helper so that device drivers do not each have to
> reimplement over and over the same boiler plate code.
> 
> 
> I am hoping that this can now be consider for inclusion upstream.
> Bottom line is that without HMM we can not support some of the new
> hardware features on x86 PCIE. I do believe we need some solution
> to support those features or we won't be able to use such hardware
> in standard like C++17, OpenCL 3.0 and others.
> 
> I have been working with NVidia to bring up this feature on their
> Pascal GPU. There are real hardware that you can buy today that
> could benefit from HMM. We also intend to leverage this inside the
> open source nouveau driver.


Hi,

I think the way this new version of the patchset uses ZONE_DEVICE looks
promising and makes the patchset a little simpler than the previous
versions.

The mirroring code seems like it could be used to simplify the on-demand
paging code in the mlx5 driver and the RDMA subsystem. It currently uses
mmu notifiers directly.

I'm also curious whether it can be used to allow peer to peer access
between devices. For instance, if one device calls hmm_vma_get_pfns on a
process that has unaddressable memory mapped in, with some additional
help from DMA-API, its driver can convert these pfns to bus addresses
directed to another device's MMIO region and thus enable peer to peer
access. Then by handling invalidations through HMM's mirroring callbacks
it can safely handle cases where the peer migrates the page back to the
CPU or frees it.

Haggai

Re: Enabling peer to peer device transactions for PCIe devices

2016-11-27 Thread Haggai Eran

On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
>
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai

Re: Enabling peer to peer device transactions for PCIe devices

2016-11-27 Thread Haggai Eran

On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
> 
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
> 
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
> 
> Yes, it is DMA, so this is a valid approach.
> 
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
> 
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai

Re: [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13

2016-11-27 Thread Haggai Eran

On 11/25/2016 6:16 PM, Jerome Glisse wrote:
> Yes this is something i have work on with NVidia, idea is that you will
> see the hmm_pfn_t with the device flag set you can then retrive the struct
> device from it. Issue is now to figure out how from that you can know that
> this is a device with which you can interact. I would like a common and
> device agnostic solution but i think as first step you will need to rely
> on some back channel communication.
Maybe this can be done with the same DMA-API changes you mention below.
Given two device structs (the peer doing the mapping and the device that
provided the pages) and some (unaddressable) ZONE_DEVICE page structs,
ask the DMA-API to provide bus addresses for that p2p transaction.

> Once you have setup a peer mapping to the GPU memory its lifetime will be
> tie with CPU page table content ie if the CPU page table is updated either
> to remove the page (because of munmap/truncate ...) or because the page
> is migrated to some other place. In both case the device using the peer
> mapping must stop using it and refault to update its page table with the
> new page where the data is.
Sounds good.

> Issue to implement the above lie in the order in which mmu_notifier call-
> back are call. We want to tear down the peer mapping only once we know
> that any device using it is gone. If all device involve use the HMM mirror
> API then this can be solve easily. Otherwise it will need some change to
> mmu_notifier.
I'm not sure I understand how p2p would work this way. If the device
that provides the memory is using HMM for migration it marks the CPU
page tables with the special swap entry. Another device that is not
using HMM mirroring won't be able to translate this into a pfn, even if
it uses mmu notifiers.

> Note that all of the above would rely on change to DMA-API to allow to
> IOMMAP (through iommu) PCI bar address into a device IOMMU context. But
> this is an orthogonal issue.

Even without an IOMMU, I think the DMA-API is a good place to tell
whether p2p is at all possible, or whether it is a good idea in terms of
performance.

Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Haggai Eran

On Mon, 2016-11-28 at 09:48 -0500, Serguei Sagalovitch wrote:
> On 2016-11-27 09:02 AM, Haggai Eran wrote
> > 
> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately. This should work on legacy devices without
> > ODP
> > support, and allows the system to safely terminate a process that
> > misbehaves. The downside of course is that it cannot transparently
> > migrate memory but I think for user-space RDMA doing that
> > transparently
> > requires hardware support for paging, via something like HMM.
> > 
> > ...
> May be I am wrong but my understanding is that PeerDirect logic
> basically
> follow  "RDMA register MR" logic 
Yes. The only difference from regular MRs is the invalidation process I
mentioned, and the fact that we get the addresses not from
get_user_pages but from a peer driver.

> so basically nothing prevent to "terminate"
> process for "MMU notifier" case when we are very low on memory
> not making it similar (not worse) then PeerDirect case.
I'm not sure I understand. I don't think any solution prevents
terminating an application. The paragraph above is just trying to
explain how a non-ODP device/MR can handle an invalidation.

> > > I'm hearing most people say ZONE_DEVICE is the way to handle this,
> > > which means the missing remaing piece for RDMA is some kind of DMA
> > > core support for p2p address translation..
> > Yes, this is definitely something we need. I think Will Davis's
> > patches
> > are a good start.
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that.
> About HMM: I do not think that in the current form HMM would  fit in
> requirement for generic P2P transfer case. My understanding is that at
> the current stage HMM is good for "caching" system memory
> in device memory for fast GPU access but in RDMA MR non-ODP case
> it will not work because  the location of memory should not be
> changed so memory should be allocated directly in PCIe memory.
The way I see it there are two ways to handle non-ODP MRs. Either you
prevent the GPU from migrating / reusing the MR's VRAM pages for as long
as the MR is alive (if I understand correctly you didn't like this
solution), or you allow the GPU to somehow notify the HCA to invalidate
the MR. If you do that, you can use mmu notifiers or HMM or something
else, but HMM provides a nice framework to facilitate that notification.

> > 
> > Using ZONE_DEVICE with or without something like DMA-BUF to pin and
> > unpin
> > pages for the short duration as you wrote above could work fine for
> > kernel uses in which we can guarantee they are short.
> Potentially there is another issue related to pin/unpin. If memory
> could
> be used a lot of time then there is no sense to rebuild and program
> s/g tables each time if location of memory was not changed.
Is this about the kernel use or user-space? In user-space I think the MR
concept captures a long-lived s/g table so you don't need to rebuild it
(unless the mapping changes).

Haggai

Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Haggai Eran

On Mon, 2016-11-28 at 09:57 -0700, Jason Gunthorpe wrote:
> On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:
> > I think blocking mmu notifiers against something that is basically
> > controlled by user-space can be problematic. This can block things
> > like
> > memory reclaim. If you have user-space access to the device's
> > queues,
> > user-space can block the mmu notifier forever.
> Right, I mentioned that..
Sorry, I must have missed it.

> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately.
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? 
Naturally this doesn't support migration. The GPU is expected to pin
these pages as long as the MR lives. The MR invalidation is done only as
a last resort to keep system correctness.

I think it is similar to how non-ODP MRs rely on user-space today to
keep them correct. If you do something like madvise(MADV_DONTNEED) on a
non-ODP MR's pages, you can still get yourself into a data corruption
situation (HCA sees one page and the process sees another for the same
virtual address). The pinning that we use only guarentees the HCA's page
won't be reused.

> I would not support that
> upstream without a lot more explanation..
> 
> I know people don't like requiring new hardware, but in this case we
> really do need ODP hardware to get all the semantics people want..
> 
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that. Using
> From what I understand we are not really talking about kernel p2p,
> everything proposed so far is being mediated by a userspace VMA, so
> I'd focus on making that work.
Fair enough, although we will need both eventually, and I hope the
infrastructure can be shared to some degree.

Re: [PATCH] ib umem: bug: put pid back before return from error path

2016-12-22 Thread Haggai Eran

On 12/22/2016 9:11 AM, Kenneth Lee wrote:
> I catched this bug when reading the code. I'm sorry I have no hardware to test
> it. But it is abviously a bug.
> 
> Signed-off-by: Kenneth Lee 
> ---
>  drivers/infiniband/core/umem.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 1e62a5f..4609b92 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -134,6 +134,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
> unsigned long addr,
>IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND));
>  
>   if (access & IB_ACCESS_ON_DEMAND) {
> + put_pid(umem->pid);
>   ret = ib_umem_odp_get(context, umem);
>   if (ret) {
>   kfree(umem);
> @@ -149,6 +150,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
> unsigned long addr,
>  
>   page_list = (struct page **) __get_free_page(GFP_KERNEL);
>   if (!page_list) {
> + put_pid(umem->pid);
>   kfree(umem);
>   return ERR_PTR(-ENOMEM);
>   }
> 

Looks good to me. Thanks!

Reviewed-by: Haggai Eran

Re: [PATCH 0/3] iopmem : A block device for PCIe memory

2016-10-26 Thread Haggai Eran

On 10/19/2016 6:51 AM, Dan Williams wrote:
> On Tue, Oct 18, 2016 at 2:42 PM, Stephen Bates  wrote:
>> 1. Address Translation. Suggestions have been made that in certain
>> architectures and topologies the dma_addr_t passed to the DMA master
>> in a peer-2-peer transfer will not correctly route to the IO memory
>> intended. However in our testing to date we have not seen this to be
>> an issue, even in systems with IOMMUs and PCIe switches. It is our
>> understanding that an IOMMU only maps system memory and would not
>> interfere with device memory regions.
I'm not sure that's the case. I think it works because with ZONE_DEVICE,
the iommu driver will simply treat a dma_map_page call as any other PFN,
and create a mapping as it does for any memory page.

>> (It certainly has no opportunity
>> to do so if the transfer gets routed through a switch).
It can still go through the IOMMU if you enable ACS upstream forwarding.

> There may still be platforms where peer-to-peer cycles are routed up
> through the root bridge and then back down to target device, but we
> can address that when / if it happens.
I agree.

> I wonder if we could (ab)use a
> software-defined 'pasid' as the requester id for a peer-to-peer
> mapping that needs address translation.
Why would you need that? Isn't it enough to map the peer-to-peer
addresses correctly in the iommu driver?

Haggai

Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran

On 11/30/2016 7:28 PM, Serguei Sagalovitch wrote:
> On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:
>>> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
>>> Or do we need to extend the OOM killer to manage GPU pages?
>> I don't know..
> We could use send_sig_info to send signal from  kernel  to user space. So 
> theoretically GPU driver
> could issue KILL signal to some process.
> 
>> On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:
>>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API 
>>> support
>>> for peer to peer. I'm not sure we need vmap. We need a way to have a 
>>> scatterlist
>>> of MMIO pfns, and ZONE_DEVICE allows that.
> I do not think that using DMA-API as it is is the best solution (at least in 
> the current form):
> 
> -  It deals with handles/fd for the whole allocation but client could/will 
> use sub-allocation as
> well as theoretically possible to "merge" several allocations in one from GPU 
> perspective.
> -  It require knowledge to export but because "sharing" is controlled from 
> user space it
> means that we must "export" all allocation by default
> - It deals with 'fd'/handles but user application may work with 
> addresses/pointers.

Aren't you confusing DMABUF and DMA-API? DMA-API is how you program the IOMMU 
(dma_map_page/dma_map_sg/etc.).
The comment above is just about the need to extend this API to allow mapping 
peer device pages to bus addresses.

In the past I sent an RFC for using DMABUF for peer to peer. I think it had some
advantages for legacy devices. I agree that working with addresses and pointers 
through
something like HMM/ODP is much more flexible and easier to program from 
user-space.
For legacy, DMABUF would have allowed you a way to pin the pages so the GPU 
knows not to
move them. However, that can probably also be achieved simply via the reference 
count
on ZONE_DEVICE pages. The other nice thing about DMABUF is that it migrate the 
buffer
itself during attachment according to the requirements of the device that is 
attaching,
so you can automatically decide in the exporter whether to use p2p or a staging 
buffer.

> 
> Also current  DMA-API force each time to do all DMA table programming 
> unrelated if
> location was changed or not. With  vma / mmu  we are  able to install 
> notifier to intercept
> changes in location and update  translation tables only as needed (we do not 
> need to keep
> get_user_pages()  lock).
I agree.

Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran

On 11/30/2016 8:01 PM, Logan Gunthorpe wrote:
> 
> 
> On 30/11/16 09:23 AM, Jason Gunthorpe wrote:
>>> Two cases I can think of are RDMA access to an NVMe device's controller
>>> memory buffer,
>>
>> I'm not sure on the use model there..
> 
> The NVMe fabrics stuff could probably make use of this. It's an
> in-kernel system to allow remote access to an NVMe device over RDMA. So
> they ought to be able to optimize their transfers by DMAing directly to
> the NVMe's CMB -- no userspace interface would be required but there
> would need some kernel infrastructure.

Yes, that's what I was thinking. The NVMe/f driver needs to map the CMB for
RDMA. I guess if it used ZONE_DEVICE like in the iopmem patches it would be
relatively easy to do.

Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran

On 11/30/2016 6:23 PM, Jason Gunthorpe wrote:
>> and O_DIRECT operations that access GPU memory.
> This goes through user space so there is still a VMA..
> 
>> Also, HMM's migration between two GPUs could use peer to peer in the
>> kernel, although that is intended to be handled by the GPU driver if
>> I understand correctly.
> Hum, presumably these migrations are VMA backed as well...
I guess so.

>>> Presumably in-kernel could use a vmap or something and the same basic
>>> flow?
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API 
>> support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a 
>> scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
> Well, if there is no virtual map then we are back to how do you do
> migrations and other things people seem to want to do on these
> pages. Maybe the loose 'struct page' flow is not for those users.
I was thinking that kernel use cases would disallow migration, similar to how 
non-ODP MRs would work. Either they are short-lived (like an O_DIRECT transfer)
or they can be longed lived but non-migratable (like perhaps a CMB staging 
buffer).

> But I think if you want kGPU or similar then you probably need vmaps
> or something similar to represent the GPU pages in kernel memory.
Right, although sometimes the GPU pages are simply inaccessible to the CPU.
In any case, I haven't thought about kGPU as a use-case.

Re: Enabling peer to peer device transactions for PCIe devices

2016-11-30 Thread Haggai Eran

On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
>>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>>> destroyed immediately.
>>> That sounds horrible. How can that possibly work? What if the MR is
>>> being used when the GPU decides to migrate? 
>> Naturally this doesn't support migration. The GPU is expected to pin
>> these pages as long as the MR lives. The MR invalidation is done only as
>> a last resort to keep system correctness.
> 
> That just forces applications to handle horrible unexpected
> failures. If this sort of thing is needed for correctness then OOM
> kill the offending process, don't corrupt its operation.
Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

> 
>> I think it is similar to how non-ODP MRs rely on user-space today to
>> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
>> non-ODP MR's pages, you can still get yourself into a data corruption
>> situation (HCA sees one page and the process sees another for the same
>> virtual address). The pinning that we use only guarentees the HCA's page
>> won't be reused.
> 
> That is not really data corruption - the data still goes where it was
> originally destined. That is an application violating the
> requirements of a MR. 
I guess it is a matter of terminology. If you compare it to the ODP case 
or the CPU case then you usually expect a single virtual address to map to
a single physical page. Violating this cause some of your writes to be dropped
which is a data corruption in my book, even if the application caused it.

> An application cannot munmap/mremap a VMA
> while a non ODP MR points to it and then keep using the MR.
Right. And it is perfectly fine to have some similar requirements from the 
application
when doing peer to peer with a non-ODP MR. 

> That is totally different from a GPU driver wanthing to mess with
> translation to physical pages.
> 
>>> From what I understand we are not really talking about kernel p2p,
>>> everything proposed so far is being mediated by a userspace VMA, so
>>> I'd focus on making that work.
> 
>> Fair enough, although we will need both eventually, and I hope the
>> infrastructure can be shared to some degree.
> 
> What use case do you see for in kernel?
Two cases I can think of are RDMA access to an NVMe device's controller 
memory buffer, and O_DIRECT operations that access GPU memory.
Also, HMM's migration between two GPUs could use peer to peer in the kernel,
although that is intended to be handled by the GPU driver if I understand
correctly.

> Presumably in-kernel could use a vmap or something and the same basic
> flow?
I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.

Haggai

Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-02 Thread Haggai Eran

On 03/03/2016 04:49, Parav Pandit wrote:
> Hi Tejun, Haggai,
> 
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
 + rpool->refcnt--;
 + if (rpool->refcnt == 0 && rpool->num_max_cnt == 
 pool_info->table_len) {
>>>
>>> If the caller charges 2 and then uncharges 1 two times, the refcnt
>>> underflows?  Why not just track how many usages are zero?
>>>
>> This is certainly must fix bug. Changed refcnt to usage_sum and changed to do
>> usage_sum -= num during uncharging
>> and
>> usage_sum += num during charing.
> 
> This is not sufficient as css_get() and put are done only once per
> call, which leads to similar problem as of refcnt.
Are css_get_many() and css_put_many() relevant here?

> As I think more, I realised that this particular test is missing that
> resulted in this related bug, I realize that we don't have use case to
> have "num" field from the IB stack side.
> For bulk free IB stack will have to keep track of different or same
> rdmacg returned values to call uncharge() with right number of
> resources, all of that complexity just doesn't make sense and not
> required.
> So as first step to further simplify this, I am removing "num" input
> field from charge and uncharge API.

IIRC there are no instances in the RDMA subsystem today where userspace
allocates more than one resource at a time.

Yishai, in your proposed RSS patchset did you have a verb to allocate
multiple work queues at once?

Haggai

Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-03 Thread Haggai Eran

On 03/03/2016 05:18, Parav Pandit wrote:
> On Thu, Mar 3, 2016 at 1:28 AM, Parav Pandit  wrote:
>> On Wed, Mar 2, 2016 at 11:09 PM, Tejun Heo  wrote:
>>> Nothing seems to prevent @cg from going away if this races with
>>> @current being migrated to a different cgroup.  Have you run this with
>>> lockdep and rcu debugging enabled?  This should have triggered a
>>> warning.
> I am able to reproduce this race. Looking into how to address it.

If I understand correctly, task_css() requires rcu read lock being held.
Is task_get_css() suitable here?

Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller

2016-03-06 Thread Haggai Eran

On 05/03/2016 19:20, Parav Pandit wrote:
>> > 3 is fine but resource [un]charging is not hot path?
> charge/uncharge is hot path from cgroup perspective.

Most of the resources the RDMA cgroup handles are only allocated at
the beginning of the application. The RDMA subsystem allows direct
user-space access to the devices, so most of the hot path operations
don't go through the kernel at all.

It is true though that for some applications MR registration and 
de-registration is in the hot path.

Haggai

Re: [PATCHv7 1/3] rdmacg: Added rdma cgroup controller

2016-03-01 Thread Haggai Eran

On 28/02/2016 16:13, Parav Pandit wrote:
> Added rdma cgroup controller that does accounting, limit enforcement
> on rdma/IB verbs and hw resources.
> 
> Added rdma cgroup header file which defines its APIs to perform
> charing/uncharing functionality and device registration which will
> participate in controller functions of accounting and limit
> enforcements. It also define rdmacg_device structure to bind IB stack
> and RDMA cgroup controller.
> 
> RDMA resources are tracked using resource pool. Resource pool is per
> device, per cgroup entity which allows setting up accounting limits
> on per device basis.
> 
> Resources are not defined by the RDMA cgroup, instead they are defined
> by the external module IB stack. This allows extending IB stack
> without changing kernel, as IB stack is going through changes
> and enhancements.
> 
> Resource pool is created/destroyed dynamically whenever
> charging/uncharging occurs respectively and whenever user
> configuration is done. Its a tradeoff of memory vs little more code
> space that creates resource pool whenever necessary,
> instead of creating them during cgroup creation and device registration
> time.
> 
> Signed-off-by: Parav Pandit 

Reviewed-by: Haggai Eran

Re: [PATCHv7 2/3] IB/core: added support to use rdma cgroup controller

2016-03-01 Thread Haggai Eran

On 28/02/2016 16:13, Parav Pandit wrote:
> +void ib_rdmacg_query_limit(struct ib_device *device, int *limits, int 
> max_count)
> +{
> + rdmacg_query_limit(&device->cg_device, limits);
> +}
> +EXPORT_SYMBOL(ib_rdmacg_query_limit);

You can remove the max_count parameter here as well.

Re: [PATCHv7 2/3] IB/core: added support to use rdma cgroup controller

2016-03-01 Thread Haggai Eran

On 28/02/2016 16:13, Parav Pandit wrote:
> diff --git a/drivers/infiniband/core/device.c 
> b/drivers/infiniband/core/device.c
> index 00da80e..54ea8ce 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -343,28 +343,38 @@ int ib_register_device(struct ib_device *device,
>  
>   ret = read_port_immutable(device);
>   if (ret) {
> - printk(KERN_WARNING "Couldn't create per port immutable data 
> %s\n",
> -device->name);
> + pr_warn("Couldn't create per port immutable data %s\n",
> + device->name);
>   goto out;

This change doesn't belong in the patch.

Re: [PATCHv7 2/3] IB/core: added support to use rdma cgroup controller

2016-03-01 Thread Haggai Eran

On 01/03/2016 11:22, Parav Pandit wrote:
> On Tue, Mar 1, 2016 at 2:42 PM, Haggai Eran  wrote:
>> On 28/02/2016 16:13, Parav Pandit wrote:
>>> diff --git a/drivers/infiniband/core/device.c 
>>> b/drivers/infiniband/core/device.c
>>> index 00da80e..54ea8ce 100644
>>> --- a/drivers/infiniband/core/device.c
>>> +++ b/drivers/infiniband/core/device.c
>>> @@ -343,28 +343,38 @@ int ib_register_device(struct ib_device *device,
>>>
>>>   ret = read_port_immutable(device);
>>>   if (ret) {
>>> - printk(KERN_WARNING "Couldn't create per port immutable data 
>>> %s\n",
>>> -device->name);
>>> + pr_warn("Couldn't create per port immutable data %s\n",
>>> + device->name);
>>>   goto out;
>>
>> This change doesn't belong in the patch.
> I agree, but few warnings are with pr_warn and few with printk just
> make code look uneven.
> So I changed printk to pr_warn in same function instead of spinning
> complete new patch.
Still, I think it would be better to have such cosmetic changes in a
separate patch, so that we have a cleaner git history. You can send 
this extra patch separately from this patchset so that Doug can take 
it independently.

Re: [PATCHv7 3/3] rdmacg: Added documentation for rdmacg

2016-03-01 Thread Haggai Eran

On 28/02/2016 16:13, Parav Pandit wrote:
> Added documentation for v1 and v2 version describing high
> level design and usage examples on using rdma controller.
> 
> Signed-off-by: Parav Pandit 

Reviewed-by: Haggai Eran

Re: [PATCHv9 2/3] IB/core: added support to use rdma cgroup controller

2016-03-02 Thread Haggai Eran

On 01/03/2016 21:05, Parav Pandit wrote:
> Added support APIs for IB core to register/unregister every IB/RDMA
> device with rdma cgroup for tracking verbs and hw resources.
> IB core registers with rdma cgroup controller and also defines
> resources that can be accounted.
> Added support APIs for uverbs layer to make use of rdma controller.
> Added uverbs layer to perform resource charge/uncharge functionality.
> Added support during query_device uverb operation to ensure it
> returns resource limits by honoring rdma cgroup configured limits.
> 
> Signed-off-by: Parav Pandit 

Looks good to me. Thanks.

Reviewed-by: Haggai Eran

Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

2015-09-20 Thread Haggai Eran

On 15/09/2015 06:45, Jason Gunthorpe wrote:
> No, I'm saying the resource pool is *well defined* and *fixed* by each
> hardware.
> 
> The only question is how do we expose the N resource limits, the list
> of which is totally vendor specific.

I don't see why you say the limits are vendor specific. It is true that
different RDMA devices have different implementations and capabilities,
but they all use the expose the same set of RDMA objects with their
limitations. Whether those limitations come from hardware limitations,
from the driver, or just because the address space is limited, they can
still be exhausted.

> Yes, using a % scheme fixes the ratios, 1% is going to be a certain
> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
> configuration. That is the trade off for API simplicity.
>
> 
> Yes, this results in some resources being over provisioned.

I agree that such a scheme will be easy to configure, but I don't think
it can work well in all situations. Imagine you want to let one
container use almost all RC QPs as you want it to connect to the entire
cluster through RC. Other containers can still use a single datagram QP
to connect to the entire cluster, but they would require many address
handles. If you force a fixed ratio of resources given to each container
it would be hard to describe such a partitioning.

I think it would be better to expose different controls for the
different RDMA resources.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dma-debug: skip debug_dma_assert_idle() when disabled

2015-07-15 Thread Haggai Eran

If dma-debug is disabled due to a memory error, DMA unmaps do not affect
the dma_active_cacheline radix tree anymore, and debug_dma_assert_idle()
can print false warnings.

Disable debug_dma_assert_idle() when dma_debug_disabled() is true.

Fixes: 0abdd7a81b7e ("dma-debug: introduce debug_dma_assert_idle()")
Cc: Dan Williams 
Cc: Joerg Roedel 
Cc: Vinod Koul 
Cc: Russell King 
Cc: James Bottomley 
Cc: Andrew Morton 
Cc: Florian Fainelli 
Cc: Sebastian Ott 
Cc: Jiri Kosina 
Cc: Horia Geanta 
Signed-off-by: Haggai Eran 
---
 lib/dma-debug.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index ae4b65e17e64..dace71fe41f7 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -574,6 +574,9 @@ void debug_dma_assert_idle(struct page *page)
unsigned long flags;
phys_addr_t cln;
 
+   if (dma_debug_disabled())
+   return;
+
if (!page)
return;
 
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/28] [v2] infiniband: shut up a maybe-uninitialized warning

2016-10-18 Thread Haggai Eran

On 10/18/2016 1:05 AM, Arnd Bergmann wrote:
> @@ -1309,7 +1311,7 @@ static bool validate_net_dev(struct net_device *net_dev,
>  static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event,
> const struct cma_req_info *req)
>  {
> - struct sockaddr_storage listen_addr_storage, src_addr_storage;
> + struct sockaddr_storage listen_addr_storage = {}, src_addr_storage = {};

Doesn't this still translate to an extra initialization that Doug was
worried about?

Haggai

Re: [PATCH 03/28] [v2] infiniband: shut up a maybe-uninitialized warning

2016-10-18 Thread Haggai Eran

On 10/18/2016 1:18 PM, Arnd Bergmann wrote:
> On Tuesday, October 18, 2016 9:47:31 AM CEST Haggai Eran wrote:
>> On 10/18/2016 1:05 AM, Arnd Bergmann wrote:
>>> @@ -1309,7 +1311,7 @@ static bool validate_net_dev(struct net_device 
>>> *net_dev,
>>>  static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event,
>>> const struct cma_req_info *req)
>>>  {
>>> - struct sockaddr_storage listen_addr_storage, src_addr_storage;
>>> + struct sockaddr_storage listen_addr_storage = {}, src_addr_storage = 
>>> {};
>>
>> Doesn't this still translate to an extra initialization that Doug was
>> worried about?
> 
> Thanks for spotting this. I must have screwed up while rebasing the patch
> at some point, this one change should not be there, the other changes by
> themselves sufficiently address the warning.

Okay, other than this the patch looks good to me.

78 matches

Mail list logo